Atlas

NOTE: Inputs from Jeremie's mails in atlas-l mailing list

Maintained by: Pushparajan V

Contents

[edit] Introduction

  • What is Atlas ? - Atlas is an open protocol that can become a fully distributed and interoperable world-wide search community. All of the participants can interact openly and in any role where they believe they can add value to the network. Atlas has many independent entities serving different roles instead of having one monolithic system. These entities are exchanging aggregate information, or knowledge, and can decide with whom they want to work.
  • What is knugget ? - Everything within Atlas revolves around one unit, the knugget, a shorthand for "knowledge nugget" but still pronounced the same as just nugget.

[edit] Primary roles within Atlas

Atlas

There are three primary roles within Atlas:

  • Factory - Responsible to the content.
    • A Factory is akin to a crawler in today's search engines. An Atlas Factory must fetch and process the content as intelligently as possible, performing analysis (such as Natural Language Processing) and normalizing it into distinct units. A Factory shares its highly refined and processed output with one or more Collectors based on who they believe is best utilizing it.
  • Collector - Responsible to the keyword.
    • A Collector absorbs and indexes output from one or more Factories, with one primary goal: ranking. An Atlas Collector must provide the most intelligent ranking and relationship analysis possible. A Collector has to compete for the output of a Factory, as well as compete to provide the best ranking quality for Brokers.
  • Broker - Responsible to the searcher.
    • A Broker must provide a searcher with the best possible results. It does so by combining diverse ranking results from Collectors and also by retrieving content from the original Factories. This last step, a Broker interacting with a Factory, is critical to maintaining a balanced ecosystem. All Factories must be aware of and approve how their results are being used and by whom.

[edit] How to get involved

The wire protocol and further definition of the interactions between these entities is openly evolving, anyone interested is welcomed to join the discussions and see the initial proposals at http://lists.wikia.com/mailman/listinfo/atlas-l over the coming weeks.

[edit] Explanation for Factory, Collector, Broker

Factory-Collector-Broker must interact with each other to complete any search request. Any two roles could be performed by a single entity (whereas if all three are performed by one entity, the result would be a traditional, monolithic search engine).

Reputation and reward is bi-directional between all parties (Factory-Collector, Collector-Broker, and Broker-Factory). Each entity may choose to interact on principle (free, Commons), attribution (results provided by), or commercially (as a paid service), the Atlas protocol is purely a facilitator and does not restrict how the relationships between any entities are formed. In considering these motives for the various entities, it's likely that the free-based networks will tend to become more specialized, commercial ones will compete on quality, and attribution based networks will mature in both directions.

This simple yet powerful division of roles, responsibilities, and relationships will result in a distributed economic foundation for an Internet Search Infrastructure.

[edit] What is a knugget?

  • A knugget is informally thought of as a single search result, not unlike what you often see in typical search results (title, link,clipping)
  • But, a Broker can deliver a single search result by combining a few knuggets.
  • knugget is a *human* definition, not a technical one, and Atlas is ultimately serving humans both on the incoming content and the outgoing results, so be warned that there is going to be a lot of really judgmental nature about knuggets
  • The formal definition is: the smallest standalone unit of context that most average people would recognize. What this really means is a string of text that anyone could make some sense of, could understand, outside of any other context about that text. Examples of knuggets are a title + link, a sentence stating something, a row from a table, an object + description, etc, all with a reference to their source URL of course.
  • Every Factory has a tremendous amount of lee-way in how it wants to generate knuggets. It's job is to take content and break it into the smallest and most valuable units possible, to do the best job in understanding the content and serving it. A Factory then publishes these knuggets to whatever Collectors it has relationships with, who index the individual knuggets.
  • A Collector only indexes a reference to the original knugget, and does not copy/store, thus when a Broker queries a Collector it only gets back a list of references and rankings.
  • Brokers must also then interact with and retrieve the knuggets it's interested in from the original Factory. This second fetch serves primarily as the titles and clippings that a Broker shows in it's results, but critically also as a validation stage, so that a Collector hasn't been poisoned and the reputation of the Factory is intact.

So, the pipeline looks like F -> C -> B -> F. The knugget is the vehicle by which all data is moved within Atlas through this pipeline.

[edit] More about knugget

A knugget should be identified solely by a unique URL that serves also as the method by which it can be retrieved/validated:

      http://a12.factory.com/x/md5=HUXZLQLMuI/KZ5KDcJPcOA==
  • The one critical part of this definition is that the URL path must end in a RFC3230 digest of the knugget itself. Basing it on a hash-identifier results in a number of useful attributes later on.
  • The host domain is the Factory that created it.
  • The optional sub-domain can be used as a simple means to help distribute requests or direct them to specific servers.
  • The path can be used by the Factory as it wishes, but it has one special enhanced meaning for redirects.
  • It is always "http://" so for storing references, only the hostname+path+digest is required.
  • A knugget can be retrieved via a normal GET and the mime type is TBD

[edit] Knugget Redirects

Given that there will be a plethora of knuggets produced, and that Factories can be distributed by their very nature, when knuggets are retrieved via HTTP a redirect should be treated a little more intelligently:

  • First, any knugget can *not* be redirected to a new one. The URL is it's identifier and any redirect is for locating that same knugget (it must hash to the same digest), not finding a different one. Given that, any redirects have value only in finding that same knugget elsewhere within the factory's domain.
  • A 301 can be treated as redirecting the sub-domain, so that the new sub-domain in the location replaces all future requests to the old sub-domain.
  • A 302 can be treated as redirecting the sub-domain AND path component, so that any future requests to the old sub-domain AND path exactly, are replaced by the new one in the Location from the 302 redirect.


[edit] Supporting distributed / p2p based Factories

  • In order to support a Factory that chooses to store knuggets behind a NAT (desktop software), a redirect may also change the transport method from "http://" to something else UDP based.
  • This is not a *requirement* that has to be supported by everyone requesting knuggets from any Factory, so the trade-off is that any Factory must always be able to proxy un-supported requests, and requests that do support the alternative transport should utilize an Accept* header (to be defined yet exactly how) to signal that they do.

Retrieved from "http://search.wikia.com/wiki/Atlas"

This page was last modified 00:03, 18 October 2007. GFDL