GrubWorkUnit

The new Grub workunit format is a simple list of HTTP requests in a text file, provided by a dispatch service that uses basic authentication with a Wikia user account and password.

The dispatch URL in testing is http://dispatch.grub.org/do/workunit and currently returns a default 250 entries. (should this be variable or a fixed amount?)

An example of the format is:

GET /cocktail.html HTTP/1.0\r\n
Host: www.informedchoice.info\r\n
User-Agent: Grub WU1\r\n
\r\n
GET / HTTP/1.0\r\n
Host: fbc-greenbrier.org\r\n
User-Agent: Grub WU1\r\n
\r\n
PUT /arcs/jeremie.8bd1e51389ad90c9eb0c8e8cae13ceed53a4bcb6.arc.gz HTTP/1.0\r\n
Host: soap.grub.org\r\n
\r\n


Notes:

  1. The last entry is a PUT and is the only special case, implying that an Internet Archive V1 formatted (http://www.archive.org/web/researcher/ArcFileFormat.php) and gzip compressed ARC file should be the content body of this PUT.
  2. The requests may be other than a GET but will never be anything that requires more than a header in this version (like a POST), \r\n\r\n is always the array separator. (maybe make it GET only?)
  3. The requests may be other versions of HTTP such as 1.1, but will always be Connection: close, and the client can expect server connections to always close once the request is complete.
  4. Any Grub client errors should be encoded as 500 HTTP errors in the resulting ARC file.
  5. Order must be preserved and every request must be represented in the ARC file (none can be skipped) so that the Grub server can validate an entire workunit from the resulting ARC.
  6. Redirects must NOT be followed, the exact web server response should be encoded into the ARC verbatim.
  7. A max-length should probably be hard coded into the clients per request, 10MB 50MB?


A number of early implementations exist in subversion at:

svn co http://svn.swlabs.org/grubng

Retrieved from "http://search.wikia.com/wiki/GrubWorkUnit"

This page was last modified 19:29, 24 June 2008. GFDL