The Most Important Search Startup Tool? Grub
Posted by Jimbo WalesWhen Wikia obtained Grub from Looksmart, we did so with little fanfare. But when it comes down to it, Grub is one of the most important search-related project on the web. Grub is an open source distributed web crawler which helps with the creation of the index WIkia Search employs. Anyone can download a grub client and use their computer’s idle time to crawl the web. Ideally, thousands if not millions of people all using Grub together will allow Wikia Search to put together a complete, high quality, and fast index of the web. Wikia, in turn, makes that index available for use by others under a free license.
Seems ho-hum? Maybe, but that’s probably because the description, above, drowns in “search speak”. “Crawl.” “Index.” In reality, what Grub allows us to do, collectively, is create a copy of the web. Rick Skrenta, the founder of Topix, is now working on another search project called Blekko. On his blog, Skrenta notes that search poses an interesting problem, because one of the first things one needs to do is, literally, copy the web. The whole thing — every blog post, every news article, every comment, every Geocities page, every tweet, every … everything. Wikipedia estimates that, as of 2006, Google’s copy exceed 25 billion pages. That is a huge barrier to entry for any search project, as one could imagine. And once you copy it, you need to do it again, as the content changes and grows perpetually. As Bernard Lunn noted, “Basic economics mean that only a very small number of players will be able to afford the giant server farms needed to index the whole Web.”
This is an utter waste.
We are all better off when we reduce these extra efforts to copy the web, as every additional attempt is 100% duplicative of the first and, for that reason, a net economic loss for society. All these search startups — including Wikia Search, Blekko, Haika, ChaCha, Powerset, and the 100s of others which make a site all about alternative search engines economically viable — need an up-to-date, complete copy of the web in order to themselves be viable. The big five players aside (who have a massive incentive to keep their copy of the web behind lock and key), there really is no reason why everyone else cannot work together to make one publicly available copy. If we do, collectively, we all win. Before you can compete on brand with the big boys you need to take a sizable cost hit to “simply” copy the web. Together we can eliminate that cost, and make the Internet better for everyone else at the same time.
Crazy? Not really. Think back to the first web boom — the mid to late 1990s. Apache was first released in 1996; eBay, Amazon, Yahoo!, and many others who no longer grace the Internet preceded it. MySQL came out in May of 1995, after both Yahoo! and Amazon. That is half of the LAMP Stack, the free software that runs many of the sites we all frequent. Without this suite of tools, the startups of yesteryear had to suffer through demonstrably higher infrastructure costs — fixed costs — than do the companies of today. For Search, the index is the core piece of the infrastructure. It is also a significant reason why Google et al seem unthreatened in the Search marketplace, as the colossal size of the web is a daunting barrier.
It needs to be fixed, and with Grub — and with a lot of people helping — entrepreneurs can make innovations in Search dramatically easier. If Grub succeeds, everyone wins.
(co-authored with Dan Lewis)
Tags: Grub






July 17th, 2008 at 5:53 pm
[...] andado liado últimamente y no he podido hablar de que Jimbo Wales, fundador de la Wikipedia, ha escrito un artículo en el blog de Wikia Search intentando explicar la importancia estratégica de Grub en su proyecto de competir con Google por [...]