Forum:Heritrix web crawler

Forum: Index > Heritrix web crawler


The Heritrix web crawler is an LGPL crawler used by the Internet Archive to perform crawls for our partners. We use the output of Heritrix crawls to feed our world-wide Wayback Machine, but Heritrix's design is quite modular and we could easily adapt it to feed into Wikia search as well.

Heritrix was designed to perform large-scale web crawls on server-class machines, so it's not quite like Grubby. However you still may find Heritrix preferrable for those Wikia search participants who have servers to dedicate to the cause. Also Heritrix is written in Java, so it works on platforms other than Windows.

If you don't think Heritrix is a good fit for Wikia, our team may still be able to provide guidance regarding specific changes to Grubby. Here's a quick list of pitfalls you'll want to avoid off the top of my head:

1. Politeness. This is the single most important feature of Heritrix. It's a really good idea to not upset the webmasters of the world with your crawler. The key things here are honoring robots.txt and making sure you're not crawling sites too quickly and overloading them. Ideally you'd configure the crawler to throttle down the speed for sites such as government sites that typically can't handle large loads.

2. URI Canonicalization: There are often multiple ways to express the same URI; Heritrix uses "URI canonicalization rules" to convert a URI into a standard form. This helps prevent the crawler from visiting the same site more than once.

3. Crawler traps: These are websites that are designed to feed the same content to a crawler under millions of different URIs, all dynamically generated. Your crawler becomes trapped in a seemingly infinitely deep site. You're going to need some sort of blacklist to block your crawler from visiting these sites.

And that's really just the tip of the iceberg. Anyway, you can find out more about Heritrix at crawler.archive.org; join our mailing list if you'd like to ask specific questions about crawling the web. The Internet Archive has been doing this sort of thing for years and years.

Good luck...

67.88.206.99 22:23, 27 July 2007 (UTC)

I think we are excited to support a variety of approaches. It is too early to tell which will be the best long term solution or whether (as I suspect) the best long term solution will be to combine the best output of various approaches.----Jimbo Wales 19:58, 28 July 2007 (UTC)


I've worked allot with heritrix is not goo for active sites. such as a great deal of sites like http://www.hollisterco.com

Heritrix takes a team of web crawling engineers and cant get content for what it dosnt know, Since grub is going to be run on the client lets have it use IE or firefox to build the links we get a true picture of what a browser would see save copy build links. Finding links is the hard psrt =(


We have functional and battle-tested link extraction code for vanilla html, xml, javascript, css, doc, pdf and swf files. With regards to extracting links from AJAX-heavy site, we have an experimental project underway called "webmonkeys" that uses Firefox to render each page. Such an approach would slow down the crawl, of course.

It is also true that Heritrix clusters require active maintenance during a crawl. We are slowly but surely moving towards a node-based system, where you can just add more nodes to a running crawl, and node failures are handled gracefully. But those features are still in the early planning stages.

Again, Heritrix is very different from Grub's approach. But I do imagine that full-time Heritrix crawlers could be used to augment Wikia's crawl payloads.

We're here to help. --Pjack 04:25, 16 August 2007 (UTC)

We already have heritrix running for testing, and I actually expect we'll be using parts of it in conjunction with Grub possibly soon here. While on the surface they could both be called crawlers, they are really quite different things :) -- Jer 18:59, 16 August 2007 (UTC)

Retrieved from "http://search.wikia.com/wiki/Forum:Heritrix_web_crawler"

This page was last modified 18:59, 16 August 2007. GFDL