Nutch, Hadoop, LuceneEdit
Lucene, created first by Doug Cutting, is a open source project for indexing and searching documents. Lucene uses inverted indexes. Nutch is a subproject of Lucene, which is a search engine. Nutch is roughly composed of the following parts : fetcher, parser and indexer. Nutch uses Hadoop, which implements Google's map/reduce computing paradigm and a Distributed File System(DFS).
Hadoop is a framework for running applications on large clusters of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named map/reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both map/reduce and the distributed file system are designed so that node failures are automatically handled by the framework.
The intent is to scale Hadoop up to handling thousand of computers. Hadoop has been tested on clusters of 600 nodes.
Hadoop is a Lucene sub-project that contains the distributed computing platform that was formerly a part of Nutch. This includes the Hadoop Distributed Filesystem (HDFS) and an implementation of map/reduce.
- Nutch includes a high-performance multithreaded crawler
- Nutch parses and indexes many document formats out of the box
- Nutch uses a distributed computing platform called Hadoop, an open source implementation of MapReduce. This allows to easily deploy a Nutch solution over a large number of servers. Furthermore, Hadoop can now natively run on Amazon S3
- Webpages are stored in a Lucene index, allowing for high-performance retrieval
- Nutch is highly customizable; one can extend it by creating plugins (several plugins already available)
- Nutch is open source and has a very healthy and friendly community
- Nutch is coded in Java, and thus runs on Windows,OS X, and Linux