Nutch is a project headed by Doug Cutting (formerly of Apple
VTwin, Xerox PARC, Excite and Lucene) to make an open-source search engine
expandable enough to index the entire web. It can also be used for smaller
projects such as site, multi-site and intranet searching. It includes a Java
crawler, and an indexer and search engine based on the Lucene
open source search code library. This project is in development, partly supported
by Yahoo Research and the Internet Archive, and is not complete.
Robot crawler, can use proxy
Includes hosts via grep, exclusion by host names and suffixes
FTP indexing login option
Index logging options
Flexible query parsing.
Includes link-analysis module (mainly for multi-site search)
Includes approximately fifteen relevance quality adjustment options.
Caches original page for display.
Articles & Reviews
May 28, 2004, by Philipp Lenssen Discussion of search engine architecture, Nutch, Lucene, open-source
search engines, web search, spamming, speed, and the future of search.
Nutch: Open Source Search:
April 2004, by Mike Cafarella and Doug Cutting Describes the issues and challenges of designing a hugely-scalable
search engine, advantages of open-source projects, descriptions
of spam techniques and responses, and cost-effectiveness.