Apache Nutch

by Apache

A well-matured, production-ready Web Crawler.

Excellent 10.0

(1 Ratings)

Helps with: Scraping

Similar to:

More...

Source Type: Open

License Types:

Apache

Supported OS:

Languages:

Save Unsave Compare

What is it all about?

Apache Nutch is a highly extensible and scalable open source web crawler software project.

Key Features

* Fetching and parsing are done separately by default, this reduces the risk of an error corrupting the fetch parse stage of a crawl with Nutch. * Plugins have been overhauled as a direct result of removal of legacy Lucene dependency for indexing and search. * The number of plugins for processing various document types being shipped with Nutch has been refined. Plain text, XML, OpenDocument (OpenOffice.org), Microsoft Office (Word, Excel, Powerpoint), PDF, RTF, MP3 (ID3 tags) are all now parsed by the Tika plugin. The only parser plugins shipped with Nutch now are Feed (RSS/Atom), HTML, Ext, JavaScript, SWF, Tika & ZIP. * Distributed filesystem (via Hadoop) * Link-graph database * NTLM authentication