Compare apache nutch vs web-scraping-sdk | DiscoverSdk

Compare Products

Apache Nutch Scraping	Web-Scraping-SDK Scraping
Excellent 10.0 (1 Ratings)	Rate Now (0 Ratings)
Features * Fetching and parsing are done separately by default, this reduces the risk of an error corrupting the fetch parse stage of a crawl with Nutch. * Plugins have been overhauled as a direct result of removal of legacy Lucene dependency for indexing and search. * The number of plugins for processing various document types being shipped with Nutch has been refined. Plain text, XML, OpenDocument (OpenOffice.org), Microsoft Office (Word, Excel, Powerpoint), PDF, RTF, MP3 (ID3 tags) are all now parsed by the Tika plugin. The only parser plugins shipped with Nutch now are Feed (RSS/Atom), HTML, Ext, JavaScript, SWF, Tika & ZIP. * Distributed filesystem (via Hadoop) * Link-graph database * NTLM authentication	Features • XPath driven extraction of content • Just one method to implement • Allows easy file writing, database storage or formatted string/object return • PSR2 coding standards • Uses cURL to retrieve content from specified source • Configurable failed attempts retry count and pause time • Easily follow links to get additionalcontent
Languages	Languages
Source Type Open	Source Type Open
License Type Apache	License Type Proprietary
OS Type	OS Type
Pricing Free Trial No Card, By Quotation	Pricing free - see site

X

Compare Products

Select up to three two products to compare by clicking on the compare icon () of each product.

{{compareToolModel.Error}}

Now comparing:

{{product.ProductName | createSubstring:25}} X

Compare Now