Apache Nutch

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Apache Nutch

Apache Nutch is a highly extensible and scalable open


source web crawler software project.
Apache Nutch

Features
Screenshot
Nutch is coded entirely in the Java programming language,
but data is written in language-independent formats. It has a
highly modular architecture, allowing developers to create
plug-ins for media-type parsing, data retrieval, querying and
clustering.

The fetcher ("robot" or "web crawler") has been written from


scratch specifically for this project.
Nutch Web Interface Search
History
Original author(s) Doug Cutting, Mike
Nutch originated with Doug Cutting, creator of both Lucene Cafarella
and Hadoop, and Mike Cafarella. Developer(s) Apache Software
Foundation
In June, 2003, a successful 100-million-page demonstration
system was developed. To meet the multi-machine processing Stable release 1.x 1.19 / 22 August
needs of the crawl and index tasks, the Nutch project has also 2022[1]
implemented a MapReduce facility and a distributed file 2.x 2.4 / 11 October
system. The two facilities have been spun out into their own 2019[1]
subproject, called Hadoop.
Repository Nutch Repository (h
In January, 2005, Nutch joined the Apache Incubator, from ttps://gitbox.apache.
which it graduated to become a subproject of Lucene in June org/repos/asf?p=nut
of that same year. Since April, 2010, Nutch has been ch.git)
considered an independent, top level project of the Apache Written in Java
Software Foundation.[2]
Operating system Cross-platform
In February 2014 the Common Crawl project adopted Nutch Type Web crawler
for its open, large-scale web crawl.[3]
License Apache License 2.0
Website nutch.apache.org (h
ttps://nutch.apache.
org)

Nutch robot mascot


While it was once a goal for the Nutch project to release a global large-scale web search engine, that is no
longer the case.

Release history

1.x 2.x
Release
Description
Branch Branch date

This release
includes several
major upgrades of
existing libraries
(Hadoop, Solr,
Tika, etc.) on
2010-
1.1 which Nutch
06-06
depends. Various
bug fixes, and
speedups (e.g.,
to Fetcher2) have
also been
included.
This release
includes several
improvements
(addition of parse-
html as a
selectable parser
again,
configurable per-
field indexing),
new features
(including adding
2010-
1.2 timing information
10-24
to all Tool
classes, and
implementation of
parser timeouts),
and bug fixes
(fixing an NPE in
distributed
search, fixing of
XML formatting
issues per
Document fields).

This release
includes several
improvements
(improved RSS
parsing support,
tighter integration
with Apache Tika,
2011- external parsing
1.3
06-07 support, improved
language
identification and
an order of
magnitude
smaller source
release tarball—
only about 2 MB).
This release
includes several
improvements
including allowing
Parsers to
declare support
for multiple MIME
types,
2011- configurable
1.4
11-26 Fetcher Queue
depth, Fetcher
speed
improvements,
tighter Tika
integration, and
support for HTTP
auth in Solr
indexing.
This release
includes several
improvements
including
upgrades of
several major
components
including Tika 1.1
and Hadoop
1.0.0,
2012-
1.5 improvements to
06-07
LinkRank and
WebGraph
elements as well
as a number of
new plugins
covering
blacklisting,
filtering and
parsing to name a
few.

This release
offers users an
edition focused
on large scale
crawling which
builds on storage
abstraction (via
Apache Gora) for
big data stores
2012- such as Apache
2.0
07-07 Accumulo,
Apache Avro,
Apache
Cassandra,
Apache HBase,
HDFS, an in
memory data
store and various
high-profile SQL
stores.

1.5.1 2012- This release is a


07-10 maintenance
release of the
popular 1.5.X
mainstream
version of Nutch
which has been
widely adopted
within the
community.

This release
continues to
provide Nutch
users with a
simplified Nutch
distribution
building on the
2.x development
drive which is
growing in
popularity
amongst the
community. As
well as
2012-
2.1 addressing ~20
10-05
bugs this release
also offers
improved
properties for
better Solr
configuration,
upgrades to
various Gora
dependencies
and the
introduction of the
option to build
indexes in elastic
search.
This release
includes over 20
bug fixes, the
same in
improvements, as
well as new
functionalities
including a new
HostNormalizer,
the ability to
dynamically set
fetchInterval by
MIME-type and
functional
2012- enhancements to
1.6
12-06 the Indexer API
including the
normalization of
URLs and the
deletion of robots
noIndex
documents. Other
notable
improvements
include the
upgrade of key
dependencies to
Tika 1.2 and
Automaton 1.11-
8.
This release
includes over 30
bug fixes and
over 25
improvements
representing the
third release of
increasingly
popular 2.x Nutch
series. This
release features
inclusion of
2013- Crawler-
2.2
06-08 Commons which
Nutch now
utilizes for
improved
robots.txt
parsing, library
upgrades to
Apache Hadoop
1.1.1, Apache
Gora 0.3, Apache
Tika 1.2 and
Automaton 1.11-
8.
This release
includes over 20
bug fixes, as
many
improvements;
most noticeably
featuring a new
pluggable
indexing
architecture
which currently
supports Apache
2013- Solr and Elastic
1.7
06-24 Search.
Shadowing the
recent Nutch 2.2
release, parsing
of Robots.txt is
now delegated to
Crawler-
Commons. Key
library upgrades
have been made
to Apache
Hadoop 1.2.0 and
Apache Tika 1.3.

2.2.1 2013- This release


07-02 includes library
upgrades to
Apache Hadoop
1.2.0 and Apache
Tika 1.3, it is
predominantly a
bug fix for
NUTCH-1591 -
Incorrect
conversion of
ByteBuffer to
String.

Although this
release includes
library upgrades
to Crawler
Commons 0.3
2014-
1.8 and Apache Tika
03-17
1.5, it also
provides over 30
bug fixes as well
as 18
improvements.
Nutch 2.3 release
now comes
packaged with a
self-contained
2015- Apache Wicket-
2.3 based Web
01-22
Application. The
SQL backend for
Gora has been
deprecated.[4]
This release
includes library
upgrades to Tika
1.6, also provides
2015- over 46 bug fixes
1.10
05-06 as well as 37
improvements
and 12 new
features.[5]

This release
includes library
upgrades to
Hadoop 2.X, Tika
1.11, also
2015-
1.11 provides over 32
12-07
bug fixes as well
as 35
improvements
and 14 new
features.[6]
This bug fix
2016- release contains
2.3.1
01-21 around 40 issues
addressed.
2016-
1.12
06-18

2017-
1.13
04-02

2017-
1.14
12-23
2018-
1.15
08-09

2019-
1.16
10-11
Expected to be
the last release
2019- on the 2.X series,
2.4 as "no committer
10-11
is actively
working on it".[7]
2020-
1.17
07-02

2021-
1.18
01-24

Scalability
IBM Research studied the performance[8] of Nutch/Lucene as part of its Commercial Scale Out (CSO)
project.[9] Their findings were that a scale-out system, such as Nutch/Lucene, could achieve a performance
level on a cluster of blades that was not achievable on any scale-up computer such as the POWER5.

The ClueWeb09 dataset (used in e.g. TREC) was gathered using Nutch, with an average speed of 755.31
documents per second.[10]

Related projects
Hadoop – Java framework that supports distributed applications running on large clusters.

Search engines built with Nutch


Common Crawl – publicly available internet-wide crawls, started using Nutch in 2014.[3]
Creative Commons Search – an implementation of Nutch, used in the period of 2004–
2006.[11][12][13]
DiscoverEd – Open educational resources search prototype developed by Creative
Commons
Krugle uses Nutch to crawl web pages for code, archives and technically interesting content.
mozDex (inactive)
Wikia Search - launched 2008, closed down 2009[14][15]

See also
Free and open-
source software
portal

Faceted search
Information extraction
Enterprise search

References
1. "Apache Nutch™ - Downloads" (https://nutch.apache.org/download/). Retrieved
27 September 2022.
2. "Apache Nutch -" (http://nutch.apache.org/#News). nutch.apache.org.
3. "Common Crawl's Move to Nutch – Common Crawl – Blog" (http://blog.commoncrawl.org/20
14/02/common-crawl-move-to-nutch/). blog.commoncrawl.org. Retrieved 2015-10-14.
4. "Nutch 2.3 Release" (http://nutch.apache.org/#22-january-2015-nutch-23-release). Apache
Nutch News. The Apache Software Foundation. 22 January 2015. Retrieved 18 January
2016.
5. "Nutch 1.10 Release Notes" (https://issues.apache.org/jira/secure/ReleaseNote.jspa?project
Id=10680&version=12327187). ASF JIRA. The Apache Software Foundation. 6 May 2015.
Retrieved 18 January 2016.
6. "Nutch 1.11 Release Notes" (https://issues.apache.org/jira/secure/ReleaseNote.jspa?project
Id=10680&version=12329358). ASF JIRA. The Apache Software Foundation. 7 December
2015. Retrieved 18 January 2016.
7. "Nutch 2.4 Release" (https://nutch.apache.org/news/legacy-nutch-news/#11-october-2019---
nutch-24-release). Apache Nutch News. The Apache Software Foundation. 11 October
2019. Retrieved 20 May 2022.
8. "Scalability of the Nutch search engine" (http://www.cecs.uci.edu/~papers/ipdps07/pdfs/SMT
PS-201-paper-1.pdf) (PDF).
9. "Base Operating System Provisioning and Bringup for a Commercial Supercomputer" (http
s://web.archive.org/web/20081203064621/http://weather.ou.edu/~apw/projects/cso/prov_pap
er.pdf) (PDF). Archived from the original (http://weather.ou.edu/~apw/projects/cso/prov_pape
r.pdf) (PDF) on December 3, 2008.
10. The Sapphire Web Crawler - Crawl Statistics (http://boston.lti.cs.cmu.edu/crawler/crawlerstat
s.html). Boston.lti.cs.cmu.edu (2008-10-01). Retrieved on 2013-07-21.
11. "Our Updated Search" (https://creativecommons.org/weblog/entry/4388). Creative
Commons. 2004-09-03.
12. "Creative Commons Unique Search Tool Now Integrated into Firefox 1.0" (https://web.archiv
e.org/web/20100107065707/http://creativecommons.org/press-releases/entry/5064).
Creative Commons. 2004-11-22. Archived from the original (https://creativecommons.org/pre
ss-releases/entry/5064) on 2010-01-07.
13. "New CC search UI" (https://creativecommons.org/weblog/entry/6002). Creative Commons.
2006-08-02.
14. "Where can I get the source code for Wikia Search?" (https://web.archive.org/web/20111104
010718/http://answers.wikia.com/wiki/Where_can_I_get_the_source_code_for_Wikia_Sear
ch). Archived from the original (http://answers.wikia.com/wiki/Where_can_I_get_the_source_
code_for_Wikia_Search) on 2011-11-04. Retrieved 2010-02-12.
15. "Update on Wikia – doing more of what's working | Jimmy Wales" (http://jimmywales.com/20
09/03/31/update-on-wikia/). 31 March 2009.

Bibliography
Shoberg, J (October 26, 2006). Building Search Applications with Lucene and Nutch (https://
web.archive.org/web/20091202104144/http://www.apress.com/book/view/9781590596876)
(1st ed.). Apress. p. 350. ISBN 978-1-59059-687-6. Archived from the original (http://www.apr
ess.com/book/view/9781590596876) on December 2, 2009. Retrieved August 15, 2009.

External links
Official website (https://nutch.apache.org)
Retrieved from "https://en.wikipedia.org/w/index.php?title=Apache_Nutch&oldid=1147629176"

You might also like