0% found this document useful (0 votes)
223 views6 pages

Web Crawler

This document discusses web crawling and various open source frameworks for web crawling across different programming languages. It notes that Python is highly used for web crawling due to its efficiency, distribution capabilities, libraries like Requests and LXML, support for non-blocking I/O, and scalable options like Scrapy. Common Python web crawling frameworks mentioned include Scrapy, Apache Nutch, BeautifulSoup, and Mechanize. Java, PHP, Ruby, C#, C++, and cross-platform options are also listed along with example frameworks in each.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
223 views6 pages

Web Crawler

This document discusses web crawling and various open source frameworks for web crawling across different programming languages. It notes that Python is highly used for web crawling due to its efficiency, distribution capabilities, libraries like Requests and LXML, support for non-blocking I/O, and scalable options like Scrapy. Common Python web crawling frameworks mentioned include Scrapy, Apache Nutch, BeautifulSoup, and Mechanize. Java, PHP, Ruby, C#, C++, and cross-platform options are also listed along with example frameworks in each.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 6

NSSPL-HP @vishnu simmha

Web
Research conducted
on Web Crawling,
Crawling
open source
frameworks across
languages

Open Source Platforms

Web Crawler
(Known

in other terms like Ants, Automatic indexers, Bots, web


spiders, web robots or webs cutters)

Top 5 Web Programming Languages

JAVA
PYTHON
RUBY
PHP
C# , C++ , CROSS PLATFORMS

Open source frame works in each


Language:

1.PYTHON Based
APCHE NUTCH
SCRAPY
KIMONO
SCRAPING HUB
IMPORT.IO
GRUB

2.JAVA BASED

WEBCOLLECTOR
CRAWLER4J
EX-CRAWLER
BIXO
WEB-HARVEST
JOBO
ARACHNID
SMART AND SIMPLE WEB CRAWLER
WEBLECH
CAPEK
GRUNK
LARM
ARALE
SPINDLE
METIS
APETURE
HOUNDER
WEB EATER
ANDJING
PYCREEP
LUCENE
3.PHP BASED
SPHIDER
OPEN WEB SPIDER

4.RUBY
ANEMONE

CLOUD-CRAWLER
4.C# , C++ AND CROSS PLATFORM
DATAPARK SEARCH
GNU WGET
GRU
HT://DIG
HTTRACK
ICDL CRAWLER
MNO GO SEARCH
OPEN SOURCE SERVER
ASPSEEK
HYPER ES TRAILER
OPEN WEB SPIDER
PAVUK
XAPIAN
ARACHNODE.NET
CRAWWWLER
OPESE
CCRAWLER
CONCLUSION :
Python is highly used across crawling
Reason:
Most efficient, highly distributed
The requests library is very powerful while being extremely
simple to use. Python also has a great inbuilt html/xml parser in
LXML - An alternative to LXML is Beautiful Soup.
A scripting language like Python/Perl offers excellent text
processing abilities in the form of regular expressions and low

level string operations. Handling character encodings (which


can be a pain with web crawling) is also very easy to do in
Python - One of my favourite libraries is UniDecode.
With a web crawler, most of your time is spent on network I/O
and thus making it non-blocking is very important for good
throughput. Python has many libraries and frameworks off the
shelf to support this.

Scrapy would be a great choice to build a scalable, distributed


crawler. It is built on top of Twisted (an event-driven networking
engine) and is in use by a few big companies in production
systems. It might be overkill if you are doing a weekend project.
Mechanize is another powerful library that can do pretty much
anything a user can when browsing - it was originally built in
Perl and now comes in Ruby and Python Flavors among others.

It is widely believed that a majority of the Google-bot is written


in Python
Python is a "scripting language" , "interpreted language" for
crawling the web it is best because of scripting feature with its
own built-in memory management and good facilities for calling
and cooperating with other programs

Excellent for beginners


Yet superb for experts
Highly scalable,
Suitable for large projects as well as small ones
Rapid development
Portable,
Cross-platform
Embeddable
Easily extensible
Object-oriented
Simple yet elegant
Stable and mature

Powerful standard libs


Wealth of 3rd party packages
While java we use where we want great security and
portability
there are some specific work which is done by some specific
languages python is best for crawling feature

Bibliography :
www.quora.com
http://stackoverflow.com/questions/5555930/is-there-any-javascript-web-crawler-framework
http://forums.udacity.com/questions/19039/java-vs-python-forwriting-a-web-crawler
http://en.wikipedia.org/wiki/Web_crawler
https://www.coursera.org/
www.google.com
http://opendata-tools.org/en/data/
http://www.garethjames.net/a-guide-to-web-scrapping-tools/

You might also like