0% found this document useful (0 votes)
120 views11 pages

Web Scrapping: From NP-10

Scrapy is a Python framework for scraping web pages and extracting structured data. It can be used for tasks like data mining, information processing, and archiving. Scrapy includes tools to define items to contain scraped data, spiders to scrape specific domains, and extractors to pull data from pages using XPath or CSS selectors. It can scrape both websites and APIs. The scraped data can then be stored in various formats like JSON.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
120 views11 pages

Web Scrapping: From NP-10

Scrapy is a Python framework for scraping web pages and extracting structured data. It can be used for tasks like data mining, information processing, and archiving. Scrapy includes tools to define items to contain scraped data, spiders to scrape specific domains, and extractors to pull data from pages using XPath or CSS selectors. It can scrape both websites and APIs. The scraped data can then be stored in various formats like JSON.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Web Scrapping

From http://scrapy.org/
NP-10
Scrapy at a glance
Scrapy is an application framework for
crawling web sites and extracting structured
data which can be used for a wide range of
useful applications, like data mining,
information processing or historical archival.
it can also be used to extract data using APIs
Scrapy is written in Python
pip install scrapy
you need to extract some information from a
website, but the website doesnt provide any
API or mechanism to access that info
programmatically.
Scrapy can help you extract that information.
directory
the project configuration file
the projects python module, youll later
import your code from here.
the projects items file.
the projects pipelines file.
he projects settings file.
a directory where youll later put your
spiders.
Defining our Item
Items are containers that will be loaded with
the scraped data;
Our first Spider
Spiders are user-written classes used to scrape information from a
domai
Three main mandatory attributes:
Name
Start_urls
Parse()
Extracting Items
There are several ways to extract data from web pages
Here are some examples of XPath expressions and their meanings:

/html/head/title: selects the <title> element, inside the <head> element of a HTML
document
/html/head/title/text(): selects the text inside the aforementioned <title> element.
//td: selects all the <td> elements
//div[@class="mine"]: selects all div elements which contain an attribute
class="mine"
Selectors have three methods (click on the method to see the complete API documentation).

select(): returns a list of selectors, each of them representing the nodes selected
by the xpath expression given as argument.
extract(): returns a unicode string with the data selected by the XPath selector.
re(): returns a list of unicode strings extracted by applying the regular
expression given as argument.
Extracting the data
hxs.select('//ul/li')
hxs.select('//ul/li/text()').extract() #description
hxs.select('//ul/li/a/text()').extract() #title
hxs.select('//ul/li/a/@href').extract() #links
Crawling
scrapy crawl dmoz
2013-05-06 12:08:02+0700 [scrapy] INFO: Scrapy 0.16.4 started (bot: scrapybot)
2013-05-06 12:08:03+0700 [scrapy] DEBUG: Enabled extensions: FeedExporter,
LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-05-06 12:08:03+0700 [scrapy] DEBUG: Enabled downloader middlewares:
HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware,
RetryMiddleware, DefaultHeadersMiddleware, RedirectMiddleware,
CookiesMiddleware, HttpCompressionMiddleware, ChunkedTransferMiddleware,
DownloaderStats
2013-05-06 12:08:03+0700 [scrapy] DEBUG: Enabled spider middlewares:
HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware,
DepthMiddleware
Storing the scraped data
scrapy crawl dmoz -o items.json -t json

[{"url": ["http://www.network-theory.co.uk/python/intro/"],
"name": ["An Introduction to Python"],
"description": ["By Guido van Rossum, Fred L. Drake, Jr.;
Network Theory Ltd., 2003, ISBN 0954161769. Printed edition of official tutorial,
for v2.x, from Python.org. [Network Theory, online]"]},
Other language?
Just write scraping with . in google :D

You might also like