0% found this document useful (0 votes)
5 views

web scraping using python

The document provides an overview of web scraping using Python, detailing its definition, purpose, and methods. It highlights tools like Scrapy and Beautiful Soup for extracting and structuring data from web pages, as well as the challenges faced during the scraping process. Additionally, it discusses the advantages of using Scrapy as a framework for efficient web scraping and data management.

Uploaded by

sahil.y.prince
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

web scraping using python

The document provides an overview of web scraping using Python, detailing its definition, purpose, and methods. It highlights tools like Scrapy and Beautiful Soup for extracting and structuring data from web pages, as well as the challenges faced during the scraping process. Additionally, it discusses the advantages of using Scrapy as a framework for efficient web scraping and data management.

Uploaded by

sahil.y.prince
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Web Scraping with Python

•Dr Vatan Sehrawat


•Asst. Professor, Computer Sc. & Engg. Department
•RBS-SIET Zainabad
[email protected]
•8059211113
● What is scraping
● Why we scrape
● How do we do it
● Challenges
● Scrapy
scraping
converting unstructured documents
into structured information
● Extract data from the web pages
● Store the data into structured formats
● Data not available directly or viaAPIs
What is Web Scraping?

● Web scraping (web harvesting) is a software


technique of extracting information from
websites
● It focuses on transformation of unstructured
data on the web (typically HTML), into
structured data that can be stored and
analyzed
What is Web Scraping?
● Problem:
○ Static websites
○ No access to APIs to extract the data you
need
○ Need to extract data periodically
● Manual solution - go to the website and copy
the required data
● Smarter solution: Web Scraping
Why we scrape?

● Web pages contain wealth of information (in


text form), designed mostly for human
consumption
● Static websites (legacy systems)
● Interfacing with 3rd party with no API access
● Websites are more important than API’s
● The data is already available (in the form of
web pages)
● No rate limiting
● Anonymous access
Tools for Scraping

● Scrapy
○ Python framework to extract data from webpages
● Beautiful Soup
○ Python library to parse HTML/XML documents
● Alternatives
○ Selenium
○ Requests
○ Octoparse
Getting started!

How do we do it?
Web Scraping in Python
● Download webpage with urllib2, requests

● Parse the page with BeautifulSoup/lxml

● Select with XPath or css selectors


Fetching the data

● Involves finding the endpoint - URL or URL’s


● Sending HTTP requests to the server
● Using requests library:

import requests

data = requests.get(‘http://google.com/’)

html = data.content
Use BeautifulSoup for parsing

● Provides simple methods to-


○ search
○ navigate
○ select
● Deals with broken web-pages really well
● Auto-detects encoding

Philosophy-
“You didn't write that awful page. You're just trying to get
some data out of it. Beautiful Soup is here to help.”
Export the data

● Database (relational or non-relational)


● CSV
● JSON
● File (XML, YAML, etc.)
● API
Challenges

● External sites can change without warning


○ Figuring out the frequency is difficult (TEST, and
test)
○ Changes can break scrapers easily
● Bad HTTP status codes
○ example: using 200 OK to signal an error
○ cannot always trust your HTTP libraries default
behaviour
● Messy HTML markup
Scrapy - a framework for web scraping

● Uses XPath to select elements


● Interactive shell scripting
● Using Scrapy:
○ define a model to store items
○ create your spider to extract items
○ write a Pipeline to store them
Scrapy - fast high Level Screen Scraping
and web crawling Framework

● Uses XPath to select elements


● Interactive shell scripting
● Using Scrapy:
● Pick a website
● Define the data you want to scrape
● Write the spider to extract the data
● Run the spider
● Store the Data
Why Scrapy
● Simplicity
● Fast
● Productive/ Extensible
● Portable
● Well docs & Healthy community
● Commercial Support
Advanced Features (built in)
● Interactive shell for trying XPaths (useful for
debugging)
● selecting and extracting data from html
sources
● cleaning and sanitizing the scraped data
● generating feed exports (JSON, CSV)
● media pipeline for downloading stuff
● Middlewares for (cookies, HTTP
compression, cache, user-agent spoofing,
etc)

You might also like