Web Scraping Using Python

The document provides an overview of web scraping using Python, detailing its definition, purpose, and methods. It highlights tools like Scrapy and Beautiful Soup for extracting and structuring data from web pages, as well as the challenges faced during the scraping process. Additionally, it discusses the advantages of using Scrapy as a framework for efficient web scraping and data management.

Uploaded by

sahil.y.prince

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views18 pages

Web Scraping Using Python

Uploaded by

sahil.y.prince

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Web Scraping with Python

•Dr Vatan Sehrawat

•Asst. Professor, Computer Sc. & Engg. Department
•RBS-SIET Zainabad
•[email protected]
•8059211113
● What is scraping
● Why we scrape
● How do we do it
● Challenges
● Scrapy
scraping
converting unstructured documents
into structured information
● Extract data from the web pages
● Store the data into structured formats
● Data not available directly or viaAPIs
What is Web Scraping?

● Web scraping (web harvesting) is a software

technique of extracting information from
websites
● It focuses on transformation of unstructured
data on the web (typically HTML), into
structured data that can be stored and
analyzed
What is Web Scraping?
● Problem:
○ Static websites
○ No access to APIs to extract the data you
need
○ Need to extract data periodically
● Manual solution - go to the website and copy
the required data
● Smarter solution: Web Scraping
Why we scrape?

● Web pages contain wealth of information (in

text form), designed mostly for human
consumption
● Static websites (legacy systems)
● Interfacing with 3rd party with no API access
● Websites are more important than API’s
● The data is already available (in the form of
web pages)
● No rate limiting
● Anonymous access
Tools for Scraping

● Scrapy
○ Python framework to extract data from webpages
● Beautiful Soup
○ Python library to parse HTML/XML documents
● Alternatives
○ Selenium
○ Requests
○ Octoparse
Getting started!

How do we do it?
Web Scraping in Python
● Download webpage with urllib2, requests

● Parse the page with BeautifulSoup/lxml

● Select with XPath or css selectors

Fetching the data

● Involves finding the endpoint - URL or URL’s

● Sending HTTP requests to the server
● Using requests library:

import requests

data = requests.get(‘http://google.com/’)

html = data.content
Use BeautifulSoup for parsing

● Provides simple methods to-

○ search
○ navigate
○ select
● Deals with broken web-pages really well
● Auto-detects encoding

Philosophy-
“You didn't write that awful page. You're just trying to get
some data out of it. Beautiful Soup is here to help.”
Export the data

● Database (relational or non-relational)

● CSV
● JSON
● File (XML, YAML, etc.)
● API
Challenges

● External sites can change without warning

○ Figuring out the frequency is difficult (TEST, and
test)
○ Changes can break scrapers easily
● Bad HTTP status codes
○ example: using 200 OK to signal an error
○ cannot always trust your HTTP libraries default
behaviour
● Messy HTML markup
Scrapy - a framework for web scraping

● Uses XPath to select elements

● Interactive shell scripting
● Using Scrapy:
○ define a model to store items
○ create your spider to extract items
○ write a Pipeline to store them
Scrapy - fast high Level Screen Scraping
and web crawling Framework

● Uses XPath to select elements

● Interactive shell scripting
● Using Scrapy:
● Pick a website
● Define the data you want to scrape
● Write the spider to extract the data
● Run the spider
● Store the Data
Why Scrapy
● Simplicity
● Fast
● Productive/ Extensible
● Portable
● Well docs & Healthy community
● Commercial Support
Advanced Features (built in)
● Interactive shell for trying XPaths (useful for
debugging)
● selecting and extracting data from html
sources
● cleaning and sanitizing the scraped data
● generating feed exports (JSON, CSV)
● media pipeline for downloading stuff
● Middlewares for (cookies, HTTP
compression, cache, user-agent spoofing,
etc)

Web Scraping Using Python - Notes
No ratings yet
Web Scraping Using Python - Notes
6 pages
Web Crawling and Scraping with Python
No ratings yet
Web Crawling and Scraping with Python
34 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
16 pages
DAP 4 Module
No ratings yet
DAP 4 Module
45 pages
20 - BeautifulSoup Library For Web Scraping
No ratings yet
20 - BeautifulSoup Library For Web Scraping
12 pages
Python Web Scraping Guide
100% (2)
Python Web Scraping Guide
35 pages
Introduction To Web Crawling Chapter - 13
No ratings yet
Introduction To Web Crawling Chapter - 13
3 pages
Web Crawling and Social Media Mining: Module No. 5
No ratings yet
Web Crawling and Social Media Mining: Module No. 5
77 pages
Lecture 12 - Web Scrapping
No ratings yet
Lecture 12 - Web Scrapping
11 pages
Automating Web Scraping with Scrapy
No ratings yet
Automating Web Scraping with Scrapy
5 pages
Web Scraping
No ratings yet
Web Scraping
5 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
Web Scraping with Python & Selenium
No ratings yet
Web Scraping with Python & Selenium
5 pages
Practical Web Scraping For Economists 1744341390
No ratings yet
Practical Web Scraping For Economists 1744341390
33 pages
Web+Scraping+Cheat+Sheet+2 0
No ratings yet
Web+Scraping+Cheat+Sheet+2 0
3 pages
Web Scraping Techniques Cheat Sheet
No ratings yet
Web Scraping Techniques Cheat Sheet
3 pages
Web Scraping Report
No ratings yet
Web Scraping Report
14 pages
Quick Guide Web Scraping With Python
No ratings yet
Quick Guide Web Scraping With Python
3 pages
Web Scraping & API Guide
No ratings yet
Web Scraping & API Guide
24 pages
Dap Mod 4-5
No ratings yet
Dap Mod 4-5
19 pages
Web Scraping With Python - A Complete Step-By-Step Guide + Code - by Anthony Heath - Geek Culture - Medium
No ratings yet
Web Scraping With Python - A Complete Step-By-Step Guide + Code - by Anthony Heath - Geek Culture - Medium
42 pages
Python Web Scraping Tutorial
92% (12)
Python Web Scraping Tutorial
65 pages
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
100% (3)
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
26 pages
4F IntroToWebScraping
No ratings yet
4F IntroToWebScraping
6 pages
Python Web Scraping Basics
No ratings yet
Python Web Scraping Basics
4 pages
Web Scraping
No ratings yet
Web Scraping
28 pages
Introduction To Web Scraping in RPA With Python
No ratings yet
Introduction To Web Scraping in RPA With Python
10 pages
Download
No ratings yet
Download
4 pages
Web Scraping 2
No ratings yet
Web Scraping 2
14 pages
Unit I
No ratings yet
Unit I
12 pages
Demov6 141213202739 Conversion Gate01
No ratings yet
Demov6 141213202739 Conversion Gate01
41 pages
Web Scraping With Python
No ratings yet
Web Scraping With Python
10 pages
Web Scraping with BeautifulSoup in Python
No ratings yet
Web Scraping with BeautifulSoup in Python
6 pages
DAP Module4 1
No ratings yet
DAP Module4 1
110 pages
Web Scraping for Developers
No ratings yet
Web Scraping for Developers
8 pages
Webscraping
No ratings yet
Webscraping
12 pages
Unit 11 Application Development Using Python
No ratings yet
Unit 11 Application Development Using Python
19 pages
Scraping
100% (1)
Scraping
25 pages
Web Scraping Techniques in Python
100% (1)
Web Scraping Techniques in Python
20 pages
Web Scraping
No ratings yet
Web Scraping
4 pages
Web Scraping Basics with Python
No ratings yet
Web Scraping Basics with Python
4 pages
SDS WebScraping Bonus Scrapy Vs BeautifulSoup PDF
No ratings yet
SDS WebScraping Bonus Scrapy Vs BeautifulSoup PDF
6 pages
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
No ratings yet
The Ultimate Web Scraping With Python Bootcamp 2023 - Coderprog
3 pages
Module 4
No ratings yet
Module 4
14 pages
Web Scrapping Final
No ratings yet
Web Scrapping Final
7 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
No ratings yet
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
Web Scraping with BeautifulSoup Guide
100% (1)
Web Scraping with BeautifulSoup Guide
8 pages
Retrieving Data From The Web
No ratings yet
Retrieving Data From The Web
9 pages
Web Scraping in Python Using Scrapy
No ratings yet
Web Scraping in Python Using Scrapy
30 pages
Web Scraping Python Tutorial - How To Scrape Data From A Website
No ratings yet
Web Scraping Python Tutorial - How To Scrape Data From A Website
19 pages
Text Processing For NLP Web Scrapping
No ratings yet
Text Processing For NLP Web Scrapping
18 pages
Learning Scrapy - Sample Chapter
0% (1)
Learning Scrapy - Sample Chapter
16 pages
Web Scraping with Python Guide
No ratings yet
Web Scraping with Python Guide
5 pages
Q-1 Web Scraping: Definition and Significance
No ratings yet
Q-1 Web Scraping: Definition and Significance
4 pages
1.8 Data Scrapping PDF
No ratings yet
1.8 Data Scrapping PDF
42 pages
Scraping Book
No ratings yet
Scraping Book
50 pages
Scraping Book Python PDF
No ratings yet
Scraping Book Python PDF
50 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
0% (1)
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
MSc in Mathematical Trading & Finance
100% (1)
MSc in Mathematical Trading & Finance
2 pages
HEROS-matrix Firefighting Helmet
No ratings yet
HEROS-matrix Firefighting Helmet
4 pages
HDR Imaging Tutorial with OpenCV
No ratings yet
HDR Imaging Tutorial with OpenCV
7 pages
Activity 1
50% (4)
Activity 1
3 pages
McDonald's Supply Chain
No ratings yet
McDonald's Supply Chain
21 pages
15 Citizen's Charter Bauang La Union 2022 2nd Edition
No ratings yet
15 Citizen's Charter Bauang La Union 2022 2nd Edition
262 pages
Rule 102 Deficiency
No ratings yet
Rule 102 Deficiency
2 pages
Personal Particulars Form: Photo
No ratings yet
Personal Particulars Form: Photo
11 pages
Engineering Economics MCQ Questions & Answers C
No ratings yet
Engineering Economics MCQ Questions & Answers C
1 page
CSS Animations Guide
No ratings yet
CSS Animations Guide
20 pages
BSW2604 Study Guide
No ratings yet
BSW2604 Study Guide
143 pages
SAP Keyboard Shortcuts
No ratings yet
SAP Keyboard Shortcuts
1 page
Bashirat Adedoyin, Oladimeji
50% (2)
Bashirat Adedoyin, Oladimeji
50 pages
Oracle Linux 8 - Upgrading Systems With Leap
No ratings yet
Oracle Linux 8 - Upgrading Systems With Leap
41 pages
Short Keys - Excel
No ratings yet
Short Keys - Excel
62 pages
Introduction Sample of A Term Paper
100% (1)
Introduction Sample of A Term Paper
7 pages
Operational Guidelines Milling Scheme
No ratings yet
Operational Guidelines Milling Scheme
45 pages
Labour Law (Child Labour)
No ratings yet
Labour Law (Child Labour)
5 pages
4562 140876 8110833275 PDF
No ratings yet
4562 140876 8110833275 PDF
3 pages
Coping Stratergy Manual
No ratings yet
Coping Stratergy Manual
22 pages
Welding Inspection Report - A1 T03 7 14 (61 64) 2
No ratings yet
Welding Inspection Report - A1 T03 7 14 (61 64) 2
4 pages
Datasheet gsc3510 English
No ratings yet
Datasheet gsc3510 English
2 pages
Chapter11 PDF
No ratings yet
Chapter11 PDF
39 pages
White Box & Black Box
No ratings yet
White Box & Black Box
16 pages
4 s2.0 S1568494623011080 Main
No ratings yet
4 s2.0 S1568494623011080 Main
17 pages
Employee Pay and Allowances Summary
0% (1)
Employee Pay and Allowances Summary
3 pages
Amplifier Stereo With STK Simple Power PDF
No ratings yet
Amplifier Stereo With STK Simple Power PDF
4 pages
Informatica Interview Questions Scenario Based PDF
No ratings yet
Informatica Interview Questions Scenario Based PDF
14 pages
Passport Service Fee Receipt
No ratings yet
Passport Service Fee Receipt
3 pages
Churchill v. Rafferty
100% (1)
Churchill v. Rafferty
2 pages