0% found this document useful (0 votes)

315 views63 pages

Data Scraping

The document discusses scraping data from websites. It begins with an overview of scraping and common tools used, such as Google Docs, Chrome Developer Tools, Python and Scrapy. It then covers strategies for scraping, including defining the desired data, examining the data structure on the site, doing a test scrape, and then fully implementing the scraper. The document provides examples scraping Baseball Prospectus and MLB prospect data as case studies. It discusses common data formats like HTML, JSON, XML and the use of XPath for querying pages.

Uploaded by

Prashanth Mohan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

315 views63 pages

Data Scraping

Uploaded by

Prashanth Mohan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 63

There’s Always 

an API
Sometimes they make you work for it
Hi, I’m Matt Dennewitz

• VP Product, Pitchfork; Dir. Engineering, Wired

• I consult on data for baseball writers and 

MLB clubs

• @mattdennewitz on Twitter, Github

Agenda
• 101

• Your first scrape: Google Docs

• Interlude: HTML, JSON, XML, XPath

• Scaling up: Python

• What happens when the data isn’t on the page?

• Advanced topics (time allowing)

What is scraping?
• Extracting information from a document

• Rows from an HTML table

• Text from a PDF

• Images from Craigslist posts or museum websites

• OCR’ing an image and reading its text

• Spidering a website like Google

Tools

• Google Docs (surprise!)

• Chrome Developer Tools

• Python

• Scrapy
Strategy

1. “What do I want?”

2. Case the joint

3. Rob it just a little bit

4. Move in
“What do I want?”
• Envision the data you want, how you need it

• “How will I scrape this data?” Script? Crawler?

• “Do I need to scrape this more than once?”

• “How do I need to shape the data?”

• “What do I need to do to the data after I have it?”

Clean, verify, cross-link with another data set, …?

• “How/to where do I want to output the data?”

Case the joint
• Does the document seem scrape-ready? Does
access come with preconditions?

• Preconditions: password-protected? Online-only?

Needs a special decoder?

• Look at how the data is presented in the

document. Are there external dependencies, or
is it self-contained?

• External deps: more information on secondary

pages, data in other spreadsheets or workbooks
Rob it just a little bit
• Prototype using a subset of the information

• Estimate how long scraping will take, determine

imperative needs like throttling or a specific OS

• Validate your ideas about the data you wish 

to extract, correct bugs

• Writing unit tests

Oceans 1101

• You’ve created a stable scraper which emits data

in the format you want (CSV, JSON, XML, SQL, …)
to the location you want

• You understand its performance characteristics

• Go!
Interlude: formats
• Data is distributed in mercilessly innumerable
formats

• The Big Three of web scraping

• HTML

• JSON

• XML
Formats: XML

• eXtensible Markup Language

• Well-structured, self-validating, predictable

• Pedantic, though not with its charms

Formats: XML
Formats: HTML
• Hypertext something something something

• XML-like, without the upside

• Needs stronger class of parser to heal broken

code

• Less predictable, far more susceptible to

changes in the wind
Formats: HTML
<p>

<strong><span class="playerdef"><a href="http://

www.baseballprospectus.com/card/card.php?id=102123">Alex Reyes</
a></span>, RHP, <span class="teamdef"><a href="http://
www.baseballprospectus.com/team_audit.php?team=SLN"
target="blank">St. Louis Cardinals</a></span></strong><br>

Scouting Report: <a href=“http://www.baseballprospectus.com/

article.php?articleid=30958">LINK</a>

</p>
Formats: JSON
• JavaScript Object Notation

• Data objects with simple primitives: int, double,

string, boolean, object (key/value pairs), array
(untyped), null.

• Requires waaaaaaay less parsing, much easier to

serialize

• No schemas, but validation tools exist

• Has taken over for XML in web data transmission

Formats: JSON
{

"prospect_year": "2017",

"player_id": 643217,

"player_first_name": "Andrew",

"player_last_name": "Benintendi",

"rank": 1,

"position": "OF",

"preseason100": 1,

"preseason20": 1,

"team_file_code": "BOS",

}
Bonus: XPath

• XPath is a way to query XML (and HTML)

• It’s got a super goofy syntax

• Very powerful, essential for scraping the web

Bonus: XPath
• XPath: //table/tbody/tr

• HTML (fragment):  
<table> 
<thead> 
<tr> 
<th>Name</th><th>HR</th><th>SB</th> 
</tr> 
</thead> 
<tbody> 
<tr><td>Mike Trout</td><td>40</td><td>40</td></tr> 
</tbody> 
</table>

• Result: <tr><td>Mike Trout</td><td>40</td><td>40</td></tr>

Bonus: XPath

• XPath: //span[@class=“playerdef”]/text()

• HTML:  
<p>1. <strong><span class=“playerdef”>Eloy
Jiminez</span></strong>, OF, …</p>

• Result: “Eloy Jiminez”

Ok, time to scrape
Google Docs

• Fire up Google Docs, start a new spreadsheet

• IMPORTXML and IMPORTHTML are your friends

• Let’s look at IMPORTHTML

IMPORTHTML

• Allows you put pull in specific list or tabular data

from a web page

• Syntax: 
=IMPORTHTML(url, <“list” or “table”>,
[index])
IMPORTHTML
• ESPN Home Run Tracker

• Syntax: 
=importhtml("http://
www.hittrackeronline.com/?perpage=1000",
"table", 17)

• “Give me the 16th table on the page” (0-based

indexing)
IMPORTHTML
IMPORTHTML
• Brooks Baseball Player Pitch Logs

• Syntax: 
=IMPORTHTML("http://
www.brooksbaseball.net/pfxVB/
tabdel_expanded.php?
pitchSel=458584&game=gid_2016_06_27_bosm
lb_tbamlb_1/
&s_type=&h_size=700&v_size=500",
"table")
IMPORTHTML
Google Docs

• Useful for pulling in single tables, or keeping

everything in a spreadsheet

• Data doesn’t always exist in a single place

• Spread across several pages

• Spread across several files or APIs

Google Docs
• Useful for pulling in single tables, or keeping
everything in a spreadsheet

• Data doesn’t always exist in a single place

• Spread across several pages

• Spread across several files or APIs

• Automate as much as you can

Python time

• Beautiful language. Transcendental even.

• Robust ecosystem for handling data parsing,

cleaning, making net requests, etc

• A+ community

• Runs anywhere
Python time

• I’m going to use two non-standard packages

today:

• lxml, for HTML parsing and cleaning

• requests, for HTTP fetching

Strategy (again)

1. “What do I want?”

2. Case the joint

3. Rob it just a little bit

4. Move in
Strategy (again)

1. “What do I want?”: prospect rankings from BP,

MLB, Baseball America

2. Case the joint

3. Rob it just a little bit

4. Move in
Strategy (again)
1. “What do I want?”: prospect rankings from BP,
MLB

2. Case the joint: BP has dirty HTML. MLB loads a

JSON file.

3. Rob it just a little bit

4. Move in
Strategy (again)
1. “What do I want?”: prospect rankings from BP,
MLB

2. Case the joint: BP has dirty HTML. MLB loads a

JSON file.

3. Rob it just a little bit: Get a feel for BP and BA’s

HTML structure, examine MLB’s JSON file.

4. Move in
Strategy (again)
1. “What do I want?”: prospect rankings from BP,
MLB.

2. Case the joint: BP has dirty HTML. MLB loads a

JSON file.

3. Rob it just a little bit: Get a feel for BP and BA’s

HTML structure, examine MLB’s JSON file.

4. Move in: Write three scripts, one for each.

Strategy (again)

• Fields to export:

• Name

• Rank

• List type (“BP”, “MLB”, …)

• System ID (MLBAM ID, BP player ID, …)

• http://www.baseballprospectus.com/article.php?
articleid=31160
BP

• First thing to do is inspect the source

• Is there a pattern in the HTML you can engineer for,

or an attribute you can target?

• Let’s head to the console! Right click on the one

of the capsules, and click “Inspect”
BP
BP
• Yes! Player data is in a paragraph tag, <p>, which
contains a <span> with class “playerdef”

• Get used to talking like this

• Using XPath, we can target that <span> and walk

up to its parent element, <p>, which gives us
access to the whole player capsule
BP
• Beware: the “playerdef” class could be used anywhere.
We need to find a reasonable scope for our XPath.

• Luckily for us, player capsules are in a <div> with class

“article”, and that structure appears only once per
article page across BP.

• XPath: //div[@class=“article”]//
span[@class=“playerdef"]/..

• What else?
BP

• Code: https://github.com/mattdennewitz/
sloan-scraping/blob/master/bp-
top-101-2017.py

• Output: https://github.com/mattdennewitz/
sloan-scraping/blob/master/bp-2017.csv
BP
BP

• What did we do?

• Inspected the page

• Found critical path to data, wrote supporting

XPaths

• Scripted collecting and outputting the data

MLB

• http://m.mlb.com/prospects/2017
MLB

• Again, start by inspecting the source

• Try to find “Benintendi” or “Moncada”

•
MLB

• Again, start by inspecting the source

• Try to find “Benintendi” or “Moncada” in 

the HTML

• “uhh”
MLB
• Websites love to load data asynchronously.

• LOVE to

• Let’s head to the Inspector’s Network panel to

poke around and find the source

• In Chrome: Ctrl+Shift+I (Windows) or Cmd+Opt+I

(Mac), then select “Network”
MLB
• Websites love to load data asynchronously.

• LOVE to

• Let’s head to the Inspector’s Network panel to

poke around and find the source

• In Chrome: Ctrl+Shift+I (Windows) or Cmd+Opt+I

(Mac), then select “Network”

• Let’s start by looking under “XHR”, the typical

place to look for dynamically loaded data
MLB
MLB

• “playerProspects.json” looks promising

• We know it’s a JSON file

• The filename is a pretty dead giveaway

• When we open it up, it has a ton of prospect data

MLB

• Here, we have a JSON file

• Let’s inspect the structure to find exactly what

attributes we would like to scrape

• Fast-forward: the “prospect_players” key has

prospects for all teams! And it has the Top 100
under the “prospects” key.
MLB
{

"prospect_year": "2017",

"player_id": 643217,

"player_first_name": "Andrew",

"player_last_name": "Benintendi",

"rank": 1,

"position": "OF",

"preseason100": 1,

"preseason20": 1,

"team_file_code": "BOS",

}
MLB
• Using Python’s out-of-box JSON parser, we can
easily parse this file and extract players

• Code: https://github.com/mattdennewitz/
sloan-scraping/blob/master/mlb-
top-100-2017.py

• Output: https://github.com/mattdennewitz/
sloan-scraping/blob/master/mlb-2017.csv
MLB
Recap
• We’ve used the four step approach to plan for
consistent output across disparate systems

• We’ve used tools like the Inspector to probe for

data

• We’ve written very simple yet powerful scripts in

Python to download prospect lists

• We’ve streamlined the data into a consistent shape

• Our scripts are easily reusable

Next steps
• Since we were clever and included system IDs,
we can tie it all together using a baseball player
ID registry

• Chadwick Register

• Smart Fantasy Baseball

• Crunchtime
Tools

• Hopefully there’s time to talk about this

Tools
• requests: A beautiful HTTP library

• lxml: A beautiful XML and HTML parsing library.

Tricky to install on Windows, binaries are
available.

• BeautifulSoup: another A+ HTML parser

• Scrapy: a very robust Python framework for

crawling websites
Code

• The code and output from this session is online

at: https://github.com/mattdennewitz/2017-
sloan-data-scraping
Thanks!

• Questions?

• If we have some time left, we could try a bit of

live coding

• If you have very specific scraping questions, find

me after and let’s talk

WS3 Geographic
100% (1)
WS3 Geographic
18 pages
Logs
No ratings yet
Logs
5 pages
Stock Market Analysis Project
No ratings yet
Stock Market Analysis Project
23 pages
Scraping Book Python PDF
No ratings yet
Scraping Book Python PDF
50 pages
Data Analysis by Web Scraping Using Python
No ratings yet
Data Analysis by Web Scraping Using Python
6 pages
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
No ratings yet
Web Scrapping: Dept - of CS&E, BIET, Davangere Page - 1
8 pages
Scraping Book
No ratings yet
Scraping Book
50 pages
WebScraping Lessons 1
100% (1)
WebScraping Lessons 1
3 pages
Twitter Scraping Streamlit - Py
No ratings yet
Twitter Scraping Streamlit - Py
2 pages
Web Scraping: Applications and Tools
100% (2)
Web Scraping: Applications and Tools
31 pages
Advanced JavaScript & HTML
No ratings yet
Advanced JavaScript & HTML
226 pages
EASY Web Scraping With Google Gemini 2.0 - by Manpreet Singh - in AI Advances - Freedium
No ratings yet
EASY Web Scraping With Google Gemini 2.0 - by Manpreet Singh - in AI Advances - Freedium
9 pages
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
No ratings yet
TE Sem1 UNIT 1 (Data Science and Visualization) HONOURS - TE (SEM V)
28 pages
Scraping Instagram With Python
No ratings yet
Scraping Instagram With Python
4 pages
Web Scraping
No ratings yet
Web Scraping
718 pages
Site Scraper
No ratings yet
Site Scraper
10 pages
Course Notes - Web Scraping and API Fundamentals in Python
No ratings yet
Course Notes - Web Scraping and API Fundamentals in Python
10 pages
Semanticweb Python
No ratings yet
Semanticweb Python
30 pages
Tableau Exasol WhitePaper
No ratings yet
Tableau Exasol WhitePaper
9 pages
Web Scraping Using Python: A Step by Step Guide: September 2019
No ratings yet
Web Scraping Using Python: A Step by Step Guide: September 2019
7 pages
UPWORK Python Test
No ratings yet
UPWORK Python Test
72 pages
Income Tax Australia
No ratings yet
Income Tax Australia
23 pages
Open-Source Frameworks For AI
100% (1)
Open-Source Frameworks For AI
3 pages
Selenium TESTNG and Eclipse IDE
No ratings yet
Selenium TESTNG and Eclipse IDE
19 pages
Shadow Nexus - Cap Sheet White Paper
No ratings yet
Shadow Nexus - Cap Sheet White Paper
3 pages
Unwired Learning: The Ultimate Python Developer Bundle
No ratings yet
Unwired Learning: The Ultimate Python Developer Bundle
19 pages
Guides and Roadmaps
No ratings yet
Guides and Roadmaps
26 pages
Silver Bullet Short Notes
No ratings yet
Silver Bullet Short Notes
3 pages
Digiskills Freelancing Exercise No 3
100% (1)
Digiskills Freelancing Exercise No 3
1 page
Deep Learning Approach For Earthquake Parameters Classification in Earthquake Early Warning System
No ratings yet
Deep Learning Approach For Earthquake Parameters Classification in Earthquake Early Warning System
5 pages
06.05.23 - Python - Web Scraping in Python
No ratings yet
06.05.23 - Python - Web Scraping in Python
108 pages
Different Types of Web Crawlers
No ratings yet
Different Types of Web Crawlers
40 pages
WWW Spymasterpro Com Iphone Spy
No ratings yet
WWW Spymasterpro Com Iphone Spy
5 pages
A Dive Into Web Scraper World
100% (1)
A Dive Into Web Scraper World
5 pages
Kriti Dutta: Linkedin - Github
No ratings yet
Kriti Dutta: Linkedin - Github
2 pages
Automation Testing: by Balaji
No ratings yet
Automation Testing: by Balaji
17 pages
Essential HTTP Headers For Securing Your Web Server - Pentest-Tools - Com Blog
No ratings yet
Essential HTTP Headers For Securing Your Web Server - Pentest-Tools - Com Blog
18 pages
Web and Text Mining
No ratings yet
Web and Text Mining
73 pages
Data Visualization - Getting Started With Plotly
No ratings yet
Data Visualization - Getting Started With Plotly
37 pages
Bypass License Trick
No ratings yet
Bypass License Trick
15 pages
Online Privacy
100% (1)
Online Privacy
21 pages
Data Analytics Project
No ratings yet
Data Analytics Project
9 pages
Shopify - Sa Form
No ratings yet
Shopify - Sa Form
6 pages
Web Data Extractors
No ratings yet
Web Data Extractors
26 pages
Advanced Web Scraping - Bypassing - 403 Forbidden, - Captchas, and More - Sangaline
No ratings yet
Advanced Web Scraping - Bypassing - 403 Forbidden, - Captchas, and More - Sangaline
12 pages
Data Science Roles, Stages in A Data Science Project
No ratings yet
Data Science Roles, Stages in A Data Science Project
14 pages
Amazon
No ratings yet
Amazon
6 pages
No-Code ML With DataRobot
No ratings yet
No-Code ML With DataRobot
73 pages
Viral Leaked Videos HD 32
100% (1)
Viral Leaked Videos HD 32
5 pages
Data Science Course Content
No ratings yet
Data Science Course Content
8 pages
Python Glossary of Terms
No ratings yet
Python Glossary of Terms
11 pages
Advanced Web Programming
No ratings yet
Advanced Web Programming
8 pages
Future of Business Analytics PDF
No ratings yet
Future of Business Analytics PDF
14 pages
Social Network Analysis: Supervisor: Prof. Abhisek Gour Presented By: Ratnesh Shah
No ratings yet
Social Network Analysis: Supervisor: Prof. Abhisek Gour Presented By: Ratnesh Shah
19 pages
Build An SEO Analyzer Using Python
No ratings yet
Build An SEO Analyzer Using Python
7 pages
Data - Collection Python
No ratings yet
Data - Collection Python
40 pages
Web Scraping Using Python
No ratings yet
Web Scraping Using Python
18 pages
Web+Scraping+Cheat+Sheet+2 0
No ratings yet
Web+Scraping+Cheat+Sheet+2 0
3 pages
PDF Document 2
No ratings yet
PDF Document 2
24 pages
Web Scraping Cheat Sheet 2.0
No ratings yet
Web Scraping Cheat Sheet 2.0
3 pages
Deploying Machine Learning Models Using Docker Containers
No ratings yet
Deploying Machine Learning Models Using Docker Containers
5 pages
Subqueries Inside Where and Select Clauses: Chester Ismay
No ratings yet
Subqueries Inside Where and Select Clauses: Chester Ismay
24 pages
World Bank Fraud
No ratings yet
World Bank Fraud
10 pages
Fraud Analytics
No ratings yet
Fraud Analytics
26 pages
Smart Fraud Detection Systems For Credit Cards: Challenges and Solutions
No ratings yet
Smart Fraud Detection Systems For Credit Cards: Challenges and Solutions
7 pages
ADA2 Notes Ch18
No ratings yet
ADA2 Notes Ch18
45 pages
2086-Sports Analytics IEEE Seminar 1-17-17
No ratings yet
2086-Sports Analytics IEEE Seminar 1-17-17
74 pages
Installing R With Qlik Sense
No ratings yet
Installing R With Qlik Sense
15 pages
R Cheat SHeet
No ratings yet
R Cheat SHeet
1 page
Java Programs
No ratings yet
Java Programs
36 pages
Data Visualization 2.1 PDF
No ratings yet
Data Visualization 2.1 PDF
2 pages
Data Import
No ratings yet
Data Import
2 pages
03 Dynamical Systems Tutorial
No ratings yet
03 Dynamical Systems Tutorial
43 pages
Fuctional Control of Users Using Biometric Behavior
No ratings yet
Fuctional Control of Users Using Biometric Behavior
6 pages
System Architecure
No ratings yet
System Architecure
1 page
Aptitude Formula
No ratings yet
Aptitude Formula
9 pages
24 Tirthas of Ramanatha Swamy Temple Rameswaram
No ratings yet
24 Tirthas of Ramanatha Swamy Temple Rameswaram
3 pages
Wa250 5
No ratings yet
Wa250 5
12 pages
Bhagavad Gita - Wikipedia
No ratings yet
Bhagavad Gita - Wikipedia
81 pages
IT Is Gr8! at Grade 11 - Module 2.4 (Internet Services Technologies)
No ratings yet
IT Is Gr8! at Grade 11 - Module 2.4 (Internet Services Technologies)
19 pages
Tmt-Brochure Aco
No ratings yet
Tmt-Brochure Aco
9 pages
Install Vulcan
No ratings yet
Install Vulcan
3 pages
Screenshot 2024-02-20 at 2.01.17 PM
No ratings yet
Screenshot 2024-02-20 at 2.01.17 PM
1 page
Materials Letters: Anjana Kothari, Tapas K. Chaudhuri
No ratings yet
Materials Letters: Anjana Kothari, Tapas K. Chaudhuri
3 pages
Linphone Technote For ACCESS and BRIC Link
No ratings yet
Linphone Technote For ACCESS and BRIC Link
5 pages
Acne and Social Media: A Cross-Sectional Study of Content Quality On Tiktok
No ratings yet
Acne and Social Media: A Cross-Sectional Study of Content Quality On Tiktok
3 pages
Goods and Beds of The Mobile Phone
No ratings yet
Goods and Beds of The Mobile Phone
7 pages
Education System in India During British Rule UPSC Notes1
100% (1)
Education System in India During British Rule UPSC Notes1
3 pages
Programming Fundamentals Project Proposal Submission Form
No ratings yet
Programming Fundamentals Project Proposal Submission Form
2 pages
Norinco Firearms
100% (4)
Norinco Firearms
56 pages
Detailed Training Agenda
No ratings yet
Detailed Training Agenda
4 pages
Qos of Voip Over Wlan
No ratings yet
Qos of Voip Over Wlan
66 pages
Price Action Signatures 3 - Inside Bar Ranges
100% (1)
Price Action Signatures 3 - Inside Bar Ranges
36 pages
Cameron Bell-2
No ratings yet
Cameron Bell-2
5 pages
Waterproofing: Arch - Allen R. Buenaventura. MSCM
No ratings yet
Waterproofing: Arch - Allen R. Buenaventura. MSCM
50 pages
The English Teacher S Companion Fourth Edition A Completely New Guide To Classroom Curriculum and The Profession Jim Burke
100% (9)
The English Teacher S Companion Fourth Edition A Completely New Guide To Classroom Curriculum and The Profession Jim Burke
85 pages
Fanvil I10
No ratings yet
Fanvil I10
2 pages
Rush E Impossible (Sheet Music)
No ratings yet
Rush E Impossible (Sheet Music)
16 pages
Cs3251 Programming in C (Syllabus)
No ratings yet
Cs3251 Programming in C (Syllabus)
2 pages
Gemini Lesson 1
No ratings yet
Gemini Lesson 1
24 pages
National Merit Scholarship Program Semifinalists Named For 2019
100% (1)
National Merit Scholarship Program Semifinalists Named For 2019
3 pages
Unit 6
No ratings yet
Unit 6
15 pages
Untitled Document
No ratings yet
Untitled Document
4 pages
Ruby RB1 Ice Cream Cones Recipe HR
No ratings yet
Ruby RB1 Ice Cream Cones Recipe HR
1 page
Enumerators Advert Baseline
No ratings yet
Enumerators Advert Baseline
2 pages
Villa de Bacolor, Pampanga, Philippines: Don Honorio Ventura State University
No ratings yet
Villa de Bacolor, Pampanga, Philippines: Don Honorio Ventura State University
6 pages