There’s Always
an API
Sometimes they make you work for it
Hi, I’m Matt Dennewitz
• VP Product, Pitchfork; Dir. Engineering, Wired
• I consult on data for baseball writers and
MLB clubs
• @mattdennewitz on Twitter, Github
Agenda
• 101
• Your first scrape: Google Docs
• Interlude: HTML, JSON, XML, XPath
• Scaling up: Python
• What happens when the data isn’t on the page?
• Advanced topics (time allowing)
What is scraping?
• Extracting information from a document
• Rows from an HTML table
• Text from a PDF
• Images from Craigslist posts or museum websites
• OCR’ing an image and reading its text
• Spidering a website like Google
Tools
• Google Docs (surprise!)
• Chrome Developer Tools
• Python
• Scrapy
Strategy
1. “What do I want?”
2. Case the joint
3. Rob it just a little bit
4. Move in
“What do I want?”
• Envision the data you want, how you need it
• “How will I scrape this data?” Script? Crawler?
• “Do I need to scrape this more than once?”
• “How do I need to shape the data?”
• “What do I need to do to the data after I have it?”
Clean, verify, cross-link with another data set, …?
• “How/to where do I want to output the data?”
Case the joint
• Does the document seem scrape-ready? Does
access come with preconditions?
• Preconditions: password-protected? Online-only?
Needs a special decoder?
• Look at how the data is presented in the
document. Are there external dependencies, or
is it self-contained?
• External deps: more information on secondary
pages, data in other spreadsheets or workbooks
Rob it just a little bit
• Prototype using a subset of the information
• Estimate how long scraping will take, determine
imperative needs like throttling or a specific OS
• Validate your ideas about the data you wish
to extract, correct bugs
• Writing unit tests
Oceans 1101
• You’ve created a stable scraper which emits data
in the format you want (CSV, JSON, XML, SQL, …)
to the location you want
• You understand its performance characteristics
• Go!
Interlude: formats
• Data is distributed in mercilessly innumerable
formats
• The Big Three of web scraping
• HTML
• JSON
• XML
Formats: XML
• eXtensible Markup Language
• Well-structured, self-validating, predictable
• Pedantic, though not with its charms
Formats: XML
Formats: HTML
• Hypertext something something something
• XML-like, without the upside
• Needs stronger class of parser to heal broken
code
• Less predictable, far more susceptible to
changes in the wind
Formats: HTML
<p>
1.
<strong><span class="playerdef"><a href="http://
www.baseballprospectus.com/card/card.php?id=102123">Alex Reyes</
a></span>, RHP, <span class="teamdef"><a href="http://
www.baseballprospectus.com/team_audit.php?team=SLN"
target="blank">St. Louis Cardinals</a></span></strong><br>
Scouting Report: <a href=“http://www.baseballprospectus.com/
article.php?articleid=30958">LINK</a>
</p>
Formats: JSON
• JavaScript Object Notation
• Data objects with simple primitives: int, double,
string, boolean, object (key/value pairs), array
(untyped), null.
• Requires waaaaaaay less parsing, much easier to
serialize
• No schemas, but validation tools exist
• Has taken over for XML in web data transmission
Formats: JSON
{
"prospect_year": "2017",
"player_id": 643217,
"player_first_name": "Andrew",
"player_last_name": "Benintendi",
"rank": 1,
"position": "OF",
"preseason100": 1,
"preseason20": 1,
"team_file_code": "BOS",
}
Bonus: XPath
• XPath is a way to query XML (and HTML)
• It’s got a super goofy syntax
• Very powerful, essential for scraping the web
Bonus: XPath
• XPath: //table/tbody/tr
• HTML (fragment):
<table>
<thead>
<tr>
<th>Name</th><th>HR</th><th>SB</th>
</tr>
</thead>
<tbody>
<tr><td>Mike Trout</td><td>40</td><td>40</td></tr>
</tbody>
</table>
• Result: <tr><td>Mike Trout</td><td>40</td><td>40</td></tr>
Bonus: XPath
• XPath: //span[@class=“playerdef”]/text()
• HTML:
<p>1. <strong><span class=“playerdef”>Eloy
Jiminez</span></strong>, OF, …</p>
• Result: “Eloy Jiminez”
Ok, time to scrape
Google Docs
• Fire up Google Docs, start a new spreadsheet
• IMPORTXML and IMPORTHTML are your friends
• Let’s look at IMPORTHTML
IMPORTHTML
• Allows you put pull in specific list or tabular data
from a web page
• Syntax:
=IMPORTHTML(url, <“list” or “table”>,
[index])
IMPORTHTML
• ESPN Home Run Tracker
• Syntax:
=importhtml("http://
www.hittrackeronline.com/?perpage=1000",
"table", 17)
• “Give me the 16th table on the page” (0-based
indexing)
IMPORTHTML
IMPORTHTML
• Brooks Baseball Player Pitch Logs
• Syntax:
=IMPORTHTML("http://
www.brooksbaseball.net/pfxVB/
tabdel_expanded.php?
pitchSel=458584&game=gid_2016_06_27_bosm
lb_tbamlb_1/
&s_type=&h_size=700&v_size=500",
"table")
IMPORTHTML
Google Docs
• Useful for pulling in single tables, or keeping
everything in a spreadsheet
• Data doesn’t always exist in a single place
• Spread across several pages
• Spread across several files or APIs
Google Docs
• Useful for pulling in single tables, or keeping
everything in a spreadsheet
• Data doesn’t always exist in a single place
• Spread across several pages
• Spread across several files or APIs
• Automate as much as you can
Python time
• Beautiful language. Transcendental even.
• Robust ecosystem for handling data parsing,
cleaning, making net requests, etc
• A+ community
• Runs anywhere
Python time
• I’m going to use two non-standard packages
today:
• lxml, for HTML parsing and cleaning
• requests, for HTTP fetching
Strategy (again)
1. “What do I want?”
2. Case the joint
3. Rob it just a little bit
4. Move in
Strategy (again)
1. “What do I want?”: prospect rankings from BP,
MLB, Baseball America
2. Case the joint
3. Rob it just a little bit
4. Move in
Strategy (again)
1. “What do I want?”: prospect rankings from BP,
MLB
2. Case the joint: BP has dirty HTML. MLB loads a
JSON file.
3. Rob it just a little bit
4. Move in
Strategy (again)
1. “What do I want?”: prospect rankings from BP,
MLB
2. Case the joint: BP has dirty HTML. MLB loads a
JSON file.
3. Rob it just a little bit: Get a feel for BP and BA’s
HTML structure, examine MLB’s JSON file.
4. Move in
Strategy (again)
1. “What do I want?”: prospect rankings from BP,
MLB.
2. Case the joint: BP has dirty HTML. MLB loads a
JSON file.
3. Rob it just a little bit: Get a feel for BP and BA’s
HTML structure, examine MLB’s JSON file.
4. Move in: Write three scripts, one for each.
Strategy (again)
• Fields to export:
• Name
• Rank
• List type (“BP”, “MLB”, …)
• System ID (MLBAM ID, BP player ID, …)
BP
• http://www.baseballprospectus.com/article.php?
articleid=31160
BP
• First thing to do is inspect the source
• Is there a pattern in the HTML you can engineer for,
or an attribute you can target?
• Let’s head to the console! Right click on the one
of the capsules, and click “Inspect”
BP
BP
• Yes! Player data is in a paragraph tag, <p>, which
contains a <span> with class “playerdef”
• Get used to talking like this
• Using XPath, we can target that <span> and walk
up to its parent element, <p>, which gives us
access to the whole player capsule
BP
• Beware: the “playerdef” class could be used anywhere.
We need to find a reasonable scope for our XPath.
• Luckily for us, player capsules are in a <div> with class
“article”, and that structure appears only once per
article page across BP.
• XPath: //div[@class=“article”]//
span[@class=“playerdef"]/..
• What else?
BP
• Code: https://github.com/mattdennewitz/
sloan-scraping/blob/master/bp-
top-101-2017.py
• Output: https://github.com/mattdennewitz/
sloan-scraping/blob/master/bp-2017.csv
BP
BP
• What did we do?
• Inspected the page
• Found critical path to data, wrote supporting
XPaths
• Scripted collecting and outputting the data
MLB
• http://m.mlb.com/prospects/2017
MLB
• Again, start by inspecting the source
• Try to find “Benintendi” or “Moncada”
•
MLB
• Again, start by inspecting the source
• Try to find “Benintendi” or “Moncada” in
the HTML
• “uhh”
MLB
• Websites love to load data asynchronously.
• LOVE to
• Let’s head to the Inspector’s Network panel to
poke around and find the source
• In Chrome: Ctrl+Shift+I (Windows) or Cmd+Opt+I
(Mac), then select “Network”
MLB
• Websites love to load data asynchronously.
• LOVE to
• Let’s head to the Inspector’s Network panel to
poke around and find the source
• In Chrome: Ctrl+Shift+I (Windows) or Cmd+Opt+I
(Mac), then select “Network”
• Let’s start by looking under “XHR”, the typical
place to look for dynamically loaded data
MLB
MLB
• “playerProspects.json” looks promising
• We know it’s a JSON file
• The filename is a pretty dead giveaway
• When we open it up, it has a ton of prospect data
MLB
• Here, we have a JSON file
• Let’s inspect the structure to find exactly what
attributes we would like to scrape
• Fast-forward: the “prospect_players” key has
prospects for all teams! And it has the Top 100
under the “prospects” key.
MLB
{
"prospect_year": "2017",
"player_id": 643217,
"player_first_name": "Andrew",
"player_last_name": "Benintendi",
"rank": 1,
"position": "OF",
"preseason100": 1,
"preseason20": 1,
"team_file_code": "BOS",
}
MLB
• Using Python’s out-of-box JSON parser, we can
easily parse this file and extract players
• Code: https://github.com/mattdennewitz/
sloan-scraping/blob/master/mlb-
top-100-2017.py
• Output: https://github.com/mattdennewitz/
sloan-scraping/blob/master/mlb-2017.csv
MLB
Recap
• We’ve used the four step approach to plan for
consistent output across disparate systems
• We’ve used tools like the Inspector to probe for
data
• We’ve written very simple yet powerful scripts in
Python to download prospect lists
• We’ve streamlined the data into a consistent shape
• Our scripts are easily reusable
Next steps
• Since we were clever and included system IDs,
we can tie it all together using a baseball player
ID registry
• Chadwick Register
• Smart Fantasy Baseball
• Crunchtime
Tools
• Hopefully there’s time to talk about this
Tools
• requests: A beautiful HTTP library
• lxml: A beautiful XML and HTML parsing library.
Tricky to install on Windows, binaries are
available.
• BeautifulSoup: another A+ HTML parser
• Scrapy: a very robust Python framework for
crawling websites
Code
• The code and output from this session is online
at: https://github.com/mattdennewitz/2017-
sloan-data-scraping
Thanks!
• Questions?
• If we have some time left, we could try a bit of
live coding
• If you have very specific scraping questions, find
me after and let’s talk