tinyurl.
com/SunneScrapingTutorial
1
Scraping Without
Programming
© Samantha Sunne
What is scraping?
What is scraping?
It means to grab data
through code, elbow
grease, or whatever other
method you have on hand.
4
How do journalists usually get data?
From Humans From Computers
Ask Nicely FOIA Download Scrape
Playing Hardball Playing Hardball
5
How do journalists usually get data?
From Humans From Computers
Ask Nicely FOIA Download Scrape
Playing Hardball Playing Hardball
6
How do journalists usually get data?
From Humans From Computers
Ask Nicely FOIA Download Scrape
7
How do journalists usually get data?
From Humans From Computers
Ask Nicely FOIA Download Scrape
Playing Hardball Playing Hardball
8
Web scraping
Today we will extract data
from a single webpage.
This is different from
web crawling, document
scraping, and other kinds
of scraping.
9
HTML
We're going to scrape with HTML. This
is sometimes called source code.
This is how a
website looks
to a human.
11
This is how it
looks to a
computer.
12
Our goal is to
land
somewhere in
the middle.
13
Sometimes source code
itself is interesting.
Jeb Bush's campaign
site included a detailed
summary of the movie
Die Hard.
14
HTML elements
HTML is broken into elements.
Elements are wrapped in tags, that look like this:
<h1>element</h1>
15
HTML elements
For example, tables: Or headers:
<table> <h1>
Here is my table, Here is my header.
between these table </h1>
tags.
</table>
16
HTML elements
There are a lot of different elements, identified by
tags like <h1>, <li> and <a>. If you don't know what a
tag means, use an HTML dictionary.
17
Nested elements
Elements can be inside
other elements. That
means you can grab an
element and all the
elements inside it.
18
Nested elements
Table cell
One cell in a table has
1 the tag <td>, which
Table row stands for "table data."
A table row has the tag
2
<tr>, and contains table Table
cells inside it.
A table element
3 contains both table
rows and table cells. It
has the tag <table>.
19
Nested elements
You can grab a cell from a
table, a row, or a whole
table.
20
Technique 1
ImportHTML
importHTML
Type this formula in Google Sheets:
=ImportHTML(“url”, “element”)
The url is the link you are scraping.
The element is the HTML tag.
22
importHTML
For example:
=ImportHTML("https://www.fdic.gov/resources/re
solutions/bank-failures/failed-bank-list/",
"table")
This scrapes a table of failed banks from the FDIC.
23
24
Hooray!
We scraped a live webpage.
But the ImportHTML formula is pretty limited.
Let's try something more advanced.
25
Technique 2
ImportXML
Nested elements
Not all data is in a
convenient table.
Instead, you can use an
XPATH.
27
What is an XPATH?
An XPATH is like an address to a very
specific bit of data.
XPATH Examples
All bold text //b
All headers (large text) //h1
All headers containing the //h1[contains(.,'coun
word "country" try')]
//h1[@class='country-
All headers with the class
name']
"country-name"
29
Nested elements
You can also use nested
elements in an XPATH,
just like we saw with
HTML.
30
XPATH Examples
//table[@id='vaccines']/tr[56]/td[3]
//table[@id='vaccines'] tr[56]/td[3]
in the table called
// on this page in row 56 in cell 3
'vaccines'
Having trouble
finding the
XPATH?
You can also use
the Web Inspector.
32
XPATH
Now that we know what
XPATH is (more or less),
let's use it to scrape
something a lot more
specific than tables.
33
importXML
Type this formula in Google Sheets:
=ImportXML(“url”, “XPATH”)
The url is the link you are scraping.
The XPATH is the address of the data.
34
importXML
For example:
=ImportXML("https://source.opennews.org/jobs/", "//h3")
This scrapes all the headers (that is, job posts) from
the OpenNews job board.
35
importXML
36
ImportHTML and ImportXML
That's just the basics. You can find plenty of in-depth
tutorials on ImportHTML, ImportXML, and other
formulas like ImportFEED.
37
Technique 3
Point-and-Click Apps
OutWit Hub
OutWit Hub is a desktop
app that can identify
each HTML element on a
webpage and scrape it.
The free version lets you
download 100 rows at a
time.
39
ParseHub
ParseHub is a desktop
app that can identify and
scrape elements and
sub-elements. The free
version lets you scrape
200 pages at a time.
40
WebScraper
WebScraper is a browser
extension that helps you
scrape stuff through the
Web Inspector. It only
sometimes works.
41
Disclaimer:
Free apps come and go.
They may not be up to date when you're reading
this. But that's why we learned the code instead.
42
And that's it!
Find me with questions.
I also recommend my newsletter Tools for
Reporters for cool stuff like this. Good luck!
43