2022 Scraping Without Programming Tutorial
2022 Scraping Without Programming Tutorial
com/SunneScrapingTutorial
1
Scraping Without
Programming
© Samantha Sunne
What is scraping?
What is scraping?
4
How do journalists usually get data?
5
How do journalists usually get data?
6
How do journalists usually get data?
7
How do journalists usually get data?
8
Web scraping
Today we will extract data
from a single webpage.
This is different from
web crawling, document
scraping, and other kinds
of scraping.
9
HTML
We're going to scrape with HTML. This
is sometimes called source code.
This is how a
website looks
to a human.
11
This is how it
looks to a
computer.
12
Our goal is to
land
somewhere in
the middle.
13
Sometimes source code
itself is interesting.
Jeb Bush's campaign
site included a detailed
summary of the movie
Die Hard.
14
HTML elements
<h1>element</h1>
15
HTML elements
For example, tables: Or headers:
<table> <h1>
Here is my table, Here is my header.
between these table </h1>
tags.
</table>
16
HTML elements
17
Nested elements
Elements can be inside
other elements. That
means you can grab an
element and all the
elements inside it.
18
Nested elements
Table cell
A table element
3 contains both table
rows and table cells. It
has the tag <table>.
19
Nested elements
20
Technique 1
ImportHTML
importHTML
=ImportHTML(“url”, “element”)
22
importHTML
For example:
=ImportHTML("https://www.fdic.gov/resources/re
solutions/bank-failures/failed-bank-list/",
"table")
23
24
Hooray!
We scraped a live webpage.
But the ImportHTML formula is pretty limited.
Let's try something more advanced.
25
Technique 2
ImportXML
Nested elements
27
What is an XPATH?
An XPATH is like an address to a very
specific bit of data.
XPATH Examples
29
Nested elements
30
XPATH Examples
//table[@id='vaccines']/tr[56]/td[3]
//table[@id='vaccines'] tr[56]/td[3]
32
XPATH
Now that we know what
XPATH is (more or less),
let's use it to scrape
something a lot more
specific than tables.
33
importXML
=ImportXML(“url”, “XPATH”)
34
importXML
For example:
=ImportXML("https://source.opennews.org/jobs/", "//h3")
This scrapes all the headers (that is, job posts) from
the OpenNews job board.
35
importXML
36
ImportHTML and ImportXML
37
Technique 3
Point-and-Click Apps
OutWit Hub
OutWit Hub is a desktop
app that can identify
each HTML element on a
webpage and scrape it.
The free version lets you
download 100 rows at a
time.
39
ParseHub
ParseHub is a desktop
app that can identify and
scrape elements and
sub-elements. The free
version lets you scrape
200 pages at a time.
40
WebScraper
WebScraper is a browser
extension that helps you
scrape stuff through the
Web Inspector. It only
sometimes works.
41
Disclaimer:
Free apps come and go.
They may not be up to date when you're reading
this. But that's why we learned the code instead.
42
And that's it!
Find me with questions.
I also recommend my newsletter Tools for
Reporters for cool stuff like this. Good luck!
43