0% found this document useful (0 votes)
14 views

2022 Scraping Without Programming Tutorial

The document provides an overview of web scraping techniques that can be used without programming. It explains how to scrape data from websites using Google Sheets formulas like ImportHTML and ImportXML, and describes point-and-click scraping apps like OutWit Hub and ParseHub. The document also discusses HTML elements and XPATHs that can be used to target specific data fields for extraction.

Uploaded by

Faisal Kareem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

2022 Scraping Without Programming Tutorial

The document provides an overview of web scraping techniques that can be used without programming. It explains how to scrape data from websites using Google Sheets formulas like ImportHTML and ImportXML, and describes point-and-click scraping apps like OutWit Hub and ParseHub. The document also discusses HTML elements and XPATHs that can be used to target specific data fields for extraction.

Uploaded by

Faisal Kareem
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

tinyurl.

com/SunneScrapingTutorial

1
Scraping Without
Programming
© Samantha Sunne
What is scraping?
What is scraping?

It means to grab data


through code, elbow
grease, or whatever other
method you have on hand.

4
How do journalists usually get data?

From Humans From Computers

Ask Nicely FOIA Download Scrape

Playing Hardball Playing Hardball

5
How do journalists usually get data?

From Humans From Computers

Ask Nicely FOIA Download Scrape

Playing Hardball Playing Hardball

6
How do journalists usually get data?

From Humans From Computers

Ask Nicely FOIA Download Scrape

7
How do journalists usually get data?

From Humans From Computers

Ask Nicely FOIA Download Scrape

Playing Hardball Playing Hardball

8
Web scraping
Today we will extract data
from a single webpage.
This is different from
web crawling, document
scraping, and other kinds
of scraping.

9
HTML
We're going to scrape with HTML. This
is sometimes called source code.
This is how a
website looks
to a human.

11
This is how it
looks to a
computer.

12
Our goal is to
land
somewhere in
the middle.

13
Sometimes source code
itself is interesting.
Jeb Bush's campaign
site included a detailed
summary of the movie
Die Hard.

14
HTML elements

HTML is broken into elements.

Elements are wrapped in tags, that look like this:

<h1>element</h1>

15
HTML elements
For example, tables: Or headers:

<table> <h1>
Here is my table, Here is my header.
between these table </h1>
tags.
</table>

16
HTML elements

There are a lot of different elements, identified by


tags like <h1>, <li> and <a>. If you don't know what a
tag means, use an HTML dictionary.

17
Nested elements
Elements can be inside
other elements. That
means you can grab an
element and all the
elements inside it.

18
Nested elements

Table cell

One cell in a table has


1 the tag <td>, which
Table row stands for "table data."

A table row has the tag


2
<tr>, and contains table Table
cells inside it.

A table element
3 contains both table
rows and table cells. It
has the tag <table>.

19
Nested elements

You can grab a cell from a


table, a row, or a whole
table.

20
Technique 1
ImportHTML
importHTML

Type this formula in Google Sheets:

=ImportHTML(“url”, “element”)

The url is the link you are scraping.


The element is the HTML tag.

22
importHTML

For example:

=ImportHTML("https://www.fdic.gov/resources/re
solutions/bank-failures/failed-bank-list/",
"table")

This scrapes a table of failed banks from the FDIC.

23
24
Hooray!
We scraped a live webpage.
But the ImportHTML formula is pretty limited.
Let's try something more advanced.

25
Technique 2
ImportXML
Nested elements

Not all data is in a


convenient table.

Instead, you can use an


XPATH.

27
What is an XPATH?
An XPATH is like an address to a very
specific bit of data.
XPATH Examples

All bold text //b

All headers (large text) //h1

All headers containing the //h1[contains(.,'coun


word "country" try')]
//h1[@class='country-
All headers with the class
name']
"country-name"

29
Nested elements

You can also use nested


elements in an XPATH,
just like we saw with
HTML.

30
XPATH Examples
//table[@id='vaccines']/tr[56]/td[3]

//table[@id='vaccines'] tr[56]/td[3]

in the table called


// on this page in row 56 in cell 3
'vaccines'
Having trouble
finding the
XPATH?
You can also use
the Web Inspector.

32
XPATH
Now that we know what
XPATH is (more or less),
let's use it to scrape
something a lot more
specific than tables.

33
importXML

Type this formula in Google Sheets:

=ImportXML(“url”, “XPATH”)

The url is the link you are scraping.


The XPATH is the address of the data.

34
importXML

For example:

=ImportXML("https://source.opennews.org/jobs/", "//h3")

This scrapes all the headers (that is, job posts) from
the OpenNews job board.

35
importXML

36
ImportHTML and ImportXML

That's just the basics. You can find plenty of in-depth


tutorials on ImportHTML, ImportXML, and other
formulas like ImportFEED.

37
Technique 3
Point-and-Click Apps
OutWit Hub
OutWit Hub is a desktop
app that can identify
each HTML element on a
webpage and scrape it.
The free version lets you
download 100 rows at a
time.

39
ParseHub
ParseHub is a desktop
app that can identify and
scrape elements and
sub-elements. The free
version lets you scrape
200 pages at a time.

40
WebScraper
WebScraper is a browser
extension that helps you
scrape stuff through the
Web Inspector. It only
sometimes works.

41
Disclaimer:
Free apps come and go.
They may not be up to date when you're reading
this. But that's why we learned the code instead.

42
And that's it!
Find me with questions.
I also recommend my newsletter Tools for
Reporters for cool stuff like this. Good luck!

43

You might also like