100% found this document useful (1 vote)

71 views41 pages

How To Scrape Without Programming PDF

This document discusses various techniques for scraping data from websites without programming. It explains how to scrape HTML tables and lists from websites using Google Sheets formulas. It also discusses using XPath to scrape XML data and RSS feeds. The document provides examples of scraping job boards and introduces writing scraping code in Ruby. It recommends some point-and-click scraping applications and tools to use.

Uploaded by

Jesus López

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

71 views41 pages

How To Scrape Without Programming PDF

Uploaded by

Jesus López

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Scraping

without
programming
© Samantha Sunne
What's scraping?
How do journalists get data?

...from humans? ...from computers?

ay :) ay :(
Easy w Hard w
Ask
FOIA Download Scrape
nicely

ay :) ay :(
Easy w Hard w
Scraping is...
catching, collecting or coaxing data off the web.
Web scrapers are often called:

EXTRACTORS CRAWLERS
if they pull data from or if they pull data from
a single webpage multiple webpages
You'll need a basic
understanding of HTML
HTML is meant for computers, but some of it is understandable to humans.
Sometimes the code itself is interesting.

Jeb Bush’s
campaign site
Summary of the
movie “Die Hard”
Browser HTML

You can see it by right-clicking and clicking View Source.

This is called looking “under the hood."
Technique 1:

import HTML with Google Sheets

HTML content is wrapped in tags that look like this:

<>content content</>

There are a lot of tags, and a lot of them have different abbreviations, so you'll need to refer to an
HTML dictionary.
Some of them are intuitive, like tables:

<table>
Here is my table, everything between these table tags.
</table>
Webpages are actually made up of tons of "elements" - images, text chunks, headers, you name it.
You can grab any of them!

Video from Firefox

Elements can be inside each other. But any word you see immediately after a "<" is the name of the
element.

<table>
<tr>
A "table Inside a "table"
<td>Hi! I'm text inside a table.</td> Inside a "table
data" element
row" element
</tr> element

</table>

What we have here is an element inside an element inside an element.

But no worries, you can grab any of them! The larger elements will just include the smaller ones.
Version for humans Version for computers

<table>
<tr>
Hi! I'm text inside a table. <td>Hi! I'm text inside a table.
</td>
</tr>
</table>
How to scrape an HTML table with Google Sheets

You can use this Google Sheets formula to scrape a table or a list:

=ImportHTML(“url”, “element”, number of element)

The "url" is the link you are scraping.

The "element" is the HTML tag.
The number is the order the element is in.

tutorial
ed? Try this
Still confus
Example 1
=importHTML("sample.com",”table”,0)

This would import the ﬁrst table on sample.com

(If it’s the ﬁrst element, it’s number 0. I know, weird, it’s a computer thing.)

You can use View Source to ﬁgure out what element and number you want.

tutorial
ed? Try this
Still confus
This scrapes
<table>'s

For example, the IRE

Job Center
I used this formula
to scrape the IRE
jobs board:

https://www.ire.org
/jobs/
It also scrapes lists

(<ol>'s and <ul>'s)

Again, refer to the HTML

dictionary if you need to know the
HTML abbreviation for lists
This time, I scraped a list.
Remember, this isn’t just
any old list, but the ﬁrst
“list” HTML element on the
page.

If you aren't sure if there is

a List element, check the
source code. (Right click >
View Source > Ctrl+F for
"<li>")
The importHTML formula can
only scrape lists and tables.
Technique 2:

import XML with Google Sheets

Browser XML

HTML

XML is also source code, but it's

in more of a "tree" format.
You can see
it by right
clicking and
clicking
"Inspect."
With XML, you can ﬁnd the "XPath"
An XPath is like an address, or a map, to a very speciﬁc spot in the webpage’s code.
It looks like this:

This webpage, in the "wikitable" table, in the ﬁrst row, in the fourth column
//table[@class='wikitable']/tr[1]/td[4]

On this the with the class in the ﬁrst the fourth

page column
table “wikitable” row

It's easiest to highlight the element you want and Right Click > Inspect on it.
How to scrape XML data with Google Sheets

You can scrape everything at a particular XPath using this Google Sheets formula:

=importxml(“url”,”XPath”)

For example,

=importxml(“sample.com”,”//table[2]//tr”)

returns all the table rows ("tr") from the second table ("table[2]") in sample.com

ed? Try this tutorial

Still confus
XPaths open up a whole new world
all bold text: //b

all large text (known as headers): //h1

all headers with the word 'Parish': //h1[contains(.,'parish')]

all headers with the class 'Parish_name': //h1[@class='parish_name']

me tutorial!
r an aweso
See here fo
Here, I scraped:

the ﬁrst column (<td>)

from all the table rows (<tr>)
in the table (<table>)
on ire.org/jobs
But what if your
data isn't in a
<table>?

Like the Talking Biz News

job board
Here, the page source
told me that all the job
titles are headers.

So I only scraped <h2>'s

For instance, the
dates on the
Talking Biz News
job board are
paragraph (<p>)
elements with
the class
'post-date'
Technique 3:

import RSS feeds with Google Sheets

This job board is
available as an RSS
feed

The Arab Reporters for

Investigative Journalism
How to scrape an RSS feed with Google Sheets

Google Sheets has a third import function you can use for scraping:

=importFeed(“source”,”items”)

It scrapes data from an RSS feed,

which is useful if you want to scrape something at regular intervals.

But we’re not going to do that in this session.

If you’re interested, check out this tutorial.
Here, I ﬁlled the ﬁrst
column of my Google
Sheet with the item
titles - which in this
case, are the job
positions
Technique 4:

Writing code!
What a simple scrape would look like in Ruby

require 'Nokogiri'
require 'open-uri'
require 'csv'

url = "https://ire.org/jobs/"

html = Nokogiri::HTML(open(url))

csv = CSV.open("ire_jobs.csv", "w",{:col_sep => ",", :quote_char => '\'', :force_quotes => true})

html.xpath('//table//tr').each do |row|
tarray = []
row.xpath('td').each do |cell|
tarray << cell.text
end
csv << tarray
t here...
end
is scrip
csv.close
Copy th
Technique 5:

Point-and-click apps
OutWit Hub OutWit Hub is a desktop app that can
identify each HTML element on a
webpage and scrape it.

OutWit Hub is a desktop

app that can identify each
HTML element on a
webpage and scrape it.

(The free version limits how

many elements you can scrape.)
Some other good tools
● DownThemAll (Firefox add-on)
○ Highlights a whole list of links (or images/media) and downloads them
○ Not as robust as OutWit Hub
● Zapier and IFTTT (websites)
○ These automation tools can help keep track of new data, such as new Instagram posts
or tweets
○ Try: save RSS feed content to a Google Sheet or get emailed tweets from a certain
area
● Web Scraper (Chrome extension)
○ Has you build a sitemap for a site to be crawled
I wouldn’t recommend...
● Import.io (desktop app)
○ $300 a month! (But it does have a seven-day free trial.)
● Scraper (Chrome extensions)
○ Extremely basic; can only handle the simplest of tables. You’d be better off with
Google Sheets
● Helium Scraper (desktop app)
○ Works like other scraping tools, but costs a lot of money. (Although it does have a free
trial.)
● InfoExtractor (website)
○ Doesn't work.
If you want to delve further...
How to Feel Like You’re Hacking
My tutorial on getting data and metadata off the web

Using the web inspector for complex scrapes

Eric Sagara’s tutorial on using Python or Ruby for more advanced scraping
For more tools
I reviewed OutWit Hub and
DownThemAll for my newsletter, Tools
for Reporters. You can ﬁnd new, useful
tools there - scraping and otherwise -
every week.

For questions, contact me at

[email protected] or
@samanthasunne. If I don’t know an
answer, I can point you to one.
xkcd

App Clips Overview - Apple Developer
100% (1)
App Clips Overview - Apple Developer
5 pages
Microsoft Word Page Layout Guide
No ratings yet
Microsoft Word Page Layout Guide
31 pages
What Is SEO
No ratings yet
What Is SEO
26 pages
How To Implement Keyword Research For Your Products
No ratings yet
How To Implement Keyword Research For Your Products
6 pages
Introduction of C++
No ratings yet
Introduction of C++
2 pages
Textastic Manual
No ratings yet
Textastic Manual
58 pages
Gmail Guide: The Ultimate
No ratings yet
Gmail Guide: The Ultimate
11 pages
WPS Slides Quick Start Guide
100% (1)
WPS Slides Quick Start Guide
13 pages
RSS Tools for Librarians
100% (1)
RSS Tools for Librarians
8 pages
Microsoft Office: Suite Overview
100% (1)
Microsoft Office: Suite Overview
26 pages
BUS505 Lec7 Searching Google
No ratings yet
BUS505 Lec7 Searching Google
28 pages
EC5377u-872 Quick Start
No ratings yet
EC5377u-872 Quick Start
24 pages
AngularJS MVC Concepts and Features
No ratings yet
AngularJS MVC Concepts and Features
4 pages
Web Design-Elements of Good Design
100% (1)
Web Design-Elements of Good Design
8 pages
Google Advanced Browsing Guide
No ratings yet
Google Advanced Browsing Guide
25 pages
SEO Basics: Keyword Research Guide
No ratings yet
SEO Basics: Keyword Research Guide
17 pages
Learn Bootstrap & AngularJS Basics
No ratings yet
Learn Bootstrap & AngularJS Basics
17 pages
iOS Basis
No ratings yet
iOS Basis
44 pages
MS Flow
No ratings yet
MS Flow
468 pages
Web 2.0 Toolkit
No ratings yet
Web 2.0 Toolkit
24 pages
Meltwater Full Userguide2021 Updated
No ratings yet
Meltwater Full Userguide2021 Updated
16 pages
Blog Keywords: 16 Tips To Select Better Keywords For Your Articles
100% (1)
Blog Keywords: 16 Tips To Select Better Keywords For Your Articles
11 pages
Locate Elements in Chrome & IE for Selenium
No ratings yet
Locate Elements in Chrome & IE for Selenium
8 pages
Microsoft Office
No ratings yet
Microsoft Office
9 pages
Comprehensive List of RSS Feed Hosts
No ratings yet
Comprehensive List of RSS Feed Hosts
4 pages
Canva and TPT Sign Up Cheat Sheet
No ratings yet
Canva and TPT Sign Up Cheat Sheet
13 pages
Whats New in iOS
No ratings yet
Whats New in iOS
108 pages
Introduction Guide PDF
No ratings yet
Introduction Guide PDF
7 pages
Web & Mobile Apps for Learning
No ratings yet
Web & Mobile Apps for Learning
1 page
Information Technology and Its Applicatiion Business
No ratings yet
Information Technology and Its Applicatiion Business
31 pages
Apple Developer Program - Apple Developer
No ratings yet
Apple Developer Program - Apple Developer
1 page
Feed My RSS Using RSS Feeds in Writing Classes
No ratings yet
Feed My RSS Using RSS Feeds in Writing Classes
4 pages
OWASP Mobile Penetration Testing Checklist
No ratings yet
OWASP Mobile Penetration Testing Checklist
10 pages
RSS Guide for Beginners
No ratings yet
RSS Guide for Beginners
11 pages
What Is HTML
No ratings yet
What Is HTML
10 pages
Web Automation Guide for Users
No ratings yet
Web Automation Guide for Users
43 pages
Learn Basic Internet
No ratings yet
Learn Basic Internet
14 pages
Theme 4: Search Engine Optimization (SEO) : 1.what Is SERP
No ratings yet
Theme 4: Search Engine Optimization (SEO) : 1.what Is SERP
26 pages
Busi 330-b02 CMP Final Draft Group 1
100% (1)
Busi 330-b02 CMP Final Draft Group 1
29 pages
02 Getting Started
No ratings yet
02 Getting Started
84 pages
Software Engineering
No ratings yet
Software Engineering
2 pages
Google Basics Training Final
No ratings yet
Google Basics Training Final
13 pages
Melanie Perkins: Canva's Visionary CEO
No ratings yet
Melanie Perkins: Canva's Visionary CEO
5 pages
HTML Basics for Beginners
No ratings yet
HTML Basics for Beginners
50 pages
Social Media Networks Mas 203
100% (9)
Social Media Networks Mas 203
66 pages
Microsoft Storage Spaces Direct (S2D) Deployment Guide
No ratings yet
Microsoft Storage Spaces Direct (S2D) Deployment Guide
34 pages
216 Managing Documents in Your Ios Apps PDF
No ratings yet
216 Managing Documents in Your Ios Apps PDF
248 pages
Apple Developer Program - Apple Developer
No ratings yet
Apple Developer Program - Apple Developer
5 pages
Adobe-Analytics-Table of Contents
No ratings yet
Adobe-Analytics-Table of Contents
5 pages
Search Engine Architecture 1
No ratings yet
Search Engine Architecture 1
23 pages
AF103733547 En-Us Onenote2013quickstartguide PDF
No ratings yet
AF103733547 En-Us Onenote2013quickstartguide PDF
6 pages
API Configuration Settings Guide V1
No ratings yet
API Configuration Settings Guide V1
17 pages
User Manual: Welcome To Mobisystems® Officesuite For Iphone and Ipad!
No ratings yet
User Manual: Welcome To Mobisystems® Officesuite For Iphone and Ipad!
4 pages
Outline PowerAutomateIntermediate
No ratings yet
Outline PowerAutomateIntermediate
3 pages
Standalone Applications: Mac Os X Windows 2000 XP Vista Google Adwords
100% (1)
Standalone Applications: Mac Os X Windows 2000 XP Vista Google Adwords
20 pages
WPS PDF Quick Start Guide
No ratings yet
WPS PDF Quick Start Guide
8 pages
Host Your Website on GitHub Steps
No ratings yet
Host Your Website on GitHub Steps
8 pages
2019 SEO Best Practices Checklist
No ratings yet
2019 SEO Best Practices Checklist
18 pages
2022 Scraping Without Programming Tutorial
No ratings yet
2022 Scraping Without Programming Tutorial
43 pages
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
100% (3)
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
26 pages