How To Scrape Without Programming PDF
How To Scrape Without Programming PDF
without
programming
© Samantha Sunne
What's scraping?
How do journalists get data?
ay :) ay :(
Easy w Hard w
Ask
FOIA Download Scrape
nicely
ay :) ay :(
Easy w Hard w
Scraping is...
catching, collecting or coaxing data off the web.
Web scrapers are often called:
EXTRACTORS CRAWLERS
if they pull data from or if they pull data from
a single webpage multiple webpages
You'll need a basic
understanding of HTML
HTML is meant for computers, but some of it is understandable to humans.
Sometimes the code itself is interesting.
Jeb Bush’s
campaign site
Summary of the
movie “Die Hard”
Browser HTML
<>content content</>
There are a lot of tags, and a lot of them have different abbreviations, so you'll need to refer to an
HTML dictionary.
Some of them are intuitive, like tables:
<table>
Here is my table, everything between these table tags.
</table>
Webpages are actually made up of tons of "elements" - images, text chunks, headers, you name it.
You can grab any of them!
<table>
<tr>
A "table Inside a "table"
<td>Hi! I'm text inside a table.</td> Inside a "table
data" element
row" element
</tr> element
</table>
But no worries, you can grab any of them! The larger elements will just include the smaller ones.
Version for humans Version for computers
<table>
<tr>
Hi! I'm text inside a table. <td>Hi! I'm text inside a table.
</td>
</tr>
</table>
How to scrape an HTML table with Google Sheets
You can use this Google Sheets formula to scrape a table or a list:
tutorial
ed? Try this
Still confus
Example 1
=importHTML("sample.com",”table”,0)
(If it’s the first element, it’s number 0. I know, weird, it’s a computer thing.)
You can use View Source to figure out what element and number you want.
tutorial
ed? Try this
Still confus
This scrapes
<table>'s
https://www.ire.org
/jobs/
It also scrapes lists
HTML
This webpage, in the "wikitable" table, in the first row, in the fourth column
//table[@class='wikitable']/tr[1]/td[4]
It's easiest to highlight the element you want and Right Click > Inspect on it.
How to scrape XML data with Google Sheets
You can scrape everything at a particular XPath using this Google Sheets formula:
=importxml(“url”,”XPath”)
For example,
=importxml(“sample.com”,”//table[2]//tr”)
returns all the table rows ("tr") from the second table ("table[2]") in sample.com
me tutorial!
r an aweso
See here fo
Here, I scraped:
Google Sheets has a third import function you can use for scraping:
=importFeed(“source”,”items”)
Writing code!
What a simple scrape would look like in Ruby
require 'Nokogiri'
require 'open-uri'
require 'csv'
url = "https://ire.org/jobs/"
html = Nokogiri::HTML(open(url))
csv = CSV.open("ire_jobs.csv", "w",{:col_sep => ",", :quote_char => '\'', :force_quotes => true})
html.xpath('//table//tr').each do |row|
tarray = []
row.xpath('td').each do |cell|
tarray << cell.text
end
csv << tarray
t here...
end
is scrip
csv.close
Copy th
Technique 5:
Point-and-click apps
OutWit Hub OutWit Hub is a desktop app that can
identify each HTML element on a
webpage and scrape it.