0% found this document useful (0 votes)
6 views10 pages

Scrapingforjournalists Sample

This document is a guide for journalists on how to effectively scrape data from various online sources using tools like Google Drive. It provides a basic introduction to creating scrapers, understanding functions and parameters, and the importance of HTML tags. The book emphasizes the iterative learning process involved in scraping and offers practical examples and tests to reinforce the concepts presented.

Uploaded by

Daniel Cezar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views10 pages

Scrapingforjournalists Sample

This document is a guide for journalists on how to effectively scrape data from various online sources using tools like Google Drive. It provides a basic introduction to creating scrapers, understanding functions and parameters, and the importance of HTML tags. The book emphasizes the iterative learning process involved in scraping and offers practical examples and tests to reinforce the concepts presented.

Uploaded by

Daniel Cezar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Scraping for Journalists

How to grab information from hundreds of sources, put it in data you


can interrogate - and still hit deadlines

Paul Bradshaw
This book is for sale at http://leanpub.com/scrapingforjournalists

This version was published on 2016-01-21

This is a Leanpub book. Leanpub empowers authors and publishers with the Lean Publishing process. Lean
Publishing is the act of publishing an in-progress ebook using lightweight tools and many iterations to get
reader feedback, pivot until you have the right book and build traction once you do.

© 2012 - 2016 Paul Bradshaw


Tweet This Book!
Please help Paul Bradshaw by spreading the word about this book on Twitter!
The suggested hashtag for this book is #scrapingforjournos.
Find out what other people are saying about the book by clicking on this link to search for this hashtag on
Twitter:
https://twitter.com/search?q=#scrapingforjournos
Also By Paul Bradshaw
8000 Holes: How the 2012 Olympic Torch Relay Lost its Way
Model for the 21st Century Newsroom - Redux
Stories and Streams
Organising an Online Investigation Team
Data Journalism Heist
Finding Stories in Spreadsheets
Excel para periodistas
Periodismo de datos: Un golpe rápido
Learning HTML and CSS by making tweetable quotes
For Joseph, who loves robots, Max, who likes asking questions, and Claire, who has all the answers.
Contents

1. Scraper #1: Start scraping in 5 minutes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1


How it works: functions and parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
What are the parameters? Strings and indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Tables and lists? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1. Scraper #1: Start scraping in 5 minutes

You can write a very basic scraper by using Google Drive, selecting Create>Spreadsheet, and adapting this
formula - it doesn’t matter where you type it:
=ImportHTML("ENTER THE URL HERE", "table", 1)
This formula will go to the URL you specify, look for a table, and pull the first one into your spreadsheet.

If you’re using a Portuguese, Spanish or German version of Google Docs - or have any problems with
the formula - use semi colons instead of commas. We’re using commas here because this convention
will continue when we get into programming in later chapters.

Let’s imagine it’s the day after a big horse race where two horses died, and you want some context. Or
let’s say there’s a topical story relating to prisons and you want to get a global overview of the field: you could
use this formula by typing it into the first cell of an empty Google Docs spreadsheet and replacing ENTER
THE URL HERE with http://www.horsedeathwatch.com or http://en.wikipedia.org/wiki/List_of_prisons. Try
it and see what happens. It should look like this:
=ImportHTML("http://en.wikipedia.org/wiki/List_of_prisons", "table", 1)

Don’t copy and paste this - it’s always better to type directly to avoid problems with hyphenation
and curly quotation marks, etc.

After a moment, the spreadsheet should start to pull in data from the first table on that webpage.
So, you’ve written a scraper. It’s a very basic one, but by understanding how it works and building on it
you can start to make more and more ambitious scrapers with different languages and tools.

How it works: functions and parameters


=ImportHTML("http://en.wikipedia.org/wiki/List_of_prisons", "table", 1)
The scraping formula above has two core ingredients: a function, and parameters:

1
Scraper #1: Start scraping in 5 minutes 2

• importHTML is the function. Functions (as you might expect) do things. According to Google Docs’
Help pages¹ this one “imports the data in a particular table or list from an HTML page”
• Everything within the parentheses (brackets) are the parameters. Parameters are the ingredients that
the function needs in order to work. In this case, there are three: a URL, the word “table”, and a number
1.
You can use different functions in scraping to tackle different problems, or achieve different results. Google
Docs, for example, also has functions called importXML, importFeed and importData - some of which we’ll
cover later. And if you’re writing scrapers with languages like Python, Ruby or PHP you can create your own
functions that extract particular pieces of data from a page or PDF.

What are the parameters? Strings and indexes


Back to the formula:
=ImportHTML("http://en.wikipedia.org/wiki/List_of_prisons", "table", 1)
In addition to the function and parameters, it’s important to explain some other things you should notice:
• Firstly, the = sign at the start. This tells Google Docs that this is a formula, rather than a simple number
or text entry
• Secondly, notice that two of the three parameters use straight quotation marks: the URL, and “table”.
This is because they are strings: strings are basically words, phrases or any other collection (i.e. string)
of characters. The computer treats these differently to other types of information, such as numbers,
dates, or cell references - we’ll come across these again later.
• The third parameter does not use quotation marks, because it is a number. In fact, in this case it’s a
number with a particular meaning: an index - the position of the table we’re looking for (first, second,
third, etc)
Knowing these things helps both in avoiding mistakes (for example, if you omit a quotation mark or use
curly quotation marks it won’t work) and in adapting a scraper…
For example, perhaps the table you got wasn’t the one you wanted. Try replacing the number 1 in your
formula with a number 2. This should now scrape the second table (in Google Docs an index starts from 1).
Knowing to search for information (often called ‘documentation’) on a function is important too. The
page on Google Docs Help², for example, explains that we can use “list” instead of “table” if you wanted to
grab a list from the webpage.
So try that, and see what happens (make sure the webpage has a list).
=ImportHTML("http://en.wikipedia.org/wiki/List_of_prisons", "list", 1)
You can also try replacing either string with a cell reference. For example:
=ImportHTML(A2, "list", 1)
And then in cell A2 type or paste:
http://en.wikipedia.org/wiki/List_of_prisons
Notice that you don’t need quotation marks around the URL if it’s in another cell.
Using cell references like this makes it easier to change your formula: instead of having to edit the whole
formula you only have to change the value of the cell that it’s drawing from.
For examples of scrapers that do all of the above, see this example³.
¹http://support.google.com/docs/bin/answer.py?hl=en&answer=155182
²http://support.google.com/docs/bin/answer.py?hl=en&answer=155182
³https://docs.google.com/spreadsheet/ccc?key=0ApTo6f5Yj1iJdDBSb0FPQm9jUjYzdjcyNWlUTjVYMFE
Scraper #1: Start scraping in 5 minutes 3

Tables and lists?


There’s one final element in this scraper that deserves some further exploration: what it means by “table” or
“list”.
When we say “table” or “list” we are specifically asking it to look for a HTML tag in the code of the
webpage. You can - and should - do this yourself…
Look at the raw HTML of your webpage by right-clicking on the webpage and selecting View Page
Source, or using the shortcuts CTRL+U (Windows) and CMD+U (Mac) in Firefox, or a plugin like Firebug.
You can also view it by selecting Tools > Web Developer > Page Source in Firefox or View > Developer >
View Source in Chrome. Note: for viewing source HTML, Firefox and Chrome are generally better set up.
You’ll now see the HTML. Use Edit>Find on your browser (or CTRL+F) to search for <table
When =importHTML looks for a table, this is what it looks for - and it will grab everything between
<table> and </table> (which marks the end of the table)
With “list”, =importHTML is looking for the tags <ul> (unordered list - normally displayed as bullet lists)
or <ol> (ordered list - normally displayed as numbered lists). The end of each list is indicated by either </ul>
or </ol>.
Both tables and lists will include other tags, such as <li> (list item), <tr> (table row) and <td> (table data)
which add further structure - and that’s what Google Docs uses to decide how to organise that data across
rows and columns - but you don’t need to worry about them.
How do you know what index number to use? Well, there are two ways: you can look at the raw HTML
and count how many tables there are - and which one you need. Or you can just use trial and error, beginning
with 1, and going up until it grabs the table you want. That’s normally quicker.
Trial and error, by the way, is a common way of learning in scraping - it’s quite typical not to get things
right first time, and you shouldn’t be disheartened if things go wrong at first.
Don’t expect yourself to know everything there is to know about programming: half the fun is solving
the inevitable problems that arise, and half the skill is in the techniques that you use to solve them (some of
which I’ll cover here), and learning along the way.

Scraping tip #1: Finding out about functions


We’ve already mentioned one of those problem-solving techniques, which is to look for the Help pages
relating to the function you’re using - what’s often called the ‘documentation’.
When you come across a function (pretty much any word that comes after the = sign) it’s always a good
idea to Google it. Google Docs has extensive help pages - documentation - that explain what the function
does, as well as discussion around particular questions.
Likewise, as you explore more powerful scrapers such as those hosted on Scraperwiki or Github, search for
‘documentation’ and the name of the function to find out more about how it works.
.

Recap
Before we move on, here’s a summary of what we’ve covered:

• Functions do things…
Scraper #1: Start scraping in 5 minutes 4

• they need ingredients to do this, supplied in parameters


• There are different kinds of parameters: strings, for example, are collections of characters, indicated
by quotation marks
• and an index is a position indicated by a number, such as first (1), second (2) and so on.
• The strings “table” and “list” in this formula refer to particular HTML tags in the code underlying a
page

Although this is described as a ‘scraper’ the results only exist as long as the page does. The advantage
of this is that your spreadsheet will update every time the page does (you can set the spreadsheet
to notify you by email whenever it updates by going to Tools>Notification rules in the Google
spreadsheet and selecting how often you want to be updated of changes).
The disadvantage is that if the webpage disappears, so will your data. So it’s a good idea to keep a
static copy of that data in case the webpage is taken down or changed. You can do this by selecting
all the cells and clicking on Edit>Copy then going to a new spreadsheet and clicking on Edit>Paste
values only

We’ll come back to these concepts again and again, beginning with HTML. But before you do that - try
this…

Tests
To reinforce what you’ve just learned - or to test you’ve learned it at all - here are some tasks to get you
solving problems creatively:

• Let’s say you need a list of towns in Hungary (this was an actual task I needed to undertake for a story).
What formula would you write to scrape the first table on this page: http://en.wikipedia.org/wiki/List_-
of_cities_and_towns_in_Hungary
• To make things easier for yourself, how can you change the formula so it uses cell references for each
of the three parameters? (Make sure each cell has the relevant parameter in it)
• How can you change one of those cells so that the formula scrapes the second table?
• How can you change it so it scrapes a list instead?
• Look at the source code for the page you’re scraping - try using the Find command (CTRL+F) to count
the tables and work out which one you need to scrape the table of smaller cities - adapt your formula
so it scrapes that
• Try to explain what a parameter is (tip: choose someone who isn’t going to run away screaming)
• Try to explain what an index is
• Try to explain what a string is
• Look for the documentation on related functions like importData and importFeed - can you get those
working?

Once you’re happy that you’ve nailed these core concepts, it’s time to move on to Scraper #2…

You might also like