Scrapingforjournalists Sample
Scrapingforjournalists Sample
Paul Bradshaw
This book is for sale at http://leanpub.com/scrapingforjournalists
This is a Leanpub book. Leanpub empowers authors and publishers with the Lean Publishing process. Lean
Publishing is the act of publishing an in-progress ebook using lightweight tools and many iterations to get
reader feedback, pivot until you have the right book and build traction once you do.
You can write a very basic scraper by using Google Drive, selecting Create>Spreadsheet, and adapting this
formula - it doesn’t matter where you type it:
=ImportHTML("ENTER THE URL HERE", "table", 1)
This formula will go to the URL you specify, look for a table, and pull the first one into your spreadsheet.
If you’re using a Portuguese, Spanish or German version of Google Docs - or have any problems with
the formula - use semi colons instead of commas. We’re using commas here because this convention
will continue when we get into programming in later chapters.
Let’s imagine it’s the day after a big horse race where two horses died, and you want some context. Or
let’s say there’s a topical story relating to prisons and you want to get a global overview of the field: you could
use this formula by typing it into the first cell of an empty Google Docs spreadsheet and replacing ENTER
THE URL HERE with http://www.horsedeathwatch.com or http://en.wikipedia.org/wiki/List_of_prisons. Try
it and see what happens. It should look like this:
=ImportHTML("http://en.wikipedia.org/wiki/List_of_prisons", "table", 1)
Don’t copy and paste this - it’s always better to type directly to avoid problems with hyphenation
and curly quotation marks, etc.
After a moment, the spreadsheet should start to pull in data from the first table on that webpage.
So, you’ve written a scraper. It’s a very basic one, but by understanding how it works and building on it
you can start to make more and more ambitious scrapers with different languages and tools.
1
Scraper #1: Start scraping in 5 minutes 2
• importHTML is the function. Functions (as you might expect) do things. According to Google Docs’
Help pages¹ this one “imports the data in a particular table or list from an HTML page”
• Everything within the parentheses (brackets) are the parameters. Parameters are the ingredients that
the function needs in order to work. In this case, there are three: a URL, the word “table”, and a number
1.
You can use different functions in scraping to tackle different problems, or achieve different results. Google
Docs, for example, also has functions called importXML, importFeed and importData - some of which we’ll
cover later. And if you’re writing scrapers with languages like Python, Ruby or PHP you can create your own
functions that extract particular pieces of data from a page or PDF.
Recap
Before we move on, here’s a summary of what we’ve covered:
• Functions do things…
Scraper #1: Start scraping in 5 minutes 4
Although this is described as a ‘scraper’ the results only exist as long as the page does. The advantage
of this is that your spreadsheet will update every time the page does (you can set the spreadsheet
to notify you by email whenever it updates by going to Tools>Notification rules in the Google
spreadsheet and selecting how often you want to be updated of changes).
The disadvantage is that if the webpage disappears, so will your data. So it’s a good idea to keep a
static copy of that data in case the webpage is taken down or changed. You can do this by selecting
all the cells and clicking on Edit>Copy then going to a new spreadsheet and clicking on Edit>Paste
values only
We’ll come back to these concepts again and again, beginning with HTML. But before you do that - try
this…
Tests
To reinforce what you’ve just learned - or to test you’ve learned it at all - here are some tasks to get you
solving problems creatively:
• Let’s say you need a list of towns in Hungary (this was an actual task I needed to undertake for a story).
What formula would you write to scrape the first table on this page: http://en.wikipedia.org/wiki/List_-
of_cities_and_towns_in_Hungary
• To make things easier for yourself, how can you change the formula so it uses cell references for each
of the three parameters? (Make sure each cell has the relevant parameter in it)
• How can you change one of those cells so that the formula scrapes the second table?
• How can you change it so it scrapes a list instead?
• Look at the source code for the page you’re scraping - try using the Find command (CTRL+F) to count
the tables and work out which one you need to scrape the table of smaller cities - adapt your formula
so it scrapes that
• Try to explain what a parameter is (tip: choose someone who isn’t going to run away screaming)
• Try to explain what an index is
• Try to explain what a string is
• Look for the documentation on related functions like importData and importFeed - can you get those
working?
Once you’re happy that you’ve nailed these core concepts, it’s time to move on to Scraper #2…