0% found this document useful (0 votes)
152 views14 pages

Web Scraping: Tables, PDFS, Ocr: Cleo O'Brien-Udry

Cleo O'Brien-Udry from Yale University presented a workshop on web scraping. The presentation covered scraping tables and PDFs from websites into R and using optical character recognition to extract text from images. The plan included reviewing HTML, scraping tables, importing PDFs, and using OCR. As an example, the presenter discussed scraping global voting data over 50 years from the International IDEA website to analyze voting patterns across countries. Tools mentioned were RStudio packages and a Chrome extension for selecting webpage elements.

Uploaded by

Econ books
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
152 views14 pages

Web Scraping: Tables, PDFS, Ocr: Cleo O'Brien-Udry

Cleo O'Brien-Udry from Yale University presented a workshop on web scraping. The presentation covered scraping tables and PDFs from websites into R and using optical character recognition to extract text from images. The plan included reviewing HTML, scraping tables, importing PDFs, and using OCR. As an example, the presenter discussed scraping global voting data over 50 years from the International IDEA website to analyze voting patterns across countries. Tools mentioned were RStudio packages and a Chrome extension for selecting webpage elements.

Uploaded by

Econ books
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Web Scraping: Tables, PDFs, OCR

Cleo O’Brien-Udry

Yale University

25 May 2020

Cleo O’Brien-Udry (Yale University) Web Scraping 25 May 2020 1 / 11


Plan

1 short review of html code/basic web-scraping techniques


2 scraping tables from a webpage
3 importing PDFs into R
4 Optical character recognition (pulling text from images into R)

Cleo O’Brien-Udry (Yale University) Web Scraping 25 May 2020 2 / 11


Tools

RStudio: packages rvest, pdftools,tesseract, magick, tidyverse,


plyr, data.table
Github script, slides, additional resources
(https://github.com/cobrienudry/webscrape)
Selector Gadget Chrome Extension
(https://chrome.google.com/webstore/detail/
selectorgadget/mhjhnkcfbdhnjickkkdbjoemdmbfginb?hl=en)

Cleo O’Brien-Udry (Yale University) Web Scraping 25 May 2020 3 / 11


Quick review

Web scraping: extract data from websites and store on your computer (or
an external server)
1 Find web-page
2 Identify location of relevant data on web-page
3 Import into R
4 Clean data
5 Repeat

Cleo O’Brien-Udry (Yale University) Web Scraping 25 May 2020 4 / 11


Research Question: Global Voting Patterns

Cleo O’Brien-Udry (Yale University) Web Scraping 25 May 2020 5 / 11


Research Question: Global Voting Patterns

How have global levels of voting changed over the last 50 years? Which
countries show similar patterns of turnout and registration; which show
different patterns?

Cleo O’Brien-Udry (Yale University) Web Scraping 25 May 2020 5 / 11


Research Question: Global Voting Patterns

How have global levels of voting changed over the last 50 years? Which
countries show similar patterns of turnout and registration; which show
different patterns?

Data we need:
Country voter turnout data
Covariates (country development indicators, VDEM indicators, etc.)

Cleo O’Brien-Udry (Yale University) Web Scraping 25 May 2020 5 / 11


Research Question: Global Voting Patterns

How have global levels of voting changed over the last 50 years? Which
countries show similar patterns of turnout and registration; which show
different patterns?

Data we need:
Country voter turnout data
Covariates (country development indicators, VDEM indicators, etc.)

Use https://www.idea.int/data-tools, which has lots of data.

Cleo O’Brien-Udry (Yale University) Web Scraping 25 May 2020 5 / 11


Plan

1 short review of html code/basic web-scraping techniques


2 scraping tables from a webpage
3 importing PDFs into R
4 Optical character recognition (pulling text from images into R)

Cleo O’Brien-Udry (Yale University) Web Scraping 25 May 2020 6 / 11


Plan

1 short review of html code/basic web-scraping techniques


2 scraping tables from a webpage
3 importing PDFs into R
4 Optical character recognition (pulling text from images into R)

Cleo O’Brien-Udry (Yale University) Web Scraping 25 May 2020 7 / 11


Plan

1 short review of html code/basic web-scraping techniques


2 scraping tables from a webpage
3 importing PDFs into R
4 Optical character recognition (pulling text from images into R)

Cleo O’Brien-Udry (Yale University) Web Scraping 25 May 2020 8 / 11


Plan

1 short review of html code/basic web-scraping techniques


2 scraping tables from a webpage
3 importing PDFs into R
4 Optical character recognition (pulling text from images into R)

Cleo O’Brien-Udry (Yale University) Web Scraping 25 May 2020 9 / 11


Other web scraping topics

Python for web scraping


clicking links
remote servers

Cleo O’Brien-Udry (Yale University) Web Scraping 25 May 2020 10 / 11


Thank you!

[email protected]

Cleo O’Brien-Udry (Yale University) Web Scraping 25 May 2020 11 / 11

You might also like