How To Web Scrape With Python in 4 Minutes
How To Web Scrape With Python in 4 Minutes
Python in 4 Minutes
A Beginner’s Guide for Webscraping in Python
Sep 26, 2018 · 5 min read
Photo by Chris Ried on Unsplash
Web Scraping
Web scraping is a technique to automatically access and
extract large amounts of information from a website, which
can save a huge amount of time and effort. In this article, we
will go through an easy example of how to automate
downloading hundreds of files from the New York MTA.
This is a great exercise for web scraping beginners who are
looking to understand how to web scrape. Web scraping can
be slightly intimidating, so this tutorial will break down the
process of how to go about the process.
If you click on this arrow and then click on an area of the site
itself, the code for that particular item will be highlighted in
the console. I’ve clicked on the very first data file, Saturday,
September 22, 2018 and the console has highlighted in blue
the link to that particular file.
<a href=”data/nyct/turnstile/turnstile_180922.txt”>Saturday,
September 22, 2018</a>
Notice that all the .txt files are inside the <a> tag following
the line above. As you do more web scraping, you will find
that the <a> is used for hyperlinks.
Now that we’ve identified the location of the links, let’s get
started on coding!
Python Code
We start by importing the following libraries.
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
Next, we set the url to the website and access the site with
our requests library.
url = 'http://web.mta.info/developers/turnstile.html'
response = requests.get(url)
This code gives us every line of code that has an <a> tag. The
information that we are interested in starts on line 36. Not
all links are relevant to what we want, but most of it is, so we
can easily slice from line 36. Below is a subset of what
BeautifulSoup returns to us when we call the code above.
subset of all <a> tags
Next, let’s extract the actual link that we want. Let’s test out
the first link.
one_a_tag = soup.findAll(‘a’)[36]
link = one_a_tag[‘href’]
Last but not least, we should include this line of code so that
we can pause our code for a second so that we are not
spamming the website with requests. This helps us avoid
getting flagged as a spammer.
time.sleep(1)