100% found this document useful (1 vote)

459 views12 pages

How To Web Scrape With Python in 4 Minutes

The document provides a tutorial on how to web scrape data from the New York MTA website using Python. It explains that web scraping allows automatic extraction of large amounts of information from websites. The tutorial walks through using Python and libraries like Beautiful Soup to inspect the website, find links to turnstile data files, download the files one by one with a pause between each download to avoid overloading the site. It also provides a full code sample to download all the turnstile data files with a for loop.

Uploaded by

vicearellano

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

459 views12 pages

How To Web Scrape With Python in 4 Minutes

Uploaded by

vicearellano

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 12

How to Web Scrape with

Python in 4 Minutes
A Beginner’s Guide for Webscraping in Python
Sep 26, 2018 · 5 min read
Photo by Chris Ried on Unsplash

Web Scraping
Web scraping is a technique to automatically access and
extract large amounts of information from a website, which
can save a huge amount of time and effort. In this article, we
will go through an easy example of how to automate
downloading hundreds of files from the New York MTA.
This is a great exercise for web scraping beginners who are
looking to understand how to web scrape. Web scraping can
be slightly intimidating, so this tutorial will break down the
process of how to go about the process.

New York MTA Data

We will be downloading turnstile data from this site:
http://web.mta.info/developers/turnstile.html

Turnstile data is compiled every week from May 2010 to

present, so hundreds of .txt files exist on the site. Below is a
snippet of what some of the data looks like. Each date is a
link to the .txt file that you can download.
It would be torturous to manually right click on each link
and save to your desktop. Luckily, there’s web-scraping!

Important notes about web

scraping:
1. Read through the website’s Terms and Conditions to
understand how you can legally use the data. Most sites
prohibit you from using the data for commercial
purposes.

2. Make sure you are not downloading data at too rapid a

rate because this may break the website. You may
potentially be blocked from the site as well.

Inspecting the Website

The first thing that we need to do is to figure out where we
can locate the links to the files we want to download inside
the multiple levels of HTML tags. Simply put, there is a lot
of code on a website page and we want to find the relevant
pieces of code that contains our data. If you are not familiar
with HTML tags, refer to W3Schools Tutorials. It is
important to understand the basics of HTML in order to
successfully web scrape.

On the website, right click and click on “Inspect”. This

allows you to see the raw code behind the site.
Once you’ve clicked on “Inspect”, you should see this
console pop up.
Console

Notice that on the top left of the console, there is an arrow

symbol.

If you click on this arrow and then click on an area of the site
itself, the code for that particular item will be highlighted in
the console. I’ve clicked on the very first data file, Saturday,
September 22, 2018 and the console has highlighted in blue
the link to that particular file.
<a href=”data/nyct/turnstile/turnstile_180922.txt”>Saturday,
September 22, 2018</a>

Notice that all the .txt files are inside the <a> tag following
the line above. As you do more web scraping, you will find
that the <a> is used for hyperlinks.

Now that we’ve identified the location of the links, let’s get
started on coding!

Python Code
We start by importing the following libraries.
import requests
import urllib.request
import time
from bs4 import BeautifulSoup

Next, we set the url to the website and access the site with
our requests library.
url = 'http://web.mta.info/developers/turnstile.html'
response = requests.get(url)

If the access was successful, you should see the following

output:

Next we parse the html with BeautifulSoup so that we can

work with a nicer, nested BeautifulSoup data structure. If
you are interested in learning more about this library, check
out the BeatifulSoup documentation.
soup = BeautifulSoup(response.text, “html.parser”)

We use the method .findAll to locate all of our <a> tags.

soup.findAll('a')

This code gives us every line of code that has an <a> tag. The
information that we are interested in starts on line 36. Not
all links are relevant to what we want, but most of it is, so we
can easily slice from line 36. Below is a subset of what
BeautifulSoup returns to us when we call the code above.
subset of all <a> tags

Next, let’s extract the actual link that we want. Let’s test out
the first link.
one_a_tag = soup.findAll(‘a’)[36]
link = one_a_tag[‘href’]

This code saves ‘data/nyct/turnstile/turnstile_180922.txt’

to our variable link. The full url to download the data is
actually
‘http://web.mta.info/developers/data/nyct/turnstile/turnst
ile_180922.txt’ which I discovered by clicking on the first
data file on the website as a test. We can use our
urllib.request library to download this file path to our
computer. We provide request.urlretrieve with two
parameters: file url and the filename. For my files, I named
them “turnstile_180922.txt”, “turnstile_180901”, etc.
download_url = 'http://web.mta.info/developers/'+ link
urllib.request.urlretrieve(download_url,'./’'link[link.find('
/turnstile_')+1:])

Last but not least, we should include this line of code so that
we can pause our code for a second so that we are not
spamming the website with requests. This helps us avoid
getting flagged as a spammer.
time.sleep(1)

Now that we understand how to download a file, let’s try

downloading the entire set of data files with a for loop. The
code below contains the entire set of code for web scraping
the NY MTA turnstile data.

You can find my Jupyter Notebook for this on my Github.

If you’re a beginner in Python and/or webscraping, check

out this book called Automate the Boring Stuff with Python:
practical programming for total beginners.

It teaches you both the basics of Python as well as the basics

of webscraping. Skip to Part II of the book if you already
have experience with Python. Full disclosure, as an Amazon
Associate I earn from qualifying purchases. Click link here.
Thanks for reading and happy web scraping everyone!

MacGyver's Return - An EMV Chip Cloning Case
100% (1)
MacGyver's Return - An EMV Chip Cloning Case
31 pages
Kent and Riegel's - Handbook of Industrial Chemistry and Biotechnology 11va Ed
100% (10)
Kent and Riegel's - Handbook of Industrial Chemistry and Biotechnology 11va Ed
1,833 pages
TopSky Developer Guide Settings
No ratings yet
TopSky Developer Guide Settings
51 pages
Bug Bounty Automation With Python The Secrets of Bug Hunting
75% (4)
Bug Bounty Automation With Python The Secrets of Bug Hunting
79 pages
How to Hack Like a GOD: Master the secrets of hacking through real-life hacking scenarios
From Everand
How to Hack Like a GOD: Master the secrets of hacking through real-life hacking scenarios
Sparc FLOW
4/5 (6)
Python Web Scraping Tutorial
92% (12)
Python Web Scraping Tutorial
65 pages
Bug Bounty Playbook V2 PDF
80% (10)
Bug Bounty Playbook V2 PDF
250 pages
How To Use Termux X11
No ratings yet
How To Use Termux X11
9 pages
Splunk Certification Exams Study Guide
100% (1)
Splunk Certification Exams Study Guide
30 pages
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
100% (3)
Web Scraping Cheat Sheet (2021), Python For Web Scraping by Frank Andrade Geek Culture - Medium
26 pages
How To Scrape Websites With Python and BeautifulSoup PDF
100% (2)
How To Scrape Websites With Python and BeautifulSoup PDF
10 pages
Black Hat Python Programming - T - Richard Ozer
100% (3)
Black Hat Python Programming - T - Richard Ozer
48 pages
How To Scrap Any Website's Content Using Scrapy
0% (1)
How To Scrap Any Website's Content Using Scrapy
20 pages
Telegram Bot
100% (1)
Telegram Bot
8 pages
Hands-On Web Scraping with Python: Perform advanced scraping operations using various Python libraries and tools such as Selenium, Regex, and others
From Everand
Hands-On Web Scraping with Python: Perform advanced scraping operations using various Python libraries and tools such as Selenium, Regex, and others
Anish Chapagain
No ratings yet
Web Scraping With Python - Sample Chapter
100% (3)
Web Scraping With Python - Sample Chapter
26 pages
Web Scraping With Python Tutorials From A To Z
100% (2)
Web Scraping With Python Tutorials From A To Z
35 pages
Hacking of Computer Networks: Full Course on Hacking of Computer Networks
From Everand
Hacking of Computer Networks: Full Course on Hacking of Computer Networks
Dr. Hidaia Mahmood Alassouli
No ratings yet
Python Go Hackers
50% (4)
Python Go Hackers
23 pages
FB Hacking Kali Linus-1
No ratings yet
FB Hacking Kali Linus-1
2 pages
Python For Web Scraping - Week 3: 1 Installing A Module
No ratings yet
Python For Web Scraping - Week 3: 1 Installing A Module
4 pages
Hacking University - Learn Python Computer Programming and The Linux Operating Command Line 2 Manuscript Bundle
No ratings yet
Hacking University - Learn Python Computer Programming and The Linux Operating Command Line 2 Manuscript Bundle
186 pages
Scrapy Tutorial PDF
100% (3)
Scrapy Tutorial PDF
114 pages
Flask Docs
100% (1)
Flask Docs
300 pages
Python Cheat Sheet
No ratings yet
Python Cheat Sheet
11 pages
Django/Python Framework
100% (5)
Django/Python Framework
57 pages
Getting Started With Beautiful Soup Build Your Own Web Scraper and Learn All About Web Scraping With Beautiful Soup (PDFDrive)
100% (2)
Getting Started With Beautiful Soup Build Your Own Web Scraper and Learn All About Web Scraping With Beautiful Soup (PDFDrive)
130 pages
Practical Introduction To Web Scraping in Python
100% (1)
Practical Introduction To Web Scraping in Python
14 pages
Flask Tutorial
100% (5)
Flask Tutorial
71 pages
Python Specialization2
No ratings yet
Python Specialization2
3 pages
20+ Real-World Java and Python Projects To Expand Your Dev Portfolio
100% (1)
20+ Real-World Java and Python Projects To Expand Your Dev Portfolio
25 pages
Learning Python
100% (3)
Learning Python
210 pages
Hacking With Python
93% (15)
Hacking With Python
501 pages
Build Your Own Mobile Proxy For Web Scraping - Scraping Fish
No ratings yet
Build Your Own Mobile Proxy For Web Scraping - Scraping Fish
1 page
Python Web Development Libraries
100% (1)
Python Web Development Libraries
67 pages
Pythonforhackers Sample
No ratings yet
Pythonforhackers Sample
14 pages
Pish Web Tool
No ratings yet
Pish Web Tool
3 pages
Hacking With Python - Steve Tale
100% (2)
Hacking With Python - Steve Tale
94 pages
Web Scraping for SEO with Python
From Everand
Web Scraping for SEO with Python
Enrique Vicente
No ratings yet
Fun With Python
100% (5)
Fun With Python
113 pages
Python Web Framework
No ratings yet
Python Web Framework
17 pages
Scraping Book
No ratings yet
Scraping Book
50 pages
(David Phillips) Web Scraping With Excel How To U (B-Ok - CC)
100% (3)
(David Phillips) Web Scraping With Excel How To U (B-Ok - CC)
59 pages
Old4a Python Python Made Easy 1 Hacking Beginners PDF
100% (2)
Old4a Python Python Made Easy 1 Hacking Beginners PDF
92 pages
c99 PHP
No ratings yet
c99 PHP
71 pages
Learn Python Programming
No ratings yet
Learn Python Programming
169 pages
Python Web Hacking Essentials - Earnest Wish
100% (1)
Python Web Hacking Essentials - Earnest Wish
98 pages
Learning Flask Framework - Sample Chapter
100% (2)
Learning Flask Framework - Sample Chapter
27 pages
A Survey of Android Exploits in The Wild
No ratings yet
A Survey of Android Exploits in The Wild
22 pages
Python Cheatsheet - Python Cheatsheet PDF
No ratings yet
Python Cheatsheet - Python Cheatsheet PDF
128 pages
Python, Install PIP
No ratings yet
Python, Install PIP
18 pages
Python Cookbook
100% (5)
Python Cookbook
477 pages
Learning Python Guide
No ratings yet
Learning Python Guide
5 pages
Hacking Python
100% (1)
Hacking Python
442 pages
Learning Python Design Patterns - Second Edition - Sample Chapter
No ratings yet
Learning Python Design Patterns - Second Edition - Sample Chapter
16 pages
Brute Force A Website Login in Python - Coder in Aero
No ratings yet
Brute Force A Website Login in Python - Coder in Aero
10 pages
Linux Cheat Sheet
No ratings yet
Linux Cheat Sheet
4 pages
Linux For Beginners The Practical Guide To Learn Linux Operating System With Programming Tools For The Installation
No ratings yet
Linux For Beginners The Practical Guide To Learn Linux Operating System With Programming Tools For The Installation
116 pages
Muhammad Yasoob Ullah Khalid - Practical Python Projects-Muhammad Yasoob Ullah Khalid (2021)
100% (3)
Muhammad Yasoob Ullah Khalid - Practical Python Projects-Muhammad Yasoob Ullah Khalid (2021)
329 pages
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Hack into your Friends Computer
From Everand
Hack into your Friends Computer
Magelan Cyber Security
No ratings yet
Kali Linux Penetration Testing Bible
From Everand
Kali Linux Penetration Testing Bible
Gus Khawaja
No ratings yet
Hacks..
From Everand
Hacks..
Hunter Davis
No ratings yet
Certified Ethical Hacker: Session Hijacking, SQL Injections, Cloud Computing, And Cryptography
From Everand
Certified Ethical Hacker: Session Hijacking, SQL Injections, Cloud Computing, And Cryptography
Rob Botwright
No ratings yet
Código de VBA Combine Filas Duplicadas y Sume Los Valores
No ratings yet
Código de VBA Combine Filas Duplicadas y Sume Los Valores
1 page
Overview of Batch Management - Ordenes de Produccion
No ratings yet
Overview of Batch Management - Ordenes de Produccion
26 pages
Yesterday, The Beatles (Acordes) (Ver 2) en TusAcordes - Com Ok
No ratings yet
Yesterday, The Beatles (Acordes) (Ver 2) en TusAcordes - Com Ok
3 pages
STAND BY ME, Música de Películas - Tablatura
100% (1)
STAND BY ME, Música de Películas - Tablatura
2 pages
STAND BY ME, Música de Películas - Tablatura
100% (1)
STAND BY ME, Música de Películas - Tablatura
2 pages
Mass Creation of Vendor
No ratings yet
Mass Creation of Vendor
22 pages
Funcion RecorrerTreeView
No ratings yet
Funcion RecorrerTreeView
1 page
The More Import Tables SAP
No ratings yet
The More Import Tables SAP
9 pages
Whirlpool Awe8727
No ratings yet
Whirlpool Awe8727
20 pages
Using The SAP .NET Connector
No ratings yet
Using The SAP .NET Connector
9 pages
UI Screenshots
No ratings yet
UI Screenshots
9 pages
lcp11 01 Que 20221011
No ratings yet
lcp11 01 Que 20221011
24 pages
Constrained Application Protocol CoAP
No ratings yet
Constrained Application Protocol CoAP
5 pages
Terra Form Syllabus-2
No ratings yet
Terra Form Syllabus-2
3 pages
Greenshot-User-Guide 20200407110452 58378
No ratings yet
Greenshot-User-Guide 20200407110452 58378
8 pages
Vue vs. Angular vs. React - A 2023 Comparison of JS Frameworks
No ratings yet
Vue vs. Angular vs. React - A 2023 Comparison of JS Frameworks
16 pages
HikCentral Professional V1.6.0 - FAQ - 20200220
No ratings yet
HikCentral Professional V1.6.0 - FAQ - 20200220
60 pages
Cópia de Script Freebitcoin 2018
No ratings yet
Cópia de Script Freebitcoin 2018
2 pages
Windows 11 Complete Guide To Narrator-April-2024
No ratings yet
Windows 11 Complete Guide To Narrator-April-2024
63 pages
infoPLC Net 109822788 SIMATICRobotPickAIURAppExample DOC EN V10
No ratings yet
infoPLC Net 109822788 SIMATICRobotPickAIURAppExample DOC EN V10
47 pages
Meta Interview Prep
No ratings yet
Meta Interview Prep
9 pages
Forcepoint F1E Ep - Install Guide
No ratings yet
Forcepoint F1E Ep - Install Guide
69 pages
Web Development Powerpoint Templates
No ratings yet
Web Development Powerpoint Templates
7 pages
N2 GNSS IMU Receiver User Guide - V1.2
No ratings yet
N2 GNSS IMU Receiver User Guide - V1.2
57 pages
Performance Task Computer 3
No ratings yet
Performance Task Computer 3
1 page
KDE User Guide
No ratings yet
KDE User Guide
202 pages
Imagerunner Advance c7500 III Series
No ratings yet
Imagerunner Advance c7500 III Series
103 pages
The Dreamweaver Workspace (Final)
No ratings yet
The Dreamweaver Workspace (Final)
8 pages
How To Install Snort NIDS On Ubuntu Linux
No ratings yet
How To Install Snort NIDS On Ubuntu Linux
9 pages
Automation Tester Resume
100% (1)
Automation Tester Resume
8 pages
Known Issues Oracle Integration 3
No ratings yet
Known Issues Oracle Integration 3
26 pages
Ansible Q A For 3 Years Profile 1728232918
No ratings yet
Ansible Q A For 3 Years Profile 1728232918
5 pages
Introduction To SOA With Web Services - Understanding SOA With Web Services
No ratings yet
Introduction To SOA With Web Services - Understanding SOA With Web Services
50 pages
Flameshot Documentation
No ratings yet
Flameshot Documentation
11 pages
Shubham Mohan Web Devoloper Resume-Converted - 1-3
No ratings yet
Shubham Mohan Web Devoloper Resume-Converted - 1-3
4 pages
Dolphin Number Line Activity
No ratings yet
Dolphin Number Line Activity
7 pages
Hotel Management Visual Basic BCA Summer Training Project Report PDF Download
No ratings yet
Hotel Management Visual Basic BCA Summer Training Project Report PDF Download
121 pages