09/05/2023 09:48 01-Python_02-Data-Sourcing
Data Sourcing
Finding Data
I can export it from a software (CSV)
I know it exists somewhere in a database (SQL)
It's on this website I visit daily (Scraping)
I have found a service (API) that gives access to it
...
Plan
1. Reading/Writing CSV (the hard way)
2. Consuming an API
3. Scraping a website
Next lectures
Databases & SQL
CSV
A comma-separated values file is a delimited text file that uses a comma to separate
values. A CSV file stores tabular data (numbers and text) in plain text. Each line of the file is
a data record. Each record consists of one or more fields, separated by commas.
Source: Wikipedia (https://en.wikipedia.org/wiki/Comma-separated_values)
Example
👉 On https://people.sc.fsu.edu/~jburkardt (https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html), let's look
at the addresses.csv
https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 1/9
09/05/2023 09:48 01-Python_02-Data-Sourcing
In [ ]:
%%bash
mkdir -p data
curl -s https://people.sc.fsu.edu/~jburkardt/data/csv/addresses.csv > data/addresses.cs
v
cat data/addresses.csv
John,Doe,120 jefferson st.,Riverside, NJ, 08075
Jack,McGinnis,220 hobo Av.,Phila, PA,09119
"John ""Da Man""",Repici,120 Jefferson St.,Riverside, NJ,08075
Stephen,Tyler,"7452 Terrace ""At the Plaza"" road",SomeTown,SD, 91234
,Blankman,,SomeTown, SD, 00298
"Joan ""the bone"", Anne",Jet,"9th, at Terrace plc",Desert City,CO,00123
CSV Reading
In [ ]:
import csv
with open('data/addresses.csv') as csvfile:
reader = csv.reader(csvfile, skipinitialspace=True)
for row in reader:
# row is a `list`
print(row)
['John', 'Doe', '120 jefferson st.', 'Riverside', 'NJ', '08075']
['Jack', 'McGinnis', '220 hobo Av.', 'Phila', 'PA', '09119']
['John "Da Man"', 'Repici', '120 Jefferson St.', 'Riverside', 'NJ', '0807
5']
['Stephen', 'Tyler', '7452 Terrace "At the Plaza" road', 'SomeTown', 'SD',
'91234']
['', 'Blankman', '', 'SomeTown', 'SD', '00298']
['Joan "the bone", Anne', 'Jet', '9th, at Terrace plc', 'Desert City', 'C
O', '00123']
CSV with Headers
In [ ]:
%%bash
curl -s https://people.sc.fsu.edu/~jburkardt/data/csv/biostats.csv > data/biostats.csv
head -n 3 data/biostats.csv
"Name", "Sex", "Age", "Height (in)", "Weight (lbs)"
"Alex", "M", 41, 74, 170
"Bert", "M", 42, 68, 166
https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 2/9
09/05/2023 09:48 01-Python_02-Data-Sourcing
CSV with Headers
In [ ]:
import csv
with open('data/biostats.csv') as csvfile:
reader = csv.DictReader(csvfile, skipinitialspace=True)
for row in reader:
# row is a dict
print(row['Name'], row['Sex'], int(row['Age']))
Alex M 41
Bert M 42
Carl M 32
Dave M 39
Elly F 30
Fran F 33
Gwen F 26
Hank M 30
Ivan M 53
Jake M 32
Kate F 47
Luke M 34
Myra F 23
Neil M 36
Omar M 38
Page F 31
Quin M 29
Ruth F 28
Writing a CSV
In [ ]:
beatles = [
{ 'first_name': 'John', 'last_name': 'lennon', 'instrument': 'guitar'},
{ 'first_name': 'Ringo', 'last_name': 'Starr', 'instrument': 'drums'}
]
In [ ]:
import csv
with open('data/beatles.csv', 'w') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=beatles[0].keys())
writer.writeheader()
for beatle in beatles:
writer.writerow(beatle)
https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 3/9
09/05/2023 09:48 01-Python_02-Data-Sourcing
In [ ]:
%%bash
cat data/beatles.csv
first_name,last_name,instrument
John,lennon,guitar
Ringo,Starr,drums
API
An application programming interface (API) is an interface or communication protocol
between a client and a server intended to simplify the building of client-side software. It has
been described as a “contract” between the client and the server.
Source Wikipedia (https://en.wikipedia.org/wiki/Application_programming_interface)
HTTP
A client-server protocol based on a request/response cycle.
Modern Web API
RESTful ( GET , POST , etc.)
Returns JSON (https://en.wikipedia.org/wiki/JSON#JSON_sample)
👉 Examples (https://github.com/public-apis/public-apis)
Requests: HTTP for Humans™
👉 Documentation (https://pypi.org/project/requests/)
Basic request
In [ ]:
import requests
url = 'https://api.github.com/users/ssaunier'
response = requests.get(url).json()
print(response['name'])
Sébastien Saunier
https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 4/9
09/05/2023 09:48 01-Python_02-Data-Sourcing
Example
Let's use the Open Library Books API (https://openlibrary.org/dev/docs/api/books).
This API documentation is not that good - let's decipher it together!
Query Parameters:
Provide an ISBN ( bibkeys )
Options:
format=json
jscmd=data
Livecode: Let's find the book title behind ISBN 9780747532699
In [ ]:
import requests
isbn = '0-7475-3269-9'
key = f'ISBN:{isbn}'
response = requests.get(
'https://openlibrary.org/api/books',
params={'bibkeys': key, 'format':'json', 'jscmd':'data'},
).json()
print(response[key]['title'])
Harry Potter and the Philosopher's Stone
Web Scraping
HTTP (again)
This time, we'll have to deal with HTML (~unstructured data)
https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 5/9
09/05/2023 09:48 01-Python_02-Data-Sourcing
HTML
Right click -> Inspect Element on any website
<!DOCTYPE html>
<html>
<head>
<title>Title of the browser tab</title>
</head>
<body>
<h1>Main title</h1>
<p>Some content</p>
<ul id="results">
<li class="result">Result 1</li>
<li class="result">Result 2</li>
</ul>
</body>
</html>
HTML Vocabulary
BeautifulSoup
The Python package to browse HTML (and XML!)
👉 Documentation (https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 6/9
09/05/2023 09:48 01-Python_02-Data-Sourcing
Typical Web Scraper with BeautifulSoup
import requests
from bs4 import BeautifulSoup
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
# You now can query the `soup` object!
soup.title.string
soup.find('h1')
soup.find_all('a')
# etc...
Searching by element name
<p>A paragraph</p>
<article>An article...</article>
<article>Another...</article>
paragraph = soup.find("p")
articles = soup.find_all("article")
Searching by id
<a href="https://www.lewagon.com" id="wagon">Le Wagon</a>
item = soup.find(id="wagon")
https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 7/9
09/05/2023 09:48 01-Python_02-Data-Sourcing
Searching by CSS Class
<ul>
<li class="pizza">Margharita</li>
<li class="pizza">Calzone</li>
<li class="pizza">Romana</li>
<li class="dessert">Tiramisu</li>
</ul>
items = soup.find_all("li", class_="pizza")
Live-code
Let's scrape IMDb Top 50 (https://www.imdb.com/list/ls055386972/) and extract the following information for
each movie:
Title
Duration
Let's build a list ( movies ) of dict ( { 'title': ?, 'duration': ? } )
Solution
In [ ]:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://www.imdb.com/list/ls055386972/", headers={"Accept-Lang
uage":"en-US"})
soup = BeautifulSoup(response.content, "html.parser")
movies = []
for movie in soup.find_all("div", class_="lister-item-content"):
title = movie.find("h3").find("a").string
duration = int(movie.find("span", class_="runtime").string.strip(' min'))
movies.append({'title': title, 'duration': duration})
print(movies[0:2])
[{'title': 'The Godfather', 'duration': 175}, {'title': "Schindler's Lis
t", 'duration': 195}]
https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 8/9
09/05/2023 09:48 01-Python_02-Data-Sourcing
Bonus
Then convert cURL command to Python Requests (https://sqqihao.github.io/trillworks.html)
Your turn!
There are 3 challenges for this lecture:
1. Reading and writing CSVs (the hard way!)
2. Making API calls with Python
3. Scraping a website
4. (Optional) Scraping a JavaScript client-side rendered website
https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 9/9