01 Python 02 Data Sourcing
01 Python 02 Data Sourcing
Data Sourcing
Finding Data
I can export it from a software (CSV)
I know it exists somewhere in a database (SQL)
It's on this website I visit daily (Scraping)
I have found a service (API) that gives access to it
...
Plan
1. Reading/Writing CSV (the hard way)
2. Consuming an API
3. Scraping a website
Next lectures
Databases & SQL
CSV
A comma-separated values file is a delimited text file that uses a comma to separate
values. A CSV file stores tabular data (numbers and text) in plain text. Each line of the file is
a data record. Each record consists of one or more fields, separated by commas.
Example
👉 On https://people.sc.fsu.edu/~jburkardt (https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html), let's look
at the addresses.csv
https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 1/9
09/05/2023 09:48 01-Python_02-Data-Sourcing
In [ ]:
%%bash
mkdir -p data
curl -s https://people.sc.fsu.edu/~jburkardt/data/csv/addresses.csv > data/addresses.cs
v
cat data/addresses.csv
CSV Reading
In [ ]:
import csv
In [ ]:
%%bash
curl -s https://people.sc.fsu.edu/~jburkardt/data/csv/biostats.csv > data/biostats.csv
head -n 3 data/biostats.csv
https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 2/9
09/05/2023 09:48 01-Python_02-Data-Sourcing
In [ ]:
import csv
Alex M 41
Bert M 42
Carl M 32
Dave M 39
Elly F 30
Fran F 33
Gwen F 26
Hank M 30
Ivan M 53
Jake M 32
Kate F 47
Luke M 34
Myra F 23
Neil M 36
Omar M 38
Page F 31
Quin M 29
Ruth F 28
Writing a CSV
In [ ]:
beatles = [
{ 'first_name': 'John', 'last_name': 'lennon', 'instrument': 'guitar'},
{ 'first_name': 'Ringo', 'last_name': 'Starr', 'instrument': 'drums'}
]
In [ ]:
import csv
https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 3/9
09/05/2023 09:48 01-Python_02-Data-Sourcing
In [ ]:
%%bash
cat data/beatles.csv
first_name,last_name,instrument
John,lennon,guitar
Ringo,Starr,drums
API
An application programming interface (API) is an interface or communication protocol
between a client and a server intended to simplify the building of client-side software. It has
been described as a “contract” between the client and the server.
HTTP
A client-server protocol based on a request/response cycle.
👉 Examples (https://github.com/public-apis/public-apis)
Basic request
In [ ]:
import requests
url = 'https://api.github.com/users/ssaunier'
response = requests.get(url).json()
print(response['name'])
Sébastien Saunier
https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 4/9
09/05/2023 09:48 01-Python_02-Data-Sourcing
Example
Let's use the Open Library Books API (https://openlibrary.org/dev/docs/api/books).
Query Parameters:
In [ ]:
import requests
isbn = '0-7475-3269-9'
key = f'ISBN:{isbn}'
response = requests.get(
'https://openlibrary.org/api/books',
params={'bibkeys': key, 'format':'json', 'jscmd':'data'},
).json()
print(response[key]['title'])
Web Scraping
HTTP (again)
This time, we'll have to deal with HTML (~unstructured data)
https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 5/9
09/05/2023 09:48 01-Python_02-Data-Sourcing
HTML
Right click -> Inspect Element on any website
<!DOCTYPE html>
<html>
<head>
<title>Title of the browser tab</title>
</head>
<body>
<h1>Main title</h1>
<p>Some content</p>
<ul id="results">
<li class="result">Result 1</li>
<li class="result">Result 2</li>
</ul>
</body>
</html>
HTML Vocabulary
BeautifulSoup
The Python package to browse HTML (and XML!)
👉 Documentation (https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 6/9
09/05/2023 09:48 01-Python_02-Data-Sourcing
import requests
from bs4 import BeautifulSoup
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
<p>A paragraph</p>
<article>An article...</article>
<article>Another...</article>
paragraph = soup.find("p")
articles = soup.find_all("article")
Searching by id
item = soup.find(id="wagon")
https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 7/9
09/05/2023 09:48 01-Python_02-Data-Sourcing
<ul>
<li class="pizza">Margharita</li>
<li class="pizza">Calzone</li>
<li class="pizza">Romana</li>
<li class="dessert">Tiramisu</li>
</ul>
Live-code
Let's scrape IMDb Top 50 (https://www.imdb.com/list/ls055386972/) and extract the following information for
each movie:
Title
Duration
Solution
In [ ]:
import requests
from bs4 import BeautifulSoup
movies = []
for movie in soup.find_all("div", class_="lister-item-content"):
title = movie.find("h3").find("a").string
duration = int(movie.find("span", class_="runtime").string.strip(' min'))
movies.append({'title': title, 'duration': duration})
print(movies[0:2])
https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 8/9
09/05/2023 09:48 01-Python_02-Data-Sourcing
Bonus
Your turn!
There are 3 challenges for this lecture:
https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 9/9