0% found this document useful (0 votes)
14 views

01 Python 02 Data Sourcing

Uploaded by

AyoubENSAT
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

01 Python 02 Data Sourcing

Uploaded by

AyoubENSAT
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

09/05/2023 09:48 01-Python_02-Data-Sourcing

Data Sourcing

Finding Data
I can export it from a software (CSV)
I know it exists somewhere in a database (SQL)
It's on this website I visit daily (Scraping)
I have found a service (API) that gives access to it
...

Plan
1. Reading/Writing CSV (the hard way)
2. Consuming an API
3. Scraping a website

Next lectures
Databases & SQL

CSV
A comma-separated values file is a delimited text file that uses a comma to separate
values. A CSV file stores tabular data (numbers and text) in plain text. Each line of the file is
a data record. Each record consists of one or more fields, separated by commas.

Source: Wikipedia (https://en.wikipedia.org/wiki/Comma-separated_values)

Example
👉 On https://people.sc.fsu.edu/~jburkardt (https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html), let's look
at the addresses.csv

https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 1/9
09/05/2023 09:48 01-Python_02-Data-Sourcing

In [ ]:

%%bash
mkdir -p data
curl -s https://people.sc.fsu.edu/~jburkardt/data/csv/addresses.csv > data/addresses.cs
v
cat data/addresses.csv

John,Doe,120 jefferson st.,Riverside, NJ, 08075


Jack,McGinnis,220 hobo Av.,Phila, PA,09119
"John ""Da Man""",Repici,120 Jefferson St.,Riverside, NJ,08075
Stephen,Tyler,"7452 Terrace ""At the Plaza"" road",SomeTown,SD, 91234
,Blankman,,SomeTown, SD, 00298
"Joan ""the bone"", Anne",Jet,"9th, at Terrace plc",Desert City,CO,00123

CSV Reading

In [ ]:

import csv

with open('data/addresses.csv') as csvfile:


reader = csv.reader(csvfile, skipinitialspace=True)
for row in reader:
# row is a `list`
print(row)

['John', 'Doe', '120 jefferson st.', 'Riverside', 'NJ', '08075']


['Jack', 'McGinnis', '220 hobo Av.', 'Phila', 'PA', '09119']
['John "Da Man"', 'Repici', '120 Jefferson St.', 'Riverside', 'NJ', '0807
5']
['Stephen', 'Tyler', '7452 Terrace "At the Plaza" road', 'SomeTown', 'SD',
'91234']
['', 'Blankman', '', 'SomeTown', 'SD', '00298']
['Joan "the bone", Anne', 'Jet', '9th, at Terrace plc', 'Desert City', 'C
O', '00123']

CSV with Headers

In [ ]:

%%bash
curl -s https://people.sc.fsu.edu/~jburkardt/data/csv/biostats.csv > data/biostats.csv
head -n 3 data/biostats.csv

"Name", "Sex", "Age", "Height (in)", "Weight (lbs)"


"Alex", "M", 41, 74, 170
"Bert", "M", 42, 68, 166

https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 2/9
09/05/2023 09:48 01-Python_02-Data-Sourcing

CSV with Headers

In [ ]:

import csv

with open('data/biostats.csv') as csvfile:


reader = csv.DictReader(csvfile, skipinitialspace=True)
for row in reader:
# row is a dict
print(row['Name'], row['Sex'], int(row['Age']))

Alex M 41
Bert M 42
Carl M 32
Dave M 39
Elly F 30
Fran F 33
Gwen F 26
Hank M 30
Ivan M 53
Jake M 32
Kate F 47
Luke M 34
Myra F 23
Neil M 36
Omar M 38
Page F 31
Quin M 29
Ruth F 28

Writing a CSV

In [ ]:

beatles = [
{ 'first_name': 'John', 'last_name': 'lennon', 'instrument': 'guitar'},
{ 'first_name': 'Ringo', 'last_name': 'Starr', 'instrument': 'drums'}
]

In [ ]:

import csv

with open('data/beatles.csv', 'w') as csvfile:


writer = csv.DictWriter(csvfile, fieldnames=beatles[0].keys())
writer.writeheader()
for beatle in beatles:
writer.writerow(beatle)

https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 3/9
09/05/2023 09:48 01-Python_02-Data-Sourcing

In [ ]:

%%bash
cat data/beatles.csv

first_name,last_name,instrument
John,lennon,guitar
Ringo,Starr,drums

API
An application programming interface (API) is an interface or communication protocol
between a client and a server intended to simplify the building of client-side software. It has
been described as a “contract” between the client and the server.

Source Wikipedia (https://en.wikipedia.org/wiki/Application_programming_interface)

HTTP
A client-server protocol based on a request/response cycle.

Modern Web API


RESTful ( GET , POST , etc.)
Returns JSON (https://en.wikipedia.org/wiki/JSON#JSON_sample)

👉 Examples (https://github.com/public-apis/public-apis)

Requests: HTTP for Humans™


👉 Documentation (https://pypi.org/project/requests/)

Basic request

In [ ]:

import requests

url = 'https://api.github.com/users/ssaunier'
response = requests.get(url).json()

print(response['name'])

Sébastien Saunier

https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 4/9
09/05/2023 09:48 01-Python_02-Data-Sourcing

Example
Let's use the Open Library Books API (https://openlibrary.org/dev/docs/api/books).

This API documentation is not that good - let's decipher it together!

Query Parameters:

Provide an ISBN ( bibkeys )


Options:
format=json
jscmd=data

Livecode: Let's find the book title behind ISBN 9780747532699

In [ ]:

import requests

isbn = '0-7475-3269-9'
key = f'ISBN:{isbn}'

response = requests.get(
'https://openlibrary.org/api/books',
params={'bibkeys': key, 'format':'json', 'jscmd':'data'},
).json()

print(response[key]['title'])

Harry Potter and the Philosopher's Stone

Web Scraping

HTTP (again)
This time, we'll have to deal with HTML (~unstructured data)

https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 5/9
09/05/2023 09:48 01-Python_02-Data-Sourcing

HTML
Right click -> Inspect Element on any website

<!DOCTYPE html>
<html>
<head>
<title>Title of the browser tab</title>
</head>
<body>
<h1>Main title</h1>
<p>Some content</p>
<ul id="results">
<li class="result">Result 1</li>
<li class="result">Result 2</li>
</ul>
</body>
</html>

HTML Vocabulary

BeautifulSoup
The Python package to browse HTML (and XML!)

👉 Documentation (https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 6/9
09/05/2023 09:48 01-Python_02-Data-Sourcing

Typical Web Scraper with BeautifulSoup

import requests
from bs4 import BeautifulSoup

response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# You now can query the `soup` object!


soup.title.string
soup.find('h1')
soup.find_all('a')
# etc...

Searching by element name

<p>A paragraph</p>

<article>An article...</article>
<article>Another...</article>

paragraph = soup.find("p")
articles = soup.find_all("article")

Searching by id

<a href="https://www.lewagon.com" id="wagon">Le Wagon</a>

item = soup.find(id="wagon")

https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 7/9
09/05/2023 09:48 01-Python_02-Data-Sourcing

Searching by CSS Class

<ul>
<li class="pizza">Margharita</li>
<li class="pizza">Calzone</li>
<li class="pizza">Romana</li>
<li class="dessert">Tiramisu</li>
</ul>

items = soup.find_all("li", class_="pizza")

Live-code
Let's scrape IMDb Top 50 (https://www.imdb.com/list/ls055386972/) and extract the following information for
each movie:

Title
Duration

Let's build a list ( movies ) of dict ( { 'title': ?, 'duration': ? } )

Solution

In [ ]:

import requests
from bs4 import BeautifulSoup

response = requests.get("https://www.imdb.com/list/ls055386972/", headers={"Accept-Lang


uage":"en-US"})
soup = BeautifulSoup(response.content, "html.parser")

movies = []
for movie in soup.find_all("div", class_="lister-item-content"):
title = movie.find("h3").find("a").string
duration = int(movie.find("span", class_="runtime").string.strip(' min'))
movies.append({'title': title, 'duration': duration})

print(movies[0:2])

[{'title': 'The Godfather', 'duration': 175}, {'title': "Schindler's Lis


t", 'duration': 195}]

https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 8/9
09/05/2023 09:48 01-Python_02-Data-Sourcing

Bonus

Then convert cURL command to Python Requests (https://sqqihao.github.io/trillworks.html)

Your turn!
There are 3 challenges for this lecture:

1. Reading and writing CSVs (the hard way!)


2. Making API calls with Python
3. Scraping a website
4. (Optional) Scraping a JavaScript client-side rendered website

https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 9/9

You might also like