0% found this document useful (0 votes)

18 views9 pages

01 Python 02 Data Sourcing

Uploaded by

AyoubENSAT

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views9 pages

01 Python 02 Data Sourcing

Uploaded by

AyoubENSAT

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

09/05/2023 09:48 01-Python_02-Data-Sourcing

Data Sourcing

Finding Data
I can export it from a software (CSV)
I know it exists somewhere in a database (SQL)
It's on this website I visit daily (Scraping)
I have found a service (API) that gives access to it
...

Plan
1. Reading/Writing CSV (the hard way)
2. Consuming an API
3. Scraping a website

Next lectures
Databases & SQL

CSV
A comma-separated values file is a delimited text file that uses a comma to separate
values. A CSV file stores tabular data (numbers and text) in plain text. Each line of the file is
a data record. Each record consists of one or more fields, separated by commas.

Source: Wikipedia (https://en.wikipedia.org/wiki/Comma-separated_values)

Example
👉 On https://people.sc.fsu.edu/~jburkardt (https://people.sc.fsu.edu/~jburkardt/data/csv/csv.html), let's look
at the addresses.csv

https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 1/9
09/05/2023 09:48 01-Python_02-Data-Sourcing

In [ ]:

%%bash
mkdir -p data
curl -s https://people.sc.fsu.edu/~jburkardt/data/csv/addresses.csv > data/addresses.cs
v
cat data/addresses.csv

John,Doe,120 jefferson st.,Riverside, NJ, 08075

Jack,McGinnis,220 hobo Av.,Phila, PA,09119
"John ""Da Man""",Repici,120 Jefferson St.,Riverside, NJ,08075
Stephen,Tyler,"7452 Terrace ""At the Plaza"" road",SomeTown,SD, 91234
,Blankman,,SomeTown, SD, 00298
"Joan ""the bone"", Anne",Jet,"9th, at Terrace plc",Desert City,CO,00123

CSV Reading

In [ ]:

import csv

with open('data/addresses.csv') as csvfile:

reader = csv.reader(csvfile, skipinitialspace=True)
for row in reader:
# row is a `list`
print(row)

['John', 'Doe', '120 jefferson st.', 'Riverside', 'NJ', '08075']

['Jack', 'McGinnis', '220 hobo Av.', 'Phila', 'PA', '09119']
['John "Da Man"', 'Repici', '120 Jefferson St.', 'Riverside', 'NJ', '0807
5']
['Stephen', 'Tyler', '7452 Terrace "At the Plaza" road', 'SomeTown', 'SD',
'91234']
['', 'Blankman', '', 'SomeTown', 'SD', '00298']
['Joan "the bone", Anne', 'Jet', '9th, at Terrace plc', 'Desert City', 'C
O', '00123']

CSV with Headers

In [ ]:

%%bash
curl -s https://people.sc.fsu.edu/~jburkardt/data/csv/biostats.csv > data/biostats.csv
head -n 3 data/biostats.csv

"Name", "Sex", "Age", "Height (in)", "Weight (lbs)"

"Alex", "M", 41, 74, 170
"Bert", "M", 42, 68, 166

https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 2/9
09/05/2023 09:48 01-Python_02-Data-Sourcing

CSV with Headers

In [ ]:

import csv

with open('data/biostats.csv') as csvfile:

reader = csv.DictReader(csvfile, skipinitialspace=True)
for row in reader:
# row is a dict
print(row['Name'], row['Sex'], int(row['Age']))

Alex M 41
Bert M 42
Carl M 32
Dave M 39
Elly F 30
Fran F 33
Gwen F 26
Hank M 30
Ivan M 53
Jake M 32
Kate F 47
Luke M 34
Myra F 23
Neil M 36
Omar M 38
Page F 31
Quin M 29
Ruth F 28

Writing a CSV

In [ ]:

beatles = [
{ 'first_name': 'John', 'last_name': 'lennon', 'instrument': 'guitar'},
{ 'first_name': 'Ringo', 'last_name': 'Starr', 'instrument': 'drums'}
]

In [ ]:

import csv

with open('data/beatles.csv', 'w') as csvfile:

writer = csv.DictWriter(csvfile, fieldnames=beatles[0].keys())
writer.writeheader()
for beatle in beatles:
writer.writerow(beatle)

https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 3/9
09/05/2023 09:48 01-Python_02-Data-Sourcing

In [ ]:

%%bash
cat data/beatles.csv

first_name,last_name,instrument
John,lennon,guitar
Ringo,Starr,drums

API
An application programming interface (API) is an interface or communication protocol
between a client and a server intended to simplify the building of client-side software. It has
been described as a “contract” between the client and the server.

Source Wikipedia (https://en.wikipedia.org/wiki/Application_programming_interface)

HTTP
A client-server protocol based on a request/response cycle.

Modern Web API

RESTful ( GET , POST , etc.)
Returns JSON (https://en.wikipedia.org/wiki/JSON#JSON_sample)

👉 Examples (https://github.com/public-apis/public-apis)

Requests: HTTP for Humans™

👉 Documentation (https://pypi.org/project/requests/)

Basic request

In [ ]:

import requests

url = 'https://api.github.com/users/ssaunier'
response = requests.get(url).json()

print(response['name'])

Sébastien Saunier

https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 4/9
09/05/2023 09:48 01-Python_02-Data-Sourcing

Example
Let's use the Open Library Books API (https://openlibrary.org/dev/docs/api/books).

This API documentation is not that good - let's decipher it together!

Query Parameters:

Provide an ISBN ( bibkeys )

Options:
format=json
jscmd=data

Livecode: Let's find the book title behind ISBN 9780747532699

In [ ]:

import requests

isbn = '0-7475-3269-9'
key = f'ISBN:{isbn}'

response = requests.get(
'https://openlibrary.org/api/books',
params={'bibkeys': key, 'format':'json', 'jscmd':'data'},
).json()

print(response[key]['title'])

Harry Potter and the Philosopher's Stone

Web Scraping

HTTP (again)
This time, we'll have to deal with HTML (~unstructured data)

https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 5/9
09/05/2023 09:48 01-Python_02-Data-Sourcing

HTML
Right click -> Inspect Element on any website

<!DOCTYPE html>
<html>
<head>
<title>Title of the browser tab</title>
</head>
<body>
<h1>Main title</h1>
<p>Some content</p>
<ul id="results">
<li class="result">Result 1</li>
<li class="result">Result 2</li>
</ul>
</body>
</html>

HTML Vocabulary

BeautifulSoup
The Python package to browse HTML (and XML!)

👉 Documentation (https://www.crummy.com/software/BeautifulSoup/bs4/doc/)

https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 6/9
09/05/2023 09:48 01-Python_02-Data-Sourcing

Typical Web Scraper with BeautifulSoup

import requests
from bs4 import BeautifulSoup

response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")

# You now can query the `soup` object!

soup.title.string
soup.find('h1')
soup.find_all('a')
# etc...

Searching by element name

<p>A paragraph</p>

<article>An article...</article>
<article>Another...</article>

paragraph = soup.find("p")
articles = soup.find_all("article")

Searching by id

<a href="https://www.lewagon.com" id="wagon">Le Wagon</a>

item = soup.find(id="wagon")

https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 7/9
09/05/2023 09:48 01-Python_02-Data-Sourcing

Searching by CSS Class

<ul>
<li class="pizza">Margharita</li>
<li class="pizza">Calzone</li>
<li class="pizza">Romana</li>
<li class="dessert">Tiramisu</li>
</ul>

items = soup.find_all("li", class_="pizza")

Live-code
Let's scrape IMDb Top 50 (https://www.imdb.com/list/ls055386972/) and extract the following information for
each movie:

Title
Duration

Let's build a list ( movies ) of dict ( { 'title': ?, 'duration': ? } )

Solution

In [ ]:

import requests
from bs4 import BeautifulSoup

response = requests.get("https://www.imdb.com/list/ls055386972/", headers={"Accept-Lang

uage":"en-US"})
soup = BeautifulSoup(response.content, "html.parser")

movies = []
for movie in soup.find_all("div", class_="lister-item-content"):
title = movie.find("h3").find("a").string
duration = int(movie.find("span", class_="runtime").string.strip(' min'))
movies.append({'title': title, 'duration': duration})

print(movies[0:2])

[{'title': 'The Godfather', 'duration': 175}, {'title': "Schindler's Lis

t", 'duration': 195}]

https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 8/9
09/05/2023 09:48 01-Python_02-Data-Sourcing

Bonus

Then convert cURL command to Python Requests (https://sqqihao.github.io/trillworks.html)

Your turn!
There are 3 challenges for this lecture:

1. Reading and writing CSVs (the hard way!)

2. Making API calls with Python
3. Scraping a website
4. (Optional) Scraping a JavaScript client-side rendered website

https://kitt.lewagon.com/camps/1173/lectures/content/01-Python_02-Data-Sourcing.html 9/9

System Adm. Question and Answer
No ratings yet
System Adm. Question and Answer
15 pages
The Project Proposal Toolkit
No ratings yet
The Project Proposal Toolkit
6 pages
DA Unit 4
No ratings yet
DA Unit 4
46 pages
Programming 2 Lectures
No ratings yet
Programming 2 Lectures
52 pages
Data - Collection Python
No ratings yet
Data - Collection Python
40 pages
Api and Data Structure
No ratings yet
Api and Data Structure
3 pages
Lecture03 Data II
No ratings yet
Lecture03 Data II
42 pages
Importing Data From A .CSV File: Brandon Krakowsky
No ratings yet
Importing Data From A .CSV File: Brandon Krakowsky
26 pages
Python Record Manual
No ratings yet
Python Record Manual
18 pages
DAwHPC L03 Data Cleaning Practical
No ratings yet
DAwHPC L03 Data Cleaning Practical
43 pages
Week 1: 1 The Python Programming Language: Functions
No ratings yet
Week 1: 1 The Python Programming Language: Functions
9 pages
Arpit
No ratings yet
Arpit
30 pages
Glossary: Apis and Data Collection: Term Definition
No ratings yet
Glossary: Apis and Data Collection: Term Definition
2 pages
Chapter1 PDF
No ratings yet
Chapter1 PDF
22 pages
Python CheatSheet Horizontal
No ratings yet
Python CheatSheet Horizontal
2 pages
Python CheatSheet
No ratings yet
Python CheatSheet
2 pages
Getting Data
No ratings yet
Getting Data
54 pages
Data Wrangling With Python Lab Manual
No ratings yet
Data Wrangling With Python Lab Manual
29 pages
Python Notes
No ratings yet
Python Notes
12 pages
Python API Tutorial - Getting Started With APIs - Dataquest
100% (1)
Python API Tutorial - Getting Started With APIs - Dataquest
26 pages
Python Notes
No ratings yet
Python Notes
11 pages
Ip Lab Manual (Python) 2019-20
No ratings yet
Ip Lab Manual (Python) 2019-20
16 pages
Data Analysis With Python
No ratings yet
Data Analysis With Python
12 pages
Pandas
No ratings yet
Pandas
57 pages
Quick Python Guide
From Everand
Quick Python Guide
Coder1
No ratings yet
Glosario m5
No ratings yet
Glosario m5
2 pages
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms
No ratings yet
Extract and Transform Data With Web Scraping - Learn Python Basics - OpenClassrooms
16 pages
Coding - Python Camp - Notes
No ratings yet
Coding - Python Camp - Notes
5 pages
Python Cheat Sheet: Topics
No ratings yet
Python Cheat Sheet: Topics
16 pages
Python Cheat Sheet
No ratings yet
Python Cheat Sheet
16 pages
Python Pages Doc Ic Ac Uk CPP Lessons CPP 10 Files 09 Csvreaddict HTML
No ratings yet
Python Pages Doc Ic Ac Uk CPP Lessons CPP 10 Files 09 Csvreaddict HTML
2 pages
Give A List of 20 Most Important Topics An Intermediate Programmer Should Know
No ratings yet
Give A List of 20 Most Important Topics An Intermediate Programmer Should Know
5 pages
Working With TXT CSV and Json Files in Python 1634092304
No ratings yet
Working With TXT CSV and Json Files in Python 1634092304
5 pages
How to a Developers Guide to 4k: Developer edition, #3
From Everand
How to a Developers Guide to 4k: Developer edition, #3
Xinc Cyberwizard
No ratings yet
This Is CS50: CS50's Introduction To Computer Science
No ratings yet
This Is CS50: CS50's Introduction To Computer Science
17 pages
Practical Introduction To Web Scraping in Python
100% (1)
Practical Introduction To Web Scraping in Python
14 pages
Server Hosting Management System (Ip Class 12) (2024-25)
No ratings yet
Server Hosting Management System (Ip Class 12) (2024-25)
21 pages
CSV New
No ratings yet
CSV New
4 pages
Data Science - A First Introduction With Python (Z-Lib - Io)
No ratings yet
Data Science - A First Introduction With Python (Z-Lib - Io)
452 pages
Movie Ticket Booking
No ratings yet
Movie Ticket Booking
30 pages
Unit 7: Problem Solving Real World Programming Problems
No ratings yet
Unit 7: Problem Solving Real World Programming Problems
36 pages
Project Walkthrough - Bike Share-2020
No ratings yet
Project Walkthrough - Bike Share-2020
58 pages
Kunj Project 2
No ratings yet
Kunj Project 2
31 pages
L12 FileInputOutput
No ratings yet
L12 FileInputOutput
18 pages
Movie Ticket Data Analysis System (Ip Class 12) (2024-25)
No ratings yet
Movie Ticket Data Analysis System (Ip Class 12) (2024-25)
26 pages
Projects
No ratings yet
Projects
7 pages
Ibm Python Module 5 Glossary
No ratings yet
Ibm Python Module 5 Glossary
2 pages
Glossary APIs and Data Collection
No ratings yet
Glossary APIs and Data Collection
2 pages
Pandas Data Frame
No ratings yet
Pandas Data Frame
11 pages
Efficient Python Tricks and Tools For Data Scientists
100% (1)
Efficient Python Tricks and Tools For Data Scientists
23 pages
Working With Data in Python
No ratings yet
Working With Data in Python
5 pages
Reading and Writing Files
No ratings yet
Reading and Writing Files
4 pages
Importing Data in Python Ii: Importing Flat Files From The Web
No ratings yet
Importing Data in Python Ii: Importing Flat Files From The Web
22 pages
Ultimate Python Cheat Sheet - Practical Python For Everyday Tasks - by Jason Roell - Medium
No ratings yet
Ultimate Python Cheat Sheet - Practical Python For Everyday Tasks - by Jason Roell - Medium
107 pages
cmsc320 f2018 Lec02
No ratings yet
cmsc320 f2018 Lec02
45 pages
Project File 12
No ratings yet
Project File 12
22 pages
Ip Project - Docx1
100% (4)
Ip Project - Docx1
22 pages
File Handling
No ratings yet
File Handling
12 pages
Csvkit Manual
No ratings yet
Csvkit Manual
53 pages
HKU - 7001 - 3.1 Managing Data I
No ratings yet
HKU - 7001 - 3.1 Managing Data I
73 pages
DS ML Python
No ratings yet
DS ML Python
4 pages
Web Scraping
No ratings yet
Web Scraping
28 pages
01 Python 01 Programming Basics
No ratings yet
01 Python 01 Programming Basics
13 pages
01 Python 03 SQL Basics
No ratings yet
01 Python 03 SQL Basics
8 pages
VBA Compteur
No ratings yet
VBA Compteur
1 page
Marketing & Patient Relations: Training For Clinic Proprietors, Managers and Administrators
No ratings yet
Marketing & Patient Relations: Training For Clinic Proprietors, Managers and Administrators
9 pages
PPPoE Protocol
No ratings yet
PPPoE Protocol
45 pages
Fms - Uhs.edu - PK Login Print Challan Other 9038443 75 179034
No ratings yet
Fms - Uhs.edu - PK Login Print Challan Other 9038443 75 179034
1 page
LTE Handover
100% (1)
LTE Handover
49 pages
Verve Magazine Issue 3 - 2011-12
100% (1)
Verve Magazine Issue 3 - 2011-12
32 pages
Dam Break Assessment PDF
100% (1)
Dam Break Assessment PDF
41 pages
4 Recruitment Process
No ratings yet
4 Recruitment Process
4 pages
Intermediate Acctg 1 - Cash8
No ratings yet
Intermediate Acctg 1 - Cash8
4 pages
Princom The Fundamentals of Electronics Module 2
No ratings yet
Princom The Fundamentals of Electronics Module 2
9 pages
Prospectus January-2023 - A
No ratings yet
Prospectus January-2023 - A
33 pages
MPHW (F)
No ratings yet
MPHW (F)
4 pages
Encyclopedic Dictionary of Mathematics
100% (1)
Encyclopedic Dictionary of Mathematics
1,004 pages
200
No ratings yet
200
10 pages
A Secure, Robust & Hybrid Smart Home Solution: Specifications Simulation/Result Application Areas
No ratings yet
A Secure, Robust & Hybrid Smart Home Solution: Specifications Simulation/Result Application Areas
1 page
Funding Proposal
No ratings yet
Funding Proposal
13 pages
Chap. 15. Electronic Marketing Channels
No ratings yet
Chap. 15. Electronic Marketing Channels
25 pages
Presentation On Mncs in India
No ratings yet
Presentation On Mncs in India
4 pages
Workbook
No ratings yet
Workbook
104 pages
Trixie Joyce P. Malit
No ratings yet
Trixie Joyce P. Malit
3 pages
Stephanie Lechadores
No ratings yet
Stephanie Lechadores
1 page
SPB Case Digest
No ratings yet
SPB Case Digest
9 pages
A Python Based Virtual Assistant Using Raspberry Pi For Home Automation
No ratings yet
A Python Based Virtual Assistant Using Raspberry Pi For Home Automation
6 pages
Mission Vatsalya Dated 05 July 2022
No ratings yet
Mission Vatsalya Dated 05 July 2022
78 pages
Unit 2. Line, Bar, Table
No ratings yet
Unit 2. Line, Bar, Table
32 pages
Extinction of Criminal Liability
No ratings yet
Extinction of Criminal Liability
2 pages
DriveWorks-DesignAutomation Update PDF
No ratings yet
DriveWorks-DesignAutomation Update PDF
2 pages
Corken Compresores Amoniaco PDF
No ratings yet
Corken Compresores Amoniaco PDF
100 pages
RoboBuilder Tutorial
100% (1)
RoboBuilder Tutorial
70 pages
Conclusion Chapter Dissertation Sample
100% (2)
Conclusion Chapter Dissertation Sample
7 pages
Three Phase Squirrel Cage Induction Motor
No ratings yet
Three Phase Squirrel Cage Induction Motor
12 pages

01 Python 02 Data Sourcing

Uploaded by

01 Python 02 Data Sourcing

Uploaded by

09/05/2023 09:48 01-Python_02-Data-Sourcing

Source: Wikipedia (https://en.wikipedia.org/wiki/Comma-separated_values)

John,Doe,120 jefferson st.,Riverside, NJ, 08075

with open('data/addresses.csv') as csvfile:

['John', 'Doe', '120 jefferson st.', 'Riverside', 'NJ', '08075']

CSV with Headers

"Name", "Sex", "Age", "Height (in)", "Weight (lbs)"

CSV with Headers

with open('data/biostats.csv') as csvfile:

with open('data/beatles.csv', 'w') as csvfile:

Source Wikipedia (https://en.wikipedia.org/wiki/Application_programming_interface)

Modern Web API

Requests: HTTP for Humans™

This API documentation is not that good - let's decipher it together!

Provide an ISBN ( bibkeys )

Livecode: Let's find the book title behind ISBN 9780747532699

Harry Potter and the Philosopher's Stone

Typical Web Scraper with BeautifulSoup

# You now can query the `soup` object!

Searching by element name

<a href="https://www.lewagon.com" id="wagon">Le Wagon</a>

Searching by CSS Class

items = soup.find_all("li", class_="pizza")

Let's build a list ( movies ) of dict ( { 'title': ?, 'duration': ? } )

response = requests.get("https://www.imdb.com/list/ls055386972/", headers={"Accept-Lang

[{'title': 'The Godfather', 'duration': 175}, {'title': "Schindler's Lis

Then convert cURL command to Python Requests (https://sqqihao.github.io/trillworks.html)

1. Reading and writing CSVs (the hard way!)

You might also like