Introduction to ETL
in Python
ETL IN PYTHON
Stefano Francavilla
CEO - Geowox
What is ETL?
Extract, Transform, Load
ETL IN PYTHON
What is ETL?
Extract, Transform, Load
ETL IN PYTHON
What is ETL?
Extract, Transform, Load
ETL IN PYTHON
The scene
Private equity fund called "DataCamp
Capital Group" (DCG Capital)
Residential assets
Monthly sales insights
In charge of the ETL pipeline
Stakeholder is the business analyst
ETL IN PYTHON
The pipeline
ETL IN PYTHON
The pipeline
ETL IN PYTHON
The pipeline
ETL IN PYTHON
In this lesson
ETL IN PYTHON
In this lesson
ETL IN PYTHON
In this lesson
ETL IN PYTHON
Requests
GET requests.get('<url>')
request/fetch data from a resource get_url = 'https://example.com/file.zip'
response = requests.get(get_url)
Social network GETs contacts content
response : requests.Response object
POST requests.post('<url>', data={'key':
'value'})
create/update a resource
post_url = 'https://example.com/page'
Social network POSTs user generated post_data = {'company': 'DCG Capital'}
content response = requests(post_url,
data=post_data)
response : requests.Response object
ETL IN PYTHON
Common requests attributes
response = requests.get('https://example.com/ny-properties-onsale.csv')
city, address, price
New York, "441 W 37th St FLOOR 2, New York, NY 10018", "$1,700,000"
New York, "22 W 57th St #Q56, New York, NY 10019", "$3,895,000"
New York, "788 9th Ave APT 1B, New York, NY 10019", "$1,000,000"
ETL IN PYTHON
Common requests attributes
Name Output Example
b'city, address, price\nNew
response.content raw bytes response payload
York...'
character encoded (e.g. UTF-8)
response.text 'city, address, price\nNew York...'
string payload
dictionary-like object which { 'Date': 'Wed, 20 Oct 2021
response.headers contains header payload as key- 18:49:30 GMT', 'Content-Length':
value '218'... }
status code returned by the
response.status_code 200
external service
ETL IN PYTHON
Zipfile
from zipfile import ZipFile class
Built-in zipfile function
Commonly used with two arguments: ZipFile(filepath, mode)
Read mode
with ZipFile(filepath, mode='r') as f:
f.namelist()
f.extract()
f.namelist() returns the list of les inside the opened .zip le
f.extract(filename, path=None) extracts a speci c le to a speci ed directory
ETL IN PYTHON
Zipfile: an example
from zipfile import ZipFile
filepath = "/my/custom/path/example.zip"
with ZipFile(filepath, mode='r') as f:
name_list = f.namelist()
print("List of files:", name_list)
extract_path = f.extract(name_list[0], path="/my/custom/path/")
print("Extract Path:", extract_path)
List of files: ["example.csv"]
Extract path: "/my/custom/path/example.csv"
ETL IN PYTHON
Let's practice!
ETL IN PYTHON
Ask the right
questions
ETL IN PYTHON
Stefano Francavilla
CEO - Geowox
Where we are in the pipeline
ETL IN PYTHON
Dataset example
Date of Sale Postal Description
Address County Price (€)
(dd/mm/yyyy) Code of Property
Second-
Hand
123 WALKINSTOWN PARK, Dublin
12/02/2021 Dublin €297,000.00 Dwelling
WALKINSTOWN, DUBLIN 12 12
house
/Apartment
New
12 Oileain Na
Dublin Dwelling
04/01/2021 Cranoige.Cranogue Isl, Dublin €192,951.00
11 house
Balbutcher Lane, BALLYMUN
/Apartment
ETL IN PYTHON
Open a file
Built-in open() function
Commonly used with 2 arguments: open(filepath, mode)
Most common mode :
Character Meaning
'r' open for reading (default)
'w' open for writing
encoding argument example: open("file.csv", mode="r", encoding="windows-1252")
ETL IN PYTHON
Open a file: example
Read mode
with open('file.csv', mode="r", encoding="windows-1252"):
# Code here
Write mode
with open('file.csv', mode="w", encoding="windows-1252"):
# Code here
ETL IN PYTHON
CSV module
csv implements classes to read and write tabular data in CSV format
Dictionary form with csv.DictReader() and csv.DictWriter() functions
csv.DictReader(file, fieldnames=None...)
csv.DictWriter(file, fieldnames, ...)
keys = column names
values = row values
ETL IN PYTHON
Read in action
Code
with open("file.csv", mode="r") as csv_file:
reader = csv.DictReader(csv_file)
row = next(reader)
print(row)
Output
OrderedDict([
('Date of Sale (dd/mm/yyyy)', '03/01/2021'),('Postal Code', 'Dublin 4'),
('Address', '16 BURLEIGH COURT, BURLINGTON ROAD, DUBLIN 4'),('County', 'Dublin'),
('Price (€)', '€450,000.00'), ...])
ETL IN PYTHON
Write in action
Code
with open("file.csv", mode="w") as csv_file:
new_column_names = {"Date of Sale (dd/mm/yyyy)": "date_of_sale",
"Address": "address", "Postal Code": "postal_code", "County": "county",
"Price (€)": "price", "Description of Property": "description"}
writer = csv.DictWriter(csv_file, fieldnames=new_column_names)
# Write headers as first line
writer.writeheader()
# Write all rows in file
for row in reader:
writer.writerow(row)
ETL IN PYTHON
Let's practice!
ETL IN PYTHON
Extracting
ETL IN PYTHON
Stefano Francavilla
CEO - Geowox
End goal
Automated pipeline
cron
Command line utility used for scheduling
execute.py
1. extract.py
2. transform.py
3. load.py
Download and process property transactions
ETL IN PYTHON
E(xtract)TL
ETL IN PYTHON
E(xtract)TL
ETL IN PYTHON
E(xtract)TL
ETL IN PYTHON
E(xtract)TL
ETL IN PYTHON
E(xtract)TL
ETL IN PYTHON
E(xtract)TL
ETL IN PYTHON
E(xtract)TL
ETL IN PYTHON
In this lesson: E(xtract)
ETL IN PYTHON
Create a folder
Make sure the downloaded_at folder exists
import os
Allows Python to interact with the operating system
os.makedirs() creates a folder recursively
Missing intermediate-level directories are created as well
os.makedirs(path, exist_ok=[True|False])
ETL IN PYTHON
Create a folder: an example
January 1st, 2021
rst time we run the cron job
Save the .zip le in the current month directory:
<root>/source/downloaded_at=2021-02-01
create .../downloaded_at=2021-01-01 folder
but... .../source folder does not exist yet
ETL IN PYTHON
Create a folder: an example
# Create <root>/source/downloaded_at=2021-01-01
path = "root/source/downloaded_at=2021-01-01"
os.makedirs(path, exist_ok=True)
# 1. Create source
# 2. Create downloaded_at=2021-01-01
/source/downloaded_at=2021-01-01/<zipfile_name>.zip
ETL IN PYTHON
Save ZIP file locally
open()
Commonly used with two arguments: open(filepath, mode)
Text vs binary mode :
Character Meaning
'w' open for writing in text format
'wb' open for writing in binary format
Write binary mode
with open('source/downloaded_at.../ppr-all.zip', mode="wb") as f:
f.write(...)
ETL IN PYTHON
Let's practice!
ETL IN PYTHON
Project folder
structure
ETL IN PYTHON
Stefano Francavilla
CEO Geowox
ETL IN PYTHON
ETL IN PYTHON
ETL IN PYTHON
ETL IN PYTHON
ETL IN PYTHON
ETL IN PYTHON
ETL IN PYTHON
ETL IN PYTHON
ETL IN PYTHON
ETL IN PYTHON
ETL IN PYTHON
ETL IN PYTHON
ETL IN PYTHON
ETL IN PYTHON
ETL IN PYTHON
ETL IN PYTHON
ETL IN PYTHON
Extract, Transform and Load
# Import libraries
def methodX():
# Code here
pass
def methodY():
# Code here
pass
def main():
methodX()
methodY()
ETL IN PYTHON
Execute
# Import extract, transform and load
import extract, transform, load
# Ensure execute.py can only be ran from bash
if __name__ == "__main__":
# 1. Run Extract
extract.main()
# 2. Run Transform
transform.main()
# 3. Run Load
load.main()
python execute.py
ETL IN PYTHON
Let's practice!
ETL IN PYTHON