Week 5. Data Pipelines
Week 5. Data Pipelines
Keeyong Han
Table Of Contents
1. Recap of the 4th Week
2. Interview Tips #1
3. Overview of Data Pipelines
4. Key Concepts
5. Refreshing Python Syntax
6. First Data Pipeline (ETL)
7. Improving the Data Pipeline
8. Homework #4
9. More on Lab #1
10. Group Project & Final Exam
11. Demo
Recap of the 4th Week
SQL
● The best language to manipulate structured data (small or big)
● The most important skill in the data domain
● Composed of two parts
○ DDL (Data Definition Language): CREATE/ALTER/DROP
○ DML (Data Manipulation Language): INSERT/DELETE/UPDATE/SELECT
■ COUNT DISTINCT
■ NULLIF, COALESCE
■ CTAS
■ WINDOW Functions
Quiz #1 Review
● Best 2 out of 3 quizzes will be graded
● The other two quizzes will follow the same format
Interview Tips #1
Resume
● Audience of your resume is not YOU but hiring managers and recruiters
● Readability matters
○ Work Experience and Projects should be more like telling your story
■ Not a list of what you have done!!
○ Categorize your Skills
○ Customize and optimize for the job you are applying for!
■ Look for frequent keywords in the job description
● Be friends with ChatGPT
Overview of Data Pipelines
Definition of Data Pipeline
● A series of processes that automatically move and transform data from one or
more sources to a destination
○ Mostly likely the destination is data warehouse or data lake
○ The source can be DW or DL itself in the case of ELT
○ Sometimes the flow is the other way around (Reverse ETL)
● In Airflow, data pipeline is called DAG (Directed Acyclic Graph)
X
ETL (Extract, Transform, Load)
● Extract:
○ The process of reading data from the data source, usually through API calls.
● Transform:
○ The process of changing the format of the raw data into a desired form
● Load:
○ The final step of loading the data into the data warehouse as tables.
1. ETL
2. ELT (“T”)
3. Reverse ETL
Data Source
Data Warehouse
1 ETL
3 Reverse ETL
What is Airflow?
● An open source started from Airbnb along with Superset
● A de facto Data Pipeline management framework
○ You can write data pipelines and can manage them (scheduling, triggering, …)
○ Python modules are provided along with the Web UI
○ So popular that most Cloud providers supports Airflows as their service
■ Google Cloud supports Airflow (“Cloud Composer”)
● Data pipeline in Airflow is called DAG (Direct Acyclic Graph)
Key Concepts
Key Concepts to Look at
● Full Refresh, Incremental Update
● Idempotency
○ SQL Transaction
● Backfill
● Upsert
○ Most efficient way of ensuring the primary key uniqueness
○ Snowflake supports an operation called MERGE
Full Refresh vs. Incremental Update #1
● Full Refresh
○ Reading the whole data from the source and recreate the corresponding table
○ The simplest and cleanest approach but only works for small data source
● Increment Update
○ For large data sources, doing full refresh is not feasible
○ Doing periodic update such as daily or hourly
■ Daily incremental update: reading previous day’s data
■ Hourly incremental update: reading previous hour’s data
○ Effective but more complex to manage
■ What if the entire source data format changes?
■ What if updates from certain hours or days are found to be incorrect or missing?
Full Refresh vs. Incremental Update #2
● Prefer to do full-refresh if possible: everything gets easier
● If incremental update is the only way, the data source needs to support the
followings:
○ Given a date/time range, returning records added or modified during the window.
○ Deletion shouldn’t be allowed. If deletion is needed, introduce a deletion flag to mark.
● Data Types
○ Number, string, list, dictionary, …
○ Slice operator
○ Formatted string literals
● For loop
● Functions
○ Definition and invocation
What to learn: More Advanced Topics
● Import modules
○ Installing modules via pip
● Downloading a web page: “request” module
○ This is in essence calling an API
● Error handling
○ try/except
Google Colab Practice
● Refreshing Python
First Data Pipeline (ETL)
ETL Overview Practice
● Copy a country/capital CSV file from the web into a table in Snowflake
● Written in Python on Google Colab
country capital
Abkhazia Sukhumi
Afghanistan Kabul
Albania Tirana
Algeria Algiers
American Samoa Pago Pago
ETL Overview - 2. Create a table in Snowflake
● What if there are tasks that would be left in an incomplete state if they fail
midway?
● In the example below, what if the withdrawal is successful but a problem
occurs during the transfer?
try:
cur.execute("BEGIN;")
# execute SQLs that need to be run atomically
…
cur.execute("COMMIT;") # same as cur.execute("END;")
except Exception as e:
cur.execute("ROLLBACK;")
raise
ETL v2 Practice
1. (+1) Pick up a stock symbol and get your own API key from Alpha Vantage
2. (+1) Secure your Snowflake credentials and Alpha Vantage API key (don't expose them in the code)
3. (+2) Read the last 90 days of the price info via the API (refer to the code snippet & you need to add "date”)
a. With regard to adding "date", please look at the next slide
4. (+1) Create a table under “raw” schema if it doesn’t exist to capture the info from the API
a. symbol, date, open, close, high, low, volume: symbol and date should be primary keys
5. (+1) Delete all records from the table
6. (+1) Populate the table with the records from step 2 using INSERT SQL (refer to the relevant code snippet as a
starting point)
7. (+4) Steps 4 and 6 need to be done together
a. Use try/except along with SQL transaction. (use the code here as reference)
8. (+1) Demonstrate your work ensures Idempotency by running your pipeline (from extract to load) twice in a row
and checking the number of records (the number needs to remain the same)
9. (+2) Follow today’s demo and capture Docker Desktop screen showing Airflow
Stock Info #1
ETL
Stock Info #2