0% found this document useful (0 votes)
6 views

Week 5. Data Pipelines

The document outlines the content for Week 5 of a course on Data Pipelines, covering topics such as ETL processes, interview tips, and key concepts like idempotency and backfill. It includes practical exercises using Python and Google Colab, as well as homework assignments related to stock price data extraction and transformation. Additionally, it discusses the use of Airflow for managing data pipelines and provides details about group projects and the final exam.

Uploaded by

Rajarajeswari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Week 5. Data Pipelines

The document outlines the content for Week 5 of a course on Data Pipelines, covering topics such as ETL processes, interview tips, and key concepts like idempotency and backfill. It includes practical exercises using Python and Google Colab, as well as homework assignments related to stock price data extraction and transformation. Additionally, it discusses the use of Airflow for managing data pipelines and provides details about group projects and the final exam.

Uploaded by

Rajarajeswari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Week 5: Data Pipelines

Keeyong Han
Table Of Contents
1. Recap of the 4th Week
2. Interview Tips #1
3. Overview of Data Pipelines
4. Key Concepts
5. Refreshing Python Syntax
6. First Data Pipeline (ETL)
7. Improving the Data Pipeline
8. Homework #4
9. More on Lab #1
10. Group Project & Final Exam
11. Demo
Recap of the 4th Week
SQL
● The best language to manipulate structured data (small or big)
● The most important skill in the data domain
● Composed of two parts
○ DDL (Data Definition Language): CREATE/ALTER/DROP
○ DML (Data Manipulation Language): INSERT/DELETE/UPDATE/SELECT
■ COUNT DISTINCT
■ NULLIF, COALESCE
■ CTAS
■ WINDOW Functions
Quiz #1 Review
● Best 2 out of 3 quizzes will be graded
● The other two quizzes will follow the same format
Interview Tips #1
Resume
● Audience of your resume is not YOU but hiring managers and recruiters
● Readability matters
○ Work Experience and Projects should be more like telling your story
■ Not a list of what you have done!!
○ Categorize your Skills
○ Customize and optimize for the job you are applying for!
■ Look for frequent keywords in the job description
● Be friends with ChatGPT
Overview of Data Pipelines
Definition of Data Pipeline

● A series of processes that automatically move and transform data from one or
more sources to a destination
○ Mostly likely the destination is data warehouse or data lake
○ The source can be DW or DL itself in the case of ELT
○ Sometimes the flow is the other way around (Reverse ETL)
● In Airflow, data pipeline is called DAG (Directed Acyclic Graph)

Extract Transform Load

X
ETL (Extract, Transform, Load)

● Extract:
○ The process of reading data from the data source, usually through API calls.
● Transform:
○ The process of changing the format of the raw data into a desired form
● Load:
○ The final step of loading the data into the data warehouse as tables.

Extract Transform Load


2
Types of Data Pipeline ELT

1. ETL
2. ELT (“T”)
3. Reverse ETL

Data Source
Data Warehouse

1 ETL

3 Reverse ETL
What is Airflow?
● An open source started from Airbnb along with Superset
● A de facto Data Pipeline management framework
○ You can write data pipelines and can manage them (scheduling, triggering, …)
○ Python modules are provided along with the Web UI
○ So popular that most Cloud providers supports Airflows as their service
■ Google Cloud supports Airflow (“Cloud Composer”)
● Data pipeline in Airflow is called DAG (Direct Acyclic Graph)
Key Concepts
Key Concepts to Look at
● Full Refresh, Incremental Update
● Idempotency
○ SQL Transaction
● Backfill
● Upsert
○ Most efficient way of ensuring the primary key uniqueness
○ Snowflake supports an operation called MERGE
Full Refresh vs. Incremental Update #1
● Full Refresh
○ Reading the whole data from the source and recreate the corresponding table
○ The simplest and cleanest approach but only works for small data source
● Increment Update
○ For large data sources, doing full refresh is not feasible
○ Doing periodic update such as daily or hourly
■ Daily incremental update: reading previous day’s data
■ Hourly incremental update: reading previous hour’s data
○ Effective but more complex to manage
■ What if the entire source data format changes?
■ What if updates from certain hours or days are found to be incorrect or missing?
Full Refresh vs. Incremental Update #2
● Prefer to do full-refresh if possible: everything gets easier
● If incremental update is the only way, the data source needs to support the
followings:
○ Given a date/time range, returning records added or modified during the window.
○ Deletion shouldn’t be allowed. If deletion is needed, introduce a deletion flag to mark.

Full Refresh Incremental Update


Idempotency
● Running the data pipeline multiple times with the same input data should not
change the contents of the final table
○ For example, duplicate data should not be created
● All critical points must be executed as one atomic action
○ This is why SQL transactions are an essential technique
Backfill
● Rerunning data pipelines to ensure complete & accurate data from the source
○ There are various reasons why past data might need to be re-read
● For full refresh, running the data pipeline again after fixing issues should work
○ Backfill isn’t really needed
● For incremental update, find missing or wrong time slots and rerun only those
○ If backfill is easy to perform, the lives of data engineers would be easier
■ Airflow makes the backfill easier (but learning it takes some efforts)
○ Ensuring the primary key uniqueness is required in some use cases
Upsert

● UPSERT = INSERT + UPDATE


○ Operates based on the Primary Key
○ If the record exists, update with new information
○ If the record doesn't exist, insert as a new record
○ What about deletion? Don’t delete records but flag a field (such as deleted)
● Data warehouses support UPSERT
○ Snowflake has a SQL operation called MERGE for this purpose
MERGE SQL: Snowflake’s Upsert
● Practice
Refreshing Python Syntax
What to learn: Basics
● Will use Google Colab

● Data Types
○ Number, string, list, dictionary, …
○ Slice operator
○ Formatted string literals
● For loop
● Functions
○ Definition and invocation
What to learn: More Advanced Topics
● Import modules
○ Installing modules via pip
● Downloading a web page: “request” module
○ This is in essence calling an API
● Error handling
○ try/except
Google Colab Practice
● Refreshing Python
First Data Pipeline (ETL)
ETL Overview Practice

● Copy a country/capital CSV file from the web into a table in Snowflake
● Written in Python on Google Colab

S3 CSV File A table in Snowflake


ETL Overview - 1. Create a table in Snowflake

● Download the following file


○ https://s3-geospatial.s3.us-west-2.amazonaws.com/country_capital.csv
○ The file has two fields (country & capital)

country capital
Abkhazia Sukhumi
Afghanistan Kabul
Albania Tirana
Algeria Algiers
American Samoa Pago Pago
ETL Overview - 2. Create a table in Snowflake

CREATE TABLE dev.raw.country_capital (


country varchar primary key,
capital varchar
);
ETL Overview - Overall Structure

● Written in Python on Colab: Consists of three functions (tasks)


○ extract, transform, load

def transform(data): def load(list):


def extract(url):
# transform data into country, # Insert records in list
# return the data at url
# capital list and return # to Snowflake

… …
return data
return list ...
ETL Practice

● Data Pipeline Example from Google Colab


Break
Improving the Data Pipeline
Three issues were identified

1. Header is currently being loaded as records.


2. Idempotency is not being maintained. Duplicate records are created each
time the process is executed.
3. If the SQL fails midway, it could compromise data integrity.
○ Transactions should be used
What is SQL Transaction? (1)

● What if there are tasks that would be left in an incomplete state if they fail
midway?
● In the example below, what if the withdrawal is successful but a problem
occurs during the transfer?

Withdrawal from my Transfer to another


account person's account

Bank transfer process: When I send money to another person


What is SQL Transaction? (2)

● Method for grouping SQLs that need to be executed as a single operation


● Use these SQL statements between BEGIN and END/COMMIT
● ROLLBACK (or ABORT) is an SQL that means "return to the state before
BEGIN"
What is SQL Transaction? (3)
COMMIT/END: If all
operations succeed, this
temporary state becomes
... the final state.
BEGIN SQL 1 SQL 2 SQL N
ROLLBACK/ABORT: If
even one operation fails
Transaction: midway, the system
reverts to its original state
The SQL results during this process become a
temporary state. They are not visible to other
sessions until committed.
Best Practices in using SQL transaction with Python

● In Python, it is common to use try/except to handle errors


○ If an error occurs within the try/except block, a rollback is explicitly executed
○ If no error occurs, a commit is executed

try:
cur.execute("BEGIN;")
# execute SQLs that need to be run atomically

cur.execute("COMMIT;") # same as cur.execute("END;")
except Exception as e:
cur.execute("ROLLBACK;")
raise
ETL v2 Practice

● Data Pipeline Example from Google Colab


Demo: Airflow Installation on Docker
Homework
Homework: Stock Price ETL in Python (15 pt)
● This homework should be done in Python using Google Colab. Use this link as a reference.
● Use Alpha Vantage API (look for “TIME_SERIES_DAILY” API)

1. (+1) Pick up a stock symbol and get your own API key from Alpha Vantage
2. (+1) Secure your Snowflake credentials and Alpha Vantage API key (don't expose them in the code)
3. (+2) Read the last 90 days of the price info via the API (refer to the code snippet & you need to add "date”)
a. With regard to adding "date", please look at the next slide
4. (+1) Create a table under “raw” schema if it doesn’t exist to capture the info from the API
a. symbol, date, open, close, high, low, volume: symbol and date should be primary keys
5. (+1) Delete all records from the table
6. (+1) Populate the table with the records from step 2 using INSERT SQL (refer to the relevant code snippet as a
starting point)
7. (+4) Steps 4 and 6 need to be done together
a. Use try/except along with SQL transaction. (use the code here as reference)
8. (+1) Demonstrate your work ensures Idempotency by running your pipeline (from extract to load) twice in a row
and checking the number of records (the number needs to remain the same)
9. (+2) Follow today’s demo and capture Docker Desktop screen showing Airflow

(+1) Overall formatting


How to add “date” to return_last_90d_price
def return_last_90d_price(symbol):
vantage_api_key = userdata.get('vantage_api_key')
url = f'https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol={symbol}&apikey={vantage_api_key}'
r = requests.get(url)
data = r.json()
results = [] # empyt list for now to hold the 90 days of stock info (open, high, low, close, volume)
for d in data["Time Series (Daily)"]: # here d is a date: "YYYY-MM-DD"
stock_info = data["Time Series (Daily)"][d]
stock_info[“date”] = d
results.append(stock_info)
# an example of data["Time Series (Daily)"][d] is
# {'1. open': '117.3500', '2. high': '119.6600', '3. low': '117.2500', '4. close': '117.8700', '5. volume': '286038878', “date”:
“2024-09-17”}
return results
Homework: Step 9 screenshot example
More on Lab #1
Building a Stock Price Prediction Analytics
using Snowflake & Airflow Task to forecast the next 14
days of price (ELT)

Stock Info #1

ETL
Stock Info #2

yfinance APIs Data Warehouse


Group Project & Final Exam
Group Project Proposal
● Form a team of 4: due by next week
● Proposal Format
○ One paragraph abstract
○ Dataset or API links and short description (if applicable)
○ Reference links
● Group Project Presentation: Week 16
○ May 8th
○ Each team will be given 10 minutes to present
● Grading criteria
○ More details to be shared
○ Teamwork!
Final Exam
● Set to May 15th: 3:15 - 5:15 PM
● For those of you who have conflicts
○ Make-up Exam date is set to May 21st but tentative at the moment
○ Email me with details of the conflicting class
Airflow Docker Demo
Next Week: Airflow!
Homework: Stock Price ETL in Python (15 pt)
● Implement Full Refresh using yfinance API
● This homework should be done in Python using Google Colab. Use this link as a reference.

1. Pick up a stock symbol


2. (+1) Secure your Snowflake credentials via Secrets (don't expose them in the code)
3. (+2) Read all historical price info via the API (refer to the relevant code snippet)
4. (+1) Create a table under “raw” schema if it doesn’t exist to capture the info from the API
a. symbol, date, open, close, high, low, volume: symbol and date should be primary keys
5. (+1) Delete all records from the table
6. (+2) Populate the table with the records from step 3 using INSERT SQL
7. (+4) Steps 5 and 6 need to be done together
a. Use try/except along with SQL transaction. (use the code here as reference)
8. (+1) Demonstrate your work ensures Idempotency by running it twice in a row and checking
the number of records (the number needs to remain the same)
9. (+2) Follow today’s demo and capture Docker Desktop screen

(+1) Overall formatting

You might also like