0% found this document useful (0 votes)

6 views

Week 5. Data Pipelines

The document outlines the content for Week 5 of a course on Data Pipelines, covering topics such as ETL processes, interview tips, and key concepts like idempotency and backfill. It includes practical exercises using Python and Google Colab, as well as homework assignments related to stock price data extraction and transformation. Additionally, it discusses the use of Airflow for managing data pipelines and provides details about group projects and the final exam.

Uploaded by

Rajarajeswari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views

Week 5. Data Pipelines

Uploaded by

Rajarajeswari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 51

Week 5: Data Pipelines

Keeyong Han
Table Of Contents
1. Recap of the 4th Week
2. Interview Tips #1
3. Overview of Data Pipelines
4. Key Concepts
5. Refreshing Python Syntax
6. First Data Pipeline (ETL)
7. Improving the Data Pipeline
8. Homework #4
9. More on Lab #1
10. Group Project & Final Exam
11. Demo
Recap of the 4th Week
SQL
● The best language to manipulate structured data (small or big)
● The most important skill in the data domain
● Composed of two parts
○ DDL (Data Definition Language): CREATE/ALTER/DROP
○ DML (Data Manipulation Language): INSERT/DELETE/UPDATE/SELECT
■ COUNT DISTINCT
■ NULLIF, COALESCE
■ CTAS
■ WINDOW Functions
Quiz #1 Review
● Best 2 out of 3 quizzes will be graded
● The other two quizzes will follow the same format
Interview Tips #1
Resume
● Audience of your resume is not YOU but hiring managers and recruiters
● Readability matters
○ Work Experience and Projects should be more like telling your story
■ Not a list of what you have done!!
○ Categorize your Skills
○ Customize and optimize for the job you are applying for!
■ Look for frequent keywords in the job description
● Be friends with ChatGPT
Overview of Data Pipelines
Definition of Data Pipeline

● A series of processes that automatically move and transform data from one or
more sources to a destination
○ Mostly likely the destination is data warehouse or data lake
○ The source can be DW or DL itself in the case of ELT
○ Sometimes the flow is the other way around (Reverse ETL)
● In Airflow, data pipeline is called DAG (Directed Acyclic Graph)

Extract Transform Load

X
ETL (Extract, Transform, Load)

● Extract:
○ The process of reading data from the data source, usually through API calls.
● Transform:
○ The process of changing the format of the raw data into a desired form
● Load:
○ The final step of loading the data into the data warehouse as tables.

Extract Transform Load

2
Types of Data Pipeline ELT

1. ETL
2. ELT (“T”)
3. Reverse ETL

Data Source
Data Warehouse

1 ETL

3 Reverse ETL
What is Airflow?
● An open source started from Airbnb along with Superset
● A de facto Data Pipeline management framework
○ You can write data pipelines and can manage them (scheduling, triggering, …)
○ Python modules are provided along with the Web UI
○ So popular that most Cloud providers supports Airflows as their service
■ Google Cloud supports Airflow (“Cloud Composer”)
● Data pipeline in Airflow is called DAG (Direct Acyclic Graph)
Key Concepts
Key Concepts to Look at
● Full Refresh, Incremental Update
● Idempotency
○ SQL Transaction
● Backfill
● Upsert
○ Most efficient way of ensuring the primary key uniqueness
○ Snowflake supports an operation called MERGE
Full Refresh vs. Incremental Update #1
● Full Refresh
○ Reading the whole data from the source and recreate the corresponding table
○ The simplest and cleanest approach but only works for small data source
● Increment Update
○ For large data sources, doing full refresh is not feasible
○ Doing periodic update such as daily or hourly
■ Daily incremental update: reading previous day’s data
■ Hourly incremental update: reading previous hour’s data
○ Effective but more complex to manage
■ What if the entire source data format changes?
■ What if updates from certain hours or days are found to be incorrect or missing?
Full Refresh vs. Incremental Update #2
● Prefer to do full-refresh if possible: everything gets easier
● If incremental update is the only way, the data source needs to support the
followings:
○ Given a date/time range, returning records added or modified during the window.
○ Deletion shouldn’t be allowed. If deletion is needed, introduce a deletion flag to mark.

Full Refresh Incremental Update

Idempotency
● Running the data pipeline multiple times with the same input data should not
change the contents of the final table
○ For example, duplicate data should not be created
● All critical points must be executed as one atomic action
○ This is why SQL transactions are an essential technique
Backfill
● Rerunning data pipelines to ensure complete & accurate data from the source
○ There are various reasons why past data might need to be re-read
● For full refresh, running the data pipeline again after fixing issues should work
○ Backfill isn’t really needed
● For incremental update, find missing or wrong time slots and rerun only those
○ If backfill is easy to perform, the lives of data engineers would be easier
■ Airflow makes the backfill easier (but learning it takes some efforts)
○ Ensuring the primary key uniqueness is required in some use cases
Upsert

● UPSERT = INSERT + UPDATE

○ Operates based on the Primary Key
○ If the record exists, update with new information
○ If the record doesn't exist, insert as a new record
○ What about deletion? Don’t delete records but flag a field (such as deleted)
● Data warehouses support UPSERT
○ Snowflake has a SQL operation called MERGE for this purpose
MERGE SQL: Snowflake’s Upsert
● Practice
Refreshing Python Syntax
What to learn: Basics
● Will use Google Colab

● Data Types
○ Number, string, list, dictionary, …
○ Slice operator
○ Formatted string literals
● For loop
● Functions
○ Definition and invocation
What to learn: More Advanced Topics
● Import modules
○ Installing modules via pip
● Downloading a web page: “request” module
○ This is in essence calling an API
● Error handling
○ try/except
Google Colab Practice
● Refreshing Python
First Data Pipeline (ETL)
ETL Overview Practice

● Copy a country/capital CSV file from the web into a table in Snowflake
● Written in Python on Google Colab

S3 CSV File A table in Snowflake

ETL Overview - 1. Create a table in Snowflake

● Download the following file

○ https://s3-geospatial.s3.us-west-2.amazonaws.com/country_capital.csv
○ The file has two fields (country & capital)

country capital
Abkhazia Sukhumi
Afghanistan Kabul
Albania Tirana
Algeria Algiers
American Samoa Pago Pago
ETL Overview - 2. Create a table in Snowflake

CREATE TABLE dev.raw.country_capital (

country varchar primary key,
capital varchar
);
ETL Overview - Overall Structure

● Written in Python on Colab: Consists of three functions (tasks)

○ extract, transform, load

def transform(data): def load(list):

def extract(url):
# transform data into country, # Insert records in list
# return the data at url
# capital list and return # to Snowflake
…
… …
return data
return list ...
ETL Practice

● Data Pipeline Example from Google Colab

Break
Improving the Data Pipeline
Three issues were identified

1. Header is currently being loaded as records.

2. Idempotency is not being maintained. Duplicate records are created each
time the process is executed.
3. If the SQL fails midway, it could compromise data integrity.
○ Transactions should be used
What is SQL Transaction? (1)

● What if there are tasks that would be left in an incomplete state if they fail
midway?
● In the example below, what if the withdrawal is successful but a problem
occurs during the transfer?

Withdrawal from my Transfer to another

account person's account

Bank transfer process: When I send money to another person

What is SQL Transaction? (2)

● Method for grouping SQLs that need to be executed as a single operation

● Use these SQL statements between BEGIN and END/COMMIT
● ROLLBACK (or ABORT) is an SQL that means "return to the state before
BEGIN"
What is SQL Transaction? (3)
COMMIT/END: If all
operations succeed, this
temporary state becomes
... the final state.
BEGIN SQL 1 SQL 2 SQL N
ROLLBACK/ABORT: If
even one operation fails
Transaction: midway, the system
reverts to its original state
The SQL results during this process become a
temporary state. They are not visible to other
sessions until committed.
Best Practices in using SQL transaction with Python

● In Python, it is common to use try/except to handle errors

○ If an error occurs within the try/except block, a rollback is explicitly executed
○ If no error occurs, a commit is executed

try:
cur.execute("BEGIN;")
# execute SQLs that need to be run atomically
…
cur.execute("COMMIT;") # same as cur.execute("END;")
except Exception as e:
cur.execute("ROLLBACK;")
raise
ETL v2 Practice

● Data Pipeline Example from Google Colab

Demo: Airflow Installation on Docker
Homework
Homework: Stock Price ETL in Python (15 pt)
● This homework should be done in Python using Google Colab. Use this link as a reference.
● Use Alpha Vantage API (look for “TIME_SERIES_DAILY” API)

1. (+1) Pick up a stock symbol and get your own API key from Alpha Vantage
2. (+1) Secure your Snowflake credentials and Alpha Vantage API key (don't expose them in the code)
3. (+2) Read the last 90 days of the price info via the API (refer to the code snippet & you need to add "date”)
a. With regard to adding "date", please look at the next slide
4. (+1) Create a table under “raw” schema if it doesn’t exist to capture the info from the API
a. symbol, date, open, close, high, low, volume: symbol and date should be primary keys
5. (+1) Delete all records from the table
6. (+1) Populate the table with the records from step 2 using INSERT SQL (refer to the relevant code snippet as a
starting point)
7. (+4) Steps 4 and 6 need to be done together
a. Use try/except along with SQL transaction. (use the code here as reference)
8. (+1) Demonstrate your work ensures Idempotency by running your pipeline (from extract to load) twice in a row
and checking the number of records (the number needs to remain the same)
9. (+2) Follow today’s demo and capture Docker Desktop screen showing Airflow

(+1) Overall formatting

How to add “date” to return_last_90d_price
def return_last_90d_price(symbol):
vantage_api_key = userdata.get('vantage_api_key')
url = f'https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol={symbol}&apikey={vantage_api_key}'
r = requests.get(url)
data = r.json()
results = [] # empyt list for now to hold the 90 days of stock info (open, high, low, close, volume)
for d in data["Time Series (Daily)"]: # here d is a date: "YYYY-MM-DD"
stock_info = data["Time Series (Daily)"][d]
stock_info[“date”] = d
results.append(stock_info)
# an example of data["Time Series (Daily)"][d] is
# {'1. open': '117.3500', '2. high': '119.6600', '3. low': '117.2500', '4. close': '117.8700', '5. volume': '286038878', “date”:
“2024-09-17”}
return results
Homework: Step 9 screenshot example
More on Lab #1
Building a Stock Price Prediction Analytics
using Snowflake & Airflow Task to forecast the next 14
days of price (ELT)

Stock Info #1

ETL
Stock Info #2

yfinance APIs Data Warehouse

Group Project & Final Exam
Group Project Proposal
● Form a team of 4: due by next week
● Proposal Format
○ One paragraph abstract
○ Dataset or API links and short description (if applicable)
○ Reference links
● Group Project Presentation: Week 16
○ May 8th
○ Each team will be given 10 minutes to present
● Grading criteria
○ More details to be shared
○ Teamwork!
Final Exam
● Set to May 15th: 3:15 - 5:15 PM
● For those of you who have conflicts
○ Make-up Exam date is set to May 21st but tentative at the moment
○ Email me with details of the conflicting class
Airflow Docker Demo
Next Week: Airflow!
Homework: Stock Price ETL in Python (15 pt)
● Implement Full Refresh using yfinance API
● This homework should be done in Python using Google Colab. Use this link as a reference.

1. Pick up a stock symbol

2. (+1) Secure your Snowflake credentials via Secrets (don't expose them in the code)
3. (+2) Read all historical price info via the API (refer to the relevant code snippet)
4. (+1) Create a table under “raw” schema if it doesn’t exist to capture the info from the API
a. symbol, date, open, close, high, low, volume: symbol and date should be primary keys
5. (+1) Delete all records from the table
6. (+2) Populate the table with the records from step 3 using INSERT SQL
7. (+4) Steps 5 and 6 need to be done together
a. Use try/except along with SQL transaction. (use the code here as reference)
8. (+1) Demonstrate your work ensures Idempotency by running it twice in a row and checking
the number of records (the number needs to remain the same)
9. (+2) Follow today’s demo and capture Docker Desktop screen

(+1) Overall formatting

12 - DataEngineer - Interview - Questions and Answers - EPAM Anywhere
No ratings yet
12 - DataEngineer - Interview - Questions and Answers - EPAM Anywhere
2 pages
Data Engineering Course Outline
No ratings yet
Data Engineering Course Outline
3 pages
SAP Credit Management Configuration
No ratings yet
SAP Credit Management Configuration
8 pages
Heat Exchanger.
No ratings yet
Heat Exchanger.
10 pages
Chapter 4
No ratings yet
Chapter 4
26 pages
0522251118 VIGNESHWARAN thiruppathur APSA COLLEGE
No ratings yet
0522251118 VIGNESHWARAN thiruppathur APSA COLLEGE
9 pages
Data and Analytics - TechM.pdf
No ratings yet
Data and Analytics - TechM.pdf
8 pages
UNIT 1 To 5
No ratings yet
UNIT 1 To 5
37 pages
Tcobza
No ratings yet
Tcobza
2 pages
Data Engineering Agenda
No ratings yet
Data Engineering Agenda
19 pages
CCD UNIT 4
No ratings yet
CCD UNIT 4
5 pages
Data Models (Module - II)
No ratings yet
Data Models (Module - II)
101 pages
Next Pathway Hack Backpackers Problem Statement
No ratings yet
Next Pathway Hack Backpackers Problem Statement
11 pages
Project documentation
No ratings yet
Project documentation
36 pages
Apache Airflow On Docker For Complete Beginners - Justin Gage - Medium
No ratings yet
Apache Airflow On Docker For Complete Beginners - Justin Gage - Medium
12 pages
08 - Data Pipelines Presentation
No ratings yet
08 - Data Pipelines Presentation
36 pages
Lecture 16
No ratings yet
Lecture 16
21 pages
Unit-4
No ratings yet
Unit-4
11 pages
Data Engineering - Session 03
No ratings yet
Data Engineering - Session 03
26 pages
4-Data Processing Pipelines in Science and Business
100% (1)
4-Data Processing Pipelines in Science and Business
22 pages
Cloud Data Lakes For Dummies Snowflake Special Edition V1 4
No ratings yet
Cloud Data Lakes For Dummies Snowflake Special Edition V1 4
10 pages
ETL
No ratings yet
ETL
2 pages
S
No ratings yet
S
22 pages
Extract, Transform and Load (ETL)
No ratings yet
Extract, Transform and Load (ETL)
31 pages
Ass 1
No ratings yet
Ass 1
31 pages
Roadmap To Become Data Engineer in 2024
No ratings yet
Roadmap To Become Data Engineer in 2024
8 pages
Data Engineering Notes
No ratings yet
Data Engineering Notes
4 pages
Data Engineering Nanodegree Program Syllabus
33% (3)
Data Engineering Nanodegree Program Syllabus
15 pages
bi-unit-3
No ratings yet
bi-unit-3
26 pages
ETL-Process
No ratings yet
ETL-Process
2 pages
Azure Data Factory
No ratings yet
Azure Data Factory
47 pages
Snowflake - Syllubus and DBT
No ratings yet
Snowflake - Syllubus and DBT
11 pages
19.1 - Data Pipelines
No ratings yet
19.1 - Data Pipelines
18 pages
ETL Interview Preparation
No ratings yet
ETL Interview Preparation
18 pages
Data Pipeline Essentials: See Ya Later
No ratings yet
Data Pipeline Essentials: See Ya Later
6 pages
Week 6. Airflow Overview
No ratings yet
Week 6. Airflow Overview
71 pages
Importer and Exporter Product For Data Analysis Based On Extract, Transform, Load (ETL) and Regular Expression With Python Programming .Teway
No ratings yet
Importer and Exporter Product For Data Analysis Based On Extract, Transform, Load (ETL) and Regular Expression With Python Programming .Teway
26 pages
Data Warehousing: Modern Database Management
No ratings yet
Data Warehousing: Modern Database Management
32 pages
ADTHEORY4
No ratings yet
ADTHEORY4
13 pages
AWS Data Engireeing Broucher
No ratings yet
AWS Data Engireeing Broucher
17 pages
L010 DW Lab7
No ratings yet
L010 DW Lab7
7 pages
Data Pipelines From Zero To Solid
No ratings yet
Data Pipelines From Zero To Solid
58 pages
Extract, Transform, Load
No ratings yet
Extract, Transform, Load
9 pages
Data Engineering Core Concepts Interview Questions
No ratings yet
Data Engineering Core Concepts Interview Questions
22 pages
Summer Internship Report (ETSI-600) (KOUSTAV DUTTA 49)
No ratings yet
Summer Internship Report (ETSI-600) (KOUSTAV DUTTA 49)
36 pages
Data-Engineering Compressed
No ratings yet
Data-Engineering Compressed
20 pages
Datathon at UCI Resource Sheet
No ratings yet
Datathon at UCI Resource Sheet
15 pages
Data Pipelines From Zero to Solid
No ratings yet
Data Pipelines From Zero to Solid
16 pages
DocScanner 20 Oct 2024 2-19 PM
No ratings yet
DocScanner 20 Oct 2024 2-19 PM
16 pages
Tiger Analytics 1735834470
No ratings yet
Tiger Analytics 1735834470
27 pages
Data Engineering Nanodegree Program Syllabus PDF
No ratings yet
Data Engineering Nanodegree Program Syllabus PDF
5 pages
How To Build Data Pipelines For Machine Learning - by Shaw Talebi - Towards Data Science
No ratings yet
How To Build Data Pipelines For Machine Learning - by Shaw Talebi - Towards Data Science
21 pages
ETL_Pipelines__1741352181
No ratings yet
ETL_Pipelines__1741352181
17 pages
Data Analytics Using Python
No ratings yet
Data Analytics Using Python
10 pages
DWM-Notes-1
No ratings yet
DWM-Notes-1
15 pages
Large Scale Etl With Hadoop
No ratings yet
Large Scale Etl With Hadoop
76 pages
De Courseoutline White
No ratings yet
De Courseoutline White
4 pages
Warehousing
No ratings yet
Warehousing
13 pages
Data Warehousing: Modern Database Management
No ratings yet
Data Warehousing: Modern Database Management
32 pages
What Is Sql ?: Fundamentals of Sql,T-Sql,Pl/Sql and Datawarehousing.
From Everand
What Is Sql ?: Fundamentals of Sql,T-Sql,Pl/Sql and Datawarehousing.
Victor Ebai
No ratings yet
ORACLE PL/SQL Interview Questions You'll Most Likely Be Asked
From Everand
ORACLE PL/SQL Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
5/5 (1)
Oracle: Protect Your Data
From Everand
Oracle: Protect Your Data
Floribert TCHOKO
No ratings yet
sql_lab1
No ratings yet
sql_lab1
2 pages
Week 4. Advanced SQL
No ratings yet
Week 4. Advanced SQL
71 pages
Beverage Preparation and Bottling Unit Using PLC & SCADA System As A Proposed Automation For A Small Scale Industries
No ratings yet
Beverage Preparation and Bottling Unit Using PLC & SCADA System As A Proposed Automation For A Small Scale Industries
6 pages
DataBase Recovery Techniques
100% (1)
DataBase Recovery Techniques
37 pages
4.3 EDP (Electronic Data Processing)
100% (1)
4.3 EDP (Electronic Data Processing)
4 pages
JDBC Faq From Jguru
No ratings yet
JDBC Faq From Jguru
213 pages
Module 6
No ratings yet
Module 6
13 pages
CS3492 Syllabus
No ratings yet
CS3492 Syllabus
2 pages
Chapter 1
No ratings yet
Chapter 1
16 pages
BPC10 1 BADIs
No ratings yet
BPC10 1 BADIs
6 pages
Case Study
100% (2)
Case Study
4 pages
NOSQL Lecture 1 Notes
No ratings yet
NOSQL Lecture 1 Notes
31 pages
RDBMS-Relational Database Made Easy
100% (1)
RDBMS-Relational Database Made Easy
118 pages
Exchange Server Error Codes Jet DB Errors
No ratings yet
Exchange Server Error Codes Jet DB Errors
10 pages
Nrlais Camis Integration Alpc Conference 2019
No ratings yet
Nrlais Camis Integration Alpc Conference 2019
17 pages
Introduction To Database Systems: Database Management Systems, R. Ramakrishnan and J. Gehrke 1
No ratings yet
Introduction To Database Systems: Database Management Systems, R. Ramakrishnan and J. Gehrke 1
21 pages
Groupwork1 l5 Correction
No ratings yet
Groupwork1 l5 Correction
2 pages
Errant GTIDs Breaking Replication - How To Detect and Avoid Them - FileId - 187306
No ratings yet
Errant GTIDs Breaking Replication - How To Detect and Avoid Them - FileId - 187306
48 pages
Transaction Replication - Violation of PRIMARY
No ratings yet
Transaction Replication - Violation of PRIMARY
4 pages
Class 10 DBMS
No ratings yet
Class 10 DBMS
41 pages
Lakehouse: A New Generation of Open Platforms That Unify Data Warehousing and Advanced Analytics
No ratings yet
Lakehouse: A New Generation of Open Platforms That Unify Data Warehousing and Advanced Analytics
8 pages
Car Management System Design
No ratings yet
Car Management System Design
11 pages
Database Management Systems Ramakrishnan 3rd Edition Raghu Ramakrishnan pdf download
100% (2)
Database Management Systems Ramakrishnan 3rd Edition Raghu Ramakrishnan pdf download
73 pages
Gad Viva Questions
No ratings yet
Gad Viva Questions
4 pages
Dyna Trace
No ratings yet
Dyna Trace
10 pages
Ogg 19 1 0 0 0 Cert Matrix 5491855
No ratings yet
Ogg 19 1 0 0 0 Cert Matrix 5491855
51 pages
Enhancement of IDoc Type
No ratings yet
Enhancement of IDoc Type
11 pages
Elmasri 6e Ch21 Transaction Processing
No ratings yet
Elmasri 6e Ch21 Transaction Processing
82 pages
LM-DBMS
No ratings yet
LM-DBMS
159 pages
ASR 104 V3.1 Policy For Transaction Certificates
No ratings yet
ASR 104 V3.1 Policy For Transaction Certificates
37 pages
CSE - Database Management Systems
No ratings yet
CSE - Database Management Systems
17 pages
SAP ASE System Administration Guide Volume 2 en PDF
No ratings yet
SAP ASE System Administration Guide Volume 2 en PDF
480 pages

Week 5. Data Pipelines

Uploaded by

Week 5. Data Pipelines

Uploaded by

Week 5: Data Pipelines

Extract Transform Load

Extract Transform Load

Full Refresh Incremental Update

● UPSERT = INSERT + UPDATE

S3 CSV File A table in Snowflake

● Download the following file

CREATE TABLE dev.raw.country_capital (

● Written in Python on Colab: Consists of three functions (tasks)

def transform(data): def load(list):

● Data Pipeline Example from Google Colab

1. Header is currently being loaded as records.

Withdrawal from my Transfer to another

Bank transfer process: When I send money to another person

● Method for grouping SQLs that need to be executed as a single operation

● In Python, it is common to use try/except to handle errors

● Data Pipeline Example from Google Colab

(+1) Overall formatting

yfinance APIs Data Warehouse

1. Pick up a stock symbol

(+1) Overall formatting

You might also like