0% found this document useful (0 votes)

44 views36 pages

08 - Data Pipelines Presentation

The document discusses data pipelines and ETL/ELT processes. It covers key aspects of data pipelines including being holistic, incremental, iterative, reusable, documented, and auditable. It also discusses the steps to design a data pipeline including conceptual, logical and physical models as well as source to target mappings and workflow. The document then covers extract, transform and load steps in ETL/ELT and differences between the two approaches.

Uploaded by

ancgate

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views36 pages

08 - Data Pipelines Presentation

Uploaded by

ancgate

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Data Pipelines

Data Pipelines Rules

• A key deliverables in business intelligence (BI) is providing consistent, comprehensive, clean,

conformed, and current information for business decision making
1 Holistic 2 Incremental
—avoid costly overlaps more manageable and
and inconsistencies. practical

3 Iterative 4 Reusable
discover and learn from ensure consistency
each individual project

5 Documented 6 Auditable
identify data for reuse, and Necessary for government
create leverage for future regulations and industry
projects standards.
Data Pipeline Design

• Data pipeline is a process. The steps are:

• Create a stage-related conceptual data integration
process model.
• Create a stage-related logical data integration
process model.
• Design a stage-related physical data integration
process model.
• Design stage-related source to target mappings.
• Design overall data pipeline workflow.
• Please refer back to the Information Architecture
and Data Architecture
Data Mapping : Source Tracking
Data Mapping : Source Tracking
Data Pipelines Workflow

• The product-specific workflow with all data and data pipeline components documented
Introduction to ETL and ELT

ETL stands for Extract, Transform and Load ELT extracts data first then transforms
In ELT, extraction of data happens first, then
ETL extracts data from source systems, transforms the loading into target system and transformation
data for analysis and loads it into a data warehouse happens inside target system

ETL transforms data first then loads whereas ELT loads first then transforms inside target.
Both achieve moving data from sources to target data warehouse.
Why ETL

• There are many reasons for adopting ETL

• helps companies to analyze business data
• provides a method of moving the data from various
sources into a data warehouse
• allow verification of data transformation,
aggregation and calculations rules
• allows sample data comparison between the
source and the target system.
• offers deep historical context for the business.
Extract
Data is extracted from the source system into the staging area. Staging area gives an
opportunity to extracted data before it moves into the Data warehouse.

• Two reasons :
• Performance of source system in not degraded.
• if corrupted data is copied directly from the source into Data warehouse, rollback will be a
challenge.

• Data warehouse needs to integrate systems that have different DBMS, Hardware, Operating
Systems and Communication Protocols….
Extract
• Three Data Extraction methods:
• Full Extraction
• Partial Extraction- without update notification.
• Partial Extraction- with update notification

• Examples of Validation types:

• Reconcile records with the source data
• Make sure that no spam/unwanted data loaded
• Data type check
• Remove all types of duplicate/fragmented data
• Check whether all the keys are in place or not
Transform
Data extracted from source server is commonly raw and not
usable in its original form.
• needs to be cleansed, mapped and transformed
• Data that does not require any transformation is called as
direct move or pass through data.
Transform
Following are Data Integrity Problems:
• Different spelling of the same person like Jon, John, etc.
• There are multiple ways to denote company name like
Google, Google Inc.Use of different names like Cleaveland,
Cleveland.
• There may be a case that different account numbers are
generated by various applications for the same customer.
• In some data required files remains blank
• Invalid product collected at POS as manual entry can lead to
mistakes.
Transform
Validations are done during this stage
• Filtering – Select only certain columns to load
• Using rules and lookup tables for Data standardization
• Character Set Conversion and encoding handling
• Conversion of Units of Measurements like Date Time,
currency, numerical conversions,
• Data threshold validation check. For example, age cannot
be more than two digits.
• Data flow validation from the staging area to the
intermediate tables.
Transform
• Cleaning ( for example, mapping NULL to 0 or
Gender Male to “M” and Female to “F” etc.)

• Split a column into multiples and merging

multiple columns into a single column.

• Transposing rows and columns,Use lookups to

merge data

• Using any complex data validation (e.g., if the

first two columns in a row are empty then it
automatically reject the row from processing)
Load
• Loading data into the target data warehouse
• In a typical Data warehouse, huge volume of data needs to
be loaded in a relatively short period (nights).
• Load process should be optimized for performance.

• Types of Loading:
• Initial Load — populating all the Data Warehouse tables
• Incremental Load — applying ongoing changes as when
needed periodically.
• Full Refresh —erasing the contents of one or more tables
and reloading with fresh data.
ETL vs ELT
• Despite its many benefits, ETL process can be
prone to break when any change occurs in
the source systems or target warehouse.

• Instead of transforming the data before it’s

written, ELT lets the target system to do the
transformation.
• The data first copied to the target and then transformed in
place.
• ELT usually used with NoSQL databases like Hadoop cluster,
data appliance or cloud installation.
• The ELT process also works hand-in-hand with data lakes
ETL vs ELT
Implementing ETL Process

Extract data from sources Transform data Load data into target

Extract or read data from various sources like Clean, filter, aggregate, join, enrich, normalize Write the transformed data into the target
databases, APIs, files etc. This step collects all etc. the extracted data into the desired database, data warehouse, file or other
the data needed for transformation. format. system.
ETL Process
ETL Architecture

Data Warehouse Expert

ELT Architecture
Benefits of ETL

Increased Efficiency Data Quality Cost Savings Simplified integration

ETL allows for faster data ETL helps to ensure data quality by ETL can reduce costs associated ELT pipelines load raw data into
integration and processing, validating and cleaning data with data integration by the warehouse first, simplifying
resulting in improved efficiency. before it is loaded into the target automating processes. integration of diverse data
system. sources.

Overall, ETL provides many benefits for data integration, including

increased efficiency, improved data quality, and cost savings.
Choosing the
Right Approach
Data integration is an important part of any business
process. Choosing the right approach for ETL vs ELT is
essential for successful data integration.
“Data Pipeline is not a one size-fits-
all solution; ETL or ELT has its
limitations and may not be the best
choice for every situation.”
Data Pipeline
in Practice
Hand-Coded ETL
• Usage of code and combination of
libraries

• Pros
• Easy for small projects
• Cheap
• Does not require sophisticated data modeling

• Cons:
• Time-consuming Complicated
• Hard to document
• Not reusable
ETL Tool
• Time: hand-coded methods can take days or
even weeks, depending on the amount and
complexity of the data, to only prepare the data
for the analysis, even before any analysis is done.

• Reusability: Processes in ETL method can be

saved and directly reused for other processes and
data models as well. In manual coding, changes
will have to be made meticulously, by a
programmer.

• Management: Because of automation, managing

datasets has become easy with ETL
programming. ETL tools provide one a larger view
of ETL processes, with things like where the data
is coming from, where it is going and what sort of
calculations have been done on it.
Benefits ETL Tool
• Reusable dimensional processes➔
Productivity gain

• Robust data quality processes

• Workflow, error handling, and

restart/recovery functionality

• Self-documentation of processes
and workflow

• Data governance
Incremental Loading

Only load new records Reduce load times Requires change data capture
Incremental loading only brings in new or By only processing a subset of data, To identify new/changed records, source
updated records from the source systems since incremental loading greatly speeds up load systems must have change data capture
the last ETL run, avoiding reprocessing times compared to full loads. processes that flag recently added/updated
unchanged data. data.

Incremental loading makes the ETL process more efficient by only loading
changes since the previous run, improving load performance.
Error Handling

Log errors Send alerts Handle different error types

Implement logging to record errors during ETL Configure alerts to notify teams when Have specific error handling logic for
execution for debugging. critical errors occur in ETL pipelines. different types of errors like data errors,
connection errors, transform errors, etc.

Robust error handling with logging, alerts, and handling of error types helps
build resilient ETL processes that keep running in case of failures.
Data Quality Checks

Duplicate record checks Valid value checks Consistency checks

Look for duplicate records based on unique Check data values against allowed domains, Validate ID references, sums, counts across
identifiers or combinations of columns. Ensure data types, and ranges to catch bad data. tables. Data should be consistent.
each record is unique.

Doing rigorous data quality checks during ETL helps improve

downstream data quality and prevent dirty data issues.
Date Dimension

Temporal analysis Hierarchies Holidays and events

Date dimensions enable powerful time-series Date dimensions can store holidays, festivals,
analysis of data by attributes like day of week, events etc. enabling analysis by these temporal
month, quarter, year etc. events.

A well designed date dimension table structures time data,

enabling powerful temporal analysis in a data warehouse.
Let's Code
Putting in practice everything we have learned

MS Access MCQ Questions
No ratings yet
MS Access MCQ Questions
50 pages
Abinitio Question
67% (3)
Abinitio Question
3 pages
Ultimate Salesforce Data Cloud for Customer Experience: Explore, Implement and Elevate B2C Experiences Through Customer Data Innovations Using Salesforce Data Cloud
From Everand
Ultimate Salesforce Data Cloud for Customer Experience: Explore, Implement and Elevate B2C Experiences Through Customer Data Innovations Using Salesforce Data Cloud
Gourab Mukherjee
No ratings yet
Data Warehouse and Data Sources
No ratings yet
Data Warehouse and Data Sources
18 pages
Data Warehousing
100% (9)
Data Warehousing
46 pages
DWH Concepts Overview
No ratings yet
DWH Concepts Overview
11 pages
Data Warehousing AND Data Mining
100% (1)
Data Warehousing AND Data Mining
90 pages
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
From Everand
Data Engineering with Scala and Spark: Build streaming and batch pipelines that process massive amounts of data using Scala
Eric Tome
No ratings yet
MSBI Training PDF
100% (1)
MSBI Training PDF
4 pages
09 - Azure Data Engineering Cheatsheet
No ratings yet
09 - Azure Data Engineering Cheatsheet
37 pages
Intro To ETL
No ratings yet
Intro To ETL
43 pages
Velocity v8 Data Warehousing Methodology
No ratings yet
Velocity v8 Data Warehousing Methodology
1,106 pages
What Is Aws?: Saas (Software As A Service)
No ratings yet
What Is Aws?: Saas (Software As A Service)
16 pages
The Data WareHouse ETL Toolkit - Chapter 05
100% (1)
The Data WareHouse ETL Toolkit - Chapter 05
40 pages
Understanding Topics, Partitions, and Brokers: Ryan Plant
No ratings yet
Understanding Topics, Partitions, and Brokers: Ryan Plant
37 pages
Understanding Topics, Partitions, and Brokers: Ryan Plant
No ratings yet
Understanding Topics, Partitions, and Brokers: Ryan Plant
37 pages
Unit-7 Transaction Processing
No ratings yet
Unit-7 Transaction Processing
107 pages
Lesson 3 - Data - Ingestion - Into - Big - Data - Systems - and - ETL
No ratings yet
Lesson 3 - Data - Ingestion - Into - Big - Data - Systems - and - ETL
104 pages
Dbms Question Bank2 Marks 16 Marks
No ratings yet
Dbms Question Bank2 Marks 16 Marks
31 pages
Getting Started With Apache Kafka
No ratings yet
Getting Started With Apache Kafka
21 pages
SQL Introduction
No ratings yet
SQL Introduction
96 pages
250+ TOP MCQs On Multivalued Dependencies and Answers
No ratings yet
250+ TOP MCQs On Multivalued Dependencies and Answers
7 pages
Microsoft Power BI Webinar
No ratings yet
Microsoft Power BI Webinar
18 pages
SAAC03-Services Summary
No ratings yet
SAAC03-Services Summary
8 pages
DW DM Notes
No ratings yet
DW DM Notes
107 pages
Pentaho Data Integration
No ratings yet
Pentaho Data Integration
99 pages
Ex No: 1. Data Definition of Base Tables. Date: Aim
No ratings yet
Ex No: 1. Data Definition of Base Tables. Date: Aim
30 pages
Requirement Management and CMM
100% (1)
Requirement Management and CMM
23 pages
Data Warehouse/Data Mart: Components Concepts Characteristics
0% (1)
Data Warehouse/Data Mart: Components Concepts Characteristics
24 pages
DDL Commands
No ratings yet
DDL Commands
65 pages
DoD Cloud Strategy Overview
No ratings yet
DoD Cloud Strategy Overview
24 pages
ETL:Introduction
100% (1)
ETL:Introduction
22 pages
11 - Getting Started With Google Cloud
No ratings yet
11 - Getting Started With Google Cloud
35 pages
Flat File Testing
100% (1)
Flat File Testing
10 pages
Set 21
No ratings yet
Set 21
13 pages
Hive Using Hiveql PDF
No ratings yet
Hive Using Hiveql PDF
40 pages
Adbms Data Warehousing and Data Mining
No ratings yet
Adbms Data Warehousing and Data Mining
169 pages
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
Manage Your Data: Manage Your Business: Mapping Business Needs To Technical Capabilities
100% (1)
Manage Your Data: Manage Your Business: Mapping Business Needs To Technical Capabilities
4 pages
2 Technical Writing Software Documentation m2 Slides PDF
No ratings yet
2 Technical Writing Software Documentation m2 Slides PDF
66 pages
Systems: Local-And Wide-Area Computer Networks (Such As The Internet) Connect
No ratings yet
Systems: Local-And Wide-Area Computer Networks (Such As The Internet) Connect
19 pages
DBMS FILE Amit Singh
No ratings yet
DBMS FILE Amit Singh
98 pages
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
Term Paper On Database Design
100% (1)
Term Paper On Database Design
6 pages
Data Warehouse Concepts
No ratings yet
Data Warehouse Concepts
11 pages
HBase
No ratings yet
HBase
31 pages
DBMS Interview Questions
No ratings yet
DBMS Interview Questions
12 pages
Personal Development (Slide Show) New
No ratings yet
Personal Development (Slide Show) New
27 pages
CSE5003 - DAT ABA Se Syste MS: DES IGN A ND I M PLE Ment Atio N L, T, P, J, C 2,0,2,4,4
No ratings yet
CSE5003 - DAT ABA Se Syste MS: DES IGN A ND I M PLE Ment Atio N L, T, P, J, C 2,0,2,4,4
9 pages
DWDM I Mid Objective QB
100% (1)
DWDM I Mid Objective QB
7 pages
Unit 3
No ratings yet
Unit 3
14 pages
The Following Are The Different Phases Involved in A ETL Project Development Life Cycle
100% (2)
The Following Are The Different Phases Involved in A ETL Project Development Life Cycle
3 pages
Unit 7 Data Transformation and Loading: Structure
No ratings yet
Unit 7 Data Transformation and Loading: Structure
24 pages
Index of Computer Science Class 12th 2024-25-1
No ratings yet
Index of Computer Science Class 12th 2024-25-1
3 pages
DW
No ratings yet
DW
29 pages
Nelson - DSP Conference Briefing - DoD CIO
No ratings yet
Nelson - DSP Conference Briefing - DoD CIO
7 pages
Bhatti WebTechnology With .NET Unit-3
No ratings yet
Bhatti WebTechnology With .NET Unit-3
55 pages
Cheat Sheet: Hive Basics
No ratings yet
Cheat Sheet: Hive Basics
1 page
DW Basics
No ratings yet
DW Basics
17 pages
DWM Assignment
No ratings yet
DWM Assignment
9 pages
Create Int Varchar Date Varchar State Varchar: Emp - Piyush Employeeid Empname 30 Dob City 20 20
100% (1)
Create Int Varchar Date Varchar State Varchar: Emp - Piyush Employeeid Empname 30 Dob City 20 20
10 pages
MGD Printout Tabloid PDF
No ratings yet
MGD Printout Tabloid PDF
1 page
MGD Printout Poster PDF
No ratings yet
MGD Printout Poster PDF
1 page
Data Reduction
No ratings yet
Data Reduction
2 pages
1 DWH Concepts
No ratings yet
1 DWH Concepts
13 pages
Sap MM Master Data Configuration
No ratings yet
Sap MM Master Data Configuration
55 pages
LabWare AP Training 2023-01-02
No ratings yet
LabWare AP Training 2023-01-02
1 page
Mock II Std. X - I.T. Paper II
No ratings yet
Mock II Std. X - I.T. Paper II
9 pages
CS302 Unit1-III
No ratings yet
CS302 Unit1-III
18 pages
Sample Paper Q0503
No ratings yet
Sample Paper Q0503
20 pages
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
No ratings yet
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
46 pages
Newgen Management Trainee: Oracle Technical Orientation Program
No ratings yet
Newgen Management Trainee: Oracle Technical Orientation Program
41 pages
Batch Processing Vs Stream Processing
No ratings yet
Batch Processing Vs Stream Processing
3 pages
What Is BI Testing
No ratings yet
What Is BI Testing
19 pages
Typical Interview Questions
No ratings yet
Typical Interview Questions
11 pages
Using The Publish and Subscribe Pattern For Notifications Slides
No ratings yet
Using The Publish and Subscribe Pattern For Notifications Slides
44 pages
SQL Statement Tunning
No ratings yet
SQL Statement Tunning
19 pages
Data Mining Unit - 1 Notes
No ratings yet
Data Mining Unit - 1 Notes
16 pages
ETL Testing Fundamentals
No ratings yet
ETL Testing Fundamentals
5 pages
Consuming Messages With Kafka Consumers and Consumer Groups: Ryan Plant
No ratings yet
Consuming Messages With Kafka Consumers and Consumer Groups: Ryan Plant
38 pages
Montauk GO101
No ratings yet
Montauk GO101
2 pages
Change Capture Stage in Datastage PDF
No ratings yet
Change Capture Stage in Datastage PDF
4 pages
CH 8-Interfacing Python With Mysql (Connectivity) For Board Exam
No ratings yet
CH 8-Interfacing Python With Mysql (Connectivity) For Board Exam
14 pages
Etl VS Elt
No ratings yet
Etl VS Elt
8 pages
Understanding Rabbitmq and Easynetq Slides
No ratings yet
Understanding Rabbitmq and Easynetq Slides
19 pages
Producing Messages With Kafka Producers: Ryan Plant
No ratings yet
Producing Messages With Kafka Producers: Ryan Plant
31 pages
Get Off To A Fast Start With Db2 V9 Purexml, Part 2
No ratings yet
Get Off To A Fast Start With Db2 V9 Purexml, Part 2
16 pages
Please Can Someone List What Are All The Testing Types Performed On ETL/DW Testing?
No ratings yet
Please Can Someone List What Are All The Testing Types Performed On ETL/DW Testing?
3 pages
IT (802) Practical
No ratings yet
IT (802) Practical
12 pages
Migration
No ratings yet
Migration
23 pages
MTA BridgesAndTunnelsHourlyTrafficRates DataDictionary
No ratings yet
MTA BridgesAndTunnelsHourlyTrafficRates DataDictionary
1 page
SAPGUI720 Installation Procedure
No ratings yet
SAPGUI720 Installation Procedure
9 pages
Spartans - Milestone 3
No ratings yet
Spartans - Milestone 3
23 pages
Factory Patterns: Factory Method and Abstract Factory
No ratings yet
Factory Patterns: Factory Method and Abstract Factory
25 pages
Getting Up and Running With Rabbitmq and Easynetq Slides
No ratings yet
Getting Up and Running With Rabbitmq and Easynetq Slides
25 pages
Practical No. 06 DBMS
No ratings yet
Practical No. 06 DBMS
5 pages
Hangfire: Rag Dhiman
No ratings yet
Hangfire: Rag Dhiman
24 pages
Getting To Know Apache Kafka's Architecture: Ryan Plant
No ratings yet
Getting To Know Apache Kafka's Architecture: Ryan Plant
23 pages
SQL - Examples - FB Interview
No ratings yet
SQL - Examples - FB Interview
6 pages
Lecture 4 Relational Model
No ratings yet
Lecture 4 Relational Model
20 pages
Dev's Datastage Tutorial, Guides, Training and Online Help 4 U. Unix, Etl, Database Related Solutions - Datastage Interview Questions and Answers v1
No ratings yet
Dev's Datastage Tutorial, Guides, Training and Online Help 4 U. Unix, Etl, Database Related Solutions - Datastage Interview Questions and Answers v1
6 pages
SQL
No ratings yet
SQL
8 pages
DBMSSupp Jan15
No ratings yet
DBMSSupp Jan15
4 pages
Basic Definitions
No ratings yet
Basic Definitions
5 pages
Database Connection
No ratings yet
Database Connection
4 pages
Datastage Answers
No ratings yet
Datastage Answers
3 pages
Sender File Adapter Questions
No ratings yet
Sender File Adapter Questions
4 pages
The 2 Dimensions of Audit Information
No ratings yet
The 2 Dimensions of Audit Information
4 pages
MGD Printout Letter PDF
No ratings yet
MGD Printout Letter PDF
1 page
JSP-Servlet Interview Questions You'll Most Likely Be Asked
From Everand
JSP-Servlet Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
TIBCO Software The Ultimate Step-By-Step Guide
From Everand
TIBCO Software The Ultimate Step-By-Step Guide
Gerardus Blokdyk
No ratings yet

08 - Data Pipelines Presentation

Uploaded by

08 - Data Pipelines Presentation

Uploaded by

Data Pipelines

Data Pipelines Rules

• A key deliverables in business intelligence (BI) is providing consistent, comprehensive, clean,

• Data pipeline is a process. The steps are:

• There are many reasons for adopting ETL

• Examples of Validation types:

• Split a column into multiples and merging

• Transposing rows and columns,Use lookups to

• Using any complex data validation (e.g., if the

• Instead of transforming the data before it’s

Data Warehouse Expert

Increased Efficiency Data Quality Cost Savings Simplified integration

Overall, ETL provides many benefits for data integration, including

• Reusability: Processes in ETL method can be

• Management: Because of automation, managing

• Robust data quality processes

• Workflow, error handling, and

Log errors Send alerts Handle different error types

Duplicate record checks Valid value checks Consistency checks

Doing rigorous data quality checks during ETL helps improve

Temporal analysis Hierarchies Holidays and events

A well designed date dimension table structures time data,

You might also like