0% found this document useful (0 votes)
44 views36 pages

08 - Data Pipelines Presentation

The document discusses data pipelines and ETL/ELT processes. It covers key aspects of data pipelines including being holistic, incremental, iterative, reusable, documented, and auditable. It also discusses the steps to design a data pipeline including conceptual, logical and physical models as well as source to target mappings and workflow. The document then covers extract, transform and load steps in ETL/ELT and differences between the two approaches.

Uploaded by

ancgate
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views36 pages

08 - Data Pipelines Presentation

The document discusses data pipelines and ETL/ELT processes. It covers key aspects of data pipelines including being holistic, incremental, iterative, reusable, documented, and auditable. It also discusses the steps to design a data pipeline including conceptual, logical and physical models as well as source to target mappings and workflow. The document then covers extract, transform and load steps in ETL/ELT and differences between the two approaches.

Uploaded by

ancgate
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Data Pipelines

Data Pipelines Rules

• A key deliverables in business intelligence (BI) is providing consistent, comprehensive, clean,


conformed, and current information for business decision making
1 Holistic 2 Incremental
—avoid costly overlaps more manageable and
and inconsistencies. practical

3 Iterative 4 Reusable
discover and learn from ensure consistency
each individual project

5 Documented 6 Auditable
identify data for reuse, and Necessary for government
create leverage for future regulations and industry
projects standards.
Data Pipeline Design

• Data pipeline is a process. The steps are:


• Create a stage-related conceptual data integration
process model.
• Create a stage-related logical data integration
process model.
• Design a stage-related physical data integration
process model.
• Design stage-related source to target mappings.
• Design overall data pipeline workflow.
• Please refer back to the Information Architecture
and Data Architecture
Data Mapping : Source Tracking
Data Mapping : Source Tracking
Data Pipelines Workflow

• The product-specific workflow with all data and data pipeline components documented
Introduction to ETL and ELT

ETL stands for Extract, Transform and Load ELT extracts data first then transforms
In ELT, extraction of data happens first, then
ETL extracts data from source systems, transforms the loading into target system and transformation
data for analysis and loads it into a data warehouse happens inside target system

ETL transforms data first then loads whereas ELT loads first then transforms inside target.
Both achieve moving data from sources to target data warehouse.
Why ETL

• There are many reasons for adopting ETL


• helps companies to analyze business data
• provides a method of moving the data from various
sources into a data warehouse
• allow verification of data transformation,
aggregation and calculations rules
• allows sample data comparison between the
source and the target system.
• offers deep historical context for the business.
Extract
Data is extracted from the source system into the staging area. Staging area gives an
opportunity to extracted data before it moves into the Data warehouse.

• Two reasons :
• Performance of source system in not degraded.
• if corrupted data is copied directly from the source into Data warehouse, rollback will be a
challenge.

• Data warehouse needs to integrate systems that have different DBMS, Hardware, Operating
Systems and Communication Protocols….
Extract
• Three Data Extraction methods:
• Full Extraction
• Partial Extraction- without update notification.
• Partial Extraction- with update notification

• Examples of Validation types:


• Reconcile records with the source data
• Make sure that no spam/unwanted data loaded
• Data type check
• Remove all types of duplicate/fragmented data
• Check whether all the keys are in place or not
Transform
Data extracted from source server is commonly raw and not
usable in its original form.
• needs to be cleansed, mapped and transformed
• Data that does not require any transformation is called as
direct move or pass through data.
Transform
Following are Data Integrity Problems:
• Different spelling of the same person like Jon, John, etc.
• There are multiple ways to denote company name like
Google, Google Inc.Use of different names like Cleaveland,
Cleveland.
• There may be a case that different account numbers are
generated by various applications for the same customer.
• In some data required files remains blank
• Invalid product collected at POS as manual entry can lead to
mistakes.
Transform
Validations are done during this stage
• Filtering – Select only certain columns to load
• Using rules and lookup tables for Data standardization
• Character Set Conversion and encoding handling
• Conversion of Units of Measurements like Date Time,
currency, numerical conversions,
• Data threshold validation check. For example, age cannot
be more than two digits.
• Data flow validation from the staging area to the
intermediate tables.
Transform
• Cleaning ( for example, mapping NULL to 0 or
Gender Male to “M” and Female to “F” etc.)

• Split a column into multiples and merging


multiple columns into a single column.

• Transposing rows and columns,Use lookups to


merge data

• Using any complex data validation (e.g., if the


first two columns in a row are empty then it
automatically reject the row from processing)
Load
• Loading data into the target data warehouse
• In a typical Data warehouse, huge volume of data needs to
be loaded in a relatively short period (nights).
• Load process should be optimized for performance.

• Types of Loading:
• Initial Load — populating all the Data Warehouse tables
• Incremental Load — applying ongoing changes as when
needed periodically.
• Full Refresh —erasing the contents of one or more tables
and reloading with fresh data.
ETL vs ELT
• Despite its many benefits, ETL process can be
prone to break when any change occurs in
the source systems or target warehouse.

• Instead of transforming the data before it’s


written, ELT lets the target system to do the
transformation.
• The data first copied to the target and then transformed in
place.
• ELT usually used with NoSQL databases like Hadoop cluster,
data appliance or cloud installation.
• The ELT process also works hand-in-hand with data lakes
ETL vs ELT
Implementing ETL Process

Extract data from sources Transform data Load data into target

Extract or read data from various sources like Clean, filter, aggregate, join, enrich, normalize Write the transformed data into the target
databases, APIs, files etc. This step collects all etc. the extracted data into the desired database, data warehouse, file or other
the data needed for transformation. format. system.
ETL Process
ETL Architecture

Data Warehouse Expert


ELT Architecture
Benefits of ETL

Increased Efficiency Data Quality Cost Savings Simplified integration


ETL allows for faster data ETL helps to ensure data quality by ETL can reduce costs associated ELT pipelines load raw data into
integration and processing, validating and cleaning data with data integration by the warehouse first, simplifying
resulting in improved efficiency. before it is loaded into the target automating processes. integration of diverse data
system. sources.

Overall, ETL provides many benefits for data integration, including


increased efficiency, improved data quality, and cost savings.
Choosing the
Right Approach
Data integration is an important part of any business
process. Choosing the right approach for ETL vs ELT is
essential for successful data integration.
“Data Pipeline is not a one size-fits-
all solution; ETL or ELT has its
limitations and may not be the best
choice for every situation.”
Data Pipeline
in Practice
Hand-Coded ETL
• Usage of code and combination of
libraries

• Pros
• Easy for small projects
• Cheap
• Does not require sophisticated data modeling

• Cons:
• Time-consuming Complicated
• Hard to document
• Not reusable
ETL Tool
• Time: hand-coded methods can take days or
even weeks, depending on the amount and
complexity of the data, to only prepare the data
for the analysis, even before any analysis is done.

• Reusability: Processes in ETL method can be


saved and directly reused for other processes and
data models as well. In manual coding, changes
will have to be made meticulously, by a
programmer.

• Management: Because of automation, managing


datasets has become easy with ETL
programming. ETL tools provide one a larger view
of ETL processes, with things like where the data
is coming from, where it is going and what sort of
calculations have been done on it.
Benefits ETL Tool
• Reusable dimensional processes➔
Productivity gain

• Robust data quality processes

• Workflow, error handling, and


restart/recovery functionality

• Self-documentation of processes
and workflow

• Data governance
Incremental Loading

Only load new records Reduce load times Requires change data capture
Incremental loading only brings in new or By only processing a subset of data, To identify new/changed records, source
updated records from the source systems since incremental loading greatly speeds up load systems must have change data capture
the last ETL run, avoiding reprocessing times compared to full loads. processes that flag recently added/updated
unchanged data. data.

Incremental loading makes the ETL process more efficient by only loading
changes since the previous run, improving load performance.
Error Handling

Log errors Send alerts Handle different error types


Implement logging to record errors during ETL Configure alerts to notify teams when Have specific error handling logic for
execution for debugging. critical errors occur in ETL pipelines. different types of errors like data errors,
connection errors, transform errors, etc.

Robust error handling with logging, alerts, and handling of error types helps
build resilient ETL processes that keep running in case of failures.
Data Quality Checks

Duplicate record checks Valid value checks Consistency checks


Look for duplicate records based on unique Check data values against allowed domains, Validate ID references, sums, counts across
identifiers or combinations of columns. Ensure data types, and ranges to catch bad data. tables. Data should be consistent.
each record is unique.

Doing rigorous data quality checks during ETL helps improve


downstream data quality and prevent dirty data issues.
Date Dimension

Temporal analysis Hierarchies Holidays and events


Date dimensions enable powerful time-series Date dimensions can store holidays, festivals,
analysis of data by attributes like day of week, events etc. enabling analysis by these temporal
month, quarter, year etc. events.

A well designed date dimension table structures time data,


enabling powerful temporal analysis in a data warehouse.
Let's Code
Putting in practice everything we have learned

You might also like