08 - Data Pipelines Presentation
08 - Data Pipelines Presentation
3 Iterative 4 Reusable
discover and learn from ensure consistency
each individual project
5 Documented 6 Auditable
identify data for reuse, and Necessary for government
create leverage for future regulations and industry
projects standards.
Data Pipeline Design
• The product-specific workflow with all data and data pipeline components documented
Introduction to ETL and ELT
ETL stands for Extract, Transform and Load ELT extracts data first then transforms
In ELT, extraction of data happens first, then
ETL extracts data from source systems, transforms the loading into target system and transformation
data for analysis and loads it into a data warehouse happens inside target system
ETL transforms data first then loads whereas ELT loads first then transforms inside target.
Both achieve moving data from sources to target data warehouse.
Why ETL
• Two reasons :
• Performance of source system in not degraded.
• if corrupted data is copied directly from the source into Data warehouse, rollback will be a
challenge.
• Data warehouse needs to integrate systems that have different DBMS, Hardware, Operating
Systems and Communication Protocols….
Extract
• Three Data Extraction methods:
• Full Extraction
• Partial Extraction- without update notification.
• Partial Extraction- with update notification
• Types of Loading:
• Initial Load — populating all the Data Warehouse tables
• Incremental Load — applying ongoing changes as when
needed periodically.
• Full Refresh —erasing the contents of one or more tables
and reloading with fresh data.
ETL vs ELT
• Despite its many benefits, ETL process can be
prone to break when any change occurs in
the source systems or target warehouse.
Extract data from sources Transform data Load data into target
Extract or read data from various sources like Clean, filter, aggregate, join, enrich, normalize Write the transformed data into the target
databases, APIs, files etc. This step collects all etc. the extracted data into the desired database, data warehouse, file or other
the data needed for transformation. format. system.
ETL Process
ETL Architecture
• Pros
• Easy for small projects
• Cheap
• Does not require sophisticated data modeling
• Cons:
• Time-consuming Complicated
• Hard to document
• Not reusable
ETL Tool
• Time: hand-coded methods can take days or
even weeks, depending on the amount and
complexity of the data, to only prepare the data
for the analysis, even before any analysis is done.
• Self-documentation of processes
and workflow
• Data governance
Incremental Loading
Only load new records Reduce load times Requires change data capture
Incremental loading only brings in new or By only processing a subset of data, To identify new/changed records, source
updated records from the source systems since incremental loading greatly speeds up load systems must have change data capture
the last ETL run, avoiding reprocessing times compared to full loads. processes that flag recently added/updated
unchanged data. data.
Incremental loading makes the ETL process more efficient by only loading
changes since the previous run, improving load performance.
Error Handling
Robust error handling with logging, alerts, and handling of error types helps
build resilient ETL processes that keep running in case of failures.
Data Quality Checks