0% found this document useful (0 votes)
350 views26 pages

ETL Process in Data Warehouse

The document discusses the ETL (extraction, transformation, loading) process used to integrate data from multiple source systems into a data warehouse. It describes how data is extracted from various sources, transformed for quality and consistency, and loaded into the data warehouse. Key aspects of the ETL process include extracting data from different source systems, cleaning and transforming data during the loading stage, and properly handling slowly changing dimensions when loading data into fact and dimension tables.

Uploaded by

rudran_786
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
350 views26 pages

ETL Process in Data Warehouse

The document discusses the ETL (extraction, transformation, loading) process used to integrate data from multiple source systems into a data warehouse. It describes how data is extracted from various sources, transformed for quality and consistency, and loaded into the data warehouse. Key aspects of the ETL process include extracting data from different source systems, cleaning and transforming data during the loading stage, and properly handling slowly changing dimensions when loading data into fact and dimension tables.

Uploaded by

rudran_786
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 26

ETL Process in Data

Warehouse
Data Warehouse
A subject-oriented, integrated, time-variant,
non updatable collection of data used in
support of management decision-making
processes

 Subject-oriented: e.g. customers, patients, students,


products
 Integrated: Consistent naming conventions, formats,
encoding structures; from multiple data sources
 Time-variant: Can study trends and changes
 Nonupdatable: Read-only, periodically refreshed;
never deleted
Data preprocessing Outline
 ETL
 Extraction
 Transformation
 Loading
Operational ETL Architecture
Data

File
System Data
Staging Extract Transform
Ware
house

External Data Base Cleanse Norma-


Data lise

ETL Engine Data Marts


Atomic data
Summary data
Transient data
ETL Overview
 Extraction Transformation Loading – ETL
 To get data out of the source and load it into the data
warehouse – simply a process of copying data from one
database to other
 Data is extracted from an OLTP database, transformed
to match the data warehouse schema and loaded into
the data warehouse database
 Many data warehouses also incorporate data from non-
OLTP systems such as text files, legacy systems, and
spreadsheets; such data also requires extraction,
transformation, and loading
 When defining ETL for a data warehouse, it is important
to think of ETL as a process, not a physical
implementation
ETL Overview
 ETL is often a complex combination of process and
technology - business analysts, database designers, and
application developers
 It is not a one time event as new data is added to the
Data Warehouse periodically – monthly, daily, hourly
 Because ETL is an integral, ongoing, and recurring part of
a data warehouse
 Automated
 Well documented
 Easily changeable
ETL Staging Database
 ETL operations should be performed on a
relational database server separate from the
source databases and the data warehouse
database
 Creates a logical and physical separation
between the source systems and the data
warehouse
 Minimizes the impact of the intense periodic ETL
activity on source and data warehouse databases
Extraction
Extraction

 The integration of all of the disparate systems across the


enterprise is the real challenge to getting the data
warehouse to a state where it is usable
 Data is extracted from heterogeneous data sources
 Each data source has its distinct set of characteristics
that need to be managed and integrated into the ETL
system in order to effectively extract data.
Extraction
 ETL process needs to effectively integrate systems that have
different:
 DBMS
 Operating Systems
 Hardware
 Communication protocols

 Need to have a logical data map before the physical data can
be transformed

 The logical data map describes the relationship between the


extreme starting points and the extreme ending points of your
ETL system usually presented in a table or spreadsheet
 The analysis of the source system is
usually broken into two major phases:
 The data discovery phase
 The anomaly detection phase
Extraction - Data Discovery Phase
 Data Discovery Phase
key criterion for the success of the data warehouse is the
cleanliness and cohesiveness of the data within it
 Once you understand what the target needs to look like,
you need to identify and examine the data sources

 Understanding the content of the data is crucial for


determining the best approach for retrieval

- NULL values

- Dates in non date fields


Transformation
Transformation
 Main step where the ETL adds value
 Actually changes data and provides
guidance whether data can be used for its
intended purposes
Transformation
Data Quality paradigm
 Correct
 Unambiguous
 Consistent
 Complete
 Data quality checks are run at 2 places - after
extraction and after cleaning and confirming
additional check are run at this point
Transformation - Cleaning Data
 Anomaly Detection
 Data sampling – count(*) of the rows for a department
column
 Column Property Enforcement
 Null Values in reqd columns
 Numeric values that fall outside of expected high and
lows
 Cols whose lengths are exceptionally short/long
 Cols with certain values outside of discrete valid value
sets
 Adherence to a reqd pattern/ member of a set of
pattern
Transformation - Confirming
 Structure Enforcement
 Tableshave proper primary and foreign keys
 Obey referential integrity

 Data and Rule value enforcement


 Simple business rules
 Logical data checks
Stop

Yes

Cleaning
Fatal Errors No Loading
Staged Data And
Confirming
Loading

Loading Dimensions
Loading Facts
Loading Dimensions
 Physically built to have the minimal sets of components
 The primary key is a single field containing meaningless
unique integer – Surrogate Keys
 The DW owns these keys and never allows any other
entity to assign them
 De-normalized flat tables – all attributes in a dimension
must take on a single value in the presence of a
dimension primary key.
 Should possess one or more other fields that compose
the natural key of the dimension
 The data loading module consists of all the steps
required to administer slowly changing dimensions
(SCD) and write the dimension to disk as a physical
table in the proper dimensional format with correct
primary keys, correct natural keys, and final descriptive
attributes.
 Creating and assigning the surrogate keys occur in this
module.
 The table is definitely staged, since it is the object to be
loaded into the presentation system of the data
warehouse.
Loading dimensions
 When DW receives notification that an
existing row in dimension has changed it
gives out 3 types of responses
Type 1
Type 2
Type 3
Type 1 Dimension
Type 2 Dimension
Type 3 Dimensions

You might also like