ETL Process in Data Warehouse
ETL Process in Data Warehouse
ETL Process in Data Warehouse
Warehouse
Chirayu Poundarik
Outline
ETL
Extraction
Transformation
Loading
ETL Overview
ETL Overview
Automated
Well documented
Easily changeable
Extraction
Extraction
Extraction
DBMS
Operating Systems
Hardware
Communication protocols
Need to have a logical data map before the physical data can
be transformed
Target
Table Name
Source
Column Name
Data Type
Table Name
Transformation
Column Name
Data Type
The content of the logical data mapping document has been proven to be the critical
element required to efficiently plan ETL processes
The table type gives us our queue for the ordinal position of our data load
processesfirst dimensions, then facts.
The primary purpose of this document is to provide the ETL developer with a clearcut blueprint of exactly what is expected from the ETL process. This table must
depict, without question, the course of action involved in the transformation process
The transformation can contain anything from the absolute solution to nothing at all.
Most often, the transformation can be expressed in SQL. The SQL may or may not
be the complete statement
Understanding the content of the data is crucial for determining the best
approach for retrieval
- NULL values. An unhandled NULL value can destroy any ETL process.
NULL values pose the biggest risk when they are in foreign key columns.
Joining two or more tables based on a column that contains NULL values
will cause data loss! Remember, in a relational database NULL is not equal
to NULL. That is why those joins fail. Check for NULL values in every
foreign key in the source database. When NULL values are present, you
must outer join the tables
- Dates in nondate fields. Dates are very peculiar elements because they
are the only logical elements that can come in various formats, literally
containing different values and having the exact same meaning.
Fortunately, most database systems support most of the various formats for
display purposes but store them in a single standard format
Transformation
Transformation
Main step where the ETL adds value
Actually changes data and provides
guidance whether data can be used for its
intended purposes
Performed in staging area
Transformation
Data Quality paradigm
Correct
Unambiguous
Consistent
Complete
Data quality checks are run at 2 places - after
extraction and after cleaning and confirming
additional check are run at this point
Anomaly Detection
sampling count(*) of the rows for a department
column
Data
lows
Cols whose lengths are exceptionally short/long
Cols with certain values outside of discrete valid value
sets
Adherence to a reqd pattern/ member of a set of
pattern
Transformation - Confirming
Structure Enforcement
Tables
business rules
Logical data checks
Stop
Yes
Staged Data
Cleaning
And
Confirming
Fatal Errors
No Loading
Loading
Loading Dimensions
Loading Facts
Loading Dimensions
Loading dimensions
Type 1 Dimension
Type 2 Dimension
Type 3 Dimensions
Loading facts
Facts
Fact tables hold the measurements of an
enterprise. The relationship between fact
tables and measurements is extremely
simple. If a measurement exists, it can be
modeled as a fact table row. If a fact table
row exists, it is a measurement
Managing Indexes
Performance
Managing Partitions
Questions