ELT Process
ELT Process
F A C U LT Y O F I N F O R M AT I O N T E C H N O L O G Y A N D
COMPUTER SCIENCES
Acknowledgements: Most of these slides have been prepared based on various online tutorials and presentations, with respect to their authors and
adopted for our course. Additional slides have been added from the mentioned references in the syllabus
C O M P O N E N T S O F A DATA W A R E H O U S E
2
TOPIC 3 : OUTLINE
E T L Overview
Data Extraction
Data Transformation
Data Loading
THE ETL CYCLE
4
E T L O V E R V I E W ( CONT .)
6
DIFFICULTIES IN E T L P R O C E S S ( CONT .)
5) Source system structures keep changing over time
because of new business conditions. E T L functions
must also be modified accordingly.
6) Inconsistency among source systems. Same data is
likely to be represented differently in the various
source systems.
7) Most source systems do not represent data in types or
formats that are meaningful to the users. Many
representations are cryptic and ambiguous.
7
MAIN STEPS IN THE ETL PROCESS
8
ETL STAGING AREA
E T L Overview
Data Extraction
Data Transformation
Data Loading
OVERVIEW OF DATA EXTRACTION
Extraction is the process of reading data from different
sources which has to be loaded into the data warehouse. .
First step of E T L , followed by many.
A very complex task due to number of reasons:
● Data is extracted from heterogeneous and inconsistent
data sources
● Most of the data source systems are poorly documented
● Each data source has its distinct set of characteristics
that need to be managed and integrated into the E T L
system in order to effectively extract data..
● Very often, there is no possibility to add additional
logic to the source systems to enhance an incremental
extraction of data due to the performance or the
increased workload of these systems.
11
DATA EXTRACTION ISSUES
o Source identification—identify source applications and
source structures.
o Method of extraction—for each data source, define
whether the extraction process is manual or tool-based.
o Extraction frequency—for each data source, establish
how frequently the data extraction must be done: daily,
weekly, quarterly, and so on.
o Time window—for each data source, denote the time
window for the extraction process.
o Job sequencing—determine whether the beginning of
one job in an extraction job stream has to wait until the
previous job has finished successfully.
o Exception handling—determine how to handle input
records that cannot be extracted.
12
S O U R C E IDENTIFICATION
Source identification includes the identification of all
the proper data sources.
Source identification include examination and
verification that the identified sources will provide the
necessary value to the data warehouse.
We need to go through the source identification process
for every piece of information you have to store in the
data warehouse.
Source identification needs accuracy, lots of time, and
comprehensive analysis.
13
S O U R C E IDENTIFICATION STEPS
14
DATA IN OPERATIONAL SYSTEMS
Data in the source systems are said to be time-
dependent or temporal. This is because source data
changes with time. The value of a single variable varies
over time.
History cannot be ignored in the data warehouse.
16
DATA IN OPERATIONAL SYSTEMS – PERIODIC STATUS
17
DATA IN O P E R A T I O N A L S Y S T E M S ( CONT .)
18
DATA EXTRACTION TEC HNIQUES
Broadly, there are two major types of data extractions
from the source operational systems:
1. “As Is” or static data is the capture of data at a
given point in time. It is like taking a snapshot of
the relevant source data at a certain point in time.
2. Incremental data capture (data of revisions),
which includes the revisions since the last time
data was captured. Incremental data capture may
be immediate or deferred.
19
IMMEDIATE DATA EXTRACTION
In this option, the data extraction is real-time. It occurs
as the transactions happen at the source databases and
files.
There are three options for immediate data
extraction:
1) Capture through Transaction Logs
2) Capture through Database Triggers
3) Capture in Source Applications
20
I M M E D I A T E D A T A E X T R A C T I O N ( CONT .)
21
CAPTURE THROUGH TRANSACTION L O G S
This option uses the transaction logs of the DB M S s
maintained for recovery from possible failures.
As each transaction adds, updates, or deletes a row
from a database table, the D B M S immediately writes
entries on the log file.
This data extraction technique reads the transaction
log and selects all the committed transactions.
There is no extra overhead in the operational systems
because logging is already part of the transaction
processing.
The appropriate transaction logs contain all the
changes to the various source database tables.
22
CAPTURE THROUGH T R A NS A C T I O N L O G S ( CONT .)
Here are the broad steps for using replication to capture
changes to source data
● Identify the source system database table
● Identify and define target files in the staging area
● Create mapping between the source table and target files
● Define the replication mode
● Schedule the replication process
● Capture the changes from the transaction logs
● Transfer captured data from logs to target files
● Verify transfer of data changes
● Confirm success or failure of replication
● In metadata, document the outcome of replication.
Maintain definitions of sources, targets, and mappings
23
CAPTURE THROUGH T R A NS A C T I O N L O G S ( CONT .)
24
CAPTURE THROUGH DATABASE TRIGGERS
This option is applicable to source systems that are
database applications.
Triggers are special stored procedures (programs) that
are stored on the database and fired when certain
predefined events occur.
You can create trigger programs for all events for which
you need data to be captured. The output of the trigger
programs is written to a separate file that will be used
to extract data for the data warehouse.
Data capture through database triggers occurs right at
the source and is therefore quite reliable.
Also, execution of trigger procedures during transaction
processing of the source systems puts additional
overhead on the source systems
25
CAPTURE IN S O U R C E APPLICATIONS
Application programs need to be revised to write all
adds, updates, and deletes to the source files and
database tables.
Unlike the previous two cases, this technique may be
used for all types of source data irrespective of whether
it is in databases, indexed files, or other flat files.
Revising the programs in the source operational
systems could be a huge task if the number of source
system programs is large.
This technique may degrade the performance of the
source applications because of the additional processing
needed to capture the changes on separate files.
26
DEFERRED DATA EXTRACTION
The techniques under deferred data extraction do not
capture the changes in real time. The capture happens
later.
There are two options for deferred data extraction:
27
DEFERRED DATA EXTRACTION
28
CAPTURE BASED ON DATE AND TIME STAMP
Every time a source record is created or updated it may
be marked with a stamp showing the date and time.
The time stamp provides the basis for selecting records
for data extraction. Here the data capture occurs at a
later time, not while each source record is created or
updated.
This technique works well if the number of revised
records is small.
This technique presupposes that all the relevant source
records contain date and time stamps. Provided this is
true, data capture based on date and time stamp can
work for any type of source file.
This technique captures the latest state of the source
data.
29
CAPTURE BY COMPARING FILES
If none of the above techniques are feasible for specific
source files in your environment, then consider this
technique as the last resort.
This technique is also called the snapshot differential
technique because it compares two snapshots of the
source data.
This technique necessitates the keeping of prior copies
of all the relevant source data.
Though simple and straightforward, comparison of full
rows in a large file can be very inefficient.
This method may be the only feasible option for some
legacy data sources that do not have transaction logs or
time stamps on source records.
30
EVALUATION OF THE TECHNIQUES
To summarize, the following options are available for
data extraction:
1) Capture of Static Data
2) Incremental Data Capture
A. Immediate Data Extraction
Capture through transaction logs
Capture through database triggers
Capture in source applications
B. Deferred Data Extraction
Capture based on date and time stamp
Capture by comparing files
31
TOPIC 3 : OUTLINE
E T L Overview
Data Extraction
Data Transformation
Data Loading
DATA TRANSFORMATION
33
D A T A T R A N S F O R M A T I O N ( CONT .)
You have to transform the data according to standards
because they come from many dissimilar source
systems.
You have to ensure that after all the data is put
together, the combined data does not violate any
business rules.
Transformation of source data encompasses a wide
variety of manipulations to change all the extracted
source data into usable information to be stored in the
data warehouse.
One major effort within data transformation is the
improvement of data quality.
34
D A T A T R A N S F O R M A T I O N ( CONT .)
Data Quality paradigm :
● Correct
U nambiguous
● Consistent
● Complete
35
EXAMPLES OF INCONSISTENT DAT A R E P R E S E N T A T I O N S
36
M A J O R TRANSFORMATION TYPES
Format Revisions. You will come across these quite
often. These revisions include changes to the data types
and lengths of individual fields. In source systems,
product package types may be indicated by codes and
names in which the fields are numeric and text data
types. The lengths of the package types may vary
among the different source systems. It is wise to
standardize and change the data type to text to provide
values meaningful to the users.
37
M A J O R T R A N S F O R M A T I O N T Y P E S ( CONT .)
Decoding of Fields. This is also a common type of
data transformation. When you deal with multiple
source systems, you are bound to have the same data
items described by a plethora of field values. The
classic example is the coding for gender, with one
source system using 1 and 2 for male and female and
another system using M and F. Also, many legacy
systems are known for using cryptic codes to represent
business values. What do the codes AC , IN, R E , and S U
mean in a customer file? You need to decode all such
cryptic codes and change these into values that make
sense to the users. Change the codes to Active, Inactive,
Regular, and Suspended.
38
M A J O R T R A N S F O R M A T I O N T Y P E S ( CONT .)
Calculated and Derived Values. What if you want to
keep profit margin along with sales and cost amounts
in your data warehouse tables? The extracted data from
the sales system contains sales amounts, sales units,
and operating cost estimates by product. You will have
to calculate the total cost and the profit margin before
data can be stored in the data warehouse. Average
daily balances and operating ratios are examples of
derived fields.
39
M A J O R T R A N S F O R M A T I O N T Y P E S ( CONT .)
Splitting of Single Fields. Earlier legacy systems
stored names and addresses of customers and
employees in large text fields. The first name, middle
initials, and last name were stored as a large text in a
single field. Similarly, some earlier systems stored city,
state, and zip code data together in a single field. You
need to store individual components of names and
addresses in separate fields in your data warehouse for
two reasons. First, you may improve the operating
performance by indexing on individual components.
Second, your users may need to perform analysis by
using individual components such as city, state, and zip
code.
40
M A J O R T R A N S F O R M A T I O N T Y P E S ( CONT .)
Merging of Information. This is not quite the
opposite of splitting of single fields. This type of data
transformation does not literally mean the merging of
several fields to create a single field of data. For
example, information about a product may come from
different data sources. The product code and
description may come from one data source. The
relevant package types may be found in another data
source. The cost data may be from yet another source.
In this case, merging of information denotes the
combination of the product code, description, package
types, and cost into a single entity.
41
M A J O R T R A N S F O R M A T I O N T Y P E S ( CONT .)
Character set conversion. This type of data
transformation relates to the conversion of character
sets to an agreed standard character set for textual
data in the data warehouse. If you have mainframe
legacy systems as source systems, the source data from
these systems will be in E B C D I C characters. If PC-
based architecture is the choice for your data
warehouse, then you must convert the mainframe
E B C D I C format to the AS C I I format. When your source
data is on other types of hardware and operating
systems, you are faced with similar character set
conversions.
42
M A J O R T R A N S F O R M A T I O N T Y P E S ( CONT .)
Conversion of Units of Measurements. Many
companies today have global branches. Measurements
in many European countries are in metric units. If your
company has overseas operations, you may have to
convert the metrics so that the numbers are all in one
standard unit of measurement.
D ate /Time C onversion. This type relates to
representation of date and time in standard formats.
For example, the American and the British date
formats may be standardized to an international
format. The date of October 11, 2008 is written as 10 /
11 /2008 in the U. S . format and as 11 /10 /2008 in the
British format. This date may be standardized to be
written as 11 O C T 2008.
43
M A J O R T R A N S F O R M A T I O N T Y P E S ( CONT .)
Summarization. This type of transformation is the
creating of summaries to be loaded in the data
warehouse instead of loading the most granular level of
data. For example, for a credit card company to analyze
sales patterns, it may not be necessary to store in the
data warehouse every single transaction on each credit
card. Instead, you may want to summarize the daily
transactions for each credit card and store the
summary data instead of storing the most granular
data by individual transactions.
44
M A J O R T R A N S F O R M A T I O N T Y P E S ( CONT .)
45
M A J O R T R A N S F O R M A T I O N T Y P E S ( CONT .)
46
M A J O R T R A N S F O R M A T I O N T Y P E S ( CONT .)
Enrichment: This task is the rearrangement and
simplification of individual fields to make them more
useful for the data warehouse environment. You may
use one or more fields from the same input record to
create a better view of the data for the data warehouse.
This principle is extended when one or more fields
originate from multiple records, resulting in a single
field for the data warehouse.
47
M A J O R T R A N S F O R M A T I O N T Y P E S ( CONT .)
Deduplication. In many companies, the customer files
have several records for the same customer.
In a normal client database some clients may be
represented by several records for various reasons:
● Incorrect or missing data values because of data
entry errors
● Inconsistent naming convention such as: O N E vs 1
● Incomplete information because data is not captured
or available
● Physically moved, but clients did not notify change of
address
● Misspelling or falsification of names
48
PROBLEMS D UE TO DATA DUPLICATION
Data duplication, can result in costly errors, such as:
● False frequency distributions.
● Incorrect aggregates due to double counting.
● Difficulty with catching fabricated identities by
credit card companies.
49
SLOWLY CHANGING DIMENSIONS
Compared to the fact table, the dimension tables are more
stable and are generally constant over time.
Unlike the fact table, which changes through an increase
in the number of rows, a dimension table does not change
just through the increase in the number of rows, but also
through changes to the attributes themselves.
Many dimensions, though not constant over time, change
slowly.
In the source O LT P systems, the new values overwrite
the old ones.
In the source of data warehouse, overwriting of dimension
table attributes is not always the appropriate option.
There are three types of dimension table changes: Type 1
changes, Type 2 changes, and Type 3 changes
50
SLOWLY CHANGING DIMENSIONS – TYPE 1
These changes usually relate to correction of errors in
source systems.
The old value in the source system needs to be
discarded.
The change in the source system need not be preserved
in the data warehouse.
Overwrite the attribute value in the dimension table row
with the new value.
No other changes are made in the dimension table row.
51
SLOWLY CHANGING DIMENSIONS – TYPE 1
Type 1: Example
Susan's Tax Bracket attribute value changes from Medium to High
52
SLOWLY CHANGING DIMENSIONS – TYPE 2
These changes usually relate to true changes in source
systems.
Every change for the same attribute must be preserved.
53
SLOWLY CHANGING DIMENSIONS – TYPE 2
Type 2 : Example (with timestamps and row indicator)
Susan's Tax Bracket attribute value changes from Medium to High
54
SLOWLY CHANGING DIMENSIONS – TYPE 3
They usually relate to “soft” or tentative changes in the
source systems.
There is a need to keep track of history with old and
new values of the changed attribute.
They provide the ability to track forward and backward
55
SLOWLY CHANGING DIMENSIONS – TYPE 3
Type 3: Example (with timestamps)
Susan's Tax Bracket attribute value changes from Medium to High
56
PROS AND CONS
Type-1: Overwrite existing value
+ Simple to implement
- No tracking of history
57
T R A NS F O R M A T I O N FOR D I M E N S I O N A T T R I BU T E S
58
AUTOMATIC DATA CLEAN SIN G
1) Statistical Methods
● Identifying outlier fields and records using the
values of mean, standard deviation, range and
other statistical methods.
2) Pattern-based
● Identify outlier fields and records that do not
conform to existing patterns in the data.
● A pattern is defined by a group of records that have
similar characteristics (“behavior”) for p% of the
fields in the data set, where p is a user-defined
value (usually above 90).
● Techniques such as partitioning, classification, and
clustering can be used to identify patterns that
apply to most records.
59
A U T O M A T I C D A T A C L E A N S I N G ( CONT .)
3) Clustering
● Identify outlier records using clustering based on
Euclidian (or other) distance.
● Clustering the entire record space can reveal
outliers that are not identified at the field level
inspection
● Main drawback of this method is computational
time.
4) Association rules
● Association rules with high confidence and support
define a different kind of pattern.
● Records that do not follow these rules are considered
outliers.
60
TOPIC 3 : OUTLINE
E T L Overview
Data Extraction
Data Transformation
Data Loading
DATA LOADING
Data Loading is the process of writing the data into
the target source. It includes both: loading dimension
and fact tables.
Because loading the data warehouse may take an large
amount of time, loads are generally causes for great
concern. During the loads, the data warehouse has to
be offline.
Consider dividing up the whole load process into
smaller chunks and populating a few files at a time.
The whole process of moving data into the data
warehouse repository is referred to in several ways:
“loading the data”, and ”refreshing the data”.
62
TYPES OF DATA LOADING
Initial load—populating all the data warehouse tables
for the very first time.
Incremental load—applying ongoing changes as
necessary in a periodic manner.
Full refresh—completely erasing the contents of one
or more tables and reloading with fresh data (initial
load is a refresh of all the tables).
63
A P P L Y I N G D ATA : T E C H N I Q U E S AND PROCESSES
Data may be applied to data warehouse in the following
four different modes: load, append, destructive merge,
and constructive merge.
Load: If the target table to be loaded already exists
and data exists in the table, the load process wipes out
the existing data and applies the data from the
incoming file. If the table is already empty before
loading, the load process simply applies the data from
the incoming file.
Append: If data already exists in the table, the append
process unconditionally adds the incoming data,
preserving the existing data in the target table. When
an incoming record is a duplicate of an already existing
record, the incoming record may be allowed to be added
as a duplicate or it may be rejected.
64
A P P L Y I N G D ATA : T E C H N I Q U E S A N D P R O C E S S E S ( CONT .)
65
A P P L Y I N G D ATA : T E C H N I Q U E S A N D P R O C E S S E S ( CONT .)
66
LOADING CHANGES TO DIMENSION TABLES
67
LOADING DIMENSIONS
68
L O A D I N G D I M E N S I O N S ( CONT .)
69
LOADING FACTS
70
L O A D I N G F A C T T A B L E S ( CONT .)
71
L O A D I N G F A C T T A B L E S ( CONT .)
Managing Indexes
● Performance Killers at load time
● Drop all indexes in pre-load time
● Separate Updates from inserts
● Load updates
● Rebuild indexes
72
ROLLBACK LOG
The rollback log, also known as the redo log, is
invaluable ()نيمثin transaction (OLTP) systems. But in a
data warehouse environment where all transactions
are managed by the E T L process, the rollback log is a
unnecessary feature that must be dealt with to achieve
optimal load performance.
Reasons why the data warehouse does not need
rollback logging are:
All data is entered by a managed process—the E T L
system.
Data is loaded in bulk.
74
E T L TOOL OPTIONS
Vendors have approached the challenges of E T L and
addressed them by providing tools falling into the
following three broad functional categories:
● Data transformation engines.
● Data capture through replication
● Code generators
Data transformation engines. These tools captures
data from a designated set of source systems at user-
defined intervals, performs elaborate data
transformations, sends the results to a target
environment, and applies the data to target files. These
tools provide you with maximum flexibility for pointing to
various source systems, to select the appropriate data
transformation methods, and to apply full loads and
incremental loads.
75
E T L T O O L O P T I O N S ( CONT .)
Data capture through replication: Most of these
tools use the transaction recovery logs maintained by
the D B M S . The changes to the source systems captured
in the transaction logs are replicated in near real time
to the data staging area for further processing. Some of
the tools provide the ability to replicate data through
the use of data base triggers. These specialized stored
procedures in the database signal the replication agent
to capture and transport the changes.
76
E T L T O O L O P T I O N S ( CONT .)
Code generators: These are tools that directly deal
with the extraction, transformation, and loading of
data. The tools enable the process by generating
program code to perform these functions. Code
generators create 3G L /4G L data extraction and
transformation programs. You provide the parameters
of the data sources and the target layouts along with
the business rules. The tools generate most of the
program code in some of the common programming
languages.
77
M A J O R CAPABILITIES OF ETL TOOLS
79
THE END
80