Lecture 7 - ETL
Lecture 7 - ETL
LAB
ETL Process
Source Systems Destination
(Date Warehouse)
©
—► ►
2
ETL
• Extract
• Transform
Extract Transform Load
ffiD
• Load
►
►
3
Extract
Staging Area
DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY
4
Extract - Staging Area
The staging area acts as a buffer between the data warehouse and the source
data.
Since data may be coming from multiple different sources, it's likely in various
formats, and directly transferring the data to the warehouse may result in
corrupted data. The staging area is used for transforming the data.
5
Transform
6
Load
7
ELT
T
— 1
*— ** -------------- EXTRACT : ■ -fr- LOAD — g—----------- fj
-f- 1 --------- |”
g TRANSFORM
-------- '
8
ETL and ELT
9
Variations of ETL
- Initial
'Fi
i
iWfSferti
H <♦,< '
10
Initial Load ETL
11
Incremental ETL
12
Incremental ETL Patterns
Append
Data Warehouse
• New data added at the end
13
Incremental ETL Patterns
In-place update
• Modify existing data (only some
rows)
14
Incremental ETL Patterns
Complete replacement
Data Warehouse
• Overwrite existing data
15
Incremental ETL Patterns
16
Incremental ETL Patterns
• Complete replacement
• Rolling Append
Complete replacement and rolling append are not used in modern data
warehouses. However, maybe found in very old DWHs.
17
Data Transformation
Goals
• Uniformity in data
1
18
Data Transformation
Goals
• Uniformity in data
• Restructuring
19
Data Transformation Models
20
Data value unification
Suppose that we have data from two different campuses, which use a different
format for the Rank column
21
Data value unification
Faculty Master Dimension
• Choose one uniform format LastName FirstName Rank ...
Johnson Susan
• Transform other formats Wilson Robert
Tolleson Mary
Zimmerman Todd
Marcus Walter
Adleman Robert
Bonvoy Janice
Clark William
Douglas Thomas
The abbreviated format is our standard one, used in our dimension table in the
DWH.So
22
Data type and size unification
Campus 1 and campus 2 used different data sizes for the columns in their source
systems
23
Data type and size unification
Since previously, we chose to use Campus 2's abbreviated scheme, we will use
their abbreviated data sizes in the dimension table
24
De-duplication
Greta Williams is
taking classes on both
• Remove duplicate data campuses
Here, Greta Williams is taking classes on both campuses and needs to register for
both campuses. She will have a record in source systems for both the
campuses
25
De-duplication
4 Campus 1 New Students Campus 2 New Students
LastName FirstName Year.. LastName FirstName Year—
Jackson Sally FR Young Ted FR
Thompson Richard SO Williams Greta FR
Williams Greta FR
We need to detect the fact that Greta Williams appears in two different systems
but is in fact a single student. This can be done through maybe the CNIC or
any other natural key and then add de-duplicate it
26
Dropping columns (Vertical slicing)
J
A
28
Correcting known errors
29
Correcting known Errors
V
DR. FAISAL KAMIRAN INFORMATION TECHNOLOGY UNIVERSITY
30
ETL best practices and guidelines
For incremental ETL, only ETL the data which is updated in the source systems
31
If fact tables are processed first, then we might be trying to process a new
student, whose entries are not present in the dimension table. Trying to do so
will result in a foreign key error.
32
ETL best practices and guidelines
tXMl
WM2
DIM3 FACTA
33