Data Warehousing: Lecture No 07
Data Warehousing: Lecture No 07
Data Warehousing
By: Dr. Syed Aun Irtaza
Extract Transform Load (ETL)
2
Putting the pieces together
MOLAP
Sources Query/Reporting
www data
Meta
Data
Extract
Data Analysis
Transform
Archived Load Warehouse
data ROLAP Business
(ETL)
IT Data Mining
Users
Users
Operational
Data Bases
Data sources Data Marts Tools
Business
Users
{Comment: All except ETL washed out look}
3
The ETL Cycle
EXTRACT TRANSFORM LOAD
The process of The process of transforming The process of
reading data from the extracted data from its writing the data into
different sources. original state into a the target source.
consistent state so that it
can be placed into another
database.
www data
MIS Systems
(Acct, HR)
TRANSFORM CLEANSE Data Warehouse
Legacy
Systems
EXTRACT LOAD
Archived data
Other indigenous applications
(COBOL, VB, C++, Java)
OLAP
Temporary
Data storage
4
ETL Processing
ETL is independent yet interrelated steps.
It is important to look at the big picture.
Data acquisition time may include…
Extracts
Data Index
from Data Data Data Stat
Transfor- Mainte-
source Movement Cleansing Loading Coll
mation nance
systems
5
Overview of Data Extraction
First step of ETL, followed by many.
Physical Extraction
Online Extraction
Offline Extraction
Legacy vs. OLTP
7
Logical Data Extraction
Full Extraction
The data extracted completely from the source system.
Incremental Extraction
Data extracted after a well defined point/event in time.
8
Physical Data Extraction…
Online Extraction
Data extracted directly from the source system.
May access source tables through an intermediate system.
Intermediate system usually similar to the source system.
Offline Extraction
Data NOT extracted directly from the source system, instead staged
explicitly outside the original source system.
10
Data Transformation
Basic tasks
1. Selection
2. Splitting/Joining
3. Conversion
4. Summarization
5. Enrichment
11
Data Transformation Basic Tasks
Selection
12
Data Transformation Basic Tasks
Splitting/joining
13
Data Transformation Basic Tasks
Conversion
14
Data Transformation: Conversion Example-1
Summarization
17
Data Transformation Basic Tasks
Enrichment
18
Data Transformation Basic Tasks: Enrichment Example
Data Freshness
Very fresh low update efficiency
Historical data, high update efficiency
Always trade-offs in the light of goals
System performance
Availability of staging table space
Impact on query workload
Data Volatility
Ratio of new to historical data
High percentages of data change (batch update)
20
Three Loading Strategies
Once we have transformed data, there are
three primary loading strategies:
22
Extracting Changed Data
Incremental data extraction
Incremental data extraction i.e. what has changed, say during last
24 hrs if considering nightly extraction.
Very challenging
Change Data Capture is therefore, typically the most challenging
technical issue in data extraction.
23
Source Systems
Two CDC sources
• Modern systems
• Legacy systems
24
CDC in Modern Systems
• Time Stamps
• Works if timestamp column present
• If column not present, add column
• May not be possible to modify table, so add triggers
• Triggers
• Create trigger for each source table
• Following each DML operation trigger performs updates
• Record DML operations in a log
• Partitioning
• Table range partitioned, say along date key
• Easy to identify new data, say last week’s data
25
CDC in Legacy Systems
Changes recorded in tapes Changes occurred in legacy
transaction processing are recorded on the log or
journal tapes.
26
Major Transformation Types
Format revision
Decoding of fields
Calculated and derived values
Splitting of single fields
Merging of information
Character set conversion
Unit of measurement conversion
Date/Time conversion
Summarization
Key restructuring
Duplication
27
Major Transformation Types
Format revision
Decoding of fields
Covered in De-Norm
Calculated and derived values
Covered in issues
Splitting of single fields
28
Major Transformation Types
Merging of information
Not really means combining columns to create one column.
Info for product coming from different sources merging it into single
entity.
Character set conversion
For PC architecture converting legacy EBCIDIC to ASCII
• Unit of measurement
conversion
For companies with global branches Km vs. mile or lb vs Kg
Date/Time conversion
November 14, 2005 as 11/14/2005 in US and 14/11/2005 in
the British format. This date may be standardized to be written
as 14 NOV 2005.
29
Major Transformation Types
Removing duplication
Inconsistent naming convention ONE vs 1
Incomplete information
Physically moved, but address not changed
Misspelling or falsification of names
31
Data content defects
• Domain value redundancy
Non-standard data formats
Non-atomic data values
Multipurpose data fields
Embedded meanings
Inconsistent data values
Data quality contamination
32
Data content defects Examples
33
Data content defects Examples
Embedded Meanings
RC, AP, RJ
received, approved, rejected
34
Data Cleansing
Other names: Called as data scrubbing or cleaning.
More than data arranging: DWH is NOT just about arranging
data, but should be clean for overall health of organization.
We drink clean water!
Big problem, big effect: Enormous problem, as most data is
dirty. GIGO (Garbage in garbage out)
Dirty is relative: Dirty means does not confirm to proper
domain definition and vary from domain to domain.
Paradox: Must involve domain expert, as detailed domain
knowledge is required, so it becomes semi-automatic, but
has to be automatic because of large data sets.
Data duplication: Original problem was removing duplicates
in one system, compounded by duplicates from many
systems.
Lighter Side of Dirty Data
Year of birth 2009 current year 2014
Graduation in 1986 Doctorate in 1985
Who would take it seriously? Computers while