Details Extract Transform Load
Details Extract Transform Load
1
Extracting Changed Data
Incremental data extraction
Incremental data extraction i.e. what has changed, say during last 24
hrs if considering nightly extraction.
Very challenging
Change Data Capture is therefore, typically the most challenging
technical issue in data extraction.
2
Source Systems
Two CDC sources
• Modern systems
• Legacy systems
3
CDC in Modern Systems
• Time Stamps
• Works if timestamp column present
• If column not present, add column
• May not be possible to modify table, so add triggers
• Triggers
• Create trigger for each source table
• Following each DML operation trigger performs updates
• Record DML operations in a log
• Partitioning
• Table range partitioned, say along date key
• Easy to identify new data, say last week’s data
4
CDC in Legacy Systems
• Changes recorded in tapes Changes occurred in legacy
transaction processing are recorded on the log or
journal tapes.
• Changes read and removed from tapes Log or journal
tape are read and the update/transaction changes are
stripped off for movement into the data warehouse.
Format revision
Decoding of fields
Covered in De-Norm
Calculated and derived values
Covered in issues
Splitting of single fields
7
Major Transformation Types
Merging of information
Not really means combining columns to create one column.
Info for product coming from different sources merging it into single entity.
Character set conversion
For PC architecture converting legacy EBCIDIC to ASCII
Unit of measurement conversion
For companies with global branches Km vs. mile or lb vs Kg
Date/Time conversion
November 14, 2005 as 11/14/2005 in US and 14/11/2005 in the British format.
This date may be standardized to be written as 14 NOV 2005.
8
Major Transformation Types
ONLY yellow part will go to Graphics
9
Major Transformation Types
Key restructuring (inherent meaning at source)
92 42 4979 234
Country_Code City_Code Post_Code Product_Code
Removing duplication
Incorrect or missing value
Inconsistent naming convention ONE vs 1
Incomplete information
Misspelling or falsification of names
10
Ahsan Abdullah
Data content defects
• Domain value redundancy
Non-standard data formats
Non-atomic data values
Embedded meanings
Data quality contamination
11
Data content defects Examples
ONLY yellow part will go to Graphics
Domain value redundancy
Unit of Measure
Dozen, Doz., Dz., 12
Embedded Meanings
RC, AP, RJ
received, approved, rejected
13