0% found this document useful (0 votes)
12 views13 pages

Details Extract Transform Load

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views13 pages

Details Extract Transform Load

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Data Warehousing and Data Mining

ETL Detail: Data Extraction & Transformation

1
Extracting Changed Data
Incremental data extraction
Incremental data extraction i.e. what has changed, say during last 24
hrs if considering nightly extraction.

Efficient when changes can be identified


This is efficient, when the small changed data can be identified
efficiently.

Identification could be costly


Unfortunately, for many source systems, identifying the recently
modified data may be difficult or effect operation of the source
system.

Very challenging
Change Data Capture is therefore, typically the most challenging
technical issue in data extraction.

2
Source Systems
Two CDC sources
• Modern systems
• Legacy systems

3
CDC in Modern Systems
• Time Stamps
• Works if timestamp column present
• If column not present, add column
• May not be possible to modify table, so add triggers

• Triggers
• Create trigger for each source table
• Following each DML operation trigger performs updates
• Record DML operations in a log

• Partitioning
• Table range partitioned, say along date key
• Easy to identify new data, say last week’s data
4
CDC in Legacy Systems
• Changes recorded in tapes Changes occurred in legacy
transaction processing are recorded on the log or
journal tapes.
• Changes read and removed from tapes Log or journal
tape are read and the update/transaction changes are
stripped off for movement into the data warehouse.

• Problems with reading a log/journal tape are many:


– Contains lot of extraneous data
– Format is often arcane
– Sequencing of data in the log tape often has deep and
complex implications
– Log tape varies widely from one DBMS to another. 5
Major Transformation Types
 Format revision
 Decoding of fields
 Calculated and derived values
 Splitting of single fields
 Merging of information
 Character set conversion
 Unit of measurement conversion
 Date/Time conversion
 Summarization
 Key restructuring
 Duplication 6
Major Transformation Types

 Format revision

 Decoding of fields
Covered in De-Norm
 Calculated and derived values
Covered in issues
 Splitting of single fields

7
Major Transformation Types

 Merging of information
Not really means combining columns to create one column.
Info for product coming from different sources merging it into single entity.
 Character set conversion
For PC architecture converting legacy EBCIDIC to ASCII
 Unit of measurement conversion
For companies with global branches Km vs. mile or lb vs Kg
 Date/Time conversion
November 14, 2005 as 11/14/2005 in US and 14/11/2005 in the British format.
This date may be standardized to be written as 14 NOV 2005.

8
Major Transformation Types
ONLY yellow part will go to Graphics

 Aggregation & Summarization

 How they are different? Adding


like values

Summarization with calculation across business


dimension is aggregation. Example Monthly
compensation = monthly sale + bonus

9
Major Transformation Types
 Key restructuring (inherent meaning at source)
92 42 4979 234
Country_Code City_Code Post_Code Product_Code

 i.e. 92424979234 changed to 12345678

 Removing duplication
Incorrect or missing value
Inconsistent naming convention ONE vs 1
Incomplete information
Misspelling or falsification of names

10
Ahsan Abdullah
Data content defects
• Domain value redundancy
 Non-standard data formats
 Non-atomic data values
 Embedded meanings
 Data quality contamination

11
Data content defects Examples
ONLY yellow part will go to Graphics
Domain value redundancy
 Unit of Measure
 Dozen, Doz., Dz., 12

 Non-standard data formats


 Phone Numbers
 1234567 or 123.456.7

 Non-atomic data fields


 Name & Addresses
 Dr. Hameed Khan, PhD
12
Data content defects Examples

 Embedded Meanings
 RC, AP, RJ
 received, approved, rejected

13

You might also like