0% found this document useful (0 votes)
62 views38 pages

Data Warehousing: Lecture No 07

The document discusses Extract, Transform, Load (ETL) processes in data warehousing. It describes the ETL cycle and the three main steps: extraction, which reads data from different sources; transformation, which transforms extracted data; and load, which writes transformed data to a target data warehouse. It provides details on different types of data extraction and transformation techniques including logical vs physical extraction, and basic transformation tasks like selection, splitting/joining, conversion, summarization, and enrichment. Finally, it covers aspects of data loading strategies and the three primary loading approaches.

Uploaded by

karl evans
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views38 pages

Data Warehousing: Lecture No 07

The document discusses Extract, Transform, Load (ETL) processes in data warehousing. It describes the ETL cycle and the three main steps: extraction, which reads data from different sources; transformation, which transforms extracted data; and load, which writes transformed data to a target data warehouse. It provides details on different types of data extraction and transformation techniques including logical vs physical extraction, and basic transformation tasks like selection, splitting/joining, conversion, summarization, and enrichment. Finally, it covers aspects of data loading strategies and the three primary loading approaches.

Uploaded by

karl evans
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 38

Lecture No 07

Data Warehousing
By: Dr. Syed Aun Irtaza
Extract Transform Load (ETL)

2
Putting the pieces together

Data Data Warehouse Server OLAP Servers Clients


(Tier 0) (Tier 1) (Tier 2) (Tier 3)
Semistructured


MOLAP
Sources Query/Reporting

www data
Meta
Data 
Extract

  Data Analysis 

 Transform



 Archived Load Warehouse 
 data ROLAP Business
(ETL)
IT Data Mining
Users
Users
Operational
Data Bases  
Data sources Data Marts  Tools
Business
Users
{Comment: All except ETL washed out look}
3
The ETL Cycle
EXTRACT TRANSFORM LOAD
The process of The process of transforming The process of
reading data from the extracted data from its writing the data into
different sources. original state into a the target source.
consistent state so that it
can be placed into another
database.


www data
MIS Systems
(Acct, HR)
TRANSFORM CLEANSE Data Warehouse

Legacy
Systems
EXTRACT LOAD
Archived data


Other indigenous applications
(COBOL, VB, C++, Java)
OLAP
Temporary
Data storage
4
ETL Processing
ETL is independent yet interrelated steps.
It is important to look at the big picture.
Data acquisition time may include…

Extracts
Data Index
from Data Data Data Stat
Transfor- Mainte-
source Movement Cleansing Loading Coll
mation nance
systems

ote: Backup will come as other


ements after “Statistical collection”
Backup

5
Overview of Data Extraction
First step of ETL, followed by many.

Source system for extraction are typically OLTP


systems.

A very complex task due to number of reasons:


 Very complex and poorly documented source system.
 Data has to be extracted not once, but number of
times.

The process design is dependent on:
 Which extraction method to choose?
 How to make available extracted data for further
processing?
6
Types of Data Extraction
 Logical Extraction
 Full Extraction
 Incremental Extraction

 Physical Extraction
 Online Extraction
 Offline Extraction
 Legacy vs. OLTP

7
Logical Data Extraction
 Full Extraction
 The data extracted completely from the source system.

 No need to keep track of changes.

 Source data made available as-is with any additional


information.

 Incremental Extraction
 Data extracted after a well defined point/event in time.

 Mechanism used to reflect/record the temporal changes in data


(column or table).

 Can have significant performance impacts on the data


warehouse server.

8
Physical Data Extraction…
 Online Extraction
 Data extracted directly from the source system.
 May access source tables through an intermediate system.
 Intermediate system usually similar to the source system.

 Offline Extraction
 Data NOT extracted directly from the source system, instead staged
explicitly outside the original source system.

 Data is either already structured or was created by an extraction


routine.

 Some of the prevalent structures are:


 Flat files
 Dump files
 Redo and archive logs
 Transportable table-spaces
9
Physical Data Extraction

 Legacy vs. OLTP

 Data moved from the source system

 Copy made of the source system data

 Staging area used for performance reasons

10
Data Transformation

 Basic tasks
1. Selection

2. Splitting/Joining

3. Conversion

4. Summarization

5. Enrichment

11
Data Transformation Basic Tasks

 Selection

12
Data Transformation Basic Tasks

 Splitting/joining

13
Data Transformation Basic Tasks

 Conversion

14
Data Transformation: Conversion Example-1

 Convert common data elements into a


consistent form i.e. name and address.
Field format Field data
First-Family-title Muhammad Ibrahim Contractor
Family-title-comma-first Ibrahim Contractor, Muhammad
Family-comma-first-title Ibrahim, Muhammad Contractor

 Translation of dissimilar codes into a


standard code. FLAT No. 2
Natl. ID NID F/NO-2
F-2
National ID NID FL.NO.2
FL.2
FL/NO.2
FL-2
FLAT-2
FLAT#
FLAT,2
FLAT-NO-2
15
FL-NO.2
Data Transformation Basic Tasks: Conversion Example-2

 Data representation change


 EBCIDIC to ASCII

 Operating System Change


 Mainframe (MVS) to UNIX
 UNIX to NT or XP

 Data type change


 Program (Excel to Access), database format (FoxPro
to Access).
 Character, numeric and date type.
 Fixed and variable length.
16
Data Transformation Basic Tasks

 Summarization

17
Data Transformation Basic Tasks

 Enrichment

18
Data Transformation Basic Tasks: Enrichment Example

 Data elements are mapped from source


tables and files to destination fact and
dimension tables. Parsed Data
Input Data First Name: HAJI MUHAMMAD
HAJI MUHAMMAD IBRAHIM, GOVT. CONT. Family Name: IBRAHIM
K. S. ABDULLAH & BROTHERS, Title: GOVT. CONT.
MAMOOJI ROAD, ABDULLAH MANZIL Firm: K. S. ABDULLAH & BROTHERS
RAWALPINDI, Ph 67855 Firm Location: ABDULLAH MANZIL
Road: MAMOOJI ROAD
Phone: 051-67855
City: RAWALPINDI
Code: 46200

 Default values are used in the absence of


source data.

 Fields are added for unique keys and time 19


Aspects of Data Loading Strategies
 Need to look at:
 Data freshness
 System performance
 Data volatility

 Data Freshness
 Very fresh low update efficiency
 Historical data, high update efficiency
 Always trade-offs in the light of goals

 System performance
 Availability of staging table space
 Impact on query workload

 Data Volatility
 Ratio of new to historical data
 High percentages of data change (batch update)
20
Three Loading Strategies
 Once we have transformed data, there are
three primary loading strategies:

 Full data refresh with BLOCK INSERT or ‘block


slamming’ into empty table.
 Incremental data refresh with BLOCK INSERT
or ‘block slamming’ into existing (populated)
tables.

 Trickle/continuous feed with constant data


collection and loading using row level insert
and update operations.
21
ETL Detail: Data Extraction &
Transformation

22
Extracting Changed Data
Incremental data extraction
Incremental data extraction i.e. what has changed, say during last
24 hrs if considering nightly extraction.

Efficient when changes can be identified


This is efficient, when the small changed data can be identified
efficiently.

Identification could be costly


Unfortunately, for many source systems, identifying the recently
modified data may be difficult or effect operation of the source
system.

Very challenging
Change Data Capture is therefore, typically the most challenging
technical issue in data extraction.

23
Source Systems
Two CDC sources
• Modern systems
• Legacy systems

24
CDC in Modern Systems
• Time Stamps
• Works if timestamp column present
• If column not present, add column
• May not be possible to modify table, so add triggers

• Triggers
• Create trigger for each source table
• Following each DML operation trigger performs updates
• Record DML operations in a log

• Partitioning
• Table range partitioned, say along date key
• Easy to identify new data, say last week’s data
25
CDC in Legacy Systems
 Changes recorded in tapes Changes occurred in legacy
transaction processing are recorded on the log or
journal tapes.

 Changes read and removed from tapes Log or journal


tape are read and the update/transaction changes are
stripped off for movement into the data warehouse.

 Problems with reading a log/journal tape are many:


◦ Contains lot of extraneous data
◦ Format is often arcane
◦ Often contains addresses instead of data values and keys
◦ Sequencing of data in the log tape often has deep and complex
implications
◦ Log tape varies widely from one DBMS to another.

26
Major Transformation Types
 Format revision
 Decoding of fields
 Calculated and derived values
 Splitting of single fields
 Merging of information
 Character set conversion
 Unit of measurement conversion
 Date/Time conversion
 Summarization
 Key restructuring
 Duplication
27
Major Transformation Types

 Format revision

 Decoding of fields
Covered in De-Norm
 Calculated and derived values
Covered in issues
 Splitting of single fields

28
Major Transformation Types
 Merging of information
Not really means combining columns to create one column.
Info for product coming from different sources merging it into single
entity.
 Character set conversion
For PC architecture converting legacy EBCIDIC to ASCII

• Unit of measurement
conversion
For companies with global branches Km vs. mile or lb vs Kg

 Date/Time conversion
November 14, 2005 as 11/14/2005 in US and 14/11/2005 in
the British format. This date may be standardized to be written
as 14 NOV 2005.
29
Major Transformation Types

 Aggregation & Summarization

 How they are different? Adding


like
values
Summarization with calculation across business dimension is
aggregation. Example Monthly compensation = monthly sale +
bonus
 Why both are required?
 Grain mismatch (don’t require, don’t have
space)
 Data Marts requiring high detail
 Detail losing its utility
30
Major Transformation Types
 Key restructuring (inherent meaning at
source)
92 42 4979 234
Country_Code City_Code Post_Code Product_Code

 i.e. 92424979234 changed to 12345678

 Removing duplication
Inconsistent naming convention ONE vs 1
Incomplete information
Physically moved, but address not changed
Misspelling or falsification of names
31
Data content defects
• Domain value redundancy
 Non-standard data formats
 Non-atomic data values
 Multipurpose data fields
 Embedded meanings
 Inconsistent data values
 Data quality contamination

32
Data content defects Examples

Domain value redundancy


 Unit of Measure
 Dozen, Doz., Dz., 12

 Non-standard data formats


 Phone Numbers
 1234567 or 123.456.7

 Non-atomic data fields


 Name & Addresses
 Dr. Hameed Khan, PhD

33
Data content defects Examples

 Embedded Meanings
 RC, AP, RJ
 received, approved, rejected

34
Data Cleansing
 Other names: Called as data scrubbing or cleaning.
 More than data arranging: DWH is NOT just about arranging
data, but should be clean for overall health of organization.
We drink clean water!
 Big problem, big effect: Enormous problem, as most data is
dirty. GIGO (Garbage in garbage out)
 Dirty is relative: Dirty means does not confirm to proper
domain definition and vary from domain to domain.
 Paradox: Must involve domain expert, as detailed domain
knowledge is required, so it becomes semi-automatic, but
has to be automatic because of large data sets.
 Data duplication: Original problem was removing duplicates
in one system, compounded by duplicates from many
systems.
Lighter Side of Dirty Data
 Year of birth 2009 current year 2014
 Graduation in 1986 Doctorate in 1985
 Who would take it seriously? Computers while

summarizing, aggregating, populating etc.


 Small discrepancies become irrelevant for

large averages, but what about sums,


medians, maximum, minimum etc.?
Serious Problems due to dirty data
 Decision making at the Government level on
investment based on rate of birth in terms of
schools and then teachers. Wrong data
resulting in over and under investment.
 Direct mail marketing sending letters to

wrong addresses retuned, or multiple letters


to same address, loss of money and bad
reputation and wrong identification of
marketing region.
3 Classes of Anomalies…
 Syntactically Dirty Data
◦ Lexical Errors e.g. 5 cols 4 recorded.
◦ Irregularities e.g. sal in dollars and Rupees
 Semantically Dirty Data
◦ Integrity Constraint Violation e.g. deleted record still
exists.
◦ Business rule contradiction e.g. age and date of birth
(min age for job position is 50 years)
◦ Duplication
 Coverage Anomalies
◦ Missing Attributes
◦ Missing Records

You might also like