0% found this document useful (0 votes)
135 views16 pages

Microsoft PowerPoint - 03 - ETL Process - PPT (Compatibility Mode)

This document discusses the Extract, Transform, Load (ETL) process used in data warehousing. It covers the key components of ETL including extraction, transformation, and loading. Extraction involves reading data from different source systems like operational databases. Transformation prepares the data for loading by cleaning and consolidating it. Loading inserts the transformed data into the data warehouse tables. The document outlines some of the difficulties in ETL including diverse source systems and inconsistent data formats across sources. It also describes the use of a staging area to store data temporarily during the ETL process.

Uploaded by

alhamzahaudai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
135 views16 pages

Microsoft PowerPoint - 03 - ETL Process - PPT (Compatibility Mode)

This document discusses the Extract, Transform, Load (ETL) process used in data warehousing. It covers the key components of ETL including extraction, transformation, and loading. Extraction involves reading data from different source systems like operational databases. Transformation prepares the data for loading by cleaning and consolidating it. Loading inserts the transformed data into the data warehouse tables. The document outlines some of the difficulties in ETL including diverse source systems and inconsistent data formats across sources. It also describes the use of a staging area to store data temporarily during the ETL process.

Uploaded by

alhamzahaudai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

YARMOUK UNIVERSITY

FACULTY OF INFORMATION TECHNOLOGY AND


COMPUTER SCIENCES

CIS 367: Data Warehousing

Topic 3: Extract Transform Load (ETL)

Dr. Rafat Hammad

Acknowledgements: Most of these slides have been prepared based on various online tutorials and presentations, with respect to their authors and
adopted for our course. Additional slides have been added from the mentioned references in the syllabus

COMPONENTS OF A DATA WAREHOUSE

Dr. Rafat Hammad, Yarmouk University


TOPIC 3 : OUTLINE

 ETL Overview
 Data Extraction
 Data Transformation
 Data Loading

THE ETL CYCLE

Dr. Rafat Hammad, Yarmouk University


ETL OVERVIEW (CONT.)

 ETL stands for Extract, Transform and Load, which is


a process used to collect data from various sources,
transform the data depending on business rules/needs
and load the data into a destination database.
 ETL is often a complex combination of process and
technology that consumes a significant portion of the
data warehouse development efforts and requires the
skills of business analysts, database designers, and
application developers.
 Because ETL is an integral, ongoing, and recurring
part of a data warehouse
 Automated
 Well documented
 Easily changeable
5

DIFFICULTIES IN ETL PROCESS

1) Source systems are very diverse (‫ )متنوع‬and disparate


(‫)متفاوت‬.
2) There is usually a need to deal with source systems
on multiple platforms and different operating
systems.
3) Many source systems are older legacy applications
running on obsolete database technologies.
4) Generally, historical data on changes in values are
not preserved in source operational systems.
Historical information is critical in a data warehouse.

Dr. Rafat Hammad, Yarmouk University


DIFFICULTIES IN ETL PROCESS (CONT.)
5) Source system structures keep changing over time
because of new business conditions. ETL functions
must also be modified accordingly.
6) Inconsistency among source systems. Same data is
likely to be represented differently in the various
source systems.
7) Most source systems do not represent data in types or
formats that are meaningful to the users. Many
representations are cryptic (‫ )مﺨفﻲ‬and ambiguous
(‫)ﻏامﺾ‬.

MAIN STEPS IN THE ETL PROCESS

Dr. Rafat Hammad, Yarmouk University


ETL STAGING AREA
 ETL operations should be performed on a separate
intermediate storage area called "staging area" .
 The data staging area sits between the data source(s)
and the data target data warehouse.
 Staging areas can be implemented in the form of tables
in relational databases, text-based flat files (or XML
files) stored in file systems or proprietary formatted
binary files stored in file systems.
 Staging areas creates a logical and physical separation
between the source systems and the data warehouse.
 Staging areas minimizes the impact of the intense
(‫ )مﻜﺜﻒ‬periodic (‫ )دوري‬ETL activity on source and data
warehouse databases.
9

TOPIC 3 : OUTLINE

 ETL Overview
 Data Extraction
 Data Transformation
 Data Loading

10

Dr. Rafat Hammad, Yarmouk University


OVERVIEW OF DATA EXTRACTION
 Extraction is the process of reading data from different
sources which has to be loaded into the data warehouse. .
 First step of ETL, followed by many.
 A very complex task due to number of reasons:
 Data is extracted from heterogeneous and inconsistent
data sources
 Most of the data source systems are poorly documented
 Each data source has its distinct set of characteristics
that need to be managed and integrated into the ETL
system in order to effectively extract data..
 Very often, there is no possibility to add additional
logic to the source systems to enhance an incremental
extraction of data due to the performance or the
increased workload of these systems.
11

DATA EXTRACTION ISSUES


o Source identification—identify source applications and
source structures.
o Method of extraction—for each data source, define
whether the extraction process is manual or tool-based.
o Extraction frequency—for each data source, establish
how frequently the data extraction must be done: daily,
weekly, quarterly, and so on.
o Time window—for each data source, denote the time
window for the extraction process.
o Job sequencing—determine whether the beginning of
one job in an extraction job stream has to wait until the
previous job has finished successfully.
o Exception handling—determine how to handle input
records that cannot be extracted.
12

Dr. Rafat Hammad, Yarmouk University


SOURCE IDENTIFICATION
 Source identification includes the identification of all
the proper data sources.
 Source identification include examination and
verification that the identified sources will provide the
necessary value to the data warehouse.
 We need to go through the source identification process
for every piece of information you have to store in the
data warehouse.
 Source identification needs accuracy, lots of time, and
comprehensive analysis.

13

SOURCE IDENTIFICATION STEPS

14

Dr. Rafat Hammad, Yarmouk University


DATA IN OPERATIONAL SYSTEMS
 Data in the source systems are said to be time-
dependent or temporal. This is because source data
changes with time. The value of a single variable varies
over time.
 History cannot be ignored in the data warehouse.

 For example, the change of address of a customer who


moves from New York to California. If the state code is
used for analyzing some measurements such as sales,
the sales to the customer prior to the change must be
counted in New York and those after the move must be
counted in California.
 Operational data in the source system may be thought
of as falling into two broad categories: Current Value
and Periodic Status
15

DATA IN OPERATIONAL SYSTEMS – CURRENT VALUE

 Most of the attributes in the source systems fall into


this category. Here the stored value of an attribute
represents the value of the attribute at this moment of
time. The values are transient or transitory. As
business transactions happen, the values change. There
is no way to predict how long the present value will
stay or when it will get changed next.
 Customer name and address, bank account balances,
and outstanding amounts on individual orders are some
examples of this category.
 Data extraction for preserving the history of the
changes in the data warehouse gets quite involved for
this category of data.

16

Dr. Rafat Hammad, Yarmouk University


DATA IN OPERATIONAL SYSTEMS – PERIODIC STATUS

 This category is not as common as the previous


category. In this category, the value of the attribute is
preserved as the status every time a change occurs.
 At each of these points in time, the status value is
stored with reference to the time when the new value
became effective. This category also includes events
stored with reference to the time when each event
occurred.
 For operational data in this category, the history of the
changes is preserved in the source systems themselves.
Therefore, data extraction for the purpose of keeping
history in the data warehouse is relatively easier.

17

DATA IN OPERATIONAL SYSTEMS (CONT.)

18

Dr. Rafat Hammad, Yarmouk University


DATA EXTRACTION TECHNIQUES
 Broadly, there are two major types of data extractions
from the source operational systems:
1. “As Is” or static data is the capture of data at a
given point in time. It is like taking a snapshot of
the relevant source data at a certain point in time.
2. Incremental data capture (data of revisions),
which includes the revisions since the last time
data was captured. Incremental data capture may
be immediate or deferred.

19

IMMEDIATE DATA EXTRACTION


 In this option, the data extraction is real-time. It occurs
as the transactions happen at the source databases and
files.
 There are three options for immediate data
extraction:
1) Capture through Transaction Logs
2) Capture through Database Triggers
3) Capture in Source Applications

20

Dr. Rafat Hammad, Yarmouk University


IMMEDIATE DATA EXTRACTION (CONT.)

21

CAPTURE THROUGH TRANSACTION LOGS


 This option uses the transaction logs of the DBMSs
maintained for recovery from possible failures.
 As each transaction adds, updates, or deletes a row
from a database table, the DBMS immediately writes
entries on the log file.
 This data extraction technique reads the transaction
log and selects all the committed transactions.
 There is no extra overhead in the operational systems
because logging is already part of the transaction
processing.
 The appropriate transaction logs contain all the
changes to the various source database tables.

22

Dr. Rafat Hammad, Yarmouk University


CAPTURE THROUGH TRANSACTION LOGS (CONT.)
 Here are the broad steps for using replication to capture
changes to source data
 Identify the source system database table
 Identify and define target files in the staging area
 Create mapping between the source table and target files
 Define the replication mode
 Schedule the replication process
 Capture the changes from the transaction logs
 Transfer captured data from logs to target files
 Verify transfer of data changes
 Confirm success or failure of replication
 In metadata, document the outcome of replication.
Maintain definitions of sources, targets, and mappings
23

CAPTURE THROUGH TRANSACTION LOGS (CONT.)

24

Dr. Rafat Hammad, Yarmouk University


CAPTURE THROUGH DATABASE TRIGGERS
 This option is applicable to source systems that are
database applications.
 Triggers are special stored procedures (programs) that
are stored on the database and fired when certain
predefined events occur.
 You can create trigger programs for all events for which
you need data to be captured. The output of the trigger
programs is written to a separate file that will be used
to extract data for the data warehouse.
 Data capture through database triggers occurs right at
the source and is therefore quite reliable.
 Also, execution of trigger procedures during transaction
processing of the source systems puts additional
overhead on the source systems
25

CAPTURE IN SOURCE APPLICATIONS


 Application programs need to be revised to write all
adds, updates, and deletes to the source files and
database tables.
 Unlike the previous two cases, this technique may be
used for all types of source data irrespective of whether
it is in databases, indexed files, or other flat files.
 Revising the programs in the source operational
systems could be a huge task if the number of source
system programs is large.
 This technique may degrade the performance of the
source applications because of the additional processing
needed to capture the changes on separate files.

26

Dr. Rafat Hammad, Yarmouk University


DEFERRED DATA EXTRACTION
 The techniques under deferred data extraction do not
capture the changes in real time. The capture happens
later.
 There are two options for deferred data extraction:

 Capture Based on Date and Time Stamp


 Capture by Comparing Files

27

DEFERRED DATA EXTRACTION

28

Dr. Rafat Hammad, Yarmouk University


CAPTURE BASED ON DATE AND TIME STAMP
 Every time a source record is created or updated it may
be marked with a stamp showing the date and time.
 The time stamp provides the basis for selecting records
for data extraction. Here the data capture occurs at a
later time, not while each source record is created or
updated.
 This technique works well if the number of revised
records is small.
 This technique presupposes that all the relevant source
records contain date and time stamps. Provided this is
true, data capture based on date and time stamp can
work for any type of source file.
 This technique captures the latest state of the source
data.
29

CAPTURE BY COMPARING FILES


 If none of the above techniques are feasible for specific
source files in your environment, then consider this
technique as the last resort.
 This technique is also called the snapshot differential
technique because it compares two snapshots of the
source data.
 This technique necessitates the keeping of prior copies
of all the relevant source data.
 Though simple and straightforward, comparison of full
rows in a large file can be very inefficient.
 This method may be the only feasible option for some
legacy data sources that do not have transaction logs or
time stamps on source records.

30

Dr. Rafat Hammad, Yarmouk University


EVALUATION OF THE TECHNIQUES
 To summarize, the following options are available for
data extraction:
1) Capture of Static Data
2) Incremental Data Capture
A. Immediate Data Extraction
Capture through transaction logs
Capture through database triggers
Capture in source applications
B. Deferred Data Extraction
Capture based on date and time stamp
Capture by comparing files

31

TOPIC 3 : OUTLINE

 ETL Overview
 Data Extraction
 Data Transformation
 Data Loading

32

Dr. Rafat Hammad, Yarmouk University

You might also like