0% found this document useful (0 votes)

22 views

ELT Process

Uploaded by

alhamzahaudai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views

ELT Process

Uploaded by

alhamzahaudai

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 80

YARMOUK UNIVERSITY

F A C U LT Y O F I N F O R M AT I O N T E C H N O L O G Y A N D
COMPUTER SCIENCES

C I S 367: Data Warehousing

Topic 3: Extract Transform Load (ETL)

Dr. Rafat Hammad

Acknowledgements: Most of these slides have been prepared based on various online tutorials and presentations, with respect to their authors and
adopted for our course. Additional slides have been added from the mentioned references in the syllabus
C O M P O N E N T S O F A DATA W A R E H O U S E

2
TOPIC 3 : OUTLINE

 E T L Overview
 Data Extraction
 Data Transformation
 Data Loading
THE ETL CYCLE

4
E T L O V E R V I E W ( CONT .)

 E T L stands for Extract, Transform and Load, which is

a process used to collect data from various sources,
transform the data depending on business rules/needs
and load the data into a destination database.
 E T L is often a complex combination of process and
technology that consumes a significant portion of the
data warehouse development efforts and requires the
skills of business analysts, database designers, and
application developers.
 Because E T L is an integral, ongoing, and recurring
part of a data warehouse
● Automated
● Well documented
● Easily changeable
5
DIFFICULTIES IN ETL PROCESS

1) Source systems are very diverse and disparate

2) There is usually a need to deal with source systems
on multiple platforms and different operating
systems.
3) Many source systems are older legacy applications
running on obsolete database technologies.
4) Generally, historical data on changes in values are
not preserved in source operational systems.
Historical information is critical in a data warehouse.

6
DIFFICULTIES IN E T L P R O C E S S ( CONT .)
5) Source system structures keep changing over time
because of new business conditions. E T L functions
must also be modiﬁed accordingly.
6) Inconsistency among source systems. Same data is
likely to be represented differently in the various
source systems.
7) Most source systems do not represent data in types or
formats that are meaningful to the users. Many
representations are cryptic and ambiguous.

7
MAIN STEPS IN THE ETL PROCESS

8
ETL STAGING AREA

 E T L operations should be performed on a separate

intermediate storage area called "staging area" .
 The data staging area sits between the data source(s)
and the data target data warehouse.
 Staging areas can be implemented in the form of tables
in relational databases, text-based flat files (or X M L
files) stored in file systems or proprietary formatted
binary files stored in file systems.
 Staging areas creates a logical and physical separation
between the source systems and the data warehouse.
 Staging areas minimizes the impact of the intense
periodic E T L activity on source and data warehouse
databases.
9
TOPIC 3 : OUTLINE

 E T L Overview
 Data Extraction
 Data Transformation
 Data Loading
OVERVIEW OF DATA EXTRACTION
 Extraction is the process of reading data from different
sources which has to be loaded into the data warehouse. .
 First step of E T L , followed by many.
 A very complex task due to number of reasons:
● Data is extracted from heterogeneous and inconsistent
data sources
● Most of the data source systems are poorly documented
● Each data source has its distinct set of characteristics
that need to be managed and integrated into the E T L
system in order to effectively extract data..
● Very often, there is no possibility to add additional
logic to the source systems to enhance an incremental
extraction of data due to the performance or the
increased workload of these systems.
11
DATA EXTRACTION ISSUES
o Source identification—identify source applications and
source structures.
o Method of extraction—for each data source, define
whether the extraction process is manual or tool-based.
o Extraction frequency—for each data source, establish
how frequently the data extraction must be done: daily,
weekly, quarterly, and so on.
o Time window—for each data source, denote the time
window for the extraction process.
o Job sequencing—determine whether the beginning of
one job in an extraction job stream has to wait until the
previous job has finished successfully.
o Exception handling—determine how to handle input
records that cannot be extracted.
12
S O U R C E IDENTIFICATION
 Source identification includes the identification of all
the proper data sources.
 Source identification include examination and
verification that the identified sources will provide the
necessary value to the data warehouse.
 We need to go through the source identification process
for every piece of information you have to store in the
data warehouse.
 Source identification needs accuracy, lots of time, and
comprehensive analysis.

13
S O U R C E IDENTIFICATION STEPS

14
DATA IN OPERATIONAL SYSTEMS
 Data in the source systems are said to be time-
dependent or temporal. This is because source data
changes with time. The value of a single variable varies
over time.
 History cannot be ignored in the data warehouse.

 For example, the change of address of a customer who

moves from New York to California. If the state code is
used for analyzing some measurements such as sales,
the sales to the customer prior to the change must be
counted in New York and those after the move must be
counted in California.
 Operational data in the source system may be thought
of as falling into two broad categories: Current Value
and Periodic Status
15
DATA IN OPERATIONAL SYSTEMS – CURRENT VALUE

 Most of the attributes in the source systems fall into

this category. Here the stored value of an attribute
represents the value of the attribute at this moment of
time. The values are transient or transitory. As
business transactions happen, the values change. There
is no way to predict how long the present value will
stay or when it will get changed next.
 Customer name and address, bank account balances,
and outstanding amounts on individual orders are some
examples of this category.
 Data extraction for preserving the history of the
changes in the data warehouse gets quite involved for
this category of data.

16
DATA IN OPERATIONAL SYSTEMS – PERIODIC STATUS

 This category is not as common as the previous

category. In this category, the value of the attribute is
preserved as the status every time a change occurs.
 At each of these points in time, the status value is
stored with reference to the time when the new value
became effective. This category also includes events
stored with reference to the time when each event
occurred.
 For operational data in this category, the history of the
changes is preserved in the source systems themselves.
Therefore, data extraction for the purpose of keeping
history in the data warehouse is relatively easier.

17
DATA IN O P E R A T I O N A L S Y S T E M S ( CONT .)

18
DATA EXTRACTION TEC HNIQUES
 Broadly, there are two major types of data extractions
from the source operational systems:
1. “As Is” or static data is the capture of data at a
given point in time. It is like taking a snapshot of
the relevant source data at a certain point in time.
2. Incremental data capture (data of revisions),
which includes the revisions since the last time
data was captured. Incremental data capture may
be immediate or deferred.

19
IMMEDIATE DATA EXTRACTION
 In this option, the data extraction is real-time. It occurs
as the transactions happen at the source databases and
ﬁles.
 There are three options for immediate data
extraction:
1) Capture through Transaction Logs
2) Capture through Database Triggers
3) Capture in Source Applications

20
I M M E D I A T E D A T A E X T R A C T I O N ( CONT .)

21
CAPTURE THROUGH TRANSACTION L O G S
 This option uses the transaction logs of the DB M S s
maintained for recovery from possible failures.
 As each transaction adds, updates, or deletes a row
from a database table, the D B M S immediately writes
entries on the log ﬁle.
 This data extraction technique reads the transaction
log and selects all the committed transactions.
 There is no extra overhead in the operational systems
because logging is already part of the transaction
processing.
 The appropriate transaction logs contain all the
changes to the various source database tables.

22
CAPTURE THROUGH T R A NS A C T I O N L O G S ( CONT .)
 Here are the broad steps for using replication to capture
changes to source data
● Identify the source system database table
● Identify and define target files in the staging area
● Create mapping between the source table and target files
● Define the replication mode
● Schedule the replication process
● Capture the changes from the transaction logs
● Transfer captured data from logs to target files
● Verify transfer of data changes
● Confirm success or failure of replication
● In metadata, document the outcome of replication.
Maintain definitions of sources, targets, and mappings
23
CAPTURE THROUGH T R A NS A C T I O N L O G S ( CONT .)

24
CAPTURE THROUGH DATABASE TRIGGERS
 This option is applicable to source systems that are
database applications.
 Triggers are special stored procedures (programs) that
are stored on the database and fired when certain
predefined events occur.
 You can create trigger programs for all events for which
you need data to be captured. The output of the trigger
programs is written to a separate file that will be used
to extract data for the data warehouse.
 Data capture through database triggers occurs right at
the source and is therefore quite reliable.
 Also, execution of trigger procedures during transaction
processing of the source systems puts additional
overhead on the source systems
25
CAPTURE IN S O U R C E APPLICATIONS
 Application programs need to be revised to write all
adds, updates, and deletes to the source files and
database tables.
 Unlike the previous two cases, this technique may be
used for all types of source data irrespective of whether
it is in databases, indexed files, or other flat files.
 Revising the programs in the source operational
systems could be a huge task if the number of source
system programs is large.
 This technique may degrade the performance of the
source applications because of the additional processing
needed to capture the changes on separate files.

26
DEFERRED DATA EXTRACTION
 The techniques under deferred data extraction do not
capture the changes in real time. The capture happens
later.
 There are two options for deferred data extraction:

● Capture Based on Date and Time Stamp

● Capture by Comparing Files

27
DEFERRED DATA EXTRACTION

28
CAPTURE BASED ON DATE AND TIME STAMP
 Every time a source record is created or updated it may
be marked with a stamp showing the date and time.
 The time stamp provides the basis for selecting records
for data extraction. Here the data capture occurs at a
later time, not while each source record is created or
updated.
 This technique works well if the number of revised
records is small.
 This technique presupposes that all the relevant source
records contain date and time stamps. Provided this is
true, data capture based on date and time stamp can
work for any type of source file.
 This technique captures the latest state of the source
data.
29
CAPTURE BY COMPARING FILES
 If none of the above techniques are feasible for specific
source files in your environment, then consider this
technique as the last resort.
 This technique is also called the snapshot differential
technique because it compares two snapshots of the
source data.
 This technique necessitates the keeping of prior copies
of all the relevant source data.
 Though simple and straightforward, comparison of full
rows in a large file can be very inefficient.
 This method may be the only feasible option for some
legacy data sources that do not have transaction logs or
time stamps on source records.

30
EVALUATION OF THE TECHNIQUES
 To summarize, the following options are available for
data extraction:
1) Capture of Static Data
2) Incremental Data Capture
A. Immediate Data Extraction
 Capture through transaction logs
 Capture through database triggers
 Capture in source applications
B. Deferred Data Extraction
 Capture based on date and time stamp
 Capture by comparing ﬁles

31
TOPIC 3 : OUTLINE

 E T L Overview
 Data Extraction
 Data Transformation
 Data Loading
DATA TRANSFORMATION

 Transformation is the process of transforming the

extracted data from its original state into a consistent
states so that it can be placed into another database.
 The extracted data is raw data and it cannot be
applied to the data warehouse right away and it must
be made usable in the data warehouse.
 Because operational data is extracted from many old
legacy systems, the quality of the data in those
systems is less likely to be good enough for the data
warehouse.
 You have to enrich and improve the quality of the data
before it can be usable in the data warehouse.

33
D A T A T R A N S F O R M A T I O N ( CONT .)
 You have to transform the data according to standards
because they come from many dissimilar source
systems.
 You have to ensure that after all the data is put
together, the combined data does not violate any
business rules.
 Transformation of source data encompasses a wide
variety of manipulations to change all the extracted
source data into usable information to be stored in the
data warehouse.
 One major effort within data transformation is the
improvement of data quality.

34
D A T A T R A N S F O R M A T I O N ( CONT .)
 Data Quality paradigm :
● Correct
 U nambiguous
● Consistent
● Complete

 Data quality checks are run at two places:

● After extraction
● After cleaning and confirming additional check are
run at this point

35
EXAMPLES OF INCONSISTENT DAT A R E P R E S E N T A T I O N S

 Date value representations, examples:

970314 1997-03-14
03/14/1997 14-MAR-1997
March 14 1997 2450521.5 (Julian date format)

 Gender value representations, examples:

- Male/Female - M /F
- 0/1 - PM /A M

36
M A J O R TRANSFORMATION TYPES
 Format Revisions. You will come across these quite
often. These revisions include changes to the data types
and lengths of individual ﬁelds. In source systems,
product package types may be indicated by codes and
names in which the ﬁelds are numeric and text data
types. The lengths of the package types may vary
among the different source systems. It is wise to
standardize and change the data type to text to provide
values meaningful to the users.

37
M A J O R T R A N S F O R M A T I O N T Y P E S ( CONT .)
 Decoding of Fields. This is also a common type of
data transformation. When you deal with multiple
source systems, you are bound to have the same data
items described by a plethora of ﬁeld values. The
classic example is the coding for gender, with one
source system using 1 and 2 for male and female and
another system using M and F. Also, many legacy
systems are known for using cryptic codes to represent
business values. What do the codes AC , IN, R E , and S U
mean in a customer ﬁle? You need to decode all such
cryptic codes and change these into values that make
sense to the users. Change the codes to Active, Inactive,
Regular, and Suspended.

38
M A J O R T R A N S F O R M A T I O N T Y P E S ( CONT .)
 Calculated and Derived Values. What if you want to
keep profit margin along with sales and cost amounts
in your data warehouse tables? The extracted data from
the sales system contains sales amounts, sales units,
and operating cost estimates by product. You will have
to calculate the total cost and the profit margin before
data can be stored in the data warehouse. Average
daily balances and operating ratios are examples of
derived fields.

39
M A J O R T R A N S F O R M A T I O N T Y P E S ( CONT .)
 Splitting of Single Fields. Earlier legacy systems
stored names and addresses of customers and
employees in large text fields. The first name, middle
initials, and last name were stored as a large text in a
single field. Similarly, some earlier systems stored city,
state, and zip code data together in a single field. You
need to store individual components of names and
addresses in separate fields in your data warehouse for
two reasons. First, you may improve the operating
performance by indexing on individual components.
Second, your users may need to perform analysis by
using individual components such as city, state, and zip
code.

40
M A J O R T R A N S F O R M A T I O N T Y P E S ( CONT .)
 Merging of Information. This is not quite the
opposite of splitting of single fields. This type of data
transformation does not literally mean the merging of
several fields to create a single field of data. For
example, information about a product may come from
different data sources. The product code and
description may come from one data source. The
relevant package types may be found in another data
source. The cost data may be from yet another source.
In this case, merging of information denotes the
combination of the product code, description, package
types, and cost into a single entity.

41
M A J O R T R A N S F O R M A T I O N T Y P E S ( CONT .)
 Character set conversion. This type of data
transformation relates to the conversion of character
sets to an agreed standard character set for textual
data in the data warehouse. If you have mainframe
legacy systems as source systems, the source data from
these systems will be in E B C D I C characters. If PC-
based architecture is the choice for your data
warehouse, then you must convert the mainframe
E B C D I C format to the AS C I I format. When your source
data is on other types of hardware and operating
systems, you are faced with similar character set
conversions.

42
M A J O R T R A N S F O R M A T I O N T Y P E S ( CONT .)
 Conversion of Units of Measurements. Many
companies today have global branches. Measurements
in many European countries are in metric units. If your
company has overseas operations, you may have to
convert the metrics so that the numbers are all in one
standard unit of measurement.
 D ate /Time C onversion. This type relates to
representation of date and time in standard formats.
For example, the American and the British date
formats may be standardized to an international
format. The date of October 11, 2008 is written as 10 /
11 /2008 in the U. S . format and as 11 /10 /2008 in the
British format. This date may be standardized to be
written as 11 O C T 2008.

43
M A J O R T R A N S F O R M A T I O N T Y P E S ( CONT .)
 Summarization. This type of transformation is the
creating of summaries to be loaded in the data
warehouse instead of loading the most granular level of
data. For example, for a credit card company to analyze
sales patterns, it may not be necessary to store in the
data warehouse every single transaction on each credit
card. Instead, you may want to summarize the daily
transactions for each credit card and store the
summary data instead of storing the most granular
data by individual transactions.

44
M A J O R T R A N S F O R M A T I O N T Y P E S ( CONT .)

 Key Restructuring. While extracting data from your

input sources, look at the primary keys of the extracted
records. You will have to come up with keys for the fact
and dimension tables based on the keys in the extracted
records.
 When choosing keys for your data warehouse database
tables, avoid such keys with built-in meanings. Transform
such keys into generic keys generated by the system
itself. This is called key restructuring.

45
M A J O R T R A N S F O R M A T I O N T Y P E S ( CONT .)

46
M A J O R T R A N S F O R M A T I O N T Y P E S ( CONT .)
 Enrichment: This task is the rearrangement and
simplification of individual fields to make them more
useful for the data warehouse environment. You may
use one or more fields from the same input record to
create a better view of the data for the data warehouse.
This principle is extended when one or more fields
originate from multiple records, resulting in a single
field for the data warehouse.

47
M A J O R T R A N S F O R M A T I O N T Y P E S ( CONT .)
 Deduplication. In many companies, the customer ﬁles
have several records for the same customer.
 In a normal client database some clients may be
represented by several records for various reasons:
● Incorrect or missing data values because of data
entry errors
● Inconsistent naming convention such as: O N E vs 1
● Incomplete information because data is not captured
or available
● Physically moved, but clients did not notify change of
address
● Misspelling or falsification of names

48
PROBLEMS D UE TO DATA DUPLICATION
 Data duplication, can result in costly errors, such as:
● False frequency distributions.
● Incorrect aggregates due to double counting.
● Difficulty with catching fabricated identities by
credit card companies.

49
SLOWLY CHANGING DIMENSIONS
 Compared to the fact table, the dimension tables are more
stable and are generally constant over time.
 Unlike the fact table, which changes through an increase
in the number of rows, a dimension table does not change
just through the increase in the number of rows, but also
through changes to the attributes themselves.
 Many dimensions, though not constant over time, change
slowly.
 In the source O LT P systems, the new values overwrite
the old ones.
 In the source of data warehouse, overwriting of dimension
table attributes is not always the appropriate option.
 There are three types of dimension table changes: Type 1
changes, Type 2 changes, and Type 3 changes
50
SLOWLY CHANGING DIMENSIONS – TYPE 1
 These changes usually relate to correction of errors in
source systems.
 The old value in the source system needs to be
discarded.
 The change in the source system need not be preserved
in the data warehouse.
 Overwrite the attribute value in the dimension table row
with the new value.
 No other changes are made in the dimension table row.

 The key of this dimension table or any other key values

are not affected.
 This type is easiest to implement.

51
SLOWLY CHANGING DIMENSIONS – TYPE 1
Type 1: Example
Susan's Tax Bracket attribute value changes from Medium to High

52
SLOWLY CHANGING DIMENSIONS – TYPE 2
 These changes usually relate to true changes in source
systems.
 Every change for the same attribute must be preserved.

 Add a new dimension table row with the new value of

the changed attribute.
 An effective date ﬁeld may be included in the dimension
table.
 There are no changes to the original row in the
dimension table.
 The key of the original row is not affected.

 The new row is inserted with a new surrogate key.

53
SLOWLY CHANGING DIMENSIONS – TYPE 2
Type 2 : Example (with timestamps and row indicator)
Susan's Tax Bracket attribute value changes from Medium to High

54
SLOWLY CHANGING DIMENSIONS – TYPE 3
 They usually relate to “soft” or tentative changes in the
source systems.
 There is a need to keep track of history with old and
new values of the changed attribute.
 They provide the ability to track forward and backward

 Add an “old” ﬁeld in the dimension table for the

affected attribute.
 Push down the existing value of the attribute from the
“current” field to the “old” field.
 Keep the new value of the attribute in the “current”
field.
 The key of the row is not affected.

 No new dimension row is needed.

55
SLOWLY CHANGING DIMENSIONS – TYPE 3
Type 3: Example (with timestamps)
Susan's Tax Bracket attribute value changes from Medium to High

56
PROS AND CONS
 Type-1: Overwrite existing value
+ Simple to implement
- No tracking of history

 Type-2: Add a new dimension row

+ Accurate historical reporting
+ Pre-computed aggregates unaffected
- Dimension table grows over time

 Type-3: Add a new field

+ Accurate historical reporting to last TWO changes
+ Record keys are unaffected
- Dimension table size increases

57
T R A NS F O R M A T I O N FOR D I M E N S I O N A T T R I BU T E S

58
AUTOMATIC DATA CLEAN SIN G
1) Statistical Methods
● Identifying outlier fields and records using the
values of mean, standard deviation, range and
other statistical methods.
2) Pattern-based
● Identify outlier fields and records that do not
conform to existing patterns in the data.
● A pattern is defined by a group of records that have
similar characteristics (“behavior”) for p% of the
fields in the data set, where p is a user-defined
value (usually above 90).
● Techniques such as partitioning, classification, and
clustering can be used to identify patterns that
apply to most records.
59
A U T O M A T I C D A T A C L E A N S I N G ( CONT .)
3) Clustering
● Identify outlier records using clustering based on
Euclidian (or other) distance.
● Clustering the entire record space can reveal
outliers that are not identified at the field level
inspection
● Main drawback of this method is computational
time.
4) Association rules
● Association rules with high confidence and support
define a different kind of pattern.
● Records that do not follow these rules are considered
outliers.
60
TOPIC 3 : OUTLINE

 E T L Overview
 Data Extraction
 Data Transformation
 Data Loading
DATA LOADING
 Data Loading is the process of writing the data into
the target source. It includes both: loading dimension
and fact tables.
 Because loading the data warehouse may take an large
amount of time, loads are generally causes for great
concern. During the loads, the data warehouse has to
be ofﬂine.
 Consider dividing up the whole load process into
smaller chunks and populating a few ﬁles at a time.
 The whole process of moving data into the data
warehouse repository is referred to in several ways:
“loading the data”, and ”refreshing the data”.

62
TYPES OF DATA LOADING
 Initial load—populating all the data warehouse tables
for the very ﬁrst time.
 Incremental load—applying ongoing changes as
necessary in a periodic manner.
 Full refresh—completely erasing the contents of one
or more tables and reloading with fresh data (initial
load is a refresh of all the tables).

63
A P P L Y I N G D ATA : T E C H N I Q U E S AND PROCESSES
 Data may be applied to data warehouse in the following
four different modes: load, append, destructive merge,
and constructive merge.
 Load: If the target table to be loaded already exists
and data exists in the table, the load process wipes out
the existing data and applies the data from the
incoming ﬁle. If the table is already empty before
loading, the load process simply applies the data from
the incoming ﬁle.
 Append: If data already exists in the table, the append
process unconditionally adds the incoming data,
preserving the existing data in the target table. When
an incoming record is a duplicate of an already existing
record, the incoming record may be allowed to be added
as a duplicate or it may be rejected.
64
A P P L Y I N G D ATA : T E C H N I Q U E S A N D P R O C E S S E S ( CONT .)

 Destructive Merge: In this mode, you apply the

incoming data to the target data. If the primary key of
an incoming record matches with the key of an existing
record, update the matching target record. If the
incoming record is a new record without a match with
any existing record, add the incoming record to the
target table.
 Constructive Merge: This mode is slightly different
from the destructive merge. If the primary key of an
incoming record matches with the key of an existing
record, leave the existing record, add the incoming
record, and mark the added record as superseding the
old record.

65
A P P L Y I N G D ATA : T E C H N I Q U E S A N D P R O C E S S E S ( CONT .)

66
LOADING CHANGES TO DIMENSION TABLES

67
LOADING DIMENSIONS

 Physically built to have the minimal sets of

components
 The primary key is a single field containing
meaningless unique integer – Surrogate Keys
 Creating and assigning the surrogate keys occur in
this module
 The data warehouse owns these keys and never
allows any other entity to assign them

68
L O A D I N G D I M E N S I O N S ( CONT .)

69
LOADING FACTS

 When building a fact table, the final E T L step is

converting the natural keys in the new input records
into the correct surrogate keys
 E T L maintains a special surrogate key lookup table
for each dimension. This table is updated whenever a
new dimension entity is created and whenever a
change occurs on an existing dimension entity
 All of the required lookup tables should be pinned in
memory so that they can be randomly accessed as
each incoming fact record presents its natural keys.
This is one of the reasons for making the lookup
tables separate from the original data warehouse
dimension tables.

70
L O A D I N G F A C T T A B L E S ( CONT .)

71
L O A D I N G F A C T T A B L E S ( CONT .)

 Managing Indexes
● Performance Killers at load time
● Drop all indexes in pre-load time
● Separate Updates from inserts
● Load updates
● Rebuild indexes

72
ROLLBACK LOG
 The rollback log, also known as the redo log, is
invaluable (‫)نيمث‬in transaction (OLTP) systems. But in a
data warehouse environment where all transactions
are managed by the E T L process, the rollback log is a
unnecessary feature that must be dealt with to achieve
optimal load performance.
 Reasons why the data warehouse does not need
rollback logging are:
 All data is entered by a managed process—the E T L
system.
 Data is loaded in bulk.

 Data can easily be reloaded if a load process fails.

 Each database management system has different

logging features and manages its rollback log
differently
73
DATA REFR ESH
 Propagate updates on source data to the warehouse
 When to Refresh?

● Periodically (e.g., every night, every week) or after

significant events
● On every update: not warranted unless warehouse
data require current data (up to the minute stock
quotes)
● Refresh policy set by administrator based on user
needs and traffic
● Possibly different policies for different sources

74
E T L TOOL OPTIONS
 Vendors have approached the challenges of E T L and
addressed them by providing tools falling into the
following three broad functional categories:
● Data transformation engines.
● Data capture through replication
● Code generators
 Data transformation engines. These tools captures
data from a designated set of source systems at user-
defined intervals, performs elaborate data
transformations, sends the results to a target
environment, and applies the data to target files. These
tools provide you with maximum flexibility for pointing to
various source systems, to select the appropriate data
transformation methods, and to apply full loads and
incremental loads.
75
E T L T O O L O P T I O N S ( CONT .)
 Data capture through replication: Most of these
tools use the transaction recovery logs maintained by
the D B M S . The changes to the source systems captured
in the transaction logs are replicated in near real time
to the data staging area for further processing. Some of
the tools provide the ability to replicate data through
the use of data base triggers. These specialized stored
procedures in the database signal the replication agent
to capture and transport the changes.

76
E T L T O O L O P T I O N S ( CONT .)
 Code generators: These are tools that directly deal
with the extraction, transformation, and loading of
data. The tools enable the process by generating
program code to perform these functions. Code
generators create 3G L /4G L data extraction and
transformation programs. You provide the parameters
of the data sources and the target layouts along with
the business rules. The tools generate most of the
program code in some of the common programming
languages.

77
M A J O R CAPABILITIES OF ETL TOOLS

 Data extraction from various relational databases of

leading vendors
 Data extraction from old legacy databases, indexed
files, and flat files
 Data transformation from one format to another with
variations in source and target fields
 Performing of standard conversions, key reformatting,
and structural changes
 Provision of audit trails from source to target

 Application of business rules for extraction and

transformation
 Combining of several records from the source systems
into one integrated target record
 Recording and management of metadata
78
ETL SUMMARY AND APPROACH

79
THE END

Azure Data Engineer Interview Questions and Answers
No ratings yet
Azure Data Engineer Interview Questions and Answers
7 pages
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
Learn Data Warehousing in 24 Hours
From Everand
Learn Data Warehousing in 24 Hours
Alex Nordeen
No ratings yet
DBMS - R18 UNIT 5 Notes
86% (7)
DBMS - R18 UNIT 5 Notes
23 pages
Microsoft PowerPoint - 03 - ETL Process - PPT (Compatibility Mode)
No ratings yet
Microsoft PowerPoint - 03 - ETL Process - PPT (Compatibility Mode)
16 pages
3 Lecture 3-ETL
100% (1)
3 Lecture 3-ETL
42 pages
Lecture 3
No ratings yet
Lecture 3
46 pages
ETL Process in Data Warehouse
67% (3)
ETL Process in Data Warehouse
40 pages
Data Warehousing - C04 - ETL
No ratings yet
Data Warehousing - C04 - ETL
52 pages
Module 2
No ratings yet
Module 2
117 pages
Data Warehousing and Data Mining: Sunil Paudel
No ratings yet
Data Warehousing and Data Mining: Sunil Paudel
29 pages
03 Etl 081028 2055
No ratings yet
03 Etl 081028 2055
46 pages
Unit III DWM
No ratings yet
Unit III DWM
13 pages
Unit - Iii: ETL: Data Extraction, Transformation, Cleansing, Loading Data Warehouse Information Flows
No ratings yet
Unit - Iii: ETL: Data Extraction, Transformation, Cleansing, Loading Data Warehouse Information Flows
36 pages
ETL Power Point Presentation
No ratings yet
ETL Power Point Presentation
40 pages
ETL Process in Data Warehouse: Chirayu Poundarik
No ratings yet
ETL Process in Data Warehouse: Chirayu Poundarik
40 pages
ETL Process in Data Warehouse: Click To Add Text Chirayu Poundarik
No ratings yet
ETL Process in Data Warehouse: Click To Add Text Chirayu Poundarik
37 pages
Outline: ETL Extraction Transformation Loading
No ratings yet
Outline: ETL Extraction Transformation Loading
38 pages
Data Warehousing: Lecture No 07
No ratings yet
Data Warehousing: Lecture No 07
38 pages
Integrasi Data Dan ETL
No ratings yet
Integrasi Data Dan ETL
45 pages
IDW Lecture 27-ETL-Tranformation & Loading
No ratings yet
IDW Lecture 27-ETL-Tranformation & Loading
22 pages
Assignment On Chapter 8 Data Warehousing and Management
No ratings yet
Assignment On Chapter 8 Data Warehousing and Management
13 pages
Why ETL
No ratings yet
Why ETL
15 pages
Lecture 16
No ratings yet
Lecture 16
21 pages
Lecture 05 and 06
No ratings yet
Lecture 05 and 06
48 pages
ETL Process in Data Warehouse
No ratings yet
ETL Process in Data Warehouse
26 pages
04 - ETL Process
No ratings yet
04 - ETL Process
40 pages
Data Warehousing Dr. L. Rajya Lakshmi
No ratings yet
Data Warehousing Dr. L. Rajya Lakshmi
16 pages
Lecture-9 Extraction Transformation Loading
No ratings yet
Lecture-9 Extraction Transformation Loading
15 pages
ETL Basics
No ratings yet
ETL Basics
6 pages
bi-unit-3
No ratings yet
bi-unit-3
26 pages
Unit 6 Data Extraction: Structure
No ratings yet
Unit 6 Data Extraction: Structure
24 pages
dw_chap2
No ratings yet
dw_chap2
15 pages
DWH and Testing1
No ratings yet
DWH and Testing1
11 pages
Assignment On Chapter 8 Data Warehousing and Management
No ratings yet
Assignment On Chapter 8 Data Warehousing and Management
13 pages
Kabul University: Computer Science Faculty
No ratings yet
Kabul University: Computer Science Faculty
27 pages
Chapter IV
No ratings yet
Chapter IV
22 pages
Imran Introduction To DWH-5
No ratings yet
Imran Introduction To DWH-5
26 pages
DWH Concepts Overview
No ratings yet
DWH Concepts Overview
11 pages
ETL
No ratings yet
ETL
11 pages
DW_unit 3
No ratings yet
DW_unit 3
10 pages
Data Mining PDF
No ratings yet
Data Mining PDF
67 pages
Data Warehouse Slide3
No ratings yet
Data Warehouse Slide3
43 pages
Extract Transform Load
No ratings yet
Extract Transform Load
4 pages
Unit 2
No ratings yet
Unit 2
7 pages
(ETL) Ahmad Abdalkareem Lafta
No ratings yet
(ETL) Ahmad Abdalkareem Lafta
8 pages
ETL (Extract, Transform, and Load) Process in Data Warehouse
No ratings yet
ETL (Extract, Transform, and Load) Process in Data Warehouse
6 pages
ETL Testing
No ratings yet
ETL Testing
12 pages
Data Warehouse
No ratings yet
Data Warehouse
86 pages
Unit 3
No ratings yet
Unit 3
33 pages
Extract, Transform, Load
No ratings yet
Extract, Transform, Load
8 pages
DW Training
No ratings yet
DW Training
31 pages
Building The Data WareHouse - Chapter 03
No ratings yet
Building The Data WareHouse - Chapter 03
95 pages
Azure Data Factory
No ratings yet
Azure Data Factory
47 pages
06-Data-Integration Quality Profiling
No ratings yet
06-Data-Integration Quality Profiling
39 pages
ETL Process: (Extract, Transform, and Load) Process
No ratings yet
ETL Process: (Extract, Transform, and Load) Process
21 pages
The Extract-Transform-Load (ETL) System Is The Foundation of The Data Warehou
No ratings yet
The Extract-Transform-Load (ETL) System Is The Foundation of The Data Warehou
7 pages
BC0058 Assignment
No ratings yet
BC0058 Assignment
8 pages
ADTHEORY4
No ratings yet
ADTHEORY4
13 pages
Staging Area
No ratings yet
Staging Area
10 pages
The Study of Building the Data Warehouse
From Everand
The Study of Building the Data Warehouse
venkateswara Rao
No ratings yet
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet
1 - The Power of Digital Storytelling
No ratings yet
1 - The Power of Digital Storytelling
21 pages
Olap and Physical Design
No ratings yet
Olap and Physical Design
77 pages
Game
No ratings yet
Game
1 page
Developer Needs
No ratings yet
Developer Needs
2 pages
Pyspark Sejal Pittala
No ratings yet
Pyspark Sejal Pittala
9 pages
Handout 2 - Introduction To SQL Server
100% (1)
Handout 2 - Introduction To SQL Server
6 pages
Skripsi Full Tanpa Bab Pembahasan PDF
No ratings yet
Skripsi Full Tanpa Bab Pembahasan PDF
76 pages
Expdp and Impd1
No ratings yet
Expdp and Impd1
13 pages
Mongodb: The Database For The Modern World
No ratings yet
Mongodb: The Database For The Modern World
122 pages
Binary Search Trees: CS 302 - Data Structures
100% (1)
Binary Search Trees: CS 302 - Data Structures
106 pages
Dbms Assignment HW
No ratings yet
Dbms Assignment HW
16 pages
Useful and Common Symm CLI Command List
No ratings yet
Useful and Common Symm CLI Command List
7 pages
Survey On Big Data Analytics
No ratings yet
Survey On Big Data Analytics
5 pages
Colossus, Google’s File System
No ratings yet
Colossus, Google’s File System
8 pages
Database Management NOTES
No ratings yet
Database Management NOTES
15 pages
Writing SELECT Queries: Module Overview
No ratings yet
Writing SELECT Queries: Module Overview
22 pages
Junior Software Developer English Class 12 (1)
No ratings yet
Junior Software Developer English Class 12 (1)
229 pages
Here Is Step by Step CD Copy Protection
No ratings yet
Here Is Step by Step CD Copy Protection
2 pages
Lu2 Lo1
No ratings yet
Lu2 Lo1
41 pages
DBMS Assignment
No ratings yet
DBMS Assignment
3 pages
Machine Readable Cataloging
No ratings yet
Machine Readable Cataloging
23 pages
Efficient Retrieval Over Documents Encrypted by Attributes in
100% (1)
Efficient Retrieval Over Documents Encrypted by Attributes in
10 pages
Error
No ratings yet
Error
7 pages
Raid PDF
No ratings yet
Raid PDF
24 pages
JSON and Laravel Eloquent With Example
No ratings yet
JSON and Laravel Eloquent With Example
5 pages
Week1 Lecture
No ratings yet
Week1 Lecture
182 pages
S2-18-SS ZG537-L1
No ratings yet
S2-18-SS ZG537-L1
47 pages
45 Useful Oracle Queries For Oracle Developers PDF
No ratings yet
45 Useful Oracle Queries For Oracle Developers PDF
23 pages
DDD Assignment Mark Scheme Autumn 2018
No ratings yet
DDD Assignment Mark Scheme Autumn 2018
13 pages
8 File Processing
No ratings yet
8 File Processing
14 pages
Midterm Cheat Sheet
No ratings yet
Midterm Cheat Sheet
2 pages
Arts, Commerce and Science College Ashti A Project Report On "Railway Reservation System"
No ratings yet
Arts, Commerce and Science College Ashti A Project Report On "Railway Reservation System"
44 pages

ELT Process

Uploaded by

ELT Process

Uploaded by

YARMOUK UNIVERSITY

C I S 367: Data Warehousing

Topic 3: Extract Transform Load (ETL)

Dr. Rafat Hammad

 E T L stands for Extract, Transform and Load, which is

1) Source systems are very diverse and disparate

 E T L operations should be performed on a separate

 For example, the change of address of a customer who

 Most of the attributes in the source systems fall into

 This category is not as common as the previous

● Capture Based on Date and Time Stamp

 Transformation is the process of transforming the

 Data quality checks are run at two places:

 Date value representations, examples:

 Gender value representations, examples:

 Key Restructuring. While extracting data from your

 The key of this dimension table or any other key values

 Add a new dimension table row with the new value of

 The new row is inserted with a new surrogate key.

 Add an “old” ﬁeld in the dimension table for the

 No new dimension row is needed.

 Type-2: Add a new dimension row

 Type-3: Add a new field

 Destructive Merge: In this mode, you apply the

 Physically built to have the minimal sets of

 When building a fact table, the final E T L step is

 Data can easily be reloaded if a load process fails.

 Each database management system has different

● Periodically (e.g., every night, every week) or after

 Data extraction from various relational databases of

 Application of business rules for extraction and

You might also like