0% found this document useful (0 votes)
22 views

ELT Process

Uploaded by

alhamzahaudai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views

ELT Process

Uploaded by

alhamzahaudai
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

YARMOUK UNIVERSITY

F A C U LT Y O F I N F O R M AT I O N T E C H N O L O G Y A N D
COMPUTER SCIENCES

C I S 367: Data Warehousing

Topic 3: Extract Transform Load (ETL)

Dr. Rafat Hammad

Acknowledgements: Most of these slides have been prepared based on various online tutorials and presentations, with respect to their authors and
adopted for our course. Additional slides have been added from the mentioned references in the syllabus
C O M P O N E N T S O F A DATA W A R E H O U S E

2
TOPIC 3 : OUTLINE

 E T L Overview
 Data Extraction
 Data Transformation
 Data Loading
THE ETL CYCLE

4
E T L O V E R V I E W ( CONT .)

 E T L stands for Extract, Transform and Load, which is


a process used to collect data from various sources,
transform the data depending on business rules/needs
and load the data into a destination database.
 E T L is often a complex combination of process and
technology that consumes a significant portion of the
data warehouse development efforts and requires the
skills of business analysts, database designers, and
application developers.
 Because E T L is an integral, ongoing, and recurring
part of a data warehouse
● Automated
● Well documented
● Easily changeable
5
DIFFICULTIES IN ETL PROCESS

1) Source systems are very diverse and disparate


2) There is usually a need to deal with source systems
on multiple platforms and different operating
systems.
3) Many source systems are older legacy applications
running on obsolete database technologies.
4) Generally, historical data on changes in values are
not preserved in source operational systems.
Historical information is critical in a data warehouse.

6
DIFFICULTIES IN E T L P R O C E S S ( CONT .)
5) Source system structures keep changing over time
because of new business conditions. E T L functions
must also be modified accordingly.
6) Inconsistency among source systems. Same data is
likely to be represented differently in the various
source systems.
7) Most source systems do not represent data in types or
formats that are meaningful to the users. Many
representations are cryptic and ambiguous.

7
MAIN STEPS IN THE ETL PROCESS

8
ETL STAGING AREA

 E T L operations should be performed on a separate


intermediate storage area called "staging area" .
 The data staging area sits between the data source(s)
and the data target data warehouse.
 Staging areas can be implemented in the form of tables
in relational databases, text-based flat files (or X M L
files) stored in file systems or proprietary formatted
binary files stored in file systems.
 Staging areas creates a logical and physical separation
between the source systems and the data warehouse.
 Staging areas minimizes the impact of the intense
periodic E T L activity on source and data warehouse
databases.
9
TOPIC 3 : OUTLINE

 E T L Overview
 Data Extraction
 Data Transformation
 Data Loading
OVERVIEW OF DATA EXTRACTION
 Extraction is the process of reading data from different
sources which has to be loaded into the data warehouse. .
 First step of E T L , followed by many.
 A very complex task due to number of reasons:
● Data is extracted from heterogeneous and inconsistent
data sources
● Most of the data source systems are poorly documented
● Each data source has its distinct set of characteristics
that need to be managed and integrated into the E T L
system in order to effectively extract data..
● Very often, there is no possibility to add additional
logic to the source systems to enhance an incremental
extraction of data due to the performance or the
increased workload of these systems.
11
DATA EXTRACTION ISSUES
o Source identification—identify source applications and
source structures.
o Method of extraction—for each data source, define
whether the extraction process is manual or tool-based.
o Extraction frequency—for each data source, establish
how frequently the data extraction must be done: daily,
weekly, quarterly, and so on.
o Time window—for each data source, denote the time
window for the extraction process.
o Job sequencing—determine whether the beginning of
one job in an extraction job stream has to wait until the
previous job has finished successfully.
o Exception handling—determine how to handle input
records that cannot be extracted.
12
S O U R C E IDENTIFICATION
 Source identification includes the identification of all
the proper data sources.
 Source identification include examination and
verification that the identified sources will provide the
necessary value to the data warehouse.
 We need to go through the source identification process
for every piece of information you have to store in the
data warehouse.
 Source identification needs accuracy, lots of time, and
comprehensive analysis.

13
S O U R C E IDENTIFICATION STEPS

14
DATA IN OPERATIONAL SYSTEMS
 Data in the source systems are said to be time-
dependent or temporal. This is because source data
changes with time. The value of a single variable varies
over time.
 History cannot be ignored in the data warehouse.

 For example, the change of address of a customer who


moves from New York to California. If the state code is
used for analyzing some measurements such as sales,
the sales to the customer prior to the change must be
counted in New York and those after the move must be
counted in California.
 Operational data in the source system may be thought
of as falling into two broad categories: Current Value
and Periodic Status
15
DATA IN OPERATIONAL SYSTEMS – CURRENT VALUE

 Most of the attributes in the source systems fall into


this category. Here the stored value of an attribute
represents the value of the attribute at this moment of
time. The values are transient or transitory. As
business transactions happen, the values change. There
is no way to predict how long the present value will
stay or when it will get changed next.
 Customer name and address, bank account balances,
and outstanding amounts on individual orders are some
examples of this category.
 Data extraction for preserving the history of the
changes in the data warehouse gets quite involved for
this category of data.

16
DATA IN OPERATIONAL SYSTEMS – PERIODIC STATUS

 This category is not as common as the previous


category. In this category, the value of the attribute is
preserved as the status every time a change occurs.
 At each of these points in time, the status value is
stored with reference to the time when the new value
became effective. This category also includes events
stored with reference to the time when each event
occurred.
 For operational data in this category, the history of the
changes is preserved in the source systems themselves.
Therefore, data extraction for the purpose of keeping
history in the data warehouse is relatively easier.

17
DATA IN O P E R A T I O N A L S Y S T E M S ( CONT .)

18
DATA EXTRACTION TEC HNIQUES
 Broadly, there are two major types of data extractions
from the source operational systems:
1. “As Is” or static data is the capture of data at a
given point in time. It is like taking a snapshot of
the relevant source data at a certain point in time.
2. Incremental data capture (data of revisions),
which includes the revisions since the last time
data was captured. Incremental data capture may
be immediate or deferred.

19
IMMEDIATE DATA EXTRACTION
 In this option, the data extraction is real-time. It occurs
as the transactions happen at the source databases and
files.
 There are three options for immediate data
extraction:
1) Capture through Transaction Logs
2) Capture through Database Triggers
3) Capture in Source Applications

20
I M M E D I A T E D A T A E X T R A C T I O N ( CONT .)

21
CAPTURE THROUGH TRANSACTION L O G S
 This option uses the transaction logs of the DB M S s
maintained for recovery from possible failures.
 As each transaction adds, updates, or deletes a row
from a database table, the D B M S immediately writes
entries on the log file.
 This data extraction technique reads the transaction
log and selects all the committed transactions.
 There is no extra overhead in the operational systems
because logging is already part of the transaction
processing.
 The appropriate transaction logs contain all the
changes to the various source database tables.

22
CAPTURE THROUGH T R A NS A C T I O N L O G S ( CONT .)
 Here are the broad steps for using replication to capture
changes to source data
● Identify the source system database table
● Identify and define target files in the staging area
● Create mapping between the source table and target files
● Define the replication mode
● Schedule the replication process
● Capture the changes from the transaction logs
● Transfer captured data from logs to target files
● Verify transfer of data changes
● Confirm success or failure of replication
● In metadata, document the outcome of replication.
Maintain definitions of sources, targets, and mappings
23
CAPTURE THROUGH T R A NS A C T I O N L O G S ( CONT .)

24
CAPTURE THROUGH DATABASE TRIGGERS
 This option is applicable to source systems that are
database applications.
 Triggers are special stored procedures (programs) that
are stored on the database and fired when certain
predefined events occur.
 You can create trigger programs for all events for which
you need data to be captured. The output of the trigger
programs is written to a separate file that will be used
to extract data for the data warehouse.
 Data capture through database triggers occurs right at
the source and is therefore quite reliable.
 Also, execution of trigger procedures during transaction
processing of the source systems puts additional
overhead on the source systems
25
CAPTURE IN S O U R C E APPLICATIONS
 Application programs need to be revised to write all
adds, updates, and deletes to the source files and
database tables.
 Unlike the previous two cases, this technique may be
used for all types of source data irrespective of whether
it is in databases, indexed files, or other flat files.
 Revising the programs in the source operational
systems could be a huge task if the number of source
system programs is large.
 This technique may degrade the performance of the
source applications because of the additional processing
needed to capture the changes on separate files.

26
DEFERRED DATA EXTRACTION
 The techniques under deferred data extraction do not
capture the changes in real time. The capture happens
later.
 There are two options for deferred data extraction:

● Capture Based on Date and Time Stamp


● Capture by Comparing Files

27
DEFERRED DATA EXTRACTION

28
CAPTURE BASED ON DATE AND TIME STAMP
 Every time a source record is created or updated it may
be marked with a stamp showing the date and time.
 The time stamp provides the basis for selecting records
for data extraction. Here the data capture occurs at a
later time, not while each source record is created or
updated.
 This technique works well if the number of revised
records is small.
 This technique presupposes that all the relevant source
records contain date and time stamps. Provided this is
true, data capture based on date and time stamp can
work for any type of source file.
 This technique captures the latest state of the source
data.
29
CAPTURE BY COMPARING FILES
 If none of the above techniques are feasible for specific
source files in your environment, then consider this
technique as the last resort.
 This technique is also called the snapshot differential
technique because it compares two snapshots of the
source data.
 This technique necessitates the keeping of prior copies
of all the relevant source data.
 Though simple and straightforward, comparison of full
rows in a large file can be very inefficient.
 This method may be the only feasible option for some
legacy data sources that do not have transaction logs or
time stamps on source records.

30
EVALUATION OF THE TECHNIQUES
 To summarize, the following options are available for
data extraction:
1) Capture of Static Data
2) Incremental Data Capture
A. Immediate Data Extraction
 Capture through transaction logs
 Capture through database triggers
 Capture in source applications
B. Deferred Data Extraction
 Capture based on date and time stamp
 Capture by comparing files

31
TOPIC 3 : OUTLINE

 E T L Overview
 Data Extraction
 Data Transformation
 Data Loading
DATA TRANSFORMATION

 Transformation is the process of transforming the


extracted data from its original state into a consistent
states so that it can be placed into another database.
 The extracted data is raw data and it cannot be
applied to the data warehouse right away and it must
be made usable in the data warehouse.
 Because operational data is extracted from many old
legacy systems, the quality of the data in those
systems is less likely to be good enough for the data
warehouse.
 You have to enrich and improve the quality of the data
before it can be usable in the data warehouse.

33
D A T A T R A N S F O R M A T I O N ( CONT .)
 You have to transform the data according to standards
because they come from many dissimilar source
systems.
 You have to ensure that after all the data is put
together, the combined data does not violate any
business rules.
 Transformation of source data encompasses a wide
variety of manipulations to change all the extracted
source data into usable information to be stored in the
data warehouse.
 One major effort within data transformation is the
improvement of data quality.

34
D A T A T R A N S F O R M A T I O N ( CONT .)
 Data Quality paradigm :
● Correct
 U nambiguous
● Consistent
● Complete

 Data quality checks are run at two places:


● After extraction
● After cleaning and confirming additional check are
run at this point

35
EXAMPLES OF INCONSISTENT DAT A R E P R E S E N T A T I O N S

 Date value representations, examples:


970314 1997-03-14
03/14/1997 14-MAR-1997
March 14 1997 2450521.5 (Julian date format)

 Gender value representations, examples:


- Male/Female - M /F
- 0/1 - PM /A M

36
M A J O R TRANSFORMATION TYPES
 Format Revisions. You will come across these quite
often. These revisions include changes to the data types
and lengths of individual fields. In source systems,
product package types may be indicated by codes and
names in which the fields are numeric and text data
types. The lengths of the package types may vary
among the different source systems. It is wise to
standardize and change the data type to text to provide
values meaningful to the users.

37
M A J O R T R A N S F O R M A T I O N T Y P E S ( CONT .)
 Decoding of Fields. This is also a common type of
data transformation. When you deal with multiple
source systems, you are bound to have the same data
items described by a plethora of field values. The
classic example is the coding for gender, with one
source system using 1 and 2 for male and female and
another system using M and F. Also, many legacy
systems are known for using cryptic codes to represent
business values. What do the codes AC , IN, R E , and S U
mean in a customer file? You need to decode all such
cryptic codes and change these into values that make
sense to the users. Change the codes to Active, Inactive,
Regular, and Suspended.

38
M A J O R T R A N S F O R M A T I O N T Y P E S ( CONT .)
 Calculated and Derived Values. What if you want to
keep profit margin along with sales and cost amounts
in your data warehouse tables? The extracted data from
the sales system contains sales amounts, sales units,
and operating cost estimates by product. You will have
to calculate the total cost and the profit margin before
data can be stored in the data warehouse. Average
daily balances and operating ratios are examples of
derived fields.

39
M A J O R T R A N S F O R M A T I O N T Y P E S ( CONT .)
 Splitting of Single Fields. Earlier legacy systems
stored names and addresses of customers and
employees in large text fields. The first name, middle
initials, and last name were stored as a large text in a
single field. Similarly, some earlier systems stored city,
state, and zip code data together in a single field. You
need to store individual components of names and
addresses in separate fields in your data warehouse for
two reasons. First, you may improve the operating
performance by indexing on individual components.
Second, your users may need to perform analysis by
using individual components such as city, state, and zip
code.

40
M A J O R T R A N S F O R M A T I O N T Y P E S ( CONT .)
 Merging of Information. This is not quite the
opposite of splitting of single fields. This type of data
transformation does not literally mean the merging of
several fields to create a single field of data. For
example, information about a product may come from
different data sources. The product code and
description may come from one data source. The
relevant package types may be found in another data
source. The cost data may be from yet another source.
In this case, merging of information denotes the
combination of the product code, description, package
types, and cost into a single entity.

41
M A J O R T R A N S F O R M A T I O N T Y P E S ( CONT .)
 Character set conversion. This type of data
transformation relates to the conversion of character
sets to an agreed standard character set for textual
data in the data warehouse. If you have mainframe
legacy systems as source systems, the source data from
these systems will be in E B C D I C characters. If PC-
based architecture is the choice for your data
warehouse, then you must convert the mainframe
E B C D I C format to the AS C I I format. When your source
data is on other types of hardware and operating
systems, you are faced with similar character set
conversions.

42
M A J O R T R A N S F O R M A T I O N T Y P E S ( CONT .)
 Conversion of Units of Measurements. Many
companies today have global branches. Measurements
in many European countries are in metric units. If your
company has overseas operations, you may have to
convert the metrics so that the numbers are all in one
standard unit of measurement.
 D ate /Time C onversion. This type relates to
representation of date and time in standard formats.
For example, the American and the British date
formats may be standardized to an international
format. The date of October 11, 2008 is written as 10 /
11 /2008 in the U. S . format and as 11 /10 /2008 in the
British format. This date may be standardized to be
written as 11 O C T 2008.

43
M A J O R T R A N S F O R M A T I O N T Y P E S ( CONT .)
 Summarization. This type of transformation is the
creating of summaries to be loaded in the data
warehouse instead of loading the most granular level of
data. For example, for a credit card company to analyze
sales patterns, it may not be necessary to store in the
data warehouse every single transaction on each credit
card. Instead, you may want to summarize the daily
transactions for each credit card and store the
summary data instead of storing the most granular
data by individual transactions.

44
M A J O R T R A N S F O R M A T I O N T Y P E S ( CONT .)

 Key Restructuring. While extracting data from your


input sources, look at the primary keys of the extracted
records. You will have to come up with keys for the fact
and dimension tables based on the keys in the extracted
records.
 When choosing keys for your data warehouse database
tables, avoid such keys with built-in meanings. Transform
such keys into generic keys generated by the system
itself. This is called key restructuring.

45
M A J O R T R A N S F O R M A T I O N T Y P E S ( CONT .)

46
M A J O R T R A N S F O R M A T I O N T Y P E S ( CONT .)
 Enrichment: This task is the rearrangement and
simplification of individual fields to make them more
useful for the data warehouse environment. You may
use one or more fields from the same input record to
create a better view of the data for the data warehouse.
This principle is extended when one or more fields
originate from multiple records, resulting in a single
field for the data warehouse.

47
M A J O R T R A N S F O R M A T I O N T Y P E S ( CONT .)
 Deduplication. In many companies, the customer files
have several records for the same customer.
 In a normal client database some clients may be
represented by several records for various reasons:
● Incorrect or missing data values because of data
entry errors
● Inconsistent naming convention such as: O N E vs 1
● Incomplete information because data is not captured
or available
● Physically moved, but clients did not notify change of
address
● Misspelling or falsification of names

48
PROBLEMS D UE TO DATA DUPLICATION
 Data duplication, can result in costly errors, such as:
● False frequency distributions.
● Incorrect aggregates due to double counting.
● Difficulty with catching fabricated identities by
credit card companies.

49
SLOWLY CHANGING DIMENSIONS
 Compared to the fact table, the dimension tables are more
stable and are generally constant over time.
 Unlike the fact table, which changes through an increase
in the number of rows, a dimension table does not change
just through the increase in the number of rows, but also
through changes to the attributes themselves.
 Many dimensions, though not constant over time, change
slowly.
 In the source O LT P systems, the new values overwrite
the old ones.
 In the source of data warehouse, overwriting of dimension
table attributes is not always the appropriate option.
 There are three types of dimension table changes: Type 1
changes, Type 2 changes, and Type 3 changes
50
SLOWLY CHANGING DIMENSIONS – TYPE 1
 These changes usually relate to correction of errors in
source systems.
 The old value in the source system needs to be
discarded.
 The change in the source system need not be preserved
in the data warehouse.
 Overwrite the attribute value in the dimension table row
with the new value.
 No other changes are made in the dimension table row.

 The key of this dimension table or any other key values


are not affected.
 This type is easiest to implement.

51
SLOWLY CHANGING DIMENSIONS – TYPE 1
Type 1: Example
Susan's Tax Bracket attribute value changes from Medium to High

52
SLOWLY CHANGING DIMENSIONS – TYPE 2
 These changes usually relate to true changes in source
systems.
 Every change for the same attribute must be preserved.

 Add a new dimension table row with the new value of


the changed attribute.
 An effective date field may be included in the dimension
table.
 There are no changes to the original row in the
dimension table.
 The key of the original row is not affected.

 The new row is inserted with a new surrogate key.

53
SLOWLY CHANGING DIMENSIONS – TYPE 2
Type 2 : Example (with timestamps and row indicator)
Susan's Tax Bracket attribute value changes from Medium to High

54
SLOWLY CHANGING DIMENSIONS – TYPE 3
 They usually relate to “soft” or tentative changes in the
source systems.
 There is a need to keep track of history with old and
new values of the changed attribute.
 They provide the ability to track forward and backward

 Add an “old” field in the dimension table for the


affected attribute.
 Push down the existing value of the attribute from the
“current” field to the “old” field.
 Keep the new value of the attribute in the “current”
field.
 The key of the row is not affected.

 No new dimension row is needed.

55
SLOWLY CHANGING DIMENSIONS – TYPE 3
Type 3: Example (with timestamps)
Susan's Tax Bracket attribute value changes from Medium to High

56
PROS AND CONS
 Type-1: Overwrite existing value
+ Simple to implement
- No tracking of history

 Type-2: Add a new dimension row


+ Accurate historical reporting
+ Pre-computed aggregates unaffected
- Dimension table grows over time

 Type-3: Add a new field


+ Accurate historical reporting to last TWO changes
+ Record keys are unaffected
- Dimension table size increases

57
T R A NS F O R M A T I O N FOR D I M E N S I O N A T T R I BU T E S

58
AUTOMATIC DATA CLEAN SIN G
1) Statistical Methods
● Identifying outlier fields and records using the
values of mean, standard deviation, range and
other statistical methods.
2) Pattern-based
● Identify outlier fields and records that do not
conform to existing patterns in the data.
● A pattern is defined by a group of records that have
similar characteristics (“behavior”) for p% of the
fields in the data set, where p is a user-defined
value (usually above 90).
● Techniques such as partitioning, classification, and
clustering can be used to identify patterns that
apply to most records.
59
A U T O M A T I C D A T A C L E A N S I N G ( CONT .)
3) Clustering
● Identify outlier records using clustering based on
Euclidian (or other) distance.
● Clustering the entire record space can reveal
outliers that are not identified at the field level
inspection
● Main drawback of this method is computational
time.
4) Association rules
● Association rules with high confidence and support
define a different kind of pattern.
● Records that do not follow these rules are considered
outliers.
60
TOPIC 3 : OUTLINE

 E T L Overview
 Data Extraction
 Data Transformation
 Data Loading
DATA LOADING
 Data Loading is the process of writing the data into
the target source. It includes both: loading dimension
and fact tables.
 Because loading the data warehouse may take an large
amount of time, loads are generally causes for great
concern. During the loads, the data warehouse has to
be offline.
 Consider dividing up the whole load process into
smaller chunks and populating a few files at a time.
 The whole process of moving data into the data
warehouse repository is referred to in several ways:
“loading the data”, and ”refreshing the data”.

62
TYPES OF DATA LOADING
 Initial load—populating all the data warehouse tables
for the very first time.
 Incremental load—applying ongoing changes as
necessary in a periodic manner.
 Full refresh—completely erasing the contents of one
or more tables and reloading with fresh data (initial
load is a refresh of all the tables).

63
A P P L Y I N G D ATA : T E C H N I Q U E S AND PROCESSES
 Data may be applied to data warehouse in the following
four different modes: load, append, destructive merge,
and constructive merge.
 Load: If the target table to be loaded already exists
and data exists in the table, the load process wipes out
the existing data and applies the data from the
incoming file. If the table is already empty before
loading, the load process simply applies the data from
the incoming file.
 Append: If data already exists in the table, the append
process unconditionally adds the incoming data,
preserving the existing data in the target table. When
an incoming record is a duplicate of an already existing
record, the incoming record may be allowed to be added
as a duplicate or it may be rejected.
64
A P P L Y I N G D ATA : T E C H N I Q U E S A N D P R O C E S S E S ( CONT .)

 Destructive Merge: In this mode, you apply the


incoming data to the target data. If the primary key of
an incoming record matches with the key of an existing
record, update the matching target record. If the
incoming record is a new record without a match with
any existing record, add the incoming record to the
target table.
 Constructive Merge: This mode is slightly different
from the destructive merge. If the primary key of an
incoming record matches with the key of an existing
record, leave the existing record, add the incoming
record, and mark the added record as superseding the
old record.

65
A P P L Y I N G D ATA : T E C H N I Q U E S A N D P R O C E S S E S ( CONT .)

66
LOADING CHANGES TO DIMENSION TABLES

67
LOADING DIMENSIONS

 Physically built to have the minimal sets of


components
 The primary key is a single field containing
meaningless unique integer – Surrogate Keys
 Creating and assigning the surrogate keys occur in
this module
 The data warehouse owns these keys and never
allows any other entity to assign them

68
L O A D I N G D I M E N S I O N S ( CONT .)

69
LOADING FACTS

 When building a fact table, the final E T L step is


converting the natural keys in the new input records
into the correct surrogate keys
 E T L maintains a special surrogate key lookup table
for each dimension. This table is updated whenever a
new dimension entity is created and whenever a
change occurs on an existing dimension entity
 All of the required lookup tables should be pinned in
memory so that they can be randomly accessed as
each incoming fact record presents its natural keys.
This is one of the reasons for making the lookup
tables separate from the original data warehouse
dimension tables.

70
L O A D I N G F A C T T A B L E S ( CONT .)

71
L O A D I N G F A C T T A B L E S ( CONT .)

 Managing Indexes
● Performance Killers at load time
● Drop all indexes in pre-load time
● Separate Updates from inserts
● Load updates
● Rebuild indexes

72
ROLLBACK LOG
 The rollback log, also known as the redo log, is
invaluable (‫)نيمث‬in transaction (OLTP) systems. But in a
data warehouse environment where all transactions
are managed by the E T L process, the rollback log is a
unnecessary feature that must be dealt with to achieve
optimal load performance.
 Reasons why the data warehouse does not need
rollback logging are:
 All data is entered by a managed process—the E T L
system.
 Data is loaded in bulk.

 Data can easily be reloaded if a load process fails.

 Each database management system has different


logging features and manages its rollback log
differently
73
DATA REFR ESH
 Propagate updates on source data to the warehouse
 When to Refresh?

● Periodically (e.g., every night, every week) or after


significant events
● On every update: not warranted unless warehouse
data require current data (up to the minute stock
quotes)
● Refresh policy set by administrator based on user
needs and traffic
● Possibly different policies for different sources

74
E T L TOOL OPTIONS
 Vendors have approached the challenges of E T L and
addressed them by providing tools falling into the
following three broad functional categories:
● Data transformation engines.
● Data capture through replication
● Code generators
 Data transformation engines. These tools captures
data from a designated set of source systems at user-
defined intervals, performs elaborate data
transformations, sends the results to a target
environment, and applies the data to target files. These
tools provide you with maximum flexibility for pointing to
various source systems, to select the appropriate data
transformation methods, and to apply full loads and
incremental loads.
75
E T L T O O L O P T I O N S ( CONT .)
 Data capture through replication: Most of these
tools use the transaction recovery logs maintained by
the D B M S . The changes to the source systems captured
in the transaction logs are replicated in near real time
to the data staging area for further processing. Some of
the tools provide the ability to replicate data through
the use of data base triggers. These specialized stored
procedures in the database signal the replication agent
to capture and transport the changes.

76
E T L T O O L O P T I O N S ( CONT .)
 Code generators: These are tools that directly deal
with the extraction, transformation, and loading of
data. The tools enable the process by generating
program code to perform these functions. Code
generators create 3G L /4G L data extraction and
transformation programs. You provide the parameters
of the data sources and the target layouts along with
the business rules. The tools generate most of the
program code in some of the common programming
languages.

77
M A J O R CAPABILITIES OF ETL TOOLS

 Data extraction from various relational databases of


leading vendors
 Data extraction from old legacy databases, indexed
files, and flat files
 Data transformation from one format to another with
variations in source and target fields
 Performing of standard conversions, key reformatting,
and structural changes
 Provision of audit trails from source to target

 Application of business rules for extraction and


transformation
 Combining of several records from the source systems
into one integrated target record
 Recording and management of metadata
78
ETL SUMMARY AND APPROACH

79
THE END

80

You might also like