Etl Architecture
Etl Architecture
Etl Architecture
Center of Excellence
Data Warehousing
Center
of Excellence
12/08/2004
Data Warehousing
12/08/2004
Explain the architecture with the diagram and provide pros and cons. Explain how
this option can be implemented in client environment. Probably this can be a
distributed architecture ( explained in ETL Framework session)
Architecture Option 2
Explain the architecture with the diagram and provide pros and cons. Explain how
this option can be implemented in client environment. Probably this can be a
centralized architecture ( explained in ETL Framework session)
Architecture Option 3
Explain the architecture with the diagram and provide pros and cons. Explain how
this option can be implemented in client environment. Probably this can be a
geographically distributed architecture ( explained in ETL Framework session)
Recommendation for ETL Infrastructure
Explain the recommended architecture option for ETL infrastructure and provide
justification. Explain how client can be maximum benefited from the recommended
architecture option in their current environment. Also provide small roadmap for
12/08/2004
4
client regarding
how they should grow with this option.
12/08/2004
Modularity
Consistency
Flexibility
Speed
Heterogeneity
12/08/2004
Modularity
ETL systems should contain modular elements which encourages reuse and makes
them easy to modify when implementing changes
Consistency
ETL systems should guarantee consistency of data when it is loaded into the data
Flexibility
ETL systems may be appropriate to accomplish some transformations in text files
and some on the source data system; others may require the development of custom
applications
12/08/2004
Speed
Heterogeneity
ETL systems should be able to work with a wide variety of data in different formats
12/08/2004
12/08/2004
12/08/2004
10
Features
Rapid development
Rapid development because there is only one data format for each record
type.
data transformation
NoLight
data transformations are required since the incoming data is often in a
format
usable in the data warehouse.
12/08/2004
11
structural transformation
Light
Because the data comes from a single source, the amount of structural changes such
as table alteration is also very light
research requirements
Simple
The research efforts to locate data are generally
12/08/2004
12
A heterogeneous architecture for an ETL system is one that extracts data from
multiple sources
Features
Multiple data sources
More complex development
13
Heterogeneous Architecture
Substantial research requirements to identify and match data elements
Heterogeneous Architecture
12/08/2004
14
12/08/2004
15
Advantages
There are fewer ETL processes to create and maintain
Disadvantage
Here each individual ETL process redundantly perform many of the same actions .
development, this results in an increase in the amount and complexity of
During
ETL code to create and test.
Implementation, modifications require more effort as changes potentially
Upon
must be made in multiple places
12/08/2004
16
12/08/2004
Conformed table
17
Advantages
of ETL processes
Modularization
The creation of smaller, less complex ETL processes makes troubleshooting
problems and creating future enhancements easier.
The conform approach avoids redundant processing and ETL logic reduces
the chances of error, ultimately improving data quality.
12/08/2004
18
Extensibility
Disadvantages
There are more objects and processes that must be created and maintained.
12/08/2004
19
12/08/2004
20
12/08/2004
21
12/08/2004
22
12/08/2004
23
12/08/2004
24
12/08/2004
25
12/08/2004
26
data may not scale as effectively as the database does and will become a bottleneck.
Depending on the architecture, the external mechanism has to control the progress
of the ETL process and provide recovery and restart ability
12/08/2004
27
The raw data from the source system is copied into staging tables in the
database, where it is cleansed and then loaded into the warehouse tables .
12/08/2004
28
12/08/2004
29
12/08/2004
30
12/08/2004
31
The estimated amount of the data to be extracted and the stage in the ETL process
also impact the decision of how to extract, from a logical and a physical perspective
12/08/2004
32
33
Full Extraction
The data is extracted completely from the source system
Incremental Extraction
Only the data that has changed since a well-defined event back in history
will be extracted.
12/08/2004
34
12/08/2004
35
Online Extraction
The data is extracted directly from the source system itself.
Offline Extraction
The data is not extracted directly from the source system but is staged
explicitly outside the original source system
12/08/2004
36
12/08/2004
37
Transportation Methodology
12/08/2004
38
Transportation
Transportation is the operation of moving data from one system to another system
The most common requirements for transportation are in moving data from
12/08/2004
39
12/08/2004
40
Advantages
Source systems and data warehouses use different operating systems
and database systems, using flat files is the simplest way to exchange
data between heterogeneous systems with minimal transformations.
often the most efficient and most easy-to-manage mechanism for data
transfer
12/08/2004
41
These mechanisms also transport the data directly to the target systems, thus
providing both extraction and transformation in a single step
12/08/2004
42
and almost every other Oracle database object) can be directly transported from one
database to another.
Disadvantage
Source and target systems must be running Oracle8i (or higher), must
be running the same operating system, must use the same character set.
12/08/2004
43
Loading Methodology
12/08/2004
44
Loading Mechanisms
The process of writing the data into the target database.
SQL*LOADER
External Tables
OCI and Direct Path APIs
Export /Import
12/08/2004
45
Staging Area
STAGING AREA
12/08/2004
46
Definition
A place where raw data is brought in, cleaned, combined, archived, and exported to
one or more data marts.
It is also used to get data ready for loading into a presentation server.
12/08/2004
47
12/08/2004
48
Integrates data from many application source systems so there is one common
System wide enterprise view of the data.
12/08/2004
49
12/08/2004
50
Data used in the data warehouse is extracted from the data sources, cleansed and
transformed into the data warehouse schema.
Data transformation in the data source systems can interfere with OLTP
performance.
Allows DW professionals to assess the data quality problems before they are loaded
to the warehouse.
12/08/2004
51
12/08/2004
52
We can create a separate database for the data staging area, or can create these items
in the data warehouse database.
12/08/2004
53
incoming data
tables to aid in implementing surrogate keys
tables to hold transformed data.
12/08/2004
54
12/08/2004
55
12/08/2004
56
Data Cleaning
Data cleaning may involve checking the spelling of an attribute or
checking the membership of an attribute in a list
Fact Processing
The incoming fact records will have production keys, not data warehouse
keys. The current correct correspondence between a production key and the
data warehouse key must be looked up at load time
Aggregate Processing
Each load of new fact records requires that aggregates be calculated or
12/08/2004
58
sorting on the "many" attribute and verifying that each value has a unique
value on the "one" attribute.
12/08/2004
59
Flat Files
Relational tables
12/08/2004
60
whether the data staging area is relational or has more to do with sequential
processing of flat files. Ralph Kimball [1] concludes that
" Most data staging activities are not relational, but rather they are sequential
processing. If your incoming data is in flat-file format you should finish your data
staging processes as flat files before loading it into a relational database. He also
states that if both the source and target databases are relational it may be
appropriate to retain this format and not convert to flat files
12/08/2004
61
12/08/2004
62
temporarily stores and transforms data extracted from OLTP data sources.
.
12/08/2004
63
12/08/2004
64
Staging scenarios
2 staging scenarios
Scenario 1- a data staging tool is available.
The data is already in a database. The data flow is set up so that it comes out of the
source system, moves through the transformation engine, and into a staging
database.
12/08/2004
65
Staging scenarios
Scenario 2
In the second scenario, begin with a mainframe legacy system. Then extract the
sought after data into a flat file, move the file to a staging server, transform its
contents, and load transformed data into the staging database.
12/08/2004
66
Productivity support
Usability
The data staging system must be as usable as possible, it should have graphical user
interface.
12/08/2004
67
System Documentation
The data staging system need to provide a way for developers to easily capture
information about the processes that they are creating.
Metadata driven
12/08/2004
68
Staging Metadata
12/08/2004
69
DBMS metadata
Front Room Metadata
12/08/2004
70
12/08/2004
71
12/08/2004
72
ownership.
DBMS load scripts.
Aggregate definitions.
Aggregate usage statistics ,base table usage. statistics and potential
aggregates.
Aggregate modification logs.
12/08/2004
73
and when ? )
Data transformation run time logs, success summaries and time stamps.
Data transformation software version numbers.
Security settings for extract files, extract software , and extract metadata.
Security settings for data transmission (e.g. passwords ,certifications)
Data staging area archive logs and recovery procedures.
Data staging archive security settings.
12/08/2004
74
DBMS metadata
12/08/2004
75
12/08/2004
76