BI Unit-II DWH

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 131

Unit II

Data Warehousing and Data Mining


a) Date Ware housing(DwH):
Definition, Characteristic, types,
Dataware housing frame work,
DwH 3 tier architecture,
Alternative Architectures,
Data ware housing Integration,
Data ware housing- Development Approaches,
Real time Data ware housing.
b) Data Mining :-
Definition, Characteristic, Benefits, Date Mining Functions, Data
Mining Applications, Data Mining techniques and tools.
Text Mining, Web Mining.
Data Warehousing
Data
Warehouse:
It is an optimized form of operational database contain
only relevant information and provide fast access to
data.
Subject oriented
Eg: Data to all the departments of
related an
organization
Integrated: A
Different views Single unified
of data B Warehouse view
C

 Time –
variant
 Nonvolatile
Introduction

A warehouse is a repository for data


imported from other databases. Attached
to the front end of the warehouse is a set of
analytical procedures for making sense out
of the data. Retailers, home shopping
companies and banks have been early
adopters of data warehouses.
Different people have different definitions
for a data warehouse.
What is Data Warehousing?

 A data warehousing is a technique for collecting and


managing data from varied sources to provide meaningful
business insights. It is a blend of technologies and
components which allows the strategic use of data.
 It is electronic storage of a large amount of information by
a business which is designed for query and analysis
instead of transaction processing. It is a process of
transforming data into information and making it
available to users in a timely manner to make a difference.
 The data warehouse is the core of the BI system which is
built for data analysis and reporting.
Data warehouse system is also known
by the following name:
History of Data warehouse

 The Data warehouse benefits users to understand and


enhance their organization's performance. The need to
warehouse data evolved as computer systems became
more complex and needed to handle increasing
amounts of Information. However, Data Warehousing
is a not a new thing.
 Here are some key events in evolution of Data
Warehouse-
 1960- Dartmouth and General Mills in a joint research
project, develop the terms dimensions and facts.
 1970- A Nielsen and IRI introduces dimensional data
marts for retail sales.
History of Data warehouse

 1983- Tera Data Corporation introduces a database


management system which is specifically designed for
decision support
 Data warehousing started in the late 1980s when IBM
worker Paul Murphy and Barry Devlin developed the
Business Data Warehouse.
 However, the real concept was given by Inmon Bill. He was
considered as a father of data warehouse. He had written
about a variety of topics for building, usage, and
maintenance of the warehouse & the Corporate Information
Factory.
Inmon
Father of the data warehouse
Co-creator of the Corporate
Information Factory.
He has 35 years of
experience in database
technology management
and data warehouse design.
Inmon-Cont’d

 Bill has written about a variety of topics on the building,


usage, & maintenance of the data warehouse & the
Corporate Information Factory.
 He has written more than 650 articles (Datamation,
ComputerWorld, and Byte Magazine).
 Inmon has published 45 books.
 Many of books has been translated to Chinese, Dutch,
French, German, Japanese, Korean, Portuguese, Russian,
and Spanish.
Definition
 What is Data Warehouse?
A data warehouse is a collection of integrated
databases designed to support a DSS.
 According to Inmon’s (father of data warehousing)
definition(Inmon,1992a,p.5):
It is a collection of integrated, subject-oriented
databases designed to support the DSS function,
where each unit of data is non-volatile and relevant
to some moment in time.
Cont’d.

Where is it used?
It is used for evaluating future strategy.
It needs a successful technician:
 Flexible.
 Team player.
 Good balance of business and technical understanding.
Introduction-Cont’d.
The ultimate use of data warehouse is Mass Customization.
For example, it increased Capital One’s customers from
1 million to approximately 9 millions in 8 years.
Just like a muscle: DW increases in strength with active use.
With each new test and product, valuable information is
added to the DW, allowing the analyst to learn from the
success and failure of the past.
The key to survival:
Is the ability to analyze, plan, and react to changing
business conditions in a much more rapid fashion.
Data Warehouse
In order for data to be effective, DW must be:
Consistent.
Well integrated.
Well defined.
Time stamped.
DW environment:
The data store, data mart & the metadata.
Bill Inmon, who provided the
following:
“A data warehouse is a subject-oriented,
integrated, time-variant and non-volatile
collection of data in support of
management’s decision making process.”
Subject-Oriented:
A data warehouse can be used to analyze a
particular subject area. For example, “sales” can
be a particular subject.
Integrated:
A data warehouse integrates data from multiple
data sources.
For example, source A and source B may have different ways of
identifying a product, but in a data warehouse, there will be only a
single way of identifying a product.
Time-Variant:
Historical data is kept in a data warehouse.
For example, one can retrieve data from 3 months,
6 months, 12 months, or even older data from a data
warehouse.
This contrasts with a transactions system, where
often only the most recent data is kept.
For example, a transaction system may hold the
most recent address of a customer, where a data
warehouse can hold all addresses associated with a
customer.
Non-volatile:
Once data is in the data warehouse, it will not
change. So, historical data in a data warehouse
should never be altered.
Ralph Kimball provided a more concise
definition of a data warehouse:

 “A data warehouse is a copy of transaction data


specifically structured for query and analysis.”
 This is a functional view of a data warehouse.
Kimball did not address how the data warehouse is
built like Inmon did, rather he focused on the
functionality of a data warehouse.
 A data warehouse (DW) is a database used for
reporting and analysis. The data stored in the
warehouse is uploaded from the operational
systems. The data may pass through an
operational data store for additional operations
before it is used in the DW for reporting.
A data warehouse maintains its
functions in three layers:
Layer:1 Staging - Used to store raw data for
use by developers.
Layer: 2 Integration - used to integrate data
and to have a level of abstraction from users.
Layer: 3 access – used to access data
One thing to mention about data warehouse is
that they can be subdivided into data mart
Types of Data Warehouse

Three main types of Data Warehouses are:


1. Enterprise Data Warehouse:
 Enterprise Data Warehouse is a centralized warehouse. It provides decision
support service across the enterprise. It offers a unified approach for
organizing and representing data. It also provide the ability to classify data
according to the subject and give access according to those divisions.
2. Operational Data Store:
 Operational Data Store, which is also called ODS, are nothing but data store
required when neither Data warehouse nor OLTP systems support
organizations reporting needs. In ODS, Data warehouse is refreshed in real
time. Hence, it is widely preferred for routine activities like storing records of
the Employees.

3. Data Mart:

Data marts it stores subsets of data from a


warehouse, which focuses on a specific aspect of a
company like sales or a marketing process.
This definition of the data warehouse focuses on
data storage.
The main source of the data is cleaned,
transformed, catalogued and made available for
use by managers and other business professionals
for data mining, online analytical processing,
market research and decision support.
Characteristics of Data Warehouse:

 i. Subject-oriented :
 The warehouse organizes data around the essential subjects of
the busi­ness (customers and products) rather than around
applications such as inventory manage­ment or order
processing.
 i.Integrated:
 It is consistent in the way that data from several sources is
extracted and trans­formed. For example, coding conventions
are standardized: M _ male, F _ female.
 ii. Time-variant:
 Data are organized by various time-periods (e.g. months).
iii. Non-volatile:
The warehouse’s database is not updated
in real time. There is periodic bulk
uploading of transactional and other data.
This makes the data less subject to
momentary change. There are a number of
steps and processes in building a
warehouse.
 What Is a Data Warehouse Used For?
 Here, are most common sectors where Data warehouse is used:
 Airline:
 In the Airline system, it is used for operation purpose like crew
assignment, analyses of route profitability, frequent flyer
program promotions, etc.
 Banking:
 It is widely used in the banking sector to manage the resources
available on desk effectively. Few banks also used for the
market research, performance analysis of the product and
operations.
 Healthcare:
 Healthcare sector also used Data warehouse to strategize and predict outcomes, generate patient's
treatment reports, share data with tie-in insurance companies, medical aid services, etc.
 Public sector:
 In the public sector, data warehouse is used for intelligence gathering. It helps government
agencies to maintain and analyze tax records, health policy records, for every individual.
 Investment and Insurance sector:
 In this sector, the warehouses are primarily used to analyze data patterns, customer trends, and to
track market movements.
 Retain chain:
 In retail chains, Data warehouse is widely used for distribution and marketing. It also helps to
track items, customer buying pattern, promotions and also used for determining pricing policy.
 Telecommunication:
 A data warehouse is used in this sector for product promotions, sales decisions and to make
distribution decisions.
 Hospitality Industry:
 This Industry utilizes warehouse services to design as well as estimate their advertising and
promotion campaigns where they want to target clients based on their feedback and travel
patterns.
 Data Warehouse Tools
 There are many Data Warehousing tools are available in the market. Here, are some most
prominent one:
 1. MarkLogic:
 MarkLogic is useful data warehousing solution that makes data integration easier and faster using
an array of enterprise features. This tool helps to perform very complex search operations. It can
query different types of data like documents, relationships, and metadata.
 http://developer.marklogic.com/products
 2. Oracle:
 Oracle is the industry-leading database. It offers a wide range of choice of data warehouse solutions
for both on-premises and in the cloud. It helps to optimize customer experiences by increasing
operational efficiency.
 https://www.oracle.com/index.html
 3. Amazon RedShift:
 Amazon Redshift is Data warehouse tool. It is a simple and cost-effective tool to analyze all types of
data using standard SQL and existing BI tools. It also allows running complex queries against
petabytes of structured data, using the technique of query optimization.
 https://aws.amazon.com/redshift/?nc2=h_m1
 Here is a complete list of useful Datawarehouse Tools.

Benefits of a Data Warehouse:
This architectural complexity provides the opportunity to:
a. Maintain data history, even if the source transaction systems do not.
b. Integrate data from multiple source systems, enabling a central view
across the enterprise. This benefit is always valuable, but particularly so
when the organization has grown by merger.
c. Improve data, by providing consistent codes and descriptions,
flagging or even fixing bad data.
d. Present the organization’s information consistently.
e. Provide a single common data model for all data of interest
regardless of the data’s source.
f. Restructure the data so that it makes sense to the business users.
g. Restructure the data so that it delivers excellent query performance,
even for complex ana­lytic queries, without impacting the operational
systems.
h. Add value to operational business applications, notably customer
relationship manage­ment (CRM) systems.
Dimensions of Data Warehouse:
 A dimension is a data element that categorizes each item in a data set into
non-overlapping regions. A data warehouse dimension provides the means
to “slice and dice” data in a data warehouse. Dimensions provide structured
labeling information to otherwise unordered numeric measures. For
example, “Customer”, “Date”, and “Product” are all dimensions that could
be applied meaningfully to a sales receipt. A dimensional data element is
similar to a categorical variable in statistics.
 The primary function of dimensions is threefold: to provide filtering,
grouping and labeling. For example, in a data warehouse where each person
is categorized as having a gender of male, female or unknown, a user of the
data warehouse would then be able to filter or categorize each presentation
or report by either filtering based on the gender dimension or displaying
results broken out by the gender.
 Each dimension in a data warehouse may have one or more hierarchies
applied to it. For the “Date” dimension, there are several possible
hierarchies: “Day > Month > Year”, “Day > Week > Year”, “Day > Month >
Quarter > Year”, etc.
Business Intelligence
 Business Intelligence is a term commonly associated with
data warehousing.
 In fact, many of the tool vendors position their products
as business intelligence software rather than data
warehousing software.
 There are other occasions where the two terms are used
interchangeably.
 Business intelligence usually refers to the information that is
available for the enterprise to make decisions on.
 A data warehousing (or data mart) system is the backend, or the
infrastructural, component for achieving business intelligence.
 Business intelligence also includes the insight gained from doing
data mining analysis, as well as unstructured data (thus the
need for content management systems).
DATA
WAREHOUSING
&
DATA MINING
Components of Data Warehouse:
 The data warehouse is based on an RDBMS server which is a central
information repository that is surrounded by some key components to
make the entire environment functional, manageable and accessible
 There are mainly five components of Data Warehouse:
 There are 5 main components of a Data warehouse.
1. Data Warehouse Database
2. Sourcing, Acquisition, Clean-up and Transformation Tools
(ETL)
3. Metadata
4. Query Tools
5. Data Marts
 Three parts of the data
warehouse
 The data warehouse that contains the data and
associated software
 Data acquisition (back-end) software that
extracts data from legacy systems and external
sources, consolidates and summarizes them, and
loads them into the data warehouse
 Client (front-end) software that allows users to
access and analyze data from the warehouse
35
Data Warehouse Architectures

There are mainly three types of Data warehouse


Architectures: -
Single-tier architecture
 The objective of a single layer is to minimize the
amount of data stored.
 This goal is to remove data redundancy.
 This architecture is not frequently used in practice.
Data Warehouse Architectures

Two-tier architecture
 Two-layer architecture separates physically
available sources and data warehouse.
 This architecture is not expandable and also
not supporting a large number of end-users.
 It also has connectivity problems because of
network limitations.
Architecture of a two tier data
warehouse

38
3- Tier Data
Warehouse
Architecture
Architecture of a
three-tier data
warehouse

40
3-Tier Data
Warehouse
Architecture
Data ware house adopt a three tier architecture.

These 3 tiers are:


Bottom Tier
Middle Tier
Top Tier
Data Warehouse Architectures
Three-tier architecture
 This is the most widely used architecture. It consists of the Top, Middle and Bottom
Tier.
Bottom Tier: The database of the Data warehouse servers as the bottom tier. It is usually
a relational database system. Data is cleansed, transformed, and loaded into this layer
using back-end tools.
Middle Tier: The middle tier in Data warehouse is an OLAP server which is
implemented using either ROLAP or MOLAP model. For a user, this application tier
presents an abstracted view of the database. This layer also acts as a mediator between
the end-user and the database.
In the middle tier, we have the OLAP Server that can be implemented in either of the
following ways.
By Relational OLAP (ROLAP), which is an extended relational database management
system. The ROLAP maps the operations on multidimensional data to standard
relational operations.
By Multidimensional OLAP (MOLAP) model, which directly implements the
multidimensional data and operations.
Top-Tier: The top tier is a front-end client layer. Top tier is the tools and API that you
connect and get data out from the data warehouse. It could be Query tools, reporting
tools, managed query tools, Analysis tools and Data mining tools.
Data Sources:
All the data related to any bussiness organization is
stored in operational databases, external files and flat
files.
These sources are application oriented
Eg: complete data of organization such as training detail,
customer
detail, sales, departments, transactions, employee detail
etc.
Data present here in different formats or host format
Contain data that is not well documented
Bottom Tier:
Data warehouse server
Data Warehouse server fetch only relevant
information based on data mining (mining a
knowledge from large amount of data) request.

Eg: customer profile information provided by external


consultants.

Data is feed into bottom tier by some backend tools


and utilities.
 Bottom tier
 The bottom tier is a warehouse database server that is almost always a relational
database system.
 Back-end tools and utilities are used to feed data into the bottom tier from operational
databases or other external sources.
 These tools and utilities perform data extraction, cleaning, and transformation, as
well as load and refresh functions to update the data warehouse.
 The data are extracted using application program interfaces known as gateways.
 Also, A gateway is supported by the underlying DBMS and allows client programs to
generate SQL code to be executed at a server.
 Examples of gateways include ODBC (Open Database Connection) and OLEDB (Open
Linking and Embedding for Databases) by Microsoft and JDBC (Java Database
Connection).
 This tier also contains a metadata repository, which stores information about the data
warehouse and its contents
Backend Tools & Utilities:
Functions performed by backend tools and
utilities are:

Data Extraction
Data Cleaning
Data Transformation
Load
Refresh
Bottom Tier Contains:
 Data warehouse
 Metadata Repository
 Data Marts
 Monitoring and
Administration
Metadata repository:

It figure out that what is available in data warehouse.


It contains:
Structure of data warehouse
Data names and definitions
Source of extracted data
Algorithm used for data cleaning purpose
Sequence of transformations applied on data
Data related to system performance
Data Marts:

 Subset of data warehouse contain only small slices


of data warehouse
Eg: Data pertaining to the single department

 Two types of data marts:

Dependent Independent
sourced directly sourced from one or
from data warehouse more data sources
Monitoring & Administration:

 Data Refreshment
 Data source synchronization
 Disaster recovery
 Managing access control and security
 Manage data growth, database performance
 Controlling the number & range of queries
 Limiting the size of data warehouse
Bottom Tier: Data
Monitoring Administration Warehouse Server

Data
Data Marts
Metadata Warehouse
Repository

Data
Sourc e
A B C
Middle Tier: OLAP
 Server
It presents the users a multidimensional data from
data warehouse or data marts.
 Typically implemented using two models:

ROLAP Model MOLAP Model


Present data in Present data in array
relational tables based structures means
map directly to data
cube array structure.
Middle tier

 The middle tier is an OLAP server that typically implemented


using either.
 A relational OLAP (ROLAP) model, that , an extended
relational DBMS that maps operations on multidimensional data
to standard relational operations or,
 Also, A multidimensional OLAP (MOLAP) model, that , a
special-purpose server that directly implements
multidimensional data and operations
OLAP Operations

Since OLAP servers are based on multidimensional view of data,


OLAP operations in multidimensional data, the list of OLAP
operations −
 Roll-up
 Drill-down
 Slice and dice
 Pivot (rotate)
Four types of OLAP servers:

Relational OLAP
ROLAP servers are placed between the relational back-end server
and client front-end tools.
Moreover, To store and manage warehouse data, ROLAP uses
relational or extended-relational DBMS.
ROLAP includes the following:
 Implementation of aggregation navigation logic.
 Optimization for each DBMS back end.
 Additional tools and services
Multidimensional OLAP
MOLAP uses array-based multidimensional storage engines for multidimensional views of
data.
With multidimensional data stores, the storage utilization may be low if the data set is sparse.
Moreover, Many MOLAP servers use two levels of data storage representation to handle
dense and sparse data sets.
Hybrid OLAP (HOLAP)
Hybrid OLAP is a combination of both ROLAP and MOLAP.
Also, It offers higher scalability of ROLAP and faster computation of MOLAP.
Moreover, HOLAP servers allow storing the large data volumes of detailed information.
The aggregations are stored separately in MOLAP store.
Specialized SQL Servers
Also, Specialized SQL servers provide advanced query language and query processing support
for SQL queries over star and snowflake schemas in a read-only environment.
Top Tier: Front end tools
The top tier is a front-end client layer, which contains query and reporting
tools, analysis tools, and/or data mining tools.
It is front end client layer.
Query and reporting tools

Reporting Tools: Production reporting tools


Report writers
Managed query tools: Point and click creation of
SQL used in customer mailing list.
Analysis tools : Prepare charts based on analysis
Data mining Tools: mining knowledge, discover
hidden piece of information, new correlations, useful
pattern
OLAP vs OLTP
Sr. Data Warehouse (OLAP) Operational Database (OLTP)
No.
1 Involves historical processing of Involves day-to-day processing.
information.
2 OLAP systems are used by knowledge OLTP systems are used by clerks, DBAs, or
workers such as executives, managers database professionals.
and analysts.
3 Useful in analyzing the business. Useful in running the business.
4 It focuses on Information out. It focuses on Data in.
5 Based on Star Schema, Snowflake, Based on Entity Relationship Model.
Schema and Fact Constellation
Schema.
6 Contains historical data. Contains current data.
7 Provides summarized and consolidated Provides primitive and highly detailed data.
data.
8 Provides summarized and Provides detailed and flat relational view of
multidimensional view of data. data.
9 Number or users is in hundreds. Number of users is in thousands.
10 Number of records accessed is in Number of records accessed is in tens.
millions.
11 Database size is from 100 GB to 1 TB Database size is from 100 MB to 1 GB.
12 Highly flexible. Provides high performance.
From the architecture point of view, there are three data warehouse models:
Enterprise warehouse:
An enterprise warehouse collects all of the information about subjects spanning the entire
organization.
It provides corporate-wide data integration, usually from one or more operational systems or
external information providers, and is cross-functional in scope.
Moreover, It typically contains detailed data as well as summarized data,
Also, It can range in size from a few gigabytes to hundreds of gigabytes, terabytes, or beyond.
Data mart
A data mart contains a subset of corporate-wide data that is of value to a specific group of
users.
Virtual warehouse
A virtual warehouse set of views over operational databases.
Moreover, For efficient query processing, only some of the possible summary views may
materialize.
Architecture of web based data warehousing.
62
Alternative Data Warehouse
Architectures:
•EDW Architecture
Centralized enterprise data warehouse –similar to hub and spoke but
there are no dependent data marts but instead a gigantic enterprise DW
that serves for all the needs of all organizational units. Provides users to
all data in the DW instead of limiting them to just the data marts.

64
Alternative Data Warehouse
Architectures:

•Data Mart Architecture


Independent data marts –
arguably the simplest and the
least costly architecture
alternative, data marts are
developed to operate
independently of each other
to serve for the needs of
individual organizational units.
Data mart bus –individual
marts are linked to each other
65 via some kind of middle ware
Alternative Data Warehouse
Architectures:
•Hub-and-Spoke Data Mart Architecture
Hub-and-spoke –
attention is focused
on building a
scalable and
maintainable
infrastructure, this
allows for easy and
customization of
user interfaces and
reports
66
Alternative Data Warehouse Architectures:
•EDW and ODS (real time access support)

67
Alternative Data Warehouse
Architectures:
•Distributed Data Warehouse Architecture
Federated -is a concession to
the natural forces that
undermine the best plans for
developing a perfect-system. It
uses all possible means to
integrate analytical resources
from multiple sources to meet
changing needs or business
conditions. Essentially the
federated approach involves
integrating disparate systems.
Good for supplementing data
68 warehouses but not replacing
them
Alternative Architectures for Data Warehouse
Efforts

69
Teradata Corp.’s
EDW

70
 The alternative data warehouse architectures and their basic descriptions:
 Independent data marts –arguably the simplest and the least costly
architecture alternative, data marts are developed to operate
independently of each other to serve for the needs of individual
organizational units.
 Data mart bus –individual marts are linked to each other via some kind
of middle ware
 Hub-and-spoke –attention is focused on building a scalable and
maintainable infrastructure, this allows for easy and customization of user
interfaces and reports
 Centralized enterprise data warehouse –similar to hub and spoke
but there are no dependent data marts but instead a gigantic enterprise
DW that serves for all the needs of all organizational units. Provides users
to all data in the DW instead of limiting them to justthe data marts.
Ten factors that potentially affect the
architecture selection
decision:

1. Information 5. Constraints on resources


interdependence between 6. Strategic view of the data
organizational units warehouse prior to
2. Upper management’s implementation
information needs 7. Compatibility with
3. Urgency of need for a existing systems
data warehouse 8. Perceived ability of the in-
4. Nature of end-user tasks house IT staff
5. Technical issues
6. Social/political factors
Data Warehousing & Application
Integration
Data integration

Data integration involves combining data from


several disparate sources, which are stored using
various technologies and provide a unified view of
the data.
Data Integration comprises three major processes
that when correctly implemented permit data to
be accessed and an array of ETL & analysis tools
and data warehousing environment: data access,
data federation(integration of business views
across multiple data stores) and change capture.
Various Integration technologies that enable data
and metadata integration are:

1. Enterprise Application integration (EAI)


2. Service Oriented architecture (SOA)
3. Enterprise Information Integration (EII)
4. Extraction Transformation and Load (ETL)
Enterprise Application integration

 Enterprise Application integration involves integrating


application functionality and is focused on sharing
functionality (than data) across the systems thereby
enabling the flexibility and reuse.
 EAI is accomplished by using SOA coarse grained
services
Enterprise Information integration

 Enterprise information integration (EII) is software that


combines enterprise data and information into a unique
data monitoring interface where data is expressed via
uniform representation. 
 EII is an evolving tool space that promises real-time data
integration from variety of sources such as relational
databases, web services and multi dimensional databases.
 EII tools use pre defined meta data to populate that make
integrated data appear relational to end user.
 XML is the most important aspect of EII because it allows
data to be tagged at creation time or later.
ETL - Introduction

 ETL stands for Extract, Transform and Load.


 An ETL tool extracts the data from different
RDBMS source systems, transforms the data
like applying calculations, concatenate, etc.
and then load the data to Data Warehouse
system.
 The data is loaded in the DW system in the
form of dimension and fact tables.
What are ETL Tools?

 ETL is the process of transferring data from


source database to the destination 
data warehouse.
 In the process, there are 3 different sub-
processes like E for Extract, T for Transform
and L for Load.
 The data is extracted from the source database
in the extraction process which is then
transformed into the required format and then
loaded to the destination data warehouse.
 For performing all these functions there are
certain tools which are called the ETL tools.
The lists of the ETL tools are given
below
 • Pervasive Data Integrator
 • Pentaho Data Integration
• Open Text Integration • 19Adeptia Integration
Center Server
• Relational Junction ETL • 4SAS Data Management
Manager (Sesame Software) • 16Centerprise Data
• CloverETL Integrator
• PowerCenter Informatica • 20Syncsort DMX
• Talend Studio for Data • 10Sagent Data Flow
Integration • 21QlikView Expressor
• Oracle Warehouse Builder • 2SAP Data Services
(OWB) • 6Elixir Repertoire for
• Oracle Data Integrator Data ETL
(ODI)  • SQL Server Integration
• Data Migrator (IBI)
Services (SSIS)
• Cognos Data Manager
• IBM Infosphere
• IBM Infosphere
Information Server
Warehouse Edition
Selection of the ETL Tools for your Data
warehouse Tasks and its Importance

To experience a successful ETL functioning, proper


selection of the ETL tools is most mandatory.
The proper transfer of data between databases is
possible only when we have proper tools for the
process.
If the procedure is done with less appropriate
tools, then there will be a problem in the
functioning of the complete transfer process.
For proper functioning of data, you will have to
take care of the sub-processes and each tool used
in the sub-processes. Hence, carefulness has to be
maintained by a proper selection of the ETL tools.
ETL process
ETL (Extract, Transform and Load) is a process in data warehousing
responsible for pulling data out of the source systems and placing it into a data
warehouse. ETL involves the following tasks:

- extracting the data from source systems (SAP, ERP, other oprational


systems), data from different source systems is converted into one consolidated
data warehouse format which is ready for transformation processing. 

- transforming the data may involve the following tasks:  applying business


rules (so-called derivations, e.g., calculating new measures and dimensions), 
cleaning (e.g., mapping NULL to 0 or "Male" to "M" and "Female" to "F" etc.), 
filtering (e.g., selecting only certain columns to load),  splitting a column into
multiple columns and vice versa,  joining together data from multiple sources
(e.g., lookup, merge),  transposing rows and columns,  applying any kind of
simple or complex data validation (e.g., if the first 3 columns in a row are empty
then reject the row from processing)
- loading the data into a data warehouse or data repository other reporting
applications
Extract
 The main objective of the extract step is to retrieve all the required data from the
source system with as little resources as possible. The extract step should be designed
in a way that it does not negatively affect the source system in terms or performance,
response time or any kind of locking.
There are several ways to perform the extract:
 Update notification - if the source system is able to provide a notification that a
record has been changed and describe the change, this is the easiest way to get the
data.
 Incremental extract - some systems may not be able to provide notification that an
update has occurred, but they are able to identify which records have been modified
and provide an extract of such records. During further ETL steps, the system needs to
identify changes and propagate it down. Note, that by using daily extract, we may not
be able to handle deleted records properly.
 Full extract - some systems are not able to identify which data has been changed at all,
so a full extract is the only way one can get the data out of the system. The full extract
requires keeping a copy of the last extract in the same format in order to be able to
identify changes. Full extract handles deletions as well.
 When using Incremental or Full extracts, the extract frequency is extremely
important. Particularly for full extracts; the data volumes can be in tens of gigabytes.
Extraction
 A staging area is required during ETL load. There are various reasons why
staging area is required.
 The source systems are only available for specific period of time to extract data.
This period of time is less than the total data-load time. Therefore, staging area
allows you to extract the data from the source system and keeps it in the staging
area before the time slot ends.
 Staging area is required when you want to get the data from multiple data sources
together or if you want to join two or more systems together. For example, you will
not be able to perform a SQL query joining two tables from two physically different
databases.
 Data extractions’ time slot for different systems vary as per the time zone and
operational hours.
 Data extracted from source systems can be used in multiple data warehouse
system, Operation Data stores, etc.
 ETL allows you to perform complex transformations and requires extra area to
store the data.

Clean

 The cleaning step is one of the most important as it


ensures the quality of the data in the data warehouse.
Cleaning should perform basic data unification rules, such
as:
 Making identifiers unique (sex categories
Male/Female/Unknown, M/F/null, Man/Woman/Not
Available are translated to standard
Male/Female/Unknown)
 Convert null values into standardized Not Available/Not
Provided value
 Convert phone numbers, ZIP codes to a standardized form
 Validate address fields, convert them into proper naming,
e.g. Street/St/St./Str./Str
 Validate address fields against each other (State/Country,
City/State, City/ZIP code, City/Street).
Transform
 The transform step applies a set of rules to transform the data from the
source to the target. This includes converting any measured data to the
same dimension (i.e. conformed dimension) using the same units so
that they can later be joined. The transformation step also requires
joining data from several sources, generating aggregates, generating
surrogate keys, sorting, deriving new calculated values, and applying
advanced validation rules.
 Load
 During the load step, it is necessary to ensure that the load is
performed correctly and with as little resources as possible. The target
of the Load process is often a database. In order to make the load
process efficient, it is helpful to disable any constraints and indexes
before the load and enable them back only after the load completes.
The referential integrity needs to be maintained by ETL tool to ensure
consistency.
Managing ETL Process

 The ETL process seems quite straight forward. As with every application, there is a
possibility that the ETL process fails. This can be caused by missing extracts from one of
the systems, missing values in one of the reference tables, or simply a connection or
power outage. Therefore, it is necessary to design the ETL process keeping fail-recovery
in mind.

Staging
 It should be possible to restart, at least, some of the phases independently from the
others and this can be ensured by implementing proper staging.
 Staging means that the data is simply dumped to the location (called the Staging Area) so
that it can then be read by the next processing phase.
 The staging area is also used during ETL process to store intermediate results of
processing. However, the staging area should is to be accessed by the load ETL process
only.
 It should never be available to anyone else; particularly not to end users as it is not
intended for data presentation to the end-user as it may contain incomplete or in-the-
middle-of-the-processing data.
ETL Tool Implementation
When you are about to use an ETL tool, there is a fundamental decision to
be made:
Will the company build its own data transformation tool or will it use an
existing tool?
 Building your own data transformation tool (usually a set of shell scripts) is
the preferred approach for a small number of data sources which reside in
storage of the same type. The reason for that is the effort to implement the
necessary transformation is little due to similar data structure and common
system architecture
 There are many ready-to-use ETL tools on the market. The main benefit of
using off-the-shelf ETL tools is the fact that they are optimized for the ETL
process by providing connectors to common data sources like databases, flat
files, mainframe systems, xml, etc. They provide a means to implement data
transformations easily and consistently across various data sources.
 The tools also support transformation scheduling, version control,
monitoring and unified metadata management. Some of the ETL tools are
even integrated with BI tools.
 The most well known commercial tools are
  Ab Initio, 
 IBM InfoSphere DataStage,
  Informatica, 
 Oracle Data Integrator and 
 SAP Data Integrator.
There are several open source ETL tools, among others 
Apatar, CloverETL, Pentaho and Talend.
Industry News
 Critical Factors for Cloud Deployments: Agility, Flexibility and
Scalability the Reasons for Choosing Teradata

Some of the Well Known ETL Tools


 Oracle Enterprise BI Server
 SAP Business Objects Enterprise
 SAP NetWeaver BI
 SAS Enterprise BI Server
 Microsoft BI platform
 IBM Cognos Series 8
 Board Management IntelligenceToolkit
 BizzScore Suite
 WebFocus
 QlikView
 Microstrategy
 Oracle Hyperion System
 Actuate

Popular Business Intelligence


Tools
Data Warehouse Architecture
Data Warehouse Architecture
 Different data warehousing systems have
different structures. Some may have an
ODS (operational data store), while some
may have multiple data marts. Some may
have a small number of data sources, while
some may have dozens of data sources.
 In view of this, it is far more reasonable to
present the different layers of a data
warehouse architecture rather than
discussing the specifics of any one system.
Data Warehouse Architecture
In general, all data warehouse systems have the following
layers:
Data Source Layer
Data Extraction Layer
Staging Area
ETL LayerData
 Storage Layer
Data Logic Layer
Data Presentation Layer
Metadata LayerSystem
Operations Layer
Data Source Layer
This represents the different data sources that feed data into the
data warehouse. The data source can be of any format -- plain text
file, relational database, other types of database, Excel file, etc.,
can all act as a data source.
Many different types of data can be a data source:
Operations -- such as sales data, HR data, product data,
inventory data, marketing data, systems data.
ERP (SAP) and CRM (SalesForce.com) Systems
Web server logs with user browsing data.
Internal market research data.
Third-party data, such as census data, demographics data,
or survey data.
All these data sources together form the Data Source Layer.
Data Extraction Layer
Data gets pulled from the data source into the
data warehouse system. There is likely some
minimal data cleansing, but there is unlikely
any major data transformation.
Staging Area

This is where data sits prior to being


scrubbed and transformed into a data
warehouse / data mart. Having one common
area makes it easier for subsequent data
processing / integration.
ETL Layer
 This is where data gains its "intelligence", as
logic is applied to transform the data from a
transactional nature to an analytical nature.
 This layer is also where data cleansing
happens.
 The ETL design phase is
phase often the most time-
consuming phase in a data warehousing
project, and an ETL tool is often used in this
layer.
 Popular ETL Tools
 IBM WebSphere Information Integration (Ascential
DataStage)
 Ab Initio
 Informatica
Data Storage Layer

This is where the transformed and cleansed


data sit. Based on scope and functionality, 3
types of entities can be found here: data
warehouse,
warehouse data mart,
mart and operational
data store (ODS).
(ODS)
In any given system, you may have just one of
the three, two of the three, or all three types.
Data Logic Layer

This is where business rules are stored.

Business rules stored here do not affect the underlying data


transformation rules, but do affect what the report looks like.
Data Presentation Layer
 This refers to the information that reaches the users.
This can be in a form of a tabular / graphical report
in a browser, an emailed report that gets
automatically generated and sent everyday, or an
alert that warns users of exceptions, among others.

 Usually an OLAP tool and/or


tool a reporting tool is
tool
used in this layer
OLAP

OLAP stands for On-Line Analytical Processing.


Processing The first
attempt to provide a definition to OLAP was by Dr. Codd, who
proposed 12 rules for OLAP.

Online analytical processing, or OLAP is an approach to


answering multi-dimensional queries in faster and quick time.
OLAP tools such as Cubes, Oracle BI suite, helps users to
analyze multidimensional data effectively from multiple
perspectives.

OLAP consists of three basic operations: roll-up, drill-down,


and slicing and dicing.
Metadata Layer

This is where information about the data stored in the data warehouse
system is stored.

A logical data model would be an example of something that's in the


metadata layer.

A metadata tool is often used to manage metadata.(typically XML)


System Operations Layer

This layer includes information on how the data warehouse system


operates, such as ETL job status, system performance, and user
access history.
Data ware housing- Various Data Warehouse
Design Approaches:Top-Down and Bottom-Up

 Data Warehouse design approaches are very important


aspect of building data warehouse.
 Selection of right data warehouse design could save lot of
time and project cost.
 There are two different Data Warehouse Design Approaches
normally followed when designing a Data Warehouse
solution and based on the requirements of your project you
can choose which one suits your particular scenario.
 These methodologies are a result of research from Bill
Inmon and Ralph Kimball.
Data ware housing- Various Data
Warehouse Design Approaches:Top-
Down and Bottom-Up

 In a nutshell, here are the two approaches: in Bill Inmon’s


enterprise data warehouse approach (the top-down
design), a normalised data model is designed first, then
the dimensional data marts, which contain data required
for specific business processes or specific departments, are
created from the data warehouse. In Ralph Kimball’s
dimensional design approach (the bottom-up design), the
data marts facilitating reports and analysis are created
first; these are then combined together to create a broad
data warehouse.
Inmon’s top-down approach

 Inmon defines a data warehouse as a centralised


repository for the entire enterprise. A data
warehouse stores the “atomic” data at the lowest
level of detail. Dimensional data marts are
created only after the complete data warehouse
has been created. Thus, the data warehouse is at
the centre of the corporate information factory
(CIF), which provides a logical framework
for delivering business intelligence.
Kimball’s bottom-up approach 
 Keeping in mind the most important business aspects or
departments, data marts are created first. These provide a
thin view into the organisational data and, as and when
required, these can be combined into a larger data
warehouse. Kimball defines data warehouse as “a copy of
transaction data specifically structured for query and
analysis”.
 Kimball’s data warehousing architecture is also known as
data warehouse bus (BUS). Dimensional modelling
focuses on ease of end-user accessibility and provides a
high level of performance to the data warehouse.
Ralph Kimball –
Bottom-up Data Warehouse Design Approach
Pros and cons of both approaches 
Real-Time Data Warehousing

Real-time (active) data warehousing


The process of loading and
providing data via a data warehouse
as they become available
Active Data Warehousing is the
technical ability to capture transactions
when they change, and integrate them
into the warehouse, along with
maintaining batch or scheduled cycle
refreshes. ... Real-time Data
Warehousing describes a system that
reflects the state of
the warehouse in real time.
What is Real Time data-warehousing?

 The combination of real-time activity and data


warehousing is called real time warehousing. The activity
that happens at current time is known as real-time
activity. Data is available after completion of the activity.

Business activity data is captured in real-time data


warehousing as the data occurs. Soon after the business
activity and the available data, the data of completed
activity is flown into the data warehouse. This data is
available instantly. Real-time data warehousing can be
viewed / utilized as a framework for the information
retrieval from data as the data is available.
Definition
 Active Data Warehousing is the technical ability to capture
transactions when they change, and integrate them into the
warehouse, along with maintaining batch or scheduled cycle
refreshes. An active data warehouse offers the possibility of
automating routine tasks and decisions. The active data warehouse
exports decisions automatically to the On-Line Transaction
Processing (OLTP) systems.
 Real-time Data Warehousing describes a system that reflects the
state of the warehouse in real time.
Levels of data warehouses:

1. Reports what happened


2. Some analysis occurs
3. Provides prediction capabilities,
4. Operationalization
5. Becomes capable of making events
happen

You might also like