Introduction To DWH - 29012014
Introduction To DWH - 29012014
Agenda
Data Warehouse
A data warehouse is a relational database that is designed for query and analysis rather than
for transaction processing. It usually contains historical data derived from transaction data, but
it can include data from other sources. It separates analysis workload from transaction
workload and enables an organization to consolidate data from several sources.
In addition to a relational database, a data warehouse environment includes an extraction,
transportation, transformation, and loading (ETL) solution, an online analytical processing
(OLAP) engine, Oracle Warehouse Builder, client analysis tools, and other applications that
manage the process of gathering data and delivering it to business users.
Historical Database
Since a Data warehouse maintains historical business information which is required for
analysis Here it is called Historical Database.
© 2010 Capgemini - All rights reserved
12/14/2024 12:11 PM 3
ETL Process
OLTP
DWH
Student Register
Integrate
Student Marks
Subject Oriented
Integrated
Nonvolatile
Time Variant
OLTP
Extract
DWH
Current Account
Load Account
Transformation
Saving Account (Subject)
Check in A/C
A DWH is an integrated database which collects the data from multiple OLTP
source systems, integrates the data into a homogeneous format and delivers the
integrated data to centralized database called DWH
Nonvolatile means that, once entered into the warehouse, data should not change. This is
logical because the purpose of a warehouse is to enable you to analyze what has occurred.
A data ware house is a non-volatile database.
Chore data entered into the DWH it does not reflects to the change which takes place at OLTP
database.
DWH
OLTP
DB
Emp no E name EBW
Emp no E name EBW
7396 rahul SE
T
7396 rahul TL E L 7396 rahul SSE Historical
7396 rahul TL
current
Current
A Data warehouse is a time variant Database which supports the business management in analyzing the
business and comparing the business with different time periods this is known as time series analysis
In order to discover trends in business, analysts need large amounts of data. This is very much in
contrast to online transaction processing (OLTP) systems, where performance requirements demand
that historical data be moved to an archive. A data warehouse's focus on change over time is what is
meant by the term time variant.
Typically, data flows from one or more online transaction processing (OLTP) databases into a data
warehouse on a monthly, weekly, or daily basis. The data is normally processed in a staging file before
being added to the data warehouse. Data warehouses commonly range in size from tens of gigabytes to
a few terabytes. Usually, the vast majority of the data is stored in a few very large fact tables.
Time
Year
Quarter
Month
Week
Date
OLTP OLAP
It is design to support Business transaction it is Design to support making Decision
process
Volatile Data Non-Volatile data
Current Data Historical Data
Support Normalization Support De-Normalization
Design for running Business Design for Analysis Business
Design for clerical access Design for managerial access
Less history (3-6 months) more history (1-30 yrs)
Detailed data summary data
Application oriented data subject oriented data
E-R modeling Dimensional modeling
It is a process of Reading the data from different OLTP source system. The
following are the types at sources that define as Extraction.
OLTP SOURCES:-
ERP Sources (Sap, PeopleSoft, Siebel)
Mainframes
Oracle Applications
XML Files
Flat Files
Cobol Files
Relational sources (oracle, SQL server etc.,)
It is the process at generating and manipulating the data. The following are the
types of data transformation activities
1. Data Merging
i) Horizontal Merging
ii) Vertical Merging
2. Data Merging
3. Data Scrubbing
4. Data Aggregation
It’s a process of integrating the data from multiple OLTP source systems. There are
two types of data merging
i) Horizontal Merging (Joins)
ii) Vertical Merging (Unions)
Join: A Join required to define when the sources are having different data
definitions with a common column and data type(metadata)
Union: When the two sources are having same data definition, such sources can
be combined vertically using Union
Eno Ename Sal Job Dno Eno Ename Sal Job Dno
142 Arlo 45,000 con 10 146 kiran 45,000 con 10
144 Arun 58,000 Sr.con 20 148 Naren 58,000 Sr.con 20
145 prasad 65,000 manager 10 140 Mahesh 65,000 manager 10
> <
Union
– Round()
$4.6261 E L $4.62
$3.781 $3.78
$3.4261 $3.42
$2.01 $2.01
$3.1 $3.10
Country COUNTRY
Italy
Italy Initcap()
india
India
HYDERABAD
Hyderabad
Date Date
12-05-12 12/05/2012
12/12/2012 To_Date() 12/12/2012
12-1-2012 12/01/2012
EMP
EMP
+ Emp no
+ Emp no Sal * 0.17=
Tax +E name
+E name
+ Sal
+ Sal
+Dept
+Dept
+Tax
It is the process of Calculating the summaries from detailed data using Aggregate function
Ex:- sum(), Count(), Min(), Max()
Aggregates
Add up amounts by day
In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date
It is the process of inserting the data into a target system these are two types of
data loads
Initial Load/ Full Load:
it is the process of inserting the data into a empty target table
Incremental Load / Delta Load
It is the process of inserting only. New Records after Initial load.
Dependent Datamart
Independent Datamart
Datamart Datawarehouse
Dependent Datamart
DM
EDW
DM
DM
EDW
DM
DM
Dimension tables have a simple primary key, while fact tables have a set of
foreign keys which make up a compound primary key consisting of a combination
of relevant dimension keys.
i. Star schema
ii. Snow-Flake Schema
iii. Galaxy Schema
A Star Schema is a database design which has a centrally located fact table which is surrounded by
multiple dimention tables
Since database design looks like a star, hence it is called as star schema
In datawarehouse facts are numeric. Facts are stored in a table called fact table
Not every numeric is a fact, but numeric which are of type key performance indicators are known as
FACTS
FACTS are business measures which are used to Evaluate the performance of an enterprise
A fact table contains the facts at the lowest level of granularity
A fact granularity defines level of detailed
A dimention is a descriptive data which describes the key performance indicators known as FACTS.
Facts are the numerical data & they are key performance indicator
Fact table is surrounded by multiple number of dimention tables
Fact table are normalized
Dimention tables are de-normalized
A dimention provides the answer to the following business questions,
WHO? WHAT? WHEN? WHERE?
© 2010 Capgemini - All rights reserved
12/14/2024 12:11 PM 3
Advantages :
Provide a direct and intuitive mapping between the business entities being
analyzed by end users and the schema design.
Provide highly optimized performance for typical star queries.
Are widely supported by a large number of business intelligence tools, which may
anticipate or even require that the data-warehouse schema contain dimension
tables
Disadvantages:
Requires most amount of storage space.
DIM_Customer DIM_Time
Customer_Key (pk) Date_Key (pk)
Customer_Key Date
(Fk)
Market_key (FK)
Product_key
(FK)
DIM_Product DIM_Market
Quantity Market_key (PK)
Product_key
(PK)
Revenue Market code
Product code
profit Market name
Product name
The snowflake schema is represented by centralized fact tables which are connected to multiple dimensions. In the
snowflake schema, dimensions are normalized into multiple related tables, whereas the star schema's dimensions
are de-normalized with each dimension represented by a single table.
Snowflake schemas are often better with more sophisticated query tools that isolate users from the raw table
structures and for environments having numerous queries with complex criteria
12/14/2024 12:11 PM 3
Advantages
Some OLAP multidimensional database modeling tools that use dimensional data marts as
data sources are optimized for snowflake schemas.
A snowflake schema can sometimes reflect the way in which users think about data. Users
may prefer to generate queries using a star schema in some cases, although this may or
may not be reflected in the underlying organization of the database.
A multidimensional view is sometimes added to an existing transactional database to aid
reporting. In this case, the tables which describe the dimensions will already exist and will
typically be normalized. A snowflake schema will therefore be easier to implement.
Disadvantages:
Requires more joins to get information from look up tables hence slow performance.
SCD captures the changes which takes place over the period of time.
1. SCD Type 1
2. SCD Type 2
3. SCD Type 3
1. SCD Type 1
Type 1 dimension keeps only the current values. Doesn’t maintain history
2. SCD Type 2
Type 2 dimension maintain the full history in the target. For each update it inserts
a new record in the target tables.
3. SCD Type 3 :
1. SCD Type 1
Type 1 dimension keeps only the current values. Doesn’t maintain history
2. SCD Type 2
Type 2 dimension maintain the full history in the target. For each update it inserts
a new record in the target tables.
3. SCD Type 3 :
In the reports testing mainly we are considering below areas for testing.
1)Data level validation: We can compare the out put of report data against data in the
Data ware house/database.