0% found this document useful (0 votes)
24 views41 pages

Introduction To DWH - 29012014

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views41 pages

Introduction To DWH - 29012014

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Introduction to Data Warehousing

Agenda

About Data Warehouse


ETL Process
Characteristics of Data Warehousing
Differences between OLTP and OLAP
Data Extraction
Data Transformation
Meta Data
Data Loading
Data Mart & Types of Datamarts

© 2010 Capgemini - All rights reserved


12/14/2024 12:11 PM 2
About Data Warehouse

Data Warehouse
A data warehouse is a relational database that is designed for query and analysis rather than
for transaction processing. It usually contains historical data derived from transaction data, but
it can include data from other sources. It separates analysis workload from transaction
workload and enables an organization to consolidate data from several sources.
In addition to a relational database, a data warehouse environment includes an extraction,
transportation, transformation, and loading (ETL) solution, an online analytical processing
(OLAP) engine, Oracle Warehouse Builder, client analysis tools, and other applications that
manage the process of gathering data and delivering it to business users.

Decision Supporting System (DSS)


Since Data warehouse is a design to support decision making process hence it is called DSS

Historical Database
Since a Data warehouse maintains historical business information which is required for
analysis Here it is called Historical Database.
© 2010 Capgemini - All rights reserved
12/14/2024 12:11 PM 3
ETL Process

OLTP
DWH
Student Register

Student Fee Transformati


on Student
(subject)
Student Subject

Integrate
Student Marks

© 2010 Capgemini - All rights reserved


12/14/2024 12:11 PM 4
Characteristics of Data Warehousing
Data warehouses all share the following basic characteristics:

Subject Oriented
Integrated
Nonvolatile
Time Variant

© 2010 Capgemini - All rights reserved


12/14/2024 12:11 PM 5
Subject Oriented
A Data warehouse is a subject oriented database which supports the
business needs of department specific business managers or middle management.
E.g.:- student, account, loans, HR etc.,

OLTP
Extract
DWH
Current Account
Load Account
Transformation
Saving Account (Subject)

Check in A/C

© 2010 Capgemini - All rights reserved


12/14/2024 12:11 PM 6
Integrated:
What is required
Integration is closely related to subject orientation. Data warehouses must put
data from disparate sources into a consistent format. They must resolve such
problems as naming conflicts and inconsistencies among units of measure. When
they achieve this goal, they are said to be integrated.

A DWH is an integrated database which collects the data from multiple OLTP
source systems, integrates the data into a homogeneous format and delivers the
integrated data to centralized database called DWH

© 2010 Capgemini - All rights reserved


12/14/2024 12:11 PM 7
Non-Volatile:

Nonvolatile means that, once entered into the warehouse, data should not change. This is
logical because the purpose of a warehouse is to enable you to analyze what has occurred.
A data ware house is a non-volatile database.
Chore data entered into the DWH it does not reflects to the change which takes place at OLTP
database.

DWH
OLTP
DB
Emp no E name EBW
Emp no E name EBW
7396 rahul SE
T
7396 rahul TL E L 7396 rahul SSE Historical
7396 rahul TL
current

Current

© 2010 Capgemini - All rights reserved


12/14/2024 12:11 PM
Time- Variant:

A Data warehouse is a time variant Database which supports the business management in analyzing the
business and comparing the business with different time periods this is known as time series analysis

In order to discover trends in business, analysts need large amounts of data. This is very much in
contrast to online transaction processing (OLTP) systems, where performance requirements demand
that historical data be moved to an archive. A data warehouse's focus on change over time is what is
meant by the term time variant.

Typically, data flows from one or more online transaction processing (OLTP) databases into a data
warehouse on a monthly, weekly, or daily basis. The data is normally processed in a staging file before
being added to the data warehouse. Data warehouses commonly range in size from tens of gigabytes to
a few terabytes. Usually, the vast majority of the data is stored in a few very large fact tables.

© 2010 Capgemini - All rights reserved


12/14/2024 12:11 PM 9
DWH

Time
Year

Quarter

Month

Week

Date

© 2010 Capgemini - All rights reserved


12/14/2024 12:11 PM 1
Differences between OLTP and OLAP

OLTP OLAP
It is design to support Business transaction it is Design to support making Decision
process
Volatile Data Non-Volatile data
Current Data Historical Data
Support Normalization Support De-Normalization
Design for running Business Design for Analysis Business
Design for clerical access Design for managerial access
Less history (3-6 months) more history (1-30 yrs)
Detailed data summary data
Application oriented data subject oriented data
E-R modeling Dimensional modeling

© 2010 Capgemini - All rights reserved


12/14/2024 12:11 PM 1
Data Extraction

It is a process of Reading the data from different OLTP source system. The
following are the types at sources that define as Extraction.

OLTP SOURCES:-
ERP Sources (Sap, PeopleSoft, Siebel)
Mainframes
Oracle Applications
XML Files
Flat Files
Cobol Files
Relational sources (oracle, SQL server etc.,)

© 2010 Capgemini - All rights reserved


12/14/2024 12:11 PM 1
Data Transformation

It is the process at generating and manipulating the data. The following are the
types of data transformation activities

1. Data Merging
i) Horizontal Merging
ii) Vertical Merging

2. Data Merging

3. Data Scrubbing

4. Data Aggregation

© 2010 Capgemini - All rights reserved


12/14/2024 12:11 PM 1
1. Data Merging

It’s a process of integrating the data from multiple OLTP source systems. There are
two types of data merging
i) Horizontal Merging (Joins)
ii) Vertical Merging (Unions)

Join: A Join required to define when the sources are having different data
definitions with a common column and data type(metadata)

Union: When the two sources are having same data definition, such sources can
be combined vertically using Union

© 2010 Capgemini - All rights reserved


12/14/2024 12:11 PM
Meta Data

The data about Data

Desc Emp Desc Dept


Eno number (2) Dno number (2)
Ename varchar2(10) Dname varchar2(10)
Sal number (7,2) Loc varchar2(10)
Job varchar2(10)
Dno number (2)

© 2010 Capgemini - All rights reserved


12/14/2024 12:11 PM 1
Horizontal Merging
Dept
EMP Dno Dname Loc

Eno Ename Sal Job Dno 10 Fs Hyderabad


20 Ins Chennai
142 Arlo 45,000 con 10 30 Ban Bangalore

144 Arun 58,000 Sr.con 20

145 prasad 65,000 manager 10


Join
147 Ram 36,000 Sr.soft 40

eno ename sal job dno dname Loc

142 Arlo 45,000 con 10 Fs Hyderabad

144 Arun 58,000 Sr.con 20 Ins Chennai

145 prasad 65,000 manager 10 Ban Bangalore

© 2010 Capgemini - All rights reserved


12/14/2024 12:11 PM 1
Meta Data

The data about Data

Desc Emp1 Desc emp2

Eno number (2) Eno number (2)


Ename varchar2(10) Ename varchar2(10)
Sal number (7,2) Sal number (7,2)
Job varchar2(10) Job varchar2(10)
Dno number (2) Dno number (2)

BI Testing Services | Testing – Center of Excellence


© 2010 Capgemini - All rights reserved
12/14/2024 12:11 PM 1
II) Data Merging (Vertical Merging)

Eno Ename Sal Job Dno Eno Ename Sal Job Dno
142 Arlo 45,000 con 10 146 kiran 45,000 con 10
144 Arun 58,000 Sr.con 20 148 Naren 58,000 Sr.con 20
145 prasad 65,000 manager 10 140 Mahesh 65,000 manager 10
> <

Union

Eno Ename Sal Job Dno


146 kiran 45,000 con 10

148 Naren 58,000 Sr.con 20


140 Mahesh 65,000 manager 10

142 Arlo 45,000 con 10

144 Arun 58,000 Sr.con 20

145 prasad 65,000 manager 10

© 2010 Capgemini - All rights reserved


12/14/2024 12:11 PM 1
Data Cleansing

The Process of removing or eliminating unwanted data.


Ex:- 1) Removing the records which contains Nulls
2) Eliminating Duplicate Records
It’s the Process of changing in consistencies and in accuracies

Ex:- Sales Amount Cleansing Sales Amount


– Round()
$4.6261 E L $4.62
$3.781 $3.78
$3.4261 $3.42
$2.01 $2.01
$3.1 $3.10

© 2010 Capgemini - All rights reserved


12/14/2024 12:11 PM 1
Cleansing Example:

Country COUNTRY
Italy
Italy Initcap()
india
India
HYDERABAD
Hyderabad
Date Date
12-05-12 12/05/2012
12/12/2012 To_Date() 12/12/2012
12-1-2012 12/01/2012

© 2010 Capgemini - All rights reserved


12/14/2024 12:11 PM 20
Data Scrubbing
It is the process at deriving new attributes to the target columns

EMP
EMP
+ Emp no
+ Emp no Sal * 0.17=
Tax +E name
+E name
+ Sal
+ Sal
+Dept
+Dept
+Tax

Eno Ename Sal Job Dno Tax


Eno Ename Sal Job Dno
142 Arlo 45,000 con 10 7650
142 Arlo 45,000 con 10
144 Arun 58,000 Sr.con 20 144 Arun 58,000 Sr.con 20 9860

145 prasad 65,000 manager 10 145 prasad 65,000 manager 10 11050

© 2010 Capgemini - All rights reserved


12/14/2024 12:11 PM 2
Data Aggregation

It is the process of Calculating the summaries from detailed data using Aggregate function
Ex:- sum(), Count(), Min(), Max()
Aggregates
Add up amounts by day
In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date

© 2010 Capgemini - All rights reserved


12/14/2024 12:11 PM 2
Data Loading

It is the process of inserting the data into a target system these are two types of
data loads
Initial Load/ Full Load:
it is the process of inserting the data into a empty target table
Incremental Load / Delta Load
It is the process of inserting only. New Records after Initial load.

© 2010 Capgemini - All rights reserved


12/14/2024 12:11 PM 2
Data Mart & Types of Datamarts

A datamart is a subset of an Enterprise dataware house

A data mart is a subject oriented database which supports the business


needs or dept specific business manager

A data mart is also known as high performance query structures

There are two type of datamarts

Dependent Datamart
Independent Datamart

© 2010 Capgemini - All rights reserved


12/14/2024 12:11 PM 2
Difference Between Datamart & Dataware house

Datamart Datawarehouse

It is design to store dept specific it is design to store enterprise specific


business information

It is design to middle management it is designed for top management ex:-


CEo, MD
It is a single subject Specific Database it is an integration of multi subject
database

© 2010 Capgemini - All rights reserved


12/14/2024 12:11 PM 2
Top Down Datawarehouse approach

Top Down Datawarehouse approach (Inmon)


According to Inmon first we need to design a very large data base which stores
enterprise specific information known as EDW it derive the small subject oriented
data bases known as data marts

Dependent Datamart

DM
EDW

DM

© 2010 Capgemini - All rights reserved


12/14/2024 12:11 PM 2
EDW Bottom- Up Datawarehouse Approach

In a top down approach a datamart development depends on enterprise


dataware house, hence such datamarts are known as dependent datamarts
IN-Dependent Datamart
In a bottom-up approach a datamart development is independent of
datawarehouse

DM
EDW

DM

DM

© 2010 Capgemini - All rights reserved


12/14/2024 12:11 PM 2
The fact table holds the main data. It includes a large amount of aggregated data,
such as price and units sold. There may be multiple fact tables in a star schema.
Dimension tables, which are usually smaller than fact tables, include the
attributes that describe the facts. Often this is a separate table for each
dimension. Dimension tables can be joined to the fact table(s) as needed.

Dimension tables have a simple primary key, while fact tables have a set of
foreign keys which make up a compound primary key consisting of a combination
of relevant dimension keys.

© 2010 Capgemini - All rights reserved


12/14/2024 12:11 PM 2
Dimensional Modeling - Database Design

A dimention modeling is a design methodology for designing dataware house or data


mart with following types of schema
A data modeler or database architect designs the database using GUI based database
designing tool called ERWIN

i. Star schema
ii. Snow-Flake Schema
iii. Galaxy Schema

© 2010 Capgemini - All rights reserved


12/14/2024 12:11 PM 2
Star schema

A Star Schema is a database design which has a centrally located fact table which is surrounded by
multiple dimention tables
Since database design looks like a star, hence it is called as star schema
In datawarehouse facts are numeric. Facts are stored in a table called fact table
Not every numeric is a fact, but numeric which are of type key performance indicators are known as
FACTS
FACTS are business measures which are used to Evaluate the performance of an enterprise
A fact table contains the facts at the lowest level of granularity
A fact granularity defines level of detailed
A dimention is a descriptive data which describes the key performance indicators known as FACTS.
Facts are the numerical data & they are key performance indicator
Fact table is surrounded by multiple number of dimention tables
Fact table are normalized
Dimention tables are de-normalized
A dimention provides the answer to the following business questions,
WHO? WHAT? WHEN? WHERE?
© 2010 Capgemini - All rights reserved
12/14/2024 12:11 PM 3
Advantages :

Provide a direct and intuitive mapping between the business entities being
analyzed by end users and the schema design.
Provide highly optimized performance for typical star queries.
Are widely supported by a large number of business intelligence tools, which may
anticipate or even require that the data-warehouse schema contain dimension
tables
Disadvantages:
Requires most amount of storage space.

© 2010 Capgemini - All rights reserved


12/14/2024 12:11 PM 3
STAR SCHEMA

DIM_Customer DIM_Time
Customer_Key (pk) Date_Key (pk)

Customer code Year


Transaction_Fact
Customer name Quarter
Transaction_ID
(pk) Month

Date_Key (Fk) Week

Customer_Key Date
(Fk)
Market_key (FK)
Product_key
(FK)

DIM_Product DIM_Market
Quantity Market_key (PK)
Product_key
(PK)
Revenue Market code
Product code
profit Market name
Product name

A dimention provides the answer to the following business questions,


WHO? WHAT? WHEN? WHERE?

© 2010 Capgemini - All rights reserved


12/14/2024 12:11 PM 3
Snow Flake Schemas :

The snowflake schema is represented by centralized fact tables which are connected to multiple dimensions. In the
snowflake schema, dimensions are normalized into multiple related tables, whereas the star schema's dimensions
are de-normalized with each dimension represented by a single table.
Snowflake schemas are often better with more sophisticated query tools that isolate users from the raw table
structures and for environments having numerous queries with complex criteria

12/14/2024 12:11 PM 3
Advantages

Some OLAP multidimensional database modeling tools that use dimensional data marts as
data sources are optimized for snowflake schemas.
A snowflake schema can sometimes reflect the way in which users think about data. Users
may prefer to generate queries using a star schema in some cases, although this may or
may not be reflected in the underlying organization of the database.
A multidimensional view is sometimes added to an existing transactional database to aid
reporting. In this case, the tables which describe the dimensions will already exist and will
typically be normalized. A snowflake schema will therefore be easier to implement.
Disadvantages:
Requires more joins to get information from look up tables hence slow performance.

© 2010 Capgemini - All rights reserved


12/14/2024 12:11 PM 3
SLOWLY CHANGING DIMENSIONS

SCD captures the changes which takes place over the period of time.

There are three types of SCD:

1. SCD Type 1

2. SCD Type 2

3. SCD Type 3

© 2010 Capgemini - All rights reserved


12/14/2024 12:11 PM 3
SLOWLY CHANGING DIMENSIONS..

1. SCD Type 1

Type 1 dimension keeps only the current values. Doesn’t maintain history

2. SCD Type 2

Type 2 dimension maintain the full history in the target. For each update it inserts
a new record in the target tables.

3. SCD Type 3 :

Type 3 dimension maintains current and previous information (Partial History)

© 2010 Capgemini - All rights reserved


12/14/2024 12:11 PM 3
SLOWLY CHANGING DIMENSIONS..

1. SCD Type 1

Type 1 dimension keeps only the current values. Doesn’t maintain history

2. SCD Type 2

Type 2 dimension maintain the full history in the target. For each update it inserts
a new record in the target tables.

3. SCD Type 3 :

Type 3 dimension maintains current and previous information (Partial History)

© 2010 Capgemini - All rights reserved


12/14/2024 12:11 PM 3
Dirty dimensions
A dirty dimension is the one in which data quality cannot be guaranteed

Data about the same customer can appear multiple times

Fact Less Fact:


A factless fact table is a fact table that does not have any measures.

It is essentially an intersection of dimensions.

© 2010 Capgemini - All rights reserved 12/14/2024 12:11 PM 38


Reports Testing

In the reports testing mainly we are considering below areas for testing.

1)Data level validation: We can compare the out put of report data against data in the
Data ware house/database.

2)Formatting level validation: We can check Prompts(Dynamic filters) ,Static Filters


and Different output formats(PDF,.CSV,XLS etc…).

3)Performance testing: To check the response time to get the data

BI Testing Services | Testing – Center of Excellence


39
Reporting Tools in the market

SAP Crystal Reports


MicroStrategy
IBM Cognos
Actuate
Jaspersoft
Pentaho

BI Testing Services | Testing – Center of Excellence


40
Thank You

BI Testing Services | Testing – Center of Excellence


41

You might also like