0% found this document useful (0 votes)
125 views188 pages

First Data WarehouseAima First Final Updated 9 Sep 2016

This document discusses databases and data warehouses. It defines a database as an organized collection of data consisting of schemas, tables, queries, and other objects. A data warehouse is a separate database used for analysis and reporting rather than transactions. The key differences between databases and data warehouses are that data warehouses use de-normalized data for faster queries, focus on historical data rather than current data, and are optimized for reading rather than writing.

Uploaded by

dinesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
125 views188 pages

First Data WarehouseAima First Final Updated 9 Sep 2016

This document discusses databases and data warehouses. It defines a database as an organized collection of data consisting of schemas, tables, queries, and other objects. A data warehouse is a separate database used for analysis and reporting rather than transactions. The key differences between databases and data warehouses are that data warehouses use de-normalized data for faster queries, focus on historical data rather than current data, and are optimized for reading rather than writing.

Uploaded by

dinesh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 188

Data

Warehousing
&
Data Mining
Code: ITM 402
What is Data
Warehouse?
DATABASE
A database is an organized collection of
data.[1] It is the collection of schemas,
tables, queries, reports, views, and other
objects.
Often abbreviated DB, a database is
basically a collection of information
organized in such a way that a computer
program can quickly select desired pieces
of data. You can think of a database as an
electronic filing system.
Traditional databases are organized by
fields, records, and files.
A field is a single piece of information;
A record is one complete set of fields; AND
A file is a collection of records.
For example, a telephone book is
analogous to a file. It contains a list of
records, each of which consists of three
fields: name, address, and telephone
number.
Data Dictionary
A data dictionary is a file or a set of
files that contains a database's
metadata. The data dictionary
contains records about other objects
in the database, such as data
ownership, data relationships to
other objects, and other data.
The data dictionary is a crucial
component of any relational
database. Ironically, because of its
importance, it is invisible to most
database users. Typically, only
database administrators interact with
the data dictionary.
In a relational database, the metadata in the data dictionary includes the
following:

Names of all tables in the database and


their owners
Names of all indexes and the columns
to which the tables in those indexes
relate
Constraints defined on tables, including
primary keys, foreign-key relationships
to other tables, and not-null constraints
DATA WAREHOUSE
Data warehouse is generally for huge
storage of historical records.
It is used for reporting purposes whereas
database is for current day to online
transaction processing.
Normally, the data in data warehouse is
not supposed to be updated only inserts
should happen into data warehouse.
Ideally, the current data should become part
of data warehouse, after some pre-decided
time period, after which it will only be used
for analysis purpose and not for transactions .
Most important difference between them is:
in Database, data is normally kept in
normalized format whereas in data
warehouse, it is purposely De-normalized to
avoid joins while generating huge reports to
save time.
If you cant perform analytics to make
sense of your data, youll have trouble
improving quality and costs, and you
wont succeed.
A database designed to handle
transactions isnt designed to handle
analytics. It isnt structured to do analytics
well. A data warehouse, on the other
hand, is structured to make analytics fast
and easy.
Data Warehouse
A decision support database that is
maintained separately from the
organizations operational database.
Database

Used for Online Transactional Processing


(OLTP)
In Database the tables and joins are
complex since they are normalized (for
RDMS). This is done to reduce redundant
data and to save storage space.

BUT IN DATA WAREHOUSE the Tables and


joins are simple since they are de-
normalized. This is done to reduce the
response time for analytical queries.
In Database Entity Relational modeling
techniques are used for RDMS database
design. But in Data warehouse
dimensional modeling techniques is used.
Database is Optimized for write
operation. Data warehouse is optimized
for read operation.
But Database Performance is low for
analysis queries.
Inmonss definition
A data warehouse is
-subject-oriented(high level),
-integrated,
-time-variant,
-nonvolatile
collection of data in support of managements
decision making process.
Subject-oriented
Data warehouse is organized around
subjects such as
sales,product,customer.
It focuses on modeling and analysis of
data for decision makers.
Excludes(delete) data that is notuseful
in decision support process.
Integration
Data Warehouse is constructed by
integrating multiple heterogeneous
sources.
Data Preprocessing are applied to
RDBMS
ensure consistency.

Data
Legacy Warehouse
System

Flat File Data Processing


Data Transformation
Integration
In terms of data.
encoding structures.

Measurement of
attributes.

physical attribute.
of data remarks

naming conventions.

Data type format


Time-variant
Provides information from historical
perspective e.g. past 5-10 years
Every key structure contains either
implicitly or explicitly an element of
time
Nonvolatile
Data once recorded cannot be updated.
Data warehouse requires two operations
in data accessing
Initial loading of data
Access of data

load

acce
ss
How it is differ from
Database
OLTP vs. OLAP
OLTP: On Line Transaction Processing
Describes processing at operational sites.

OLAP: On Line Analytical Processing


Describes processing at warehouse

CSE601 21
Warehouse is a Specialized
Standard DB (OLTP)DB Warehouse (OLAP)
Mostly updates Mostly reads
Many small Queries are long and
transactions complex
Mb - Gb of data Gb - Tb of data
Current snapshot History
Index/hash on p.k. Lots of scans
Raw data Summarized, reconciled data
Thousands of users Hundreds of users (e.g.,
(e.g., clerical users) decision-makers, analysts)

CSE601 22
Operational v/s Information
System
Features Operational Information
Characteristics Operational processing Informational processing
Orientation Transaction Analysis
User Clerk,DBA,database Knowledge workers
professional
Function Day to day operation Decision support
Data Current Historical
View Detailed,flat relational Summarized,
multidimensional
DB design Application oriented Subject oriented
Unit of work Short ,simple transaction Complex query
Access Read/write Mostly read
Operational v/s Information
System
Features Operational Information
Focus Data in Information out
Number of records tens millions
accessed

Number of users thousands hundreds


DB size 100MB to GB 100 GB to TB
Priority High performance,high High flexibility,end-
availability user autonomy
Metric Transaction throughput Query througput
Data Warehouse
Definitions
A data warehouse is a type of computer
based information system developed to
provide an organization with business
intelligence to support decision making and
to monitor the operations in a company.

Integrates data from many different sources


and makes it available to end users in a
what they can understand and use in a
business context in a timely manner.
Primary difference between
a database and data
warehouse
Database stores information for a
single application, whereas a Data
warehouse stores information from
multiple databases, or multiple
applications, and external
information such as industry
information
The Database Approach
Database approach: data organised as
entities
Entity: object that has data
People
Events
Products
Character: smallest piece of data
Field: single piece of information about entity
Record: collection of fields
The Database Approach
(continued)
File: collection of related records
Database management system
(DBMS): program used to build
databases
Populates with data
Manipulates data
Query: message requesting access
to data
The Database Approach
(continued)
Database has security issues
Database administrator (DBA):
limits user access to database
Requires users to enter codes
DBMS bundled with fourth-generation
languages
The Database Approach
(continued)
Database Models
Database model: general logical
structure
How records stored in database
Records linked differently in different
models
Models constantly changing
The Relational Model

Relational Model: consists of


tables
Based on relational algebra
Tuple: record
Attribute: field
Relation: table
Key: identifier field
Used to retrieve records
The Relational Model
(continued)
Primary key: unique key
Uniquely identifies record
Required in table
Composite key: combination of fields
Serves as primary key
Foreign key: shared field
Links tables
Join table: composite of tables
The Relational Model
(continued)
The Relational Model
(continued)
Table relationships with other tables
One-to-many relationship: one
item in table linked to many items in
other table
Many-to-many relationship: many
items in table linked to many items
of other table
The Object-Oriented Model
Object-Oriented model: uses object-
oriented approach
Encapsulation: combined storage of data
and relevant procedures
Allows object to be planted in different
data sets
Inheritance: creates new object by
replicating characteristics of existing (parent)
object
Structured Query Language

Structured query language:


language of choice for DBMSs
Advantages
Standardised language
Used in many host languages
Portable
The Schema and Metadata

Schema: plan
Describes structure of database
Names and sizes of fields
Identifies primary keys
Data dictionary: repository of
information about data
The Schema and Metadata
(continued)
Metadata: data about data
Source of data
Tables related to data
Field information
Usage of data
Population rules
The Schema and Metadata
(continued)
Data Warehouse
Fundamentals
Extraction, transformation, and loading
(ETL) a process that extracts information
from internal and external databases,
transforms the information using a common
set of enterprise definitions, and loads the
information into a data warehouse

Data mart contains a subset of data


warehouse information. The ETL process also
gathers data from the data warehouse and
passes it to the data marts
Data Warehouse
Fundamentals
Components of a Data Warehouse

Metadata means data about data


Multidimensional
Analysis
and Data Mining
Databases contain information in a
series of two-dimensional tables

In a data warehouse and data mart,


information is multidimensional, it
contains layers of columns and rows
Scenario 1

ABC Pvt Ltd is a company with branches


at Mumbai, Delhi, Chennai and Banglore.
The Sales Manager wants quarterly sales
report. Each branch has a separate
operational system.
Scenario 1 : ABC Pvt Ltd.
Mumbai

Delhi
Sales per item type per branch Sales
for first quarter. Manager

Chennai

Banglore
ETL(Extract, Transform,
Load)
Improve the quality of data before
loading it into the warehouse.
Perform data cleaning and
transformation before loading the
data.
And Then Load into Data Warehouse.
Use query analysis tools to support
adhoc queries.
OLAP

There are mainly two different types:


Multidimensional OLAP (MOLAP)
and Relational OLAP (ROLAP).
Hybrid OLAP (HOLAP)
In MOLAP, data is stored in a
multidimensional cube.
The storage is not in the relational
database, but in proprietary formats.
Advantages of MOLAP:

Excellent performance: MOLAP cubes


are built for fast data retrieval, and
are optimal for slicing and dicing
operations.
Can perform complex calculations: All
calculations have been pre-generated
when the cube is created. Hence,
complex calculations are not only
doable, but they return quickly.
Disadvantagesof MOLAP:

Limited in the amount of data it can handle:


Because all calculations are performed when the cube is
built,
it is not possible to include a large amount of data
in the cube itself. This is not to say that the data in the
cube cannot be derived from a large amount of data.
Indeed, this is possible. But in this case, only summary-
level information will be included in the cube itself.
Requires additional investment: Cube technology are
often proprietary and do not already exist in the
organization. Therefore, to adopt MOLAP technology,
chances are additional investments in human and capital
resources are needed.
ROLAP
This methodology relies on
manipulating the data stored in the
relational database to give the
appearance of traditional OLAP's
slicing and dicing functionality.
In essence, each action of slicing and
dicing is equivalent to adding a
"WHERE" clause in the SQL
statement.
Advantages OF ROLAP:

Can handle large amounts of data: The data


size limitation of ROLAP technology is the
limitation on data size of the underlying
relational database.
In other words, ROLAP itself places no
limitation on data amount.
Often, relational database already comes with
a host of functionalities. ROLAP technologies,
since they sit on top of the relational database,
can therefore leverage these functionalities.
Disadvantages OF ROLAP:

Performance can be slow: Because each


ROLAP report is essentially a SQL query (or
multiple SQL queries) in the relational database,
the query time can be long if the underlying data
size is large.
Limited by SQL functionalities: Because
ROLAP technology mainly relies on generating
SQL statements to query the relational database,
and SQL statements do not fit all needs (for
example, it is difficult to perform complex
calculations using SQL), ROLAP technologies are
therefore traditionally limited by what SQL can
do.
An example of a
Datawarehouse:
A star shema datawarehouse has a central table (the Fact table)
surrouded by dimension tables
Dimension
Orders

- Order#
- Ordertype

Dimension Fact table Dimension


Products Orderdetails Salesmen

- Product# - Product# - Salesman#


- Product-name - Order# - Salesman-name
- Price - Qty
The fixed data base structure - Date#
- Salesman#
implies that application programs
(drilling functions/aggregates) can Dimension
be generated automatically! Time - Date#
- Date-Name
Conceptual Modeling of Data Warehouses

Star schema: A fact table in the middle


connected to a set of dimension tables
Snowflake schema: A refinement of star
schema where some dimensional hierarchy is
normalized into a set of smaller dimension
tables, forming a shape similar to snowflake
Roll up to the top level:
Sales Product- Turn- Branch-
man# name over office#
Smith Screw 10,000 LA Roll up can be executed by
Smith Bolt 30,000 LA removing one or more argument to
Smith Nut 60,000 LA the GROUP BY statement.
Jones Screw 20,000 SF
Jones Nut 40,000 SF
...

Productname Turnover
Roll up to the product level.
Screw 100.000
Bolt 200.000
Nut 300,000

Toplevel Turnover
600.000 Roll up to the top level.
Et star schema DW can be illustrated as a multidimensinal
cube:
Solution 1:ABC Pvt Ltd.

Extract sales information from each


database.
Store the information in a common
repository at a single site.
Solution 1:ABC Pvt Ltd.
Mumbai

Report
Delhi
Query & Sales
Data Analysis tools Manager
Warehouse

Chennai

Banglore
Data Warehousing
Architecture Monitoring &
Administratio OLAP Servers
n
Metadata
Repository

Reconciled data Analysis


External Extract
Sources
Transform
Serve
Load
Refresh Query/Reportin
Operational g
Dbs

Data Mining

DATA SOURCES TOOLS

DATA MARTS
Data Warehouse Architecture
Data Warehouse server
almost always a relational
DBMS,rarely flat files
OLAP servers
to support and operate on multi-
dimensional data structures
Clients
Query and reporting tools
Analysis tools
Data mining tools
OLTP
OLTP (On-line Transaction Processing)
Operational data
To control and run fundamental business
tasks
Transactions:INSERT, UPDATE, DELETE.
Detailed and current data,
Relatively standardized and simple queries
Highly normalized(3NF) with many tables
OLAP
OLAP (On-line Analytical Processing)
Low volume of transactions.
To help with planning, problem solving,
and decision support
Relatively standardized and simple
queries
Typically de-normalized with fewer tables;
use of star and/or snowflake schemas.
Historical data, stored in multi-
dimensional schemas (usually star
schema).
OLTP (On-line Transaction
Processing) vs. OLAP (On-
line Analytical
Processing)
We can divide IT systems into
transactional (OLTP) and analytical
(OLAP). In general we can assume
that OLTP systems provide source
data to data warehouses, whereas
OLAP systems help to analyze it.
Data Mart
Introduction

OLAP (Online Analytical Processing)


designates a category of
applications and technologies that
allow the collection, storage,
manipulation and reproduction of
multidimensional data, with the
goal of analysis.

72
OLTP vs. OLAP
We can divide IT systems into transactional
(OLTP) and analytical (OLAP). In general we
can assume that OLTP systems provide
source data to data warehouses, whereas
OLAP systems help to analyze it.
OLTP IS Highly normalized with many
tables(RDBMS)
OLAP Typically de-normalized with fewer
tables use of( star and/or snowflake
schemas)
Difference between OLTP AND
OLAP
OLTP (On-line Transaction Processing) is characterized by
a large number of short on-line transactions (INSERT, UPDATE,
DELETE). The main emphasis for OLTP systems is put on very
fast query processing, maintaining data integrity in multi-
access environments and an effectiveness measured by
number of transactions per second. In OLTP database there is
detailed and current data, and schema used to store
transactional databases is the entity model (usually
3NF).

- OLAP (On-line Analytical Processing) is characterized by


relatively low volume of transactions. Queries are often very
complex and involve aggregations. For OLAP systems a
response time is an effectiveness measure. OLAP applications
are widely used by Data Mining techniques. In OLAP database
there is aggregated, historical data, stored in multi-
dimensional schemas (usually star schema).
Three types in dataware
data structure
Rolap (ex. Star Schema)
Molap (cubes, slicing, dicing)
hybrid
Model of OLAP
OLAP Models :-

1:- Relational (ROLAP): uses relational


star
schema
2:-Multidimensional (MOLAP): uses
data cubes
1:Rolap Model
2: Multi-Dimensional
Model OLAP Servers
Predefined hierarchy allows logical
pre-aggregation and, conversely,
allows for a logical drill-down.

Supports common analytical


operations
Consolidation.
Drill-down.
Slicing and dicing.

79
Multi-Dimensional OLAP
Servers
Roll UP - aggregation of data such as simple
roll-ups or complex expressions involving inter-
related data, for example Monthly data to
quarterly data.

Drill-Down - is reverse of consolidation and


involves displaying the detailed data that
comprises the consolidated data for example
quarterly data to monthly data.

Slicing and Dicing -(also called pivoting= Rotating)


refers to the ability to look at the data from
different viewpoints.

80
Slicing
Multi-Dimensional OLAP
servers
Can store data in a compressed form by
dynamically selecting physical storage
organizations and compression techniques
that maximize space utilization.

Dense data (i.e., data that exists for high


percentage of cells) can be stored
separately from sparse data (i.e.,
significant percentage of cells are empty ).

83
ON-LINE ANALYTICAL
PROCESSING
Demand for OLAP
To develop Data Mart, three
approaches
In all approaches, Data Marts rest
on Dimensional Model
Data Marts are sufficient for basic
data analysis
Users need to go beyond such
basic analysis

85
Demand for OLAP

Need for Multidimensional


Analysis
Fast Access & Powerful
Calculations
Limitations of other analysis
methods like:
SQL
Spreadsheets
Report Writers

86
Demand for OLAP
Traditional tools of report writers,
query products, spreadsheets, &
language interfaces do not match
the user expectations as far as
performing multidimensional
analysis with complex calculations
is concerned.
Tools used with OLTP and basic DW
environments do not match up to
the task

87
OLAP is the Answer!
OLAP is a category of software technology
that enables analysts, managers, and
executives to gain insight into the data
through fast, consistent, interactive, access in
a wide variety of possible views of
information that has been transformed from
raw data to reflect the real dimensionality of
the enterprise as understood by the user.

88
Why is OLAP useful?
Facilitates multidimensional data
analysis by pre-computing aggregates
across many sets of dimensions
Provides for:
Greaterspeed and responsiveness
Improved user interactivity

89
Data, Data everywhere
yet ...find the data I need
I cant
data is scattered over the network
many versions, subtle differences

I cant get the data I need an


expert to get the data

I cant understand the data I found


available data poorly documented

I cant use the data I found


results are unexpected
data needs to be transformed
90
from one form to other
What is a Data
Warehouse?
A single, complete and
consistent store of data
obtained from a variety
of different sources
made available to end
users in a what they can
understand and use in a
business context.

[Barry Devlin]
91
What are the users
saying...
Data should be integrated
across the enterprise
Summary data has a real
value to the organization
Historical data holds the
key to understanding data
over time
What-if capabilities are
required

92
What is Data
Warehousing?
A process of
Information transforming data into
information and making
it available to users in a
timely enough manner
to make a difference

[Forrester Research, April


1996]
Data
93
Data Warehouses
A data warehouse is based on a
multidimensional data model which
views data in the form of a data cube
A data cube allows data to be modeled
and viewed in multiple dimensions

94
Data Warehousing --
It is a process
Technique for assembling and
managing data from various
sources for the purpose of
answering business questions.
Thus making decisions that
were not previous possible
A decision support database
maintained separately from
the organizations operational
database
95
Data Integration
Data integration means combining Data
coming from different sources and
providing users with a unified view of
these data.
This process becomes significant in a
variety of situations both commercial
(when two similar companies need to
merge their Databases) and scientific
(combining research results from different
repositories)
Consistency

A simple rule of consistency may state that the


Gender column of a database may only have
the values Male , Female or Unknown. If a
user attempts to enter something else, say A
then a database consistency rule kicks in and
disallows the entry of such a value.
If anyone insert Data that is against the rule
like against primary key or Not Null that will
not be taken by the system is called
consistancy
Consistancy
They also serve another important function:
they make the application developers work
easier- it is usually much easier to define
consistency rules at the database level rather
than defining them in the application that
connects to the database.
Data integrity is imposed within a database at
its design stage through the use of standard
rules and procedures, and is maintained
through the use of error checking and
validation routines.
Subject Oriented
Data warehouses are designed to help you
analyze data. For example, to learn more
about your company's sales data, you can
build a warehouse that concentrates on
sales. Using this warehouse, you can
answer questions like "Who was our best
customer for this item last year?" This
ability to define a data warehouse by
subject matter, sales in this case, makes
the data warehouse subject oriented.
Integrated

Integration is closely related to


subject orientation. Data warehouses
must put data from disparate sources
into a consistent format. They must
resolve such problems as naming
conflicts and inconsistencies among
units of measure. When they achieve
this, they are said to be integrated.
Nonvolatile
Nonvolatile means that, once entered
into the warehouse, data should not
change. This is logical because the
purpose of a warehouse is to enable
you to analyze what has occurred.
The time horizon for the data warehouse is significantly longer than that of operational sys

Time Variant

In order to discover trends in business,


analysts need large amounts of data.
All data in the data warehouse is identified
with a particular time period.
The data in a data warehouse provides
information from the historical point of view.
Operational database: current value data
Data warehouse data: provide information
from a historical perspective (e.g., past 5-10
years)
What is a Data
Warehouse?
A data warehouse is a relational database
that is designed for query and analysis
rather than for transaction processing. It
usually contains historical data derived
from transaction data, but it can include
data from other sources. It separates
analysis workload from transaction
workload and enables an organization to
consolidate data from several sources.
Data Warehouse
In addition to a relational database, a data
warehouse environment includes
an extraction, transportation,
transformation, and loading (ETL) solution,
an online analytical processing (OLAP)
engine,
client analysis tools, and other applications
that manage the process of gathering data
and delivering it to business users.
Application Areas
Industry Application
Finance Credit Card Analysis
Insurance Claims, Fraud Analysis
Telecommunication Call record analysis
Transport Logistics management
Consumer goods promotion analysis
Data Service providers
Value added data
Utilities Power usage analysis

105
Application-Orientation vs.
Subject-Orientation
Application-Orientation Subject-Orientation

Operation Data
al Warehouse
Database
Credit
Loans Customer
Card
Vendor
Product
Trust

Savings Activity
106
To summarize ...
OLTP Systems are
used to run a
business

The Data Warehouse


helps to optimize
the business

107
I. Data Warehouses:
Architecture, Design &
Construction
DW Architecture
Loading, refreshing
Structuring/Modeling
DWs and Data Marts
Query Processing

demos, labs

108
Data Warehouse
Architecture
Relational
Databases
Optimized Loader
Extraction
ERP
Cleansing
Systems
Data Warehouse
Engine Analyze
Purchased Query
Data

Legacy
Data Metadata Repository
109
Components of the
Warehouse
Data Extraction and Loading
The Warehouse
Analyze and Query -- OLAP Tools
Metadata

Data Mining tools

110
Loading the Warehouse

Cleaning the data


before it is loaded
Data Integrity Problems
Same person, different spellings
Agarwal, Agrawal, Aggarwal etc...
Multiple ways to denote company name
Persistent Systems, PSPL, Persistent Pvt. LTD.
Use of different names
mumbai, bombay
Different account numbers generated by different
applications for the same customer
Required fields left blank
Invalid product codes collected at point of sale
manual entry leads to mistakes
in case of a problem use 9999999

112
Data Granularity in
Warehouse
Summarized data stored
reduce storage costs
reduce cpu usage
increases performance since smaller
number of records to be processed
design around traditional high level
reporting needs
tradeoff with volume of data to be
stored and detailed usage of data
113
Granularity in Warehouse
Can not answer some questions with
summarized data
Did Anand call Seshadri last month? Not
possible to answer if total duration of
calls by Anand over a month is only
maintained and individual call details
are not.
Detailed data too voluminous

114
Aggregates
Add up amounts for day 1
In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1

sale prodId storeId date amt


p1 c1 1 12
p2 c1 1 11
p1 c3 1 50
p2 c2 1 8
81
p1 c1 2 44
p1 c2 2 4

115
Aggregates
Add up amounts by day
In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date

sale prodId storeId date amt


p1 c1 1 12
p2 c1 1 11 ans date sum
p1 c3 1 50 1 81
p2 c2 1 8 2 48
p1 c1 2 44
p1 c2 2 4

116
Aggregation Using
Hierarchies

c1 c2 c3
day 2
p1 44 4
customer
p2 c1 c2 c3
day 1
p1 12 50 region
p2 11 8

country

region A region B
p1 56 54
p2 11 8
(customer c1 in Region A;
customers c2, c3 in Region B)

117
OLAP Operations
Roll up (drill-up): summarize data
by climbing up hierarchy or by dimension
reduction
Drill down (roll down): reverse of roll-
up
from higher level summary to lower level
summary = detailed data, Slice and dice:
project and select
Drill Down/Up:
Drilling down or up is a specific analytical
technique whereby the user navigates
among levels of data ranging from the 118
Pivot (rotate):
Rotates the data axis to view the data from different perspectives.
Groups data with different dimensions.

Slice
Performs a selection on one dimension of the given cube, resulting
in a sub-cube.
Reduces the dimensionality of the cubes.
Sets one or more dimensions to specific values and keeps a subset
of dimensions for selected values.
Dice
Define a sub-cube by performing a selection of one or more
dimensions.
Refers to range select condition on one dimension, or to select
condition on more than one dimension.
Reduces the number of member values of one or more dimensions.
Other OLAP
Operations

o Moving Averages
o Growth Rates
o Depreciation
o Currency Conversion
o Statistical Functions
o Top N or Bottom N queries

121
Conceptual vs. Actual
The cube is a logical way of
visualizing the data in an OLAP setting
Not how the data is actually
represented on disk
Two ways of storing data:
ROLAP: Relational OLAP
MOLAP: Multidimensional OLAP

122
OLAP & CUBE
Construction of the data cube
is key to the operation of OLAP
The computation process
creates a set of aggregates on
the various dimensions of the
data
The CUBE operator

123
Approaches to OLAP
Servers

It is all about which DBMS you


choose to store your data
warehouse data
ROLAP (Star schema, Snowflake
schema)
MOLAP( roll up, roll down, slice,
pivot)
BOTH - HOLAP
124
Approaches to OLAP Servers
Three possibilities for OLAP servers
(1) Relational OLAP (ROLAP)
Relational and specialized relational DBMS to
store and manage warehouse data
OLAP middleware to support missing pieces
(2) Multidimensional OLAP (MOLAP)
Array-based storage structures
Direct access to array data structures
(3) Hybrid OLAP (HOLAP)
Storing detailed data in RDBMS
Storing aggregated data in MDBMS
User access via MOLAP tools

125
ROLAP
Special schema design: star, snowflake

Proven technology (relational model,


DBMS), tend to outperform specialized
MDDB especially on large data sets

Products
IBM DB2, Oracle, Sybase IQ,
RedBrick, Informix

126
ROLAP
Defines complex, multi-dimensional data with
simple model
Reduces the number of joins a query has to
process
Allows the data warehouse to evolve with
relatively low maintenance
Can contain both detailed and summarized data.
ROLAP is based on familiar, proven, and already
selected technologies.
BUT!!!
SQL for multi-dimensional manipulation of
calculations.

127
MOLAP

MDDB: a special-purpose data model


Facts stored in multi-dimensional
arrays
Dimensions used to index array
MOLAP (multidimensional OLAP) tools
utilize a pre-calculated data set,
commonly referred to as a data cube,
that contains all the possible answers
to a given range of questions.

128
MOLAP tools feature very fast
response, and the ability to quickly
write back data into the data set
(budgeting and forecasting are
common applications). Primary
downsides of MOLAP tools are.
Products
Pilot, Arbor Essbase, Gentia
OLAP Needs
User Needs
Multidimensional view
Excellent Performance
Analytical Flexibility
Real-Time Data Access
High Data Capacity

130
OLAP Needs: User Needs
Excellent Performance
RDBMSs must use several summary tables to store the aggregates
that a MOLAP could store in just one cube.

For example, consider a Sales indicator with three dimensions: Months, Regions,
and Products. The indicator cube will contain seven sets of aggregates:

Sales by month
Sales by product
Sales by region
Sales by month and product

To store these aggregates in an RDBMS, youd have to create seven summary


tables, one for each aggregate set.
HOW MANY SUMMARY TABLES FOR 6 DIMENSIONS?
(Separate fact table and shrunken dimension table approach for storing
131
aggregates)
OLAP Needs: User Needs
Analytical Flexibility

Both ROLAP & MOLAP tools offer comparative


performance for
Comparative Analysis
Roll-up and Drill-down
Slicing & Dicing

132
OLAP Needs: User Needs
Real-Time Data Access
MOLAP tools load data into the multidimensional cubes.
Consequently, the data being accessed is only as recent
as the last load.
Some applications require real-time data access
Process of continually refreshing the data attaches higher
costs to operating a MOLAP system
Some MOLAP tools offer reach-through functionality to
access volatile data stored outside the MDDB
Unfortunately, users must be aware of the underlying
database structure
Relational data access is too complex for the typical user

133
OLAP Needs: User Needs
Real-Time Data Access
ROLAP tools maintain a constant link to the
operational RDBMS, which provides users with
up-to-the-minute, accurate data
(Real-Time Data Warehousing)
Industries & organizations with highly volatile
data particularly benefit from this access to
live, operational data.

134
ROLAP (relational OLAP) tools do not
use pre-calculated data cubes.
Instead, they intercept the query and
pose the question to the standard
relational database and its tables in
order to bring back the data required
to answer the question.
ROLAP tools feature the ability to ask any
question (you are not limited to the contents of
a cube)
and the ability to drill down to the lowest level
of detail in the database.
Primary downsides of ROLAP tools are slow
response and some limitations on scalability
(depending on the technology architecture that
is utilized). The most common examples of
ROLAP tools are MicroStrategy and Sterling
(Information Advantage).
HOLAP (hybrid OLAP)
HOLAP (hybrid OLAP) addresses the
shortcomings of both of these technologies by
combining the capabilities of both approaches.
HOLAP tools can utilize both pre-calculated
cubes and relational data sources.
The most common example of HOLAP
architecture is OLAP services in Microsoft SQL
Server 7.0. OLAP vendors of all stripes are
working to make their products marketable as
"hybrid" as quickly as possible.
OLAP Needs: User Needs
High Capacity Data
MOLAP products are limited by the size of the
cube defined by the multidimensional view.
When dimension elements are predefined, the
scope of available data is limited at the onset.
ROLAP tools circumvent this barrier. Dynamic
dimensions are not stored in the predefined
multidimensional model, but fetched at run
time from the RDBMS.

138
OLAP Needs: Needs
Easy Development
MOLAP development is straightforward, it requires no
fine tuning and creates its own aggregates.
ROLAP tools, on the other hand, require a specific
schema for the relational database.
Skilled DBAs must provide the appropriate schema
(star or snowflake schema), tune the database, and
create the appropriate summary tables.
However, many ROLAP tools are metadata-driven,
which means the multidimensional view is generated
and maintained more easily.

139
Hybrid OLAP - HOLAP
o Best of both worlds

o Storing detailed data in RDBMS

o Storing aggregated data in MDBMS

o User access via MOLAP tools

140
HOLAP
RDBMS Server MDBMS Server Client
Multi-
dimensional
SQL-Read access
Multidimensional
User
data Meta data
Multi- Viewer
dimensional
Derived data
data
SQL-Reach
Through
Relational
Viewer
SQL-Read

141
ROLAP, MOLAP, or HOLAP
IF
A. You require write access
B. Your data is under 50 GB
C. Your timetable to implement is 60-90 days
D. Lowest level already aggregated
E. Data access on aggregated level
F. Youre developing a general-purpose application for inventory movement or assets management
THEN
Consider an MDD /MOLAP solution for your data mart

IF
A. Your data is over 100 GB
B. You have a "read-only" requirement
C. Historical data at the lowest level of granularity
D. Detailed access, long-running queries
E. Data assigned to lowest level elements
THEN
Consider an RDBMS/ROLAP solution for your data mart.

IF
A. OLAP on aggregated and detailed data
B. Different user groups
C. Ease of use and detailed data
THEN
Consider an HOLAP for your data mart

142
Conclusions
ROLAP: RDBMS -> star/snowflake schema
MOLAP: MDDB -> Cube structures
ROLAP or MOLAP: Data models used play major role in
performance differences
MOLAP: for summarized and relatively lesser volumes
of data (100GB)
ROLAP: for detailed and larger volumes of data
Both storage methods have strengths and weaknesses
The choice is requirement specific, though currently
data warehouses are predominantly built using
RDBMSs/ROLAP.
HOLAP is emerging as the OLPA server of choice

143
Warehouse Models &
Operators
Data Models
relations
stars & snowflakes
cubes
Operators
slice & dice
roll-up, drill down
pivoting
other
144
Star
product prodId name price store storeId city
p1 bolt 10 c1 nyc
p2 nut 5 c2 sfo
c3 la

sale oderId date custId prodId storeId qty amt


o100 1/7/97 53 p1 c1 1 12
o102 2/7/97 53 p2 c1 2 11
105 3/8/97 111 p1 c3 5 50

customer custId name address city


53 joe 10 main sfo
81 fred 12 main sfo
111 sally 80 willow la

145
Star Schema

sale
orderId
date customer
product
custId custId
prodId
prodId name
name
storeId address
price
qty city
amt

store
storeId
city

146
Terms
Fact table
Dimension tables
Measures sale
orderId
date customer
product
custId custId
prodId
prodId name
name
storeId address
price
qty city
amt

store
storeId
city

147
Dimension Hierarchies
sType
store
city region
sType tId size location
t1 small downtown
store storeId cityId tId mgr t2 large suburbs
s5 sfo t1 joe
s7 sfo t2 fred city cityId pop regId
s9 la t1 nancy sfo 1M north
la 5M south

snowflake schema
constellations region regId name
north cold region
south warm region

148
Cube

Fact table view: Multi-dimensional cube:


sale prodId storeId amt
p1 c1 12 c1 c2 c3
p2 c1 11 p1 12 50
p1 c3 50 p2 11 8
p2 c2 8

dimensions = 2

149
3-D Cube

Fact table view: Multi-dimensional cube:


sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11 c1 c2 c3
day 2
p1 c3 1 50 p1 44 4
p2 c2 1 8 p2 c1 c2 c3
p1 c1 2 44 day 1
p1 12 50
p1 c2 2 4 p2 11 8

dimensions = 3

150
Aggregates
Add up amounts for day 1
In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1
sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11
p1 c3 1 50
p2 c2 1 8
81
p1 c1 2 44
p1 c2 2 4

151
Aggregates
Add up amounts by day
In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date
sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11 ans date sum
p1 c3 1 50 1 81
p2 c2 1 8 2 48
p1 c1 2 44
p1 c2 2 4

152
Another Example
Add up amounts by day, product
In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
sale prodId storeId date amt
p1 c1 1 12 sale prodId date amt
p2 c1 1 11
p1 1 62
p1 c3 1 50
p2 1 19
p2 c2 1 8
p1 c1 2 44 p1 2 48
p1 c2 2 4

rollup

drill-down

153
Aggregates
Operators: sum, count, max, min,
median, ave
Having clause
Using dimension hierarchy
average by region (within store)
maximum by month (within date)

154
Different forms of
OLAP
Three ways of storing data:

Multidimensional OLAP (MOLAP)


Best Query Performance
Relational OLAP (ROLAP)
Ideal for large databases
Hybrid OLAP (HOLAP)
Best of both worlds!

155
Relational Database Model

Attribute 1 Attribute 2 Attribute 3 Attribute 4


Name Age Gender Emp No.
Row 1 Anderson 31 F 1001
Row 2 Green 42 M 1007
Row 3 Lee 22 M 1010
Row 4 Ramos 32 F 1020

The table above illustrates the employee relation.


Multidimensional Database
Customer Store
Model
Store

Time Time

SALES FINANCE

Product GL_Line

The data is found at the intersection


of dimensions.
Two dimensions
Three dimensions
MOLAP Server
The application layer
stores data in a
multidimensional structure DSS client
The presentation layer
provides the
MOLAP
multidimensional view Engine
Efficient storage and processing Application
Complexity hidden from the layer

user
Analysis using preaggregated
summaries and precalculated Warehouse
measures
ROLAP Server

The warehouse stores DSS client


atomic data.
The application layer
ROLAP
generates SQL for the engine
three- dimensional Application
view. Multiple
layer

The presentation layer SQL


provides the
multidimensional view.
Warehouse
server
Creating the Dimensional
Identify fact Model
tables
Translate business measures into
fact tables
Analyze source system information
for additional measures
Identify base and derived measures
Document additivity of measures
Identify dimension tables
Link fact tables to the dimension
tables
Create views for users
Dimension Tables
Dimension tables have the following
characteristics:
Contain textual information that
represents the attributes of the
business
Contain relatively static data
Are joined to a fact table through a
Product Channel

foreign key reference Facts


(units,
price)

Customer Time
Fact Tables
Fact tables have the following
characteristics:
Contain numeric measures (metrics) of
the business
May contain summarized (aggregated)
data
May contain date-stamped data
Have key value that is typically a
concatenated key composed of the
primary keys of the dimensions
Joined to dimension tables through
foreign keys that reference primary keys
in the dimension tables
Fact Table
Central table
mostly raw numeric items
narrow rows, a few columns at most
large number of rows (millions to a
billion)
Access via dimensions

165
Dimensional Model (Star
Schema)
Fact table

Product Channel

Facts
(units,
price)

Customer Time

Dimension tables
Star Schema Model
Product Table Store Table
Product_id Store_id
Product_desc District_id
...

Sales Fact Table


Central fact table Product_id
Store_id
Radiating dimensions Item_id
Day_id
Denormalized model Sales_dollars
Sales_units
...
Time Table Item Table
Day_id Item_id
Month_id Item_desc
Period_id ...
Year_id
Star Schema Model

Easy for users to understand


Fast response to queries
Simple metadata
Supported by many front end
tools
Less robust to change
Slower to build
Does not support history
Snowflake Schema Model
Product Table Store Table
District Table
Product_id Store_id
District_id
Product_desc Store_desc
District_desc
District_id

Sales Fact Table


Item_id
Store_id
Sales_dollars
Sales_units

Time Table Item Table Dept Table Mgr Table


Week_id Item_id Dept_id Dept_id
Period_id Item_desc Dept_desc Mgr_id
Year_id Dept_id Mgr_id Mgr_name
Snowflake Schema Model

Direct use by some tools


More flexible to change
Provides for speedier data
loading
May become large and
unmanageable
Degrades query performance
More complex metadata
Using Summary Data
Phase 3: Modeling summaries

Provides fast access to


precomputed data
Reduces use of I/O, CPU, and
memory
Is distilled from source systems and
precalculated summaries
Usually exists in summary fact
tables
Architecture of Data
WareHouse
Architecture of a Data
Warehouse with a Staging Area
Architecture of a Data
Warehouse with a Staging Area
and Data Marts
Incorrect Data in the Data
warehouse.
The architect needs to know what is to do
about incorrect data in the data warehouse.
The first assumption is that incorrect data
arrives in the data warehouse on an
exception basis.
If the data is being incorrectly entered in
the data warehouse on a wholesale basis,
then
It is the duty of the architect to find the
offending and make adjustment.
How to correct
To correct the offending an architect can do
three things.
Example: suppose on july 1 Rs 500 is made in
to operational system on july 2 a snapshot
taken in data warehouse and on july 15 it
discovered that it was a entry of 250 rather
than 500 on july 1.
Then
choice 1. go back to july 2 and update 250
inspite of 500. but it can create problem if any
report has been taken between july 2 to july 15.
How to correct
choice 2.
Enter offsetting entry i.e make two
entry first debit 500 then credit 250.
some time it also can create problem.
Choice 3.
Reset the account to the proper value.
but it will not correct the error.
So depending on the situation you can
make any decision.
Structuring Data in Data
Warehouse
The simplest and most common data structure found
in the data warehouse is :-

1:- The simplest cumulative structure i.e daily


transactions being reported from the operational
environment.
Example: jan 1, jan2 jan3 data

2:- Rolling summary data


After cumulative that they are summarized into
Data Warehouse records,
Example: Rolling summary data
Week1 data, week2 data, month1 data month2 data.
Reporting and the architected
environment
Once the data warehouse has been constructed all reporting and
informational processing will be done from there.

1. Operational reporting for clerical level :-

It focus on the line item(detailed information).


Example: A cashier has to check whole day transaction in the
evening for balance check.

2. Data ware house reporting for management level:-

It focus on summary information.


Example: A bank vice president has to take decision how many
ATM machine has to place in that particular city so he does not
need one day transactions but he needs one month or one year
summary of data to take decision.
Purging Warehouse Data
Data purging is nothing but deleting
your data from DW.
Data does not just pour into a Data
warehouse. But It has its own life
cycle within the data warehouse.
It does not means it is fully removed
it means it rolled up to high level of
summary. Where details is lost.
Granularity
Refers to the level of details of the Data
Dual level of Granularity:-
1. Low Level of Detail(More details)

2. High Level of detail( less details i.e


Summary)

Mostly Data in Data warehouse is in High level


But it has Low Level of Detail also for atomic
query.
Data Granularity in
Database
Data Granularity
A significant difference between an
operational system and a data
warehouse is the granularity of the
data stored.
An operational system typically stores
data at the lowest level of granularity:
the maximum level of detail.
Granularity in Data
Warehouse
However, because the data warehouse contains
data representing a long period in time,
simply storing all detail data from an operational
system can result in an overworked system that
takes too long to query.
A data warehouse typically stores data in different
levels of granularity or summarization, depending
on the data requirements of the business.
If an enterprise needs data to assist strategic
planning, then only highly summarized data is
required.
Granularity in Data
Warehouse
The lower the level of granularity of
data required by the enterprise, the
higher the number of resources
(specifically data storage) required to
build the data warehouse.
The different levels of summarization in
order of increasing granularity are:
Current operational data
Historical operational data
Granularity in Data
Warehouse
Aggregated data
Metadata

Current and historical operational data


are taken, unmodified, directly from
operational systems. Historical data is
operational level data.
No longer queried on a regular basis, and
is often archived onto secondary storage
OLAP(Data Warehouse)
vs
Data Mining
OLAP(Data Warehouse)

So to know what will happen in future you need a technique called Data Mining
Book you can refer
Data Mining
Concepts and Techniques
Auther:- Jaiwei Han and Micheline
Kamber

Publisher:
Morgan Kaufmann Publishers

You might also like