First Data WarehouseAima First Final Updated 9 Sep 2016
First Data WarehouseAima First Final Updated 9 Sep 2016
Warehousing
&
Data Mining
Code: ITM 402
What is Data
Warehouse?
DATABASE
A database is an organized collection of
data.[1] It is the collection of schemas,
tables, queries, reports, views, and other
objects.
Often abbreviated DB, a database is
basically a collection of information
organized in such a way that a computer
program can quickly select desired pieces
of data. You can think of a database as an
electronic filing system.
Traditional databases are organized by
fields, records, and files.
A field is a single piece of information;
A record is one complete set of fields; AND
A file is a collection of records.
For example, a telephone book is
analogous to a file. It contains a list of
records, each of which consists of three
fields: name, address, and telephone
number.
Data Dictionary
A data dictionary is a file or a set of
files that contains a database's
metadata. The data dictionary
contains records about other objects
in the database, such as data
ownership, data relationships to
other objects, and other data.
The data dictionary is a crucial
component of any relational
database. Ironically, because of its
importance, it is invisible to most
database users. Typically, only
database administrators interact with
the data dictionary.
In a relational database, the metadata in the data dictionary includes the
following:
Data
Legacy Warehouse
System
Measurement of
attributes.
physical attribute.
of data remarks
naming conventions.
load
acce
ss
How it is differ from
Database
OLTP vs. OLAP
OLTP: On Line Transaction Processing
Describes processing at operational sites.
CSE601 21
Warehouse is a Specialized
Standard DB (OLTP)DB Warehouse (OLAP)
Mostly updates Mostly reads
Many small Queries are long and
transactions complex
Mb - Gb of data Gb - Tb of data
Current snapshot History
Index/hash on p.k. Lots of scans
Raw data Summarized, reconciled data
Thousands of users Hundreds of users (e.g.,
(e.g., clerical users) decision-makers, analysts)
CSE601 22
Operational v/s Information
System
Features Operational Information
Characteristics Operational processing Informational processing
Orientation Transaction Analysis
User Clerk,DBA,database Knowledge workers
professional
Function Day to day operation Decision support
Data Current Historical
View Detailed,flat relational Summarized,
multidimensional
DB design Application oriented Subject oriented
Unit of work Short ,simple transaction Complex query
Access Read/write Mostly read
Operational v/s Information
System
Features Operational Information
Focus Data in Information out
Number of records tens millions
accessed
Schema: plan
Describes structure of database
Names and sizes of fields
Identifies primary keys
Data dictionary: repository of
information about data
The Schema and Metadata
(continued)
Metadata: data about data
Source of data
Tables related to data
Field information
Usage of data
Population rules
The Schema and Metadata
(continued)
Data Warehouse
Fundamentals
Extraction, transformation, and loading
(ETL) a process that extracts information
from internal and external databases,
transforms the information using a common
set of enterprise definitions, and loads the
information into a data warehouse
Delhi
Sales per item type per branch Sales
for first quarter. Manager
Chennai
Banglore
ETL(Extract, Transform,
Load)
Improve the quality of data before
loading it into the warehouse.
Perform data cleaning and
transformation before loading the
data.
And Then Load into Data Warehouse.
Use query analysis tools to support
adhoc queries.
OLAP
- Order#
- Ordertype
Productname Turnover
Roll up to the product level.
Screw 100.000
Bolt 200.000
Nut 300,000
Toplevel Turnover
600.000 Roll up to the top level.
Et star schema DW can be illustrated as a multidimensinal
cube:
Solution 1:ABC Pvt Ltd.
Report
Delhi
Query & Sales
Data Analysis tools Manager
Warehouse
Chennai
Banglore
Data Warehousing
Architecture Monitoring &
Administratio OLAP Servers
n
Metadata
Repository
Data Mining
DATA MARTS
Data Warehouse Architecture
Data Warehouse server
almost always a relational
DBMS,rarely flat files
OLAP servers
to support and operate on multi-
dimensional data structures
Clients
Query and reporting tools
Analysis tools
Data mining tools
OLTP
OLTP (On-line Transaction Processing)
Operational data
To control and run fundamental business
tasks
Transactions:INSERT, UPDATE, DELETE.
Detailed and current data,
Relatively standardized and simple queries
Highly normalized(3NF) with many tables
OLAP
OLAP (On-line Analytical Processing)
Low volume of transactions.
To help with planning, problem solving,
and decision support
Relatively standardized and simple
queries
Typically de-normalized with fewer tables;
use of star and/or snowflake schemas.
Historical data, stored in multi-
dimensional schemas (usually star
schema).
OLTP (On-line Transaction
Processing) vs. OLAP (On-
line Analytical
Processing)
We can divide IT systems into
transactional (OLTP) and analytical
(OLAP). In general we can assume
that OLTP systems provide source
data to data warehouses, whereas
OLAP systems help to analyze it.
Data Mart
Introduction
72
OLTP vs. OLAP
We can divide IT systems into transactional
(OLTP) and analytical (OLAP). In general we
can assume that OLTP systems provide
source data to data warehouses, whereas
OLAP systems help to analyze it.
OLTP IS Highly normalized with many
tables(RDBMS)
OLAP Typically de-normalized with fewer
tables use of( star and/or snowflake
schemas)
Difference between OLTP AND
OLAP
OLTP (On-line Transaction Processing) is characterized by
a large number of short on-line transactions (INSERT, UPDATE,
DELETE). The main emphasis for OLTP systems is put on very
fast query processing, maintaining data integrity in multi-
access environments and an effectiveness measured by
number of transactions per second. In OLTP database there is
detailed and current data, and schema used to store
transactional databases is the entity model (usually
3NF).
79
Multi-Dimensional OLAP
Servers
Roll UP - aggregation of data such as simple
roll-ups or complex expressions involving inter-
related data, for example Monthly data to
quarterly data.
80
Slicing
Multi-Dimensional OLAP
servers
Can store data in a compressed form by
dynamically selecting physical storage
organizations and compression techniques
that maximize space utilization.
83
ON-LINE ANALYTICAL
PROCESSING
Demand for OLAP
To develop Data Mart, three
approaches
In all approaches, Data Marts rest
on Dimensional Model
Data Marts are sufficient for basic
data analysis
Users need to go beyond such
basic analysis
85
Demand for OLAP
86
Demand for OLAP
Traditional tools of report writers,
query products, spreadsheets, &
language interfaces do not match
the user expectations as far as
performing multidimensional
analysis with complex calculations
is concerned.
Tools used with OLTP and basic DW
environments do not match up to
the task
87
OLAP is the Answer!
OLAP is a category of software technology
that enables analysts, managers, and
executives to gain insight into the data
through fast, consistent, interactive, access in
a wide variety of possible views of
information that has been transformed from
raw data to reflect the real dimensionality of
the enterprise as understood by the user.
88
Why is OLAP useful?
Facilitates multidimensional data
analysis by pre-computing aggregates
across many sets of dimensions
Provides for:
Greaterspeed and responsiveness
Improved user interactivity
89
Data, Data everywhere
yet ...find the data I need
I cant
data is scattered over the network
many versions, subtle differences
[Barry Devlin]
91
What are the users
saying...
Data should be integrated
across the enterprise
Summary data has a real
value to the organization
Historical data holds the
key to understanding data
over time
What-if capabilities are
required
92
What is Data
Warehousing?
A process of
Information transforming data into
information and making
it available to users in a
timely enough manner
to make a difference
94
Data Warehousing --
It is a process
Technique for assembling and
managing data from various
sources for the purpose of
answering business questions.
Thus making decisions that
were not previous possible
A decision support database
maintained separately from
the organizations operational
database
95
Data Integration
Data integration means combining Data
coming from different sources and
providing users with a unified view of
these data.
This process becomes significant in a
variety of situations both commercial
(when two similar companies need to
merge their Databases) and scientific
(combining research results from different
repositories)
Consistency
Time Variant
105
Application-Orientation vs.
Subject-Orientation
Application-Orientation Subject-Orientation
Operation Data
al Warehouse
Database
Credit
Loans Customer
Card
Vendor
Product
Trust
Savings Activity
106
To summarize ...
OLTP Systems are
used to run a
business
107
I. Data Warehouses:
Architecture, Design &
Construction
DW Architecture
Loading, refreshing
Structuring/Modeling
DWs and Data Marts
Query Processing
demos, labs
108
Data Warehouse
Architecture
Relational
Databases
Optimized Loader
Extraction
ERP
Cleansing
Systems
Data Warehouse
Engine Analyze
Purchased Query
Data
Legacy
Data Metadata Repository
109
Components of the
Warehouse
Data Extraction and Loading
The Warehouse
Analyze and Query -- OLAP Tools
Metadata
110
Loading the Warehouse
112
Data Granularity in
Warehouse
Summarized data stored
reduce storage costs
reduce cpu usage
increases performance since smaller
number of records to be processed
design around traditional high level
reporting needs
tradeoff with volume of data to be
stored and detailed usage of data
113
Granularity in Warehouse
Can not answer some questions with
summarized data
Did Anand call Seshadri last month? Not
possible to answer if total duration of
calls by Anand over a month is only
maintained and individual call details
are not.
Detailed data too voluminous
114
Aggregates
Add up amounts for day 1
In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1
115
Aggregates
Add up amounts by day
In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date
116
Aggregation Using
Hierarchies
c1 c2 c3
day 2
p1 44 4
customer
p2 c1 c2 c3
day 1
p1 12 50 region
p2 11 8
country
region A region B
p1 56 54
p2 11 8
(customer c1 in Region A;
customers c2, c3 in Region B)
117
OLAP Operations
Roll up (drill-up): summarize data
by climbing up hierarchy or by dimension
reduction
Drill down (roll down): reverse of roll-
up
from higher level summary to lower level
summary = detailed data, Slice and dice:
project and select
Drill Down/Up:
Drilling down or up is a specific analytical
technique whereby the user navigates
among levels of data ranging from the 118
Pivot (rotate):
Rotates the data axis to view the data from different perspectives.
Groups data with different dimensions.
Slice
Performs a selection on one dimension of the given cube, resulting
in a sub-cube.
Reduces the dimensionality of the cubes.
Sets one or more dimensions to specific values and keeps a subset
of dimensions for selected values.
Dice
Define a sub-cube by performing a selection of one or more
dimensions.
Refers to range select condition on one dimension, or to select
condition on more than one dimension.
Reduces the number of member values of one or more dimensions.
Other OLAP
Operations
o Moving Averages
o Growth Rates
o Depreciation
o Currency Conversion
o Statistical Functions
o Top N or Bottom N queries
121
Conceptual vs. Actual
The cube is a logical way of
visualizing the data in an OLAP setting
Not how the data is actually
represented on disk
Two ways of storing data:
ROLAP: Relational OLAP
MOLAP: Multidimensional OLAP
122
OLAP & CUBE
Construction of the data cube
is key to the operation of OLAP
The computation process
creates a set of aggregates on
the various dimensions of the
data
The CUBE operator
123
Approaches to OLAP
Servers
125
ROLAP
Special schema design: star, snowflake
Products
IBM DB2, Oracle, Sybase IQ,
RedBrick, Informix
126
ROLAP
Defines complex, multi-dimensional data with
simple model
Reduces the number of joins a query has to
process
Allows the data warehouse to evolve with
relatively low maintenance
Can contain both detailed and summarized data.
ROLAP is based on familiar, proven, and already
selected technologies.
BUT!!!
SQL for multi-dimensional manipulation of
calculations.
127
MOLAP
128
MOLAP tools feature very fast
response, and the ability to quickly
write back data into the data set
(budgeting and forecasting are
common applications). Primary
downsides of MOLAP tools are.
Products
Pilot, Arbor Essbase, Gentia
OLAP Needs
User Needs
Multidimensional view
Excellent Performance
Analytical Flexibility
Real-Time Data Access
High Data Capacity
130
OLAP Needs: User Needs
Excellent Performance
RDBMSs must use several summary tables to store the aggregates
that a MOLAP could store in just one cube.
For example, consider a Sales indicator with three dimensions: Months, Regions,
and Products. The indicator cube will contain seven sets of aggregates:
Sales by month
Sales by product
Sales by region
Sales by month and product
132
OLAP Needs: User Needs
Real-Time Data Access
MOLAP tools load data into the multidimensional cubes.
Consequently, the data being accessed is only as recent
as the last load.
Some applications require real-time data access
Process of continually refreshing the data attaches higher
costs to operating a MOLAP system
Some MOLAP tools offer reach-through functionality to
access volatile data stored outside the MDDB
Unfortunately, users must be aware of the underlying
database structure
Relational data access is too complex for the typical user
133
OLAP Needs: User Needs
Real-Time Data Access
ROLAP tools maintain a constant link to the
operational RDBMS, which provides users with
up-to-the-minute, accurate data
(Real-Time Data Warehousing)
Industries & organizations with highly volatile
data particularly benefit from this access to
live, operational data.
134
ROLAP (relational OLAP) tools do not
use pre-calculated data cubes.
Instead, they intercept the query and
pose the question to the standard
relational database and its tables in
order to bring back the data required
to answer the question.
ROLAP tools feature the ability to ask any
question (you are not limited to the contents of
a cube)
and the ability to drill down to the lowest level
of detail in the database.
Primary downsides of ROLAP tools are slow
response and some limitations on scalability
(depending on the technology architecture that
is utilized). The most common examples of
ROLAP tools are MicroStrategy and Sterling
(Information Advantage).
HOLAP (hybrid OLAP)
HOLAP (hybrid OLAP) addresses the
shortcomings of both of these technologies by
combining the capabilities of both approaches.
HOLAP tools can utilize both pre-calculated
cubes and relational data sources.
The most common example of HOLAP
architecture is OLAP services in Microsoft SQL
Server 7.0. OLAP vendors of all stripes are
working to make their products marketable as
"hybrid" as quickly as possible.
OLAP Needs: User Needs
High Capacity Data
MOLAP products are limited by the size of the
cube defined by the multidimensional view.
When dimension elements are predefined, the
scope of available data is limited at the onset.
ROLAP tools circumvent this barrier. Dynamic
dimensions are not stored in the predefined
multidimensional model, but fetched at run
time from the RDBMS.
138
OLAP Needs: Needs
Easy Development
MOLAP development is straightforward, it requires no
fine tuning and creates its own aggregates.
ROLAP tools, on the other hand, require a specific
schema for the relational database.
Skilled DBAs must provide the appropriate schema
(star or snowflake schema), tune the database, and
create the appropriate summary tables.
However, many ROLAP tools are metadata-driven,
which means the multidimensional view is generated
and maintained more easily.
139
Hybrid OLAP - HOLAP
o Best of both worlds
140
HOLAP
RDBMS Server MDBMS Server Client
Multi-
dimensional
SQL-Read access
Multidimensional
User
data Meta data
Multi- Viewer
dimensional
Derived data
data
SQL-Reach
Through
Relational
Viewer
SQL-Read
141
ROLAP, MOLAP, or HOLAP
IF
A. You require write access
B. Your data is under 50 GB
C. Your timetable to implement is 60-90 days
D. Lowest level already aggregated
E. Data access on aggregated level
F. Youre developing a general-purpose application for inventory movement or assets management
THEN
Consider an MDD /MOLAP solution for your data mart
IF
A. Your data is over 100 GB
B. You have a "read-only" requirement
C. Historical data at the lowest level of granularity
D. Detailed access, long-running queries
E. Data assigned to lowest level elements
THEN
Consider an RDBMS/ROLAP solution for your data mart.
IF
A. OLAP on aggregated and detailed data
B. Different user groups
C. Ease of use and detailed data
THEN
Consider an HOLAP for your data mart
142
Conclusions
ROLAP: RDBMS -> star/snowflake schema
MOLAP: MDDB -> Cube structures
ROLAP or MOLAP: Data models used play major role in
performance differences
MOLAP: for summarized and relatively lesser volumes
of data (100GB)
ROLAP: for detailed and larger volumes of data
Both storage methods have strengths and weaknesses
The choice is requirement specific, though currently
data warehouses are predominantly built using
RDBMSs/ROLAP.
HOLAP is emerging as the OLPA server of choice
143
Warehouse Models &
Operators
Data Models
relations
stars & snowflakes
cubes
Operators
slice & dice
roll-up, drill down
pivoting
other
144
Star
product prodId name price store storeId city
p1 bolt 10 c1 nyc
p2 nut 5 c2 sfo
c3 la
145
Star Schema
sale
orderId
date customer
product
custId custId
prodId
prodId name
name
storeId address
price
qty city
amt
store
storeId
city
146
Terms
Fact table
Dimension tables
Measures sale
orderId
date customer
product
custId custId
prodId
prodId name
name
storeId address
price
qty city
amt
store
storeId
city
147
Dimension Hierarchies
sType
store
city region
sType tId size location
t1 small downtown
store storeId cityId tId mgr t2 large suburbs
s5 sfo t1 joe
s7 sfo t2 fred city cityId pop regId
s9 la t1 nancy sfo 1M north
la 5M south
snowflake schema
constellations region regId name
north cold region
south warm region
148
Cube
dimensions = 2
149
3-D Cube
dimensions = 3
150
Aggregates
Add up amounts for day 1
In SQL: SELECT sum(amt) FROM SALE
WHERE date = 1
sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11
p1 c3 1 50
p2 c2 1 8
81
p1 c1 2 44
p1 c2 2 4
151
Aggregates
Add up amounts by day
In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date
sale prodId storeId date amt
p1 c1 1 12
p2 c1 1 11 ans date sum
p1 c3 1 50 1 81
p2 c2 1 8 2 48
p1 c1 2 44
p1 c2 2 4
152
Another Example
Add up amounts by day, product
In SQL: SELECT date, sum(amt) FROM SALE
GROUP BY date, prodId
sale prodId storeId date amt
p1 c1 1 12 sale prodId date amt
p2 c1 1 11
p1 1 62
p1 c3 1 50
p2 1 19
p2 c2 1 8
p1 c1 2 44 p1 2 48
p1 c2 2 4
rollup
drill-down
153
Aggregates
Operators: sum, count, max, min,
median, ave
Having clause
Using dimension hierarchy
average by region (within store)
maximum by month (within date)
154
Different forms of
OLAP
Three ways of storing data:
155
Relational Database Model
Time Time
SALES FINANCE
Product GL_Line
user
Analysis using preaggregated
summaries and precalculated Warehouse
measures
ROLAP Server
Customer Time
Fact Tables
Fact tables have the following
characteristics:
Contain numeric measures (metrics) of
the business
May contain summarized (aggregated)
data
May contain date-stamped data
Have key value that is typically a
concatenated key composed of the
primary keys of the dimensions
Joined to dimension tables through
foreign keys that reference primary keys
in the dimension tables
Fact Table
Central table
mostly raw numeric items
narrow rows, a few columns at most
large number of rows (millions to a
billion)
Access via dimensions
165
Dimensional Model (Star
Schema)
Fact table
Product Channel
Facts
(units,
price)
Customer Time
Dimension tables
Star Schema Model
Product Table Store Table
Product_id Store_id
Product_desc District_id
...
So to know what will happen in future you need a technique called Data Mining
Book you can refer
Data Mining
Concepts and Techniques
Auther:- Jaiwei Han and Micheline
Kamber
Publisher:
Morgan Kaufmann Publishers