Data Warehousing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

Data Warehousing and OLAP

Decision Support Systems


• Decision-support systems are used to make
business decisions, often based on data collected
by online transaction-processing systems
• Examples of business decisions:
– What items to stock?
– What insurance premium to change?
– To whom to send advertisements?
• Examples of data used for making decisions
– Retail sales transaction details
– Customer profiles (income, age, gender, etc.)

CMPT 354: Database I -- Data Warehousing and OLAP 2


Data and Statistical Analysis
• Data analysis tasks are simplified by
specialized tools and SQL extensions
– Example tasks
• For each product category and each region, what
were the total sales in the last quarter and how do
they compare with the same quarter last year
• As above, for each product category and each
customer category
• Statistical analysis packages (e.g., SAS)
can be interfaced with databases
CMPT 354: Database I -- Data Warehousing and OLAP 3
Data Analysis and OLAP
• Online Analytical Processing (OLAP)
– Interactive analysis of data, allowing data to be
summarized and viewed in different ways in an online
fashion (with negligible delay)
• Multidimensional data: data modeled as dimension
attributes and measure attributes
– Dimension attributes: define the dimensions on which
measure attributes (or aggregates thereof) are viewed,
e.g. the attributes item_name, color, and size of the
sales relation
– Measure attributes: can be aggregated upon, e.g., the
attribute number of the sales relation

CMPT 354: Database I -- Data Warehousing and OLAP 4


Pivot Table
• Values for one of the dimension attributes form the row headers
• Values for another dimension attribute form the column headers
• Other dimension attributes are listed on top
• Values in individual cells are (aggregates of) the values of the
dimension attributes that specify the cell

CMPT 354: Database I -- Data Warehousing and OLAP 5


Relational Representation
• Cross-tabs can be
represented as
relations
– The value all is used to
represent aggregates
– All represents a set
– The SQL:1999
standard uses null
values in place of all
despite confusion with
regular null values
CMPT 354: Database I -- Data Warehousing and OLAP 6
Data Cubes
• A data cube is a multidimensional
generalization of a cross-tab
• Can have n dimensions
• Cross-tabs can be used as views on a data
cube

CMPT 354: Database I -- Data Warehousing and OLAP 7


Online Analytical Processing
• Pivoting: changing the dimensions used in a
cross-tab is called
• Slicing: creating a cross-tab for fixed values only
– Sometimes called dicing, particularly when values for
multiple dimensions are fixed
• Rollup: moving from finer-granularity data to a
coarser granularity
• Drill down: The opposite operation - that of
moving from coarser-granularity data to finer-
granularity data
CMPT 354: Database I -- Data Warehousing and OLAP 8
Hierarchies on Dimensions
• Enable dimensions be viewed at different levels of
detail
– Dimension DateTime can be used to aggregate by hour
of day, date, day of week, month, quarter or year

CMPT 354: Database I -- Data Warehousing and OLAP 9


Cross Tabulation With Hierarchy

CMPT 354: Database I -- Data Warehousing and OLAP 10


OLAP Implementation
• Multidimensional OLAP (MOLAP) systems
– Multidimensional arrays in memory to store data
cubes
• Relational OLAP (ROLAP) systems
– Relational tables to store data cubes
• Hybrid OLAP (HOLAP) systems
– Store some summaries in memory and store the
base data and other summaries in a relational
database
CMPT 354: Database I -- Data Warehousing and OLAP 11
Extended Aggregation in
SQL:1999
• The cube operation computes union of group by’s
on every subset of the specified attributes
select item-name, color, size, sum(number)
from sales
group by cube(item-name, color, size)
• Compute the union of eight different groupings of
the sales relation: { (item-name, color, size), (item-
name, color), (item-name, size), (color, size),
(item-name), (color), (size), ( ) }
• For each grouping, the result contains the null
value for attributes not present in the grouping
CMPT 354: Database I -- Data Warehousing and OLAP 12
OLTP Versus OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date, detailed, flat historical, summarized, multidimensional
relational Isolated integrated, consolidated
usage repetitive ad-hoc
access read/write, index/hash on prim. lots of scans
key
unit of work short, simple transaction complex query
# records tens millions
accessed
#users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response

CMPT 354: Database I -- Data Warehousing and OLAP 13


What Is a Data Warehouse?
• “A data warehouse is a subject-oriented,
integrated, time-variant, and nonvolatile
collection of data in support of
management’s decision-making process.”
– W. H. Inmon
• Data warehousing: the process of
constructing and using data warehouses

CMPT 354: Database I -- Data Warehousing and OLAP 14


Subject-Oriented
• Organized around major subjects, such as
customer, product, sales
• Focusing on the modeling and analysis of
data for decision makers, not on daily
operations or transaction processing
• Providing a simple and concise view around
particular subject issues by excluding data
that are not useful in the decision support
process
CMPT 354: Database I -- Data Warehousing and OLAP 15
Integrated
• Integrating multiple, heterogeneous data sources
– Relational databases, flat files, on-line transaction
records
• Data cleaning and data integration
– Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources
• E.g., Hotel price: currency, tax, breakfast covered, etc.
– When data is moved to the warehouse, it is converted

CMPT 354: Database I -- Data Warehousing and OLAP 16


Time Variant
• The time horizon for the data warehouse is
significantly longer than that of operational
systems
– Operational database: current value data
– Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
• Every key structure in the data warehouse
– Contains an element of time, explicitly or implicitly
– But the key of operational data may or may not contain
“time element”

CMPT 354: Database I -- Data Warehousing and OLAP 17


Nonvolatile
• A physically separate store of data
transformed from the operational
environment
• Operational update of data does not occur in
the data warehouse environment
– Does not require transaction processing,
recovery, and concurrency control mechanisms
– Requires only two operations in data accessing:
• initial loading of data and access of data

CMPT 354: Database I -- Data Warehousing and OLAP 18


Data Warehousing

CMPT 354: Database I -- Data Warehousing and OLAP 19


Collecting Data
• Source driven architecture: data sources transmit
new information to a warehouse, either
continuously or periodically (e.g. at night)
• Destination driven architecture: a warehouse
periodically requests new information from data
sources
• Keeping warehouse exactly synchronized with
data sources (e.g. using two-phase commit) is too
expensive
– Usually OK to have slightly out-of-date data at
warehouse
– Data/updates are periodically downloaded form online
transaction processing (OLTP) systems
CMPT 354: Database I -- Data Warehousing and OLAP 20
Design Issues
• Data cleansing
– Correct mistakes in addresses (misspellings, zip code
errors), and merge address lists from different sources
and purge duplicates
• Update propagating
– Warehouse schema may be a (materialized) view of
schema from data sources
• Summarizing data
– Raw data may be too large to store on-line
– Aggregate values (totals/subtotals) often suffice
– Queries on raw data can often be transformed by query
optimizer to use aggregate values
CMPT 354: Database I -- Data Warehousing and OLAP 21
Warehouse Schemas
• Dimension values are usually encoded
using small integers and mapped to full
values via dimension tables
• Resultant schema is called a star schema
– More complicated schema structures
• Snowflake schema: multiple levels of dimension
tables
• Constellation: multiple fact tables

CMPT 354: Database I -- Data Warehousing and OLAP 22


Data Warehouse Schema

CMPT 354: Database I -- Data Warehousing and OLAP 23


Picture from publib.boulder.ibm.com

Snowflake Schema
A star schema is a
snowflake schema
where each
dimension has only
one single
dimension table

CMPT 354: Database I -- Data Warehousing and OLAP 24


Why Data Mining?
• Evolution of database technology
– To collect a large amount of data Æ primitive
file processing
– To store and query data efficiently Æ DBMS
• New challenges: huge amount of data, how
to analyze and understand?
– Data mining

CMPT 354: Database I -- Data Warehousing and OLAP 25


What Is Data Mining?
• Mining data – mining knowledge
• Data mining is the non-trivial process of
identifying valid, novel, potentially useful,
and ultimately understandable patterns in
data

CMPT 354: Database I -- Data Warehousing and OLAP 26


The KDD Process
Knowledge

Interpretation/
Patterns evaluation
Transformed
data
Data mining
Preprocessed
data
Transformation
Selection Preprocessing
Target data

Data

CMPT 354: Database I -- Data Warehousing and OLAP 27


KDD Process Steps
• Preprocessing
– Data cleaning
– Data integration
• Data selection
• Data transformation
• Data mining
• Pattern evaluation
• Knowledge presentation
CMPT 354: Database I -- Data Warehousing and OLAP 28
Summary
• OLAP
• Data warehousing
– Star schema
– Snowflake schema
• Data mining and KDD process

CMPT 354: Database I -- Data Warehousing and OLAP 29

You might also like