Lecture 7-Data Warehousing-Data Mining
Lecture 7-Data Warehousing-Data Mining
J. LIECH HENRY
Data Warehousing, Data Mining
[email protected]
Course Overview
❚The course:
what and how
❚0. Introduction
❚I. Data Warehousing
❚II. Decision Support
and OLAP
❚III. Data Mining
❚IV. Looking Ahead
❚Demos and Labs 2
0. Introduction
❚Data Warehousing,
OLAP and data mining:
what and why
(now)?
❚Relation to OLTP
❚A case study
❚demos, labs
3
A producer wants to know….
Which are our
lowest/highest margin
customers ?
Who are my customers
What is the most and what products
effective distribution are they buying?
channel?
[Barry Devlin]
6
What are the users saying...
❚Data should be integrated
across the enterprise
❚Summary data has a real
value to the organization
❚Historical data holds the
key to understanding data
over time
❚What-if capabilities are
required
7
What is Data Warehousing?
A process of
Information transforming data into
information and
making it available to
users in a timely
enough manner to
make a difference
30%
25%
Respondents
20%
15%
10%
Initial
5% Projected 2Q96
Source: META Group, Inc.
0%
5GB 10-19GB 50-99GB 250-499GB
5-9GB 20-49GB 100-249GB 500GB-1TB
10
Very Large Data Bases
❚ Terabytes -- 10^12 bytes:Walmart -- 24 Terabytes
13
Explorers, Farmers and Tourists
14
Data Warehouse Architecture
Relational
Databases
Optimized Loader
Extraction
ERP
Systems Cleansing
Data Warehouse
Engine Analyze
Purchased Query
Data
Legacy
Data Metadata Repository
15
Data Warehouse for Decision
Support & OLAP
❚P
utting Information technology to help the
knowledge worker make faster and better
decisions
❙Which of my customers are most likely to go
to the competition?
❙What product promotions have the biggest
impact on revenue?
❙How did the share price of software
companies correlate with profits over last 10
years?
16
Decision Support
❚Used to manage and control business
❚Data is historical or point-in-time
❚Optimized for inquiry rather than update
❚Use of the system is loosely defined and
can be ad-hoc
❚Used by managers and end-users to
understand the business and make
judgements
17
Data Mining works with Warehouse
Data
❚Data Warehousing
provides the Enterprise
with a memory
Industry Application
Finance Credit Card Analysis
Insurance Claims, Fraud Analysis
Telecommunication Call record analysis
Transport Logistics management
Consumer goods promotion analysis
Data Service providers Value added data
Utilities Power usage analysis
20
Why Separate Data Warehouse?
❚ Performance
❙ Op dbs designed & tuned for known txs & workloads.
❙ Complex OLAP queries would degrade perf. for op txs.
❙ Special data organization, access & implementation
methods needed for multidimensional views & queries.
❚ Function
❙ Missing data: Decision support requires historical data, which
op dbs do not typically maintain.
❙ Data consolidation: Decision support requires consolidation
(aggregation, summarization) of data from many
heterogeneous sources: op dbs, external sources.
❙ Data quality: Different sources typically use inconsistent data
representations, codes, and formats which have to be 21
reconciled.
Application-Orientation vs.
Subject-Orientation
Application-Orientation Subject-Orientation
Operational Data
Database Warehouse
Credit
Loans Card Customer
Vendor
Trust Product
Savings Activity
22
OLTP vs. Data Warehouse
25
OLTP vs Data Warehouse
26
To summarize ...
❚OLTP Systems are
used to “run” a
business
❚The Data
Warehouse helps
to “optimize” the
business
27
Why Now?
28
I. Data Warehouses:
Architecture, Design & Construction
❚DW Architecture
❚Loading, refreshing
❚Structuring/Modeling
❚DWs and Data Marts
❚Query Processing
❚demos, labs
29
Data Warehouse Architecture
Relational
Databases
Optimized Loader
ERP Extraction
Systems Cleansing
Data Warehouse
Engine Analyze
Purchased Query
Data
Legacy
Data Metadata Repository
30
Components of the Warehouse
31
Loading the Warehouse
35
Data Integrity Problems
❚Extracting ❚Enrichment
❚Conditioning ❚Scoring
❚Scrubbing ❚Loading
❚Merging ❚Validating
❚Householding ❚Delta Updating
37
Data Transformation Terms
❚Extracting
❙Capture of data from operational source in
“as is” status
❙Sources for data generally in legacy
mainframes in VSAM, IMS, IDMS, DB2; more
data today in relational databases on Unix
❚Conditioning
❙The conversion of data types from the source
to the target data store (warehouse) --
always a relational database
38
Data Transformation Terms
❚Householding
❙Identifying all members of a household
(living at the same address)
❙Ensures only one mail is sent to a
household
❙Can result in substantial savings: 1 lakh
catalogues at Rs. 50 each costs Rs. 50
lakhs. A 2% savings would save Rs. 1
lakh.
39
Data Transformation Terms
❚Enrichment
❙Bring data from external sources to
augment/enrich operational data. Data
sources include Dunn and Bradstreet, A.
C. Nielsen, CMIE, IMRA etc...
❚Scoring
❙computation of a probability of an
event. e.g..., chance that a customer
will defect to AT&T from MCI, chance
that a customer is likely to buy a new
product
40
Structuring/Modeling Issues
Data -- Heart of the Data
Warehouse
❚Heart of the data warehouse is the
data itself!
❚Single version of the truth
❚Corporate memory
❚Data is organized in a way that
represents business -- subject
orientation
42
Data Warehouse Structure
43
Data Warehouse Structure
❙base customer (1985-87)
❘custid, from date, to date, name, phone, dob
Time is ❙base customer (1988-90)
part of ❘custid, from date, to date, name, credit rating,
key of employer
each table
❙customer activity (1986-89) -- monthly
summary
❙customer activity detail (1987-89)
❘custid, activity date, amount, clerk id, order no
❙customer activity detail (1990-91)
❘custid, activity date, amount, line item no, order no
44
Data Granularity in Warehouse
46
Granularity in Warehouse
47
Vertical Partitioning
Acct. Interest
Name Balance Date Opened Address
No Rate
Frequently
accessed Rarely
accessed
Acct. Acct. Interest
Balance Name Date Opened Address
No No Rate
Smaller table
and so less I/O
48
Schema Design
❚Database organization
❙must look like business
❙must be recognizable by business user
❙approachable by business user
❙Must be simple
❚Schema Types
❙Star Schema
❙Fact Constellation Schema
❙Snowflake schema
49
Dimension Tables
❚Dimension tables
❙Define business in terms already
familiar to users
❙Wide rows with lots of descriptive text
❙Small tables (about a million rows)
❙Joined to fact table by a foreign key
❙heavily indexed
❙typical dimensions
❘time periods, geographic region (markets,
cities), products, customers, salesperson,
etc. 50
Star Schema
53
Partitioning
❚Breaking data into several
physical units that can be
handled separately
❚Not a question of whether
to do it in data
warehouses but how to do
it
❚Granularity and
partitioning are key to
effective implementation
of a warehouse
54
Why Partition?
55
Criterion for Partitioning
❚Typically partitioned by
❙date
❙line of business
❙geography
❙organizational unit
❙any combination of above
56
Where to Partition?
57
Data Warehouse vs. Data Marts
Individually Less
Structured
Departmentally History
Structured Normalized
Detailed
Organizationally More
Structured Data Warehouse
Data
59
Data Warehouse and Data Marts
OLAP
Data Mart
Lightly summarized
Departmentally structured
Organizationally structured
Atomic
Detailed Data Warehouse Data
60
Characteristics of the
Departmental Data Mart
❚OLAP
❚Small
❚Flexible
❚Customized by
Department
❚Source is
departmentally
structured data
warehouse
61
Techniques for Creating
Departmental Data Mart
❚O
LAP
Sales Finance Mktg. ❚Subset
❚Summarized
❚Superset
❚Indexed
❚Arrayed
62
Data Mart Centric
Data Sources
Data Marts
Data Warehouse
63
Problems with Data Mart Centric
Solution
64
True Warehouse
Data Sources
Data Warehouse
Data Marts
65
Warehouse Products
❚Computer Associates -- CA-Ingres
❚Hewlett-Packard -- Allbase/SQL
❚Informix -- Informix, Informix XPS
❚Microsoft -- SQL Server
❚Oracle -- Oracle7, Oracle Parallel Server
❚Red Brick -- Red Brick Warehouse
❚SAS Institute -- SAS
❚Software AG -- ADABAS
❚Sybase -- SQL Server, IQ, MPP
66
Warehouse Server Products
❚O
racle 8
❚Informix
❙Online Dynamic Server
❙S
X
P --Extended Parallel Server
❙Universal Server for object relational
applications
❚Sybase
❙Adaptive Server 11.5
❙Sybase MPP
❙Sybase IQ
67
Warehouse Server Products
68