Business Intelligence/ Data Warehousing: Lakshmi Prashad PMG
Business Intelligence/ Data Warehousing: Lakshmi Prashad PMG
Overview
0. Introduction I. Data Warehousing II. Decision Support and OLAP III. Data Mining
What product prom-otions have the biggest impact on revenue? What impact will new products/services have on revenue and margins?
Data, Data everywhere yet ... I cant find the data I need
data is scattered over the network many versions, subtle differences
Data
Evolution
60s: Batch reports
hard to find and analyze information inflexible and expensive, reprogram every new request
Weather images
Intelligence Agency Videos
9
Data Warehouse
A data warehouse is a
subject-oriented integrated time-varying non-volatile
11
Explorers: Seek out the unknown and previously unsuspected rewards hiding in the detailed data
12
Purchased Data
Legacy Data
Metadata Repository
13
Decision Support
Used to manage and control business Data is historical or point-in-time Optimized for inquiry rather than update Use of the system is loosely defined and can be ad-hoc Used by managers and end-users to understand the business and make judgements
15
Application Areas
Industry Finance Insurance Telecommunication Transport Consumer goods Data Service providers Utilities Application Credit Card Analysis Claims, Fraud Analysis Call record analysis Logistics management promotion analysis Value added data Power usage analysis
18
19
Function
Missing data: Decision support requires historical data, which op dbs do not typically maintain. Data consolidation: Decision support requires consolidation (aggregation, summarization) of data from many heterogeneous sources: op dbs, external sources. Data quality: Different sources typically use inconsistent data representations, codes, and formats which have to be 21 reconciled.
Operational Systems
Run the business in real time Based on up-to-the-second data Optimized to handle large numbers of simple read/write transactions Optimized for fast response to predefined transactions Used by people who deal with customers, products -- clerks, salespeople etc. They are increasingly used by customers
24
Technology
Volumes
Legacy application, flat Small-medium files, main frames Legacy applications, Large hierarchical databases, mainframe ERP, Client/Server, Very Large relational databases Legacy application, Very Large hierarchical database, mainframe ERP, Medium relational databases, AS/400
25
Operational Database
Loans Credit Card Trust Savings Customer
Data Warehouse
Vendor
Product
Activity
27
Warehouse (DSS)
Subject Oriented Used to analyze business Summarized and refined Snapshot data Integrated Data Ad-hoc access Knowledge User (Manager)
29
Data Warehouse
Performance relaxed Large volumes accessed at a time(millions) Mostly Read (Batch Update) Redundancy present Database Size 100 GB - few terabytes
30
Data Warehouse
Query throughput is the performance metric Hundreds of users Managed by subsets
31
To summarize ...
OLTP Systems are used to run a business
Why Now?
Data is being produced ERP provides clean data The computing power is available The computing power is affordable The competitive pressures are strong Commercial products are available
33
Million dollar massively parallel hardware is needed to deliver fast time for complex queries
OLAP servers require massive and unwieldy indices Complex OLAP queries clog the network with data Data warehouses must be at least 100 GB to be effective
Source -- Arbor Software Home Page
34
35
ERP Systems
Purchased Data
Legacy Data
Metadata Repository
36
37
Source Data
Operational/ Source Data Sequential Legacy Relational External
External Sources
Nielsens, Acxiom, CMIE, Vendors, Partners
39
Nothing could be farther from the truth Warehouse data comes from disparate questionable sources
40
42
44
45
Sources for data generally in legacy mainframes in VSAM, IMS, IDMS, DB2; more data today in relational databases on Unix
Conditioning
The conversion of data types from the source to the target data store (warehouse) -always a relational database
46
Scoring
computation of a probability of an event. e.g..., chance that a customer will defect to AT&T from MCI, chance that a customer is likely to buy a new product
48
Loads
After extracting, scrubbing, cleaning, validating etc. need to load the data into the warehouse Issues
huge volumes of data to be loaded small time window available when warehouse can be taken off line (usually nights) when to build index and summary tables allow system administrators to monitor, cancel, resume, change load rates Recover gracefully -- restart after failure from where you were and without loss of data integrity
49
Refresh
Propagate updates on source data to the warehouse Issues:
when to refresh how to refresh -- refresh techniques
50
When to Refresh?
periodically (e.g., every night, every week) or after significant events
on every update: not warranted unless warehouse data require current data (up to the minute stock quotes) refresh policy set by administrator based on user needs and traffic possibly different policies for different sources
51
Refresh Techniques
Full Extract from base tables
read entire source table: too expensive maybe the only choice for legacy systems
52
53
54
customer activity (1986-89) -- monthly summary customer activity detail (1987-89) customer activity detail (1990-91)
custid, activity date, amount, clerk id, order no custid, activity date, amount, line item no, order no
55
Granularity in Warehouse
Can not answer some questions with summarized data
Did Anand call Seshadri last month? Not possible to answer if total duration of calls by Anand over a month is only maintained and individual call details are not.
Granularity in Warehouse
Tradeoff is to have dual level of granularity
Store summary data on disks
95% of DSS processing done against this data
58
Vertical Partitioning
Acct. No Name Balance Date Opened Interest Rate Address
Frequently accessed
Acct. Balance No Acct. No Name Date Opened Interest Rate
Rarely accessed
Address
Derived Data
Introduction of derived (calculated data) may often help Have seen this in the context of dual levels of granularity Can keep auxiliary views and indexes to speed up query processing
60
Schema Design
Database organization
must look like business must be recognizable by business user approachable by business user Must be simple
Schema Types
Star Schema Fact Constellation Schema Snowflake schema
61
Dimension Tables
Dimension tables
Define business in terms already familiar to users Wide rows with lots of descriptive text Small tables (about a million rows) Joined to fact table by a foreign key heavily indexed typical dimensions
time periods, geographic region (markets, cities), products, customers, salesperson, etc.
62
Fact Table
Central table
mostly raw numeric items narrow rows, a few columns at most large number of rows (millions to a billion) Access via dimensions
63
Star Schema
A single fact table and for each dimension one dimension table Does not capture hierarchies directly
T i
date, custno, prodno, cityname, ...
c u s t
f a c t
p r o d c i t y
64
Snowflake schema
Represent dimensional hierarchy directly by normalizing tables. Easy to maintain and saves storage
T i e
date, custno, prodno, cityname, ...
c u s t
f a c t
p r o d
c i t y
r e g i o 65 n
Fact Constellation
Fact Constellation
Multiple fact tables that share many dimension tables Booking and Checkout may share many dimension tables in the hotel industry
Hotels
Booking Checkout
Customer
Promotion
Travel Agents
Room Type
66
De-normalization
Normalization in a data warehouse may lead to lots of small tables Can lead to excessive I/Os since many tables have to be accessed De-normalization is the answer especially since updates are rare
67
Organizationally Structured
Data Warehouse
Data
69
Data Marts
Data Warehouse
73
True Warehouse
Data Sources
Data Warehouse
Data Marts
75
-- Ralph Kimball
77
78
What Is OLAP?
Online Analytical Processing - coined by EF Codd in 1994 paper contracted by Arbor Software* Generally synonymous with earlier terms such as Decisions Support, Business Intelligence, Executive Information System OLAP = Multidimensional Database MOLAP: Multidimensional OLAP (Arbor Essbase, Oracle Express) ROLAP: Relational OLAP (Informix MetaCube, Microstrategy DSS Agent)
* Reference: http://www.arborsoft.com/essbase/wht_ppr/coddTOC.html 79
Result: OLAP shifted from small vertical niche to mainstream DBMS category
80
Strengths of OLAP
It is a powerful visualization paradigm It provides fast, interactive response times It is good for analyzing time series
OLAP Is FASMI
Fast Analysis Shared Multidimensional Information
Multi-dimensional Data
HeyI sold $100M worth of goods
Dimensions: Product, Region, Time Hierarchical summarization paths
W S N Juice Cola Milk Cream Toothpaste Soap 1 2 34 5 6 7
Product
Product Industry
Region Country
Time Year
Category
Region
Quarter
Product
City
Month
Week
83
Month
Office Day
85
Juice
Cola Milk
10 Region Product 47 30
Cream 12
Date
86
Household Telecomm Video Europe Far East India Retail Direct Special
Audio
Sales Channel
87
Low-level Details
88
Drill-Down
89
90
Multidimensional Spreadsheets
Analysts need spreadsheets that support
pivot tables (cross-tabs) drill-down and roll-up slice and dice sort selections derived attributes
91
Database Layer
Presentation Layer
Generate SQL execution plans in the ROLAP engine to obtain OLAP functionality.
Database Layer
Presentation Layer
Store atomic data in a proprietary data structure (MDDB), pre-calculate as many outcomes as possible, obtain OLAP functionality via proprietary algorithms running against this data.
From day one establish that warehousing is a joint user/builder project Establish that maintaining data quality will be an ONGOING joint user/builder responsibility Train the users one step at a time Consider doing a high level corporate data model in no more than three weeks
95
Implement a user accessible automated directory to information stored in the warehouse Determine a plan to test the integrity of the data in the warehouse
From the start get warehouse users in the habit of 'testing' complex queries
96
When in a bind, ask others who have done the same thing for advice
Be on the lookout for small, but strategic, projects Market and sell your data warehousing systems
97
You will find the need to store data not being captured by any existing system
You will need to validate data not being validated by transaction processing systems
98
Useful URLs
Ralph Kimballs home page
http://www.rkimball.com
OLAP Council
http://www.olapcouncil.com/
101