0% found this document useful (0 votes)
62 views49 pages

Intro. To Data Warehouse: Worapoj Kreesuradej, Ph.D. Associate Professor

The document provides an introduction to data warehousing. It defines a data warehouse as a subject-oriented database designed for decision making that is maintained separately from operational databases. The document outlines some key characteristics of a data warehouse such as being integrated, non-volatile, and time-variant. It also describes the major components of a data warehousing process including data extraction, transformation, loading, the data store, data marts, and metadata. Dimensional modeling and star schemas are discussed as common approaches in data warehouse design.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views49 pages

Intro. To Data Warehouse: Worapoj Kreesuradej, Ph.D. Associate Professor

The document provides an introduction to data warehousing. It defines a data warehouse as a subject-oriented database designed for decision making that is maintained separately from operational databases. The document outlines some key characteristics of a data warehouse such as being integrated, non-volatile, and time-variant. It also describes the major components of a data warehousing process including data extraction, transformation, loading, the data store, data marts, and metadata. Dimensional modeling and star schemas are discussed as common approaches in data warehouse design.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 49

Intro.

to Data Warehouse
รศ.ดร. วรพจน์ กรีสุระเดช
Worapoj Kreesuradej, Ph.D.
Associate Professor

Data Mining & Data Exploration Laboratory (DME Lab),


Faculty of Information Technology,
King Mongkut's Institute of Technology Ladkrabang,
Web: www.it.kmitl.ac.th/dme
Email: [email protected]
Book

 Paulraj Ponniah, Data Warehousing


Fundamentals, John Wiley & Sons, 2001.

 Ralph Kimbal and Margy Ross, The Data


Warehouse Toolkit, John Wiley and Sons,
2002.
Definition of DW
 “A collection of integrated, subject-
oriented databases designed to supply
the information required for decision-
making.” - W. Inmon
 A decision support database that is
maintained separately from the
organization’s operational databases.
 A physical repository where relational
data are specially organized to provide
enterprise-wide, cleansed data in a
standardized format –E. Turban and etc.
R. Kimball’s definition of a DW
 A data warehouse is a copy of
transactional data specifically
structured for querying and analysis.
Problem: Data Management
in Large Enterprises
 Vertical fragmentation of informational
systems
 Result of application (user)-driven
development of operational systems
Sales Planning Suppliers Num. Control
Stock Mngmt Debt Mngmt Inventory
... ... ...

Sales Administration Finance Manufacturing ...


Problem: Data Management
in Large Enterprises
 Two Approaches for accessing
data:
 Query-Driven (Lazy) ?
 Warehouse (Eager)
Source Source
The Need for DW
 Query-driven (lazy, on-demand)
Clients

Integration System Metadata

...
Wrapper Wrapper Wrapper

...
Source Source Source
Disadvantages of Query-
Driven Approach
 Delay in query processing
 Inefficient and potentially expensive
for frequent queries
 Competes with local processing at
sources
The Warehousing Approach
 Information Clients
integrated in
advance Data
Warehouse
 Stored in wh
for direct
Integration System Metadata
querying and
analysis ...
Extractor/ Extractor/ Extractor/
Monitor Monitor Monitor

...
Source Source Source
Advantages of Warehousing
Approach
 High query performance
 Doesn’t interfere with local processing
at sources
 Information copied at warehouse
 Can modify, annotate, summarize,
restructure, etc.
 Can store historical information
 Security, no auditing
Characteristics of DW
Subject oriented Data are organized by how
users refer to it
Integrated Inconsistencies are removed
in both nomenclature and
conflicting information; (i.e.
data are ‘clean’)
Non-volatile Read-only data. Data do not
change over time.
Time variant Data are time series, not
current status
Subject Oriented
 Data Warehouse is designed around
“subjects” rather than processes
 A company may have
 Retail Sales System
 Outlet Sales System
 Catalog Sales System
 DW will have a Sales Subject Area
Subject Oriented
OLTP Systems

Retail Sales Outlet Sales Catalog Sales


System System System

Data Warehouse
Sales Subject Area

Subject-Oriented Sales Information


Integrated
 Heterogeneous Source Systems
 Need to Integrate source data
 For Example: Product codes could
be different in different systems
 Arrive at common code in DW
Integrated
 Information
Clients
integrated in
advance
Data
 Stored in DW Warehouse

for direct
querying and Integration System Metadata

analysis
...
Extractor/ Extractor/ Extractor/
Monitor Monitor Monitor

...
Source Source Source
Non-Volatile
 Operational update of data does not occur
in the data warehouse environment.
 Does not require transaction
processing, recovery, and concurrency
control mechanisms
 Requires only two operations in data
accessing:
 initial loading of data and access of
data.
Non-Volatile(Read-Mostly)
Write
USER
OLTP
Read

USER DW
Read
Time Variant
 The time horizon for the data warehouse is
significantly longer than that of operational
systems.
 Operational database: current value data.
 Data warehouse data: provide information
from a historical perspective (e.g., past 5-
10 years)
Time Variant

 Most business Sales

analysis has a
time component

2001 2002 2003 2004


 Trend Analysis
(historical data is
required)
Data Warehousing
Process Overview
Data Warehousing
Process Overview
 The major components of a data
warehousing process
 Data sources
 Data extraction
 Data loading
 Comprehensive Database /Data Store
 Data Mart
 Metadata
 Middleware tools /information delivery
tools
ETL
• Data Extraction
• Data Cleaning and Transformation
Convert from legacy/host format to
warehouse format
• Load
Sort, summarize, consolidate,
compute views, check integrity,
build indexes, partition
The ETL Process

Source Staging DW
Systems Area Database

Extract Transform Load


Data Staging Area
• A storage area where extracted data is
cleaned, transformed and deduplicated.
• Initial storage for data
• Need not be based on Relational model
• Mainly sorting and Sequential
processing
• Does not provide data access to users
• Analogy – kitchen of a restaurant
ETL Process
Issues & Challenges
• Consumes 70-80% of project time
• Heterogeneous Source Systems
• Little or no control over source systems
• Source systems scattered
• Different currencies, measurement units
• Ensuring data quality
Comprehensive Database
/Data Store
 Mostly a relational DB
 Oracle, DB2, Sybase, SQL Server

 New DB design for special purpose of


DW (e.g., scale up, speed up, parallel
processing)
Data Warehouse Design
 OLTP Systems are Data Capture Systems
 “DATA IN” systems
 DW are “DATA OUT” systems

OLTP DW
Dimensional Modeling
 Facts are stored in FACT Tables
 Dimensions are stored in
DIMENSION tables
 Dimension tables contains textual
descriptors of business
 Fact and dimension tables form a
Star Schema
 “BIG” fact table in center surrounded
by “SMALL” dimension tables
Star Schema
CUSTOMER
# CUSTOMER _KEY
* C ID
TIME * C NAME
# TIME_KEY referenced by * STATE
* ORD ERD ATE * C ITY
* D AY_ OF_ WEEK referenced by
* D AY_ NU MBER_IN_MONTH SALES reference
* D AY_ NU MBER_IN_YEAR # TIME_KEY
* WEEK_N UMBER # PRODUC T_KEY
* MON TH # CUSTOMER _KEY
* QUARTER reference
* PRIC E
* H OL IDAY_FL AG * QUANTITY
* FISC AL_ YEAR * SALES
* FISC AL_ QUARTER

reference

referenced by

PRODUCT
# PRODUC T_KEY
* PID
* PNAME
* PCN AME
Star Schema
Data mart
 Data mart = subset of DW for community
users, e.g. accounting department
 Sometimes exist as Multidimensional
Database
 Info mart = summarized data + report for
community users
Meta Data
 Data about data
 Needed by both information technology
personnel and users
 IT personnel need to know data sources and
targets; database, table and column names;
refresh schedules; data usage measures; etc.
 Users need to know entity/attribute
definitions; reports/query tools available;
report distribution information; help desk
contact information, etc.
Information Delivery Tools
 Tools
 Query & reporting
 OLAP
 Data mining, visualization, segmentation,
clustering
 New developments: text mining, web mining
& personalization
 Mining multimedia data
Information Delivery Tools

 Commercial tools
 Crystal Report, Impromptu, WebFocus

 Increasingly common mode of delivery:


Web-enabled
Data Warehouse Architecture

 Data Flow Architecture


 System Architecture
Data Flow Architecture
Data Flow Architecture
Data Flow Architecture
 Operational data stores (ODS)
A type of database often used as an
interim area for a data warehouse,
especially for customer information files

 MDB=Multidimensional databases
System Architectures

 Three parts of the data warehouse


 The data warehouse that contains the data
and associated software
 Data acquisition (back-end) software that
extracts data from legacy systems and
external sources, consolidates and
summarizes them, and loads them into the
data warehouse
 Client (front-end) software that allows
users to access and analyze data from the
warehouse
System Architectures
System Architectures
System Architecture
System Architecture
Data Warehouse Development
 Data warehouse development
approaches
 Inmon Model: EDW approach, Enterprise-
wide warehouse, top down
 Kimball Model: Data mart approach, Data
mart, bottom up
 Which model is best?
 There is no one-size-fits-all strategy to data
warehousing
 When properly executed, both result in an
enterprise-wide data warehouse, but with
different architectures
The Data Mart Strategy
 The most common approach
 Begins with a single mart and architected
marts are added over time for more subject
areas
 Relatively inexpensive and easy to implement
 Can be used as a proof of concept for data
warehousing
 Can perpetuate the “silos of information”
problem
 Can postpone difficult decisions and
activities
 Requires an overall integration plan
The Enterprise-wide
Strategy
 A comprehensive warehouse is built initially
 An initial dependent data mart is built using a
subset of the data in the warehouse
 Additional data marts are built using subsets
of the data in the warehouse
 Like all complex projects, it is expensive, time
consuming, and prone to failure
 When successful, it results in an integrated,
scalable warehouse
DW Lifecycle (Ralph Kimball )
Data Warehouse Development
 Some best practices for implementing a
data warehouse (Weir, 2002):
 Project must fit with corporate strategy and
business objectives
 There must be complete buy-in to the
project by executives, managers, and users
 It is important to manage user expectations
about the completed project
 The data warehouse must be built
incrementally
 Build in adaptability
Data Warehouse Development
 Some best practices for implementing a
data warehouse (Weir, 2002):
 The project must be managed by both IT
and business professionals
 Develop a business/supplier relationship
 Only load data that have been cleansed and
are of a quality understood by the
organization
 Do not overlook training requirements
 Be politically aware

You might also like