0% found this document useful (0 votes)
67 views26 pages

Data Warehouse Components

The document discusses the key components of a data warehouse, including that it is a subject-oriented collection of integrated and non-volatile data from multiple sources used to support analysis and decision making. It is maintained separately from operational databases for performance and focuses on historical and aggregated data rather than real-time transactions. The document contrasts data warehouses with operational databases and heterogeneous databases.

Uploaded by

durai murugan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views26 pages

Data Warehouse Components

The document discusses the key components of a data warehouse, including that it is a subject-oriented collection of integrated and non-volatile data from multiple sources used to support analysis and decision making. It is maintained separately from operational databases for performance and focuses on historical and aggregated data rather than real-time transactions. The document contrasts data warehouses with operational databases and heterogeneous databases.

Uploaded by

durai murugan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

DATA WAREHOUSE

COMPONENTS
What is Data Warehouse?
• Loosely speaking, a data warehouse refers to a database that
is maintained separately from an organization’s operational
database

• Officially speaking:
• “A data warehouse is a subject-oriented, integrated, time-
variant, and nonvolatile collection of data in support of
management’s decision-making process.”—W. H. Inmon
Data Warehouse—Subject-Oriented

 Organized around major subjects, such as customer, product,


sales
 Focusing on the modeling and analysis of data for decision
makers, not on daily operations or transaction processing
 Provide a simple and concise view around particular subject
issues by excluding data that are not useful in the decision
support process
Subject Oriented
• Example for an insurance company :
Applications Area Data Warehouse
Auto
Autoand
andFire
Fire
Policy Processing
Policy Processing
Commercial
Commercial Systems Customer Policy
Policy
and Systems
andLife
Life Customer
Insurance
Insurance
Systems
Systems

Data
Data

Claims
Claims Losses
Losses Premium
Premium
Accounting
Accounting Processing
Processing
System
System Billing System
System
Billing
System
System

4
Data Warehouse—Integrated
• Constructed by integrating multiple, heterogeneous
data sources
– relational databases, flat files, on-line transaction
records
• Data cleaning and data integration techniques are
applied.
– Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different
data sources
• E.g., Hotel price: currency, tax, breakfast covered, etc.
– Before data is moved to the warehouse, it is
transformed to a common scheme.
Integrated
• Data is stored once in a single integrated location
(e.g. insurance company)

Auto
AutoPolicy
Policy
Processing Data Warehouse
Processing Database
System
System

Customer
Fire
FirePolicy
Policy
data Processing
stored Processing
System
System
in several
databases
Subject = Customer
FACTS,
FACTS,LIFE
LIFE
Commercial,
Commercial,Accounting
Accounting
Applications
Applications

6
Data Warehouse—Time Variant

• The time horizon for the data warehouse is


significantly longer than that of operational
systems
– Operational database: current value data
– Data warehouse data: provide information
from a historical perspective (e.g., past 5-10
years)
Time - Variant
• Data is stored as a series of snapshots or views which record how it is
collected across time.
Data Warehouse Data

Time Data
{
Key

 Data is tagged with some element of time - creation date, as of


date, etc.
 Data is available on-line for long periods of time for trend
analysis and forecasting. For example, five or more years

8
Data Warehouse—Nonvolatile
• A physically separate store of data transformed from the
operational environment
• Operational update of data does not occur in the data
warehouse environment
– Does not require transaction processing, recovery,
and concurrency control mechanisms
– Requires only two operations in data accessing:
• initial loading of data and access of data
Non-Volatile
• Existing data in the warehouse is not overwritten or
updated. External
Sources

Production Data
Databases Warehouse
Data
Data Database
Production
Production Warehouse
Warehouse
Applications
Applications Environment
Environment
• Load
• Update
• Insert • Read-Only
• Delete

10
Data Warehouse vs. Heterogeneous DBMS

 Traditional heterogeneous DB integration: A query driven approach

◦ Build wrappers/mediators on top of heterogeneous databases


◦ When a query is posed to a client site, a meta-dictionary is used to
translate the query into queries appropriate for individual
heterogeneous sites involved, and the results are integrated into a
global answer set
◦ Complex information filtering, compete for resources
 Data warehouse: update-driven, high performance

◦ Information from heterogeneous sources is integrated in advance and


stored in warehouses for direct query and analysis
Data Warehouse vs. Operational DBMS

• OLTP (on-line transaction processing)


– Major task of traditional relational DBMS
– Day-to-day operations: purchasing, inventory,
banking, manufacturing, payroll, registration,
accounting, etc.

• OLAP (on-line analytical processing)


– Major task of data warehouse system
– Data analysis and decision making
• Distinct features (OLTP vs. OLAP):

– User and system orientation: customer vs. market

– Data contents: current, detailed vs. historical,


consolidated

– Database design: ER + application vs. star + subject

– View: current, local vs. evolutionary, integrated


OLTP vs. OLAP
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB design application-oriented subject-oriented
data current, up-to-date historical,
detailed, flat relational summarized, multidimensional
isolated integrated, consolidated
usage repetitive ad-hoc
access read/write lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
No records accessed tens millions
No users thousands hundreds
DB size 100MB-GB 100GB-TB
metric transaction throughput query throughput, response
Why Separate Data Warehouse?

 High performance for both systems


DBMS— tuned for OLTP: access methods, indexing, concurrency control,
recovery
Warehouse—tuned for OLAP: complex OLAP queries, multidimensional
view, consolidation
 Different functions and different data:
missing data: Decision support requires historical data which operational
DBs do not typically maintain
data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
 Note: There are more and more systems which perform OLAP analysis
directly on relational databases
Data warehouse architecture
Design of a Data Warehouse: A Business
Analysis Framework
• Four views regarding the design of a data warehouse
– Top-down view
• allows selection of the relevant information necessary for the data
warehouse
– Data source view
• exposes the information being captured, stored, and managed by
operational systems
– Data warehouse view
• consists of fact tables and dimension tables
– Business query view
• sees the perspectives of data in the warehouse from the view of
end-user (profit, etc)
Data Warehouse Design Process
• Top-down, bottom-up approaches or a combination of both
– Top-down: Starts with overall design and planning (mature)
– Bottom-up: Starts with experiments and prototypes (rapid)
• From software engineering point of view
– Waterfall: structured and systematic analysis at each step before
proceeding to the next
– Spiral: rapid generation of increasingly functional systems, short turn
around time, quick turn around
• Typical data warehouse design process
– Choose a business process to model, e.g., orders, invoices, etc.
– Choose the grain (atomic level of data) of the business process
– Choose the dimensions that will apply to each fact table record
– Choose the measure that will populate each fact table record
Data Warehouse Architectures:
Conceptual View
Operational Informational

• Single-layer systems systems

– Every data element is stored once only “Real-time data”


– Virtual warehouse

• Two-layer
Operational Informational
– Real-time + derived data systems systems

– Most commonly used approach in


Derived Data
industry today
Real-time data

19
Three-layer Architecture: Conceptual View
• Transformation of real-time data to derived
data really requires two steps
Operational Informational
systems systems

View level
“Particular informational
Derived Data
needs”

Reconciled Data
Physical Implementation
of the Data Warehouse

Real-time data

20
Data Warehouse: A Multi-Tiered Architecture

Monitor
Metadata & OLAP Server
Other
sources Integrator

Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining

Data Marts

Data Sources Data Storage OLAP Engine Front-End Tools


Three Data Warehouse Models
• Enterprise warehouse
– collects all of the information about subjects spanning the
entire organization
• Data Mart
– a subset of corporate-wide data that is of value to a specific
groups of users. Its scope is confined to specific, selected
groups, such as marketing data mart
• Independent vs. dependent (directly from warehouse) data mart
• Virtual warehouse
– A set of views over operational databases
– Only some of the possible summary views may be
materialized
Data Warehouse Back-End Tools and Utilities
• Data extraction
– get data from multiple, heterogeneous, and external sources
• Data cleaning
– detect errors in the data and rectify them when possible
• Data transformation
– convert data from legacy or host format to warehouse format
• Load
– sort, summarize, consolidate, compute views, check integrity,
and build indicies and partitions
• Refresh
– propagate the updates from the data sources to the
warehouse
Metadata Repository
• Meta data is the data defining warehouse objects. It stores:
• Description of the structure of the data warehouse
– schema, view, dimensions, hierarchies, derived data defn, data mart
locations and contents
• Operational meta-data
– data lineage (history of migrated data and transformation path), currency
of data (active, archived, or purged), monitoring information (warehouse
usage statistics, error reports, audit trails)
• The algorithms used for summarization
• The mapping from operational environment to the data warehouse
• Data related to system performance
– warehouse schema, view and derived data definitions
• Business data
– business terms and definitions, ownership of data, charging policies
Data Mining and Visualization
• Knowledge discovery using a blend of statistical, AI, and
computer graphics techniques
• Goals:
– Explain observed events or conditions
– Confirm hypotheses
– Explore data for new or unexpected relationships
• Data mining techniques
– Statistical regression
– Associate rule
– Classification
– Clustering
• Data visualization – representing data in graphical /
multimedia formats for analysis
Data Mart
• A subset of a data warehouse that supports
the requirements of a particular department
or business function.

• Characteristics include:
– Do not normally contain detailed operational data
unlike data warehouses.
– May contain certain levels of aggregation

You might also like