0% found this document useful (0 votes)
17 views59 pages

Lecture19 257

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views59 pages

Lecture19 257

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 59

Data Warehousing

University of California, Berkeley


School of Information
IS 257: Database Management

IS 257 – Fall 2015 2015.11.03 - SLIDE 1


Lecture Outline
• Data Warehouses
• Introduction to Data Warehouses
• Data Warehousing
– (Based on lecture notes from Modern
Database Management Text (Hoffer, Ramesh,
Topi); Joachim Hammer, University of Florida,
and Joe Hellerstein and Mike Stonebraker of
UCB)

IS 257 – Fall 2015 2015.11.03 - SLIDE 2


Overview
• Data Warehouses and Merging
Information Resources
• What is a Data Warehouse?
• History of Data Warehousing
• Types of Data and Their Uses
• Data Warehouse Architectures
• Data Warehousing Problems and Issues

IS 257 – Fall 2015 2015.11.03 - SLIDE 3


Problem: Heterogeneous Information Sources

“Heterogeneities are
everywhere” Personal
Databases

World
Scientific Databases
Wide
Web
Digital Libraries
 Different interfaces
 Different data representations
 Duplicate and inconsistent information
Slide credit: J. Hammer
IS 257 – Fall 2015 2015.11.03 - SLIDE 4
Problem: Data Management in Large Enterprises

• Vertical fragmentation of informational


systems (vertical stove pipes)
• Result of application (user)-driven
development of operational systems
Sales Planning Suppliers Num. Control
Stock Mngmt Debt Mngmt Inventory
... ... ...

Sales Administration Finance Manufacturing ...


Slide credit: J. Hammer
IS 257 – Fall 2015 2015.11.03 - SLIDE 5
Goal: Unified Access to Data

Integration System

World
Wide
Personal
Web
Digital Libraries Scientific Databases Databases

• Collects and combines information


• Provides integrated view, uniform user interface
• Supports sharing
Slide credit: J. Hammer
IS 257 – Fall 2015 2015.11.03 - SLIDE 6
The Traditional Research Approach

• Query-driven (lazy, on-demand)


Clients

Integration System Metadata

...
Wrapper Wrapper Wrapper

...
Source Source Source
Slide credit: J. Hammer
IS 257 – Fall 2015 2015.11.03 - SLIDE 7
Disadvantages of Query-Driven Approach

• Delay in query processing


– Slow or unavailable information sources
– Complex filtering and integration
• Inefficient and potentially expensive for
frequent queries
• Competes with local processing at sources
• Hasn’t caught on in industry

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 8
The Warehousing Approach
• Information Clients
integrated in
advance Data
Warehouse
• Stored in WH
for direct
Integration System Metadata
querying and
analysis ...
Extractor/ Extractor/ Extractor/
Monitor Monitor Monitor

...
Source Source Source
Slide credit: J. Hammer
IS 257 – Fall 2015 2015.11.03 - SLIDE 9
Advantages of Warehousing Approach

• High query performance


– But not necessarily most current information
• Doesn’t interfere with local processing at
sources
– Complex queries at warehouse
– OLTP at information sources
• Information copied at warehouse
– Can modify, annotate, summarize, restructure, etc.
– Can store historical information
– Security, no auditing
• Has caught on in industry

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 10
Not Either-Or Decision
• Query-driven approach still better for
– Rapidly changing information
– Rapidly changing information sources
– Truly vast amounts of data from large
numbers of sources
– Clients with unpredictable needs

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 11
Data Warehouse Evolution
“Building the
Relational Company DW” Data Replication
Databases DWs Inmon (1992) Tools

1960 1975 1980 1985 1990 1995 2000

Information-
“Middle Data

TIME
“Prehistoric Based
Times” Ages” Revolution
Management

PC’s and End-user 1st DW DW Vendor DW


Spreadsheets Interfaces Article Confs. Frameworks
Slide credit: J. Hammer
IS 257 – Fall 2015 2015.11.03 - SLIDE 12
What is a Data Warehouse?

“A Data Warehouse is a
– subject-oriented,
– integrated,
– time-variant,
– non-volatile
collection of data used in support of
management decision making
processes.”
-- Inmon & Hackathorn, 1994: viz. Hoffer, Chap 11

IS 257 – Fall 2015 2015.11.03 - SLIDE 13


DW Definition…
• Subject-Oriented:
– The data warehouse is organized around the
key subjects (or high-level entities) of the
enterprise. Major subjects include
• Customers
• Patients
• Students
• Products
• Etc.

IS 257 – Fall 2015 2015.11.03 - SLIDE 14


DW Definition…
• Integrated
– The data housed in the data warehouse are
defined using consistent
• Naming conventions
• Formats
• Encoding Structures
• Related Characteristics

IS 257 – Fall 2015 2015.11.03 - SLIDE 15


DW Definition…
• Time-variant
– The data in the warehouse contain a time
dimension so that they may be used as a
historical record of the business

IS 257 – Fall 2015 2015.11.03 - SLIDE 16


DW Definition…
• Non-volatile
– Data in the data warehouse are loaded and
refreshed from operational systems, but
cannot be updated by end-users

IS 257 – Fall 2015 2015.11.03 - SLIDE 17


What is a Data Warehouse?
A Practitioners Viewpoint
• “A data warehouse is simply a single,
complete, and consistent store of data
obtained from a variety of sources and
made available to end users in a way they
can understand and use it in a business
context.”
• -- Barry Devlin, IBM Consultant

IS 257 – Fall 2015 Slide credit:


2015.11.03 J. Hammer
- SLIDE 18
A Data Warehouse is...
• Stored collection of diverse data
– A solution to data integration problem
– Single repository of information
• Subject-oriented
– Organized by subject, not by application
– Used for analysis, data mining, etc.
• Optimized differently from transaction-
oriented db
• User interface aimed at executive decision
makers and analysts

IS 257 – Fall 2015 2015.11.03 - SLIDE 19


… Cont’d
• Large volume of data (Gb, Tb)
• Non-volatile
– Historical
– Time attributes are important
• Updates infrequent
• May be append-only
• Examples
– All transactions ever at WalMart
– Complete client histories at insurance firm
– Stockbroker financial information and portfolios

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 20
Need for Data Warehousing
• Integrated, company-wide view of high-quality
information (from disparate databases)
• Separation of operational and informational systems
and data (for improved performance)

IS 257 – Fall 2015 2015.11.03 - SLIDE 21


Warehouse is a Specialized DB

Warehouse
Standard (Informational)
(Operational) DB
•• Mostly
Mostly reads
updates
•• Queries are transactions
long and complex
Many small
• Gb - Tb of data
• Mb - Gb of data
• History
• Current snapshot
• Lots of scans
• Index/hash onreconciled
p.k.
• Summarized, data
•• Raw data of users (e.g., decision-makers, analysts)
Hundreds
• Thousands of users (e.g., clerical users)

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 22
Warehouse vs. Data Mart

IS 257 – Fall 2015 2015.11.03 - SLIDE 23


Data Warehouse Architectures
• Generic Two-Level Architecture
• Independent Data Mart
• Dependent Data Mart and Operational
Data Store
• Logical Data Mart and @ctive Warehouse
• Three-Layer architecture

All involve some form of extraction, transformation and loading (ETL)

IS 257 – Fall 2015 2015.11.03 - SLIDE 24


Generic two-level data warehousing
architecture

L
One,
company-
wide
T warehouse

Periodic extraction  data is not completely current in warehouse

IS 257 – Fall 2015 2015.11.03 - SLIDE 25


Independent data mart data warehousing
architecture
Data marts:
Mini-warehouses, limited in scope

T
E
Separate ETL for each Data access complexity
independent data mart due to multiple data marts
IS 257 – Fall 2015 2015.11.03 - SLIDE 26
Dependent data mart with operational data
store: a three-level architecture ODS provides option for
obtaining current data

T
E Simpler data access
Single ETL for
Dependent data marts
enterprise data warehouse
loaded from EDW
(EDW)
IS 257 – Fall 2015 2015.11.03 - SLIDE 27
Logical data mart and real time warehouse
architecture
ODS and data
warehouse are one
and the same

T
E
Near real-time ETL for Data marts are NOT separate databases, but logical views of the
data warehouse
Data Warehouse  Easier to create new data marts

IS 257 – Fall 2015 2015.11.03 - SLIDE 28


Data Characteristics
Status vs. Event Data

Status

Event = a database
action
(create/update/delete
) that results from a
Status transaction

IS 257 – Fall 2015 2015.11.03 - SLIDE 30


Data Characteristics
Transient vs. Periodic Data

With
transient
data,
changes to
existing
records are
written over
previous
records, thus
destroying
the previous
data content

IS 257 – Fall 2015 2015.11.03 - SLIDE 31


Data Characteristics
Transient vs. Periodic Data

Periodic
data are
never
physically
altered or
deleted
once they
have
been
added to
the store

IS 257 – Fall 2015 2015.11.03 - SLIDE 32


Other Data Warehouse Changes
• New descriptive attributes
• New business activity attributes
• New classes of descriptive attributes
• Descriptive attributes become more
refined
• Descriptive data are related to one another
• New source of data

IS 257 – Fall 2015 2015.11.03 - SLIDE 33


The Reconciled Data Layer
• Typical operational data is:
– Transient–not historical
– Not normalized (perhaps due to denormalization for
performance)
– Restricted in scope–not comprehensive
– Sometimes poor quality–inconsistencies and errors
• After ETL, data should be:
– Detailed–not summarized yet
– Historical–periodic
– Normalized–3rd normal form or higher
– Comprehensive–enterprise-wide perspective
– Timely–data should be current enough to assist decision-making
– Quality controlled–accurate with full integrity

IS 257 – Fall 2015 2015.11.03 - SLIDE 34


Types of Data
• Business Data - represents meaning
– Real-time data (ultimate source of all business data)
– Reconciled data
– Derived data
• Metadata - describes meaning
– Build-time metadata
– Control metadata
– Usage metadata
• Data as a product* - intrinsic meaning
– Produced and stored for its own intrinsic value
– e.g., the contents of a text-book

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 35
Data Warehousing: Two Distinct Issues

• (1) How to get information into warehouse


– “Data warehousing”
• (2) What to do with data once it’s in
warehouse
– “Warehouse DBMS”
• Both rich research areas
• Industry has focused on (2)

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 36
The ETL Process
• Capture/Extract
• Scrub or data cleansing
• Transform
• Load and Index

ETL = Extract, transform, and load

IS 257 – Fall 2015 2015.11.03 - SLIDE 37


Capture/Extract…obtaining a snapshot of a
chosen subset of the source data for
loading into the data warehouse

Static extract = capturing Incremental extract =


a snapshot of the source capturing changes that
data at a point in time have occurred since the last
static extract
IS 257 – Fall 2015 2015.11.03 - SLIDE 38
Data Extraction
• Source types
– Relational, flat file, WWW, etc.
• How to get data out?
– Replication tool
– Dump file
– Create report
– ODBC or third-party “wrappers”

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 39
Wrapper
 Converts data and queries from one data model to
another
Data Queries Data
Model Model
A Data B

 Extends query capabilities for sources with


limited capabilities

Queries Wrapper Source

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 40
Wrapper Generation
• Solution 1: Hard code for each source
• Solution 2: Automatic wrapper generation

Wrapper
Wrapper Definition
Generator

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 41
Monitors
• Goal: Detect changes of interest and
propagate to integrator
• How?
– Triggers
– Replication server
– Log sniffer
– Compare query results
– Compare snapshots/dumps

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 42
Scrub/Cleanse…uses pattern recognition
and AI techniques to upgrade data quality
Figure 11-10:
Steps in data
reconciliation
(cont.)

Fixing errors: misspellings, Also: decoding, reformatting,


erroneous dates, incorrect field time stamping, conversion, key
usage, mismatched addresses, generation, merging, error
missing data, duplicate data, detection/logging, locating
inconsistencies missing data
IS 257 – Fall 2015 2015.11.03 - SLIDE 43
New approaches for Data Cleansing

• It is generally been found that 70-90


percent of the time and effort in large data
management and analysis tasks is taken
up with data cleansing
• New tool “Data Wrangler” from Stanford
and Berkeley CS folks
• http://vis.stanford.edu/wrangler/

IS 257 – Fall 2015 2015.11.03 - SLIDE 44


Data Cleansing
• Find (& remove) duplicate tuples
– e.g., Jane Doe vs. Jane Q. Doe
• Detect inconsistent, wrong data
– Attribute values that don’t match
• Patch missing, unreadable data
• Notify sources of errors found

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 45
Transform = convert data from format of
operational system to format of data
Figure 11-10: warehouse
Steps in data
reconciliation
(cont.)

Record-level: Field-level:
Selection–data partitioning single-field–from one field to one field
Joining–data combining multi-field–from many fields to one, or
Aggregation–data summarization one field to many

IS 257 – Fall 2015 2015.11.03 - SLIDE 46


Data Transformations
• Convert data to uniform format
– Byte ordering, string termination
– Internal layout
• Remove, add & reorder attributes
– Add key
– Add data to get history
• Sort tuples

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 47
Load/Index= place
transformed data into the
Figure 11-10:
Steps in data
warehouse and create
reconciliation indexes
(cont.)

Refresh mode: bulk rewriting Update mode: only changes


of target data at periodic intervals in source data are written to data
warehouse

IS 257 – Fall 2015 2015.11.03 - SLIDE 48


Data Integration
• Receive data (changes) from multiple
wrappers/monitors and integrate into warehouse
• Rule-based
• Actions
– Resolve inconsistencies
– Eliminate duplicates
– Integrate into warehouse (may not be empty)
– Summarize data
– Fetch more data from sources (wh updates)
– etc.

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 49
Warehouse Maintenance
• Warehouse data  materialized view
– Initial loading
– View maintenance
• View maintenance

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 50
Differs from Conventional View Maintenance...

• Warehouses may be highly aggregated


and summarized
• Warehouse views may be over history of
base data
• Process large batch updates
• Schema may evolve

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 51
Differs from Conventional View Maintenance...

• Base data doesn’t participate in view


maintenance
– Simply reports changes
– Loosely coupled
– Absence of locking, global transactions
– May not be queriable

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 52
Warehouse Maintenance Anomalies

• Materialized view maintenance in loosely


coupled, non-transactional environment
• Simple example

Data Sold (item,clerk,age)


Warehouse

Sold = Sale Emp


Integrator

Sales Comp.

Sale(item,clerk) Emp(clerk,age)
Slide credit: J. Hammer
IS 257 – Fall 2015 2015.11.03 - SLIDE 53
Warehouse Maintenance Anomalies

Data Sold (item,clerk,age)


Warehouse

Integrator

Sales Comp.

Sale(item,clerk) Emp(clerk,age)
1. Insert into Emp(Mary,25), notify integrator
2. Insert into Sale (Computer,Mary), notify integrator
3. (1)  integrator adds Sale (Mary,25)
4. (2)  integrator adds (Computer,Mary) Emp
5. View incorrect (duplicate tuple)
Slide credit: J. Hammer
IS 257 – Fall 2015 2015.11.03 - SLIDE 54
Warehouse Specification (ideally)
View Definitions

Warehouse
Integration Warehouse
Configuration rules
Module
Change Integrator Metadata
Detection
Requirements

Extractor/ Extractor/ Extractor/


Monitor Monitor Monitor

...
Slide credit: J. Hammer
IS 257 – Fall 2015 2015.11.03 - SLIDE 55
Additional Research Issues
• Historical views of non-historical data
• Expiring outdated information
• Crash recovery
• Addition and removal of information
sources
– Schema evolution

Slide credit: J. Hammer


IS 257 – Fall 2015 2015.11.03 - SLIDE 56
Warehousing and Industry
• Data Warehousing is big business
– $2 billion in 1995
– $3.5 billion in early 1997
– Predicted: $8 billion in 1998 [Metagroup]
• Wal-Mart said to have the largest warehouse
– 1000-CPU, 583 Terabyte, Teradata system
(InformationWeek, Jan 9, 2006)
– “Half a Petabyte” in warehouse (Ziff Davis Internet,
October 13, 2004)
– 1 billion rows of data or more are updated every day
(InformationWeek, Jan 9, 2006)
– Reported to be 2.5 Petabytes in 2008
• http://gigaom.com/2013/03/27/why-apple-ebay-and-walmart-
have-some-of-the-biggest-data-warehouses-youve-ever-see
n

IS 257 – Fall 2015 2015.11.03 - SLIDE 57


Other Large Data Warehouses

(InformationWeek, Jan 9, 2006)


IS 257 – Fall 2015 2015.11.03 - SLIDE 58
Those are small change today…
• Some databases are larger, however…
– eBay: has two Teradata systems. Its primary data
warehouse is 9.2 petabyes; its “singularity system”
that stores web clicks and other “big” data is more
than 40 petabytes. It includes a single table that’s 1
trillion rows. (2013)
• http://gigaom.com/2013/03/27/why-apple-ebay-and-walmart-have-s
ome-of-the-biggest-data-warehouses-youve-ever-seen
– Apple: “Multiple Petabytes” in 2013
– Yahoo! for web user behavioral analysis, storing two
petabytes and claimed to be the largest data
warehouse using a heavily modified version of
PostgreSQL (Wikipedia 2012)

IS 257 – Fall 2015 2015.11.03 - SLIDE 59


More Information on DW
• Agosta, Lou, The Essential Guide to Data
Warehousing. Prentise Hall PTR, 1999.
• Devlin, Barry, Data Warehouse, from
Architecture to Implementation. Addison-Wesley,
1997.
• Inmon, W.H., Building the Data Warehouse.
John Wiley, 1992.
• Widom, J., “Research Problems in Data
Warehousing.” Proc. of the 4th Intl. CIKM Conf.,
1995.
• Chaudhuri, S., Dayal, U., “An Overview of Data
Warehousing and OLAP Technology.” ACM
SIGMOD Record, March 1997.

IS 257 – Fall 2015 2015.11.03 - SLIDE 60

You might also like