0% found this document useful (0 votes)
25 views

Lecture 1 Introduction To Data Warehousing

This document provides an introduction and overview of data warehousing. It begins with an outline of topics to be covered, including the need for data analysis, problems with heterogeneous data sources, and the goals and approaches of data warehousing. It defines what a data warehouse is, including perspectives from practitioners and researchers. Key aspects of a data warehouse are that it contains integrated data from diverse sources organized by subject for analysis rather than transactions. The document also discusses types of data stored in warehouses and common warehouse architectures.

Uploaded by

lasithrandima123
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views

Lecture 1 Introduction To Data Warehousing

This document provides an introduction and overview of data warehousing. It begins with an outline of topics to be covered, including the need for data analysis, problems with heterogeneous data sources, and the goals and approaches of data warehousing. It defines what a data warehouse is, including perspectives from practitioners and researchers. Key aspects of a data warehouse are that it contains integrated data from diverse sources organized by subject for analysis rather than transactions. The document also discusses types of data stored in warehouses and common warehouse architectures.

Uploaded by

lasithrandima123
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Introduction to Data

Warehousing
Lecture 1
Conducted by
Ms. Akila Brahmana
Department of ICT
Faculty of Technology
University of Ruhuna
Outline of Lecture
❑ Data Warehousing and Information Integration
❑ Brief History of Data Warehousing
❑ What is a Data Warehouse?
❑ Types of Data and Their Uses
❑ Data Warehouse Architectures
❑ Issues in Data Warehousing
Why Need Data Analysis?
❑ to know your customers and yourself better
❑ for effective business strategies,
❑ to provide future directions to business organizations.

This kind of data analysis has been going on for long time. But there
is an urgency in getting such data analysis done faster. Main problem
in doing this has been the disparate and heterogeneous data sources.

Data warehousing systems aim to solve this problem!


Problem: Heterogeneous Information Sources
“Heterogeneities are everywhere”
Personal
Databases

World Wide Web


Scientific Databases

❑ Different interfaces Digital Libraries

❑ Different data representations


❑ Duplicate and inconsistent information
Goal: Unified Access to Data

Integration System

World Wide Web


Personal
Digital Libraries Scientific Databases Databases

❑ Collects and combines information


❑ Provides integrated view, uniform user interface
❑ Supports sharing
The Traditional Research Approach
❑ Query-driven (on-demand)
Clients

Integration System Metadata

...
Wrapper Wrapper Wrapper

...
Source Source Source
Disadvantages of Query-Driven Approach

❑ Delay in query processing


❑ Slow or unavailable information sources
❑ Complex filtering and integration
❑ Inefficient and potentially expensive for frequent queries
❑ Competes with local processing at sources
❑ Hasn’t caught on in industry
The Warehousing Approach
Clients
❑ Information integrated in
advance
Data
❑ Stored in WH for direct Warehouse

querying and analysis


Integration System Metadata

...
Extractor/ Extractor/ Extractor/
Monitor Monitor Monitor

...
Source Source Source
Advantages of Warehousing Approach
❑ High query performance
❑ But not necessarily most current information
❑ Doesn’t interfere with local processing at sources
❑ Complex queries at warehouse
❑ OLTP at information sources
❑ Information copied at warehouse
❑ Can modify, annotate, summarize, restructure, etc.
❑ Can store historical information
❑ Security, no auditing
❑ Has caught on in industry
Query-driven approach
Query-driven approach still better for
❑ Rapidly changing information
❑ Rapidly changing information sources
❑ Clients with unpredictable needs
Data Warehouse Evolution
“Building the
Relational Company DW” Data Replication
Databases DWs Inmon (1992) Tools

1960 1975 1980 1985 1990 1995 2000

Information-

TIME
“Prehistoric “Middle Data Based
Times” Ages” Revolution
Management

PC’s and End-user 1st DW DW Vendor DW


Spreadsheets Interfaces Article Confs. Frameworks
What is a Data Warehouse?
A Practitioners Viewpoint

“A data warehouse is simply a single, complete, and


consistent store of data obtained from a variety of sources
and made available to end users in a way they can
understand and use it in a business context.”
-- Barry Devlin, IBM Consultant
A Data Warehouse is...
❑ Stored collection of diverse data
❑ A solution to data integration problem
❑ Single repository of information
❑ Optimized differently from transaction-oriented db
❑ User interface aimed at executive
❑ Large volume of data (Gb, Tb)
❑ Updates infrequent
❑ May be append-only
A Data Warehouse is...… Cont’d
❑ Examples
❑ All transactions ever at IBM
❑ Complete client histories at insurance firm
❑ Stockbroker financial information and portfolios
What Is Data Warehousing?

❑ The process of constructing and using data


warehouses
❑ Data warehousing is a collection of decision support
technologies, aimed at enabling the knowledge worker
(e.g., chief executive, manager, analyst) to make better
and faster decisions.
- Chaudhuri and Dayal, SIGMOD Record, March 1997
What is a Data Warehouse?
An Alternative Viewpoint

“A DW is a
❑ subject-oriented,
❑ integrated,
❑ time-varying,
❑ non-volatile
collection of data that is used primarily in organizational decision
making.”

-- W.H. Inmon, Building the Data Warehouse, 1992


A Data Warehouse is...
❑ Subject-oriented
❑ Organized by subject, such as customer, product, sales not by
application
❑ Focusing on the modeling and analysis of data for decision
makers, not on daily operations or transaction processing.
❑ Used for analysis, data mining, etc.
A Data Warehouse is...
❑ Integrated
❑ Constructed by integrating multiple, heterogeneous data
sources
❑ relational databases, flat files, on-line transaction records
❑ Data cleaning and data integration techniques are applied.
❑ Ensure consistency in naming conventions, encoding
structures, attribute measures, etc. among different data
sources
❑ E.g., Hotel price: currency, tax, breakfast covered, etc.
❑ When data is moved to the warehouse, it is converted.
A Data Warehouse is...
❑ Time Variant
❑ The time horizon for the data warehouse is significantly longer
than that of operational systems.
❑ Operational database: current value data.
❑ Data warehouse data: provide information from a historical
perspective (e.g., past 5-10 years)
❑ Every key structure in the data warehouse
❑ Contains an element of time, explicitly or implicitly
❑ But the key of operational data may or may not contain “time
element”.
A Data Warehouse is...
❑ Non-Volatile
❑ A physically separate store of data transformed from the operational
environment.
❑ Operational update of data does not occur in the data warehouse
environment.
❑ Does not require transaction processing, recovery, and
concurrency control mechanisms
❑ Requires only two operations in data accessing:
❑ initial loading of data and access of data.
Summary
Business Information
Interface

Data
Warehouse

Data Warehouse
Population

Operational Systems
Warehouse is a Specialized DB
Standard DB Warehouse
❑ Mostly updates ❑ Mostly reads
❑ Many small transactions ❑ Queries are long and complex
❑ Mb - Gb of data ❑ Gb - Tb of data
❑ Current snapshot ❑ History
❑ Index ❑ Lots of scans
Summarized, reconciled data
Raw data

Hundreds of users (e.g.,


Thousands of users (e.g.,

decision-makers, analysts)

clerical users)
Warehousing and Industry
Warehousing is big business
$2 billion in 1995
$3.5 billion in early 1997
$8 billion in 1998
$13 billion in 2018
Predicted: to cross $ 30 billion by 2025 [Global Market Insights, Inc.]
Types of Data

Business Data - represents meaning


❑ Real-time data (ultimate source of all business data)
❑ Reconciled data
(target data is compared against original source data to ensure that the
migration architecture has transferred the data correctly)
❑ Derived data
Data Warehouse Architectures: Conceptual View
Operational Informational
❑ Single-layer systems systems

❑ Every data element is stored once only


❑ Virtual warehouse
“Real-time data”

❑ Two-layer
Operational Informational
systems systems
❑ Real-time + derived data
❑ Most commonly used approach Derived Data
in industry today
Real-time data
Three-layer Architecture: Conceptual View
Transformation of real-time data to derived data really
requires two steps

Operational Informational
systems systems
View level
Derived Data “Particular informational
needs”
Reconciled Data
Physical Implementation
of the Data Warehouse
Real-time data
Data Warehousing: Two Distinct Issues
(1) How to get information into warehouse
“Data warehousing”
(2) What to do with data once it’s in warehouse
“Warehouse DBMS”

❑ Both are rich research areas


❑ Industry has focused mainly on (2)
Issues in Data Warehousing
❑ Warehouse Design
❑ Extraction
❑ Wrappers, monitors (change detectors)
❑ Data Transformations
❑ Integration
❑ Cleansing & merging
❑ Warehousing specification & Maintenance
❑ Optimizations
Data Extraction
❑ Database heterogeneity - different DBMSs
❑ Relational, flat file, WWW, etc.
❑ Data heterogeneity - different definitions /representations of data
❑ How to get data out?
❑ Replication tool
❑ Dump file
❑ Create report
❑ ODBC or third-party “wrappers”
Wrapper
❑ Converts data and queries from one data model to another. It
is a software component or interface that mediates between
source and user’s query.

Data Queries Data


Model Model
A Data B
Data Transformations
❑ Convert data to uniform format
❑ Inconsistent field lengths, descriptions

❑ Remove, add & reorder attributes


❑ Add key
❑ Add data to get history
Monitors
❑ Goal: Detect changes of interest and propagate to integrator
❑ How?
❑ Triggers
❑ Replication server
❑ Compare query results
❑ Compare snapshots/dumps
Data Integration
❑ Receive data (changes) from multiple wrappers/monitors and
integrate into warehouse
❑ Rule-based
❑ Actions
❑ Resolve inconsistencies
❑ Eliminate duplicates
❑ Integrate into warehouse
❑ Summarize data
❑ Fetch more data from sources (WH updates) etc
Data Cleansing
❑ Find (& remove) duplicate tuples
❑ e.g., Jane Fernando vs. Jane Q. Fernando
❑ Detect inconsistent, wrong data
❑ Attribute values that don’t match
❑ Patch missing, unreadable data
❑ Notify sources of errors found
Data
Warehousing
Architecture
What is Metadata?
Metadata is simply defined as data about data. The data that is
used to represent other data is known as metadata.
For example, the index of a book serves as a metadata for the
contents in the book. In other words, we can say that metadata is
the summarized data that leads us to detailed data. In terms of
data warehouse, we can define metadata as follows.
Metadata is the road-map to a data warehouse.
Metadata in a data warehouse defines the warehouse
objects.
Metadata acts as a directory. This directory helps the decision
support system to locate the contents of a data warehouse.
Maintenance Differs from Conventional View
❑ Warehouses may be highly aggregated and summarized
❑ Warehouse views may be over history of base data
❑ Process large batch updates
❑ Schema may evolve
❑ Simply reports changes
❑ Absence of locking, global transactions
Data Warehousing, Data Mining & Business Intelligence

Data Warehousing is the process of constructing and using data


warehouses
Data Warehouse (DW)
❑ An implementation of an informational database used to collect, integrate and
provide sharable data sourced from multiple operational databases for
reporting and analysis
❑ Provide data that is reliable, consistent, understandable
❑ It typically serves as the foundation for a business intelligence system
Data Warehousing, Data Mining & Business Intelligence

❑ Data Mining
❑ Used to extract useful information and patterns from data.
❑ The data mining can be carried with any traditional database, but since a data
warehouse contains quality data, it is good to have data mining over the data
warehouse system.
❑ Business Intelligence(BI)
❑ An environment in which business users conduct analyses that yield overall
understanding of where
❑ The business has been
❑ Where it is now and
❑ Where it will be in the near future (i.e.planning)
❑ Data Mining is a subset of Business Intelligence (BI)
Thank You!
Activity 1

How data warehousing differs with decision support systems?

You might also like