0% found this document useful (0 votes)
74 views59 pages

Data Warehouse: Lutfi Freij Konstantin Rimarchuk Vasken Chamlaian John Sahakian Suzan Ton

A data warehouse is a collection of integrated databases designed to support decision making. It contains subject-oriented, non-volatile data that is relevant to a specific point in time. An operational data store feeds current data to the warehouse. A data mart offers a targeted version of the warehouse. Metadata provides information about the data in the warehouse.

Uploaded by

chirag chauhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views59 pages

Data Warehouse: Lutfi Freij Konstantin Rimarchuk Vasken Chamlaian John Sahakian Suzan Ton

A data warehouse is a collection of integrated databases designed to support decision making. It contains subject-oriented, non-volatile data that is relevant to a specific point in time. An operational data store feeds current data to the warehouse. A data mart offers a targeted version of the warehouse. Metadata provides information about the data in the warehouse.

Uploaded by

chirag chauhan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 59

Data Warehouse

Lutfi Freij
Konstantin Rimarchuk
Vasken Chamlaian
John Sahakian
Suzan Ton
Inmon
Father of the data warehouse
Co-creator of the Corporate
Information Factory.
He has 35 years of
experience in database
technology management
and data warehouse design.
Inmon-Cont’d
Bill has written about a variety
of topics on the building, usage,
& maintenance of the data warehouse
& the Corporate Information Factory.

He has written more than 650


articles (Datamation, ComputerWorld,
and Byte Magazine).

Inmon has published 45 books.


 Many of books has been translated to Chinese, Dutch, French, German,
Japanese, Korean, Portuguese, Russian, and Spanish.
Introduction
What is Data Warehouse?
A data warehouse is a collection of integrated
databases designed to support a DSS.

According to Inmon’s (father of data warehousing)


definition(Inmon,1992a,p.5):
 It is a collection of integrated, subject-oriented

databases designed to support the DSS function,


where each unit of data is non-volatile and relevant
to some moment in time.
Introduction-Cont’d.
Where is it used?
It is used for evaluating future strategy.

It needs a successful technician:


 Flexible.
 Team player.
 Good balance of business and technical
understanding.
Introduction-Cont’d.
The ultimate use of data warehouse is Mass Customization.
 For example, it increased Capital One’s customers from 1
million to approximately 9 millions in 8 years.
Just like a muscle: DW increases in strength with active use.
 With each new test and product, valuable information is
added to the DW, allowing the analyst to learn from the
success and failure of the past.
The key to survival:
 Is the ability to analyze, plan, and react to changing
business conditions in a much more rapid fashion.
Data Warehouse
In order for data to be effective, DW must be:
 Consistent.
 Well integrated.
 Well defined.
 Time stamped.
DW environment:
 The data store, data mart & the metadata.
The Data Store
An operational data store (ODS) stores data for a
specific application. It feeds the data warehouse a
stream of desired raw data.

Is the most common component of DW environment.

Data store is generally subject oriented, volatile,


current commonly focused on customers, products,
orders, policies, claims, etc…
Data Store & Data Warehouse
Data store & Data warehouse, table 10-1 page
296
The data store-Cont’d.
Its day-to-day function is to store the data for a
single specific set of operational application.

Its function is to feed the data warehouse data


for the purpose of analysis.
The Data Mart
It is lower-cost, scaled down version of the
DW.

Data Mart offer a targeted and less costly


method of gaining the advantages associated
with data warehousing and can be scaled up to
a full DW environment over time.
The Meta Data
Last component of DW environments.

It is information that is kept about the warehouse


rather than information kept within the warehouse.

Legacy systems generally don’t keep a record of


characteristics of the data (such as what pieces of data
exist and where they are located).

The metadata is simply data about data.


Conclusion
A Data Warehouse is a collection of integrated subject-
oriented databases designed to support a DSS.
 Each unit of data is non-volatile and relevant to some moment in time.

An operational data store (ODS) stores data for a specific


application. It feeds the data warehouse a stream of desired
raw data.

A data mart is a lower-cost, scaled-down version of a data


warehouse, usually designed to support a small group of users
(rather than the entire firm).

The metadata is information that is kept about the warehouse.


Data Warehouse

Subject oriented
Data integrated
Time variant
Nonvolatile
Characteristics of Data Warehouse

Subject oriented. Data are organized based on


how the users refer to them.
Integrated. All inconsistencies regarding
naming convention and value representations
are removed.
Nonvolatile. Data are stored in read-only format
and do not change over time.
Time variant. Data are not current but normally
time series.
Characteristics of Data Warehouse

Summarized Operational data are mapped into


a decision-usable format
Large volume. Time series data sets are
normally quite large.
Not normalized. DW data can be, and often
are, redundant.
Metadata. Data about data are stored.
Data sources. Data come from internal and
external unintegrated operational systems.
A Data Warehouse is Subject Oriented
Subject Orientation

Application Environment Data warehouse


Environment
Design activities must be equally DW world is primarily void of process
focused on both process and database design and tends to focus exclusively on
design issues of data modeling and database
design
Data Integrated
Integration –consistency naming
conventions and measurement attributers,
accuracy, and common aggregation.
Establishment of a common unit of
measure for all synonymous data
elements from dissimilar database.
The data must be stored in the DW in an
integrated, globally acceptable manner
Data Integrated
Time Variant
In an operational application system, the
expectation is that all data within the database
are accurate as of the moment of access. In the
DW data are simply assumed to be accurate as
of some moment in time and not necessarily
right now.
One of the places where DW data display time
variance is in the structure of the record key.
Every primary key contained within the DW
must contain, either implicitly or explicitly an
element of time( day, week, month, etc)
Time Variant
Every piece of data contained within the
warehouse must be associated with a
particular point in time if any useful
analysis is to be conducted with it.
Another aspect of time variance in DW
data is that, once recorded, data within the
warehouse cannot be updated or
changed.
Nonvolatility
Typical activities such as deletes, inserts,
and changes that are performed in an
operational application environment are
completely nonexistent in a DW
environment.
Only two data operations are ever
performed in the DW: data loading and
data access
Nonvolatility
Application DW
The design issues must focus on data Such issues are no concern to in a DW
integrity and update anomalies. Complex environment because data update is never
processes must be coded to ensure that the performed.
data update processes allow for high
integrity of the final product.

Data is placed in normalized form to Designers find it useful to store many of


ensure a minimal redundancy (totals that such calculations or summarizations.
could be calculated would never be stored)

The technologies necessary to support Relative simplicity in technology


issues of transaction and data recovery,
roll back, and detection and remedy of
deadlock are quite complex.
Issues of Data Redundancy between
DW and operational environments
The lack of relevancy of issues such as data
normalization in the DW environment may suggest that
existence of massive data redundancy within the data
warehouse and between the operational and DW
environments.

Inmon(1992) pointed out and proved that it is not true.


Issues of Data Redundancy between
DW and operational environments
The data being loaded into the DW are filtered and “cleansed” as they
pass from the operational database to the warehouse. Because of this
cleansing numerous data that exists in the operational environment
never pass to the data warehouse. Only the data necessary for
processing by the DSS or EIS are ever actually loaded into the DW.

The time horizons for warehouse and operational data elements are
unique. Data in the operational environment are fresh, whereas
warehouse data are generally much older.(so there is minimal
opportunity of the data to overlap between two environments )

The data loaded into the DW often undergo a radical transformation as


they pass form operational to the DW environment. So data in DW are
not the same.

Given this factors, Inmon suggests that data redundancy between the two
environments is a rare occurrence with a typical redundancy factor of
less than 1 %
The Data Warehouse
Architecture
The architecture consists of various
interconnected elements:
 Operational and external database layer – the
source data for the DW
 Information access layer – the tools the end
user access to extract and analyze the data
 Data access layer – the interface between the
operational and information access layers
 Metadata layer – the data directory or
repository of metadata information
Components of the Data
Warehouse Architecture
The Data Warehouse
Architecture
Additional layers are:
 Process management layer – the scheduler or job
controller
 Application messaging layer – the “middleware” that
transports information around the firm
 Physical data warehouse layer – where the actual
data used in the DSS are located
 Data staging layer – all of the processes necessary to
select, edit, summarize and load warehouse data
from the operational and external data bases
Data Warehousing Typology
The virtual data warehouse – the end users
have direct access to the data stores, using tools
enabled at the data access layer
The central data warehouse – a single physical
database contains all of the data for a specific
functional area
The distributed data warehouse – the
components are distributed across several
physical databases
The Metadata
The name suggests some high-level
technological concept, but it really is fairly
simple. Metadata is “data about data”.
With the emergence of the data warehouse as a
decision support structure, the metadata are
considered as much a resource as the business
data they describe.
Metadata are abstractions -- they are high level
data that provide concise descriptions of lower-
level data.
The Metadata

For example, a line in a sales database may contain:


4056 KJ596 223.45

This is mostly meaningless until we consult the metadata


that tells us it was store number 4056, product KJ596
and sales of $223.45

The metadata are essential ingredients in the


transformation of raw data into knowledge. They are the
“keys” that allow us to handle the raw data.
General Metadata Issues
General metadata issues associated with Data
Warehouse use:
 What tables, attributes and keys does the DW
contain?
 Where did each set of data come from?
 What transformations were applied with cleansing?
 How have the metadata changed over time?
 How often do the data get reloaded?
 Are there so many data elements that you need to be
careful what you ask for?
Components of the Metadata
Transformation maps – records that show
what transformations were applied
Extraction & relationship history – records
that show what data was analyzed
Algorithms for summarization – methods
available for aggregating and summarizing
Data ownership – records that show origin
Patterns of access – records that show
what data are accessed and how often
Typical Mapping Metadata
Transformation mapping records include:
 Identification of original source
 Attribute conversions
 Physical characteristic conversions
 Encoding/reference table conversions
 Naming changes
 Key changes
 Values of default attributes
 Logic to choose from multiple sources
 Algorithmic changes
Implementing the Data Warehouse
Kozar list of “seven deadly sins” of data warehouse
implementation:
1. “If you build it, they will come” – the DW needs to be
designed to meet people’s needs
2. Omission of an architectural framework – you need
to consider the number of users, volume of data,
update cycle, etc.
3. Underestimating the importance of documenting
assumptions – the assumptions and potential
conflicts must be included in the framework
“Seven Deadly Sins”, continued

4. Failure to use the right tool – a DW project needs


different tools than those used to develop an
application
5. Life cycle abuse – in a DW, the life cycle really
never ends
6. Ignorance about data conflicts – resolving these
takes a lot more effort than most people realize
7. Failure to learn from mistakes – since one DW
project tends to beget another, learning from the
early mistakes will yield higher quality later
Data Warehouse Technologies
No one currently offers an end-to-end DW
solution. Organizations buy bits and pieces from
a number of vendors and hopefully make them
work together.
SAS, IBM, Software AG, Information Builders
and Platinum offer solutions that are at least
fairly comprehensive.
The market is very competitive. Table 10-6 in
the text lists 90 firms that produce DW products.
The Future of Data Warehousing
As the DW becomes a standard part of an
organization, there will be efforts to find new
ways to use the data. This will likely bring with it
several new challenges:
 Regulatory constraints may limit the ability to combine
sources of disparate data.
 These disparate sources are likely to contain
unstructured data, which is hard to store.
 The Internet makes it possible to access data from
virtually “anywhere”. Of course, this just increases
the disparity.
Objective
Interesting Facts Implementing Data
Warehouse
Data Can be Used To
Real Time Alerts &
Robust Infrastructure Integration

Success of Data Identity Theft


Warehouse Projects
What Can You Do?
Interesting Facts
Harrah’s Entertainment’s Data Warehouse holds
30 terabytes, or 30 trillion bytes of data, roughly
three times the number of printed characters in
the Library of Congress

Casinos, retailers, airlines, and banks are piling


up data so vast, it would have been unthinkable
years ago; result from the curse of cheap
storage
Interesting Facts
Storage Shipments as of 2004: 22
exabytes or 22 million trillion bytes of hard
disk space, double the amount in 2002.

Equivalent to 4x’s the space needed to


store every word ever spoken by every
human being who has ever lived.

Should double again in 2006


Data Can be Used To
Quantify the volume impact of vehicles across the
marketing matrix

Account for decay and saturation factors in the


determination of investment choices and returns

Execute “what-if” simulations of pricing or promotional


scenarios before a proposed action is taken

Provide a continuous planning, measurement, analysis and


optimization cycle supported by a software structure

Deliver robust data feeds into other systems supporting


supply chain, sales, and financial reporting and endeavors
Robust Infrastructure
Data Identification and Acquisition

Data Cleansing, Mapping, and


Transformation

Production System Loading and Ongoing


Update
Success of Data Warehouse
Projects
Over half of Data Warehouse projects are Doomed

 Fail due to lack of attention to Data Quality Issues

 More than half only have limited acceptance

 Consistency and Accuracy of Data

 Most businesses fail to use business intelligence (BI)


strategically

 IT organizations build data warehouses with little to no business


involvement
“A real-time enterprise
without real-time business
intelligence is a real fast,
dumb organization.”

Stephen Brobst
Chief Technology Office
Teradata
Success of Data Warehouse
Projects
Most challenging type of deployment for an
enterprise

 Large scale and complex system configurations

 Sophisticated data modeling and analysis tools

 High visibility in broad range of important business


functions within company

 Adoption of Linux-Based Platform


Implementing Data Warehouse
Challenges:
 Identifying new processes
 Assuring there were of real use
 Implementing and ensuring cultural shifts
 Managing content and New communities
towards a common benefit
 Linear models
 Standards, Governance, Controls, Valuation
Teradata
Division of NCR in Dayton, Ohio

Competitor of IBM and Oracle

Multi-million Dollar Machines to run the


world’s biggest data warehouses
 Wal-Mart
 Bank of America
 Verizon Wireless
Teradata’s Success
Conventional IBM or Sun Microsystems
overload for a couple hours to days on a
few terabytes and/or data queries

IBM cannot return computation on certain


complex requests

Equivalent to having data but not able to


use it.
Real Time Alerts & Integration
Teradata 8.0 Version released in Oct 2004
 Improves real-time alerts and integration

Businesses can analyze operational info against


historical info to identify events in near real-time
using the new table design

Used by:
 Continental Airlines in the US: reroute passengers on
delayed flights, reissuing tickets, reserving a room in
a hotel booking system
 Southwest Airlines- savings between $1.2-$1.4 Million
Identity Theft
Government Regulation of Personal Data is Needed
(National Consumer Protection Standards)

ChoicePoint Folly

 Georgia-based data-collection company

 Founded in 1997 to analyze insurance claims information, but


now provides data to customers including finance companies,
law enforcement, and government

 Obtain personal information by perusing public records, or


purchasing the information from other companies
Identity Theft
Duped by scammers who set 150 phony
accounts to access personal data of as many as
145,000 people nationwide

Scammers set user accounts by faxing in phony


business licenses, undetected for one year

750 people had their identities stolen

Theft would have gone unnoticed without


California Identity theft law SB 1386
Identity Theft
MSN Event

Data Warehouse Information Gathering

Over the Phone Interviews

Trash Can Hunting

Gathered from Doctors, Internet Transactions,


Telephone Operators (Overseas or Prisoners)
MSN Email
What Can You Do?
Carefully monitor your credit card bills and credit
reports

Request a once a year free access credit report


via the three big credit agencies.
 Equifax, Experian, TransUnion

Victims: contact Federal Trade Commission to


report the theft and monitor credit reports.
 1-800-IDTHEFT
References
Decision Support Systems in the 21st Century 2nd Edition, by George M.
Marakas, Prentice Hall, Upper Saddle River, NJ, 2003

http://seattletimes.nwsource.com/html/editorialsopinion/2002191098_credite
d27.html
Seattle times, plugging holes in data warehousing

Teradata warehouse improves real-time alerts and integration


Cliff Saran. Computer Weekly. Sutton: Oct 12, 2004. p. 22 (1 page)

ON THE MARK
Mark Hall. Computerworld. Framingham: Oct 18, 2004. Vol. 38, Iss. 42; p.
6 (1 page)

Optimization: It's All About the Data Brandweek: Ellen Pederson, Mark
Anderson

THE NO-SACRIFICE, AFFORDABLE DATA WAREHOUSE APP


Intelligent Enterprises, Michael Gonzalez
References
http://www.dmreview.com/article_sub.cfm?articleId=7071 Convergence-
Beyond the Data Warehouse

http://www.computerworld.com/printthis/2001/0,4814,56969,00.html Micro-
segmentation – Computerworld

Too Much Information Forbes article on data warehouse

http://reviews.cnet.com/4520-3513_7-5690533-1.html When identity


thieves strike data warehouses

Over half of data warehouse projects doomed VNU Business Publications


Limited, Robert Jaques 25 February 2005

http://www.linuxworld.com/magazine/?issueid=571 Linux World Article


Questions?

You might also like