Data Warehousing: Hu Yan Huy@cs - Tut.fi
Data Warehousing: Hu Yan Huy@cs - Tut.fi
Hu Yan
[email protected]
Outline
• What is data warehousing
• The benefit of data warehousing
• Differences between OLTP and data warehousing
• The architecture of data warehouse
• The main components
• Data flows
• Tools and technologies
• Integration
• The importance of managing meta-data
• Data marts
What is data warehousing?
• data warehousing is subject-oriented, integrated, time-
variant, and non-volatile collection of data in support of
management’s decision-making process.
• a data warehouse is data management and data analysis
• data webhouse is a distributed data warehouse that is
implement over the web with no central data repository
• goal: is to integrate enterprise wide corporate data into a
single reository from which users can easily run queries
What is data warehousing?
• Subject-orientedWH is organized around the major subjects of the
enterprise..rather than the major application areas.. This is reflected in the need to
store decision-support data rather than application-oriented data
• Integratedbecause the source data come together from different enterprise-wide
applications systems. The source data is often inconsistent using..The integrated
data source must be made consistent to present a unified view of the data to the
users
• Time-variantthe source data in the WH is only accurate and valid at some point
in time or over some time interval. The time-variance of the data warehouse is also
shown in the extended time that the data is held, the implicit or explicit association
of time with all data, and the fact that the data represents a series of snapshots
• Non-volatiledata is not update in real time but is refresh from OS on a regular
basis. New data is always added as a supplement to DB, rather than replacement.
The DB continually absorbs this new data, incrementally integrating it with
previous data
The benefits of data
warehousing
• The potential benefits of data warehousing
are high returns on investment..
• substantial competitive advantage..
• increased productivity of corporate
decision-makers..
The difference bewteen OLTP
and data warehousing
• A DBMS built for online transaction
processing (OLTP) is generally regarded as
unsuitable for data warehousing because
each system is designed with a differing set
of requirements in mind
Operational
data source n Detailed data DBMS OLAP(online
analytical processing) tools
Operational
Warehouse Manager
data store (ods)
Archive/backup
data
End-user
access tools
Typical architecture of a data warehouse
The main components
• Operational data sourcesfor the DW is supplied from
mainframe operational data held in first generation hierarchical and
network databases, departmental data held in proprietary file systems,
private data held on workstaions and private serves and external
systems such as the Internet, commercially available DB, or DB
assoicated with and organization’s suppliers or customers
• Operational datastore(ODS)is a repository of current
and integrated operational data used for analysis. It is often structured
and supplied with data in the same way as the data warehouse, but
may in fact simply act as a staging area for data to be moved into the
warehouse
The main components
• load manageralso called the frontend component, it performance
all the operations associated with the extraction and loading of data
into the warehouse. These operations include simple transformations
of the data to prepare the data for entry into the warehouse
• warehouse managerperforms all the operations associated with
the management of the data in the warehouse. The operations
performed by this component include analysis of data to ensure
consistency, transformation and merging of source data, creation of
indexes and views, generation of denormalizations and aggregations,
and archiving and backing-up data
The main components
• query manageralso called backend component, it performs all
the operations associated with the management of user queries. The
operations performed by this component include directing queries to
the appropriate tables and scheduling the execution of queries
• detailed, lightly and lightly summarized
data,archive/backup data
• meta-data
• end-user access toolscan be categorized into five main groups:
data reporting and query tools, application development tools,
executive information system (EIS) tools, online analytical processing
(OLAP) tools, and data mining tools
Data flows
• Inflow- The processes associated with the extraction, cleansing, and loading of
the data from the source systems into the data warehouse.
• upflow- The process associated with adding value to the data in the warehouse
through summarizing, packaging , packaging, and distribution of the data
• outflow- The process associated with making the data availabe to the end-users
• Meta-flow- The processes associated with the management of the meta-data
Reporting, query,application
Operational
Warehouse Manager development, and EIS (executive
data source1 information system) tools
Meta-flow
Meta-data High
summarized data
Inflow Outflow
Lightly
Load summarized
data
OLAP (online
Manager
Upflow Query Manage analytical processing)
Operational tools
data source n Detailed data DBMS
Operational
data store (ods)
Warehouse Manager
Data mining tools
End-user
Downflow access tools
Archive/backup
data
Operational Lightly
data source 2 Query
Load summarized
Manage
data
Manager
Operational OLAP(online
data source n
Detailed data
DBMS analytical processing) tools
Operational
data store (ods)
Warehouse Manager
Data mining
(First Tier) (Third Tier)
Operational data store (ODS)
Archive/backup End-user
data access tools
Data Mart
summarized
data(Relational database)
Summarized data
(Multi-dimension database) (Second Tier)