Data Warehousing
Data Warehousing
Data Warehousing
informational system, data mart, independent data mart, dependent data mart, enterprise
data warehouse (EDW), operational data store (ODS), logical data mart, real-time data
warehouse, reconciled data, derived data, transient data, periodic data, star schema, grain,
conformed dimension, snowflake schema
Data warehouse : A subject-oriented, integrated, time-variant, nonupdateable collection of data
used in support of management decision-making processes
Operational system : A system that is used to run a business in real time, based on current data.
Also called a system of record.
Informational system : A system designed to support decision making based on historical point-
in-time and prediction data for complex queries or data-mining applications
Data mart : A data warehouse that is limited in scope, whose data are obtained by selecting and
summarizing data from a data warehouse or from separate extract, transform, and load
processes from source data systems.
Independent data : mart A data mart filled with data extracted from the operational
environment, without the benefit of a data warehouse.
Dependent data : mart A data mart filled exclusively from an enterprise data warehouse and its
reconciled data
Enterprise data warehouse (EDW) : A centralized, integrated data warehouse that is the control
point and single source of all data made available to end users for decision support applications.
Operational data store (ODS) : An integrated, subject-oriented, continuously updateable,
currentvalued (with recent history), enterprise-wide, detailed database designed to serve
operational users as they do decision support processing
Logical data mart : A data mart created by a relational view of a data warehouse.
Real-time data warehouse : An enterprise data warehouse that accepts near-real-time feeds
of transactional data from the systems of record, analyzes warehouse data, and in nearreal-time
relays business rules to the data warehouse and systems of record so that immediate action
can be taken in response to business events.
Reconciled data: Detailed, current data intended to be the single, authoritative source for all
decision support applications
Derived data: Data that have been selected, formatted, and aggregated for end-user decision
support applications.
Transient data : Data in which changes to existing records are written over previous records,
thus destroying the previous data content.
Periodic data : Data that are never physically altered or deleted once they have been added to
the store.
Star schema : A simple database design in which dimensional data are separated from fact or
event data. A dimensional model is another name for a star schema
Grain : The level of detail in a fact table, determined by the intersection of all the components
of the primary key, including all foreign keys and any other primary key elements.
Conformed dimension : One or more dimension tables associated with two or more fact tables
for which the dimension tables have the same business meaning and primary key with each fact
table.
Snowflake schema : An expanded version of a star schema in which dimension tables are
normalized into several related tables.
2) Give two important reasons why an “information gap” often exists between an information
manager’s need and the information generally available.
a. The fragmented way in which organizations have developed information systems—and
their supporting databases— for many years. The emphasis in this text is on a carefully
planned, architectural approach to systems development that should produce a compatible
set of databases. However, in reality, constraints on time and resources cause most
organizations to resort to a “one-thing-at-a-time” approach to developing islands of
information systems. This approach inevitably produces a hodgepodge of uncoordinated
and often inconsistent databases. Usually, databases are based on a variety of hardware,
software platforms, and purchased applications and have resulted from different
organizational mergers, acquisitions, and reorganizations. Under these circumstances, it is
extremely difficult, if not impossible, for managers to locate and use accurate information,
which must be synthesized across these various systems of record.
b. That most systems are developed to support operational processing, with little or no
thought given to the information or analytical tools needed for decision making.
Operational processing, also called transaction processing, captures, stores, and
manipulates data to support daily operations of the organization. It tends to focus database
design on optimizing access to a small set of data related to a transaction. Informational
processing is the analysis of data or other forms of information to support decision making.
It needs large “swatches” of data from which to derive information. Most systems that are
developed internally or purchased from outside vendors are designed to support
operational processing, with little thought given to informational processing.
3) List two major reasons most organizations today need data warehousing.
Two major factors drive the need for data warehousing in most organizations today:
1. A business requires an integrated, company-wide view of high-quality information.
Data in operational systems are typically fragmented and inconsistent, so-called silos, or
islands, of data. They are also generally distributed on a variety of incompatible hardware
and software platforms. For example, one source of customer data may be located on a
UNIX-based server running an Oracle DBMS, whereas another may be located on a SAP
system. Yet, for decision-making purposes, it is often necessary to provide a single,
corporate view of that information
2. The information systems department must separate informational from operational systems
to improve performance dramatically in managing company data.
An operational system is a system that is used to run a business in real time, based on
current data. Examples of operational systems are sales order processing, reservation
systems, and patient registration systems. Operational systems must process large volumes
of relatively simple read/write transactions and provide fast response.
4) Name and briefly describe the three levels in a data warehouse architecture.
2. Operational metadata, which usually describes the currency level of the stored data, i.e., active,
archived or purged, and warehouse monitoring information, i.e., usage statistics, error reports,
audit, etc.
3. System performance data, which includes indices, used to improve data access and retrieval
performance.
5. Summarization algorithms, predefined queries, and reports business data, which include
business terms and definitions, ownership information, etc.
Facts are numeric measurements (values) that represent a specific business aspect or activity. For
example, sales figures are numeric measurements that represent product and/or service sales. Facts
commonly used in business data analysis are units, costs, prices, and revenues. Facts are normally stored
in a fact table that is the center of the star schema. The fact table contains facts that are linked through
their dimensions. Facts can also be computed or derived at run time. Such computed or derived facts
are sometimes called metrics to differentiate them from stored facts. The fact table is updated
periodically (daily, weekly, monthly, and so on) with data from operational databases.
b. Dimensions
Dimensions are qualifying characteristics that provide additional perspectives to a given fact. Recall that
dimensions are of interest because decision support data are almost always viewed in relation to other
data. For instance, sales might be compared by product from region to region and from one time period
to the next. Dimensions are normally stored in dimension tables. The following diagram depicts a star
schema for sales with product, location, and time dimensions.
c. Attributes
Each dimension table contains attributes. Attributes are often used to search, filter, or classify facts.
Dimensions provide descriptive characteristics about the facts through their attributes. Therefore, the
data warehouse designer must define common business attributes that will be used by the data analyst
to narrow a search, group information, or describe dimensions. For example Region, state, city are
dimensions of Location, Product type, product ID are dimensions of Product.
d. Attribute Hierarchies:
Attributes within dimensions can be ordered in a well-defined attribute hierarchy. The attribute
hierarchy provides a top-down data organization that is used for two main purposes: aggregation and
drill-down/roll-up data analysis. For example, the following figure shows how the location dimension
attributes can be organized in a hierarchy by region, state, city, and store.
The attribute hierarchy provides the capability to perform drill-down and roll-up searches in a data
warehouse. For example, suppose a data analyst looks at the answers to the query: How does the 2009
month-to-date sales performance compare to the 2010 month-to-date sales performance? The data
analyst spots a sharp sales decline for March 2010. The data analyst might decide to drill down inside
the month of March to see how sales by regions compared to the previous year.
6) Estimate the number of rows and total size, in bytes, of a fact table, given reasonable
assumptions concerning the database dimensions.
7) Design a data mart using various schemes to normalize and denormalize dimensions and to
account for fact history, hierarchical relationships between dimensions, and changing
dimension attribute values.
8) Develop the requirements for a data mart from questions supporting decision making.
9) Understand the trends that are likely to affect the future of data warehousing in
organizations.