DWDM Unit III
DWDM Unit III
UNIT-III
Subject-oriented: A data warehouse is organized around major subjects, such as cus- tomer, supplier,
product, and sales. Rather than concentrating on the day-to-day oper- ations and transaction processing
of an organization, a data warehouse focuses on the modeling and analysis of data for decision makers.
Integrated: A data warehouse is usually constructed by integrating multiple heteroge- neous sources,
such as relational databases, flat files, and on-line transaction records. Data cleaning and data
integration techniques are applied to ensure consistency in naming conventions, encoding structures,
Dr.K.M.Rayudu,Professor,Dept. of CSE Page 1
III CSE DWDM -III
attribute measures, and so on.
Time-variant: Data are stored to provide information from a historical perspective (e.g., the past
5–10 years). Every key structure in the data warehouse contains, either implicitly or explicitly, an
element of time.
Nonvolatile: A data warehouse is always a physically separate store of data trans- formed from
the application data found in the operational environment. Due to this separation, a data warehouse
does not require transaction processing, recovery, and concurrency control mechanisms. It usually
requires only two operations in data accessing: initial loading of data and access of data.
The major task of on-line operational database systems is to perform on-line trans- action and
query processing. These systems are called on-line transaction processing (OLTP) systems.
They cover most of the day-to-day operations of an organization, such as purchasing, inventory,
manufacturing, banking, payroll, registration, and accounting.
Data warehouse systems, on the other hand, serve users or knowledge workers in the role of data
analysis and decision making. Such systems can organize and present data in var- ious formats in order
to accommodate the diverse needs of the different users. These systems are known as on-line
analytical processing (OLAP) systems.
A data warehouse is based on multidimensional data model which views data in the form of
data cube.
3.2.1 From tables and spreadsheets to data cubes
A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions.
Dimensions are perspectives or entities with respect to which an organization wants to keep
records such as time, item, branch, location etc.
Dimension table, such as item (item name, brand, type), or time (day, week, month, quarter, year)
gives further descriptions about dimensions
Fact table contains measures (such as dollars _sold) and keys to each of the related dimension
tables.
In data warehousing literature, an n-D base cube is called base cuboids. The top most o-D cuboids,
which hold the highest-level of summarization, called the apex cuboids. The lattice of cuboids
forms a data cube.
If we continue in this way, we may display any n-D data as a series of (n − 1)-D “cubes.” The data
cube is a metaphor for multidimensional data storage. The actual physical storage of such data may
differ from its logical representation. The important thing to remember is that data cubes are n-
dimensional and do not confine data to 3-D.
A 4-D data cube representation of sales data, according to the dimensions time, item, location, and supplier.
Lattice of cuboids, making up a 4-D data cube for time, item, location, and supplier. Each
cuboid represents a different degree of summarization.
3.2.2 Stars, Snowflakes, and Fact Constellations: Schemas for Multidimensional Data Models
The entity-relationship data model is commonly used in the design of relational databases,
where a database schema consists of a set of entities and the relationships between them. Such
a data model is appropriate for on-line transaction processing.
A data warehouse, however, requires a concise, subject-oriented schema that facilitates on-line
data analysis.
The most popular data model for a data warehouse is a multidimensional model, which can exist
in the formof a star schema, a snowflake schema, or a fact constellation schema.
Star schema:
The most common modeling paradigm is the star schema, in which the data warehouse contains
(1) a large central table (fact table) containing the bulk of the data, with no redundancy, and (2)
a set of smaller attendant tables (dimension tables), one for each dimension.
The schema graph resembles a starburst, with the dimension tables displayed in a radial pattern
around the central fact table.
Snowflake schema:
The snowflake schema is a variant of the star schema model, where some dimension tables are
normalized, thereby further splitting the data into additional tables.
The resulting schema graph forms a shape similar to a snowflake.
The major difference between the snowflake and star schema models is that the dimension tables
of the snowflake model may be kept in normalized form to reduce redundancies.
A concept hierarchy for location. Due to space limitations, not all of the hierarchy nodes are
shown, indicated by ellipses between nodes.
Hierarchical and lattice structures of attributes in warehouse dimensions: (a) a hierarchy for
location and (b) a lattice for time.
Concept hierarchies may also be defined by discretizing or grouping values for a given dimension
or attribute, resulting in a set-grouping hierarchy.
In the multidimensional model, data are organized into multiple dimensions, and each dimension
contains multiple levels of abstraction defined by concept hierarchies.
This organization provides users with the flexibility to view data from different perspectives.
A number of OLAP data cube operations exist to materialize these different views, allowing
interactive querying and analysis of the data at hand.
Hence, OLAP provides a user-friendly environment for interactive data analysis.
Roll-up:
The roll-up operation (also called the drill-up operation by some vendors) performs aggregation on
a data cube, either by climbing up a concept hierarchy for a dimension or by dimension reduction.
Example : The result of a roll-up operation performed on the central cube by climbing up the
concept hierarchy for location given in Figure.
This hierarchy was defined as the total order “street <city < province or state < country.”
The roll-up operation shown aggregates the data by ascending the location hierarchy from the level
of city to the level of country.
Drill-down:
Drill-down is the reverse of roll-up. It navigates from less detailed data to more detailed data.
Drill-down can be realized by either stepping down a concept hierarchy for a dimension or
introducing additional dimensions.
Figure 4.12 shows the result of a drill-down operation performed on the central cube by stepping
down a concept hierarchy for time defined as “day < month < quarter < year.”
Drill-down occurs by descending the time hierarchy fromthe level of quarter to the more detailed
level of month.
Pivot (rotate):
Pivot (also called rotate) is a visualization operation that rotates the data axes in view to provide
an alternative data presentation.
Figure 4.12 shows a pivot operation where the item and location axes in a 2-D slice are rotated.
Other examples include rotating the axes in a 3-D cube, or transforming a 3-D cube into a series
Other OLAP operations: Some OLAP systems offer additional drilling operations. For example,
drill-across executes queries involving (i.e., across) more than one fact table. The drill-through
operation uses relational SQL facilities to drill through the bottom level of a data cube down to its
back-end relational tables.
Data warehouse systems use back-end tools and utilities to populate and refresh their data (These
tools and utilities include the following functions:
Data extraction, which typically gathers data from multiple, heterogeneous, and exter- nal
sources
Data cleaning, which detects errors in the data and rectifies them when possible.
Data transformation, which converts data from legacy or host format to warehouse format
Load, which sorts, summarizes, consolidates, computes views, checks integrity, and builds
indices and partitions
Refresh, which propagates the updates from the data sources to the warehouse
Metadata are data about data. When used in a data warehouse, metadata are the data that define
warehouse objects.
Metadata are created for the data names and definitions of the given warehouse.
Additional metadata are created and captured for timestamping any extracted data, the source of
the extracted data, and missing fields that have been added by data cleaning or integration
processes.