Module-1: Data Warehousing & Modelling
Module-1: Data Warehousing & Modelling
Module-1: Data Warehousing & Modelling
Module-1
DATA WAREHOUSING & MODELLING
1.1. Introduction
1.2. Data Warehousing: A multitier Architecture
1.3. Data warehouse models: Enterprise warehouse, Data mart and virtual warehouse
1.4. Extraction, Transformation and loading
1.5. Data Cube: A multidimensional data model, Stars, Snowflakes
1.6. Fact constellations: Schemas for multidimensional Data models
1.7. Dimensions: The role of concept Hierarchies
1.8. Measures: Their Categorization and computation
1.9. Typical OLAP Operations
1.10. Outcome
1.11. Important Questions
1.1 Introduction
Data warehouses generalize and consolidate data in multidimensional space. The construction of data
warehouses involves data cleaning, data integration, and data transformation, and can be viewed as an
important preprocessing step for data mining. Moreover, data warehouses provide online analytical
processing (OLAP) tools for the interactive analysis of multidimensional data of varied granularities,
VTUPulse.com
which facilitates effective data generalization and data mining.
Many other data mining functions, such as association, classification, prediction, and clustering, can be
integrated with OLAP operations to enhance interactive mining of knowledge at multiple levels of
abstraction. Hence, the data warehouse has become an increasingly important platform For data analysis
and OLAP and will provide an effective platform for datamining. Therefore ,data warehousing and OLAP
form an essential step in the knowledge discovery process.
Key features:
1. Subject-Oriented: A data warehouse can be used to analyze a particular subject area. For example,
"sales" can be a particular subject.
2. Integrated: A data warehouse integrates data from multiple data sources. For example, source A
and source B may have different ways of identifying a product, but in a data warehouse, there will be only
a single way of identifying a product.
3. Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data from
3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts with a
transactions system, where often only the most recent data is kept. For example, a transaction system may
hold the most recent address of a customer, where a data warehouse can hold all addresses associated with
a customer.
4. Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data
warehouse should never be altered.
VTUPulse.com
and decision making. Such systems can organize and present data in various formats in order to
accommodate the diverse needs of different users. These systems are known as online analytical
processing(OLAP) systems. The major distinguishing features of OLTP and OLAP are summarized as
follows:
Tier-1:
VTUPulse.com
Data Warehousing: A Multitiered Architecture
The bottom tier is a warehouse database server that is almost always a relational database system. Back-
end tools and utilities are used to feed data into the bottom tier from operational databases or other
external sources (such as customer profile information provided by external consultants). These tools and
utilities perform data extraction, cleaning, and transformation (e.g., to merge similar data from different
sources into a unified format), as well as load and refresh functions to update the data warehouse . The
data are extracted using application program interfaces known as gateways. A gateway is supported by the
underlying DBMS and allows client programs to generate SQL code to be executed at a server.
Examples of gateways includes ODBC (open database connection) and OLEDB (Open Linking and
Embedding for Databases) by Microsoft and JDBC (Java Database Connection). This tier also contains a
metadata repository, which stores information about the data warehouse and its contents.
VTUPulse.com
Tier-2:
The middle tier is an OLAP server that is typically implemented using either a relational OLAP (ROLAP)
model or a multidimensional OLAP.
OLAP model is an extended relational DBMS that maps operations on multidimensional data to
standard relational operations.
A multidimensional OLAP (MOLAP) model, that is, a special-purpose server that directly
implements multidimensional data and operations.
Tier-3:
The top tier is a front-end client layer, which contains query and reporting tools, analysis
tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).
It provides corporate-wide data integration, usually from one or more operational systems or
external information providers, and is cross-functional in scope.
It typically contains detailed data aswell as summarized data, and can range in size from a few
gigabytes to hundreds of gigabytes, terabytes, or beyond.
An enterprise data warehouse may be implemented on traditional mainframes, computer super
servers, or parallel architecture platforms. It requires extensive business modeling and may take years to
design and build.
2. Data mart:
A data mart contains a subset of corporate-wide data that is of value to a specific group of users.
The scope is confined to specific selected subjects. For example, a marketing data mart may confine its
subjects to customer, item, and sales. The data contained in data marts tend to be summarized.
Data marts are usually implemented on low-cost departmental servers that are UNIX/LINUX-
or Windows-based. The implementation cycle of a data mart is more likely to be measured in weeks rather
than months or years. However, it may involve complex integration in the long run if its design and
planning were not enterprise-wide.
Depending on the source of data, data marts can be categorized as independent or dependent. Independent
data marts are sourced from data captured from one or more operational systems or external information
providers, or from data generated locally within a particular department or geographic area. Dependent
data marts are sourced directly from
enterprise data warehouses.
3. Virtual warehouse:
A virtual warehouse is a set of views over operational databases. For efficient query processing,
only some of the possible summary views may be materialized.
VTUPulse.com
A virtual warehouse is easy to build but requires excess capacity on operational database
servers.
VTUPulse.com
algorithms, data on granularity, partitions, subject areas, aggregation, summarization, and predefined
queries and reports.
The mapping from the operational environment to the data warehouse, which includes source
databases and their contents, gateway descriptions, data partitions, data extraction, cleaning,
transformation rules and defaults, data refresh and purging rules, and security (user authorization and
access control).
Data related to system performance, which include indices and profiles that improve data
access and retrieval performance, in addition to rules for the timing and scheduling of refresh, update, and
replication cycles.
Business metadata, which include business terms and definitions, data ownership information,
and charging policies.
A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions
Dimension tables, such as item (item_name, brand, type), or time(day, week, month,
quarter, year)
Fact table contains measures (such as dollars_sold) and keys to each of the related
dimension tables
In data warehousing literature, an n-D base cube is called a base cuboid. The top most 0-D
cuboid, which holds the highest-level of summarization, is called the apex cuboid. The lattice of cuboids
forms a data cube.
VTUPulse.com
Given a set of dimensions, we can generate a cuboid for each of the possible subsets of the given
dimensions. The result would form a lattice of cuboids, each showing the data at a different level of
summarization, or group-by. The lattice of cuboids is then referred to as a data cube. Figure shows a
lattice of cuboids forming a data cube for the dimensions time, item, location, and supplier. The cuboid
that holds the lowest level of summarization is called the base cuboid.
VTUPulse.com
1.6 Stars, Snowflakes, and Fact Constellations: Schemas for Multidimensional Data Models The
most popular data model for a data warehouse is a multidimensional model, which can exist in the form of a
star schema , a snow flake schema, or a fact constellation schema.
Schemas for multidimensional data models
Star schema: A fact table in the middle connected to a set of dimension tables Snowflake schema: A
refinement of star schema where some dimensional hierarchy is
normalized into a set of smaller dimension tables, forming a shape similar to snowflake
Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore
called galaxy schema or fact constellation
Star schema: The most common modeling paradigm is the star schema, in which the data warehouse
contains (1) a large central table (fact table) containing the bulk of the data, with no redundancy, and
(2) a set of smaller attendant tables (dimension tables), one for each dimension. The schema graph
resembles a starburst, with the dimension tables displayed in a radial pattern around the central fact table.
VTUPulse.com
Snowflake schema: The snowflake schema is a variant of the star schema model, where some dimension
tables are normalized, thereby further splitting the data into additional tables. The resulting schema graph
forms a shape similar to a snowflake.
Fact constellation: Sophisticated applications may require multiple fact tables to share dimension tables.
This kind of schema can be viewed as a collection of stars, and hence is called a galaxy schema or a fact
constellation.
VTUPulse.com
VTUPulse.com
Holistic: if there is no constant bound on the storage size needed to describe a subaggregate.
DRILL DOWN
This is like zooming-in on the data. This is the reverse of roll-up. • This is an appropriate operation →
when the user needs further details or → when the user wants to partition more finely or
→ when the user wants to focus on some particular values of certain dimensions. • This adds more details
to the data. • Initially, the time-hierarchy was "day < month < quarter < year”. • On drill-up, the time
dimension is descended from the level-of-quarter to the level-of-month.
VTUPulse.com
SLICE & DICE
These are operations for browsing the data in the cube. • These operations allow ability to look at
information from different viewpoints. • A slice is a subset of cube corresponding to a single value for 1
or more members of dimensions..
A dice operation is done by performing a selection of 2 or more dimensions.
VTUPulse.com
1.11Question Bank
1. What is data warehouse? Discuss key features
2. Differentiate between Operational Database Systems and Data Warehouses.
3. Differentiate between OLAP and OLTP
4. Why multidimensional views of data and data-cubes are used? 5.With a neat diagram, explain data-
cube implementations.
6. Describe the Multitiered Architecture of data warehousing.
7. Explain the data warehouse models