Unit 1
Unit 1
• Independent data marts are sourced from data captured from one or more
operational systems or external information providers
• Dependent data marts are sourced directly from enterprise data warehouses.
• Virtual warehouse
– A virtual warehouse is a set of views over operational databases.
– only some of the possible summary views may be materialized.
– A virtual warehouse is easy to build but requires excess capacity on
operational database servers
Data Warehouse Metadata
Metadata are data about data. When used in a data warehouse,
metadata are the data that define warehouse objects.
Metadata are created for the data names and definitions of the given
warehouse.
Additional metadata are created and captured for time stamping any
extracted data, the source of the extracted data, and missing fields that
have been added by data cleaning or integration processes.
A metadata repository should contain:
This includes the warehouse schema, view, dimensions, hierarchies, and
derived data definitions, as well as data mart locations and contents;
Operational metadata: which include data lineage (history of migrated
data and the sequence of transformations applied to it), currency of
data (active, archived, or purged), and monitoring information
(warehouse usage statistics, error reports, and audit trails);
the algorithms used for summarization, which include measure and
dimension definition algorithms, data on granularity, partitions, subject
areas, aggregation, summarization, and predefined queries and reports;
• The mapping from the operational environment to the data
warehouse, which includes source databases and their contents,
gateway descriptions, data partitions, data extraction, cleaning,
transformation rules and defaults, data refresh and purging rules, and
security (user authorization and access control).
• Data related to system performance, which include indices and profiles
that improve data access and retrieval performance, in addition to
rules for the timing and scheduling of refresh, update, and replication
cycles; and
• Business metadata: which include business terms and definitions, data
ownership information, and charging policies
Data Warehouse Modeling: Data Cube and Online analytical
processing (OLAP)
• OLAP systems are data warehouse front-end software tools to make aggregate
data available efficiently, for advanced analysis, to managers of an enterprise.
• Data warehouses and OLAP tools are based on a multidimensional data model.
• This model views data in the form of a data cube.
• In this section, you will learn
– how data cubes model n-dimensional data.
– concept hierarchies and
– how they can be used in basic OLAP operations to allow interactive mining at
multiple levels of abstraction. Typical OLAP Operations : roll up, drill down, slice &
dice, pivot (rotate)
Why cubes?
• It is meant to be used by application builders who wants to provide
analytical functionality.
• logical view of analyzed data
– how analysts look at data
– how they think of data,
– not how the data are physically implemented in the data stores
Data Cube: A Multidimensional Data Model
• What is a data cube?”
– It is a multidimensional structure that contains information for
analytical purposes
– the main constituents of a cube are dimensions and measures or facts
– Dimensions define the structure of the cube that you use to slice and
dice over and,
– Measures provide aggregated numerical values of interest to the end
user.
• In general terms, dimensions are the perspectives or entities with
respect to which an organization wants to keep records.
• Ex. AllElectronics may create a sales data warehouse in order to keep
records of the store’s sales with respect to the dimensions time, item,
branch, and location.
– keep tracks of things like monthly sales of items and the branches and
locations at which the items were sold
• Each dimension may have a table associated with it, called a dimension
table, which further describes the dimension
For example, a dimension table for item may contain the attributes
item_name, brand, and type
• Dimension tables can be specified by users or experts, or generated and
adjusted based on data distributions
• A multidimensional data model is typically organized around a central
theme, such as SALES
• This theme is represented by a fact table
• The fact table contains the names of the facts, or measures, as well as
keys to each of the related dimension tables.
• To gain a better understanding of data cubes and the multidimensional
data model, let’s start by looking at a simple 2-D data cube
• Consider sales data from AllElectronics
• In particular, we will look at the AllElectronics sales data for items sold
per quarter in the city of Vancouver.
• In this 2-D representation, the sales for Vancouver are shown with
respect to
– the time dimension (organized in quarters) and
– the item dimension (organized according to the types of items sold).
– The fact or measure displayed is dollars sold (in thousands)
• Representation AllElectronics sales data for items sold per quarter in
the city of Vancouver
• Now, suppose that we would like to view the sales data with a third
dimension.
– For instance, would like to view the data according to time and item, as
well as location, for the cities Chicago, New York, Toronto, and
Vancouver.
– These 3-D data are shown in Table ….
The 3-D data in the table are represented as a series of 2-D tables
Conceptually, we may also represent the same data in the form
of a 3- D data cube
Efficient Computation of Data Cubes
• At the core of multidimensional data analysis is the efficient
computation of aggregations across many sets of dimensions
• In SQL terms, these aggregations are referred to as group-by’s.
• Each group-by can be represented by a cuboid
• Where the set of group-by’s forms a lattice of cuboids defining a data
cube
• Will explore issues relating to the efficient computation of data cube
• In the data warehousing research literature, a data cube like those
shown in Figure often referred to as a cuboid.
• Given a set of dimensions, we can generate a cuboid for each of the
possible subsets of the given dimensions.
• The result would form a lattice of cuboids, each showing the data at a
different level of summarization, or group-by.
• Let us understand through a simple example ……….
Efficient Computation of Data Cubes
For three diemension (a,b,c), the possible group-by’s are
{(a,b,c), (a,b), (a,c),
(b,c), (a), (b), (c), () }
. .
Example
Suppose that you would like to create a data cube for AllElectronics
sales that contains the following: city, item, year, and sales in dollars.
You would like to be able to analyze the data, with queries such as the
following:
– “Compute the sum of sales, grouping by city and item.”
– “Compute the sum of sales, grouping by city.”
– “Compute the sum of sales, grouping by item.”
What is the total number of cuboids, or group-by’s, that can be computed for this
data cube?
. .
Taking the three attributes, city, item, and year, as the dimensions for the data
cube, and sales in dollars as the measure
the total number of cuboids, or group by’s, that can be computed for this data
cube is 23 = 8.
The possible group-by’s are the following:
(city), (item), (year), () } where () means that the group-by is empty or dimensions are not
grouped
. These group-by’s form a lattice of cuboids for the data cube .
. .
The apex cuboid, or 0-D cuboid, refers to the case where the group-by is empty.
It contains the total sum of all sales.
The apex cuboid is the most generalized is often
denoted by all
The base cuboid contains all three dimensions, city, item, and year.
It can return the total sales for any combination of the three dimensions
The base cuboid is the least generalized
. .
An SQL query containing
no group-by, such as “compute the sum of total
sales” is a zero-dimensional operation
. .
Data Mining Query Language (DMQL)
The DMQL was proposed by Han, Fu, Wang, et al. for the DBMiner data mining
system.
. .
Defining Star Schema in DMQL
. .
A statement such a
compute cube sales_star
. .
Defining Snowflake Schema
. .
Defining Fact Constellation in DMQL
. .
The Role of Concept Hierarchies
A concept hierarchy defines a sequence of mappings from a set of low-level concepts to
higher-level, more general concepts
Concept hierarchy organizes concepts (attribute values) hierarchically and is usually
associated with each dimension in a data warehouse
Concept hierarchy facilitate drilling and rolling in data ware houses to view data in
multiple granularity
Hierarchies can be explicitly specified by domain experts and/or data ware house
designers
Consider a concept hierarchy for the dimension location.
. .
The Role of Concept Hierarchies
Hierarchical and lattice structures of attributes in warehouse
dimensions: (a) a hierarchy for location and (b) a lattice for time.
. .
Number of Cuboids
How many cuboids are there in an n-dimensional data cube?
– If there were no hierarchies associated with each dimension, then the total
number of cuboids for an n-dimensional data cube, as we have seen is 2n
– For dimensions (Product, Region, City), 2n =23 = 8 cuboids
– However, in practice, many dimensions do have hierarchies.
– For an n-dimensional data cube, the total number of cuboids that can be
generated including hierarchies is
. .
Number of Cuboids
How many cuboids are there in an n-dimensional data cube?
– If there were no hierarchies associated with each dimension, then the total
number of cuboids for an n-dimensional data cube, as we have seen above,
is 2n
– However, in practice, many dimensions do have hierarchies.
– For example
. .
Number of Cuboids
Dim_Product Dim_Region Dim_Time
Class Item Product Country State City Year Month Day
Class 1 Item 1 Camera India Karnataka Mysore 2016 2 3
Class 2 Item 2 DVD India Tamilnadu Salem 2015 4 2
Class 3 Item 3 LED India Kerala Kozhikode 2014 5 1
… … … … …. … …. …. …
… … … … …. … …. …. …
. .
Construct a lattice of cuboids forming a data
cube for the dimensions
time, item, location, and supplier.
. .
. .
• Star schema:
– The most common modeling paradigm is the star schema, in which the
data warehouse contains
1. A large central table (fact table) containing the bulk of the data, with
no redundancy, and
2. A set of smaller attendant tables (dimension tables), one for each
dimension. The schema graph resembles a starburst
– The dimension tables displayed in a radial pattern around the central
fact table.
• Star schema for AllElectronics sales with four dimensions: time, item,
branch, and location.
• The schema contains a central fact table for sales that contains keys to
each of the four dimensions, along with two measures: dollars sold and
units sold
• Snowflake schema:
• Distributive: if the result derived by applying the function to n aggregate values is the same as that
derived by applying the function on all the data without partitioning
• E.g., count(), sum(), min(), max()
• Algebraic: if it can be computed by an algebraic function with M arguments (where M is a bounded
integer), each of which is obtained by applying a distributive aggregate function
• E.g., avg(), min_N(), standard_deviation()
• Holistic: if there is no constant bound on the storage size needed to describe a subaggregate.
• E.g., median(), mode(), rank()
87
Typical OLAP operations
Roll-up:
The roll-up operation (also called the drill-up operation by some vendors) performs aggregation on a
data cube, either by climbing up a concept hierarchy for a dimension or by dimension reduction.
This hierarchy was defined as the total order “street < city < province or state < country.” The roll-up
operation shown aggregates the data by ascending the location hierarchy from the level of city to the
level of country.
In other words, rather than grouping the data by city, the resulting cube groups the data by country.
When roll-up is performed by dimension reduction, one or more dimensions are removed from the
given cube.
Drill-down
Drill-down is the reverse of roll-up. It navigates from less detailed data to more detailed data.
Drill-down can be realized by either stepping down a concept hierarchy for a dimension or introducing
additional dimensions.
It shows the result of a drill-down operation performed on the central cube by stepping down a concept
hierarchy for time defined as “day < month < quarter < year.”
Drill-down occurs by descending the time hierarchy from the level of quarter to the more detailed level
of month. The resulting data cube details the total sales per month rather than summarizing them by
quarter.
Slice and dice
The slice operation performs a selection on one dimension of the given cube, resulting in a subcube.
It shows a slice operation where the sales data are selected from the central cube for the dimension time
using the criterion time = “Q1”
The dice operation defines a subcube by performing a selection on two or more dimensions.
Pivot (rotate)
Pivot (also called rotate) is a visualization operation that rotates the data axes in view in order to provide
an alternative presentation of the data.
Steps for the Design and Construction of Data Warehouses
First, having a data warehouse may provide a competitive advantage by presenting relevant information from
which to measure performance and make critical adjustments in order to help win over competitors.
Second, a data warehouse can enhance business productivity because it is able to quickly and efficiently gather
information that accurately describes the organization.
Third, a data warehouse facilitates customer relationship management because it provides a consistent view of
customers and items across all lines of business, all departments, and all markets.
Finally, a data warehouse may bring about cost reduction by tracking trends, patterns, and exceptions over long
periods in a consistent and reliable manner.
To design an effective data warehouse we need to understand and analyze business needs and
construct a business analysis framework.
The construction of a large and complex information system can be viewed as the construction
of a large and complex building, for which the owner, architect, and builder have different
views.
These views are combined to form a complex framework that represents the top-down,
business-driven, or owner’s perspective, as well as the bottom-up, builder-driven, or
implementer's view of the information system.
Four different views regarding the design of a data warehouse must be considered: the top-down view,
the data source view, the data warehouse view, and the business query view.
The top-down view allows the selection of the relevant information necessary for the data warehouse.
This information matches the current and future business needs.
The data source view exposes the information being captured, stored, and managed by operational
systems. This information may be documented at various levels of detail and accuracy, from individual
data source tables to integrated data source tables.
Data sources are often modeled by traditional data modeling techniques, such as the entity-
relationship model or CASE (computer-aided software engineering) tools.
The data warehouse view includes fact tables and dimension tables. It represents the information that
is stored inside the data warehouse, including pre calculated totals and counts, as well as information
regarding the source, date, and time of origin, added to provide historical context.
Finally, the business query view is the perspective of data in the data warehouse from the viewpoint
of the end user.
The warehouse design process consists of the following steps.
Choose a business process to model, for example, orders, invoices, shipments, inventory, account
administration, sales, or the general ledger.
If the business process is organizational and involves multiple complex object collections, a data
warehouse model should be followed. However, if the process is departmental and focuses on the
analysis of one kind of business process, a data mart model should be chosen.
Choose the grain of the business process. The grain is the fundamental, atomic level of data to be
represented in the fact table for this process, for example, individual transactions, individual daily
snapshots, and so on.
Choose the dimensions that will apply to each fact table record. Typical dimensions are time, item,
customer, supplier, warehouse, transaction type, and status.
Choose the measures that will populate each fact table record. Typical measures are numeric additive
quantities like dollars sold and units sold.
Indexing OLAP Data: Bitmap Index
• Index on a particular column
• Each value in the column has a bit vector: bit-op is fast
• The length of the bit vector: # of records in the base table
• The i-th bit is set if the i-th row of the base table has the value for the indexed column
• not suitable for high cardinality domains
– A recent bit compression technique, Word-Aligned Hybrid (WAH), makes it work for high cardinality domain as well
[Wu, et al. TODS’06]
99