0% found this document useful (0 votes)
270 views37 pages

A Multi-Dimensional Data Model

A multi-dimensional data model views data in the form of a data cube, which allows data to be modeled and viewed in multiple dimensions. Dimensions represent perspectives or entities that an organization wants to track, and are organized using dimension tables. Facts are numerical measures stored in a fact table along with keys linking to dimension tables. Common schemas for multi-dimensional databases include star schemas with one fact table linked to dimension tables, snowflake schemas with normalized dimension tables, and fact constellations with multiple linked fact tables sharing dimensions.

Uploaded by

Sasi S INDIA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
270 views37 pages

A Multi-Dimensional Data Model

A multi-dimensional data model views data in the form of a data cube, which allows data to be modeled and viewed in multiple dimensions. Dimensions represent perspectives or entities that an organization wants to track, and are organized using dimension tables. Facts are numerical measures stored in a fact table along with keys linking to dimension tables. Common schemas for multi-dimensional databases include star schemas with one fact table linked to dimension tables, snowflake schemas with normalized dimension tables, and fact constellations with multiple linked fact tables sharing dimensions.

Uploaded by

Sasi S INDIA
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 37

A multi-dimensional data model

1
From Tables and Spreadsheets to Data Cubes
• A data warehouse is based on a multidimensional data model which views
data in the form of a data cube.
• In general terms, dimensions are the perspectives or entities with respect
to which an organization wants to keep records.
• A data cube, such as sales, allows data to be modeled and viewed in
multiple dimensions
– Dimension tables, such as item (item_name, brand, type), or time(day,
week, month, quarter, year)
– Fact table contains measures (such as dollars_sold) and keys to each of
the related dimension tables
• Each dimension may have a table associated with it, called a dimension
table,

11/16/20 Data Mining: Concepts and Techniques 2


• suppose that we would like to view the sales data
with a third dimension. For instance, suppose we
would like to view the data according to time and
item, as well as location for the cities Chicago,
New York, Toronto, and Vancouver.
• These 3-D data are shown in Table. The 3-D data
of Table are represented as a series of 2-D tables.
• Conceptually, we may also represent the same
data in the form of a 3-D data cube,
• Suppose that we would now like to view our
sales data with an additional fourth
dimension, such as supplier
• Given a set of dimensions, we can generate a cuboid for
each of the possible subsets of the given dimensions.

• The result would form a lattice of cuboids

• The lattice of cuboids is then referred to as a data cube

• In data warehousing literature, an n-D base cube is


called a base cuboid. The top most 0-D cuboid, which
holds the highest-level of summarization, is called the
apex cuboid. The lattice of cuboids forms a data cube.
3.2.2 Stars, Snowflakes, and Fact Constellations:
Schemas for Multidimensional Databases

• The most popular data model for a data warehouse is a


multidimensional model. Such a model can exist in the form of a star
schema, a snowflake schema, or a fact constellation schema.

• Star schema: The most common modeling paradigm is the star schema,
in which the data warehouse contains (1) a large central table (fact
table) containing the bulk of the data, with no redundancy, and (2) a set
of smaller attendant tables (dimension tables), one for each dimension.
The schema graph resembles a starburst, with the dimension tables
displayed in a radial pattern around the central fact table.
Example 3.1 Star schema.
• A star schema for AllElectronics sales is shown in Figure
3.4. Sales are considered along four dimensions,
namely, time, item, branch, and location.
• The schema contains a central fact table for sales that
contains keys to each of the four dimensions, along with
two measures: dollars sold and units sold. To minimize
the size of the fact table, dimension identifiers (such as
time key and item key) are system-generated identifiers.
• Snowflake schema: The snowflake schema is a
variant of the star schema model, where some
dimension tables are normalized, thereby
further splitting the data into additional
tables. The resulting schema graph forms a
shape similar to a snowflake
• Example 3.2

• Snowflake schema. A snowflake schema for AllElectronics sales is given in Figure 3.5.

Here, the sales fact table is identical to that of the star schema in Figure 3.4. The main

difference between the two schemas is in the definition of dimension tables.

• The single dimension table for item in the star schema is normalized in the snowflake

schema, resulting in new item and supplier tables. For example, the item dimension

table now contains the attributes item key, item name, brand, type, and supplier key,

where supplier key is linked to the supplier dimension table, containing supplier key and

supplier type information.

• Similarly, the single dimension table for location in the star schema can be normalized

into two new tables: location and city. The city key in the new location table links to the

city dimension.

• Notice that further normalization can be performed on province or state and country in

the snowflake schema shown in Figure 3.5, when desirable.


• The major difference between the snowflake and
star schema models is that the dimension tables of
the snowflake model may be kept in normalized
form to reduce redundancies. Such a table is easy
to maintain and saves storage space.
• The snowflake schema reduces redundancy, it is
not as popular as the star schema in data
warehouse design
• Fact constellation: Sophisticated applications
may require multiple fact tables to share
dimension tables. This kind of schema can be
viewed as a collection of stars, and hence is
called a galaxy schema or a fact constellation.
Example 3.3

• Fact constellation. A fact constellation schema is shown in Figure 3.6.

This schema specifies two fact tables, sales and shipping. The sales table

definition is identical to that of the star schema (Figure 3.4). The shipping

table has five dimensions, or keys: item key, time key, shipper key, from

location, and to location, and two measures: dollars cost and units

shipped.

• A fact constellation schema allows dimension tables to be shared

between fact tables.

• For example, the dimensions tables for time, item, and location are

shared between both the sales and shipping fact tables


• In data warehousing, there is a distinction between a data warehouse

and a data mart.

• A data warehouse collects information about subjects that span the

entire organization, For data warehouses, the fact constellation

schema is commonly used, since it can model multiple, interrelated

subjects.

• A data mart, on the other hand, is a department subset of the data

warehouse that focuses on selected subjects, and thus its scope is

department wide.

• For data marts, the star or snowflake schema are commonly used.
Cube Definition Syntax (BNF) in DMQL

• Cube Definition (Fact Table)


define cube <cube_name> [<dimension_list>]:
<measure_list>
• Dimension Definition (Dimension Table)
define dimension <dimension_name> as
(<attribute_or_subdimension_list>)
• Special Case (Shared Dimension Tables)
– First time as “cube definition”
– define dimension <dimension_name> as
<dimension_name_first_time> in cube
<cube_name_first_time>

20
Defining Star Schema in DMQL

define cube sales_star [time, item, branch, location]:


dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month,
quarter, year)
define dimension item as (item_key, item_name, brand, type,
supplier_type)
define dimension branch as (branch_key, branch_name,
branch_type)
define dimension location as (location_key, street, city,
province_or_state, country)

21
Defining Snowflake Schema in DMQL

define cube sales_snowflake [time, item, branch, location]:


dollars_sold = sum(sales_in_dollars), avg_sales =
avg(sales_in_dollars), units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type,
supplier(supplier_key, supplier_type))
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city(city_key,
province_or_state, country))

22
Defining Fact Constellation in DMQL

define cube sales [time, item, branch, location]:


dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars),
units_sold = count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state, country)
define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*)
define dimension time as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as location in cube
sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales

23
• Distributive: An aggregate function is distributive if it can be computed in a

distributed manner as follows.

• Suppose the data are partitioned into n sets. We apply the function to each

partition, resulting in n aggregate values. If the result derived by applying the

function to the n aggregate values is the same as that derived by applying the

function to the entire data set (without partitioning), the function can be

computed in a distributed manner. For example, count() can be computed for a

data cube by first partitioning the cube into a set of subcubes, computing count()

for each subcube, and then summing up the counts obtained for each subcube.

• Hence, count() is a distributive aggregate function. For the same reason, sum(),

min(), and max() are distributive aggregate functions. A measure is distributive if

it is obtained by applying a distributive aggregate function.


Measures of Data Cube: Three Categories

• Algebraic: if it can be computed by an algebraic function with M


arguments (where M is a bounded integer), each of which is
obtained by applying a distributive aggregate function
• E.g., avg(), min_N(), standard_deviation()

• Holistic: there does not exist an algebraic function with M


arguments (where M is a constant) that characterizes the
computation.
• E.g., median(), mode(), rank()

25
Concept Hierarchies

• A concept hierarchy defines a sequence of mappings from


a set of low-level concepts to higher-level, more general
concepts. Consider a concept hierarchy for the dimension
location. City values for location include Vancouver,
Toronto, NewYork, andChicago. Each city, however, can be
mapped to the province or state to which it belongs.

• A concept hierarchy that is a total or partial order among


attributes in a database schema is called a schema
hierarchy.
• Concept hierarchies may also be defined by
discretizing or grouping values for a given dimension
or attribute, resulting in a set-grouping hierarchy. A
total or partial order can be defined among groups of
values. An example of a set-grouping hierarchy is
shown in Figure 3.9 for the dimension price, where
an interval ($X : : :$Y] denotes the range from
• $X (exclusive) to $Y (inclusive).
OLAP Operations in the Multidimensional Data Model

• Roll-up: The roll-up operation (also called the drill-up operation by some

vendors) performs aggregation on a data cube, either by climbing up a

concept hierarchy for a dimension or by dimension reduction.

• When roll-up is performed by dimension reduction, one or more

dimensions are removed from the given cube. For example, consider a

sales data cube containing only the two dimensions location and time.

Roll-up may be performed by removing, say,the time dimension, resulting

in an aggregation of the total sales by location, rather than by location

and by time.
• Drill-down: Drill-down is the reverse of roll-up. It

navigates from less detailed data to more detailed data.

Drill-down can be realized by either stepping down a

concept hierarchy for a dimension or introducing

additional dimensions.

• The result of a drill-down operation performed on the

central cube by stepping down a concept hierarchy for

time defined as “day < month < quarter < year.”


• Slice and dice: The slice operation performs a selection on one

dimension of the given cube, resulting in a subcube.

• Figure 3.10 shows a slice operation where the sales data are

selected from the central cube for the dimension time using the

criterion time = “Q1”. The dice operation defines a subcube by

performing a selection on two or more dimensions.

• Figure 3.10 shows a dice operation on the central cube based on

the following selection criteria that involve three dimensions:

(location = “Toronto” or “Vancouver”) and (time = “Q1” or “Q2”)

and (item = “home entertainment” or “computer”).


• Pivot (rotate): Pivot (also called rotate) is a
visualization operation that rotates the data
axes in view in order to provide an alternative
presentation of the data. Figure 3.10 shows a
pivot operation where the item and location
axes in a 2-D slice are rotated.
• Other OLAP operations: Some OLAP systems offer
additional drilling operations. For example, drill-
across executes queries involving (i.e., across)
more than one fact table.
• The drill-through operation uses relational SQL
facilities to drill through the bottom level of a data
cube down to its back-end relational tables.
OLAP Systems versus Statistical Databases

• A statistical database is a database system that


is designed to support statistical applications.
While SDBs tend to focus on socioeconomic
applications, OLAP has been targeted for
business applications.
A Starnet Query Model for Querying
Multidimensional Databases
• The querying of multidimensional databases
can be based on a starnet model. A starnet
model consists of radial lines emanating from a
central point, where each line represents a
concept hierarchy for a dimension.
• Each abstraction level in the hierarchy is called
a footprint. These represent the granularities
available for use by OLAP operations such as
drill-down and roll-up.

You might also like