0% found this document useful (0 votes)
10 views59 pages

Data Mining 4

Uploaded by

writetoaris
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views59 pages

Data Mining 4

Uploaded by

writetoaris
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Data Warehousing and On-line Analytical Processing

■ Data Warehouse: Basic Concepts


■ Data Warehouse Modeling: Data Cube and OLAP
■ Data Warehouse Implementation

1
Data warehouses-Overview
■ Data warehouses generalize and consolidate data in multidimensional
space.
■ The construction of data warehouses involves data cleaning, data
integration, and data transformation.
■ It can be viewed as an important preprocessing step for data mining.
■ It provides online analytical processing (OLAP) tools for the interactive
analysis of multidimensional data of varied granularities.
■ OLAP tools facilitates effective data generalization and data mining.
■ Many other DM functions, such as association, classification, prediction,
and clustering, can be integrated with OLAP operations to enhance
interactive mining of knowledge at multiple levels of abstraction.
■ Hence, the DW has become an increasingly important platform for data
analysis and OLAP and will provide an effective platform for data mining.
■ Therefore, data warehousing and OLAP form an essential step in the
knowledge discovery process.

* Data Mining: Concepts and Techniques 2


What is a Data Warehouse?

■ Data warehousing provides architectures and tools for business


executives to systematically organize, understand, and use
their data to make strategic decisions.
■ Data warehouse systems are valuable tools in today’s
competitive, fast-evolving world.
■ In the last several years, many firms have spent millions of
dollars in building enterprise-wide data warehouses.
■ Data warehousing is the latest must-have marketing
weapon—a way to retain customers by learning more about
their needs.

* Data Mining: Concepts and Techniques 3


What is a Data Warehouse?
■ Defined in many different ways, but not rigorously.
■ A decision support database(data repository) that is maintained
separately from the organization’s operational database
■ Support information processing by providing a solid platform of
consolidated, historical data for analysis.
■ “A data warehouse is a subject-oriented, integrated, time-variant,
and nonvolatile collection of data in support of management’s
decision-making process.”—W. H. Inmon
■ The four keywords—subject-oriented, integrated, time-variant, and
nonvolatile—distinguish data warehouses from other data repository
systems, such as relational database systems, transaction processing
systems, and file systems.

4
Data Warehouse—Subject-Oriented

■ A DW is organized around major subjects, such as


customer, supplier, product, sales
■ It focuses on the modeling and analysis of data for
decision makers, and not on daily operations or
transaction processing of an organization.
■ Provide a simple and concise view around particular
subject issues by excluding data that are not useful in the
decision support process

5
Data Warehouse—Integrated

■ Constructed by integrating multiple, heterogeneous data


sources
■ relational databases, flat files, on-line transaction

records
■ Data cleaning and data integration techniques are
applied
■ To ensure consistency in naming conventions,
encoding structures, attribute measures, etc. among
different data sources
■ E.g., Hotel price: currency, tax, breakfast covered, etc.
■ When data is moved to the warehouse, it is
converted.

6
Data Warehouse—Time Variant

■ The time horizon for the data warehouse is significantly


longer than that of operational systems
■ Operational database: current value data
■ Data warehouse data: provide information from a
historical perspective (e.g., past 5-10 years)
■ Every key structure in the data warehouse
■ Contains an element of time, explicitly or implicitly
■ But the key of operational data may or may not
contain “time element”

7
Data Warehouse—Nonvolatile
■ A DW is a physically separate store of data transformed
from the operational environment.
■ Due to this separation, operational update of data does
not occur in the data warehouse environment
■ Does not require transaction processing, recovery,
and concurrency control mechanisms
■ Requires only two operations in data accessing:
■ initial loading of data and access of data

8
OLTP vs. OLAP

9
Why a Separate Data Warehouse?
■ High performance for both systems
■ DBMS— tuned for OLTP: access methods, indexing, concurrency
control, recovery
■ Warehouse—tuned for OLAP: complex OLAP queries,
multidimensional view, consolidation
■ Different functions and different data:
■ missing data: Decision support requires historical data which
operational DBs do not typically maintain
■ data consolidation: DS requires consolidation (aggregation,
summarization) of data from heterogeneous sources
■ data quality: different sources typically use inconsistent data
representations, codes and formats which have to be reconciled
■ Note: There are more and more systems which perform OLAP
analysis directly on relational databases
10
* Data Mining: Concepts and Techniques 11
Data Warehouse: A Multi-Tiered Architecture

Monitor
& OLAP Server
Other Metadata
Integrato
sources r
Analysis
Operational Extract Query
DBs Transform Data Serve Reports
Load
Refresh
Warehouse Data mining

Data Marts

Data Sources Data OLAP Engine Front-End


Storage Tools 12
Three Data Warehouse Models
■ From the architecture point of view, there are three data
warehouse models:
■ Enterprise warehouse
■ collects all of the information about subjects spanning

the entire organization


■ Data Mart
■ a subset of corporate-wide data that is of value to a

specific groups of users. Its scope is confined to


specific, selected groups, such as marketing data mart
■ Independent vs. dependent (directly from warehouse) data mart
■ Virtual warehouse
■ A set of views over operational databases

■ Only some of the possible summary views may be

materialized 13
Extraction, Transformation, and Loading (ETL)
■ Data warehouse systems use back-end tools and utilities to populate
and refresh their data (Figure 4.1).
■ These tools and utilities include the following functions:
■ Data extraction
■ get data from multiple, heterogeneous, and external sources
■ Data cleaning
■ detect errors in the data and rectify them when possible
■ Data transformation
■ convert data from legacy or host format to warehouse format
■ Load
■ sort, summarize, consolidate, compute views, check integrity, and
build indicies and partitions
■ Refresh
■ propagate the updates from the data sources to the warehouse

14
Metadata Repository
■ Meta data is the data defining warehouse objects. It stores:
■ Description of the structure of the data warehouse
■ schema, view, dimensions, hierarchies, derived data defn, data
mart locations and contents
■ Operational meta-data
■ data lineage (history of migrated data and transformation path),
currency of data (active, archived, or purged), monitoring
information (warehouse usage statistics, error reports, audit trails)
■ The algorithms used for summarization
■ The mapping from operational environment to the data warehouse
■ Data related to system performance
■ warehouse schema, view and derived data definitions
■ Business data
■ business terms and definitions, ownership of data, charging policies
15
4.2 DataWarehouse Modeling: Data Cube
and OLAP
■ Data warehouses and OLAP tools are based on a multidimensional
data model, which views data in the form of a data cube.
■ A data cube allows data to be modeled and viewed in multiple
dimensions. It is defined by dimensions and facts.
■ In general terms, dimensions are the perspectives or entities with
respect to which an organization wants to keep records.
■ For example, AllElectronics may create a sales data warehouse in
order to keep records of the store’s sales with respect to the
dimensions time, item, branch, and location.
■ These dimensions allow the store to keep track of things like
monthly sales of items and the branches and locations at which
the items were sold.
■ Each dimension may have a table associated with it, called a
dimension table, which further describes the dimension.
■ For example, a dimension table for item may contain the attributes
item name, brand, and type.

* Data Mining: Concepts and Techniques 16


4.2.1 From Tables and Spreadsheets to Data Cubes

■ Dimension tables can be specified by users or experts, or automatically


generated and adjusted based on data distributions.
■ A multidimensional data model is typically organized around a central
theme, such as sales.
■ This theme is represented by a fact table.
■ Facts are numeric measures.
■ Think of them as the quantities by which we want to analyze
relationships between dimensions.
■ Examples of facts for a sales data warehouse include dollars sold (sales
amount in dollars), units sold (number of units sold), and amount
budgeted.
■ The fact table contains the names of the facts, or measures, as well as
keys to each of the related dimension tables.

17
■ Cubes are generally viewed as 3-D geometric structures, but in data
warehousing the data cube is n-dimensional.
■ To gain a better understanding of data cubes and the
multidimensional data model, let’s start by looking at a simple 2-D
data cube that is, in fact, a table or spreadsheet for sales data from
AllElectronics.
■ In particular, we will look at the AllElectronics sales data for items sold
per quarter in the city of Vancouver.
■ These data are shown in Table 4.2.
■ In this 2-D representation, the sales for Vancouver are shown with
respect to the time dimension (organized in quarters) and the item
dimension (organized according to the types of items sold).
■ The fact or measure displayed is dollars sold (in thousands).

* Data Mining: Concepts and Techniques 18


* Data Mining: Concepts and Techniques 19
■ Now, suppose that we would like to view the sales data with a third
dimension.
■ For instance, suppose we would like to view the data according to
time and item, as well as location, for the cities Chicago, New York,
Toronto, and Vancouver.
■ These 3-D data are shown in Table 4.3. The 3-D data in the table
are represented as a series of 2-D tables.
■ Conceptually, we may also represent the same data in the form of a
3-D data cube, as in Figure 4.3.
■ Suppose that we would now like to view our sales data with an
additional fourth dimension such as supplier.
■ Viewing things in 4-D becomes tricky.
■ However, we can think of a 4-D cube as being a series of 3-D
cubes, as shown in Figure 4.4.

* Data Mining: Concepts and Techniques 20


* Data Mining: Concepts and Techniques 21
■ If we continue in this way, we may display any n-dimensional data as
a series of (n-1) dimensional “cubes.”
■ The data cube is a metaphor for multidimensional data storage.
■ The actual physical storage of such data may differ from its logical
representation.
■ The important thing to remember is that data cubes are n-dimensional
and do not confine data to 3-D.
■ Tables 4.2 and 4.3 show the data at different degrees of
summarization.
■ In the data warehousing research literature, a data cube like those
shown in Figures 4.3 and 4.4 is often referred to as a cuboid.
■ Given a set of dimensions, we can generate a cuboid for each of the
possible subsets of the given dimensions.
■ The result would form a lattice of cuboids, each showing the data at a
different level of summarization, or group-by.

* Data Mining: Concepts and Techniques 22


■ The lattice of cuboids is then referred to as a data cube.
■ Figure 4.5 shows a lattice of cuboids forming a data cube for the
dimensions time, item, location, and supplier.
■ The cuboid that holds the lowest level of summarization is called the
base cuboid.
■ For example, the 4-D cuboid in Figure 4.4 is the base cuboid for the
given time, item, location, and supplier dimensions.
■ Figure 4.3 is a 3-D (nonbase) cuboid for time, item, and location,
summarized for all suppliers.
■ The 0-D cuboid, which holds the highest level of summarization, is
called the apex cuboid.
■ In our example, this is the total sales, or dollars sold, summarized
over all four dimensions.
■ The apex cuboid is typically denoted by all.
■ Figure 4.5 Lattice of cuboids, making up a 4-D data cube for time,
item, location, and supplier. Each cuboid represents a different
degree of summarization.
* Data Mining: Concepts and Techniques 23
Cube: A Lattice of Cuboids

all
0-D (apex) cuboid

time item location supplier


1-D cuboids

time,location item,location location,supplier


time,item 2-D cuboids
time,supplier item,supplier

time,location,supplier
3-D cuboids
time,item,location
time,item,supplier item,location,supplier

4-D (base) cuboid


time, item, location, supplier

24
4.2.2 Conceptual Modeling of Data Warehouses
■ Modeling data warehouses: dimensions & measures
■ Star schema: A fact table in the middle connected to a
set of dimension tables
■ Snowflake schema: A refinement of star schema
where some dimensional hierarchy is normalized into a
set of smaller dimension tables, forming a shape
similar to snowflake
■ Fact constellations: Multiple fact tables share
dimension tables, viewed as a collection of stars,
therefore called galaxy schema or fact constellation

25
Example of Star Schema
time
time_key item
day item_key
day_of_the_week Sales Fact Table item_name
month brand
quarter time_key type
year supplier_type
item_key
branch_key
branch location
location_key
branch_key location_key
branch_name units_sold street
branch_type city
dollars_sold state_or_province
country
avg_sales
Measures

26
Example of Snowflake Schema
time
time_key item
day item_key supplier
day_of_the_week Sales Fact Table item_name supplier_key
month brand supplier_type
quarter time_key type
year item_key supplier_key

branch_key
branch location
location_key
location_key
branch_key
units_sold street
branch_name
city_key
branch_type
dollars_sold city
city_key
avg_sales city
state_or_province
Measures country

27
Example of Fact Constellation
time
time_key item Shipping Fact Table
day item_key
day_of_the_week Sales Fact Table item_name time_key
month brand
quarter time_key type item_key
year supplier_type shipper_key
item_key
branch_key from_location

branch location_key location to_location


branch_key location_key dollars_cost
branch_name
units_sold
street
branch_type dollars_sold city units_shipped
province_or_state
avg_sales country shipper
Measures shipper_key
shipper_name
location_key
shipper_type 28
4.2.3 Dimensions: The Role of Concept
Hierarchies

■ A concept hierarchy defines a sequence of mappings from a set


of low-level concepts to higher-level, more general concepts.
■ Consider a concept hierarchy for the dimension location.
■ City values for location include Vancouver, Toronto, New York, and
Chicago.
■ Each city, however, can be mapped to the province or state to
which it belongs.
■ For example, Vancouver can be mapped to British Columbia, and
Chicago to Illinois.
■ The provinces and states can in turn be mapped to the country
(e.g., Canada or the United States) to which they belong.
■ These mappings form a concept hierarchy for the dimension location,
mapping a set of low-level concepts (i.e., cities) to higher-level, more
general concepts (i.e., countries).
■ This concept hierarchy is illustrated in Figure 4.9.
* Data Mining: Concepts and Techniques 29
A Concept Hierarchy:
Dimension (location)

all all

region Europe ... North_America

country Germany ... Spain Canada ... Mexico

city Frankfurt ... Vancouver ... Toronto

office L. Chan ... M. Wind

30
■ Many concept hierarchies are implicit within the database schema.
■ For example, suppose that the dimension location is described by
the attributes number, street, city, province or state, zip code, and
country.
■ These attributes are related by a total order, forming a concept
hierarchy such as “street < city < province or state < country.”
■ This hierarchy is shown in Figure 4.10(a).
■ Alternatively, the attributes of a dimension may be organized in a
partial order, forming a lattice.
■ An example of a partial order for the time based on the attributes
day, week, month, quarter, and year is “day < fmonth < quarter;
weekg < year.”1
■ This lattice structure is shown in Figure 4.10(b).
■ A concept hierarchy that is a total or partial order among attributes
in a database schema is called a schema hierarchy.
■ Concept hierarchies that are common to many applications (e.g.,
for time) may be predefined in the data mining system.

* Data Mining: Concepts and Techniques 31


* Data Mining: Concepts and Techniques 32
■ DM systems should provide users with the flexibility to tailor
predefined hierarchies according to their particular needs.
■ For ex: users may want to define a fiscal year starting on April 1
or an academic year starting on September 1.
■ Concept hierarchies may also be defined by discretizing or
grouping values for a given dimension or attribute, resulting in a
set-grouping hierarchy.
■ A total or partial order can be defined among groups of values.
■ An ex of a set-grouping hierarchy is shown in Figure 4.11 for the
dimension price, where an interval [$X …$Y] denotes the range from
$X (exclusive) to $Y (inclusive).
■ There may be more than one concept hierarchy for a given
attribute or dimension, based on different user viewpoints.
■ For instance, a user may prefer to organize price by defining ranges
for inexpensive, moderately priced, and expensive.
■ Concept hierarchies may be provided manually by system users,
domain experts, or knowledge engineers, or may be automatically
generated based on statistical analysis of the data distribution.
* Data Mining: Concepts and Techniques 33
4.2.4 Data Cube Measures: Three Categories

■ To see how measures are computed, its first necessary to study how
measures can be categorized.
■ A data cube measure is a numeric function that can be evaluated
at each point in the data cube space.
■ A measure value is computed for a given point by aggregating the
data corresponding to the respective dimension–value pairs
defining the given point.
■ Measures can be organized into three categories—distributive,
algebraic, and holistic— based on the kind of aggregate functions
used.
■ Distributive: An aggregate function is distributive if the result derived
by applying the function to n aggregate values is the same as that
derived by applying the function on all the data without partitioning
■ E.g., count(), sum(), min(), max()

34
■ Algebraic: An aggregate function is algebraic if it can be computed by
an algebraic function with M arguments (where M is a bounded
integer), each of which is obtained by applying a distributive
aggregate function
■ E.g., avg(), min_N(), standard_deviation()
■ Holistic: An aggregate function is holistic if there is no constant bound
on the storage size needed to describe a subaggregate.
■ E.g., median(), mode(), rank()

* Data Mining: Concepts and Techniques 35


Multidimensional Data

■ Sales volume as a function of product, month,


and region
Dimensions: Product, Location, Time
Hierarchical summarization paths
on
gi
Re

Industry Region Year

Category Country Quarter


Product

Product City Month Week

Office Day

Month
36
A Sample Data Cube

Total annual sales


Date of TVs in U.S.A.
1Qtr 2Qtr sum
t

3Qtr 4Qtr
uc

TV
od

PC U.S.A
Pr

VCR

Country
sum
Canada

Mexico

sum

37
Cuboids Corresponding to the Cube

all
0-D (apex) cuboid
product date country
1-D cuboids

product,date product,country date, country


2-D cuboids

3-D (base) cuboid


product, date, country

38
Typical OLAP Operations
■ Roll up (drill-up): summarize data
■ by climbing up hierarchy or by dimension reduction
■ Drill down (roll down): reverse of roll-up
■ from higher level summary to lower level summary or
detailed data, or introducing new dimensions
■ Slice and dice: project and select
■ Pivot (rotate):
■ reorient the cube, visualization, 3D to series of 2D planes
■ Other operations
■ drill across: involving (across) more than one fact table
■ drill through: through the bottom level of the cube to its
back-end relational tables (using SQL)

39
40
A Star-Net Query Model
Customer Orders
Shipping Method
Customer
CONTRACTS
AIR-EXPRESS

ORDER
TRUCK
PRODUCT LINE
Time Product
ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP
CITY
SALES PERSON
COUNTRY
DISTRICT
REGION
DIVISION
Location Each circle is
called a footprint Promotion Organization
41
4.3 Design of Data Warehouse: A Business
Analysis Framework
■ Four views regarding the design of a data warehouse
■ Top-down view
■ allows selection of the relevant information necessary for the
data warehouse
■ Data source view
■ exposes the information being captured, stored, and
managed by operational systems
■ Data warehouse view
■ consists of fact tables and dimension tables
■ Business query view
■ sees the perspectives of data in the warehouse from the view
of end-user

42
Data Warehouse Design Process
■ Top-down, bottom-up approaches or a combination of both
■ Top-down: Starts with overall design and planning (mature)
■ Bottom-up: Starts with experiments and prototypes (rapid)
■ From software engineering point of view
■ Waterfall: structured and systematic analysis at each step before
proceeding to the next
■ Spiral: rapid generation of increasingly functional systems, short
turn around time, quick turn around
■ Typical data warehouse design process
■ Choose a business process to model, e.g., orders, invoices, etc.
■ Choose the grain (atomic level of data) of the business process
■ Choose the dimensions that will apply to each fact table record
■ Choose the measure that will populate each fact table record

43
Data Warehouse Development: A
Recommended Approach
Multi-Tier Data
Warehouse
Distributed
Data Marts

Enterprise
Data Data
Data
Mart Mart
Warehouse

Model refinement Model refinement

Define a high-level corporate data model


44
DataWarehouse Usage

■ Data warehouses and data marts are used in a wide range of applications.
■ Business executives use the data in data warehouses and data
marts to perform data analysis and make strategic decisions.
■ In many firms, DW are used as an integral part of a plan-execute-assess
“closed-loop” feedback system for enterprise management.
■ DW are used extensively in banking and financial services, consumer
goods and retail distribution sectors, and controlled manufacturing
such as demand-based production.
■ Initially, the data warehouse is mainly used for generating reports and
answering predefined queries.
■ Progressively, it is used to analyze summarized and detailed data,
where the results are presented in the form of reports and charts.
■ Later, it is used for strategic purposes, performing multidimensional
analysis and sophisticated slice-and-dice operations.

* Data Mining: Concepts and Techniques 45


DataWarehouse Usage

■ Finally, the data warehouse may be employed for knowledge discovery


and strategic decision making using DM tools.
■ The tools for data warehousing can be categorized into access & retrieval
tools, database reporting tools, data analysis tools, and data mining tools.
■ There are three kinds of data warehouse applications:
■ information processing,
■ analytical processing, and
■ data mining.

* Data Mining: Concepts and Techniques 46


Data Warehouse Usage
■ Information processing
■ supports querying, basic statistical analysis, and reporting using
crosstabs, tables, charts and graphs
■ A current trend is to construct low-cost web-based accessing tools that
are then integrated with web browsers.
■ Analytical processing

■ supports basic OLAP operations, including slice-dice, dril-down, roll-up, &


pivoting
■ It generally operates on historic data in both summarized and detailed
forms.
■ The major strength of online analytical processing over information
processing is the multidimensional data analysis of data warehouse data.
■ Data mining

■ supports knowledge discovery by finding hidden patterns


■ supports associations, constructing analytical models, performing
classification and prediction, and presenting the mining results using
visualization tools

47
From On-Line Analytical Processing (OLAP)
to On Line Analytical Mining (OLAM)
■ The DM field has conducted research on various data types,
■ including relational data, data from data warehouses, transaction
data, time-series data, spatial data, text data, and flat files.
■ Multidimensional data mining (also known as exploratory
multidimensional data mining, online analytical mining, or OLAM)
integrates OLAP with data mining to uncover knowledge in
multidimensional databases.
■ Among the many different paradigms and architectures of DM
systems, OLAM is particularly important for the following reasons:
■ High quality of data in data warehouses
■ Most DM tools need to work on integrated, consistent, and
cleaned data, which requires costly data cleaning, data
integration, and data transformation as pre-processing steps.
■ A DW constructed by such pre-processing serves as a valuable
source of high-quality data for OLAP as well as for data mining.
■ Notice that data mining may serve as a valuable tool for data
cleaning and data integration as well 48
■ Available information processing structure surrounding data
warehouses
■ Information processing and data analysis infrastructures

constructed surrounding data warehouses includes


■ accessing, integration, consolidation, and transformation of

multiple heterogeneous databases, ODBC/OLEDB connections,


Web accessing and service facilities, and reporting and OLAP
analysis tools.
■ It is prudent to use the available infrastructures rather than

constructing everything from scratch.


■ OLAP-based exploration of multidimensional data:
■ Effective data mining needs exploratory data analysis.
■ A user will often want to traverse through a database, select
portions of relevant data, analyze them at different granularities, and
present knowledge/results in different forms.

* Data Mining: Concepts and Techniques 49


■ Multidimensional data mining provides facilities for mining on
different subsets of data and at varying levels of abstraction
■ by drilling, pivoting, filtering, dicing, and slicing on a data cube

and/or intermediate data mining results.


■ This, together with data/knowledge visualization tools, greatly
enhances the power and flexibility of data mining.
■ Online selection of data mining functions:
■ Users may not always know the specific kinds of knowledge to be
mined.
■ By integrating OLAP with various DM functions, MDM provides
users with the flexibility to select desired data mining functions
and swap data mining tasks dynamically.

* Data Mining: Concepts and Techniques 50


4.4 Data warehouse Implementation
■ Data warehouses contain huge volumes of data.
■ OLAP servers demand that decision support queries be answered in the order of
seconds.
■ Therefore, it is crucial for DW systems to support highly efficient cube
computation techniques, access methods, and query processing techniques.
4.4.1 Efficient Data Cube Computation
■ At the core of multidimensional data analysis is the efficient computation of
aggregations across many sets of dimensions.
■ In SQL terms, these aggregations are referred to as group-by’s.
■ Each group-by can be represented by a cuboid, where the set of group-by’s
forms a lattice of cuboids defining a data cube.
■ We explore the issues relating to the efficient computation of data cubes.
The compute cube Operator
■ One approach to cube computation extends SQL so as to include a compute
cube operator.
■ The compute cube operator computes aggregates over all subsets of the
dimensions specified in the operation.
■ This can require excessive storage space, especially for large numbers of
dimensions.

51
■ A data cube is a lattice of cuboids.
■ Ex : Create a data cube for AllElectronics sales that contains the following:
city, item, year, and sales in dollars.
■ You want to be able to analyze the data, with queries such as the following:
■ “Compute the sum of sales, grouping by city and item.”
■ “Compute the sum of sales, grouping by city.”
■ “Compute the sum of sales, grouping by item.”
■ What is the total number of cuboids, or group-by’s, that can be computed
for this data cube?
■ Taking the three attributes, city, item, and year, as the dimensions for the
data cube, and sales in dollars as the measure, the total number of cuboids,
or groupby’s, that can be computed for this data cube is 2^3 = 8.
■ The possible group-by’s are the following: {(city, item, year), (city, item),
(city, year), (item, year), (city), (item), (year), ()} where () means that the
group-by is empty (i.e., the dimensions are not grouped).
■ These group-by’s form a lattice of cuboids for the data cube, as shown in
Figure 4.14.

* Data Mining: Concepts and Techniques 52


■ The base cuboid contains all three dimensions, city, item, and year.
■ It can return the total sales for any combination of the three dimensions.
■ The apex cuboid, or 0-D cuboid, refers to the case where the group-by is
empty.
■ It contains the total sum of all sales.
■ The base cuboid is the least generalized (most specific) of the cuboids.
■ The apex cuboid is the most generalized (least specific) of the cuboids, and
is often denoted as all.
■ If we start at the apex cuboid and explore downward in the lattice, this is
equivalent to drilling down within the data cube.
■ If we start at the base cuboid and explore upward, this is akin to rolling up.

* Data Mining: Concepts and Techniques 53


■ For an n-dimensional data cube, the total number of cuboids that can be generated
(including the cuboids generated by climbing up the hierarchies along each
dimension) is

■ where Li is the number of levels associated with dimension i. One is added to Li to


include the virtual top level, all.
■ There are three choices for data cube materialization given a base cuboid:
■ 1. No materialization: Do not precompute any of the “nonbase” cuboids.
■ This leads to computing expensive multidimensional aggregates on-the-fly, which
can be extremely slow.
■ 2. Full materialization: Precompute all of the cuboids. The resulting lattice of
computed cuboids is referred to as the full cube.
■ This choice typically requires huge amounts of memory space in order to store all
of the precomputed cuboids.
■ 3. Partial materialization: Selectively compute a proper subset of the whole set
of possible cuboids. Alternatively, we may compute a subset of the cube, which
contains only those cells that satisfy some user-specified criterion, such as where
the tuple count of each cell is above some threshold.
■ Selection of which cuboids to materialize
■ Based on size, sharing, access frequency, etc.
* Data Mining: Concepts and Techniques 54
Indexing OLAP Data: Bitmap Index and Join Index

■ To facilitate efficient data accessing, most data warehouse systems


support index structures and materialized views (using cuboids).
■ The bitmap indexing method is popular in OLAP products because it
allows quick searching in data cubes.
■ It is an alternative representation of the record ID (RID) list.
■ In the bitmap index for a given attribute, there is a distinct bit
vector, Bv, for each value v in the attribute’s domain.
■ If a given attribute’s domain consists of n values, then n bits are
needed for each entry in the bitmap index (there are n bit vectors).
■ If the attribute has the value v for a given row in the data table,
then the bit representing that value is set to 1 in the corresponding
row of the bitmap index.
■ All other bits for that row are set to 0.

55
Indexing OLAP Data: Bitmap Index
■ It is advantageous compared to hash and tree indices.
■ It is especially useful for low-cardinality domains because
comparison, join, and aggregation operations are then reduced to
bit arithmetic, which substantially reduces the processing time.
■ not suitable for high cardinality domains
■ The length of the bit vector: # of records in the base table
■ The i-th bit is set if the i-th row of the base table has the value for
the indexed column

Base table Index on Region Index on Type

56
Indexing OLAP Data: Join Indices
■ The join indexing method gained popularity from its use in relational
database query processing.
■ Traditional indexing maps the value in a given column to a list of rows
having that value.
■ In contrast, join indexing registers the joinable rows of two relations
from a relational database.
■ For example, if two relations R(RID, A) and S(B, SID) join on the
attributes A and B, then the join index record contains the pair .RID,
SID/, where RID and SID are record identifiers from the R and S
relations, respectively.
■ Hence, the join index records can identify joinable tuples without
performing costly join operations.
■ Join indexing is especially useful for maintaining the relationship between
a foreign key and its matching primary keys, from the joinable relation.
■ The star schema model of data warehouses makes join indexing
attractive for crosstable search.
■ Join indices may span multiple dimensions to form composite join
indices.
* Data Mining: Concepts and Techniques 57
Efficient Processing OLAP Queries

■ The purpose of materializing cuboids and constructing OLAP index structures


is to speed up query processing in data cubes.
■ Given materialized views, query processing should proceed as follows:
■ Determine which operations should be performed on the available cuboids
■ Transform any selection, projection, drill-down, roll-up operations specified in
the query into corresponding SQL and/or OLAP operations,
■ For example, slicing and dicing a data cube may correspond to selection and/or
projection operations on a materialized cuboid. (dice = selection + projection)
■ Determine which materialized cuboid(s) should be selected for OLAP op.
■ This involves identifying all of the materialized cuboids that may potentially
be used to answer the query,
■ pruning the set using knowledge of “dominance” relationships among the
cuboids,
■ estimating the costs of using the remaining materialized cuboids, and
■ selecting the cuboid with the least cost.

58
OLAP Server Architectures
■ The physical architecture and implementation of OLAP servers must
consider data storage issues.
■ Implementations of a warehouse server for OLAP processing include the
following:
■ Relational OLAP (ROLAP)
■ Use relational or extended-relational DBMS to store and manage
warehouse data and OLAP middle ware
■ Include optimization of DBMS backend, implementation of
aggregation navigation logic, and additional tools and services
■ Greater scalability
■ Multidimensional OLAP (MOLAP)
■ Sparse array-based multidimensional storage engine
■ Fast indexing to pre-computed summarized data
■ Hybrid OLAP (HOLAP) (e.g., Microsoft SQLServer)
■ Flexibility, e.g., low level: relational, high-level: array
■ Specialized SQL servers (e.g., Redbricks)
■ Specialized support for SQL queries over star/snowflake schemas
59

You might also like