0% found this document useful (0 votes)
33 views19 pages

DWBI Unit-1

The document discusses data warehousing and online analytical processing. It defines what a data warehouse is, compares operational databases and data warehouses, and outlines the multitier architecture of data warehousing systems. Data warehouses integrate data from multiple sources and store it separately for analysis and decision making.

Uploaded by

Omer Sohail
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views19 pages

DWBI Unit-1

The document discusses data warehousing and online analytical processing. It defines what a data warehouse is, compares operational databases and data warehouses, and outlines the multitier architecture of data warehousing systems. Data warehouses integrate data from multiple sources and store it separately for analysis and decision making.

Uploaded by

Omer Sohail
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

DMDW Unit-1

Data Warehousing And Business Intelligence (Jawaharlal Nehru Technological


University, Hyderabad)

Scan to open on Studocu

Studocu is not sponsored or endorsed by any college or university


Downloaded by Omer Sohail ([email protected])
Unit-I
Data Warehousing and Online Analytical Processing: Basic Concepts
What Is a DataWarehouse?
Loosely speaking, a data warehouse refers to a data repository that is maintained separately from an
organization’s operational databases.
According to William H. Inmon, a leading architect in the construction of data warehouse systems, “A
data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in
support of management’s decision making process”.
Subject-oriented: A data warehouse is organized around major subjects such as customer, supplier,
product, and sales. Rather than concentrating on the day-to-day operations and transaction processing
of an organization, a data warehouse focuses on the modeling and analysis of data for decision makers.
Integrated: A data warehouse is usually constructed by integrating multiple heterogeneous sources,
such as relational databases, flat files, and online transaction records. Data cleaning and data
integration techniques are applied to ensure consistency in naming conventions, encoding structures,
attribute measures, and so on.
Time-variant: Data are stored to provide information from an historic perspective (e.g., the past 5–10
years). Every key structure in the data warehouse contains, either implicitly or explicitly, a time
element.
Nonvolatile: A data warehouse is always a physically separate store of data transformed from the
application data found in the operational environment. Due to this separation, a data warehouse does
not require transaction processing, recovery, and concurrency control mechanisms. It usually requires
only two operations in data accessing: initial loading of data and access of data.
data warehousing as the process of constructing and using data warehouses. The construction of a
data warehouse requires data cleaning, data integration, and data consolidation.
“How are organizations using the information from data warehouses?” Many organizations use this
information to support business decision-making activities, including (1) increasing customer focus,
which includes the analysis of customer buying patterns (such as buying preference, buying time,
budget cycles, and appetites for
spending); (2) repositioning products and managing product portfolios by comparing the performance
of sales by quarter, by year, and by geographic regions in order to fine-tune production strategies; (3)
analyzing operations and looking for sources of profit; and (4) managing customer relationships,
making environmental corrections, and managing the cost of corporate assets.
The traditional database approach to heterogeneous database integration is to build wrappers and
integrators (or mediators) on top of multiple, heterogeneous databases. When a query is posed to a
client site, a metadata dictionary is used to translate the query into queries appropriate for the
individual heterogeneous sites involved. These queries are then mapped and sent to local query
processors. The results returned from the different sites are integrated into a global answer set. This
query-driven approach requires complex information filtering and integration processes, and
competes with local sites for processing resources. It is inefficient and potentially expensive for
frequent queries, especially queries requiring aggregations.
Data warehousing provides an interesting alternative to this traditional approach. Rather than using a
query-driven approach, data warehousing employs an update-driven approach in which information
from multiple, heterogeneous sources is integrated in advance and stored in a warehouse for direct
querying and analysis.

Differences between Operational Database Systems and Data Warehouses

The major task of online operational database systems is to perform online transaction and query
processing. These systems are called online transaction processing (OLTP) systems. They cover

Reference: Data Mining – Concepts and Techniques – 3rd Edition, Jiawei Han, Micheline Kamber & Jian
Pei-Elsevier

Downloaded by Omer Sohail ([email protected])


most of the day-to-day operations of an organization such as purchasing, inventory, manufacturing,
banking, payroll, registration, and accounting. Data warehouse systems, on the other hand, serve users
or knowledge workers in the role of data analysis and decision making. Such systems can organize and
present data in various formats in order to accommodate the diverse needs of different users. These
systems are known as online analytical processing (OLAP) systems.
The major distinguishing features of OLTP and OLAP are summarized as follows:

Users and system orientation: An OLTP system is customer-oriented and is used for transaction and
query processing by clerks, clients, and information technology professionals. An OLAP system is
market-oriented and is used for data analysis by knowledge workers, including managers, executives,
and analysts.

Data contents: An OLTP systemmanages current data that, typically, are too detailed to be easily used
for decision making. An OLAP system manages large amounts of historic data, provides facilities for
summarization and aggregation, and stores and manages information at different levels of granularity.

Database design: An OLTP system usually adopts an entity-relationship (ER) data model and an
application-oriented database design. An OLAP system typically adopts either a star or a snowflake
model and a subject-oriented database design.

View: An OLTP system focuses mainly on the current data within an enterprise or department,
without referring to historic data or data in different organizations. In contrast, an OLAP system often
spans multiple versions of a database schema, due to the evolutionary process of an organization.

Access patterns: The access patterns of an OLTP system consist mainly of short, atomic transactions.
Such a systemrequires concurrency control and recovery mechanisms. However, accesses to OLAP
systems are mostly read-only operations (because most data warehouses store historic rather than up-
to-date information), although many could be complex queries.

Reference: Data Mining – Concepts and Techniques – 3rd Edition, Jiawei Han, Micheline Kamber & Jian
Pei-Elsevier

Downloaded by Omer Sohail ([email protected])


But, Why Have a Separate Data Warehouse?
“Why not perform online analytical processing directly on such databases instead of spending
additional time and resources to construct a separate data warehouse?”
A major reason for such a separation is to help promote the high performance of both systems. An
operational database is designed and tuned from known tasks and workloads like indexing and hashing
using primary keys, searching for particular records, and optimizing “canned” queries. On the other
hand, data warehouse queries are often complex. They involve the computation of large data groups at
summarized levels, and may require the use of special data organization, access, and implementation
methods based on multidimensional views. Processing OLAP queries in operational databases would
substantially degrade the performance of operational tasks.
Moreover, an operational database supports the concurrent processing of multiple transactions.
Concurrency control and recovery mechanisms (e.g., locking and logging) are required to ensure the
consistency and robustness of transactions. An OLAP query often needs read-only access of data
records for summarization and aggregation. Decision support requires historic data, whereas
operational databases do not typically maintain historic data.
In this context, the data in operational databases, though abundant, are usually far from complete for
decision making. Decision support requires consolidation (e.g., aggregation and summarization) of
data from heterogeneous sources, resulting in high-quality, clean, integrated data. In contrast,
operational databases contain only detailed raw data, such as transactions, which need to be
consolidated before analysis.

Data Warehousing: A Multitiered Architecture

1. The bottom tier is a warehouse database server that is almost always a relational Database
system. Back-end tools and utilities are used to feed data into the bottom tier from operational

Reference: Data Mining – Concepts and Techniques – 3rd Edition, Jiawei Han, Micheline Kamber & Jian
Pei-Elsevier

Downloaded by Omer Sohail ([email protected])


databases or other external sources. These tools and utilities perform data extraction, cleaning, and
transformation (e.g., to merge similar data from different sources into a unified format), as well as load
and refresh functions to update the data warehouse. The data are extracted using application program
interfaces known as gateways. A gateway is supported by the underlying DBMS and allows client
programs to generate SQL code to be executed at a server. Examples of gateways include ODBC
(Open Database Connection) and OLEDB (Object Linking and Embedding Database) by Microsoft
and JDBC (Java Database Connection). This tier also contains a metadata repository, which stores
information about the data warehouse and its contents.
2. The middle tier is an OLAP server that is typically implemented using either (1) a
relationalOLAP(ROLAP) model (i.e., an extended relational DBMS that maps operations on
multidimensional data to standard relational operations); or (2) a multidimensional OLAP (MOLAP)
model (i.e., a special-purpose server that directly implements multidimensional data and operations).
3. The top tier is a front-end client layer, which contains query and reporting tools, analysis tools,
and/or data mining tools (e.g., trend analysis, prediction, and so on).

Data Warehouse Models: Enterprise Warehouse, Data Mart and Virtual Warehouse

From the architecture point of view, there are three data warehouse models: the enterprise warehouse,
the data mart, and the virtual warehouse.
Enterprise warehouse: An enterprise warehouse collects all of the information about subjects
spanning the entire organization. It provides corporate-wide data integration, usually from one or more
operational systems or external information providers, and is cross-functional in scope. It typically
contains detailed data as well as summarized data, and can range in size from a few gigabytes to
hundreds of gigabytes, terabytes, or beyond. An enterprise data warehouse may be implemented on
traditional mainframes, computer superservers, or parallel architecture platforms. It requires extensive
business modeling and may take years to design and build.

Data mart: A data mart contains a subset of corporate-wide data that is of value to a specific group of
users. The scope is confined to specific selected subjects. For example, a marketing data mart may
confine its subjects to customer, item, and sales. The data contained in data marts tend to be
summarized.
Data marts are usually implemented on low-cost departmental servers that are Unix/Linux or
Windows based. The implementation cycle of a data mart is more likely to be measured in weeks
rather than months or years. However, it may involve complex integration in the long run if its design
and planning were not enterprise-wide.
Data marts are two types. They are
1. Independent data mart
2. Dependent data mart
1. Independent data marts are sourced from data captured from one or more operational systems
or external information providers, or from data generated locally within a particular department
or geographic area.
2. Dependent data marts are sourced directly from enterprise data warehouses.
Virtual warehouse: A virtual warehouse is a set of views over operational databases. For efficient
query processing, only some of the possible summary views may be materialized. A virtual warehouse
is easy to build but requires excess capacity on operational database servers.
“What are the pros and cons of the top-down and bottom-up approaches to data warehouse
development?”

top-down development of an enterprise warehouse


pros

Reference: Data Mining – Concepts and Techniques – 3rd Edition, Jiawei Han, Micheline Kamber & Jian
Pei-Elsevier

Downloaded by Omer Sohail ([email protected])


minimizes integration problems
cons
1. it is expensive
2. it takes a long time to develop,
3. lacks flexibility
bottom-up approach
pros
flexibility, low cost, and rapid return of investment.
Cons
lead to problems when integrating various disparate data marts into a consistent enterprise data
warehouse.

A recommended method for the development of data warehouse systems is to implement the
warehouse in an incremental and evolutionary manner. First, a high-level corporate data model is
defined within a reasonably short period (such as one or two months) that provides a corporate-wide,
consistent, integrated view of data among different subjects and potential usages. Second, independent
data marts can be implemented in parallel with the enterprise warehouse based on the same corporate
data model set noted before. Third, distributed data marts can be constructed to integrate different data
marts via hub servers. Finally, a multitier data warehouse is constructed where the enterprise
warehouse is the sole custodian of all warehouse data, which is then distributed to the various
dependent data marts.

Extraction, Transformation, and Loading


Data warehouse systems use back-end tools and utilities to populate and refresh their data. These tools
and utilities include the following functions:
Data extraction, which typically gathers data from multiple, heterogeneous, and external sources.
Data cleaning, which detects errors in the data and rectifies them when possible.
Data transformation, which converts data from legacy or host format to warehouse format.
Load, which sorts, summarizes, consolidates, computes views, checks integrity, and builds indices and
partitions.
Refresh, which propagates the updates from the data sources to the warehouse.

Metadata Repository

Metadata are data about data. When used in a data warehouse, metadata are the data that define
warehouse objects. Metadata are created for the data names and definitions of the given warehouse.

Reference: Data Mining – Concepts and Techniques – 3rd Edition, Jiawei Han, Micheline Kamber & Jian
Pei-Elsevier

Downloaded by Omer Sohail ([email protected])


Additional metadata are created and captured for timestamping any extracted data, the source of the
extracted data, and missing fields that have been added by data cleaning or integration processes.
A metadata repository should contain the following:
 A description of the data warehouse structure, which includes the warehouse schema, view,
dimensions, hierarchies, and derived data definitions, as well as data mart locations and
contents.
 Operational metadata, which include data lineage (history of migrated data and the sequence
of transformations applied to it), currency of data (active, archived, or purged), and monitoring
information (warehouse usage statistics, error reports, and audit trails).
 The algorithms used for summarization, which include measure and dimension definition
algorithms, data on granularity, partitions, subject areas, aggregation, summarization, and
predefined queries and reports.
 Mapping from the operational environment to the data warehouse, which includes source
databases and their contents, gateway descriptions, data partitions, data extraction, cleaning,
transformation rules and defaults, data refresh and purging rules, and security (user
authorization and access control).
 Data related to system performance, which include indices and profiles that improve data
access and retrieval performance, in addition to rules for the timing and scheduling of refresh,
update, and replication cycles.
 Business metadata, which include business terms and definitions, data ownership information,
and charging policies.
A data warehouse contains different levels of summarization, of which metadata is one. Other types
include current detailed data (which are almost always on disk), older detailed data (which are usually
on tertiary storage), lightly summarized data, and highly summarized data (which may or may not be
physically housed).

DataWarehouse Modeling: Data Cube and OLAP

Data warehouses and OLAP tools are based on a multidimensional data model. This model views
data in the form of a data cube.

Data Cube: A Multidimensional Data Model

“What is a data cube?” A data cube allows data to be modeled and viewed in multiple dimensions. It
is defined by dimensions and facts.
In general terms, dimensions are the perspectives or entities with respect to which an organization
wants to keep records. For example, AllElectronics may create a sales data warehouse in order to keep
records of the store’s sales with respect to the dimensions time, item, branch, and location. Each
dimension may have a table associated with it, called a dimension table, which further describes the
dimension. For example, a dimension table for item may contain the attributes item name, brand, and
type.
A multidimensional data model is typically organized around a central theme, such as sales. This
theme is represented by a fact table. Facts are numeric measures. Examples of facts for a sales data
warehouse include dollars_sold (sales amount in dollars), units_sold (number of units sold), and
amount_budgeted. The fact table contains the names of the facts, or measures, as well as keys to each
of the related dimension tables.
In 2-D representation, the sales for Vancouver are shown with respect to the time dimension
(organized in quarters) and the item dimension (organized according to the types of items sold). The
fact or measure displayed is dollars_sold (in thousands). Now, suppose that we would like to view the
sales data with a third dimension. For instance, suppose we would like to view the data according to

Reference: Data Mining – Concepts and Techniques – 3rd Edition, Jiawei Han, Micheline Kamber & Jian
Pei-Elsevier

Downloaded by Omer Sohail ([email protected])


time and item, as well as location, for the cities Chicago, New York, Toronto, and Vancouver.
Suppose that we would now like to view our sales data with an additional fourth dimension such as
supplier. Viewing things in 4-D becomes tricky. However, we can think of a 4-D cube as being a
series of 3-D cubes, as shown below,

4D cube is often referred to as a Cuboid. Given a set of dimensions, we can generate a cuboid for each
of the possible subsets of the given dimensions. The result would form a lattice of cuboids, each
showing the data at a different level of summarization, or group-by. The lattice of cuboids is then
referred to as a data cube.

Reference: Data Mining – Concepts and Techniques – 3rd Edition, Jiawei Han, Micheline Kamber & Jian
Pei-Elsevier

Downloaded by Omer Sohail ([email protected])


lattice of cuboids
The following figure shows the forming of a data cube for the dimensions time, item, location, and
supplier. The cuboid that holds the lowest level of summarization is called the base cuboid. For
example, the 4-D cuboid in Figure 4.4 is the base cuboid for the given time, item, location, and
supplier dimensions. Figure 4.3 is a 3-D (nonbase) cuboid for time, item, and location, summarized for
all suppliers. The 0-D cuboid, which holds the highest level of summarization, is called the apex
cuboid. In our example, this is the total sales, or Dollars_sold, summarized over all four dimensions.
The apex cuboid is typically denoted by all.

Stars, Snowflakes, and Fact Constellations: Schemas for Multidimensional Data Models
A data warehouse, however, requires a concise, subject-oriented schema that facilitates online data
analysis.
The most popular data model for a data warehouse is a multidimensional model, which can exist in
the formof a star schema, a snowflake schema, or a fact constellation schema.

Star schema: The most common modeling paradigm is the star schema, in which the data warehouse
contains (1) a large central table (fact table) containing the bulk of the data, with no redundancy, and
(2) a set of smaller attendant tables (dimension tables), one for each dimension.
Example 4.1 Star schema. A star schema for AllElectronics sales is shown in Figure 4.6. Sales are
considered

Reference: Data Mining – Concepts and Techniques – 3rd Edition, Jiawei Han, Micheline Kamber & Jian
Pei-Elsevier

Downloaded by Omer Sohail ([email protected])


along four dimensions: time, item, branch, and location. The schema contains a central fact table for
sales that contains keys to each of the four dimensions, along with two measures: dollars sold and
units sold. To minimize the size of the fact table, dimension identifiers (e.g., time key and item key) are
system-generated identifiers.

Notice that in the star schema, each dimension is represented by only one table, and each table
contains a set of attributes. For example, the location dimension table contains the attribute set
{location_key, street, city, province_or_state, country}. This constraint may introduce some
redundancy. For example, “Urbana” and “Chicago” are both cities in the state of Illinois, USA. Entries
for such cities in the location dimension table will create redundancy among the attributes
province_or_state and country; that is, (...., Urbana, IL, USA) and ( , Chicago, IL, USA).Moreover,
the attributes within a dimension table may form either a hierarchy (total order) or a lattice (partial
order).

Snowflake schema: The snowflake schema is a variant of the star schema model, where some
dimension tables are normalized, thereby further splitting the data into additional tables.
The major difference between the snowflake and star schema models is
1. Dimension tables of the snowflake model may be kept in normalized form to reduce
redundancies.
2. Table is easy to maintain.
3. Saves storage space.
Disadvantage
1. The snowflake structure can reduce the effectiveness of browsing.
2. More joins will be needed to execute a query.
3. The system performance may be adversely impacted.
Hence, although the snowflake schema reduces redundancy, it is not as popular as the star schema in
data warehouse design.

Reference: Data Mining – Concepts and Techniques – 3rd Edition, Jiawei Han, Micheline Kamber & Jian
Pei-Elsevier

Downloaded by Omer Sohail ([email protected])


The main difference between the two schemas is in the definition of dimension tables. The single
dimension table for item in the star schema is normalized in the snowflake schema, resulting in new
item and supplier tables. For example, the item dimension table now contains the attributes item_key,
item_name, brand, type, and supplier_key, where supplier_key is linked to the supplier dimension
table, containing supplier_key and
Supplier_type information. Similarly, the single dimension table for location in the star schema can be
normalized into two new tables: location and city. The city key in the new location table links to the
city dimension.
Fact constellation: Sophisticated applications may require multiple fact tables to share dimension
tables. This kind of schema can be viewed as a collection of stars, and hence is called a galaxy
schema or a fact constellation.
Example 4.3 Fact constellation. A fact constellation schema is shown in Figure 4.8. This schema
specifies two fact tables, sales and shipping. The sales table definition is identical to that of the star
schema (Figure 4.6). The shipping table has five dimensions, or keys—item_key, time_key,
shipper_key, from_location, and to_location—and two measures—dollars_cost and units_shipped. A
fact constellation schema allows dimension tables to be shared between fact tables. For example, the
dimensions tables for time, item, and location are shared between the sales and shipping fact tables.

In data warehousing, there is a distinction between a data warehouse and a data mart. A data
warehouse collects information about subjects that span the entire organization, such as customers,

Reference: Data Mining – Concepts and Techniques – 3rd Edition, Jiawei Han, Micheline Kamber & Jian
Pei-Elsevier

Downloaded by Omer Sohail ([email protected])


items, sales, assets, and personnel, and thus its scope is enterprise-wide. For data warehouses, the fact
constellation schema is commonly used, since it can model multiple, interrelated subjects. A data
mart, on the other hand, is a department subset of the data warehouse that focuses on selected
subjects, and thus its scope is departmentwide. For data marts, the star or snowflake schema is
commonly used.

Dimensions: The Role of Concept Hierarchies


A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-level,
more general concepts. Consider a concept hierarchy for the dimension location. City values for
location include Vancouver, Toronto, New York, and Chicago. Each city, however, can be mapped to
the province or state to which it belongs. For example, Vancouver can be mapped to British Columbia,
and Chicago to Illinois.
The provinces and states can in turn be mapped to the country(e.g., Canada or the United States) to
which they belong. These mappings form a concept hierarchy for the dimension location, mapping a
set of low-level concepts (i.e., cities) to higher-level, more general concepts (i.e., countries).
Many concept hierarchies are implicit within the database schema. For example, suppose that the
dimension location is described by the attributes number, street, city, province_or_state, zip_code, and
country. These attributes are related by a total order, forming a concept hierarchy such as “street < city
< province or state < country.” This hierarchy is shown in Figure 4.10(a). Alternatively, the attributes
of a dimension may

Reference: Data Mining – Concepts and Techniques – 3rd Edition, Jiawei Han, Micheline Kamber & Jian
Pei-Elsevier

Downloaded by Omer Sohail ([email protected])


be organized in a partial order, forming a lattice. An example of a partial order for the time dimension
based on the attributes day, week, month, quarter, and year is “day < {month < quarter; week} <
year.”1 This lattice structure is shown in Figure 4.10(b). A concept hierarchy that is a total or partial
order among attributes in a database schema is called a schema hierarchy. Concept hierarchies that
are common to many applications (e.g., for time) may be predefined in the data mining system.
Concept hierarchies may also be defined by discretizing or grouping values for a given dimension or
attribute, resulting in a set-grouping hierarchy. A total or partial order can be defined among groups
of values. An example of a set-grouping hierarchy is shown in Figure 4.11 for the dimension price,
where an interval ($X ...$Y] denotes the range from $X (exclusive) to $Y (inclusive).
There may be more than one concept hierarchy for a given attribute or dimension, based on different
user viewpoints. For instance, a user may prefer to organize price by defining ranges for inexpensive,
moderately priced, and expensive.
Concept hierarchies may be provided manually by system users, domain experts, or knowledge
engineers, or may be automatically generated based on statistical analysis of the data distribution.
Measures: Their Categorization and Computation
“How are measures computed?”
A data cube measure is a numeric function that can be evaluated at each point in the data cube space.
A measure value is computed for a given point by aggregating the data corresponding to the respective
dimension–value pairs defining the given point.
for example, <time = “Q1”, location = “Vancouver”, item = “computer”>. – set of dimension- value
pairs

Measures can be organized into three categories—distributive, algebraic, and holistic— based on the
kind of aggregate functions used.

A measure is distributive if it is obtained by applying a distributive aggregate function. Distributive


measures can be computed efficiently because of the way the computation can be partitioned. it can be
computed in a distributed manner as follows. Suppose the data are partitioned into n sets. We apply
the function to each partition, resulting in n aggregate values. If the result derived by applying the
function to the n aggregate values is the same as that derived by applying the function to the entire
data set (without partitioning), the function can be computed in a distributed manner. For example,
sum() can be computed for a data cube by first partitioning the cube into a set of subcubes, computing
sum() for each subcube, and then summing up the counts obtained for each subcube. Hence, sum() is a
distributive aggregate function. For the same reason, count(), min(), and max() are distributive
aggregate functions.

A measure is algebraic if it is obtained by applying an algebraic aggregate function. it can be


computed by an algebraic function with M arguments (where M is a bounded positive integer), each of
which is obtained by applying a distributive aggregate function. For example, avg() (average) can be

Reference: Data Mining – Concepts and Techniques – 3rd Edition, Jiawei Han, Micheline Kamber & Jian
Pei-Elsevier

Downloaded by Omer Sohail ([email protected])


computed by sum()/count(), where both sum() and count() are distributive aggregate functions.
Similarly, it can be shown that min N() and max N() (which find the N minimum and N maximum
values, respectively, in a given set) and standard deviation() are algebraic aggregate functions.

A measure is holistic if it is obtained by applying a holistic aggregate function.


An aggregate function is holistic if there is no constant bound on the storage size needed to describe a
subaggregate. That is, there does not exist an algebraic function with M arguments (where M is a
constant) that characterizes the computation. Common examples of holistic functions include
median(), mode(), and rank().

Most large data cube applications require efficient computation of distributive and algebraic measures.
Many efficient techniques for this exist. In contrast, it is difficult to compute holistic measures
efficiently. Efficient techniques to approximate the computation of some holistic measures, however,
do exist.

Typical OLAP Operations


“How are concept hierarchies useful in OLAP?” In the multidimensional model, data are organized
into multiple dimensions, and each dimension contains multiple levels of abstraction defined by
concept hierarchies. This organization provides users with the flexibility to view data from different
perspectives.

Roll-up: The roll-up operation (also called the drill-up operation by some vendors) performs
aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or by dimension
reduction.
This hierarchy was defined as the total order “street < city < province or state < country.” The roll-up
operation shown aggregates the data by ascending the location hierarchy from the level of city to the
level of country
When roll-up is performed by dimension reduction, one or more dimensions are removed from the
given cube. For example, consider a sales data cube containing only the location and time dimensions.
Roll-up may be performed by removing, say, the time dimension, resulting in an aggregation of the
total sales by location, rather than by location and by time.

Drill-down: Drill-down is the reverse of roll-up. It navigates from less detailed data to more detailed
data. Drill-down can be realized by either stepping down a concept hierarchy for a dimension or
introducing additional dimensions. concept hierarchy for time defined as “day < month < quarter <
year.” Drill-down occurs by descending the time hierarchy fromthe level of quarter to the more
detailed level of month. Because a drill-down adds more detail to the given data, it can also be
performed by adding new dimensions to a cube.

Slice and dice: The slice operation performs a selection on one dimension of the given cube, resulting
in a subcube. Figure 4.12 shows a slice operation where the sales data are selected from the central
cube for the dimension time using the criterion time = “Q1.” The dice operation defines a subcube by
performing a selection on two or more dimensions. Figure 4.12 shows a dice operation on the central
cube based on the following selection criteria that involve three dimensions: (location = “Toronto” or
“Vancouver”) and (time = “Q1” or “Q2”) and (item = “home entertainment” or “computer”).

Pivot (rotate): Pivot (also called rotate) is a visualization operation that rotates the data axes in view
to provide an alternative data presentation. Figure 4.12 shows a pivot operation where the item and
location axes in a 2-D slice are rotated.

Reference: Data Mining – Concepts and Techniques – 3rd Edition, Jiawei Han, Micheline Kamber & Jian
Pei-Elsevier

Downloaded by Omer Sohail ([email protected])


Other OLAP operations: Some OLAP systems offer additional drilling operations. For example,
drill-across executes queries involving (i.e., across) more than one fact table. The drill-through
operation uses relational SQL facilities to drill through the bottom level of a data cube down to its
back-end relational tables.
Other OLAP operations may include ranking the top N or bottom N items in lists, as well as
computing moving averages, growth rates, interests, internal return rates, depreciation, currency
conversions, and statistical functions.

OLAP offers analytical modeling capabilities, including a calculation engine for deriving ratios,
variance, and so on, and for computing measures across multiple dimensions. OLAP also supports
functional models for forecasting, trend analysis, and statistical analysis.

Reference: Data Mining – Concepts and Techniques – 3rd Edition, Jiawei Han, Micheline Kamber & Jian
Pei-Elsevier

Downloaded by Omer Sohail ([email protected])


OLAP SERVERS

Three main types of OLAP Servers

ROLAP stands for Relational OLAP, an application based on relational DBMSs.


MOLAP stands for Multidimensional OLAP, an application based on multidimensional DBMSs.
HOLAP stands for Hybrid OLAP, an application using both relational and multidimensional techniques.

These are intermediate servers which stand in between a relational back-end server and user frontend tools.
They use a relational or extended-relational DBMS to save and handle warehouse data, and OLAP middleware to
provide missing pieces.
ROLAP servers contain optimization for each DBMS back end, implementation of aggregation navigation logic, and
additional tools and services.
ROLAP technology tends to have higher scalability than MOLAP technology.
ROLAP systems work primarily from the data that resides in a relational database, where the base data and
dimension tables are stored as relational tables. This model permits the multidimensional analysis of data.
This technique relies on manipulating the data stored in the relational database to give the presence of traditional
OLAP's slicing and dicing functionality. In essence, each method of slicing and dicing is equivalent to adding a
"WHERE" clause in the SQL statement.

Relational OLAP Architecture

ROLAP Architecture includes the following components

o Database server.
o ROLAP server.
o Front-end tool.

Reference: Data Mining – Concepts and Techniques – 3rd Edition, Jiawei Han, Micheline Kamber & Jian
Pei-Elsevier

Downloaded by Omer Sohail ([email protected])


Relational OLAP (ROLAP) is the latest and fastest-growing OLAP technology segment in the market. This method
allows multiple multidimensional views of two-dimensional relational tables to be created, avoiding structuring
record around the desired view.

Some products in this segment have supported reliable SQL engines to help the complexity of multidimensional
analysis. This includes creating multiple SQL statements to handle user requests, being 'RDBMS' aware and also
being capable of generating the SQL statements based on the optimizer of the DBMS engine.

Advantages

Can handle large amounts of information:

The data size limitation of ROLAP technology is depends on the data size of the underlying RDBMS. So, ROLAP
itself does not restrict the data amount.
RDBMS already comes with a lot of features. So ROLAP technologies, (works on top of the RDBMS) can control
these functionalities.

Disadvantages

Performance can be slow: Each ROLAP report is a SQL query (or multiple SQL queries) in the relational database,
the query time can be prolonged if the underlying data size is large.

Limited by SQL functionalities: ROLAP technology relies on upon developing SQL statements to query the
relational database, and SQL statements do not suit all needs.

Multidimensional OLAP (MOLAP) Server


A MOLAP system is based on a native logical model that directly supports multidimensional data and operations.
Data are stored physically into multidimensional arrays, and positional techniques are used to access them.

One of the significant distinctions of MOLAP against a ROLAP is that data are summarized and are stored in an
optimized format in a multidimensional cube, instead of in a relational database. In MOLAP model, data are
structured into proprietary formats by client's reporting requirements with the calculations pre-generated on the
cubes.

MOLAP Architecture

MOLAP Architecture includes the following components

o Database server.
o MOLAP server.
o Front-end tool.

Reference: Data Mining – Concepts and Techniques – 3rd Edition, Jiawei Han, Micheline Kamber & Jian
Pei-Elsevier

Downloaded by Omer Sohail ([email protected])


MOLAP structure primarily reads the precompiled data. MOLAP structure has limited capabilities to dynamically
create aggregations or to evaluate results which have not been pre-calculated and stored.
Applications requiring iterative and comprehensive time-series analysis of trends are well suited for MOLAP
technology (e.g., financial analysis and budgeting).
Examples include Arbor Software's Essbase. Oracle's Express Server, Pilot Software's Lightship Server, Sniper's
TM/1. Planning Science's Gentium and Kenan Technology's Multiway.
Some of the problems faced by clients are related to maintaining support to multiple subject areas in an RDBMS.
Some vendors can solve these problems by continuing access from MOLAP tools to detailed data in and RDBMS.

This can be very useful for organizations with performance-sensitive multidimensional analysis requirements and
that has built or is in the process of building a data warehouse architecture that contains multiple subject areas.
An example would be the creation of sales data measured by several dimensions (e.g., product and sales region) to
be stored and maintained in a persistent structure. This structure would be provided to reduce the application
overhead of performing calculations and building aggregation during initialization. These structures can be
automatically refreshed at predetermined intervals established by an administrator.

Advantages
1. Excellent Performance: A MOLAP cube is built for fast information retrieval, and is optimal for slicing and
dicing operations.
2. Can perform complex calculations: All evaluation have been pre-generated when the cube is created. Hence,
complex calculations are not only possible, but they return quickly.
Disadvantages
1. Limited in the amount of information it can handle: Because all calculations are performed when the cube is
built, it is not possible to contain a large amount of data in the cube itself.
2. Requires additional investment: Cube technology is generally proprietary and does not already exist in the
organization. Therefore, to adopt MOLAP technology, chances are other investments in human and capital
resources are needed.

Reference: Data Mining – Concepts and Techniques – 3rd Edition, Jiawei Han, Micheline Kamber & Jian
Pei-Elsevier

Downloaded by Omer Sohail ([email protected])


Hybrid OLAP (HOLAP) Server
HOLAP incorporates the best features of MOLAP and ROLAP into a single architecture. HOLAP systems save more
substantial quantities of detailed data in the relational tables while the aggregations are stored in the pre-
calculated cubes. HOLAP also can drill through from the cube down to the relational tables for delineated data.
The Microsoft SQL Server 2000 provides a hybrid OLAP server.

Advantages of HOLAP
1. HOLAP provide benefits of both MOLAP and ROLAP.
2. It provides fast access at all levels of aggregation.
3. HOLAP balances the disk space requirement, as it only stores the aggregate information on the OLAP
server and the detail record remains in the relational database. So no duplicate copy of the detail record is
maintained.

Disadvantages of HOLAP
1. HOLAP architecture is very complicated because it supports both MOLAP and ROLAP servers.

Reference: Data Mining – Concepts and Techniques – 3rd Edition, Jiawei Han, Micheline Kamber & Jian
Pei-Elsevier

Downloaded by Omer Sohail ([email protected])

You might also like