0% found this document useful (0 votes)
5 views

Data Mining Notes UNIT II

A Data Warehouse (DW) is a relational database designed for query and analysis, integrating historical data from various sources to support decision-making. Key features include being subject-oriented, integrated, time-variant, and non-volatile, with different schemas like Star, Snowflake, and Fact Constellation for organizing data. OLAP operations such as roll-up, drill-down, slice, dice, and pivot enable users to analyze data from multiple perspectives, enhancing business productivity and customer relationship management.

Uploaded by

Gayathri T
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Data Mining Notes UNIT II

A Data Warehouse (DW) is a relational database designed for query and analysis, integrating historical data from various sources to support decision-making. Key features include being subject-oriented, integrated, time-variant, and non-volatile, with different schemas like Star, Snowflake, and Fact Constellation for organizing data. OLAP operations such as roll-up, drill-down, slice, dice, and pivot enable users to analyze data from multiple perspectives, enhancing business productivity and customer relationship management.

Uploaded by

Gayathri T
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

UNIT II

Data Warehouse:

A Data Warehouse (DW) is a relational database that is designed for query and
analysis rather than transaction processing. It includes historical data derived from
transaction data from single and multiple sources.

A Data Warehouse provides integrated, enterprise-wide, historical data and focuses


on providing support for decision-makers for data modeling and analysis.

Data Warehouse environment contains an extraction, transportation, and loading


(ETL) solution, an online analytical processing (OLAP) engine, customer analysis tools, and
other applications that handle the process of gathering information and delivering it to
business users.

Data Warehouse Features

The key features of a data warehouse are discussed below −


 Subject Oriented − A data warehouse is subject oriented because it provides
information around a subject rather than the organization's ongoing operations.
These subjects can be product, customers, suppliers, sales, revenue, etc. A data
warehouse does not focus on the ongoing operations, rather it focuses on
modelling and analysis of data for decision making.
 Integrated − A data warehouse is constructed by integrating data from
heterogeneous sources such as relational databases, flat files, etc. This integration
enhances the effective analysis of data.
 Time Variant − The data collected in a data warehouse is identified with a
particular time period. The data in a data warehouse provides information from the
historical point of view.
 Non-volatile − Non-volatile means the previous data is not erased when new data
is added to it. A data warehouse is kept separate from the operational database and
therefore frequent changes in operational database is not reflected in the data
warehouse.
Comparison between OLTP and OLAP systems:

Sr.No. Data Warehouse (OLAP) Operational Database(OLTP)

1 It involves historical processing of It involves day-to-day processing.


information.
2 OLAP systems are used by knowledge OLTP systems are used by clerks,
workers such as executives, managers, and DBAs, or database professionals.
analysts.

3 It is used to analyze the business. It is used to run the business.

4 It focuses on Information out. It focuses on Data in.

5 It is based on Star Schema, Snowflake It is based on Entity Relationship


Schema, and Fact Constellation Schema. Model.

6 It focuses on Information out. It is application oriented.

7 It contains historical data. It contains current data.

8 It provides summarized and consolidated It provides primitive and highly


data. detailed data.

9 It provides summarized and It provides detailed and flat


multidimensional view of data. relational view of data.

10 The number of users is in hundreds. The number of users is in


thousands.

11 The number of records accessed is in The number of records accessed is


millions. in tens.

12 The database size is from 100GB to 100 TB. The database size is from 100 MB to
100 GB.

13 These are highly flexible. It provides high performance.

Multidimensional Data Model


Multidimensional data model stores data in the form of data cube. Mostly, data
warehousing supports two or three-dimensional cubes.
A data cube allows data to be viewed in multiple dimensions. A dimensions are
entities with respect to which an organization wants to keep records. For example in store
sales record, dimensions allow the store to keep track of things like monthly sales of items
and the branches and locations. A multidimensional databases helps to provide data-
related answers to complex business queries quickly and accurately.

Data warehouses and Online Analytical Processing (OLAP) tools are based on a
multidimensional data model. OLAP in data warehousing enables users to view data from
different angles and dimensions.

Schemas for multidimensional Databases:


Schema is a logical description of the entire database. It includes the name and
description of records of all record types including all associated data-items and
aggregates. Much like a database, a data warehouse also requires to maintain a schema. A
database uses relational model, while a data warehouse uses Star, Snowflake, and Fact
Constellation schema. In this chapter, we will discuss the schemas used in a data
warehouse.

Star Schema:

 Each dimension in a star schema is represented with only one-dimension table.


 This dimension table contains the set of attributes.
 The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.

 There is a fact table at the center. It contains the keys to each of four dimensions.
 The fact table also contains the attributes, namely dollars sold and units sold.
Note − Each dimension has only one dimension table and each table holds a set of
attributes. For example, the location dimension table contains the attribute set
{location_key, street, city, province_or_state,country}. This constraint may cause data
redundancy. For example, "Vancouver" and "Victoria" both the cities are in the Canadian
province of British Columbia. The entries for such cities may cause data redundancy along
the attributes province_or_state and country.

Snowflake Schema:

 Some dimension tables in the Snowflake schema are normalized.


 The normalization splits up the data into additional tables.
 Unlike Star schema, the dimensions table in a snowflake schema are normalized.
For example, the item dimension table in star schema is normalized and split into
two dimension tables, namely item and supplier table.
 Now the item dimension table contains the attributes item_key, item_name, type,
brand, and supplier-key.
 The supplier key is linked to the supplier dimension table. The supplier dimension
table contains the attributes supplier_key and supplier_type.
Note − Due to normalization in the Snowflake schema, the redundancy is reduced and
therefore, it becomes easy to maintain and the save storage space.

Fact Constellation Schema:

 A fact constellation has multiple fact tables. It is also known as galaxy schema.
 The following diagram shows two fact tables, namely sales and shipping.

 The sales fact table is same as that in the star schema.


 The shipping fact table has the five dimensions, namely item_key, time_key,
shipper_key, from_location, to_location.
 The shipping fact table also contains two measures, namely dollars sold and units
sold.
 It is also possible to share dimension tables between fact tables. For example, time,
item, and location dimension tables are shared between the sales and shipping fact
table.

Schema Definition:

Multidimensional schema is defined using Data Mining Query Language (DMQL).


The two primitives, cube definition and dimension definition, can be used for defining the
data warehouses and data marts.

Syntax for Cube Definition:


define cube < cube_name > [ < dimension-list > }: < measure_list >

Syntax for Dimension Definition:


define dimension < dimension_name > as ( < attribute_or_dimension_list > )

Star Schema Definition:


The star schema that we have discussed can be defined using Data Mining Query Language
(DMQL) as follows −
define cube sales star [time, item, branch, location]:

dollars sold = sum(sales in dollars), units sold = count(*)

define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state, country)

Snowflake Schema Definition:


Snowflake schema can be defined using DMQL as follows −
define cube sales snowflake [time, item, branch, location]:

dollars sold = sum(sales in dollars), units sold = count(*)

define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier (supplier key,
supplier type))
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city (city key, city, province or state,
country))

Fact Constellation Schema Definition:


Fact constellation schema can be defined using DMQL as follows −
define cube sales [time, item, branch, location]:

dollars sold = sum(sales in dollars), units sold = count(*)

define dimension time as (time key, day, day of week, month, quarter, year)
define dimension item as (item key, item name, brand, type, supplier type)
define dimension branch as (branch key, branch name, branch type)
define dimension location as (location key, street, city, province or state,country)
define cube shipping [time, item, shipper, from location, to location]:

dollars cost = sum(cost in dollars), units shipped = count(*)

define dimension time as time in cube sales


define dimension item as item in cube sales
define dimension shipper as (shipper key, shipper name, location as location in cube sales,
shipper type)
define dimension from location as location in cube sales
define dimension to location as location in cube sales

Categorization of Measures - distributive, algebraic, holistic


Measures can be organized into three categories based on the kind of aggregate functions
used:

 distributive,
 algebraic,
 holistic.
Distributive.
An aggregate function is distributive if it can be computed in a distributed manner.
Suppose the data are partitioned into n sets. We apply the function to each partition,
resulting in n aggregate values. If the result derived by applying the function to
the n aggregate values is the same as that derived by applying the function to the entire
data set (without partitioning), the function can be computed in a distributed manner.
For example, count() can be computed for a data cube by first partitioning the cube
into a set of sub cubes, computing count() for each sub cube, and then summing up the
counts obtained for each sub cube. Hence, count() is a distributive
aggregate function. For the same reason, sum(), min(), and max() are distributive aggregate
functions.
A measure is distributive if it is obtained by applying a distributive aggregate
function. Distributive measures can be computed efficiently because they can be computed
in a distributive manner.

Algebraic.
An aggregate function is algebraic if it can be computed by an algebraic function
with m arguments (where m is a bounded positive integer), each of which is obtained by
applying a distributive aggregate function.

For example, avg() (average) can be computed by sum()/count(), where both sum()
and count() are distributive aggregate functions. Similarly, it can be shown that min N()
and max N() (which find the N minimum and N maximum values, respectively, in a given
set) and standard deviation() are algebraic aggregate functions.

A measure is algebraic if it is obtained by applying an algebraic aggregate function.

Holistic.
An aggregate function is holistic if there is no constant bound on the storage size
needed to describe a sub aggregate. That is, there does not exist an algebraic function
with m arguments (where m is a constant) that characterizes the computation.

Common examples of holistic functions include median(), mode(), and rank().

A measure is holistic if it is obtained by applying a holistic aggregate function.

Concept Hierarchies:

A concept hierarchy represents a series of mappings from a set of low-level concepts to


larger-level, more general concepts. Concept hierarchy organizes information or concepts
in a hierarchical structure or a specific partial order, which are used for defining
knowledge in brief, high-level methods, and creating possible mining knowledge at several
levels of abstraction.

A conceptual hierarchy includes a set of nodes organized in a tree, where the nodes define
values of an attribute known as concepts. A specific node, “ANY”, is constrained for the root
of the tree. A number is created to the level of each node in a conceptual hierarchy. The
level of the root node is one. The level of a non-root node is one more the level of its parent
level number.

Because values are defined by nodes, the levels of nodes can also be used to describe the
levels of values. Concept hierarchy enables raw information to be managed at a higher and
more generalized level of abstraction. There are several types of concept hierarchies which
are as follows −

Schema Hierarchy − Schema hierarchy represents the total or partial order between
attributes in the database. It can define existing semantic relationships between attributes.
In a database, more than one schema hierarchy can be generated by using multiple
sequences and grouping of attributes.

Set-Grouping Hierarchy − A set-grouping hierarchy constructs values for a given attribute


or dimension into groups or constant range values. It is also known as instance hierarchy
because the partial series of the hierarchy is represented on the set of instances or values
of an attribute. These hierarchies have more functional sense and are so approved than
other hierarchies.

Operation-Derived Hierarchy − Operation-derived hierarchy is represented by a set of


operations on the data. These operations are defined by users, professionals, or the data
mining system. These hierarchies are usually represented for mathematical attributes.
Such operations can be as easy as range value comparison, as difficult as a data clustering
and data distribution analysis algorithm.

Rule-based Hierarchy − In a rule-based hierarchy either a whole concept hierarchy or an


allocation of it is represented by a set of rules and is computed dynamically based on the
current information and rule definition. A lattice-like architecture is used for graphically
defining this type of hierarchy, in which each child-parent route is connected with a
generalization rule.

The static and dynamic generation of concept hierarchy is based on data sets. In this
context, the generation of a concept hierarchy depends on a static or dynamic data set is
known as the static or dynamic generation of concept hierarchy.
(a) Hierarchy for locations (b) a lattice for time

A concept hierarchy for the attribute price

OLAP Operations in the Multidimensional data model:

OLAP Operations

Since OLAP servers are based on multidimensional view of data, we will discuss OLAP
operations in multidimensional data.
Here is the list of OLAP operations −

 Roll-up
 Drill-down
 Slice and dice
 Pivot (rotate)

Roll-up
Roll-up performs aggregation on a data cube in any of the following ways −

 By climbing up a concept hierarchy for a dimension


 By dimension reduction
The following diagram illustrates how roll-up works.

 Roll-up is performed by climbing up a concept hierarchy for the dimension location.


 Initially the concept hierarchy was "street < city < province < country".
 On rolling up, the data is aggregated by ascending the location hierarchy from the
level of city to the level of country.
 The data is grouped into cities rather than countries.
 When roll-up is performed, one or more dimensions from the data cube are
removed.

Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following
ways −

 By stepping down a concept hierarchy for a dimension


 By introducing a new dimension.
The following diagram illustrates how drill-down works −
 Drill-down is performed by stepping down a concept hierarchy for the dimension
time.
 Initially the concept hierarchy was "day < month < quarter < year."
 On drilling down, the time dimension is descended from the level of quarter to the
level of month.
 When drill-down is performed, one or more dimensions from the data cube are
added.
 It navigates the data from less detailed data to highly detailed data.

Slice
The slice operation selects one particular dimension from a given cube and provides a new
sub-cube. Consider the following diagram that shows how slice works.
 Here Slice is performed for the dimension "time" using the criterion time = "Q1".
 It will form a new sub-cube by selecting one or more dimensions.

Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube.
Consider the following diagram that shows the dice operation.

The dice operation on the cube based on the following selection criteria involves three
dimensions.
 (location = "Toronto" or "Vancouver")
 (time = "Q1" or "Q2")
 (item =" Mobile" or "Modem")

Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to
provide an alternative presentation of data. Consider the following diagram that shows
the pivot operation.

The design of a Data Warehouse:

The business analyst get the information from the data warehouses to measure the
performance and make critical adjustments in order to win over other business holders in
the market. Having a data warehouse offers the following advantages −
 Since a data warehouse can gather information quickly and efficiently, it can
enhance business productivity.
 A data warehouse provides us a consistent view of customers and items, hence, it
helps us manage customer relationship.
 A data warehouse also helps in bringing down the costs by tracking trends, patterns
over a long period in a consistent and reliable manner.
To design an effective and efficient data warehouse, we need to understand and analyze
the business needs and construct a business analysis framework. Each person has
different views regarding the design of a data warehouse. These views are as follows −
 The top-down view − This view allows the selection of relevant information
needed for a data warehouse.
 The data source view − This view presents the information being captured, stored,
and managed by the operational system.
 The data warehouse view − This view includes the fact tables and dimension
tables. It represents the information stored inside the data warehouse.
 The business query view − It is the view of the data from the viewpoint of the end-
user.

Three-Tier Data Warehouse Architecture

Generally a data warehouses adopts a three-tier architecture. Following are the three tiers
of the data warehouse architecture.
 Bottom Tier − The bottom tier of the architecture is the data warehouse database
server. It is the relational database system. We use the back end tools and utilities
to feed data into the bottom tier. These back end tools and utilities perform the
Extract, Clean, Load, and refresh functions.
 Middle Tier − In the middle tier, we have the OLAP Server that can be implemented
in either of the following ways.
o By Relational OLAP (ROLAP), which is an extended relational database
management system. The ROLAP maps the operations on multidimensional
data to standard relational operations.
o By Multidimensional OLAP (MOLAP) model, which directly implements the
multidimensional data and operations.
 Top-Tier − This tier is the front-end client layer. This layer holds the query tools
and reporting tools, analysis tools and data mining tools.
The following diagram depicts the three-tier architecture of data warehouse −
Data Warehouse Models

From the perspective of data warehouse architecture, we have the following data
warehouse models −

 Virtual Warehouse
 Data mart
 Enterprise Warehouse

Virtual Warehouse
The view over an operational data warehouse is known as a virtual warehouse. It is easy
to build a virtual warehouse. Building a virtual warehouse requires excess capacity on
operational database servers.

Data Mart
Data mart contains a subset of organization-wide data. This subset of data is valuable to
specific groups of an organization.
In other words, we can claim that data marts contain data specific to a particular group.
For example, the marketing data mart may contain data related to items, customers, and
sales. Data marts are confined to subjects.
Points to remember about data marts −
 Window-based or Unix/Linux-based servers are used to implement data marts.
They are implemented on low-cost servers.
 The implementation data mart cycles is measured in short periods of time, i.e., in
weeks rather than months or years.
 The life cycle of a data mart may be complex in long run, if its planning and design
are not organization-wide.
 Data marts are small in size.
 Data marts are customized by department.
 The source of a data mart is departmentally structured data warehouse.
 Data mart are flexible.

Enterprise Warehouse
 An enterprise warehouse collects all the information and the subjects spanning an
entire organization
 It provides us enterprise-wide data integration.
 The data is integrated from operational systems and external information
providers.
 This information can vary from a few gigabytes to hundreds of gigabytes, terabytes
or beyond.
Data Warehouse Back-End Tools and Utilities:
Data extraction: get data from multiple, heterogeneous, and external sources
Data cleaning: detect errors in the data and rectify them when possible

Data transformation: convert data from legacy or host format to warehouse format
Load: sort, summarize, consolidate, compute views, check integrity, and build indices and
partitions
Refresh: propagate the updates from the data sources to the warehouse

What is Metadata?

Metadata is simply defined as data about data. The data that is used to represent other
data is known as metadata. For example, the index of a book serves as a metadata for the
contents in the book. In other words, we can say that metadata is the summarized data
that leads us to detailed data. In terms of data warehouse, we can define metadata as
follows.
 Metadata is the road-map to a data warehouse.
 Metadata in a data warehouse defines the warehouse objects.
 Metadata acts as a directory. This directory helps the decision support system to
locate the contents of a data warehouse.
Note − In a data warehouse, we create metadata for the data names and definitions of a
given data warehouse. Along with this metadata, additional metadata is also created for
time-stamping any extracted data, the source of extracted data.

Categories of Metadata

Metadata can be broadly categorized into three categories −


 Business Metadata − It has the data ownership information, business definition,
and changing policies.
 Technical Metadata − It includes database system names, table and column names
and sizes, data types and allowed values. Technical metadata also includes
structural information such as primary and foreign key attributes and indices.
 Operational Metadata − It includes currency of data and data lineage. Currency of
data means whether the data is active, archived, or purged. Lineage of data means
the history of data migrated and transformation applied on it.
Role of Metadata
Metadata has a very important role in a data warehouse. The role of metadata in a
warehouse is different from the warehouse data, yet it plays an important role. The
various roles of metadata are explained below.
 Metadata acts as a directory.
 This directory helps the decision support system to locate the contents of the data
warehouse.
 Metadata helps in decision support system for mapping of data when data is
transformed from operational environment to data warehouse environment.
 Metadata helps in summarization between current detailed data and highly
summarized data.
 Metadata also helps in summarization between lightly detailed data and highly
summarized data.
 Metadata is used for query tools.
 Metadata is used in extraction and cleansing tools.
 Metadata is used in reporting tools.
 Metadata is used in transformation tools.
 Metadata plays an important role in loading functions.
OLAP:
Online Analytical Processing Server (OLAP) is based on the multidimensional data model.
It allows managers, and analysts to get an insight of the information through fast,
consistent, and interactive access to information.

Types of OLAP Servers

We have four types of OLAP servers −

 Relational OLAP (ROLAP)


 Multidimensional OLAP (MOLAP)
 Hybrid OLAP (HOLAP)
 Specialized SQL Servers

Relational OLAP

ROLAP servers are placed between relational back-end server and client front-end tools.
To store and manage warehouse data, ROLAP uses relational or extended-relational
DBMS.
ROLAP includes the following −

 Implementation of aggregation navigation logic.


 Optimization for each DBMS back end.
 Additional tools and services.

Multidimensional OLAP

MOLAP uses array-based multidimensional storage engines for multidimensional views of


data. With multidimensional data stores, the storage utilization may be low if the data set
is sparse. Therefore, many MOLAP server use two levels of data storage representation to
handle dense and sparse data sets.

Hybrid OLAP

Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of
ROLAP and faster computation of MOLAP. HOLAP servers allows to store the large data
volumes of detailed information. The aggregations are stored separately in MOLAP store.

Specialized SQL Servers

Specialized SQL servers provide advanced query language and query processing support
for SQL queries over star and snowflake schemas in a read-only environment.
Data Warehouse Implementation

 The big data which is to be analyzed and handled to draw insights from it will be
stored in data warehouses.
 These warehouses are run by OLAP servers which require processing of a query
with seconds.
 So, a data warehouse should need highly efficient cube computation techniques,
access methods, and query processing techniques.
 The core of multidimensional data analysis is the efficient computation of
aggregations across many sets of dimensions.
 In SQL aggregations are referred to as group-by’s.
 Each group-by can be represented as a cuboid.
 Set of group-by’s forms a lattice of a cuboid defining a data cube.

Efficient Data Cube Computation

The compute cube Operator and the Curse of Dimensionality

 The compute cube operator computes aggregates over all subsets of the dimensions
specified in the operation.
 It requires excessive storage space, especially for a large number of dimensions.
 A data cube is a lattice of cuboids.

Suppose that we create a data cube for ProElectronics(Company) sales that contains the
following: city, item, year, and sales_in_dollars.
 Compute the sum of sales, grouping by city, and item.
 Compute the sum of sales, grouping by city.
 Compute the sum of sales, grouping by item.

What is the total number of cuboids, or group-by’s, that can be computed for this data
cube?

Three attributes: city, item, year (dimensions), sales_in_dollars (measure). The


total number of cuboids or group-by’s computed for this cube is 2^3=8. Group-
by’s: {(city,item,year), (city, item), (city, year), (item, year), (city), (item), (year),()}. ( ) :
group-by is empty i.e. the dimensions are not grouped.

The base cuboid contains all three dimensions. Apex cuboid is empty. On-line
analytical processing may need to access different cuboids for different queries. So we
have to compute all or at least some of the cuboids in the data cube in advance. Pre
computation leads to fast response time and avoids some redundant computation.
A major challenge related to pre computation would be storage space if all the
cuboids in the data cube are computed, especially when the cube has many dimensions.
The storage requirements are even more excessive when many of the dimensions have
associated concept hierarchies, each with multiple levels. This problem is referred to as
the Curse of Dimensionality.

Cube Operation

Cube definition and computation in DMQL


 define cube sales_cube[ city, item, year] (sales_in_dollars)
 compute cube sales_cube Transform it into a SQL-like language (with a new operator cube
by, introduced by Gray et al.’96)
 SELECT item, city, year, SUM (amount) FROM SALES CUBE BY item, city, year
Data cube can be viewed as a lattice of cuboids
 The bottom-most cuboid is the base cuboid.
 The top-most cuboid (apex) contains only one cell.
 How many cuboids in an n-dimensional cube with L levels? (T=SUM(Li+1))
 For example, the time dimension as specified above has 4 conceptual levels, or 5 if we
include the virtual level all.
 If the cube has 10 dimensions and each dimension has 5 levels (including all), the total
number of cuboids that can be generated is 510 9.8x106.

Data Cube Materialization

There are three choices for data cube materialization given a base cuboid.
 No Materialization
 Full Materialization
 Partial Materialization
How to select which materialization to use
 Identify the subsets of cuboids or subcubes to materialize.
 Exploit the materialized cuboids or subcubes during query processing.
 Efficiently update the materialized cuboids or subcubes during load and refresh.

Selection of which cuboids to materialize


 Based on the size, queries in the workload, accessing cost, their frequencies, etc.
Indexing OLAP Data: Bitmap Index

First of all, create an index table on a particular column of the table. Then each value in the
column has got a bit vector: bit-op is fast. The length of the bit vector: # of records in the
base table. The i-th bit is set if the i-th row of the base table has the value for the indexed
column. It's not suitable for high cardinality domains.

Indexing OLAP Data: Join Indices

The join indexing method gained popularity from its use in relational database query
processing. The join index records can identify joinable tuples without performing costly
join operations. Join indexing is especially useful for maintaining the relationship between
a foreign key and its matching primary keys, from the joinable relation.
Suppose that there are 360-time values, 100 items, 50 branches, 30 locations, and
10 million sales tuples in the sales star data cube. If the sales fact table has recorded sales
for only 30 items, the remaining 70 items will obviously not participate in joins. If join
indices are not used, additional I/Os have to be performed to bring the joining portions of
the fact table and dimension tables together.

To further speed up query processing, the join indexing, and bitmap indexing
methods can be integrated to form bitmapped join indices. Microsoft SQL Server and
Sybase IQ support bitmap indices. Oracle 8 uses bitmap and join indices.

Efficient Processing OLAP Queries

The purpose of materializing cuboids and constructing OLAP index structures is to speed
up the query processing in data cubes.

Given materialized views, query processing should proceed as follows:


Determine which operations should be performed on the available cuboids:
Transform drill, roll, etc. into the corresponding SQL and/or OLAP operations, e.g.,
dice = selection + projection.
Determine to which materialized cuboid(s) the relevant operations should be
applied:
Suppose that the query to be processed be on {brand, province_or_state} with the
selection constant “year = 2004”, and there are 4 materialized cuboids available: {year,
item_name, city}, {year, brand, country}, {year, brand, province_or_state}, {item_name,
province_or_state} where year = 2004
From Data warehousing to Data mining:

Data Warehouse Usage

 Three kinds of data warehouse applications


o Information processing
 supports querying, basic statistical analysis, and reporting using
crosstabs, tables, charts and graphs
o Analytical processing
 multidimensional analysis of data warehouse data
 supports basic OLAP operations, slice-dice, drilling, pivoting
o Data mining
 knowledge discovery from hidden patterns
 supports associations, constructing analytical models, performing
classification and prediction, and presenting the mining results using
visualization tools.

From on-Line analytical processing to On-Line analytical mining:


OLAM stands for Online analytical mining. It is also known as OLAP Mining. It integrates
online analytical processing with data mining and mining knowledge in multi-
dimensional databases. There are several paradigms and structures of data mining
systems.
Various data mining tools must work on integrated, consistent, and cleaned data. This
requires costly pre-processing for data cleaning, data transformation, and data
integration. Thus, a data warehouse constructed by such pre-processing is a valuable
source of high-quality information for both OLAP and data mining. Data mining can
serve as a valuable tool for data cleaning and data integration.
OLAM is particularly important for the following reasons which are as follows −
High quality of data in data warehouses − Most data mining tools are required to
work on integrated, consistent, and cleaned information, which needs costly data
cleaning, data integration, and data transformation as a pre-processing phase. A data
warehouse constructed by such pre-processing serves as a valuable source of high-
quality data for OLAP and data mining. Data mining can also serve as a valuable tool for
data cleaning and data integration.
Available information processing infrastructure surrounding data warehouses −
Comprehensive data processing and data analysis infrastructures have been or will be
orderly constructed surrounding data warehouses, which contains accessing,
integration, consolidation, and transformation of various heterogeneous databases,
ODBC/OLE DB connections, Web-accessing and service facilities, and documenting and
OLAP analysis tools. It is careful to create the best use of the available infrastructures
instead of constructing everything from scratch.
OLAP-based exploratory data analysis − Effective data mining required exploratory
data analysis. A user will be required to traverse through a database, select areas of
relevant information, analyze them at multiple granularities, and display
knowledge/results in multiple forms.
Online analytical mining supports facilities for data mining on multiple subsets of data
and at several levels of abstraction, by drilling, pivoting, filtering, dicing, and slicing on a
data cube and some intermediate data mining outcomes.
On-line selection of data mining functions − It supports a user who cannot
understand what type of knowledge they would like to mine. By integrating OLAP with
various data mining functions, online analytical mining provides users with the
flexibility to choose desired data mining functions and swap data mining tasks
dynamically.
Architecture for On-Line Analytical Mining:
Online Analytical Mining integrates with Online Analytical Processing with data
mining and mining knowledge in multidimensional databases. Here is the diagram that
shows the integration of both OLAP and OLAM −

You might also like