0% found this document useful (0 votes)
17 views31 pages

Unit 2

The document outlines the fundamental concepts of Data Warehousing, including its definition, modeling, design, and implementation. It distinguishes between operational databases and data warehouses, emphasizing the focus on historical data for decision-making and analysis. Additionally, it covers OLAP technology, data cubes, and the architecture of data warehouses, highlighting their roles in supporting business intelligence and analytical processes.

Uploaded by

kvm474710
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views31 pages

Unit 2

The document outlines the fundamental concepts of Data Warehousing, including its definition, modeling, design, and implementation. It distinguishes between operational databases and data warehouses, emphasizing the focus on historical data for decision-making and analysis. Additionally, it covers OLAP technology, data cubes, and the architecture of data warehouses, highlighting their roles in supporting business intelligence and analytical processes.

Uploaded by

kvm474710
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Department: Computer Science

Subject Name: Data Mining

Subject Code:CECS54A

Sem: V

Unit:II

1. Data Warehouse

2. Basic concepts Data Warehouse Modeling:

2.1 Data Cube

2.2 OLAP

3. Data Warehouse Design and Usage

4. Data Warehouse Implementation

5. Data Generalization by Attribute Oriented Induction

6. Data Cube Technology

7. Data Cube Computation Methods

8. Exploring Cube Technology

9. Multidimensional Data Analysis in cube space.


1.Data Warehouse -Basic concepts

A Data Warehouse (DW) is a relational database that is designed for query and analysis rather
than transaction processing. It includes historical data derived from transaction data from single
and multiple sources.

A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on providing
support for decision-makers for data modeling and analysis.

What is Data Warehouse?

A Data Warehouse is a group of data specific to the entire organization, not only to a particular
group of users.

It is not used for daily operations and transaction processing but used for making decisions.

A Data Warehouse can be viewed as a data system with the following attributes:

 It is a database designed for investigative tasks, using data from various applications.
 It supports a relatively small number of clients with relatively long interactions.
 It includes current and historical data to provide a historical perspective of information.
 Its usage is read-intensive.
 It contains a few large tables.

"Data Warehouse is a subject-oriented, integrated, and time-variant store of information in


support of management's decisions."

Subject-Oriented

A data warehouse target on the modeling and analysis of data for decision-makers.
Therefore, data warehouses typically provide a concise and straightforward view around
a particular subject, such as customer, product, or sales, instead of the global
organization's ongoing operations. This is done by excluding data that are not useful
concerning the subject and including all data needed by the users to understand the
subject.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files,
and online transaction records. It requires performing data cleaning and integration
during data warehousing to ensure consistency in naming conventions, attributes types,
etc., among different data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files
from 3 months, 6 months, 12 months, or even previous data from a data warehouse.
These variations with a transactions system, where often only the most current file is
kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the
source operational RDBMS. The operational updates of data do not occur in the data
warehouse, i.e., update, insert, and delete operations are not performed.
It usually requires only two procedures in data accessing: Initial loading of data and
access to data. Therefore, the DW does not require transaction processing, recovery, and
concurrency capabilities, which allows for substantial speedup of data retrieval. Non-
Volatile defines that once entered into the warehouse, and data should not change.
2.Data Warehouse Modeling
 Data warehouse modeling is the process of designing the schemas of the detailed
and summarized information of the data warehouse. The goal of data warehouse
modeling is to develop a schema describing the reality, or at least a part of the
fact, which the data warehouse is needed to support.
 Data warehouse modeling is an essential stage of building a data warehouse for
two main reasons. Firstly, through the schema, data warehouse clients can
visualize the relationships among the warehouse data, to use them with greater
ease.
 Secondly, a well-designed schema allows an effective data warehouse structure to
emerge, to help decrease the cost of implementing the warehouse and improve the
efficiency of using it.
 Data modeling in data warehouses is different from data modeling in operational
database systems.
 The primary function of data warehouses is to support DSS processes. Thus, the
objective of data warehouse modeling is to make the data warehouse efficiently
support complex queries on long term information.

Goals of Data Warehousing


o To help reporting as well as analysis
o Maintain the organization's historical information
o Be the foundation for decision making

Benefits of Data Warehouse

1. Understand business trends and make better forecasting decisions.


2. Data Warehouses are designed to perform well enormous amounts of data.
3. The structure of data warehouses is more accessible for end-users to navigate,
understand, and query.
4. Queries that would be complex in many normalized databases could be easier to build
and maintain in data warehouses.
5. Data warehousing is an efficient method to manage demand for lots of information from
lots of users.
6. Data warehousing provide the capabilities to analyze a large amount of historical data.
Difference between Operational Database and Data Warehouse
Operational Database Data Warehouse

Operational systems are designed to support high- Data warehousing systems are typically designed to
volume transaction processing. support high-volume analytical processing (i.e.,
OLAP).

Operational systems are usually concerned with current Data warehousing systems are usually concerned
data. with historical data.

Data within operational systems are mainly updated Non-volatile, new data may be added regularly.
regularly according to need. Once Added rarely changed.

It is designed for real-time business dealing and It is designed for analysis of business measures by
processes. subject area, categories, and attributes.

It is optimized for a simple set of transactions, generally It is optimized for extent loads and high, complex,
adding or retrieving a single row at a time per table. unpredictable queries that access many rows per
table.

It is optimized for validation of incoming information Loaded with consistent, valid information, requires
during transactions, uses validation data tables. no real-time validation.

It supports thousands of concurrent clients. It supports a few concurrent clients relative to OLTP.

Operational systems are widely process-oriented. Data warehousing systems are widely subject-
oriented

Operational systems are usually optimized to perform Data warehousing systems are usually optimized to
fast inserts and updates of associatively small volumes perform fast retrievals of relatively high volumes of
of data. data.

Data In Data Out

Less Number of data accessed. Large Number of data accessed.

Relational databases are created for on-line transactional Data Warehouse designed for on-line Analytical
Processing (OLTP) Processing (OLAP)

Difference between OLTP and OLAP

OLTP System
OLTP System handle with operational data. Operational data are those data contained in the
operation of a particular system. Example, ATM transactions and Bank transactions, etc.

OLAP System
OLAP handle with Historical Data or Archival Data. Historical data are those data that are
achieved over a long period. For example, if we collect the last 10 years information about flight
reservation, the data can give us much meaningful data such as the trends in the reservation. This
may provide useful information like peak time of travel, what kind of people are traveling in
various classes (Economy/Business) etc.

The major difference between an OLTP and OLAP system is the amount of data analyzed in a
single transaction. Whereas an OLTP manage many concurrent customers and queries touching
only an individual record or limited groups of files at a time. An OLAP system must have the
capability to operate on millions of files to answer a single query.

Feature OLTP OLAP

Characteristic It is a system which is used to It is a system which is used to manage informational


manage operational Data. Data.

Users Clerks, clients, and information Knowledge workers, including managers, executives,
technology professionals. and analysts.

System OLTP system is a customer-oriented, OLAP system is market-oriented, knowledge workers


orientation transaction, and query processing are including managers, do data analysts executive and
done by clerks, clients, and analysts.
information technology
professionals.

Data contents OLTP system manages current data OLAP system manages a large amount of historical
that too detailed and are used for data, provides facilitates for summarization and
decision making. aggregation, and stores and manages data at different
levels of granularity. This information makes the data
more comfortable to use in informed decision
making.

Database Size 100 MB-GB 100 GB-TB

Database design OLTP system usually uses an entity- OLAP system typically uses either a star or
relationship (ER) data model and snowflake model and subject-oriented database
application-oriented database design. design.

View OLTP system focuses primarily on OLAP system often spans multiple versions of a
the current data within an enterprise database schema, due to the evolutionary process of
or department, without referring to an organization. OLAP systems also deal with data
historical information or data in that originates from various organizations, integrating
different organizations. information from many data stores.

Volume of data Not very large Because of their large volume, OLAP data are stored
on multiple storage media.

Access patterns The access patterns of an OLTP Accesses to OLAP systems are mostly read-only
system subsist mainly of short, methods because of these data warehouses stores
atomic transactions. Such a system historical data.
requires concurrency control and
recovery techniques.

Access mode Read/write Mostly write

Insert and Short and fast inserts and updates Periodic long-running batch jobs refresh the data.
Updates proposed by end-users.

Number of Tens Millions


records
accessed

Normalization Fully Normalized Partially Normalized

Processing Very Fast It depends on the amount of files contained, batch


Speed data refresh, and complex query may take many
hours, and query speed can be upgraded by creating
indexes

2.1 What is Data Cube?


When data is grouped or combined in multidimensional matrices called Data Cubes. The data
cube method has a few alternative names or a few variants, such as "Multidimensional
databases," "materialized views," and "OLAP (On-Line Analytical Processing)."
The general idea of this approach is to materialize certain expensive computations that are
frequently inquired.

Dimensions are a fact that defines a data cube. Facts are generally quantities, which are
used for analyzing the relationship between dimensions.
Example: In the 2-D representation, we will look at the All Electronics sales data
for items sold per quarter in the city of Vancouver. The measured display in dollars
sold (in thousands).

3-Dimensional Cuboids
Let suppose we would like to view the sales data with a third dimension. For example,
suppose we would like to view the data according to time, item as well as the location
for the cities Chicago, New York, Toronto, and Vancouver. The measured display in
dollars sold (in thousands). These 3-D data are shown in the table. The 3-D data of the
table are represented as a series of 2-D tables.
Conceptually, we may represent the same data in the form of 3-D data cubes, as shown
in fig:

In data warehousing, the data cubes are n-dimensional. The cuboid which holds the lowest level
of summarization is called a base cuboid.

For example, the 4-D cuboid in the figure is the base cuboid for the given time, item, location,
and supplier dimensions.

Figure is shown a 4-D data cube representation of sales data, according to the dimensions time,
item, location, and supplier. The measure displayed is dollars sold (in thousands).

The topmost 0-D cuboid, which holds the highest level of summarization, is known as the apex
cuboid. In this example, this is the total sales, or dollars sold, summarized over all four
dimensions.

The lattice of cuboid forms a data cube. The figure shows the lattice of cuboids creating
4-D data cubes for the dimension time, item, location, and supplier. Each cuboid
represents a different degree of summarization.
2.2What is OLAP (Online Analytical Processing)?

OLAP stands for On-Line Analytical Processing. OLAP is a classification of software


technology which authorizes analysts, managers, and executives to gain insight into information
through fast, consistent, interactive access in a wide variety of possible views of data that has
been transformed from raw information to reflect the real dimensionality of the enterprise as
understood by the clients.

OLAP implement the multidimensional analysis of business information and support the
capability for complex estimations, trend analysis, and sophisticated data modeling. It is rapidly
enhancing the essential foundation for Intelligent Solutions containing Business Performance
Management, Planning, Budgeting, Forecasting, Financial Documenting, Analysis, Simulation-
Models, Knowledge Discovery, and Data Warehouses Reporting. OLAP enables end-clients to
perform ad hoc analysis of record in multiple dimensions, providing the insight and
understanding they require for better decision making.

Who uses OLAP and Why?

OLAP applications are used by a variety of the functions of an organization.

Finance and accounting:

o Budgeting

o Activity-based costing

o Financial performance analysis

o And financial modeling


Sales and Marketing

o Sales analysis and forecasting

o Market research analysis

o Promotion analysis

o Customer analysis

o Market and customer segmentation

Production

o Production planning

o Defect analysis

OLAP cubes have two main purposes. The first is to provide business users with a data model
more intuitive to them than a tabular model. This model is called a Dimensional Model.

The second purpose is to enable fast query response that is usually difficult to achieve using
tabular models.

How OLAP Works?

Fundamentally, OLAP has a very simple concept. It pre-calculates most of the queries that are
typically very hard to execute over tabular databases, namely aggregation, joining, and grouping.
These queries are calculated during a process that is usually called 'building' or 'processing' of
the OLAP cube. This process happens overnight, and by the time end users get to work - data
will have been updated.

Three-Tier Data Warehouse Architecture

Data Warehouses usually have a three-level (tier) architecture that includes:

1. Bottom Tier (Data Warehouse Server)


2. Middle Tier (OLAP Server)
3. Top Tier (Front end Tools).

A bottom-tier that consists of the Data Warehouse server, which is almost always an RDBMS.
It may include several specialized data marts and a metadata repository.
Data from operational databases and external sources (such as user profile data provided by
external consultants) are extracted using application program interfaces called a gateway.

A gateway is provided by the underlying DBMS and allows customer programs to generate SQL
code to be executed at a server.

Examples of gateways contain ODBC (Open Database Connection) and OLE-DB (Open-
Linking and Embedding for Databases), by Microsoft, and JDBC (Java Database Connection).

A middle-tier which consists of an OLAP server for fast querying of the data warehouse.

The OLAP server is implemented using either

(1) A Relational OLAP (ROLAP) model, i.e., an extended relational DBMS that maps functions
on multidimensional data to standard relational operations.

(2) A Multidimensional OLAP (MOLAP) model, i.e., a particular purpose server that directly
implements multidimensional information and operations.

A top-tier that contains front-end tools for displaying results provided by OLAP, as well as
additional tools for data mining of the OLAP-generated data.
The metadata repository stores information that defines DW objects. It includes the following
parameters and information for the middle and the top-tier applications:

1. A description of the DW structure, including the warehouse schema, dimension, hierarchies,


data mart locations, and contents, etc.
2. Operational metadata, which usually describes the currency level of the stored data, i.e.,
active, archived or purged, and warehouse monitoring information, i.e., usage statistics, error
reports, audit, etc.
3. System performance data, which includes indices, used to improve data access and retrieval
performance.
4. Information about the mapping from operational databases, which provides
source RDBMSs and their contents, cleaning and transformation rules, etc.
5. Summarization algorithms, predefined queries, and reports business data, which include
business terms and definitions, ownership information, etc.

3.Data Warehouse Design

A data warehouse is a single data repository where a record from multiple data sources is
integrated for online business analytical processing (OLAP). This implies a data warehouse
needs to meet the requirements from all the business stages within the entire organization. Thus,
data warehouse design is a hugely complex, lengthy, and hence error-prone process.
Furthermore, business analytical functions change over time, which results in changes in the
requirements for the systems. Therefore, data warehouse and OLAP systems are dynamic, and
the design process is continuous.

Data warehouse design takes a method different from view materialization in the industries. It
sees data warehouses as database systems with particular needs such as answering management
related queries. The target of the design becomes how the record from multiple data sources
should be extracted, transformed, and loaded (ETL) to be organized in a database as the data
warehouse.

There are two approaches

1. "top-down" approach
2. "bottom-up" approach
Top-down Design Approach

In the "Top-Down" design approach, a data warehouse is described as a subject-oriented, time-


variant, non-volatile and integrated data repository for the entire enterprise data from different
sources are validated, reformatted and saved in a normalized (up to 3NF) database as the data
warehouse.

The data warehouse stores "atomic" information, the data at the lowest level of granularity, from
where dimensional data marts can be built by selecting the data required for specific business
subjects or particular departments.

An approach is a data-driven approach as the information is gathered and integrated first and
then business requirements by subjects for building data marts are formulated.

The advantage of this method is which it supports a single integrated data source. Thus data
marts built from it will have consistency when they overlap.

Advantages of top-down design

Data Marts are loaded from the data warehouses.

Developing new data mart from the data warehouse is very easy.

Disadvantages of top-down design

This technique is inflexible to changing departmental needs.

The cost of implementing the project is high.


Bottom-Up Design Approach

In the "Bottom-Up" approach, a data warehouse is described as "a copy of transaction data
specifical architecture for query and analysis," term the star schema. In this approach, a data mart
is created first to necessary reporting and analytical capabilities for particular business processes
(or subjects). Thus it is needed to be a business-driven approach in contrast to Inmon's data-
driven approach.

Data marts include the lowest grain data and, if needed, aggregated data too. Instead of a
normalized database for the data warehouse, a denormalized dimensional database is adapted to
meet the data delivery requirements of data warehouses. Using this method, to use the set of data
marts as the enterprise data warehouse, data marts should be built with conformed dimensions in
mind, defining that ordinary objects are represented the same in different data marts. The
conformed dimensions connected the data marts to form a data warehouse, which is generally
called a virtual data warehouse.

The advantage of the "bottom-up" design approach is that it has quick ROI, as developing a data
mart, a data warehouse for a single subject, takes far less time and effort than developing an
enterprise-wide data warehouse. Also, the risk of failure is even less. This method is inherently
incremental. This method allows the project team to learn and grow.
Advantages of bottom-up design

Documents can be generated quickly.

The data warehouse can be extended to accommodate new business units.

It is just developing new data marts and then integrating with other data marts.

Disadvantages of bottom-up design

the locations of the data warehouse and the data marts are reversed in the bottom-up approach
design.

Differentiate between Top-Down Design Approach and Bottom-Up Design Approach

Top-Down Design Approach Bottom-Up Design Approach

Breaks the vast problem into smaller Solves the essential low-level problem and integrates them
subproblems. into a higher one.

Inherently architected- not a union of several Inherently incremental; can schedule essential data marts first.
data marts.

Single, central storage of information about the Departmental information stored.


content.

Centralized rules and control. Departmental rules and control.

It includes redundant information. Redundancy can be removed.

It may see quick results if implemented with Less risk of failure, favorable return on investment, and proof
repetitions. of techniques.

Data Warehouse Usage

 Three kinds of data warehouse applications


o Information processing
 supports querying, basic statistical analysis, and reporting using crosstabs,
tables, charts and graphs
o Analytical processing
 multidimensional analysis of data warehouse data
 supports basic OLAP operations, slice-dice, drilling, pivoting
o Data mining
 knowledge discovery from hidden patterns
 supports associations, constructing analytical models, performing
classification and prediction, and presenting the mining results using
visualization tools.

4.Data Warehouse Implementation

 There are various implementation in data warehouses which are as follows


 1. Requirements analysis and capacity planning: The first process in data warehousing
involves defining enterprise needs, defining architectures, carrying out capacity planning,
and selecting the hardware and software tools. This step will contain be consulting senior
management as well as the different stakeholder.
 2. Hardware integration: Once the hardware and software has been selected, they
require to be put by integrating the servers, the storage methods, and the user software
tools.
 3. Modeling: Modelling is a significant stage that involves designing the warehouse
schema and views. This may contain using a modeling tool if the data warehouses are
sophisticated.
 4. Physical modeling: For the data warehouses to perform efficiently, physical modeling
is needed. This contains designing the physical data warehouse organization, data
placement, data partitioning, deciding on access techniques, and indexing.
 5. Sources: The information for the data warehouse is likely to come from several data
sources. This step contains identifying and connecting the sources using the gateway,
ODBC drives, or another wrapper.
 6. ETL: The data from the source system will require to go through an ETL phase. The
process of designing and implementing the ETL phase may contain defining a suitable
ETL tool vendors and purchasing and implementing the tools. This may contains
customize the tool to suit the need of the enterprises.
 7. Populate the data warehouses: Once the ETL tools have been agreed upon, testing
the tools will be needed, perhaps using a staging area. Once everything is working
adequately, the ETL tools may be used in populating the warehouses given the schema
and view definition.
 8. User applications: For the data warehouses to be helpful, there must be end-user
applications. This step contains designing and implementing applications required by the
end-users.
 9. Roll-out the warehouses and applications: Once the data warehouse has been
populated and the end-client applications tested, the warehouse system and the operations
may be rolled out for the user's community to use.
 Single Table for Base and Summary Facts
 RID item . . . day month quarter year dollars sold
 1001 TV . . . 15 10 Q4 2010 250.60
 1002 TV . . . 23 10 Q4 2010 175.00
 ........................
 5001 TV . . . all 10 Q4 2010 45,786.08
 ........................
 Difference between ROLAP, MOLAP, and HOLAP

ROLAP MOLAP HOLAP

ROLAP stands for Relational MOLAP stands for Multidimensional HOLAP stands for Hybrid Online
Online Analytical Processing. Online Analytical Processing. Analytical Processing.

The ROLAP storage mode The MOLAP storage mode principle The HOLAP storage mode connects
causes the aggregation of the the aggregations of the division and a attributes of both MOLAP and
division to be stored in indexed copy of its source information to be ROLAP. Like MOLAP, HOLAP
views in the relational database saved in a multidimensional causes the aggregation of the
that was specified in the operation in analysis services when division to be stored in a
partition's data source. the separation is processed. multidimensional operation in an
SQL Server analysis services
instance.

ROLAP does not because a copy This MOLAP operation is highly HOLAP does not causes a copy of
of the source information to be optimize to maximize query the source information to be stored.
stored in the Analysis services performance. The storage area can be For queries that access the only
data folders. Instead, when the on the computer where the partition is summary record in the aggregations
outcome cannot be derived from described or on another computer of a division, HOLAP is the
the query cache, the indexed running Analysis services. Because a equivalent of MOLAP.
views in the record source are copy of the source information
accessed to answer queries. resides in the multidimensional
operation, queries can be resolved
without accessing the partition's
source record.

Query response is frequently Query response times can be reduced Queries that access source record
slower with ROLAP storage than substantially by using aggregations. for example, if we want to drill
with the MOLAP or HOLAP The record in the partition's MOLAP down to an atomic cube cell for
storage mode. Processing time is operation is only as current as of the which there is no aggregation
also frequently slower with most recent processing of the information must retrieve data from
ROLAP. separation. the relational database and will not
be as fast as they would be if the
source information were stored in
the MOLAP architecture.
5 Data Generalization
What is data generalization?
 Data generalization allows you to replace a data value with a less precise one using a few
different techniques, which preserves data utility and protects against some types of
attacks that could lead to re-identification of individuals or reveal private information
unintentionally.
 A process that abstracts a large set of task-relevant data in a database from a low
conceptual level to higher ones.
 Data Generalization is a summarization of general features of objects in a target class and
produces what is called characteristic rules.
 The data relevant to a user-specified class are normally retrieved by a database query and
run through a summarization module to extract the essence of the data at different levels
of abstractions.
 For example, one may want to characterize the "OurVideoStore" customers who
regularly rent more than 30 movies a year. With concept hierarchies on the attributes
describing the target class, the attribute-oriented induction method can be used, for
example, to carry out data summarization.
 Attribute-Oriented Induction
 The Attribute-Oriented Induction (AOI) approach to data generalization and
summarization – based characterization was first proposed in 1989 (KDD ‘89 workshop)
a few years before the introduction of the data cube approach.

The data cube approach can be considered as a data warehouse – based, pre
computational – oriented, materialized approach.

It performs off-line aggregation before an OLAP or data mining query is submitted for
processing.

On the other hand, the attribute oriented induction approach, at least in its initial
proposal, a relational database query – oriented, generalized – based, on-line data
analysis technique.

However, there is no inherent barrier distinguishing the two approaches based on online
aggregation versus offline precomputation.

Some aggregations in the data cube can be computed on-line, while off-line
precomputation of multidimensional space can speed up attribute-oriented induction as
well.

It was proposed in 1989 (KDD ‘89 workshop).

It is not confined to categorical data nor particular measures.


 Basic Principles of Attribute Oriented Induction
 A set of basic principles for the attribute-oriented induction in relational databases is
summarized as follows:-
 1. follows Data focusing: Analyzing task-relevant data, including dimensions, and the
result is the initial relation.
 2. Attribute-removal: To remove attribute A if there is a large set of distinct values for A
but either
 a.There is no generalization operator on A, or
 b. A's higher-level concepts are expressed in terms of other attributes.
 3. Attribute-generalization: If there is a large set of distinct values for A, and there exists
a set of generalization operators on A, then select an operator and generalize A. 3.
 4. Attribute-threshold control: Typical 2-8, specified/default.

6. Data Cube Technology


A data cube is a three-dimensional (3D) (or higher) set of values that are typically used to
describe the time series of data from an image. It is an abstraction of data to analyze aggregated
information from a number of points of view. As a spectrally-resolved picture is interpreted as a
3-D volume, it is often useful for imaging spectroscopy.

The multidimensional extensions of two-dimensional tables may also represent a data cube. It
can be viewed as a group of 2-D tables stacked on each other that are similar. Data cubes are
used to represent data that is too abstract for a table of columns and rows to explain

As data in multidimensional matrices called Data Cubes is clustered or mixed. There are a few
alternate names or alternatives to the data cube system, such as "Multidimensional databases,"
"materialized views," and "OLAP (On-Line Analytical Processing.

A data cube is a multi-dimensional ("n-D") sequence of values in computer programming


contexts. The term data cube is usually used in contexts where these arrays are massively bigger
than the main memory of the hosting computer; examples include multi-terabyte/petabyte data
warehouses and image data time series.

From a subset of attributes in the database, a data cube is generated. To quantify attributes,
unique attributes are selected, i.e. attributes whose qualities are of importance.

The other attributes are chosen as usable attributes or measurements. The characteristics of the
measurements are aggregated according to the proportions.

For instance, XYZ can create a sales data warehouse to preserve records of the sales of the store
for the time, object, branch, and location dimensions For eg, the item name, brand, and type
attributes can include a dimension table for products.

The data cube technique, with many implementations, is a fascinating method. In several
examples, data cubes may be sparse and not every cell in each dimension would have
matching data in the database.

What is data cube technology used for?


A multi-dimensional architecture is a data cube. The data cube is an abstract of data for
displaying aggregated data from a variety of viewpoints.
As the 'measure' attribute, the dimensions are aggregated, as the remaining dimensions are
known as the 'function' attributes. In a multidimensional way, data is viewed on a cube.

It is possible to display the aggregated and summarised facts with variables or attributes. This is
the specification where OLAP plays a role.

For simple data analysis, data cubes are widely used. It is used to represent data as such
quantities of company needs along with dimensions.

Each cube dimension reflects some of the database's characteristics, such as revenue every day,
month, or year.

Data cubes classifications


Data cubes are specifically grouped into two classifications. These are described below -

1. Multidimensional Data Cube - Centered on a structure where the cube is patterned as a


multidimensional array, most OLAP products are created. Compared to other methods,
these multidimensional OLAP (MOLAP) products typically provide a better
performance, primarily because they can be indexed directly into the data cube structure
to capture data subsets.
2. The cube gets sparser as the number of dimensions is larger. That ensures that no
aggregated data would be stored in multiple cells that represent unique attribute
combinations
3. This in turn raises the storage needs, which can at times exceed undesirable thresholds,
rendering the MOLAP solution untenable for massive, multi-dimensional data sets.
Compression strategies may help, but their use may damage MOLAP's natural indexing.
4. Relational OLAP - The relational database architecture is used by Relational OLAP.
Compared to a multidimensional array, the ROLAP data cube is used as a series of
relational tables (approximately twice as many as the number of dimensions). Each of
these columns, referred to as a cuboid, denotes a particular perspective.

7. Data Cube Computation Methods:


 Data cube computation is an essential task in data warehouse implementation. The
precomputation of all or part of a data cube can greatly reduce the response time and
enhance the performance of online analytical processing. However, such computation is
challenging because it may require substantial computational time and storage space.
 Efficient methods for data cube computation methods are:
A. the multiway array aggregation (MultiWay) method for computing full cubes.
B. a method known as BUC, which computes iceberg cubes from the apex cuboid
downward.
C. the Star-Cubing method, which integrates top-down and bottom-up computation.
D. High dimension OLAP

A. Multi-Way Array Aggregation:


 Array-based “bottom-up” algorithm
 Using multi-dimensional chunks
 No direct tuple comparisons
 Simultaneous aggregation on multiple dimensions
 Intermediate aggregate values are re-used for computing ancestor cuboids
 Full materialization Cannot do Apriori pruning: No iceberg optimization

Aggregation Strategy
1. Partitions array into chunks
2. Chunk: a small sub-cube which fits in memory
3. Data addressing
4. Uses chunk id and offset
5. Multi-way Aggregation
6. Computes aggregates in multi-way
7. Visits chunks in the order
1. to minimize memory access
2. to minimize memory space
Example
 Suppose the data size on each dimension A, B and C is 40, 400 and 4000, respectively.
 Minimum memory required when traversing the order, 1,2,3,4,5,…, 64
 Total memory required is 100×1000 + 40×1000 + 40×400

Summary of Multi-Way
Method
 Cuboids should be sorted and computed according to the data size on each dimension
 Keeps the smallest plane in the main memory, fetches and computes only one chunk at a
time for the largest plane
Limitations
 Full materialization
 Computes well only for a small number of dimensions ( high dimensional data → partial
materialization )

B. Bottom-Up Computation (BUC)


Characteristics
 “Top-down” approach
 Partial materialization (iceberg cube computation)
 Divides dimensions into partitions and facilitates iceberg pruning
 No simultaneous aggregation

Iceberg Pruning Process


Partitioning
 Sorts data values
 Partitions into blocks that fit in memory
Apriori Pruning
 For each block • If it does not satisfy min_sup, its descendants are pruned • If it satisfies
min_sup, materialization and a recursive call including the next dimension
8.Exploring Cube Technology
Sampling Cubes: OLAP-Based Mining on Sampling Data
When collecting data, we often collect only a subset of the data we would ideally like
to gather. In statistics, this is known as collecting a sample of the data population.

Sampling Cube Framework


The sampling cube is a data cube structure that stores the sample data and their multidimensional
aggregates. It supports OLAP on sample data. It calculates confidence intervals
as a quality measure for any multidimensional query. Given a sample data relation
(i.e., base cuboid) R, the sampling cube CR typically co

The best way to solve the small sample size problem is to get more data. Fortunately,
there is usually an abundance of additional data available in the cube. The data do not
match the query cell exactly; however, we can consider data from cells that are “close
by.” There are two ways to incorporate such data to enhance the reliability of the query
answer: (1) intracuboid query expansion, where we consider nearby cells within the same
cuboid, and (2) intercuboid query expansion, where we consider more general versions
(from parent cuboids) of the query cell.

Ranking Cubes: Efficient Computation of Top-k Queries

Ranking Cube contributes to the efficient processing of top-k queries. Instead of returning a large
set of indiscriminative answers to a query, a top-k
query (or ranking query) returns only the best k results according to a user-specified
preference.
The results are returned in ranked order so that the best is at the top. The user specified
preference generally consists of two components: a selection condition and
a ranking function. Top-k queries are common in many applications like searching
web databases, k-nearest-neighbor searches with approximate matches, and similarity
queries in multimedia databases.

9.Multidimensional Data Analysis in Cube Space


Prediction Cubes: Prediction Mining in Cube Space

Recently, researchers have their attention toward multidimensional data mining to uncover
knowledge at varying dimensional combinations and granularities. Such mining is also
known as exploratory multidimensional data mining and online analytical data mining
(OLAM).

There are at least four ways in which OLAP-style analysis can be fused with data
mining techniques:
1. Use cube space to define the data space for mining. Each region in cube space represents
a subset of data over which we wish to find interesting patterns. Cube space
is defined by a set of expert-designed, informative dimension hierarchies, not just
arbitrary subsets of data. Therefore, the use of cube space makes the data space both
meaningful and tractable.
2. Use OLAP queries to generate features and targets for mining. The features and even
the targets (that we wish to learn to predict) can sometimes be naturally defined as
OLAP aggregate queries over regions in cube space.
3. Use data mining models as building blocks in a multistep mining process.
Multidimensional
data mining in cube space may consist of multiple steps, where data mining
models can be viewed as building blocks that are used to describe the behavior of
interesting data sets, rather than the end results.
4. Use data cube computation techniques to speed up repeated model construction.
Multidimensional
data mining in cube space may require building a model for each
candidate data space, which is usually too expensive to be feasible. However, by carefully
sharing computation across model construction for different candidates based
on data cube computation techniques, efficient mining is achievable.

Multifeature cubes enable more in-depth analysis. They can compute more complex queries
of which the measures depend on groupings of multiple aggregates at varying granularity
levels.
Thequeries posed can be much more elaborate and task-specific than traditional queries,
as we shall illustrate in the next examples.
Many complex data mining queries can be
answered by multifeature cubes without significant increase in computational cost, in
comparison to cube computation for simple queries with traditional data cubes.

Exception-Based, Discovery-Driven Cube Space Exploration


A discovery-driven approach to exploring cube space.

Precomputed measures indicating data exceptions are used to guide the user in the data
analysis process, at all aggregation levels.
We hereafter refer to these measures as exception
indicators. Intuitively, an exception is a data cube cell value that is significantly
different from the value anticipated, based on a statistical model.
The model considers variations and patterns in the measure value across all the dimensions
to which a cell belongs. For example, if the analysis of item-sales data reveals an increase in
sales in December in comparison to all other months, this may seem like an exception in the
time dimension.
However, it is not an exception if the item dimension is considered,
since there is a similar increase in sales for other items during December.

The model considers exceptions hidden at all aggregated group-by’s of a data cube.
Visual cues, such as background color, are used to reflect each cell’s degree of exception,
based on the precomputed exception indicators. Efficient algorithms have been proposed
for cube construction

You might also like