Unit 2
Unit 2
Subject Code:CECS54A
Sem: V
Unit:II
1. Data Warehouse
2.2 OLAP
A Data Warehouse (DW) is a relational database that is designed for query and analysis rather
than transaction processing. It includes historical data derived from transaction data from single
and multiple sources.
A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on providing
support for decision-makers for data modeling and analysis.
A Data Warehouse is a group of data specific to the entire organization, not only to a particular
group of users.
It is not used for daily operations and transaction processing but used for making decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
It is a database designed for investigative tasks, using data from various applications.
It supports a relatively small number of clients with relatively long interactions.
It includes current and historical data to provide a historical perspective of information.
Its usage is read-intensive.
It contains a few large tables.
Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers.
Therefore, data warehouses typically provide a concise and straightforward view around
a particular subject, such as customer, product, or sales, instead of the global
organization's ongoing operations. This is done by excluding data that are not useful
concerning the subject and including all data needed by the users to understand the
subject.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files,
and online transaction records. It requires performing data cleaning and integration
during data warehousing to ensure consistency in naming conventions, attributes types,
etc., among different data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files
from 3 months, 6 months, 12 months, or even previous data from a data warehouse.
These variations with a transactions system, where often only the most current file is
kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the
source operational RDBMS. The operational updates of data do not occur in the data
warehouse, i.e., update, insert, and delete operations are not performed.
It usually requires only two procedures in data accessing: Initial loading of data and
access to data. Therefore, the DW does not require transaction processing, recovery, and
concurrency capabilities, which allows for substantial speedup of data retrieval. Non-
Volatile defines that once entered into the warehouse, and data should not change.
2.Data Warehouse Modeling
Data warehouse modeling is the process of designing the schemas of the detailed
and summarized information of the data warehouse. The goal of data warehouse
modeling is to develop a schema describing the reality, or at least a part of the
fact, which the data warehouse is needed to support.
Data warehouse modeling is an essential stage of building a data warehouse for
two main reasons. Firstly, through the schema, data warehouse clients can
visualize the relationships among the warehouse data, to use them with greater
ease.
Secondly, a well-designed schema allows an effective data warehouse structure to
emerge, to help decrease the cost of implementing the warehouse and improve the
efficiency of using it.
Data modeling in data warehouses is different from data modeling in operational
database systems.
The primary function of data warehouses is to support DSS processes. Thus, the
objective of data warehouse modeling is to make the data warehouse efficiently
support complex queries on long term information.
Operational systems are designed to support high- Data warehousing systems are typically designed to
volume transaction processing. support high-volume analytical processing (i.e.,
OLAP).
Operational systems are usually concerned with current Data warehousing systems are usually concerned
data. with historical data.
Data within operational systems are mainly updated Non-volatile, new data may be added regularly.
regularly according to need. Once Added rarely changed.
It is designed for real-time business dealing and It is designed for analysis of business measures by
processes. subject area, categories, and attributes.
It is optimized for a simple set of transactions, generally It is optimized for extent loads and high, complex,
adding or retrieving a single row at a time per table. unpredictable queries that access many rows per
table.
It is optimized for validation of incoming information Loaded with consistent, valid information, requires
during transactions, uses validation data tables. no real-time validation.
It supports thousands of concurrent clients. It supports a few concurrent clients relative to OLTP.
Operational systems are widely process-oriented. Data warehousing systems are widely subject-
oriented
Operational systems are usually optimized to perform Data warehousing systems are usually optimized to
fast inserts and updates of associatively small volumes perform fast retrievals of relatively high volumes of
of data. data.
Relational databases are created for on-line transactional Data Warehouse designed for on-line Analytical
Processing (OLTP) Processing (OLAP)
OLTP System
OLTP System handle with operational data. Operational data are those data contained in the
operation of a particular system. Example, ATM transactions and Bank transactions, etc.
OLAP System
OLAP handle with Historical Data or Archival Data. Historical data are those data that are
achieved over a long period. For example, if we collect the last 10 years information about flight
reservation, the data can give us much meaningful data such as the trends in the reservation. This
may provide useful information like peak time of travel, what kind of people are traveling in
various classes (Economy/Business) etc.
The major difference between an OLTP and OLAP system is the amount of data analyzed in a
single transaction. Whereas an OLTP manage many concurrent customers and queries touching
only an individual record or limited groups of files at a time. An OLAP system must have the
capability to operate on millions of files to answer a single query.
Users Clerks, clients, and information Knowledge workers, including managers, executives,
technology professionals. and analysts.
Data contents OLTP system manages current data OLAP system manages a large amount of historical
that too detailed and are used for data, provides facilitates for summarization and
decision making. aggregation, and stores and manages data at different
levels of granularity. This information makes the data
more comfortable to use in informed decision
making.
Database design OLTP system usually uses an entity- OLAP system typically uses either a star or
relationship (ER) data model and snowflake model and subject-oriented database
application-oriented database design. design.
View OLTP system focuses primarily on OLAP system often spans multiple versions of a
the current data within an enterprise database schema, due to the evolutionary process of
or department, without referring to an organization. OLAP systems also deal with data
historical information or data in that originates from various organizations, integrating
different organizations. information from many data stores.
Volume of data Not very large Because of their large volume, OLAP data are stored
on multiple storage media.
Access patterns The access patterns of an OLTP Accesses to OLAP systems are mostly read-only
system subsist mainly of short, methods because of these data warehouses stores
atomic transactions. Such a system historical data.
requires concurrency control and
recovery techniques.
Insert and Short and fast inserts and updates Periodic long-running batch jobs refresh the data.
Updates proposed by end-users.
Dimensions are a fact that defines a data cube. Facts are generally quantities, which are
used for analyzing the relationship between dimensions.
Example: In the 2-D representation, we will look at the All Electronics sales data
for items sold per quarter in the city of Vancouver. The measured display in dollars
sold (in thousands).
3-Dimensional Cuboids
Let suppose we would like to view the sales data with a third dimension. For example,
suppose we would like to view the data according to time, item as well as the location
for the cities Chicago, New York, Toronto, and Vancouver. The measured display in
dollars sold (in thousands). These 3-D data are shown in the table. The 3-D data of the
table are represented as a series of 2-D tables.
Conceptually, we may represent the same data in the form of 3-D data cubes, as shown
in fig:
In data warehousing, the data cubes are n-dimensional. The cuboid which holds the lowest level
of summarization is called a base cuboid.
For example, the 4-D cuboid in the figure is the base cuboid for the given time, item, location,
and supplier dimensions.
Figure is shown a 4-D data cube representation of sales data, according to the dimensions time,
item, location, and supplier. The measure displayed is dollars sold (in thousands).
The topmost 0-D cuboid, which holds the highest level of summarization, is known as the apex
cuboid. In this example, this is the total sales, or dollars sold, summarized over all four
dimensions.
The lattice of cuboid forms a data cube. The figure shows the lattice of cuboids creating
4-D data cubes for the dimension time, item, location, and supplier. Each cuboid
represents a different degree of summarization.
2.2What is OLAP (Online Analytical Processing)?
OLAP implement the multidimensional analysis of business information and support the
capability for complex estimations, trend analysis, and sophisticated data modeling. It is rapidly
enhancing the essential foundation for Intelligent Solutions containing Business Performance
Management, Planning, Budgeting, Forecasting, Financial Documenting, Analysis, Simulation-
Models, Knowledge Discovery, and Data Warehouses Reporting. OLAP enables end-clients to
perform ad hoc analysis of record in multiple dimensions, providing the insight and
understanding they require for better decision making.
o Budgeting
o Activity-based costing
o Promotion analysis
o Customer analysis
Production
o Production planning
o Defect analysis
OLAP cubes have two main purposes. The first is to provide business users with a data model
more intuitive to them than a tabular model. This model is called a Dimensional Model.
The second purpose is to enable fast query response that is usually difficult to achieve using
tabular models.
Fundamentally, OLAP has a very simple concept. It pre-calculates most of the queries that are
typically very hard to execute over tabular databases, namely aggregation, joining, and grouping.
These queries are calculated during a process that is usually called 'building' or 'processing' of
the OLAP cube. This process happens overnight, and by the time end users get to work - data
will have been updated.
A bottom-tier that consists of the Data Warehouse server, which is almost always an RDBMS.
It may include several specialized data marts and a metadata repository.
Data from operational databases and external sources (such as user profile data provided by
external consultants) are extracted using application program interfaces called a gateway.
A gateway is provided by the underlying DBMS and allows customer programs to generate SQL
code to be executed at a server.
Examples of gateways contain ODBC (Open Database Connection) and OLE-DB (Open-
Linking and Embedding for Databases), by Microsoft, and JDBC (Java Database Connection).
A middle-tier which consists of an OLAP server for fast querying of the data warehouse.
(1) A Relational OLAP (ROLAP) model, i.e., an extended relational DBMS that maps functions
on multidimensional data to standard relational operations.
(2) A Multidimensional OLAP (MOLAP) model, i.e., a particular purpose server that directly
implements multidimensional information and operations.
A top-tier that contains front-end tools for displaying results provided by OLAP, as well as
additional tools for data mining of the OLAP-generated data.
The metadata repository stores information that defines DW objects. It includes the following
parameters and information for the middle and the top-tier applications:
A data warehouse is a single data repository where a record from multiple data sources is
integrated for online business analytical processing (OLAP). This implies a data warehouse
needs to meet the requirements from all the business stages within the entire organization. Thus,
data warehouse design is a hugely complex, lengthy, and hence error-prone process.
Furthermore, business analytical functions change over time, which results in changes in the
requirements for the systems. Therefore, data warehouse and OLAP systems are dynamic, and
the design process is continuous.
Data warehouse design takes a method different from view materialization in the industries. It
sees data warehouses as database systems with particular needs such as answering management
related queries. The target of the design becomes how the record from multiple data sources
should be extracted, transformed, and loaded (ETL) to be organized in a database as the data
warehouse.
1. "top-down" approach
2. "bottom-up" approach
Top-down Design Approach
The data warehouse stores "atomic" information, the data at the lowest level of granularity, from
where dimensional data marts can be built by selecting the data required for specific business
subjects or particular departments.
An approach is a data-driven approach as the information is gathered and integrated first and
then business requirements by subjects for building data marts are formulated.
The advantage of this method is which it supports a single integrated data source. Thus data
marts built from it will have consistency when they overlap.
Developing new data mart from the data warehouse is very easy.
In the "Bottom-Up" approach, a data warehouse is described as "a copy of transaction data
specifical architecture for query and analysis," term the star schema. In this approach, a data mart
is created first to necessary reporting and analytical capabilities for particular business processes
(or subjects). Thus it is needed to be a business-driven approach in contrast to Inmon's data-
driven approach.
Data marts include the lowest grain data and, if needed, aggregated data too. Instead of a
normalized database for the data warehouse, a denormalized dimensional database is adapted to
meet the data delivery requirements of data warehouses. Using this method, to use the set of data
marts as the enterprise data warehouse, data marts should be built with conformed dimensions in
mind, defining that ordinary objects are represented the same in different data marts. The
conformed dimensions connected the data marts to form a data warehouse, which is generally
called a virtual data warehouse.
The advantage of the "bottom-up" design approach is that it has quick ROI, as developing a data
mart, a data warehouse for a single subject, takes far less time and effort than developing an
enterprise-wide data warehouse. Also, the risk of failure is even less. This method is inherently
incremental. This method allows the project team to learn and grow.
Advantages of bottom-up design
It is just developing new data marts and then integrating with other data marts.
the locations of the data warehouse and the data marts are reversed in the bottom-up approach
design.
Breaks the vast problem into smaller Solves the essential low-level problem and integrates them
subproblems. into a higher one.
Inherently architected- not a union of several Inherently incremental; can schedule essential data marts first.
data marts.
It may see quick results if implemented with Less risk of failure, favorable return on investment, and proof
repetitions. of techniques.
ROLAP stands for Relational MOLAP stands for Multidimensional HOLAP stands for Hybrid Online
Online Analytical Processing. Online Analytical Processing. Analytical Processing.
The ROLAP storage mode The MOLAP storage mode principle The HOLAP storage mode connects
causes the aggregation of the the aggregations of the division and a attributes of both MOLAP and
division to be stored in indexed copy of its source information to be ROLAP. Like MOLAP, HOLAP
views in the relational database saved in a multidimensional causes the aggregation of the
that was specified in the operation in analysis services when division to be stored in a
partition's data source. the separation is processed. multidimensional operation in an
SQL Server analysis services
instance.
ROLAP does not because a copy This MOLAP operation is highly HOLAP does not causes a copy of
of the source information to be optimize to maximize query the source information to be stored.
stored in the Analysis services performance. The storage area can be For queries that access the only
data folders. Instead, when the on the computer where the partition is summary record in the aggregations
outcome cannot be derived from described or on another computer of a division, HOLAP is the
the query cache, the indexed running Analysis services. Because a equivalent of MOLAP.
views in the record source are copy of the source information
accessed to answer queries. resides in the multidimensional
operation, queries can be resolved
without accessing the partition's
source record.
Query response is frequently Query response times can be reduced Queries that access source record
slower with ROLAP storage than substantially by using aggregations. for example, if we want to drill
with the MOLAP or HOLAP The record in the partition's MOLAP down to an atomic cube cell for
storage mode. Processing time is operation is only as current as of the which there is no aggregation
also frequently slower with most recent processing of the information must retrieve data from
ROLAP. separation. the relational database and will not
be as fast as they would be if the
source information were stored in
the MOLAP architecture.
5 Data Generalization
What is data generalization?
Data generalization allows you to replace a data value with a less precise one using a few
different techniques, which preserves data utility and protects against some types of
attacks that could lead to re-identification of individuals or reveal private information
unintentionally.
A process that abstracts a large set of task-relevant data in a database from a low
conceptual level to higher ones.
Data Generalization is a summarization of general features of objects in a target class and
produces what is called characteristic rules.
The data relevant to a user-specified class are normally retrieved by a database query and
run through a summarization module to extract the essence of the data at different levels
of abstractions.
For example, one may want to characterize the "OurVideoStore" customers who
regularly rent more than 30 movies a year. With concept hierarchies on the attributes
describing the target class, the attribute-oriented induction method can be used, for
example, to carry out data summarization.
Attribute-Oriented Induction
The Attribute-Oriented Induction (AOI) approach to data generalization and
summarization – based characterization was first proposed in 1989 (KDD ‘89 workshop)
a few years before the introduction of the data cube approach.
The data cube approach can be considered as a data warehouse – based, pre
computational – oriented, materialized approach.
It performs off-line aggregation before an OLAP or data mining query is submitted for
processing.
On the other hand, the attribute oriented induction approach, at least in its initial
proposal, a relational database query – oriented, generalized – based, on-line data
analysis technique.
However, there is no inherent barrier distinguishing the two approaches based on online
aggregation versus offline precomputation.
Some aggregations in the data cube can be computed on-line, while off-line
precomputation of multidimensional space can speed up attribute-oriented induction as
well.
The multidimensional extensions of two-dimensional tables may also represent a data cube. It
can be viewed as a group of 2-D tables stacked on each other that are similar. Data cubes are
used to represent data that is too abstract for a table of columns and rows to explain
As data in multidimensional matrices called Data Cubes is clustered or mixed. There are a few
alternate names or alternatives to the data cube system, such as "Multidimensional databases,"
"materialized views," and "OLAP (On-Line Analytical Processing.
From a subset of attributes in the database, a data cube is generated. To quantify attributes,
unique attributes are selected, i.e. attributes whose qualities are of importance.
The other attributes are chosen as usable attributes or measurements. The characteristics of the
measurements are aggregated according to the proportions.
For instance, XYZ can create a sales data warehouse to preserve records of the sales of the store
for the time, object, branch, and location dimensions For eg, the item name, brand, and type
attributes can include a dimension table for products.
The data cube technique, with many implementations, is a fascinating method. In several
examples, data cubes may be sparse and not every cell in each dimension would have
matching data in the database.
It is possible to display the aggregated and summarised facts with variables or attributes. This is
the specification where OLAP plays a role.
For simple data analysis, data cubes are widely used. It is used to represent data as such
quantities of company needs along with dimensions.
Each cube dimension reflects some of the database's characteristics, such as revenue every day,
month, or year.
Aggregation Strategy
1. Partitions array into chunks
2. Chunk: a small sub-cube which fits in memory
3. Data addressing
4. Uses chunk id and offset
5. Multi-way Aggregation
6. Computes aggregates in multi-way
7. Visits chunks in the order
1. to minimize memory access
2. to minimize memory space
Example
Suppose the data size on each dimension A, B and C is 40, 400 and 4000, respectively.
Minimum memory required when traversing the order, 1,2,3,4,5,…, 64
Total memory required is 100×1000 + 40×1000 + 40×400
Summary of Multi-Way
Method
Cuboids should be sorted and computed according to the data size on each dimension
Keeps the smallest plane in the main memory, fetches and computes only one chunk at a
time for the largest plane
Limitations
Full materialization
Computes well only for a small number of dimensions ( high dimensional data → partial
materialization )
The best way to solve the small sample size problem is to get more data. Fortunately,
there is usually an abundance of additional data available in the cube. The data do not
match the query cell exactly; however, we can consider data from cells that are “close
by.” There are two ways to incorporate such data to enhance the reliability of the query
answer: (1) intracuboid query expansion, where we consider nearby cells within the same
cuboid, and (2) intercuboid query expansion, where we consider more general versions
(from parent cuboids) of the query cell.
Ranking Cube contributes to the efficient processing of top-k queries. Instead of returning a large
set of indiscriminative answers to a query, a top-k
query (or ranking query) returns only the best k results according to a user-specified
preference.
The results are returned in ranked order so that the best is at the top. The user specified
preference generally consists of two components: a selection condition and
a ranking function. Top-k queries are common in many applications like searching
web databases, k-nearest-neighbor searches with approximate matches, and similarity
queries in multimedia databases.
Recently, researchers have their attention toward multidimensional data mining to uncover
knowledge at varying dimensional combinations and granularities. Such mining is also
known as exploratory multidimensional data mining and online analytical data mining
(OLAM).
There are at least four ways in which OLAP-style analysis can be fused with data
mining techniques:
1. Use cube space to define the data space for mining. Each region in cube space represents
a subset of data over which we wish to find interesting patterns. Cube space
is defined by a set of expert-designed, informative dimension hierarchies, not just
arbitrary subsets of data. Therefore, the use of cube space makes the data space both
meaningful and tractable.
2. Use OLAP queries to generate features and targets for mining. The features and even
the targets (that we wish to learn to predict) can sometimes be naturally defined as
OLAP aggregate queries over regions in cube space.
3. Use data mining models as building blocks in a multistep mining process.
Multidimensional
data mining in cube space may consist of multiple steps, where data mining
models can be viewed as building blocks that are used to describe the behavior of
interesting data sets, rather than the end results.
4. Use data cube computation techniques to speed up repeated model construction.
Multidimensional
data mining in cube space may require building a model for each
candidate data space, which is usually too expensive to be feasible. However, by carefully
sharing computation across model construction for different candidates based
on data cube computation techniques, efficient mining is achievable.
Multifeature cubes enable more in-depth analysis. They can compute more complex queries
of which the measures depend on groupings of multiple aggregates at varying granularity
levels.
Thequeries posed can be much more elaborate and task-specific than traditional queries,
as we shall illustrate in the next examples.
Many complex data mining queries can be
answered by multifeature cubes without significant increase in computational cost, in
comparison to cube computation for simple queries with traditional data cubes.
Precomputed measures indicating data exceptions are used to guide the user in the data
analysis process, at all aggregation levels.
We hereafter refer to these measures as exception
indicators. Intuitively, an exception is a data cube cell value that is significantly
different from the value anticipated, based on a statistical model.
The model considers variations and patterns in the measure value across all the dimensions
to which a cell belongs. For example, if the analysis of item-sales data reveals an increase in
sales in December in comparison to all other months, this may seem like an exception in the
time dimension.
However, it is not an exception if the item dimension is considered,
since there is a similar increase in sales for other items during December.
The model considers exceptions hidden at all aggregated group-by’s of a data cube.
Visual cues, such as background color, are used to reflect each cell’s degree of exception,
based on the precomputed exception indicators. Efficient algorithms have been proposed
for cube construction