0% found this document useful (0 votes)

130 views33 pages

3.1 What Is Data Warehouse?: Unit Iii

A data warehouse is a collection of data from different operational systems within a company that is optimized for querying and analysis. It contains historical data stored in a consistent structure. A data warehouse uses a multidimensional model with fact and dimension tables organized in a star, snowflake or fact constellation schema to enable analysis across multiple dimensions like time, products or locations. Operational systems focus on current transactional data while data warehouses integrate historical data from various sources for analysis and informed decision making.

Uploaded by

ANITHA AMMU

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

130 views33 pages

3.1 What Is Data Warehouse?: Unit Iii

Uploaded by

ANITHA AMMU

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 33

UNIT III

3.1 What is Data Warehouse?

Data Warehouse Introduction

A data warehouse is a collection of data marts representing historical data

from different operations in the company. This data is stored in a structure
optimized for querying and data analysis as a data warehouse. Table design,
dimensions and organization should be consistent throughout a data
warehouse so that reports or queries across the data warehouse are consistent.
A data warehouse can also be viewed as a database for historical data from
different functions within a company.

The term Data Warehouse was coined by Bill Inmon in 1990, which he
defined in the following way: "A warehouse is a subject-oriented, integrated,
time-variant and non-volatile collection of data in support of management's
decision making process". He defined the terms in the sentence as follows:

Subject Oriented: Data that gives information about a particular subject

instead of about a company's ongoing operations.

Integrated: Data that is gathered into the data warehouse from a variety of
sources and merged into a coherent whole.

Time-variant: All data in the data warehouse is identified with a particular time
period.

Non-volatile: Data is stable in a data warehouse. More data is added but data is never
removed.

This enables management to gain a consistent picture of the business. It is a

single, complete and consistent store of data obtained from a variety of
different sources made available to end users in what they can understand and
use in a business context. It can be
Used for decision Support
 Used to manage and control business
 Used by managers and end-users to understand the business and make
judgments
Data Warehousing is an architectural construct of information systems that
provides users with current and historical decision support information that is
hard to access or present in traditional operational data stores.

Other important terminology

Enterprise Data warehouse: It collects all information about subjects

(customers, products, sales, assets, personnel) that span the entire
organization

Data Mart: Departmental subsets that focus on selected subjects. A data mart
is a segment of a data warehouse that can provide data for reporting and
analysis on a section, unit, department or operation in the company, e.g. sales,
payroll, production. Data marts are sometimes complete individual data
warehouses which are usually smaller than the corporate data warehouse.

Decision Support System (DSS):Information technology to help the

knowledge worker (executive, manager, and analyst) makes faster &
better decisions

Drill-down: Traversing the summarization levels from highly summarized data

to the underlying current or old detail

Metadata: Data about data. Containing location and description of

warehouse system components: names, definition, structure…

Benefits of data warehousing

 Data warehouses are designed to perform well with aggregate queries
running on large amounts of data.
 The structure of data warehouses is easier for end users to navigate,
understand and query against unlike the relational databases
primarily designed to handle lots of transactions.
 Data warehouses enable queries that cut across different segments of
a company's operation. E.g. production data could be compared
against inventory data even if they were originally stored in
different databases with different structures.
 Queries that would be complex in very normalized databases could
be easier to build and maintain in data warehouses, decreasing the
workload on transaction systems.
 Data warehousing is an efficient way to manage and report on
data that is from a variety of sources, non uniform and scattered
throughout a company.
 Data warehousing is an efficient way to manage demand for lots of
information from lots of users.
 Data warehousing provides the capability to analyze large amounts
of historical data for nuggets of wisdom that can provide an
organization with competitive advantage.

Operational and informational Data

• Operational Data:
 Focusing on transactional function such as bank card
withdrawals and deposits
 Detailed
 Updateable
 Reflects current data
• Informational Data:
 Focusing on providing answers to problems posed
by decision makers
 Summarized
 Non updateable
Data Warehouse Characteristics

• A data warehouse can be viewed as an information system with the

following attributes: – It is a database designed for analytical tasks
– It‘s content is periodically updated
– It contains current and historical data to provide a historical perspective of
information

Operational data store (ODS)

• ODS is an architecture concept to support day-to-day operational decision

support and contains current value data propagated from operational
applications
• ODS is subject-oriented, similar to a classic definition of
a Data warehouse • ODS is integrated

However:

ODS DATA WAREHOUSE

Volatile Non volatile
Very current data Current and historical data
Detailed data Pre calculated summaries

3.1.1 Differences between Operational Database Systems and Data

Warehouses

Features of OLTP and OLAP

The major distinguishing features between OLTP and OLAP are summarized as
follows.

1. Users and system orientation: An OLTP system is customer-oriented and is used

for transaction and query processing by clerks, clients, and information technology
professionals. An OLAP system is market-oriented and is used for data analysis by
knowledge workers, including managers, executives, and analysts.

2. Data contents: An OLTP system manages current data that, typically, are too
detailed to be easily used for decision making. An OLAP system manages large
amounts of historical data, provides facilities for summarization and aggregation, and
stores and manages information at different levels of granularity. These features
make the data easier for use in informed decision making.

3. Database design: An OLTP system usually adopts an entity-relationship (ER) data

model and an application oriented database design. An OLAP system typically adopts
either a star or snowflake model and a subject-oriented database design.

4. View: An OLTP system focuses mainly on the current data within an enterprise or
department, without referring to historical data or data in different organizations. In
contrast, an OLAP system often spans multiple versions of a database schema. OLAP
systems also deal with information that originates from different organizations,
integrating information from many data stores. Because of their huge volume, OLAP
data are stored on multiple storage media.

5. Access patterns: The access patterns of an OLTP system consist mainly of short,
atomic transactions. Such a system requires concurrency control and recovery
mechanisms. However, accesses to OLAP systems are mostly read-only operations
although many could be complex queries.

Comparison between OLTP and OLAP systems.

3.2 A Multidimensional Data Model.

The most popular data model for data warehouses is a multidimensional

model. This model can exist in the form of a star schema, a snowflake schema, or a
fact constellation schema. Let's have a look at each of these schema types.
3.2.1 From Tables and Spreadsheets to Data Cubes

3.2.2 Stars,Snowflakes,and Fact Constellations:

Schemas for Multidimensional Databases

 Star schema: The star schema is a modeling paradigm in which the data
warehouse contains (1) a large central table (fact table), and (2) a set of
smaller attendant tables (dimension tables), one for each dimension. The
schema graph resembles a starburst, with the dimension tables displayed in a
radial pattern around the central fact table.

Figure Star schema of a data warehouse for sales.

 Snowflake schema: The snowflake schema is a variant of the star schema
model, where some dimension tables are normalized, thereby further splitting
the data into additional tables. The resulting schema graph forms a shape
similar to a snowflake. The major difference between the snowflake and star
schema models is that the dimension tables of the snowflake model may be
kept in normalized form. Such a table is easy to maintain and also saves
storage space because a large dimension table can be extremely large when
the dimensional structure is included as columns.

Figure Snowflake schema of a data warehouse for sales.

 Fact constellation: Sophisticated applications may require multiple fact
tables to share dimension tables. This kind of schema can be viewed as a
collection of stars, and hence is called a galaxy schema or a fact constellation.

Figure Fact constellation schema of a data warehouse for sales

and shipping.

3.2.3 Example for Defining Star, Snowflake, and Fact Constellation

Schemas

A Data Mining Query Language, DMQL: Language Primitives

 Cube Definition (Fact Table)
define cube <cube_name> [<dimension_list>]: <measure_list>
 Dimension Definition (Dimension Table)
define dimension <dimension_name> as (<attribute_or_subdimension_list>)
 Special Case (Shared Dimension Tables)
 First time as “cube definition”

 define dimension <dimension_name> as <dimension_name_first_time>

in cube <cube_name_first_time>
Defining a Star Schema in DMQL
define cube sales_star [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold =
count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state, country)

Defining a Snowflake Schema in DMQL

define cube sales_snowflake [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold =
count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier(supplier_key,
supplier_type))
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city(city_key, province_or_state,
country))

Defining a Fact Constellation in DMQL define

cube sales [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars), avg_sales = avg(sales_in_dollars), units_sold =
count(*)
define dimension time as (time_key, day, day_of_week, month, quarter, year)
define dimension item as (item_key, item_name, brand, type, supplier_type)
define dimension branch as (branch_key, branch_name, branch_type)
define dimension location as (location_key, street, city, province_or_state,
country) define cube shipping [time, item, shipper, from_location, to_location]:
dollar_cost = sum(cost_in_dollars), unit_shipped = count(*) define dimension time
as time in cube sales
define dimension item as item in cube sales
define dimension shipper as (shipper_key, shipper_name, location as location in cube
sales, shipper_type)
define dimension from_location as location in cube sales
define dimension to_location as location in cube sales
3.2.4 Measures: Three Categories
Measure: a function evaluated on aggregated data corresponding to given dimension-
value pairs.
Measures can be:
 distributive: if the measure can be calculated in a distributive manner.
 E.g., count(), sum(), min(), max().

 algebraic: if it can be computed from arguments obtained by applying

distributive aggregate functions.
 E.g., avg()=sum()/count(), min_N(), standard_deviation().
 holistic: if it is not algebraic.
 E.g., median(), mode(), rank().

3.2.5 A Concept Hierarchy

A Concept hierarchy defines a sequence of mappings from a set of low level
Concepts to higher level, more general Concepts. Concept hierarchies allow data to
be handled at varying levels of abstraction

3.2.6 OLAP operations on multidimensional data.

1. Roll-up: The roll-up operation performs aggregation on a data cube, either by

climbing-up a concept hierarchy for a dimension or by dimension reduction. Figure
shows the result of a roll-up operation performed on the central cube by climbing up
the concept hierarchy for location. This hierarchy was defined as the total order street
< city < province or state <country.

2. Drill-down: Drill-down is the reverse of roll-up. It navigates from less detailed

data to more detailed data. Drill-down can be realized by either stepping-down a
concept hierarchy for a dimension or introducing additional dimensions. Figure
shows the result of a drill-down operation performed on the central cube by stepping
down a concept hierarchy for time defined as day < month < quarter < year. Drill-
down occurs by descending the time hierarchy from the level of quarter to the more
detailed level of month.

3. Slice and dice: The slice operation performs a selection on one dimension of the
given cube, resulting in a sub cube. Figure shows a slice operation where the sales
data are selected from the central cube for the dimension time using the criteria
time=”Q2". The dice operation defines a sub cube by performing a selection on two
or more dimensions.

4. Pivot (rotate): Pivot is a visualization operation which rotates the data axes in
view in order to provide an alternative presentation of the data. Figure shows a pivot
operation where the item and location axes in a 2-D slice are rotated.

Figure : Examples of typical OLAP operations on

multidimensional data.
3.3 Data warehouse architecture
3.3.1 Steps for the Design and Construction of Data Warehouse
This subsection presents a business analysis framework for data
warehouse design. The basic steps involved in the design process are
also described.

The Design of a Data Warehouse: A Business Analysis Framework

Four different views regarding the design of a data warehouse must be
considered: the top-down view, the data source view, the data warehouse
view, the business query view.

 The top-down view allows the selection of relevant information

necessary for the data warehouse.
 The data source view exposes the information being captured,
stored and managed by operational systems.
 The data warehouse view includes fact tables and dimension tables
 Finally the business query view is the Perspective of data in the
data warehouse from the viewpoint of the end user.

3.3.2 Three-tier Data warehouse architecture

The bottom tier is ware-house database server which is almost always a
relational database system. The middle tier is an OLAP server which is typically
implemented using either (1) a Relational OLAP (ROLAP) model, (2) a
Multidimensional OLAP (MOLAP) model. The top tier is a client, which contains
query and reporting tools, analysis tools, and/or data mining tools (e.g., trend
analysis, prediction, and so on).

From the architecture point of view, there are three data warehouse models: the
enterprise warehouse, the data mart, and the virtual warehouse.

 Enterprise warehouse: An enterprise warehouse collects all of the

information about subjects spanning the entire organization. It provides
corporate-wide data integration, usually from one or more operational systems
or external information providers, and is cross-functional in scope. It typically
contains detailed data as well as summarized data, and can range in size from
a few gigabytes to hundreds of gigabytes, terabytes, or beyond.
 Data mart: A data mart contains a subset of corporate-wide data that is of
value to a specific group of users. The scope is connected to specific, selected
subjects. For example, a marketing data mart may connect its subjects to
customer, item, and sales. The data contained in data marts tend to be
summarized. Depending on the source of data, data marts can be categorized
into the following two classes:

(i).Independent data marts are sourced from data captured from one or
more operational systems or external information providers, or from data
generated locally within a particular department or geographic area.

(ii).Dependent data marts are sourced directly from enterprise data

warehouses.

 Virtual warehouse: A virtual warehouse is a set of views over operational

databases. For efficient query processing, only some of the possible summary
views may be materialized. A virtual warehouse is easy to build but requires
excess capacity on operational database servers.

Figure: A recommended approach for data warehouse development.

3.3.3 Data warehouse Back-End Tools and Utilities
The ETL (Extract Transformation Load) process
In this section we will discussed about the 4 major process of the data warehouse.
They are extract (data from the operational systems and bring it to the data
warehouse), transform (the data into internal format and structure of the data
warehouse), cleanse (to make sure it is of sufficient quality to be used for decision
making) and load (cleanse data is put into the data warehouse).
The four processes from extraction through loading often referred collectively as
Data Staging.

EXTRACT
Some of the data elements in the operational database can be reasonably be expected
to be useful in the decision making, but others are of less value for that purpose. For
this reason, it is necessary to extract the relevant data from the operational database
before bringing into the data warehouse. Many commercial tools are available to help
with the extraction process. Data Junction is one of the commercial products. The
user of one of these tools typically has an easy-to-use windowed interface by which
to specify the following:

(i) Which files and tables are to be accessed in the source database?
(ii) Which fields are to be extracted from them? This is often done
internally by SQL Select statement.
(iii) What are those to be called in the resulting database?
(iv) What is the target machine and database format of the output?
(v) On what schedule should the extraction process be repeated?

TRANSFORM

The operational databases developed can be based on any set of priorities, which
keeps changing with the requirements. Therefore those who develop data warehouse
based on these databases are typically faced with inconsistency among their data
sources. Transformation process deals with rectifying any inconsistency (if any).

One of the most common transformation issues is ‘Attribute Naming Inconsistency’.

It is common for the given data element to be referred to by different data names in
different databases. Employee Name may be EMP_NAME in one database, ENAME
in the other. Thus one set of Data Names are picked and used consistently in the data
warehouse. Once all the data elements have right names, they must be converted to
common formats. The conversion may encompass the following:

(i) Characters must be converted ASCII to EBCDIC or vise versa.

(ii) Mixed Text may be converted to all uppercase for consistency.
(iii) Numerical data must be converted in to a common format.
(iv) Data Format has to be standardized.
(v) Measurement may have to convert. (Rs/ $)
(vi) Coded data (Male/ Female, M/F) must be converted into a common
format.
All these transformation activities are automated and many commercial products are
available to perform the tasks. DataMAPPER from Applied Database Technologies
is one such comprehensive tool.

CLEANSING
Information quality is the key consideration in determining the value of the
information. The developer of the data warehouse is not usually in a position to
change the quality of its underlying historic data, though a data warehousing project
can put spotlight on the data quality issues and lead to improvements for the future. It
is, therefore, usually necessary to go through the data entered into the data warehouse
and make it as error free as possible. This process is known as Data Cleansing.

Data Cleansing must deal with many types of possible errors. These include missing
data and incorrect data at one source; inconsistent data and conflicting data when two
or more source are involved. There are several algorithms followed to clean the data,
which will be discussed in the coming lecture notes.

LOADING
Loading often implies physical movement of the data from the computer(s) storing
the source database(s) to that which will store the data warehouse database, assuming
it is different. This takes place immediately after the extraction phase. The most
common channel for data movement is a high-speed communication link. Ex: Oracle
Warehouse Builder is the API from Oracle, which provides the features to perform
the ETL task on Oracle Data Warehouse.

Data cleaning problems

This section classifies the major data quality problems to be solved by data cleaning
and data transformation. As we will see, these problems are closely related and
should thus be treated in a uniform way. Data transformations [26] are needed to
support any changes in the structure, representation or content of data. These
transformations become necessary in many situations, e.g., to deal with schema
evolution, migrating a legacy system to a new information system, or when multiple
data sources are to be integrated. As shown in Fig. 2 we roughly distinguish between
single-source and multi-source problems and between schema- and instance-related
problems. Schema-level problems of course are also reflected in the instances; they
can be addressed at the schema level by an improved schema design (schema
evolution), schema translation and schema integration. Instance-level problems, on
the other hand, refer to errors and inconsistencies in the actual data contents which
are not visible at the schema level. They are the primary focus of data cleaning. Fig. 2
also indicates some typical problems for the various cases. While not shown in Fig. 2,
the single-source problems occur (with increased likelihood) in the multi-source case,
too, besides specific multi-source problems.
Single-source problems

The data quality of a source largely depends on the degree to which it is governed by
schema and integrity constraints controlling permissible data values. For sources
without schema, such as files, there are few restrictions on what data can be entered
and stored, giving rise to a high probability of errors and inconsistencies. Database
systems, on the other hand, enforce restrictions of a specific data model (e.g., the
relational approach requires simple attribute values, referential integrity, etc.) as well
as application-specific integrity constraints. Schema-related data quality problems
thus occur because of the lack of appropriate model-specific or application-specific
integrity constraints, e.g., due to data model limitations or poor schema design, or
because only a few integrity constraints were defined to limit the overhead for
integrity control. Instance-specific problems relate to errors and inconsistencies that
cannot be prevented at the schema level (e.g., misspellings).
For both schema- and instance-level problems we can differentiate different problem
scopes: attribute (field), record, record type and source; examples for the various
cases are shown in Tables 1 and 2. Note that uniqueness constraints specified at the
schema level do not prevent duplicated instances, e.g., if information on the same real
world entity is entered twice with different attribute values (see example in Table 2).

Multi-source problems

The problems present in single sources are aggravated when multiple sources need to
be integrated. Each source may contain dirty data and the data in the sources may be
represented differently, overlap or contradict. This is because the sources are
typically developed, deployed and maintained independently to serve specific needs.
This results in a large degree of heterogeneity w.r.t. data management systems, data
models, schema designs and the actual data.

At the schema level, data model and schema design differences are to be
addressed by the steps of schema translation and schema integration, respectively.
The main problems w.r.t. schema design are naming and structural conflicts. Naming
conflicts arise when the same name is used for different objects (homonyms) or
different names are used for the same object (synonyms). Structural conflicts occur in
many variations and refer to different representations of the same object in different
sources, e.g., attribute vs. table representation, different component structure,
different data types, different integrity constraints, etc. In addition to schema-level
conflicts, many conflicts appear only at the instance level (data conflicts). All
problems from the single-source case can occur with different representations in
different sources (e.g., duplicated records, contradicting records,…). Furthermore,
even when there are the same attribute names and data types, there may be different
value representations (e.g., for marital status) or different interpretation of the values
(e.g., measurement units Dollar vs. Euro) across sources. Moreover, information in
the sources may be provided at different aggregation levels (e.g., sales per product vs.
sales per product group) or refer to different points in time (e.g. current sales as of
yesterday for source 1 vs. as of last week for source 2).

A main problem for cleaning data from multiple sources is to identify

overlapping data, in particular matching records referring to the same real-world
entity (e.g., customer). This problem is also referred to as the object identity problem,
duplicate elimination or the merge/purge problem. Frequently, the information is only
partially redundant and the sources may complement each other by providing
additional information about an entity. Thus duplicate information should be purged
out and complementing information should be consolidated and merged in order to
achieve a consistent view of real world entities.

The two sources in the example of Fig. 3 are both in relational format but exhibit
schema and data conflicts. At the schema level, there are name conflicts (synonyms
Customer/Client, Cid/Cno, Sex/Gender) and structural conflicts (different
representations for names and addresses). At the instance level, we note that there are
different gender representations (“0”/”1” vs. “F”/”M”) and presumably a duplicate
record (Kristen Smith). The latter observation also reveals that while Cid/Cno are
both source-specific identifiers, their contents are not comparable between the
sources; different numbers (11/493) may refer to the same person while different
persons can have the same number (24). Solving these problems requires both
schema integration and data cleaning; the third table shows a possible solution. Note
that the schema conflicts should be resolved first to allow data cleaning, in particular
detection of duplicates based on a uniform representation of names and addresses,
and matching of the Gender/Sex values.

Data cleaning approaches

In general, data cleaning involves several phases
Data analysis: In order to detect which kinds of errors and inconsistencies are
to be removed, a detailed
data analysis is required. In addition to a manual inspection of the data or data
samples, analysis programs should be used to gain metadata about the data
properties and detect data quality problems.

Definition of transformation workflow and mapping rules: Depending on the

number of data sources ,their degree of heterogeneity and the “dirtyness” of the
data, a large number of data transformation and cleaning steps may have to be
executed. Sometime, a schema translation is used to map sources to a common
data model; for data warehouses, typically a relational representation is used.
Early data cleaning steps can correct single-source instance problems and
prepare the data for integration. Later steps deal with schema/data integration
and cleaning multi-source instance problems, e.g., duplicates.
For data warehousing, the control and data flow for these transformation and
cleaning steps should be specified within a workflow that defines the ETL
process (Fig. 1).

The schema-related data transformations as well as the cleaning steps

should be specified by a declarative query and mapping language as far as
possible, to enable automatic generation of the transformation code. In addition,
it should be possible to invoke user-written cleaning code and special purpose
tools during a data transformation workflow. The transformation steps may
request user feedback on data instances for which they have no built-in cleaning
logic.

Verification: The correctness and effectiveness of a transformation workflow

and the transformation definitions should be tested and evaluated, e.g., on a
sample or copy of the source data, to improve the definitions if necessary.
Multiple iterations of the analysis, design and verification steps may be needed,
e.g., since some errors only become apparent after applying some
transformations.

Transformation: Execution of the transformation steps either by running the

ETL workflow for loading and refreshing a data warehouse or during answering
queries on multiple sources.

Backflow of cleaned data: After (single-source) errors are removed, the

cleaned data should also replace the dirty data in the original sources in order to
give legacy applications the improved data too and to avoid redoing the cleaning
work for future data extractions. For data warehousing, the cleaned data is
available from the data staging area (Fig. 1).

Data analysis
Metadata reflected in schemas is typically insufficient to assess the data quality
of a source, especially if only a few integrity constraints are enforced. It is thus
important to analyse the actual instances to obtain real (reengineered)
metadata on data characteristics or unusual value patterns. This metadata helps
finding data quality problems. Moreover, it can effectively contribute to identify
attribute correspondences between source schemas (schema matching), based
on which automatic data transformations can be derived.
There are two related approaches for data analysis, data profiling and
data mining. Data profiling focuses on the instance analysis of individual
attributes. It derives information such as the data type, length, value range,
discrete values and their frequency, variance, uniqueness, occurrence of null
values, typical string pattern (e.g., for phone numbers), etc., providing an exact
view of various quality aspects of the attribute.
Table 3 shows examples of how this metadata can help detecting data
quality problems.

3.3.4 Metadata repository

Metadata are data about data. When used in a data warehouse, metadata are the data
that define warehouse objects. Metadata are created for the data names and
definitions of the given warehouse. Additional metadata are created and captured for
time stamping any extracted data, the source of the extracted data, and missing fields
that have been added by data cleaning or integration processes. A metadata repository
should contain:

 A description of the structure of the data warehouse. This includes the

warehouse schema, view, dimensions, hierarchies, and derived data
definitions, as well as data mart locations and contents;
 Operational metadata, which include data lineage (history of migrated data
and the sequence of transformations applied to it), currency of data (active,
archived, or purged), and monitoring information (warehouse usage statistics,
error reports, and audit trails);
 the algorithms used for summarization, which include measure and dimension
definition algorithms, data on granularity, partitions, subject areas,
aggregation, summarization, and predefined queries and reports;
 The mapping from the operational environment to the data warehouse, which
includes source databases and their contents, gateway descriptions, data
partitions, data extraction, cleaning, transformation rules and defaults, data
refresh and purging rules, and security (user authorization and access control).
 Data related to system performance, which include indices and profiles that
improve data access and retrieval performance, in addition to rules for the
timing and scheduling of refresh, update, and replication cycles; and
 Business metadata, which include business terms and definitions, data
ownership information, and charging policies.

3.3.5 Types of OLAP Servers: ROLAP versus MOLAP versus HOLAP

1. Relational OLAP (ROLAP)

 Use relational or extended-relational DBMS to store and manage

warehouse data and OLAP middle ware to support missing pieces

 Include optimization of DBMS backend, implementation of aggregation

navigation logic, and additional tools and services

 greater scalability

2. Multidimensional OLAP (MOLAP)

 Array-based multidimensional storage engine (sparse matrix techniques)

 fast indexing to pre-computed summarized data
3. Hybrid OLAP (HOLAP)
 User flexibility, e.g., low level: relational, high-level: array

4. Specialized SQL servers

 specialized support for SQL queries over star/snowflake schemas

3.4 Data Warehouse Implementation

3.4.1 Efficient Computation of Data Cubes

Data cube can be viewed as a lattice of cuboids

 The bottom-most cuboid is the base cuboid
 The top-most cuboid (apex) contains only one cell
 How many cuboids in an n-dimensional cube with L levels?

Materialization of data cube

 Materialize every (cuboid) (full materialization), none (no materialization), or

some (partial materialization)
 Selection of which cuboids to materialize
 Based on size, sharing, access frequency, etc.
Cube Operation

 Cube definition and computation in DMQL

define cube sales[item, city, year]: sum(sales_in_dollars)

compute cube sales

 Transform it into a SQL-like language (with a new operator cube by,

introduced by Gray et al.’96)
SELECT item, city, year, SUM (amount)

FROM SALES

CUBE BY item, city, year

 Need compute the following Group-

Bys (date, product, customer),

(date,product),(date, customer), (product, customer),

(date), (product), (customer)

()
Cube Computation: ROLAP-Based Method

 Efficient cube computation methods

o ROLAP-based cubing algorithms (Agarwal et al’96)
o Array-based cubing algorithm (Zhao et al’97)
o Bottom-up computation method (Bayer & Ramarkrishnan’99)
 ROLAP-based cubing algorithms
o Sorting, hashing, and grouping operations are applied to the
dimension attributes in order to reorder and cluster related tuples
 Grouping is performed on some sub aggregates as a “partial grouping
step”
 Aggregates may be computed from previously computed aggregates,
rather than from the base fact table

Multi-way Array Aggregation for Cube

 Computation
 Partition arrays into chunks (a small sub cube which fits in memory).
 Compressed sparse array addressing: (chunk_id, offset)
 Compute aggregates in “multiway” by visiting cube cells in the order
which minimizes the # of times to visit each cell, and reduces memory
access and storage cost.
3.4.2 Indexing OLAP data
The bitmap indexing method is popular in OLAP products because it allows quick
searching in data cubes.

The bitmap index is an alternative representation of the record ID (RID) list.

In the bitmap index for a given attribute, there is a distinct bit vector, By, for each
value v in the domain of the attribute. If the domain of a given attribute consists of n
values, then n bits are needed for each entry in the bitmap index

The join indexing method gained popularity from its use in relational
database query processing. Traditional indexing maps the value in a given column to
a list of rows having that value. In contrast, join indexing registers the joinable rows
of two relations from a relational database. For example, if two relations R(RID;A)
and S(B; SID) join on the attributes A and B, then the join index record contains the
pair (RID; SID), where RID and SID are record identifiers from the R and S
relations, respectively.

3.4.3 Efficient processing of OLAP queries

1. Determine which operations should be performed on the available cuboids. This

involves transforming any selection, projection, roll-up (group-by) and drill-down
operations specified in the query into corresponding SQL and/or OLAP operations.
For example, slicing and dicing of a data cube may correspond to selection and/or
projection operations on a materialized cuboid.
2. Determine to which materialized cuboid(s) the relevant operations should be
applied. This involves identifying all of the materialized cuboids that may potentially
be used to answer the query.

3.5 From Data Warehousing to Data mining

3.5.1 Data Warehouse Usage:

Three kinds of data warehouse applications

1. Information processing

 supports querying, basic statistical analysis, and reporting

using crosstabs, tables, charts and graphs

2. Analytical processing
 multidimensional analysis of data warehouse data

 supports basic OLAP operations, slice-dice, drilling, pivoting

3. Data mining
 knowledge discovery from hidden patterns

 supports associations, constructing analytical models,

performing classification and prediction, and presenting
the mining results using visualization tools.
 Differences among the three tasks

Note:

From On-Line Analytical Processing to On Line Analytical

Mining (OLAM) called from data warehousing to data mining
3.5.2 From on-line analytical processing to on-line analytical mining.
On-Line Analytical Mining (OLAM) (also called OLAP mining), which
integrates on-line analytical processing (OLAP) with data mining and mining
knowledge in multidimensional databases, is particularly important for the following
reasons.

1. High quality of data in data warehouses.

Most data mining tools need to work on integrated, consistent, and cleaned
data, which requires costly data cleaning, data transformation and data integration as
preprocessing steps. A data warehouse constructed by such preprocessing serves as a
valuable source of high quality data for OLAP as well as for data mining.

2. Available information processing infrastructure surrounding data

warehouses.

Comprehensive information processing and data analysis infrastructures have

been or will be systematically constructed surrounding data warehouses, which
include accessing, integration, consolidation, and transformation of multiple,
heterogeneous databases, ODBC/OLEDB connections, Web-accessing and service
facilities, reporting and OLAP analysis tools.

3. OLAP-based exploratory data analysis.

Effective data mining needs exploratory data analysis. A user will often want
to traverse through a database, select portions of relevant data, analyze them at
different granularities, and present knowledge/results in different forms. On-line
analytical mining provides facilities for data mining on different subsets of data and
at different levels of abstraction, by drilling, pivoting, filtering, dicing and slicing on
a data cube and on some intermediate data mining results.

4. On-line selection of data mining functions.

By integrating OLAP with multiple data mining functions, on-line analytical

mining provides users with the exibility to select desired data mining functions and
swap data mining tasks dynamically.
Architecture for on-line analytical mining

An OLAM engine performs analytical mining in data cubes in a similar manner as

an OLAP engine performs on-line analytical processing. An integrated OLAM and OLAP
architecture is shown in Figure, where the OLAM and OLAP engines both accept users'
on-line queries via a User GUI API and work with the data cube in the data analysis via a
Cube API.

A metadata directory is used to guide the access of the data cube. The data cube
can be constructed by accessing and/or integrating multiple databases and/or by filtering a
data warehouse via a Database API which may support OLEDB or ODBC connections.
Since an OLAM engine may perform multiple data mining tasks, such as concept
description, association, classification, prediction, clustering, time-series analysis ,etc., it
usually consists of multiple, integrated data mining modules and is more sophisticated than
an OLAP engine.

Figure: An integrated OLAM and OLAP architecture.

Data Science M-1 Notes
No ratings yet
Data Science M-1 Notes
34 pages
GATE DA Data Warehousing
No ratings yet
GATE DA Data Warehousing
30 pages
Data Engineering 101 - ETL
No ratings yet
Data Engineering 101 - ETL
70 pages
Unit 4 Data Science
No ratings yet
Unit 4 Data Science
21 pages
CS614 FinalTerm Solved Papers
No ratings yet
CS614 FinalTerm Solved Papers
24 pages
CTS INTERNSHIP REPORT - Mohak
50% (4)
CTS INTERNSHIP REPORT - Mohak
32 pages
Nptel Swayam DWDM Slides
No ratings yet
Nptel Swayam DWDM Slides
406 pages
Data Mining Unit-IV
No ratings yet
Data Mining Unit-IV
37 pages
Unit 2 - Data Preprocessing
No ratings yet
Unit 2 - Data Preprocessing
23 pages
Hadoop Ecosystem and Their Components
No ratings yet
Hadoop Ecosystem and Their Components
19 pages
Unit-3: Non-Linear Data Structure
No ratings yet
Unit-3: Non-Linear Data Structure
23 pages
Unit I - Data Science
No ratings yet
Unit I - Data Science
161 pages
Unit V
No ratings yet
Unit V
67 pages
Critical Capabilities For Analytics and Business Intelligence Platforms
100% (1)
Critical Capabilities For Analytics and Business Intelligence Platforms
73 pages
Ab Initio Transform Components: We Have An Total of 13 Transformation Components
No ratings yet
Ab Initio Transform Components: We Have An Total of 13 Transformation Components
11 pages
White Paper - Data Warehouse Project Management
100% (2)
White Paper - Data Warehouse Project Management
32 pages
4 Data Mining & Preprocessing L 11,12,13,14,15,16
No ratings yet
4 Data Mining & Preprocessing L 11,12,13,14,15,16
100 pages
MCA - BigData Notes
No ratings yet
MCA - BigData Notes
136 pages
Advanced Databricks Curriculum
No ratings yet
Advanced Databricks Curriculum
2 pages
UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.
No ratings yet
UNIT-04: Introduction To Data Mining: Data Mining Techniques KDD Process Association Rules.
40 pages
FDS Unit 1
No ratings yet
FDS Unit 1
21 pages
CDMP Practice Exam v3
No ratings yet
CDMP Practice Exam v3
38 pages
CCS341 DW QP 28.04.25
No ratings yet
CCS341 DW QP 28.04.25
4 pages
Lecture 01 05.08.2024 AI-ML Introduction
No ratings yet
Lecture 01 05.08.2024 AI-ML Introduction
46 pages
Batch B DWM Experiments
No ratings yet
Batch B DWM Experiments
90 pages
R Language
No ratings yet
R Language
59 pages
Cov Platform Use and Admin Guide
No ratings yet
Cov Platform Use and Admin Guide
406 pages
Chandigarh Group of Colleges College of Engineering Landran, Mohali
No ratings yet
Chandigarh Group of Colleges College of Engineering Landran, Mohali
47 pages
Data Science Techniques Classification Regression and Clustering
No ratings yet
Data Science Techniques Classification Regression and Clustering
5 pages
Unit 5 - Data Mining - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Data Mining - WWW - Rgpvnotes.in
15 pages
Data Structure Question Bank
No ratings yet
Data Structure Question Bank
24 pages
Kumar Subramani
No ratings yet
Kumar Subramani
9 pages
4-Data Cleaning, Data Integration, Data Transformation, Data Reduction-03-02-2024
No ratings yet
4-Data Cleaning, Data Integration, Data Transformation, Data Reduction-03-02-2024
22 pages
DSV Module-3
No ratings yet
DSV Module-3
24 pages
Etl Tools PDF
0% (1)
Etl Tools PDF
2 pages
Handling of Categorical Data
No ratings yet
Handling of Categorical Data
18 pages
Unit V Big Data Analytics
No ratings yet
Unit V Big Data Analytics
47 pages
DataMining Lecture 1
No ratings yet
DataMining Lecture 1
35 pages
SKP Engineering College: A Course Material On
No ratings yet
SKP Engineering College: A Course Material On
212 pages
DWDM Unit-2 PDF
No ratings yet
DWDM Unit-2 PDF
149 pages
Excel Power Pivot and Power Query For Dummies
90% (10)
Excel Power Pivot and Power Query For Dummies
58 pages
PPT1
No ratings yet
PPT1
93 pages
Data Warehousing and Data Mining Syllabus
No ratings yet
Data Warehousing and Data Mining Syllabus
2 pages
Unit 1
No ratings yet
Unit 1
70 pages
Cp7029 Information Storage Management
100% (1)
Cp7029 Information Storage Management
1 page
Data Warehousing and Mining
No ratings yet
Data Warehousing and Mining
2 pages
TM02 Determine Suitability of Database Functionality and Scalability
No ratings yet
TM02 Determine Suitability of Database Functionality and Scalability
74 pages
SOP-013 Argus Safety Administration (v.04)
No ratings yet
SOP-013 Argus Safety Administration (v.04)
7 pages
Lecture Notes: Introduction To Data Science and Big Data
No ratings yet
Lecture Notes: Introduction To Data Science and Big Data
5 pages
Top 88 ODI Interview Questions
No ratings yet
Top 88 ODI Interview Questions
16 pages
Introduction To Data Warehouse
No ratings yet
Introduction To Data Warehouse
34 pages
Data Warehouse Modeling 1230637093713768 2
No ratings yet
Data Warehouse Modeling 1230637093713768 2
87 pages
Notes - EDA-Unit1
No ratings yet
Notes - EDA-Unit1
34 pages
DWDM R13 Unit 1 PDF
No ratings yet
DWDM R13 Unit 1 PDF
10 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
91 pages
Informatica Data Engineering Hackathon 2024 - Idea Submission Template
No ratings yet
Informatica Data Engineering Hackathon 2024 - Idea Submission Template
19 pages
Unit 2 - Knowledge Delivery
No ratings yet
Unit 2 - Knowledge Delivery
31 pages
Big Data Unit5
No ratings yet
Big Data Unit5
57 pages
BCA Syllabus Sem VI (NEW) - 2020
No ratings yet
BCA Syllabus Sem VI (NEW) - 2020
11 pages
Assignment On Chapter 8 Data Warehousing and Management
No ratings yet
Assignment On Chapter 8 Data Warehousing and Management
13 pages
Business Intelligence: Past, Present and Future.: January 2009
No ratings yet
Business Intelligence: Past, Present and Future.: January 2009
27 pages
Data Repositories in Data Analytics
No ratings yet
Data Repositories in Data Analytics
8 pages
Unit - 3
No ratings yet
Unit - 3
42 pages
CV Formatted
No ratings yet
CV Formatted
5 pages
Bigdata Unit II
No ratings yet
Bigdata Unit II
19 pages
UNIT-III Data Warehouse and Minig Notes MDU
No ratings yet
UNIT-III Data Warehouse and Minig Notes MDU
42 pages
ER Practical 7r
No ratings yet
ER Practical 7r
5 pages
Overview of Parallel Coordinates, Visualizing Neural Network and Visualization of Trees
No ratings yet
Overview of Parallel Coordinates, Visualizing Neural Network and Visualization of Trees
9 pages
Srikrishna Amaravadi
No ratings yet
Srikrishna Amaravadi
2 pages
Deep Learning and CNNFYTGS5101-Guoyangxie
No ratings yet
Deep Learning and CNNFYTGS5101-Guoyangxie
42 pages
Case-Radio Station
No ratings yet
Case-Radio Station
7 pages
Data Mining Lab Manual
No ratings yet
Data Mining Lab Manual
34 pages
Ramesh 1
No ratings yet
Ramesh 1
3 pages
Chi Merge
No ratings yet
Chi Merge
5 pages
Maanvendra CV Ind
No ratings yet
Maanvendra CV Ind
1 page
Chapter - 1 Introduction
No ratings yet
Chapter - 1 Introduction
22 pages
Lecture 4 Data Structure Linked List
No ratings yet
Lecture 4 Data Structure Linked List
30 pages
1-Udacity Enterprise Syllabus Data Architect nd038
No ratings yet
1-Udacity Enterprise Syllabus Data Architect nd038
15 pages
UNIT V DWM Notes
No ratings yet
UNIT V DWM Notes
18 pages
Pentaho Data Integration Fundamentals: Course Description
No ratings yet
Pentaho Data Integration Fundamentals: Course Description
2 pages
FDS Unit 5
No ratings yet
FDS Unit 5
22 pages
Open Source ETL Software: Pentaho Kettle
No ratings yet
Open Source ETL Software: Pentaho Kettle
2 pages
MVA Implementing A Data Warehouse With SQL Jump Start Mod 1 Final
No ratings yet
MVA Implementing A Data Warehouse With SQL Jump Start Mod 1 Final
37 pages
Handling Missing Value
No ratings yet
Handling Missing Value
12 pages
Bar Graph-Wps Office
No ratings yet
Bar Graph-Wps Office
16 pages
Parallel Sorting Algorithms
No ratings yet
Parallel Sorting Algorithms
22 pages
CS 2032 - Data Warehousing and Data Mining PDF
No ratings yet
CS 2032 - Data Warehousing and Data Mining PDF
3 pages
Pattern Recognition
No ratings yet
Pattern Recognition
3 pages
Lesson Plan: Data Warehousing and Data Mining
No ratings yet
Lesson Plan: Data Warehousing and Data Mining
1 page
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet