OLAP and Metadata

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

OLAP AND META DATA REPOSITORY

META DATA REPOSITORY


Metadata are data about data. When used in a data warehouse, metadata are the data that define
warehouse objects. Metadata are created for the data names and definitions of the given
warehouse. Additional metadata are created and captured for time stamping any extracted
data, the source of the extracted data, and missing fields that have been added by data cleaning
or integration processes.

A metadata repository should contain the following:

A description of the structure of the data warehouse, which includes the


warehouse schema, view, dimensions, hierarchies, and derived data definitions, as
well as data mart
locations and contents.

Operational metadata, which include data lineage (history of migrated data and the
sequence of transformations applied to it), currency of data (active, archived, or
purged), and monitoring information (warehouse usage statistics, error reports, and
audit trails).

The algorithms used for summarization, which include measure and dimension
definition algorithms, data on granularity, partitions, subject areas, aggregation,
summarization, and predefined queries and reports.

The mapping from the operational environment to the data warehouse, which
includes source databases and their contents, gateway descriptions, data partitions, data
extraction, cleaning, transformation rules and defaults, data refresh and purging
rules,
and security (user authorization and access control).

Data related to system performance, which include indices and profiles that improve
data access and retrieval performance, in addition to rules for the timing and
scheduling of refresh, update, and replication cycles.

Business meta-data, which include business terms and definitions, data


Ownership information, and charging policies.
OLAP (Online analytical Processing):
OLAP is an approach to answering multi-dimensional analytical (MDA) queries
swiftly. OLAP is part of the broader category of business intelligence, which also
encompasses
relational database, report writing and data mining.
OLAP tools enable users to analyze multidimensional data interactively from
multiple perspectives.

OLAP consists of three basic analytical operations:

➢ Consolidation (Roll-Up)
➢ Drill-Down
➢ Slicing and Dicing

Consolidation involves the aggregation of data that can be accumulated and


computed in one or more dimensions. For example, all sales offices are rolled
up to the sales department or sales division to anticipate sales trends.

The drill-down is a technique that allows users to navigate through the details.
For instance, users can view the sales by individual products that make up a region’s
sales.

Slicing and dicing is a feature whereby users can take out (slicing) a specific set of
data of the OLAP cube and view (dicing) the slices from different viewpoints.

Types of OLAP:

1. Relational OLAP (ROLAP):

ROLAP works directly with relational databases. The base data and the
dimension tables are stored as relational tables and new tables are created to hold
the aggregated information. It depends on a specialized schema design.
This methodology relies on manipulating the data stored in the relational database
to give the appearance of traditional OLAP's slicing and dicing functionality. In
essence, each action of slicing and dicing is equivalent to adding a "WHERE"
clause in the SQL statement.
ROLAP tools do not use pre-calculated data cubes but instead pose the query to
the standard relational database and its tables in order to bring back the data
required to answer the question.
ROLAP tools feature the ability to ask any question because the methodology
does not limit to the contents of a cube. ROLAP also has the ability to drill down
to the lowest level of detail in the database.

2. Multidimensional OLAP (MOLAP):


MOLAP is the 'classic' form of OLAP and is sometimes referred to as just OLAP.

MOLAP stores this data in an optimized multi-dimensional array storage, rather than
in a relational database. Therefore, it requires the pre-computation and storage
of information in the cube - the operation known as processing.

MOLAP tools generally utilize a pre-calculated data set referred to as a data cube.
The data cube contains all the possible answers to a given range of questions.

MOLAP tools have a very fast response time and the ability to quickly write back
data into the data set.

3. Hybrid OLAP (HOLAP):


There is no clear agreement across the industry as to what constitutes Hybrid OLAP,
except that a database will divide data between relational and specialized storage.
For example, for some vendors, a HOLAP database will use relational tables to hold
the larger quantities of detailed data, and use specialized storage for at least some
aspects of the smaller quantities of more-aggregate or less-detailed data.
HOLAP addresses the shortcomings of MOLAP and ROLAP by combining the
capabilities of both approaches.
HOLAP tools can utilize both pre-calculated cubes and relational data sources.
Data Preprocessing:

Data Integration:

It combines data from multiple sources into a coherent data store, as in data warehousing.
These sources may include multiple databases, data cubes, or flat files.

The data integration systems are formally defined as triple<G, S, M> Where G: The

global schema

S: Heterogeneous source of schemas

M: Mapping between the queries of source and global schema


Issues in Data integration:

1. Schema integration and object matching:

How can the data analyst or the computer be sure that customer id in one database and
customer number in another reference to the same attribute?

2. Redundancy:

An attribute (such as annual revenue, for instance) may be redundant if it can be derived
from another attribute or set of attributes. Inconsistencies in attribute or dimension naming
can also cause redundancies in the resulting data set.

3. Detection and resolution of data value conflicts:

For the same real-world entity, attribute values from different sources may differ.

Data Transformation:

In data transformation, the data are transformed or consolidated into forms appropriate
for mining.

Data transformation can involve the following:

Smoothing, which works to remove noise from the data. Such techniques include binning,
regression, and clustering.
Aggregation, where summary or aggregation operations are applied to the data. For
example, the daily sales data may be aggregated so as to compute monthly and annual total
amounts. This step is typically used in constructing a data cube for analysis of the data at
multiple granularities.
Generalization of the data, where low-level or ―primitive‖ (raw) data are replaced
by higher-level concepts through the use of concept hierarchies. For example, categorical
attributes, like street, can be generalized to higher-level concepts, like city or country.
Normalization, where the attribute data are scaled so as to fall within a small specified
range, such as 1:0 to 1:0, or 0:0 to 1:0.
Attribute construction (or feature construction), where new attributes are constructed
and added from the given set of attributes to help the mining process.

Data Reduction:

Data reduction techniques can be applied to obtain a reduced representation of the data set that
is much smaller in volume, yet closely maintains the integrity of the original data. That is, mining
on the reduced data set should be more efficient yet produce the same (or almost the same)
analytical results.
Strategies for data reduction include the following:
Data cube aggregation, where aggregation operations are applied to the data in the
construction of a data cube.
Attribute subset selection, where irrelevant, weakly relevant, or redundant attributes or
dimensions may be detected and removed.
Dimensionality reduction, where encoding mechanisms are used to reduce the dataset
size.
Numerosity reduction, where the data are replaced or estimated by alternative,
smaller data representations such as parametric models (which need store only the model
parameters instead of the actual data) or nonparametric methods such as clustering,
sampling, and the use of histograms.
Discretization and concept hierarchy generation, where raw data values for
attributes are replaced by ranges or higher conceptual levels. Data discretization is a form
of numerosity reduction that is very useful for the automatic generation of concept
hierarchies. Discretization and concept hierarchy generation are powerful tools for
datamining, in that they allow the mining of data at multiple levels of abstraction.

You might also like