0% found this document useful (0 votes)
6 views31 pages

Sec A and B DWDM

A Data Warehouse is a relational database designed for query and analysis, integrating historical data from various sources to support decision-making. Data Marts are subsets of data warehouses focused on specific business functions, providing easier access and lower costs for Business Intelligence applications. Metadata is essential for understanding and managing data warehouses, while schemas like Star, Snowflake, and Fact Constellation organize data for efficient querying and analysis.

Uploaded by

Swastik 229
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views31 pages

Sec A and B DWDM

A Data Warehouse is a relational database designed for query and analysis, integrating historical data from various sources to support decision-making. Data Marts are subsets of data warehouses focused on specific business functions, providing easier access and lower costs for Business Intelligence applications. Metadata is essential for understanding and managing data warehouses, while schemas like Star, Snowflake, and Fact Constellation organize data for efficient querying and analysis.

Uploaded by

Swastik 229
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

DATA WAREHOUSE & DATA MINING

SECTION – A
What is a Data Warehouse?
A Data Warehouse (DW) is a relational database that is designed for query and analysis rather than transaction
processing. It includes historical data derived from transaction data from single and multiple sources. A Data
Warehouse provides integrated, enterprise-wide, historical data and focuses on providing support for
decision-makers for data modeling and analysis. A Data Warehouse is a group of data specific to the
entire organization, not only to a particular group of users.
It is not used for daily operations and transaction processing but used for making decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
o It is a database designed for investigative tasks, using data from various applications.
o It supports a relatively small number of clients with relatively long interactions.
o It includes current and historical data to provide a historical perspective of information.
o Its usage is read-intensive.
o It contains a few large tables.
"Data Warehouse is a subject-oriented, integrated, and time-variant store of information in
support of management's decisions."

Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers. Therefore, data
warehouses typically provide a concise and straightforward view around a particular subject, such as
customer, product, or sales, instead of the global organization's ongoing operations. This is done by
excluding data that are not useful concerning the subject and including all data needed by the users
to understand the subject.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and online
transaction records. It requires performing data cleaning and integration during data warehousing
to ensure consistency in naming conventions, attributes types, etc., among different data sources.
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files from 3 months,
6 months, 12 months, or even previous data from a data warehouse. These variations with a
transactions system, where often only the most current file is kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the source
operational RDBMS. The operational updates of data do not occur in the data warehouse, i.e., update,
insert, and delete operations are not performed. It usually requires only two procedures in data
accessing: Initial loading of data and access to data. Therefore, the DW does not require transaction
processing, recovery, and concurrency capabilities, which allows for substantial speedup of data
retrieval. Non-Volatile defines that once entered into the warehouse, and data should not change.
Data Warehouse Usage:-
1. Data warehouses and data marts are used in a wide range of applications.
2. Business executives use the data in data warehouses and data marts to perform data
analysis and make strategic decisions.
3. In many areas, data warehouses are used as an integral part for enterprise management.
4. The data warehouse is mainly used for generating reports and answering predefined queries.
5. It is used to analyze summarized and detailed data, where the results are presented in the
form of reports and charts.
6. Later, the data warehouse is used for strategic purposes, performing multidimensional
analysis and sophisticated operations.
7. Finally, the data warehouse may be employed for knowledge discovery and strategic
decision making using data mining tools.
8. In this context, the tools for data warehousing can he categorized into access and retrieval
tools, database reporting tools, data analysis tools, and data mining tools.
What is Data Mart?
A Data Mart is a subset of a directorial information store, generally oriented to a specific
purpose or primary data subject which may be distributed to provide business needs. Data
Marts are analytical record stores designed to focus on particular business functions for a
specific community within an organization. Data marts are derived from subsets of data in a
data warehouse, though in the bottom-up data warehouse design methodology, the data
warehouse is created from the union of organizational data marts.

The fundamental use of a data mart is Business Intelligence (BI) applications. BI is used to
gather, store, access, and analyze record. It can be used by smaller businesses to utilize the
data they have accumulated since it is less expensive than implementing a data warehouse.

Reasons for creating a data mart

o Creates collective data by a group of users


o Easy access to frequently needed data
o Ease of creation
o Improves end-user response time
o Lower cost than implementing a complete data warehouses
o Potential clients are more clearly defined than in a comprehensive data warehouse
o It contains only essential business data and is less cluttered.

Types of Data Marts

There are mainly two approaches to designing data marts. These approaches are

o Dependent Data Marts


o Independent Data Marts

Dependent Data Marts


A dependent data marts is a logical subset of a physical subset of a higher data warehouse.
According to this technique, the data marts are treated as the subsets of a data warehouse. In
this technique, firstly a data warehouse is created from which further various data marts can
be created. These data mart are dependent on the data warehouse and extract the essential
record from it. In this technique, as the data warehouse creates the data mart; therefore, there
is no need for data mart integration. It is also known as a top-down approach.

Independent Data Marts


The second approach is Independent data marts (IDM) Here, firstly independent data marts
are created, and then a data warehouse is designed using these independent multiple data
marts. In this approach, as all the data marts are designed independently; therefore, the
integration of data marts is required. It is also termed as a bottom-up approach as the data
marts are integrated to develop a data warehouse.

What is Meta Data?


Metadata is data about the data or documentation about the information which is required by
the users. In data warehousing, metadata is one of the essential aspects.

Metadata includes the following:

1. The location and descriptions of warehouse systems and components.


2. Names, definitions, structures, and content of data-warehouse and end-users views.
3. Identification of authoritative data sources.
4. Integration and transformation rules used to populate data.
5. Integration and transformation rules used to deliver information to end-user analytical
tools.
6. Subscription information for information delivery to analysis subscribers.
7. Metrics used to analyze warehouses usage and performance.
8. Security authorizations, access control list, etc.

Metadata is used for building, maintaining, managing, and using the data warehouses.
Metadata allow users access to help understand the content and find data.

Several examples of metadata are:

1. A library catalog may be considered metadata. The directory metadata consists of


several predefined components representing specific attributes of a resource, and each
item can have one or more values. These components could be the name of the author,
the name of the document, the publisher's name, the publication date, and the
methods to which it belongs.
2. The table of content and the index in a book may be treated metadata for the book.
3. Suppose we say that a data item about a person is 80. This must be defined by noting
that it is the person's weight and the unit is kilograms. Therefore, (weight, kilograms)
is the metadata about the data is 80.
4. Another examples of metadata are data about the tables and figures in a report like
this book. A table (which is a record) has a name (e.g., table titles), and there are column
names of the tables that may be treated metadata. The figures also have titles or
names.

Why is metadata necessary in a data warehouses?

o First, it acts as the glue that links all parts of the data warehouses.
o Next, it provides information about the contents and structures to the developers.
o Finally, it opens the doors to the end-users and makes the contents recognizable in
their terms.

Metadata is Like a Nerve Center. Various processes during the building and administering of
the data warehouse generate parts of the data warehouse metadata. Another uses parts of
metadata generated by one process. In the data warehouse, metadata assumes a key position
and enables communication among various methods. It acts as a nerve centre in the data
warehouse.

What is Data Cube?


When data is grouped or combined in multidimensional matrices called Data Cubes. The data
cube method has a few alternative names or a few variants, such as "Multidimensional
databases," "materialized views," and "OLAP (On-Line Analytical Processing)."

The general idea of this approach is to materialize certain expensive computations that are
frequently inquired.

For example, a relation with the schema sales (part, supplier, customer, and sale-price) can be
materialized into a set of eight views as shown in fig, where psc indicates a view consisting of
aggregate function value (such as total-sales) computed by grouping three attributes part,
supplier, and customer, p indicates a view composed of the corresponding aggregate function
values calculated by grouping part alone, etc.
A data cube is created from a subset of attributes in the database. Specific attributes are chosen to
be measure attributes, i.e., the attributes whose values are of interest. Another attributes are selected
as dimensions or functional attributes. The measure attributes are aggregated according to the
dimensions. For example, XYZ may create a sales data warehouse to keep records of the store's sales
for the dimensions time, item, branch, and location. These dimensions enable the store to keep track
of things like monthly sales of items, and the branches and locations at which the items were sold.
Each dimension may have a table identify with it, known as a dimensional table, which describes the
dimensions. For example, a dimension table for items may contain the attributes item_name, brand,
and type.

Data cube method is an interesting technique with many applications. Data cubes could be sparse in many cases
because not every cell in each dimension may have corresponding data in the database. Techniques should be
developed to handle sparse cubes efficiently. If a query contains constants at even lower levels than those provided
in a data cube, it is not clear how to make the best use of the precomputed results stored in the data cube. The
model view data in the form of a data cube. OLAP tools are based on the multidimensional data model. Data cubes
usually model n-dimensional data. A data cube enables data to be modeled and viewed in multiple dimensions. A
multidimensional data model is organized around a central theme, like sales and transactions. A fact table represents
this theme. Facts are numerical measures. Thus, the fact table contains measure (such as Rs_sold) and keys to each
of the related dimensional tables. Dimensions are a fact that defines a data cube. Facts are generally quantities,
which are used for analyzing the relationship between dimensions.

What is Star Schema?


A star schema is the elementary form of a dimensional model,
in which data are organized into facts and dimensions. A fact
is an event that is counted or measured, such as a sale or log
in. A dimension includes reference data about the fact, such as
date, item, or customer.

A star schema is a relational schema where a relational schema


whose design represents a multidimensional data model. The
star schema is the explicit data warehouse schema. It is known
as star schema because the entity-relationship diagram of
this schemas simulates a star, with points, diverge from a central table. The center of the schema consists
of a large fact table, and the points of the star are the dimension tables.

Fact Tables

A table in a star schema which contains facts and connected to dimensions. A fact table has two types
of columns: those that include fact and those that are foreign keys to the dimension table. The primary
key of the fact tables is generally a composite key that is made up of all of its foreign keys.

A fact table might involve either detail level fact or fact that have been aggregated (fact tables that
include aggregated fact are often instead called summary tables). A fact table generally contains facts
with the same level of aggregation.
Dimension Tables
A dimension is an architecture usually composed of one or more hierarchies that categorize data. If a
dimension has not got hierarchies and levels, it is called a flat dimension or list. The primary keys of
each of the dimensions table are part of the composite primary keys of the fact table. Dimensional
attributes help to define the dimensional value. They are generally descriptive, textual values.
Dimensional tables are usually small in size than fact table.
Fact tables store data about sales while dimension tables data about the geographic region (markets,
cities), clients, products, times, channels.
Characteristics of Star Schema
The star schema is intensely suitable for data warehouse database design because of the following
features:
o It creates a DE-normalized database that can quickly provide query responses.
o It provides a flexible design that can be changed easily or added to throughout the development cycle,
and as the database grows.
o It provides a parallel in design to how end-users typically think of and use the data.
o It reduces the complexity of metadata for both developers and end-users.
Advantages of Star Schema

Disadvantage of Star Schema


There is some condition which cannot be meet by star schemas like the relationship between the user,
and bank account cannot describe as star schema as the relationship between them is many to many.

What is Snowflake Schema?


A snowflake schema is equivalent to the star schema. "A schema is known as a snowflake if one or more
dimension tables do not connect directly to the fact table but must join through other dimension
tables." The snowflake schema is an expansion of the star schema where each point of the star explodes
into more points. It is called snowflake schema because the diagram of snowflake schema resembles a
snowflake. Snowflaking is a method of normalizing the dimension tables in a STAR schemas. When we
normalize all the dimension tables entirely, the resultant structure resembles a snowflake with the fact
table in the middle.

Snowflaking is used to develop the performance


of specific queries. The schema is diagramed with
each fact surrounded by its associated
dimensions, and those dimensions are related to
other dimensions, branching out into a
snowflake pattern. The snowflake schema
consists of one fact table which is linked to many
dimension tables, which can be linked to other
dimension tables through a many-to-one
relationship. Tables in a snowflake schema are
generally normalized to the third normal form.
Each dimension table performs exactly one level
in a hierarchy. The following diagram shows a snowflake schema with two dimensions, each having
three levels. A snowflake schemas can have any number of dimension, and each dimension can have
any number of levels.
Example: Figure shows a snowflake schema with a Sales fact table, with Store, Location, Time, Product,
Line, and Family dimension tables. The Market dimension has two dimension tables with Store as the
primary dimension table, and Location as the outrigger dimension table. The product dimension has
three dimension tables with Product as the primary dimension table, and the Line and Family table are
the outrigger dimension tables.

A snowflake schema is designed for flexible querying across more complex dimensions and relationship.
It is suitable for many to many and one to many relationships between dimension levels.

Advantage of Snowflake Schema

1. The primary advantage of the snowflake schema is the development in query performance due to
minimized disk storage requirements and joining smaller lookup tables.
2. It provides greater scalability in the interrelationship between dimension levels and components.
3. No redundancy, so it is easier to maintain.

Disadvantage of Snowflake Schema

1. The primary disadvantage of the snowflake schema is the additional maintenance efforts required due
to the increasing number of lookup tables. It is also known as a multi fact star schema.
2. There are more complex queries and hence, difficult to understand.
3. More tables more join so more query execution time.

What is Fact Constellation Schema?


A Fact constellation means two or more fact tables sharing one or more dimensions. It is also
called Galaxy schema.

Fact Constellation Schema describes a logical structure of data warehouse or data mart. Fact
Constellation Schema can design with a collection of de-normalized FACT, Shared, and Conformed
Dimension tables.

Fact Constellation Schema is a sophisticated database design that is difficult to summarize information.
Fact Constellation Schema can implement between aggregate Fact tables or decompose a complex Fact
table into independent simplex Fact tables.

Example: A fact constellation schema is shown in the figure below.


This schema defines two fact tables, sales, and shipping. Sales are treated along four dimensions,
namely, time, item, branch, and location. The schema contains a fact table for sales that includes keys
to each of the four dimensions, along with two measures: Rupee_sold and units_sold. The shipping table
has five dimensions, or keys: item_key, time_key, shipper_key, from_location, and to_location, and two
measures: Rupee_cost and units_shipped.

The primary disadvantage of the fact constellation schema is that it is a more challenging design
because many variants for specific kinds of aggregation must be considered and selected.

Data Warehouse Process Architecture


The process architecture defines an architecture in which the data from the data warehouse is processed
for a particular computation.

Following are the two fundamental process architectures:

Centralized Process Architecture

In this architecture, the data is collected into single centralized storage and processed upon completion
by a single machine with a huge structure in terms of memory, processor, and storage.

Centralized process architecture evolved with transaction processing and is well suited for small
organizations with one location of service.

It requires minimal resources both from people and system perspectives.


It is very successful when the collection and consumption of data occur at the same location.
Distributed Process Architecture
In this architecture, information and its processing are allocated across data centers, and its processing is
distributed across data centers, and processing of data is localized with the group of the results into
centralized storage. Distributed architectures are used to overcome the limitations of the centralized
process architectures where all the information needs to be collected to one central location, and results
are available in one central location.
There are several architectures of the distributed process:
Client-Server
In this architecture, the user does all the information collecting and presentation, while the server does
the processing and management of data.
Three-tier Architecture
With client-server architecture, the client machines need to be connected to a server machine, thus
mandating finite states and introducing latencies and overhead in terms of record to be carried between
clients and servers.
N-tier Architecture
The n-tier or multi-tier architecture is where clients, middleware, applications, and servers are isolated
into tiers.
Cluster Architecture
In this architecture, machines that are connected in network architecture (software or hardware) to
approximately work together to process information or compute requirements in parallel. Each device in
a cluster is associated with a function that is processed locally, and the result sets are collected to a master
server that returns it to the user.
Peer-to-Peer Architecture
This is a type of architecture where there are no dedicated servers and clients. Instead, all the processing
responsibilities are allocated among all machines, called peers. Each machine can perform the function of
a client or server or just process data.
Difference between OLTP and OLAP
OLTP (On-Line Transaction Processing) is featured by a large number of short on-line transactions
(INSERT, UPDATE, and DELETE). The primary significance of OLTP operations is put on very rapid query
processing, maintaining record integrity in multi-access environments, and effectiveness consistent by the
number of transactions per second. In the OLTP database, there is an accurate and current record, and
schema used to save transactional database is the entity model (usually 3NF).

OLAP (On-line Analytical Processing) is represented by a relatively low volume of transactions. Queries
are very difficult and involve aggregations. For OLAP operations, response time is an effectiveness
measure. OLAP applications are generally used by Data Mining techniques. In OLAP database there is
aggregated, historical information, stored in multi-dimensional schemas (generally star schema).

Following are the difference between OLAP and OLTP system.


1) Users: OLTP systems are designed for office worker while the OLAP systems are designed for decision-
makers. Therefore while an OLTP method may be accessed by hundreds or even thousands of clients in a
huge enterprise, an OLAP system is suitable to be accessed only by a select class of manager and may be
used only by dozens of users.
2) Functions: OLTP systems are mission-critical. They provide day-to-day operations of an enterprise and
are largely performance and availability driven. These operations carry out simple repetitive
operations. OLAP systems are management-critical to support the decision of enterprise support tasks
using detailed investigation.
3) Nature: Although SQL queries return a set of data, OLTP methods are designed to step one record at
the time, for example, a data related to the user who may be on the phone or in the store. OLAP system
is not designed to deal with individual customer records. Instead, they include queries that deal with many
data at a time and provide summary or aggregate information to a manager. OLAP applications include
data stored in a data warehouses that have been extracted from many tables and possibly from more
than one enterprise database.
4) Design: OLTP database operations are designed to be application-oriented while OLAP operations
are designed to be subject-oriented. OLTP systems view the enterprise record as a collection of tables
(possibly based on an entity-relationship model). OLAP operations view enterprise information as
multidimensional).
5) Data: OLTP systems usually deal only with the current status of data. For example, a record about an
employee who left three years ago may not be feasible on the Human Resources System. The old data
may have been achieved on some type of stable storage media and may not be accessible online. On the
other hand, OLAP systems needed historical data over several years since trends are often essential in
decision making.
6) Kind of use: OLTP methods are used for reading and writing operations while OLAP methods usually
do not update the data.
7) View: An OLTP system focuses primarily on the current data within an enterprise or department, which
does not refer to historical data or data in various organizations. In contrast, an OLAP system spans
multiple version of a database schema, due to the evolutionary process of an organization. OLAP system
also deals with information that originates from different organizations, integrating information from
many data stores. Because of their huge volume, these are stored on multiple storage media.
8) Access Patterns: The access pattern of an OLTP system consist primarily of short, atomic transactions.
Such a system needed concurrency control and recovery techniques. However, access to OLAP systems is
mostly read-only operations because these data warehouses store historical information.
The biggest difference between an OLTP and OLAP system is the amount of data analyzed in a
single transaction. Whereas an OLTP handles many concurrent customers and queries touching
only a single data or limited collection of records at a time, an OLAP system must have the
efficiency to operate on millions of data to answer a single query.
Difference between ROLAP and MOLAP

ROLAP MOLAP

ROLAP stands for Relational Online Analytical MOLAP stands for Multidimensional Online Analytical
Processing. Processing.

It usually used when data warehouse contains It used when data warehouse contains relational as well as
relational data. non-relational data.

It contains Analytical server. It contains the MDDB server.

It creates a multidimensional view of data It contains prefabricated data cubes.


dynamically.

It is very easy to implement It is difficult to implement.

It has a high response time It has less response time due to prefabricated cubes.

It requires less amount of memory. It requires a large amount of memory.


Types of OLAP
There are three main types of OLAP servers are as following:

ROLAP stands for Relational OLAP, an application based on relational DBMSs.

MOLAP stands for Multidimensional OLAP, an application based on multidimensional DBMSs.

HOLAP stands for Hybrid OLAP, an application using both relational and multidimensional techniques.

Relational OLAP (ROLAP) Server

These are intermediate servers which stand in between a relational back-end server and user frontend
tools. They use a relational or extended-relational DBMS to save and handle warehouse data, and OLAP
middleware to provide missing pieces. ROLAP servers contain optimization for each DBMS back end,
implementation of aggregation navigation logic, and additional tools and services. ROLAP technology
tends to have higher scalability than MOLAP technology. ROLAP systems work primarily from the data that
resides in a relational database, where the base data and dimension tables are stored as relational tables.
This model permits the multidimensional analysis of data. This technique relies on manipulating the data
stored in the relational database to give the presence of traditional OLAP's slicing and dicing functionality.
In essence, each method of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL
statement.

Relational OLAP Architecture


ROLAP Architecture includes the following components

o Database server.
o ROLAP server.
o Front-end tool.

Relational OLAP (ROLAP) is the latest and fastest-


growing OLAP technology segment in the market.
This method allows multiple multidimensional views
of two-dimensional relational tables to be created,
avoiding structuring record around the desired view.

Some products in this segment have supported


reliable SQL engines to help the complexity of
multidimensional analysis. This includes creating multiple SQL statements to handle user requests, being
'RDBMS' aware and also being capable of generating the SQL statements based on the optimizer of the
DBMS engine.

Advantages
Can handle large amounts of information: The data size limitation of ROLAP technology is depends on
the data size of the underlying RDBMS. So, ROLAP itself does not restrict the data amount.

<="" strong="">RDBMS already comes with a lot of features. So ROLAP technologies, (works on top of
the RDBMS) can control these functionalities.

Disadvantages
Performance can be slow: Each ROLAP report is a SQL query (or multiple SQL queries) in the relational
database, the query time can be prolonged if the underlying data size is large.

Limited by SQL functionalities: ROLAP technology relies on upon developing SQL statements to query
the relational database, and SQL statements do not suit all needs.

Multidimensional OLAP (MOLAP) Server

A MOLAP system is based on a native logical model that directly supports multidimensional data and
operations. Data are stored physically into multidimensional arrays, and positional techniques are used to
access them.

One of the significant distinctions of MOLAP against a ROLAP is that data are summarized and are stored
in an optimized format in a multidimensional cube, instead of in a relational database. In MOLAP model,
data are structured into proprietary formats by client's reporting requirements with the calculations pre-
generated on the cubes.

MOLAP Architecture
MOLAP Architecture includes the following components

o Database server.
o MOLAP server.
o Front-end tool.

o
o MOLAP structure primarily reads the precompiled data. MOLAP structure has limited capabilities to
dynamically create aggregations or to evaluate results which have not been pre-calculated and stored.
o Applications requiring iterative and comprehensive time-series analysis of trends are well suited for MOLAP
technology (e.g., financial analysis and budgeting).
o Examples include Arbor Software's Essbase. Oracle's Express Server, Pilot Software's Lightship Server,
Sniper's TM/1. Planning Science's Gentium and Kenan Technology's Multiway.
o Some of the problems faced by clients are related to maintaining support to multiple subject areas in an
RDBMS. Some vendors can solve these problems by continuing access from MOLAP tools to detailed data
in and RDBMS.
o This can be very useful for organizations with performance-sensitive multidimensional analysis
requirements and that have built or are in the process of building a data warehouse architecture that
contains multiple subject areas.
o An example would be the creation of sales data measured by several dimensions (e.g., product and sales
region) to be stored and maintained in a persistent structure. This structure would be provided to reduce
the application overhead of performing calculations and building aggregation during initialization. These
structures can be automatically refreshed at predetermined intervals established by an administrator.
o Advantages
o Excellent Performance: A MOLAP cube is built for fast information retrieval, and is optimal for slicing and
dicing operations.
o Can perform complex calculations: All evaluation have been pre-generated when the cube is created.
Hence, complex calculations are not only possible, but they return quickly.
o Disadvantages
o Limited in the amount of information it can handle: Because all calculations are performed when the
cube is built, it is not possible to contain a large amount of data in the cube itself.

Requires additional investment: Cube technology is generally proprietary and does not already exist in
the organization. Therefore, to adopt MOLAP technology, chances are other investments in human and
capital resources are needed.

Hybrid OLAP (HOLAP) Server

HOLAP incorporates the best features of MOLAP and ROLAP into a single architecture. HOLAP systems
save more substantial quantities of detailed data in the relational tables while the aggregations are stored
in the pre-calculated cubes. HOLAP also can drill through from the cube down to the relational tables for
delineated data. The Microsoft SQL Server 2000 provides a hybrid OLAP server.

Advantages of HOLAP
1. HOLAP provide benefits of both MOLAP
and ROLAP.
2. It provides fast access at all levels of
aggregation.
3. HOLAP balances the disk space
requirement, as it only stores the aggregate
information on the OLAP server and the detail
record remains in the relational database. So no
duplicate copy of the detail record is
maintained.

Disadvantages of HOLAP
1. HOLAP architecture is very complicated because it supports both MOLAP and ROLAP servers.

Other Types

There are also less popular types of OLAP styles upon which one could stumble upon every so often. We
have listed some of the less popular brands existing in the OLAP industry.

Web-Enabled OLAP (WOLAP) Server


WOLAP pertains to OLAP application which is accessible via the web browser. Unlike traditional
client/server OLAP applications, WOLAP is considered to have a three-tiered architecture which consists of
three components: a client, a middleware, and a database server.
Desktop OLAP (DOLAP) Server
DOLAP permits a user to download a section of the data from the database or source, and work with that
dataset locally, or on their desktop.

Mobile OLAP (MOLAP) Server


Mobile OLAP enables users to access and work on OLAP data and applications remotely through the use
of their mobile devices.

Spatial OLAP (SOLAP) Server


SOLAP includes the capabilities of both Geographic Information Systems (GIS) and OLAP into a single user
interface. It facilitates the management of both spatial and non-spatial data.

Three-Tier Data Warehouse Architecture


Data Warehouses usually have a three-level (tier) architecture that includes:

1. Bottom Tier (Data Warehouse Server)


2. Middle Tier (OLAP Server)
3. Top Tier (Front end Tools).

A bottom-tier that consists of the Data Warehouse server, which is almost always an RDBMS. It may
include several specialized data marts and a metadata repository.

Data from operational databases and external sources (such as user profile data provided by external
consultants) are extracted using application program interfaces called a gateway. A gateway is provided
by the underlying DBMS and allows customer programs to generate SQL code to be executed at a server.

Examples of gateways contain ODBC (Open Database Connection) and OLE-DB (Open-Linking and
Embedding for Databases), by Microsoft, and JDBC (Java Database Connection).

A middle-tier which consists of an OLAP


server for fast querying of the data warehouse.

The OLAP server is implemented using either

(1) A Relational OLAP (ROLAP) model, i.e., an


extended relational DBMS that maps functions on
multidimensional data to standard relational
operations.

(2) A Multidimensional OLAP (MOLAP) model,


i.e., a particular purpose server that directly
implements multidimensional information and
operations.

A top-tier that contains front-end tools for


displaying results provided by OLAP, as well as
additional tools for data mining of the OLAP-
generated data.
The overall Data Warehouse Architecture is shown
in fig :

The metadata repository stores information that


defines DW objects. It includes the following
parameters and information for the middle and the
top-tier applications:

1. A description of the DW structure, including the


warehouse schema, dimension, hierarchies, data
mart locations, and contents, etc.
2. Operational metadata, which usually describes
the currency level of the stored data, i.e., active,
archived or purged, and warehouse monitoring
information, i.e., usage statistics, error reports,
audit, etc.
3. System performance data, which includes
indices, used to improve data access and retrieval
performance.
4. Information about the mapping from operational
databases, which provides source RDBMSs and
their contents, cleaning and transformation rules,
etc.
5. Summarization algorithms, predefined queries,
and reports business data, which include business
terms and definitions, ownership information, etc.

Distributed Data Warehouses


The concept of a distributed data warehouse suggests that there are two types of distributed data
warehouses and their modifications for the local enterprise warehouses which are distributed throughout
the enterprise and a global warehouses as shown in fig:

Characteristics of Local data warehouses


o Activity appears at the local level
o Bulk of the operational processing
o Local site is autonomous
o Each local data warehouse has its unique architecture
and contents of data
o The data is unique and of prime essential to that locality
only
o Majority of the record is local and not replicated
o Any intersection of data between local data warehouses
is circumstantial
o Local warehouse serves different technical communities
o The scope of the local data warehouses is finite to the
local site
o Local warehouses also include historical data and are
integrated only within the local site.
Virtual Data Warehouses
Virtual Data Warehouses is created in the following stages:
1. Installing a set of data approach, data dictionary, and process management facilities.
2. Training end-clients.
3. Monitoring how DW facilities will be used
4. Based upon actual usage, physically Data Warehouse is created to provide the high-frequency results
This strategy defines that end users are allowed to get at operational databases directly using whatever
tools are implemented to the data access network. This method provides ultimate flexibility as well as the
minimum amount of redundant information that must be loaded and maintained. The data warehouse is
a great idea, but it is difficult to build and requires investment. Why not use a cheap and fast method by
eliminating the transformation phase of repositories for metadata and another database. This method is
termed the 'virtual data warehouse.'
To accomplish this, there is a need to define four kinds of data:
1. A data dictionary including the definitions of the various databases.
2. A description of the relationship between the data components.
3. The description of the method user will interface with the system.
4. The algorithms and business rules that describe what to do and how to do it.
Disadvantages
1. Since queries compete with production record transactions, performance can be degraded.
2. There is no metadata, no summary record, or no individual DSS (Decision Support System) integration
or history. All queries must be copied, causing an additional burden on the system.
3. There is no refreshing process, causing the queries to be very complex.

Warehouse Manager
The warehouse manager is responsible for the warehouse management process. It consists of a third-
party system software, C programs, and shell scripts. The size and complexity of a warehouse manager
varies between specific solutions.
Warehouse Manager Architecture
A warehouse manager includes the following −
• The controlling process
• Stored procedures or C with SQL
• Backup/Recovery tool
• SQL scripts

Functions of Warehouse Manager


A warehouse manager performs the following functions −
• Analyzes the data to perform consistency and
referential integrity checks.
• Creates indexes, business views, partition views
against the base data.
• Generates new aggregations and updates the existing
aggregations.
• Generates normalizations.
• Transforms and merges the source data of the
temporary store into the published data warehouse.
• Backs up the data in the data warehouse.
• Archives the data that has reached the end of its
captured life.
Note − A warehouse Manager analyzes query profiles to determine whether the index and aggregations
are appropriate.
SECTION - B
Data Mining Query Language
Data Mining is a process is in which user data are extracted and processed from a heap of
unprocessed raw data. By aggregating these datasets into a summarized format, many
problems arising in finance, marketing, and many other fields can be solved. In the modern
world with enormous data, Data Mining is one of the growing fields of technology that acts as
an application in many industries we depend on in our life. Many developments and
researches have been held in this field and many systems are also been disclosed. Since there
are numerous processes and functions to be done in Data Mining, a very well developed user
interface is needed. Even though there are many well-developed user interfaces for the
relational systems, Han, Fu, Wang, et al. proposed the Data Mining Query Language(DMQL) to
further build more developmental systems and innovate many kinds of research in this field.
Though we can’t consider DMQL as a standard language. It is a derived language that stands
as a general query language to perform data mining techniques. DMQL is executed in DB miner
systems for collecting data from several layers of databases. Ideas in designing DMQL:
DMQL is designed based on Structured Query Language(SQL) which in turn is a relational query
language.
• Data Mining request: For the given data mining task, the corresponding datasets must be defined
in the form of a data mining request. Let us see this with an example. As the user can request for any
specific part of a dataset in the database, the data miner can use the database query to retrieve the
suitable datasets before the process of data mining. If the aggregation of that specific data is not
possible for the data miner, he then collects the supersets from which one can derive the required
data. This proves the need for query language in data mining which acts as its subtask. Since the
extraction of relevant data from huge datasets cannot be performed by manual work, many
development methods are present in the data mining technique. But by doing this way, sometimes
the task of collecting relevant data requested by the user may be failed. By using DMQL, a command
to retrieve specific datasets or data from the database, which gives a desired result to the user and it
gives comprehending experience in fulfilling the expectations of users.
• Background Knowledge: Prior knowledge of datasets and their relationships in a database help in
mining the data. By knowing the relationships or any useful information can ease the process of
extraction and aggregation. For an instance, the conceptual hierarchy of the number of datasets can
increase the efficiency of the process and accuracy by collecting the desired data easily. By knowing
the hierarchy, the data can be generalized with ease.
• Generalization: When the data in datasets of a data warehouse is not generalized, often the data
would be in form of unprocessed primitive integrity constraints, roughly associated multi-valued
datasets and their dependencies. But by using the generalization concept using query language can
help in processing the raw data into a precise abstraction. It also works in the multi-level collection of
data with a quality aggregation. When the larger databases come into the scene, the generalization
would play a major role in giving desirable results in a conceptual level of data collection.
• Flexibility and Interaction: To avoid the collection of less desirable or unwanted data from
databases, efficient exposure values or thresholds must be specified for the flexible data mining and
to provide compulsive interaction which makes the user experience interesting. Such threshold values
can be provided with queries of data mining.
The four parameters of data mining:
• The first parameter is to fetch the relevant dataset from the database in the form of a relational query.
By specifying this primitive, relevant data are retrieved.
• The second parameter is the type of resource/information extracted. This primitive includes
generalization, association, classification, characterization, and discrimination rules.
• The third parameter is the hierarchy of datasets or generalization relation or background knowledge
as said earlier in the designing of DMQL.
• The final parameter is the proficiency of the data collected which can be represented by a specific
threshold value which in turn depends on the type of rules used in data mining.
Kinds of thresholds in rule mining:
In the process of data mining, maintaining a set of threshold values is very important in extracting
useful and engaging datasets from a heap of data. This threshold value also helps in measuring the
relevance of the data and it helps in a driving search for interesting datasets.
The types of thresholds in rule mining can be categorized into three classes.
• Significance Threshold: To present a dataset in the data mining process, the dataset must be verified
for having at least some rationally significant proof of a pattern within itself. According to mining
association rules, they are called the minimum support threshold. The patterns found within this
minimum support threshold is called frequent data items. In accordance with characteristic rules, they
are called noise threshold. The patterns which cannot cross this threshold are denoted as noise.
• Rule Redundancy Threshold: This threshold prevents the redundancy of the dataset that is going to
be presented. That is, the rules that are going to be provided should not be the same as that of existing
ones.
• Rule Confidence Threshold: The probability of X under the condition Y in rule (X->Y), probability
must pass through this rule confidence threshold to make sure of it.
Transaction ID Rice Pulse Oil Milk Apple

t1 1 1 1 0 0

t2 0 1 1 1 0

t3 0 0 0 1 1

t4 1 1 0 1 0

t5 1 1 1 0 1

t6 1 1 1 1 1

You might also like