0% found this document useful (0 votes)
131 views26 pages

R18CSE4102-UNIT 1 Data Mining Notes

This document discusses the key components of data warehousing and online analytical processing (OLAP). It begins by explaining the basic concepts of data warehousing and how it helps with maintaining historical records and analyzing data to improve business understanding. It then describes the typical components of a data warehouse including the source data, data staging, data storage, information delivery, and metadata components. It also provides details on relational OLAP (ROLAP), multidimensional OLAP (MOLAP), hybrid OLAP (HOLAP), and specialized SQL servers. Finally, it outlines common OLAP operations like roll-up, drill-down, slice, dice, and pivot.

Uploaded by

texxas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
131 views26 pages

R18CSE4102-UNIT 1 Data Mining Notes

This document discusses the key components of data warehousing and online analytical processing (OLAP). It begins by explaining the basic concepts of data warehousing and how it helps with maintaining historical records and analyzing data to improve business understanding. It then describes the typical components of a data warehouse including the source data, data staging, data storage, information delivery, and metadata components. It also provides details on relational OLAP (ROLAP), multidimensional OLAP (MOLAP), hybrid OLAP (HOLAP), and specialized SQL servers. Finally, it outlines common OLAP operations like roll-up, drill-down, slice, dice, and pivot.

Uploaded by

texxas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 26

(R18CSE4102) - Data Mining

UNIT I Data Warehousing, Business Analysis and On-Line Analytical Processing (OLAP) :
Basic Concepts – Data Warehousing Components – Building a Data Warehouse – Database
Architectures for Parallel Processing – Parallel DBMS Vendors – Multidimensional Data
Model – Data Warehouse Schemas for Decision Support, Concept Hierarchies -
Characteristics of OLAP Systems – Typical OLAP Operations, OLAP and OLTP.

Data Warehousing, Business Analysis and On-Line Analytical Processing (OLAP):


Online Analytical Processing Server (OLAP) is based on the multidimensional data model. It
allows managers, and analysts to get an insight of the information through fast, consistent, and
interactive access to information. This chapter cover the types of OLAP, operations on OLAP,
difference between OLAP, and statistical databases and OLTP.

Types of OLAP Servers


We have four types of OLAP servers −
 Relational OLAP (ROLAP)
 Multidimensional OLAP (MOLAP)
 Hybrid OLAP (HOLAP)
 Specialized SQL Servers

Relational OLAP
ROLAP servers are placed between relational back-end server and client front-end tools. To
store and manage warehouse data, ROLAP uses relational or extended-relational DBMS.
ROLAP includes the following −
 Implementation of aggregation navigation logic.
 Optimization for each DBMS back end.
 Additional tools and services.

Multidimensional OLAP
MOLAP uses array-based multidimensional storage engines for multidimensional views of data.
With multidimensional data stores, the storage utilization may be low if the data set is sparse.
Therefore, many MOLAP server use two levels of data storage representation to handle dense
and sparse data sets.

Hybrid OLAP
Hybrid OLAP is a combination of both ROLAP and MOLAP. It offers higher scalability of
ROLAP and faster computation of MOLAP. HOLAP servers allows to store the large data
volumes of detailed information. The aggregations are stored separately in MOLAP store.

Specialized SQL Servers


Specialized SQL servers provide advanced query language and query processing support for
SQL queries over star and snowflake schemas in a read-only environment.

OLAP Operations
Since OLAP servers are based on multidimensional view of data, we will discuss OLAP
operations in multidimensional data.
Here is the list of OLAP operations −
 Roll-up
 Drill-down
 Slice and dice
 Pivot (rotate)

Roll-up
Roll-up performs aggregation on a data cube in any of the following ways −
 By climbing up a concept hierarchy for a dimension
 By dimension reduction
The following diagram illustrates how roll-up works.
 Roll-up is performed by climbing up a concept hierarchy for the dimension location.
 Initially the concept hierarchy was "street < city < province < country".
 On rolling up, the data is aggregated by ascending the location hierarchy from the level of
city to the level of country.
 The data is grouped into cities rather than countries.
 When roll-up is performed, one or more dimensions from the data cube are removed.

Drill-down
Drill-down is the reverse operation of roll-up. It is performed by either of the following ways −
 By stepping down a concept hierarchy for a dimension
 By introducing a new dimension.
The following diagram illustrates how drill-down works −

 Drill-down is performed by stepping down a concept hierarchy for the dimension time.
 Initially the concept hierarchy was "day < month < quarter < year."
 On drilling down, the time dimension is descended from the level of quarter to the level of
month.
 When drill-down is performed, one or more dimensions from the data cube are added.
 It navigates the data from less detailed data to highly detailed data.
Slice
The slice operation selects one particular dimension from a given cube and provides a new sub-
cube. Consider the following diagram that shows how slice works.

 Here Slice is performed for the dimension "time" using the criterion time = "Q1".
 It will form a new sub-cube by selecting one or more dimensions.
Dice
Dice selects two or more dimensions from a given cube and provides a new sub-cube. Consider
the following diagram that shows the dice operation.
The dice operation on the cube based on the following selection criteria involves three
dimensions.
 (location = "Toronto" or "Vancouver")
 (time = "Q1" or "Q2")
 (item =" Mobile" or "Modem")
Pivot
The pivot operation is also known as rotation. It rotates the data axes in view in order to provide
an alternative presentation of data. Consider the following diagram that shows the pivot
operation.
Basic Concept of Data warehouse:

A data warehouse is a database designed to enable business intelligence activities: it


exists to help users understand and enhance their organization's performance. It is designed for
query and analysis rather than for transaction processing, and usually contains historical data
derived from transaction data, but can include data from other sources. Data warehouses separate
analysis workload from transaction workload and enable an organization to consolidate data
from several sources.
This helps in:
 Maintaining historical records
 Analyzing the data to gain a better understanding of the business and to improve the
business
Data warehousing Components:

Components or Building Blocks of Data Warehouse

Architecture is the proper arrangement of the elements. We build a data warehouse with software
and hardware components. To suit the requirements of our organizations, we arrange these
building we may want to boost up another part with extra tools and services. All of these depends
on our circumstances.

The figure shows the essential elements of a typical warehouse. We see the Source Data
component shows on the left. The Data staging element serves as the next building block. In the
middle, we see the Data Storage component that handles the data warehouses data. This element
not only stores and manages the data; it also keeps track of data using the metadata repository.
The Information Delivery component shows on the right consists of all the different ways of
making the information from the data warehouses available to the users.

Source Data Component

Source data coming into the data warehouses may be grouped into four broad categories:

Production Data: This type of data comes from the different operating systems of the enterprise.
Based on the data requirements in the data warehouse, we choose segments of the data from the
various operational modes.
Internal Data: In each organization, the client keeps their "private" spreadsheets,
reports, customer profiles, and sometimes even department databases. This is the
internal data, part of which could be useful in a data warehouse.

Archived Data: Operational systems are mainly intended to run the current business. In
every operational system, we periodically take the old data and store it in achieved files.

External Data: Most executives depend on information from external sources for a large
percentage of the information they use. They use statistics associating to their industry
produced by the external department.

Data Staging Component

After we have been extracted data from various operational systems and external sources, we
have to prepare the files for storing in the data warehouse. The extracted data coming from
several different sources need to be changed, converted, and made ready in a format that is
relevant to be saved for querying and analysis.

We will now discuss the three primary functions that take place in the staging area.

1) Data Extraction: This method has to deal with numerous data sources. We have to employ the
appropriate techniques for each data source.
2) Data Transformation: As we know, data for a data warehouse comes from many different
sources. If data extraction for a data warehouse posture big challenges, data transformation
present even significant challenges. We perform several individual tasks as part of data
transformation.

First, we clean the data extracted from each source. Cleaning may be the correction of
misspellings or may deal with providing default values for missing data elements, or elimination
of duplicates when we bring in the same data from various source systems.

Standardization of data components forms a large part of data transformation. Data


transformation contains many forms of combining pieces of data from different sources. We
combine data from single source record or related data parts from many source records.
On the other hand, data transformation also contains purging source data that is not useful and
separating outsource records into new combinations. Sorting and merging of data take place on a
large scale in the data staging area. When the data transformation function ends, we have a
collection of integrated data that is cleaned, standardized, and summarized.
3) Data Loading: Two distinct categories of tasks form data loading functions. When we
complete the structure and construction of the data warehouse and go live for the first time, we
do the initial loading of the information into the data warehouse storage. The initial load moves
high volumes of data using up a substantial amount of time.
Data Storage Components
Data storage for the data warehousing is a split repository. The data repositories for the
operational systems generally include only the current data. Also, these data repositories include
the data structured in highly normalized for fast and efficient processing.
Information Delivery Component
The information delivery element is used to enable the process of subscribing for data warehouse
files and having it transferred to one or more destinations according to some customer-specified
scheduling algorithm.
Metadata Component
Metadata in a data warehouse is equal to the data dictionary or the data catalog in a database
management system. In the data dictionary, we keep the data about the logical data structures,
the data about the records and addresses, the information about the indexes, and so on.
Data Marts
It includes a subset of corporate-wide data that is of value to a specific group of users. The scope
is confined to particular selected subjects. Data in a data warehouse should be a fairly current,
but not mainly up to the minute, although development in the data warehouse industry has made
standard and incremental data dumps more achievable. Data marts are lower than data
warehouses and usually contain organization. The current trends in data warehousing are to
developed a data warehouse with several smaller related data marts for particular kinds of
queries and reports.
Management and Control Component
The management and control elements coordinate the services and functions within the data
warehouse. These components control the data transformation and the data transfer into the data
warehouse storage. On the other hand, it moderates the data delivery to the clients. Its work with
the database management systems and authorizes data to be correctly saved in the repositories. It
monitors the movement of information into the staging method and from there into the data
warehouses storage itself.
Database Architectures for Parallel Processing
Shared memory system
Shared memory system uses multiple processors which is attached to a global shared memory via
intercommunication channel or communication bus.
Shared memory system have large amount of cache memory at each processors, so referencing
of the shared memory is avoided.
If a processor performs a write operation to memory location, the data should be updated or
removed from that location.

Advantages of Shared memory system


 Data is easily accessible to any processor.
 One processor can send message to other efficiently.
Disadvantages of Shared memory system
 Waiting time of processors is increased due to more number of processors.
 Bandwidth problem.
Shared Disk System
 Shared disk system uses multiple processors which are accessible to multiple disks via
intercommunication channel and every processor has local memory.
 Each processor has its own memory so the data sharing is efficient.
 The system built around this system are called as clusters.
Advantages of Shared Disk System
 Fault tolerance is achieved using shared disk system.
Fault tolerance: If a processor or its memory fails, the other processor can complete the
task. This is called as fault tolerance.
Disadvantage of Shared Disk System
 Shared disk system has limited scalability as large amount of data travels through the
interconnection channel.
 If more processors are added the existing processors are slowed down.
Applications of Shared Disk System

Digital Equipment Corporation(DEC): DEC cluster running relational databases use the shared
disk system and now owned by Oracle.

Parallel DBMS Vendors

A parallel database is one which involves multiple processors and working in parallel on the
database used to provide the services.
A parallel database system seeks to improve performance through parallelization of various
operations like loading data, building index and evaluating queries parallel systems improve
processing and I/O speeds by using multiple CPU’s and disks in parallel.
Working of parallel database
Let us discuss how parallel database works in step by step manner −
Step 1 − Parallel processing divides a large task into many smaller tasks and executes the smaller
tasks concurrently on several CPU’s and completes it more quickly.
Step 2 − The driving force behind parallel database systems is the demand of applications that
have to query extremely large databases of the order of terabytes or that have to process a large
number of transactions per second.
Step 3 − In parallel processing, many operations are performed simultaneously as opposed to
serial processing, in which the computational steps are performed sequentially.
This working of parallel database is explained in the diagram given below –
A parallel database is one which involves multiple processors
and working in parallel on the database used to provide the
services.

A parallel database system seeks to improve performance


through parallelization of various operations like loading data,
building index and evaluating queries parallel systems improve
processing and I/O speeds by using multiple CPU’s and disks in
parallel.

Working of parallel database

A parallel database is one which involves multiple processors and working in parallel on the
database used to provide the services.
A parallel database system seeks to improve performance through parallelization of various
operations like loading data, building index and evaluating queries parallel systems improve
processing and I/O speeds by using multiple CPU’s and disks in parallel.
Working of parallel database
Let us discuss how parallel database works in step by step manner −
Step 1 − Parallel processing divides a large task into many smaller tasks and executes the smaller
tasks concurrently on several CPU’s and completes it more quickly.
Step 2 − The driving force behind parallel database systems is the demand of applications that
have to query extremely large databases of the order of terabytes or that have to process a large
number of transactions per second.
Step 3 − In parallel processing, many operations are performed simultaneously as opposed to
serial processing, in which the computational steps are performed sequentially.
This working of parallel database is explained in the diagram given below −
Performance measures
There are two main resources of performance of a database system, which are explained below −
Throughput − The number of tasks that can be completed in a given time interval. A system
that processes a large number of small transactions can improve throughput by processing many
transactions in parallel.
Response time − The amount of time it takes to complete a single task from the time it is
submitted. A system that processes large transactions can improve response time, as well as
throughput by performing subtasks of each transaction in parallel.
Benefits of parallel Database
The benefits of the parallel database are explained below −
Speed
Speed is the main advantage of parallel databases. The server breaks up a request for a user
database into parts and sends each part to a separate computer.
We eventually function on the pieces and combine the outputs, returning them to the customer. It
speeds up most requests for data so that large databases can be reached more easily.
Capacity
As more users request access to the database, the network administrators are adding more
machines to the parallel server, increasing their overall capacity.
For example, a parallel database enables a large online store to have at the same time access to
information from thousands of users. With a single server, this level of performance is not
feasible.
Reliability
Despite the failure of any computer in the cluster, a properly configured parallel database will
continue to work. The database server senses that there is no response from a single computer
and redirects its function to the other computers.
Many companies, such as online retailers, want their database to be accessible as fast as possible.
This is where a parallel database stands good.
This method also helps in conducting scheduled maintenance on a computer-by-computer
technician. They send a server command to uninstall the affected device, then perform the
maintenance and update required.
Performance measures
There are two main resources of performance of a database system, which are explained below −
Throughput − The number of tasks that can be completed in a given time interval. A system
that processes a large number of small transactions can improve throughput by processing many
transactions in parallel.
Response time − The amount of time it takes to complete a single task from the time it is
submitted. A system that processes large transactions can improve response time, as well as
throughput by performing subtasks of each transaction in parallel.
Benefits of parallel Database
The benefits of the parallel database are explained below −
Speed
Speed is the main advantage of parallel databases. The server breaks up a request for a user
database into parts and sends each part to a separate computer.
We eventually function on the pieces and combine the outputs, returning them to the customer. It
speeds up most requests for data so that large databases can be reached more easily.
Capacity
As more users request access to the database, the network administrators are adding more
machines to the parallel server, increasing their overall capacity.
For example, a parallel database enables a large online store to have at the same time access to
information from thousands of users. With a single server, this level of performance is not
feasible.
Reliability
Despite the failure of any computer in the cluster, a properly configured parallel database will
continue to work. The database server senses that there is no response from a single computer
and redirects its function to the other computers.
Many companies, such as online retailers, want their database to be accessible as fast as possible.
This is where a parallel database stands good.
This method also helps in conducting scheduled maintenance on a computer-by-computer
technician. They send a server command to uninstall the affected device, then perform the
maintenance and update required.

Multidimensional Data Model:


What is Multi-Dimensional Data Model?

A multidimensional model views data in the form of a data-cube. A data cube enables data to be
modeled and viewed in multiple dimensions. It is defined by dimensions and facts.

The dimensions are the perspectives or entities concerning which an organization keeps records.
For example, a shop may create a sales data warehouse to keep records of the store's sales for the
dimension time, item, and location. These dimensions allow the save to keep track of things, for
example, monthly sales of items and the locations at which the items were sold. Each dimension
has a table related to it, called a dimensional table, which describes the dimension further. For
example, a dimensional table for an item may contain the attributes item_name, brand, and type.
A multidimensional data model is organized around a central theme, for example, sales. This
theme is represented by a fact table. Facts are numerical measures. The fact table contains the
names of the facts or measures of the related dimensional tables.

Consider the data of a shop for items sold per quarter in the city of Delhi. The data is shown in
the table. In this 2D representation, the sales for Delhi are shown for the time dimension
(organized in quarters) and the item dimension (classified according to the types of an item sold).
The fact or measure displayed in rupee_sold (in thousands).
Now, if we want to view the sales data with a third dimension, For example, suppose the data
according to time and item, as well as the location is considered for the cities Chennai, Kolkata,
Mumbai, and Delhi. These 3D data are shown in the table. The 3D data of the table are
represented as a series of 2D tables.

Conceptually, it may also be represented by the same data in the form of a 3D data cube, as
shown in fig:
Data warehouse Schemas:
Introduction to Data warehouse Schema
The Data Warehouse Schema is a structure that rationally defines the contents of the Data
Warehouse, by facilitating the operations performed on the Data Warehouse and the maintenance
activities of the Data Warehouse system, which usually includes the detailed description of the
databases, tables, views, indexes, and the Data, that are regularly structured using predefined
design types such as Star Schema, Snowflake Schema, Galaxy Schema (also known as Fact
Constellation Schema).
A schema is a logical description that describes the entire database. In the data warehouse there
includes the name and description of records. It has all data items and also different aggregates
associated with the data. Like a database has a schema, it is required to maintain a schema for a
data warehouse as well. There are different schemas based on the setup and data which
are maintained in a data warehouse.
Types of Data Warehouse Schema
Following are the three major types of schemas:
 Star Schema
 Snowflake Schema
 Galaxy Schema
There are fact tables and dimension tables that form the basis of any schema in the data
warehouse that are important to be understood. The fact tables should have data corresponding
data to any business process. Every row represents any event that can be associated with any
process. It stores quantitative information for analysis. A dimension table stores data about how
the data in fact table is being analyzed. They facilitate the fact table in gathering different
dimensions on the measures which are to be taken.
Let us have a look at all these in detail.
1. Star Schema
Here are some of the basic points of star schema which are as follows:
In a star schema, as the structure of a star, there is one fact table in the middle and a number of
associated dimension tables. This structure resembles a star and hence it is known as a star
schema.
The fact table here consists of primary information in the data warehouse. It surrounds the
smaller dimension lookup tables which will have details for different fact tables. The primary
key which is present in each dimension is related to a foreign key which is present in the fact
table.
This infers that fact table has two types of columns having foreign keys to dimension tables and
measures which contain numeric facts. At the center of the star, there is a fact table and the
points of the star are the dimension tables.
The fact tables are in 3NF form and the dimension tables are in denormalized form. Every
dimension in star schema should be represented by the only one-dimensional table. The
dimension table should be joined to a fact table. The fact table should have a key and measure.
2. Snowflake Schema
Here are some of the basic points of snowflake schema which are as follows:
Snowflake schema acts like an extended version of a star schema. There are additional
dimensions added to Star schema. This schema is known as snowflake due to its structure.
In this schema, the centralized fact table will be connected to different multiple dimensions. The
dimensions present are in normalized form from the multiple related tables which are present.
The snowflake structure is detailed and structured when compared to star schema.
There are multiple levels of relationships and child tables involved that have multiple parent
tables. In snowflake schema, the affected tables are only the dimension tables and not the fact
tables.
The difference between star and snowflake schema is that the dimensions of snowflake schema
are maintained in such a way that they reduce the redundancy of data. The tables are easy to
manage and maintain. They also save storage space.
However, due to this, it is needed to have more joins in the query in order to execute the query.
The further expansion of the tables leads to snowflaking. When a dimension table has a low
cardinality attribute of dimensions then it is said to be snowflaked.
The dimension tables have been divided into segregated normalized tables. Once they are
segregated they are further joined with the original dimension table which has a referential
constraint. This schema may hamper the performance as the number of tables that are required
are more so that the joins are satisfied.
The advantage of snowflake schema is that it uses small disk space. The implementation of
dimensions is easy when they are added to this schema. The same set of attributes are published
by different sources.
3. Fact Constellation Schema or Galaxy Schema
Here are some of the basic points of fact constellation schema which are as follows:
A fact constellation can consist of multiple fact tables. These are more than two tables that share
the same dimension tables. This schema is also known as galaxy schema.
It is viewed as a collection of stars and hence the name galaxy. The shared dimensions in this
schema are known as conformed dimensions. The dimensions in this schema are separated into
segregated dimensions which are having different levels of hierarchy.
As an example, we can consider the four levels of hierarchy taking geography into consideration
as region, country, state, and city. This galaxy schema has four dimensions. Another way of
creating a galaxy schema is by splitting one-star schema into more star schemas.
The dimensions created as large and built on the basis of hierarchy. This schema is useful when
aggregation of fact tables is necessary. Fact constellations are considered to be more complex
than star and snowflake schemas. These are considered to be more flexible but hard to implement
and maintain.
This type of schema is usually used for sophisticated applications. The multiple number of tables
present in this schema makes it difficult and complex. Implementing this schema is hence
difficult. The architecture is thus more complex when compared to star and snowflake schema.

Characteristics of OLAP
In the FASMI characteristics of OLAP methods, the term derived from the first letters of the
characteristics are:

Fast
It defines which the system targeted to deliver the most feedback to the client within about five
seconds, with the elementary analysis taking no more than one second and very few taking more
than 20 seconds.
Analysis
It defines which the method can cope with any business logic and statistical analysis that is
relevant for the function and the user, keep it easy enough for the target client. Although some
preprogramming may be needed we do not think it acceptable if all application definitions have
to be allow the user to define new Adhoc calculations as part of the analysis and to document on
the data in any desired method, without having to program so we excludes products (like Oracle
Discoverer) that do not allow the user to define new Adhoc calculation as part of the analysis and
to document on the data in any desired product that do not allow adequate end user-oriented
calculation flexibility.
Share
It defines which the system tools all the security requirements for understanding and, if multiple
write connection is needed, concurrent update location at an appropriated level, not all functions
need customer to write data back, but for the increasing number which does, the system should
be able to manage multiple updates in a timely, secure manner.
Multidimensional
This is the basic requirement. OLAP system must provide a multidimensional conceptual view of
the data, including full support for hierarchies, as this is certainly the most logical method to
analyze business and organizations.
Information
The system should be able to hold all the data needed by the applications. Data sparsity should
be handled in an efficient manner.
The main characteristics of OLAP are as follows:
1. Multidimensional conceptual view: OLAP systems let business users have a dimensional
and logical view of the data in the data warehouse. It helps in carrying slice and dice
operations.
2. Multi-User Support: Since the OLAP techniques are shared, the OLAP operation should
provide normal database operations, containing retrieval, update, adequacy control,
integrity, and security.
3. Accessibility: OLAP acts as a mediator between data warehouses and front-end. The
OLAP operations should be sitting between data sources (e.g., data warehouses) and an
OLAP front-end.
4. Storing OLAP results: OLAP results are kept separate from data sources.
5. Uniform documenting performance: Increasing the number of dimensions or database
size should not significantly degrade the reporting performance of the OLAP system.
6. OLAP provides for distinguishing between zero values and missing values so that
aggregates are computed correctly.
7. OLAP system should ignore all missing values and compute correct aggregate values.
8. OLAP facilitate interactive query and complex analysis for the users.
9. OLAP allows users to drill down for greater details or roll up for aggregations of metrics
along a single business dimension or across multiple dimension.
10. OLAP provides the ability to perform intricate calculations and comparisons.
11. OLAP presents results in a number of meaningful ways, including charts and graphs.
Benefits of OLAP
 OLAP holds several benefits for businesses: -
 OLAP helps managers in decision-making through the multidimensional record views
that it is efficient in providing, thus increasing their productivity.
 OLAP functions are self-sufficient owing to the inherent flexibility support to the
organized databases.
 It facilitates simulation of business models and problems, through extensive management
of analysis-capabilities.
 In conjunction with data warehouse, OLAP can be used to support a reduction in the
application backlog, faster data retrieval, and reduction in query drag.
Motivations for using OLAP
1) Understanding and improving sales: For enterprises that have much products and benefit a
number of channels for selling the product, OLAP can help in finding the most suitable products
and the most famous channels. In some methods, it may be feasible to find the most profitable
users. For example, considering the telecommunication industry and considering only one
product, communication minutes, there is a high amount of record if a company want to analyze
the sales of products for every hour of the day (24 hours), difference between weekdays and
weekends (2 values) and split regions to which calls are made into 50 region.
2) Understanding and decreasing costs of doing business: Improving sales is one method of
improving a business, the other method is to analyze cost and to control them as much as suitable
without affecting sales. OLAP can assist in analyzing the costs related to sales. In some methods,
it may also be feasible to identify expenditures which produce a high return on investments
(ROI). For example, recruiting a top salesperson may contain high costs, but the revenue
generated by the salesperson may justify the investment.
OLAP Vs OLTP:
Definition of OLTP
OLTP is an Online Transaction Processing system. The main focus of OLTP system is to
record the current Update, Insertion and Deletion while transaction. The OLTP queries
are simpler and short and hence require less time in processing, and also requires less space.
OLTP database gets updated frequently. It may happen that a transaction in OLTP fails in
middle, which may effect data integrity. So, it has to take special care of data integrity. OLTP
database has normalized tables (3NF).
The best example for OLTP system is an ATM, in which using short transactions we modify the
status of our account. OLTP system becomes the source of data for OLAP.
Definition of OLAP
OLAP is an Online Analytical Processing system. OLAP database stores historical data that
has been inputted by OLTP. It allows a user to view different summaries of multi-dimensional
data. Using OLAP, you can extract information from a large database and analyze it for decision
making.
OLAP also allow a user to execute complex queries to extract multidimensional data. In OLTP
even if the transaction fails in middle it will not harm data integrity as the user use OLAP system
to retrieve data from a large database to analyze. Simply the user can fire the query again and
extract the data for analysis.
The transaction in OLAP are long and hence take comparatively more time for processing and
requires large space. The transactions in OLAP are less frequent as compared to OLTP. Even
the tables in OLAP database may not be normalized. The example for OLAP is to view a
financial report, or budgeting, marketing management, sales report, etc.
Below is the difference between OLAP and OLTP in Data Warehouse:

OLTP vs OLAP

Parameters OLTP OLAP

It is an online transactional
OLAP is an online analysis and data
Process system. It manages database
retrieving process.
modification.

It is characterized by large
It is characterized by a large volume of
Characteristic numbers of short online
data.
transactions.

OLTP is an online database OLAP is an online database query


Functionality
modifying system. management system.

Method OLTP uses traditional DBMS. OLAP uses the data warehouse.

Insert, Update, and Delete


Query Mostly select operations
information from the database.

Table Tables in OLTP database are Tables in OLAP database are not
Parameters OLTP OLAP

normalized. normalized.

OLTP and its transactions are the Different OLTP databases become the
Source
sources of data. source of data for OLAP.

OLAP database does not get frequently


OLTP database must maintain
Data Integrity modified. Hence, data integrity is not an
data integrity constraint.
issue.

It’s response time is in


Response time Response time in seconds to minutes.
millisecond.

The data in the OLTP database is The data in OLAP process might not be
Data quality
always detailed and organized. organized.

It helps to control and run It helps with planning, problem-solving,


Usefulness
fundamental business tasks. and decision support.

Operation Allow read/write operations. Only read and rarely write.

Audience It is a market orientated process. It is a customer orientated process.

Queries in this process are Complex queries involving


Query Type
standardized and simple. aggregations.

Complete backup of the data OLAP only need a backup from time to
Back-up combined with incremental time. Backup is not important compared
backups. to OLTP

DB design is application oriented.


DB design is subject oriented. Example:
Example: Database design
Design Database design changes with subjects
changes with industry like Retail,
like sales, marketing, purchasing, etc.
Airline, Banking, etc.

It is used by Data critical users


Used by Data knowledge users like
User type like clerk, DBA & Data Base
workers, managers, and CEO.
professionals.

Designed for real time business Designed for analysis of business


Purpose
operations. measures by category and attributes.

Performance Transaction throughput is the Query throughput is the performance


Parameters OLTP OLAP

metric performance metric metric.

Number of This kind of Database users allows This kind of Database allows only
users thousands of users. hundreds of users.

It helps to Increase user’s self- Help to Increase productivity of the


Productivity
service and productivity business analysts.

An OLAP cube is not an open SQL


Data Warehouses historically have
server data warehouse. Therefore,
Challenge been a development project which
technical knowledge and experience is
may prove costly to build.
essential to manage the OLAP server.

It provides fast result for daily It ensures that response to the query is
Process
used data. quicker consistently.

It lets the user create a view with the


Characteristic It is easy to create and maintain.
help of a spreadsheet.

A data warehouse is created uniquely so


OLTP is designed to have fast
that it can integrate different data
Style response time, low data
sources for building a consolidated
redundancy and is normalized.
database

You might also like