DWM Unit 1

1
Data Warehousing & Mining
Data Warehousing, Business Analysis and Online Analytical Processing
UNIT I : Data Warehouse is a relational database management system (RDBMS) construct to

meet the requirement of transaction processing systems. It can be loosely described as any
centralized data repository which can be queried for business benefits. It is a database that
stores information oriented to satisfy decision-making requests. It is a group of decision
support technologies, targets to enabling the knowledge worker (executive, manager, and
analyst) to make superior and higher decisions. So, Data Warehousing support architectures
and tool for business executives to systematically organize, understand and use their
information to make strategic decisions.
Data Warehouse environment contains an extraction, transportation, and loading (ETL)
solution, an online analytical processing (OLAP) engine, customer analysis tools, and other
applications that handle the process of gathering information and delivering it to business
users.
What is a Data Warehouse?

A Data Warehouse (DW) is a relational database that is designed for query and analysis rather
than transaction processing. It includes historical data derived from transaction data from
single and multiple sources.
A Data Warehouse provides integrated, enterprise-wide, historical data and focuses on
providing support for decision-makers for data modeling and analysis.
A Data Warehouse is a group of data specific to the entire organization, not only to a particular
group of users.
It is not used for daily operations and transaction processing but used for making decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
It is a database designed for investigative tasks, using data from various applications.
It supports a relatively small number of clients with relatively long interactions.
It includes current and historical data to provide a historical perspective of information.
Its usage is read-intensive. It contains a few large tables.
"Data Warehouse is a subject-oriented, integrated, and time-variant store of information in
support of management's decisions."
Smt. Shweta Patil Deptment

Deptment of Computer Science BCA V Sem
2
Basic Concepts
Characteristics of Data Warehouse
Subject-Oriented
A data warehouse target on the modeling and analysis of data for decision-makers. Therefore,
data warehouses typically provide a concise and straightforward view around a particular
subject, such as customer, product, or sales, instead of the global organization's ongoing
operations. This is done by excluding data that are not useful concerning the subject and
including all data needed by the users to understand the subject.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and
online transaction records. It requires performing data cleaning and integration during data
warehousing to ensure consistency in naming conventions, attributes types, etc., among
different data sources.

3
Time-Variant
Historical information is kept in a data warehouse. For example, one can retrieve files from 3
months, 6 months, 12 months, or even previous data from a data warehouse. These variations
with a transactions system, where often only the most current file is kept.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the source
operational RDBMS. The operational updates of data do not occur in the data warehouse, i.e.,
update, insert, and delete operations are not performed. It usually requires only two procedures
in data accessing: Initial loading of data and access to data. Therefore, the DW does not require
transaction processing, recovery, and concurrency capabilities, which allows for substantial
speedup of data retrieval. Non-Volatile defines that once entered into the warehouse, and data
should not change.
Goals of Data Warehousing

To help reporting as well as analysis
Maintain the organization's historical information
Be the foundation for decision making.
Need for Data Warehouse

4
History of Data Warehouse

1) Business User: Business users require a data warehouse to view summarized data from the
past. Since these people are non-technical, the data may be presented to them in an elementary
form.
2) Store historical data: Data Warehouse is required to store the time variable data from the
past. This input is made to be used for various purposes.
3) Make strategic decisions: Some strategies may be depending upon the data in the data
warehouse. So, data warehouse contributes to making strategic decisions.
4) For data consistency and quality: Bringing the data from different sources at a
commonplace, the user can effectively undertake to bring the uniformity and consistency in
data.
5) High response time: Data warehouse has to be ready for somewhat unexpected loads and
types of queries, which demands a significant degree of flexibility and quick response time.
Benefits of Data Warehouse

• Understand business trends and make better forecasting decisions.
• Data Warehouses are designed to perform well enormous amounts of data.
• The structure of data warehouses is more accessible for end-users to navigate,
understand, and query.
• Queries that would be complex in many normalized databases could be easier to build
and maintain in data warehouses.
• Data warehousing is an efficient method to manage demand for lots of information
from lots of users.
• Data warehousing provide the capabilities to analyze a large amount of historical data.
What exactly is a Data Warehouse?

Data Warehouse has been defined in many ways, making it difficult to formulate a rigorous
definition. Gradually speaking, a data warehouse is a data repository that is kept separate from
an organization’s operational database. Data warehouse systems allow the integration of a wide
variety of application systems. They support information processing by providing a solid plan
of aggregated historical data for analysis.
Data in a data warehouse comes from the organization’s operational systems as well as other
external sources. These are collectively referred to as the source systems. The data extracted

5
from the source systems is stored in an area called the data staging area, where the data is
cleaned, transformed, assembled, and duplicated to prepare the data in the data warehouse.
The data staging area is usually a set of machines where simple activities like sorting and
sequential processing take place. The data staging area does not provide as soon as possible a
system provides query or presentation services, it is classified as a presentation server. A
presentation server is the destination machine on which data is loaded from the data staging
area and directly stored for query by end-users, report authors, and other applications.
There are three different types of systems required for a data warehouse –
1. Source Systems
2. Data Staging Area
3. Presentation Server
The data moves from the data source area through the staging area to the presentation server.
The entire process is better known as ETL (extract, transform, and load) or ETT (extract,
transform, and transfer).
Components of Data Warehouse Architecture and their tasks :

Operational Source –
An operational Source is a data source consists of Operational Data and External Data.
Data can come from Relational DBMS like Informix, Oracle.
Load Manager –
The Load Manager performs all operations associated with the extraction of loading data in
the data warehouse.
These tasks include the simple transformation of data to prepare data for entry into the
warehouse.

6
Warehouse Manage –
The warehouse manager is responsible for the warehouse management process.
The operations performed by the warehouse manager are the analysis, aggregation, backup
and collection of data, de-normalization of the data.
Query Manager –
Query Manager performs all the tasks associated with the management of user queries.
The complexity of the query manager is determined by the end-user access operations tool
and the features provided by the database.
Detailed Data –
It is used to store all the detailed data in the database schema.
Detailed data is loaded into the data warehouse to complement the data collected.
Summarized Data –
Summarized Data is a part of the data warehouse that stores predefined aggregations
These aggregations are generated by the warehouse manager.
Archive and Backup Data –
The Detailed and Summarized Data are stored for the purpose of archiving and backup.
The data is relocated to storage archives such as magnetic tapes or optical disks.
Metadata –
Metadata is basically data stored above data.
It is used for extraction and loading process, warehouse, management process, and query
management process.
End User Access Tools –
End-User Access Tools consist of Analysis, Reporting, and mining.
By using end-user access tools users can link with the warehouse.
Building a Data Warehouse –
Some steps that are needed for building any data warehouse are as following below:
1. To extract the data (transnational) from different data sources:
For building a data warehouse, a data is extracted from various data sources and that data is
stored in central storage area. For extraction of the data Microsoft has come up with an
excellent tool. When you purchase Microsoft SQL Server, then this tool will be available at
free of cost.
2. To transform the transnational data:
There are various DBMS where many of the companies stores their data. Some of them are:
MS Access, MS SQL Server, Oracle, Sybase etc. Also these companies saves the data in
7
spreadsheets, flat files, mail systems etc. Relating a data from all these sources is done
while building a data warehouse.
3. To load the data (transformed) into the dimensional database:
After building a dimensional model, the data is loaded in the dimensional database. This
process combines the several columns together or it may split one field into the several
columns. There are two stages at which transformation of the data can be performed and
they are: while loading the data into the dimensional model or while data extraction from
their origins.
4. To purchase a front-end reporting tool:
There are top notch analytical tools are available in the market. These tools are provided by
the several major vendors. A cost effective tool and Data Analyzer is released by the
Microsoft on its own.
Database architecture for parallel processing in data warehouse
A parallel DBMS is a DBMS that runs across multiple processors or CPUs and is mainly
designed to execute query operations in parallel, wherever possible. The parallel DBMS link a
number of smaller machines to achieve the same throughput as expected from a single large
machine.
In Parallel Databases, mainly there are three architectural designs for parallel DBMS. They are
as follows:
1. Shared Memory Architecture
2. Shared Disk Architecture
3. Shared Nothing Architecture
1. Shared Memory Architecture- In Shared Memory Architecture, there are multiple CPUs
that are attached to an interconnection network. They are able to share a single or global main
memory and common disk arrays. It is to be noted that, In this architecture, a single copy of a
multi-threaded operating system and multithreaded DBMS can support these multiple CPUs.
Also, the shared memory is a solid coupled architecture in which multiple CPUs share their
memory. It is also known as Symmetric multiprocessing (SMP). This architecture has a very
wide range which starts from personal workstations that support a few microprocessors in
parallel via RISC.

8
Advantages :
1. It has high-speed data access for a limited number of processors.
2. The communication is efficient.
Disadvantages :
1. It cannot use beyond 80 or 100 CPUs in parallel.
2. The bus or the interconnection network gets block due to the increment of the large number
of CPUs.
2. Shared Disk Architectures :
In Shared Disk Architecture, various CPUs are attached to an interconnection network. In this,
each CPU has its own memory and all of them have access to the same disk. Also, note that
here the memory is not shared among CPUs therefore each node has its own copy of the
operating system and DBMS. Shared disk architecture is a loosely coupled architecture
optimized for applications that are inherently centralized. They are also known as clusters.
Advantages :
1. The interconnection network is no longer a bottleneck each CPU has its own memory.
2. Load-balancing is easier in shared disk architecture.
3. There is better fault tolerance.
Disadvantages :
1. If the number of CPUs increases, the problems of interference and memory contentions
also increase.
2. There’s also exists a scalability problem.

9
3. Shared Nothing Architecture :

Shared Nothing Architecture is multiple processor architecture in which each processor has its
own memory and disk storage. In this, multiple CPUs are attached to an interconnection
network through a node. Also, note that no two CPUs can access the same disk area. In this
architecture, no sharing of memory or disk resources is done. It is also known as Massively
parallel processing (MPP).
Advantages :
1. It has better scalability as no sharing of resources is done
2. Multiple CPUs can be added
Disadvantages:
1. The cost of communications is higher as it involves sending of data and software
interaction at both ends
2. The cost of non-local disk access is higher than the cost of shared disk architectures.
Note that this technology is typically used for very large databases that have the size of
1012 bytes or TB or for the system that has the process of thousands of transactions per second.
Parallel DBMS vendor
It accepts requests from the application and instructs the operating system to transfer the
appropriate data. The major DBMS vendors are Oracle, IBM, Microsoft and Sybase (see
Oracle Database, DB2, SQL Server and ASE). MySQL and SQLite are very popular open source
products
Multi-Dimensional Data Model
It is a method which is used for ordering data in the database along with good arrangement
and assembling of the contents in the database.
The Multi Dimensional Data Model allows customers to interrogate analytical questions
associated with market or business trends, unlike relational databases which allow customers to

10
access data in the form of queries. They allow users to rapidly receive answers to the requests
which they made by creating and examining the data comparatively fast.
OLAP (online analytical processing) and data warehousing uses multi dimensional databases.
It is used to show multiple dimensions of the data to users.
It represents data in the form of data cubes. Data cubes allow to model and view the data from
many dimensions and perspectives. It is defined by dimensions and facts and is represented by
a fact table. Facts are numerical measures and fact tables contain measures of the related
dimensional tables or names of the facts.
Working on a Multidimensional Data Model
On the basis of the pre-decided steps, the Multidimensional Data Model works.
The following stages should be followed by every project for building a Multi Dimensional
Data Model :
Stage 1 : Assembling data from the client : In first stage, a Multi Dimensional Data Model
collects correct data from the client. Mostly, software professionals provide simplicity to the
client about the range of data which can be gained with the selected technology and collect the
complete data in detail.
Stage 2 : Grouping different segments of the system : In the second stage, the Multi
Dimensional Data Model recognizes and classifies all the data to the respective section they
belong to and also builds it problem-free to apply step by step.
Stage 3 : Noticing the different proportions : In the third stage, it is the basis on which the
design of the system is based. In this stage, the main factors are recognized according to the
user’s point of view. These factors are also known as “Dimensions”.
Stage 4 : Preparing the actual-time factors and their respective qualities : In the fourth
stage, the factors which are recognized in the previous step are used further for identifying the
related qualities. These qualities are also known as “attributes” in the database.
Stage 5 : Finding the actuality of factors which are listed previously and their qualities
In the fifth stage, A Multi Dimensional Data Model separates and differentiates the actuality
from the factors which are collected by it. These actually play a significant role in the
arrangement of a Multi Dimensional Data Model.
Stage 6 : Building the Schema to place the data, with respect to the information collected
from the steps above : In the sixth stage, on the basis of the data which was collected
previously, a Schema is built.

11
Advantages of Multi Dimensional Data Model

The following are the advantages of a multi
multi-dimensional data model :
• A multi-dimensional
dimensional data model is easy to handle.
• It is easy to maintain.
• Its performance is better than that of normal databa
databases
ses (e.g. relational databases).
• The representation of data is better than traditional databases. That is because the multi
multi-
dimensional databases are multi
multi-viewed
viewed and carry different types of factors.
• It is workable on complex systems and applications, con
contrary
trary to the simple one-
one
dimensional database systems.
• The compatibility in this type of database is an upliftment for projects having lower
bandwidth for maintenance staff.

12
Disadvantages of Multi Dimensional Data Model

The following are the disadvantages of a Multi Dimensional Data Model :
• The multi-dimensional Data Model is slightly complicated in nature and it requires
professionals to recognize and examine the data in the database.
• During the work of a Multi-Dimensional Data Model, when the system caches, there is a
great effect on the working of the system.
• It is complicated in nature due to which the databases are generally dynamic in design.
• The path to achieving the end product is complicated most of the time.
• As the Multi Dimensional Data Model has complicated systems, databases have a large
number of databases due to which the system is very insecure when there is a security
break.
Data warehouse schemas for decision support
Schema is a logical description of the entire database. It includes the name and description of
records of all record types including all associated data-items and aggregates. Much like a
database, a data warehouse also requires to maintain a schema. A database uses relational model,
while a data warehouse uses Star, Snowflake, and Fact Constellation schema. In this chapter, we
will discuss the schemas used in a data warehouse.
1.Star Schema
• Each dimension in a star schema is represented with only one-dimension table.
• This dimension table contains the set of attributes.
• The following diagram shows the sales data of a company with respect to the four
dimensions, namely time, item, branch, and location.
• There is a fact table at the center. It contains the keys to each of four dimensions.
• The fact table also contains the attributes, namely dollars sold and units sold.

13
Note − Each dimension has only one dimension table and each table holds a set of attributes. For
example, the location dimension table contains the attribute set {location_key, street, city,
province_or_state,country}. This constraint may cause data redundancy. For example,
"Vancouver" and "Victoria" both the cities are in the Canadian province of British Columbia.
The entries for such cities may cause data redundancy along the attributes province_or_state and
country.
Characteristics of Star Schema
The star schema is intensely suitable for data warehouse database design because of the
following features:
o It creates a DE-normalized database that can quickly provide query responses.
o It provides a flexible design that can be changed easily or added to throughout the
development cycle, and as the database grows.
o It provides a parallel in design to how end-users typically think of and use the data.
o It reduces the complexity of metadata for both developers and end-users.
Advantages of Star Schema
Query Performance, Load performance and administration,
Built-in referential integrity, Easily Understood
2.Snowflake Schema
• Some dimension tables in the Snowflake schema are normalized.
• The normalization splits up the data into additional tables.
• Unlike Star schema, the dimensions table in a snowflake schema are normalized. For
example, the item dimension table in star schema is normalized and split into two
dimension tables, namely item and supplier table.

14
• Now the item dimension table contains the attributes item_key, item_name, type, brand,
and supplier-key.
• The supplier key is linked to the supplier dimension table. The supplier dimension table
contains the attributes supplier_key and supplier_type.
Note − Due to normalization in the Snowflake schema, the redundancy is reduced and therefore,
it becomes easy to maintain and the save storage space.
Advantage of Snowflake Schema
1. The primary advantage of the snowflake schema is the development in query
performance due to minimized disk storage requirements and joining smaller lookup
tables.
2. It provides greater scalability in the interrelationship between dimension levels and
components.
3. No redundancy, so it is easier to maintain.
Disadvantage of Snowflake Schema
1. The primary disadvantage of the snowflake schema is the additional maintenance efforts
required due to the increasing number of lookup tables. It is also known as a multi fact
star schema.
2. There are more complex queries and hence, difficult to understand.
3. More tables more join so more query execution time.
3.Fact Constellation Schema
• A fact constellation has multiple fact tables. It is also known as galaxy schema.
• The following diagram shows two fact tables, namely sales and shipping.

15
• The sales fact table is same as that in the star schema.

• The shipping fact table has the five dimensions, namely item_key, time_key, shipper_key,
from_location, to_location.
• The shipping fact table also contains two measures, namely dollars sold and units sold.
• It is also possible to share dimension tables between fact tables. For example, time, item,
and location dimension tables are shared between the sales and shipping fact table.
a. Fact Tables
A table in a star schema which contains facts and connected to dimensions. A fact table has two
types of columns: those that include fact and those that are foreign keys to the dimension table.
The primary key of the fact tables is generally a composite key that is made up of all of its
foreign keys.
A fact table might involve either detail level fact or fact that have been aggregated (fact tables
that include aggregated fact are often instead called summary tables). A fact table generally
contains facts with the same level of aggregation.
b. Dimension Tables
A dimension is an architecture usually composed of one or more hierarchies that categorize data.
If a dimension has not got hierarchies and levels, it is called a flat dimension or list. The
primary keys of each of the dimensions table are part of the composite primary keys of the fact
table. Dimensional attributes help to define the dimensional value. They are generally
descriptive, textual values. Dimensional tables are usually small in size than fact table.
Fact tables store data about sales while dimension tables data about the geographic region
(markets, cities), clients, products, times, channels.
c. Schema Definition
Multidimensional schema is defined using Data Mining Query Language (DMQL). The two
primitives, cube definition and dimension definition, can be used for defining the data
warehouses and data marts.
Concept Hierarchy
A concept hierarchy represents a series of mappings from a set of low-level concepts to larger-
level, more general concepts. Concept hierarchy organizes information or concepts in a
hierarchical structure or a specific partial order, which are used for defining knowledge in brief,
high-level methods, and creating possible mining knowledge at several levels of abstraction.
A conceptual hierarchy includes a set of nodes organized in a tree, where the nodes define
values of an attribute known as concepts. A specific node, “ANY”, is constrained for the root of

16
the tree. A number is created to the level of each node in a conceptual hierarchy. The level of the
root node is one. The level of a non-root node is one more the level of its parent level number.
Because values are defined by nodes, the levels of nodes can also be used to describe the levels
of values. Concept hierarchy enables raw information to be managed at a higher and more
generalized level of abstraction.
There are several types of concept hierarchies which are as follows −
Schema Hierarchy − Schema hierarchy represents the total or partial order between attributes
in the database. It can define existing semantic relationships between attributes. In a database,
more than one schema hierarchy can be generated by using multiple sequences and grouping of
attributes.
Set-Grouping Hierarchy − A set-grouping hierarchy constructs values for a given attribute or
dimension into groups or constant range values. It is also known as instance hierarchy because
the partial series of the hierarchy is represented on the set of instances or values of an attribute.
These hierarchies have more functional sense and are so approved than other hierarchies.
Operation-Derived Hierarchy − Operation-derived hierarchy is represented by a set of
operations on the data. These operations are defined by users, professionals, or the data mining
system. These hierarchies are usually represented for mathematical attributes. Such operations
can be as easy as range value comparison, as difficult as a data clustering and data distribution
analysis algorithm.
Rule-based Hierarchy − In a rule-based hierarchy either a whole concept hierarchy or an
allocation of it is represented by a set of rules and is computed dynamically based on the current
information and rule definition. A lattice-like architecture is used for graphically defining this
type of hierarchy, in which each child-parent route is connected with a generalization rule.
The static and dynamic generation of concept hierarchy is based on data sets. In this context, the
generation of a concept hierarchy depends on a static or dynamic data set is known as the static
or dynamic generation of concept hierarchy.
Types of concept hierarchy
1] Binning
• Binning is a top-down splitting technique based on a specified number of bins.
• Binning is an unsupervised discretization technique because it does not use class

information.
• In this, The sorted values are distributed into several buckets or bins and then replaced
with each bin value by the bin mean or median.
• It is further classified into
17
o Equal-width (distance) partitioning

o Equal-depth (frequency) partitioning
2] Histogram Analysis
• It is an unsupervised discretization technique because histogram analysis does not use
class information.
• Histograms partition the values for an attribute into disjoint ranges called buckets.
• It is also further classified into
o Equal-width histogram
o Equal frequency histogram
• The histogram analysis algorithm can be applied recursively to each partition to
automatically generate a multilevel concept hierarchy, with the procedure terminating
once a pre-specified number of concept levels has been reached.
3] Cluster Analysis
• Cluster analysis is a popular data discretization method.
• A clustering algorithm can be applied to discretize a numerical attribute of A by
partitioning the values of A into clusters or groups.
• Clustering considers the distribution of A, as well as the closeness of data points, and
therefore can produce high-quality discretization results.
• Each initial cluster or partition may be further decomposed into several subcultures,
forming a lower level of the hierarchy.
4] Entropy-Based Discretization
• Entropy-based discretization is a supervised, top-down splitting technique.
• It explores class distribution information in its calculation and determination of split
points.
• Let D consist of data instances defined by a set of attributes and a class-label attribute.
• The class-label attribute provides the class information per instance.
• In this, the interval boundaries or split-points defined may help to improve classification
accuracy.
• The entropy and information gain measures are used for decision tree induction.
5] Interval Merge by χ2 Analysis
• It is a bottom-up method.
• Find the best neighboring intervals and merge them to form larger intervals recursively.
• The method is supervised in that it uses class information.
• ChiMerge treats intervals as discrete categories.

18
• The basic notion is that for accurate discretization, the relative class frequencies should
be fairly consistent within an interval.
• Therefore, if two adjacent intervals have a very similar distribution of classes, then the
intervals can be merged.
Data Discretization
• Dividing the range of a continuous attribute into intervals.
• Interval labels can then be used to replace actual data values.
• Reduce the number of values for a given continuous attribute.
• Some classification algorithms only accept categorically attributes.
• This leads to a concise, easy-to-use, knowledge-level representation of mining results.
• Discretization techniques can be categorized based on whether it uses class information
or not such as follows:
o Supervised Discretization - This discretization process uses class information.
o Unsupervised Discretization - This discretization process does not use class
information.
• Discretization techniques can be categorized based on which direction it proceeds as
follows:
Top-down Discretization -
• If the process starts by first finding one or a few points called split points or cut points to
split the entire attribute range and then repeat this recursively on the resulting intervals.
Bottom-up Discretization -
• Starts by considering all of the continuous values as potential split-points.
• Removes some by merging neighborhood values to form intervals, and then recursively
applies this process to the resulting intervals.
Characteristics of OLAP
In the FASMI characteristics of OLAP methods, the term derived from the first letters of the
characteristics are:
Fast
It defines which the system targeted to deliver the most feedback to the client within about five
seconds, with the elementary analysis taking no more than one second and very few taking more
than 20 seconds.
Analysis
It defines which the method can cope with any business logic and statistical analysis that is
relevant for the function and the user, keep it easy enough for the target client. Although some
19
preprogramming may be needed we do not think it acceptable if all application definitions have
to be allow the user to define new Adhoc calculations as part of the analysis and to document on
the data in any desired method, without having to program so we excludes products (like Oracle
Discoverer) that do not allow the user to define new Adhoc calculation as part of the analysis and
to document on the data in any desired product that do not allow adequate end user-oriented
calculation flexibility.
Share
It defines which the system tools all the security requirements for understanding and, if multiple
write connection is needed, concurrent update location at an appropriated level, not all functions
need customer to write data back, but for the increasing number which does, the system should
be able to manage multiple updates in a timely, secure manner.
Multidimensional
This is the basic requirement. OLAP system must provide a multidimensional conceptual view of
the data, including full support for hierarchies, as this is certainly the most logical method to
analyze business and organizations.
Information
The system should be able to hold all the data needed by the applications. Data sparsity should
be handled in an efficient manner.
Some more characteristics of OLAP are as follows:
1. Multidimensional conceptual view: OLAP systems let business users have a
dimensional and logical view of the data in the data warehouse. It helps in carrying slice
and dice operations.
2. Multi-User Support: Since the OLAP techniques are shared, the OLAP operation should
provide normal database operations, containing retrieval, update, adequacy control,
integrity, and security.
3. Accessibility: OLAP acts as a mediator between data warehouses and front-end. The
OLAP operations should be sitting between data sources (e.g., data warehouses) and an
OLAP front-end.
4. Storing OLAP results: OLAP results are kept separate from data sources.
5. Uniform documenting performance: Increasing the number of dimensions or database
size should not significantly degrade the reporting performance of the OLAP system.
6. OLAP provides for distinguishing between zero values and missing values so that
aggregates are computed correctly.

20
7. OLAP system should ignore all missing values and compute correct aggregate values.
8. OLAP facilitate interactive query and complex analysis for the users.
9. OLAP allows users to drill down for greater details or roll up for aggregations of metrics
along a single business dimension or across multiple dimension.
10. OLAP provides the ability to perform intricate calculations and comparisons.
11. OLAP presents results in a number of meaningful ways, including charts and graphs.
OLAP Operations in DBMS
OLAP stands for Online Analytical Processing Server. It is a software technology that allows
users to analyze information from multiple database systems at the same time. It is based on
multidimensional data model and allows the user to query on multi-dimensional data (eg. Delhi -
> 2018 -> Sales data). OLAP databases are divided into one or more cubes and these cubes are
known as Hyper-cubes.
OLAP operations:
There are five basic analytical operations that can be performed on an OLAP cube:
1. Drill down: In drill-down operation, the less detailed data is converted into highly
detailed data. It can be done by:
• Moving down in the concept hierarchy
• Adding a new dimension
In the cube given in overview section, the drill down operation is performed by moving
down in the concept hierarchy of Time dimension (Quarter -> Month).
2. Roll up: It is just opposite of the drill-down operation. It performs aggregation on the
OLAP cube. It can be done by:
• Climbing up in the concept hierarchy

21
• Reducing the dimensions

In the cube given in the overview section, the roll-up operation is performed by climbing
up in the concept hierarchy of Location dimension (City -> Country).
3. Dice: It selects a sub-cube from the OLAP cube by selecting two or more dimensions.
In the cube given in the overview section, a sub-cube is selected by selecting following
dimensions with criteria:
• Location = “Delhi” or “Kolkata”
• Time = “Q1” or “Q2”
• Item = “Car” or “Bus”
4. Slice: It selects a single dimension from the OLAP cube which results in a new sub-
cube creation. In the cube given in the overview section, Slice is performed on the
dimension Time = “Q1”.

22
5. Pivot: It is also known as rotation operation as it rotates the current view to get a new
view of the representation. In the sub-cube obtained after the slice operation, performing
pivot operation gives a new view of it
Difference between OLAP and OLTP in DBMS

Online Analytical Processing (OLAP): Online Analytical Processing consists of a type of
software tools that are used for data analysis for business decisions. OLAP provides an
environment to get insights from the database retrieved from multiple database systems at one
time.
Examples – Any type of Data warehouse system is an OLAP system. The uses of OLAP are as
follows:
• Spotify analyzed songs by users to come up with a personalized homepage of their songs
and playlist.
• Netflix movie recommendation system.
Online transaction processing (OLTP): Online transaction processing provides transaction-
oriented applications in a 3-tier architecture. OLTP administers the day-to-day transactions of
an organization.
Examples: Uses of OLTP are as follows:
• ATM center is an OLTP application.
• OLTP handles the ACID properties during data transactions via the application.
• It’s also used for Online banking, Online airline ticket booking, sending a text message,
add a book to the shopping cart.

23
Comparisons of OLAP vs OLTP
Sr. OLAP (Online analytical OLTP (Online transaction

No. Category processing) processing)
It is well-known as an online
database query management It is well-known as an online
1. Definition system. database modifying system.
Consists of historical data from Consists of only of operational

2. Data source various Databases. current data.
It makes use of a standard database

3. Method used It makes use of a data warehouse. management system (DBMS).
It is subject-oriented. Used for

Data Mining, Analytics, It is application-oriented. Used for
4. Application Decisions making, etc. business tasks.
In an OLAP database, tables are In an OLTP database, tables are

5. Normalized not normalized. normalized (3NF).
The data is used in planning,

problem-solving, and decision- The data is used to perform day-to-
6. Usage of data making. day fundamental operations.
It provides a multi-dimensional It reveals a snapshot of present

7. Task view of different business tasks. business tasks.
It serves the purpose to extract It serves the purpose to Insert,

information for analysis and Update, and Delete information from
8. Purpose decision-making. the database.
The size of the data is relatively

Volume of A large amount of data is stored small as the historical data is
9. data typically in TB, PB archived. For ex MB, GB
Relatively slow as the amount of

data involved is large. Queries Very Fast as the queries operate on
10. Queries may take hours. 5% of the data.
The OLAP database is not often

updated. As a result, data The data integrity constraint must be
11. Update integrity is unaffected. maintained in an OLTP database.

24
Sr. OLAP (Online analytical OLTP (Online transaction

No. Category processing) processing)
Backup and It only need backup from time to Backup and recovery process is
12. Recovery time as compared to OLTP. maintained rigorously
It is comparatively fast in processing

Processing The processing of complex because of simple and
13. time queries can take a lengthy time. straightforward queries.
This data is generally managed This data is managed by clerks,

14. Types of users by CEO, MD, GM. managers.
Only read and rarely write

15. Operations operation. Both read and write operations.
With lengthy, scheduled batch

operations, data is refreshed on a The user initiates data updates,
16. Updates regular basis. which are brief and quick.
Nature of Process that is focused on the Process that is focused on the

17. audience customer. market.
Database Design with a focus on the Design that is focused on the

18. Design subject. application.
Improves the efficiency of

19. Productivity business analysts. Enhances the user’s productivity


DWM Unit 1

Uploaded by

Copyright:

Available Formats

DWM Unit 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DWM Unit 1

Uploaded by

Copyright:

Available Formats

1

Data Warehousing & Mining

Data Warehousing, Business Analysis and Online Analytical Processing

UNIT I : Data Warehouse is a relational database management system (RDBMS) construct to

What is a Data Warehouse?

Smt. Shweta Patil Deptment

Smt. Shweta Patil Deptment

Goals of Data Warehousing

Smt. Shweta Patil Deptment

History of Data Warehouse

Benefits of Data Warehouse

What exactly is a Data Warehouse?

Smt. Shweta Patil Deptment

Components of Data Warehouse Architecture and their tasks :

Smt. Shweta Patil Deptment

Smt. Shweta Patil Deptment

Smt. Shweta Patil Deptment

3. Shared Nothing Architecture :

Smt. Shweta Patil Deptment

Smt. Shweta Patil Deptment

Advantages of Multi Dimensional Data Model

Smt. Shweta Patil Deptment

Disadvantages of Multi Dimensional Data Model

Smt. Shweta Patil Deptment

Smt. Shweta Patil Deptment

Smt. Shweta Patil Deptment

• The sales fact table is same as that in the star schema.

Smt. Shweta Patil Deptment

• Binning is an unsupervised discretization technique because it does not use class

o Equal-width (distance) partitioning

Smt. Shweta Patil Deptment

Smt. Shweta Patil Deptment

Smt. Shweta Patil Deptment

• Reducing the dimensions

Smt. Shweta Patil Deptment

Difference between OLAP and OLTP in DBMS

Smt. Shweta Patil Deptment

Comparisons of OLAP vs OLTP

Sr. OLAP (Online analytical OLTP (Online transaction

Consists of historical data from Consists of only of operational

It makes use of a standard database

It is subject-oriented. Used for

In an OLAP database, tables are In an OLTP database, tables are

The data is used in planning,

It provides a multi-dimensional It reveals a snapshot of present

It serves the purpose to extract It serves the purpose to Insert,

The size of the data is relatively

Relatively slow as the amount of

The OLAP database is not often

Smt. Shweta Patil Deptment

Sr. OLAP (Online analytical OLTP (Online transaction

It is comparatively fast in processing

This data is generally managed This data is managed by clerks,

Only read and rarely write

With lengthy, scheduled batch

Nature of Process that is focused on the Process that is focused on the

Database Design with a focus on the Design that is focused on the

Improves the efficiency of