Decision Support System: Unit 1

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 34

Unit 1

Decision support system


A decision support system (DSS) is a computer-based information system that supports business or
organizational decision-making activities. DSSs serve the management, operations, and planning levels
of an organization and help to make decisions, which may be rapidly changing and not easily specified in
advance. Decision support systems can be either fully computerized, human or a combination of both.

DSSs include knowledge-based systems. A properly designed DSS is an interactive software-based


system intended to help decision makers compile useful information from a combination of raw data,
documents, and personal knowledge, or business models to identify and solve problems and make
decisions.

Typical information that a decision support application might gather and present includes:

 inventories of information assets (including legacy and relational data sources, cubes, data
warehouses, and data marts),
 comparative sales figures between one period and the next,
 projected revenue figures based on product sales assumptions.
Components

1. the database (or knowledge base),
2. the model (i.e., the decision context and user criteria), and
3. the user interface.

Operational VS Decision Support System


Operational systems are oltp, these system are used to run day to day core business of the company.
they are so called bread and butter system

On the other hand, decision support system not meant to run the core business process, they used to
watch how business runs, and then make strategic decision to improve the business.

Data warehouse
data warehouse  is a database used for reporting and data analysis. It is a central repository of data
which is created by integrating data from one or more disparate sources. Data warehouses store current
as well as historical data and are used for creating trending reports for senior management reporting such
as annual and quarterly comparisons.

The data stored in the warehouse are uploaded from the operational systems (such as marketing, sales
etc., shown in the figure to the right). The data may pass through an operational data store for additional
operations before they are used in the DW for reporting.

Benefits of a data warehouse [edit]


A data warehouse maintains a copy of information from the source transaction systems. This architectural
complexity provides the opportunity to:

 Congregate data from multiple sources into a single database so a single query engine can be
used to present data.
 Mitigate the problem of database isolation level lock contention in transaction processing systems
caused by attempts to run large, long running, analysis queries in transaction processing databases.
 Maintain data history, even if the source transaction systems do not.
 Integrate data from multiple source systems, enabling a central view across the enterprise. This
benefit is always valuable, but particularly so when the organization has grown by merger.
 Improve data quality, by providing consistent codes and descriptions, flagging or even fixing bad
data.
 Present the organization's information consistently.
 Provide a single common data model for all data of interest regardless of the data's source.
 Restructure the data so that it makes sense to the business users.
 Restructure the data so that it delivers excellent query performance, even for complex analytic
queries, without impacting the operational systems.
 Add value to operational business applications, notably customer relationship
management (CRM) systems.

Subject Oriented

Data warehouses are designed to help you analyze data. For example, to learn more
about your company's sales data, you can build a warehouse that concentrates on
sales. Using this warehouse, you can answer questions like "Who was our best
customer for this item last year?" This ability to define a data warehouse by subject
matter, sales in this case, makes the data warehouse subject oriented.
Integrated

Integration is closely related to subject orientation. Data warehouses must put data
from disparate sources into a consistent format. They must resolve such problems as
naming conflicts and inconsistencies among units of measure. When they achieve this,
they are said to be integrated.
Nonvolatile

Nonvolatile means that, once entered into the warehouse, data should not change. This
is logical because the purpose of a warehouse is to enable you to analyze what has
occurred.

Time Variant

In order to discover trends in business, analysts need large amounts of data. This is
very much in contrast to online transaction processing (OLTP) systems, where
performance requirements demand that historical data be moved to an archive. A data
warehouse's focus on change over time is what is meant by the term time variant.

Data mart
A data mart is the access layer of the data warehouse environment that is used to get data out to the
users. Data marts are small slices of the data warehouse.

Design schemas

 Star schema - fairly popular design choice; enables a relational database to emulate the
analytical functionality of a multidimensional database
 Snowflake schema
Reasons for creating a data mart

 Easy access to frequently needed data


 Creates collective view by a group of users
 Improves end-user response time
 Ease of creation
 Lower cost than implementing a full data warehouse
 Potential users are more clearly defined than in a full data warehouse
 Contains only business essential data and is less cluttered

Top down and Bottom up approch


Td:-take a big picture of organization ,take a top down approach and build a mammoth data warehouse

Bp:- take a big picture of organization ,take a bottom approach ,look at the individual local and
departmental requirements and buid bite-size departmental data marts
Datawarehouse And Compnents
granularity in datawarehouse?
Granularity refers to the level of detail of the data stored fact tables in a data warehouse. High
granularity refers to data that is at or near the transaction level. Data that is at the transaction
level is usually referred to as atomic level data. Low granularity refers to data that is summarized
or aggregated, usually from the atomic level data. Summarized data can be lightly summarized as
in daily or weekly summaries or highly summarized data such as yearly averages and totals.
Granularity
The single most important aspect and issue of the design of the data warehouse is the issue of
granularity. It refers to the detail or summarization of the units of data in the data warehouse. The more
detail there is, the lower the granularity level. The less detail there is, the higher the granularity level.

Granularity is a major design issue in the data warehouse as it profoundly affects the volume of data.
The figure below shows the issue of granularity in a data warehouse.

Granularity in the Data Warehouse


Granularity is the most important to the data warehouse architect because it affects all the
environments that depend in the data warehouse for data. The main issue of granularity is that of
getting it at the right level. The level of granularity needs to be neither too high nor too low.

Raw Estimates
The starting point to determine the appropriate level of granularity is to do a rough estimate of
the number of rows that would be there in the data warehouse. If there are very few rows in the
data warehouse then any level of granularity would be fine. After these projections are made the
index data space projections are calculated. In this index data projection we identify the length of
the key or element of data and determine whether the key would exist for each and every entry in
the primary table.
Data in the data warehouse grows in a rate never seen before. The combination of historical data
and detailed data produces a growth rate which is phenomenal. It is only after data warehouse the
terms terabyte and petabyte came into existence. As data keeps growing some part of the data
becomes inactively used and they are sometimes called as dormant data. So it is always better to
have these kinds of dormant data in external storage media.
Data which is usually stored externally are much less expensive than the data which resides on
the disk storage. Some times as these data are external it becomes difficult to retrieve the data
and this causes lots of performance issues and these issues cause lots of effect on the granularity.
It is usually the rough estimates which tell whether the overflow storage should be considered or
not.
Levels of Granularity
After simple analysis is done the next step would be to determine the level of granularity for the
data which is residing on the disk storage. Determining the level of granularity requires some
extent of common sense and intuition. Having a very low level of granularity also doesn't make
any sense as we will have to need many resources to analyze and process the data. While if the
level of granularity is very high then this means that analysis needs to done on the data which
reside in the external storage. Hence this is a very tricky issue so the only way to handle this to
put the data in front of the user and let he/she decide on what the type of data should be. The
below figure shows the iterative loop which needs to be followed.
The process which needs to be followed is.
 Build a small subset quickly based on the feedback
 Prototyping
 Looking what other people have done
 Working with experienced user
 Looking at what the organization has now
 Having sessions with the simulated output.

Strategic Information System


A Strategic Information System (SIS) is a system that helps companies change or otherwise alter their
business strategy and/or structure. It is typically utilized to streamline and quicken the reaction time to
environmental changes and aid it in achieving a competitive advantage.

Key features of the Strategic Information Systems are the following:1) Decision support systems that
enable to develop a strategic approach to align Information Systems (IS) or Information Technologies (IT)
with an organization's business strategies

2) Primarily Enterprise resource planning solutions that integrate/link the business processes to meet the
enterprise objectives for the optimization of the enterprise resources

3) Database systems with the "data mining" capabilities to make the best use of available corporate
information for marketing, production, promotion and innovation. The SIS systems also facilitate
identification of the data collection strategies to help optimize database marketing opportunities.

4) The real-time information Systems that intend to maintain a rapid-response and the quality indicators
OLAP Operations
One of the most compelling front-end applications for OLAP is a PC spreadsheet program. Below is the list of some
popular operations that are supported by the multidimensional spreadsheet applications.

Roll-up:- A roll-up involves summarizing the data along a dimension. The summarization rule might be computing
totals along a hierarchy or applying a set of formulas such as "profit = sales - expenses"

 Takes the current aggregation level of fact values and does a further aggregation on one or more of the dimensions.
 Equivalent to doing GROUP BY to this dimension by using attribute hierarchy.
 Decreases a number of dimensions - removes row headers.

SELECT [attribute list], SUM [attribute names]


FROM [table list]
WHERE [condition list]
GROUP BY [grouping list];
Drill-up/down:- allows the user to navigate among levels of data ranging from the most summarized (up) to the most
detailed (down).[5] The picture shows a drill-down operation: The analyst moves from the summary category
"Outdoor-Schutzausrüstung" to see the sales figures for the individual products.(Note there is an error in the middle
column of the drill-down cube which is 339 but should be 336.2 (195+136+5.2=336.2).)

 Opposite of roll-up.
 Summarizes data at a lower level of a dimension hierarchy, thereby viewing data in a more specialized level within a
dimension.
 Increases a number of dimensions - adds new headers
Slice  is the act of picking a rectangular subset of a cube by choosing a single value for one of its
dimensions, creating a new cube with one fewer dimension. [5] The picture shows a slicing operation: The
sales figures of all sales regions and all product categories of the company in the year 2004 are "sliced" out
of the data cube.

 Performs a selection on one dimension of the given cube, resulting in a sub-cube.


 Reduces the dimensionality of the cubes.
 Sets one or more dimensions to specific values and keeps a subset of dimensions for selected values.

Dice:- The dice operation produces a subcube by allowing the analyst to pick specific values of multiple dimensions.
[6]
 The picture shows a dicing operation: The new cube shows the sales figures of a limited number of product
categories, the time and region dimensions cover the same range as before.

 Define a sub-cube by performing a selection of one or more dimensions.


 Refers to range select condition on one dimension, or to select condition on more than one dimension.
 Reduces the number of member values of one or more dimensions.

Pivot (or rotate) allows an analyst to rotate the cube in space to see its various faces. For example, cities could be
arranged vertically and products horizontally while viewing data for a particular quarter. Pivoting could replace
products with time periods to see data across time for a single product

 Rotates the data axis to view the data from different perspectives.
 Groups data with different dimensions.
Unit-2
1. What is Dimensional Modelling?

Dimensional modeling (DM) is the name of a set of techniques and concepts used in data warehouse
design. It is considered to be different from entity-relationship modeling (ER). Dimensional Modeling does
not necessarily involve a relational database. The same modeling approach, at the logical level, can be
used for any physical form, such as multidimensional database or even flat files.

DM is a design technique for databases intended to support end-user queries in a data warehouse. It is
oriented around understandability and performance. Dimensional modeling always uses the concepts of
facts (measures), and dimensions (context). Facts are typically (but not always) numeric values that can
be aggregated, and dimensions are groups of hierarchies and descriptors that define the facts. For
example,

Dimensional modeling process


The dimensional model is built on a star-like schema, with dimensions surrounding the fact table. To build
the schema, the following design model is used:

1. Choose the business process


2. Declare the grain
3. Identify the dimensions
4. Identify the fact
Benefits of dimensional modeling
Understandability

Query performance

Extensibility

Er Modeling Versus Dimension Modeling

 An E-R diagram (used in OLTP or transactional system) has highly normalized model (Even at a logical
level), whereas dimensional model aggregates most of the attributes and hierarchies of a dimension into a
single entity.
 An E-R diagram is a complex maze of hundreds of entities linked with each other, whereas the Dimensional
model has logical grouped set of star-schemas.
 The E-R diagram is split as per the entities. A dimension model is split as per the dimensions and facts.
 In an E-R diagram all attributes for an entity including textual as well as numeric, belong to the entity table.
Whereas a 'dimension' entity in dimension model has mostly the textual attributes, and the 'fact' entity has
mostly numeric attributes.

Relational Data Modeling Dimensional Data Modeling


Data is stored in RDBMS Data is stored in RDBMS or Multidimensional
databases
Tables are units of storage Cubes are units of storage
Data is normalized and used for OLTP. Optimized Data is denormalized and used in datawarehouse
for OLTP processing and data mart. Optimized for OLAP
Several tables and chains of relationships among Few tables and fact tables are connected to
them dimensional tables
Volatile(several updates) and time variant Non volatile and time invariant
SQL is used to manipulate data MDX is used to manipulate data
Detailed level of transactional data Summary of bulky transactional data(Aggregates
and Measures) used in business decisions
Normal Reports User friendly, interactive, drag and drop
multidimensional OLAP Reports
Typical data design used for business transaction Data design used for analysis systems
systems
Goal – reduce every piece of information to it’s Goal – break up information into ‘Facts’ – things a
simplest form – company measures and ‘Dimensions’ - how we
a debit transaction, a customer record, an measure them: by time, region, or customer
address.
Suited for concurrent handling of many small Suited for reading or analyzing large amounts of
transactions by many users. Only a limited amount data by a modest numbers of users. Many years of
of data history is normally kept data history may be kept.
User is usually constrained by an application that This simpler data design makes it easier for users
understands the data design. Users are typically to analyze data in any way they choose. Users are
operations staff. typically analysts,  company strategists, or even
executives

Star schemas
A star schema is a type of relational database schema that is composed of a single, central
fact table that is surrounded by dimension tables.

The following figure shows a star schema with a single fact table and four dimension tables.
A star schema can have any number of dimension tables. The branches at the end of the
links connecting the tables indicate a many-to-one relationship between the fact table and
each dimension table.

Figure 1. Star schema with a single fact table with links to multiple dimension tables
Factless Fact Table 

A factless fact table is a fact table that does not have any measures. It is essentially an
intersection of dimensions. On the surface, a factless fact table does not make sense,
since a fact table is, after all, about facts. However, there are situations where having
this kind of relationship makes sense in data warehousing.

For example, think about a record of student attendance in classes. In this case, the fact
table would consist of 3 dimensions: the student dimension, the time dimension, and the
class dimension. This factless fact table would look like the following:

The only measure that you can possibly attach to each combination is "1" to show the
presence of that particular combination. However, adding a fact that always shows 1 is
redundant because we can simply use the COUNT function in SQL to answer the same
questions.

Factless fact tables offer the most flexibility in data warehouse design. For example,
one can easily answer the following questions with this factless fact table:

 How many students attended a particular class on a particular day?


 How many classes on average does a student attend on a given day?

Without using a factless fact table, we will need two separate fact tables to answer the
above two questions. With the above factless fact table, it becomes the only fact table
that's needed.

Star Schema Keys

Surrogate Keys
 A substitution for the natural primary key.

 It is just a unique identifier or number for each row that can be used for the primary key to the table.
The only requirement for a surrogate primary key is that it is unique for each row in the table.

 Data warehouses typically use a surrogate, (also known as artificial or identity key), key for the dimension
tables primary keys. They can use Oracle sequence, or SQL Server Identity values for the surrogate key.

 It is useful because the natural primary key (i.e. Customer Number in Customer table) can change and this
makes updates more difficult.

 In a data warehouse, a surrogate key is a necessary generalization of the natural production key and is
one of the basic elements of data warehouse design.

 Every join between dimension tables and fact tables in a data warehouse environment should be based on
surrogate keys, not natural keys.

 It is up to the data extract logic to systematically look up and replace every incoming natural key with a
data warehouse surrogate key each time either a dimension record or a fact record is brought into the
data warehouse environment.

 One of the most important use of surrogate Key the need to encode uncertain knowledge (When you have
a I donýt know" situation, you may want more than just this one special key for the anonymous customer )
Advantage of Star Schema :

1.Provide a direct mapping between the business entities and the schema design.
2.Provide highly optimized performance for star queries.
3.It is widely supported by a lot of business intelligence tools.

Star schemas are denormalized, meaning the normal rules of normalization applied to transactional


relational databases are relaxed during star schema design and implementation. The benefits of star
schema denormalization are:

 Simpler queries - star schema join logic is generally simpler than the join logic required to retrieve
data from a highly normalized transactional schemas.
 Simplified business reporting logic - when compared to highly normalized schemas, the star
schema simplifies common business reporting logic, such as period-over-period and as-of reporting.
 Query performance gains - star schemas can provide performance enhancements for read-only
reporting applications when compared to highly normalized schemas.
 Fast aggregations - the simpler queries against a star schema can result in improved
performance for aggregation operations.
 Feeding cubes - star schemas are used by all OLAP systems to build proprietary OLAP
cubes efficiently; in fact, most major OLAP systems provide a ROLAP mode of operation which can
use a star schema directly as a source without building a proprietary cube structure.
Disadvantages 

The main disadvantage of the star schema is that data integrity is not enforced as well as it is in a
highly normalized database. One-off inserts and updates can result in data anomalies
whichnormalized schemas are designed to avoid. Generally speaking, star schemas are loaded in a
highly controlled fashion via batch processing or near-real time "trickle feeds", to compensate for the lack
of protection afforded by normalization

Snowflake schemas
The snowflake schema consists of one fact table that is connected to many dimension
tables, which can be connected to other dimension tables through a many-to-one
relationship.

Tables in a snowflake schema are usually normalized to the third normal form. Each
dimension table represents exactly one level in a hierarchy.

The following figure shows a snowflake schema with two dimensions, each having three
levels. A snowflake schema can have any number of dimensions and each dimension can
have any number of levels.

Figure 1. Snowflake schema with two dimensions and three levels each
Advantage of Snowflake Schema:
1.It provides greater flexibility in interrelationship between dimension levels and components.
2.No redundancy so it is easier to maintain.

Disadvantage of Snowflake Schema :

1.There are More complex queries and hence difficult to understand


2.More tables more joins so more query execution time.

 Starflake schemas
A starflake schema is a combination of a star schema and a snowflake schema.
Starflake schemas are snowflake schemas where only some of the dimension tables
have been denormalized.

 Starflake schemas aim to leverage the benefits of both star schemas and snowflake
schemas. The hierarchies of star schemas are denormalized, while the hierarchies of
snowflake schemas are normalized.
 Starflake schemas are normalized to remove any redundancies in the dimensions. To
normalize the schema, the shared dimensional hierarchies are placed in outriggers.
 The following figure depicts a sample starflake schema:
 Figure 1. Starflake schema with one fact and two dimensions that share an outrigger

 Many-to-one relationships
A many-to-one relationship refers to one table or entity that contains values and
refers to another table or entity that has unique values. Many-to-one relationships
are often enforced by foreign key and primary key relationships, and the
relationships typically are between fact and dimension tables or entities and between
levels in a hierarchy.
Snowflake Schema vs Star Schema

Comparison chart</> EMBED THIS CHART


Snowflake Schema Star Schema

Ease of No redundancy and hence more Has redundant data and hence less
maintenance/change: easy to maintain and change easy to maintain/change

Ease of Use: More complex queries and hence Less complex queries and easy to
less easy to understand understand

Query Performance: More foreign keys-and hence more Less no. of foreign keys and hence
query execution time lesser query execution time

Normalization: Has normalized tables Has De-normalized tables

Type of Good to use for datawarehouse core Good for datamarts with simple
Datawarehouse: to simplify complex relationships relationships (1:1 or 1:many)
(many:many)

Joins: Higher number of Joins Fewer Joins

Dimension table: It may have more than one Contains only single dimension
dimension table for each dimension table for each dimension

When to use: When dimension table is relatively When dimension table contains less
big in size, snowflaking is better as number of rows, we can go for Star
it reduces space. schema.
OLTP vs. OLAP

OLTP (On-line Transaction Processing) is characterized by a large number of short on-


line transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put
on very fast query processing, maintaining data integrity in multi-access environments and
an effectiveness measured by number of transactions per second. In OLTP database there
is detailed and current data, and schema used to store transactional databases is the entity
model (usually 3NF). 

- OLAP (On-line Analytical Processing) is characterized by relatively low volume of


transactions. Queries are often very complex and involve aggregations. For OLAP systems a
response time is an effectiveness measure. OLAP applications are widely used by Data
Mining techniques. In OLAP database there is aggregated, historical data, stored in multi-
dimensional schemas (usually star schema). 

We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we
can assume that OLTP systems provide source data to data warehouses, whereas OLAP
systems help to analyze it.

The following table summarizes the major differences between OLTP and OLAP system
design.
OLTP System  OLAP System 
Online Transaction Processing  Online Analytical Processing 
(Operational System) (Data Warehouse)
Operational data; OLTPs are the original source Consolidation data; OLAP data comes from the
Source of data
of the data. various OLTP Databases
To help with planning, problem solving, and decision
Purpose of data To control and run fundamental business tasks
support
Reveals a snapshot of ongoing business Multi-dimensional views of various kinds of business
What the data
processes activities
Inserts and Short and fast inserts and updates initiated by
Periodic long-running batch jobs refresh the data
Updates end users
Relatively standardized and simple queries
Queries Often complex queries involving aggregations
Returning relatively few records
Depends on the amount of data involved; batch data
Processing refreshes and complex queries may take many
Typically very fast
Speed hours; query speed can be improved by creating
indexes
Space Can be relatively small if historical data is Larger due to the existence of aggregation structures
Requirements archived and history data; requires more indexes than OLTP
Database Typically de-normalized with fewer tables; use of
Highly normalized with many tables
Design star and/or snowflake schemas
Backup religiously; operational data is critical to Instead of regular backups, some environments may
Backup and
run the business, data loss is likely to entail consider simply reloading the OLTP data as a
Recovery
significant monetary loss and legal liability recovery method

Online analytical processing


 online analytical processing, or OLAP (pron.: /ˈoʊlæp/), is an approach to answer multi-dimensional
analytical (MDA) queries swiftly.[1] OLAP is part of the broader category ofbusiness intelligence, which
also encompasses relational database, report writing and data mining.

OLAP tools enable users to analyze multidimensional data interactively from multiple perspectives. OLAP
consists of three basic analytical operations: consolidation (roll-up), drill-down, and slicing and dicing.
[6]
 Consolidation involves the aggregation of data that can be accumulated and computed in one or more
dimensions. For example, all sales offices are rolled up to the sales department or sales division to
anticipate sales trends. By contrast, the drill-down is a technique that allows users to navigate through the
details. For instance, users can view the sales by individual products that make up a region’s sales.
Slicing and dicing is a feature whereby users can take out (slicing) a specific set of data of the OLAP
cube and view (dicing) the slices from different viewpoints.

2. OLAP Architecture
OLAP systems have a structured architecture based on three essential components:
Database - the data source used for OLAP analysis. As database can use a relational
database to ensure our multidimensional storage facilities, a multidimensional database, a data
warehouse, etc.
OLAP server - the one that manages multidimensional data structure and at the same
time a link between the database and OLAP customer.
OLAP customer - are those that provide data mining applications but also supports
the generation of results (graphs, reports, etc.).
OLAP tools enable users to store data in both relational databases and multidimensional
databases. If we consider how to store data in databases, OLAP systems can be classified as:
- ROLAP systems
- MOLAP systems
- HOLAP systems

 WOLAP - Web-based OLAP


 DOLAP - Desktop OLAP
 RTOLAP - Real-Time OLAP

ROLAP works directly with relational databases. The base data and the dimension tables are stored as
relational tables and new tables are created to hold the aggregated information. Depends on a
specialized schema design.This methodology relies on manipulating the data stored in the relational
database to give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each
action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement. ROLAP tools
do not use pre-calculated data cubes but instead pose the query to the standard relational database and
its tables in order to bring back the data required to answer the question. ROLAP tools feature the ability
to ask any question because the methodology does not limit to the contents of a cube. ROLAP also has
the ability to drill down to the lowest level of detail in the database.

Advantages:

 Can handle large amounts of data: The data size limitation of ROLAP technology
is the limitation on data size of the underlying relational database. In other words,
ROLAP itself places no limitation on data amount.
 Can leverage functionalities inherent in the relational database: Often, relational
database already comes with a host of functionalities. ROLAP technologies,
since they sit on top of the relational database, can therefore leverage these
functionalities.

Disadvantages:

 Performance can be slow: Because each ROLAP report is essentially a SQL


query (or multiple SQL queries) in the relational database, the query time can be
long if the underlying data size is large.
 Limited by SQL functionalities: Because ROLAP technology mainly relies on
generating SQL statements to query the relational database, and SQL
statements do not fit all needs (for example, it is difficult to perform complex
calculations using SQL), ROLAP technologies are therefore traditionally limited
by what SQL can do. ROLAP vendors have mitigated this risk by building into the
tool out-of-the-box complex functions as well as the ability to allow users to
define their own functions.

MOLAP systems have focused on optimizing the flexibility and storage techniques and the
concept of transaction [2].
MOLAP systems are much faster in terms of data aggregation and in terms of queries, however,
generates large volumes of data hedge. Response time the query is improved because of
precalculate aggregations of such data and responses to queries are prepared before launching the
application.
Analyzing disk space and response time performance of complex queries, we can say that
MOLAP cubes are best.
As we mention advantages of MOLAP systems [2]:
- Relational tables are not suitable for multidimensional data;
- Multidimensional Arrays allow storage of multidimensional data efficiently;
- SQL language is not suitable for multidimensional operations.

Advantages:

 Excellent performance: MOLAP cubes are built for fast data retrieval, and is
optimal for slicing and dicing operations.
 Can perform complex calculations: All calculations have been pre-generated
when the cube is created. Hence, complex calculations are not only doable, but
they return quickly.

Disadvantages:

 Limited in the amount of data it can handle: Because all calculations are
performed when the cube is built, it is not possible to include a large amount of
data in the cube itself. This is not to say that the data in the cube cannot be
derived from a large amount of data. Indeed, this is possible. But in this case,
only summary-level information will be included in the cube itself.
 Requires additional investment: Cube technology are often proprietary and do
not already exist in the organization. Therefore, to adopt MOLAP technology,
chances are additional investments in human and capital resources are needed.

2.3 HOLAP (HYBRID OLAP) - can be called a compromise between the first two versions,
which attempts to combine the advantages of MOLAP and ROLAP, to ensure users the best
possible solution in terms of performance. If HOLAP basic data are stored in relational database
while the data are aggregated in the cube HOLAP saved.
We can also say there are not too advanced in terms of scalability, but not in terms of speed.
However, they both provide them a level and within acceptable parameters.
Comparison
Each type has certain benefits, although there is disagreement about the specifics of the benefits
between providers.

 Some MOLAP implementations are prone to database explosion, a phenomenon causing vast
amounts of storage space to be used by MOLAP databases when certain common conditions are
met: high number of dimensions, pre-calculated results and sparse multidimensional data.

 MOLAP generally delivers better performance due to specialized indexing and storage
optimizations. MOLAP also needs less storage space compared to ROLAP because the specialized
storage typically includes compression techniques.[15]

 ROLAP is generally more scalable.[15] However, large volume pre-processing is difficult to


implement efficiently so it is frequently skipped. ROLAP query performance can therefore suffer
tremendously.

 Since ROLAP relies more on the database to perform calculations, it has more limitations in the
specialized functions it can use.

 HOLAP encompasses a range of solutions that attempt to mix the best of ROLAP and MOLAP. It
can generally pre-process swiftly, scale well, and offer good function support.

Cube browsing is the fastest when using MOLAP. This is so even in cases where no aggregations have
been done. The data is stored in a compressed multidimensional format and can be accessed quickly
than in the relational database. Browsing is very slow in ROLAP about the same in HOLAP. Processing
time is slower in ROLAP, especially at higher levels of aggregation.  
 
MOLAP storage takes up more space than HOLAP as data is copied and at very low levels of
aggregation it takes up more room than ROLAP. ROLAP takes almost no storage space as data is not
duplicated. However ROALP aggregations take up more space than MOLAP or HOLAP aggregations.  
 
All data is stored in the cube in MOLAP and data can be viewed even when the original data source is not
available. In ROLAP data cannot be viewed unless connected to the data source.  
 
MOLAP can handle very limited data only as all data is stored in the cube.
UNIT-3
Data Mining
Data mining is the computational process of discovering patterns in large data setsThe overall goal of the
data mining process is to extract information from a data set and transform it into an understandable
structure for further use

WHAT ARE THE MOST USED DATA MINING TECHNIQUES?

CLASSIFICATION

Classification is probably the most widely used data mining technique.

Most decision making models are usually based upon classification methods. These
techniques, also called classifiers, enable the categorisation of data (or entities) into
pre-defined classes.

The use of classification algorithms involves a training set consisting of pre-classified


examples. In the tax audit domain, the two classes could be compliant filings versus
non-compliant filings, and the training set would be assembled from historical audits.
The classifier calibration algorithm uses the pre-classified examples to determine a set
of parameters required for proper discrimination between the classes. The algorithm
then encodes these parameters into a model called a classifier. Once such a classifier is
calibrated, it can assign new filings to either of the classes.

There are many algorithms that can be used for classification, such as decision trees,
neural networks, logistic regression, etc.

Using this data mining technique, the data mining tool learns from examples or the data
(data warehouses, databases etc) how to partition or classify certain objects (it can be
an object, an action, or any other information, that can be formalised).  As a result,
data mining software formulates classification rules.

 Example - customer database


o Question - Does the customer belong to loyal ones?
o Typical rule formulated -

if PURCHASED = monthly and PROFIT > 5000$ and INCIDENTS = 0 then


CUSTOMER_TYPE = LOYAL

 
CLUSTERING (SEGMENTATION)

Clustering is a data mining technique, used to discover and explore groupings within
data or entities. Clustering approaches are mainly  used for segmentation – for
example, it can be used to identify polluted soil areas. Clustering method allows entities
to be partitioned into distinct groups, also called  “segments”.  The main difference
between classification and clustering is that clustering is structuring data without
knowing anything about classes, while classification method assigns new knowledge to 
the classes that are known apriori.

Cluster analysis is a visual method, that helps to understand data structure.

ASSOCIATION

Association rules are basic types of patterns or regularities that are found in
transactional-type data. This data mining technique has its origins in traditional retail
marketing where it can discover affinities between items that occur within a particular
shopping trip (for example, what items typically co-occur as contents of a shopping
basket). Hence, an alternative name for this type of analysis is “market-basket
analysis”.

From a set of transaction data (for example tax filings, or insurance claims), association
rules can discover characteristics within a transaction that imply the presence of other
characteristics in the same transaction. For two sets of characteristics X and Y, an
association rule is usually denoted as to convey that the presence of the characteristic
X in a transaction frequently implies the presence of characteristic Y.

With the help of association methods data mining software creates rules that associate
one attribute of a relation to another. Discovering these rules is very efficient on set
oriented approaches.

 Example - customer database in a supermarket


o 56% of customers who purchase Article1 also purchase Article2

56 is the confidence factor of the rule.

SEQUENCE/TEMPORAL

Sequential patterns involve mining frequently occurring patterns of activity over a


period of time. In many situations, not only may the coexistence of items within a
transaction be important (which would be discovered by association rules algorithms),
but also the order in which those items appear across ordered transactions, and the
amount of time between transactions (which would be discovered by sequential pattern
detection algorithms). Thus, sequential pattern detection methods are similar to
association rules, except that they look for patterns across time (as opposed to patterns
within transactions). This could be a pattern that represents a sequence of tax filings
over time, or a sequence of purchases over time, etc.

Sequence rules differ from other data mining methods with the temporal factor.

rediction
Prediction is a wide topic and runs from predicting the failure of components or machinery, to
identifying fraud and even the prediction of company profits. Used in combination with the other
data mining techniques, prediction involves analyzing trends, classification, pattern matching,
and relation. By analyzing past events or instances, you can make a prediction about an event.
Using the credit card authorization, for example, you might combine decision tree analysis of
individual past transactions with classification and historical pattern matches to identify whether
a transaction is fraudulent. Making a match between the purchase of flights to the US and
transactions in the US, it is likely that the transaction is valid.
Sequential patterns
Oftern used over longer-term data, sequential patterns are a useful method for identifying
trends, or regular occurrences of similar events. For example, with customer data you can
identify that customers buy a particular collection of products together at different times of the
year. In a shopping basket application, you can use this information to automatically suggest
that certain items be added to a basket based on their frequency and past purchasing history.
Decision trees
Related to most of the other techniques (primarily classification and prediction), the decision
tree can be used either as a part of the selection criteria, or to support the use and selection of
specific data within the overall structure. Within the decision tree, you start with a simple
question that has two (or sometimes more) answers. Each answer leads to a further question to
help classify or identify the data so that it can be categorized, or so that a prediction can be
made based on each answer.
shows an example where you can classify an incoming error condition.
Figure 4. Decision treeDecision trees are often used with classification systems to attribute type information,
and with predictive systems, where different predictions might be based on past historical experience that helps drive
the structure of the decision tree and the output.
Data Mining Task Primitives
Each user will have a data mining task in mind, that is, some form of data analysis that
he or she would like to have performed. A data mining task can be specified in the form
of a data mining query, which is input to the data mining system. A data mining query is
defined in terms of data mining task primitives. These primitives allow the user to
interactively

communicate with the data mining system during discovery in order to direct
the mining process, or examine the findings from different angles or depths. The data
mining primitives specify the following, as illustrated in Figure 1.13.

The set of task-relevant data to be mined: This specifies the portions of the database
or the set of data in which the user is interested. This includes the database attributes
or data warehouse dimensions of interest (referred to as the relevant attributes or
dimensions).

The kind of knowledge to be mined: This specifies the data mining functions to be
performed,such as characterization, discrimination, association or correlation analysis,
classification, prediction, clustering, outlier analysis, or evolution analysis.

The background knowledge to be used in the discovery process: This knowledge


about
the domain to be mined is useful for guiding the knowledge discovery process and
for evaluating the patterns found. Concept hierarchies are a popular form of background
knowledge, which allow data to be mined at multiple levels of abstraction.
An example of a concept hierarchy for the attribute (or dimension) age is shown in
Figure 1.14. User beliefs regarding relationships in the data are another formof
background
knowledge.

The interestingness measures and thresholds for pattern evaluation: They may be
used
to guide the mining process or, after discovery, to evaluate the discovered patterns.
Different kinds of knowledge may have different interestingness measures. For
example,interestingness measures for association rules include support and
confidence.Rules whose support and confidence values are below user-specified
thresholds are considered uninteresting.
The expected representation for visualizing the discovered patterns: This refers to
the forminwhich discovered patterns are to be displayed,which may include rules,
tables, charts, graphs, decision trees, and cubes.

OLAP vs Data MINING

Data mining as a part of Knowledge discovery in database:


Data mining addresses inductive knowledge which discovers new rules and patterns from the supplied
data.It comprises six phases such as data selection, data cleansing, enrichment, data transformation or

encoding, data mining and the reporting and display of the discovered information.

You might also like