DWDM Unit 1
DWDM Unit 1
Data Warehouse
Data Warehouse is a relational database management system (RDBMS) construct to meet the requirement of
transaction processing systems. It can be loosely described as any centralized data repository which can be
queried for business benefits. It is a database that stores information oriented to satisfy decision-making requests. It
is a group of decision support technologies, targets to enabling the knowledge worker (executive, manager, and
analyst) to make superior and higher decisions. So, Data Warehousing support architectures and tool for business
executives to systematically organize, understand and use their information to make strategic decisions.
Data Warehouse environment contains an extraction, transportation, and loading (ETL) solution, an online analytical
processing (OLAP) engine, customer analysis tools, and other applications that handle the process of gathering
information and delivering it to business users.
A Data Warehouse is a group of data specific to the entire organization, not only to a particular group of users.
It is not used for daily operations and transaction processing but used for making decisions.
A Data Warehouse can be viewed as a data system with the following attributes:
It is a database designed for investigative tasks, using data from various applications.
Integrated
A data warehouse integrates various heterogeneous data sources like RDBMS, flat files, and online transaction
records. It requires performing data cleaning and integration during data warehousing to ensure consistency in
naming conventions, attributes types, etc., among different data sources.
Non-Volatile
The data warehouse is a physically separate data storage, which is transformed from the source operational
RDBMS. The operational updates of data do not occur in the data warehouse, i.e., update, insert, and delete
operations are not performed. It usually requires only two procedures in data accessing: Initial loading of data and
access to data. Therefore, the DW does not require transaction processing, recovery, and concurrency capabilities,
which allows for substantial speedup of data retrieval. Non-Volatile defines that once entered into the warehouse,
and data should not change.
1. 1) Business User: Business users require a data warehouse to view summarized data from the past. Since these
people are non-technical, the data may be presented to them in an elementary form.
2. 2) Store historical data: Data Warehouse is required to store the time variable data from the past. This input is
made to be used for various purposes.
4. 4) For data consistency and quality: Bringing the data from different sources at a commonplace, the user can
effectively undertake to bring the uniformity and consistency in data.
5. 5) High response time: Data warehouse has to be ready for somewhat unexpected loads and types of queries,
which demands a significant degree of flexibility and quick response time.
3. The structure of data warehouses is more accessible for end-users to navigate, understand, and query.
4. Queries that would be complex in many normalized databases could be easier to build and maintain in data
warehouses.
5. Data warehousing is an efficient method to manage demand for lots of information from lots of users.
6. Data warehousing provide the capabilities to analyze a large amount of historical data.
Prerequisites
Before learning about Data Warehouse, you must have the fundamental knowledge of basic database concepts
such as schema, ER model, structured query language, etc.
Operational systems are designed to support high-volume Data warehousing systems are typically designed to support
transaction processing. high-volume analytical processing (i.e., OLAP).
Data within operational systems are mainly updated regularly Non-volatile, new data may be added regularly. Once Added
according to need. rarely changed.
It is optimized for a simple set of transactions, generally adding or It is optimized for extent loads and high, complex,
retrieving a single row at a time per table. unpredictable queries that access many rows per table.
It is optimized for validation of incoming information during Loaded with consistent, valid information, requires no real-
transactions, uses validation data tables. time validation.
It supports thousands of concurrent clients. It supports a few concurrent clients relative to OLTP.
Operational systems are widely process-oriented. Data warehousing systems are widely subject-oriented
Operational systems are usually optimized to perform fast inserts Data warehousing systems are usually optimized to perform
and updates of associatively small volumes of data. fast retrievals of relatively high volumes of data.
Relational databases are created for on-line transactional Data Warehouse designed for on-line Analytical Processing
Processing (OLTP) (OLAP)
Statisticians − There are generally only a handful of sophisticated analysts Statisticians and operations
research types in any organization. Though few, they are multiple best users of the data warehouse; those
Knowledge Workers − A relatively small number of analysts implement the number of new queries and analyses
against the data warehouse. These are the users who have the “Designer” or “Analyst” versions of user access
tools.
After a few iterations, their queries and documents generally get published for the benefit of the information
consumers. Knowledge Workers are often intensely engaged with the data warehouse design and place the
highest demands on the ongoing data warehouse operations team for training and support.
Information Consumers − Some users of the data warehouse are information consumers and they will probably
never compose a valid ad hoc query. They use static or simple interactive documents that have been developed.
It is simple to forget about these users because they generally communicate with the data warehouse only
through the work product of others.
Executives − Executives are a specific case of the Information Consumers group. Some executives issue their
queries, but an executive’s slightest musing can create a flurry of activity between the other types of users. A
wise data warehouse designer will develop a very frigid digital dashboard for executives, considering it is easy
and economical to do so. Generally, this must follow other data warehouse work, but it never hurts to influence
the bosses.
5. Integration and transformation rules used to deliver information to end-user analytical tools.
Metadata is used for building, maintaining, managing, and using the data warehouses. Metadata allow users access
to help understand the content and find data.
2. The table of content and the index in a book may be treated metadata for the book.
3. Suppose we say that a data item about a person is 80. This must be defined by noting that it is the person's
weight and the unit is kilograms. Therefore, (weight, kilograms) is the metadata about the data is 80.
4. Another examples of metadata are data about the tables and figures in a report like this book. A table (which is a
record) has a name (e.g., table titles), and there are column names of the tables that may be treated metadata.
The figures also have titles or names.
Next, it provides information about the contents and structures to the developers.
Finally, it opens the doors to the end-users and makes the contents recognizable in their terms.
Metadata is Like a Nerve Center. Various processes during the building and administering of the data warehouse
generate parts of the data warehouse metadata. Another uses parts of metadata generated by one process. In the
data warehouse, metadata assumes a key position and enables communication among various methods. It acts as a
nerve centre in the data warehouse.
Types of Metadata:
There are many types of metadata that can be used to describe different aspects of data, such as its content,
format, structure, and provenance. Some common types of metadata include:
1. Descriptive metadata: This type of metadata provides information about the content, structure, and format of
data, and may include elements such as title, author, subject, and keywords. Descriptive metadata helps to
identify and describe the content of data and can be used to improve the discoverability of data through search
engines and other tools.
2. Administrative metadata: This type of metadata provides information about the management and technical
characteristics of data, and may include elements such as file format, size, and creation date. Administrative
metadata helps to manage and maintain data over time and can be used to support data governance and
preservation.
3. Structural metadata: This type of metadata provides information about the relationships and organization of
data, and may include elements such as links, tables of contents, and indices. Structural metadata helps to
organize and connect data and can be used to facilitate the navigation and discovery of data.
4. Provenance metadata: This type of metadata provides information about the history and origin of data, and may
include elements such as the creator, date of creation, and sources of data. Provenance metadata helps to
provide context and credibility to data and can be used to support data governance and preservation.
5. Rights metadata: This type of metadata provides information about the ownership, licensing, and access
controls of data, and may include elements such as copyright, permissions, and terms of use. Rights metadata
helps to manage and protect the intellectual property rights of data and can be used to support data governance
and compliance.
6. Educational metadata: This type of metadata provides information about the educational value and learning
objectives of data, and may include elements such as learning outcomes, educational levels, and competencies.
Educational metadata can be used to support the discovery and use of educational resources, and to support
the design and evaluation of learning environments.
Ease of creation
Potential clients are more clearly defined than in a comprehensive data warehouse
Other than these two categories, one more type exists that is called "Hybrid Data Marts."
Designing
The design step is the first in the data mart process. This phase covers all of the functions from initiating the request
for a data mart through gathering data about the requirements and developing the logical and physical design of the
data mart.
It involves the following tasks:
Constructing
This step contains creating the physical database and logical structures associated with the data mart to provide
fast and efficient access to the data.
It involves the following tasks:
2. creating the schema objects such as tables and indexes describe in the design step.
Populating
This step includes all of the tasks related to the getting data from the source, cleaning it up, modifying it to the right
format and level of detail, and moving it into the data mart.
It involves the following tasks:
2. Extracting data
Accessing
This step involves putting the data to use: querying the data, analyzing it, creating reports, charts and graphs and
publishing them.
It involves the following tasks:
1. Set up and intermediate layer (Meta Layer) for the front-end tool to use. This layer translates database
operations and objects names into business conditions so that the end-clients can interact with the data mart
using words which relates to the business functions.
2. Set up and manage database architectures like summarized tables which help queries agree through the front-
end tools execute rapidly and efficiently.
Managing
This step contains managing the data mart over its lifetime. In this step, management functions are performed as:
A Data Warehouse is a vast repository of information collected A data mart is an only subtype of a Data Warehouses. It is
from various organizations or departments within a corporation. architecture to meet the requirement of a specific user group.
It may hold multiple subject areas. It holds only one subject area. For example, Finance or Sales.
In data warehousing, Fact constellation is used. In Data Mart, Star Schema and Snowflake Schema are used.
Data Warehouse applications are designed to support the user ad-hoc data requirements, an activity recently
dubbed online analytical processing (OLAP). These include applications such as forecasting, profiling, summary
reporting, and trend analysis.
Production databases are updated continuously by either by hand or via OLTP applications. In contrast, a warehouse
database is updated from operational systems periodically, usually during off-hours. As OLTP data accumulates in
production databases, it is regularly extracted, filtered, and then loaded into a dedicated warehouse server that is
accessible to users. As the warehouse is populated, it must be restructured tables de-normalized, data cleansed of
errors and redundancies and new fields and keys added to reflect the needs to the user for sorting, combining, and
summarizing data.
Data warehouses and their architectures very depending upon the elements of an organization's situation.
Three common architectures are:
Operational System
An operational system is a method used in data warehousing to refer to a system that is used to process the day-
to-day transactions of an organization.
Flat Files
Meta Data
A set of data that defines and gives information about other data.
Meta Data used in Data Warehouse for a variety of purpose, including:
Meta Data summarizes necessary information about data, which can make finding and work with particular
instances of data more accessible. For example, author, data build, and data changed, and file size are examples of
very basic document metadata.
The principal purpose of a data warehouse is to provide information to the business managers for strategic
decision-making. These customers interact with the warehouse using end-client access tools.
The examples of some of the end-user access tools can be:
Data Warehouse Staging Area is a temporary location where a record from source systems is copied.
1. Separation: Analytical and transactional processing should be keep apart as much as possible.
2. Scalability: Hardware and software architectures should be simple to upgrade the data volume, which has to be
managed and processed, and the number of user's requirements, which have to be met, progressively increase.
3. Extensibility: The architecture should be able to perform new operations and technologies without redesigning
the whole system.
4. Security: Monitoring accesses are necessary because of the strategic data stored in the data warehouses.
5. Administerability: Data Warehouse management should not be complicated.
Single-Tier Architecture
Single-Tier architecture is not periodically used in practice. Its purpose is to minimize the amount of data stored to
reach this goal; it removes data redundancies.
The figure shows the only layer physically available is the source layer. In this method, data warehouses are virtual.
This means that the data warehouse is implemented as a multidimensional view of operational data created by
specific middleware, or an intermediate processing layer.
The vulnerability of this architecture lies in its failure to meet the requirement for separation between analytical and
transactional processing. Analysis queries are agreed to operational data after the middleware interprets them. In
this way, queries affect transactional workloads.
Two-Tier Architecture
Although it is typically called two-layer architecture to highlight a separation between physically available sources
and data warehouses, in fact, consists of four subsequent data flow stages:
1. Source layer: A data warehouse system uses a heterogeneous source of data. That data is stored initially to
corporate relational databases or legacy databases, or it may come from an information system outside the
corporate walls.
2. Data Staging: The data stored to the source should be extracted, cleansed to remove inconsistencies and fill
gaps, and integrated to merge heterogeneous sources into one standard schema. The so-
named Extraction, Transformation, and Loading Tools (ETL) can combine heterogeneous schemata, extract,
transform, cleanse, validate, filter, and load source data into a data warehouse.
3. Data Warehouse layer: Information is saved to one logically centralized individual repository: a data warehouse.
The data warehouses can be directly accessed, but it can also be used as a source for creating data marts,
which partially replicate data warehouse contents and are designed for specific enterprise departments. Meta-
data repositories store information on sources, access procedures, data staging, users, data mart schema, and
so on.
4. Analysis: In this layer, integrated data is efficiently, and flexible accessed to issue reports, dynamically analyze
information, and simulate hypothetical business scenarios. It should feature aggregate information navigators,
complex query optimizers, and customer-friendly GUIs.
Three-Tier Architecture
The three-tier architecture consists of the source layer (containing multiple source system), the reconciled layer and
the data warehouse layer (containing both data warehouses and data marts). The reconciled layer sits between the
source data and data warehouse.
The main advantage of the reconciled layer is that it creates a standard reference data model for a whole
enterprise. At the same time, it separates the problems of source data extraction and integration from those of data
warehouse population. In some cases, the reconciled layer is also directly used to accomplish better some
operational tasks, such as producing daily reports that cannot be satisfactorily prepared using the corporate
applications or generating data flows to feed external processes periodically to benefit from cleaning and
integration.
A star schema is a relational schema where a relational schema whose design represents a multidimensional data
model. The star schema is the explicit data warehouse schema. It is known as star schema because the entity-
relationship diagram of this schemas simulates a star, with points, diverge from a central table. The center of the
schema consists of a large fact table, and the points of the star are the dimension tables.
A fact table might involve either detail level fact or fact that have been aggregated (fact tables that include
aggregated fact are often instead called summary tables). A fact table generally contains facts with the same level
of aggregation.
Dimension Tables
A dimension is an architecture usually composed of one or more hierarchies that categorize data. If a dimension has
not got hierarchies and levels, it is called a flat dimension or list. The primary keys of each of the dimensions table
are part of the composite primary keys of the fact table. Dimensional attributes help to define the dimensional value.
They are generally descriptive, textual values. Dimensional tables are usually small in size than fact table.
Fact tables store data about sales while dimension tables data about the geographic region (markets, cities), clients,
products, times, channels.
It provides a flexible design that can be changed easily or added to throughout the development cycle, and as
the database grows.
It provides a parallel in design to how end-users typically think of and use the data.
Query Performance
A star schema database has a limited number of table and clear join paths, the query run faster than they do against
OLTP systems. Small single-table queries, frequently of a dimension table, are almost instantaneous. Large join
queries that contain multiple tables takes only seconds or minutes to run.
In a star schema database design, the dimension is connected only through the central fact table. When the two-
dimension table is used in a query, only one join path, intersecting the fact tables, exist between those two tables.
Easily Understood
A star schema is simple to understand and navigate, with dimensions joined only through the fact table. These joins
are more significant to the end-user because they represent the fundamental relationship between parts of the
underlying business. Customer can also browse dimension table attributes before constructing a query.
The TIME table has a column for each day, month, quarter, and year. The ITEM table has columns for each item_Key,
item_name, brand, type, supplier_type. The BRANCH table has columns for each branch_key, branch_name,
branch_type. The LOCATION table has columns of geographic data, including street, city, state, and country.
In this scenario, the SALES table contains only four columns with IDs from the dimension tables, TIME, ITEM,
BRANCH, and LOCATION, instead of four columns for time data, four columns for ITEM data, three columns for
BRANCH data, and four columns for LOCATION data. Thus, the size of the fact table is significantly reduced. When
we need to change an item, we need only make a single change in the dimension table, instead of making many
changes in the fact table.
We can create even more complex star schemas by normalizing a dimension table into several tables. The
normalized dimension table is called a Snowflake.
Snowflaking is used to develop the performance of specific queries. The schema is diagramed with each fact
surrounded by its associated dimensions, and those dimensions are related to other dimensions, branching out into
a snowflake pattern.
The snowflake schema consists of one fact table which is linked to many dimension tables, which can be linked to
other dimension tables through a many-to-one relationship. Tables in a snowflake schema are generally normalized
to the third normal form. Each dimension table performs exactly one level in a hierarchy.
The following diagram shows a snowflake schema with two dimensions, each having three levels. A snowflake
schemas can have any number of dimension, and each dimension can have any number of levels.
A star schema store all attributes for a dimension into one denormalized table. This needed more disk space than a
more normalized snowflake schema. Snowflaking normalizes the dimension by moving attributes with low
cardinality into separate dimension tables that relate to the core dimension table by using foreign keys. Snowflaking
for the sole purpose of minimizing disk space is not recommended, because it can adversely impact query
performance.
In snowflake, schema tables are normalized to delete redundancy. In snowflake dimension tables are damaged into
multiple dimension tables.
Figure shows a simple STAR schema for sales in a manufacturing company. The sales fact table include quantity,
price, and other relevant metrics. SALESREP, CUSTOMER, PRODUCT, and TIME are the dimension tables.
The STAR schema for sales, as shown above, contains only five tables, whereas the normalized version now
extends to eleven tables. We will notice that in the snowflake schema, the attributes with low cardinality in each
original dimension tables are removed to form separate tables. These new tables are connected back to the original
dimension table through artificial keys.
A snowflake schema is designed for flexible querying across more complex dimensions and relationship. It is
suitable for many to many and one to many relationships between dimension levels.
2. It provides greater scalability in the interrelationship between dimension levels and components.
Fact Constellation Schema describes a logical structure of data warehouse or data mart. Fact Constellation Schema
can design with a collection of de-normalized FACT, Shared, and Conformed Dimension tables.
Fact Constellation Schema is a sophisticated database design that is difficult to summarize information. Fact
Constellation Schema can implement between aggregate Fact tables or decompose a complex Fact table into
independent simplex Fact tables.
It deals with querying, statistical analysis, and reporting via tables, charts, or graphs. Nowadays, information
processing of data warehouse is to construct a low cost, web-based accessing tools typically integrated with web
browsers.
Analytical Processing
It supports various online analytical processing such as drill-down, roll-up, and pivoting. The historical data is being
processed in both summarized and detailed format.
OLAP is implemented on data warehouses or data marts. The primary objective of OLAP is to support ad-hoc
querying needed for support DSS. The multidimensional view of data is fundamental to the OLAP application. OLAP
is an operational view, not a data structure or schema. The complex nature of OLAP applications requires a
multidimensional view of the data.
Data Mining
It helps in the analysis of hidden design and association, constructing scientific models, operating classification and
prediction, and performing the mining results using visualization tools.
Data mining is the technique of designing essential new correlations, patterns, and trends by changing through high
amounts of a record save in repositories, using pattern recognition technologies as well as statistical and
mathematical techniques.
It is the phase of selection, exploration, and modeling of huge quantities of information to determine regularities or
relations that are at first unknown to access precise and useful results for the owner of the database.
It is the process of inspection and analysis, by automatic or semi-automatic means, of large quantities of records to
discover meaningful patterns and rules.
Budgeting
Activity-based costing
Promotion analysis
Customer analysis
Production
Production planning
Defect analysis
OLAP cubes have two main purposes. The first is to provide business users with a data model more intuitive to them
than a tabular model. This model is called a Dimensional Model.
The second purpose is to enable fast query response that is usually difficult to achieve using tabular models.
1) Multidimensional Conceptual View: This is the central features of an OLAP system. By needing a
multidimensional view, it is possible to carry out methods like slice and dice.
2) Transparency: Make the technology, underlying information repository, computing operations, and the dissimilar
nature of source data totally transparent to users. Such transparency helps to improve the efficiency and
productivity of the users.
3) Accessibility: It provides access only to the data that is actually required to perform the particular analysis,
present a single, coherent, and consistent view to the clients. The OLAP system must map its own logical schema to
the heterogeneous physical data stores and perform any necessary transformations. The OLAP operations should
be sitting between data sources (e.g., data warehouses) and an OLAP front-end.
4) Consistent Reporting Performance: To make sure that the users do not feel any significant degradation in
documenting performance as the number of dimensions or the size of the database increases. That is, the
performance of OLAP should not suffer as the number of dimensions is increased. Users must observe consistent
run time, response time, or machine utilization every time a given query is run.
5) Client/Server Architecture: Make the server component of OLAP tools sufficiently intelligent that the various
clients to be attached with a minimum of effort and integration programming. The server should be capable of
mapping and consolidating data between dissimilar databases.
6) Generic Dimensionality: An OLAP method should treat each dimension as equivalent in both is structure and
operational capabilities. Additional operational capabilities may be allowed to selected dimensions, but such
additional tasks should be grantable to any dimension.
7) Dynamic Sparse Matrix Handling: To adapt the physical schema to the specific analytical model being created
and loaded that optimizes sparse matrix handling. When encountering the sparse matrix, the system must be easy
to dynamically assume the distribution of the information and adjust the storage and access to obtain and maintain a
consistent level of performance.
8) Multiuser Support: OLAP tools must provide concurrent data access, data integrity, and access security.
10) Intuitive Data Manipulation: Data Manipulation fundamental the consolidation direction like as reorientation
(pivoting), drill-down and roll-up, and another manipulation to be accomplished naturally and precisely via point-
and-click and drag and drop methods on the cells of the scientific model. It avoids the use of a menu or multiple
trips to a user interface.
11) Flexible Reporting: It implements efficiency to the business clients to organize columns, rows, and cells in a
manner that facilitates simple manipulation, analysis, and synthesis of data.
12) Unlimited Dimensions and Aggregation Levels: The number of data dimensions should be unlimited. Each of
these common dimensions must allow a practically unlimited number of customer-defined aggregation levels within
any given consolidation path.
Characteristics of OLAP
In the FASMI characteristics of OLAP methods, the term derived from the first letters of the characteristics are:
Fast
It defines which the system targeted to deliver the most feedback to the client within about five seconds, with the
elementary analysis taking no more than one second and very few taking more than 20 seconds.
Analysis
It defines which the method can cope with any business logic and statistical analysis that is relevant for the function
and the user, keep it easy enough for the target client. Although some preprogramming may be needed we do not
think it acceptable if all application definitions have to be allow the user to define new Adhoc calculations as part of
the analysis and to document on the data in any desired method, without having to program so we excludes
products (like Oracle Discoverer) that do not allow the user to define new Adhoc calculation as part of the analysis
and to document on the data in any desired product that do not allow adequate end user-oriented calculation
flexibility.
Share
It defines which the system tools all the security requirements for understanding and, if multiple write connection is
needed, concurrent update location at an appropriated level, not all functions need customer to write data back, but
for the increasing number which does, the system should be able to manage multiple updates in a timely, secure
manner.
Multidimensional
This is the basic requirement. OLAP system must provide a multidimensional conceptual view of the data, including
full support for hierarchies, as this is certainly the most logical method to analyze business and organizations.
Information
The system should be able to hold all the data needed by the applications. Data sparsity should be handled in an
efficient manner.
The main characteristics of OLAP are as follows:
1. Multidimensional conceptual view: OLAP systems let business users have a dimensional and logical view of
the data in the data warehouse. It helps in carrying slice and dice operations.
2. Multi-User Support: Since the OLAP techniques are shared, the OLAP operation should provide normal
database operations, containing retrieval, update, adequacy control, integrity, and security.
3. Accessibility: OLAP acts as a mediator between data warehouses and front-end. The OLAP operations should
be sitting between data sources (e.g., data warehouses) and an OLAP front-end.
4. Storing OLAP results: OLAP results are kept separate from data sources.
5. Uniform documenting performance: Increasing the number of dimensions or database size should not
significantly degrade the reporting performance of the OLAP system.
7. OLAP system should ignore all missing values and compute correct aggregate values.
8. OLAP facilitate interactive query and complex analysis for the users.
9. OLAP allows users to drill down for greater details or roll up for aggregations of metrics along a single business
dimension or across multiple dimension.
10. OLAP provides the ability to perform intricate calculations and comparisons.
11. OLAP presents results in a number of meaningful ways, including charts and graphs.
Benefits of OLAP
OLAP holds several benefits for businesses: -
1. OLAP helps managers in decision-making through the multidimensional record views that it is efficient in
providing, thus increasing their productivity.
2. OLAP functions are self-sufficient owing to the inherent flexibility support to the organized databases.
3. It facilitates simulation of business models and problems, through extensive management of analysis-
capabilities.
4. In conjunction with data warehouse, OLAP can be used to support a reduction in the application backlog, faster
data retrieval, and reduction in query drag.
2) Understanding and decreasing costs of doing business: Improving sales is one method of improving a business,
the other method is to analyze cost and to control them as much as suitable without affecting sales. OLAP can assist
in analyzing the costs related to sales. In some methods, it may also be feasible to identify expenditures which
produce a high return on investments (ROI). For example, recruiting a top salesperson may contain high costs, but
the revenue generated by the salesperson may justify the investment.
Types of OLAP
There are three main types of OLAP servers are as following:
HOLAP stands for Hybrid OLAP, an application using both relational and multidimensional techniques.
They use a relational or extended-relational DBMS to save and handle warehouse data, and OLAP middleware to
provide missing pieces.
ROLAP servers contain optimization for each DBMS back end, implementation of aggregation navigation logic, and
additional tools and services.
ROLAP technology tends to have higher scalability than MOLAP technology.
Database server.
ROLAP server.
Front-end tool.
Relational OLAP (ROLAP) is the latest and fastest-growing OLAP technology segment in the market. This method
allows multiple multidimensional views of two-dimensional relational tables to be created, avoiding structuring
record around the desired view.
Some products in this segment have supported reliable SQL engines to help the complexity of multidimensional
analysis. This includes creating multiple SQL statements to handle user requests, being 'RDBMS' aware and also
being capable of generating the SQL statements based on the optimizer of the DBMS engine.
Advantages
Can handle large amounts of information: The data size limitation of ROLAP technology is depends on the data size
of the underlying RDBMS. So, ROLAP itself does not restrict the data amount.
RDBMS already comes with a lot of features. So ROLAP technologies, (works on top of the RDBMS) can control
these functionalities.
Disadvantages
Performance can be slow: Each ROLAP report is a SQL query (or multiple SQL queries) in the relational database,
the query time can be prolonged if the underlying data size is large.
Limited by SQL functionalities: ROLAP technology relies on upon developing SQL statements to query the relational
database, and SQL statements do not suit all needs.
One of the significant distinctions of MOLAP against a ROLAP is that data are summarized and are stored in an
optimized format in a multidimensional cube, instead of in a relational database. In MOLAP model, data are
structured into proprietary formats by client's reporting requirements with the calculations pre-generated on the
cubes.
MOLAP Architecture
MOLAP Architecture includes the following components
Database server.
MOLAP server.
Front-end tool.
MOLAP structure primarily reads the precompiled data. MOLAP structure has limited capabilities to dynamically
create aggregations or to evaluate results which have not been pre-calculated and stored.
Applications requiring iterative and comprehensive time-series analysis of trends are well suited for MOLAP
technology (e.g., financial analysis and budgeting).
This can be very useful for organizations with performance-sensitive multidimensional analysis requirements and
that have built or are in the process of building a data warehouse architecture that contains multiple subject areas.
An example would be the creation of sales data measured by several dimensions (e.g., product and sales region) to
be stored and maintained in a persistent structure. This structure would be provided to reduce the application
overhead of performing calculations and building aggregation during initialization. These structures can be
automatically refreshed at predetermined intervals established by an administrator.
Advantages
Excellent Performance: A MOLAP cube is built for fast information retrieval, and is optimal for slicing and dicing
operations.
Can perform complex calculations: All evaluation have been pre-generated when the cube is created. Hence,
complex calculations are not only possible, but they return quickly.
Disadvantages
Limited in the amount of information it can handle: Because all calculations are performed when the cube is built, it
is not possible to contain a large amount of data in the cube itself.
Requires additional investment: Cube technology is generally proprietary and does not already exist in the
organization. Therefore, to adopt MOLAP technology, chances are other investments in human and capital
resources are needed.
Advantages of HOLAP
1. HOLAP provide benefits of both MOLAP and ROLAP.
3. HOLAP balances the disk space requirement, as it only stores the aggregate information on the OLAP server and
the detail record remains in the relational database. So no duplicate copy of the detail record is maintained.
Disadvantages of HOLAP
1. HOLAP architecture is very complicated because it supports both MOLAP and ROLAP servers.
Other Types
There are also less popular types of OLAP styles upon which one could stumble upon every so often. We have listed
some of the less popular brands existing in the OLAP industry.
Consider the OLAP operations which are to be performed on multidimensional data. The figure shows data cubes
for sales of a shop. The cube contains the dimensions, location, and time and item, where the location is
aggregated with regard to city values, time is aggregated with respect to quarters, and an item is aggregated with
respect to item types.
Roll-Up
The roll-up operation (also known as drill-up or aggregation operation) performs aggregation on a data cube, by
climbing down concept hierarchies, i.e., dimension reduction. Roll-up is like zooming-out on the data cubes. Figure
shows the result of roll-up operations performed on the dimension location. The hierarchy for the location is defined
as the Order Street, city, province, or state, country. The roll-up operation aggregates the data by ascending the
location hierarchy from the level of the city to the level of the country.
When a roll-up is performed by dimensions reduction, one or more dimensions are removed from the cube. For
example, consider a sales data cube having two dimensions, location and time. Roll-up may be performed by
removing, the time dimensions, appearing in an aggregation of the total sales by location, relatively than by location
and by time.
Example
Consider the following cubes illustrating temperature of certain days recorded weekly:
Temperature 64 65 68 69 70 71 72
Week1 1 0 1 0 1 0 0
Week2 0 0 0 1 0 0 1
Consider that we want to set up levels (hot (80-85), mild (70-75), cool (64-69)) in temperature from the above
cubes.
To do this, we have to group column and add up the value according to the concept hierarchies. This operation is
known as a roll-up.
Week1 2 1 1
Week2 2 1 1
Drill-Down
Because a drill-down adds more details to the given data, it can also be performed by adding a new dimension to a
cube. For example, a drill-down on the central cubes of the figure can occur by introducing an additional dimension,
such as a customer group.
Example
Drill-down adds more details to the given data
Day 1 0 0 0
Day 2 0 0 0
Day 3 0 0 1
Day 4 0 1 0
Day 5 1 0 0
Day 6 0 0 0
Day 7 1 0 0
Day 8 0 0 0
Day 9 1 0 0
Day 10 0 1 0
Day 11 0 1 0
Day 12 0 1 0
Day 13 0 0 1
Day 14 0 0 0
Slice
A slice is a subset of the cubes corresponding to a single value for one or more members of the dimension. For
example, a slice operation is executed when the customer wants a selection on one dimension of a three-
dimensional cube resulting in a two-dimensional site. So, the Slice operations perform a selection on one dimension
of the given cube, thus resulting in a subcube.
For example, if we make the selection, temperature=cool we will obtain the following cube:
Temperature cool
Day 1 0
Day 2 0
Day 3 0
Day 4 0
Day 5 1
Day 6 1
Day 7 1
Day 8 1
Day 9 1
Day 11 0
Day 12 0
Day 14 0
Here Slice is functioning for the dimensions "time" using the criterion time = "Q1".
Dice
The dice operation describes a subcube by operating a selection on two or more dimension.
For example, Implement the selection (time = day 3 OR time = day 4) AND (temperature = cool OR temperature =
hot) to the original cubes we get the following subcube (still two-dimensional)
Day 3 0 1
Day 4 0 0
The dice operation on the cubes based on the following selection criteria involves three dimensions.
Pivot
The pivot operation is also called a rotation. Pivot is a visualization operations which rotates the data axes in view to
provide an alternative presentation of the data. It may contain swapping the rows and columns or moving one of the
row-dimensions into the column dimensions.
Other OLAP operations may contain ranking the top-N or bottom-N elements in lists, as well as calculate moving
average, growth rates, and interests, internal rates of returns, depreciation, currency conversions, and statistical
tasks.
OLAP offers analytical modeling capabilities, containing a calculation engine for determining ratios, variance, etc.
and for computing measures across various dimensions. It can generate summarization, aggregation, and
hierarchies at each granularity level and at every dimensions intersection. OLAP also provide functional models for
forecasting, trend analysis, and statistical analysis. In this context, the OLAP engine is a powerful data analysis tool.