0% found this document useful (0 votes)

16 views112 pages

DSS ch2

Chapter 2 of the document discusses data warehouses, their concepts, and the importance of business intelligence (BI) for data analysis and management. It covers various topics including data integration, OLAP and OLTP systems, ETL processes, and different data models like star and snowflake schemas. The chapter emphasizes the structured storage of data and the challenges associated with data integration and representation.

Uploaded by

lenossd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views112 pages

DSS ch2

Uploaded by

lenossd

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 112

2nd year Master University Amar Télidji Laghouat

Decision Support
Systems
Chapter 2:
Datawarehouses
Younes Guellouma
Chapter 2: Datawarehouse
Concepts

Younes Guellouma

1
Introduction
What is Datawarehouse
Data Integration
Data representation
Multidimentional representation
Logical design of DW
OLAP Operations
Physical Design of DW
Query Languages for OLAP
SQL and ROALP Systems
SQL and Analytical Queries
Query Language for MOLAP

2
Introduction
Why Buisness Intelligence (BI)

Wikipedia
Business intelligence (BI) consists of strategies,
methodologies, and technologies used by enterprises for
data analysis and management of business
information.Common functions of BI technologies include
reporting, online analytical processing, analytics, dashboard
development, data mining, process mining, complex event
processing, business performance management,
benchmarking, text mining, predictive analytics, and
prescriptive analytics.

3
Why Buisness Intelligence (BI)

4
Why Data Warehouse?

• Data collected from various sources and stored in various

databases cannot be directly visualized.
• The data first needs to be integrated and then processed
before visualization takes place.

5
Needs

6
Internal Data Sources vs External Data Sources

7
Sources of Big data

8
Solution

1. I can’t find the data I need:

• The data is scattered across the network.
• Multiple versions, large differences.
2. I can’t retrieve the data I need
3. I can’t understand the retrieved data:
• The data is poorly documented.
4. I can’t use my data:
• The results are unexpected
• The results need to be transformed into another format.

9
Big Data ?Vs

10
DataLake

• Recommended for external data

• On-the-fly backup of data deemed interesting.
• Data in their original form (raw data).
• Fast access to data, NoSQL-type queries.

Solutions
Microsoft Azure Datalake

11
BigData Vs Datawarehouse

Big Data Data Warehouse

Large volumes of data on which Historical data of a company
technologies can be applied
Storage technology Data organization architecture
Structured, semi-structured, and Only structured data
unstructured data
Uses a distributed file system Does not use a distributed file sys-
(HDFS) tem for processing
Based on NoSQL query system Uses SQL queries to search from re-
lational databases

12
Datalake vs Datawarehouse

Data Lake Data Warehouse

Data Struc- Raw Processed
ture
Usage Store data before deter- Store data related to the
mining usage subject in current use
Users Data scientists Business professionals
Accessibility Fast More complex

13
What is Datawarehouse
Operational DBMS

• They consist of tables having attributes and are populated

by tuples.
• They generally use the E-R data model.
• It is used to store transactional data.
• The information content is generally recent.
• These are thus called as OLTP systems.
• Their goals are data accuracy and consistency,
concurrency, recoverability, and reliability (ACID
properties).

14
Datawarehouse

Formal Defintion
“ A data warehouse is a subject-oriented, integrated, timevariant and
non-volatile collection of data in support of management decision making
process.

15
Datawarehouse

It means:

• Subject-Oriented: The stored data target specific subjects.

Example: It may store data regarding total sales, number of customers,
etc., and not general data on everyday operations.
• Integrated: Data may be distributed across heterogeneous sources
which have to be integrated.
Example: Sales data may be in an RDB, while customer information is
on flat files, etc.
• Time Variant: Data stored may not be current but vary with time and
include a time element.
Example: Data of sales over the last 5 years, etc.
• Non-Volatile: It is separate from the Enterprise Operational Database
and is not subject to frequent modification. It generally has only two
operations performed on it: loading of data and access of data.

15
Data Processing Systems

What is a DPS
A DPS is a hardware and software set-up that collects,
processes, and manages data to produce meaningful
information. Automate tasks such as data input, storage,
transformation, and output, making it easier to organize,
analyze, and use data for various purposes, such as
decision-making, reporting, or operational tasks.

16
Data Processing Systems

Two main DPS

1. Online Transaction Processing (OLTP),
2. Online Analytical Processing (OLAP).

16
OLTP

OLTP systems are designed to manage and facilitate

high-volume transactional data. They support day-to-day
operations, such as order entry, inventory management, and
financial transactions. OLTP systems focus on:
• Transactional Data: Handling detailed, current data used
in everyday transactions.
• Efficiency and Speed: Quick processing of multiple
concurrent transactions with minimal latency.
• Data Integrity and Accuracy: Ensuring data consistency
and correctness through ACID (Atomicity, Consistency,
Isolation, Durability) properties.
Example Use Cases: Banking systems, e-commerce websites,
and CRM (Customer Relationship Management) systems.
17
OLAP

OLAP systems are designed for complex queries and data

analysis to support decision-making processes. They are
optimized for read-heavy operations on large datasets, often
aggregating data over time. OLAP systems focus on:
• Analytical Data: Aggregating historical data to identify
trends and perform complex calculations.
• Data Aggregation: Enabling multi-dimensional views of
data, such as analyzing sales across different regions and
time periods.
• Query Optimization: Optimized for complex queries that
can take longer to process, making them suitable for
in-depth analysis.
Example Use Cases: Data warehouses, business intelligence
systems, and reporting platforms. 18
OLTP vs OLAP

OLTP OLAP
Purpose Manage day-to-day transac- Support data analysis and
tional data decision-making
Data Type Current, detailed, and short Historical, aggregated, and
transactions multi-dimensional data
Operations Insert, update, delete (fre- Complex queries, primarily
quent write operations) read operations
Response Fast, milliseconds Slower, seconds to minutes
Time
Users End-users and operational Data analysts and business
staff decision-makers
Database De- Normalized, reducing redun- Denormalized, optimized for
sign dancy query performance

19
Data Integration
Data Integration is Hard

• Data warehouses combine data from multiple sources

• Data must be translated into a consistent format
• Data integration represents 80% of effort for a typical data warehouse
project!
• Some reasons why it’s hard:
• Metadata is often poor or non-existent
• Data quality is often bad
• Inconsistent semantics

20
Federated Databases

1. An alternative to data warehouses

2. Data warehouse Create a copy of all the data using ETL then execute
queries against the copy
3. Federated database Pull data from sources as needed to answer
queries using Mediators

21
Federated Databases

21
Mediator

Mediators play a crucial role in federated database systems by acting as

intermediaries that facilitate communication and data integration among
the various databases. Their primary functions include:

• Query Processing: Mediators receive queries from users and transform

them into appropriate queries for the individual databases. They
handle the complexities of routing queries to the correct databases
and merging the results.
• Data Integration: Mediators aggregate and harmonize data from
multiple sources, presenting it in a unified format to users. This can
involve resolving data discrepancies and applying necessary
transformations.
• Access Control: They manage permissions and security, ensuring that
users have appropriate access to the data stored in the federated
databases.

22
Extract, Transform & Load

ETL (Extract, Transform, Load)

is a data integration process used to gather data from various sources,
transform it into a suitable format for analysis, and load it into a target
data storage system, typically a data warehouse or data lake.

23
Extract, Transform & Load

1. Extract: This phase involves collecting data from diverse sources, which
can include databases, flat files, APIs, and more. The goal is to gather
all relevant data needed for analysis.
2. Transform: During this phase, the extracted data is cleansed and
transformed to ensure its quality and suitability for analysis. This may
involve:
• Data cleansing (removing duplicates, correcting errors)
• Data conversion (changing data types or formats)
• Aggregation (summarizing data)
• Applying business rules or calculations
3. Load: In the final phase, the transformed data is loaded into a target
storage system, such as a data warehouse, where it can be accessed for
reporting, analysis, and decision-making.

23
ETL characteristics

Key Characteristics of ETL:

1. Batch Processing: ETL processes are typically executed on a scheduled
basis (e.g., daily, weekly) rather than in real-time.
2. Data Movement: ETL involves physically moving data from source
systems to a centralized repository.
3. Support for Analytics: ETL processes prepare data for business
intelligence and analytics applications, enabling organizations to gain
insights from their data.

24
Mediators Vs ETL

Aspect ETL System Mediator

Purpose Extract, transform, and load Facilitate real-time integration and
data into a target data store access to multiple databases in a
(e.g., data warehouse). federated system.
Data Movement Physically moves and stores Provides virtual access to data
data from source systems to a without moving it; data remains in
target location. source systems.
Processing Type Typically batch processing, Real-time access and processing of
running at scheduled inter- queries as they come in.
vals.
Transformations Applies significant transfor- May perform light transformations
mations to ensure data qual- on-the-fly to facilitate query pro-
ity and consistency before cessing.
loading.
Data Storage Data is stored in a target Data is not stored; access is
repository, like a data ware- provided directly to underlying
house. databases.
Use Cases Data warehousing, reporting, Federated querying, real-time data
and historical analysis. access, and integration across het-
25
erogeneous sources.
Data representation
How to store Data

Data in a Datawarehouse should

be stored in a structured manner to facilitate efficient querying and
analysis.

Two abstraction levels:

1. Logical Design:
Represents the abstract structure of the data warehouse, focusing on
the relationships between different data entities and their attributes. It
defines how data will be organized and how users will interact with the
data without specifying the technical details of how the data will be
stored.
2. Physical Design:
Refers to the actual implementation details of the data warehouse,
specifying how data will be stored in a particular database
management system (DBMS). It includes the physical storage structure,
data types, indexing strategies, and performance optimization
techniques. 26
Hypercubes

Hypercube
represents a multidimensional data structure that allows for the analysis
of data across multiple dimensions simultaneously.

• A hypercube is used to represent multidimensional data in a structured

format that enables efficient querying and analysis.
• Each dimension of the hypercube corresponds to a different attribute
or category of data (e.g., time, product, geography).
• The intersections of these dimensions (cells of the hypercube) contain
the measures or facts that can be analyzed (e.g., sales amounts,
quantities sold).

27
Hypercube

Key Characteristics of Hypercubes in OLAP

• Multidimensional Structure: A hypercube allows data to be viewed
from multiple perspectives simultaneously, facilitating complex
analysis and reporting.
• Dimensions: Each axis of the hypercube represents a different
dimension, and the number of dimensions can vary based on the data
being analyzed.
• Measures (Facts): The values stored at the intersections of the
dimensions represent the data to be analyzed, such as aggregated
sales figures or other metrics.
• Drill-down and Roll-up: Users can navigate the hypercube by drilling
down into more detailed data or rolling up to view aggregated data
across dimensions.

28
Hypercube

29
Logical Data Models

Definition
A logical data model represents the structure of the data without getting
into the specifics of how that data will be physically stored or
implemented. It focuses on the relationships between different data
entities and their attributes.

Purpose
Logical models are used to define data requirements, relationships, and
structures in a way that is understandable to stakeholders, including
business analysts and data architects.

30
Star Schema

Star Schema
is a type of database schema used in data warehousing and OLAP (Online
Analytical Processing) systems that organizes data into a simple, intuitive
structure. It is called a ”star” schema because the diagram of its layout
resembles a star, with a central fact table connected to several dimension
tables.

31
Star Schema

A star schema consists of:

• Fact Table: The central table that contains quantitative data
(measures) related to a specific business process or event. Each row in
the fact table typically represents a transaction or event and includes
foreign keys to related dimension tables.
• Dimension Tables: Surrounding the fact table, dimension tables store
descriptive attributes (dimensions) that provide context to the data in
the fact table. Each dimension table contains attributes that describe
the data (e.g., time, product, customer).
31
Charasteristics of Star Schema

1. Simplicity:The star schema is straightforward and easy to understand,

making it intuitive for users to query and analyze data.
2. Denormalization: Dimension tables are often denormalized, meaning
that they may contain redundant data to simplify the schema. This
allows for faster query performance but may lead to some data
duplication.
3. Performance:Star schemas are optimized for read-heavy operations
typical in reporting and analytical applications. The simplified
structure allows for efficient querying and data retrieval.
4. Flexibility: It allows users to drill down into details or roll up to
summary data easily, making it suitable for various analytical queries.

32
Example

33
Snowflake Schema

The snowflake model (or snowflake schema) is a type of database schema

used in data warehousing. It is an extension of the star schema but with a
more normalized structure. In a snowflake schema, the dimension tables are
organized in a way that reduces data redundancy by splitting the data into
additional tables, which resembles a snowflake shape when visualized.

34
Example

35
Constellation Schema

The constellation schema (also known as a galaxy schema or fact

constellation) is a type of database schema commonly used in data
warehousing. It consists of multiple fact tables that share dimension tables.
This schema is suitable for complex data warehouses that need to support
multiple business processes, as it allows for the representation of multiple
stars (star schemas) in a single schema, forming a constellation-like
structure.

36
Example

37
Hierarchy Concept

Hierarchy
refers to the organization of data into levels of granularity that allow users
to navigate and analyze data from various perspectives. This concept is
particularly important to understand OLAP operations.

38
Hierarchy Concept

Defintion
• A hierarchy is a structured way of organizing data where elements are
arranged in levels from the most general (high-level) to the most
specific (low-level).
• Hierarchies often represent relationships in data, such as geographical
locations (Country → State → City) or time periods (Year → Quarter →
Month → Day).

38
Hierarchy Concept

Levels of Hierarchy:
Each level in a hierarchy represents a different level of detail. For example,
in a time hierarchy, the levels could be Year, Quarter, Month, and Day.

39
Hierarchy Concept

Levels of Hierarchy:
Each level in a hierarchy represents a different level of detail. For example,
in a time hierarchy, the levels could be Year, Quarter, Month, and Day.

Purpose
• Hierarchies help users navigate large datasets more intuitively,
allowing them to analyze data at different levels of granularity.
• They enable users to perform operations such as aggregation and
summarization, making it easier to derive insights.

39
Example

40
OLAP Operations

The essential OLAP operations

empower users to interact with and analyze multidimensional data
efficiently. They facilitate detailed exploration of data, enabling business
users and analysts to derive insights and make informed decisions based
on complex datasets. Each operation serves a distinct purpose, allowing
for a comprehensive understanding of the underlying data in a data
warehouse environment.

41
Slice Operation

Slice
The slice operation selects a single dimension from a multidimensional
cube, resulting in a new sub-cube that contains only the relevant data.

Example:
Consider a sales data cube with dimensions: Time, Product, and Region. If
you perform a slice on the Time dimension to focus only on ”The first
quarter of one year (Q1)” the resulting cube will only contain sales data for
all regions across all products and the first quarter time period.

42
Example

43
Dice Operation

Dice
The dice operation produces a sub-cube by selecting multiple dimensions
and specific values within those dimensions.

Example:
Using the same sales data cube, if you want to see the sales data for
”Mobiles” or ”Modems” sold in ”Toronto” or in ”Vancouver” during the first
quarter ”Q1” or the second one ”Q2”, you would dice the cube to focus on
the dimensions Product, Region, and Time with those specific values. The
resulting sub-cube would contain only the relevant data for mobiles or
modems, Toronto or Vancouver, and Q1 or Q2.

44
Example

45
Roll-up Operation

Roll-Up
This operation aggregates data by climbing up the hierarchy of a
dimension. It reduces the data’s detail level by summarizing it.

Example:
If you have sales data by City, rolling up would aggregate the data to the
Country level, providing total sales for each country instead of individual
cities.

46
Example

47
Drill-Down Operation

Drill-Down
This operation increases the level of detail in the data by navigating from
less detailed data to more detailed data.

Example
In a time dimension, if you are viewing sales data aggregated by Year, you
can drill down to see the data broken down by Quarter or Month.

48
Example

49
Other OLAP Operations

Aggregate
This operation computes summary statistics for data, such as sums,
averages, counts, etc., often at different levels of granularity. For example,
in a sales data cube, you might aggregate total sales figures to calculate
the average sales per month for a specific product category.

50
Other OLAP Operations

Pivot (Rotate)
The pivot operation allows users to rotate the data axes in view, enabling
them to visualize the data from different perspectives. For example, in a
sales report showing total sales by Region (rows) and Product (columns),
you can pivot the report to show Product by Region instead. This change in
perspective may reveal different trends or insights in the data.

50
Other OLAP Operations

Filtering
Filtering allows users to restrict the data displayed in an OLAP query based
on specific criteria. For instance, if you only want to see sales data for a
specific product category, you can apply a filter to show data only for that
category while excluding others. 50
Logical design vs Physical design

Logical design of a data warehouse outlines how data should be organized

and accessed based on business needs, while the physical design translates
that structure into a technical implementation, focusing on how data is
stored and optimized for performance. Both designs are crucial in the
development of a successful data warehouse, ensuring that it meets user
requirements while performing efficiently.

51
Logical design vs Physical design

Logical Design Physical Design

Focuses on the structure and rela- Focuses on the implementation and
tionships of data. optimization of data storage.
Defines entities, attributes, and re- Involves indexes, storage formats,
lationships. and partitioning.
Independent of any DBMS or stor- Dependent on the specific DBMS
age technology. and hardware used.
Includes schema design (e.g., star or Includes data placement, indexing,
snowflake schemas). materialized views, etc.
Used to model how data is orga- Used to optimize data retrieval and
nized logically. storage performance.

52
OLAP Physical Models

OLAP models or systems dictate how data are physically stored, indexed, and
accessed, thus representing physical models. Each approach offers different
trade-offs in terms of storage efficiency, query performance, and flexibility,
depending on the specific needs of the OLAP system and the type of data
being analyzed.

ROLAP
• Physical Model: Uses a relational database to store data in a star or
snowflake schema. Data are stored in tables with joins between fact
and dimension tables.
• Storage Mechanism: Since it relies on relational databases, it stores
data in rows and columns, utilizing SQL for querying.
• Characteristics: Can handle large volumes of data and is more flexible
with data updates but may have slower query performance for complex
multidimensional queries compared to other systems.

53
OLAP Physical Models

MOLAP
• Physical Model: Uses a multidimensional data storage system, often in
the form of proprietary cube structures.
• Storage Mechanism: Data is stored in multidimensional arrays
(hypercubes) where each cell represents a data point at the
intersection of dimensions.
• Characteristics: Offers fast query performance due to pre-aggregation
but can be limited by storage space and the complexity of updating
data.

53
OLAP Physical Models

HOLAP
• Physical Model: Combines both ROLAP and MOLAP, using relational
databases for detailed data storage and multidimensional cubes for
aggregated data.
• Storage Mechanism: Stores high-level aggregations in MOLAP
structures for quick access, while detailed transactional data is stored
in ROLAP tables.
• Characteristics: Provides a balance between performance and storage
capacity, enabling both detailed and aggregated queries.

53
OLAP Physical Models

SOLAP
• Physical Model: Focuses on incorporating spatial data (like
geographical information) into OLAP systems, using specialized storage
techniques to handle spatial data types and relationships.
• Storage Mechanism: Often utilizes spatial databases (e.g., PostgreSQL
with PostGIS, or Oracle Spatial) that support spatial indexing and
querying capabilities.
• Characteristics: Optimized for spatial data analysis, enabling queries
that involve geographical dimensions and spatial relationships.

53
Index Selection

The index selection problem in data warehousing is the challenge of

determining the optimal set of indexes to create on the data warehouse
tables to improve query performance. Indexes are crucial for enabling faster
data retrieval by allowing the database to quickly locate and access rows
based on key columns. However, indexes also consume storage space and
can slow down data loading and updates, as they need to be maintained
alongside the data.

54
Index Selection

1. Query Performance:
Indexes are primarily used to speed up data retrieval. By creating
indexes on columns frequently used in filters, joins, and aggregations,
the data warehouse can serve queries more efficiently. The goal is to
choose indexes that minimize query execution time, particularly for
complex and frequent queries in data warehouses where users expect
fast access to large datasets.
2. Storage Costs:
Indexes require additional storage space. In large-scale data
warehouses, where tables can be extremely large, storage overhead for
indexes can become significant. Index selection must balance the
benefit of faster queries against the cost of increased storage
requirements.

55
Index Selection

3. Maintenance Overhead:
Every time data is loaded, updated, or deleted, indexes must be
maintained. This can increase the time and computational resources
required for ETL (Extract, Transform, Load) processes. Choosing a large
number of indexes or indexing many columns can lead to high
maintenance costs, particularly in data warehouses with frequent
updates.
4. Workload Characteristics:
Understanding the workload is crucial to index selection. Different
types of queries (e.g., range queries, joins, or aggregations) benefit
from different types of indexes (e.g., B-trees, bitmap indexes). Index
selection involves analyzing query patterns, such as which columns are
commonly filtered, joined, or grouped, and choosing indexes that
optimize those operations.

56
Materialized Views

Materialized views in a data warehouse are database objects that store the
results of a query physically on disk, as opposed to a standard view that
dynamically computes results at query time. Materialized views are
particularly useful in data warehousing because they allow for faster query
performance on large datasets by pre-computing and storing frequently
accessed data.

57
Benefits of Materialized Views

• Improved Query Performance: Queries that would normally require

complex calculations or aggregations can instead retrieve
pre-computed results, drastically reducing query execution time.
• Efficiency for Aggregation and Join Operations: Materialized views are
ideal for queries involving costly operations like JOINs, GROUP BY, and
other aggregations. By storing these results, the database can avoid
recalculating them each time they’re needed.
• Reduced Load on Source Tables: By accessing the materialized view
rather than recalculating data from the underlying tables, the load on
those tables is reduced, which can be beneficial in data warehousing
environments with large and complex datasets.

58
Example

Suppose we have a sales data warehouse with tables Sales, Product, and
Date. We frequently need to query total sales revenue by product category
and month. A materialized view can pre-compute and store these results:

CREATE MATERIALIZED VIEW Sales_Monthly_Revenue

AS
SELECT P.Category, D.Month, SUM(S.Amount) AS
Total_Revenue
FROM
Sales S
NATURAL JOIN Product P
NATURAL JOIN Date D
GROUP BY
P.Category,
D.Month;

The Sales_Monthly_Revenue materialized view stores the total revenue

by product category and month, calculated by joining the Sales, Product, and
59
Date tables.
Example

Once created, you can query the materialized view as if it were a regular
table:

SELECT Category, Month, Total_Revenue

FROM Sales_Monthly_Revenue
WHERE Category = 'Electronics';
This query retrieves total monthly revenue for the ’Electronics’ category,
utilizing the pre-computed results stored in the materialized view, thus
enhancing performance.

60
Execution Plans

The execution plan selection problem arises in query optimization within

database management systems (DBMS). When a query is executed, there can
be multiple ways (execution plans) to retrieve the required data, often
involving different join methods, indexing strategies, and access paths. The
execution plan selection problem is about choosing the most efficient plan
from these alternatives, aiming to minimize resource usage, such as CPU
time, memory, and I/O operations, to reduce query response time.

61
Query Languages for OLAP
SQL for ROLAP

SQL?
ROLAP systems rely on SQL to query data stored in relational databases.
SQL is used to perform operations like data retrieval, filtering, aggregation,
and joining tables. Many ROLAP tools extend SQL capabilities to handle
OLAP-specific queries, providing functionalities like cube operations (e.g.,
GROUP BY CUBE in SQL).

62
SQL3

SQL 3
SQL:1999 standard, also known as SQL3, introduced several enhancements
for performing analytic queries. These extensions focus on advanced
querying capabilities, particularly for decision support and business
intelligence tasks.

63
Aggregation Functions

Aggregate functions are often used with the GROUP BY clause of the SELECT
statement. The GROUP BY clause splits the result-set into groups of values
and the aggregate function can be used to return a single value for each
group.

The most commonly used SQL aggregate functions are:

• MIN() - returns the smallest value within the selected column

• MAX() - returns the largest value within the selected column
• COUNT() - returns the number of rows in a set
• SUM() - returns the total sum of a numerical column

Aggregate functions ignore null values (except for COUNT()).

64
Window Functions

Window functions are used to perform calculations on rows related to the

current row. These are particularly useful for ranking and aggregate
computations.

Syntax
SELECT column_name,...,
window_function(column_name) OVER([PARTITION BY
column_name] [ORDER BY column_name]) AS new_column
FROM table_name;

window_function is any aggregate or ranking function.

65
Recursive Queries and CTEs

1. Recursive queries allow you to handle hierarchical data, such as

organizational structures or graphs.
2. Common Table Expressions (CTEs) allow to break down complex queries
into more manageable parts, enhancing readability and reusability.

66
CUBE and ROLLUP

The CUBE operator

generates a result set that contains all combinations of groupings across
the specified dimensions. It performs an aggregation across all
dimensions, including all possible subtotal and grand total combinations.

ROLLUP
generates subtotals that are a subset of those provided by CUBE, but in a
hierarchical way, rolling up from the most detailed to the least detailed
level.

67
ROLL-Up Operation

In OLAP (Online Analytical Processing), ROLL-UP and DRILL-DOWN operations

are commonly used to navigate through different levels of data aggregation.
Let’s assume you have a table sales with columns date, product, category,
and revenue.

id year month category region quantity total_amount

1 2023 ’January’ ’Electronics’ ’North’ 10 1000.00
2 2023 ’January’ ’Clothing’ ’North’ 20 800.00
3 2023 ’February’ ’Electronics’ ’South’ 15 1500.00
4 2023 ’February’ ’Clothing’ ’South’ 10 700.00
5 2023 ’March’ ’Electronics’ ’North’ 12 1200.00
6 2023 ’March’ ’Clothing’ ’North’ 25 900.00
7 2022 ’January’ ’Electronics’ ’South’ 8 800.00
8 2022 ’February’ ’Clothing’ ’South’ 18 600.00
9 2022 ’March’ ’Electronics’ ’North’ 5 500.00

68
ROLL-Up Operation

SELECT category,year , SUM( quantity) AS total_quantity

FROM sales
GROUP BY ROLLUP (category, year)
ORDER BY category, year;

Results :

category year total_quantity

Clothing 2022 18
Clothing 2023 55
Clothing 73
Electronics 2022 13
Electronics 2023 37
Electronics 50
123

69
Drill-Down

The DRILL-DOWN operation moves from higher-level aggregated data to a

more detailed level, providing more granular data.

Using the same sales table, let’s say you want to go from yearly revenue
totals down to monthly totals.

70
Drill-Down

SELECT year, month, category, SUM( total_amount) AS

total
FROM sales
GROUP BY year, month, category
ORDER BY month, category;

71
Slice

The SLICE operation involves fixing one dimension and filtering the data to
focus on specific values.

SELECT year, month, category, SUM( total_amount) AS

total
FROM sales
WHERE
year = 2023
AND region = 'North'
GROUP BY year, month, category
ORDER BY month, category;

year month category total_amount

2023 January Clothing 800.00
2023 January Electronics 1000.00
2023 March Clothing 900.00
2023 March Electronics 1200.00

72
Dice

The DICE operation involves applying multiple filters across several

dimensions to select a more specific subset of data.

year month category total_amount

2023 February Electronics 1500.00
2023 January Electronics 1000.00

73
Rotate

Rotate operation can be achieved by manually restructuring your query to

swap between rows and columns. Let us try to rotate the previous Dice
result. We want to see the total sales of ”January” and ”February” on the
columns and the values ”1500.00” and ”1000.00 on rows.

SELECT SUM(CASE WHEN month = 'January' THEN total_amount

ELSE 0 END) AS total_January,
SELECT SUM(CASE WHEN month = 'February' THEN
total_amount ELSE 0 END) AS total_February
FROM sales
WHERE year = 2023
AND category = 'Electronics'
AND region IN ('North', 'South')
AND month IN ('January', February)
total_january total_february
1000.00 1500.00

74
Query Language for MOLAP :CQL

CQL, or Cell Query Language, is a less common query language typically

associated with certain MOLAP (Multidimensional Online Analytical
Processing) databases. Its main function is to allow direct access to
individual cells within a multidimensional data cube, rather than querying
larger data aggregations as one would in a traditional OLAP query language.

Common commands in CQL include:

• CELL: Retrieves a single cell value on the basis of specified dimensions.

• CELL UPDATE: Updates the value of a specific cell in the cube.
• CELL RANGES: Fetches a defined range of cells from the cube.

75
MDX

MDX (Multi-Dimensional Expressions):

MDX is the primary query language for MOLAP systems, designed
specifically to interact with multidimensional data structures. It allows
users to define and execute complex queries that involve hierarchies,
dimensions, and measures.

76
Cube implentation in MOLAP

In MOLAP , a cube is implemented as a multidimensional data structure

optimized for storing and retrieving aggregated data across various
dimensions.
Implementation
• Data Storage: Cubes are usually stored in a highly compressed and
multidimensional array format, allowing for efficient data retrieval.
• Pre-Aggregation: Aggregates are pre-computed for each combination
of dimensions and stored in the cube. For example, if the cube has
dimensions Product, Time, and Region, it pre-aggregates measures like
Sales for each possible combination (e.g., total sales for each product
in each region for each time period).
• Data Loading: The data from the operational database (typically a
relational database) is loaded into the MOLAP cube. During this
process, the MOLAP engine performs calculations and stores
aggregations at multiple levels to ensure quick access to both detailed
and summarized data.
77
Lattices

A Lattice
is a structure that organizes the various levels of aggregation in a cube.
Each node in the lattice represents a level of data aggregation in different
dimensions.

• Levels of Aggregation: In a multidimensional cube with multiple

dimensions, lattices help represent the hierarchy of possible
aggregations (e.g., aggregating Sales by Product and Time but not by
region).
• Hierarchical Structure: The lattice forms a hierarchy of aggregated
data, allowing the cube to store aggregations at various levels, from
the most granular (detailed data) to the most summarized (total across
all dimensions).
• Efficient querying: When a query is executed, the MOLAP engine uses
the lattice structure to find the closest preaggregated data points. This
approach minimizes computational load by avoiding the real-time
calculation of common aggregations.
78
Lattices

79
Basic Structure of an MDX Query

The fundamental structure of an MDX query is as follows:

SELECT
<Axis Specifications>
ON <Axis>,
<Axis Specifications>
ON <Axis>, ...
FROM <Cube Name>
WHERE <Slicer>
In this structure:

• SELECT defines the axes (e.g., columns, rows, pages, chapters and
sections) and specifies which dimensions or measures to include. Axis
can be names as numbers or axis(1), axis(2).... MDX can have
up to 128 axis.
• FROM specifies the cube from which the data is retrieved.
• WHERE is an optional slicer that filters the data by specific members or
conditions.
80
Tuples and Sets

In MDX, sets and tuples are fundamental concepts for working with
multidimensional data, but they serve different purposes and have distinct
characteristics.

Tuples
• A tuple is an ordered collection of members, typically from different
dimensions.
• Each tuple specifies a single cell in the multidimensional cube by fixing
one member per dimension. This is like pinpointing a specific
intersection in the data.
• Tuples are surrounded by parentheses () and can contain one or more
members (one per dimension).
• A tuple with a single member is called a scalar tuple, while a tuple with
multiple members is often referred to as a multidimensional tuple.

81
Tuples and Sets

In MDX, sets and tuples are fundamental concepts for working with
multidimensional data, but they serve different purposes and have distinct
characteristics.

Sets
• A set is an unordered collection of tuples or members, typically from
the same or similar dimensions.
• Sets are surrounded by curly braces and can include multiple tuples
or members.
• Sets are commonly used in MDX to define multiple points or ranges in
the cube, such as a set of all years in a time dimension or all products
in a product category.

81
Tuples and Sets: Differences

Features Tuples Sets

Definition A single point in a multidi- A collection of points (cells)
mensional space (a cell). or members.
Syntax Parentheses () Curly braces {}
Usage Defines a specific cell in the Defines multiple cells or
cube members
Dimension Contains exactly one mem- Can include multiple tuples
Restriction ber per dimension or members

82
Examples

SELECT ([Measures].[Sales]) ON COLUMNS,

([Product].[Category].[Beverages],
[Time].[Year].[2023]) ON ROWS
FROM [SalesCube]
produces :

Sales
Beverages (2023) 500,000
Snacks (2023) 350,000

83
Examples

To produce the same result with sets, we use this query:

SELECT ([Measures].[Sales]) ON COLUMNS,

([Product].[Category].[Beverages],
[Time].[Year].[2023]),
([Product].[Category].[Snacks],
[Time].[Year].[2023]) ON ROWS
FROM [SalesCube]

84
Roll-up in MDX

In MDX, we can roll up by specifying a higher level in a hierarchy.

SELECT
[Time].[Month].Members ON ROWS,
[Region].[All Regions] ON COLUMNS
FROM [Sales]
WHERE ([Measures].[Total Sales])

85
Drill Down with MDX

To drill down in MDX, we specify a lower level in the hierarchy.

SELECT
[Time].[Day].Members ON ROWS,
[Region].[All Regions] ON COLUMNS
FROM [Sales]
WHERE ([Measures].[Total Sales])

86
Slice in MDX

In MDX, use the WHERE clause to filter on a specific member of a dimension.

SELECT
[Time].[Month].Members ON ROWS,
[Customer].[All Customers] ON COLUMNS
FROM [Sales]
WHERE ([Region].[North America])

87
Dice in MDX

MDX allows filtering on multiple dimension members within the WHERE

clause.
SELECT
[Time].[Day].Members ON ROWS,
[Customer].[All Customers] ON COLUMNS
FROM [Sales]
WHERE ([Region].[North America], [Time].[January])

88
Pivot in MDX

In MDX, we define the dimensions for rows and columns explicitly.

89
Can MDX be used in ROLAP?

ROLAP systems like QL Server (with SQL Server Analysis Services - SSAS) and
Mondrian (an OLAP server for relational databases) can use MDX even
though they are SQL-based by acting as middleware layers that interact with
multidimensional cubes built on top of relational data.

90
Building an OLAP Cube on Top of SQL Data

• Relational tables in a SQL database are used as a data source for

building a multidimensional model, or OLAP cube.
• Dimensions (e.g., time, product, customer) and facts (measures like
sales, profit) are defined based on relational data tables.
• This structure is then pre-aggregated or indexed in ways that make it
efficient for multidimensional queries, effectively transforming SQL
data into a multidimensional model.

91
Translating MDX Queries to SQL-Based Data Retrieval

Although MDX operates on multidimensional structures, SQL Server SSAS

and Mondrian convert MDX queries into SQL queries on the relational
back-end to retrieve the needed data.

Translation
dynamically translates MDX queries into optimized SQL queries, enabling it
to perform operations like roll-ups, drill-downs, and other OLAP functions
directly on the underlying relational database. SQL generation engine can
translate even complex MDX expressions into SQL, making it possible to
work with large relational datasets.

92
Optimizing for MDX Performance

These systems use several strategies to optimize MDX query performance on

relational backends:

• Aggregations: they precompute aggregations for frequently queried

dimensions and hierarchies to speed up query responses.
• Caching: Frequently accessed data is cached, allowing MDX queries to
retrieve results quickly without hitting the SQL database every time.
• Indexes and Partitions: Data partitions and indexing strategies on the
SQL side help accelerate SQL execution, especially for large datasets.
SSAS, for example, can partition data by time or other dimensions, and
Mondrian can leverage database indexes to speed up query execution.

Databricks Big Book of GenAI FINAL
100% (7)
Databricks Big Book of GenAI FINAL
118 pages
Data Mining and Predictive Analytics - Andres Fortino
No ratings yet
Data Mining and Predictive Analytics - Andres Fortino
390 pages
Learn Data Warehousing in 24 Hours
From Everand
Learn Data Warehousing in 24 Hours
Alex Nordeen
No ratings yet
Distributed Database Concepts
No ratings yet
Distributed Database Concepts
35 pages
DW Concepts
100% (1)
DW Concepts
40 pages
Data Warehousing AND Data Mining
No ratings yet
Data Warehousing AND Data Mining
51 pages
BI Unit 1 Data Warehouse
No ratings yet
BI Unit 1 Data Warehouse
169 pages
Business Intelligence Overview
No ratings yet
Business Intelligence Overview
20 pages
DWHDM 22cse120 Module-1
No ratings yet
DWHDM 22cse120 Module-1
45 pages
Business Intelligence/ Data Warehousing: Lakshmi Prashad PMG
100% (1)
Business Intelligence/ Data Warehousing: Lakshmi Prashad PMG
101 pages
Data Warehousing AND Data Mining
No ratings yet
Data Warehousing AND Data Mining
134 pages
Unit 1
No ratings yet
Unit 1
99 pages
Data Warehouse Administration
No ratings yet
Data Warehouse Administration
14 pages
DW Concepts
No ratings yet
DW Concepts
40 pages
DATA WAREHOUSE Basic Concepts
No ratings yet
DATA WAREHOUSE Basic Concepts
26 pages
Big Query
No ratings yet
Big Query
8 pages
Big Data Unit 1
No ratings yet
Big Data Unit 1
194 pages
Ch-03-1 Unlocked 2
No ratings yet
Ch-03-1 Unlocked 2
45 pages
Traditional Enterprise BI
No ratings yet
Traditional Enterprise BI
47 pages
DMDW 6
No ratings yet
DMDW 6
41 pages
chp15 16 17 Warehouse NoSQL
No ratings yet
chp15 16 17 Warehouse NoSQL
38 pages
DW Concepts
No ratings yet
DW Concepts
40 pages
Data Warehousing & Dimensional Modeling Concepts !!
No ratings yet
Data Warehousing & Dimensional Modeling Concepts !!
33 pages
Data Warehouse For Bignners
No ratings yet
Data Warehouse For Bignners
14 pages
Business Intelligence and Data Warehousing
No ratings yet
Business Intelligence and Data Warehousing
117 pages
Data Warehouse Full Slides
100% (3)
Data Warehouse Full Slides
822 pages
Chapter 1 Data Warehouse Fundamentals
No ratings yet
Chapter 1 Data Warehouse Fundamentals
26 pages
Introduction To Data Warehousing
No ratings yet
Introduction To Data Warehousing
24 pages
Introduction To Data Warehousing
No ratings yet
Introduction To Data Warehousing
24 pages
DM Unit 2
No ratings yet
DM Unit 2
21 pages
Ch4 - Data Warehousing
No ratings yet
Ch4 - Data Warehousing
33 pages
8 Data Warehousing
No ratings yet
8 Data Warehousing
113 pages
MIS - Session 11-14 - BI Data Warehouse
No ratings yet
MIS - Session 11-14 - BI Data Warehouse
65 pages
CH3 Data Warehousing
No ratings yet
CH3 Data Warehousing
51 pages
Module1 Part3
No ratings yet
Module1 Part3
46 pages
Data Warehousing AND Data Mining
No ratings yet
Data Warehousing AND Data Mining
168 pages
DWDM Book
No ratings yet
DWDM Book
58 pages
Data Warehouse
No ratings yet
Data Warehouse
97 pages
DW Unit I Notes
No ratings yet
DW Unit I Notes
28 pages
DMDW1
No ratings yet
DMDW1
13 pages
ETL Testing
No ratings yet
ETL Testing
32 pages
Adbms Data Warehousing and Data Mining
No ratings yet
Adbms Data Warehousing and Data Mining
169 pages
Data Warehouse Architectures Business Intelligence Information Assets
No ratings yet
Data Warehouse Architectures Business Intelligence Information Assets
5 pages
Business Analytics
No ratings yet
Business Analytics
3 pages
Data Warehouse
No ratings yet
Data Warehouse
169 pages
Data Warehousing AND Data Mining: S. Sudarshan Krithi Ramamritham
No ratings yet
Data Warehousing AND Data Mining: S. Sudarshan Krithi Ramamritham
169 pages
Data Warehousing AND Data Mining: S. Sudarshan Krithi Ramamritham
No ratings yet
Data Warehousing AND Data Mining: S. Sudarshan Krithi Ramamritham
169 pages
DWDM Unit 1
No ratings yet
DWDM Unit 1
86 pages
Data Warehousing Introduction Pages 2 53
No ratings yet
Data Warehousing Introduction Pages 2 53
52 pages
Lecture4 - DATA WAREHOUSING PDF
No ratings yet
Lecture4 - DATA WAREHOUSING PDF
13 pages
Module 1
No ratings yet
Module 1
71 pages
Data Warehousing AND Data Mining
No ratings yet
Data Warehousing AND Data Mining
169 pages
DWDM Unit-2 PDF
No ratings yet
DWDM Unit-2 PDF
149 pages
In T e G R A Ti o N: Integration of Data
No ratings yet
In T e G R A Ti o N: Integration of Data
21 pages
1 Lecture 1-Introduction
No ratings yet
1 Lecture 1-Introduction
22 pages
Unit-I DW - Architecture
100% (1)
Unit-I DW - Architecture
96 pages
Practical Data Strategies and Recipes
From Everand
Practical Data Strategies and Recipes
Tom Henricksen
No ratings yet
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
Databases: System Concepts, Designs, Management, and Implementation
From Everand
Databases: System Concepts, Designs, Management, and Implementation
Jonathan Rigdon
No ratings yet
The Study of Building the Data Warehouse
From Everand
The Study of Building the Data Warehouse
venkateswara Rao
No ratings yet
The Snowflake Handbook: Optimizing Data Warehousing and Analytics
From Everand
The Snowflake Handbook: Optimizing Data Warehousing and Analytics
Robert Johnson
No ratings yet
Mastering Delta Lake: Optimizing Data Lakes for Performance and Reliability
From Everand
Mastering Delta Lake: Optimizing Data Lakes for Performance and Reliability
Robert Johnson
No ratings yet
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
From Everand
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
Robert Johnson
No ratings yet
Umar Khan CV - 2023june
No ratings yet
Umar Khan CV - 2023june
1 page
Intro and Exercice Power Query
100% (1)
Intro and Exercice Power Query
30 pages
AbInitio Questions
No ratings yet
AbInitio Questions
2 pages
Database Design Using The E-R Model: Practice Exercises
No ratings yet
Database Design Using The E-R Model: Practice Exercises
4 pages
CKM3 For Group of Materials
No ratings yet
CKM3 For Group of Materials
3 pages
DSES Assignmet 2
No ratings yet
DSES Assignmet 2
29 pages
Accounts Question 1 PDF
No ratings yet
Accounts Question 1 PDF
11 pages
Peta KKOP Semarang
No ratings yet
Peta KKOP Semarang
12 pages
DB2 RTS Usage
No ratings yet
DB2 RTS Usage
52 pages
18CSC303J DBMS Sample MCQ
No ratings yet
18CSC303J DBMS Sample MCQ
12 pages
Using The Pentaho BI Suite Online Demonstration
No ratings yet
Using The Pentaho BI Suite Online Demonstration
14 pages
CQI Database User Manual
No ratings yet
CQI Database User Manual
6 pages
Case Study Guidelines: Property of STI
No ratings yet
Case Study Guidelines: Property of STI
2 pages
Assignment 05
No ratings yet
Assignment 05
8 pages
DBMS 1st Unit Notes
No ratings yet
DBMS 1st Unit Notes
20 pages
UNIT-1 1) KDD: KDD (Knowledge Discovery in Database)
No ratings yet
UNIT-1 1) KDD: KDD (Knowledge Discovery in Database)
17 pages
Latest Internet Search Aug11
No ratings yet
Latest Internet Search Aug11
13 pages
Lesson 10
No ratings yet
Lesson 10
4 pages
Introduction To Big Data BS (CS) 6 Lecture # 4: Dr. Syed Attique Shah (PH.D.)
No ratings yet
Introduction To Big Data BS (CS) 6 Lecture # 4: Dr. Syed Attique Shah (PH.D.)
19 pages
Modern Anti Forensics - A Systems Disruption Approach
100% (1)
Modern Anti Forensics - A Systems Disruption Approach
50 pages
Agentic - AI Projects
No ratings yet
Agentic - AI Projects
19 pages
Alken Catalogue: Specifications Catalogue of The Alken Cars, All Models and Types. Check Also Alken Timeline Catalogue
No ratings yet
Alken Catalogue: Specifications Catalogue of The Alken Cars, All Models and Types. Check Also Alken Timeline Catalogue
2 pages
Week 13 GCP Notes
No ratings yet
Week 13 GCP Notes
5 pages
DFD and User Interface
No ratings yet
DFD and User Interface
38 pages
Text Mining Using Python
No ratings yet
Text Mining Using Python
1 page
Warehouse Complete
No ratings yet
Warehouse Complete
6 pages
Student Information System
No ratings yet
Student Information System
20 pages