0% found this document useful (0 votes)
16 views112 pages

DSS ch2

Chapter 2 of the document discusses data warehouses, their concepts, and the importance of business intelligence (BI) for data analysis and management. It covers various topics including data integration, OLAP and OLTP systems, ETL processes, and different data models like star and snowflake schemas. The chapter emphasizes the structured storage of data and the challenges associated with data integration and representation.

Uploaded by

lenossd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views112 pages

DSS ch2

Chapter 2 of the document discusses data warehouses, their concepts, and the importance of business intelligence (BI) for data analysis and management. It covers various topics including data integration, OLAP and OLTP systems, ETL processes, and different data models like star and snowflake schemas. The chapter emphasizes the structured storage of data and the challenges associated with data integration and representation.

Uploaded by

lenossd
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 112

2nd year Master University Amar Télidji Laghouat

Decision Support
Systems
Chapter 2:
Datawarehouses
Younes Guellouma
Chapter 2: Datawarehouse
Concepts

Younes Guellouma

1
Introduction
What is Datawarehouse
Data Integration
Data representation
Multidimentional representation
Logical design of DW
OLAP Operations
Physical Design of DW
Query Languages for OLAP
SQL and ROALP Systems
SQL and Analytical Queries
Query Language for MOLAP

2
Introduction
Why Buisness Intelligence (BI)

Wikipedia
Business intelligence (BI) consists of strategies,
methodologies, and technologies used by enterprises for
data analysis and management of business
information.Common functions of BI technologies include
reporting, online analytical processing, analytics, dashboard
development, data mining, process mining, complex event
processing, business performance management,
benchmarking, text mining, predictive analytics, and
prescriptive analytics.

3
Why Buisness Intelligence (BI)

4
Why Data Warehouse?

• Data collected from various sources and stored in various


databases cannot be directly visualized.
• The data first needs to be integrated and then processed
before visualization takes place.

5
Needs

6
Internal Data Sources vs External Data Sources

7
Sources of Big data

8
Solution

1. I can’t find the data I need:


• The data is scattered across the network.
• Multiple versions, large differences.
2. I can’t retrieve the data I need
3. I can’t understand the retrieved data:
• The data is poorly documented.
4. I can’t use my data:
• The results are unexpected
• The results need to be transformed into another format.

9
Big Data ?Vs

10
DataLake

• Recommended for external data


• On-the-fly backup of data deemed interesting.
• Data in their original form (raw data).
• Fast access to data, NoSQL-type queries.

Solutions
Microsoft Azure Datalake

11
BigData Vs Datawarehouse

Big Data Data Warehouse


Large volumes of data on which Historical data of a company
technologies can be applied
Storage technology Data organization architecture
Structured, semi-structured, and Only structured data
unstructured data
Uses a distributed file system Does not use a distributed file sys-
(HDFS) tem for processing
Based on NoSQL query system Uses SQL queries to search from re-
lational databases

12
Datalake vs Datawarehouse

Data Lake Data Warehouse


Data Struc- Raw Processed
ture
Usage Store data before deter- Store data related to the
mining usage subject in current use
Users Data scientists Business professionals
Accessibility Fast More complex

13
What is Datawarehouse
Operational DBMS

• They consist of tables having attributes and are populated


by tuples.
• They generally use the E-R data model.
• It is used to store transactional data.
• The information content is generally recent.
• These are thus called as OLTP systems.
• Their goals are data accuracy and consistency,
concurrency, recoverability, and reliability (ACID
properties).

14
Datawarehouse

Formal Defintion
“ A data warehouse is a subject-oriented, integrated, timevariant and
non-volatile collection of data in support of management decision making
process.

15
Datawarehouse

It means:

• Subject-Oriented: The stored data target specific subjects.


Example: It may store data regarding total sales, number of customers,
etc., and not general data on everyday operations.
• Integrated: Data may be distributed across heterogeneous sources
which have to be integrated.
Example: Sales data may be in an RDB, while customer information is
on flat files, etc.
• Time Variant: Data stored may not be current but vary with time and
include a time element.
Example: Data of sales over the last 5 years, etc.
• Non-Volatile: It is separate from the Enterprise Operational Database
and is not subject to frequent modification. It generally has only two
operations performed on it: loading of data and access of data.

15
Data Processing Systems

What is a DPS
A DPS is a hardware and software set-up that collects,
processes, and manages data to produce meaningful
information. Automate tasks such as data input, storage,
transformation, and output, making it easier to organize,
analyze, and use data for various purposes, such as
decision-making, reporting, or operational tasks.

16
Data Processing Systems

What is a DPS
A DPS is a hardware and software set-up that collects,
processes, and manages data to produce meaningful
information. Automate tasks such as data input, storage,
transformation, and output, making it easier to organize,
analyze, and use data for various purposes, such as
decision-making, reporting, or operational tasks.

Two main DPS


1. Online Transaction Processing (OLTP),
2. Online Analytical Processing (OLAP).

16
OLTP

OLTP systems are designed to manage and facilitate


high-volume transactional data. They support day-to-day
operations, such as order entry, inventory management, and
financial transactions. OLTP systems focus on:
• Transactional Data: Handling detailed, current data used
in everyday transactions.
• Efficiency and Speed: Quick processing of multiple
concurrent transactions with minimal latency.
• Data Integrity and Accuracy: Ensuring data consistency
and correctness through ACID (Atomicity, Consistency,
Isolation, Durability) properties.
Example Use Cases: Banking systems, e-commerce websites,
and CRM (Customer Relationship Management) systems.
17
OLAP

OLAP systems are designed for complex queries and data


analysis to support decision-making processes. They are
optimized for read-heavy operations on large datasets, often
aggregating data over time. OLAP systems focus on:
• Analytical Data: Aggregating historical data to identify
trends and perform complex calculations.
• Data Aggregation: Enabling multi-dimensional views of
data, such as analyzing sales across different regions and
time periods.
• Query Optimization: Optimized for complex queries that
can take longer to process, making them suitable for
in-depth analysis.
Example Use Cases: Data warehouses, business intelligence
systems, and reporting platforms. 18
OLTP vs OLAP

OLTP OLAP
Purpose Manage day-to-day transac- Support data analysis and
tional data decision-making
Data Type Current, detailed, and short Historical, aggregated, and
transactions multi-dimensional data
Operations Insert, update, delete (fre- Complex queries, primarily
quent write operations) read operations
Response Fast, milliseconds Slower, seconds to minutes
Time
Users End-users and operational Data analysts and business
staff decision-makers
Database De- Normalized, reducing redun- Denormalized, optimized for
sign dancy query performance

19
Data Integration
Data Integration is Hard

• Data warehouses combine data from multiple sources


• Data must be translated into a consistent format
• Data integration represents 80% of effort for a typical data warehouse
project!
• Some reasons why it’s hard:
• Metadata is often poor or non-existent
• Data quality is often bad
• Inconsistent semantics

20
Federated Databases

1. An alternative to data warehouses


2. Data warehouse Create a copy of all the data using ETL then execute
queries against the copy
3. Federated database Pull data from sources as needed to answer
queries using Mediators

21
Federated Databases

21
Mediator

Mediators play a crucial role in federated database systems by acting as


intermediaries that facilitate communication and data integration among
the various databases. Their primary functions include:

• Query Processing: Mediators receive queries from users and transform


them into appropriate queries for the individual databases. They
handle the complexities of routing queries to the correct databases
and merging the results.
• Data Integration: Mediators aggregate and harmonize data from
multiple sources, presenting it in a unified format to users. This can
involve resolving data discrepancies and applying necessary
transformations.
• Access Control: They manage permissions and security, ensuring that
users have appropriate access to the data stored in the federated
databases.

22
Extract, Transform & Load

ETL (Extract, Transform, Load)


is a data integration process used to gather data from various sources,
transform it into a suitable format for analysis, and load it into a target
data storage system, typically a data warehouse or data lake.

23
Extract, Transform & Load

1. Extract: This phase involves collecting data from diverse sources, which
can include databases, flat files, APIs, and more. The goal is to gather
all relevant data needed for analysis.
2. Transform: During this phase, the extracted data is cleansed and
transformed to ensure its quality and suitability for analysis. This may
involve:
• Data cleansing (removing duplicates, correcting errors)
• Data conversion (changing data types or formats)
• Aggregation (summarizing data)
• Applying business rules or calculations
3. Load: In the final phase, the transformed data is loaded into a target
storage system, such as a data warehouse, where it can be accessed for
reporting, analysis, and decision-making.

23
ETL characteristics

Key Characteristics of ETL:


1. Batch Processing: ETL processes are typically executed on a scheduled
basis (e.g., daily, weekly) rather than in real-time.
2. Data Movement: ETL involves physically moving data from source
systems to a centralized repository.
3. Support for Analytics: ETL processes prepare data for business
intelligence and analytics applications, enabling organizations to gain
insights from their data.

24
Mediators Vs ETL

Aspect ETL System Mediator


Purpose Extract, transform, and load Facilitate real-time integration and
data into a target data store access to multiple databases in a
(e.g., data warehouse). federated system.
Data Movement Physically moves and stores Provides virtual access to data
data from source systems to a without moving it; data remains in
target location. source systems.
Processing Type Typically batch processing, Real-time access and processing of
running at scheduled inter- queries as they come in.
vals.
Transformations Applies significant transfor- May perform light transformations
mations to ensure data qual- on-the-fly to facilitate query pro-
ity and consistency before cessing.
loading.
Data Storage Data is stored in a target Data is not stored; access is
repository, like a data ware- provided directly to underlying
house. databases.
Use Cases Data warehousing, reporting, Federated querying, real-time data
and historical analysis. access, and integration across het-
25
erogeneous sources.
Data representation
How to store Data

Data in a Datawarehouse should


be stored in a structured manner to facilitate efficient querying and
analysis.

Two abstraction levels:

1. Logical Design:
Represents the abstract structure of the data warehouse, focusing on
the relationships between different data entities and their attributes. It
defines how data will be organized and how users will interact with the
data without specifying the technical details of how the data will be
stored.
2. Physical Design:
Refers to the actual implementation details of the data warehouse,
specifying how data will be stored in a particular database
management system (DBMS). It includes the physical storage structure,
data types, indexing strategies, and performance optimization
techniques. 26
Hypercubes

Hypercube
represents a multidimensional data structure that allows for the analysis
of data across multiple dimensions simultaneously.

• A hypercube is used to represent multidimensional data in a structured


format that enables efficient querying and analysis.
• Each dimension of the hypercube corresponds to a different attribute
or category of data (e.g., time, product, geography).
• The intersections of these dimensions (cells of the hypercube) contain
the measures or facts that can be analyzed (e.g., sales amounts,
quantities sold).

27
Hypercube

Key Characteristics of Hypercubes in OLAP


• Multidimensional Structure: A hypercube allows data to be viewed
from multiple perspectives simultaneously, facilitating complex
analysis and reporting.
• Dimensions: Each axis of the hypercube represents a different
dimension, and the number of dimensions can vary based on the data
being analyzed.
• Measures (Facts): The values stored at the intersections of the
dimensions represent the data to be analyzed, such as aggregated
sales figures or other metrics.
• Drill-down and Roll-up: Users can navigate the hypercube by drilling
down into more detailed data or rolling up to view aggregated data
across dimensions.

28
Hypercube

29
Logical Data Models

Definition
A logical data model represents the structure of the data without getting
into the specifics of how that data will be physically stored or
implemented. It focuses on the relationships between different data
entities and their attributes.

Purpose
Logical models are used to define data requirements, relationships, and
structures in a way that is understandable to stakeholders, including
business analysts and data architects.

30
Star Schema

Star Schema
is a type of database schema used in data warehousing and OLAP (Online
Analytical Processing) systems that organizes data into a simple, intuitive
structure. It is called a ”star” schema because the diagram of its layout
resembles a star, with a central fact table connected to several dimension
tables.

31
Star Schema

Star Schema
is a type of database schema used in data warehousing and OLAP (Online
Analytical Processing) systems that organizes data into a simple, intuitive
structure. It is called a ”star” schema because the diagram of its layout
resembles a star, with a central fact table connected to several dimension
tables.

A star schema consists of:


• Fact Table: The central table that contains quantitative data
(measures) related to a specific business process or event. Each row in
the fact table typically represents a transaction or event and includes
foreign keys to related dimension tables.
• Dimension Tables: Surrounding the fact table, dimension tables store
descriptive attributes (dimensions) that provide context to the data in
the fact table. Each dimension table contains attributes that describe
the data (e.g., time, product, customer).
31
Charasteristics of Star Schema

1. Simplicity:The star schema is straightforward and easy to understand,


making it intuitive for users to query and analyze data.
2. Denormalization: Dimension tables are often denormalized, meaning
that they may contain redundant data to simplify the schema. This
allows for faster query performance but may lead to some data
duplication.
3. Performance:Star schemas are optimized for read-heavy operations
typical in reporting and analytical applications. The simplified
structure allows for efficient querying and data retrieval.
4. Flexibility: It allows users to drill down into details or roll up to
summary data easily, making it suitable for various analytical queries.

32
Example

33
Snowflake Schema

The snowflake model (or snowflake schema) is a type of database schema


used in data warehousing. It is an extension of the star schema but with a
more normalized structure. In a snowflake schema, the dimension tables are
organized in a way that reduces data redundancy by splitting the data into
additional tables, which resembles a snowflake shape when visualized.

34
Example

35
Constellation Schema

The constellation schema (also known as a galaxy schema or fact


constellation) is a type of database schema commonly used in data
warehousing. It consists of multiple fact tables that share dimension tables.
This schema is suitable for complex data warehouses that need to support
multiple business processes, as it allows for the representation of multiple
stars (star schemas) in a single schema, forming a constellation-like
structure.

36
Example

37
Hierarchy Concept

Hierarchy
refers to the organization of data into levels of granularity that allow users
to navigate and analyze data from various perspectives. This concept is
particularly important to understand OLAP operations.

38
Hierarchy Concept

Hierarchy
refers to the organization of data into levels of granularity that allow users
to navigate and analyze data from various perspectives. This concept is
particularly important to understand OLAP operations.

Defintion
• A hierarchy is a structured way of organizing data where elements are
arranged in levels from the most general (high-level) to the most
specific (low-level).
• Hierarchies often represent relationships in data, such as geographical
locations (Country → State → City) or time periods (Year → Quarter →
Month → Day).

38
Hierarchy Concept

Levels of Hierarchy:
Each level in a hierarchy represents a different level of detail. For example,
in a time hierarchy, the levels could be Year, Quarter, Month, and Day.

39
Hierarchy Concept

Levels of Hierarchy:
Each level in a hierarchy represents a different level of detail. For example,
in a time hierarchy, the levels could be Year, Quarter, Month, and Day.

Purpose
• Hierarchies help users navigate large datasets more intuitively,
allowing them to analyze data at different levels of granularity.
• They enable users to perform operations such as aggregation and
summarization, making it easier to derive insights.

39
Example

40
OLAP Operations

The essential OLAP operations


empower users to interact with and analyze multidimensional data
efficiently. They facilitate detailed exploration of data, enabling business
users and analysts to derive insights and make informed decisions based
on complex datasets. Each operation serves a distinct purpose, allowing
for a comprehensive understanding of the underlying data in a data
warehouse environment.

41
Slice Operation

Slice
The slice operation selects a single dimension from a multidimensional
cube, resulting in a new sub-cube that contains only the relevant data.

Example:
Consider a sales data cube with dimensions: Time, Product, and Region. If
you perform a slice on the Time dimension to focus only on ”The first
quarter of one year (Q1)” the resulting cube will only contain sales data for
all regions across all products and the first quarter time period.

42
Example

43
Dice Operation

Dice
The dice operation produces a sub-cube by selecting multiple dimensions
and specific values within those dimensions.

Example:
Using the same sales data cube, if you want to see the sales data for
”Mobiles” or ”Modems” sold in ”Toronto” or in ”Vancouver” during the first
quarter ”Q1” or the second one ”Q2”, you would dice the cube to focus on
the dimensions Product, Region, and Time with those specific values. The
resulting sub-cube would contain only the relevant data for mobiles or
modems, Toronto or Vancouver, and Q1 or Q2.

44
Example

45
Roll-up Operation

Roll-Up
This operation aggregates data by climbing up the hierarchy of a
dimension. It reduces the data’s detail level by summarizing it.

Example:
If you have sales data by City, rolling up would aggregate the data to the
Country level, providing total sales for each country instead of individual
cities.

46
Example

47
Drill-Down Operation

Drill-Down
This operation increases the level of detail in the data by navigating from
less detailed data to more detailed data.

Example
In a time dimension, if you are viewing sales data aggregated by Year, you
can drill down to see the data broken down by Quarter or Month.

48
Example

49
Other OLAP Operations

Aggregate
This operation computes summary statistics for data, such as sums,
averages, counts, etc., often at different levels of granularity. For example,
in a sales data cube, you might aggregate total sales figures to calculate
the average sales per month for a specific product category.

50
Other OLAP Operations

Aggregate
This operation computes summary statistics for data, such as sums,
averages, counts, etc., often at different levels of granularity. For example,
in a sales data cube, you might aggregate total sales figures to calculate
the average sales per month for a specific product category.

Pivot (Rotate)
The pivot operation allows users to rotate the data axes in view, enabling
them to visualize the data from different perspectives. For example, in a
sales report showing total sales by Region (rows) and Product (columns),
you can pivot the report to show Product by Region instead. This change in
perspective may reveal different trends or insights in the data.

50
Other OLAP Operations

Aggregate
This operation computes summary statistics for data, such as sums,
averages, counts, etc., often at different levels of granularity. For example,
in a sales data cube, you might aggregate total sales figures to calculate
the average sales per month for a specific product category.

Pivot (Rotate)
The pivot operation allows users to rotate the data axes in view, enabling
them to visualize the data from different perspectives. For example, in a
sales report showing total sales by Region (rows) and Product (columns),
you can pivot the report to show Product by Region instead. This change in
perspective may reveal different trends or insights in the data.

Filtering
Filtering allows users to restrict the data displayed in an OLAP query based
on specific criteria. For instance, if you only want to see sales data for a
specific product category, you can apply a filter to show data only for that
category while excluding others. 50
Logical design vs Physical design

Logical design of a data warehouse outlines how data should be organized


and accessed based on business needs, while the physical design translates
that structure into a technical implementation, focusing on how data is
stored and optimized for performance. Both designs are crucial in the
development of a successful data warehouse, ensuring that it meets user
requirements while performing efficiently.

51
Logical design vs Physical design

Logical Design Physical Design


Focuses on the structure and rela- Focuses on the implementation and
tionships of data. optimization of data storage.
Defines entities, attributes, and re- Involves indexes, storage formats,
lationships. and partitioning.
Independent of any DBMS or stor- Dependent on the specific DBMS
age technology. and hardware used.
Includes schema design (e.g., star or Includes data placement, indexing,
snowflake schemas). materialized views, etc.
Used to model how data is orga- Used to optimize data retrieval and
nized logically. storage performance.

52
OLAP Physical Models

OLAP models or systems dictate how data are physically stored, indexed, and
accessed, thus representing physical models. Each approach offers different
trade-offs in terms of storage efficiency, query performance, and flexibility,
depending on the specific needs of the OLAP system and the type of data
being analyzed.

ROLAP
• Physical Model: Uses a relational database to store data in a star or
snowflake schema. Data are stored in tables with joins between fact
and dimension tables.
• Storage Mechanism: Since it relies on relational databases, it stores
data in rows and columns, utilizing SQL for querying.
• Characteristics: Can handle large volumes of data and is more flexible
with data updates but may have slower query performance for complex
multidimensional queries compared to other systems.

53
OLAP Physical Models

OLAP models or systems dictate how data are physically stored, indexed, and
accessed, thus representing physical models. Each approach offers different
trade-offs in terms of storage efficiency, query performance, and flexibility,
depending on the specific needs of the OLAP system and the type of data
being analyzed.

MOLAP
• Physical Model: Uses a multidimensional data storage system, often in
the form of proprietary cube structures.
• Storage Mechanism: Data is stored in multidimensional arrays
(hypercubes) where each cell represents a data point at the
intersection of dimensions.
• Characteristics: Offers fast query performance due to pre-aggregation
but can be limited by storage space and the complexity of updating
data.

53
OLAP Physical Models

OLAP models or systems dictate how data are physically stored, indexed, and
accessed, thus representing physical models. Each approach offers different
trade-offs in terms of storage efficiency, query performance, and flexibility,
depending on the specific needs of the OLAP system and the type of data
being analyzed.

HOLAP
• Physical Model: Combines both ROLAP and MOLAP, using relational
databases for detailed data storage and multidimensional cubes for
aggregated data.
• Storage Mechanism: Stores high-level aggregations in MOLAP
structures for quick access, while detailed transactional data is stored
in ROLAP tables.
• Characteristics: Provides a balance between performance and storage
capacity, enabling both detailed and aggregated queries.

53
OLAP Physical Models

OLAP models or systems dictate how data are physically stored, indexed, and
accessed, thus representing physical models. Each approach offers different
trade-offs in terms of storage efficiency, query performance, and flexibility,
depending on the specific needs of the OLAP system and the type of data
being analyzed.

SOLAP
• Physical Model: Focuses on incorporating spatial data (like
geographical information) into OLAP systems, using specialized storage
techniques to handle spatial data types and relationships.
• Storage Mechanism: Often utilizes spatial databases (e.g., PostgreSQL
with PostGIS, or Oracle Spatial) that support spatial indexing and
querying capabilities.
• Characteristics: Optimized for spatial data analysis, enabling queries
that involve geographical dimensions and spatial relationships.

53
Index Selection

The index selection problem in data warehousing is the challenge of


determining the optimal set of indexes to create on the data warehouse
tables to improve query performance. Indexes are crucial for enabling faster
data retrieval by allowing the database to quickly locate and access rows
based on key columns. However, indexes also consume storage space and
can slow down data loading and updates, as they need to be maintained
alongside the data.

54
Index Selection

1. Query Performance:
Indexes are primarily used to speed up data retrieval. By creating
indexes on columns frequently used in filters, joins, and aggregations,
the data warehouse can serve queries more efficiently. The goal is to
choose indexes that minimize query execution time, particularly for
complex and frequent queries in data warehouses where users expect
fast access to large datasets.
2. Storage Costs:
Indexes require additional storage space. In large-scale data
warehouses, where tables can be extremely large, storage overhead for
indexes can become significant. Index selection must balance the
benefit of faster queries against the cost of increased storage
requirements.

55
Index Selection

3. Maintenance Overhead:
Every time data is loaded, updated, or deleted, indexes must be
maintained. This can increase the time and computational resources
required for ETL (Extract, Transform, Load) processes. Choosing a large
number of indexes or indexing many columns can lead to high
maintenance costs, particularly in data warehouses with frequent
updates.
4. Workload Characteristics:
Understanding the workload is crucial to index selection. Different
types of queries (e.g., range queries, joins, or aggregations) benefit
from different types of indexes (e.g., B-trees, bitmap indexes). Index
selection involves analyzing query patterns, such as which columns are
commonly filtered, joined, or grouped, and choosing indexes that
optimize those operations.

56
Materialized Views

Materialized views in a data warehouse are database objects that store the
results of a query physically on disk, as opposed to a standard view that
dynamically computes results at query time. Materialized views are
particularly useful in data warehousing because they allow for faster query
performance on large datasets by pre-computing and storing frequently
accessed data.

57
Benefits of Materialized Views

• Improved Query Performance: Queries that would normally require


complex calculations or aggregations can instead retrieve
pre-computed results, drastically reducing query execution time.
• Efficiency for Aggregation and Join Operations: Materialized views are
ideal for queries involving costly operations like JOINs, GROUP BY, and
other aggregations. By storing these results, the database can avoid
recalculating them each time they’re needed.
• Reduced Load on Source Tables: By accessing the materialized view
rather than recalculating data from the underlying tables, the load on
those tables is reduced, which can be beneficial in data warehousing
environments with large and complex datasets.

58
Example

Suppose we have a sales data warehouse with tables Sales, Product, and
Date. We frequently need to query total sales revenue by product category
and month. A materialized view can pre-compute and store these results:

CREATE MATERIALIZED VIEW Sales_Monthly_Revenue


AS
SELECT P.Category, D.Month, SUM(S.Amount) AS
Total_Revenue
FROM
Sales S
NATURAL JOIN Product P
NATURAL JOIN Date D
GROUP BY
P.Category,
D.Month;

The Sales_Monthly_Revenue materialized view stores the total revenue


by product category and month, calculated by joining the Sales, Product, and
59
Date tables.
Example

Once created, you can query the materialized view as if it were a regular
table:

SELECT Category, Month, Total_Revenue


FROM Sales_Monthly_Revenue
WHERE Category = 'Electronics';
This query retrieves total monthly revenue for the ’Electronics’ category,
utilizing the pre-computed results stored in the materialized view, thus
enhancing performance.

60
Execution Plans

The execution plan selection problem arises in query optimization within


database management systems (DBMS). When a query is executed, there can
be multiple ways (execution plans) to retrieve the required data, often
involving different join methods, indexing strategies, and access paths. The
execution plan selection problem is about choosing the most efficient plan
from these alternatives, aiming to minimize resource usage, such as CPU
time, memory, and I/O operations, to reduce query response time.

61
Query Languages for OLAP
SQL for ROLAP

SQL?
ROLAP systems rely on SQL to query data stored in relational databases.
SQL is used to perform operations like data retrieval, filtering, aggregation,
and joining tables. Many ROLAP tools extend SQL capabilities to handle
OLAP-specific queries, providing functionalities like cube operations (e.g.,
GROUP BY CUBE in SQL).

62
SQL3

SQL 3
SQL:1999 standard, also known as SQL3, introduced several enhancements
for performing analytic queries. These extensions focus on advanced
querying capabilities, particularly for decision support and business
intelligence tasks.

63
Aggregation Functions

Aggregate functions are often used with the GROUP BY clause of the SELECT
statement. The GROUP BY clause splits the result-set into groups of values
and the aggregate function can be used to return a single value for each
group.

The most commonly used SQL aggregate functions are:

• MIN() - returns the smallest value within the selected column


• MAX() - returns the largest value within the selected column
• COUNT() - returns the number of rows in a set
• SUM() - returns the total sum of a numerical column

Aggregate functions ignore null values (except for COUNT()).

64
Window Functions

Window functions are used to perform calculations on rows related to the


current row. These are particularly useful for ranking and aggregate
computations.

Syntax
SELECT column_name,...,
window_function(column_name) OVER([PARTITION BY
column_name] [ORDER BY column_name]) AS new_column
FROM table_name;

window_function is any aggregate or ranking function.

65
Recursive Queries and CTEs

1. Recursive queries allow you to handle hierarchical data, such as


organizational structures or graphs.
2. Common Table Expressions (CTEs) allow to break down complex queries
into more manageable parts, enhancing readability and reusability.

66
CUBE and ROLLUP

The CUBE operator


generates a result set that contains all combinations of groupings across
the specified dimensions. It performs an aggregation across all
dimensions, including all possible subtotal and grand total combinations.

ROLLUP
generates subtotals that are a subset of those provided by CUBE, but in a
hierarchical way, rolling up from the most detailed to the least detailed
level.

67
ROLL-Up Operation

In OLAP (Online Analytical Processing), ROLL-UP and DRILL-DOWN operations


are commonly used to navigate through different levels of data aggregation.
Let’s assume you have a table sales with columns date, product, category,
and revenue.

id year month category region quantity total_amount


1 2023 ’January’ ’Electronics’ ’North’ 10 1000.00
2 2023 ’January’ ’Clothing’ ’North’ 20 800.00
3 2023 ’February’ ’Electronics’ ’South’ 15 1500.00
4 2023 ’February’ ’Clothing’ ’South’ 10 700.00
5 2023 ’March’ ’Electronics’ ’North’ 12 1200.00
6 2023 ’March’ ’Clothing’ ’North’ 25 900.00
7 2022 ’January’ ’Electronics’ ’South’ 8 800.00
8 2022 ’February’ ’Clothing’ ’South’ 18 600.00
9 2022 ’March’ ’Electronics’ ’North’ 5 500.00

68
ROLL-Up Operation

SELECT category,year , SUM( quantity) AS total_quantity


FROM sales
GROUP BY ROLLUP (category, year)
ORDER BY category, year;

Results :

category year total_quantity


Clothing 2022 18
Clothing 2023 55
Clothing 73
Electronics 2022 13
Electronics 2023 37
Electronics 50
123

69
Drill-Down

The DRILL-DOWN operation moves from higher-level aggregated data to a


more detailed level, providing more granular data.

Using the same sales table, let’s say you want to go from yearly revenue
totals down to monthly totals.

70
Drill-Down

SELECT year, month, category, SUM( total_amount) AS


total
FROM sales
GROUP BY year, month, category
ORDER BY month, category;

year month category total_amount


2022 February Clothing 600.00
2023 February Clothing 700.00
2023 February Electronics 1500.00
2023 January Clothing 800.00
2023 January Electronics 1000.00
2022 January Electronics 800.00
2023 March Clothing 900.00
2022 March Electronics 500.00
2023 March Electronics 1200.00

71
Slice

The SLICE operation involves fixing one dimension and filtering the data to
focus on specific values.

SELECT year, month, category, SUM( total_amount) AS


total
FROM sales
WHERE
year = 2023
AND region = 'North'
GROUP BY year, month, category
ORDER BY month, category;

year month category total_amount


2023 January Clothing 800.00
2023 January Electronics 1000.00
2023 March Clothing 900.00
2023 March Electronics 1200.00

72
Dice

The DICE operation involves applying multiple filters across several


dimensions to select a more specific subset of data.

SELECT year, month, category, SUM( total_amount) AS


total
FROM sales
WHERE year = 2023
AND category = 'Electronics'
AND region IN ('North', 'South')
AND month IN ('January', February)
GROUP BY year, month, category
ORDER BY month, category;

year month category total_amount


2023 February Electronics 1500.00
2023 January Electronics 1000.00

73
Rotate

Rotate operation can be achieved by manually restructuring your query to


swap between rows and columns. Let us try to rotate the previous Dice
result. We want to see the total sales of ”January” and ”February” on the
columns and the values ”1500.00” and ”1000.00 on rows.

SELECT SUM(CASE WHEN month = 'January' THEN total_amount


ELSE 0 END) AS total_January,
SELECT SUM(CASE WHEN month = 'February' THEN
total_amount ELSE 0 END) AS total_February
FROM sales
WHERE year = 2023
AND category = 'Electronics'
AND region IN ('North', 'South')
AND month IN ('January', February)
total_january total_february
1000.00 1500.00

74
Query Language for MOLAP :CQL

CQL, or Cell Query Language, is a less common query language typically


associated with certain MOLAP (Multidimensional Online Analytical
Processing) databases. Its main function is to allow direct access to
individual cells within a multidimensional data cube, rather than querying
larger data aggregations as one would in a traditional OLAP query language.

Common commands in CQL include:

• CELL: Retrieves a single cell value on the basis of specified dimensions.


• CELL UPDATE: Updates the value of a specific cell in the cube.
• CELL RANGES: Fetches a defined range of cells from the cube.

75
MDX

MDX (Multi-Dimensional Expressions):


MDX is the primary query language for MOLAP systems, designed
specifically to interact with multidimensional data structures. It allows
users to define and execute complex queries that involve hierarchies,
dimensions, and measures.

76
Cube implentation in MOLAP

In MOLAP , a cube is implemented as a multidimensional data structure


optimized for storing and retrieving aggregated data across various
dimensions.
Implementation
• Data Storage: Cubes are usually stored in a highly compressed and
multidimensional array format, allowing for efficient data retrieval.
• Pre-Aggregation: Aggregates are pre-computed for each combination
of dimensions and stored in the cube. For example, if the cube has
dimensions Product, Time, and Region, it pre-aggregates measures like
Sales for each possible combination (e.g., total sales for each product
in each region for each time period).
• Data Loading: The data from the operational database (typically a
relational database) is loaded into the MOLAP cube. During this
process, the MOLAP engine performs calculations and stores
aggregations at multiple levels to ensure quick access to both detailed
and summarized data.
77
Lattices

A Lattice
is a structure that organizes the various levels of aggregation in a cube.
Each node in the lattice represents a level of data aggregation in different
dimensions.

• Levels of Aggregation: In a multidimensional cube with multiple


dimensions, lattices help represent the hierarchy of possible
aggregations (e.g., aggregating Sales by Product and Time but not by
region).
• Hierarchical Structure: The lattice forms a hierarchy of aggregated
data, allowing the cube to store aggregations at various levels, from
the most granular (detailed data) to the most summarized (total across
all dimensions).
• Efficient querying: When a query is executed, the MOLAP engine uses
the lattice structure to find the closest preaggregated data points. This
approach minimizes computational load by avoiding the real-time
calculation of common aggregations.
78
Lattices

79
Basic Structure of an MDX Query

The fundamental structure of an MDX query is as follows:

SELECT
<Axis Specifications>
ON <Axis>,
<Axis Specifications>
ON <Axis>, ...
FROM <Cube Name>
WHERE <Slicer>
In this structure:

• SELECT defines the axes (e.g., columns, rows, pages, chapters and
sections) and specifies which dimensions or measures to include. Axis
can be names as numbers or axis(1), axis(2).... MDX can have
up to 128 axis.
• FROM specifies the cube from which the data is retrieved.
• WHERE is an optional slicer that filters the data by specific members or
conditions.
80
Tuples and Sets

In MDX, sets and tuples are fundamental concepts for working with
multidimensional data, but they serve different purposes and have distinct
characteristics.

Tuples
• A tuple is an ordered collection of members, typically from different
dimensions.
• Each tuple specifies a single cell in the multidimensional cube by fixing
one member per dimension. This is like pinpointing a specific
intersection in the data.
• Tuples are surrounded by parentheses () and can contain one or more
members (one per dimension).
• A tuple with a single member is called a scalar tuple, while a tuple with
multiple members is often referred to as a multidimensional tuple.

81
Tuples and Sets

In MDX, sets and tuples are fundamental concepts for working with
multidimensional data, but they serve different purposes and have distinct
characteristics.

Sets
• A set is an unordered collection of tuples or members, typically from
the same or similar dimensions.
• Sets are surrounded by curly braces and can include multiple tuples
or members.
• Sets are commonly used in MDX to define multiple points or ranges in
the cube, such as a set of all years in a time dimension or all products
in a product category.

81
Tuples and Sets: Differences

Features Tuples Sets


Definition A single point in a multidi- A collection of points (cells)
mensional space (a cell). or members.
Syntax Parentheses () Curly braces {}
Usage Defines a specific cell in the Defines multiple cells or
cube members
Dimension Contains exactly one mem- Can include multiple tuples
Restriction ber per dimension or members

82
Examples

SELECT ([Measures].[Sales]) ON COLUMNS,


([Product].[Category].[Beverages],
[Time].[Year].[2023]) ON ROWS
FROM [SalesCube]
produces :

Sales
Beverages (2023) 500,000
Snacks (2023) 350,000

83
Examples

To produce the same result with sets, we use this query:

SELECT ([Measures].[Sales]) ON COLUMNS,


([Product].[Category].[Beverages],
[Time].[Year].[2023]),
([Product].[Category].[Snacks],
[Time].[Year].[2023]) ON ROWS
FROM [SalesCube]

84
Roll-up in MDX

In MDX, we can roll up by specifying a higher level in a hierarchy.

SELECT
[Time].[Month].Members ON ROWS,
[Region].[All Regions] ON COLUMNS
FROM [Sales]
WHERE ([Measures].[Total Sales])

85
Drill Down with MDX

To drill down in MDX, we specify a lower level in the hierarchy.


SELECT
[Time].[Day].Members ON ROWS,
[Region].[All Regions] ON COLUMNS
FROM [Sales]
WHERE ([Measures].[Total Sales])

86
Slice in MDX

In MDX, use the WHERE clause to filter on a specific member of a dimension.

SELECT
[Time].[Month].Members ON ROWS,
[Customer].[All Customers] ON COLUMNS
FROM [Sales]
WHERE ([Region].[North America])

87
Dice in MDX

MDX allows filtering on multiple dimension members within the WHERE


clause.
SELECT
[Time].[Day].Members ON ROWS,
[Customer].[All Customers] ON COLUMNS
FROM [Sales]
WHERE ([Region].[North America], [Time].[January])

88
Pivot in MDX

In MDX, we define the dimensions for rows and columns explicitly.

89
Can MDX be used in ROLAP?

ROLAP systems like QL Server (with SQL Server Analysis Services - SSAS) and
Mondrian (an OLAP server for relational databases) can use MDX even
though they are SQL-based by acting as middleware layers that interact with
multidimensional cubes built on top of relational data.

90
Building an OLAP Cube on Top of SQL Data

• Relational tables in a SQL database are used as a data source for


building a multidimensional model, or OLAP cube.
• Dimensions (e.g., time, product, customer) and facts (measures like
sales, profit) are defined based on relational data tables.
• This structure is then pre-aggregated or indexed in ways that make it
efficient for multidimensional queries, effectively transforming SQL
data into a multidimensional model.

91
Translating MDX Queries to SQL-Based Data Retrieval

Although MDX operates on multidimensional structures, SQL Server SSAS


and Mondrian convert MDX queries into SQL queries on the relational
back-end to retrieve the needed data.

Translation
dynamically translates MDX queries into optimized SQL queries, enabling it
to perform operations like roll-ups, drill-downs, and other OLAP functions
directly on the underlying relational database. SQL generation engine can
translate even complex MDX expressions into SQL, making it possible to
work with large relational datasets.

92
Optimizing for MDX Performance

These systems use several strategies to optimize MDX query performance on


relational backends:

• Aggregations: they precompute aggregations for frequently queried


dimensions and hierarchies to speed up query responses.
• Caching: Frequently accessed data is cached, allowing MDX queries to
retrieve results quickly without hitting the SQL database every time.
• Indexes and Partitions: Data partitions and indexing strategies on the
SQL side help accelerate SQL execution, especially for large datasets.
SSAS, for example, can partition data by time or other dimensions, and
Mondrian can leverage database indexes to speed up query execution.

93

You might also like