DSS ch2
DSS ch2
Decision Support
Systems
Chapter 2:
Datawarehouses
Younes Guellouma
Chapter 2: Datawarehouse
Concepts
Younes Guellouma
1
Introduction
What is Datawarehouse
Data Integration
Data representation
Multidimentional representation
Logical design of DW
OLAP Operations
Physical Design of DW
Query Languages for OLAP
SQL and ROALP Systems
SQL and Analytical Queries
Query Language for MOLAP
2
Introduction
Why Buisness Intelligence (BI)
Wikipedia
Business intelligence (BI) consists of strategies,
methodologies, and technologies used by enterprises for
data analysis and management of business
information.Common functions of BI technologies include
reporting, online analytical processing, analytics, dashboard
development, data mining, process mining, complex event
processing, business performance management,
benchmarking, text mining, predictive analytics, and
prescriptive analytics.
3
Why Buisness Intelligence (BI)
4
Why Data Warehouse?
5
Needs
6
Internal Data Sources vs External Data Sources
7
Sources of Big data
8
Solution
9
Big Data ?Vs
10
DataLake
Solutions
Microsoft Azure Datalake
11
BigData Vs Datawarehouse
12
Datalake vs Datawarehouse
13
What is Datawarehouse
Operational DBMS
14
Datawarehouse
Formal Defintion
“ A data warehouse is a subject-oriented, integrated, timevariant and
non-volatile collection of data in support of management decision making
process.
15
Datawarehouse
It means:
15
Data Processing Systems
What is a DPS
A DPS is a hardware and software set-up that collects,
processes, and manages data to produce meaningful
information. Automate tasks such as data input, storage,
transformation, and output, making it easier to organize,
analyze, and use data for various purposes, such as
decision-making, reporting, or operational tasks.
16
Data Processing Systems
What is a DPS
A DPS is a hardware and software set-up that collects,
processes, and manages data to produce meaningful
information. Automate tasks such as data input, storage,
transformation, and output, making it easier to organize,
analyze, and use data for various purposes, such as
decision-making, reporting, or operational tasks.
16
OLTP
OLTP OLAP
Purpose Manage day-to-day transac- Support data analysis and
tional data decision-making
Data Type Current, detailed, and short Historical, aggregated, and
transactions multi-dimensional data
Operations Insert, update, delete (fre- Complex queries, primarily
quent write operations) read operations
Response Fast, milliseconds Slower, seconds to minutes
Time
Users End-users and operational Data analysts and business
staff decision-makers
Database De- Normalized, reducing redun- Denormalized, optimized for
sign dancy query performance
19
Data Integration
Data Integration is Hard
20
Federated Databases
21
Federated Databases
21
Mediator
22
Extract, Transform & Load
23
Extract, Transform & Load
1. Extract: This phase involves collecting data from diverse sources, which
can include databases, flat files, APIs, and more. The goal is to gather
all relevant data needed for analysis.
2. Transform: During this phase, the extracted data is cleansed and
transformed to ensure its quality and suitability for analysis. This may
involve:
• Data cleansing (removing duplicates, correcting errors)
• Data conversion (changing data types or formats)
• Aggregation (summarizing data)
• Applying business rules or calculations
3. Load: In the final phase, the transformed data is loaded into a target
storage system, such as a data warehouse, where it can be accessed for
reporting, analysis, and decision-making.
23
ETL characteristics
24
Mediators Vs ETL
1. Logical Design:
Represents the abstract structure of the data warehouse, focusing on
the relationships between different data entities and their attributes. It
defines how data will be organized and how users will interact with the
data without specifying the technical details of how the data will be
stored.
2. Physical Design:
Refers to the actual implementation details of the data warehouse,
specifying how data will be stored in a particular database
management system (DBMS). It includes the physical storage structure,
data types, indexing strategies, and performance optimization
techniques. 26
Hypercubes
Hypercube
represents a multidimensional data structure that allows for the analysis
of data across multiple dimensions simultaneously.
27
Hypercube
28
Hypercube
29
Logical Data Models
Definition
A logical data model represents the structure of the data without getting
into the specifics of how that data will be physically stored or
implemented. It focuses on the relationships between different data
entities and their attributes.
Purpose
Logical models are used to define data requirements, relationships, and
structures in a way that is understandable to stakeholders, including
business analysts and data architects.
30
Star Schema
Star Schema
is a type of database schema used in data warehousing and OLAP (Online
Analytical Processing) systems that organizes data into a simple, intuitive
structure. It is called a ”star” schema because the diagram of its layout
resembles a star, with a central fact table connected to several dimension
tables.
31
Star Schema
Star Schema
is a type of database schema used in data warehousing and OLAP (Online
Analytical Processing) systems that organizes data into a simple, intuitive
structure. It is called a ”star” schema because the diagram of its layout
resembles a star, with a central fact table connected to several dimension
tables.
32
Example
33
Snowflake Schema
34
Example
35
Constellation Schema
36
Example
37
Hierarchy Concept
Hierarchy
refers to the organization of data into levels of granularity that allow users
to navigate and analyze data from various perspectives. This concept is
particularly important to understand OLAP operations.
38
Hierarchy Concept
Hierarchy
refers to the organization of data into levels of granularity that allow users
to navigate and analyze data from various perspectives. This concept is
particularly important to understand OLAP operations.
Defintion
• A hierarchy is a structured way of organizing data where elements are
arranged in levels from the most general (high-level) to the most
specific (low-level).
• Hierarchies often represent relationships in data, such as geographical
locations (Country → State → City) or time periods (Year → Quarter →
Month → Day).
38
Hierarchy Concept
Levels of Hierarchy:
Each level in a hierarchy represents a different level of detail. For example,
in a time hierarchy, the levels could be Year, Quarter, Month, and Day.
39
Hierarchy Concept
Levels of Hierarchy:
Each level in a hierarchy represents a different level of detail. For example,
in a time hierarchy, the levels could be Year, Quarter, Month, and Day.
Purpose
• Hierarchies help users navigate large datasets more intuitively,
allowing them to analyze data at different levels of granularity.
• They enable users to perform operations such as aggregation and
summarization, making it easier to derive insights.
39
Example
40
OLAP Operations
41
Slice Operation
Slice
The slice operation selects a single dimension from a multidimensional
cube, resulting in a new sub-cube that contains only the relevant data.
Example:
Consider a sales data cube with dimensions: Time, Product, and Region. If
you perform a slice on the Time dimension to focus only on ”The first
quarter of one year (Q1)” the resulting cube will only contain sales data for
all regions across all products and the first quarter time period.
42
Example
43
Dice Operation
Dice
The dice operation produces a sub-cube by selecting multiple dimensions
and specific values within those dimensions.
Example:
Using the same sales data cube, if you want to see the sales data for
”Mobiles” or ”Modems” sold in ”Toronto” or in ”Vancouver” during the first
quarter ”Q1” or the second one ”Q2”, you would dice the cube to focus on
the dimensions Product, Region, and Time with those specific values. The
resulting sub-cube would contain only the relevant data for mobiles or
modems, Toronto or Vancouver, and Q1 or Q2.
44
Example
45
Roll-up Operation
Roll-Up
This operation aggregates data by climbing up the hierarchy of a
dimension. It reduces the data’s detail level by summarizing it.
Example:
If you have sales data by City, rolling up would aggregate the data to the
Country level, providing total sales for each country instead of individual
cities.
46
Example
47
Drill-Down Operation
Drill-Down
This operation increases the level of detail in the data by navigating from
less detailed data to more detailed data.
Example
In a time dimension, if you are viewing sales data aggregated by Year, you
can drill down to see the data broken down by Quarter or Month.
48
Example
49
Other OLAP Operations
Aggregate
This operation computes summary statistics for data, such as sums,
averages, counts, etc., often at different levels of granularity. For example,
in a sales data cube, you might aggregate total sales figures to calculate
the average sales per month for a specific product category.
50
Other OLAP Operations
Aggregate
This operation computes summary statistics for data, such as sums,
averages, counts, etc., often at different levels of granularity. For example,
in a sales data cube, you might aggregate total sales figures to calculate
the average sales per month for a specific product category.
Pivot (Rotate)
The pivot operation allows users to rotate the data axes in view, enabling
them to visualize the data from different perspectives. For example, in a
sales report showing total sales by Region (rows) and Product (columns),
you can pivot the report to show Product by Region instead. This change in
perspective may reveal different trends or insights in the data.
50
Other OLAP Operations
Aggregate
This operation computes summary statistics for data, such as sums,
averages, counts, etc., often at different levels of granularity. For example,
in a sales data cube, you might aggregate total sales figures to calculate
the average sales per month for a specific product category.
Pivot (Rotate)
The pivot operation allows users to rotate the data axes in view, enabling
them to visualize the data from different perspectives. For example, in a
sales report showing total sales by Region (rows) and Product (columns),
you can pivot the report to show Product by Region instead. This change in
perspective may reveal different trends or insights in the data.
Filtering
Filtering allows users to restrict the data displayed in an OLAP query based
on specific criteria. For instance, if you only want to see sales data for a
specific product category, you can apply a filter to show data only for that
category while excluding others. 50
Logical design vs Physical design
51
Logical design vs Physical design
52
OLAP Physical Models
OLAP models or systems dictate how data are physically stored, indexed, and
accessed, thus representing physical models. Each approach offers different
trade-offs in terms of storage efficiency, query performance, and flexibility,
depending on the specific needs of the OLAP system and the type of data
being analyzed.
ROLAP
• Physical Model: Uses a relational database to store data in a star or
snowflake schema. Data are stored in tables with joins between fact
and dimension tables.
• Storage Mechanism: Since it relies on relational databases, it stores
data in rows and columns, utilizing SQL for querying.
• Characteristics: Can handle large volumes of data and is more flexible
with data updates but may have slower query performance for complex
multidimensional queries compared to other systems.
53
OLAP Physical Models
OLAP models or systems dictate how data are physically stored, indexed, and
accessed, thus representing physical models. Each approach offers different
trade-offs in terms of storage efficiency, query performance, and flexibility,
depending on the specific needs of the OLAP system and the type of data
being analyzed.
MOLAP
• Physical Model: Uses a multidimensional data storage system, often in
the form of proprietary cube structures.
• Storage Mechanism: Data is stored in multidimensional arrays
(hypercubes) where each cell represents a data point at the
intersection of dimensions.
• Characteristics: Offers fast query performance due to pre-aggregation
but can be limited by storage space and the complexity of updating
data.
53
OLAP Physical Models
OLAP models or systems dictate how data are physically stored, indexed, and
accessed, thus representing physical models. Each approach offers different
trade-offs in terms of storage efficiency, query performance, and flexibility,
depending on the specific needs of the OLAP system and the type of data
being analyzed.
HOLAP
• Physical Model: Combines both ROLAP and MOLAP, using relational
databases for detailed data storage and multidimensional cubes for
aggregated data.
• Storage Mechanism: Stores high-level aggregations in MOLAP
structures for quick access, while detailed transactional data is stored
in ROLAP tables.
• Characteristics: Provides a balance between performance and storage
capacity, enabling both detailed and aggregated queries.
53
OLAP Physical Models
OLAP models or systems dictate how data are physically stored, indexed, and
accessed, thus representing physical models. Each approach offers different
trade-offs in terms of storage efficiency, query performance, and flexibility,
depending on the specific needs of the OLAP system and the type of data
being analyzed.
SOLAP
• Physical Model: Focuses on incorporating spatial data (like
geographical information) into OLAP systems, using specialized storage
techniques to handle spatial data types and relationships.
• Storage Mechanism: Often utilizes spatial databases (e.g., PostgreSQL
with PostGIS, or Oracle Spatial) that support spatial indexing and
querying capabilities.
• Characteristics: Optimized for spatial data analysis, enabling queries
that involve geographical dimensions and spatial relationships.
53
Index Selection
54
Index Selection
1. Query Performance:
Indexes are primarily used to speed up data retrieval. By creating
indexes on columns frequently used in filters, joins, and aggregations,
the data warehouse can serve queries more efficiently. The goal is to
choose indexes that minimize query execution time, particularly for
complex and frequent queries in data warehouses where users expect
fast access to large datasets.
2. Storage Costs:
Indexes require additional storage space. In large-scale data
warehouses, where tables can be extremely large, storage overhead for
indexes can become significant. Index selection must balance the
benefit of faster queries against the cost of increased storage
requirements.
55
Index Selection
3. Maintenance Overhead:
Every time data is loaded, updated, or deleted, indexes must be
maintained. This can increase the time and computational resources
required for ETL (Extract, Transform, Load) processes. Choosing a large
number of indexes or indexing many columns can lead to high
maintenance costs, particularly in data warehouses with frequent
updates.
4. Workload Characteristics:
Understanding the workload is crucial to index selection. Different
types of queries (e.g., range queries, joins, or aggregations) benefit
from different types of indexes (e.g., B-trees, bitmap indexes). Index
selection involves analyzing query patterns, such as which columns are
commonly filtered, joined, or grouped, and choosing indexes that
optimize those operations.
56
Materialized Views
Materialized views in a data warehouse are database objects that store the
results of a query physically on disk, as opposed to a standard view that
dynamically computes results at query time. Materialized views are
particularly useful in data warehousing because they allow for faster query
performance on large datasets by pre-computing and storing frequently
accessed data.
57
Benefits of Materialized Views
58
Example
Suppose we have a sales data warehouse with tables Sales, Product, and
Date. We frequently need to query total sales revenue by product category
and month. A materialized view can pre-compute and store these results:
Once created, you can query the materialized view as if it were a regular
table:
60
Execution Plans
61
Query Languages for OLAP
SQL for ROLAP
SQL?
ROLAP systems rely on SQL to query data stored in relational databases.
SQL is used to perform operations like data retrieval, filtering, aggregation,
and joining tables. Many ROLAP tools extend SQL capabilities to handle
OLAP-specific queries, providing functionalities like cube operations (e.g.,
GROUP BY CUBE in SQL).
62
SQL3
SQL 3
SQL:1999 standard, also known as SQL3, introduced several enhancements
for performing analytic queries. These extensions focus on advanced
querying capabilities, particularly for decision support and business
intelligence tasks.
63
Aggregation Functions
Aggregate functions are often used with the GROUP BY clause of the SELECT
statement. The GROUP BY clause splits the result-set into groups of values
and the aggregate function can be used to return a single value for each
group.
64
Window Functions
Syntax
SELECT column_name,...,
window_function(column_name) OVER([PARTITION BY
column_name] [ORDER BY column_name]) AS new_column
FROM table_name;
65
Recursive Queries and CTEs
66
CUBE and ROLLUP
ROLLUP
generates subtotals that are a subset of those provided by CUBE, but in a
hierarchical way, rolling up from the most detailed to the least detailed
level.
67
ROLL-Up Operation
68
ROLL-Up Operation
Results :
69
Drill-Down
Using the same sales table, let’s say you want to go from yearly revenue
totals down to monthly totals.
70
Drill-Down
71
Slice
The SLICE operation involves fixing one dimension and filtering the data to
focus on specific values.
72
Dice
73
Rotate
74
Query Language for MOLAP :CQL
75
MDX
76
Cube implentation in MOLAP
A Lattice
is a structure that organizes the various levels of aggregation in a cube.
Each node in the lattice represents a level of data aggregation in different
dimensions.
79
Basic Structure of an MDX Query
SELECT
<Axis Specifications>
ON <Axis>,
<Axis Specifications>
ON <Axis>, ...
FROM <Cube Name>
WHERE <Slicer>
In this structure:
• SELECT defines the axes (e.g., columns, rows, pages, chapters and
sections) and specifies which dimensions or measures to include. Axis
can be names as numbers or axis(1), axis(2).... MDX can have
up to 128 axis.
• FROM specifies the cube from which the data is retrieved.
• WHERE is an optional slicer that filters the data by specific members or
conditions.
80
Tuples and Sets
In MDX, sets and tuples are fundamental concepts for working with
multidimensional data, but they serve different purposes and have distinct
characteristics.
Tuples
• A tuple is an ordered collection of members, typically from different
dimensions.
• Each tuple specifies a single cell in the multidimensional cube by fixing
one member per dimension. This is like pinpointing a specific
intersection in the data.
• Tuples are surrounded by parentheses () and can contain one or more
members (one per dimension).
• A tuple with a single member is called a scalar tuple, while a tuple with
multiple members is often referred to as a multidimensional tuple.
81
Tuples and Sets
In MDX, sets and tuples are fundamental concepts for working with
multidimensional data, but they serve different purposes and have distinct
characteristics.
Sets
• A set is an unordered collection of tuples or members, typically from
the same or similar dimensions.
• Sets are surrounded by curly braces and can include multiple tuples
or members.
• Sets are commonly used in MDX to define multiple points or ranges in
the cube, such as a set of all years in a time dimension or all products
in a product category.
81
Tuples and Sets: Differences
82
Examples
Sales
Beverages (2023) 500,000
Snacks (2023) 350,000
83
Examples
84
Roll-up in MDX
SELECT
[Time].[Month].Members ON ROWS,
[Region].[All Regions] ON COLUMNS
FROM [Sales]
WHERE ([Measures].[Total Sales])
85
Drill Down with MDX
86
Slice in MDX
SELECT
[Time].[Month].Members ON ROWS,
[Customer].[All Customers] ON COLUMNS
FROM [Sales]
WHERE ([Region].[North America])
87
Dice in MDX
88
Pivot in MDX
89
Can MDX be used in ROLAP?
ROLAP systems like QL Server (with SQL Server Analysis Services - SSAS) and
Mondrian (an OLAP server for relational databases) can use MDX even
though they are SQL-based by acting as middleware layers that interact with
multidimensional cubes built on top of relational data.
90
Building an OLAP Cube on Top of SQL Data
91
Translating MDX Queries to SQL-Based Data Retrieval
Translation
dynamically translates MDX queries into optimized SQL queries, enabling it
to perform operations like roll-ups, drill-downs, and other OLAP functions
directly on the underlying relational database. SQL generation engine can
translate even complex MDX expressions into SQL, making it possible to
work with large relational datasets.
92
Optimizing for MDX Performance
93