0% found this document useful (0 votes)
41 views31 pages

Data Warehousing and Online Analytical Processing

Data Warehousing is a technology that consolidates structured data from various sources into a centralized repository for analysis and reporting, supporting business decision-making. Key characteristics include being subject-oriented, integrated, non-volatile, and time-variant, allowing for comprehensive data analysis. The architecture involves data sources, staging areas, storage, and presentation tools, with distinctions made between OLTP and OLAP systems in terms of their purposes and data handling.

Uploaded by

Chandrani Ghosh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views31 pages

Data Warehousing and Online Analytical Processing

Data Warehousing is a technology that consolidates structured data from various sources into a centralized repository for analysis and reporting, supporting business decision-making. Key characteristics include being subject-oriented, integrated, non-volatile, and time-variant, allowing for comprehensive data analysis. The architecture involves data sources, staging areas, storage, and presentation tools, with distinctions made between OLTP and OLAP systems in terms of their purposes and data handling.

Uploaded by

Chandrani Ghosh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

Data Warehousing And Online Analytical Processing

Basic Concepts of Data Warehousing

Data Warehousing is a technology that aggregates structured data from multiple sources
into a centralized repository for analysis and reporting. It is designed to support business
decision-making by providing a consolidated view of the organization’s data.

Key Characteristics of a Data Warehouse:

1. Subject-Oriented:
○ Explanation: The data warehouse is organized around key business

AK
subjects such as customers, sales, products, etc., rather than individual
transactions or processes. This orientation makes it easier for businesses to

AY
analyze data from a particular perspective.
○ Example: If a company wants to analyze customer purchasing behavior, a

N
data warehouse might contain data specifically related to customers, such as
demographics, purchase history, and preferences.

LP
2. Integrated:
○ Explanation: Data from different sources (e.g., CRM, ERP, financial

KA
systems) is combined and standardized in the data warehouse. This
integration ensures that data from various sources can be used together for
N
analysis.
SA

○ Example: A company's sales data from an ERP system and customer


feedback from a CRM system might be in different formats. In the data
warehouse, this data is integrated into a consistent format, allowing for
BY

comprehensive analysis.
3. Non-Volatile:
○ Explanation: Once data is entered into the data warehouse, it is not altered.
S

This ensures that historical data remains intact for long-term analysis.
TE

○ Example: Sales data for the year 2020 will remain unchanged in the
warehouse, even if there are changes in the operational systems in 2021.
O

This allows for accurate trend analysis over multiple years.


N

4. Time-Variant:
○ Explanation: Data warehouses store historical data, enabling analysis of
W

changes over time. This time-variant nature allows businesses to identify


trends, patterns, and anomalies.
D

○ Example: A retail company can analyze monthly sales data over the past
M

five years to identify seasonal trends and make future sales forecasts.
D

Data Warehousing Architecture


The architecture of a data warehouse is designed to manage the flow of data from various
sources to end-user applications efficiently. Here's a breakdown of the components:

1. Data Sources
● Operational Databases:

1
Data Warehousing And Online Analytical Processing

○ Explanation: These databases are used for day-to-day operations and


contain detailed transaction data. Examples include systems like ERP
(Enterprise Resource Planning) or CRM (Customer Relationship
Management).
○ Example: A retail company’s POS (Point of Sale) system records daily
transactions, which serve as a data source for the warehouse.
● External Data Sources:
○ Explanation: Data can also come from external sources, such as market
research reports, financial market data, or social media feeds.
○ Example: A company might use economic indicators from a government

AK
database to complement its sales data.

AY
2. Data Staging Area
● ETL Process:

N
○ Extract:

LP
■ Explanation: Data is collected from multiple sources, which might
have different formats and structures.

KA
■ Example: Extracting customer data from a CRM system and sales
data from an ERP system.
○ Transform:
N
■ Explanation: The extracted data is cleaned and transformed into a
SA

consistent format. This step may include removing duplicates,


correcting errors, and standardizing data types.
■ Example: Converting all date formats to a standard YYYY-MM-DD
BY

format and ensuring consistency in naming conventions (e.g., "USA"


vs. "United States").
S

○ Load:
TE

■ Explanation: The transformed data is then loaded into the data


warehouse.
■ Example: After transformation, the cleaned customer and sales data
O

are loaded into the customer and sales tables in the data warehouse.
N

3. Data Storage
W

● Data Warehouse Database:


D

○ Explanation: The central repository where integrated, time-variant, and


M

non-volatile data is stored. This database is designed for fast query


D

performance, supporting complex analytical queries.


○ Example: A retail company’s data warehouse might store sales data,
customer data, and product data, enabling comprehensive analysis across
these domains.
● Data Marts:
○ Explanation: Data marts are smaller, specialized subsets of the data
warehouse that focus on specific business areas. They allow for faster
access to relevant data for particular departments or functions.

2
Data Warehousing And Online Analytical Processing

○ Example: A marketing data mart might contain only customer


demographics and purchase history, allowing the marketing team to
perform targeted analysis.

4. Data Presentation
● OLAP (Online Analytical Processing) Tools:

○ Explanation: These tools allow users to perform complex,


multidimensional analysis of the data stored in the warehouse. They enable
slicing, dicing, drilling down, and pivoting of data, offering different

AK
perspectives on the same data.
○ Example: A sales manager could use an OLAP tool to view sales data by

AY
region, product, and time period, identifying trends and outliers.
● Reporting Tools:
○ Explanation: These tools generate reports that summarize the data stored

N
in the warehouse. They can produce both standard reports (e.g., monthly

LP
sales reports) and ad-hoc reports tailored to specific queries.
○ Example: A financial analyst might generate a report showing quarterly

KA
revenue growth across different product lines.
N
SA
BY
S
TE
O
N
W
D
M
D


3
Data Warehousing And Online Analytical Processing

Properties of Data Warehouse Architectures

1. Separation: Analytical processing (OLAP) should be kept separate from


transactional processing (OLTP) to prevent conflicts and ensure that each system
operates efficiently.
2. Scalability: The architecture should be able to scale to handle increasing data
volumes and user demands over time.
3. Extensibility: The architecture should be flexible enough to incorporate new
technologies and operations without requiring a complete redesign.
4. Security: The architecture must include robust security measures to protect

AK
sensitive data.
5. Administerability: The data warehouse should be easy to manage and maintain,

AY
with tools for monitoring performance, managing storage, and ensuring data
quality.

N
Types of Data Warehouse Architectures

LP
1. Single-Tier Architecture

KA
● Explanation: This architecture minimizes the amount of data stored by creating a
virtual data warehouse, where data is processed on the fly rather than being stored
N
in a central repository.
SA

● Example: In a single-tier architecture, queries are run directly against operational


data, with middleware interpreting the queries to provide a multidimensional view
of the data. However, this can impact the performance of operational systems.
BY

2. Two-Tier Architecture
S

● Explanation: This architecture separates the data warehouse from the source
TE

systems, with an ETL process used to extract, cleanse, and integrate data before
loading it into the warehouse.
O

● Example: In a two-tier architecture, data from multiple sources is processed in a


staging area before being stored in the data warehouse. The data warehouse serves
N

as the central repository, and data marts can be created for specific departments.
W

3. Three-Tier Architecture
D

● Explanation: This architecture includes a reconciled layer between the source


M

systems and the data warehouse. The reconciled layer standardizes data across the
D

enterprise, providing a consistent data model that feeds the data warehouse and
data marts.
● Example: A large enterprise might use a three-tier architecture to ensure that data
from different departments is standardized before being loaded into the data
warehouse. The reconciled layer helps manage the complexity of integrating data
from multiple sources.

4
Data Warehousing And Online Analytical Processing

Data Warehouse vs. Heterogeneous DBMS

When comparing Data Warehouses and Heterogeneous Database Management


Systems (DBMS), it's important to recognize their fundamental differences in
architecture, data processing, and intended use cases. Below is a detailed explanation of
each system, followed by a tabular comparison.

1. Traditional Heterogeneous DB Integration

Query-Driven Approach:

AK
● Explanation: In a heterogeneous DBMS, the integration is typically done
on-demand. When a user submits a query at a client site, the system uses a

AY
meta-dictionary to translate this query into formats understandable by the various
heterogeneous databases involved. The results from these different systems are

N
then integrated to form a global answer set.
● Challenge: This approach can be resource-intensive, as it requires real-time

LP
querying and integration across multiple systems, which can lead to complex
information filtering and competition for computational resources.

Example:
KA
N
● Suppose a company has sales data stored in a SQL database and customer
SA

feedback in a NoSQL database. When a user queries for a combined report on


sales and customer feedback, the system must translate and execute queries on
BY

both databases, then combine the results.

2. Data Warehouse: Update-Driven, High Performance


S

Update-Driven Approach:
TE

● Explanation: Unlike the query-driven approach, data warehouses integrate and


O

store information from various heterogeneous sources in advance. This data is


N

preprocessed and loaded into the warehouse, making it available for direct
querying and analysis without the need for real-time data integration.
W

● Advantage: This leads to higher performance, especially for complex queries,


since the data is already centralized and prepared for analysis.
D
M

Example:
D

● A retail company may consolidate its sales data from multiple regional databases
into a central data warehouse. This centralized data can be queried for
comprehensive analysis, such as monthly sales trends across different regions.

3. OLTP vs. OLAP

OLTP (Online Transaction Processing):

5
Data Warehousing And Online Analytical Processing

● Explanation: OLTP systems are designed to handle day-to-day transactional


operations such as purchasing, inventory management, banking, payroll, etc. They
focus on fast query processing and maintaining data integrity in multi-access
environments.
● Example: A banking system where transactions like deposits and withdrawals are
processed instantly and reflected in the account balance.

OLAP (Online Analytical Processing):

● Explanation: OLAP systems are designed for complex queries that involve data

AK
analysis and decision-making processes. These systems are typically used in data
warehouses where historical and consolidated data is analyzed to support strategic
decisions.

AY
● Example: A financial analyst uses OLAP tools to analyze historical sales data to
forecast future sales trends.

N
4. Distinct Features: OLTP vs. OLAP

LP
● User and System Orientation:

KA
○ OLTP: Customer-oriented; focuses on managing current transactions and
ensuring data integrity.
N
○ OLAP: Market-oriented; focuses on analyzing data to derive insights and
SA
support decision-making.
● Data Contents:
○ OLTP: Contains current, detailed data used for routine operations.
BY

○ OLAP: Contains historical, consolidated data used for analysis.


● Database Design:
○ OLTP: Typically uses an Entity-Relationship (ER) model combined with
S

specific application logic.


TE

○ OLAP: Often uses a star schema or other multidimensional modeling


techniques centered around key subjects.
O

● View:
N

○ OLTP: Provides a current, localized view of ongoing transactions.


○ OLAP: Offers an evolutionary, integrated view of data over time.
W

● Access Patterns:
○ OLTP: Designed for frequent updates and short, simple queries.
D

○ OLAP: Primarily supports complex, read-only queries that may require


M

extensive computation.
D

6
Data Warehousing And Online Analytical Processing

AK
AY
N
LP
KA
N
SA

OLTP vs. OLAP


BY

Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) are
two fundamental paradigms within data management systems, each serving distinct
purposes and operational requirements. Below is a detailed explanation of their
S

differences, followed by a tabular comparison.


TE

1. OLTP (Online Transaction Processing)


O

Purpose:
N

● OLTP systems are designed to manage and facilitate the day-to-day operations of
W

an organization. These systems are optimized for transaction processing, which


D

involves the quick and reliable handling of short, atomic transactions.


M

Characteristics:
D

● Data Handling: OLTP systems focus on processing large volumes of simple,


routine transactions such as insertions, updates, and deletions. Each transaction is
typically small in scope, affecting only a small part of the database.
● Data Structure: Data is highly normalized to reduce redundancy and ensure
consistency. The database design usually follows the Entity-Relationship (ER)
model, which supports fast query processing and efficient storage.

7
Data Warehousing And Online Analytical Processing

● Response Time: OLTP systems prioritize fast response times, often requiring
sub-second response times to ensure smooth operation, particularly in
customer-facing applications.
● Examples:
○ Banking Systems: Transactions like deposits, withdrawals, and transfers
are processed instantly, with the database reflecting changes in account
balances immediately.
○ Retail Systems: Point-of-sale systems that process sales transactions,
update inventory levels, and generate receipts in real-time.

AK
2. OLAP (Online Analytical Processing)

Purpose:

AY
● OLAP systems are designed for complex data analysis and decision-making

N
processes. These systems enable users to query and analyze large volumes of
historical and aggregated data, helping them identify trends, patterns, and insights

LP
that inform strategic business decisions.

KA
Characteristics:
N
● Data Handling: OLAP systems handle large volumes of data that have been
aggregated and summarized from various sources. The focus is on read-heavy
SA

operations, where users execute complex queries that may span large portions of
the database.
BY

● Data Structure: OLAP databases often use multidimensional models like star
schemas or snowflake schemas, where data is organized around central facts and
related dimensions. This structure is optimized for fast retrieval and analysis.
S

● Response Time: While OLAP systems may not require the sub-second response
TE

times of OLTP systems, they are designed to handle complex queries efficiently,
even those that involve large datasets.
O

● Examples:
○ Business Intelligence Tools: A company might use an OLAP system to
N

analyze sales performance across different regions and time periods,


W

helping to identify trends and forecast future sales.


○ Financial Reporting: Financial analysts might use OLAP to generate
D

reports that consolidate financial data from multiple departments over


M

several years, providing insights into the company's financial health.


D

3. Key Differences: OLTP vs. OLAP

● User and System Orientation:


○ OLTP: Primarily customer-oriented, focused on supporting transactional
processes essential for day-to-day operations.
○ OLAP: Market-oriented, designed to support business analysts and
decision-makers by providing insights into data over time.
● Data Contents:

8
Data Warehousing And Online Analytical Processing

○ OLTP: Contains current, detailed data essential for running the operational
aspects of a business.
○ OLAP: Contains historical, consolidated data that supports analytical tasks
and strategic planning.
● Database Design:
○ OLTP: Employs an ER model with highly normalized tables, which helps
in maintaining data integrity and optimizing transactional operations.
○ OLAP: Utilizes a star schema or other multidimensional models, which
simplifies complex queries and enhances analytical capabilities.
● View:

AK
○ OLTP: Provides a real-time view of current data, reflecting the latest
transactional activities.

AY
○ OLAP: Offers an evolutionary, integrated view of data over time, making it
ideal for trend analysis and historical comparisons.

N
● Access Patterns:
○ OLTP: Designed for frequent updates and simple queries, ensuring the

LP
quick processing of transactions.
○ OLAP: Supports complex, read-only queries that may involve large

KA
datasets and require significant computation.
N
SA
BY
S
TE
O
N
W
D
M
D

Need Of Separate Data Warehouse

Separating a data warehouse from operational databases is crucial for maintaining high
performance and ensuring the efficient execution of both Online Transaction Processing

9
Data Warehousing And Online Analytical Processing

(OLTP) and Online Analytical Processing (OLAP) tasks. Here’s why this separation is
necessary:

1. High Performance for Both Systems

● OLTP Systems:
○ Optimization for Transactions: OLTP systems are specifically tuned for
fast transaction processing. This includes optimized access methods,
indexing, concurrency control, and recovery mechanisms that ensure quick
and reliable processing of high volumes of simple transactions, such as

AK
inserting, updating, or deleting records.
○ Real-Time Operations: These systems are designed to support real-time
operations, where the focus is on handling a large number of short, atomic

AY
transactions efficiently. Any delay or performance issue in these systems
can directly impact day-to-day business operations.

N
● Data Warehouse (OLAP Systems):

LP
○ Optimization for Analytical Queries: Data warehouses, on the other
hand, are tuned for complex OLAP queries that involve large-scale data

KA
analysis. These systems support multidimensional views of data and are
designed to handle complex aggregations, summarizations, and
consolidations of data from multiple sources.
N
○ Batch Processing and Historical Data: Unlike OLTP systems, data
SA

warehouses are optimized for batch processing and handling large datasets.
They are built to store and process historical data, which is crucial for trend
analysis, forecasting, and decision support.
BY

2. Different Functions and Different Data


S

● Historical Data Requirements:


TE

○ OLTP Systems: Typically, operational databases do not maintain extensive


historical data. They are designed to manage current data that is frequently
O

updated to reflect the latest transactions.


N

○ Data Warehouse: Decision support systems (DSS) require access to


historical data to analyze trends, make forecasts, and support strategic
W

decision-making. Data warehouses store this historical data, which


operational databases are not equipped to handle.
D

● Data Consolidation:
M

○ OLTP Systems: Operational databases are usually focused on a specific


D

area of operations, such as sales, inventory, or payroll, and do not perform


extensive data consolidation.
○ Data Warehouse: A data warehouse consolidates data from various
heterogeneous sources, which often involves aggregating and summarizing
data to provide a unified view for analysis. This consolidation is essential
for creating comprehensive reports and conducting in-depth analysis across
different domains of an organization.
● Data Quality and Reconciliation:

10
Data Warehousing And Online Analytical Processing

○ OLTP Systems: Different operational databases may use varying data


representations, codes, and formats, leading to inconsistencies when data
from multiple sources needs to be combined.
○ Data Warehouse: To ensure data quality, data warehouses perform data
cleansing and reconciliation. This process involves standardizing data
representations, resolving discrepancies, and ensuring that the data is
accurate and consistent across all sources before it is loaded into the
warehouse. This is crucial for reliable and meaningful analysis.

Multidimensional Data Model

AK
The multidimensional data model is central to Online Analytical Processing (OLAP) and

AY
is designed to enable efficient querying and analysis of data. This model structures data in
a way that allows users to view and interact with it from multiple perspectives. The
primary components of the multidimensional model are data cubes, star schemas,

N
snowflake schemas, and fact constellation schemas.

LP
1. Data Cubes

KA
Definition: N
● A data cube is a multidimensional array of values, typically used to represent data
SA
along multiple dimensions. It allows users to view and analyze data from various
perspectives, such as by time, location, product, etc.
BY

Components:

● Dimensions: These are the perspectives or angles from which the data is analyzed.
S

Common dimensions include time, geography, and product categories.


TE

● Measures: These are quantitative data points stored in the cube, such as sales
revenue, quantities sold, or profit margins.
● Cells: Each cell in the cube represents a unique combination of dimension values
O

and contains a measure value.


N

Example:
W

● Consider a retail company that wants to analyze sales data. A data cube could have
D

dimensions such as Time (years, quarters, months), Location (regions, stores), and
M

Product (categories, individual items). Each cell in the cube might contain the total
D

sales revenue for a specific product in a specific region during a specific time
period.

Benefits:

● Speed: Data cubes allow for fast querying and retrieval of aggregated data.
● Multidimensional Analysis: Users can perform complex queries and analyze data
across multiple dimensions.

11
Data Warehousing And Online Analytical Processing

Data Cube: Detailed Explanation

A data cube is a fundamental concept in the multidimensional data model used for Online
Analytical Processing (OLAP). It represents data in a multidimensional format, allowing
users to analyze and explore data from various perspectives efficiently. Here's a detailed
breakdown of the data cube concept:

1. Definition

A data cube is a multi-dimensional array of values, organized along different dimensions.


It allows for the representation of complex data relationships and enables users to

AK
perform sophisticated queries and analyses. The cube structure facilitates the exploration
of data across multiple dimensions and hierarchies.

AY
2. Key Components

N
● Dimensions:

LP
○ Definition: Dimensions are perspectives or categories by which data is
analyzed. They represent different angles or attributes from which data can

KA
be viewed and queried.
○ Examples: Common dimensions include Time (year, quarter, month),
N
Location (country, city, store), and Product (category, brand, item).
● Measures:
SA

○ Definition: Measures are quantitative data points stored in the cube. They
are the numerical values that users analyze and aggregate.
BY

○ Examples: Measures include sales revenue, profit margins, quantities sold,


and customer counts.
● Cells:
S

○ Definition: Each cell in a data cube contains a value representing a specific


TE

combination of dimension values. It is the intersection point of the


dimensions and contains the measure for that intersection.
O

○ Example: In a sales data cube, a cell might contain the total sales revenue
for a specific product in a particular region during a specific month.
N

3. Structure and Visualization


W
D

● Multidimensional Array:
○ A data cube is essentially a multidimensional array where each dimension
M

represents a different axis. For example, a three-dimensional cube might


D

have axes for Time, Location, and Product.


● Hierarchies:
○ Definition: Dimensions often have hierarchies that represent different
levels of granularity. For example, the Time dimension might have
hierarchies for Year > Quarter > Month > Day.
○ Purpose: Hierarchies allow users to drill down or roll up the data. Drilling
down means moving from higher-level summaries to more detailed levels,
while rolling up involves aggregating data to higher-level summaries.

12
Data Warehousing And Online Analytical Processing

● Slicing, Dicing, and Pivoting:


○ Slicing: Extracting a two-dimensional slice of the cube by selecting a
specific value for one dimension. For example, viewing sales data for a
particular year.
○ Dicing: Creating a sub-cube by selecting specific values for multiple
dimensions. For example, viewing sales data for specific products in certain
regions and months.
○ Pivoting: Rotating the data cube to view the data from different
perspectives. For example, switching the axes of the cube to analyze data
from a different angle.

AK
4.

AY
N
LP
KA
N
SA
BY
S
TE

Explanation of the Data Cube:


O

1. Dimensions:
N

○ The data cube represents data across three dimensions:


■ Product (shown on the vertical axis): This dimension includes
W

different product types such as TV, VCR, and PC.


■ Date (shown on the horizontal axis): This dimension is divided into
D

different time periods, such as 1st Quarter (1Qtr), 2nd Quarter


M

(2Qtr), 3rd Quarter (3Qtr), and 4th Quarter (4Qtr).


D

■ Country (shown on the depth axis): This dimension represents


different geographical locations like USA, Canada, and Mexico.
2. Cells in the Cube:
○ Each cell in the cube represents a specific data point, such as the sales of a
particular product in a specific quarter in a particular country. For example,
one cell could represent the sales of VCRs in the 2nd quarter in the USA.
3. Aggregation:

13
Data Warehousing And Online Analytical Processing

○ Along the edges of the cube, you can see sum labels. These represent
aggregated values. For example:
■ Total sales of all products across all quarters in the USA.
■ Total sales of TVs across all quarters and countries.
○ These aggregate values help provide summarized information for quicker
analysis.
4. Highlighted Total:
○ The diagram highlights Total annual sales of TV in U.S.A., which is the
sum of all sales of TVs in all quarters in the USA. This is shown as a
specific part of the cube that has been summed over a particular axis

AK
(quarters).
5. Concepts Involved:

AY
○ Slicing: Looking at one specific slice of the cube, for example, just the
sales for TV products or just the data for the USA.

N
○ Dicing: Examining a more specific sub-cube by choosing a subset of
dimensions, such as sales of TVs in the USA for the 1st and 2nd

LP
quarters.
○ Roll-up: Summing data across a particular dimension, such as getting total

KA
sales across all products or all countries.
○ Drill-down: Breaking the aggregated data into finer levels, for example,
N
breaking down total sales by quarters.
SA
BY
S
TE
O
N
W
D
M
D

14
Data Warehousing And Online Analytical Processing

AK
AY
N
LP
KA
N
SA
BY

This table and diagram represent a 3D view of sales data for AllElectronics, according
to the three dimensions:
S
TE

1. Time (quarters Q1 to Q4)


2. Item (home entertainment, computer, phone, security)
3. Location (Chicago, New York, Toronto, Vancouver)
O
N

The values represent dollars sold (in thousands) for each combination of these three
dimensions.
W
D

Explanation:
M

● The table shows the sales values broken down by item, time (quarters), and
D

location. For example:


○ In Q1, in Chicago, computer sales were $882,000, and phone sales were
$89,000.
○ In Q4, in Vancouver, computer sales were $927,000, and phone sales
were $1038,000.
● The cube diagram is a visual representation of this data:
○ The X-axis represents the item types (home entertainment, computer, etc.).
○ The Y-axis represents the quarters (time).
○ The Z-axis represents the locations (cities).
15
Data Warehousing And Online Analytical Processing

AK
AY
N
LP
KA
N
SA


BY

This diagram builds on the previous data cube by introducing an additional dimension:
Supplier. It now represents a 4D data cube with the following four dimensions:
S

1. Time (quarters Q1 to Q4)


TE

2. Item (home entertainment, computer, phone, security)


3. Location (Chicago, New York, Toronto, Vancouver)
O

4. Supplier (SUP1, SUP2, SUP3)


N

The measure displayed is dollars sold (in thousands), but for simplicity, only some
W

values are shown.


D

Explanation:
M

● Multiple Cubes for Suppliers:


D

○ The diagram shows three separate cubes, one for each supplier (SUP1,
SUP2, SUP3). Each cube is similar to the 3D data cube seen before but
now represents the sales data for a particular supplier.
○ For example:
■ In the SUP1 cube, during Q1, in Chicago, sales of computers were
$825,000, and sales of security items were $400,000.
○ The other cubes (SUP2, SUP3) would contain similar data but for the
corresponding supplier.

16
Data Warehousing And Online Analytical Processing

● Purpose:
○ By adding the supplier dimension, we can analyze how different suppliers
perform across the other dimensions (time, location, and item). This allows
for a deeper understanding of supplier-specific performance.

This 4D data cube is a useful representation for conducting multidimensional analysis,


such as comparing how supplier-specific sales vary by product, location, and time.

AK
AY
N
LP
KA
N
SA
BY
S
TE
O
N
W
D
M

\
D

The image illustrates various OLAP (Online Analytical Processing) operations


performed on multidimensional data. These operations include roll-up, drill-down, slice,
dice, and pivot (also called rotate). I'll explain each operation in detail using the picture
as an example.

1. Roll-up:

● Definition: Aggregating data by climbing up a hierarchy or reducing dimensions.

17
Data Warehousing And Online Analytical Processing

● Example in the image: The cube at the top right shows a roll-up on the location
dimension from cities (Chicago, New York, Toronto) to the country level (USA,
Canada). This reduces the granularity of the data.

2. Drill-down:

● Definition: The opposite of roll-up; increasing the granularity of data by moving


down a hierarchy or introducing more dimensions.
● Example in the image: A drill-down operation is performed from the USA level
to the city level, showing finer details for each city (e.g., Chicago, New York,

AK
etc.).

AY
3. Slice:

● Definition: Cutting out a single layer of data from the cube by fixing one

N
dimension at a particular value.
● Example in the image: A slice is applied by selecting data for a specific item type

LP
(e.g., "home entertainment"), which results in a 2D matrix for location and time.
The original cube is "sliced" along the item dimension.

4. Dice:
KA
N
SA
● Definition: Selecting a subcube by specifying a range or specific values for
multiple dimensions.
● Example in the image: A dice operation is applied by selecting data for certain
BY

locations (Chicago, New York) and certain item types (home entertainment,
security), resulting in a subcube.
S

5. Pivot (Rotate):
TE

● Definition: Rotating the cube to view data from different perspectives by


O

swapping dimensions.
● Example in the image: The pivot operation rotates the data cube so that different
N

dimensions (e.g., item types and location) are placed on different axes, providing a
W

new view of the data.


D

Summary:
M

● Roll-up aggregates data to a higher level.


D

● Drill-down provides more detailed data.


● Slice selects a specific data layer.
● Dice narrows the data cube to a smaller subcube based on criteria.
● Pivot rotates the cube to change the perspective.

These operations allow for flexible and dynamic data analysis in OLAP systems.

18
Data Warehousing And Online Analytical Processing

Modeling of Data Warehouses

AK
A data warehouse model defines how data is structured, stored, and accessed in a data

AY
warehouse. The modeling process involves designing an efficient and scalable
architecture to support querying, analysis, and reporting on large amounts of data. Data
warehouse modeling provides a structured way to manage complex data and is crucial for

N
data integration, storage, and retrieval.

LP
Model : In the context of data warehouses, a model is an abstract representation that

KA
defines the organization and relationships of data. A model provides a blueprint for how
data is to be arranged in the data warehouse, ensuring it is structured for easy retrieval,
querying, and analysis. The model allows for the seamless integration of data from
N
various sources and helps in decision-making.
SA

Types of Data Warehouse Models


BY

There are three primary types of models used in data warehouse design:

1. Conceptual Model
S

○ This is a high-level representation of the data warehouse's structure. It


TE

focuses on defining the entities and relationships that exist within the data
warehouse.
O

○ At this stage, no technical details are included, such as how the data will be
N

stored or the specific database design.


2. Example: Identifying key data entities such as "Sales," "Products," "Customers,"
W

and "Regions" and defining how they relate to each other.


3. Logical Model
D

○ The logical model builds upon the conceptual model and specifies the
M

logical structure of the data. This includes fact tables, dimension tables,
D

keys, relationships, and constraints.


○ This model does not include implementation-specific details such as the
physical storage medium but describes how the data is logically arranged.
4. Example: Designing a star schema with fact tables like "Sales Amount" and
dimension tables like "Time," "Product," and "Region."
5. Physical Model
○ The physical model describes how the data will actually be stored and
retrieved in the data warehouse. This involves specifying the database

19
Data Warehousing And Online Analytical Processing

schema, indexes, partitions, storage engines, and other physical


considerations.
○ It also includes details about data loading strategies, optimization, and
storage allocation.
6. Example: Defining the physical database tables, setting up indexes for efficient
queries, and optimizing disk storage.

Data Warehouse Modeling Techniques

AK
1. Star Schema

AY
Definition:

N
● The star schema is a type of multidimensional database schema that organizes data
into facts and dimensions. It is characterized by a central fact table surrounded by

LP
dimension tables.

KA
Components:

● Fact Table: Contains quantitative data (measures) and foreign keys to dimension
N
tables. Examples include sales figures, order quantities, or financial metrics.
SA

● Dimension Tables: Contain descriptive attributes related to the dimensions.


Examples include time (dates, months), product (product names, categories), and
location (city, state).
BY

Structure:
S

● The fact table is at the center of the schema, and dimension tables radiate outwards
TE

like the points of a star.


O

Example:
N

● For a sales analysis, the fact table might include columns for sales amount,
quantity sold, and foreign keys linking to dimension tables like Time, Product, and
W

Store. Each dimension table provides descriptive attributes for each dimension.
D

Benefits:
M
D

● Simplicity: The star schema is straightforward and easy to understand, making it


easier to design and query.
● Performance: Queries on star schemas can be optimized due to their simplicity
and the denormalized structure of dimension tables.

How the Star Schema Works:

1. Fact Table:

20
Data Warehousing And Online Analytical Processing

○ The central table in a star schema stores quantitative data or measures,


such as sales figures, units sold, or revenue. Each row in the fact table
corresponds to a specific event or transaction (e.g., a sale).
2. Dimension Tables:
○ The fact table is linked to several dimension tables. These tables store
descriptive attributes (e.g., time, product, location) that provide context to
the measures in the fact table. Each dimension is connected to the fact table
through a foreign key.
3. Joining Fact and Dimension Tables:
○ The star schema allows efficient querying by joining the fact table with

AK
dimension tables using foreign keys (e.g., time_key, item_key, branch_key,
location_key).

AY
○ Users can run queries like "What are the total sales for a specific product in
a particular location during a specific time period?" by joining the sales fact

N
table with relevant dimension tables.
4. Advantages:

LP
○ The structure simplifies data queries and analysis, as users can easily
retrieve relevant information by connecting the measures in the fact table

KA
with detailed attributes in the dimension tables.
○ This schema is ideal for data warehousing and reporting, enabling
N
businesses to extract insights from large datasets.
SA
BY
S
TE
O
N
W
D
M
D

21
Data Warehousing And Online Analytical Processing

AK
AY
N
LP
KA
N
SA
BY

the star schema in the diagram represents a way to organize data for efficient analysis.
The fact table in the center stores measurable data like sales (e.g., units sold, dollars
sold). Around it, there are dimension tables (e.g., time, item, branch, location) that
S

describe the context of the sales, like when and where the sales happened or which
TE

products were sold.


O

2. Snowflake Schema
N

Definition:
W

● The snowflake schema is a variation of the star schema where dimension tables are
D

normalized, splitting them into related sub-dimension tables. This results in a


structure resembling a snowflake.
M
D

Components:

● Fact Table: Similar to the star schema, it contains measures and foreign keys to
dimension tables.
● Normalized Dimension Tables: Dimension tables are decomposed into multiple
related tables to reduce redundancy and improve data integrity.

Structure:

22
Data Warehousing And Online Analytical Processing

● The snowflake schema features a more complex structure with normalized tables
connected by relationships, resembling a snowflake.

Example:

● In a snowflake schema for sales analysis, the Product dimension table might be
split into sub-tables for Product Category and Product Subcategory. The Store
dimension might be split into City and State tables.

Benefits:

AK
● Normalization: Reduces redundancy and improves data integrity by organizing
data into normalized tables.

AY
● Space Efficiency: More efficient in terms of storage compared to the star schema.

N
Drawbacks:

LP
● Complexity: The schema is more complex, which can make querying and design
more challenging.

KA
N
SA
BY
S
TE
O
N
W
D
M
D

The image represents a Snowflake Schema used in data warehousing. A snowflake


schema is an extension of a star schema where dimension tables are normalized into
multiple related tables, forming a "snowflake" shape.

Key Components:

1. Fact Table (Sales Fact Table):


23
Data Warehousing And Online Analytical Processing

○ Contains measures (like units_sold, dollars_sold, and avg_sales), which are


the quantitative data used for analysis.
○ Linked to several dimension tables via foreign keys (time_key, item_key,
branch_key, location_key).
2. Dimension Tables:
○ Time: Captures details like day, month, quarter, year, etc., identified by
time_key.
○ Item: Stores item details such as item_name, brand, type, linked by
item_key. It further connects to the Supplier table via supplier_key.
○ Branch: Holds branch information such as branch_name, branch_type,

AK
identified by branch_key.
○ Location: Stores street and links to City via city_key.

AY
○ City: Contains city information like province_or_street, country, etc.
3. Normalization:

N
○ In the snowflake schema, dimension tables like item and location are
normalized into sub-dimensions (supplier and city, respectively). This

LP
reduces redundancy but increases the number of joins required for
querying.

KA
This structure is efficient in terms of storage, but may result in slower query performance
due to the multiple table joins required.
N
SA

3. Fact Constellation Schema

Definition:
BY

● The fact constellation schema, also known as a galaxy schema, is a more complex
schema that includes multiple fact tables sharing dimension tables. It supports
S

multiple business processes or subject areas.


TE

Components:
O

● Fact Tables: Multiple fact tables that store measures related to different processes
N

or areas, such as sales and inventory.


● Shared Dimension Tables: Dimension tables that are used by multiple fact tables.
W
D

Structure:
M

● The schema features a constellation of fact tables connected by shared dimension


D

tables, allowing for comprehensive analysis across different business areas.

Example:

● For a retail company, a fact constellation schema might include fact tables for
Sales, Inventory, and Purchasing, all linked to common dimension tables like
Time, Product, and Store.

Benefits:

24
Data Warehousing And Online Analytical Processing

● Comprehensive Analysis: Supports analysis across multiple business processes


and dimensions.
● Flexibility: Allows for the integration of different fact tables and dimensions,
providing a broader view of the data.

Drawbacks:

● Complexity: More complex to design and manage due to the involvement of


multiple fact tables and dimensions.

AK
AY
N
LP
KA
N
SA
BY
S
TE
O
N

The image represents a Fact Constellation Schema (also known as a Galaxy Schema).
W

This schema is a collection of multiple fact tables that share common dimension tables. It
D

allows complex data analysis across different business processes.


M

Key Components:
D

1. Sales Fact Table:


○ Measures: units_sold, dollars_sold, avg_sales.
○ Linked to dimensions via foreign keys: time_key, item_key, branch_key,
location_key.
2. Shipping Fact Table:
○ Measures: dollars_cost, units_shipped.
○ Linked to dimensions via foreign keys: time_key, item_key, shipper_key,
from_location, to_location.
25
Data Warehousing And Online Analytical Processing

3. Shared Dimensions:
○ Time: Tracks time-related data (e.g., day_of_the_week, month, year),
shared by both fact tables.
○ Item: Contains item-specific details like item_name, brand, supplier_type,
shared by both fact tables.
○ Location: Represents geographical details (city, province_or_street,
country), linked to both sales and shipping processes.
4. Unique Dimensions:
○ Branch: Linked to the Sales Fact Table, contains branch-specific details
(branch_name, branch_type).

AK
○ Shipper: Linked to the Shipping Fact Table, contains details about shippers
(shipper_name, shipper_type).

AY
Key Points:

N
● Fact Constellation allows analysis of multiple business processes (e.g., sales and

LP
shipping) in the same schema.
● Shared dimensions reduce redundancy and ensure consistency across the

KA
processes.
● Multiple fact tables increase complexity but allow more comprehensive reporting
and querying.
N
SA
BY
S
TE
O
N
W
D
M
D

26
Data Warehousing And Online Analytical Processing

Cube: A Lattice of Cuboids

1. Concept of Cuboids

In the context of data warehousing and OLAP (Online Analytical Processing), a cuboid
is a sub-cube that represents data aggregated along specific dimensions. Each cuboid can
be seen as a multidimensional slice of the larger data cube, capturing data at various
levels of granularity.

2. Lattice of Cuboids

AK
The lattice of cuboids refers to the hierarchical structure formed by all possible cuboids
within a data cube. This structure is organized based on the levels of aggregation for each

AY
dimension.

N
● Granularity Levels:
○ Base Cuboid: Contains data at the finest level of granularity, with no

LP
aggregation. For example, sales data for each individual transaction.
○ Aggregated Cuboids: Represent data aggregated along different

KA
dimensions or hierarchies. For instance, total sales by month or by city.
● Lattice Structure: N
○ The lattice is hierarchical, where each level represents different degrees of
data aggregation. The base cuboid is at the bottom, while higher-level
SA

cuboids aggregate data along multiple dimensions.


BY

3. Visualization and Examples

Consider a data cube with dimensions for Time (Year, Month), Location (Country, City),
S

and Product (Category, Item). The lattice of cuboids for this cube would include:
TE

Base Cuboid: Sales data for each individual combination of Time, Location, and Product
(e.g., sales for each item in each city for each month).
O
N

Example:
| Time | Location | Product | Sales |
W

|--------|-------------|---------|-------|
| Jan | New York | Widget | $1,000|
D

| Jan | Los Angeles | Gadget | $500 |


M
D

● Aggregated Cuboids: These include:


○ Time Aggregation:
■ Total sales for each month (ignoring Location and Product).
■ Total sales for each year (ignoring Location and Product).
○ Location Aggregation:
■ Total sales for each city (ignoring Time and Product).
■ Total sales for each country (ignoring Time and Product).
○ Product Aggregation:
■ Total sales for each product category (ignoring Time and Location).
27
Data Warehousing And Online Analytical Processing

■ Total sales for each product item (ignoring Time and Location).

Example:
| Time | Location | Product | Sales |
|--------|-------------|---------|-------|
| Jan | All Cities | Widget | $1,500|
| Jan | All Cities | Gadget | $500 |

● Combinations of Aggregations:
○ Total sales for each combination of Year and City (e.g., total sales for each

AK
city in 2024).
○ Total sales for each combination of Product and Month (e.g., total sales for
each product in January).

AY
4. Importance and Use Cases

N
● Efficient Querying: The lattice structure allows for efficient querying and data

LP
retrieval at various levels of aggregation. Users can drill down into more detailed
data or roll up to higher-level summaries.

KA
● Data Analysis: The lattice of cuboids supports complex data analysis by
providing different views of data. Users can analyze trends, compare performance
N
across dimensions, and identify patterns.
SA
● Performance Optimization: Pre-aggregating data into cuboids can improve
query performance by reducing the amount of computation required at query time.
BY

5. Challenges

● Storage Requirements: Storing multiple cuboids, especially for large datasets


S

with many dimensions, can require significant storage space.


TE

● Maintenance: Keeping cuboids up-to-date with changes in the underlying data


requires efficient ETL (Extract, Transform, Load) processes and can be complex to
O

manage.
N
W

Data Mining Query Language (DMQL)


D

DMQL is a high-level query language designed for data mining, particularly for defining
M

data cubes, dimensions, and performing operations on data for analysis.


D

1. Cube Definition (Fact Table)

To define a data cube (the fact table that stores measures), we use the following syntax:

28
Data Warehousing And Online Analytical Processing

define cube <cube_name> [<dimension_list>]:<measure_list>

● <cube_name>: The name of the cube.


● <dimension_list>: A list of dimensions relevant to the cube.
● <measure_list>: A list of measures that are to be analyzed (e.g., sum, count, etc.).

Example:
define cube sales_cube [time, location, product]:dollars_sold, units_sold

AK
In this example:

AY
● The cube sales_cube is defined with three dimensions: time, location, and product.
● The measures tracked are dollars_sold and units_sold.

N
2. Dimension Definition (Dimension Table)

LP
To define a dimension (which is the dimensional table associated with a cube), the
following syntax is used:

KA
N
define dimension <dimension_name> as (<attribute_or_subdimension_list>)
SA

● <dimension_name>: The name of the dimension.


● <attribute_or_subdimension_list>: A list of attributes or subdimensions related
BY

to the dimension.
Example:
define dimension time as (day, month, year)
S
TE

In this example:

● The time dimension is defined with attributes day, month, and year.
O
N

3. Special Case: Shared Dimension Tables


W

If a dimension is shared across multiple cubes, it is only defined fully the first time it is
used. When referenced by subsequent cubes, we use the shared dimension syntax:
D
M

define dimension <dimension_name> as <dimension_name_first_time> in cube


D

<cube_name_first_time>
● <dimension_name_first_time>: The name of the shared dimension.
● <cube_name_first_time>: The name of the cube where the dimension was first
defined.
Example:
define dimension time as time in cube sales_cube
In this example:

29
Data Warehousing And Online Analytical Processing

● The time dimension is shared between the sales_cube and another cube, so it is
referred to from its first definition.

Defining a Star Schema in DMQL

In a star schema, we define a central fact table that is connected to various dimension
tables. The fact table contains the quantitative data (measures), and the dimension tables
hold descriptive data that helps in slicing and dicing the measures.

Below is the breakdown of the DMQL (Data Mining Query Language) syntax to

AK
define a Star Schema:

AY
1. Cube Definition (Fact Table)

N
The central fact table is defined using the define cube statement. In this case, the cube
sales_star has four dimensions: time, item, branch, and location. The cube also defines

LP
measures that aggregate data.

KA
define cube sales_star [time, item, branch, location]:
dollars_sold = sum(sales_in_dollars),
N
avg_sales = avg(sales_in_dollars),
SA

units_sold = count(*)

● dollars_sold: The total sales in dollars (aggregated using sum).


BY

● avg_sales: The average sales in dollars (calculated using avg).


● units_sold: The count of items sold (using count).
S

2. Dimension Definitions (Dimension Tables)


TE

Next, we define each dimension table that connects to the fact table. These dimensions
O

store the descriptive attributes related to each aspect of the sales data.
N

define dimension time as (time_key, day, day_of_week, month, quarter, year)


W
D

● Attributes: time_key, day, day_of_week, month, quarter, year.


M

● This dimension tracks the time-based attributes of the sales transactions.


D

Dimension: Item
define dimension item as (item_key, item_name, brand, type, supplier_type)

● Attributes: item_key, item_name, brand, type, supplier_type.


● This dimension stores details about the item being sold, such as its name, brand,
type, and supplier.

Dimension: Branch:
30
Data Warehousing And Online Analytical Processing

define dimension branch as (branch_key, branch_name, branch_type)

● Attributes: branch_key, branch_name, branch_type.


● This dimension provides information about the branch where the sale occurred.

Dimension: Location
define dimension location as (location_key, street, city, province_or_state, country)

AK
● Attributes: location_key, street, city, province_or_state, country.
● This dimension captures the geographical location of the branch or customer.

AY
Defining a Fact Constellation schema using DMQL (Data Mining Query Language) with

N
two cubes: sales and shipping. The Fact Constellation schema allows for multiple fact
tables (cubes) that share dimensions, enabling complex queries and analysis. Here’s a

LP
structured representation of your schema:
Fact Constellation Schema

KA
define cube sales [time, item, branch, location]: dollars_sold = sum(sales_in_dollars)
avg_sales = avg(sales_in_dollars) units_sold = count(*)
N
SA
BY
S
TE
O
N
W
D
M
D

31

You might also like