0% found this document useful (0 votes)

13 views19 pages

DWDM Notes

The document outlines the three-tier architecture of a data warehouse, consisting of a bottom tier (database server), middle tier (OLAP server), and top tier (client layer), each serving distinct roles in data management and analysis. It also discusses OLAP operations, data mining patterns, and the Apriori algorithm for mining frequent itemsets, emphasizing the importance of efficient data processing and decision-making in business contexts. Additionally, it compares different schema types and cube computation methods to optimize performance and storage in data warehousing.

Uploaded by

dinesh007aced

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views19 pages

DWDM Notes

Uploaded by

dinesh007aced

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Architecture of a data warehouse:

A data warehouse follows a three-tier architecture consisting of the

 bottom tier (database server),

 the middle tier (OLAP server), and
 the top tier (client layer).

Each tier plays a crucial role in managing, processing, and analyzing large volumes of data
efficiently.
The bottom tier is the database server, which is almost always a relational database system. It
stores and manages data extracted from operational databases and external sources. Back-end
tools and utilities handle data extraction, cleaning, transformation, loading, and refreshing to
ensure that the warehouse remains updated.

Data extraction is facilitated using application program interfaces (APIs) known as gateways,
such as

 ODBC (Open Database Connection)

 OLEDB (Object Linking and Embedding Database)

This tier also includes a metadata repository, which stores detailed information about the data
warehouse structure, sources, and transformations.

The middle tier consists of an Online Analytical Processing (OLAP) server that supports
complex analytical operations on multidimensional data. This tier can be implemented using
either a Relational OLAP (ROLAP) model or a Multidimensional OLAP (MOLAP) model.

Relational OLAP (ROLAP):

 Extends relational database management systems (RDBMS) to support OLAP operations.

 Maps multidimensional data to relational structures.

Multidimensional OLAP (MOLAP):

 Uses a specialized system to store and process data in a multidimensional format.

 Allows for faster query execution and better performance than ROLAP.

Role of the OLAP Server:

 Supports complex queries.

 Enables trend analysis and business intelligence applications.

The top tier is the front-end client layer, which provides tools for user interaction, data
exploration, and decision-making. This tier includes query and reporting tools, analysis tools,
and data mining applications.

Users can generate reports, analyze trends, and perform predictive modeling to gain insights
from the data. The integration of analytical and visualization tools at this level makes it easier for
businesses and analysts to make data-driven decisions.

 Query and reporting tools – Generate structured reports.

 Analysis tools – Enable interactive data analysis.
 Data mining tools – Identify patterns and predict trends.
The three-tier architecture of a data warehouse ensures efficient data management,
processing, and analysis. The bottom tier focuses on data storage and integration, the middle
tier handles analytical processing, and the top tier provides user-friendly tools for decision-
making. This structured approach enhances data accessibility, improves analytical
performance, and supports businesses in making informed decisions.

The three-tier architecture of a data warehouse ensures efficient data management, processing
and analysis. The bottom tier focuses on data storage and integration, the middle tier handles
analytical processing, and the top tier provides user-friendly tools for decision-making. This
structured approach enhances data accessibility, improves analytical performance, and supports
organizations in making informed business decisions.

OLAP Operations in DBMS:

OLAP stands for Online Analytical Processing Server. It is a software technology that allows
users to analyze information from multiple database systems at the same time. It is based on
multidimensional data model and allows the user to query on multi-dimensional data (eg.
Coimbatore-> 2024 -> Sales data). OLAP databases are divided into one or more cubes and these
cubes are known as Hyper-cubes.

OLAP operations:

There are five basic analytical operations that can be performed on an OLAP cube:

 Drill down
 Roll up
 Slice
 Dice
 Pivot
Drill down: In drill-down operation, the less detailed data is converted into highly detailed data.
It can be done by:

 Moving down in the concept hierarchy

 Adding a new dimension

In the cube given in overview section, the drill down operation is performed by moving down in
the concept hierarchy of Time dimension (Quarter -> Month).

Roll up: It is just opposite of the drill-down operation. It performs aggregation on the OLAP
cube. It can be done by:

 Climbing up in the concept hierarchy

 Reducing the dimensions

In the cube given in the overview section, the roll-up operation is performed by climbing up in
the concept hierarchy of Location dimension (City -> Country).
Dice: It selects a sub-cube from the OLAP cube by selecting two or more dimensions. In the
cube given in the overview section, a sub-cube is selected by selecting following dimensions
with criteria:

Location = “Delhi” or “Kolkata”

Time = “Q1” or “Q2”

Item = “Car” or “Bus”

Slice: It selects a single dimension from the OLAP cube which results in a new sub-cube
creation. In the cube given in the overview section, Slice is performed on the dimension Time =
“Q1”.

Pivot: It is also known as rotation operation as it rotates the current view to get a new view of the
representation. In the sub-cube obtained after the slice operation, performing pivot operation
gives a new view of it.
Schema in a data warehouse:
Schema is a logical description of the entire database. It includes the name and description of records of
all record types including all associated data-items and aggregates. Much like a database, a data
warehouse also requires to maintain a schema. A database uses relational model, while a data warehouse
uses Star, Snowflake, and Fact Constellation schema

Star Schema

 Each dimension in a star schema is represented with only one-dimension table.

 This dimension table contains the set of attributes.
 The following diagram shows the sales data of a company with respect to the four dimensions,
namely time, item, branch, and location.

 There is a fact table at the center. It contains the keys to each of four dimensions.
 The fact table also contains the attributes, namely dollars sold and units sold.

Snowflake Schema:
 Some dimension tables in the Snowflake schema are normalized.
 The normalization splits up the data into additional tables.
 Unlike Star schema, the dimensions table in a snowflake schema are normalized. For example,
the item dimension table in star schema is normalized and split into two dimension tables, namely
item and supplier table.

 Now the item dimension table contains the attributes item_key, item_name, type, brand, and
supplier-key.
 The supplier key is linked to the supplier dimension table. The supplier dimension table contains
the attributes supplier_key and supplier_type.

Fact Constellation Schema:

 A fact constellation has multiple fact tables. It is also known as galaxy schema.
 The following diagram shows two fact tables, namely sales and shipping.
 The sales fact table is same as that in the star schema.

 The shipping fact table has the five dimensions, namely item_key, time_key, shipper_key,
from_location, to_location.

 The shipping fact table also contains two measures, namely dollars sold and units sold.

 It is also possible to share dimension tables between fact tables. For example, time, item, and
location dimension tables are shared between the sales and shipping fact table.

Full Cube, Iceberg Cube, and Closed Cube in terms of computation

efficiency:
When comparing Full Cube, Iceberg Cube, and Closed Cube in terms of computation efficiency,
significant differences arise due to the scope of aggregation and optimization strategies used.

The Full Cube computes all possible group-by aggregates for a dataset, making it the most
exhaustive but also the most computationally expensive approach. Since it generates every
possible aggregation, including those with little or no relevance, it leads to the highest
computation cost and memory usage. This makes it the least efficient method, suitable primarily
for small datasets where a complete set of aggregations is necessary.

The Iceberg Cube, on the other hand, improves efficiency by computing only those group-by
aggregates that meet a specific threshold, such as a minimum count or sum. By pruning
unnecessary computations and ignoring low-support aggregations, it significantly reduces both
computation cost and memory requirements. This approach is particularly useful for large
datasets where only meaningful and high-support aggregations are required, making it a more
efficient alternative to the Full Cube.

The Closed Cube optimizes computation by eliminating redundant aggregations while still
maintaining completeness. Unlike the Full Cube, which stores every possible combination, the
Closed Cube avoids unnecessary computations by keeping only the essential and non-redundant
aggregates. This leads to lower memory usage and improved efficiency compared to the Full
Cube, though it is not as selective as the Iceberg Cube. The Closed Cube is best suited for
scenarios where minimizing redundancy is important while still retaining full analytical
capabilities.
Multiway Array Aggregation and Iceberg Cube Computation:
Handling large-scale data efficiently is a critical challenge in data warehousing and OLAP
(Online Analytical Processing). Multiway Array Aggregation (MAA) and Iceberg Cube
Computation are two important techniques designed to optimize the processing of large datasets,
particularly in data cube computations. Their effectiveness is assessed based on their ability to
reduce computation time, optimize storage, and improve query performance.

Multiway Array Aggregation (MAA) is a method used in data cube computation, particularly for
large-scale multidimensional databases. It efficiently processes data by taking advantage of
array-based storage and simultaneous aggregations across multiple dimensions.

Effectiveness of Multiway Array Aggregation:

Optimized Memory Usage:

MAA stores data in an array format rather than in relational tables, reducing memory overhead.

It minimizes disk I/O operations, making processing faster.

Reduced Computation Time:

Instead of computing the cube level by level, MAA processes multiple dimensions
simultaneously.

It avoids redundant calculations, reducing overall computational complexity.

Cache Optimization:

Since arrays provide better cache locality, access times are faster, improving performance.
Scalability for Large Datasets:

Works efficiently for high-dimensional data cubes by partitioning the data and computing
aggregations in an optimized manner.

Limitation – Works Best with Dense Data:

MAA is most effective for dense data cubes (i.e., where most combinations of attributes have
data).

For sparse datasets, alternative storage strategies like prefix trees or hash-based approaches may
be more efficient.

Iceberg Cube Computation

An Iceberg Cube is an optimization technique where only a subset of data cube aggregations that
meet a minimum support threshold (e.g., frequency, sum, or count) are computed and stored. It is
designed to prune irrelevant or insignificant aggregations, reducing computational cost.

Effectiveness of Iceberg Cube Computation:

Efficient Processing for Large Data Volumes:

By filtering out low-support aggregations, Iceberg Cube avoids unnecessary computations,

making it highly scalable for large datasets.

Significant Reduction in Storage Requirements:

Unlike a full data cube, which computes and stores all possible aggregations, Iceberg Cube only
retains the most useful subsets, drastically reducing storage.

Speeds Up Query Performance:

Iceberg Cube precomputes and indexes only frequent and relevant aggregations, leading to faster
query response times.

Avoids Computational Bottlenecks:

Since Iceberg Cube prunes less significant aggregations early, it optimizes computation and
reduces processing overhead.

Limitation – Requires Proper Threshold Selection:

The effectiveness of Iceberg Cube depends on setting an optimal minimum threshold.

If the threshold is too high, important aggregations may be lost. If it's too low, efficiency gains
are reduced.
Different Types of Data Mining Patterns:
Data mining involves identifying patterns in large datasets to uncover useful insights. The major
types of data mining patterns include association patterns, sequential patterns, classification
patterns, clustering patterns, outlier patterns, and trend patterns.

1. Association Patterns

Association patterns identify relationships between items in a dataset. They are widely used in
market basket analysis, where businesses determine frequently bought item combinations. For
example, customers who buy bread often purchase butter as well. Algorithms like Apriori and
FP-Growth are used to discover such patterns.

Example: In a supermarket, analysis of customer purchases reveals that 80% of customers who
buy diapers also buy baby wipes. This helps businesses in product placement and promotions.

Application: Market Basket Analysis.

2. Sequential Patterns

Sequential patterns focus on detecting trends that follow a sequence over time. These patterns are
useful in customer behavior analysis and medical diagnosis. For instance, an e-commerce
platform may observe that customers who buy a phone later purchase accessories like chargers
and earphones. GSP and SPADE algorithms are commonly used for sequential pattern mining.

Example: In e-commerce, data analysis shows that customers who buy a smartphone often
purchase a phone case within a week, followed by wireless earphones within a month.
Application: Customer Purchase Behavior Analysis.

3. Classification Patterns

Classification patterns categorize data into predefined labels, making them useful for fraud
detection, spam filtering, and medical diagnosis. For example, emails can be classified as spam
or not spam using classification algorithms like Decision Trees, Naïve Bayes, and Support
Vector Machines (SVM).

Example: An email filtering system classifies emails into spam or non-spam based on keywords,
sender details, and email history.

Application: Spam Detection, Medical Diagnosis.

4. Clustering Patterns

Clustering patterns group similar data points into clusters without predefined categories. This is
useful in customer segmentation, image recognition, and social network analysis. For example,
businesses can identify high-value customers and budget-conscious shoppers using clustering
techniques such as K-Means, Hierarchical Clustering, and DBSCAN.

Example: A bank segments its customers into high-income investors, middle-class savers, and
low-income borrowers based on their financial transactions and spending habits.

Application: Customer Segmentation, Image Recognition.

5. Outlier Patterns

Outlier patterns detect anomalies or unusual data points. These are essential in fraud detection
and cybersecurity. For example, an unusually large transaction on a credit card may indicate
fraud. Isolation Forest and One-Class SVM are popular outlier detection techniques.

Example: A credit card company detects an unusually high-value transaction from a customer
who typically makes small purchases, potentially indicating fraud.

Application: Fraud Detection, Cybersecurity.

6. Trend and Evolutionary Patterns

Trend patterns analyze data over time to identify changes and shifts. These patterns are
commonly used in stock market prediction, climate change studies, and social media trend
analysis. Time-series analysis and regression models help track patterns such as increasing
smartphone adoption or declining traditional newspaper readership.

Example: Data analysis in social media trends shows that short-video content (e.g., TikTok,
Instagram Reels) has gained massive popularity over the past five years, replacing traditional
blogging.

Application: Market Trend Analysis, Stock Market Prediction.

Process of mining frequent item sets:

Mining frequent itemsets is a crucial technique in association rule mining, often used in market
basket analysis to discover relationships between frequently purchased items. The Apriori
algorithm, a divide-and-conquer-based approach, is one of the most widely used methods for this
task.

Steps in the Apriori Algorithm

The Apriori algorithm follows a systematic process to discover frequent itemsets and generate
association rules:

Identifying Frequent Itemsets:

The algorithm starts by scanning the dataset to count individual items (1-itemsets) and their
frequencies. A minimum support threshold is set to determine whether an item is considered
frequent. If an item meets the threshold, it is included in the frequent 1-itemsets.

Creating Candidate Itemsets:

Once frequent 1-itemsets are identified, the algorithm generates candidate 2-itemsets by
combining frequent items. This process continues iteratively, forming larger k-itemsets until no
more frequent itemsets can be found.

Removing Infrequent Itemsets:

The algorithm applies the Apriori Property to prune itemsets that do not meet the minimum
support. If an itemset is infrequent, all of its supersets are eliminated from consideration,
significantly reducing the number of combinations that need to be evaluated.

Generating Association Rules:

Once frequent itemsets are identified, the algorithm generates association rules to show
relationships between items. Rules are evaluated using key metrics such as:

 Support: Measures how often an item appears in the dataset.

 Confidence: Determines the likelihood that one item appears when another is present.
 Lift: Evaluates the strength of an association compared to random chance.

Example of Apriori Algorithm:

Consider a dataset with five transactions:

Transaction ID Items

1 Bread, Butter, Milk

2 Bread, Butter

3 Bread, Milk

4 Butter, Milk

5 Bread, Butter, Milk

Step 1: Find Frequent 1-Itemsets

Count the frequency of each item and apply the minimum support threshold (e.g., 50% = at least
3 out of 5 transactions).

Step 2: Generate Candidate 2-Itemsets

Form pairs like {Bread, Butter}, {Bread, Milk}, {Butter, Milk} and calculate their support.

Step 3: Generate Candidate 3-Itemsets

Combine frequent 2-itemsets to create triplets like {Bread, Butter, Milk} and check their support.

Step 4: Generate Association Rules

Example rules:

Bread → Butter (Confidence = 75%)

Butter → Bread (Confidence = 100%)

Bread → Milk (Confidence = 75%)

The Apriori algorithm efficiently discovers frequent itemsets and generates association rules,
making it valuable for data-driven decision-making. By leveraging the Apriori Property, it
eliminates unnecessary computations and improves performance. Despite its limitations, such as
high computation costs for large datasets, it remains a powerful tool in market basket analysis,
recommendation systems, and business intelligence.

Bayesian Belief Networks

Bayesian Belief Networks (BBNs), also known as Bayesian Networks (BNs), are probabilistic
graphical models that represent a set of variables and their conditional dependencies using a
Directed Acyclic Graph (DAG). These networks are based on Bayesian probability theory,
allowing them to handle uncertainty and make probabilistic inferences. They play a significant
role in classification tasks, especially when dealing with incomplete data or uncertain
environments.

Structure of Bayesian Belief Networks

A Bayesian Belief Network consists of:

 Nodes: Represent random variables (e.g., symptoms, diseases, test results).

 Edges: Directed connections between nodes, indicating conditional dependencies.
 Conditional Probability Tables (CPTs): Define the probability of each node given its
parent nodes.

Each node in the network follows the Bayesian theorem, which states:

where P(A∣B)P(A | B)P(A∣B) is the probability of event AAA occurring given that BBB has
occurred.

Role of Bayesian Belief Networks in Classification

Bayesian Belief Networks are extensively used in classification problems, where the goal is
to categorize an instance into predefined classes based on observed attributes. Their key roles
include:

Handling Uncertainty:

Unlike traditional classifiers, BBNs incorporate probabilistic reasoning, making them

effective in scenarios with incomplete or noisy data.
Feature Selection and Dependency Representation:

BBNs explicitly model dependencies between features, unlike Naïve Bayes, which assumes
independence.

The graphical representation helps in understanding causal relationships.

Probabilistic Inference for Classification:

Given observed evidence (e.g., symptoms in medical diagnosis), BBNs compute the posterior
probability of each class and classify accordingly.

This makes them useful in medical diagnosis, spam filtering, and fraud detection.

Incremental Learning:

BBNs can update their probabilities dynamically as new data arrives, making them suitable
for real-time applications

Example: Medical Diagnosis Using BBN

Support Vector Machines (SVM) in Classification

Support Vector Machines (SVMs) are supervised learning models used for classification and
regression tasks. They work by finding the optimal hyperplane that separates data points of
different classes in a high-dimensional space, maximizing the margin between the classes.

Effectiveness of SVMs in Classification:

High-Dimensional Data Handling: SVMs are particularly effective in scenarios where the
number of dimensions exceeds the number of samples. They perform well in high-
dimensional spaces and are effective when the number of features is large.
Robustness to Overfitting: By focusing on maximizing the margin between classes, SVMs
are less prone to overfitting, especially in high-dimensional spaces.

Kernel Trick for Non-Linear Data: SVMs can efficiently perform a non-linear classification
using what is called the kernel trick, implicitly mapping their inputs into high-dimensional
feature spaces.

Advantages of SVMs:

 Effective in High-Dimensional Spaces: SVMs are particularly effective in high-

dimensional spaces and are still effective when the number of dimensions is greater
than the number of samples.
 Memory Efficiency: They use a subset of training points in the decision function
(called support vectors), making them also memory efficient.
 Versatility with Different Kernel Functions: Different kernel functions can be
specified for the decision function. Common kernels include the Gaussian (RBF),
polynomial, and sigmoid.
 Clear Margins of Separation: SVMs provide clear margins of separation between
classes, which can be beneficial in classification tasks.

Limitations:

 Computational Complexity: The training time of SVMs can be high, especially for large
datasets, making them less suitable for large-scale applications.
 Choice of Kernel: Selecting an appropriate kernel function is crucial, as an improper
choice can lead to reduced model performance.
 Interpretability: SVMs are often considered less interpretable compared to other models
like decision trees.

Support Vector Machines are powerful tools for classification tasks, offering advantages in
handling high-dimensional data, robustness to overfitting, and flexibility through kernel
functions. However, considerations regarding computational resources and kernel selection are
essential for optimal performance.

K-Medoids versus K-Means:

Clustering is a key technique in data mining and machine learning, with K-Means and K-
Medoids being two widely used partitioning algorithms. While both aim to group similar data
points into clusters, their effectiveness varies based on the nature of the dataset.

K-Means relies on computing cluster centroids, while K-Medoids selects actual data points
as representative cluster centers. Their performance depends on factors like data distribution,
presence of outliers, cluster shapes, and dataset size.
Effectiveness in Different Data Distributions

 The different data distributions are:

 Gaussian (Normal) Distribution
 Skewed or Unequal Cluster Sizes
 Presence of Noise and Outliers
 Non-Spherical and Arbitrary-Shaped Clusters
 Large-Scale Datasets

When dealing with Gaussian (normally distributed) data, K-Means performs efficiently as it
assumes that clusters are spherical and evenly distributed. Since it minimizes the sum of squared
distances between points and centroids, it effectively separates well-defined clusters. In contrast,
K-Medoids does not provide a significant advantage in this case and tends to be computationally
slower. For normal distributions, K-Means is the preferred choice due to its speed and accuracy.

However, in datasets with unequal cluster sizes or varying densities, K-Means struggles as it
assigns equal importance to all data points, often leading to centroid shifts towards denser
regions. K-Medoids, being more robust, performs better in this scenario by selecting medoids
that represent actual data points rather than relying on averages. For datasets with uneven
clusters, K-Medoids provides better stability and accuracy.

One major limitation of K-Means is its sensitivity to outliers. Since it computes centroids based
on mean values, a few extreme points can significantly distort the cluster assignments. K-
Medoids, however, selects medoids from existing data points, making it resistant to noise and
outliers. For datasets with potential outliers, such as fraud detection in banking transactions, K-
Medoids is the more effective clustering method.

When dealing with non-spherical and arbitrary-shaped clusters, such as crescent-shaped or

interconnected data clusters, both K-Means and K-Medoids fail to capture the true structure. K-
Means, which assumes convex clusters, struggles to properly separate non-linearly distributed
data, while K-Medoids performs slightly better but still has limitations. For such complex data
distributions, density-based clustering methods like DBSCAN are more suitable.

For large-scale datasets, K-Means is computationally efficient as it updates centroids iteratively

using simple distance calculations. On the other hand, K-Medoids requires pairwise distance
computations, making it computationally expensive for big datasets. In large datasets, such as
customer segmentation for e-commerce platforms, K-Means is preferred due to its scalability and
faster convergence.

K-Means is ideal for Gaussian-distributed, large-scale datasets due to its speed and efficiency,
while K-Medoids excels in handling outliers and uneven cluster sizes. For complex, non-
spherical data distributions,
Example of K-Means vs. K-Medoids in Clustering

Consider a dataset of customers based on spending behavior and frequency of purchases. If we

apply K-Means, it quickly identifies cluster centers but may get affected by high-spending
outliers, leading to inaccurate segmentation. In contrast, K-Medoids, by choosing actual
customer data points as medoids, ensures that high-spending outliers do not influence the overall
clustering, providing a more realistic segmentation.

MultiDimensional Data Model
No ratings yet
MultiDimensional Data Model
22 pages
Unit 2 Datawarehouse
No ratings yet
Unit 2 Datawarehouse
58 pages
Unit - 3 Data Warehousing and OLAP Technology
No ratings yet
Unit - 3 Data Warehousing and OLAP Technology
20 pages
Unit 3 OLAP and OLTP
No ratings yet
Unit 3 OLAP and OLTP
64 pages
Unit 1
No ratings yet
Unit 1
36 pages
AI and ML Lab - VIVA Questions
100% (5)
AI and ML Lab - VIVA Questions
7 pages
OLAP2
No ratings yet
OLAP2
53 pages
Application of Artificial Intelligence in Petroleum Engineering
No ratings yet
Application of Artificial Intelligence in Petroleum Engineering
104 pages
New Batches Info: Quality Thought Ai-Data Science Diploma
No ratings yet
New Batches Info: Quality Thought Ai-Data Science Diploma
16 pages
Introduction To Data Warehouse
No ratings yet
Introduction To Data Warehouse
15 pages
Unit 2 DWM
No ratings yet
Unit 2 DWM
16 pages
Data Warehousing: Data Models and OLAP Operations: by Kishore Jaladi
No ratings yet
Data Warehousing: Data Models and OLAP Operations: by Kishore Jaladi
41 pages
2.data Warehouse and OLAP
No ratings yet
2.data Warehouse and OLAP
14 pages
DWDM Mid 1
No ratings yet
DWDM Mid 1
10 pages
CST466-M1 - Ktunotes - in
No ratings yet
CST466-M1 - Ktunotes - in
24 pages
UNIT-1 (RIT-062) : Data Warehousing
No ratings yet
UNIT-1 (RIT-062) : Data Warehousing
34 pages
DWM Unit 1 (2023)
No ratings yet
DWM Unit 1 (2023)
38 pages
DWM Chp2 Notes
No ratings yet
DWM Chp2 Notes
21 pages
Data Mining 9,10,11
No ratings yet
Data Mining 9,10,11
27 pages
Unit 5 DW
No ratings yet
Unit 5 DW
12 pages
Data Mining Unit 1
No ratings yet
Data Mining Unit 1
46 pages
DWDM IT-32 DATAWAREHOUSING & DATAMINING
No ratings yet
DWDM IT-32 DATAWAREHOUSING & DATAMINING
9 pages
Unit-2 1
No ratings yet
Unit-2 1
60 pages
Data Warehousing and OLAP Technology For Data Mining
No ratings yet
Data Warehousing and OLAP Technology For Data Mining
30 pages
Unit 2
No ratings yet
Unit 2
32 pages
Introduction To Datawarehousing: Duration: 45 Minutes (Approx.) Abhishek Ranjan
No ratings yet
Introduction To Datawarehousing: Duration: 45 Minutes (Approx.) Abhishek Ranjan
32 pages
What Is A Data Warehouse?
No ratings yet
What Is A Data Warehouse?
47 pages
UEU Sistem Pendukung Keputusan Pertemuan 5
No ratings yet
UEU Sistem Pendukung Keputusan Pertemuan 5
46 pages
DWM CHP2 QB Solution
No ratings yet
DWM CHP2 QB Solution
9 pages
Data Warehouse
No ratings yet
Data Warehouse
71 pages
DMDW-MDM L8,9
No ratings yet
DMDW-MDM L8,9
53 pages
Data Warehouse C
No ratings yet
Data Warehouse C
34 pages
What Is Data Warehouse?: Data Mining by IK Unit 2
No ratings yet
What Is Data Warehouse?: Data Mining by IK Unit 2
21 pages
Chapter 2 and 3
No ratings yet
Chapter 2 and 3
89 pages
DW&DM Material
No ratings yet
DW&DM Material
107 pages
List Data Warehouse Models With Example
No ratings yet
List Data Warehouse Models With Example
19 pages
Define Data Warehouse. Differentiate Between OLTP and OLAP Databases
No ratings yet
Define Data Warehouse. Differentiate Between OLTP and OLAP Databases
6 pages
Data Warehousing Unit 1,2
No ratings yet
Data Warehousing Unit 1,2
9 pages
Unit 2
No ratings yet
Unit 2
34 pages
Data Mining Notes UNIT II
No ratings yet
Data Mining Notes UNIT II
25 pages
Dimensional Modelling
No ratings yet
Dimensional Modelling
5 pages
Session-9 Final Notes PRM 45
No ratings yet
Session-9 Final Notes PRM 45
4 pages
Unit 2 DATA WAREHOUSE AND DATA MART
No ratings yet
Unit 2 DATA WAREHOUSE AND DATA MART
17 pages
DM 6
No ratings yet
DM 6
29 pages
Unit2 Olap
No ratings yet
Unit2 Olap
13 pages
Unit I DMT
No ratings yet
Unit I DMT
74 pages
FALLSEM2023-24 CSI3010 ETH VL2023240104197 2023-07-28 Reference-Material-I
No ratings yet
FALLSEM2023-24 CSI3010 ETH VL2023240104197 2023-07-28 Reference-Material-I
32 pages
DWM Unit 1
No ratings yet
DWM Unit 1
67 pages
R20-DMT Unit-I
No ratings yet
R20-DMT Unit-I
24 pages
Unit 2 Notes DWM
No ratings yet
Unit 2 Notes DWM
14 pages
Data Warehousing: Data Models and OLAP Operations: Lecture-1
No ratings yet
Data Warehousing: Data Models and OLAP Operations: Lecture-1
47 pages
Chapter 2.introduction To Data Warehouse
No ratings yet
Chapter 2.introduction To Data Warehouse
49 pages
Unit 2 - Data Science BCA
No ratings yet
Unit 2 - Data Science BCA
20 pages
DWDM 2
No ratings yet
DWDM 2
16 pages
ML Module1
No ratings yet
ML Module1
56 pages
03 Data Warehousing Data Mining MIM
No ratings yet
03 Data Warehousing Data Mining MIM
48 pages
Module 2-2
No ratings yet
Module 2-2
8 pages
ADBMS Assignment 2
No ratings yet
ADBMS Assignment 2
16 pages
Data Warehousing: Data Models and OLAP Operations
No ratings yet
Data Warehousing: Data Models and OLAP Operations
41 pages
Market Segmentation For Airlines
No ratings yet
Market Segmentation For Airlines
1 page
DWH Unit 1
No ratings yet
DWH Unit 1
12 pages
Machine Learning Notes-1 (Clustering-1)
No ratings yet
Machine Learning Notes-1 (Clustering-1)
25 pages
Integration of Open Source Platform Duckietown and Gesture Recognition As An Interactive Interface For The Museum Robotic Guide
No ratings yet
Integration of Open Source Platform Duckietown and Gesture Recognition As An Interactive Interface For The Museum Robotic Guide
5 pages
Pandas: Reference Sheet
No ratings yet
Pandas: Reference Sheet
9 pages
Hyper Tools
No ratings yet
Hyper Tools
22 pages
AI Brochure
No ratings yet
AI Brochure
3 pages
Disease Detection Using ML
100% (8)
Disease Detection Using ML
24 pages
Ai Fundamental Midterm Quizzes - Jei
No ratings yet
Ai Fundamental Midterm Quizzes - Jei
48 pages
An Identification and Detection Process For Wheat Using ML
No ratings yet
An Identification and Detection Process For Wheat Using ML
11 pages
Solution For DWDM Problems
No ratings yet
Solution For DWDM Problems
24 pages
Superpixel-Based Fast Fuzzy C-Means Clustering For Color Image Segmentation
No ratings yet
Superpixel-Based Fast Fuzzy C-Means Clustering For Color Image Segmentation
19 pages
Clustering: Analisis Big Data - Pertemuan 6
No ratings yet
Clustering: Analisis Big Data - Pertemuan 6
51 pages
Oral Questions LP-II: Star Schema
No ratings yet
Oral Questions LP-II: Star Schema
21 pages
Amity School of Engineering and Technology Amity University, Uttar Pradesh
No ratings yet
Amity School of Engineering and Technology Amity University, Uttar Pradesh
5 pages
Survey On Predictive Medical Data Analysis Using Pattern Recognition Algorithm
No ratings yet
Survey On Predictive Medical Data Analysis Using Pattern Recognition Algorithm
7 pages
University of Manchester COMP37212 Computer Vision Exam 2014
No ratings yet
University of Manchester COMP37212 Computer Vision Exam 2014
5 pages
DWM Theory
No ratings yet
DWM Theory
37 pages
Customer Segmentation - Project With R
No ratings yet
Customer Segmentation - Project With R
5 pages
Mlba Slides 4 Clustering-Note
No ratings yet
Mlba Slides 4 Clustering-Note
14 pages
Fraud Detection in Credit Card by Clustering Approach: 2. K-Means Clustering Algorithm
No ratings yet
Fraud Detection in Credit Card by Clustering Approach: 2. K-Means Clustering Algorithm
4 pages
1718-Article Text-11191-1-10-20230927
No ratings yet
1718-Article Text-11191-1-10-20230927
19 pages
Ai Resos
No ratings yet
Ai Resos
16 pages
Functional Bid Landscape Forecasting For Display Advertising
No ratings yet
Functional Bid Landscape Forecasting For Display Advertising
16 pages
MSC Group Project Demo
No ratings yet
MSC Group Project Demo
31 pages
Clustering by Fast Search and Find of Density Peaks
No ratings yet
Clustering by Fast Search and Find of Density Peaks
6 pages
Data Mining 2
No ratings yet
Data Mining 2
9 pages
Machine Learning Mock
No ratings yet
Machine Learning Mock
3 pages
The Informed Company: How to Build Modern Agile Data Stacks that Drive Winning Insights
From Everand
The Informed Company: How to Build Modern Agile Data Stacks that Drive Winning Insights
Dave Fowler
No ratings yet
Learn Data Warehousing in 24 Hours
From Everand
Learn Data Warehousing in 24 Hours
Alex Nordeen
No ratings yet
Learn SAP BI in 24 Hours
From Everand
Learn SAP BI in 24 Hours
Alex Nordeen
3/5 (1)

DWDM Notes

Uploaded by

DWDM Notes

Uploaded by

Architecture of a data warehouse:

A data warehouse follows a three-tier architecture consisting of the

 bottom tier (database server),

 ODBC (Open Database Connection)

Relational OLAP (ROLAP):

 Extends relational database management systems (RDBMS) to support OLAP operations.

Multidimensional OLAP (MOLAP):

 Uses a specialized system to store and process data in a multidimensional format.

Role of the OLAP Server:

 Supports complex queries.

 Query and reporting tools – Generate structured reports.

OLAP Operations in DBMS:

 Moving down in the concept hierarchy

 Climbing up in the concept hierarchy

Location = “Delhi” or “Kolkata”

Time = “Q1” or “Q2”

Item = “Car” or “Bus”

 Each dimension in a star schema is represented with only one-dimension table.

Fact Constellation Schema:

Full Cube, Iceberg Cube, and Closed Cube in terms of computation

Effectiveness of Multiway Array Aggregation:

Optimized Memory Usage:

It minimizes disk I/O operations, making processing faster.

Reduced Computation Time:

It avoids redundant calculations, reducing overall computational complexity.

Limitation – Works Best with Dense Data:

Iceberg Cube Computation

Effectiveness of Iceberg Cube Computation:

Efficient Processing for Large Data Volumes:

By filtering out low-support aggregations, Iceberg Cube avoids unnecessary computations,

Significant Reduction in Storage Requirements:

Speeds Up Query Performance:

Avoids Computational Bottlenecks:

Limitation – Requires Proper Threshold Selection:

The effectiveness of Iceberg Cube depends on setting an optimal minimum threshold.

Application: Market Basket Analysis.

Application: Spam Detection, Medical Diagnosis.

Application: Customer Segmentation, Image Recognition.

Application: Fraud Detection, Cybersecurity.

Application: Market Trend Analysis, Stock Market Prediction.

Process of mining frequent item sets:

Steps in the Apriori Algorithm

Identifying Frequent Itemsets:

Creating Candidate Itemsets:

Removing Infrequent Itemsets:

Generating Association Rules:

 Support: Measures how often an item appears in the dataset.

Example of Apriori Algorithm:

Consider a dataset with five transactions:

1 Bread, Butter, Milk

5 Bread, Butter, Milk

Step 1: Find Frequent 1-Itemsets

Step 2: Generate Candidate 2-Itemsets

Step 3: Generate Candidate 3-Itemsets

Step 4: Generate Association Rules

Bread → Butter (Confidence = 75%)

Butter → Bread (Confidence = 100%)

Bread → Milk (Confidence = 75%)

Bayesian Belief Networks

Structure of Bayesian Belief Networks

A Bayesian Belief Network consists of:

 Nodes: Represent random variables (e.g., symptoms, diseases, test results).

Role of Bayesian Belief Networks in Classification

Unlike traditional classifiers, BBNs incorporate probabilistic reasoning, making them

The graphical representation helps in understanding causal relationships.

Probabilistic Inference for Classification:

Example: Medical Diagnosis Using BBN

Support Vector Machines (SVM) in Classification

Effectiveness of SVMs in Classification:

 Effective in High-Dimensional Spaces: SVMs are particularly effective in high-

K-Medoids versus K-Means:

 The different data distributions are:

When dealing with non-spherical and arbitrary-shaped clusters, such as crescent-shaped or

For large-scale datasets, K-Means is computationally efficient as it updates centroids iteratively

Consider a dataset of customers based on spending behavior and frequency of purchases. If we

You might also like