DWDM Notes
DWDM Notes
Each tier plays a crucial role in managing, processing, and analyzing large volumes of data
efficiently.
The bottom tier is the database server, which is almost always a relational database system. It
stores and manages data extracted from operational databases and external sources. Back-end
tools and utilities handle data extraction, cleaning, transformation, loading, and refreshing to
ensure that the warehouse remains updated.
Data extraction is facilitated using application program interfaces (APIs) known as gateways,
such as
This tier also includes a metadata repository, which stores detailed information about the data
warehouse structure, sources, and transformations.
The middle tier consists of an Online Analytical Processing (OLAP) server that supports
complex analytical operations on multidimensional data. This tier can be implemented using
either a Relational OLAP (ROLAP) model or a Multidimensional OLAP (MOLAP) model.
The top tier is the front-end client layer, which provides tools for user interaction, data
exploration, and decision-making. This tier includes query and reporting tools, analysis tools,
and data mining applications.
Users can generate reports, analyze trends, and perform predictive modeling to gain insights
from the data. The integration of analytical and visualization tools at this level makes it easier for
businesses and analysts to make data-driven decisions.
The three-tier architecture of a data warehouse ensures efficient data management, processing
and analysis. The bottom tier focuses on data storage and integration, the middle tier handles
analytical processing, and the top tier provides user-friendly tools for decision-making. This
structured approach enhances data accessibility, improves analytical performance, and supports
organizations in making informed business decisions.
OLAP operations:
There are five basic analytical operations that can be performed on an OLAP cube:
Drill down
Roll up
Slice
Dice
Pivot
Drill down: In drill-down operation, the less detailed data is converted into highly detailed data.
It can be done by:
In the cube given in overview section, the drill down operation is performed by moving down in
the concept hierarchy of Time dimension (Quarter -> Month).
Roll up: It is just opposite of the drill-down operation. It performs aggregation on the OLAP
cube. It can be done by:
In the cube given in the overview section, the roll-up operation is performed by climbing up in
the concept hierarchy of Location dimension (City -> Country).
Dice: It selects a sub-cube from the OLAP cube by selecting two or more dimensions. In the
cube given in the overview section, a sub-cube is selected by selecting following dimensions
with criteria:
Slice: It selects a single dimension from the OLAP cube which results in a new sub-cube
creation. In the cube given in the overview section, Slice is performed on the dimension Time =
“Q1”.
Pivot: It is also known as rotation operation as it rotates the current view to get a new view of the
representation. In the sub-cube obtained after the slice operation, performing pivot operation
gives a new view of it.
Schema in a data warehouse:
Schema is a logical description of the entire database. It includes the name and description of records of
all record types including all associated data-items and aggregates. Much like a database, a data
warehouse also requires to maintain a schema. A database uses relational model, while a data warehouse
uses Star, Snowflake, and Fact Constellation schema
Star Schema
There is a fact table at the center. It contains the keys to each of four dimensions.
The fact table also contains the attributes, namely dollars sold and units sold.
Snowflake Schema:
Some dimension tables in the Snowflake schema are normalized.
The normalization splits up the data into additional tables.
Unlike Star schema, the dimensions table in a snowflake schema are normalized. For example,
the item dimension table in star schema is normalized and split into two dimension tables, namely
item and supplier table.
Now the item dimension table contains the attributes item_key, item_name, type, brand, and
supplier-key.
The supplier key is linked to the supplier dimension table. The supplier dimension table contains
the attributes supplier_key and supplier_type.
A fact constellation has multiple fact tables. It is also known as galaxy schema.
The following diagram shows two fact tables, namely sales and shipping.
The sales fact table is same as that in the star schema.
The shipping fact table has the five dimensions, namely item_key, time_key, shipper_key,
from_location, to_location.
The shipping fact table also contains two measures, namely dollars sold and units sold.
It is also possible to share dimension tables between fact tables. For example, time, item, and
location dimension tables are shared between the sales and shipping fact table.
The Full Cube computes all possible group-by aggregates for a dataset, making it the most
exhaustive but also the most computationally expensive approach. Since it generates every
possible aggregation, including those with little or no relevance, it leads to the highest
computation cost and memory usage. This makes it the least efficient method, suitable primarily
for small datasets where a complete set of aggregations is necessary.
The Iceberg Cube, on the other hand, improves efficiency by computing only those group-by
aggregates that meet a specific threshold, such as a minimum count or sum. By pruning
unnecessary computations and ignoring low-support aggregations, it significantly reduces both
computation cost and memory requirements. This approach is particularly useful for large
datasets where only meaningful and high-support aggregations are required, making it a more
efficient alternative to the Full Cube.
The Closed Cube optimizes computation by eliminating redundant aggregations while still
maintaining completeness. Unlike the Full Cube, which stores every possible combination, the
Closed Cube avoids unnecessary computations by keeping only the essential and non-redundant
aggregates. This leads to lower memory usage and improved efficiency compared to the Full
Cube, though it is not as selective as the Iceberg Cube. The Closed Cube is best suited for
scenarios where minimizing redundancy is important while still retaining full analytical
capabilities.
Multiway Array Aggregation and Iceberg Cube Computation:
Handling large-scale data efficiently is a critical challenge in data warehousing and OLAP
(Online Analytical Processing). Multiway Array Aggregation (MAA) and Iceberg Cube
Computation are two important techniques designed to optimize the processing of large datasets,
particularly in data cube computations. Their effectiveness is assessed based on their ability to
reduce computation time, optimize storage, and improve query performance.
Multiway Array Aggregation (MAA) is a method used in data cube computation, particularly for
large-scale multidimensional databases. It efficiently processes data by taking advantage of
array-based storage and simultaneous aggregations across multiple dimensions.
MAA stores data in an array format rather than in relational tables, reducing memory overhead.
Instead of computing the cube level by level, MAA processes multiple dimensions
simultaneously.
Cache Optimization:
Since arrays provide better cache locality, access times are faster, improving performance.
Scalability for Large Datasets:
Works efficiently for high-dimensional data cubes by partitioning the data and computing
aggregations in an optimized manner.
MAA is most effective for dense data cubes (i.e., where most combinations of attributes have
data).
For sparse datasets, alternative storage strategies like prefix trees or hash-based approaches may
be more efficient.
An Iceberg Cube is an optimization technique where only a subset of data cube aggregations that
meet a minimum support threshold (e.g., frequency, sum, or count) are computed and stored. It is
designed to prune irrelevant or insignificant aggregations, reducing computational cost.
Unlike a full data cube, which computes and stores all possible aggregations, Iceberg Cube only
retains the most useful subsets, drastically reducing storage.
Iceberg Cube precomputes and indexes only frequent and relevant aggregations, leading to faster
query response times.
Since Iceberg Cube prunes less significant aggregations early, it optimizes computation and
reduces processing overhead.
If the threshold is too high, important aggregations may be lost. If it's too low, efficiency gains
are reduced.
Different Types of Data Mining Patterns:
Data mining involves identifying patterns in large datasets to uncover useful insights. The major
types of data mining patterns include association patterns, sequential patterns, classification
patterns, clustering patterns, outlier patterns, and trend patterns.
1. Association Patterns
Association patterns identify relationships between items in a dataset. They are widely used in
market basket analysis, where businesses determine frequently bought item combinations. For
example, customers who buy bread often purchase butter as well. Algorithms like Apriori and
FP-Growth are used to discover such patterns.
Example: In a supermarket, analysis of customer purchases reveals that 80% of customers who
buy diapers also buy baby wipes. This helps businesses in product placement and promotions.
2. Sequential Patterns
Sequential patterns focus on detecting trends that follow a sequence over time. These patterns are
useful in customer behavior analysis and medical diagnosis. For instance, an e-commerce
platform may observe that customers who buy a phone later purchase accessories like chargers
and earphones. GSP and SPADE algorithms are commonly used for sequential pattern mining.
Example: In e-commerce, data analysis shows that customers who buy a smartphone often
purchase a phone case within a week, followed by wireless earphones within a month.
Application: Customer Purchase Behavior Analysis.
3. Classification Patterns
Classification patterns categorize data into predefined labels, making them useful for fraud
detection, spam filtering, and medical diagnosis. For example, emails can be classified as spam
or not spam using classification algorithms like Decision Trees, Naïve Bayes, and Support
Vector Machines (SVM).
Example: An email filtering system classifies emails into spam or non-spam based on keywords,
sender details, and email history.
4. Clustering Patterns
Clustering patterns group similar data points into clusters without predefined categories. This is
useful in customer segmentation, image recognition, and social network analysis. For example,
businesses can identify high-value customers and budget-conscious shoppers using clustering
techniques such as K-Means, Hierarchical Clustering, and DBSCAN.
Example: A bank segments its customers into high-income investors, middle-class savers, and
low-income borrowers based on their financial transactions and spending habits.
5. Outlier Patterns
Outlier patterns detect anomalies or unusual data points. These are essential in fraud detection
and cybersecurity. For example, an unusually large transaction on a credit card may indicate
fraud. Isolation Forest and One-Class SVM are popular outlier detection techniques.
Example: A credit card company detects an unusually high-value transaction from a customer
who typically makes small purchases, potentially indicating fraud.
Trend patterns analyze data over time to identify changes and shifts. These patterns are
commonly used in stock market prediction, climate change studies, and social media trend
analysis. Time-series analysis and regression models help track patterns such as increasing
smartphone adoption or declining traditional newspaper readership.
Example: Data analysis in social media trends shows that short-video content (e.g., TikTok,
Instagram Reels) has gained massive popularity over the past five years, replacing traditional
blogging.
The Apriori algorithm follows a systematic process to discover frequent itemsets and generate
association rules:
The algorithm starts by scanning the dataset to count individual items (1-itemsets) and their
frequencies. A minimum support threshold is set to determine whether an item is considered
frequent. If an item meets the threshold, it is included in the frequent 1-itemsets.
Once frequent 1-itemsets are identified, the algorithm generates candidate 2-itemsets by
combining frequent items. This process continues iteratively, forming larger k-itemsets until no
more frequent itemsets can be found.
The algorithm applies the Apriori Property to prune itemsets that do not meet the minimum
support. If an itemset is infrequent, all of its supersets are eliminated from consideration,
significantly reducing the number of combinations that need to be evaluated.
Transaction ID Items
2 Bread, Butter
3 Bread, Milk
4 Butter, Milk
Count the frequency of each item and apply the minimum support threshold (e.g., 50% = at least
3 out of 5 transactions).
Form pairs like {Bread, Butter}, {Bread, Milk}, {Butter, Milk} and calculate their support.
Combine frequent 2-itemsets to create triplets like {Bread, Butter, Milk} and check their support.
Example rules:
Each node in the network follows the Bayesian theorem, which states:
where P(A∣B)P(A | B)P(A∣B) is the probability of event AAA occurring given that BBB has
occurred.
Bayesian Belief Networks are extensively used in classification problems, where the goal is
to categorize an instance into predefined classes based on observed attributes. Their key roles
include:
Handling Uncertainty:
BBNs explicitly model dependencies between features, unlike Naïve Bayes, which assumes
independence.
Given observed evidence (e.g., symptoms in medical diagnosis), BBNs compute the posterior
probability of each class and classify accordingly.
This makes them useful in medical diagnosis, spam filtering, and fraud detection.
Incremental Learning:
BBNs can update their probabilities dynamically as new data arrives, making them suitable
for real-time applications
High-Dimensional Data Handling: SVMs are particularly effective in scenarios where the
number of dimensions exceeds the number of samples. They perform well in high-
dimensional spaces and are effective when the number of features is large.
Robustness to Overfitting: By focusing on maximizing the margin between classes, SVMs
are less prone to overfitting, especially in high-dimensional spaces.
Kernel Trick for Non-Linear Data: SVMs can efficiently perform a non-linear classification
using what is called the kernel trick, implicitly mapping their inputs into high-dimensional
feature spaces.
Advantages of SVMs:
Limitations:
Computational Complexity: The training time of SVMs can be high, especially for large
datasets, making them less suitable for large-scale applications.
Choice of Kernel: Selecting an appropriate kernel function is crucial, as an improper
choice can lead to reduced model performance.
Interpretability: SVMs are often considered less interpretable compared to other models
like decision trees.
Support Vector Machines are powerful tools for classification tasks, offering advantages in
handling high-dimensional data, robustness to overfitting, and flexibility through kernel
functions. However, considerations regarding computational resources and kernel selection are
essential for optimal performance.
K-Means relies on computing cluster centroids, while K-Medoids selects actual data points
as representative cluster centers. Their performance depends on factors like data distribution,
presence of outliers, cluster shapes, and dataset size.
Effectiveness in Different Data Distributions
When dealing with Gaussian (normally distributed) data, K-Means performs efficiently as it
assumes that clusters are spherical and evenly distributed. Since it minimizes the sum of squared
distances between points and centroids, it effectively separates well-defined clusters. In contrast,
K-Medoids does not provide a significant advantage in this case and tends to be computationally
slower. For normal distributions, K-Means is the preferred choice due to its speed and accuracy.
However, in datasets with unequal cluster sizes or varying densities, K-Means struggles as it
assigns equal importance to all data points, often leading to centroid shifts towards denser
regions. K-Medoids, being more robust, performs better in this scenario by selecting medoids
that represent actual data points rather than relying on averages. For datasets with uneven
clusters, K-Medoids provides better stability and accuracy.
One major limitation of K-Means is its sensitivity to outliers. Since it computes centroids based
on mean values, a few extreme points can significantly distort the cluster assignments. K-
Medoids, however, selects medoids from existing data points, making it resistant to noise and
outliers. For datasets with potential outliers, such as fraud detection in banking transactions, K-
Medoids is the more effective clustering method.
K-Means is ideal for Gaussian-distributed, large-scale datasets due to its speed and efficiency,
while K-Medoids excels in handling outliers and uneven cluster sizes. For complex, non-
spherical data distributions,
Example of K-Means vs. K-Medoids in Clustering