0% found this document useful (0 votes)

14 views

Data Mining Clustering Questions

Uploaded by

nitob90303

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views

Data Mining Clustering Questions

Uploaded by

nitob90303

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 18

CLUSTERING

Q1. Compare all the clustering techniques with respect to

their run time complexity, handling noise and outliers, and
handling data of different shapes.
Ans -
Noise & Outlier
Technique Runtime Complexity Data Shape Handling
Handling

Sensitive to
noise and
O(nkT) (n: outliers. Can Assumes spherical
data points, k: be improved clusters. Sensitive
K-means
clusters, T: with to initial centroid
iterations) techniques placement.
like k-
medoids.

Can handle
outliers
Works for various
O(n^2 log n) depending on
data shapes, but
Hierarchical (agglomerative) or the linkage
interpretation of
Clustering O(n log n) method (e.g.,
dendrogram can
(divisive) Ward's linkage
be challenging.
is less
sensitive).

Works well for

clusters of arbitrary
Robust to
O(n log n) (best shapes and high
outliers due to
DBSCAN case), O(n^2) dimensionality.
density-based
(worst case) May miss clusters
approach.
in low-density
regions.

Association O(n^k * D) (n: data Not directly Works for

Analysis (not points, k: max designed for categorical data.
strictly itemset size, D: clustering, but Doesn't directly
clustering) number of frequent group data points.
frequent itemsets) itemsets can
suggest
cluster
patterns.

Explanation:

 Runtime Complexity:
o K-means: Linear in the number of data points, clusters, and iterations.
o Hierarchical Clustering: Quadratic (agglomerative) or logarithmic
(divisive) in the number of data points.
o DBSCAN: Logarithmic in the best case, quadratic in the worst case.
o Association Analysis: Exponential in the maximum itemset size and
linear in the number of data points and frequent itemsets.
 Noise & Outlier Handling:
o K-means: Sensitive to noise and outliers because it minimizes
distances to centroids.
o Hierarchical Clustering: Can be more resilient depending on the linkage
method. Ward's linkage minimizes variance within clusters, making it
less sensitive to outliers.
o DBSCAN: Robust to noise and outliers by focusing on data density.
o Association Analysis: Not directly designed for outlier handling.
 Data Shape Handling:
o K-means: Assumes spherical clusters. Sensitive to initial centroid
placement. Can struggle with elongated or irregular shapes.
o Hierarchical Clustering: Can handle various data shapes, but
interpretation of the hierarchical structure (dendrogram) can be
challenging.
o DBSCAN: Works well for clusters of arbitrary shapes and high
dimensionality. May miss clusters in low-density regions.
o Association Analysis: Works for categorical data. Doesn't directly group
data points, but frequent itemsets can reveal clusters indirectly.

Choosing the Right Technique:

Consider these factors when selecting a clustering technique:

 Data size and complexity: K-means is efficient for smaller datasets, while
DBSCAN scales better for larger ones. Hierarchical clustering can be
computationally expensive.
 Noise and outliers: DBSCAN is a good choice if your data is noisy or has
outliers. Hierarchical clustering can be helpful with carefully chosen linkage
methods.
 Data shape: K-means works best with spherical clusters, DBSCAN is flexible
for various shapes, and hierarchical clustering can handle diverse shapes but
with a more complex interpretation.
Additional Considerations:

 Initialization: K-means is sensitive to initial centroid placement. You might

need to run it multiple times with different initializations.
 Scalability: DBSCAN scales well with large datasets due to its linear or near-
linear complexity.
 Interpretability: K-means and hierarchical clustering provide clear cluster
centroids or dendrograms, respectively. DBSCAN's interpretation can involve
domain knowledge.

Q2. Which clustering algorithm to use when you have a lot of

noise, data is linearly separable. Given a situation where you
have clusters of spherical shape and S-shaped. Explain which
clustering algorithm you are going to use and why.
Ans –
 Density-Based Spatial Clustering of Applications with Noise (DBSCAN):

This choice is driven by several factors:

 Noise Handling: DBSCAN excels at handling noise due to its density-based

approach. It identifies clusters based on data point proximity, making it less
susceptible to outliers compared to K-means, which minimizes distances to
centroids.
 Linear Separability: The linear separability doesn't significantly impact
DBSCAN's effectiveness. It focuses on dense regions, so it can still identify
the clusters even if they are linearly separable.
 Handling Different Cluster Shapes: DBSCAN is well-suited for both
spherical and S-shaped clusters. It doesn't assume any specific shape and
relies on density to identify clusters.

Why not K-means?

K-means is not ideal for this scenario due to:

 Noise Sensitivity: K-means is susceptible to noise as it minimizes distances

to centroids. Noise points can significantly affect centroid placement, leading
to inaccurate clustering.
 Shape Assumptions: K-means assumes spherical clusters. While it might
work for spherical clusters, it wouldn't handle the S-shaped cluster effectively.

Considerations:
 Parameter Tuning: DBSCAN requires parameter tuning for its eps (epsilon)
and minPts (minimum points) parameters. These define the density threshold
and minimum neighborhood size for a point to be considered a core point.
Experimentation might be needed to find the optimal values for your data.
 Data Preprocessing: Consider preprocessing your data to reduce the impact
of noise. Techniques like normalization or outlier removal can help improve
the performance of any clustering algorithm.

Conclusion:

For a situation with a lot of noise, linearly separable data, spherical and S-shaped
clusters, DBSCAN is the most suitable choice due to its robustness to noise,
flexibility with data shapes, and efficient handling of large datasets with its linear or
near-linear complexity. If desired, you can compare its results with those of
hierarchical clustering using Ward's linkage to see if the interpretation of the
dendrogram provides additional insights.

Q3. What is K in simple K-means? Is it required to define the

value of K beforehand? What is K medeoid? Why K means is
called a non-deterministic algorithm. Explain with the help of
an example.
Ans –
K in K-means represents the number of clusters you want the algorithm to divide
your data into. It's a crucial parameter that determines the outcome of the clustering
process.

Defining K Beforehand:

Yes, you need to specify the value of K beforehand in K-means. The algorithm
doesn't automatically determine the optimal number of clusters. Choosing the right K
is essential for accurate results. Here are some methods to help you decide:

 Domain knowledge: If you have prior understanding of the data, you might
have an idea of the natural number of clusters.
 The Elbow Method: This method plots the distortion (sum of squared
distances to centroids) against different values of K. The "elbow" in the curve
often suggests the optimal K where adding more clusters doesn't significantly
reduce distortion.
 Silhouette Analysis: This method measures the silhouette coefficient, which
considers both the distance within a cluster and the distance to neighboring
clusters. Higher silhouette scores indicate better clustering.

K-Medoids vs. K-Means:

 K-Means: Uses the mean (average) of data points within a cluster as the
centroid (representative point).
 K-Medoids: Selects an actual data point from the cluster itself as the medoid,
which is the data point with the least overall distance to other points in the
cluster.

K-Means as a Non-Deterministic Algorithm

K-means is considered non-deterministic (or not guaranteed to produce the same

results every time) for a few reasons:

 Initial Centroid Placement: The algorithm starts by randomly placing K

centroids in the data space. These initial positions can significantly influence
the final clustering. Running K-means multiple times with different
initializations can lead to slightly different cluster assignments for some data
points.
 Local Optima: K-means aims to minimize the sum of squared distances
between data points and their assigned centroids. However, it may converge
to a locally optimal solution, not necessarily the globally optimal one with the
absolute minimum distance.

Example:

Imagine you have a dataset of colored dots representing customer purchases (red
for electronics, blue for clothes).

 K = 2: Depending on the initial centroid placement, K-means might assign

some electronics as outliers if a blue centroid lands near a cluster of
electronics. Running it again with different initializations could result in slightly
different cluster boundaries.
 K-Medoids: This might be less sensitive to initial placement because it uses
actual data points as medoids. However, different medoids could still lead to
slightly different results.

Key Takeaways:

 K in K-means defines the number of desired clusters.

 You need to specify K beforehand. Use domain knowledge, the Elbow
Method, or Silhouette Analysis to choose a good K.
 K-means is non-deterministic due to initial centroid placement and potential
local optima.
 Consider K-medoids as an alternative for potentially more robust clustering.

Q4. How do you determine the value of K in simple K-means?

Using algorithm method and Silhouette Coefficient.
Ans –
There are two main approaches to determine the optimal value of K in K-means
clustering:

1. Algorithm Method: The Elbow Method

This method involves calculating a metric called the Within-Cluster Sum of Squared
Errors (WSS) for various values of K. WSS represents the total squared distance of
all data points to their assigned cluster centroid.

Here's how it works:

1. Choose a range of K values: Start with a low value (e.g., 1 or 2) and

gradually increase it until you reach a reasonable number of expected
clusters.
2. Run K-means for each K value: For each K, perform the K-means clustering
algorithm and calculate the WSS.
3. Plot WSS vs. K: Create a graph where the x-axis represents the number of
clusters (K) and the y-axis represents the WSS.
4. Identify the "elbow": Look for an "elbow" shape in the curve. The elbow
point is where the slope of the curve starts to flatten significantly. This
flattening suggests that adding more clusters doesn't significantly decrease
the WSS, indicating diminishing returns.

The K value corresponding to the elbow point is considered a good candidate

for the optimal number of clusters.

2. Silhouette Coefficient

The Silhouette Coefficient (S) is another approach that measures the separation
between clusters and the compactness within clusters. It ranges from -1 to 1, where:

 1: Represents the best case, indicating a data point is well-placed within its
cluster and far from other clusters.
 0: Suggests overlapping clusters.
 -1: Indicates a data point is closer to points in another cluster, potentially
assigned to the wrong cluster.

Here's how to use Silhouette Coefficient for K selection:

1. Calculate Silhouette Coefficient for different K values: Run K-means for

different K values and calculate the average Silhouette Coefficient (S) for
each K.
2. Plot S vs. K: Create a graph where the x-axis represents the number of
clusters (K) and the y-axis represents the average Silhouette Coefficient.
3. Choose the K with the highest average S: The K value corresponding to
the highest average Silhouette Coefficient suggests a good separation
between clusters and compactness within clusters.
While the Elbow Method is simpler, it might not always be definitive. The
Silhouette Coefficient can provide a more nuanced view, but it can be
computationally expensive for large datasets.

Q5. When is DB Scan preferred over other clustering

algorithms?
Ans –
Preferred
Scenario Reason
Algorithm

Data has noise and More robust to noise due to density-

DBSCAN
outliers based approach

Clusters have irregular Flexible for various shapes, unlike

DBSCAN
shapes K-means

Unknown number of No need to predefine the number of

DBSCAN
clusters clusters (K)

Large dataset Better scalability compared to

DBSCAN
(scalability) hierarchical clustering

When Might DBSCAN Not Be Ideal?

 High-Dimensional Data (Curse of Dimensionality): In very high-

dimensional spaces, the notion of density becomes less meaningful, and
DBSCAN might struggle to identify meaningful clusters.
 Low-Density Clusters: DBSCAN might miss clusters that exist in regions
with sparser data points compared to the defined density threshold (eps).
 Parameter Tuning: DBSCAN requires tuning two parameters: eps (density
threshold) and minPts (minimum points in a neighborhood). Finding optimal
values can be an iterative process.

Q6. If the data is sparsely populated then will DB Scan work

well in this kind of data?
Ans –
DBSCAN might not be the best choice for data that is sparsely populated. Here's
why:
Challenges with Sparse Data:

 Density-Based Approach: DBSCAN relies on identifying dense regions of

data points to define clusters. In sparse data, where points are spread out, it
might struggle to find these dense areas. Consequently, it may not be able to
effectively differentiate between actual clusters and random noise.
 eps (epsilon) Parameter: DBSCAN's eps parameter defines the maximum
distance between points to be considered neighbors. In sparse data, even a
relatively small eps value might encompass a large area, leading to:
o Overly Large Neighborhoods: This could potentially merge distinct
clusters into one if they happen to fall within the same large
neighborhood defined by eps.
o Missed Clusters: Sparse regions with fewer points might not meet the
minimum density requirements defined by eps and minPts (minimum
points in a neighborhood), causing DBSCAN to miss legitimate clusters
altogether.

Q7. What is a dendrogram? How to cut your dendrogram?

Ans –

Dendrogram: Unveiling the Hierarchical

Relationships
A dendrogram is a tree-like diagram that visually represents the hierarchical
clustering process. It depicts how data points are progressively grouped into clusters
based on their similarity or distance. Here's a breakdown of its key aspects:

Structure:

 Branches: Each branch in the dendrogram represents a merge between two

clusters or data points.
 Height: The height of a branch (or the distance between the horizontal line
and the point of merging) indicates the level of dissimilarity between the
merged entities. Greater height signifies higher dissimilarity.
 Leaves: The bottommost level of the dendrogram represents individual data
points or the smallest clusters formed.

Interpretation:

 By tracing branches upwards, you can see how clusters form and merge
based on their similarity.
 The deeper two clusters or data points merge in the dendrogram, the more
similar they are.
 By analyzing the branch heights, you can assess the relative distances
between clusters.
Cutting the Dendrogram: The Art of Choosing
Clusters
Unfortunately, there's no single "correct" way to cut a dendrogram. The optimal
number of clusters depends on several factors, including:

 Domain knowledge: If you have prior understanding of the expected number

of clusters in your data, you can use that as a guide.
 Data characteristics: The natural structure of the data itself might influence
the ideal cutting point. Tightly clustered groups in the dendrogram suggest
natural cluster boundaries.
 Purpose of the clustering: Consider what insights you want to extract from
the clusters. A finer clustering might reveal more details but with potentially
more noise, while a coarser clustering might be easier to interpret but could
miss subtle variations.

Here are some common approaches to help you cut the dendrogram:

 Threshold Method: Set a distance threshold on the y-axis (height) of the

dendrogram. Any branches above the threshold represent clusters that are
too dissimilar and should be separated. However, choosing the threshold can
be subjective.
 Elbow Method: Plot the number of clusters (obtained by cutting at different
heights) against a measure of dissimilarity (e.g., average silhouette
coefficient). Look for an "elbow" point where the curve starts to flatten,
suggesting diminishing returns from further splitting. However, the elbow
might not always be clear.
 Silhouette Analysis: Calculate the silhouette coefficient for different
clustering solutions derived from cutting the dendrogram at various heights.
Higher average silhouette coefficients indicate better cluster separation within
and between clusters. This can help you choose a cutting point that optimizes
cluster quality.
 Gap Statistic: This method compares the within-cluster dispersion of your
data to an expected null distribution under the assumption of no inherent
clusters. The cut with the highest gap statistic suggests a good separation
between clusters.

Q8. Benefits of Hierarchical clustering.

Ans –

Benefits of Hierarchical Clustering

Benefit Description
No need to specify the number of clusters (K)
No Predefined K
beforehand.

Visualization with Dendrogram shows hierarchical relationships and

Dendrograms allows exploring different granularities.

Works effectively with data of different shapes

Handles Various Shapes
(elongated, irregular, etc.).

Robustness to Noise Can be somewhat resistant to noise and outliers

(Method Dependent) depending on the linkage method.

Interpretable Cluster Dendrogram provides insights into underlying

Hierarchy structure and relationships.

Can be used as a preprocessing step for other

Preprocessing Step
clustering or machine learning tasks.

Limitations

Limitation Description

Computational Can be expensive for large datasets due to pairwise

Complexity similarity calculations.

Choice of Linkage Different linkage methods lead to different clustering

Method outcomes.

Subjective Dendrogram Deciding the cut point for the desired number of
Cut clusters can be subjective.

Q9. Difference between Agglomerative and Divisive

Clustering.
Ans –
Both agglomerative and divisive clustering are hierarchical clustering techniques,
meaning they build a hierarchy of clusters that depict how data points are grouped
based on their similarity. However, they take opposite approaches in building this
hierarchy:

Agglomerative Clustering (Bottom-Up):

 Starts with each data point as a separate cluster.

 Iteratively merges the most similar clusters based on a chosen distance or
similarity measure.
 Continues merging until all data points are in a single cluster.
 The dendrogram produced by agglomerative clustering reads bottom-up,
with individual data points at the bottom and the final single cluster at the top.

Divisive Clustering (Top-Down):

 Starts with all data points in one single cluster.

 Iteratively splits the most dissimilar clusters based on a chosen distance or
similarity measure.
 Continues splitting until each data point is in its own separate cluster.
 The dendrogram produced by divisive clustering reads top-down, with the
single cluster at the top and individual data points at the bottom.

Here's a table summarizing the key differences:

Feature Agglomerative Clustering Divisive Clustering

Approach Bottom-Up Top-Down

Each data point as a All data points in one

Initial State
separate cluster cluster

Merges most similar Splits most dissimilar

Merging/Splitting
clusters clusters

Dendrogram Bottom-up (individual Top-down (single cluster -

Reading points -> single cluster) > individual points)

Identifying natural groups Identifying large clusters

Suitable for
within data and anomalies

drive_spreadsheetExport to Sheets

Choosing Between Agglomerative and Divisive:

 Agglomerative clustering is generally more popular due to its simpler

implementation and efficiency in handling large datasets.
 Divisive clustering can be more effective for finding large clusters or
identifying anomalies, as it starts with a global view and iteratively separates
dissimilar points. However, it can be computationally expensive for large
datasets and requires a separate flat clustering method as a subroutine for
splitting clusters.

Q10. Difference between Single, Complete, and Average

linkage in Hierarchical clustering.
Ans –
How Distance
Linkage
Between Clusters is Pros Cons
Method
Defined

Minimum
distance between Sensitive to outliers, Merges can be
Single
any two points can create driven by a single
Linkage
(one from each elongated clusters very close pair
cluster)

Maximum
distance between More conservative in Might miss subtle
Complete
any two points merging, well- similarities
Linkage
(one from each separated clusters between clusters
cluster)

Average distance Balance between Might not produce

Average between all pairs single and complete as tightly packed
Linkage of points (one linkage, less clusters as
from each cluster) sensitive to outliers complete linkage

Q11. What is Market Basket Analysis?

Ans –
Market Basket Analysis (MBA), also known as association analysis or affinity
analysis, is a data mining technique used in retail to uncover buying patterns among
customers. It analyzes past purchase history data to identify products that are
frequently bought together. This information provides valuable insights into customer
behavior and can be used to improve sales strategies, optimize inventory
management, and personalize marketing campaigns.

Here's a breakdown of the key aspects of Market Basket Analysis:

 Data Source: Relies on transactional data, typically from point-of-sale (POS)
systems. This data includes information about what products were purchased
together in a single transaction.
 Technique: Uses algorithms to identify frequent itemsets - combinations of
products that appear together in a high percentage of transactions. The
strength of the association between items is measured using metrics like
support (frequency of occurrence) and confidence (likelihood of buying B
given A is purchased).
 Applications:
o Recommendation Systems: Recommend products that are frequently
bought together with items a customer is currently viewing or has
purchased in the past. This can lead to increased sales and customer
satisfaction.
o Inventory Management: Identify products with high co-occurrence to
optimize inventory levels and prevent stockouts of frequently bought
together items.
o Promotional Campaigns: Design targeted promotions and bundle
deals based on frequently purchased items to incentivize customers
and boost sales.
o Store Layout Optimization: Utilize insights from MBA to strategically
place complementary products near each other in the store,
encouraging customers to add more items to their basket.

Benefits of Market Basket Analysis:

 Improved Sales and Revenue: By understanding customer buying patterns,

retailers can develop targeted strategies to increase sales of specific products
and encourage impulse purchases.
 Enhanced Customer Experience: Recommending relevant products based
on past purchases personalizes the shopping experience and increases
customer satisfaction.
 Efficient Inventory Management: Optimizing inventory levels based on co-
occurrence patterns reduces the risk of stockouts and overstocking, leading to
cost savings.
 Data-Driven Decision Making: Market Basket Analysis provides valuable
insights to support data-driven decisions regarding product placement,
promotions, and overall marketing strategies.

However, there are also some limitations to consider:

 Data Quality: The effectiveness of MBA relies on the quality and

completeness of the transactional data. Inaccurate or incomplete data can
lead to misleading results.
 Privacy Concerns: Balancing customer privacy with the benefits of market
basket analysis is crucial. Anonymizing or aggregating data can mitigate
privacy concerns.

Q12. What is the Apriori Principle?

Ans –
The Apriori Principle is a fundamental concept underlying the Apriori algorithm, a
popular technique used in Market Basket Analysis (MBA). It essentially states a
logical property that helps improve the efficiency of the algorithm. Here's a
breakdown:

Understanding the Principle:

 Premise: The Apriori Principle states that any subset of a frequent itemset
must also be frequent. In simpler terms, if a group of products (A, B, C) is
frequently bought together, then any smaller combination of those products
(A, B), (A, C), or (B, C) must also appear together frequently in transactions.

Why it Matters:

 Efficiency Boost: The Apriori Principle allows the Apriori algorithm to

significantly reduce the number of itemsets it needs to consider during the
mining process. Here's how:
o The algorithm starts by identifying frequent single items (e.g., milk,
bread).
o Using the Apriori Principle, it then considers only pairs of these
frequent items to potentially form frequent itemsets of size 2 (e.g., milk,
bread).
o If a pair is not frequent (e.g., milk, eggs), the Apriori Principle
guarantees that any larger itemset containing that pair (e.g., milk, eggs,
cereal) cannot be frequent and is excluded from further processing.
o This process continues iteratively, considering only combinations of
frequent itemsets from the previous level.

Benefits:

 Reduced Computation Time: By eliminating non-frequent itemsets early on,

the Apriori Principle helps the algorithm run significantly faster, especially for
large datasets with many potential item combinations.
 Focus on Relevant Patterns: The algorithm prioritizes analyzing potentially
frequent itemsets, leading to more efficient identification of meaningful buying
patterns.

However, the Apriori Principle also has limitations:

 Potential for Missing Rare Itemsets: While it helps identify frequently

bought together products, the Apriori Principle might miss rare but potentially
valuable itemsets that don't strictly adhere to the superset-subset relationship.

Q13. What is Support and Confidence?

Ans –
In Market Basket Analysis (MBA), support and confidence are two key metrics used
to evaluate the strength and frequency of associations between items purchased
together. Here's a breakdown of each concept:

Support:

 Concept: Support represents the proportion of transactions in a dataset

that contain a specific itemset (combination of products). It essentially reflects
how often a group of items appears together in customer purchases.
 Calculation: Support is typically expressed as a percentage and is calculated
as follows:

Support (A, B) = (# of transactions containing A and B) / Total number of

transactions

 Interpretation: A higher support value indicates that the itemset (A, B)

appears together in a larger percentage of transactions. This suggests a
stronger overall association between the items.

Confidence:

 Concept: Confidence measures the likelihood of purchasing item B given

that a customer has already bought item A (or a specific itemset). It focuses
on the conditional probability within a frequent itemset.
 Calculation: Confidence is also expressed as a percentage and is calculated
as follows:

Confidence (A => B) = (# of transactions containing A and B) / (# of

transactions containing A)

 Interpretation: A higher confidence value suggests that if a customer buys

item A, they are more likely to also purchase item B. This indicates a strong
association between the two items specifically within the transactions where A
is present.

Here's an analogy to understand the difference:

 Imagine a grocery store.

 Support: Think of it as the percentage of customers who buy both bread and
milk together out of all customers who visit the store. It reflects the overall
prevalence of this co-purchase behavior.
 Confidence: This is like the percentage of customers who buy milk
specifically among those who already bought bread. It focuses on the
likelihood of buying milk given the specific condition of having already chosen
bread.

Using Support and Confidence Together:

 Both support and confidence are crucial for evaluating the significance of
associations in Market Basket Analysis.
 A high support value indicates a frequent co-occurrence, but it doesn't
necessarily guarantee a strong association between specific items. For
example, bread and milk might have high support because many people buy
them, but confidence helps determine if buying bread specifically makes
customers more likely to buy milk (and vice versa).
 By analyzing both support and confidence, you can identify the most relevant
itemsets for strategic decisions. Products with high support and high
confidence are the most promising candidates for targeted promotions,
product placement strategies, or recommendation systems.

Q14. What are the two measures of interestingness in the

Apriori Principle?
Ans –
The Apriori Principle itself doesn't define measures of interestingness. It's a
foundational concept used by the Apriori algorithm, which in turn relies on two main
metrics to assess the interestingness of discovered itemsets (combinations of
products) in Market Basket Analysis:

1. Support: As explained previously, support represents the proportion of

transactions containing a specific itemset. It reflects how often a group of
items appears together in customer purchases.
2. Confidence: This metric measures the likelihood of purchasing item B
given that a customer has already bought item A (or a specific itemset). It
focuses on the conditional probability within a frequent itemset.

These two metrics work together to assess the significance of discovered patterns:

 Support indicates the overall prevalence of an itemset appearing together.

High support suggests a common co-purchase behavior.
 Confidence focuses on the strength of the association within transactions
where part of the itemset is already present. High confidence suggests that
buying a specific item (A) makes customers more likely to buy another item
(B) from the itemset.

By analyzing both support and confidence, we can identify the most interesting
itemsets for further exploration and strategic decisions. Here's how interestingness is
determined:

 High Support and High Confidence: This is the ideal scenario. It suggests a
frequent co-occurrence (support) and a strong association between specific
items within those co-occurrences (confidence). These itemsets are prime
candidates for actions like targeted promotions, product placement strategies,
or recommendation systems.
 High Support but Low Confidence: While many transactions might contain
the itemset together (high support), there might not be a strong association
between specific items within that group (low confidence). This pattern might
be less interesting for targeted actions but could still provide insights into
general customer buying habits.
 Low Support and Low Confidence: This suggests an infrequent co-
occurrence and a weak association. These itemsets are generally not
considered very interesting for further analysis.

In essence, the combination of support and confidence helps us identify not

only frequently bought-together products but also those with a strong
likelihood of influencing purchase decisions based on co-occurrence patterns.

Q15. What is more important High Support and High

Confidence, High Support but Low Confidence, Low Support
and Low Confidence?
Ans –
In Market Basket Analysis, when evaluating the interestingness of itemsets
(combinations of products purchased together), High Support and High
Confidence is the most important scenario. Here's why:

 High Support and High Confidence: This indicates a frequent co-

occurrence (support) and a strong association between specific items within
those co-occurrences (confidence). This is the sweet spot for identifying
valuable buying patterns.
o Example: Imagine a high support and high confidence for "bread" and
"milk." This suggests many customers buy them together (support),
and specifically among those who buy bread, a high percentage also
buy milk (confidence). This is a strong signal for potentially placing
these products close together in the store or creating a promotion that
bundles them.
 High Support but Low Confidence: While this suggests frequent co-
occurrence (high support), there might not be a strong association between
specific items within that group (low confidence). This can be less interesting
for targeted actions:
o Example: "Bread" and "diapers" might have high support because
many people buy both groceries, but the confidence for buying diapers
specifically given bread might be low. This doesn't necessarily suggest
a strong influence of bread on the purchase of diapers. It might be
coincidental that people who buy bread also buy diapers because
they're making a general grocery shopping trip.
 Low Support and Low Confidence: This suggests an infrequent co-
occurrence and a weak association. These itemsets are generally not
considered very interesting for further analysis.

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
89% (45)
12 Week Program: Summer Body Starts Now
70 pages
Read People Like A Book by Patrick King-Edited
59% (76)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (78)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
The Secret Language of Attraction
86% (107)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (542)
How To Develop and Write A Grant Proposal
17 pages
Workbook For The Body Keeps The Score
88% (52)
Workbook For The Body Keeps The Score
111 pages
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
83% (1016)
Donald Trump & Jeffrey Epstein Rape Lawsuit and Affidavits
13 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (30)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Phone Codes
78% (27)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
Sample Mental Health Progress Note
96% (47)
Sample Mental Health Progress Note
3 pages
2025 MandateForLeadership FULL
70% (10)
2025 MandateForLeadership FULL
920 pages
How To Kiss A Woman's Breast
60% (114)
How To Kiss A Woman's Breast
14 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
100 Questions To Ask Your Partner
80% (35)
100 Questions To Ask Your Partner
2 pages
Satanic Calendar
25% (56)
Satanic Calendar
4 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (7)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
Quizzes Module 1
33% (3)
Quizzes Module 1
11 pages
1001 Songs
70% (71)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
Wooldridge - Computer Exercises, Chapter 3, C3, C9, C12
No ratings yet
Wooldridge - Computer Exercises, Chapter 3, C3, C9, C12
4 pages
Retail Shoe Dataset: Adidas vs. Nike: by - Rochita Sundar 15 April 2022
No ratings yet
Retail Shoe Dataset: Adidas vs. Nike: by - Rochita Sundar 15 April 2022
20 pages
Coffee Farm Tourism Development Plan
100% (2)
Coffee Farm Tourism Development Plan
87 pages
ASQ 6 Sigma Green Belt
No ratings yet
ASQ 6 Sigma Green Belt
12 pages
ML Exp 7
No ratings yet
ML Exp 7
6 pages
Sample Complexity of Predictive Sparse Coding
No ratings yet
Sample Complexity of Predictive Sparse Coding
47 pages
ml 2m cie2
No ratings yet
ml 2m cie2
4 pages
Density Based Clustering
No ratings yet
Density Based Clustering
17 pages
DBSCAN Clustering in ML _ Density Based Clustering
No ratings yet
DBSCAN Clustering in ML _ Density Based Clustering
5 pages
DBSCAN_An_Assessment_of_Density_Based_Cl
No ratings yet
DBSCAN_An_Assessment_of_Density_Based_Cl
5 pages
Density Based Clustering
No ratings yet
Density Based Clustering
25 pages
big data techniques of 2025
No ratings yet
big data techniques of 2025
31 pages
Ml Assignment 2
No ratings yet
Ml Assignment 2
6 pages
Grauman Darrell Iccv05
No ratings yet
Grauman Darrell Iccv05
8 pages
sliterature_review_DPC
No ratings yet
sliterature_review_DPC
12 pages
Exploring Unsupervised Learning Algorithms with the Iris Dataset
No ratings yet
Exploring Unsupervised Learning Algorithms with the Iris Dataset
4 pages
Simonovsky Dynamic Edge-Conditioned Filters CVPR 2017 Paper
No ratings yet
Simonovsky Dynamic Edge-Conditioned Filters CVPR 2017 Paper
10 pages
ML Assign4
No ratings yet
ML Assign4
7 pages
Two-Dimensional Linear Discriminant Analysis: Jieping Ye Ravi Janardan Qi Li
No ratings yet
Two-Dimensional Linear Discriminant Analysis: Jieping Ye Ravi Janardan Qi Li
8 pages
DataEnggineering
No ratings yet
DataEnggineering
16 pages
Big Data Algo
No ratings yet
Big Data Algo
20 pages
Machine learning assingiment
No ratings yet
Machine learning assingiment
20 pages
DB Scan
No ratings yet
DB Scan
7 pages
IVS.2012.6232167
No ratings yet
IVS.2012.6232167
6 pages
Survey On Metric Dimension
No ratings yet
Survey On Metric Dimension
29 pages
Unsuper
No ratings yet
Unsuper
15 pages
VDBSCAN
No ratings yet
VDBSCAN
4 pages
WEEK 11 ML
No ratings yet
WEEK 11 ML
3 pages
Aiml Fml Answer Key
No ratings yet
Aiml Fml Answer Key
13 pages
CV Lecture 7
No ratings yet
CV Lecture 7
119 pages
Recklessly Approximate Sparse Coding
No ratings yet
Recklessly Approximate Sparse Coding
35 pages
Comparison of Density-Based Clustering Algorithms: Mariam Rehman
No ratings yet
Comparison of Density-Based Clustering Algorithms: Mariam Rehman
5 pages
Lecture 6
No ratings yet
Lecture 6
55 pages
Spectral Networks and Deep Locally Connected Networks On Graphs
No ratings yet
Spectral Networks and Deep Locally Connected Networks On Graphs
14 pages
GDBSCAN
No ratings yet
GDBSCAN
30 pages
2502.19865v1
No ratings yet
2502.19865v1
42 pages
KDD96 037
No ratings yet
KDD96 037
6 pages
CI-10 Networks Based on Competition Learning - Clustering- Kmean and SOM
No ratings yet
CI-10 Networks Based on Competition Learning - Clustering- Kmean and SOM
36 pages
Review On Density-Based Clustering - DBSCAN, DenClue & GRID
No ratings yet
Review On Density-Based Clustering - DBSCAN, DenClue & GRID
20 pages
Download
No ratings yet
Download
6 pages
6 - Into To Data Science Techniques and Clustering
No ratings yet
6 - Into To Data Science Techniques and Clustering
16 pages
TEAA - Memory Based Tecniques
No ratings yet
TEAA - Memory Based Tecniques
23 pages
Recommender_Systems_Assignment
No ratings yet
Recommender_Systems_Assignment
10 pages
ML Module 5
No ratings yet
ML Module 5
15 pages
Sparse Matrix-Based Random Projection For Classification: Weizhi Lu, Weiyu Li, Kidiyo Kpalma and Joseph Ronsin
No ratings yet
Sparse Matrix-Based Random Projection For Classification: Weizhi Lu, Weiyu Li, Kidiyo Kpalma and Joseph Ronsin
26 pages
Fuzzy Extensions of The DBScan Clustering Algorithm
No ratings yet
Fuzzy Extensions of The DBScan Clustering Algorithm
12 pages
Sparsistency of The Edge Lasso Over Graphs
No ratings yet
Sparsistency of The Edge Lasso Over Graphs
9 pages
DBSCAN
No ratings yet
DBSCAN
8 pages
Fauqueur Icip06 DTCWT
No ratings yet
Fauqueur Icip06 DTCWT
4 pages
Reconstruction of Markov Random Fields From Samples: Some Observations and Algorithms
No ratings yet
Reconstruction of Markov Random Fields From Samples: Some Observations and Algorithms
16 pages
DB SCAN unit 4
No ratings yet
DB SCAN unit 4
6 pages
Unsupervised Learning (A.k.a Clustering) : Marcello Pelillo
No ratings yet
Unsupervised Learning (A.k.a Clustering) : Marcello Pelillo
102 pages
DBSCAN.docx
No ratings yet
DBSCAN.docx
7 pages
Partition
No ratings yet
Partition
52 pages
Birch
No ratings yet
Birch
6 pages
1312.6203 Spectral Networks and Locally Connected Networks
No ratings yet
1312.6203 Spectral Networks and Locally Connected Networks
14 pages
CS434a/541a: Pattern Recognition Prof. Olga Veksler
No ratings yet
CS434a/541a: Pattern Recognition Prof. Olga Veksler
42 pages
Stop Using The Elbow Criterion For K-Means
No ratings yet
Stop Using The Elbow Criterion For K-Means
7 pages
Clustering Analysis (Unsupervised)
No ratings yet
Clustering Analysis (Unsupervised)
6 pages
UNIT V MACHINE LEARNING
No ratings yet
UNIT V MACHINE LEARNING
5 pages
3.KNN
No ratings yet
3.KNN
18 pages
10ClusBasic (1)
No ratings yet
10ClusBasic (1)
31 pages
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
From Everand
Hidden Line Removal: Unveiling the Invisible: Secrets of Computer Vision
Fouad Sabry
No ratings yet
Leval of Customer Satisfaction in Bhat Bhateni Super Market
No ratings yet
Leval of Customer Satisfaction in Bhat Bhateni Super Market
7 pages
Department of Economics Problem Set
No ratings yet
Department of Economics Problem Set
5 pages
Cit - Science
No ratings yet
Cit - Science
17 pages
Notes 2
No ratings yet
Notes 2
22 pages
Exp 2
No ratings yet
Exp 2
6 pages
Multiple Choice Questions On Linear Regression
No ratings yet
Multiple Choice Questions On Linear Regression
8 pages
BDE Pertemuan 1
No ratings yet
BDE Pertemuan 1
20 pages
Challenges Faced by Working Mothers A Research Proposal4-1-1
No ratings yet
Challenges Faced by Working Mothers A Research Proposal4-1-1
26 pages
Knowledge Discovery in Data Science: KDD Meets Big Data
No ratings yet
Knowledge Discovery in Data Science: KDD Meets Big Data
6 pages
Arduino
No ratings yet
Arduino
6 pages
Data Analytics in SMES-OECD
No ratings yet
Data Analytics in SMES-OECD
45 pages
Cease Fire Report
No ratings yet
Cease Fire Report
34 pages
Refer To Brand Preference Problem 6.5. A Obtain The Studentized Deleted Residuals and Identify
No ratings yet
Refer To Brand Preference Problem 6.5. A Obtain The Studentized Deleted Residuals and Identify
7 pages
Part 2 Project-Basic Inferential Stat
No ratings yet
Part 2 Project-Basic Inferential Stat
6 pages
HydroGOF Indices de Eficiencia en R
No ratings yet
HydroGOF Indices de Eficiencia en R
76 pages
Machine Learning Axioms Q&A
No ratings yet
Machine Learning Axioms Q&A
3 pages
Jurnal Nuzul Yustian 14710112
No ratings yet
Jurnal Nuzul Yustian 14710112
8 pages
Mary Wanjiku Kimani CV
No ratings yet
Mary Wanjiku Kimani CV
4 pages
Data and Business Analytics Interview Questions
No ratings yet
Data and Business Analytics Interview Questions
54 pages
Concept of Quantitative Research02
No ratings yet
Concept of Quantitative Research02
7 pages
Critical Factors Influencing Infrastructure Provision
No ratings yet
Critical Factors Influencing Infrastructure Provision
6 pages
Investment Habits of Working Women
60% (10)
Investment Habits of Working Women
43 pages
Business Statistics: A Decision-Making Approach: Introduction To Linear Regression and Correlation Analysis
No ratings yet
Business Statistics: A Decision-Making Approach: Introduction To Linear Regression and Correlation Analysis
64 pages
12-Reliability and Validity
No ratings yet
12-Reliability and Validity
38 pages
Defining Surveys and Experiments
No ratings yet
Defining Surveys and Experiments
3 pages