DW&M Unit 3 Part II

The document discusses various clustering techniques including K-means clustering, hierarchical clustering, and density-based clustering. It describes the basic concepts and algorithms of K-means clustering including choosing the number of clusters, calculating distances, and assigning points to clusters. It also discusses hierarchical clustering approaches, linkage methods, and evaluating clustering results.

Uploaded by

UT DU

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views50 pages

DW&M Unit 3 Part II

Uploaded by

UT DU

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Unit III

3.2 Cluster Analysis: Basic Concepts, A

Categorization of Major Clustering Methods,
Partitioning Methods: The Basic K-means
Algorithm, Strengths and Weaknesses of K-
means algorithm, Hierarchical methods:
Agglomerative versus Divisive Hierarchical
Clustering, Density-Based Methods: DBSCAN-
The DBSCAN Algorithm, Strengths and
Weaknesses, Evaluation of clustering, Outlier
analysis.
Unit III
3.3 Association Rule Mining: Market Basket
Analysis, Frequent item set, Closed Item sets,
and Association Rules, Apriori Algorithm: Apriori
Principle, Apriori Algorithm, Computational
Complexity, Rule Generation, Confidence of
association rule.
Clustering
• Clustering is a technique that groups similar
objects such that:
• The objects in the same group are more
similar to each other than the objects in the
other groups.
• The group of similar objects is called a Cluster.
Clustering methods
•
K means algorithm
• K-means algorithm is an iterative algorithm
that tries to partition the dataset into K pre-
defined distinct non-overlapping subgroups
(clusters) where each data point belongs
to only one group.
K means algorithm
• K-Means clustering is an unsupervised
iterative clustering technique.
• It partitions the given data set into k
predefined distinct clusters.
• A cluster is defined as a collection of data
points exhibiting certain similarities.
K means algorithm
K-Means Clustering Algorithm steps
Step-01:
Choose the number of clusters K.

Step-02:
Randomly select any K data points as cluster
centers.

Step-03:
Calculate the distance between each data point
and each cluster center.
Steps contd..
Step-04:
Assign each data point to some cluster.
A data point is assigned to that cluster whose
center is nearest to that data point.

Step-05:
Re-compute the center of newly formed clusters.

Step-06:
Keep repeating the procedure from Step-03 to
Step-05 until convergence or max iteration is
reached.
Use K-Means Algorithm to create two
clusters
• Assume A(2, 2) and C(1, 1) are centers of the two
clusters.
• Calculating Distance Between A(2, 2) and
C1(2,2)= Ρ(A, C1) = 0
• Calculating Distance Between C(1, 1) and C1(2,2)
= Ρ(C, C1)= 1.41
Distance from Distance from Point belongs to
center (2, 2) of center (1, 1) of Cluster
Given Points Cluster-01 Cluster-02

A(2, 2) 0 1.41 C1

B(3, 2) 1 2.24 C1

C(1, 1) ? ? C2

D(3, 1) ? ? C1

E(1.5, 0.5) ? ? C2
Iteration 1
• For Cluster-01:
• Center of Cluster-01 = (2.67, 1.67)
• For Cluster-02:
• Center of Cluster-02 = (1.25, 0.75)

• We will go to iteration-02, iteration-03 and so

on until the centers do not change anymore.
K means
K means
•
Elbow method to find the optimal
number of clusters (K)
Hierarchical clustering
• In this algorithm, we develop the hierarchy of
clusters in the form of a tree, and this tree-
shaped structure is known as the dendrogram.

Hierarchical clustering is divided into two types:

• Agglomerative Hierarchical Clustering.

• Divisive Hierarchical Clustering
• Agglomerative: Agglomerative is a bottom-up
approach, in which the algorithm starts with
taking all data points as single clusters and
merging them until one cluster is left.
• Divisive: Divisive algorithm is the reverse of
the agglomerative algorithm as it is a top-
down approach.
EXAMPLE(Proximity matrix)
• • Minimum distance = 1
for E & A

• Hence merge EA
EXAMPLE
• D = min [ dist { (E, A), D} ] •
=3

• In this minimum
distance = 1
EXAMPLE
•
EXAMPLE
• •
• Now, we can set a threshold distance and draw a
horizontal line , suppose we set this threshold as 12
and draw a horizontal line
• The number of clusters will be the number of vertical
lines which are being intersected by the line drawn
using the threshold.
• More the distance of the vertical lines in the
dendrogram, more is the distance between
those clusters.
Linkage methods
• The closest distance between the two clusters
is crucial for the hierarchical clustering.
• There are various ways to calculate the
distance between two clusters, and these
ways decide the rule for clustering.
• These measures are called Linkage methods.
Linkage methods
• Single Linkage • Complete Linkage:
Linkage methods
• Average Linkage • Centroid Linkage
Density-based clustering
• Partition-based and hierarchical clustering
techniques are highly efficient with normal
shaped clusters.
• However, when it comes to arbitrary shaped
clusters or detecting outliers, density-based
techniques are more efficient.
DBSCAN
• DBSCAN stands for Density-Based Spatial
Clustering of Applications with Noise.
DBSCAN
• It is able to find arbitrary shaped clusters and
clusters with noise (i.e. outliers).

• The main idea behind DBSCAN is that a point

belongs to a cluster if it is close to many points
from that cluster.
Two key parameters of DBSCAN
• eps: The distance that specifies the
neighborhoods. Two points are considered to
be neighbors if the distance between them
are less than or equal to eps.
• minPts: Minimum number of data points to
define a cluster.
Classification of points
• Based on these two
parameters, points are
classified as
• Core point,
• Border point, or
• Outlier
DBSCAN
•
DBSCAN
•
Eps= 0.6 and M=4
•
DBSCAN
•
DBSCAN
•
DBSCAN
• Directly Density-Reachable
• Density-Reachable
• Density-Connected
Evaluation metrics
• Silhouette Score: The Silhouette Coefficient is
calculated by using the mean of the distance
of the intra-cluster and nearest cluster for all
the samples. The Silhouette Coefficient ranges
from [-1,1]

(nc-ic)/max(ic,nc)
where,
ic = mean of the intra-cluster distance
nc = mean of the nearest-cluster distance
Calinski Harabaz Index
• This ratio is calculated between two
parameters within-cluster diffusion and
between cluster dispersion.
CH(k)=*B(k)W(k)+*(n−k)(k−1)+
where,
n = data points
k = clusters
W(k) = within cluster variation
B(k) = between cluster variation.
Davies Bouldin index
• Davies Bouldin index is based on the principle
of with-cluster and between cluster distances.
Market Basket Analysis
• Market Basket Analysis is one of the key
techniques used by large retailers to uncover
associations between items.
• It works by looking for combinations of items
that occur together frequently in transactions.
Association Rule Mining
• Association Rule Mining is used when you
want to find an association between different
objects in a set, find frequent patterns in a
transaction database, relational databases or
any other information repository.
Apriori
• Apriori algorithm assumes that any subset of a
frequent itemset must be frequent.
• Given
Apriori
• Support: Its the default popularity of an item
• In mathematical terms, the support of item A
is nothing but the ratio of transactions
involving A to the total number of
transactions.
• S(grapes)=??
• Confidence: Likelihood that customer who
bought both A and B. Its divides the number of
transactions involving both A and B by the
number of transactions involving B.

• Confidence(A⇒ B) =support(A ∪ B) /support(A)

• Lift : Increase in the sale of A when you sell B.

• Lift (A => B) = 1 means that there is no

correlation within the itemset.
• Lift (A => B) > 1 means that there is a positive
correlation within the itemset, i.e., products in the
itemset, A, and B, are more likely to be bought
together.
• Lift (A => B) < 1 means that there is a negative
correlation within the itemset, i.e., products in
itemset, A, and B, are unlikely to be bought
together.
• Association Rule-based algorithms are viewed
as a two-step approach:
1. Frequent Itemset Generation: Find all
frequent item-sets with support >= pre-
determined min_support count
2. Rule Generation: List all Association Rules
from frequent item-sets. Calculate Support and
Confidence for all rules. Prune rules that fail
min_support and min_confidence thresholds.
Example
• Given
•

Week 07 Lecture Material
No ratings yet
Week 07 Lecture Material
49 pages
ML Module 4 2022 1 PDF
No ratings yet
ML Module 4 2022 1 PDF
31 pages
Unit IV
No ratings yet
Unit IV
51 pages
Grouping
No ratings yet
Grouping
98 pages
AIMLB PGP 2024 Session 12
No ratings yet
AIMLB PGP 2024 Session 12
46 pages
U-5 Iml
No ratings yet
U-5 Iml
20 pages
MachineLearning Unit IV
No ratings yet
MachineLearning Unit IV
51 pages
Clustering
No ratings yet
Clustering
65 pages
ML - 8
No ratings yet
ML - 8
70 pages
Clustering Analysis
No ratings yet
Clustering Analysis
30 pages
Clustering
No ratings yet
Clustering
12 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Ambo University: Inistitute of Technology
No ratings yet
Ambo University: Inistitute of Technology
15 pages
Clustering
No ratings yet
Clustering
75 pages
Clustering
No ratings yet
Clustering
55 pages
M5
No ratings yet
M5
40 pages
Unsupervised Machine Learning Techniques
No ratings yet
Unsupervised Machine Learning Techniques
58 pages
Unit 4
No ratings yet
Unit 4
5 pages
Un Supervised Learning
No ratings yet
Un Supervised Learning
22 pages
03 Clustering
No ratings yet
03 Clustering
63 pages
Unit 5
No ratings yet
Unit 5
63 pages
Clustering
No ratings yet
Clustering
53 pages
ML Unit Iii
No ratings yet
ML Unit Iii
12 pages
Agglomerative Clustering
No ratings yet
Agglomerative Clustering
44 pages
Clustering
No ratings yet
Clustering
75 pages
Data Science Session 8 Clustering V0
No ratings yet
Data Science Session 8 Clustering V0
30 pages
Clustering Hierarchical PDF
No ratings yet
Clustering Hierarchical PDF
31 pages
Machine Learning Notes Anna University
100% (1)
Machine Learning Notes Anna University
14 pages
DMW Unit 5
No ratings yet
DMW Unit 5
10 pages
Clustering Analysis
No ratings yet
Clustering Analysis
12 pages
M5
No ratings yet
M5
40 pages
Hierarchical Clustering: Relationship Between Clusters
No ratings yet
Hierarchical Clustering: Relationship Between Clusters
23 pages
Introduction To Data Mining Clustering Analysis
No ratings yet
Introduction To Data Mining Clustering Analysis
84 pages
ML Unit III
No ratings yet
ML Unit III
82 pages
Cluster Analysis
No ratings yet
Cluster Analysis
15 pages
Cluster
100% (1)
Cluster
72 pages
Spatial Data Mining: Clustering Techniques
No ratings yet
Spatial Data Mining: Clustering Techniques
56 pages
Unit 4
No ratings yet
Unit 4
16 pages
ML Unit 4
No ratings yet
ML Unit 4
15 pages
Birch
No ratings yet
Birch
6 pages
Cluster Analysis: Minh Tran, PHD
No ratings yet
Cluster Analysis: Minh Tran, PHD
37 pages
Clustering
No ratings yet
Clustering
45 pages
Module 4 - 5TH Sem
No ratings yet
Module 4 - 5TH Sem
23 pages
Clustering
No ratings yet
Clustering
39 pages
HTCB Unit 5
No ratings yet
HTCB Unit 5
3 pages
Unit 3 Data
No ratings yet
Unit 3 Data
37 pages
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
No ratings yet
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
22 pages
Chatgpt Unit - 4
No ratings yet
Chatgpt Unit - 4
4 pages
Clustering
No ratings yet
Clustering
6 pages
Unit 2
No ratings yet
Unit 2
33 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Clustering Basics
No ratings yet
Clustering Basics
39 pages
Cluster Evaluation Techniques: Atds Assignment
No ratings yet
Cluster Evaluation Techniques: Atds Assignment
4 pages
PART2
No ratings yet
PART2
61 pages
Clustering Analysis (Unsupervised)
No ratings yet
Clustering Analysis (Unsupervised)
6 pages
Clustering Part-1
No ratings yet
Clustering Part-1
48 pages
Clustering
No ratings yet
Clustering
35 pages
Presentation 28128 Content Document 20241126014005PM
No ratings yet
Presentation 28128 Content Document 20241126014005PM
80 pages
Lecture 9 Clustering
No ratings yet
Lecture 9 Clustering
36 pages