DW&M Unit 3 Part II
DW&M Unit 3 Part II
Step-02:
Randomly select any K data points as cluster
centers.
Step-03:
Calculate the distance between each data point
and each cluster center.
Steps contd..
Step-04:
Assign each data point to some cluster.
A data point is assigned to that cluster whose
center is nearest to that data point.
Step-05:
Re-compute the center of newly formed clusters.
Step-06:
Keep repeating the procedure from Step-03 to
Step-05 until convergence or max iteration is
reached.
Use K-Means Algorithm to create two
clusters
• Assume A(2, 2) and C(1, 1) are centers of the two
clusters.
• Calculating Distance Between A(2, 2) and
C1(2,2)= Ρ(A, C1) = 0
• Calculating Distance Between C(1, 1) and C1(2,2)
= Ρ(C, C1)= 1.41
Distance from Distance from Point belongs to
center (2, 2) of center (1, 1) of Cluster
Given Points Cluster-01 Cluster-02
A(2, 2) 0 1.41 C1
B(3, 2) 1 2.24 C1
C(1, 1) ? ? C2
D(3, 1) ? ? C1
E(1.5, 0.5) ? ? C2
Iteration 1
• For Cluster-01:
• Center of Cluster-01 = (2.67, 1.67)
• For Cluster-02:
• Center of Cluster-02 = (1.25, 0.75)
• Hence merge EA
EXAMPLE
• D = min [ dist { (E, A), D} ] •
=3
• In this minimum
distance = 1
EXAMPLE
•
EXAMPLE
• •
• Now, we can set a threshold distance and draw a
horizontal line , suppose we set this threshold as 12
and draw a horizontal line
• The number of clusters will be the number of vertical
lines which are being intersected by the line drawn
using the threshold.
• More the distance of the vertical lines in the
dendrogram, more is the distance between
those clusters.
Linkage methods
• The closest distance between the two clusters
is crucial for the hierarchical clustering.
• There are various ways to calculate the
distance between two clusters, and these
ways decide the rule for clustering.
• These measures are called Linkage methods.
Linkage methods
• Single Linkage • Complete Linkage:
Linkage methods
• Average Linkage • Centroid Linkage
Density-based clustering
• Partition-based and hierarchical clustering
techniques are highly efficient with normal
shaped clusters.
• However, when it comes to arbitrary shaped
clusters or detecting outliers, density-based
techniques are more efficient.
DBSCAN
• DBSCAN stands for Density-Based Spatial
Clustering of Applications with Noise.
DBSCAN
• It is able to find arbitrary shaped clusters and
clusters with noise (i.e. outliers).
(nc-ic)/max(ic,nc)
where,
ic = mean of the intra-cluster distance
nc = mean of the nearest-cluster distance
Calinski Harabaz Index
• This ratio is calculated between two
parameters within-cluster diffusion and
between cluster dispersion.
CH(k)=*B(k)W(k)+*(n−k)(k−1)+
where,
n = data points
k = clusters
W(k) = within cluster variation
B(k) = between cluster variation.
Davies Bouldin index
• Davies Bouldin index is based on the principle
of with-cluster and between cluster distances.
Market Basket Analysis
• Market Basket Analysis is one of the key
techniques used by large retailers to uncover
associations between items.
• It works by looking for combinations of items
that occur together frequently in transactions.
Association Rule Mining
• Association Rule Mining is used when you
want to find an association between different
objects in a set, find frequent patterns in a
transaction database, relational databases or
any other information repository.
Apriori
• Apriori algorithm assumes that any subset of a
frequent itemset must be frequent.
• Given
Apriori
• Support: Its the default popularity of an item
• In mathematical terms, the support of item A
is nothing but the ratio of transactions
involving A to the total number of
transactions.
• S(grapes)=??
• Confidence: Likelihood that customer who
bought both A and B. Its divides the number of
transactions involving both A and B by the
number of transactions involving B.