0% found this document useful (0 votes)
27 views50 pages

DW&M Unit 3 Part II

The document discusses various clustering techniques including K-means clustering, hierarchical clustering, and density-based clustering. It describes the basic concepts and algorithms of K-means clustering including choosing the number of clusters, calculating distances, and assigning points to clusters. It also discusses hierarchical clustering approaches, linkage methods, and evaluating clustering results.

Uploaded by

UT DU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views50 pages

DW&M Unit 3 Part II

The document discusses various clustering techniques including K-means clustering, hierarchical clustering, and density-based clustering. It describes the basic concepts and algorithms of K-means clustering including choosing the number of clusters, calculating distances, and assigning points to clusters. It also discusses hierarchical clustering approaches, linkage methods, and evaluating clustering results.

Uploaded by

UT DU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Unit III

3.2 Cluster Analysis: Basic Concepts, A


Categorization of Major Clustering Methods,
Partitioning Methods: The Basic K-means
Algorithm, Strengths and Weaknesses of K-
means algorithm, Hierarchical methods:
Agglomerative versus Divisive Hierarchical
Clustering, Density-Based Methods: DBSCAN-
The DBSCAN Algorithm, Strengths and
Weaknesses, Evaluation of clustering, Outlier
analysis.
Unit III
3.3 Association Rule Mining: Market Basket
Analysis, Frequent item set, Closed Item sets,
and Association Rules, Apriori Algorithm: Apriori
Principle, Apriori Algorithm, Computational
Complexity, Rule Generation, Confidence of
association rule.
Clustering
• Clustering is a technique that groups similar
objects such that:
• The objects in the same group are more
similar to each other than the objects in the
other groups.
• The group of similar objects is called a Cluster.
Clustering methods

K means algorithm
• K-means algorithm is an iterative algorithm
that tries to partition the dataset into K pre-
defined distinct non-overlapping subgroups
(clusters) where each data point belongs
to only one group.
K means algorithm
• K-Means clustering is an unsupervised
iterative clustering technique.
• It partitions the given data set into k
predefined distinct clusters.
• A cluster is defined as a collection of data
points exhibiting certain similarities.
K means algorithm
K-Means Clustering Algorithm steps
Step-01:
Choose the number of clusters K.

Step-02:
Randomly select any K data points as cluster
centers.

Step-03:
Calculate the distance between each data point
and each cluster center.
Steps contd..
Step-04:
Assign each data point to some cluster.
A data point is assigned to that cluster whose
center is nearest to that data point.

Step-05:
Re-compute the center of newly formed clusters.

Step-06:
Keep repeating the procedure from Step-03 to
Step-05 until convergence or max iteration is
reached.
Use K-Means Algorithm to create two
clusters
• Assume A(2, 2) and C(1, 1) are centers of the two
clusters.
• Calculating Distance Between A(2, 2) and
C1(2,2)= Ρ(A, C1) = 0
• Calculating Distance Between C(1, 1) and C1(2,2)
= Ρ(C, C1)= 1.41
Distance from Distance from Point belongs to
center (2, 2) of center (1, 1) of Cluster
Given Points Cluster-01 Cluster-02

A(2, 2) 0 1.41 C1

B(3, 2) 1 2.24 C1

C(1, 1) ? ? C2

D(3, 1) ? ? C1

E(1.5, 0.5) ? ? C2
Iteration 1
• For Cluster-01:
• Center of Cluster-01 = (2.67, 1.67)
• For Cluster-02:
• Center of Cluster-02 = (1.25, 0.75)

• We will go to iteration-02, iteration-03 and so


on until the centers do not change anymore.
K means
K means

Elbow method to find the optimal
number of clusters (K)
Hierarchical clustering
• In this algorithm, we develop the hierarchy of
clusters in the form of a tree, and this tree-
shaped structure is known as the dendrogram.

Hierarchical clustering is divided into two types:

• Agglomerative Hierarchical Clustering.


• Divisive Hierarchical Clustering
• Agglomerative: Agglomerative is a bottom-up
approach, in which the algorithm starts with
taking all data points as single clusters and
merging them until one cluster is left.
• Divisive: Divisive algorithm is the reverse of
the agglomerative algorithm as it is a top-
down approach.
EXAMPLE(Proximity matrix)
• • Minimum distance = 1
for E & A

• Hence merge EA
EXAMPLE
• D = min [ dist { (E, A), D} ] •
=3

• In this minimum
distance = 1
EXAMPLE

EXAMPLE
• •
• Now, we can set a threshold distance and draw a
horizontal line , suppose we set this threshold as 12
and draw a horizontal line
• The number of clusters will be the number of vertical
lines which are being intersected by the line drawn
using the threshold.
• More the distance of the vertical lines in the
dendrogram, more is the distance between
those clusters.
Linkage methods
• The closest distance between the two clusters
is crucial for the hierarchical clustering.
• There are various ways to calculate the
distance between two clusters, and these
ways decide the rule for clustering.
• These measures are called Linkage methods.
Linkage methods
• Single Linkage • Complete Linkage:
Linkage methods
• Average Linkage • Centroid Linkage
Density-based clustering
• Partition-based and hierarchical clustering
techniques are highly efficient with normal
shaped clusters.
• However, when it comes to arbitrary shaped
clusters or detecting outliers, density-based
techniques are more efficient.
DBSCAN
• DBSCAN stands for Density-Based Spatial
Clustering of Applications with Noise.
DBSCAN
• It is able to find arbitrary shaped clusters and
clusters with noise (i.e. outliers).

• The main idea behind DBSCAN is that a point


belongs to a cluster if it is close to many points
from that cluster.
Two key parameters of DBSCAN
• eps: The distance that specifies the
neighborhoods. Two points are considered to
be neighbors if the distance between them
are less than or equal to eps.
• minPts: Minimum number of data points to
define a cluster.
Classification of points
• Based on these two
parameters, points are
classified as
• Core point,
• Border point, or
• Outlier
DBSCAN

DBSCAN

Eps= 0.6 and M=4

DBSCAN

DBSCAN

DBSCAN
• Directly Density-Reachable
• Density-Reachable
• Density-Connected
Evaluation metrics
• Silhouette Score: The Silhouette Coefficient is
calculated by using the mean of the distance
of the intra-cluster and nearest cluster for all
the samples. The Silhouette Coefficient ranges
from [-1,1]

(nc-ic)/max(ic,nc)
where,
ic = mean of the intra-cluster distance
nc = mean of the nearest-cluster distance
Calinski Harabaz Index
• This ratio is calculated between two
parameters within-cluster diffusion and
between cluster dispersion.
CH(k)=*B(k)W(k)+*(n−k)(k−1)+
where,
n = data points
k = clusters
W(k) = within cluster variation
B(k) = between cluster variation.
Davies Bouldin index
• Davies Bouldin index is based on the principle
of with-cluster and between cluster distances.
Market Basket Analysis
• Market Basket Analysis is one of the key
techniques used by large retailers to uncover
associations between items.
• It works by looking for combinations of items
that occur together frequently in transactions.
Association Rule Mining
• Association Rule Mining is used when you
want to find an association between different
objects in a set, find frequent patterns in a
transaction database, relational databases or
any other information repository.
Apriori
• Apriori algorithm assumes that any subset of a
frequent itemset must be frequent.
• Given
Apriori
• Support: Its the default popularity of an item
• In mathematical terms, the support of item A
is nothing but the ratio of transactions
involving A to the total number of
transactions.
• S(grapes)=??
• Confidence: Likelihood that customer who
bought both A and B. Its divides the number of
transactions involving both A and B by the
number of transactions involving B.

• Confidence(A⇒ B) =support(A ∪ B) /support(A)


• Lift : Increase in the sale of A when you sell B.

• Lift (A => B) = 1 means that there is no


correlation within the itemset.
• Lift (A => B) > 1 means that there is a positive
correlation within the itemset, i.e., products in the
itemset, A, and B, are more likely to be bought
together.
• Lift (A => B) < 1 means that there is a negative
correlation within the itemset, i.e., products in
itemset, A, and B, are unlikely to be bought
together.
• Association Rule-based algorithms are viewed
as a two-step approach:
1. Frequent Itemset Generation: Find all
frequent item-sets with support >= pre-
determined min_support count
2. Rule Generation: List all Association Rules
from frequent item-sets. Calculate Support and
Confidence for all rules. Prune rules that fail
min_support and min_confidence thresholds.
Example
• Given

You might also like