0% found this document useful (0 votes)
54 views22 pages

Clustering Techniques - Hierarchical, K-Means Clustering

Hierarchical and k-means clustering are common clustering techniques. Hierarchical clustering finds successive clusters using previously established clusters in either an agglomerative or divisive manner. K-means clustering partitions data into k clusters by minimizing distances between data points and cluster centroids, with the algorithm iteratively reassigning points until centroids converge. While useful for data exploration, k-means clustering has weaknesses like sensitivity to initialization and requiring pre-specifying the number of clusters.

Uploaded by

Tanya Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views22 pages

Clustering Techniques - Hierarchical, K-Means Clustering

Hierarchical and k-means clustering are common clustering techniques. Hierarchical clustering finds successive clusters using previously established clusters in either an agglomerative or divisive manner. K-means clustering partitions data into k clusters by minimizing distances between data points and cluster centroids, with the algorithm iteratively reassigning points until centroids converge. While useful for data exploration, k-means clustering has weaknesses like sensitivity to initialization and requiring pre-specifying the number of clusters.

Uploaded by

Tanya Sharma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Clustering Techniques –

Hierarchical, K-means Clustering

1
INTRODUCTION-
What is clustering?

• Clustering is the classification of objects into


different groups, or more precisely, the
partitioning of a data set into subsets
(clusters), so that the data in each subset
(ideally) share some common trait - often
according to some defined distance measure.

2
Types of clustering:
1. Hierarchical algorithms: these find successive clusters
using previously established clusters.
1. Agglomerative ("bottom-up"): Agglomerative algorithms
begin with each element as a separate cluster and merge them
into successively larger clusters.
2. Divisive ("top-down"): Divisive algorithms begin with the
whole set and proceed to divide it into successively smaller
clusters.
2. Partitional clustering: Partitional algorithms determine all clusters at
once. They include:
– K-means and derivatives
– Fuzzy c-means clustering
– QT clustering algorithm
3
Common Distance measures:

• Distance measure will determine how the similarity of two


elements is calculated and it will influence the shape of the
clusters.
They include:
1. The Euclidean distance (also called 2-norm distance) is given by:

2. The Manhattan distance (also called taxicab norm or 1-norm) is


given by:

4
3.The maximum norm is given by:

4. The Mahalanobis distance corrects data for


different scales and correlations in the variables.
5. Inner product space: The angle between two
vectors can be used as a distance measure when
clustering high dimensional data
6. Hamming distance (sometimes edit distance)
measures the minimum number of substitutions
required to change one member into another.
5
K-MEANS CLUSTERING
• The k-means algorithm is an algorithm to cluster n
objects based on attributes into k partitions, where
k < n.
• It is similar to the expectation-maximization
algorithm for mixtures of Gaussians in that they
both attempt to find the centers of natural clusters
in the data.
• It assumes that the object attributes form a vector
space.
6
• An algorithm for partitioning (or clustering) N
data points into K disjoint subsets Sj
containing data points so as to minimize the
sum-of-squares criterion

where xn is a vector representing the the nth


data point and uj is the geometric centroid of
the data points in Sj.

7
• Simply speaking k-means clustering is an
algorithm to classify or to group the objects
based on attributes/features into K number of
group.
• K is positive integer number.
• The grouping is done by minimizing the sum
of squares of distances between data and the
corresponding cluster centroid.

8
How the K-Mean Clustering algorith m
works?

9
• Step 1: Begin with a decision on the value of k =
number of clusters .
• Step 2: Put any initial partition that classifies the
data into k clusters. You may assign the training
samples randomly,or systematically as the
following:
1.Take the first k training sample as single- element
clusters
2. Assign each of the remaining (N-k) training
sample to the cluster with the nearest
centroid. After each assignment, recompute the
centroid of the gaining cluster.

10
• Step 3: Take each sample in sequence and
compute its distance from the centroid of each
of the clusters. If a sample is not
currently in the cluster with the closest
centroid, switch this sample to that cluster and
update the centroid of the cluster
gaining the new sample and the cluster
losing the sample.
• Step 4 . Repeat step 3 until convergence is
achieved, that is until a pass through the
training sample causes no new assignments.

11
A Simple example showing the implementation of
k-means algorithm
(using K=2)

12
Step 1:
Initialization: Randomly we choose following two centroids
(k=2) for two clusters.
In this case the 2 centroid are: m1=(1.0,1.0) and
m2=(5.0,7.0).

13
Step 2:
• Thus, we obtain two clusters
containing:
{1,2,3} and {4,5,6,7}.
• Their new centroids are:

14
Step 3:
• Now using these centroids
we compute the Euclidean
distance of each object, as
shown in table.

• Therefore, the new clusters


are:
{1,2} and {3,4,5,6,7}

• Next centroids are:


m1=(1.25,1.5) and m2 =
(3.9,5.1)

15
• Step 4 :
The clusters obtained are:
{1,2} and {3,4,5,6,7}

• Therefore, there is no change


in the cluster.
• Thus, the algorithm comes to
a halt here and final result
consist of 2 clusters {1,2} and
{3,4,5,6,7}.

16
PLOT

17
(with K=3)

Step 1 Step 2
18
PLOT

19
Weaknesses of K-Mean Clustering
1. When the numbers of data are not so many, initial grouping
will determine the cluster significantly.
2. The number of cluster, K, must be determined before hand. Its
disadvantage is that it does not yield the same result with
each run, since the resulting clusters depend on the initial
random assignments.
3. We never know the real cluster, using the same data, because
if it is inputted in a different order it may produce different
cluster if the number of data is few.
4. It is sensitive to initial condition. Different initial condition
may produce different result of cluster. The algorithm may be
trapped in the local optimum.

20
Applications of K-Mean Clustering
• It is relatively efficient and fast. It computes result
at O(tkn), where n is number of objects or points, k
is number of clusters and t is number of iterations.
• k-means clustering can be applied to machine
learning or data mining
• Used on acoustic data in speech understanding to
convert waveforms into one of k categories (known
as Vector Quantization or Image Segmentation).
• Also used for choosing color palettes on old
fashioned graphical display devices and Image
Quantization. 21
CONCLUSION
• K-means algorithm is useful for undirected
knowledge discovery and is relatively simple.
• K-means has found wide spread usage in lot
of fields, ranging from unsupervised learning
of neural network, Pattern recognitions,
Classification analysis, Artificial intelligence,
image processing, machine vision, and many
others.

22

You might also like