0% found this document useful (0 votes)
54 views7 pages

Clustering

The document discusses clustering algorithms and k-means clustering. It describes how k-means works, including initializing centroids, assigning points to clusters, recomputing centroids, and iterating until convergence. It also discusses choosing the number of clusters, evaluating clusters, and implementing k-means in R.

Uploaded by

Deepak Varma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views7 pages

Clustering

The document discusses clustering algorithms and k-means clustering. It describes how k-means works, including initializing centroids, assigning points to clusters, recomputing centroids, and iterating until convergence. It also discusses choosing the number of clusters, evaluating clusters, and implementing k-means in R.

Uploaded by

Deepak Varma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Clustering

Most of the topics are based on the textbook, introduction to data mining by tan and
video lectures by AndewNg
Clustering algorithm is an unsupervised learning

Clusters are potential classes and cluster analysis is the study of techniques for automatically
finding classes.

Dividing objects is clustering and assigning particular objects to these groups is called
classification

Clustering is similar to classification and that we don’t know labels here

Ex: finding human genome, social network analysis, market segmentation and astronomical data
analysis

So, why can’t we get labels?

We can’t afford to get (costly) – amazon web pages

Don’t exist (no real truth) – classification of animals into species

We can represent each object by the index of the prototype associated with it. This type of
compression is called vector quantization

Cluster validity – methods for evaluating the goodness of the cluster produced by clustering
algorithm

The greater the similarity within the group and greater the difference between groups the better
or more distinct is the clustering

The definition of cluster is imprecise and the best definition depends on the nature of data and
the desired results

Clustering can be used either for utility or for understanding

Understanding:

Clustering is used in biology, information retrieval, climate, business etc. to understand various
different patterns

Utility: here clustering analysis is only starting point for other purposes

Summarization

Compression

Efficiently finding nearest neighbours

Different types of clustering:

Hierarchical and partitional clustering:

Exclusive: assigning each object to a single cluster


Overlapping or non-exclusive clustering: situations in which a point can be reasonably be placed
in more than one cluster

Fuzzy: Every object belongs to every cluster with a membership weight that is between 0 and 1
(all the weights must sum to 1)

Probabilistic clustering compute the probability with which each point belongs to each cluster (all
the probabilities must sum to 1)

Complete and partial clusters: a complete cluster assigns every object to a cluster where as a
partial clustering does not.

Types of cluster:

Well separated

Prototype based – (centre-based clusters)

Graph based – connected component

Density based

Shared property
There are three major algorithms used in clustering they are

1) k-means
2) DBSCAN
3) agglomerative hierarchical clustering

In this exercise we are mainly going to concentrate on Kmeans clustering which is prototype based
partitional clustering

Algorithm:

1: Select K points as initial centroids

2: repeat

3: Form K clusters by assigning each point to its closest centroid

4: recomputed the centroid of each cluster

5: until centroids do not change

Step 5 is often replaced by weaker condition, like repeat until only 1% of points change cluster

Kmeans is based on proximity measures to assign a point to its closest center, where proximity
measure characterizes the similarity or dissimilarity that exists between the objects

Some proximity measures:

Euclidean and Manhattan – used for data points

Cosine and Jaccard - used for documents

Objective: Minimise the SSE (sum of squared errors) from a point to its centroid

𝑘
2
SSE = ∑ ∑ 𝑑𝑖𝑠𝑡(𝑐𝑖, 𝑥)
𝑖=1 𝑥∈𝑐𝑖

If we have the output labels we can calculate the accuracy of objects allocated to their
respective clusters

Time and space complexity:


Storage required – O((m+K)n)
Time required – O(I*K*m*n) – linear
Here I - number of iterations, K – number of clusters, m – number of data points, n – number
of attributes. Here I and K are much smaller compared to m

Process in kmeans

Programming in R:

1) Using the library


2) Implementing the algorithm

Prototype:

kmeans(x, centers, iter.max = 10, nstart = 1,


algorithm = c("Hartigan-Wong", "Lloyd", "Forgy",
"MacQueen"), trace=FALSE)

#use required parameters depending on the application

#x, centers are necessary parameters and others are optional

kmeans(dataframe, number of clusters) – generally used

fit = kmeans(data,2) – here 2 represents number of clusters


fit$centers – gives centers of each cluster

fit$cluster – gives which points belong to which cluster

fit$size – gives number of points in each cluster

fit$iter – number of iterations it required to converge

Cluster Evaluation:
The evaluation measures that are applied to judge various aspects of cluster validity are
traditionally classified into three types

Unsupervised:
Measures the goodness of cluster structure without respect to external information. An
example of this is SSE.
Supervised:
Measures the extent to which the clustering structure discovered by a clustering algorithm
matches some external structure. Here external structure can be externally provided class
labels
Relative:
Compares different clustering’s or clusters. A relative cluster evaluation method is a
supervised or unsupervised evaluation measure that is used for the purpose of comparison

Issues:
Choosing initial clusters
Choosing number of clusters
Handling empty clusters
Outliers

Choosing initial clusters:


One way of choosing initial clusters Random initialization

Randomly pick objects and set centroid equal to these objects

K means can converge to different solution depending on the centroid initialization (So random
centroid initialisation is important)

There should be a global optima, clusters should not struck at local optima
For this problem, try multiple random initialisations and consider those whose cost function value is
low

This will be helpful when number of clusters are less

Choosing number of clusters:

There is no particular better way for this, one way is visualisation

Elbow method for choosing clusters:

Distortion (cost function value) goes down as we increase number of clusters

Choose number of clusters at the elbow point (before elbow, distortion goes down rapidly and after
elbow, distortion goes down slowly)
Questions:

1)Consider the points x1<-c(1,2,3,6) and x2<-c(5,10,4,12).


Compute the (Euclidean) distance.
2)Consider the points x1<-c(1,2) and x2<-c(5,10). Compute cosine distance
3) Using the data x<-c(1,2,2.5,3,3.5,4,4.5,5,7,8,8.5,9,9.5,10) . Find clusters
where k=2.
4) For the given sonar_test.csv find the clustering with k=2 for first two
columns
5) Find the testing error
6) For the given sonar_test.csv find the clustering with k=2 for the entire
data set and find the testing error
7) For the given sonar_test.csv do hierarchical clustering with 4 cuts

You might also like