Clustering
Clustering
Most of the topics are based on the textbook, introduction to data mining by tan and
video lectures by AndewNg
Clustering algorithm is an unsupervised learning
Clusters are potential classes and cluster analysis is the study of techniques for automatically
finding classes.
Dividing objects is clustering and assigning particular objects to these groups is called
classification
Ex: finding human genome, social network analysis, market segmentation and astronomical data
analysis
We can represent each object by the index of the prototype associated with it. This type of
compression is called vector quantization
Cluster validity – methods for evaluating the goodness of the cluster produced by clustering
algorithm
The greater the similarity within the group and greater the difference between groups the better
or more distinct is the clustering
The definition of cluster is imprecise and the best definition depends on the nature of data and
the desired results
Understanding:
Clustering is used in biology, information retrieval, climate, business etc. to understand various
different patterns
Utility: here clustering analysis is only starting point for other purposes
Summarization
Compression
Fuzzy: Every object belongs to every cluster with a membership weight that is between 0 and 1
(all the weights must sum to 1)
Probabilistic clustering compute the probability with which each point belongs to each cluster (all
the probabilities must sum to 1)
Complete and partial clusters: a complete cluster assigns every object to a cluster where as a
partial clustering does not.
Types of cluster:
Well separated
Density based
Shared property
There are three major algorithms used in clustering they are
1) k-means
2) DBSCAN
3) agglomerative hierarchical clustering
In this exercise we are mainly going to concentrate on Kmeans clustering which is prototype based
partitional clustering
Algorithm:
2: repeat
Step 5 is often replaced by weaker condition, like repeat until only 1% of points change cluster
Kmeans is based on proximity measures to assign a point to its closest center, where proximity
measure characterizes the similarity or dissimilarity that exists between the objects
Objective: Minimise the SSE (sum of squared errors) from a point to its centroid
𝑘
2
SSE = ∑ ∑ 𝑑𝑖𝑠𝑡(𝑐𝑖, 𝑥)
𝑖=1 𝑥∈𝑐𝑖
If we have the output labels we can calculate the accuracy of objects allocated to their
respective clusters
Process in kmeans
Programming in R:
Prototype:
Cluster Evaluation:
The evaluation measures that are applied to judge various aspects of cluster validity are
traditionally classified into three types
Unsupervised:
Measures the goodness of cluster structure without respect to external information. An
example of this is SSE.
Supervised:
Measures the extent to which the clustering structure discovered by a clustering algorithm
matches some external structure. Here external structure can be externally provided class
labels
Relative:
Compares different clustering’s or clusters. A relative cluster evaluation method is a
supervised or unsupervised evaluation measure that is used for the purpose of comparison
Issues:
Choosing initial clusters
Choosing number of clusters
Handling empty clusters
Outliers
K means can converge to different solution depending on the centroid initialization (So random
centroid initialisation is important)
There should be a global optima, clusters should not struck at local optima
For this problem, try multiple random initialisations and consider those whose cost function value is
low
Choose number of clusters at the elbow point (before elbow, distortion goes down rapidly and after
elbow, distortion goes down slowly)
Questions: