ML Module5 Clustering
ML Module5 Clustering
Topic: Clustering
Topic: Clustering
Contents:
• Introduction to clustering
• Types of clustering methods
• K-means
• Kmedoids
• Issues with clustering
• Applications of clustering
Unsupervised Machine Learning
• Unsupervised learning is the training of machine using information that is
neither classified nor labeled and allowing the algorithm to act on
that information without guidance.
• Learning algorithms are generally run until the point that the final analysis
results will not change, no matter how many additional times the algorithm
is passed over the data.
A data set with clear cluster structure
• Globally optimal.
• Similarly, the k-medoid algorithm identifies the medoid which is the most
representative point for a group of points.
• We can also infer that in most cases, the centroid does not correspond to
an actual data point, whereas medoid is always an actual data point.
K-means - A centroid-based technique
• The principle of the k-means algorithm is to assign each of the ‘n’ data
points to one of the K clusters where ‘K’ is a user-defined parameter as the
number of clusters desired.
• The objective is to maximize the homogeneity within the clusters and also
to maximize the differences between the clusters.
• As the first step, we assign four random points from the data set as the
centroids, as represented by the * signs, and we assign the data points to
the nearest centroid to create four clusters.
• In the second step, on the basis of the distance of the points from the
corresponding centroids, the centroids are updated and points are
reassigned to the updated centroids.
• After three iterations, we found that the centroids are not moving as there
is no scope for refinement, and thus, the k-means algorithm will terminate.
• This provides us the most logical four groupings or cluster of the data sets
where the homogeneity within the groups is highest and difference
between the groups is maximum.
• The k-means algorithm works by placing sample cluster centers
on an n dimensional plot and then evaluating whether moving them in
any single direction would result in a new center with higher density -
with more observations closer to it.
• It will always help if we have some prior knowledge about the number of
clusters and we start our k-means algorithm with that prior knowledge.
• For a small data set, sometimes a rule of thumb that is followed is:
• which means that K is set as the square root of n/2 for a data set of n
examples.
• But unfortunately, this thumb rule does not work well for large data sets.
• But this often leads to higher squared error in the final clustering, thus resulting in
sub-optimal clustering solution.
• The assumption for selecting random centroids is that multiple subsequent runs
will minimize the SSE and identify the optimal clusters.
• But this is often not true on the basis of the spread of the data set and the
number of clusters sought.
• One effective approach is to employ the hierarchical clustering technique
on sample points from the data set and then arrive at sample K clusters.
• The centroids of these initial K clusters are used as the initial centroids.
• This approach is practical when the data set has small number of points
and K is relatively small compared to the data points.
• The result of the clustering largely depends on the initial random selection
of cluster centres.
• The complexity of the k-means algorithm is O ( nKt ), where ‘ n ’
is the total number of data points or objects in the data set, K is
the number of clusters, and ‘ t ’ is the number of iterations.
• Let us take an example of eight data points, and for simplicity, we can
consider them to be 1D data with values 1, 2, 3, 5, 9, 10, 11, and 25.
• Point 25 is the outlier, and it affects the cluster formation negatively when
the mean of the points is considered as centroids.
• With K = 2, the initial clusters we arrived at are {1, 2,3, 6} and {9, 10, 11,
25}.
• The mean of the cluster
• and the mean of the cluster
• So, the SSE within the clusters is
• If we compare this with the cluster {1, 2, 3, 6, 9} and{10, 11, 25},
• the mean of the cluster
• and the mean of the cluster
• For large value of ‘n’ and ‘k’, this calculation becomes much costlier than
that of the k-means algorithm.
Hierarchical algorithms
• The hierarchical clustering methods are used to group the data into
hierarchy or tree-like structure.
• It predicts groupings within a dataset by calculating the distance and
generating a link between each singular observation and its nearest
neighbor.
• It then uses those distances to predict subgroups within a dataset.
• If carrying out a statistical study or analyzing biological or environmental
data, hierarchical clustering might be your ideal machine learning solution.
• To visually inspect the results of your hierarchical clustering,
generate a dendrogram - a visualization tool that depicts the
similarities and branching between groups in a data cluster(Fig).
• Use several different algorithms to build a dendrogram, and
the algorithm you choose dictates where and how branching
occurs within the clusters.
•
• In hierarchical clustering, the distance between observations is
measured in three different ways: Euclidean, Manhattan, or Cosine.
• Hierarchical clustering algorithms are more computationally
expensive than k-means algorithms because with each iteration of
hierarchical clustering, many observations must be compared to many
other observations.
• Weakness: In comparison to k-means clustering, the hierarchical
clustering algorithm is a slower, chunkier unsupervised clustering
algorithm.
• However, the benefit, is that hierarchical clustering algorithms are not
subject to errors caused by center convergence at areas of local
minimum density (as exhibited with the k-means clustering algorithms).
• There are two main hierarchical clustering methods:
agglomerative clustering and divisive clustering.
• Agglomerative clustering is a bottom-up technique which starts
with individual objects as clusters and then iteratively merges
them to form larger clusters.
• On the other hand, the divisive method starts with one cluster
with all given objects and then splits it iteratively to form smaller
clusters. See Figure on next slide.
Density based Methods
• Density-based spatial clustering of applications with noise
(DBScan) is an unsupervised learning method that works by clustering
core samples (dense areas of a dataset) while simultaneously
demarking non-core samples (portions of the dataset that are
comparatively sparse).
• When we used the partitioning and hierarchical clustering methods, the
resulting clusters are spherical or nearly spherical in nature.
• In the case of the other shaped clusters such as S shaped or uneven
shaped clusters, the above two types of method do not provide accurate
results.
• The density based clustering approach provides a solution to identify
clusters of arbitrary shapes.
• The principle is based on identifying the dense area and sparse area
within the data set and then run the clustering algorithm.
• DBSCAN is one of the popular density-based algorithm which creates
clusters by using connected regions with high density.
• DBSCAN is one of the density-based clustering approaches
that provide a solution to identify clusters of arbitrary shapes.
• The principle is based on identifying the dense area and sparse
area within the data set and then running the clustering
algorithm.
Applications of clustering
• Text data mining
• Market segmentation
• Anomaly detection
• Data Mining
• Image processing and segmentation
• Identification of human errors during data entry
• Conducting accurate basket analysis, etc.
• Recommendation engines
Pblms
1) Apply k-means algorithm in given data for k=3. Use C1(2),
C2(16), c3(38) as initial cluster centres.
• Data 2,4,6,3,31,12,15,16,38,35,14,21,23,25,30
Soln:
Calculating the distance between each data point and cluster
centres, we get the following table.(Next slide)
By assigning the data points to the cluster center whose distance from it is
minimum of all the cluster centers, we get the following table.
• Similarly, using the new cluster centers we can calculate the
distance from it and allocate clusters based on minimum
distance.
• It is found that there is no difference in the cluster formed and
hence we stop this procedure.
• The final clustering result is given in the following table.
2) Apply k-means clustering for the datasets given in table below. Tabulate
all the assignments.
Soln:
After the second iteration, the assignment has not changed and hence the
algm is stopped and the points are clustered.
3) Apply k-medoid algorithm to cluster the following dataset of 6 objects into
two clusters, that is k=2.
Reference
• “Machine Learning”-Anuradha Srinivasaraghavan, Vincy
Joseph
• “Machine Learning”- Saikat Dutt, Subramanian Chandramouli,
Amit Kumar Das