0% found this document useful (0 votes)

4 views71 pages

ML Module5 Clustering

The document provides an overview of clustering in machine learning, detailing its definition, types, and methodologies such as K-means and K-medoids. It discusses the importance of clustering for grouping similar data points and highlights challenges like choosing the right number of clusters and dealing with outliers. Additionally, it covers various clustering techniques including hierarchical and density-based methods, along with their strengths and weaknesses.

Uploaded by

12302080603002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views71 pages

ML Module5 Clustering

Uploaded by

12302080603002

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 71

Subject: Machine Learning

Topic: Clustering
Topic: Clustering
Contents:
• Introduction to clustering
• Types of clustering methods
• K-means
• Kmedoids
• Issues with clustering
• Applications of clustering
Unsupervised Machine Learning
• Unsupervised learning is the training of machine using information that is
neither classified nor labeled and allowing the algorithm to act on
that information without guidance.

• The task of machine is to group unsorted information according to

similarities, patterns and differences without any prior training of data.

• Unsupervised learning algorithms are essentially complex algorithms,

categorized as clustering.
Clustering
▪ Clustering: the process of grouping a set of objects into classes of
similar objects
▪ Documents within a cluster should be similar.
▪ Documents from different clusters should be dissimilar.
Clustering
• Data sets are divided into different groups in the cluster analysis, which is
based on the similarity of the data.

• Clustering is a form of machine learning - the machine in this case is

your computer, and learning refers to an algorithm that's repeated over
and over until a certain set of predetermined conditions is met.

• Learning algorithms are generally run until the point that the final analysis
results will not change, no matter how many additional times the algorithm
is passed over the data.
A data set with clear cluster structure

• How would you design an

algorithm for finding the three
clusters in this case?
Data set for the conference attendees
• The primary driver of clustering knowledge is discovery rather
than prediction.
• Clustering is defined as an unsupervised machine learning task
that automatically divides the data into clusters or groups of
similar items.
• The primary guideline of clustering task is that the data inside a
cluster should be very similar to each other but very different
from those outside the cluster.
• Through clustering, we are trying to label the objects with class
labels.
• But clustering is somewhat different from the classification and
numeric prediction discussed in supervised learning chapters.
• In each of these cases, the goal was to create a model that
relates features to an outcome or to other features and the
model identifies patterns within the data.
• In contrast, clustering creates new data.
• Unlabeled objects are given a cluster label which is inferred
entirely from the relationship of attributes within the data.
Hard vs. soft clustering
▪ Hard clustering: Each document belongs to exactly one cluster
▪ More common and easier to do
▪ Soft clustering: A document can belong to more than one cluster.
▪ Makes more sense for applications like creating browsable
hierarchies
▪ You may want to put a pair of sneakers in two clusters: (i) sports
apparel and (ii) shoes
▪ You can only do that with a soft clustering approach.
Clustering Techniques
• The major clustering techniques are
• Partitioning methods,
• Hierarchical methods, and
• Density-based methods.
Density based Clustering
Partitioning methods
• Partitioning method: Construct a partition of n documents into a set of K
clusters

• Given: a set of documents and the number K

• Find: a partition of K clusters that optimizes the chosen partitioning

criterion

• Globally optimal.

• Effective heuristic methods: K-means and K-medoids algorithms

• Two of the most important algorithms for partitioning based clustering are
k-means and k-medoid.

• In the k-means algorithm, the centroid of the prototype is identified for

clustering, which is normally the mean of a group of points.

• Similarly, the k-medoid algorithm identifies the medoid which is the most
representative point for a group of points.

• We can also infer that in most cases, the centroid does not correspond to
an actual data point, whereas medoid is always an actual data point.
K-means - A centroid-based technique
• The principle of the k-means algorithm is to assign each of the ‘n’ data
points to one of the K clusters where ‘K’ is a user-defined parameter as the
number of clusters desired.

• The objective is to maximize the homogeneity within the clusters and also
to maximize the differences between the clusters.

• The homogeneity and differences are measured in terms of the distance

between the objects or points in the data set.
Algorithm of K-means
Step 1: Select K points in the data space and mark them as initial centroids
loop
Step 2: Assign each point in the data space to the nearest centroid to form K
clusters
Step 3: Measure the distance of each point in the cluster from the centroid
Step 4: Calculate the Sum of Squared Error (SSE) to measure the quality of
the clusters.
Step 5: Identify the new centroid of each cluster on the basis of distance
between points
Step 6: Repeat Steps 2 to 5 to refine until centroids do not change
end loop
Example
• Let us fix K = 4, implying that we want to create four clusters out of this
data set.

• As the first step, we assign four random points from the data set as the
centroids, as represented by the * signs, and we assign the data points to
the nearest centroid to create four clusters.

• In the second step, on the basis of the distance of the points from the
corresponding centroids, the centroids are updated and points are
reassigned to the updated centroids.
• After three iterations, we found that the centroids are not moving as there
is no scope for refinement, and thus, the k-means algorithm will terminate.

• This provides us the most logical four groupings or cluster of the data sets
where the homogeneity within the groups is highest and difference
between the groups is maximum.
• The k-means algorithm works by placing sample cluster centers
on an n dimensional plot and then evaluating whether moving them in
any single direction would result in a new center with higher density -
with more observations closer to it.

• The centers are moved from regions of lower density to regions of

higher density until all centers are within a region of local maximum
density - a true center of the cluster, where each cluster gets a maximum
number of points closest to its cluster center.
Strength & Weakness of K-means algm
Choosing appropriate number of clusters
• One of the most important success factors in arriving at correct clustering
is to start with the correct number of cluster assumptions.

• Different numbers of starting cluster lead to completely different types of

data split.

• It will always help if we have some prior knowledge about the number of

clusters and we start our k-means algorithm with that prior knowledge.
• For a small data set, sometimes a rule of thumb that is followed is:

• which means that K is set as the square root of n/2 for a data set of n
examples.

• But unfortunately, this thumb rule does not work well for large data sets.

• There are several statistical methods to arrive at the suitable number of

clusters.
Elbow method
• Tries to measure the homogeneity or heterogeneity within the cluster and
for various values of ‘K’ and helps in arriving at the optimal ‘K’.
• The homogeneity will increase or heterogeneity will decrease with
increasing ‘K’ as the number of data points inside each cluster reduces
with this increase.
• But these iterations take significant computation effort, and after a certain
point, the increase in homogeneity benefit is no longer in accordance with
the investment required to achieve it, as is evident from the figure.
• This point is known as the elbow point, and the ‘K’ value at this point
produces the optimal clustering performance.
Choosing the initial centroids
• One common practice is to choose random points in the data space on the basis
of the number of cluster requirement and refine the points as we move into the
iterations.

• But this often leads to higher squared error in the final clustering, thus resulting in
sub-optimal clustering solution.

• The assumption for selecting random centroids is that multiple subsequent runs
will minimize the SSE and identify the optimal clusters.

• But this is often not true on the basis of the spread of the data set and the
number of clusters sought.
• One effective approach is to employ the hierarchical clustering technique
on sample points from the data set and then arrive at sample K clusters.

• The centroids of these initial K clusters are used as the initial centroids.

• This approach is practical when the data set has small number of points
and K is relatively small compared to the data points.

• There are procedures such as bisecting k-means and use of post-

processing to fix initial clustering issues; these procedures can produce
better quality initial centroids and thus better SSE for the final clusters.
Recomputing cluster centroids
• In the k-means algorithm, the iterative step is to recalculate the centroids
of the data set after each iteration.
• The proximities of the data points from each other within a cluster is
measured to minimize the distances.
• The distance of the data point from its nearest centroid can also be
calculated to minimize the distances to arrive at the refined centroid.
• The Euclidean distance between two data points is measured as follows:
• The measure of quality of clustering uses the SSE technique. The formula
used is as follows:

• where dist() calculates the Euclidean distance between the centroid c of

the cluster C and the data points x in the cluster.
• The summation of such distances over all the ‘K’ clusters gives the total
sum of squared error.
• The lower the SSE for a clustering solution, the better is the
representative position of the centroid.

• It is observed that the centroid that minimizes the SSE of the

cluster is its mean.

• One limitation of the squared error method is that in the case of

presence of outliers in the data set, the squared error can
distort the mean value of the clusters.
• Because of the distance-based approach from the centroid to all points in
the data set, the k-means method may not always converge to the global
optimum and often terminates at a local optimum.

• The result of the clustering largely depends on the initial random selection
of cluster centres.
• The complexity of the k-means algorithm is O ( nKt ), where ‘ n ’
is the total number of data points or objects in the data set, K is
the number of clusters, and ‘ t ’ is the number of iterations.

• Normally, ’K ’ and ‘ t ’ are kept much smaller than ‘ n ’, and thus,

the k-means method is relatively scalable and efficient in
processing large data sets.
Issues of clustering
• Representation for clustering
• Document representation
• Vector space? Normalization?
• Centroids aren’t length normalized
• Need a notion of similarity/distance
• How many clusters?
• Fixed a priori?
• Completely data driven?
• Avoid “trivial” clusters - too large or small
• If a cluster's too large, then for navigation purposes you've wasted an extra user click
without whittling down the set of documents much.
K-Medoids
• k-means algorithm is sensitive to outliers in the data set and inadvertently
produces skewed clusters when the means of the data points are used as
centroids.

• Let us take an example of eight data points, and for simplicity, we can
consider them to be 1D data with values 1, 2, 3, 5, 9, 10, 11, and 25.

• Point 25 is the outlier, and it affects the cluster formation negatively when
the mean of the points is considered as centroids.
• With K = 2, the initial clusters we arrived at are {1, 2,3, 6} and {9, 10, 11,
25}.
• The mean of the cluster
• and the mean of the cluster
• So, the SSE within the clusters is
• If we compare this with the cluster {1, 2, 3, 6, 9} and{10, 11, 25},
• the mean of the cluster
• and the mean of the cluster

• So, the SSE within the clusters is

• Because the SSE of the second clustering is lower, k means tend to put
point 9 in the same cluster with 1, 2, 3, and 6 though the point is logically
nearer to points 10 and 11.
• This skewedness is introduced due to the outlier point 25, which shifts the
mean away from the centre of the cluster.
• k-medoids provides a solution to this problem. Instead of considering the
mean of the data points in the cluster, k-medoids considers k
representative data points from the existing points in the data set as the
centre of the clusters.
• It then assigns the data points according to their distance from these
centres to form k clusters.
• The SSE is calculated as

• where oi is the representative point or object of cluster

• Ci.
• Thus, the k-medoids method groups n objects in k clusters by
minimizing the SSE.
• Because of the use of medoids from the actual representative
data points, k medoids is less influenced by the outliers in the
data.
• One of the practical implementation of the k-medoids principle
is the Partitioning Around Medoids (PAM).
• Step 1: Randomly choose k points in the data set as the initial
representative points.
loop
• Step 2: Assign each of the remaining points to the cluster which has
the nearest representative point.
• Step 3: Randomly select a non-representative point r in each cluster.
• Step 4: Swap the representative point j with r and compute the new
SSE after swapping.
• Step 5: If SSEnew < SSEold, then swap j with r to form the new set of k
representative objects;
• Step 6: Refine the k clusters on the basis of the nearest representative
point. Logic continues until there is no change.
• end loop
• Though the k-medoids algorithm provides an effective way to eliminate the
noise or outliers in the data set, which was the problem in the k-means
algorithm, it is expensive in terms of calculations.

• The complexity of each iteration in the k-medoids algorithm is O(k(n - k) ).

• For large value of ‘n’ and ‘k’, this calculation becomes much costlier than
that of the k-means algorithm.
Hierarchical algorithms
• The hierarchical clustering methods are used to group the data into
hierarchy or tree-like structure.
• It predicts groupings within a dataset by calculating the distance and
generating a link between each singular observation and its nearest
neighbor.
• It then uses those distances to predict subgroups within a dataset.
• If carrying out a statistical study or analyzing biological or environmental
data, hierarchical clustering might be your ideal machine learning solution.
• To visually inspect the results of your hierarchical clustering,
generate a dendrogram - a visualization tool that depicts the
similarities and branching between groups in a data cluster(Fig).
• Use several different algorithms to build a dendrogram, and
the algorithm you choose dictates where and how branching
occurs within the clusters.
•
• In hierarchical clustering, the distance between observations is
measured in three different ways: Euclidean, Manhattan, or Cosine.
• Hierarchical clustering algorithms are more computationally
expensive than k-means algorithms because with each iteration of
hierarchical clustering, many observations must be compared to many
other observations.
• Weakness: In comparison to k-means clustering, the hierarchical
clustering algorithm is a slower, chunkier unsupervised clustering
algorithm.
• However, the benefit, is that hierarchical clustering algorithms are not
subject to errors caused by center convergence at areas of local
minimum density (as exhibited with the k-means clustering algorithms).
• There are two main hierarchical clustering methods:
agglomerative clustering and divisive clustering.
• Agglomerative clustering is a bottom-up technique which starts
with individual objects as clusters and then iteratively merges
them to form larger clusters.
• On the other hand, the divisive method starts with one cluster
with all given objects and then splits it iteratively to form smaller
clusters. See Figure on next slide.
Density based Methods
• Density-based spatial clustering of applications with noise
(DBScan) is an unsupervised learning method that works by clustering
core samples (dense areas of a dataset) while simultaneously
demarking non-core samples (portions of the dataset that are
comparatively sparse).
• When we used the partitioning and hierarchical clustering methods, the
resulting clusters are spherical or nearly spherical in nature.
• In the case of the other shaped clusters such as S shaped or uneven
shaped clusters, the above two types of method do not provide accurate
results.
• The density based clustering approach provides a solution to identify
clusters of arbitrary shapes.
• The principle is based on identifying the dense area and sparse area
within the data set and then run the clustering algorithm.
• DBSCAN is one of the popular density-based algorithm which creates
clusters by using connected regions with high density.
• DBSCAN is one of the density-based clustering approaches
that provide a solution to identify clusters of arbitrary shapes.
• The principle is based on identifying the dense area and sparse
area within the data set and then running the clustering
algorithm.
Applications of clustering
• Text data mining
• Market segmentation
• Anomaly detection
• Data Mining
• Image processing and segmentation
• Identification of human errors during data entry
• Conducting accurate basket analysis, etc.
• Recommendation engines
Pblms
1) Apply k-means algorithm in given data for k=3. Use C1(2),
C2(16), c3(38) as initial cluster centres.
• Data 2,4,6,3,31,12,15,16,38,35,14,21,23,25,30

Soln:
Calculating the distance between each data point and cluster
centres, we get the following table.(Next slide)
By assigning the data points to the cluster center whose distance from it is
minimum of all the cluster centers, we get the following table.
• Similarly, using the new cluster centers we can calculate the
distance from it and allocate clusters based on minimum
distance.
• It is found that there is no difference in the cluster formed and
hence we stop this procedure.
• The final clustering result is given in the following table.
2) Apply k-means clustering for the datasets given in table below. Tabulate
all the assignments.
Soln:
After the second iteration, the assignment has not changed and hence the
algm is stopped and the points are clustered.
3) Apply k-medoid algorithm to cluster the following dataset of 6 objects into
two clusters, that is k=2.
Reference
• “Machine Learning”-Anuradha Srinivasaraghavan, Vincy
Joseph
• “Machine Learning”- Saikat Dutt, Subramanian Chandramouli,
Amit Kumar Das

Advanced Trading Course Vol.1 Introduction To Fractal Trading
0% (1)
Advanced Trading Course Vol.1 Introduction To Fractal Trading
8 pages
Selected Works. Harry Markowitz
No ratings yet
Selected Works. Harry Markowitz
719 pages
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
From Everand
Machine Learning with Clustering: A Visual Guide for Beginners with Examples in Python
Artem Kovera
No ratings yet
Quiz All
No ratings yet
Quiz All
6 pages
ML Unit-4 Final 2024-25
No ratings yet
ML Unit-4 Final 2024-25
28 pages
Unit 4
No ratings yet
Unit 4
29 pages
Clustering Algorithm
No ratings yet
Clustering Algorithm
47 pages
ML Unit III
No ratings yet
ML Unit III
82 pages
Chapter 5. Clustering Algorithms-Stud
No ratings yet
Chapter 5. Clustering Algorithms-Stud
44 pages
Unit 4 Clustering - K-Means and Hierarchical
No ratings yet
Unit 4 Clustering - K-Means and Hierarchical
40 pages
Working of K Means Algorithm - YashBhure
No ratings yet
Working of K Means Algorithm - YashBhure
14 pages
Introduction To Unsupervised Learning:: Clustering
No ratings yet
Introduction To Unsupervised Learning:: Clustering
21 pages
7.introduction To Clustering
No ratings yet
7.introduction To Clustering
11 pages
DM Lecture 06
No ratings yet
DM Lecture 06
32 pages
DSML-ML09. Unsupervised Learning
No ratings yet
DSML-ML09. Unsupervised Learning
69 pages
Unit 4
No ratings yet
Unit 4
74 pages
Clustering
No ratings yet
Clustering
125 pages
Mod4 - Unsupervised Learning
No ratings yet
Mod4 - Unsupervised Learning
9 pages
Clustering Algorithms
No ratings yet
Clustering Algorithms
19 pages
Unit 5
No ratings yet
Unit 5
63 pages
Clustering K-Means
100% (2)
Clustering K-Means
28 pages
Kmean
No ratings yet
Kmean
24 pages
Unsupervised Learning 1
No ratings yet
Unsupervised Learning 1
40 pages
M5
No ratings yet
M5
40 pages
Clustering and K-Means Algorithm
No ratings yet
Clustering and K-Means Algorithm
81 pages
Machine Learning
No ratings yet
Machine Learning
23 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
23 pages
M5
No ratings yet
M5
40 pages
Datamining Lect8
No ratings yet
Datamining Lect8
79 pages
K Means Clustering
No ratings yet
K Means Clustering
6 pages
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
No ratings yet
Datamining-Lect5 - Clustering. The K-Means Algorithm. Hierarchical Clustering. The DBSCAN Algorithm. Clustering Evaluation
110 pages
Data Mining - Clustering
No ratings yet
Data Mining - Clustering
90 pages
Machine Learning & Data Mining
No ratings yet
Machine Learning & Data Mining
108 pages
Chapter 3 Unsupervised Learning
No ratings yet
Chapter 3 Unsupervised Learning
45 pages
Clustering
No ratings yet
Clustering
10 pages
Lect 10 - Unsupervised Learning
No ratings yet
Lect 10 - Unsupervised Learning
50 pages
U1 - KMeans - 5th Sem - DS
No ratings yet
U1 - KMeans - 5th Sem - DS
14 pages
P-3 1 2-Kmeans
No ratings yet
P-3 1 2-Kmeans
43 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
cz4041 10 Clustering
No ratings yet
cz4041 10 Clustering
67 pages
MODULE 4 Clustering
No ratings yet
MODULE 4 Clustering
23 pages
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
No ratings yet
Data Mining Lecture Notes-1: Bsc. (H) Computer Science: Vi Semester Teacher: Ms. Sonal Linda
40 pages
Unit 4
No ratings yet
Unit 4
19 pages
K Means
No ratings yet
K Means
25 pages
Intro Data Science: Cluster Analysis
No ratings yet
Intro Data Science: Cluster Analysis
60 pages
A Tutorial On Clustering Algorithms
No ratings yet
A Tutorial On Clustering Algorithms
4 pages
ML Mod 4 Part 1
No ratings yet
ML Mod 4 Part 1
99 pages
K Means Clustering
No ratings yet
K Means Clustering
29 pages
Machine Learning Unit 4
No ratings yet
Machine Learning Unit 4
22 pages
UNIT - 3 - Clustering
No ratings yet
UNIT - 3 - Clustering
21 pages
4 Clustering
No ratings yet
4 Clustering
9 pages
Unit 5
No ratings yet
Unit 5
85 pages
Machine Learning Notes-1 (Clustering-1)
No ratings yet
Machine Learning Notes-1 (Clustering-1)
25 pages
ML CH 4
No ratings yet
ML CH 4
51 pages
Unit 4
No ratings yet
Unit 4
125 pages
Lecture 1 (UNIT 1)
No ratings yet
Lecture 1 (UNIT 1)
68 pages
Unsupervised Learning Modi
No ratings yet
Unsupervised Learning Modi
16 pages
Clustering in Machine Learning
No ratings yet
Clustering in Machine Learning
20 pages
Lecture 01 - Unsupervised Learning (Optional)
No ratings yet
Lecture 01 - Unsupervised Learning (Optional)
57 pages
Clustering
No ratings yet
Clustering
104 pages
Unit-V Clustering Part 1
No ratings yet
Unit-V Clustering Part 1
26 pages
UNIT 4 K-Means Clustring
No ratings yet
UNIT 4 K-Means Clustring
13 pages
Unsupervised Learning
No ratings yet
Unsupervised Learning
66 pages
NT Quant Quant Puzzles 4: Recursion: Problem 1: Down To Zero (Easy)
No ratings yet
NT Quant Quant Puzzles 4: Recursion: Problem 1: Down To Zero (Easy)
4 pages
CHAPTER 7 Project Management and Network Analysis
100% (3)
CHAPTER 7 Project Management and Network Analysis
43 pages
Secure Electronic Transaction: SMU CSE 5349/7349
No ratings yet
Secure Electronic Transaction: SMU CSE 5349/7349
33 pages
CH-5 Network
No ratings yet
CH-5 Network
14 pages
Summary of Numerical Methods For Engineers
No ratings yet
Summary of Numerical Methods For Engineers
1 page
DSPDF Formulae
No ratings yet
DSPDF Formulae
3 pages
Java Project
No ratings yet
Java Project
39 pages
Dissertation PDF
No ratings yet
Dissertation PDF
166 pages
MIC Assignment4
No ratings yet
MIC Assignment4
9 pages
Unit - 1
No ratings yet
Unit - 1
47 pages
Daley Etal 2022 Practical Quantum Advantage in Quantum Simulation
No ratings yet
Daley Etal 2022 Practical Quantum Advantage in Quantum Simulation
14 pages
Adobe Scan 06-Nov-2024
No ratings yet
Adobe Scan 06-Nov-2024
1 page
S-MATH201LA-BSA22-1st-Sem-2022-2023-Https Dlsud - Edu20.org Student Take Quiz Assignment Resume 35013228 Results374356775
No ratings yet
S-MATH201LA-BSA22-1st-Sem-2022-2023-Https Dlsud - Edu20.org Student Take Quiz Assignment Resume 35013228 Results374356775
3 pages
Python NumPy and Machine Learning A Comprehensive Guide
No ratings yet
Python NumPy and Machine Learning A Comprehensive Guide
10 pages
Lecture W2ab
No ratings yet
Lecture W2ab
44 pages
Continuous Random Variables: - A Continuous Random Variable Has An Set of Possible Values
No ratings yet
Continuous Random Variables: - A Continuous Random Variable Has An Set of Possible Values
4 pages
Bresenham's Line Drawing Algorithm: Nehrurevathy Department of Bca
No ratings yet
Bresenham's Line Drawing Algorithm: Nehrurevathy Department of Bca
22 pages
Chapter 4 Panel
No ratings yet
Chapter 4 Panel
11 pages
Introduction
No ratings yet
Introduction
137 pages
DAA Assignment1
No ratings yet
DAA Assignment1
8 pages
The Science of Deep Learning
0% (1)
The Science of Deep Learning
2 pages
Prolog - Backtracking
No ratings yet
Prolog - Backtracking
8 pages
Sigma-Delta A/D Conversion: Floating-Point Example Model
No ratings yet
Sigma-Delta A/D Conversion: Floating-Point Example Model
2 pages
First Order Rules-1
No ratings yet
First Order Rules-1
12 pages
Circle Mid Point
0% (1)
Circle Mid Point
20 pages
Ch1 ChemicalFoundations
No ratings yet
Ch1 ChemicalFoundations
33 pages
Solution Manual For Applied Statistics and Probability For Engineers 7th by Montgomery Instant Download
100% (7)
Solution Manual For Applied Statistics and Probability For Engineers 7th by Montgomery Instant Download
33 pages

ML Module5 Clustering

Uploaded by

ML Module5 Clustering

Uploaded by

Subject: Machine Learning

• The task of machine is to group unsorted information according to

• Unsupervised learning algorithms are essentially complex algorithms,

• Clustering is a form of machine learning - the machine in this case is

• How would you design an

• Given: a set of documents and the number K

• Find: a partition of K clusters that optimizes the chosen partitioning

• Effective heuristic methods: K-means and K-medoids algorithms

• In the k-means algorithm, the centroid of the prototype is identified for

• The homogeneity and differences are measured in terms of the distance

• The centers are moved from regions of lower density to regions of

• Different numbers of starting cluster lead to completely different types of

• There are several statistical methods to arrive at the suitable number of

• There are procedures such as bisecting k-means and use of post-

• where dist() calculates the Euclidean distance between the centroid c of

• It is observed that the centroid that minimizes the SSE of the

• One limitation of the squared error method is that in the case of

• Normally, ’K ’ and ‘ t ’ are kept much smaller than ‘ n ’, and thus,

• So, the SSE within the clusters is

• where oi is the representative point or object of cluster

• The complexity of each iteration in the k-medoids algorithm is O(k(n - k) ).

You might also like