0% found this document useful (0 votes)
12 views26 pages

Unit-V Clustering Part 1

The document discusses clustering fundamentals, including hard and soft clustering techniques, and the K-Means algorithm. It outlines the advantages and disadvantages of K-Means, as well as methods for determining the optimal number of clusters. Additionally, it highlights the importance of similarity criteria in clustering and the impact of initial conditions on results.

Uploaded by

rishik.velicheti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views26 pages

Unit-V Clustering Part 1

The document discusses clustering fundamentals, including hard and soft clustering techniques, and the K-Means algorithm. It outlines the advantages and disadvantages of K-Means, as well as methods for determining the optimal number of clusters. Additionally, it highlights the importance of similarity criteria in clustering and the impact of initial conditions on results.

Uploaded by

rishik.velicheti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Unit-V

Clustering Fundamentals
Conventionally, each group is called a cluster and the process of
finding the function G is
called clustering.
Types of Clustering
Hard clustering techniques, where each element must belong to a
single cluster.

The alternative approach, called Soft clustering (or fuzzy


clustering), is based on a membership score that defines how
much the elements are "compatible" with each cluster.

The generic clustering function becomes:


• How to determine correct number of
clusters (K) ?
Refer this site: https://builtin.com/data-
science/elbow-method#:~:text=The
%20elbow%20method%20is%20a
%20graphical%20method%20for
%20finding%20the,the%20graph%20forms
%20an%20elbow.
Elbow method
K Means Algo
Algorithm

1.Clusters the data into k groups where k is predefined.


2.Select k points at random as cluster centers.
3.Assign objects to their closest cluster center according to the Euclidean
distance function.
4.Calculate the centroid or mean of all objects in each cluster.
5.Repeat steps 2, 3 and 4 until the same points are assigned to each cluster
in consecutive rounds.
Advantages
• Easy to implement
• With a large number of variables, K-Means may
be optionally faster than hierarchical clustering (if
K is small).
• k-Means may produce tighter clusters than
hierarchical clustering
• An instance can change cluster (move to
another cluster) when the centroids are
recomputed.
Disadvantages
.Difficult to predict the number of clusters (K-Value)
• Initial seeds have a strong impact on the final results
• The order of the data has an impact on the final
results
• Sensitive to scale: rescaling your datasets
(normalization or standardization) will completely
change results. While this itself is not bad, not
realizing that you have to spend extra attention(on to
scaling your data might be bad).
The general concept of clustering
• The k-Nearest Neighbors (k-NN) algorithm
• Gaussian mixture
• The K-means algorithm
• Common methods for selecting the optimal
number of clusters (inertia, silhouette plots,
Calinski-Harabasz index, and cluster instability)
• Evaluation methods based on the ground
truth (homogeneity, completeness, and
Adjusted Rand Index).
• Let's consider a dataset of m-dimensional
samples:

• Let's assume that it's possible to find a criterion


(not a unique) so that each sample can be
associated with a specific group according to its
peculiar features and the overall structure of the
dataset:
• Conventionally, each group is called a cluster,
and the process of finding the function, G, is
called clustering.
• Not imposing any restriction on the clusters;
however, as our approach is unsupervised,
there should be a similarity criterion to join
some elements and separate other ones.
• Different clustering algorithms are based on
alternative strategies to solve this problem,
and can yield very different results.
• In the following graph, there's an example of
clustering based on four sets of bidimensional
samples; the decision to assign a point to a
cluster depends only on its features and
sometimes on the position of a set of other
points (neighborhood):
• Hard-clustering techniques, where each
element must belong to a single cluster.
• The alternative approach, called soft
clustering (sometimes called fuzzy clustering),
is based on a membership score that defines
how much the elements are compatible with
each cluster. The generic clustering function
becomes as follows:
• A vector, mi, represents the relative membership of
xi, and it's often normalized as a probability
distribution (that is, the sum is always forced to be
equal to 1).
• In other scenarios, the single degrees are kept
bounded between 0 and 1 and, hence, are
considered as different probabilities.
• This is often a consequence of the underlying
algorithm. As we are going to see in the Gaussian
mixture section, a sample implicitly belongs to all
distributions, so for each of them, we obtain a
probability that is equivalent to a membership
degree.
• In the majority of cases, hard clustering is the
most appropriate choice, especially if the
assigned cluster is a piece of information
that's immediately employed in other tasks.

• In all of these cases, a soft approach whose


output is a vector must be transformed into a
single prediction (normally using the operator
argmax(•)).

You might also like