0% found this document useful (0 votes)
9 views

Machine Learning Clustering AlgorithmsI

unsupervised learning

Uploaded by

aishwaryrai44
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Machine Learning Clustering AlgorithmsI

unsupervised learning

Uploaded by

aishwaryrai44
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 129

Machine Learning Clustering Algorithms-I

[UNIT-I]
• What is Machine Learning

In the real world, we are surrounded by humans who can learn


everything from their experiences with their learning capability, and
we have computers or machines that work on our instructions.

But can a machine also learn from experiences or past data like
humans? So here comes the role of Machine Learning.
Machine Learning

• A subset of artificial intelligence known as machine learning focuses


primarily on creating algorithms that enable a computer to independently
learn from data and previous experiences.

• Machine learning algorithms create a mathematical model that, without


being explicitly programmed, aids in making predictions or decisions with
the assistance of sample historical data, or training data.
How does Machine Learning work

• A machine learning system builds prediction models, learns from previous


data, and predicts the output of new data whenever it receives it.

• The amount of data helps to build a better model that accurately predicts
the output, which in turn affects the accuracy of the predicted output.
Classification of Machine Learning
1) Supervised Learning
In supervised learning, sample-labeled data are provided to the machine
learning system for training, and the system then predicts the output based on
the training data.

Supervised learning can be grouped further in two categories of algorithms:

• Classification
• Regression
Classification

• Classification is defined as the process of recognition, understanding, and


grouping of objects and ideas into preset categories.

• In short, classification is a form of “pattern recognition”

Fig: Classification of vegetables and groceries


Types of Classification

• Decision Tree Classification: This type divides a dataset into segments based on
particular feature variables. The divisions’ threshold values are typically the mean or
mode of the feature variable if they happen to be numerical.

• K-Nearest Neighbor: This Classification type identifies the K nearest neighbors to a given
observation point. It then uses K points to evaluate the proportions of each type of target
variable and predicts the target variable that has the highest ratio.

• Logistic Regression: This classification type isn't complex so it can be easily adopted with
minimal training. It predicts the probability of Y being associated with the X input
variable.
• Naïve Bayes: This classifier is one of the most effective yet simplest algorithms. It’s based
on Bayes’ theorem, which describes how event probability is evaluated based on the
previous knowledge of conditions that could be related to the event.

• Random Forest Classification: Random forest processes many decision trees, each one
predicting a value for target variable probability. You then arrive at the final output by
averaging the probabilities.

• Support Vector Machines: This algorithm employs support vector classifiers with an
exciting change, making it ideal for evaluating non-linear decision boundaries. This
process is possible by enlarging feature variable space by employing special functions
known as kernels.
Regression

• It is a supervised machine learning technique, used to predict the value of the dependent
variable for new, unseen data. It models the relationship between the input features and the
target variable, allowing for the estimation or prediction of numerical values. Therefore,
regression algorithms help predict continuous variables such as house prices, market trends,
weather patterns, and oil and gas prices.
Types of Regression:

• Decision Tree Regression: The primary purpose of this regression is to divide the dataset into
smaller subsets. These subsets are created to plot the value of any data point connecting to the
problem statement.

• Principal Components Regression: a statistical technique for regression analysis used to


reduce a dataset's dimensionality by projecting it onto a lower-dimensional subspace.
• This is done by finding a set of orthogonal (i.e., uncorrelated) linear combinations of the
original variables, called principal components, that capture the most variance in the data.
• Polynomial Regression: Polynomial regression is a type of regression analysis used in
statistics and machine learning when the relationship between the independent variable
(input) and the dependent variable (output) is not linear.
• Random Forest Regression: Random Forest regression is heavily used in Machine
Learning. It uses multiple decision trees to predict the output. Random data points are
chosen from the given dataset and used to build a decision tree via this algorithm.

• Simple Linear Regression: Simple Linear Regression is a type of Regression algorithm that
models the relationship between a dependent variable and a single independent
variable. The relationship shown by a Simple Linear Regression model is linear or a sloped
straight line, hence it is called Simple Linear Regression.
• Support Vector Regression: This regression type solves both linear and non-linear
models. It uses non-linear kernel functions, like polynomials, to find an optimal solution
for non-linear models.
2) Unsupervised Learning

• Unsupervised learning is a learning method in which a machine learns without any


supervision.

• The training is provided to the machine with the set of data that has not been labeled,
classified, or categorized, and the algorithm needs to act on that data without any
supervision.

• The goal of unsupervised learning is to restructure the input data into new features or a
group of objects with similar patterns.
• In unsupervised learning, we don't have a predetermined result. The machine tries to
find useful insights from a huge amount of data.

• It can be further classified into two categories of algorithms:


• Clustering
• Association
Clustering

• Clustering or cluster analysis is a machine learning technique, which groups the


unlabelled dataset.

• It can be defined as "A way of grouping the data points into different clusters,
consisting of similar data points.
• The objects with the possible similarities remain in a group that has less or no
similarities with another group.“

• Note: It does it by finding some similar patterns in the unlabelled dataset such as shape,
size, color, behavior, etc., and divides them as per the presence and absence of those
similar patterns.
• Example: Let's understand the clustering technique with the real-world
example of Mall:
• When we visit any shopping mall, we can observe that the things with
similar usage are grouped.

• Such as the t-shirts are grouped in one section, and the trousers are in other
sections, similarly, at vegetable sections, apples, bananas, Mangoes, etc., are
grouped in separate sections, so that we can easily find out the things.

• The clustering technique also works in the same way. Other examples of
clustering are grouping documents according to the topic.
• The below diagram explains the working of the clustering algorithm. We can
see the different fruits are divided into several groups with similar
properties.
Clustering Algorithms

1.K-Means algorithm: The k-means algorithm is one of the most popular clustering
algorithms. It classifies the dataset by dividing the samples into different clusters of equal
variances. The number of clusters must be specified in this algorithm. It is fast with fewer
computations required, with the linear complexity of O(n).

2.Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth
density of data points. It is an example of a centroid-based model, that works on
updating the candidates for centroid to be the center of the points within a given region.

3.DBSCAN Algorithm: It stands for Density-Based Spatial Clustering of Applications with


Noise. It is an example of a density-based model similar to the mean shift but with some
remarkable advantages. In this algorithm, the areas of high density are separated by the
areas of low density. Because of this, the clusters can be found in any arbitrary shape.
4. Expectation-Maximization Clustering using GMM: This algorithm can be used as an
alternative for the k-means algorithm or for those cases where K-means can fail. In GMM, it
is assumed that the data points are Gaussian distributed.

5. Agglomerative Hierarchical algorithm: The Agglomerative hierarchical algorithm


performs the bottom-up hierarchical clustering. In this, each data point is treated as a
single cluster at the outset and then successively merged. The cluster hierarchy can be
represented as a tree structure.

6. Affinity Propagation: It is different from other clustering algorithms as it does not


require specifying the number of clusters. In this, each data point sends a message
between the pair of data points until convergence. It has O(N2T) time complexity, which is
the main drawback of this algorithm.
Applications of Clustering

• In Identification of Cancer Cells: The clustering algorithms are widely used for the
identification of cancerous cells. It divides the cancerous and non-cancerous data sets
into different groups.
• In Search Engines: Search engines also work on the clustering technique. The search
result appears based on the closest object to the search query. It does it by grouping
similar data objects in one group that is far from the other dissimilar objects. The
accurate result of a query depends on the quality of the clustering algorithm used.

• Customer Segmentation: It is used in market research to segment the customers based


on their choice and preferences.

• In Biology: It is used in the biology stream to classify different species of plants and
animals using the image recognition technique.
Association

• Association rule learning is a type of unsupervised learning technique that checks for the
dependency of one data item on another data item and maps accordingly so that it can
be more profitable.

• It tries to find some interesting relations or associations among the variables of dataset.
• It is based on different rules to discover the interesting relations between variables in the
database.
• The association rule learning is one of the very important concepts of machine learning,
and it is employed in Market Basket analysis, Web usage mining, continuous
production, etc.

• Here market basket analysis is a technique used by the various big retailers to discover
the associations between items.

• For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk,
so these products are stored within a shelf or mostly nearby. Consider the below
diagram:
Applications of Association Rule Learning

• Market Basket Analysis: It is one of the popular examples and applications of association
rule mining. This technique is commonly used by big retailers to determine the
association between items.

• Medical Diagnosis: With the help of association rules, patients can be cured easily, as it
helps in identifying the probability of illness for a particular disease.

• Protein Sequence: The association rules help in determining the synthesis of artificial
Proteins.
3) Reinforcement Learning

• Reinforcement learning is a feedback-based learning method, in which a learning agent


gets a reward for each right action and gets a penalty for each wrong action.

• The agent learns automatically with these feedbacks and improves its performance.
• In reinforcement learning, the agent interacts with the environment and explores it.

• The goal of an agent is to get the most reward points, and hence, it improves its
performance.
• Note: The robotic dog, which automatically learns the movement of his arms, is an
example of Reinforcement learning.
Unsupervised Learning
• Unsupervised learning is a type of machine learning in which models are trained using
unlabeled datasets and are allowed to act on that data without any supervision

• Example: Suppose the unsupervised learning algorithm is given an input dataset


containing images of different types of cats and dogs. The algorithm is never trained upon
the given dataset, which means it does not have any idea about the features of the
dataset.

• The task of the unsupervised learning algorithm is to identify the image features on their
own. An unsupervised learning algorithm will perform this task by clustering the image
dataset into groups according to similarities between images..
The task of the unsupervised learning algorithm is to identify the image features
on their own.

An unsupervised learning algorithm will perform this task by clustering the image
dataset into groups according to similarities between images..
Why use Unsupervised Learning?

• Unsupervised learning helps to find useful insights from the data.


• Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.

• Unsupervised learning works on unlabeled and uncategorized data which make


unsupervised learning more important.

• In real-world, we do not always have input data with the corresponding output so to
solve such cases, we need unsupervised learning.
Working of Unsupervised Learning
Working of unsupervised learning can be understood by the below
diagram:
• Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to the
machine learning model in order to train it.

• Firstly, it will interpret the raw data to find the hidden patterns from the data and then
will apply suitable algorithms such as k-means clustering, Decision tree

• Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and difference between the objects.
Types of Unsupervised Learning Algorithm:
• Clustering: Clustering is a method of grouping the objects into clusters such that objects
with most similarities remain in a group and have less or no similarities with the objects
of another group.

• Cluster analysis finds the commonalities between the data objects and categorizes them
as per the presence and absence of those commonalities.

• Association: An association rule is an unsupervised learning method which is used for


finding the relationships between variables in a large database.

• It determines the set of items that occur together in the dataset. Association rule makes
marketing strategy more effective.
• Such as people who buy X items (suppose bread) also tend to purchase Y (Butter/Jam)
items. A typical example of an Association rule is Market Basket Analysis.
Unsupervised Learning algorithms:

• K-means clustering
• KNN (k-nearest neighbors)
• Hierarchal clustering
• Neural Networks
• Principle Component Analysis
• Apriori algorithm
Advantages of Unsupervised Learning

• Unsupervised learning is used for more complex tasks as compared to


supervised learning because, in unsupervised learning, we don't have
labeled input data.

• Unsupervised learning is preferable as it is easy to get unlabeled data in


comparison to labeled data.
Disadvantages of Unsupervised Learning
• Unsupervised learning is intrinsically more difficult than supervised learning
as it does not have corresponding output.

• The result of the unsupervised learning algorithm might be less accurate as


input data is not labeled, and algorithms do not know the exact output in
advance.
Supervised Learning Unsupervised Learning
Supervised learning algorithms are trained Unsupervised learning algorithms are trained
using labeled data. using unlabeled data.

Supervised learning model takes direct Unsupervised learning model does not take
feedback to check if it is predicting correct any feedback.
output or not.
Supervised learning model predicts the Unsupervised learning model finds the hidden
output. patterns in data.
In supervised learning, input data is provided In unsupervised learning, only input data is
to the model along with the output. provided to the model.

The goal of supervised learning is to train the The goal of unsupervised learning is to find
model so that it can predict the output when the hidden patterns and useful insights from
it is given new data. the unknown dataset.

Supervised learning needs supervision to Unsupervised learning does not need any
train the model. supervision to train the model.

Supervised learning can be categorized Unsupervised Learning can be classified


in Classification and Regression problems. in Clustering and Associations problems.
Clustering Techniques
• Clustering has been studied extensively for more than 40 years and across many disciplines
due to its broad applications.

• k-Means algorithm [1957, 1967]


Partitioning • k-Medoids algorithm
methods • k-Modes [1998]
• Fuzzy c-means algorithm [1999]

Divisive
Hierarchical
methods Agglomerative
Clustering methods
Techniques
Density-based
methods • STING [1997]
• DBSCAN [1996]
• CLIQUE [1998]
38
Clustering: Some Examples
 Document/Image/Webpage Clustering
 Image Segmentation (clustering pixels)
 Clustering web search results
 Clustering (people) nodes in (social) networks/graphs
 .. and many more.

Picture courtesy: http://people.cs.uchicago.edu/∼pff/segment/


39
Types of Clustering
 Flat or Partitional clustering
 Partitions are independent of each other

In hierarchical clustering, we can look at


a clustering for any given number of
cluster by “cutting the dendrogram at
an appropriate level (so does not have
to be specified)
 Hierarchical clustering
 Partitions can be visualized using a tree structure (a dendrogram)
Hierarchical clustering
gives us a clustering at
multiple levels of
granularity
• Partitioning Clustering
• It is a type of clustering that divides the data into non-
hierarchical groups. It is also known as the centroid-based
method.

• In this type, the dataset is divided into a set of k groups,


where K is used to define the number of pre-defined groups.
The cluster center is created in such a way that the distance
between the data points of one cluster is minimum as
compared to another cluster centroid.
• Hierarchical Clustering:

• Hierarchical clustering can be used as an alternative for the


partitioned clustering as there is no requirement of pre-
specifying the number of clusters to be created.

• In this technique, the dataset is divided into clusters to create


a tree-like structure, which is also called a dendrogram.

• The observations or any number of clusters can be selected


by cutting the tree at the correct level.

• The most common example of this method is


the Agglomerative Hierarchical algorithm
• K-means is a clustering algorithm designed to divide data
points into distinct clusters depending on the similarities
between the observations. Data points that share relevant
characteristics are grouped together, implying that data
points in different clusters would be dissimilar.
How to choose the value of K?

• The Elbow Method

• The method consists of plotting the total WSS(within the


sum of squares ) as a function of the various number of
clusters. Then, pick the ‘elbow’ of the curve as the
appropriate number of clusters. See the example below.
Implementing K-means Clustering

• Start by visualizing some data points:


• import matplotlib.pyplot as plt

x = [4, 5, 10, 4, 3, 11, 14 , 6, 10, 12]


y = [21, 19, 24, 17, 16, 25, 24, 22, 21, 21]

plt.scatter(x, y)
plt.show()
Result
•Now
from sklearn.cluster
we utilize import
the elbow method to visualize KMeans
the intertia for different values of
K:

data = list(zip(x, y))


inertias = []

for i in range(1,11):
kmeans = KMeans(n_clusters=i)
kmeans.fit(data)
inertias.append(kmeans.inertia_)

plt.plot(range(1,11), inertias, marker='o')


plt.title('Elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()
Result
• The elbow method shows that 2 is a good value for K, so we
retrain and visualize the result:

• kmeans = KMeans(n_clusters=2)
kmeans.fit(data)

plt.scatter(x, y, c=kmeans.labels_)
plt.show()
• Result
• Import the modules you need.

import matplotlib.pyplot as plt
from sklearn.cluster import Kmeans
• Create arrays that resemble two variables in a dataset. Note
that while we only use two variables here, this method will
work with any number of variables:
• x = [4, 5, 10, 4, 3, 11, 14 , 6, 10, 12]
y = [21, 19, 24, 17, 16, 25, 24, 22, 21, 21]
• Turn the data into a set of points:
• data = list(zip(x, y))
print(data)
• Result:
• [(4, 21), (5, 19), (10, 24), (4, 17), (3, 16),
(11, 25), (14, 24), (6, 22), (10, 21), (12, 21)]

• In order to find the best value for K, we need to run K-means


across our data for a range of possible values. We only have
10 data points, so the maximum number of clusters is 10. So
for each value K in range(1,11), we train a K-means model
and plot the intertia at that number of clusters:
• inertias = []

for i in range(1,11):
kmeans = KMeans(n_clusters=i)
kmeans.fit(data)
inertias.append(kmeans.inertia_)

plt.plot(range(1,11), inertias, marker='o')


plt.title('Elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()
• Result:
We can see that the "elbow" on the graph above (where the interia becomes more linear) is at
K=2. We can then fit our K-means algorithm one more time and plot the different clusters
assigned to the data:

kmeans = KMeans(n_clusters=2)
kmeans.fit(data)

plt.scatter(x, y, c=kmeans.labels_)
plt.show()
Result:
The below diagram explains the working of the K-means Clustering
Algorithm:
60
K-means: An Illustration
Initializing K-means.

• Initializing K-means is the initial step of the K-means clustering


algorithm, which is an unsupervised machine learning technique used
for partitioning data points into K clusters.

• The goal of K-means is to minimize the sum of squared distances


between data points and their corresponding cluster centers, often
referred to as centroids.
General process for initializing K-means

1. Choose the number of clusters (K): Decide the number of clusters you want
to partition your data into. This is a user-specified parameter and requires
domain knowledge or experimentation to determine the optimal value.

2. Select initial centroids: Randomly select K data points from your dataset as
the initial centroids. These data points will serve as the starting positions
for the cluster centers.

3. Assign data points to clusters: Calculate the distance between each data
point and all the centroids. Assign each data point to the cluster with the
nearest centroid. This step forms the initial clustering.
Cont.

4. Update centroids: Calculate the mean (center) of the data points


within each cluster and move the centroids to these new positions.

5. Repeat steps 3 and 4: Continue steps 3 and 4 iteratively until


convergence or until a specified number of iterations is reached.
Convergence happens when the cluster assignments stop changing
significantly or when the centroids stop moving substantially.

6. Final clustering: Once convergence is reached, the algorithm stops,


and you have your final clustering result.
What have we learned so far…
• Step 1: Begin with a decision on the value of k = number of clusters .

• Step 2: Put any initial partition that classifies the data into k clusters. You
may assign the training samples randomly, or systematically as the
following:

1. Take the first k training sample as single-element clusters


2. Assign each of the remaining (N-k) training sample to the cluster with
the nearest centroid. After each assignment, recompute the centroid of
the gaining cluster.
• Step 3: Take each sample in sequence and compute its distance from the centroid
of each of the clusters. If a sample is not currently in the cluster with the closest
centroid, switch this sample to that cluster and update the centroid of the cluster
gaining the new sample and the cluster losing the sample.

• Step 4: Repeat step 3 until convergence is achieved, that is until a pass through
the training sample causes no new assignments.
Uses of Clustering

•Market Segmentation – Businesses use clustering to group their customers


and use targeted advertisements to attract more audience.
•Market Basket Analysis – Shop owners analyze their sales and figure out
which items are majorly bought together by the customers. For example, In
USA, according to a study diapers and beers were usually bought together
by fathers.
•Social Network Analysis – Social media sites use your data to understand
your browsing behaviour and provide you with targeted friend
recommendations or content recommendations.
•Medical Imaging – Doctors use Clustering to find out diseased areas in
diagnostic images like X-rays.
•Anomaly Detection – To find outliers in a stream of real-time dataset or
forecasting fraudulent transactions we can use clustering to identify them.
Hard Clustering
•Definition: In hard clustering, each data point is assigned to one and only one cluster. The
assignment is definitive, meaning a data point belongs to a single cluster with certainty.
•Algorithms: Common hard clustering algorithms include K-Means, Hierarchical Clustering
(Agglomerative and Divisive), and DBSCAN.
Limitations:
•Not Flexible: It doesn’t handle overlapping clusters well.
•Sensitivity to Outliers: Outliers can significantly affect the clustering results.
Soft Clustering
•Definition: In soft clustering, data points can belong to multiple clusters with varying
degrees of membership. This approach provides a probability or degree of membership for
each cluster.
•Algorithms: Fuzzy C-Means (FCM) is a well-known soft clustering algorithm. Other
probabilistic models include Gaussian Mixture Models (GMM).
Applications: Suitable for scenarios where clusters may overlap or where data points are not
distinctly separable, such as in image segmentation or gene expression data.
Hard vs Soft Clustering
Hard Clustering
• Also known as Exclusive Clustering or Crisp Clustering.
• Each data point is assigned to exactly one cluster.
• Results in non-overlapping clusters with clear boundaries.
• E.g., K-means
Advantages

• Simplicity and Ease of Implementation: Hard clustering algorithms are straightforward to understand and
implement.

• Computational Efficiency: Ideal for handling large datasets efficiently.

• Clear Cluster Membership: Each data point unambiguously belongs to a single cluster.

Disadvantages

• Sensitive to Initial Placement: Results can vary depending on the initial cluster centroids.

• Limited Handling of Overlapping Data: May struggle with complex data structures that have overlapping
clusters.

• Impact of Outliers: Outliers can significantly affect cluster assignments.


Soft Clustering
• Also known as Fuzzy Clustering.

• Allows data points to belong to multiple clusters with varying


membership degrees.

• Provides more flexibility and captures uncertainty in cluster


assignments.

• E.g., Fuzzy C-means (FCM)


Advantages

• Handling Overlapping Data: Well-suited for datasets with complex or overlapping structures.

• Robustness to outliers: The ability of a clustering algorithm to minimize the influence of


outliers on the assignment of data points to clusters. (probability distribution over clusters
instead of a hard assignment to a single cluster)

Disadvantages
• Computational Complexity: Soft clustering methods can be more computationally expensive
than their hard clustering counterparts.

• Determining the Number of Clusters: Requires the pre-specification of the number of clusters
or fuzziness coefficient.

• Interpretability Challenges: Fuzzy memberships might be more challenging to interpret.


How to Find the Optimal Number of Clusters

• The elbow method is a popular technique used to determine the optimal number
of clusters (k) in a clustering algorithm, such as K-means.

• It involves plotting the cost (inertia) of clustering as a function of the number of


clusters and looking for the "elbow point" in the plot.

• The elbow point is the value of k at which the inertia starts to level off or
decrease at a slower rate.

• This point indicates that adding more clusters does not significantly improve the
clustering quality and suggests the appropriate number of clusters for the data.
Cont.
• The elbow method helps us find the optimal number of clusters for
our data.

• It involves analyzing the inertia (within-cluster sum of squares) as a


function of the number of clusters.
Elbow Method

Elbow Method is a technique that we use to determine the number of


centroids(k) to use in a k-means clustering algorithm.

In this method to determine the k-value we continuously iterate for k=1 to


k=n (Here n is the hyperparameter that we choose as per our
requirement).

For every value of k, we calculate the within-cluster sum of squares (WCSS)


value.

WCSS - It is defined as the sum of square distances between the


Note:centroids and each point.
Implementation of the Elbow Method Usking Sklearn in Python

We will see how to implement the elbow method in 4 steps.


At first, we will create random dataset points, then we will apply k-means
on this dataset and calculate wcss value for k between 1 to 4.

Step 1: Importing the required libraries

from sklearn.cluster import KMeans


from sklearn import metrics
from scipy.spatial.distance import cdist
import numpy as np
import matplotlib.pyplot as plt
Step 2: Creating and Visualizing the data
We will create a random array and visualize its distribution
# Creating the data
x1 = np.array([3, 1, 1, 2, 1, 6, 6, 6, 5, 6,\
7, 8, 9, 8, 9, 9, 8, 4, 4, 5, 4])
x2 = np.array([5, 4, 5, 6, 5, 8, 6, 7, 6, 7, \
1, 2, 1, 2, 3, 2, 3, 9, 10, 9, 10])
X = np.array(list(zip(x1, x2))).reshape(len(x1), 2)

# Visualizing the data


plt.plot()
plt.xlim([0, 10])
plt.ylim([0, 10])
plt.title('Dataset')
plt.scatter(x1, x2)
plt.show()
Output:

From the above visualization, we can see that the optimal number of clusters should be around
3. But visualizing the data alone cannot always give the right answer.
Note: Distortion is the alteration of the original shape
Step 3: Building the clustering model and calculating the values of the Distortion and Inertia:

distortions = []
inertias = []
mapping1 = {}
mapping2 = {}
K = range(1, 10)

for k in K:
# Building and fitting the model
kmeanModel = KMeans(n_clusters=k).fit(X)
kmeanModel.fit(X)

distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_,
'euclidean'), axis=1)) /X.shape[0])

inertias.append(kmeanModel.inertia_)

mapping1[k] = sum(np.min(cdist(X, kmeanModel.cluster_centers_,


'euclidean'), axis=1)) / X.shape[0]
mapping2[k] = kmeanModel.inertia_
Step 4: Tabulating and Visualizing the Results
a) Using the different values of Distortion:

for key, val in mapping1.items():


print(f'{key} : {val}’)
Output:
Next we will plot the graph of k versus WCSS

plt.plot(K, distortions, 'bx-')


plt.xlabel('Values of K')
plt.ylabel('Distortion')
plt.title('The Elbow Method using Distortion')
plt.show()
Output:
b) Using the different values of Inertia:

for key, val in mapping2.items():


print(f'{key} : {val}’)

Output:
plt.plot(K, inertias, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Inertia')
plt.title('The Elbow Method using Inertia')
plt.show()
Output:
Elbow Method: Step by Step
• Choose a range of k values to consider (e.g., 1 to 10).

• For each k value, run the K-means algorithm with k clusters on the
data.

• Calculate the inertia (sum of squared distances within clusters) for


each k.

• Plot the inertia values against the corresponding k values.


Cont.
Identifying the Elbow Point
• The elbow point is the optimal k value.

• It is the point where inertia starts to level off or decreases at a slower


rate.

• Adding more clusters beyond this point may not significantly improve
clustering quality.
Limitations of the Elbow Method
• The elbow method may not always yield a clear-cut elbow point,
especially for complex datasets.

• It is subjective, and the optimal k value may vary based on


interpretation.
Weaknesses of K-Mean Clustering

1. When the numbers of data are not so many, initial grouping will determine the
cluster significantly.

2. The number of cluster, K, must be determined before hand. Its disadvantage is


that it does not yield the same result with each run, since the resulting clusters
depend on the initial random assignments.

3. We never know the real cluster, using the same data, because if it is inputted in a
different order it may produce different cluster if the number of data is few.

4. It is sensitive to initial condition. Different initial condition may produce different


result of cluster. The algorithm may be trapped in the local optimum.
Applications of K-Mean Clustering
• It is relatively efficient and fast. It computes result at O(tkn), where n is number of objects
or points, k is number of clusters and t is number of iterations.

• k-means clustering can be applied to machine learning or data mining

• Used on acoustic data in speech understanding to convert waveforms into one of k


categories (known as Vector Quantization or Image Segmentation).

• Also used for choosing color palettes on old fashioned graphical display devices and Image
Quantization.
CONCLUSION
• K-means algorithm is useful for undirected knowledge discovery and
is relatively simple.

• K-means has found wide spread usage in lot of fields, ranging from
unsupervised learning of neural network, Pattern recognitions,
Classification analysis, Artificial intelligence, image processing,
machine vision, and many others.
Drawback of standard K-means algorithm:

One disadvantage of the K-means algorithm is that it is sensitive


to the initialization of the centroids or the mean points.

So, if a centroid is initialized to be a “far-off” point, it might just


end up with no points associated with it, and at the same time,
more than one cluster might end up linked with a single centroid.

Similarly, more than one centroid might be initialized into the


same cluster resulting in poor clustering.
To overcome the above-mentioned drawback we use K-means+

+.
K Means++
This algorithm ensures a smarter initialization of the centroids and
improves the quality of the clustering.

Apart from initialization, the rest of the algorithm is the same as the
standard K-means algorithm.

That is K-means++ is the standard K-means algorithm coupled with a


smarter initialization of the centroids.
Kmeans++
• KMeans++ is an improved initialization technique for the KMeans
clustering algorithm.
• The primary goal of KMeans++ is to provide a better initial set of
cluster centroids, which in turn leads to faster convergence and
higher-quality clustering results compared to the standard random
initialization used in traditional KMeans.

(Problem with kmeans)


Different initialization will
result in different clusters.
Kmeans++ Algorithm
• Select the First Centroid:
• Choose one data point randomly from the dataset as the first centroid.
• Calculate Distances:
• For each data point that has not been selected as a centroid, calculate the distance (usually
using Euclidean distance) from that point to the nearest centroid that has already been
chosen.
• Select Subsequent Centroids:
• The next centroid is chosen from the remaining data points with a probability proportional to
the square of the distance from the nearest existing centroid. This ensures that points that are
far from existing centroids are more likely to be selected as new centroids.
• Repeat Step 2 and 3:
• Repeat the distance calculation and centroid selection process until the desired number of
centroids is reached.
Kmeans++ Numerical
• Let's say we have a small dataset of two-dimensional points:
• Data Points:
• (2, 3), (3, 2), (8, 8), (10, 9), (11, 10)
• We want to perform KMeans clustering with K=2 clusters.
Step 1
• Select First Centroid:
• Choose a random data point as the first centroid. Let's say we choose (2, 3).
Step 2
• Calculate Distances:
• Calculate the squared distances from each data point to the chosen centroid (2,
3):
• Distance from (3, 2) = (3 - 2)^2 + (2 - 3)^2 = 2
• Distance from (8, 8) = (8 - 2)^2 + (8 - 3)^2 = 61
• Distance from (10, 9) = (10 - 2)^2 + (9 - 3)^2 = 100
• Distance from (11, 10) = (11 - 2)^2 + (10 - 3)^2 = 130
Step 3
• Select Subsequent Centroid:
• The next centroid is selected based on the calculated squared distances. The probability of selecting a point
as the next centroid is proportional to its squared distance from the nearest existing centroid.
• Calculate the probabilities for each point:
• Probability for (3, 2) = 2 / (2 + 61 + 100 + 130) ≈ 0.0068
• Probability for (8, 8) = 61 / (2 + 61 + 100 + 130) ≈ 0.2081
• Probability for (10, 9) = 100 / (2 + 61 + 100 + 130) ≈ 0.3412
• Probability for (11, 10) = 130 / (2 + 61 + 100 + 130) ≈ 0.4436
• Choose the next centroid based on these probabilities. For simplicity, let's assume (11, 10) is selected as the
second centroid.
Step 4
• Repeat Steps 2 and 3:
• Recalculate distances from each data point to the two centroids (2, 3) and (11, 10) and calculate
new probabilities.
• Choose the third centroid based on the probabilities.
KMedians
• "KMedians," another clustering algorithm that is similar to KMeans but uses
medians instead of means to calculate the cluster centroids.
• Medians is particularly useful when dealing with data that might have outliers or
when you want to create clusters based on the median values of the data points
rather than the mean values.
Kmedians Algorithm
• Initialization:
• Just like KMeans, KMedians also requires an initial set of cluster centroids. These centroids can be randomly chosen
data points or selected using other methods.
• Assignment Step:
• Each data point is assigned to the nearest centroid based on the Manhattan distance (also known as L1 distance)
rather than Euclidean distance, which is used in KMeans. The Manhattan distance is the sum of the absolute
differences between the coordinates of the data point and the centroid.
• Update Step:
• After the assignment step, the medians of the data points in each cluster are calculated. The median value of a set of
numbers is the middle value when they are sorted in order. It's a more robust measure than the mean because it's
less affected by outliers.
• Iteration:
• The assignment and update steps are repeated iteratively until convergence. In each iteration, data points are
reassigned to the nearest centroids based on Manhattan distance, and then the centroids are updated using the
medians of the data points in each cluster.
Advantages of KMedians
• The main advantage of KMedians over KMeans is its robustness to outliers.

• Outliers can significantly influence the mean-based centroids in KMeans, whereas


medians are less affected by extreme values.

• However, KMedians can also be computationally more expensive than KMeans


since calculating medians involves sorting data points.
Kmedians Numerical
• Let's consider a small dataset of one-dimensional points:
• Data Points: 2, 3, 7, 8, 10, 12, 15, 20, 25
• Assuming we want to perform KMedians clustering with K=2 clusters.
Step 1
• Initialization:
• Choose two initial medians randomly or using some other initialization method.
Let's say we start with medians at 7 and 20.
Step 2
• Assignment Step:
• For each data point, calculate the Manhattan distance (L1 distance) from each of the cluster medians. Assign each data point to the
nearest median's cluster.

Data
Point |Median 7|Median 20| Assigned Cluster
-----------------------------------------------------------------
2 | 5 | 18 | 1
3 | 4 | 17 | 1
7 | 0 | 13 | 1
8 | 1 | 12 | 1
10 | 3 | 10 | 1
12 | 5 | 8 | 1
15 | 8 | 5 | 2
20 | 13 | 0 | 2
25 | 18 | 5 | 2
Step 3
• Update Step (Calculation of Medians):
• For each cluster, calculate the median of the data points in that cluster. The
median is the middle value when the data points are sorted.

Cluster 1 Median: 7.5


Cluster 2 Median: 20

• Repeat the assignment and update steps until convergence. In each iteration,
reassign data points to the nearest medians and update the medians based on
the median of the data points in each cluster.
K-Medoid
• K-Medoids is particularly useful when dealing with non-numerical data or when you want
clusters that are centered around actual data points.

• "K-Medoids," is a clustering algorithm similar to KMeans but uses actual data points as
cluster representatives (medoids) instead of the mean or centroid.

• Medoids are representative objects of a data set or a cluster within a data set whose sum of
dissimilarities to all the objects in the cluster is minimal.

• Medoids are similar in concept to means or centroids, but medoids are always restricted to
be members of the data set.

• Medoids are most commonly used on data when a mean or centroid cannot be defined,
such as graphs.
K-Medoid
1. First, we select K random data points from the dataset and use them as
medoids.

2. Now, we will calculate the distance of each data point from the medoids. You
can use any of the Euclidean, Manhattan distance, or squared Euclidean
distance as the distance measure.

3. Once we find the distance of each data point from the medoids, we will
assign the data points to the clusters associated with each medoid. The data
points are assigned to the medoids at the closest distance.

4. After determining the clusters, we will calculate the sum of the distance of all
the non-medoid data points to the medoid of each cluster. Let the cost be Ci.
5. Now, we will select a random data point Dj from the dataset
and swap it with a medoid Mi. Here, Dj becomes a
temporary medoid. After swapping, we will calculate the
distance of all the non-medoid data points to the current
medoid of each cluster. Let this cost be Cj.

6. If Ci>Cj, the current medoids with Dj as one of the medoids


are made permanent medoids. Otherwise, we undo the swap,
and Mi is reinstated as the medoid.

7. Repeat 4 to 6 until no change occurs in the clusters.


K-Medoids Numerical

You might also like