Machine Learning Clustering AlgorithmsI
Machine Learning Clustering AlgorithmsI
[UNIT-I]
• What is Machine Learning
But can a machine also learn from experiences or past data like
humans? So here comes the role of Machine Learning.
Machine Learning
• The amount of data helps to build a better model that accurately predicts
the output, which in turn affects the accuracy of the predicted output.
Classification of Machine Learning
1) Supervised Learning
In supervised learning, sample-labeled data are provided to the machine
learning system for training, and the system then predicts the output based on
the training data.
• Classification
• Regression
Classification
• Decision Tree Classification: This type divides a dataset into segments based on
particular feature variables. The divisions’ threshold values are typically the mean or
mode of the feature variable if they happen to be numerical.
• K-Nearest Neighbor: This Classification type identifies the K nearest neighbors to a given
observation point. It then uses K points to evaluate the proportions of each type of target
variable and predicts the target variable that has the highest ratio.
• Logistic Regression: This classification type isn't complex so it can be easily adopted with
minimal training. It predicts the probability of Y being associated with the X input
variable.
• Naïve Bayes: This classifier is one of the most effective yet simplest algorithms. It’s based
on Bayes’ theorem, which describes how event probability is evaluated based on the
previous knowledge of conditions that could be related to the event.
• Random Forest Classification: Random forest processes many decision trees, each one
predicting a value for target variable probability. You then arrive at the final output by
averaging the probabilities.
• Support Vector Machines: This algorithm employs support vector classifiers with an
exciting change, making it ideal for evaluating non-linear decision boundaries. This
process is possible by enlarging feature variable space by employing special functions
known as kernels.
Regression
• It is a supervised machine learning technique, used to predict the value of the dependent
variable for new, unseen data. It models the relationship between the input features and the
target variable, allowing for the estimation or prediction of numerical values. Therefore,
regression algorithms help predict continuous variables such as house prices, market trends,
weather patterns, and oil and gas prices.
Types of Regression:
• Decision Tree Regression: The primary purpose of this regression is to divide the dataset into
smaller subsets. These subsets are created to plot the value of any data point connecting to the
problem statement.
• Simple Linear Regression: Simple Linear Regression is a type of Regression algorithm that
models the relationship between a dependent variable and a single independent
variable. The relationship shown by a Simple Linear Regression model is linear or a sloped
straight line, hence it is called Simple Linear Regression.
• Support Vector Regression: This regression type solves both linear and non-linear
models. It uses non-linear kernel functions, like polynomials, to find an optimal solution
for non-linear models.
2) Unsupervised Learning
• The training is provided to the machine with the set of data that has not been labeled,
classified, or categorized, and the algorithm needs to act on that data without any
supervision.
• The goal of unsupervised learning is to restructure the input data into new features or a
group of objects with similar patterns.
• In unsupervised learning, we don't have a predetermined result. The machine tries to
find useful insights from a huge amount of data.
• It can be defined as "A way of grouping the data points into different clusters,
consisting of similar data points.
• The objects with the possible similarities remain in a group that has less or no
similarities with another group.“
• Note: It does it by finding some similar patterns in the unlabelled dataset such as shape,
size, color, behavior, etc., and divides them as per the presence and absence of those
similar patterns.
• Example: Let's understand the clustering technique with the real-world
example of Mall:
• When we visit any shopping mall, we can observe that the things with
similar usage are grouped.
• Such as the t-shirts are grouped in one section, and the trousers are in other
sections, similarly, at vegetable sections, apples, bananas, Mangoes, etc., are
grouped in separate sections, so that we can easily find out the things.
• The clustering technique also works in the same way. Other examples of
clustering are grouping documents according to the topic.
• The below diagram explains the working of the clustering algorithm. We can
see the different fruits are divided into several groups with similar
properties.
Clustering Algorithms
1.K-Means algorithm: The k-means algorithm is one of the most popular clustering
algorithms. It classifies the dataset by dividing the samples into different clusters of equal
variances. The number of clusters must be specified in this algorithm. It is fast with fewer
computations required, with the linear complexity of O(n).
2.Mean-shift algorithm: Mean-shift algorithm tries to find the dense areas in the smooth
density of data points. It is an example of a centroid-based model, that works on
updating the candidates for centroid to be the center of the points within a given region.
• In Identification of Cancer Cells: The clustering algorithms are widely used for the
identification of cancerous cells. It divides the cancerous and non-cancerous data sets
into different groups.
• In Search Engines: Search engines also work on the clustering technique. The search
result appears based on the closest object to the search query. It does it by grouping
similar data objects in one group that is far from the other dissimilar objects. The
accurate result of a query depends on the quality of the clustering algorithm used.
• In Biology: It is used in the biology stream to classify different species of plants and
animals using the image recognition technique.
Association
• Association rule learning is a type of unsupervised learning technique that checks for the
dependency of one data item on another data item and maps accordingly so that it can
be more profitable.
• It tries to find some interesting relations or associations among the variables of dataset.
• It is based on different rules to discover the interesting relations between variables in the
database.
• The association rule learning is one of the very important concepts of machine learning,
and it is employed in Market Basket analysis, Web usage mining, continuous
production, etc.
• Here market basket analysis is a technique used by the various big retailers to discover
the associations between items.
• For example, if a customer buys bread, he most likely can also buy butter, eggs, or milk,
so these products are stored within a shelf or mostly nearby. Consider the below
diagram:
Applications of Association Rule Learning
• Market Basket Analysis: It is one of the popular examples and applications of association
rule mining. This technique is commonly used by big retailers to determine the
association between items.
• Medical Diagnosis: With the help of association rules, patients can be cured easily, as it
helps in identifying the probability of illness for a particular disease.
• Protein Sequence: The association rules help in determining the synthesis of artificial
Proteins.
3) Reinforcement Learning
• The agent learns automatically with these feedbacks and improves its performance.
• In reinforcement learning, the agent interacts with the environment and explores it.
• The goal of an agent is to get the most reward points, and hence, it improves its
performance.
• Note: The robotic dog, which automatically learns the movement of his arms, is an
example of Reinforcement learning.
Unsupervised Learning
• Unsupervised learning is a type of machine learning in which models are trained using
unlabeled datasets and are allowed to act on that data without any supervision
• The task of the unsupervised learning algorithm is to identify the image features on their
own. An unsupervised learning algorithm will perform this task by clustering the image
dataset into groups according to similarities between images..
The task of the unsupervised learning algorithm is to identify the image features
on their own.
An unsupervised learning algorithm will perform this task by clustering the image
dataset into groups according to similarities between images..
Why use Unsupervised Learning?
• In real-world, we do not always have input data with the corresponding output so to
solve such cases, we need unsupervised learning.
Working of Unsupervised Learning
Working of unsupervised learning can be understood by the below
diagram:
• Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to the
machine learning model in order to train it.
• Firstly, it will interpret the raw data to find the hidden patterns from the data and then
will apply suitable algorithms such as k-means clustering, Decision tree
• Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and difference between the objects.
Types of Unsupervised Learning Algorithm:
• Clustering: Clustering is a method of grouping the objects into clusters such that objects
with most similarities remain in a group and have less or no similarities with the objects
of another group.
• Cluster analysis finds the commonalities between the data objects and categorizes them
as per the presence and absence of those commonalities.
• It determines the set of items that occur together in the dataset. Association rule makes
marketing strategy more effective.
• Such as people who buy X items (suppose bread) also tend to purchase Y (Butter/Jam)
items. A typical example of an Association rule is Market Basket Analysis.
Unsupervised Learning algorithms:
• K-means clustering
• KNN (k-nearest neighbors)
• Hierarchal clustering
• Neural Networks
• Principle Component Analysis
• Apriori algorithm
Advantages of Unsupervised Learning
Supervised learning model takes direct Unsupervised learning model does not take
feedback to check if it is predicting correct any feedback.
output or not.
Supervised learning model predicts the Unsupervised learning model finds the hidden
output. patterns in data.
In supervised learning, input data is provided In unsupervised learning, only input data is
to the model along with the output. provided to the model.
The goal of supervised learning is to train the The goal of unsupervised learning is to find
model so that it can predict the output when the hidden patterns and useful insights from
it is given new data. the unknown dataset.
Supervised learning needs supervision to Unsupervised learning does not need any
train the model. supervision to train the model.
Divisive
Hierarchical
methods Agglomerative
Clustering methods
Techniques
Density-based
methods • STING [1997]
• DBSCAN [1996]
• CLIQUE [1998]
38
Clustering: Some Examples
Document/Image/Webpage Clustering
Image Segmentation (clustering pixels)
Clustering web search results
Clustering (people) nodes in (social) networks/graphs
.. and many more.
plt.scatter(x, y)
plt.show()
Result
•Now
from sklearn.cluster
we utilize import
the elbow method to visualize KMeans
the intertia for different values of
K:
for i in range(1,11):
kmeans = KMeans(n_clusters=i)
kmeans.fit(data)
inertias.append(kmeans.inertia_)
• kmeans = KMeans(n_clusters=2)
kmeans.fit(data)
plt.scatter(x, y, c=kmeans.labels_)
plt.show()
• Result
• Import the modules you need.
•
import matplotlib.pyplot as plt
from sklearn.cluster import Kmeans
• Create arrays that resemble two variables in a dataset. Note
that while we only use two variables here, this method will
work with any number of variables:
• x = [4, 5, 10, 4, 3, 11, 14 , 6, 10, 12]
y = [21, 19, 24, 17, 16, 25, 24, 22, 21, 21]
• Turn the data into a set of points:
• data = list(zip(x, y))
print(data)
• Result:
• [(4, 21), (5, 19), (10, 24), (4, 17), (3, 16),
(11, 25), (14, 24), (6, 22), (10, 21), (12, 21)]
for i in range(1,11):
kmeans = KMeans(n_clusters=i)
kmeans.fit(data)
inertias.append(kmeans.inertia_)
kmeans = KMeans(n_clusters=2)
kmeans.fit(data)
plt.scatter(x, y, c=kmeans.labels_)
plt.show()
Result:
The below diagram explains the working of the K-means Clustering
Algorithm:
60
K-means: An Illustration
Initializing K-means.
1. Choose the number of clusters (K): Decide the number of clusters you want
to partition your data into. This is a user-specified parameter and requires
domain knowledge or experimentation to determine the optimal value.
2. Select initial centroids: Randomly select K data points from your dataset as
the initial centroids. These data points will serve as the starting positions
for the cluster centers.
3. Assign data points to clusters: Calculate the distance between each data
point and all the centroids. Assign each data point to the cluster with the
nearest centroid. This step forms the initial clustering.
Cont.
• Step 2: Put any initial partition that classifies the data into k clusters. You
may assign the training samples randomly, or systematically as the
following:
• Step 4: Repeat step 3 until convergence is achieved, that is until a pass through
the training sample causes no new assignments.
Uses of Clustering
• Simplicity and Ease of Implementation: Hard clustering algorithms are straightforward to understand and
implement.
• Clear Cluster Membership: Each data point unambiguously belongs to a single cluster.
Disadvantages
• Sensitive to Initial Placement: Results can vary depending on the initial cluster centroids.
• Limited Handling of Overlapping Data: May struggle with complex data structures that have overlapping
clusters.
• Handling Overlapping Data: Well-suited for datasets with complex or overlapping structures.
Disadvantages
• Computational Complexity: Soft clustering methods can be more computationally expensive
than their hard clustering counterparts.
• Determining the Number of Clusters: Requires the pre-specification of the number of clusters
or fuzziness coefficient.
• The elbow method is a popular technique used to determine the optimal number
of clusters (k) in a clustering algorithm, such as K-means.
• The elbow point is the value of k at which the inertia starts to level off or
decrease at a slower rate.
• This point indicates that adding more clusters does not significantly improve the
clustering quality and suggests the appropriate number of clusters for the data.
Cont.
• The elbow method helps us find the optimal number of clusters for
our data.
From the above visualization, we can see that the optimal number of clusters should be around
3. But visualizing the data alone cannot always give the right answer.
Note: Distortion is the alteration of the original shape
Step 3: Building the clustering model and calculating the values of the Distortion and Inertia:
distortions = []
inertias = []
mapping1 = {}
mapping2 = {}
K = range(1, 10)
for k in K:
# Building and fitting the model
kmeanModel = KMeans(n_clusters=k).fit(X)
kmeanModel.fit(X)
distortions.append(sum(np.min(cdist(X, kmeanModel.cluster_centers_,
'euclidean'), axis=1)) /X.shape[0])
inertias.append(kmeanModel.inertia_)
Output:
plt.plot(K, inertias, 'bx-')
plt.xlabel('Values of K')
plt.ylabel('Inertia')
plt.title('The Elbow Method using Inertia')
plt.show()
Output:
Elbow Method: Step by Step
• Choose a range of k values to consider (e.g., 1 to 10).
• For each k value, run the K-means algorithm with k clusters on the
data.
• Adding more clusters beyond this point may not significantly improve
clustering quality.
Limitations of the Elbow Method
• The elbow method may not always yield a clear-cut elbow point,
especially for complex datasets.
1. When the numbers of data are not so many, initial grouping will determine the
cluster significantly.
3. We never know the real cluster, using the same data, because if it is inputted in a
different order it may produce different cluster if the number of data is few.
• Also used for choosing color palettes on old fashioned graphical display devices and Image
Quantization.
CONCLUSION
• K-means algorithm is useful for undirected knowledge discovery and
is relatively simple.
• K-means has found wide spread usage in lot of fields, ranging from
unsupervised learning of neural network, Pattern recognitions,
Classification analysis, Artificial intelligence, image processing,
machine vision, and many others.
Drawback of standard K-means algorithm:
+.
K Means++
This algorithm ensures a smarter initialization of the centroids and
improves the quality of the clustering.
Apart from initialization, the rest of the algorithm is the same as the
standard K-means algorithm.
Data
Point |Median 7|Median 20| Assigned Cluster
-----------------------------------------------------------------
2 | 5 | 18 | 1
3 | 4 | 17 | 1
7 | 0 | 13 | 1
8 | 1 | 12 | 1
10 | 3 | 10 | 1
12 | 5 | 8 | 1
15 | 8 | 5 | 2
20 | 13 | 0 | 2
25 | 18 | 5 | 2
Step 3
• Update Step (Calculation of Medians):
• For each cluster, calculate the median of the data points in that cluster. The
median is the middle value when the data points are sorted.
• Repeat the assignment and update steps until convergence. In each iteration,
reassign data points to the nearest medians and update the medians based on
the median of the data points in each cluster.
K-Medoid
• K-Medoids is particularly useful when dealing with non-numerical data or when you want
clusters that are centered around actual data points.
• "K-Medoids," is a clustering algorithm similar to KMeans but uses actual data points as
cluster representatives (medoids) instead of the mean or centroid.
• Medoids are representative objects of a data set or a cluster within a data set whose sum of
dissimilarities to all the objects in the cluster is minimal.
• Medoids are similar in concept to means or centroids, but medoids are always restricted to
be members of the data set.
• Medoids are most commonly used on data when a mean or centroid cannot be defined,
such as graphs.
K-Medoid
1. First, we select K random data points from the dataset and use them as
medoids.
2. Now, we will calculate the distance of each data point from the medoids. You
can use any of the Euclidean, Manhattan distance, or squared Euclidean
distance as the distance measure.
3. Once we find the distance of each data point from the medoids, we will
assign the data points to the clusters associated with each medoid. The data
points are assigned to the medoids at the closest distance.
4. After determining the clusters, we will calculate the sum of the distance of all
the non-medoid data points to the medoid of each cluster. Let the cost be Ci.
5. Now, we will select a random data point Dj from the dataset
and swap it with a medoid Mi. Here, Dj becomes a
temporary medoid. After swapping, we will calculate the
distance of all the non-medoid data points to the current
medoid of each cluster. Let this cost be Cj.