UNIT-5 Material
UNIT-5 Material
Unsupervised learning, also known as unsupervised machine learning, uses machine learning
algorithms to analyze and cluster unlabeled datasets. These algorithms discover hidden
patterns or data groupings without the need for human intervention.
(or)
As the name suggests, unsupervised learning is a machine learning technique in which models
are not supervised using training dataset. Instead, models itself find the hidden patterns and
insights from the given data. It can be compared to learning which takes place in the human
brain while learning new things. It can be defined as:
Unsupervised learning is a type of machine learning in which models are trained using unlabeled
dataset and are allowed to act on that data without any supervision.
Example: Suppose the unsupervised learning algorithm is given an input dataset containing
images of different types of cats and dogs. The algorithm is never trained upon the given
dataset, which means it does not have any idea about the features of the dataset. The task of
the unsupervised learning algorithm is to identify the image features on their own.
Unsupervised learning algorithm will perform this task by clustering the image dataset into the
groups according to similarities between images.
Why use Unsupervised Learning?
Below are some main reasons which describe the importance of Unsupervised Learning:
o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which make
unsupervised learning more important.
o In real-world, we do not always have input data with the corresponding output so to
solve such cases, we need unsupervised learning.
Working of Unsupervised Learning
Working of unsupervised learning can be understood by the below diagram:
Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to the machine
learning model in order to train it. Firstly, it will interpret the raw data to find the hidden
patterns from the data and then will apply suitable algorithms such as k-means clustering,
Decision tree, etc.
Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and difference between the objects.
Types of Unsupervised Learning Algorithm:
The unsupervised learning algorithm can be further categorized into two types of problems:
o Clustering: Clustering is a method of grouping the objects into clusters such that objects
with most similarities remains into a group and has less or no similarities with the
objects of another group. Cluster analysis finds the commonalities between the data
objects and categorizes them as per the presence and absence of those commonalities.
o Association: An association rule is an unsupervised learning method which is used for
finding the relationships between variables in the large database. It determines the set
of items that occurs together in the dataset. Association rule makes marketing strategy
more effective. Such as people who buy X item (suppose a bread) are also tend to
purchase Y (Butter/Jam) item. A typical example of Association rule is Market Basket
Analysis.
Note: We will learn these algorithms in later chapters.
Unsupervised Learning algorithms:
Below is the list of some popular unsupervised learning algorithms:
o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition
Advantages of Unsupervised Learning
o Unsupervised learning is used for more complex tasks as compared to supervised
learning because, in unsupervised learning, we don't have labeled input data.
o Unsupervised learning is preferable as it is easy to get unlabeled data in comparison to
labeled data.
Disadvantages of Unsupervised Learning
o Unsupervised learning is intrinsically more difficult than supervised learning as it does
not have corresponding output.
o The result of the unsupervised learning algorithm might be less accurate as input data is
not labeled, and algorithms do not know the exact output in advance.
2) Explain K-means clustering algorithm with an example? Write its limitations?
, which groups the unlabeled dataset into different clusters. Here K defines the number of pre-
defined clusters that need to be created in the process, as if K=2, there will be two clusters, and
for K=3, there will be three clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a
way that each dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim
of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k should
be predetermined in this algorithm.
The k-means clustering
algorithm mainly performs two tasks:
o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into
different clusters. It means here we will try to group these datasets into two different
clusters.
o We need to choose some random k points or centroid to form the cluster. These points
can be either the points from the dataset or any other point. So, here we are selecting
the below two points as k points, which are not the part of our dataset. Consider the
below image:
o Now we will assign each data point of the scatter plot to its closest K-point or centroid.
We will compute it by applying some mathematics that we have studied to calculate the
distance between two points. So, we will draw a median between both the centroids.
Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid. Let's color them as
blue and yellow for clear visualization.
o As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
centroids, and will find new centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the
same process of finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and two blue
points are right to the line. So, these three points will be assigned to new centroids.
As reassignment has taken place, so we will again go to the step-4, which is finding new
centroids or K-points.
o We will repeat the process by finding the center of gravity of centroids, so the new
centroids will be as shown in the below image:
o As we got the new centroids so again will draw the median line and reassign the data
points. So, the image will be:
o We can see in the above image; there are no dissimilar data points on either side of the
line, which means our model is formed. Consider the below image:
As our model is ready, so we can now remove the assumed centroids, and the two final clusters
will be as shown in the below image:
Nearest clustering
Average Clustering
Farthest Clustering
Clustering by division or Divisive splitting
In this approach, we follow the top-down approach, which means we assign the pixel closest
to the cluster. The algorithm for performing the agglomerative clustering as follows:
Construct a single cluster containing all points.
For a given number of epochs or until clustering is satisfactory.
Split the cluster into two clusters with the largest inter-cluster distance.
Repeat the above steps.
In this article, we will be discussing how to perform the K-Means Clustering.
K-Means Clustering
K-means clustering is a very popular clustering algorithm which applied when we have a
dataset with labels unknown. The goal is to find certain groups based on some kind of
similarity in the data with the number of groups represented by K. This algorithm is generally
used in areas like market segmentation, customer segmentation, etc. But, it can also be used
to segment different objects in the images on the basis of the pixel values.
The algorithm for image segmentation works as follows:
1. First, we need to select the value of K in K-means clustering.
2. Select a feature vector for every pixel (color values such as RGB value, texture etc.).
3. Define a similarity measure b/w feature vectors such as Euclidean distance to measure the
similarity b/w any two points/pixel.
4. Apply K-means algorithm to the cluster centers
5. Apply connected component’s algorithm.
6. Combine any component of size less than the threshold to an adjacent component that is
similar to it until you can’t combine more.
Following are the steps for applying the K-means clustering algorithm:
Select K points and assign them one cluster center each.
Until the cluster center won’t change, perform the following steps:
Allocate each point to the nearest cluster center and ensure that each cluster
center has one point.
Replace the cluster center with the mean of the points assigned to it.
End
The optimal value of K?
For a certain class of clustering algorithms, there is a parameter commonly referred to as K
that specifies the number of clusters to detect. We may have the predefined value of K, if we
have domain knowledge about data that how many categories it contains. But, before
calculating the optimal value of K, we first need to define the objective function for the above
algorithm. The objective function can be given by:
Where j is the number of clusters, and i will be the points belong to the j th cluster. The above
objective function is called within-cluster sum of square (WCSS) distance.
A good way to find the optimal value of K is to brute force a smaller range of values (1-10)
and plot the graph of WCSS distance vs K. The point where the graph is sharply bent
downward can be considered the optimal value of K. This method is called Elbow method.
For image segmentation, we plot the histogram of the image and try to find peaks, valleys in
it. Then, we will perform the peakiness test on that histogram.
Implementation
In this implementation, we will be performing Image Segmentation using K-Means
clustering. We will be using OpenCV k-Means API to perform this clustering.
Python3
# imports
import numpy as np
import cv2 as cv
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (12,50)
# load image
img = cv.imread('road.jpg')
Z = img.reshape((-1,3))
# convert to np.float32
Z = np.float32(Z)
Output:
Image Segmentation for K=3,4,5
Image Segmentation for K=6,7,8
4) Explain clustering for preprocessing?
Data Preprocessing or Data Preparation is a data mining technique that transforms raw data into
an understandable format for ML algorithms. Real-world data usually is noisy (contains errors,
outliers, duplicates), incomplete (some values are missed), could be stored in different places
and different formats. The task of Data Preprocessing is to handle these issues.
In the common ML pipeline, Data Preprocessing stage is between Data Collection stage and
Training / Tunning Model.
Credits
Parameter Estimation
Every data mining task has the problem of parameters. Every parameter influences the
algorithm in specific ways. For DBSCAN, the parameters ε and minPts are needed.
minPts: As a rule of thumb, a minimum minPts can be derived from the number of
dimensions D in the data set, as minPts ≥ D + 1. The low value minPts = 1 does not make
sense, as then every point on its own will already be a cluster. With minPts ≤ 2, the result will
be the same as of hierarchical clustering with the single link metric, with the dendrogram cut
at height ε. Therefore, minPts must be chosen at least 3. However, larger values are usually
better for data sets with noise and will yield more significant clusters. As a rule of
thumb, minPts = 2·dim can be used, but it may be necessary to choose larger values for very
large data, for noisy data or for data that contains many duplicates.
ε: The value for ε can then be chosen by using a k-distance graph, plotting the distance to
the k = minPts-1 nearest neighbor ordered from the largest to the smallest value. Good
values of ε are where this plot shows an “elbow”: if ε is chosen much too small, a large part
of the data will not be clustered; whereas for a too high value of ε, clusters will merge and
the majority of objects will be in the same cluster. In general, small values of ε are
preferable, and as a rule of thumb, only a small fraction of points should be within this
distance of each other.
Distance function: The choice of distance function is tightly linked to the choice of ε, and has
a major impact on the outcomes. In general, it will be necessary to first identify a reasonable
measure of similarity for the data set, before the parameter ε can be chosen. There is no
estimation for this parameter, but the distance functions need to be chosen appropriately
for the data set.
DBSCAN Python Implementation Using Scikit-learn
The black data points represent outliers in the above result. Next, apply DBSCAN to cluster non-
spherical data.
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.datasets import make_circles
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
X, y = make_circles(n_samples=750, factor=0.3, noise=0.1)
X = StandardScaler().fit_transform(X)
y_pred = DBSCAN(eps=0.3, min_samples=10).fit_predict(X)
plt.scatter(X[:,0], X[:,1], c=y_pred)
print('Number of clusters: {}'.format(len(set(y_pred[np.where(y_pred != -
1)]))))
print('Homogeneity: {}'.format(metrics.homogeneity_score(y, y_pred)))
print('Completeness: {}'.format(metrics.completeness_score(y, y_pred)))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
% metrics.adjusted_rand_score(labels_true, labels))
print("Adjusted Mutual Information: %0.3f"
% metrics.adjusted_mutual_info_score(labels_true, labels))
print("Silhouette Coefficient: %0.3f"
% metrics.silhouette_score(X, labels))
view rawDBSCAN_non_linear_data.py hosted with ❤ by GitHub
Which is absolutely perfect. If we compare with K-means it would give a completely incorrect
output like:
K-means clustering result
Best Case: If an indexing system is used to store the dataset such that neighborhood queries
are executed in logarithmic time, we get O(nlogn) average runtime complexity.
Worst Case: Without the use of index structure or on degenerated data (e.g. all points within
a distance less than ε), the worst-case run time complexity remains O(n²).
Average Case: Same as best/worst case depending on data and implementation of the
algorithm.
Conclusion
Density-based clustering algorithms can learn clusters of arbitrary shape, and with the Level Set
Tree algorithm, one can learn clusters in datasets that exhibit wide differences in density.
However, I should point out that these algorithms are somewhat more arduous to tune
contrasted to parametric clustering algorithms like K-Means. Parameters like the epsilon for
DBSCAN or for the Level Set Tree are less intuitive to reason about compared to the number of
clusters parameter for K-Means, so it’s more difficult to choose good initial parameter values
for these algorithms.
That is all for this article. I hope you guys have enjoyed reading it, please share your
suggestions/views/questions in the comment section.
7) Explain Gaussian mixtures?
Gaussian Mixture Model or Mixture of Gaussian as it is sometimes called, is not so much a
model as it is a probability distribution. It is a universally used model for generative
unsupervised learning or clustering. It is also called Expectation-Maximization Clustering or EM
Clustering and is based on the optimization strategy. Gaussian Mixture models are used for
representing Normally Distributed subpopulations within an overall population. The advantage
of Mixture models is that they do not require which subpopulation a data point belongs to. It
allows the model to learn the subpopulations automatically. This constitutes a form of
unsupervised learning.
A Gaussian is a type of distribution, and it is a popular and mathematically convenient type of
distribution. A distribution is a listing of outcomes of an experiment and the probability
associated with each outcome. Let’s take an example to understand. We have a data table that
lists a set of cyclist’s speeds.
Speed (Km/h) Frequency
1 4
2 9
3 6
4 7
5 3
6 2
Here, we can see that a cyclist reaches the speed of 1 Km/h four times, 2Km/h nine times, 3
Km/h and so on. We can notice how this follows, the frequency goes up and then it goes down.
It looks like it follows a kind of bell curve the frequencies go up as the speed goes up and then it
has a peak value and then it goes down again, and we can represent this using a bell curve
otherwise known as a Gaussian distribution.
A Gaussian distribution is a type of distribution where half of the data falls on the left of it, and
the other half of the data falls on the right of it. It’s an even distribution, and one can notice just
by the thought of it intuitively that it is very mathematically convenient.
So, what do we need to define a Gaussian or Normal Distribution? We need a mean which is
the average of all the data points. That is going to define the centre of the curve, and the
standard deviation which describes how to spread out the data is. Gaussian distribution would
be a great distribution to model the data in those cases where the data reaches a peak and
then decreases. Similarly, in Multi Gaussian Distribution, we will have multiple peaks with
multiple means and multiple standard deviations.
The formula for Gaussian distribution using the mean and the standard deviation called
the Probability Density Function:
For a given point X, we can compute the associated Y values. Y values are the probabilities for
those X values. So, for any X value, we can calculate the probability of that X value being a part
of the curve or being a part of the dataset.
This is a function of a continuous random variable whose integral across an interval gives the
probability that the value of the variable lies within the same interval.
What is a Gaussian Mixture Model?
Sometimes our data has multiple distributions or it has multiple peaks. It does not always have
one peak, and one can notice that by looking at the data set. It will look like there are multiple
peaks happening here and there. There are two peak points and the data seems to be going up
and down twice or maybe three times or four times. But if there are Multiple Gaussian
distributions that can represent this data, then we can build what we called a Gaussian Mixture
Model.
In other words we can say that, if we have three Gaussian Distribution as GD1, GD2, GD3 having
mean as µ1, µ2,µ3 and variance 1,2,3 than for a given set of data points GMM will identify the
probability of each data point belonging to each of these distributions.
It is a probability distribution that consists of multiple probability distributions and has Multiple
Gaussians.
The probability distribution function of d-dimensions Gaussian Distribution is defined as:
The reason that standard deviation is added into this because in the denominator the 2 takes
variation into consideration when it calculates its measurement but K means only calculates
conventional Euclidean distance. i.e K-means calculates distance and GM calculates weights.
This means that the k-means algorithm gives you a hard assignment: it either says this is going
to be this data point is a part of this class or it’s a part of this class. In a lot of cases we just want
that hard assignment but in a lot of cases it’s better to have a soft assignment. Sometimes we
want the maximum probability like: This is going to be 70% likely that it’s a part of this class but
we also want the probability that it’s going to be a part of other classes. It is a list of probability
values that it could be a part of multiple distributions, it could be in the middle, it could be 60%
likely this class and 40% likely of this class. That’s why we incorporate the standard deviation.
Expectation Maximization Algorithm: EM can be used for variables that are not directly
observable and deduce from the value of other observed variables. It can be used with
unlabeled data for its classification. It is one of the popular approaches to maximize the
likelihood.
Basic Ideas of EM -Algorithm: Given a set of incomplete data and set of starting parameters.
E-Step: Using the given data and the current value of parameters, estimate the value of hidden
data.
M-Step: After the E-step, it is used to maximize the hidden variable and joint distribution of the
data.
Usage of EM Algorithm
1. Can be used to fill missing data.
2. To find the values of latent variables.
The disadvantage of EM algorithm is that it has slow convergence and it converges up to local
optima only.
Comparing to Gradient descent
Gradient descent compute the derivative which tells us the direction in which the data wants to
move in or in what direction should we move the parameter’s data of your model such that the
function of our model is optimized to fit our data but what if we can’t compute a gradient of a
variable. i.e. we can’t compute a derivative of a random variable. The Gaussian mixture model
has a random variable. It is a stochastic model i.e. it is non-deterministic. We can’t compute the
derivative of a random variable that’s why we cannot use gradient descent.
Applications
GMM is widely used in the field of signal processing.
GMM provides good results in language Identification.
Customer Churn is another example.
GMM founds its use case in Anomaly Detection.
GMM is also used to track the object in a video frame.
GMM can also be used to classify songs based on genres.
2. Loading Data
Load the breast_cancer dataset from sklearn.datasets. It is clear that the dataset has 569 data
items with 30 input attributes. There are two output classes-benign and malignant. Due to 30
input features, it is impossible to visualize this data
Python3
Output:
3. Apply PCA
Standardize the dataset prior to PCA.
Import PCA from sklearn.decomposition.
Choose the number of principal components.
Let us select it to 3. After executing this code, we get to know that the dimensions of x are
(569,3) while the dimension of actual data is (569,30). Thus, it is clear that with PCA, the
number of dimensions has reduced to 3 from 30. If we choose n_components=2, the
dimensions would be reduced to 2.
Output:
(569,3)
4. Check Components
The principal.components_ provide an array in which the number of rows tells the number of
principal components while the number of columns is equal to the number of features in
actual data. We can easily see that there are three rows as n_components was chosen to be
3. However, each row has 30 columns as in actual data.
Python3
plt.figure(figsize=(10,10))
plt.scatter(x[:,0],x[:,1],c=data['target'],cmap='plasma')
plt.xlabel('pc1')
plt.ylabel('pc2')
Output:
For three principal components, we need to plot a 3d graph. x[:,0] signifies the first principal
component. Similarly, x[:,1] and x[:,2] represent the second and the third principal
component.
Python3
Output:
6. Calculate variance ratio
Explained_variance_ratio provides an idea of how much variation is explained by principal
components.
Python3
Output:
array([0.44272026, 0.18971182, 0.09393163])
plt.title("PCA")
plt.scatter(X_pca[:, 0], X_pca[:, 1], c = y)
plt.xlabel("Component 1")
plt.ylabel("Component 2")
plt.show()
Code: Applying kernel PCA on this dataset with RBF kernel with a gamma value of 15.
plt.title("Kernel PCA")
plt.scatter(X_kpca[:, 0], X_kpca[:, 1], c = y)
plt.show()
In the kernel space the two classes are linearly separable. Kernel PCA uses a kernel function
to project the dataset into a higher-dimensional space, where it is linearly separable.
Finally, we applied the kernel PCA to a non-linear dataset using scikit-learn.