0% found this document useful (0 votes)
12 views42 pages

UNIT-5 Material

Unsupervised learning is a machine learning technique that analyzes unlabeled datasets to discover hidden patterns without human supervision. It includes algorithms like K-means clustering, which groups data into clusters based on similarities, and is useful for tasks such as image segmentation and market analysis. While unsupervised learning can provide insights from uncategorized data, it faces challenges like difficulty in accuracy and the need for predetermined cluster numbers.

Uploaded by

Prashant
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views42 pages

UNIT-5 Material

Unsupervised learning is a machine learning technique that analyzes unlabeled datasets to discover hidden patterns without human supervision. It includes algorithms like K-means clustering, which groups data into clusters based on similarities, and is useful for tasks such as image segmentation and market analysis. While unsupervised learning can provide insights from uncategorized data, it faces challenges like difficulty in accuracy and the need for predetermined cluster numbers.

Uploaded by

Prashant
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 42

UNIT-5

Un-Supervised Learning Techniques.


1) What is Unsupervised Learning Technique explain with an example?

Unsupervised learning, also known as unsupervised machine learning, uses machine learning
algorithms to analyze and cluster unlabeled datasets. These algorithms discover hidden
patterns or data groupings without the need for human intervention.

(or)

As the name suggests, unsupervised learning is a machine learning technique in which models
are not supervised using training dataset. Instead, models itself find the hidden patterns and
insights from the given data. It can be compared to learning which takes place in the human
brain while learning new things. It can be defined as:

Unsupervised learning is a type of machine learning in which models are trained using unlabeled
dataset and are allowed to act on that data without any supervision.

Unsupervised learning cannot be directly applied to a regression or classification problem


because unlike supervised learning, we have the input data but no corresponding output data.
The goal of unsupervised learning is to find the underlying structure of dataset, group that data
according to similarities, and represent that dataset in a compressed format.

Example: Suppose the unsupervised learning algorithm is given an input dataset containing
images of different types of cats and dogs. The algorithm is never trained upon the given
dataset, which means it does not have any idea about the features of the dataset. The task of
the unsupervised learning algorithm is to identify the image features on their own.
Unsupervised learning algorithm will perform this task by clustering the image dataset into the
groups according to similarities between images.
Why use Unsupervised Learning?
Below are some main reasons which describe the importance of Unsupervised Learning:
o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which make
unsupervised learning more important.
o In real-world, we do not always have input data with the corresponding output so to
solve such cases, we need unsupervised learning.
Working of Unsupervised Learning
Working of unsupervised learning can be understood by the below diagram:

Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to the machine
learning model in order to train it. Firstly, it will interpret the raw data to find the hidden
patterns from the data and then will apply suitable algorithms such as k-means clustering,
Decision tree, etc.
Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and difference between the objects.
Types of Unsupervised Learning Algorithm:
The unsupervised learning algorithm can be further categorized into two types of problems:
o Clustering: Clustering is a method of grouping the objects into clusters such that objects
with most similarities remains into a group and has less or no similarities with the
objects of another group. Cluster analysis finds the commonalities between the data
objects and categorizes them as per the presence and absence of those commonalities.
o Association: An association rule is an unsupervised learning method which is used for
finding the relationships between variables in the large database. It determines the set
of items that occurs together in the dataset. Association rule makes marketing strategy
more effective. Such as people who buy X item (suppose a bread) are also tend to
purchase Y (Butter/Jam) item. A typical example of Association rule is Market Basket
Analysis.
Note: We will learn these algorithms in later chapters.
Unsupervised Learning algorithms:
Below is the list of some popular unsupervised learning algorithms:
o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition
Advantages of Unsupervised Learning
o Unsupervised learning is used for more complex tasks as compared to supervised
learning because, in unsupervised learning, we don't have labeled input data.
o Unsupervised learning is preferable as it is easy to get unlabeled data in comparison to
labeled data.
Disadvantages of Unsupervised Learning
o Unsupervised learning is intrinsically more difficult than supervised learning as it does
not have corresponding output.
o The result of the unsupervised learning algorithm might be less accurate as input data is
not labeled, and algorithms do not know the exact output in advance.
2) Explain K-means clustering algorithm with an example? Write its limitations?

K-Means Clustering is an Unsupervised Learning algorithm

, which groups the unlabeled dataset into different clusters. Here K defines the number of pre-
defined clusters that need to be created in the process, as if K=2, there will be two clusters, and
for K=3, there will be three clusters, and so on.
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a
way that each dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main aim
of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k should
be predetermined in this algorithm.
The k-means clustering
algorithm mainly performs two tasks:
o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other clusters.
The below diagram explains the working of the K-means Clustering Algorithm:

How does the K-Means Algorithm Work?


The working of the K-Means algorithm is explained in the below steps:
Step-1: Select the number K to decide the number of clusters.
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
Step-7: The model is ready.
Let's understand the above steps by considering the visual plots:
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables is
given below:

o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into
different clusters. It means here we will try to group these datasets into two different
clusters.
o We need to choose some random k points or centroid to form the cluster. These points
can be either the points from the dataset or any other point. So, here we are selecting
the below two points as k points, which are not the part of our dataset. Consider the
below image:
o Now we will assign each data point of the scatter plot to its closest K-point or centroid.
We will compute it by applying some mathematics that we have studied to calculate the
distance between two points. So, we will draw a median between both the centroids.
Consider the below image:

From the above image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid. Let's color them as
blue and yellow for clear visualization.

o As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity of these
centroids, and will find new centroids as below:

o Next, we will reassign each datapoint to the new centroid. For this, we will repeat the
same process of finding a median line. The median will be like below image:

From the above image, we can see, one yellow point is on the left side of the line, and two blue
points are right to the line. So, these three points will be assigned to new centroids.

As reassignment has taken place, so we will again go to the step-4, which is finding new
centroids or K-points.
o We will repeat the process by finding the center of gravity of centroids, so the new
centroids will be as shown in the below image:

o As we got the new centroids so again will draw the median line and reassign the data
points. So, the image will be:
o We can see in the above image; there are no dissimilar data points on either side of the
line, which means our model is formed. Consider the below image:

As our model is ready, so we can now remove the assumed centroids, and the two final clusters
will be as shown in the below image:

Applications of K- Means Clustering Algorithm


Below are the applications mentioned:
 Market segmentation
 Document clustering
 Image segmentation
 Image compression
 Vector quantization
 Cluster analysis
 Feature learning or dictionary learning
 Identifying crime-prone areas
 Insurance fraud detection
 Public transport data analysis
 Clustering of IT assets
 Customer segmentation
 Identifying Cancerous data
 Used in search engines
 Drug Activity Prediction
Advantages of K- Means Clustering Algorithm
Below are the advantages mentioned:
 It is fast
 Robust
 Easy to understand
 Comparatively efficient
 If data sets are distinct, then gives the best results
 Produce tighter clusters
 When centroids are recomputed, the cluster changes.
 Flexible
 Easy to interpret
 Better computational cost
 Enhances Accuracy
 Works better with spherical clusters
Disadvantages of K- Means Clustering Algorithm
Below are the disadvantages mentioned:
 Needs prior specification for the number of cluster centers
 If there are two highly overlapping data, then it cannot be distinguished and cannot tell
that there are two clusters
 With the different representations of the data, the results achieved are also different
 Euclidean distance can unequally weigh the factors
 It gives the local optima of the squared error function
 Sometimes choosing the centroids randomly cannot give fruitful results
 It can be used only if the meaning is defined
 Cannot handle outliers and noisy data
 Do not work for the non-linear data set
 Lacks consistency
 Sensitive to scale
 If very large data sets are encountered, then the computer may crash.
 Prediction issues

3) Explain how clustering is used for image segmentation?


Segmentation By clustering
It is a method to perform Image Segmentation of pixel-wise segmentation. In this type of
segmentation, we try to cluster the pixels that are together. There are two approaches for
performing the Segmentation by clustering.
 Clustering by Merging
 Clustering by Divisive
Clustering by merging or Agglomerative Clustering:
In this approach, we follow the bottom-up approach, which means we assign the pixel closest
to the cluster. The algorithm for performing the agglomerative clustering as follows:
 Take each point as a separate cluster.
 For a given number of epochs or until clustering is satisfactory.
 Merge two clusters with the smallest inter-cluster distance (WCSS).
 Repeat the above step
The agglomerative clustering is represented by Dendrogram. It can be performed in 3
methods: by selecting the closest pair for merging, by selecting the farthest pair for merging,
or by selecting the pair which is at an average distance (neither closest nor farthest). The
dendrogram representing these types of clustering is below:

Nearest clustering

Average Clustering
Farthest Clustering
Clustering by division or Divisive splitting
In this approach, we follow the top-down approach, which means we assign the pixel closest
to the cluster. The algorithm for performing the agglomerative clustering as follows:
 Construct a single cluster containing all points.
 For a given number of epochs or until clustering is satisfactory.
 Split the cluster into two clusters with the largest inter-cluster distance.
 Repeat the above steps.
In this article, we will be discussing how to perform the K-Means Clustering.
K-Means Clustering
K-means clustering is a very popular clustering algorithm which applied when we have a
dataset with labels unknown. The goal is to find certain groups based on some kind of
similarity in the data with the number of groups represented by K. This algorithm is generally
used in areas like market segmentation, customer segmentation, etc. But, it can also be used
to segment different objects in the images on the basis of the pixel values.
The algorithm for image segmentation works as follows:
1. First, we need to select the value of K in K-means clustering.
2. Select a feature vector for every pixel (color values such as RGB value, texture etc.).
3. Define a similarity measure b/w feature vectors such as Euclidean distance to measure the
similarity b/w any two points/pixel.
4. Apply K-means algorithm to the cluster centers
5. Apply connected component’s algorithm.
6. Combine any component of size less than the threshold to an adjacent component that is
similar to it until you can’t combine more.
Following are the steps for applying the K-means clustering algorithm:
 Select K points and assign them one cluster center each.
 Until the cluster center won’t change, perform the following steps:
 Allocate each point to the nearest cluster center and ensure that each cluster
center has one point.
 Replace the cluster center with the mean of the points assigned to it.
 End
The optimal value of K?
For a certain class of clustering algorithms, there is a parameter commonly referred to as K
that specifies the number of clusters to detect. We may have the predefined value of K, if we
have domain knowledge about data that how many categories it contains. But, before
calculating the optimal value of K, we first need to define the objective function for the above
algorithm. The objective function can be given by:

Where j is the number of clusters, and i will be the points belong to the j th cluster. The above
objective function is called within-cluster sum of square (WCSS) distance.
A good way to find the optimal value of K is to brute force a smaller range of values (1-10)
and plot the graph of WCSS distance vs K. The point where the graph is sharply bent
downward can be considered the optimal value of K. This method is called Elbow method.
For image segmentation, we plot the histogram of the image and try to find peaks, valleys in
it. Then, we will perform the peakiness test on that histogram.
Implementation
 In this implementation, we will be performing Image Segmentation using K-Means
clustering. We will be using OpenCV k-Means API to perform this clustering.
 Python3

# imports
import numpy as np
import cv2 as cv
import matplotlib.pyplot as plt

plt.rcParams["figure.figsize"] = (12,50)

# load image
img = cv.imread('road.jpg')
Z = img.reshape((-1,3))
# convert to np.float32
Z = np.float32(Z)

# define stopping criteria, number of clusters(K) and apply


kmeans()
# TERM_CRITERIA_EPS : stop when the epsilon value is
reached
# TERM_CRITERIA_MAX_ITER: stop when Max iteration is
reached
criteria = (cv.TERM_CRITERIA_EPS +
cv.TERM_CRITERIA_MAX_ITER, 10, 1.0)

fig, ax = plt.subplots(10,2, sharey=True)


for i in range(10):
K = i+3
# apply K-means algorithm
ret,label,center=cv.kmeans(Z,K,None,criteria,attempts = 10,
cv.KMEANS_RANDOM_CENTERS)
# Now convert back into uint8, and make original image
center = np.uint8(center)
res = center[label.flatten()]
res2 = res.reshape((img.shape))
# plot the original image and K-means image
ax[i, 1].imshow(res2)
ax[i,1].set_title('K = %s Image'%K)
ax[i, 0].imshow(img)
ax[i,0].set_title('Original Image')

Output:
Image Segmentation for K=3,4,5
Image Segmentation for K=6,7,8
4) Explain clustering for preprocessing?
Data Preprocessing or Data Preparation is a data mining technique that transforms raw data into
an understandable format for ML algorithms. Real-world data usually is noisy (contains errors,
outliers, duplicates), incomplete (some values are missed), could be stored in different places
and different formats. The task of Data Preprocessing is to handle these issues.
In the common ML pipeline, Data Preprocessing stage is between Data Collection stage and
Training / Tunning Model.

Importance of Data Preprocessing stage


1. Different ML models have different required input data (numerical data, images in specific
format, etc). Without the right data, nothing will work.
2. Because of “bad” data, ML models will not give any useful results, or even may give wrong
answers, that may lead to wrong decisions (GIGO principle).
3. The higher the quality of the data, the less data is needed.
Note. Nowadays Preprocessing stage is the most laborious step, it may take 60–80% of ML
Engineer efforts.
Before starting data preparation, it is recommended to determine what data requirements are
presented by the ML algorithm for getting quality results. In this article we consider the K-means
clustering algorithm.
K-means input data requirements:
 Numerical variables only. K-means uses distance-based measurements to determine the
similarity between data points. If you have categorical data, use K-modes clustering, if data
is mixed, use K-prototype clustering.
 Data has no noises or outliers. K-means is very sensitive to outliers and noisy data. More
detail here and here.
 Data has symmetric distribution of variables (it isn’t skewed). Real data always has outliers
and noise, and it’s difficult to get rid of it. Transformation data to normal distribution helps
to reduce the impact of these issues. In this way, it’s much easier for the algorithm to
identify clusters.
 Variables on the same scale — have the same mean and variance, usually in a range -1.0 to
1.0 (standardized data) or 0.0 to 1.0 (normalized data). For the ML algorithm to consider all
attributes as equal, they must all have the same scale. More detail here and here.
 There is no collinearity (a high level of correlation between two variables). Correlated
variables are not useful for ML segmentation algorithms because they represent the same
characteristic of a segment. So correlated variables are nothing but noise. More detail here.
 Few numbers of dimensions. As the number of dimensions (variables) increases, a distance-
based similarity measure converges to a constant value between any given examples. The
more variables the more difficult to find strict differences between instances.
Note: What exactly does few numbers mean? It’s an open question for me. If you know the
answer, please, let me know. For now, I stick to the rule — the less the better. Plus validation of
the results.
Besides the requirements above, there are a few fundamental model assumptions:
 the variance of the distribution of each attribute (variable) is spherical (or in other words,
the boundaries between k-means clusters are linear);
 all variables have the same variance;
 each cluster has roughly equal numbers of observations.
These assumptions are beyond the data preprocessing stage. There is no way to validate them
before getting model results.
Stages of Data preprocessing for K-means Clustering
1. Data Cleaning
 Removing duplicates
 Removing irrelevant observations and errors
 Removing unnecessary columns
 Handling inconsistent data
 Handling outliers and noise
2. Handling missing data
3. Data Integration
4. Data Transformation
 Feature Construction
 Handling skewness
 Data Scaling
5. Data Reduction
 Removing dependent (highly correlated) variables
 Feature selection
 PCA
5) Explain the usage of clustering for semi-supervised learning?
Semi-supervised clustering is a method that partitions unlabeled data by creating the use of
domain knowledge. It is generally expressed as pairwise constraints between instances or just
as an additional set of labeled instances.
The quality of unsupervised clustering can be essentially improved using some weak structure
of supervision, for instance, in the form of pairwise constraints (i.e., pairs of objects labeled as
belonging to similar or different clusters). Such a clustering procedure that depends on user
feedback or guidance constraints is known as semisupervised clustering.
There are several methods for semi-supervised clustering that can be divided into two classes
which are as follows −
Constraint-based semi-supervised clustering − It can be used based on user-provided labels or
constraints to support the algorithm toward a more appropriate data partitioning. This
contains modifying the objective function depending on constraints or initializing and
constraining the clustering process depending on the labeled objects.
Distance-based semi-supervised clustering − It can be used to employ an adaptive distance
measure that is trained to satisfy the labels or constraints in the supervised data. Multiple
adaptive distance measures have been utilized, including string-edit distance trained using
Expectation-Maximization (EM), and Euclidean distance changed by the shortest distance
algorithm.
An interesting clustering method, known as CLTree (CLustering based on decision TREEs). It
integrates unsupervised clustering with the concept of supervised classification. It is an
instance of constraint-based semi-supervised clustering. It changes a clustering task into a
classification task by considering the set of points to be clustered as belonging to one class,
labeled as “Y,” and inserts a set of relatively uniformly distributed, “nonexistence points” with
a multiple class label, “N.”
The problem of partitioning the data area into data (dense) regions and empty (sparse) regions
can then be changed into a classification problem. These points can be considered as a set of
“Y” points. It shows the addition of a collection of uniformly distributed “N” points, defined by
the “o” points.
The original clustering problem is thus changed into a classification problem, which works out a
design that distinguishes “Y” and “N” points. A decision tree induction method can be used to
partition the two-dimensional space. Two clusters are recognized, which are from the “Y”
points only.
It can be used to insert a large number of “N” points to the original data can introduce
unnecessary overhead in the calculation. Moreover, it is unlikely that some points added would
truly be uniformly distributed in a very high-dimensional space as this can need an exponential
number of points.
6) Describe DBSCAN?
Clustering analysis is an unsupervised learning method that separates the data points into
several specific bunches or groups, such that the data points in the same groups have similar
properties and data points in different groups have different properties in some sense.
It comprises of many different methods based on different distance measures. E.g. K-Means
(distance between points), Affinity propagation (graph distance), Mean-shift (distance between
points), DBSCAN (distance between nearest points), Gaussian mixtures (Mahalanobis distance
to centers), Spectral clustering (graph distance), etc.
Centrally, all clustering methods use the same approach i.e. first we calculate similarities and
then we use it to cluster the data points into groups or batches. Here we will focus on
the Density-based spatial clustering of applications with noise (DBSCAN) clustering method.
If you are unfamiliar with the clustering algorithms, I advise you to read the Introduction to
Image Segmentation with K-Means clustering. You may also read the article on Hierarchical
Clustering.
Why do we need a Density-Based clustering algorithm like DBSCAN when we already have K-
means clustering?
K-Means clustering may cluster loosely related observations together. Every observation
becomes a part of some cluster eventually, even if the observations are scattered far away in
the vector space. Since clusters depend on the mean value of cluster elements, each data point
plays a role in forming the clusters. A slight change in data points might affect the clustering
outcome. This problem is greatly reduced in DBSCAN due to the way clusters are formed. This is
usually not a big problem unless we come across some odd shape data.
Another challenge with k-means is that you need to specify the number of clusters (“k”) in
order to use it. Much of the time, we won’t know what a reasonable k value is a priori.
What’s nice about DBSCAN is that you don’t have to specify the number of clusters to use it. All
you need is a function to calculate the distance between values and some guidance for what
amount of distance is considered “close”. DBSCAN also produces more reasonable results
than k-means across a variety of different distributions. Below figure illustrates the fact:

Credits

Density-Based Clustering Algorithms

Density-Based Clustering refers to unsupervised learning methods that identify distinctive


groups/clusters in the data, based on the idea that a cluster in data space is a contiguous region
of high point density, separated from other such clusters by contiguous regions of low point
density.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a base algorithm for
density-based clustering. It can discover clusters of different shapes and sizes from a large
amount of data, which is containing noise and outliers.
The DBSCAN algorithm uses two parameters:
 minPts: The minimum number of points (a threshold) clustered together for a region to be
considered dense.
 eps (ε): A distance measure that will be used to locate the points in the neighborhood of any
point.
These parameters can be understood if we explore two concepts called Density Reachability
and Density Connectivity.
Reachability in terms of density establishes a point to be reachable from another if it lies within
a particular distance (eps) from it.
Connectivity, on the other hand, involves a transitivity based chaining-approach to determine
whether points are located in a particular cluster. For example, p and q points could be
connected if p->r->s->t->q, where a->b means b is in the neighborhood of a.
There are three types of points after the DBSCAN clustering is complete:
 Core — This is a point that has at least m points within distance n from itself.
 Border — This is a point that has at least one Core point at a distance n.
 Noise — This is a point that is neither a Core nor a Border. And it has less than m points
within distance n from itself.
Algorithmic steps for DBSCAN clustering
 The algorithm proceeds by arbitrarily picking up a point in the dataset (until all points have
been visited).
 If there are at least ‘minPoint’ points within a radius of ‘ε’ to the point then we consider all
these points to be part of the same cluster.
 The clusters are then expanded by recursively repeating the neighborhood calculation for
each neighboring point
DBSCAN in action

Parameter Estimation

Every data mining task has the problem of parameters. Every parameter influences the
algorithm in specific ways. For DBSCAN, the parameters ε and minPts are needed.
 minPts: As a rule of thumb, a minimum minPts can be derived from the number of
dimensions D in the data set, as minPts ≥ D + 1. The low value minPts = 1 does not make
sense, as then every point on its own will already be a cluster. With minPts ≤ 2, the result will
be the same as of hierarchical clustering with the single link metric, with the dendrogram cut
at height ε. Therefore, minPts must be chosen at least 3. However, larger values are usually
better for data sets with noise and will yield more significant clusters. As a rule of
thumb, minPts = 2·dim can be used, but it may be necessary to choose larger values for very
large data, for noisy data or for data that contains many duplicates.
 ε: The value for ε can then be chosen by using a k-distance graph, plotting the distance to
the k = minPts-1 nearest neighbor ordered from the largest to the smallest value. Good
values of ε are where this plot shows an “elbow”: if ε is chosen much too small, a large part
of the data will not be clustered; whereas for a too high value of ε, clusters will merge and
the majority of objects will be in the same cluster. In general, small values of ε are
preferable, and as a rule of thumb, only a small fraction of points should be within this
distance of each other.
 Distance function: The choice of distance function is tightly linked to the choice of ε, and has
a major impact on the outcomes. In general, it will be necessary to first identify a reasonable
measure of similarity for the data set, before the parameter ε can be chosen. There is no
estimation for this parameter, but the distance functions need to be chosen appropriately
for the data set.
DBSCAN Python Implementation Using Scikit-learn

Let us first apply DBSCAN to cluster spherical data.


We first generate 750 spherical training data points with corresponding labels. After that
standardize the features of your training data and at last, apply DBSCAN from the sklearn
library.
import numpy as np
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,
random_state=0)
X = StandardScaler().fit_transform(X)
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
% metrics.adjusted_rand_score(labels_true, labels))
print("Adjusted Mutual Information: %0.3f"
% metrics.adjusted_mutual_info_score(labels_true, labels))
print("Silhouette Coefficient: %0.3f"
% metrics.silhouette_score(X, labels))
# Plot result
import matplotlib.pyplot as plt
%matplotlib inline
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()
view rawDBSCAN_sklearn_implementation.py hosted with ❤ by GitHub
DBSCAN to cluster spherical data

The black data points represent outliers in the above result. Next, apply DBSCAN to cluster non-
spherical data.
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.datasets import make_circles
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
X, y = make_circles(n_samples=750, factor=0.3, noise=0.1)
X = StandardScaler().fit_transform(X)
y_pred = DBSCAN(eps=0.3, min_samples=10).fit_predict(X)
plt.scatter(X[:,0], X[:,1], c=y_pred)
print('Number of clusters: {}'.format(len(set(y_pred[np.where(y_pred != -
1)]))))
print('Homogeneity: {}'.format(metrics.homogeneity_score(y, y_pred)))
print('Completeness: {}'.format(metrics.completeness_score(y, y_pred)))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
% metrics.adjusted_rand_score(labels_true, labels))
print("Adjusted Mutual Information: %0.3f"
% metrics.adjusted_mutual_info_score(labels_true, labels))
print("Silhouette Coefficient: %0.3f"
% metrics.silhouette_score(X, labels))
view rawDBSCAN_non_linear_data.py hosted with ❤ by GitHub

DBSCAN to cluster non-spherical data

Which is absolutely perfect. If we compare with K-means it would give a completely incorrect
output like:
K-means clustering result

The Complexity of DBSCAN

 Best Case: If an indexing system is used to store the dataset such that neighborhood queries
are executed in logarithmic time, we get O(nlogn) average runtime complexity.
 Worst Case: Without the use of index structure or on degenerated data (e.g. all points within
a distance less than ε), the worst-case run time complexity remains O(n²).
 Average Case: Same as best/worst case depending on data and implementation of the
algorithm.

Conclusion

Density-based clustering algorithms can learn clusters of arbitrary shape, and with the Level Set
Tree algorithm, one can learn clusters in datasets that exhibit wide differences in density.
However, I should point out that these algorithms are somewhat more arduous to tune
contrasted to parametric clustering algorithms like K-Means. Parameters like the epsilon for
DBSCAN or for the Level Set Tree are less intuitive to reason about compared to the number of
clusters parameter for K-Means, so it’s more difficult to choose good initial parameter values
for these algorithms.
That is all for this article. I hope you guys have enjoyed reading it, please share your
suggestions/views/questions in the comment section.
7) Explain Gaussian mixtures?
Gaussian Mixture Model or Mixture of Gaussian as it is sometimes called, is not so much a
model as it is a probability distribution. It is a universally used model for generative
unsupervised learning or clustering. It is also called Expectation-Maximization Clustering or EM
Clustering and is based on the optimization strategy. Gaussian Mixture models are used for
representing Normally Distributed subpopulations within an overall population. The advantage
of Mixture models is that they do not require which subpopulation a data point belongs to. It
allows the model to learn the subpopulations automatically. This constitutes a form of
unsupervised learning.
A Gaussian is a type of distribution, and it is a popular and mathematically convenient type of
distribution. A distribution is a listing of outcomes of an experiment and the probability
associated with each outcome. Let’s take an example to understand. We have a data table that
lists a set of cyclist’s speeds.
Speed (Km/h) Frequency

1 4

2 9

3 6

4 7

5 3

6 2
Here, we can see that a cyclist reaches the speed of 1 Km/h four times, 2Km/h nine times, 3
Km/h and so on. We can notice how this follows, the frequency goes up and then it goes down.
It looks like it follows a kind of bell curve the frequencies go up as the speed goes up and then it
has a peak value and then it goes down again, and we can represent this using a bell curve
otherwise known as a Gaussian distribution.

A Gaussian distribution is a type of distribution where half of the data falls on the left of it, and
the other half of the data falls on the right of it. It’s an even distribution, and one can notice just
by the thought of it intuitively that it is very mathematically convenient.
So, what do we need to define a Gaussian or Normal Distribution? We need a mean which is
the average of all the data points. That is going to define the centre of the curve, and the
standard deviation which describes how to spread out the data is. Gaussian distribution would
be a great distribution to model the data in those cases where the data reaches a peak and
then decreases. Similarly, in Multi Gaussian Distribution, we will have multiple peaks with
multiple means and multiple standard deviations.
The formula for Gaussian distribution using the mean and the standard deviation called
the Probability Density Function:
For a given point X, we can compute the associated Y values. Y values are the probabilities for
those X values. So, for any X value, we can calculate the probability of that X value being a part
of the curve or being a part of the dataset.
This is a function of a continuous random variable whose integral across an interval gives the
probability that the value of the variable lies within the same interval.
What is a Gaussian Mixture Model?
Sometimes our data has multiple distributions or it has multiple peaks. It does not always have
one peak, and one can notice that by looking at the data set. It will look like there are multiple
peaks happening here and there. There are two peak points and the data seems to be going up
and down twice or maybe three times or four times. But if there are Multiple Gaussian
distributions that can represent this data, then we can build what we called a Gaussian Mixture
Model.
In other words we can say that, if we have three Gaussian Distribution as GD1, GD2, GD3 having
mean as µ1, µ2,µ3 and variance 1,2,3 than for a given set of data points GMM will identify the
probability of each data point belonging to each of these distributions.
It is a probability distribution that consists of multiple probability distributions and has Multiple
Gaussians.
The probability distribution function of d-dimensions Gaussian Distribution is defined as:

Why do we use the Variance-Covariance Matrix?


The Covariance is a measure of how changes in one variable are associated with changes in a
second variable. It’s not about the independence of variation of two variables but how they
change depending on each other. The variance-covariance matrix is a measure of how these
variables are related to each other, and in that way it’s very similar to the standard deviation
except when we have more dimension, the covariance matrix against the standard deviation
gives us a better more accurate result.

Where, V= c x c variance-covariance matrix


N = the number of scores in each of the c datasets
xi= is a deviation score from the ith dataset
xi2/N= is the variance of element from the ith dataset
xixj/N= is the covariance for the elements from the ithand jth datasets
and the probability given in a mixture of K Gaussian where K is a number of distributions:

Once we multiply the probability distribution function of d-dimension by W, the prior


probability of each of our gaussians, it will give us the probability value X for a given X data
point. If we were to plot multiple Gaussian distributions, it would be multiple bell curves. What
we really want is a single continuous curve that consists of multiple bell curves. Once we have
that huge continuous curve then for the given data points, it can tell us the probability that it is
going to belong to a specific class.
Now, we would like to find the maximum likelihood estimate of X (the data point we want to
predict the probability) i.e. we want to maximize the likelihood that X belongs to a particular
class or we want to find a class that this data point X is most likely to be part of.
It is very similar to the k-means algorithm. It uses the same optimization strategy which is the
expectation maximization algorithm.
K-Means VS Gaussian Mixture Model

The reason that standard deviation is added into this because in the denominator the 2 takes
variation into consideration when it calculates its measurement but K means only calculates
conventional Euclidean distance. i.e K-means calculates distance and GM calculates weights.
This means that the k-means algorithm gives you a hard assignment: it either says this is going
to be this data point is a part of this class or it’s a part of this class. In a lot of cases we just want
that hard assignment but in a lot of cases it’s better to have a soft assignment. Sometimes we
want the maximum probability like: This is going to be 70% likely that it’s a part of this class but
we also want the probability that it’s going to be a part of other classes. It is a list of probability
values that it could be a part of multiple distributions, it could be in the middle, it could be 60%
likely this class and 40% likely of this class. That’s why we incorporate the standard deviation.
Expectation Maximization Algorithm: EM can be used for variables that are not directly
observable and deduce from the value of other observed variables. It can be used with
unlabeled data for its classification. It is one of the popular approaches to maximize the
likelihood.
Basic Ideas of EM -Algorithm: Given a set of incomplete data and set of starting parameters.
E-Step: Using the given data and the current value of parameters, estimate the value of hidden
data.
M-Step: After the E-step, it is used to maximize the hidden variable and joint distribution of the
data.
Usage of EM Algorithm
1. Can be used to fill missing data.
2. To find the values of latent variables.
The disadvantage of EM algorithm is that it has slow convergence and it converges up to local
optima only.
Comparing to Gradient descent
Gradient descent compute the derivative which tells us the direction in which the data wants to
move in or in what direction should we move the parameter’s data of your model such that the
function of our model is optimized to fit our data but what if we can’t compute a gradient of a
variable. i.e. we can’t compute a derivative of a random variable. The Gaussian mixture model
has a random variable. It is a stochastic model i.e. it is non-deterministic. We can’t compute the
derivative of a random variable that’s why we cannot use gradient descent.
Applications
 GMM is widely used in the field of signal processing.
 GMM provides good results in language Identification.
 Customer Churn is another example.
 GMM founds its use case in Anomaly Detection.
 GMM is also used to track the object in a video frame.
 GMM can also be used to classify songs based on genres.

8) Define PCA? Explain implementation of PCA with Scikit-learn?


Principal Component Analysis (PCA) is a statistical procedure that uses an orthogonal
transformation that converts a set of correlated variables to a set of uncorrelated variables.
PCA is the most widely used tool in exploratory data analysis and in machine learning for
predictive models. Moreover, PCA is an unsupervised statistical technique used to examine
the interrelations among a set of variables. It is also known as a general factor analysis where
regression determines a line of best fit.
WHY PCA?
 When there are many input attributes, it is difficult to visualize the data. There is a very
famous term ‘Curse of dimensionality in the machine learning domain.
 Basically, it refers to the fact that a higher number of attributes in a dataset adversely
affects the accuracy and training time of the machine learning model.
 Principal Component Analysis (PCA) is a way to address this issue and is used for better
data visualization and improving accuracy.
How does PCA work?
 PCA is an unsupervised pre-processing task that is carried out before applying any ML
algorithm. PCA is based on “orthogonal linear transformation” which is a mathematical
technique to project the attributes of a data set onto a new coordinate system. The
attribute which describes the most variance is called the first principal component and is
placed at the first coordinate.
 Similarly, the attribute which stands second in describing variance is called a second
principal component and so on. In short, the complete dataset can be expressed in terms
of principal components. Usually, more than 90% of the variance is explained by
two/three principal components.
 Principal component analysis, or PCA, thus converts data from high dimensional space to
low dimensional space by selecting the most important attributes that capture maximum
information about the dataset.
Python Implementation:
 To implement PCA in Scikit learn, it is essential to standardize/normalize the data before
applying PCA.
 PCA is imported from sklearn.decomposition. We need to select the required number of
principal components.
 Usually, n_components is chosen to be 2 for better visualization but it matters and
depends on data.
 By the fit and transform method, the attributes are passed.
 The values of principal components can be checked using components_ while the variance
explained by each principal component can be calculated using explained_variance_ratio.
1. Import all the libraries
 Python3
# import all libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

2. Loading Data
Load the breast_cancer dataset from sklearn.datasets. It is clear that the dataset has 569 data
items with 30 input attributes. There are two output classes-benign and malignant. Due to 30
input features, it is impossible to visualize this data
 Python3

#import the breast _cancer dataset


from sklearn.datasets import load_breast_cancer
data=load_breast_cancer()
data.keys()

# Check the output classes


print(data['target_names'])

# Check the input attributes


print(data['feature_names'])

Output:

3. Apply PCA
 Standardize the dataset prior to PCA.
 Import PCA from sklearn.decomposition.
 Choose the number of principal components.
Let us select it to 3. After executing this code, we get to know that the dimensions of x are
(569,3) while the dimension of actual data is (569,30). Thus, it is clear that with PCA, the
number of dimensions has reduced to 3 from 30. If we choose n_components=2, the
dimensions would be reduced to 2.

# construct a dataframe using pandas


df1=pd.DataFrame(data['data'],columns=data['feature_name
s'])

# Scale data before applying PCA


scaling=StandardScaler()

# Use fit and transform method


scaling.fit(df1)
Scaled_data=scaling.transform(df1)

# Set the n_components=3


principal=PCA(n_components=3)
principal.fit(Scaled_data)
x=principal.transform(Scaled_data)

# Check the dimensions of data after PCA


print(x.shape)

Output:
(569,3)
4. Check Components
The principal.components_ provide an array in which the number of rows tells the number of
principal components while the number of columns is equal to the number of features in
actual data. We can easily see that there are three rows as n_components was chosen to be
3. However, each row has 30 columns as in actual data.
 Python3

# Check the values of eigen vectors


# prodeced by principal components
principal.components_
5. Plot the components (Visualization)
Plot the principal components for better data visualization. Though we had taken
n_components =3, here we are plotting a 2d graph as well as 3d using first two principal
components and 3 principal components respectively. For three principal components, we
need to plot a 3d graph. The colors show the 2 output classes of the original dataset-benign
and malignant. It is clear that principal components show clear separation between two
output classes.
 Python3

plt.figure(figsize=(10,10))
plt.scatter(x[:,0],x[:,1],c=data['target'],cmap='plasma')
plt.xlabel('pc1')
plt.ylabel('pc2')

Output:
For three principal components, we need to plot a 3d graph. x[:,0] signifies the first principal
component. Similarly, x[:,1] and x[:,2] represent the second and the third principal
component.
 Python3

# import relevant libraries for 3d graph


from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(10,10))

# choose projection 3d for creating a 3d graph


axis = fig.add_subplot(111, projection='3d')

# x[:,0]is pc1,x[:,1] is pc2 while x[:,2] is pc3


axis.scatter(x[:,0],x[:,1],x[:,2], c=data['target'],cmap='plasma')
axis.set_xlabel("PC1", fontsize=10)
axis.set_ylabel("PC2", fontsize=10)
axis.set_zlabel("PC3", fontsize=10)

Output:
6. Calculate variance ratio
Explained_variance_ratio provides an idea of how much variation is explained by principal
components.
 Python3

# check how much variance is explained by each principal


component
print(principal.explained_variance_ratio_)

Output:
array([0.44272026, 0.18971182, 0.09393163])

9) Describe Kernel PCA?


PRINCIPAL COMPONENT ANALYSIS: is a tool which is used to reduce the dimension of the
data. It allows us to reduce the dimension of the data without much loss of information. PCA
reduces the dimension by finding a few orthogonal linear combinations (principal
components) of the original variables with the largest variance.
The first principal component captures most of the variance in the data. The second principal
component is orthogonal to the first principal component and captures the remaining
variance, which is left of first principal component and so on. There are as many principal
components as the number of original variables.
These principal components are uncorrelated and are ordered in such a way that the first
several principal components explain most of the variance of the original data. To learn more
about PCA you can read the article Principal Component Analysis
KERNEL PCA:
PCA is a linear method. That is it can only be applied to datasets which are linearly separable.
It does an excellent job for datasets, which are linearly separable. But, if we use it to non-
linear datasets, we might get a result which may not be the optimal dimensionality reduction.
Kernel PCA uses a kernel function to project dataset into a higher dimensional feature space,
where it is linearly separable. It is similar to the idea of Support Vector Machines.
There are various kernel methods like linear, polynomial, and gaussian.
Code: Create a dataset which is nonlinear and then apply PCA on the dataset.

import matplotlib.pyplot as plt


from sklearn.datasets import make_moons

X, y = make_moons(n_samples = 500, noise = 0.02,


random_state = 417)

plt.scatter(X[:, 0], X[:, 1], c = y)


plt.show()

Code: Let’s apply PCA on this dataset

from sklearn.decomposition import PCA


pca = PCA(n_components = 2)
X_pca = pca.fit_transform(X)

plt.title("PCA")
plt.scatter(X_pca[:, 0], X_pca[:, 1], c = y)
plt.xlabel("Component 1")
plt.ylabel("Component 2")
plt.show()

As you can see PCA failed to distinguish the two classes.

Code: Applying kernel PCA on this dataset with RBF kernel with a gamma value of 15.

from sklearn.decomposition import KernelPCA


kpca = KernelPCA(kernel ='rbf', gamma = 15)
X_kpca = kpca.fit_transform(X)

plt.title("Kernel PCA")
plt.scatter(X_kpca[:, 0], X_kpca[:, 1], c = y)
plt.show()
In the kernel space the two classes are linearly separable. Kernel PCA uses a kernel function
to project the dataset into a higher-dimensional space, where it is linearly separable.
Finally, we applied the kernel PCA to a non-linear dataset using scikit-learn.

You might also like