ML and Ai Unit 04 and Unit 05
ML and Ai Unit 04 and Unit 05
It is shown that the k-nearest neighbor algorithm (kNN) outperforms the first nearest
neighbor algorithm only under certain conditions. Data sets must contain moderate
amounts of noise. Training examples from the different classes must belong to clusters that
allow an increase in the value of k without reaching into clusters of other classes. Methods
for choosing the value of k for kNN are investigated. It shown that one-fold cross-validation
on a restricted number of values for k suffices for best performance. It is also shown that for
best performance the votes of the k-nearest neighbors of a query should be weighted in
inverse proportion to their distances from the query.
The primary contributions of this dissertation are (a) several improvements to existing
distance-based algorithms, (b) several new distance-based algorithms, and (c) an
experimentally supported understanding of the conditions under which various distance-
based algorithms are likely to give good performance.
K-Nearest Neighbors
The K-Nearest Neighbors algorithm is a supervised machine learning algorithm for labeling
an unknown data point given existing labeled data.
The nearness of points is typically determined by using distance algorithms such as the
Euclidean distance formula based on parameters of the data. The algorithm will classify a
point based on the labels of the K nearest neighbor points, where the value of K can be
specified.
KNN of Unknown Data Point
To classify the unknown data point using the KNN (K-Nearest Neighbor) algorithm:
Normalizing Data
Normalization is a process of converting the numeric columns in the dataset to a common
scale while retaining the underlying differences in the range of values.
For example, Min-max normalization converts each value of the numeric column to a value
between 0 and 1 using the formula Normalized value = (NumericValue - MinValue) /
(MaxValue - MinValue). A downside of Min-max Normalization is that it does not handle
outliers very well.
Regression in KNN Algorithm
K-Nearest Neighbor algorithm uses ‘feature similarity’ to predict values of any new data
points. This means that the new point is assigned a value based on how closely it resembles
the points in the training set. During regression implementation, the average of the values is
taken to be the final prediction, whereas during the classification implementation mode of
the values is taken to be the final prediction.
o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision rules and each
leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in
the correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector
Machine. Consider the below diagram in which there are two different categories that are
classified using a decision boundary or hyperplane:
Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that
can accurately identify whether it is a cat or dog, so such a model can be created by using the
SVM algorithm. We will first train our model with lots of images of cats and dogs so that it can
learn about different features of cats and dogs, and then we test it with this strange creature.
So as support vector creates a decision boundary between these two data (cat and dog) and
choose extreme cases (support vectors), it will see the extreme case of cat and dog. On the
basis of the support vectors, it will classify it as a cat. Consider the below diagram
SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
SVM can be of two types:
o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is
termed as linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.
The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features (as shown in image), then hyperplane will be a straight line. And
if there are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position
of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane,
hence called a Support vector.
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have
a dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We
want a classifier that can classify the pair(x1, x2) of coordinates in either green or blue.
Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes.
But there can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary
or region is called as a hyperplane. SVM algorithm finds the closest point of the lines from
both the classes. These points are called support vectors. The distance between the vectors
and the hyperplane is called as margin. And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third dimension z. It
can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:
So now, SVM will divide the datasets into classes in the following way. Consider the below
image:
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.
Kernel methods are approaches for dealing with linearly inseparable data or non-linear data
sets like those presented in fig-1. The concept is to use a mapping function to project
nonlinear combinations of the original features onto a higher-dimensional space, where the
data becomes linearly separable. The two-dimensional dataset (X1, X2) is projected into a
new three-dimensional feature space (Z1, Z2, Z3) in the diagram above, where the classes
become separable.
To grasp it completely, Assume we have two vectors, x and x*, in a 2D space (illustrated in
fig-1) and want to find a linear classifier by performing a dot product between them.
Unfortunately, in our current 2D vector space, the data is not linearly separable. We can
address this challenge by mapping the two vectors to a 3D space.
x→ϕ(x)
x∗→ϕ(x*)
Where (x) and (x*) are 3D representations of x and x*, respectively. Now we can discover
our linear classifier in 3D space using the dot product of (x) and (x*) and then map back to
2D space using the dot product of (x) and (x*) like below.
xT x∗→ϕ(x)T ϕ(x∗)
A mapping function can be used to convert the training data into a higher-dimensional
feature space, and then a linear SVM model can be trained to classify the data in this new
feature space following the method outlined above. Using the mapping function, the new
data may then be fed into the model for categorization. However, this method is
computationally intensive. So, what is the solution?
The approach is to use a method to avoid explicitly mapping the input data to a high-
dimensional feature space in order to train linear learning algorithms to learn a nonlinear
function or decision boundary. This is known as a kernel trick. It should be noted that the
kernel trick is significantly more general than SVM.
In real-world applications, data may contain numerous features, and transformations using
multiple polynomial combinations of these features will result in extremely large and
prohibitive processing costs.
Kernel Trick
This problem can be solved using the kernel trick. Instead of explicitly applying the
transformations (x) and representing the data by these transformed coordinates in the
higher dimensional feature space, kernel methods represent the data only through a set of
pairwise similarity comparisons between the original data observations x (with the original
coordinates in the lower dimensional space).
Our kernel function takes in lower-dimensional inputs and outputs the dot product of
converted vectors in higher-dimensional space. Other theorems guarantee that such kernel
functions exist under certain conditions.
The kernel function simplifies the process of determining the mapping function. As a result,
the kernel function in the altered space specifies the inner product. Different types of kernel
functions are listed below. However, based on the requirement that the kernel function is
symmetric, one can create their own kernel functions.
Polynomial Kernel
The polynomial kernel is a kernel function that allows the learning of non-linear models by
representing the similarity of vectors (training samples) in a feature space over polynomials
of the original variables. It is often used with support vector machines (SVMs) and other
kernelized models.
Sigmoid Kernel
It is primarily used in neural networks. This kernel function is similar to the activation
function for neurons in a two-layer perceptron model of a neural network.
Linear Kernel
It is the most fundamental sort of kernel and is usually one-dimensional in structure. When
there are numerous characteristics, it proves to be the best function. The linear kernel is
commonly used for text classification issues since most of these problems can be linearly
split. Other functions are slower than linear kernel functions.
A task involving machine learning may not be linear, but it has a number of well known steps:
Problem definition.
Preparation of Data.
One good way to come to terms with a new problem is to work through identifying and
defining the problem in the best possible way and learn a model that captures meaningful
information from the data. While problems in Pattern Recognition and Machine Learning can
Supervised Learning:
The system is presented with example inputs and their desired outputs, given
by a “teacher”, and the goal is to learn a general rule that maps inputs to
outputs.
Unsupervised Learning:
No labels are given to the learning algorithm, leaving it on its own to find
structure in its input. Unsupervised learning can be a goal in itself
(discovering hidden patterns in data) or a means towards an end (feature
learning).
Reinforcement Learning:
A system interacts with a dynamic environment in which it must perform a
certain goal (such as driving a vehicle or playing a game against an
opponent). The system is provided feedback in terms of rewards and
punishments as it navigates its problem space.
teacher gives an incomplete training signal: a training set with some (often many) of the
target outputs missing. We will focus on unsupervised learning and data clustering in this
blog post.
Unsupervised Learning
In some pattern recognition problems, the training data consists of a set of input vectors x
without any corresponding target values. The goal in such unsupervised learning problems
may be to discover groups of similar examples within the data, where it is called clustering,
or to determine how the data is distributed in the space, known as density estimation. To put
forward in simpler terms, for a n-sampled space x1 to xn, true class labels are not provided
There may be cases where we don’t know how many/what classes is the data
divided into. Example: Data Mining
We may want to use clustering to gain some insight into the structure of the
data before designing a classifier.
Clustering can be considered the most important unsupervised learning problem; so, as
every other problem of this kind, it deals with finding a structure in a collection of unlabeled
data. A loose definition of clustering could be “the process of organizing objects into groups
whose members are similar in some way”. A cluster is therefore a collection of objects which
are “similar” between them and are “dissimilar” to the objects belonging to other clusters.
Distance-based clustering
Given a set of points, with a notion of distance between points, grouping the points into
internal (within the cluster) distances should be small i.e members of clusters
are close/similar to each other.
The goal of clustering is to determine the internal grouping in a set of unlabeled data. But
how to decide what constitutes a good clustering? It can be shown that there is no absolute
“best” criterion which would be independent of the final aim of the clustering. Consequently,
it is the user who should supply this criterion, in such a way that the result of the clustering
In the above image, how do we know what is the best clustering solution?
To find a particular clustering solution , we need to define the similarity measures for the
clusters.
Proximity Measures
For clustering, we need to define a proximity measure for two data points. Proximity here
means how similar/dissimilar the samples are with respect to each other.
A “good” proximity measure is VERY application dependent. The clusters should be invariant
under the transformations “natural” to the problem. Also, while clustering it is not advised to
Clustering Algorithms
Exclusive Clustering
Overlapping Clustering
Hierarchical Clustering
Probabilistic Clustering
In the first case data are grouped in an exclusive way, so that if a certain data point belongs
to a definite cluster then it could not be included in another cluster. A simple example of that
is shown in the figure below, where the separation of points is achieved by a straight line on
a bi-dimensional plane.
On the contrary, the second type, the overlapping clustering, uses fuzzy sets to cluster data,
so that each point may belong to two or more clusters with different degrees of
A hierarchical clustering algorithm is based on the union between the two nearest clusters.
The beginning condition is realized by setting every data point as a cluster. After a few
In this blog we will talk about four of the most used clustering algorithms:
K-means
Fuzzy K-means
Hierarchical clustering
Mixture of Gaussians
Each of these algorithms belongs to one of the clustering types listed above. While, K-means
clustering algorithm. We will discuss about each clustering method in the following
paragraphs.
K-Means Clustering
K-means is one of the simplest unsupervised learning algorithms that solves the well known
clustering problem. The procedure follows a simple and easy way to classify a given data set
through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to
define k centres, one for each cluster. These centroids should be placed in a smart way
because of different location causes different result. So, the better choice is to place them as
much as possible far away from each other. The next step is to take each point belonging to a
given data set and associate it to the nearest centroid. When no point is pending, the first
step is completed and an early groupage is done. At this point we need to re-calculate k new
centroids as barycenters of the clusters resulting from the previous step. After we have these
k new centroids, a new binding has to be done between the same data set points and the
nearest new centroid. A loop has been generated. As a result of this loop we may notice that
the k centroids change their location step by step until no more changes are done. In other
Finally, this algorithm aims at minimizing an objective function, in this case a squared error
where
is a chosen distance measure between a data point xi and the cluster centre cj, is an indicator
of the distance of the n data points from their respective cluster centres.
Calculate the distance between each data point and cluster centers.
Assign the data point to the cluster center whose distance from the cluster
center is minimum of all the cluster centers.
Recalculate the distance between each data point and new obtained cluster
centers.
If no data point was reassigned then stop, otherwise repeat from step 3).
Although it can be proved that the procedure will always terminate, the k-means algorithm
does not necessarily find the most optimal configuration, corresponding to the global
objective function minimum. The algorithm is also significantly sensitive to the initial
randomly selected cluster centres. The k-means algorithm can be run multiple times to
going to see, it is a good candidate for extension to work with fuzzy feature vectors.
The k-means procedure can be viewed as a greedy algorithm for partitioning the n samples
into k clusters so as to minimize the sum of the squared distances to the cluster centers. It
The way to initialize the means was not specified. One popular way to start is
to randomly choose k of the samples.
The results depend on the value of k and there is no optimal way to describe
a best “k”.
This last problem is particularly troublesome, since we often have no way of knowing how
many clusters exist. In the example shown above, the same algorithm applied to the same
data produces the following 3-means clustering. Is it better or worse than the 2-means
clustering?
Unfortunately there is no general theoretical solution to find the optimal number of clusters
for any given data set. A simple approach is to compare the results of multiple runs with
different k classes and choose the best one according to a given criterion, but we need to be
careful because increasing k results in smaller error function values by definition, but also
In fuzzy clustering, each point has a probability of belonging to each cluster, rather than
completely belonging to just one cluster as it is the case in the traditional k-means. Fuzzy k-
means specifically tries to deal with the problem where points are somewhat in between
centers or otherwise ambiguous by replacing distance with probability, which of course could
be some function of distance, such as having probability relative to the inverse of the
distance. Fuzzy k-means uses a weighted centroid based on those probabilities. Processes of
initialization, iteration, and termination are the same as the ones used in k-means. The
resulting clusters are best analyzed as probabilistic distributions rather than a hard
assignment of labels. One should realize that k-means is a special case of fuzzy k-means when
the probability function used is simply 1 if the data point is closest to a centroid and 0
otherwise.
For a better understanding, we may consider this simple mono-dimensional example. Given a
certain data set, suppose to represent it as distributed on an axis. The figure below shows
this:
Looking at the picture, we may identify two clusters in proximity of the two data
concentrations. We will refer to them using ‘A’ and ‘B’. In the first approach shown in this
tutorial — the k-means algorithm — we associated each data point to a specific centroid;
exclusively to a well defined cluster, but it can be placed in a middle way. In this case, the
membership function follows a smoother line to indicate that every data point may belong to
In the figure above, the data point shown as a red marked spot belongs more to the B cluster
rather than the A cluster. The value 0.2 of ‘m’ indicates the degree of membership to A for
Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the basic
Start by assigning each item to a cluster, so that if you have N items, you now
have N clusters, each containing just one item. Let the distances (similarities)
between the clusters the same as the distances (similarities) between the
items they contain.
Find the closest (most similar) pair of clusters and merge them into a single
cluster, so that now you have one cluster less.
Compute distances (similarities) between the new cluster and each of the old
clusters.
Repeat steps 2 and 3 until all items are clustered into a single cluster of size
N.
There’s another way to deal with clustering problems: a model-based approach, which
consists in using certain models for clusters and attempting to optimize the fit between the
a Gaussian. The entire data set is therefore modelled by a mixture of these distributions.
A mixture model with high likelihood tends to have the following traits:
component distributions have high “peaks” (data in one cluster are tight);
the mixture model “covers” the data well (dominant patterns in the data are
captured by component distributions).
Mixture of Gaussians
The most widely used clustering method of this kind is based on learning a mixture of
Gaussians:
distribution f(x):
The αk represents the contribution of the kth component in constructing f(x). In practice,
parametric distribution (e.g. gaussians), are often used since a lot work has been done to
understand their behaviour. If you substitute each fk(x) for a gaussian you get what is known
The EM Algorithm
normal distributions (note that this is a very strong assumption, in particular when you fix the
function when some of the variables in your model are unobserved (i.e. when you have
latent variables).
You might fairly ask, if we’re just trying to maximize a function, why don’t we just use the
existing machinery for maximizing a function. Well, if you try to maximize this by taking
derivatives and setting them to zero, you find that in many cases the first-order conditions
don’t have a solution. There’s a chicken-and-egg problem in that to solve for your model
parameters you need to know the distribution of your unobserved data; but the distribution
the unobserved data, then estimating the model parameters by maximizing something that is
a lower bound on the actual likelihood function, and repeating until convergence:
E-step: For each datapoint that has missing values, use your model equation
to solve for the distribution of the missing data given your current guess of
the model parameters and given the observed data (note that you are solving
for a distribution for each missing value, not for the expected value). Now
that we have a distribution for each missing value, we can calculate
the expectation of the likelihood function with respect to the unobserved
variables. If our guess for the model parameter was correct, this expected
likelihood will be the actual likelihood of our observed data; if the
parameters were not correct, it will just be a lower bound.
dealing with large number of dimensions and large number of data items can
be problematic because of time complexity;
the result of the clustering algorithm (that in many cases can be arbitrary
itself) can be interpreted in different ways.
Possible Applications
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such
a way that each dataset belongs only one group that has similar properties.
It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.
It is a centroid-based algorithm, where each cluster is associated with a centroid. The main
aim of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.
The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k
should be predetermined in this algorithm.
o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.
Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.
The below diagram explains the working of the K-means Clustering Algorithm:
How does the K-Means Algorithm Work
Step-2: Select random K points or centroids. (It can be other from the input dataset).
Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.
Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables
is given below:
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into
different clusters. It means here we will try to group these datasets into two different
clusters.
o We need to choose some random k points or centroid to form the cluster. These
points can be either the points from the dataset or any other point. So, here we are
selecting the below two points as k points, which are not the part of our dataset.
Consider the below image:
o Now we will assign each data point of the scatter plot to its closest K-point or
centroid. We will compute it by applying some mathematics that we have studied to
calculate the distance between two points. So, we will draw a median between both
the centroids. Consider the below image:
From the above image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid. Let's color them
as blue and yellow for clear visualization.
o As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity
centroids,
o and will find new centroids as below:
o Next, we will reassign each datapoint to the new centroid. For this, we will repeat
the same process of finding a median line. The median will be like below image:
From the above image, we can see, one yellow point is on the left side of the line, and two
blue points are right to the line. So, these three points will be assigned to new centroids.
A dataset contains a huge number of input features in various cases, which makes the
predictive modeling task more complicated. Because it is very difficult to visualize or make
predictions for the training dataset with a high number of features, for such cases,
dimensionality reduction techniques are required to
use.
Handling the high-dimensional data is very difficult in practice, commonly known as the curse
of dimensionality. If the dimensionality of the input dataset increases, any machine learning
algorithm and model becomes more complex. As the number of features increases, the
number of samples also gets increased proportionally, and the chance of overfitting also
increases. If the machine learning model is trained on high-dimensional data, it becomes
overfitted and results in poor performance.
Hence, it is often required to reduce the number of features, which can be done with
dimensionality reduction.
Some benefits of applying dimensionality reduction technique to the given dataset are given
below:
o By reducing the dimensions of the features, the space required to store the dataset
also gets reduced.
o Less Computation training time is required for reduced dimensions of features.
o Reduced dimensions of features of the dataset help in visualizing the data quickly.
o It removes the redundant features (if present) by taking care of multicollinearity.
There are also some disadvantages of applying the dimensionality reduction, which are given
below:
o Some data may be lost due to dimensionality reduction.
o In the PCA dimensionality reduction technique, sometimes the principal components
required to consider are unknown.
There are two ways to apply the dimension reduction technique, which are given below:
Feature Selection
Feature selection is the process of selecting the subset of the relevant features and leaving
out the irrelevant features present in a dataset to build a model of high accuracy. In other
words, it is a way of selecting the optimal features from the input dataset.
1. Filters Methods
In this method, the dataset is filtered, and a subset that contains only the relevant features is
taken. Some common techniques of filters method are:
o Correlation
o Chi-Square Test
o ANOVA
o Information Gain, etc.
2. Wrappers Methods
The wrapper method has the same goal as the filter method, but it takes a machine learning
model for its evaluation. In this method, some features are fed to the ML model, and evaluate
the performance. The performance decides whether to add those features or remove to
increase the accuracy of the model. This method is more accurate than the filtering method
but complex to work. Some common techniques of wrapper methods are:
o Forward Selection
o Backward Selection
o Bi-directional Elimination
3. Embedded Methods: Embedded methods check the different training iterations of the
machine learning model and evaluate the importance of each feature. Some common
techniques of Embedded methods are:
o LASSO
o Elastic Net
o Ridge Regression, etc.
Feature Extraction:
Feature extraction is the process of transforming the space containing many dimensions into
space with fewer dimensions. This approach is useful when we want to keep the whole
information but use fewer resources while processing the information.
PCA works by considering the variance of each attribute because the high attribute shows the
good split between the classes, and hence it reduces the dimensionality. Some real-world
applications of PCA are image processing, movie recommendation system, optimizing the
power allocation in various communication channels.
Backward Feature Elimination
The backward feature elimination technique is mainly used while developing Linear
Regression or Logistic Regression model. Below steps are performed in this technique to
reduce the dimensionality or in feature selection:
o In this technique, firstly, all the n variables of the given dataset are taken to train the
model.
o The performance of the model is checked.
o Now we will remove one feature each time and train the model on n-1 features for n
times, and will compute the performance of the model.
o We will check the variable that has made the smallest or no change in the performance
of the model, and then we will drop that variable or features; after that, we will be left
with n-1 features.
o Repeat the complete process until no feature can be dropped.
In this technique, by selecting the optimum performance of the model and maximum
tolerable error rate, we can define the optimal number of features require for the machine
learning algorithms.
Forward feature selection follows the inverse process of the backward elimination process. It
means, in this technique, we don't eliminate the feature; instead, we will find the best
features that can produce the highest increase in the performance of the model. Below steps
are performed in this technique:
o We start with a single feature only, and progressively we will add each feature at a
time.
o Here we will train the model on each feature separately.
o The feature with the best performance is selected.
o The process will be repeated until we get a significant increase in the performance of
the model.
If a dataset has too many missing values, then we drop those variables as they do not carry
much useful information. To perform this, we can set a threshold level, and if a variable has
missing values more than that threshold, we will drop that variable. The higher the threshold
value, the more efficient the reduction.
Low Variance Filter
As same as missing value ratio technique, data columns with some changes in the data have
less information. Therefore, we need to calculate the variance of each variable, and all data
columns with variance lower than a given threshold are dropped because low variance
features will not affect the target variable.
High Correlation refers to the case when two variables carry approximately similar
information. Due to this factor, the performance of the model can be degraded. This
correlation between the independent numerical variable gives the calculated value of the
correlation coefficient. If this value is higher than the threshold value, we can remove one of
the variables from the dataset. We can consider those variables or features that show a high
correlation with the target variable.
Random Forest
Random Forest is a popular and very useful feature selection algorithm in machine learning.
This algorithm contains an in-built feature importance package, so we do not need to program
it separately. In this technique, we need to generate a large set of trees against the target
variable, and with the help of usage statistics of each attribute, we need to find the subset of
features.
Random forest algorithm takes only numerical variables, so we need to convert the input data
into numeric data using hot encoding.
Factor Analysis
Factor analysis is a technique in which each variable is kept within a group according to the
correlation with other variables, it means variables within a group can have a high correlation
between themselves, but they have a low correlation with variables of other groups.
We can understand it by an example, such as if we have two variables Income and spend.
These two variables have a high correlation, which means people with high income spends
more, and vice versa. So, such variables are put into a group, and that group is known as
the factor. The number of these factors will be reduced as compared to the original dimension
of the dataset.
Auto-encoders
Encoder: The function of the encoder is to compress the input to form the latent-space
representation.
Decoder: The function of the decoder is to recreate the output from the latent-space
representation.
Unit-05
Machine learning algorithm analytics
In this blog, we will discuss the various ways to check the performance of our machine
learning or deep learning model and why to use one in place of the other. We will discuss
terms like:
1. Confusion matrix
2. Accuracy
3. Precision
4. Recall
5. Specificity
6. F1 score
7. Precision-Recall or PR curve
9. PR vs ROC curve.
For simplicity, we will mostly discuss things in terms of a binary classification problem where
let’s say we’ll have to find if an image is of a cat or a dog. Or a patient is having cancer
(positive) or is found healthy (negative). Some common terms to be clear with are:
Confusion matrix
It’s just a representation of the above parameters in a matrix format. Better visualization is
always good :)
Accuracy
The most commonly used metric to judge a model and is actually not a clear indicator of the
Take for example a cancer detection model. The chances of actually having cancer are very
low. Let’s say out of 100, 90 of the patients don’t have cancer and the remaining 10 actually
have it. We don’t want to miss on a patient who is having cancer but goes undetected (false
negative). Detecting everyone as not having cancer gives an accuracy of 90% straight. The
model did nothing here but just gave cancer free for all the 100 predictions.
Precision
Percentage of positive instances out of the total predicted positive instances. Here
denominator is the model prediction done as positive from the whole given dataset. Take it
as to find out ‘how much the model is right when it says it is right’.
Percentage of positive instances out of the total actual positive instances. Therefore
denominator (TP + FN) here is the actual number of positive instances present in the dataset.
Take it as to find out ‘how much extra right ones, the model missed when it showed the right
ones’.
Specificity
Percentage of negative instances out of the total actual negative instances. Therefore
denominator (TN + FP) here is the actual number of negative instances present in the
dataset. It is similar to recall but the shift is on the negative instances. Like finding out how
many healthy patients were not having cancer and were told they don’t have cancer. Kind of
It is the harmonic mean of precision and recall. This takes the contribution of both, so higher
the F1 score, the better. See that due to the product in the numerator if one goes low, the
final F1 score goes down significantly. So a model does well in F1 score if the positive
predicted are actually positives (precision) and doesn't miss out on positives and predicts
One drawback is that both precision and recall are given equal importance due to which
according to our application we may need one higher than the other and F1 score may not
be the exact metric for it. Therefore either weighted-F1 score or seeing the PR or ROC curve
can help.
PR curve
It is the curve between precision and recall for various threshold values. In the figure below
we have 6 predictors showing their respective precision-recall curve for various threshold
values. The top right part of the graph is the ideal space where we get high precision and
recall. Based on our application we can choose the predictor and the threshold value. PR AUC
is just the area under the curve. The higher its numerical value the better.
ROC curve
ROC stands for receiver operating characteristic and the graph is plotted against TPR and FPR
for various threshold values. As TPR increases FPR also increases. As you can see in the first
figure, we have four categories and we want the threshold value that leads us closer to the
top left corner. Comparing different predictors (here 3) on a given dataset also becomes easy
as you can see in figure 2, one can choose the threshold according to the application at hand.
ROC AUC is just the area under the curve, the higher its numerical value the better.
PR vs ROC curve
Due to the absence of TN in the precision-recall equation, they are useful in imbalanced
classes. In the case of class imbalance when there is a majority of the negative class. The
metric doesn’t take much into consideration the high number of TRUE NEGATIVES of the
negative class which is in majority, giving better resistance to the imbalance. This is
Like to detect cancer patients, which has a high class imbalance because very few have it out
of all the diagnosed. We certainly don’t want to miss on a person having cancer and going
Due to the consideration of TN or the negative class in the ROC equation, it is useful when
both the classes are important to us. Like the detection of cats and dog. The importance of
true negatives makes sure that both the classes are given importance, like the output of a
Ensemble Methods, what are they? Ensemble methods is a machine learning technique that
combines several base models in order to produce one optimal predictive model. To better
understand this definition lets take a step back into ultimate goal of machine learning and
model building. This is going to make more sense as I dive into specific examples and why
I will largely utilize Decision Trees to outline the definition and practicality of Ensemble
Methods (however it is important to note that Ensemble Methods do not only pertain to
Decision Trees).
A Decision Tree determines the predictive value based on series of questions and conditions.
For instance, this simple Decision Tree determining on whether an individual should play
outside or not. The tree takes several weather factors into account, and given each factor
either makes a decision or asks another question. In this example, every time it is overcast,
we will play outside. However, if it is raining, we must ask if it is windy or not? If windy, we
will not play. But given no wind, tie those shoelaces tight because were going outside to play.
Decision Trees can also solve quantitative problems as well with the same format. In the Tree
to the left, we want to know wether or not to invest in a commercial real estate property. Is
Poor Economic Conditions? How much will an investment return? These questions are
When making Decision Trees, there are several factors we must take into consideration: On
what features do we make our decisions on? What is the threshold for classifying each
question into a yes or no answer? In the first Decision Tree, what if we wanted to ask
ourselves if we had friends to play with or not. If we have friends, we will play every time. If
not, we might continue to ask ourselves questions about the weather. By adding an
This is where Ensemble Methods come in handy! Rather than just relying on one Decision
Tree and hoping we made the right decision at each split, Ensemble Methods allow us to take
a sample of Decision Trees into account, calculate which features to use or questions to ask
at each split, and make a final predictor based on the aggregated results of the sampled
Decision Trees.
Types of Ensemble Methods
Given a Dataset, bootstrapped subsamples are pulled. A Decision Tree is formed on each
bootstrapped sample. The results of each tree are aggregated to yield the strongest, most
accurate predictor.
2. Random Forest Models. Random Forest Models can be thought of as BAGGing, with a
slight tweak. When deciding where to split and how to make decisions, BAGGed Decision
Trees have the full disposal of features to choose from. Therefore, although the
bootstrapped samples may be slightly different, the data is largely going to break off at the
same features throughout each model. In contrary, Random Forest models decide where to
split based on a random selection of features. Rather than splitting at similar features at each
node throughout, Random Forest models implement a level of differentiation because each
tree will split based on different features. This level of differentiation provides a greater
ensemble to aggregate over, ergo producing a more accurate predictor. Refer to the image
tree is formed on each subsample. HOWEVER, the decision tree is split on different features
3. Boosting is an ensemble learning method that combines a set of weak learners into a
strong learner to minimize training errors. In boosting, a random sample of data is selected,
fitted with a model and then trained sequentially—that is, each model tries to compensate
for the weaknesses of its predecessor. With each iteration, the weak rules from each
individual classifier are combined to form one, strong prediction rule.
A Generative Model is a powerful way of learning any kind of data distribution using
unsupervised learning and it has achieved tremendous success in just few years. All types of
generative models aim at learning the true data distribution of the training set so as to
generate new data points with some variations. But it is not always possible to learn the
exact distribution of our data either implicitly or explicitly and so we try to model a
distribution which is as similar as possible to the true data distribution. For this, we can
leverage the power of neural networks to learn a function which can approximate the model
(VAE) and Generative Adversarial Networks (GAN). VAE aims at maximizing the lower bound
of the data log-likelihood and GAN aims at achieving an equilibrium between Generator and
Discriminator. In this blogpost, I will be explaining the working of VAE and GANs and the
Variational Autoencoder
I am assuming that the reader is already familiar with the working of a vanilla autoencoder.
We know that we can use an autoencoder to encode an input image to a much smaller
dimensional representation which can store latent information about the input data
distribution. But in a vanilla autoencoder, the encoded vector can only be mapped to the
corresponding input using a decoder. It certainly can’t be used to generate similar images
To achieve this, the model needs to learn the probability distribution of the training data.
VAE is one of the most popular approach to learn the complicated data distribution such as
rooted in Bayesian inference i.e., the model aims to learn the underlying probability
distribution of the training data so that it could easily sample new data from that learned
distribution. The idea is to learn a low-dimensional latent representation of the training data
called latent variables (variables which are not directly observed but are rather inferred
through a mathematical model) which we assume to have generated our actual training
data. These latent variables can store useful information about the type of output the model
Gaussian distribution is selected as a prior to learn the distribution P(z) so as to easily sample
likelihood of training data X. In short, we are assuming that a low-dimensional latent vector
has generated our data x (x ∈ X) and we can map this latent vector to data x using a
deterministic function f(z;θ) parameterized by theta which we need to evaluate (see fig.
1[1]). Under this generative process, our aim is to maximize the probability of each data in X
The intuition behind this maximum likelihood estimation is that if the model can generate
training samples from these latent variables then it can also generate similar samples with
some variations. In other words, if we sample a large number of latent variables from P(z)
and generate x from these variables then the generated x should match the data distribution
Pdata(x). Now we have two questions which we need to answer. How to capture the
distribution of latent variables and how to integrate Equation 1 over all the dimensions of z?
Obviously it is a tedious task to manually specify the relevant information we would like to
encode in latent vector to generate the output image. Rather we rely on neural networks to
compute z just with an assumption that this latent vector can be well approximated as a
using a sufficiently complicated function and the inverse of this function can be used to learn
In equation 1, integration is carried over all the dimensions of z and is therefore intractable.
in equation 1. The idea of VAE is to infer P(z) using P(z|X) which we don’t know. We infer
P(z|X) using a method called variational inference which is basically an optimization problem
in Bayesian statistics. We first model P(z|X) using simpler distribution Q(z|X) which is easy to
find and we try to minimize the difference between P(z|X) and Q(z|X) using KL-divergence
metric approach so that our hypothesis is close to the true distribution. This is followed by a
lot of mathematical equations which I will not be explaining here but you can find it in the
original paper. But I must say that those equations are not very difficult to understand once
The above equation has a very nice interpretation. The term Q(z|X) is basically our encoder
net, z is our encoded representation of data x(x ∈ X) and P(X|z) is our decoder net. So in the
above equation our goal is to maximize the log-likelihood of our data distribution under
some error given by D_KL[Q(z|X) || P(z|X)]. It can easily seen that VAE is trying to minimize
the lower bound of log(P(X)) since P(z|X) is not tractable but the KL-divergence term is >=0.
This is same as maximizing E[logP(X|z)] and minimizing D_KL[Q(z|X) || P(z|X)]. We know that
net. As I said earlier that we want our latent representation to be close to Gaussian and
hence we assume P(z) as N(0, 1). Following this assumption, Q(z|X) should also be close to
this distribution. If we assume that it is a Gaussian with parameters μ(X) and Ʃ(X), the error
due to the difference between these two distributions i.e., P(z) and Q(z|X) given by KL-
Considering we are optimizing the lower variational bound, our optimization function is :
Hence, our loss function will contain two terms. First one is reconstruction loss of the input
to output and the second loss is KL-divergence term. Now we can train the network using
backpropagation algorithm. But there is a problem and that is the first term doesn’t only
depend on the parameters of P but also on the parameters of Q but this dependency doesn’t
appear in the above equation. So how to backpropagate through the layer where we are
sampling z randomly from the distribution Q(z|X) or N[μ(X), Ʃ(X)] so that P can decode.
Gradients can’t flow through random nodes. We use reparameterization trick (see fig) to
make the network differentiable. We sample from N(μ(X), Σ(X)) by first sampling ε ∼ N(0, I),
This has been very beautifully shown in the figure 2[1]? . It should be noted that the
feedforward step is identical for both of these networks (left & right) but gradients can only
At inference time, we can simply sample z from N(0, 1) and feed it to decoder net to
generate new data point. Since we are optimizing the lower variational bound, the quality of
the generated image is somewhat poor as compared to state-of-the art techniques like
The best thing of VAE is that it learns both the generative model and an inference model.
Although both VAE and GANs are very exciting approaches to learn the underlying data
distribution using unsupervised learning but GANs yield better results as compared to VAE. In
VAE, we optimize the lower variational bound whereas in GAN, there is no such assumption.
In fact, GANs don’t deal with any explicit probability density estimation. The failure of VAE in
generating sharp images implies that the model is not able to learn the true posterior
distribution. VAE and GAN mainly differ in the way of training. Let’s now dive into Generative
Adversarial Networks.
Yann LeCun says that adversarial training is the coolest thing since sliced bread. Seeing the
popularity of Generative Adversarial Networks and the quality of the results they produce, I
think most of us would agree with him. Adversarial training has completely changed the way
we teach the neural networks to do a specific task. Generative Adversarial Networks don’t
work with any explicit density estimation like Variational Autoencoders. Instead, it is based
on game theory approach with an objective to find Nash equilibrium between the two
networks, Generator and Discriminator. The idea is to sample from a simple distribution like
Gaussian and then learn to transform this noise to data distribution using universal function
This is achieved by adversarial training of these two networks. A generator model G learns to
capture the data distribution and a discriminator model D estimates the probability that a
sample came from the data distribution rather than model distribution. Basically the task of
the Generator is to generate natural looking images and the task of the Discriminator is to
decide whether the image is fake or real. This can be thought of as a mini-max two player
game where the performance of both the networks improves over time. In this game, the
generator tries to fool the discriminator by generating real images as far as possible and the
discriminator tries not to get fooled by the generator by improving its discriminative
We define a prior on input noise variables P(z) and then the generator maps this to data
distribution using a complex differentiable function with parameters өg. In addition to this,
we have another network called Discriminator which takes in input x and using another
differentiable function with parameters өd outputs a single scalar value denoting the
probability that x comes from the true data distribution Pdata(x). The objective function of
In the above equation, if the input to the Discriminator comes from true data distribution
then D(x) should output 1 to maximize the above objective function w.r.t D whereas if the
image has been generated from the Generator then D(G(z)) should output 1 to minimize the
objective function w.r.t G. The latter basically implies that G should generate such realistic
images which can fool D. We maximize the above function w.r.t parameters of Discriminator
using Gradient Ascent and minimize the same w.r.t parameters of Generator using Gradient
Descent. But there is a problem in optimizing generator objective. At the start of the game
when the generator hasn’t learned anything, the gradient is usually very small and when it is
doing very well, the gradients are very high (see Fig. 4). But we want the opposite behaviour.
Fig.4. Cost for the Generator as a function of Discriminator response on the generated
image
and one step of optimizing G on the mini-batch. The process of training stops when the
Discriminator is unable to distinguish ρg and ρdata i.e. D(x, өd) = ½ or when ρg = ρdata.
One of the earliest model on GAN employing Convolutional Neural Network
was DCGAN which stands for Deep Convolutional Generative Adversarial Networks. This
network takes as input 100 random numbers drawn from a uniform distribution and outputs
an image of desired shape. The network consists of many convolutional, deconvolutional and
fully connected layers. The network uses many deconvolutional layers to map the input noise
to the desired output image. Batch Normalization is used to stabilize the training of the
network. ReLU activation is used in generator for all layers except the output layer which
uses tanh layer and Leaky ReLU is used for all layers in the Discriminator. This network was
trained using mini-batch stochastic gradient descent and Adam optimizer was used to
accelerate training with tuned hyperparameters. The results of the paper were quite
interesting. The authors showed that the generators have interesting vector arithmetic
One of the most widely used variation of GANs is conditional GAN which is constructed by
simply adding conditional vector along with the noise vector (see Fig. 7). Prior to cGAN, we
were generating images randomly from random samples of noise z. What if we want to
generate an image with some desired features. Is there any way to provide this extra
information to the model anyhow about what type of image we want to generate? The
answer is yes and Conditional GAN is the way to do that. By conditioning the model on
direct the data generation process. Conditional GANs are used in a variety of tasks such as
text to image generation, image to image translation, automated image tagging etc. A unified
structure of both the networks has been shown in the diagram below.
One of the cool thing about GANs is that they can be trained even with small training data.
Indeed the results of GANs are promising but the training procedure is not trivial especially
setting up the hyperparameters of the network. Moreover, GANs are difficult to optimize as
they don’t converge easily. Of course there are some tips and tricks to hack GANs but they
may not always help. You can find some of these tips here. Also, we don’t have any criteria
for the quantitative evaluation of the results except to check whether the generated images
A deep Boltzmann machine is a model with more hidden layers with directionless
connections between the nodes as shown in Fig. 7.7. DBM learns the features hierarchically
from the raw data and the features extracted in one layer are applied as hidden variables as
input to the subsequent layer.
Application of autoencoders
So far we have seen a variety of autoencoders and each of them is good at a specific task.
Let’s find out some of the tasks they can do
Data Compression
Although autoencoders are designed for data compression yet they are hardly used for this
purpose in practical situations. The reasons are:
Lossy compression: The output of the autoencoder is not exactly the same as
the input, it is a close but degraded representation. For lossless compression,
they are not the way to go.
Data-specific: Autoencoders are only able to meaningfully compress data
similar to what they have been trained on. Since they learn features specific
for the given training data, they are different from a standard data
compression algorithm like jpeg or gzip. Hence, we can’t expect an
autoencoder trained on handwritten digits to compress landscape photos.
Since we have more efficient and simple algorithms like jpeg, LZMA, LZSS(used in WinRAR in
tandem with Huffman coding), autoencoders are not generally used for compression.
Although autoencoders have seen their use for image denoising and dimensionality
reduction in recent years.
Image Denoising
Autoencoders are very good at denoising images. When an image gets corrupted or there is
a bit of noise in it, we call this image a noisy image.
To obtain proper information about the content of the image, we perform image denoising.
Dimensionality Reduction
The autoencoders convert the input into a reduced representation which is stored in the
middle layer called code. This is where the information from the input has been compressed
and by extracting this layer from the model, each node can now be treated as a variable.
Thus we can conclude that by trashing out the decoder part, an autoencoder can be used
for dimensionality reduction with the output being the code layer.
Feature Extraction
Encoding part of Autoencoders helps to learn important hidden features present in the
input data, in the process to reduce the reconstruction error. During encoding, a new set of
combinations of original features is generated.
Image Generation
One of the applications of autoencoders is to convert a black and white picture into a
coloured image. Or we can convert a coloured image into a grayscale image.
***