0% found this document useful (0 votes)
28 views58 pages

ML and Ai Unit 04 and Unit 05

AIML

Uploaded by

canikissyou7013
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views58 pages

ML and Ai Unit 04 and Unit 05

AIML

Uploaded by

canikissyou7013
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Unit-04

Distance-based machine learning algorithms:


Distance-based algorithms are machine learning algorithms that classify queries by
computing distances between these queries and a number of internally stored exemplars.
Exemplars that are closest to the query have the largest influence on the classification
assigned to the query. Two specific distance-based algorithms, the nearest neighbor
algorithm and the nearest-hyperrectangle algorithm, are studied in detail.

It is shown that the k-nearest neighbor algorithm (kNN) outperforms the first nearest
neighbor algorithm only under certain conditions. Data sets must contain moderate
amounts of noise. Training examples from the different classes must belong to clusters that
allow an increase in the value of k without reaching into clusters of other classes. Methods
for choosing the value of k for kNN are investigated. It shown that one-fold cross-validation
on a restricted number of values for k suffices for best performance. It is also shown that for
best performance the votes of the k-nearest neighbors of a query should be weighted in
inverse proportion to their distances from the query.

Principal component analysis is shown to reduce the number of relevant dimensions


substantially in several domains. Two methods for learning feature weights for a weighted
Euclidean distance metric are proposed. These methods improve the performance of kNN
and NN in a variety of domains.

The nearest-hyperrectangle algorithm (NGE) is found to give predictions that are


substantially inferior to those given by kNN in a variety of domains. Experiments performed
to understand this inferior performance led to the discovery of several improvements to
NGE. Foremost of these is BNGE, a batch algorithm that avoids construction of overlapping
hyperrectangles from different classes. Although it is generally superior to NGE, BNGE is still
significantly inferior to kNN in a variety of domains. Hence, a hybrid algorithm (KBNGE), that
uses BNGE in parts of the input space that can be represented by a single hyperrectangle
and kNN otherwise, is introduced.

The primary contributions of this dissertation are (a) several improvements to existing
distance-based algorithms, (b) several new distance-based algorithms, and (c) an
experimentally supported understanding of the conditions under which various distance-
based algorithms are likely to give good performance.

K-Nearest Neighbors
The K-Nearest Neighbors algorithm is a supervised machine learning algorithm for labeling
an unknown data point given existing labeled data.

The nearness of points is typically determined by using distance algorithms such as the
Euclidean distance formula based on parameters of the data. The algorithm will classify a
point based on the labels of the K nearest neighbor points, where the value of K can be
specified.
KNN of Unknown Data Point
To classify the unknown data point using the KNN (K-Nearest Neighbor) algorithm:

 Normalize the numeric data


 Find the distance between the unknown data point and all training data points
 Sort the distance and find the nearest k data points
 Classify the unknown data point based on the most instances of nearest k points

Normalizing Data
Normalization is a process of converting the numeric columns in the dataset to a common
scale while retaining the underlying differences in the range of values.

For example, Min-max normalization converts each value of the numeric column to a value
between 0 and 1 using the formula Normalized value = (NumericValue - MinValue) /
(MaxValue - MinValue). A downside of Min-max Normalization is that it does not handle
outliers very well.
Regression in KNN Algorithm
K-Nearest Neighbor algorithm uses ‘feature similarity’ to predict values of any new data
points. This means that the new point is assigned a value based on how closely it resembles
the points in the training set. During regression implementation, the average of the values is
taken to be the final prediction, whereas during the classification implementation mode of
the values is taken to be the final prediction.

Decision Tree Classification Algorithm

o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision rules and each
leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.

Support Vector Machine Algorithm

Support Vector Machine or SVM is one of the most popular Supervised Learning algorithms,
which is used for Classification as well as Regression problems. However, primarily, it is used
for Classification problems in Machine Learning.

The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in
the correct category in the future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme
cases are called as support vectors, and hence algorithm is termed as Support Vector
Machine. Consider the below diagram in which there are two different categories that are
classified using a decision boundary or hyperplane:

Example: SVM can be understood with the example that we have used in the KNN classifier.
Suppose we see a strange cat that also has some features of dogs, so if we want a model that
can accurately identify whether it is a cat or dog, so such a model can be created by using the
SVM algorithm. We will first train our model with lots of images of cats and dogs so that it can
learn about different features of cats and dogs, and then we test it with this strange creature.
So as support vector creates a decision boundary between these two data (cat and dog) and
choose extreme cases (support vectors), it will see the extreme case of cat and dog. On the
basis of the support vectors, it will classify it as a cat. Consider the below diagram

SVM algorithm can be used for Face detection, image classification, text categorization, etc.
Types of SVM
SVM can be of two types:

o Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
can be classified into two classes by using a single straight line, then such data is
termed as linearly separable data, and classifier is used called as Linear SVM classifier.
o Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
means if a dataset cannot be classified by using a straight line, then such data is
termed as non-linear data and classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-


dimensional space, but we need to find out the best decision boundary that helps to classify
the data points. This best boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset, which
means if there are 2 features (as shown in image), then hyperplane will be a straight line. And
if there are 3 features, then hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum
distance between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect the position
of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane,
hence called a Support vector.

How does SVM works?

Linear SVM:

The working of the SVM algorithm can be understood by using an example. Suppose we have
a dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We
want a classifier that can classify the pair(x1, x2) of coordinates in either green or blue.
Consider the below image:
So as it is 2-d space so by just using a straight line, we can easily separate these two classes.
But there can be multiple lines that can separate these classes. Consider the below image:

Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary
or region is called as a hyperplane. SVM algorithm finds the closest point of the lines from
both the classes. These points are called support vectors. The distance between the vectors
and the hyperplane is called as margin. And the goal of SVM is to maximize this margin.
The hyperplane with maximum margin is called the optimal hyperplane.

Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for non-linear
data, we cannot draw a single straight line. Consider the below image:
So to separate these data points, we need to add one more dimension. For linear data, we
have used two dimensions x and y, so for non-linear data, we will add a third dimension z. It
can be calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way. Consider the below
image:

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert
it in 2d space with z=1, then it will become as:
Hence we get a circumference of radius 1 in case of non-linear data.

Non linearity and Kernal methods:

Kernel methods are approaches for dealing with linearly inseparable data or non-linear data
sets like those presented in fig-1. The concept is to use a mapping function to project
nonlinear combinations of the original features onto a higher-dimensional space, where the
data becomes linearly separable. The two-dimensional dataset (X1, X2) is projected into a
new three-dimensional feature space (Z1, Z2, Z3) in the diagram above, where the classes
become separable.

To grasp it completely, Assume we have two vectors, x and x*, in a 2D space (illustrated in
fig-1) and want to find a linear classifier by performing a dot product between them.
Unfortunately, in our current 2D vector space, the data is not linearly separable. We can
address this challenge by mapping the two vectors to a 3D space.

x→ϕ(x)

x∗→ϕ(x*)

Where (x) and (x*) are 3D representations of x and x*, respectively. Now we can discover
our linear classifier in 3D space using the dot product of (x) and (x*) and then map back to
2D space using the dot product of (x) and (x*) like below.

xT x∗→ϕ(x)T ϕ(x∗)

A mapping function can be used to convert the training data into a higher-dimensional
feature space, and then a linear SVM model can be trained to classify the data in this new
feature space following the method outlined above. Using the mapping function, the new
data may then be fed into the model for categorization. However, this method is
computationally intensive. So, what is the solution?
The approach is to use a method to avoid explicitly mapping the input data to a high-
dimensional feature space in order to train linear learning algorithms to learn a nonlinear
function or decision boundary. This is known as a kernel trick. It should be noted that the
kernel trick is significantly more general than SVM.

The Kernel Trick


We’ve seen how higher-dimensional transformations can help us separate data so that
classification predictions can be made. It appears that we will have to operate on the higher
dimensional vectors in the modified feature space in order to train a support vector
classifier and maximize our objective function.

In real-world applications, data may contain numerous features, and transformations using
multiple polynomial combinations of these features will result in extremely large and
prohibitive processing costs.

Kernel Trick
This problem can be solved using the kernel trick. Instead of explicitly applying the
transformations (x) and representing the data by these transformed coordinates in the
higher dimensional feature space, kernel methods represent the data only through a set of
pairwise similarity comparisons between the original data observations x (with the original
coordinates in the lower dimensional space).

Our kernel function takes in lower-dimensional inputs and outputs the dot product of
converted vectors in higher-dimensional space. Other theorems guarantee that such kernel
functions exist under certain conditions.

If a function can be written as an inner product of a mapping function, we only need to


know that function, not the mapping function, as shown above. A Kernel Function is a name
for the function.

Types of Kernel Functions


The kernel function is a function that may be expressed as the dot product of the mapping
function (kernel method) and looks like this,
K(xi,xj) = Ø(xi) . Ø(xj)

The kernel function simplifies the process of determining the mapping function. As a result,
the kernel function in the altered space specifies the inner product. Different types of kernel
functions are listed below. However, based on the requirement that the kernel function is
symmetric, one can create their own kernel functions.

Polynomial Kernel
The polynomial kernel is a kernel function that allows the learning of non-linear models by
representing the similarity of vectors (training samples) in a feature space over polynomials
of the original variables. It is often used with support vector machines (SVMs) and other
kernelized models.

F(x, xj) = (x.xj+1)^d

Sigmoid Kernel
It is primarily used in neural networks. This kernel function is similar to the activation
function for neurons in a two-layer perceptron model of a neural network.

F(x, xj) = tanh(αxay + c)

Linear Kernel
It is the most fundamental sort of kernel and is usually one-dimensional in structure. When
there are numerous characteristics, it proves to be the best function. The linear kernel is
commonly used for text classification issues since most of these problems can be linearly
split. Other functions are slower than linear kernel functions.

F(x, xj) = sum( x.xj)

Radial Basis Function (RBF) Kernel


The radial basis function kernel, often known as the RBF kernel, is a prominent kernel
function that is utilized in a variety of kernelized learning techniques. It is most typically
used in support vector machine classification. The RBF kernel is defined on two samples x
and x’, which are represented as feature vectors in some input space, as
Unsupervised Learning and Data Clustering

A task involving machine learning may not be linear, but it has a number of well known steps:

 Problem definition.

 Preparation of Data.

 Learn an underlying model.

 Improve the underlying model by quantitative and qualitative evaluations.

 Present the model.

One good way to come to terms with a new problem is to work through identifying and

defining the problem in the best possible way and learn a model that captures meaningful

information from the data. While problems in Pattern Recognition and Machine Learning can

be of various types, they can be broadly classified into three categories:

 Supervised Learning:
The system is presented with example inputs and their desired outputs, given
by a “teacher”, and the goal is to learn a general rule that maps inputs to
outputs.

 Unsupervised Learning:
No labels are given to the learning algorithm, leaving it on its own to find
structure in its input. Unsupervised learning can be a goal in itself
(discovering hidden patterns in data) or a means towards an end (feature
learning).

 Reinforcement Learning:
A system interacts with a dynamic environment in which it must perform a
certain goal (such as driving a vehicle or playing a game against an
opponent). The system is provided feedback in terms of rewards and
punishments as it navigates its problem space.

Between supervised and unsupervised learning is semi-supervised learning, where the

teacher gives an incomplete training signal: a training set with some (often many) of the

target outputs missing. We will focus on unsupervised learning and data clustering in this

blog post.

Unsupervised Learning

In some pattern recognition problems, the training data consists of a set of input vectors x

without any corresponding target values. The goal in such unsupervised learning problems

may be to discover groups of similar examples within the data, where it is called clustering,

or to determine how the data is distributed in the space, known as density estimation. To put

forward in simpler terms, for a n-sampled space x1 to xn, true class labels are not provided

for each sample, hence known as learning without teacher.

Issues with Unsupervised Learning:

 Unsupervised Learning is harder as compared to Supervised Learning tasks..

 How do we know if results are meaningful since no answer labels are


available?

 Let the expert look at the results (external evaluation)

 Define an objective function on clustering (internal evaluation)

Why Unsupervised Learning is needed despite of these issues


 Annotating large datasets is very costly and hence we can label only a few
examples manually. Example: Speech Recognition

 There may be cases where we don’t know how many/what classes is the data
divided into. Example: Data Mining

 We may want to use clustering to gain some insight into the structure of the
data before designing a classifier.

Unsupervised Learning can be further classified into two categories:

 Parametric Unsupervised Learning


In this case, we assume a parametric distribution of data. It assumes that
sample data comes from a population that follows a probability distribution
based on a fixed set of parameters. Theoretically, in a normal family of
distributions, all members have the same shape and are parameterized by
mean and standard deviation. That means if you know the mean and
standard deviation, and that the distribution is normal, you know the
probability of any future observation. Parametric Unsupervised Learning
involves construction of Gaussian Mixture Models and using Expectation-
Maximization algorithm to predict the class of the sample in question. This
case is much harder than the standard supervised learning because there are
no answer labels available and hence there is no correct measure of accuracy
available to check the result.

 Non-parametric Unsupervised Learning


In non-parameterized version of unsupervised learning, the data is grouped
into clusters, where each cluster(hopefully) says something about categories
and classes present in the data. This method is commonly used to model and
analyze data with small sample sizes. Unlike parametric models,
nonparametric models do not require the modeler to make any assumptions
about the distribution of the population, and so are sometimes referred to as
a distribution-free method.
What is Clustering

Clustering can be considered the most important unsupervised learning problem; so, as

every other problem of this kind, it deals with finding a structure in a collection of unlabeled

data. A loose definition of clustering could be “the process of organizing objects into groups

whose members are similar in some way”. A cluster is therefore a collection of objects which

are “similar” between them and are “dissimilar” to the objects belonging to other clusters.

Distance-based clustering

Given a set of points, with a notion of distance between points, grouping the points into

some number of clusters, such that

 internal (within the cluster) distances should be small i.e members of clusters
are close/similar to each other.

 external (intra-cluster) distances should be large i.e. members of different


clusters are dissimilar.

The Goals of Clustering

The goal of clustering is to determine the internal grouping in a set of unlabeled data. But

how to decide what constitutes a good clustering? It can be shown that there is no absolute

“best” criterion which would be independent of the final aim of the clustering. Consequently,
it is the user who should supply this criterion, in such a way that the result of the clustering

will suit their needs.

In the above image, how do we know what is the best clustering solution?

To find a particular clustering solution , we need to define the similarity measures for the

clusters.

Proximity Measures

For clustering, we need to define a proximity measure for two data points. Proximity here

means how similar/dissimilar the samples are with respect to each other.

 Similarity measure S(xi,xk): large if xi,xk are similar

 Dissimilarity(or distance) measure D(xi,xk): small if xi,xk are similar

There are various similarity measures which can be used.

 Vectors: Cosine Distance

 Sets: Jaccard Distance


 Points: Euclidean Distance
q=2

A “good” proximity measure is VERY application dependent. The clusters should be invariant

under the transformations “natural” to the problem. Also, while clustering it is not advised to

normalize data that are drawn from multiple distributions.

Clustering Algorithms

Clustering algorithms may be classified as listed below:

 Exclusive Clustering

 Overlapping Clustering

 Hierarchical Clustering

 Probabilistic Clustering
In the first case data are grouped in an exclusive way, so that if a certain data point belongs

to a definite cluster then it could not be included in another cluster. A simple example of that

is shown in the figure below, where the separation of points is achieved by a straight line on

a bi-dimensional plane.

On the contrary, the second type, the overlapping clustering, uses fuzzy sets to cluster data,

so that each point may belong to two or more clusters with different degrees of

membership. In this case, data will be associated to an appropriate membership value.

A hierarchical clustering algorithm is based on the union between the two nearest clusters.

The beginning condition is realized by setting every data point as a cluster. After a few

iterations it reaches the final clusters wanted.

Finally, the last kind of clustering uses a completely probabilistic approach.

In this blog we will talk about four of the most used clustering algorithms:

 K-means

 Fuzzy K-means

 Hierarchical clustering

 Mixture of Gaussians
Each of these algorithms belongs to one of the clustering types listed above. While, K-means

is an exclusive clustering algorithm, Fuzzy K-means is an overlapping clustering algorithm,

Hierarchical clustering is obvious and lastly Mixture of Gaussians is a probabilistic

clustering algorithm. We will discuss about each clustering method in the following

paragraphs.

K-Means Clustering

K-means is one of the simplest unsupervised learning algorithms that solves the well known

clustering problem. The procedure follows a simple and easy way to classify a given data set

through a certain number of clusters (assume k clusters) fixed a priori. The main idea is to

define k centres, one for each cluster. These centroids should be placed in a smart way

because of different location causes different result. So, the better choice is to place them as

much as possible far away from each other. The next step is to take each point belonging to a

given data set and associate it to the nearest centroid. When no point is pending, the first

step is completed and an early groupage is done. At this point we need to re-calculate k new

centroids as barycenters of the clusters resulting from the previous step. After we have these

k new centroids, a new binding has to be done between the same data set points and the

nearest new centroid. A loop has been generated. As a result of this loop we may notice that

the k centroids change their location step by step until no more changes are done. In other

words centroids do not move any more.

Finally, this algorithm aims at minimizing an objective function, in this case a squared error

function. The objective function

where
is a chosen distance measure between a data point xi and the cluster centre cj, is an indicator

of the distance of the n data points from their respective cluster centres.

The algorithm is composed of the following steps:

 Let X = {x1,x2,x3,……..,xn} be the set of data points and V = {v1,v2,…….,vc} be


the set of centers.

 Randomly select ‘c’ cluster centers.

 Calculate the distance between each data point and cluster centers.

 Assign the data point to the cluster center whose distance from the cluster
center is minimum of all the cluster centers.

 Recalculate the new cluster center using:

where, ‘ci’ represents the number of data points in ith cluster.

 Recalculate the distance between each data point and new obtained cluster
centers.

 If no data point was reassigned then stop, otherwise repeat from step 3).

Although it can be proved that the procedure will always terminate, the k-means algorithm

does not necessarily find the most optimal configuration, corresponding to the global

objective function minimum. The algorithm is also significantly sensitive to the initial

randomly selected cluster centres. The k-means algorithm can be run multiple times to

reduce this effect.


K-means is a simple algorithm that has been adapted to many problem domains. As we are

going to see, it is a good candidate for extension to work with fuzzy feature vectors.

The k-means procedure can be viewed as a greedy algorithm for partitioning the n samples

into k clusters so as to minimize the sum of the squared distances to the cluster centers. It

does have some weaknesses:

 The way to initialize the means was not specified. One popular way to start is
to randomly choose k of the samples.

 It can happen that the set of samples closest to mi is empty, so that mi


cannot be updated. This is a problem which needs to be handled during the
implementation, but is generally ignored.

 The results depend on the value of k and there is no optimal way to describe
a best “k”.

This last problem is particularly troublesome, since we often have no way of knowing how

many clusters exist. In the example shown above, the same algorithm applied to the same

data produces the following 3-means clustering. Is it better or worse than the 2-means

clustering?
Unfortunately there is no general theoretical solution to find the optimal number of clusters

for any given data set. A simple approach is to compare the results of multiple runs with

different k classes and choose the best one according to a given criterion, but we need to be

careful because increasing k results in smaller error function values by definition, but also

increases the risk of overfitting.

Fuzzy K-Means Clustering

In fuzzy clustering, each point has a probability of belonging to each cluster, rather than

completely belonging to just one cluster as it is the case in the traditional k-means. Fuzzy k-

means specifically tries to deal with the problem where points are somewhat in between

centers or otherwise ambiguous by replacing distance with probability, which of course could

be some function of distance, such as having probability relative to the inverse of the

distance. Fuzzy k-means uses a weighted centroid based on those probabilities. Processes of

initialization, iteration, and termination are the same as the ones used in k-means. The

resulting clusters are best analyzed as probabilistic distributions rather than a hard

assignment of labels. One should realize that k-means is a special case of fuzzy k-means when

the probability function used is simply 1 if the data point is closest to a centroid and 0

otherwise.

The fuzzy k-means algorithm is the following:

 Assume a fixed number of clusters K.


 Initialization: Randomly initialize the k-means μk associated with the clusters
and compute the probability that each data point Xi is a member of a given
cluster K,
P(PointXiHasLabelK|Xi,K).

 Iteration: Recalculate the centroid of the cluster as the weighted centroid


given the probabilities of membership of all data points Xi :

 Termination: Iterate until convergence or until a user-specified number of


iterations has been reached (the iteration may be trapped at some local
maxima or minima)

For a better understanding, we may consider this simple mono-dimensional example. Given a

certain data set, suppose to represent it as distributed on an axis. The figure below shows

this:

Looking at the picture, we may identify two clusters in proximity of the two data

concentrations. We will refer to them using ‘A’ and ‘B’. In the first approach shown in this

tutorial — the k-means algorithm — we associated each data point to a specific centroid;

therefore, this membership function looked like this:


In the Fuzzy k-means approach, instead, the same given data point does not belong

exclusively to a well defined cluster, but it can be placed in a middle way. In this case, the

membership function follows a smoother line to indicate that every data point may belong to

several clusters with different extent of membership.

In the figure above, the data point shown as a red marked spot belongs more to the B cluster

rather than the A cluster. The value 0.2 of ‘m’ indicates the degree of membership to A for

such data point.

Hierarchical Clustering Algorithms

Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the basic

process of hierarchical clustering is this:

 Start by assigning each item to a cluster, so that if you have N items, you now
have N clusters, each containing just one item. Let the distances (similarities)
between the clusters the same as the distances (similarities) between the
items they contain.

 Find the closest (most similar) pair of clusters and merge them into a single
cluster, so that now you have one cluster less.

 Compute distances (similarities) between the new cluster and each of the old
clusters.
 Repeat steps 2 and 3 until all items are clustered into a single cluster of size
N.

Clustering as a Mixture of Gaussians

There’s another way to deal with clustering problems: a model-based approach, which

consists in using certain models for clusters and attempting to optimize the fit between the

data and the model.

In practice, each cluster can be mathematically represented by a parametric distribution, like

a Gaussian. The entire data set is therefore modelled by a mixture of these distributions.

A mixture model with high likelihood tends to have the following traits:

 component distributions have high “peaks” (data in one cluster are tight);

 the mixture model “covers” the data well (dominant patterns in the data are
captured by component distributions).

Main advantages of model-based clustering:

 well-studied statistical inference techniques available;

 flexibility in choosing the component distribution;


 obtain a density estimation for each cluster;

 a “soft” classification is available.

Mixture of Gaussians

The most widely used clustering method of this kind is based on learning a mixture of

Gaussians:

A mixture model is a mixture of k component distributions that collectively make a mixture

distribution f(x):

The αk represents the contribution of the kth component in constructing f(x). In practice,

parametric distribution (e.g. gaussians), are often used since a lot work has been done to

understand their behaviour. If you substitute each fk(x) for a gaussian you get what is known

as a gaussian mixture models (GMM).

The EM Algorithm

Expectation-Maximization assumes that your data is composed of multiple multivariate

normal distributions (note that this is a very strong assumption, in particular when you fix the

number of clusters!). Alternatively put, EM is an algorithm for maximizing a likelihood

function when some of the variables in your model are unobserved (i.e. when you have
latent variables).

You might fairly ask, if we’re just trying to maximize a function, why don’t we just use the

existing machinery for maximizing a function. Well, if you try to maximize this by taking

derivatives and setting them to zero, you find that in many cases the first-order conditions

don’t have a solution. There’s a chicken-and-egg problem in that to solve for your model

parameters you need to know the distribution of your unobserved data; but the distribution

of your unobserved data is a function of your model parameters.

Expectation-Maximization tries to get around this by iteratively guessing a distribution for

the unobserved data, then estimating the model parameters by maximizing something that is

a lower bound on the actual likelihood function, and repeating until convergence:

The Expectation-Maximization algorithm

 Start with guess for values of your model parameters

 E-step: For each datapoint that has missing values, use your model equation
to solve for the distribution of the missing data given your current guess of
the model parameters and given the observed data (note that you are solving
for a distribution for each missing value, not for the expected value). Now
that we have a distribution for each missing value, we can calculate
the expectation of the likelihood function with respect to the unobserved
variables. If our guess for the model parameter was correct, this expected
likelihood will be the actual likelihood of our observed data; if the
parameters were not correct, it will just be a lower bound.

 M-step: Now that we’ve got an expected likelihood function with no


unobserved variables in it, maximize the function as you would in the fully
observed case, to get a new estimate of your model parameters.

 Repeat until convergence.


Problems associated with clustering

There are a number of problems with clustering. Among them:

 dealing with large number of dimensions and large number of data items can
be problematic because of time complexity;

 the effectiveness of the method depends on the definition of “distance” (for


distance-based clustering). If an obvious distance measure doesn’t exist we
must “define” it, which is not always easy, especially in multidimensional
spaces;

 the result of the clustering algorithm (that in many cases can be arbitrary
itself) can be interpreted in different ways.

Possible Applications

Clustering algorithms can be applied in many fields, for instance:

 Marketing: finding groups of customers with similar behavior given a large


database of customer data containing their properties and past buying
records;

 Biology: classification of plants and animals given their features;

 Insurance: identifying groups of motor insurance policy holders with a high


average claim cost; identifying frauds;

 Earthquake studies: clustering observed earthquake epicenters to identify


dangerous zones;

 World Wide Web: document classification; clustering weblog data to discover


groups of similar access patterns.
K-Means Clustering Algorithm
K-Means Clustering is an unsupervised learning algorithm that is used to solve the clustering
problems in machine learning or data science. In this topic, we will learn what is K-means
clustering algorithm, how the algorithm works, along with the Python implementation of k-
means clustering.

What is K-Means Algorithm?

K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled


dataset into different clusters. Here K defines the number of pre-defined clusters that need
to be created in the process, as if K=2, there will be two clusters, and for K=3, there will be
three clusters, and so on.

It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such
a way that each dataset belongs only one group that has similar properties.

It allows us to cluster the data into different groups and a convenient way to discover the
categories of groups in the unlabeled dataset on its own without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a centroid. The main
aim of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of
clusters, and repeats the process until it does not find the best clusters. The value of k
should be predetermined in this algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an iterative process.
o Assigns each data point to its closest k-center. Those data points which are near to the
particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is away from other
clusters.

The below diagram explains the working of the K-means Clustering Algorithm:
How does the K-Means Algorithm Work

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the input dataset).

Step-3: Assign each data point to their closest centroid, which will form the predefined K
clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to the new closest
centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Let's understand the above steps by considering the visual plots:

Suppose we have two variables M1 and M2. The x-y axis scatter plot of these two variables
is given below:
o Let's take number k of clusters, i.e., K=2, to identify the dataset and to put them into
different clusters. It means here we will try to group these datasets into two different
clusters.
o We need to choose some random k points or centroid to form the cluster. These
points can be either the points from the dataset or any other point. So, here we are
selecting the below two points as k points, which are not the part of our dataset.
Consider the below image:

o Now we will assign each data point of the scatter plot to its closest K-point or
centroid. We will compute it by applying some mathematics that we have studied to
calculate the distance between two points. So, we will draw a median between both
the centroids. Consider the below image:

From the above image, it is clear that points left side of the line is near to the K1 or blue
centroid, and points to the right of the line are close to the yellow centroid. Let's color them
as blue and yellow for clear visualization.
o As we need to find the closest cluster, so we will repeat the process by choosing a new
centroid. To choose the new centroids, we will compute the center of gravity
centroids,
o and will find new centroids as below:

o Next, we will reassign each datapoint to the new centroid. For this, we will repeat
the same process of finding a median line. The median will be like below image:

From the above image, we can see, one yellow point is on the left side of the line, and two
blue points are right to the line. So, these three points will be assigned to new centroids.

Dimensionality Reduction Technique

What is Dimensionality Reduction


The number of input features, variables, or columns present in a given dataset is known as
dimensionality, and the process to reduce these features is called dimensionality reduction.

A dataset contains a huge number of input features in various cases, which makes the
predictive modeling task more complicated. Because it is very difficult to visualize or make
predictions for the training dataset with a high number of features, for such cases,
dimensionality reduction techniques are required to
use.

Dimensionality reduction technique can be defined


as, "It is a way of converting the higher dimensions dataset
into lesser dimensions dataset ensuring that it provides
similar information." These techniques are widely used
in machine learning for obtaining a better fit
predictive model while solving the classification and
regression problems.
It is commonly used in the fields that deal with high-dimensional data, such as speech
recognition, signal processing, bioinformatics, etc. It can also be used for data visualization,
noise reduction, cluster analysis, etc.

The Curse of Dimensionality

Handling the high-dimensional data is very difficult in practice, commonly known as the curse
of dimensionality. If the dimensionality of the input dataset increases, any machine learning
algorithm and model becomes more complex. As the number of features increases, the
number of samples also gets increased proportionally, and the chance of overfitting also
increases. If the machine learning model is trained on high-dimensional data, it becomes
overfitted and results in poor performance.

Hence, it is often required to reduce the number of features, which can be done with
dimensionality reduction.

Benefits of applying Dimensionality Reduction

Some benefits of applying dimensionality reduction technique to the given dataset are given
below:

o By reducing the dimensions of the features, the space required to store the dataset
also gets reduced.
o Less Computation training time is required for reduced dimensions of features.
o Reduced dimensions of features of the dataset help in visualizing the data quickly.
o It removes the redundant features (if present) by taking care of multicollinearity.

Disadvantages of dimensionality Reduction

There are also some disadvantages of applying the dimensionality reduction, which are given
below:
o Some data may be lost due to dimensionality reduction.
o In the PCA dimensionality reduction technique, sometimes the principal components
required to consider are unknown.

Approaches of Dimension Reduction

There are two ways to apply the dimension reduction technique, which are given below:

Feature Selection

Feature selection is the process of selecting the subset of the relevant features and leaving
out the irrelevant features present in a dataset to build a model of high accuracy. In other
words, it is a way of selecting the optimal features from the input dataset.

Three methods are used for the feature selection:

1. Filters Methods

In this method, the dataset is filtered, and a subset that contains only the relevant features is
taken. Some common techniques of filters method are:

o Correlation
o Chi-Square Test
o ANOVA
o Information Gain, etc.

2. Wrappers Methods

The wrapper method has the same goal as the filter method, but it takes a machine learning
model for its evaluation. In this method, some features are fed to the ML model, and evaluate
the performance. The performance decides whether to add those features or remove to
increase the accuracy of the model. This method is more accurate than the filtering method
but complex to work. Some common techniques of wrapper methods are:

o Forward Selection
o Backward Selection
o Bi-directional Elimination

3. Embedded Methods: Embedded methods check the different training iterations of the
machine learning model and evaluate the importance of each feature. Some common
techniques of Embedded methods are:

o LASSO
o Elastic Net
o Ridge Regression, etc.

Feature Extraction:

Feature extraction is the process of transforming the space containing many dimensions into
space with fewer dimensions. This approach is useful when we want to keep the whole
information but use fewer resources while processing the information.

Some common feature extraction techniques are:

a. Principal Component Analysis


b. Linear Discriminant Analysis
c. Kernel PCA
d. Quadratic Discriminant Analysis

Common techniques of Dimensionality Reduction


a. Principal Component Analysis
b. Backward Elimination
c. Forward Selection
d. Score comparison
e. Missing Value Ratio
f. Low Variance Filter
g. High Correlation Filter
h. Random Forest
i. Factor Analysis
j. Auto-Encoder

Principal Component Analysis (PCA)

Principal Component Analysis is a statistical process that converts the observations of


correlated features into a set of linearly uncorrelated features with the help of orthogonal
transformation. These new transformed features are called the Principal Components. It is
one of the popular tools that is used for exploratory data analysis and predictive modeling.

PCA works by considering the variance of each attribute because the high attribute shows the
good split between the classes, and hence it reduces the dimensionality. Some real-world
applications of PCA are image processing, movie recommendation system, optimizing the
power allocation in various communication channels.
Backward Feature Elimination

The backward feature elimination technique is mainly used while developing Linear
Regression or Logistic Regression model. Below steps are performed in this technique to
reduce the dimensionality or in feature selection:

o In this technique, firstly, all the n variables of the given dataset are taken to train the
model.
o The performance of the model is checked.
o Now we will remove one feature each time and train the model on n-1 features for n
times, and will compute the performance of the model.
o We will check the variable that has made the smallest or no change in the performance
of the model, and then we will drop that variable or features; after that, we will be left
with n-1 features.
o Repeat the complete process until no feature can be dropped.

In this technique, by selecting the optimum performance of the model and maximum
tolerable error rate, we can define the optimal number of features require for the machine
learning algorithms.

Forward Feature Selection

Forward feature selection follows the inverse process of the backward elimination process. It
means, in this technique, we don't eliminate the feature; instead, we will find the best
features that can produce the highest increase in the performance of the model. Below steps
are performed in this technique:

o We start with a single feature only, and progressively we will add each feature at a
time.
o Here we will train the model on each feature separately.
o The feature with the best performance is selected.
o The process will be repeated until we get a significant increase in the performance of
the model.

Missing Value Ratio

If a dataset has too many missing values, then we drop those variables as they do not carry
much useful information. To perform this, we can set a threshold level, and if a variable has
missing values more than that threshold, we will drop that variable. The higher the threshold
value, the more efficient the reduction.
Low Variance Filter

As same as missing value ratio technique, data columns with some changes in the data have
less information. Therefore, we need to calculate the variance of each variable, and all data
columns with variance lower than a given threshold are dropped because low variance
features will not affect the target variable.

High Correlation Filter

High Correlation refers to the case when two variables carry approximately similar
information. Due to this factor, the performance of the model can be degraded. This
correlation between the independent numerical variable gives the calculated value of the
correlation coefficient. If this value is higher than the threshold value, we can remove one of
the variables from the dataset. We can consider those variables or features that show a high
correlation with the target variable.

Random Forest

Random Forest is a popular and very useful feature selection algorithm in machine learning.
This algorithm contains an in-built feature importance package, so we do not need to program
it separately. In this technique, we need to generate a large set of trees against the target
variable, and with the help of usage statistics of each attribute, we need to find the subset of
features.

Random forest algorithm takes only numerical variables, so we need to convert the input data
into numeric data using hot encoding.

Factor Analysis

Factor analysis is a technique in which each variable is kept within a group according to the
correlation with other variables, it means variables within a group can have a high correlation
between themselves, but they have a low correlation with variables of other groups.

We can understand it by an example, such as if we have two variables Income and spend.
These two variables have a high correlation, which means people with high income spends
more, and vice versa. So, such variables are put into a group, and that group is known as
the factor. The number of these factors will be reduced as compared to the original dimension
of the dataset.
Auto-encoders

One of the popular methods of dimensionality reduction is auto-encoder, which is a type of


ANN or artificial neural network, and its main aim is to copy the inputs to their outputs. In
this, the input is compressed into latent-space representation, and output is occurred using
this representation. It has mainly two parts:

Encoder: The function of the encoder is to compress the input to form the latent-space
representation.

Decoder: The function of the decoder is to recreate the output from the latent-space
representation.
Unit-05
Machine learning algorithm analytics

In this blog, we will discuss the various ways to check the performance of our machine

learning or deep learning model and why to use one in place of the other. We will discuss

terms like:

1. Confusion matrix

2. Accuracy

3. Precision

4. Recall

5. Specificity

6. F1 score

7. Precision-Recall or PR curve

8. ROC (Receiver Operating Characteristics) curve

9. PR vs ROC curve.

For simplicity, we will mostly discuss things in terms of a binary classification problem where

let’s say we’ll have to find if an image is of a cat or a dog. Or a patient is having cancer

(positive) or is found healthy (negative). Some common terms to be clear with are:

True positives (TP): Predicted positive and are actually positive.

False positives (FP): Predicted positive and are actually negative.


True negatives (TN): Predicted negative and are actually negative.

False negatives (FN): Predicted negative and are actually positive.

So let's get started!

Confusion matrix

It’s just a representation of the above parameters in a matrix format. Better visualization is

always good :)

Accuracy

The most commonly used metric to judge a model and is actually not a clear indicator of the

performance. The worse happens when classes are imbalanced.

Take for example a cancer detection model. The chances of actually having cancer are very

low. Let’s say out of 100, 90 of the patients don’t have cancer and the remaining 10 actually

have it. We don’t want to miss on a patient who is having cancer but goes undetected (false
negative). Detecting everyone as not having cancer gives an accuracy of 90% straight. The

model did nothing here but just gave cancer free for all the 100 predictions.

We surely need better alternatives.

Precision

Percentage of positive instances out of the total predicted positive instances. Here

denominator is the model prediction done as positive from the whole given dataset. Take it

as to find out ‘how much the model is right when it says it is right’.

Recall/Sensitivity/True Positive Rate

Percentage of positive instances out of the total actual positive instances. Therefore

denominator (TP + FN) here is the actual number of positive instances present in the dataset.

Take it as to find out ‘how much extra right ones, the model missed when it showed the right

ones’.

Specificity

Percentage of negative instances out of the total actual negative instances. Therefore

denominator (TN + FP) here is the actual number of negative instances present in the

dataset. It is similar to recall but the shift is on the negative instances. Like finding out how

many healthy patients were not having cancer and were told they don’t have cancer. Kind of

a measure to see how separate the classes are.


F1 score

It is the harmonic mean of precision and recall. This takes the contribution of both, so higher

the F1 score, the better. See that due to the product in the numerator if one goes low, the

final F1 score goes down significantly. So a model does well in F1 score if the positive

predicted are actually positives (precision) and doesn't miss out on positives and predicts

them negative (recall).

One drawback is that both precision and recall are given equal importance due to which

according to our application we may need one higher than the other and F1 score may not

be the exact metric for it. Therefore either weighted-F1 score or seeing the PR or ROC curve

can help.

PR curve

It is the curve between precision and recall for various threshold values. In the figure below

we have 6 predictors showing their respective precision-recall curve for various threshold

values. The top right part of the graph is the ideal space where we get high precision and

recall. Based on our application we can choose the predictor and the threshold value. PR AUC

is just the area under the curve. The higher its numerical value the better.
ROC curve

ROC stands for receiver operating characteristic and the graph is plotted against TPR and FPR

for various threshold values. As TPR increases FPR also increases. As you can see in the first

figure, we have four categories and we want the threshold value that leads us closer to the

top left corner. Comparing different predictors (here 3) on a given dataset also becomes easy

as you can see in figure 2, one can choose the threshold according to the application at hand.

ROC AUC is just the area under the curve, the higher its numerical value the better.
PR vs ROC curve

Both the metrics are widely used to judge a models performance.

Which one to use PR or ROC

The answer lies in TRUE NEGATIVES.

Due to the absence of TN in the precision-recall equation, they are useful in imbalanced

classes. In the case of class imbalance when there is a majority of the negative class. The

metric doesn’t take much into consideration the high number of TRUE NEGATIVES of the

negative class which is in majority, giving better resistance to the imbalance. This is

important when the detection of the positive class is very important.

Like to detect cancer patients, which has a high class imbalance because very few have it out

of all the diagnosed. We certainly don’t want to miss on a person having cancer and going

undetected (recall) and be sure the detected one is having it (precision).

Due to the consideration of TN or the negative class in the ROC equation, it is useful when

both the classes are important to us. Like the detection of cats and dog. The importance of

true negatives makes sure that both the classes are given importance, like the output of a

CNN model in determining the image is of a cat or a dog.


Ensemble Methods in Machine Learning:

Ensemble Methods, what are they? Ensemble methods is a machine learning technique that

combines several base models in order to produce one optimal predictive model. To better

understand this definition lets take a step back into ultimate goal of machine learning and

model building. This is going to make more sense as I dive into specific examples and why

Ensemble methods are used.

I will largely utilize Decision Trees to outline the definition and practicality of Ensemble

Methods (however it is important to note that Ensemble Methods do not only pertain to

Decision Trees).

A Decision Tree determines the predictive value based on series of questions and conditions.

For instance, this simple Decision Tree determining on whether an individual should play

outside or not. The tree takes several weather factors into account, and given each factor

either makes a decision or asks another question. In this example, every time it is overcast,

we will play outside. However, if it is raining, we must ask if it is windy or not? If windy, we

will not play. But given no wind, tie those shoelaces tight because were going outside to play.
Decision Trees can also solve quantitative problems as well with the same format. In the Tree

to the left, we want to know wether or not to invest in a commercial real estate property. Is

it an office building? A Warehouse? An Apartment building? Good economic conditions?

Poor Economic Conditions? How much will an investment return? These questions are

answered and solved using this decision tree.

When making Decision Trees, there are several factors we must take into consideration: On

what features do we make our decisions on? What is the threshold for classifying each

question into a yes or no answer? In the first Decision Tree, what if we wanted to ask

ourselves if we had friends to play with or not. If we have friends, we will play every time. If

not, we might continue to ask ourselves questions about the weather. By adding an

additional question, we hope to greater define the Yes and No classes.

This is where Ensemble Methods come in handy! Rather than just relying on one Decision

Tree and hoping we made the right decision at each split, Ensemble Methods allow us to take

a sample of Decision Trees into account, calculate which features to use or questions to ask

at each split, and make a final predictor based on the aggregated results of the sampled

Decision Trees.
Types of Ensemble Methods

1. BAGGing, or Bootstrap AGGregating. BAGGing gets its name because it


combines Bootstrapping and Aggregation to form one ensemble model. Given a
sample of data, multiple bootstrapped subsamples are pulled. A Decision Tree is
formed on each of the bootstrapped subsamples. After each subsample Decision
Tree has been formed, an algorithm is used to aggregate over the Decision Trees
to form the most efficient predictor. The image below will help explain:

Given a Dataset, bootstrapped subsamples are pulled. A Decision Tree is formed on each
bootstrapped sample. The results of each tree are aggregated to yield the strongest, most
accurate predictor.

2. Random Forest Models. Random Forest Models can be thought of as BAGGing, with a

slight tweak. When deciding where to split and how to make decisions, BAGGed Decision

Trees have the full disposal of features to choose from. Therefore, although the

bootstrapped samples may be slightly different, the data is largely going to break off at the

same features throughout each model. In contrary, Random Forest models decide where to

split based on a random selection of features. Rather than splitting at similar features at each

node throughout, Random Forest models implement a level of differentiation because each

tree will split based on different features. This level of differentiation provides a greater

ensemble to aggregate over, ergo producing a more accurate predictor. Refer to the image

for a better understanding.


Similar to BAGGing, bootstrapped subsamples are pulled from a larger dataset. A decision

tree is formed on each subsample. HOWEVER, the decision tree is split on different features

(in this diagram the features are represented by shapes).

3. Boosting is an ensemble learning method that combines a set of weak learners into a
strong learner to minimize training errors. In boosting, a random sample of data is selected,
fitted with a model and then trained sequentially—that is, each model tries to compensate
for the weaknesses of its predecessor. With each iteration, the weak rules from each
individual classifier are combined to form one, strong prediction rule.

Deep Generative Models

A Generative Model is a powerful way of learning any kind of data distribution using

unsupervised learning and it has achieved tremendous success in just few years. All types of

generative models aim at learning the true data distribution of the training set so as to

generate new data points with some variations. But it is not always possible to learn the

exact distribution of our data either implicitly or explicitly and so we try to model a

distribution which is as similar as possible to the true data distribution. For this, we can

leverage the power of neural networks to learn a function which can approximate the model

distribution to the true distribution.


Two of the most commonly used and efficient approaches are Variational Autoencoders

(VAE) and Generative Adversarial Networks (GAN). VAE aims at maximizing the lower bound

of the data log-likelihood and GAN aims at achieving an equilibrium between Generator and

Discriminator. In this blogpost, I will be explaining the working of VAE and GANs and the

intuition behind them.

Variational Autoencoder

I am assuming that the reader is already familiar with the working of a vanilla autoencoder.

We know that we can use an autoencoder to encode an input image to a much smaller

dimensional representation which can store latent information about the input data

distribution. But in a vanilla autoencoder, the encoded vector can only be mapped to the

corresponding input using a decoder. It certainly can’t be used to generate similar images

with some variability.

To achieve this, the model needs to learn the probability distribution of the training data.

VAE is one of the most popular approach to learn the complicated data distribution such as

images using neural networks in an unsupervised fashion. It is a probabilistic graphical model

rooted in Bayesian inference i.e., the model aims to learn the underlying probability

distribution of the training data so that it could easily sample new data from that learned

distribution. The idea is to learn a low-dimensional latent representation of the training data

called latent variables (variables which are not directly observed but are rather inferred

through a mathematical model) which we assume to have generated our actual training

data. These latent variables can store useful information about the type of output the model

needs to generate. The probability distribution of latent variables z is denoted by P(z). A

Gaussian distribution is selected as a prior to learn the distribution P(z) so as to easily sample

new data points during inference time.


Now the primary objective is to model the data with some parameters which maximizes the

likelihood of training data X. In short, we are assuming that a low-dimensional latent vector

has generated our data x (x ∈ X) and we can map this latent vector to data x using a

deterministic function f(z;θ) parameterized by theta which we need to evaluate (see fig.

1[1]). Under this generative process, our aim is to maximize the probability of each data in X

which is given as,

Pө(X) = ∫Pө(X, z)dz = ∫Pө(X|z)Pө(z)dz (1)

Here, f(z;θ)has been replaced by a distribution Pө(X|z).

Fig. 1. Latent vector mapped to data distribution using parameter ө [1]

The intuition behind this maximum likelihood estimation is that if the model can generate

training samples from these latent variables then it can also generate similar samples with

some variations. In other words, if we sample a large number of latent variables from P(z)

and generate x from these variables then the generated x should match the data distribution

Pdata(x). Now we have two questions which we need to answer. How to capture the

distribution of latent variables and how to integrate Equation 1 over all the dimensions of z?

Obviously it is a tedious task to manually specify the relevant information we would like to

encode in latent vector to generate the output image. Rather we rely on neural networks to

compute z just with an assumption that this latent vector can be well approximated as a

normal distribution so as to sample easily at inference time. If we have a normal distribution


of z in n dimensional space then it is always possible to generate any kind of distribution

using a sufficiently complicated function and the inverse of this function can be used to learn

the latent variables itself.

In equation 1, integration is carried over all the dimensions of z and is therefore intractable.

However, it can be calculated using methods of Monte-Carlo integration which is something

not easy to implement. So we follow an another approach to approximately maximize Pө(X)

in equation 1. The idea of VAE is to infer P(z) using P(z|X) which we don’t know. We infer

P(z|X) using a method called variational inference which is basically an optimization problem

in Bayesian statistics. We first model P(z|X) using simpler distribution Q(z|X) which is easy to

find and we try to minimize the difference between P(z|X) and Q(z|X) using KL-divergence

metric approach so that our hypothesis is close to the true distribution. This is followed by a

lot of mathematical equations which I will not be explaining here but you can find it in the

original paper. But I must say that those equations are not very difficult to understand once

you get the intuition behind VAE.

The final objective function of VAE is :-

The above equation has a very nice interpretation. The term Q(z|X) is basically our encoder

net, z is our encoded representation of data x(x ∈ X) and P(X|z) is our decoder net. So in the

above equation our goal is to maximize the log-likelihood of our data distribution under

some error given by D_KL[Q(z|X) || P(z|X)]. It can easily seen that VAE is trying to minimize

the lower bound of log(P(X)) since P(z|X) is not tractable but the KL-divergence term is >=0.

This is same as maximizing E[logP(X|z)] and minimizing D_KL[Q(z|X) || P(z|X)]. We know that

maximizing E[logP(X|z)] is a maximum likelihood estimation and is modeled using a decoder

net. As I said earlier that we want our latent representation to be close to Gaussian and

hence we assume P(z) as N(0, 1). Following this assumption, Q(z|X) should also be close to
this distribution. If we assume that it is a Gaussian with parameters μ(X) and Ʃ(X), the error

due to the difference between these two distributions i.e., P(z) and Q(z|X) given by KL-

divergence results in a closed form solution given below.

Considering we are optimizing the lower variational bound, our optimization function is :

log(P(X|z)) − D_KL[Q(z|X)‖P(z)], where the solu on of the second is shown above.

Hence, our loss function will contain two terms. First one is reconstruction loss of the input

to output and the second loss is KL-divergence term. Now we can train the network using

backpropagation algorithm. But there is a problem and that is the first term doesn’t only

depend on the parameters of P but also on the parameters of Q but this dependency doesn’t

appear in the above equation. So how to backpropagate through the layer where we are

sampling z randomly from the distribution Q(z|X) or N[μ(X), Ʃ(X)] so that P can decode.

Gradients can’t flow through random nodes. We use reparameterization trick (see fig) to

make the network differentiable. We sample from N(μ(X), Σ(X)) by first sampling ε ∼ N(0, I),

then computing z=μ(X) + Σ1/2(X)∗ε.

This has been very beautifully shown in the figure 2[1]? . It should be noted that the

feedforward step is identical for both of these networks (left & right) but gradients can only

backpropagate through right network.


Fig.2. Reparameterization trick used to backpropagate through random nodes [1]

At inference time, we can simply sample z from N(0, 1) and feed it to decoder net to

generate new data point. Since we are optimizing the lower variational bound, the quality of

the generated image is somewhat poor as compared to state-of-the art techniques like

Generative Adversarial Networks.

The best thing of VAE is that it learns both the generative model and an inference model.

Although both VAE and GANs are very exciting approaches to learn the underlying data

distribution using unsupervised learning but GANs yield better results as compared to VAE. In

VAE, we optimize the lower variational bound whereas in GAN, there is no such assumption.

In fact, GANs don’t deal with any explicit probability density estimation. The failure of VAE in

generating sharp images implies that the model is not able to learn the true posterior

distribution. VAE and GAN mainly differ in the way of training. Let’s now dive into Generative

Adversarial Networks.

Generative Adversarial Networks

Yann LeCun says that adversarial training is the coolest thing since sliced bread. Seeing the

popularity of Generative Adversarial Networks and the quality of the results they produce, I

think most of us would agree with him. Adversarial training has completely changed the way
we teach the neural networks to do a specific task. Generative Adversarial Networks don’t

work with any explicit density estimation like Variational Autoencoders. Instead, it is based

on game theory approach with an objective to find Nash equilibrium between the two

networks, Generator and Discriminator. The idea is to sample from a simple distribution like

Gaussian and then learn to transform this noise to data distribution using universal function

approximators such as neural networks.

This is achieved by adversarial training of these two networks. A generator model G learns to

capture the data distribution and a discriminator model D estimates the probability that a

sample came from the data distribution rather than model distribution. Basically the task of

the Generator is to generate natural looking images and the task of the Discriminator is to

decide whether the image is fake or real. This can be thought of as a mini-max two player

game where the performance of both the networks improves over time. In this game, the

generator tries to fool the discriminator by generating real images as far as possible and the

discriminator tries not to get fooled by the generator by improving its discriminative

capability. Below image shows the basic architecture of GAN.

Fig.3. Building block of Generative Adversarial Network

We define a prior on input noise variables P(z) and then the generator maps this to data

distribution using a complex differentiable function with parameters өg. In addition to this,

we have another network called Discriminator which takes in input x and using another

differentiable function with parameters өd outputs a single scalar value denoting the
probability that x comes from the true data distribution Pdata(x). The objective function of

the GAN is defined as

In the above equation, if the input to the Discriminator comes from true data distribution

then D(x) should output 1 to maximize the above objective function w.r.t D whereas if the

image has been generated from the Generator then D(G(z)) should output 1 to minimize the

objective function w.r.t G. The latter basically implies that G should generate such realistic

images which can fool D. We maximize the above function w.r.t parameters of Discriminator

using Gradient Ascent and minimize the same w.r.t parameters of Generator using Gradient

Descent. But there is a problem in optimizing generator objective. At the start of the game

when the generator hasn’t learned anything, the gradient is usually very small and when it is

doing very well, the gradients are very high (see Fig. 4). But we want the opposite behaviour.

We therefore maximize E[log(D(G(z))] rather than minimizing E[log(1-D(G(z))]

Fig.4. Cost for the Generator as a function of Discriminator response on the generated
image

The training process consists of simultaneous application of Stochastic Gradient Descent on

Discriminator and Generator. While training, we alternate between k steps of optimizing D

and one step of optimizing G on the mini-batch. The process of training stops when the

Discriminator is unable to distinguish ρg and ρdata i.e. D(x, өd) = ½ or when ρg = ρdata.
One of the earliest model on GAN employing Convolutional Neural Network

was DCGAN which stands for Deep Convolutional Generative Adversarial Networks. This

network takes as input 100 random numbers drawn from a uniform distribution and outputs

an image of desired shape. The network consists of many convolutional, deconvolutional and

fully connected layers. The network uses many deconvolutional layers to map the input noise

to the desired output image. Batch Normalization is used to stabilize the training of the

network. ReLU activation is used in generator for all layers except the output layer which

uses tanh layer and Leaky ReLU is used for all layers in the Discriminator. This network was

trained using mini-batch stochastic gradient descent and Adam optimizer was used to

accelerate training with tuned hyperparameters. The results of the paper were quite

interesting. The authors showed that the generators have interesting vector arithmetic

properties using which we can manipulate images in the way we want.

Fig.5. Generator of DCGAN

Fig.6. Discriminator of DCGAN

One of the most widely used variation of GANs is conditional GAN which is constructed by

simply adding conditional vector along with the noise vector (see Fig. 7). Prior to cGAN, we

were generating images randomly from random samples of noise z. What if we want to
generate an image with some desired features. Is there any way to provide this extra

information to the model anyhow about what type of image we want to generate? The

answer is yes and Conditional GAN is the way to do that. By conditioning the model on

additional information which is provided to both generator and discriminator, it is possible to

direct the data generation process. Conditional GANs are used in a variety of tasks such as

text to image generation, image to image translation, automated image tagging etc. A unified

structure of both the networks has been shown in the diagram below.

Fig. 7. A basic example of cGAN with y as the conditioning vector

One of the cool thing about GANs is that they can be trained even with small training data.

Indeed the results of GANs are promising but the training procedure is not trivial especially

setting up the hyperparameters of the network. Moreover, GANs are difficult to optimize as

they don’t converge easily. Of course there are some tips and tricks to hack GANs but they

may not always help. You can find some of these tips here. Also, we don’t have any criteria

for the quantitative evaluation of the results except to check whether the generated images

are perceptually realistic or not.


Boltzmann machine in deep learning

A deep Boltzmann machine is a model with more hidden layers with directionless
connections between the nodes as shown in Fig. 7.7. DBM learns the features hierarchically
from the raw data and the features extracted in one layer are applied as hidden variables as
input to the subsequent layer.

Deep autoencoders: A deep autoencoder is composed of two symmetrical deep-belief


networks having four to five shallow layers. One of the networks represents the encoding
half of the net and the second network makes up the decoding half. They have more layers
than a simple autoencoder and thus are able to learn more complex features. The layers are
restricted Boltzmann machines, the building blocks of deep-belief networks.

Application of autoencoders

So far we have seen a variety of autoencoders and each of them is good at a specific task.
Let’s find out some of the tasks they can do

Data Compression

Although autoencoders are designed for data compression yet they are hardly used for this
purpose in practical situations. The reasons are:

 Lossy compression: The output of the autoencoder is not exactly the same as
the input, it is a close but degraded representation. For lossless compression,
they are not the way to go.
 Data-specific: Autoencoders are only able to meaningfully compress data
similar to what they have been trained on. Since they learn features specific
for the given training data, they are different from a standard data
compression algorithm like jpeg or gzip. Hence, we can’t expect an
autoencoder trained on handwritten digits to compress landscape photos.
Since we have more efficient and simple algorithms like jpeg, LZMA, LZSS(used in WinRAR in
tandem with Huffman coding), autoencoders are not generally used for compression.
Although autoencoders have seen their use for image denoising and dimensionality
reduction in recent years.
Image Denoising

Autoencoders are very good at denoising images. When an image gets corrupted or there is
a bit of noise in it, we call this image a noisy image.

To obtain proper information about the content of the image, we perform image denoising.

Dimensionality Reduction

The autoencoders convert the input into a reduced representation which is stored in the
middle layer called code. This is where the information from the input has been compressed
and by extracting this layer from the model, each node can now be treated as a variable.
Thus we can conclude that by trashing out the decoder part, an autoencoder can be used
for dimensionality reduction with the output being the code layer.

Feature Extraction

Encoding part of Autoencoders helps to learn important hidden features present in the
input data, in the process to reduce the reconstruction error. During encoding, a new set of
combinations of original features is generated.

Image Generation

Variational Autoencoder(VAE) discussed above is a Generative Model, used to generate


images that have not been seen by the model yet. The idea is that given input images like
images of face or scenery, the system will generate similar images. The use is to:

 generate new characters of animation


 generate fake human images
Image colourisation

One of the applications of autoencoders is to convert a black and white picture into a
coloured image. Or we can convert a coloured image into a grayscale image.

Applications of Deep Learning Across Industries

1. Self Driving Cars


2. News Aggregation and Fraud News Detection
3. Natural Language Processing
4. Virtual Assistants
5. Entertainment
6. Visual Recognition
7. Fraud Detection
8. Healthcare
9. Personalisations
10. Detecting Developmental Delay in Children
11. Colourisation of Black and White images
12. Adding sounds to silent movies
13. Automatic Machine Translation
14. Automatic Handwriting Generation
15. Automatic Game Playing
16. Language Translations
17. Pixel Restoration
18. Photo Descriptions
19. Demographic and Election Predictions
20. Deep Dreaming

***

You might also like