DL Unit - 2
DL Unit - 2
Gradient Descent
Gradient descent is an optimization algorithm used in machine learning to minimize the cost
function by iteratively adjusting parameters in the direction of the negative gradient, aiming to
find the optimal set of parameters.
The cost function represents the discrepancy between the predicted output of the model and the
actual output. The goal of gradient descent is to find the set of parameters that minimizes this
discrepancy and improves the model’s performance.
What is a Cost Function
It is a function that measures the performance of a model for any given data. Cost Function
quantifies the error between predicted values and expected values and presents it in the form of a
single real number.
After making a hypothesis with initial parameters, we calculate the Cost function. And with a
goal to reduce the cost function, we modify the parameters by using the Gradient descent
algorithm over the given data. Here’s the mathematical representation for it:
The algorithm operates by calculating the gradient of the cost function, which indicates the
direction and magnitude of steepest ascent. However, since the objective is to minimize the cost
function, gradient descent moves in the opposite direction of the gradient, known as the negative
gradient direction.
By iteratively updating the model’s parameters in the negative gradient direction, gradient
descent gradually converges towards the optimal set of parameters that yields the lowest cost.
The learning rate, a hyperparameter, determines the step size taken in each iteration, influencing
the speed and stability of convergence.
Gradient descent can be applied to various machine learning algorithms, including linear
regression, logistic regression, neural networks, and support vector machines. It provides a
general framework for optimizing models by iteratively refining their parameters based on the
cost function.
Example of Gradient Descent
Let’s say you are playing a game where the players are at the top of a mountain, and they are
asked to reach the lowest point of the mountain. Additionally, they are blindfolded. So, what
approach do you think would make you reach the lake?
Take a moment to think about this before you read on.
Notes By Sarvagya Jain
The best way is to observe the ground and find where the land descends. From that position, take
a step in the descending direction and iterate this process until we reach the lowest point.
The goal of the gradient descent algorithm is to minimize the given function (say cost function).
To achieve this goal, it performs two steps iteratively:
● Compute the gradient (slope), the first order derivative of the function at that point
● Make a step (move) in the direction opposite to the gradient, opposite direction of slope
increase from the current point by alpha times the gradient at that point
Notes By Sarvagya Jain
Alpha is called Learning rate – a tuning parameter in the optimization process. It decides the
length of the steps.
Working of Gradient Descent
● Gradient descent is an optimization algorithm used to minimize the cost function of a
model.
● The cost function measures how well the model fits the training data and is defined based
on the difference between the predicted and actual values.
● The gradient of the cost function is the derivative with respect to the model’s parameters
and points in the direction of the steepest ascent.
● The algorithm starts with an initial set of parameters and updates them in small steps to
minimize the cost function.
● In each iteration of the algorithm, the gradient of the cost function with respect to each
parameter is computed.
● The gradient tells us the direction of the steepest ascent, and by moving in the opposite
direction, we can find the direction of the steepest descent.
● The size of the step is controlled by the learning rate, which determines how quickly the
algorithm moves towards the minimum.
● The process is repeated until the cost function converges to a minimum, indicating that
the model has reached the optimal set of parameters.
● There are different variations of gradient descent, including batch gradient descent,
stochastic gradient descent, and mini-batch gradient descent, each with its own
advantages and limitations.
● Efficient implementation of gradient descent is essential for achieving good performance
in machine learning tasks. The choice of the learning rate and the number of iterations
can significantly impact the performance of the algorithm.
Gradient Descent Algorithm
Notes By Sarvagya Jain
Gradient Descent Algorithm iteratively calculates the next point using gradient at the current
position, scales it (by a learning rate) and subtracts obtained value from the current position
(makes a step). It subtracts the value because we want to minimise the function (to maximise it
would be adding). This process can be written as:
There’s an important parameter η which scales the gradient and thus controls the step size. In
machine learning, it is called learning rate and have a strong influence on performance.
Gradient Descent method’s steps are:
1. choose a starting point (initialisation)
2. calculate gradient at this point
3. make a scaled step in the opposite direction to the gradient (objective: minimise)
4. repeat points 2 and 3 until one of the criteria is met:
● maximum number of iterations reached
● step size is smaller than the tolerance (due to scaling or a small gradient).
This function takes 5 parameters:
1. starting point - in our case, we define it manually but in practice, it is often a random
initialisation
2. gradient function - has to be specified before-hand
3. learning rate - scaling factor for step sizes
4. maximum number of iterations
5. tolerance to conditionally stop the algorithm (in this case a default value is 0.01)
θ = θ − α · ∇Ji(θ)
where:
θ is the parameter vector
α is the learning rate
∇Ji(θ) is the gradient of the cost function J with respect to θ computed on
a single randomly selected training example i
3. Mini-batch Gradient Descent:
Mini-batch gradient descent is a compromise between batch gradient descent and stochastic
gradient descent. In mini-batch gradient descent, the gradients are computed on a small random
subset of the training dataset, typically between 10 and 1000 examples, called a mini-batch. This
reduces the computational cost of the algorithm compared to batch gradient descent, while also
reducing the variance of the updates compared to SGD. Mini-batch gradient descent is widely
used in deep learning because it strikes a good balance between convergence speed and stability.
6. Adagrad:
Adagrad is a variant of gradient descent that adapts the learning rate for each parameter based on
its historical gradient values. Parameters with large gradients have their learning rates reduced,
while parameters with small gradients have their learning rates increased. This helps to
normalize the updates and can be useful in cases where the cost function has a lot of curvature or
different scales of gradients.
7. RMSProp:
RMSProp is a variant of gradient descent that also adapts the learning rate for each parameter,
but instead of using the historical gradient values, it uses a moving average of the squared
gradient values. This helps to reduce the learning rate for parameters that have large squared
gradient values, which can cause the algorithm to oscillate or diverge.
In simple terms, RMSprop uses an adaptive learning rate instead of treating the learning rate as a
hyperparameter. This means that the learning rate varies over time. RMSprop uses the same
concept of the exponentially weighted average of gradient descent as gradient descent with
momentum, but the difference is parameter update.
Parameters Updation
The recursive formula of EWA is given by:
Where,
Vt: Moving average value at t.
Notes By Sarvagya Jain
In RMSProp, we find the EWA of gradients and update our parameters using those EWAs. On
each iteration t, we calculate dw and db on the current minibatch. Then we calculate vdw and
vdb using the following formulae:
We will update our parameters after calculating the exponentially weighted averages.
W = W - learning rate * dW
b = b - learning rate * db.
Putting the value of db and dw in respective equations, we get:
8. Adam:
Adam derives its name from adaptive moment estimation. This optimization algorithm is a
stochastic gradient descent extension that updates network weights during training. It is a hybrid
of the “gradient descent with momentum” and the “RMSP” algorithms.
It is an adaptive learning rate method that calculates individual learning rates for various
parameters.
Adam can be used instead of the classical stochastic gradient descent procedure to update
network weights iteratively based on training data.
The Adam optimizer employs a hybrid of two gradient descent methods:
Momentum: This algorithm is used to speed up the gradient descent algorithm by considering the
”exponentially weighted average” of the gradients. Using averages causes the algorithm to
converge to the minima more quickly.
● Wt = Weights at time t
● Wt+1 = Weights at time t+1
● αt = Learning rate at time t
● ∂L = Derivative of Loss Function
● ∂Wt = Derivative of weights at time t
● Vt = Sum of the square of past gradients. [i.e sum(∂L/∂Wt-1)] (initially, Vt = 0)
● β = Moving average parameter (const, 0.9)
● ϵ = A small positive constant (10-8)
Adam Optimizer takes the strengths or positive characteristics of the previous two methods and
builds on them to provide a more optimized gradient descent.
In this case, we control the gradient descent rate so that there is minimal oscillation when it
reaches the global minimum while taking large enough steps (step size) to avoid the local
minima hurdles along the way—as a result, combining the features of the above methods to
reach the global minimum efficiently.
Mathematical Aspect of Adam Optimizer:
Using the formulas used in the previous two methods, we get the following:
Autoencoders
Autoencoders are a specific type of feedforward neural networks where the input is the same as
the output. They compress the input into a lower-dimensional code and then reconstruct the
output from this representation. The code is a compact “summary” or “compression” of the
input, also called the latent-space representation.
An autoencoder consists of 3 components: encoder, code and decoder. The encoder compresses
the input and produces the code, the decoder then reconstructs the input only using this code.
Notes By Sarvagya Jain
To build an autoencoder we need 3 things: an encoding method, decoding method, and a loss
function to compare the output with the target.
Autoencoders are mainly a dimensionality reduction (or compression) algorithm with a couple of
important properties:
● Data-specific: Autoencoders are only able to meaningfully compress data similar to
what they have been trained on. Since they learn features specific for the given
training data, they are different from a standard data compression algorithm like
gzip. So we can’t expect an autoencoder trained on handwritten digits to compress
landscape photos.
● Lossy: The output of the autoencoder will not be exactly the same as the input, it
will be a close but degraded representation. If you want lossless compression they
are not the way to go.
● Unsupervised: To train an autoencoder we don’t need to do anything fancy, just
throw the raw input data at it. Autoencoders are considered an unsupervised learning
technique since they don’t need explicit labels to train on. But to be more precise
they are self-supervised because they generate their own labels from the training
data.
Architecture of autoencoders
An autoencoder consists of three components:
● Encoder: An encoder is a feedforward, fully connected neural network that
compresses the input into a latent space representation and encodes the input
image as a compressed representation in a reduced dimension. The compressed
image is the distorted version of the original image.
● Code: This part of the network contains the reduced representation of the input
that is fed into the decoder.
● Decoder: Decoder is also a feedforward network like the encoder and has a
similar structure to the encoder. This network is responsible for reconstructing the
input back to the original dimensions from the code.
●
Notes By Sarvagya Jain
Training of autoencoders
You need to set 4 hyperparameters before training an autoencoder:
1. Code size: The code size or the size of the bottleneck is the most important
hyperparameter used to tune the autoencoder. The bottleneck size decides how much the
data has to be compressed. This can also act as a regularisation term.
2. Number of layers: Like all neural networks, an important hyperparameter to tune
autoencoders is the depth of the encoder and the decoder. While a higher depth increases
model complexity, a lower depth is faster to process.
3. Number of nodes per layer: The number of nodes per layer defines the weights we use
per layer. Typically, the number of nodes decreases with each subsequent layer in the
autoencoder as the input to each of these layers becomes smaller across the layers.
4. Reconstruction Loss: The loss function we use to train the autoencoder is highly
dependent on the type of input and output we want the autoencoder to adapt to. If we are
working with image data, the most popular loss functions for reconstruction are MSE
Loss and L1 Loss. In case the inputs and outputs are within the range [0,1], as in MNIST,
we can also make use of Binary Cross Entropy as the reconstruction loss.
There are several types of regularization that can be used with autoencoders, including:
● L1 Regularization: This method adds a penalty to the loss function for the sum of the
absolute values of the model weights. This encourages the model to learn sparse
representations, where many of the weights are set to zero.
● L2 Regularization: This method adds a penalty to the loss function for the sum of the
squares of the model weights. This encourages the model to learn small, non-zero
weights.
● Dropout: This method randomly sets a fraction of the model's activations to zero during
each training iteration. This helps prevent the model from relying too heavily on any one
set of activations.
A regression model that uses L1 regularization technique is called Lasso Regression and model
which uses L2 is called Ridge Regression.
Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds “absolute value of
magnitude” of coefficient as penalty term to the loss function.
Again, if lambda is zero then we will get back OLS whereas a very large value will make
coefficients zero hence it will under-fit.
Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function.
Here the highlighted part represents the L2 regularization element.
Cost function
Here, if lambda is zero then you can imagine we get back OLS. However, if lambda is very large
then it will add too much weight and it will lead to under-fitting. Having said that it’s important
how lambda is chosen. This technique works very well to avoid over-fitting issue.
Types of autoencoders
There are five popular autoencoders .
1. Undercomplete autoencoders
2. Sparse autoencoders
3. Contractive autoencoders
4. Denoising autoencoders
5. Variational Autoencoders (for generative modeling)
1. Undercomplete autoencoders
An Undercomplete autoencoder is one of the simplest types of autoencoders.
The way it works is very straightforward—
Notes By Sarvagya Jain
Undercomplete autoencoder takes in an image and tries to predict the same image as output, thus
reconstructing the image from the compressed bottleneck region.
Undercomplete autoencoders are truly unsupervised as they do not take any form of label, the
target being the same as the input.
The primary use of autoencoders like such is the generation of the latent space or the bottleneck,
which forms a compressed substitute of the input data and can be easily decompressed back with
the help of the network when needed.
This form of compression in the data can be modeled as a form of dimensionality reduction.
When we think of dimensionality reduction, we tend to think of methods like PCA (Principal
Component Analysis) that form a lower-dimensional hyperplane to represent data in a
higher-dimensional form without losing information.
However—
PCA can only build linear relationships. As a result, it is put at a disadvantage compared with
methods like under complete autoencoders that can learn non-linear relationships and, therefore,
perform better in dimensionality reduction.
This form of nonlinear dimensionality reduction where the autoencoder learns a non-linear
manifold is also termed as manifold learning.
Effectively, if we remove all non-linear activations from an undercomplete autoencoder and use
only linear layers, we reduce the undercomplete autoencoder into something that works at an
equal footing with PCA.
The loss function used to train an undercomplete autoencoder is called reconstruction loss, as it
is a check of how well the image has been reconstructed from the input data.
Although the reconstruction loss can be anything depending on the input and output, we will use
an L1 loss to depict the term (also called the norm loss) represented by:
Where represents the predicted output and x represents the ground truth.
Notes By Sarvagya Jain
As the loss function has no explicit regularization term, the only method to ensure that the model
is not memorizing the input data is by regulating the size of the bottleneck and the number of
hidden layers within this part of the network—the architecture.
2. Sparse autoencoders
Sparse autoencoders are similar to the undercomplete autoencoders in that they use the same
image as input and ground truth. However—
The means via which encoding of information is regulated is significantly different.
While undercomplete autoencoders are regulated and fine-tuned by regulating the size of the
bottleneck, the sparse autoencoder is regulated by changing the number of nodes at each hidden
layer.
Since it is not possible to design a neural network that has a flexible number of nodes at its
hidden layers, sparse autoencoders work by penalizing the activation of some neurons in hidden
layers.
In other words, the loss function has a term that calculates the number of neurons that have been
activated and provides a penalty that is directly proportional to that.
This penalty, called the sparsity function, prevents the neural network from activating more
neurons and serves as a regularizer.
While typical regularizers work by creating a penalty on the size of the weights at the nodes,
sparsity regularizer works by creating a penalty on the number of nodes activated.
This form of regularization allows the network to have nodes in hidden layers dedicated to find
specific features in images during training and treating the regularization problem as a problem
separate from the latent space problem.
We can thus set latent space dimensionality at the bottleneck without worrying about
regularization.
There are two primary ways in which the sparsity regularizer term can be incorporated into the
loss function.
Notes By Sarvagya Jain
L1 Loss: In here, we add the magnitude of the sparsity regularizer as we do for general
regularizers:
Where h represents the hidden layer, i represents the image in the minibatch, and a represents the
activation.
KL-Divergence: In this case, we consider the activations over a collection of samples at once
rather than summing them as in the L1 Loss method. We constrain the average activation of each
neuron over this collection.
Considering the ideal distribution as a Bernoulli distribution, we include KL divergence within
the loss to reduce the difference between the current distribution of the activations and the ideal
(Bernoulli) distribution:
Where
and j denotes the specific neuron for layer h and a collection of m samples is being made here,
each denoted as x.
3. Contractive autoencoders
Similar to other autoencoders, contractive autoencoders perform the task of learning a
representation of the image while passing it through a bottleneck and reconstructing it in the
decoder.
The contractive autoencoder also has a regularization term to prevent the network from learning
the identity function and mapping input into the output.
Contractive autoencoders work on the basis that similar inputs should have similar encodings
and a similar latent space representation. It means that the latent space should not vary by a huge
amount for minor variations in the input.
To train a model that works along with this constraint, we have to ensure that the derivatives of
the hidden layer activations are small with respect to the input data.
Mathematically:
While the reconstruction loss wants the model to tell differences between two inputs and observe
variations in the data, the frobenius norm of the derivatives says that the model should be able to
ignore variations in the input data.
Putting these two contradictory conditions into one loss function enables us to train a network
where the hidden layers now capture only the most essential information. This information is
necessary to separate images and ignore information that is non-discriminatory in nature, and
therefore, not important.
The total loss function can be mathematically expressed as:
Where h> is the hidden layer for which a gradient is calculated and represented with respect to
the input x as
The gradient is summed over all training samples, and a frobenius norm of the same is taken.
4. Denoising autoencoders
Denoising autoencoders, as the name suggests, are autoencoders that remove noise from an
image.
As opposed to autoencoders we’ve already covered, this is the first of its kind that does not have
the input image as its ground truth.
In denoising autoencoders, we feed a noisy version of the image, where noise has been added via
digital alterations. The noisy image is fed to the encoder-decoder architecture, and the output is
compared with the ground truth image.
Notes By Sarvagya Jain
The denoising autoencoder gets rid of noise by learning a representation of the input where the
noise can be filtered out easily.
While removing noise directly from the image seems difficult, the autoencoder performs this by
mapping the input data into a lower-dimensional manifold (like in undercomplete autoencoders),
where filtering of noise becomes much easier.
Essentially, denoising autoencoders work with the help of non-linear dimensionality reduction.
The loss function generally used in these types of networks is L2 or L1 loss.
5. Variational autoencoders
Standard and variational autoencoders learn to represent the input just in a compressed form
called the latent space or the bottleneck.
Therefore, the latent space formed after training the model is not necessarily continuous and, in
effect, might not be easy to interpolate.
For example—
This is what a variational autoencoder would learn from the input:
Notes By Sarvagya Jain
While these attributes explain the image and can be used in reconstructing the image from the
compressed latent space, they do not allow the latent attributes to be expressed in a probabilistic
fashion.
Variational autoencoders deal with this specific topic and express their latent attributes as a
probability distribution, leading to the formation of a continuous latent space that can be easily
sampled and interpolated.
When fed the same input, a variational autoencoder would construct latent attributes in the
following manner:
The latent attributes are then sampled from the latent distribution formed and fed to the decoder,
reconstructing the input.
While this seems easy in theory, it becomes impossible to implement because backpropagation
cannot be defined for a random sampling process performed before feeding the data to the
decoder.
To get by this hurdle, we use the reparameterization trick—a cleverly defined way to bypass the
sampling process from the neural network.
In the reparameterization trick, we randomly sample a valueε from a unit Gaussian and then
scale this by the latent distribution varianceσ and shift it by the mean μ of the same.
Now, we have left behind the sampling process as something done outside what the
backpropagation pipeline handles, and the sampled value ε acts just like another input to the
model, that is fed at the bottleneck.
A diagrammatic view of what we attain can be expressed as:
The variational autoencoder thus allows us to learn smooth latent state representations of the
input data.
To train a VAE, we use two loss functions: the reconstruction loss and the other being the KL
divergence.
While reconstruction loss enables the distribution to correctly describe the input, by focusing
only on minimizing the reconstruction loss, the network learns very narrow distributions—akin
to discrete latent attributes.
The KL divergence loss prevents the network from learning narrow distributions and tries to
bring the distribution closer to a unit normal distribution.
The summarized loss function can be expressed as:
Notes By Sarvagya Jain
Where N denotes the normal unit distribution and denotes a weighting factor.
The primary use of variational autoencoders can be seen in generative modeling.
Sampling from the latent distribution trained and feeding the result to the decoder can lead to
data being generated in the autoencoder.
A sample of MNIST digits generated by training a variational autoencoder is shown below:
Data Augmentation
Data augmentation is a technique of artificially increasing the training set by creating modified
copies of a dataset using existing data. It includes making minor changes to the dataset or using
deep learning to generate new data points.
Augmented vs. Synthetic data
Augmented data is driven from original data with some minor changes. In the case of image
augmentation, we make geometric and color space transformations (flipping, resizing, cropping,
brightness, contrast) to increase the size and diversity of the training set.
Synthetic data is generated artificially without using the original dataset. It often uses DNNs
(Deep Neural Networks) and GANs (Generative Adversarial Networks) to generate synthetic
data.
Use of Data Augmentation
1. To prevent models from overfitting.
Notes By Sarvagya Jain