0% found this document useful (0 votes)

23 views20 pages

DL Unit - 2

Uploaded by

sarthakjain1112

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views20 pages

DL Unit - 2

Uploaded by

sarthakjain1112

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

Notes By Sarvagya Jain

Gradient Descent
Gradient descent is an optimization algorithm used in machine learning to minimize the cost
function by iteratively adjusting parameters in the direction of the negative gradient, aiming to
find the optimal set of parameters.
The cost function represents the discrepancy between the predicted output of the model and the
actual output. The goal of gradient descent is to find the set of parameters that minimizes this
discrepancy and improves the model’s performance.
What is a Cost Function
It is a function that measures the performance of a model for any given data. Cost Function
quantifies the error between predicted values and expected values and presents it in the form of a
single real number.
After making a hypothesis with initial parameters, we calculate the Cost function. And with a
goal to reduce the cost function, we modify the parameters by using the Gradient descent
algorithm over the given data. Here’s the mathematical representation for it:

The algorithm operates by calculating the gradient of the cost function, which indicates the
direction and magnitude of steepest ascent. However, since the objective is to minimize the cost
function, gradient descent moves in the opposite direction of the gradient, known as the negative
gradient direction.
By iteratively updating the model’s parameters in the negative gradient direction, gradient
descent gradually converges towards the optimal set of parameters that yields the lowest cost.
The learning rate, a hyperparameter, determines the step size taken in each iteration, influencing
the speed and stability of convergence.
Gradient descent can be applied to various machine learning algorithms, including linear
regression, logistic regression, neural networks, and support vector machines. It provides a
general framework for optimizing models by iteratively refining their parameters based on the
cost function.
Example of Gradient Descent
Let’s say you are playing a game where the players are at the top of a mountain, and they are
asked to reach the lowest point of the mountain. Additionally, they are blindfolded. So, what
approach do you think would make you reach the lake?
Take a moment to think about this before you read on.
Notes By Sarvagya Jain

The best way is to observe the ground and find where the land descends. From that position, take
a step in the descending direction and iterate this process until we reach the lowest point.

Finding the lowest point in a hilly landscape.

Gradient descent is an iterative optimization algorithm for finding the local minimum of a
function. To find the local minimum of a function using gradient descent, we must take steps
proportional to the negative of the gradient (move away from the gradient) of the function at the
current point. If we take steps proportional to the positive of the gradient (moving towards the
gradient), we will approach a local maximum of the function, and the procedure is called
Gradient Ascent.

The goal of the gradient descent algorithm is to minimize the given function (say cost function).
To achieve this goal, it performs two steps iteratively:
● Compute the gradient (slope), the first order derivative of the function at that point
● Make a step (move) in the direction opposite to the gradient, opposite direction of slope
increase from the current point by alpha times the gradient at that point
Notes By Sarvagya Jain

Alpha is called Learning rate – a tuning parameter in the optimization process. It decides the
length of the steps.
Working of Gradient Descent
● Gradient descent is an optimization algorithm used to minimize the cost function of a
model.
● The cost function measures how well the model fits the training data and is defined based
on the difference between the predicted and actual values.
● The gradient of the cost function is the derivative with respect to the model’s parameters
and points in the direction of the steepest ascent.
● The algorithm starts with an initial set of parameters and updates them in small steps to
minimize the cost function.
● In each iteration of the algorithm, the gradient of the cost function with respect to each
parameter is computed.
● The gradient tells us the direction of the steepest ascent, and by moving in the opposite
direction, we can find the direction of the steepest descent.
● The size of the step is controlled by the learning rate, which determines how quickly the
algorithm moves towards the minimum.
● The process is repeated until the cost function converges to a minimum, indicating that
the model has reached the optimal set of parameters.
● There are different variations of gradient descent, including batch gradient descent,
stochastic gradient descent, and mini-batch gradient descent, each with its own
advantages and limitations.
● Efficient implementation of gradient descent is essential for achieving good performance
in machine learning tasks. The choice of the learning rate and the number of iterations
can significantly impact the performance of the algorithm.
Gradient Descent Algorithm
Notes By Sarvagya Jain

Gradient Descent Algorithm iteratively calculates the next point using gradient at the current
position, scales it (by a learning rate) and subtracts obtained value from the current position
(makes a step). It subtracts the value because we want to minimise the function (to maximise it
would be adding). This process can be written as:

There’s an important parameter η which scales the gradient and thus controls the step size. In
machine learning, it is called learning rate and have a strong influence on performance.
Gradient Descent method’s steps are:
1. choose a starting point (initialisation)
2. calculate gradient at this point
3. make a scaled step in the opposite direction to the gradient (objective: minimise)
4. repeat points 2 and 3 until one of the criteria is met:
● maximum number of iterations reached
● step size is smaller than the tolerance (due to scaling or a small gradient).
This function takes 5 parameters:
1. starting point - in our case, we define it manually but in practice, it is often a random
initialisation
2. gradient function - has to be specified before-hand
3. learning rate - scaling factor for step sizes
4. maximum number of iterations
5. tolerance to conditionally stop the algorithm (in this case a default value is 0.01)

Different Variants of gradient descent

1. Batch gradient descent:
Also known as vanilla gradient descent, is the simplest variant of gradient descent. In batch
gradient descent, the entire training dataset is used to compute the gradients of the cost function
with respect to the model parameters in each iteration. This can be computationally expensive for
large datasets, but it guarantees convergence to a local minimum of the cost function.
θ = θ − α · ∇J(θ)
where:
θ is the parameter vector
α is the learning rate
∇J(θ) is the gradient of the cost function J with respect to θ
2. Stochastic Gradient Descent (SGD):
Stochastic gradient descent is a variant of gradient descent that updates the model parameters for
each training example in the dataset. Unlike batch gradient descent, which uses the entire dataset
to compute the gradients, SGD updates the parameters based on a randomly selected training
example. This can lead to faster convergence because the updates are more frequent and noisy,
but it can also result in more oscillations in the cost function due to the randomness of the
updates.
Notes By Sarvagya Jain

θ = θ − α · ∇Ji(θ)
where:
θ is the parameter vector
α is the learning rate
∇Ji(θ) is the gradient of the cost function J with respect to θ computed on
a single randomly selected training example i
3. Mini-batch Gradient Descent:
Mini-batch gradient descent is a compromise between batch gradient descent and stochastic
gradient descent. In mini-batch gradient descent, the gradients are computed on a small random
subset of the training dataset, typically between 10 and 1000 examples, called a mini-batch. This
reduces the computational cost of the algorithm compared to batch gradient descent, while also
reducing the variance of the updates compared to SGD. Mini-batch gradient descent is widely
used in deep learning because it strikes a good balance between convergence speed and stability.

4. Momentum Gradient Descent:

Momentum gradient descent is a variant of gradient descent that adds a momentum term to the
update rule. The momentum term accumulates the gradient values over time and dampens the
oscillations in the cost function, leading to faster convergence. This is particularly useful in cases
where the cost function has a lot of noise or curvature, which can cause traditional gradient
descent to get stuck in local minima.

5. Nesterov Accelerated Gradient (NAG):

Nesterov accelerated gradient descent is an extension of momentum gradient descent that takes
into account the future gradient values when computing the momentum term. This helps to
reduce overshooting and can lead to faster convergence than momentum gradient descent.
Notes By Sarvagya Jain

6. Adagrad:
Adagrad is a variant of gradient descent that adapts the learning rate for each parameter based on
its historical gradient values. Parameters with large gradients have their learning rates reduced,
while parameters with small gradients have their learning rates increased. This helps to
normalize the updates and can be useful in cases where the cost function has a lot of curvature or
different scales of gradients.

7. RMSProp:
RMSProp is a variant of gradient descent that also adapts the learning rate for each parameter,
but instead of using the historical gradient values, it uses a moving average of the squared
gradient values. This helps to reduce the learning rate for parameters that have large squared
gradient values, which can cause the algorithm to oscillate or diverge.
In simple terms, RMSprop uses an adaptive learning rate instead of treating the learning rate as a
hyperparameter. This means that the learning rate varies over time. RMSprop uses the same
concept of the exponentially weighted average of gradient descent as gradient descent with
momentum, but the difference is parameter update.
Parameters Updation
The recursive formula of EWA is given by:

Where,
Vt: Moving average value at t.
Notes By Sarvagya Jain

In RMSProp, we find the EWA of gradients and update our parameters using those EWAs. On
each iteration t, we calculate dw and db on the current minibatch. Then we calculate vdw and
vdb using the following formulae:

We will update our parameters after calculating the exponentially weighted averages.
W = W - learning rate * dW
b = b - learning rate * db.
Putting the value of db and dw in respective equations, we get:

8. Adam:
Adam derives its name from adaptive moment estimation. This optimization algorithm is a
stochastic gradient descent extension that updates network weights during training. It is a hybrid
of the “gradient descent with momentum” and the “RMSP” algorithms.
It is an adaptive learning rate method that calculates individual learning rates for various
parameters.
Adam can be used instead of the classical stochastic gradient descent procedure to update
network weights iteratively based on training data.
The Adam optimizer employs a hybrid of two gradient descent methods:
Momentum: This algorithm is used to speed up the gradient descent algorithm by considering the
”exponentially weighted average” of the gradients. Using averages causes the algorithm to
converge to the minima more quickly.

● mt = Aggregate of gradients at time t [Current] (Initially, mt = 0)

● mt-1 = Aggregate of gradients at time t-1 [Previous]
● Wt = Weights at time t
● Wt+1 = Weights at time t+1
● αt = Learning rate at time t
● ∂L = Derivative of Loss Function
Notes By Sarvagya Jain

● ∂Wt = Derivative of weights at time t

● β = Moving average parameter (Constant, 0.9)
Root Mean Square Propagation (RMSP):
RMSprop, or root mean square prop, is an adaptive learning algorithm that attempts to improve
AdaGrad. It uses the ”exponential moving average” rather than the cumulative Sum of squared
gradients as AdaGrad does.

● Wt = Weights at time t
● Wt+1 = Weights at time t+1
● αt = Learning rate at time t
● ∂L = Derivative of Loss Function
● ∂Wt = Derivative of weights at time t
● Vt = Sum of the square of past gradients. [i.e sum(∂L/∂Wt-1)] (initially, Vt = 0)
● β = Moving average parameter (const, 0.9)
● ϵ = A small positive constant (10-8)
Adam Optimizer takes the strengths or positive characteristics of the previous two methods and
builds on them to provide a more optimized gradient descent.
In this case, we control the gradient descent rate so that there is minimal oscillation when it
reaches the global minimum while taking large enough steps (step size) to avoid the local
minima hurdles along the way—as a result, combining the features of the above methods to
reach the global minimum efficiently.
Mathematical Aspect of Adam Optimizer:
Using the formulas used in the previous two methods, we get the following:

Autoencoders
Autoencoders are a specific type of feedforward neural networks where the input is the same as
the output. They compress the input into a lower-dimensional code and then reconstruct the
output from this representation. The code is a compact “summary” or “compression” of the
input, also called the latent-space representation.
An autoencoder consists of 3 components: encoder, code and decoder. The encoder compresses
the input and produces the code, the decoder then reconstructs the input only using this code.
Notes By Sarvagya Jain

To build an autoencoder we need 3 things: an encoding method, decoding method, and a loss
function to compare the output with the target.
Autoencoders are mainly a dimensionality reduction (or compression) algorithm with a couple of
important properties:
● Data-specific: Autoencoders are only able to meaningfully compress data similar to
what they have been trained on. Since they learn features specific for the given
training data, they are different from a standard data compression algorithm like
gzip. So we can’t expect an autoencoder trained on handwritten digits to compress
landscape photos.
● Lossy: The output of the autoencoder will not be exactly the same as the input, it
will be a close but degraded representation. If you want lossless compression they
are not the way to go.
● Unsupervised: To train an autoencoder we don’t need to do anything fancy, just
throw the raw input data at it. Autoencoders are considered an unsupervised learning
technique since they don’t need explicit labels to train on. But to be more precise
they are self-supervised because they generate their own labels from the training
data.
Architecture of autoencoders
An autoencoder consists of three components:
● Encoder: An encoder is a feedforward, fully connected neural network that
compresses the input into a latent space representation and encodes the input
image as a compressed representation in a reduced dimension. The compressed
image is the distorted version of the original image.
● Code: This part of the network contains the reduced representation of the input
that is fed into the decoder.
● Decoder: Decoder is also a feedforward network like the encoder and has a
similar structure to the encoder. This network is responsible for reconstructing the
input back to the original dimensions from the code.
●
Notes By Sarvagya Jain

Training of autoencoders
You need to set 4 hyperparameters before training an autoencoder:
1. Code size: The code size or the size of the bottleneck is the most important
hyperparameter used to tune the autoencoder. The bottleneck size decides how much the
data has to be compressed. This can also act as a regularisation term.
2. Number of layers: Like all neural networks, an important hyperparameter to tune
autoencoders is the depth of the encoder and the decoder. While a higher depth increases
model complexity, a lower depth is faster to process.
3. Number of nodes per layer: The number of nodes per layer defines the weights we use
per layer. Typically, the number of nodes decreases with each subsequent layer in the
autoencoder as the input to each of these layers becomes smaller across the layers.
4. Reconstruction Loss: The loss function we use to train the autoencoder is highly
dependent on the type of input and output we want the autoencoder to adapt to. If we are
working with image data, the most popular loss functions for reconstruction are MSE
Loss and L1 Loss. In case the inputs and outputs are within the range [0,1], as in MNIST,
we can also make use of Binary Cross Entropy as the reconstruction loss.

Regularized Autoencoder/Regularization in Autoencoder

RAE stands for "Regularized Autoencoder" and refers to a specific type of autoencoder that
incorporates regularization techniques to prevent overfitting and improve generalization.
Overfitting occurs when the model learns to fit the noise in the training data rather than the
underlying patterns, resulting in poor performance on new, unseen data. Regularization is a
method for constraining the model in order to prevent overfitting and improve its ability to
generalize to new data.
Types of Regularization
Notes By Sarvagya Jain

There are several types of regularization that can be used with autoencoders, including:
● L1 Regularization: This method adds a penalty to the loss function for the sum of the
absolute values of the model weights. This encourages the model to learn sparse
representations, where many of the weights are set to zero.
● L2 Regularization: This method adds a penalty to the loss function for the sum of the
squares of the model weights. This encourages the model to learn small, non-zero
weights.
● Dropout: This method randomly sets a fraction of the model's activations to zero during
each training iteration. This helps prevent the model from relying too heavily on any one
set of activations.
A regression model that uses L1 regularization technique is called Lasso Regression and model
which uses L2 is called Ridge Regression.
Lasso Regression (Least Absolute Shrinkage and Selection Operator) adds “absolute value of
magnitude” of coefficient as penalty term to the loss function.

Again, if lambda is zero then we will get back OLS whereas a very large value will make
coefficients zero hence it will under-fit.
Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function.
Here the highlighted part represents the L2 regularization element.

Cost function
Here, if lambda is zero then you can imagine we get back OLS. However, if lambda is very large
then it will add too much weight and it will lead to under-fitting. Having said that it’s important
how lambda is chosen. This technique works very well to avoid over-fitting issue.

Types of autoencoders
There are five popular autoencoders .
1. Undercomplete autoencoders
2. Sparse autoencoders
3. Contractive autoencoders
4. Denoising autoencoders
5. Variational Autoencoders (for generative modeling)
1. Undercomplete autoencoders
An Undercomplete autoencoder is one of the simplest types of autoencoders.
The way it works is very straightforward—
Notes By Sarvagya Jain

Undercomplete autoencoder takes in an image and tries to predict the same image as output, thus
reconstructing the image from the compressed bottleneck region.
Undercomplete autoencoders are truly unsupervised as they do not take any form of label, the
target being the same as the input.
The primary use of autoencoders like such is the generation of the latent space or the bottleneck,
which forms a compressed substitute of the input data and can be easily decompressed back with
the help of the network when needed.
This form of compression in the data can be modeled as a form of dimensionality reduction.

When we think of dimensionality reduction, we tend to think of methods like PCA (Principal
Component Analysis) that form a lower-dimensional hyperplane to represent data in a
higher-dimensional form without losing information.
However—
PCA can only build linear relationships. As a result, it is put at a disadvantage compared with
methods like under complete autoencoders that can learn non-linear relationships and, therefore,
perform better in dimensionality reduction.
This form of nonlinear dimensionality reduction where the autoencoder learns a non-linear
manifold is also termed as manifold learning.
Effectively, if we remove all non-linear activations from an undercomplete autoencoder and use
only linear layers, we reduce the undercomplete autoencoder into something that works at an
equal footing with PCA.
The loss function used to train an undercomplete autoencoder is called reconstruction loss, as it
is a check of how well the image has been reconstructed from the input data.
Although the reconstruction loss can be anything depending on the input and output, we will use
an L1 loss to depict the term (also called the norm loss) represented by:
‍

‍Where represents the predicted output and x represents the ground truth.
Notes By Sarvagya Jain

As the loss function has no explicit regularization term, the only method to ensure that the model
is not memorizing the input data is by regulating the size of the bottleneck and the number of
hidden layers within this part of the network—the architecture.
2. Sparse autoencoders
Sparse autoencoders are similar to the undercomplete autoencoders in that they use the same
image as input and ground truth. However—
The means via which encoding of information is regulated is significantly different.

While undercomplete autoencoders are regulated and fine-tuned by regulating the size of the
bottleneck, the sparse autoencoder is regulated by changing the number of nodes at each hidden
layer.
Since it is not possible to design a neural network that has a flexible number of nodes at its
hidden layers, sparse autoencoders work by penalizing the activation of some neurons in hidden
layers.
In other words, the loss function has a term that calculates the number of neurons that have been
activated and provides a penalty that is directly proportional to that.
This penalty, called the sparsity function, prevents the neural network from activating more
neurons and serves as a regularizer.
While typical regularizers work by creating a penalty on the size of the weights at the nodes,
sparsity regularizer works by creating a penalty on the number of nodes activated.
This form of regularization allows the network to have nodes in hidden layers dedicated to find
specific features in images during training and treating the regularization problem as a problem
separate from the latent space problem.
We can thus set latent space dimensionality at the bottleneck without worrying about
regularization.
There are two primary ways in which the sparsity regularizer term can be incorporated into the
loss function.
Notes By Sarvagya Jain

L1 Loss: In here, we add the magnitude of the sparsity regularizer as we do for general
regularizers:

‍
Where h represents the hidden layer, i represents the image in the minibatch, and a represents the
activation.
KL-Divergence: In this case, we consider the activations over a collection of samples at once
rather than summing them as in the L1 Loss method. We constrain the average activation of each
neuron over this collection.
Considering the ideal distribution as a Bernoulli distribution, we include KL divergence within
the loss to reduce the difference between the current distribution of the activations and the ideal
(Bernoulli) distribution:
‍

‍
Where

and j denotes the specific neuron for layer h and a collection of m samples is being made here,
each denoted as x.
3. Contractive autoencoders
Similar to other autoencoders, contractive autoencoders perform the task of learning a
representation of the image while passing it through a bottleneck and reconstructing it in the
decoder.
The contractive autoencoder also has a regularization term to prevent the network from learning
the identity function and mapping input into the output.
Contractive autoencoders work on the basis that similar inputs should have similar encodings
and a similar latent space representation. It means that the latent space should not vary by a huge
amount for minor variations in the input.
To train a model that works along with this constraint, we have to ensure that the derivatives of
the hidden layer activations are small with respect to the input data.
Mathematically:

‍ here h represents the hidden layer and x represents the input.

W
An important thing to note in the loss function (formed from the norm of the derivatives and the
reconstruction loss) is that the two terms contradict each other.
Notes By Sarvagya Jain

While the reconstruction loss wants the model to tell differences between two inputs and observe
variations in the data, the frobenius norm of the derivatives says that the model should be able to
ignore variations in the input data.
Putting these two contradictory conditions into one loss function enables us to train a network
where the hidden layers now capture only the most essential information. This information is
necessary to separate images and ignore information that is non-discriminatory in nature, and
therefore, not important.
The total loss function can be mathematically expressed as:

‍
‍Where h> is the hidden layer for which a gradient is calculated and represented with respect to
the input x as

The gradient is summed over all training samples, and a frobenius norm of the same is taken.

4. Denoising autoencoders
Denoising autoencoders, as the name suggests, are autoencoders that remove noise from an
image.
As opposed to autoencoders we’ve already covered, this is the first of its kind that does not have
the input image as its ground truth.
In denoising autoencoders, we feed a noisy version of the image, where noise has been added via
digital alterations. The noisy image is fed to the encoder-decoder architecture, and the output is
compared with the ground truth image.
Notes By Sarvagya Jain

The denoising autoencoder gets rid of noise by learning a representation of the input where the
noise can be filtered out easily.
While removing noise directly from the image seems difficult, the autoencoder performs this by
mapping the input data into a lower-dimensional manifold (like in undercomplete autoencoders),
where filtering of noise becomes much easier.
Essentially, denoising autoencoders work with the help of non-linear dimensionality reduction.
The loss function generally used in these types of networks is L2 or L1 loss.
5. Variational autoencoders
Standard and variational autoencoders learn to represent the input just in a compressed form
called the latent space or the bottleneck.
Therefore, the latent space formed after training the model is not necessarily continuous and, in
effect, might not be easy to interpolate.
For example—
This is what a variational autoencoder would learn from the input:
Notes By Sarvagya Jain

While these attributes explain the image and can be used in reconstructing the image from the
compressed latent space, they do not allow the latent attributes to be expressed in a probabilistic
fashion.
Variational autoencoders deal with this specific topic and express their latent attributes as a
probability distribution, leading to the formation of a continuous latent space that can be easily
sampled and interpolated.
When fed the same input, a variational autoencoder would construct latent attributes in the
following manner:

The latent attributes are then sampled from the latent distribution formed and fed to the decoder,
reconstructing the input.

Here’s how this works—

We aim at identifying the characteristics of the latent vector z that reconstructs the output given a
particular input. Effectively, we want to study the characteristics of the latent vector given a
certain output x[p(z|x)].
While estimating the distribution becomes impossible mathematically, a much simpler and easier
option is to build a parameterized model that can estimate the distribution for us. It does it by
minimizing the KL divergence between the original distribution and our parameterized one.
Expressing the parameterized distribution as q, we can infer the possible latent attributes used in
the image reconstruction.
Assuming the prior z to be a multivariate Gaussian model, we can build a parameterized
distribution as one containing two parameters, the mean and the variance. The corresponding
distribution is then sampled and fed to the decoder, which then proceeds to reconstruct the input
from the sample points.
But—
Notes By Sarvagya Jain

While this seems easy in theory, it becomes impossible to implement because backpropagation
cannot be defined for a random sampling process performed before feeding the data to the
decoder.
To get by this hurdle, we use the reparameterization trick—a cleverly defined way to bypass the
sampling process from the neural network.

In the reparameterization trick, we randomly sample a valueε from a unit Gaussian and then
scale this by the latent distribution varianceσ and shift it by the mean μ of the same.
Now, we have left behind the sampling process as something done outside what the
backpropagation pipeline handles, and the sampled value ε acts just like another input to the
model, that is fed at the bottleneck.
A diagrammatic view of what we attain can be expressed as:

The variational autoencoder thus allows us to learn smooth latent state representations of the
input data.
To train a VAE, we use two loss functions: the reconstruction loss and the other being the KL
divergence.
While reconstruction loss enables the distribution to correctly describe the input, by focusing
only on minimizing the reconstruction loss, the network learns very narrow distributions—akin
to discrete latent attributes.
The KL divergence loss prevents the network from learning narrow distributions and tries to
bring the distribution closer to a unit normal distribution.
The summarized loss function can be expressed as:
‍
Notes By Sarvagya Jain

Where N denotes the normal unit distribution and denotes a weighting factor.
The primary use of variational autoencoders can be seen in generative modeling.
Sampling from the latent distribution trained and feeding the result to the decoder can lead to
data being generated in the autoencoder.
A sample of MNIST digits generated by training a variational autoencoder is shown below:

Data Augmentation
Data augmentation is a technique of artificially increasing the training set by creating modified
copies of a dataset using existing data. It includes making minor changes to the dataset or using
deep learning to generate new data points.
Augmented vs. Synthetic data
Augmented data is driven from original data with some minor changes. In the case of image
augmentation, we make geometric and color space transformations (flipping, resizing, cropping,
brightness, contrast) to increase the size and diversity of the training set.
Synthetic data is generated artificially without using the original dataset. It often uses DNNs
(Deep Neural Networks) and GANs (Generative Adversarial Networks) to generate synthetic
data.
Use of Data Augmentation
1. To prevent models from overfitting.
Notes By Sarvagya Jain

2. The initial training set is too small.

3. To improve the model accuracy.
4. To Reduce the operational cost of labeling and cleaning the raw dataset.
Limitations of Data Augmentation
● The biases in the original dataset persist in the augmented data.
● Quality assurance for data augmentation is expensive.
● Research and development are required to build a system with advanced applications. For
example, generating high-resolution images using GANs can be challenging.
● Finding an effective data augmentation approach can be challenging.
Data Augmentation Techniques
In this section, we will learn about audio, text, image, and advanced data augmentation
techniques.
Audio Data Augmentation
1. Noise injection: add gaussian or random noise to the audio dataset to improve the model
performance.
2. Shifting: shift audio left (fast forward) or right with random seconds.
3. Changing the speed: stretches times series by a fixed rate.
4. Changing the pitch: randomly change the pitch of the audio.
Text Data Augmentation
1. Word or sentence shuffling: randomly changing the position of a word or sentence.
2. Word replacement: replace words with synonyms.
3. Syntax-tree manipulation: paraphrase the sentence using the same word.
4. Random word insertion: inserts words at random.
5. Random word deletion: deletes words at random.
Image Augmentation
1. Geometric transformations: randomly flip, crop, rotate, stretch, and zoom images. You
need to be careful about applying multiple transformations on the same images, as this
can reduce model performance.
2. Color space transformations: randomly change RGB color channels, contrast, and
brightness.
3. Kernel filters: randomly change the sharpness or blurring of the image.
4. Random erasing: delete some part of the initial image.
5. Mixing images: blending and mixing multiple images.
Advanced Techniques
1. Generative adversarial networks (GANs): used to generate new data points or images. It
does not require existing data to generate synthetic data.
2. Neural Style Transfer: a series of convolutional layers trained to deconstruct images and
separate context and style.

Gradient Descent (GD) - GD With Momentum - Nesterov Accelerated GD - Stochastic GD - OrIGINAL
No ratings yet
Gradient Descent (GD) - GD With Momentum - Nesterov Accelerated GD - Stochastic GD - OrIGINAL
25 pages
GD Types
No ratings yet
GD Types
98 pages
Gradient Descend
No ratings yet
Gradient Descend
64 pages
Backpropagation, Sgmiod Neuron & Gradient Discend
No ratings yet
Backpropagation, Sgmiod Neuron & Gradient Discend
29 pages
ML Lec 08 Gradient Descent
No ratings yet
ML Lec 08 Gradient Descent
37 pages
Lec05-1-Gradient Descent-Detailed
No ratings yet
Lec05-1-Gradient Descent-Detailed
62 pages
Gradient Descent A Fundamental Optimization Algorithm
No ratings yet
Gradient Descent A Fundamental Optimization Algorithm
30 pages
UNIT III Part-2
No ratings yet
UNIT III Part-2
39 pages
Gradient Descent
No ratings yet
Gradient Descent
27 pages
Gradient Descent in Linear Regression
No ratings yet
Gradient Descent in Linear Regression
30 pages
4 - Gradient Descent and Stochastic GD
No ratings yet
4 - Gradient Descent and Stochastic GD
37 pages
5.1loss Function, Optimization, GD
No ratings yet
5.1loss Function, Optimization, GD
39 pages
Gradient Descent Regression
No ratings yet
Gradient Descent Regression
14 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
Gradient Descent
No ratings yet
Gradient Descent
7 pages
CCS355 Neural Networks and Deep Learning
No ratings yet
CCS355 Neural Networks and Deep Learning
142 pages
5 Optimizers
No ratings yet
5 Optimizers
10 pages
Gradient Descent Final
No ratings yet
Gradient Descent Final
27 pages
Gradient Descent
No ratings yet
Gradient Descent
14 pages
Gradient Descent: By-Vineet Ahuja BCA-V1-E 00221102021
No ratings yet
Gradient Descent: By-Vineet Ahuja BCA-V1-E 00221102021
10 pages
Gradient Decent
No ratings yet
Gradient Decent
40 pages
Gradient Descent
No ratings yet
Gradient Descent
15 pages
Gradient Descent Algorithm.Y...
No ratings yet
Gradient Descent Algorithm.Y...
10 pages
What Is Gradient Descent - Built in
No ratings yet
What Is Gradient Descent - Built in
11 pages
Adam Optimizer
No ratings yet
Adam Optimizer
22 pages
Introduction To Gradient Descent
No ratings yet
Introduction To Gradient Descent
8 pages
Gradient Descent Unit3
No ratings yet
Gradient Descent Unit3
9 pages
Gradient Descent
No ratings yet
Gradient Descent
12 pages
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
No ratings yet
Gradient Descent Algorithm in Machine Learning: Dr. P. K. Chaurasia
24 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
Gradient Descent DS Rohit Sharma Fench Knjs
No ratings yet
Gradient Descent DS Rohit Sharma Fench Knjs
15 pages
LInear
No ratings yet
LInear
14 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Gradient Descent - A Quick, Simple Introduction - Built in
No ratings yet
Gradient Descent - A Quick, Simple Introduction - Built in
15 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Gradient Descent
No ratings yet
Gradient Descent
9 pages
Deep Learning (Part 8) - Coursesteach
No ratings yet
Deep Learning (Part 8) - Coursesteach
16 pages
Gradient Descent
No ratings yet
Gradient Descent
8 pages
chp2 Gradient Descent Algorithm
No ratings yet
chp2 Gradient Descent Algorithm
5 pages
4p-Desicion-Making Model of Child Guidance
100% (2)
4p-Desicion-Making Model of Child Guidance
11 pages
AI33
No ratings yet
AI33
6 pages
Gradient Descent
No ratings yet
Gradient Descent
2 pages
Paper 2
No ratings yet
Paper 2
27 pages
Gradient Descent Algorithm in Machine Learning
No ratings yet
Gradient Descent Algorithm in Machine Learning
21 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Gradient Descent
No ratings yet
Gradient Descent
2 pages
Gradient Descent and Cost Function
No ratings yet
Gradient Descent and Cost Function
14 pages
Gradient Descent
No ratings yet
Gradient Descent
6 pages
Gradient Descent
No ratings yet
Gradient Descent
9 pages
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
No ratings yet
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
37 pages
Gradient Descent
No ratings yet
Gradient Descent
13 pages
Gradient Descent
No ratings yet
Gradient Descent
4 pages
Gradient Descent Algorithm Is A First
No ratings yet
Gradient Descent Algorithm Is A First
5 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
3 Types of Gradient Descent Algorithms For Small & Large Datasets
No ratings yet
3 Types of Gradient Descent Algorithms For Small & Large Datasets
9 pages
14-RMSProp and Adam Optimization-12!08!2024
No ratings yet
14-RMSProp and Adam Optimization-12!08!2024
2 pages
Yash 21bsds12
No ratings yet
Yash 21bsds12
3 pages
UNIT2
No ratings yet
UNIT2
25 pages
Gradient Descent
No ratings yet
Gradient Descent
17 pages
Chapter 1
No ratings yet
Chapter 1
13 pages
Modified-Curriculum-Map PPG
100% (1)
Modified-Curriculum-Map PPG
28 pages
Mid-Year Review Form (MRF) For Teacher I-Iii
No ratings yet
Mid-Year Review Form (MRF) For Teacher I-Iii
13 pages
Lesson 4 Gradient Descent
No ratings yet
Lesson 4 Gradient Descent
13 pages
Module 4 Working With The Standards-Based Assessors - Tools
No ratings yet
Module 4 Working With The Standards-Based Assessors - Tools
73 pages
RRL For Conducive Learning Environment
100% (3)
RRL For Conducive Learning Environment
4 pages
Behaviourism Learning-Group 1
100% (1)
Behaviourism Learning-Group 1
22 pages
EDUC 6243 TECHNOLOGY FOR TEACHING&LEARNING PRELIM To FINAL BY JEZZA
No ratings yet
EDUC 6243 TECHNOLOGY FOR TEACHING&LEARNING PRELIM To FINAL BY JEZZA
9 pages
Time Management Skills For Teachers 2
No ratings yet
Time Management Skills For Teachers 2
21 pages
Assure Model Lesson Plan
No ratings yet
Assure Model Lesson Plan
2 pages
ELT492 - Thesis Template
No ratings yet
ELT492 - Thesis Template
25 pages
NTC Youth Coach Workshop Final
No ratings yet
NTC Youth Coach Workshop Final
26 pages
Learning Action Cell Practices and Performance of Science Teachers in Enhancing 21st - Century Skills: Indigenous-Based Output For A LAC Template Model
No ratings yet
Learning Action Cell Practices and Performance of Science Teachers in Enhancing 21st - Century Skills: Indigenous-Based Output For A LAC Template Model
17 pages
Presentation of Data
No ratings yet
Presentation of Data
3 pages
Lesson Plan 2 - Final Draft
No ratings yet
Lesson Plan 2 - Final Draft
3 pages
GRD 12 ATP 2023-24 AGRS Final
No ratings yet
GRD 12 ATP 2023-24 AGRS Final
4 pages
A. A. Leontiev
No ratings yet
A. A. Leontiev
11 pages
Assignment 2
No ratings yet
Assignment 2
6 pages
Ap Lang Argument
No ratings yet
Ap Lang Argument
7 pages
CrewAI Vs LangChain - The Clash of AI Titans in The LLM Arena - by Cogni Down Under - Nov, 2024 - Medium
No ratings yet
CrewAI Vs LangChain - The Clash of AI Titans in The LLM Arena - by Cogni Down Under - Nov, 2024 - Medium
13 pages
Readibg
No ratings yet
Readibg
4 pages
Group9 - New E11-Unit-9 - Speaking
No ratings yet
Group9 - New E11-Unit-9 - Speaking
21 pages
DLL 5
No ratings yet
DLL 5
2 pages
Global Conflict in The Modern Era (MA) - en
No ratings yet
Global Conflict in The Modern Era (MA) - en
38 pages
HANGTM6 - Students' Motivation in Learning English at FPT Polytechnic
No ratings yet
HANGTM6 - Students' Motivation in Learning English at FPT Polytechnic
8 pages
Elizabeth Jarrett: Mckenzie Elementary School
No ratings yet
Elizabeth Jarrett: Mckenzie Elementary School
2 pages
Services in Tunisia Revised
No ratings yet
Services in Tunisia Revised
2 pages
Practical Knowledge: Priya Kulasagaran
No ratings yet
Practical Knowledge: Priya Kulasagaran
4 pages
Grossman-Why Models Matter
No ratings yet
Grossman-Why Models Matter
10 pages
Form 138-REPORT CARD GR. 6
No ratings yet
Form 138-REPORT CARD GR. 6
2 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet

DL Unit - 2

Uploaded by

DL Unit - 2

Uploaded by

Notes By Sarvagya Jain

Finding the lowest point in a hilly landscape.

Different Variants of gradient descent

4. Momentum Gradient Descent:

5. Nesterov Accelerated Gradient (NAG):

● mt = Aggregate of gradients at time t [Current] (Initially, mt = 0)

● ∂Wt = Derivative of weights at time t

Regularized Autoencoder/Regularization in Autoencoder

‍ here h represents the hidden layer and x represents the input.

Here’s how this works—

2. The initial training set is too small.

You might also like