0% found this document useful (0 votes)
35 views11 pages

Adam 1

Deep Learning is a subset of machine learning that utilizes deep neural networks to learn features and patterns from raw data, commonly applied in image recognition, speech processing, and natural language understanding. The Adam optimizer is a widely used optimization algorithm that combines momentum and RMSProp techniques to efficiently adjust learning rates and improve convergence in training deep learning models. While Adam offers advantages like adaptive learning rates and robustness to noisy datasets, it requires careful hyperparameter tuning and may not generalize as well as simpler methods in certain scenarios.

Uploaded by

nikkiant16
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views11 pages

Adam 1

Deep Learning is a subset of machine learning that utilizes deep neural networks to learn features and patterns from raw data, commonly applied in image recognition, speech processing, and natural language understanding. The Adam optimizer is a widely used optimization algorithm that combines momentum and RMSProp techniques to efficiently adjust learning rates and improve convergence in training deep learning models. While Adam offers advantages like adaptive learning rates and robustness to noisy datasets, it requires careful hyperparameter tuning and may not generalize as well as simpler methods in certain scenarios.

Uploaded by

nikkiant16
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

ADAM - Adaptive moment estimation

What is Deep Learning?

Deep Learning is a subset of machine learning focused on neural networks with many layers (deep
neural networks). It mimics the way the human brain processes data by learning features and
patterns directly from raw input.

Deep learning is widely used in tasks like image recognition, speech processing, and natural language
understanding.

Key Concept:

• Neural Networks: The building blocks of deep learning. They consist of layers of neurons that
process input data and learn features automatically.

• Learning Process: The network adjusts its parameters (weights and biases) by minimizing an
error through backpropagation and optimization techniques (e.g., Adam optimizer).

Simple Example: Classifying Cats vs. Dogs

1. Input: Pictures of cats and dogs.

2. Process:

o The network takes raw image pixels as input.

o It learns features like edges, shapes, and patterns in initial layers.

o Higher layers combine these features to identify complex patterns like "cat ears" or
"dog snout."

3. Output: The network predicts if the image is of a cat or a dog.

Weights and Biases in the Learning Process

In the context of neural networks, weights and biases are the key parameters that a network learns
during training to map inputs to outputs.

1. Weights:

• Definition: Weights determine the importance of a particular feature in the input data. They
are the values assigned to the connections between neurons.

• Role: Each input feature (e.g., pixel value in an image) is multiplied by its corresponding
weight. The weight adjusts how much influence that input feature has on the output.

• Training: During the learning process, weights are updated to minimize the difference
between the predicted output and the actual output (using optimization algorithms like
gradient descent).
2. Biases:

• Definition: A bias is an additional parameter added to the weighted sum of inputs to shift
the output activation function. It allows the model to fit the data more flexibly.

• Role: Bias ensures that the neuron can produce a non-zero output even when all inputs are
zero. This improves the learning capability of the network.

• Training: Like weights, biases are also updated during the training process.

3. Why Are Weights and Biases Important?

• Weights control how features are combined to predict an outcome.

• Biases add flexibility to the model, ensuring it can represent complex data patterns.

4. Learning Process:

1. Forward Pass:

o Inputs are multiplied by weights and added to biases.

o This result is passed through an activation function to produce the output.

2. Backward Pass:

o The network calculates the error (loss).

o Weights and biases are updated using optimization algorithms to reduce the loss.
Simple Analogy:

Think of a neural network as trying to fit a line through data points:

• Weights determine the slope of the line.

• Bias determines the intercept.

In higher-dimensional problems, weights and biases perform similar roles, scaling and shifting the
data to achieve the best fit for the target output.

What is optimisation technique and how it is related to deep learning?

Optimization Techniques:

Optimization techniques are mathematical methods used to minimize or maximize a function. In


deep learning, they help minimize the loss function, which measures the difference between the
predicted output and the actual target.

Common Optimization Techniques in Deep Learning:

1. Gradient Descent:

o Adjusts model parameters (weights and biases) by calculating gradients of the loss
function.

o Variants: Batch Gradient Descent, Stochastic Gradient Descent (SGD), Mini-Batch


Gradient Descent.

2. Advanced Optimizers:

o Momentum: Speeds up convergence by considering the past gradient direction.

o RMSProp: Adapts learning rates based on recent gradient magnitudes.

o Adam (Adaptive Moment Estimation): Combines momentum and RMSProp for


efficient optimization.

3. Regularization Techniques:

o Add penalties to the loss function to avoid overfitting (e.g., L1, L2 regularization).

Relation to Deep Learning:


• Purpose: Optimization techniques adjust the neural network’s weights and biases to
minimize the loss function.

• Impact: Efficient optimization ensures faster convergence and better accuracy.

• Key Role: Without optimization, training deep neural networks would be computationally
infeasible due to the large number of parameters.

What is Adam optimiser? Why are we using this? Applications? Adv disadvantage? Conclusion?

Adam Optimization: An Overview

Adam (short for Adaptive Moment Estimation) is a popular optimization algorithm in deep learning.
It combines the strengths of two other optimization methods:

• Momentum: Helps smooth out updates by considering the direction of past gradients.

• RMSProp: Adapts the learning rate for each parameter based on the magnitude of recent
gradients.

Adam is well-suited for handling large datasets and models with a high number of parameters,
making it a preferred choice for many machine learning and deep learning tasks.

How Adam Works:

Adam maintains two moving averages for each parameter during training:

1. First moment (mean): The exponentially decaying average of past gradients (Vdw).

2. Second moment (uncentered variance): The exponentially decaying average of the squared
gradients (Sdw).

Advantages of Adam:

1. Adaptive Learning Rates: Adjusts learning rates for each parameter independently, making it
effective for problems with sparse gradients.

Sparse Gradients

Sparse gradients refer to a situation during training where most of the gradient values for a model's parameters are zero or
near-zero. This typically happens when:

2. Data Characteristics: The input data has sparse features, meaning many features are zero or inactive. For
example:
o Text data represented as one-hot encoded vectors.
o High-dimensional data with most entries as zeros.

3. Combines Momentum and RMSProp: Benefits from the advantages of both methods,
resulting in faster convergence.

4. Robust: Performs well even with noisy datasets or gradients.

5. Efficient: Requires minimal memory and computational resources compared to second-order


methods.
6. Wide Applicability: Works well for large datasets and models with high-dimensional
parameter spaces.

Disadvantages of Adam:

1. Tuning Sensitivity: Sensitive to the choice of hyperparameters (α,β1,β2).

2. Overfitting: Can lead to overfitting in some cases due to rapid convergence.

3. Generalization: May not generalize as well as simpler methods like SGD in certain problems.

4. Plateau in Convergence: May stop improving after a certain point, especially in highly
complex loss landscapes.

Applications of Adam:

1. Deep Learning: Training neural networks for tasks like image recognition, natural language
processing (NLP), and speech recognition.

2. Reinforcement Learning: Optimization in policy-gradient methods.

3. Generative Models: Training GANs (Generative Adversarial Networks) and variational


autoencoders.

4. Sparse Data: Effective in handling problems with sparse datasets or features, such as text
data in NLP.

5. Time-Series Analysis: Applied in forecasting models, such as recurrent neural networks


(RNNs) and transformers.

Summary:

Adam is a highly versatile optimization algorithm that excels in a wide range of deep learning
applications due to its efficiency and ability to adapt learning rates. However, it requires careful
hyperparameter tuning and may not always outperform simpler methods for problems with well-
structured data.

First understand Root mean square (RMS) propagation & Momentum (Mini batch korean descent)
optimization alogos

ADAM: Most powerful technique used to optimize almost all the machine learning algorithms.
Used in Deep learning and deep neural networks.

ANN, CNN, RNN are optimised using ADAM

Batch gradient descent: if you want to reach a point it directly goes but it is slow.

Momentum (Mini batch korean descent) Optimizer: Uses Momentum and Velocity to reach the point
and it circles the point as satellite and reaches the point after few Oscillations around point.

NAG: Oscillations radium reduces

Adagrad: Misses the minima (point) slightly because learning rate decay is less (learning rate is
reduced)
Root mean square (RMS) propagation: Reaches the point properly

Two important concepts:

1. Momentum (Momentum & NAG techniques)


2. Learning rate decay (Adagrad & RMS Prop)

ADAM Merges both Momentum and Learning rate decay concepts.

ADAM is formed by combining both Root mean square (RMS) propagation & Momentum (Mini
batch gradient descent) optimization alogos.

ADAM ADV: Efficient, Less Memory, very well in practice

Weight updation in Momentum (Mini batch gradient descent) takes place in zig-zag form and a lot
of training time is wasted in moving in the vertical direction instead of just moving straight to the
local minima.

To reduce the time taken to train the model and to have a straighter path we use optimization.

Optimizations reduces the time taken to train the model thus optimize the training. Speeds up the
learnings of the model.

Graph of cost and weight parameter: this is how cost reaches the local minina.

If there are 2 weight parameters:

Red line is mini batch gradient descent. Green is batch gradient descent.

Right side figure is Contour of 3D fig on left. It is 2D top view of left fig.
Exponentially weighted moving average: Average is calculated as we move forward in time and
meet new points. Higher weightage to newer points and lower weightage to older points.

The average at time stamp t is given by below equation.

Beta is Hyper parameter 0<Beta<1, Beta = 0.9 (best)

Vt-1 is Prev average


Theta: Current data point
Momentum in mini batch gradient descent:s

Lets assume W & B are y and x axis, Since W (Vertical height) is high, dw^2 is higher. When dw is
divided by Sdw the whole value (EWMA) reduces and vertical height is reduced. If the root value
becomes to small the fraction might become too big and EWMA may overshoot to avoid this epsilon
is introduced.

RMS Prop:
Vdw & Sdw are EWMA. When dw^2 it is Sdw and without square is Vdw.

W & B are parameters.

W is weight and B is biases.

You might also like