0% found this document useful (0 votes)

35 views11 pages

Adam 1

Deep Learning is a subset of machine learning that utilizes deep neural networks to learn features and patterns from raw data, commonly applied in image recognition, speech processing, and natural language understanding. The Adam optimizer is a widely used optimization algorithm that combines momentum and RMSProp techniques to efficiently adjust learning rates and improve convergence in training deep learning models. While Adam offers advantages like adaptive learning rates and robustness to noisy datasets, it requires careful hyperparameter tuning and may not generalize as well as simpler methods in certain scenarios.

Uploaded by

nikkiant16

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views11 pages

Adam 1

Uploaded by

nikkiant16

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

ADAM - Adaptive moment estimation

What is Deep Learning?

Deep Learning is a subset of machine learning focused on neural networks with many layers (deep
neural networks). It mimics the way the human brain processes data by learning features and
patterns directly from raw input.

Deep learning is widely used in tasks like image recognition, speech processing, and natural language
understanding.

Key Concept:

• Neural Networks: The building blocks of deep learning. They consist of layers of neurons that
process input data and learn features automatically.

• Learning Process: The network adjusts its parameters (weights and biases) by minimizing an
error through backpropagation and optimization techniques (e.g., Adam optimizer).

Simple Example: Classifying Cats vs. Dogs

1. Input: Pictures of cats and dogs.

2. Process:

o The network takes raw image pixels as input.

o It learns features like edges, shapes, and patterns in initial layers.

o Higher layers combine these features to identify complex patterns like "cat ears" or
"dog snout."

3. Output: The network predicts if the image is of a cat or a dog.

Weights and Biases in the Learning Process

In the context of neural networks, weights and biases are the key parameters that a network learns
during training to map inputs to outputs.

1. Weights:

• Definition: Weights determine the importance of a particular feature in the input data. They
are the values assigned to the connections between neurons.

• Role: Each input feature (e.g., pixel value in an image) is multiplied by its corresponding
weight. The weight adjusts how much influence that input feature has on the output.

• Training: During the learning process, weights are updated to minimize the difference
between the predicted output and the actual output (using optimization algorithms like
gradient descent).
2. Biases:

• Definition: A bias is an additional parameter added to the weighted sum of inputs to shift
the output activation function. It allows the model to fit the data more flexibly.

• Role: Bias ensures that the neuron can produce a non-zero output even when all inputs are
zero. This improves the learning capability of the network.

• Training: Like weights, biases are also updated during the training process.

3. Why Are Weights and Biases Important?

• Weights control how features are combined to predict an outcome.

• Biases add flexibility to the model, ensuring it can represent complex data patterns.

4. Learning Process:

1. Forward Pass:

o Inputs are multiplied by weights and added to biases.

o This result is passed through an activation function to produce the output.

2. Backward Pass:

o The network calculates the error (loss).

o Weights and biases are updated using optimization algorithms to reduce the loss.
Simple Analogy:

Think of a neural network as trying to fit a line through data points:

• Weights determine the slope of the line.

• Bias determines the intercept.

In higher-dimensional problems, weights and biases perform similar roles, scaling and shifting the
data to achieve the best fit for the target output.

What is optimisation technique and how it is related to deep learning?

Optimization Techniques:

Optimization techniques are mathematical methods used to minimize or maximize a function. In

deep learning, they help minimize the loss function, which measures the difference between the
predicted output and the actual target.

Common Optimization Techniques in Deep Learning:

1. Gradient Descent:

o Adjusts model parameters (weights and biases) by calculating gradients of the loss
function.

o Variants: Batch Gradient Descent, Stochastic Gradient Descent (SGD), Mini-Batch

Gradient Descent.

2. Advanced Optimizers:

o Momentum: Speeds up convergence by considering the past gradient direction.

o RMSProp: Adapts learning rates based on recent gradient magnitudes.

o Adam (Adaptive Moment Estimation): Combines momentum and RMSProp for

efficient optimization.

3. Regularization Techniques:

o Add penalties to the loss function to avoid overfitting (e.g., L1, L2 regularization).

Relation to Deep Learning:

• Purpose: Optimization techniques adjust the neural network’s weights and biases to
minimize the loss function.

• Impact: Efficient optimization ensures faster convergence and better accuracy.

• Key Role: Without optimization, training deep neural networks would be computationally
infeasible due to the large number of parameters.

What is Adam optimiser? Why are we using this? Applications? Adv disadvantage? Conclusion?

Adam Optimization: An Overview

Adam (short for Adaptive Moment Estimation) is a popular optimization algorithm in deep learning.
It combines the strengths of two other optimization methods:

• Momentum: Helps smooth out updates by considering the direction of past gradients.

• RMSProp: Adapts the learning rate for each parameter based on the magnitude of recent
gradients.

Adam is well-suited for handling large datasets and models with a high number of parameters,
making it a preferred choice for many machine learning and deep learning tasks.

How Adam Works:

Adam maintains two moving averages for each parameter during training:

1. First moment (mean): The exponentially decaying average of past gradients (Vdw).

2. Second moment (uncentered variance): The exponentially decaying average of the squared
gradients (Sdw).

Advantages of Adam:

1. Adaptive Learning Rates: Adjusts learning rates for each parameter independently, making it
effective for problems with sparse gradients.

Sparse Gradients

Sparse gradients refer to a situation during training where most of the gradient values for a model's parameters are zero or
near-zero. This typically happens when:

2. Data Characteristics: The input data has sparse features, meaning many features are zero or inactive. For
example:
o Text data represented as one-hot encoded vectors.
o High-dimensional data with most entries as zeros.

3. Combines Momentum and RMSProp: Benefits from the advantages of both methods,
resulting in faster convergence.

4. Robust: Performs well even with noisy datasets or gradients.

5. Efficient: Requires minimal memory and computational resources compared to second-order

methods.
6. Wide Applicability: Works well for large datasets and models with high-dimensional
parameter spaces.

Disadvantages of Adam:

1. Tuning Sensitivity: Sensitive to the choice of hyperparameters (α,β1,β2).

2. Overfitting: Can lead to overfitting in some cases due to rapid convergence.

3. Generalization: May not generalize as well as simpler methods like SGD in certain problems.

4. Plateau in Convergence: May stop improving after a certain point, especially in highly
complex loss landscapes.

Applications of Adam:

1. Deep Learning: Training neural networks for tasks like image recognition, natural language
processing (NLP), and speech recognition.

2. Reinforcement Learning: Optimization in policy-gradient methods.

3. Generative Models: Training GANs (Generative Adversarial Networks) and variational

autoencoders.

4. Sparse Data: Effective in handling problems with sparse datasets or features, such as text
data in NLP.

5. Time-Series Analysis: Applied in forecasting models, such as recurrent neural networks

(RNNs) and transformers.

Summary:

Adam is a highly versatile optimization algorithm that excels in a wide range of deep learning
applications due to its efficiency and ability to adapt learning rates. However, it requires careful
hyperparameter tuning and may not always outperform simpler methods for problems with well-
structured data.

First understand Root mean square (RMS) propagation & Momentum (Mini batch korean descent)
optimization alogos

ADAM: Most powerful technique used to optimize almost all the machine learning algorithms.
Used in Deep learning and deep neural networks.

ANN, CNN, RNN are optimised using ADAM

Batch gradient descent: if you want to reach a point it directly goes but it is slow.

Momentum (Mini batch korean descent) Optimizer: Uses Momentum and Velocity to reach the point
and it circles the point as satellite and reaches the point after few Oscillations around point.

NAG: Oscillations radium reduces

Adagrad: Misses the minima (point) slightly because learning rate decay is less (learning rate is
reduced)
Root mean square (RMS) propagation: Reaches the point properly

Two important concepts:

1. Momentum (Momentum & NAG techniques)

2. Learning rate decay (Adagrad & RMS Prop)

ADAM Merges both Momentum and Learning rate decay concepts.

ADAM is formed by combining both Root mean square (RMS) propagation & Momentum (Mini
batch gradient descent) optimization alogos.

ADAM ADV: Efficient, Less Memory, very well in practice

Weight updation in Momentum (Mini batch gradient descent) takes place in zig-zag form and a lot
of training time is wasted in moving in the vertical direction instead of just moving straight to the
local minima.

To reduce the time taken to train the model and to have a straighter path we use optimization.

Optimizations reduces the time taken to train the model thus optimize the training. Speeds up the
learnings of the model.

Graph of cost and weight parameter: this is how cost reaches the local minina.

If there are 2 weight parameters:

Red line is mini batch gradient descent. Green is batch gradient descent.

Right side figure is Contour of 3D fig on left. It is 2D top view of left fig.
Exponentially weighted moving average: Average is calculated as we move forward in time and
meet new points. Higher weightage to newer points and lower weightage to older points.

The average at time stamp t is given by below equation.

Beta is Hyper parameter 0<Beta<1, Beta = 0.9 (best)

Vt-1 is Prev average

Theta: Current data point
Momentum in mini batch gradient descent:s

Lets assume W & B are y and x axis, Since W (Vertical height) is high, dw^2 is higher. When dw is
divided by Sdw the whole value (EWMA) reduces and vertical height is reduced. If the root value
becomes to small the fraction might become too big and EWMA may overshoot to avoid this epsilon
is introduced.

RMS Prop:
Vdw & Sdw are EWMA. When dw^2 it is Sdw and without square is Vdw.

W & B are parameters.

W is weight and B is biases.

Baggage Delivery Business Proposal
100% (1)
Baggage Delivery Business Proposal
6 pages
Deep Learning
No ratings yet
Deep Learning
18 pages
AdamZ Research Paper
No ratings yet
AdamZ Research Paper
13 pages
Deep Learning (MODULE-2)
No ratings yet
Deep Learning (MODULE-2)
86 pages
Momentum Update Rule
No ratings yet
Momentum Update Rule
4 pages
Important Optimization Algorithms Essentials
No ratings yet
Important Optimization Algorithms Essentials
12 pages
Module 2
No ratings yet
Module 2
67 pages
A Modified Adam Algorithm For Deep Neural Network Optimization
No ratings yet
A Modified Adam Algorithm For Deep Neural Network Optimization
18 pages
Module 3
No ratings yet
Module 3
7 pages
Gen Aiml Notes by Piyush
No ratings yet
Gen Aiml Notes by Piyush
39 pages
Adam: A Method For Stochastic Optimization: Diederik P. Kingma and Jimmy Lei Ba
No ratings yet
Adam: A Method For Stochastic Optimization: Diederik P. Kingma and Jimmy Lei Ba
41 pages
Module 1
No ratings yet
Module 1
64 pages
Part 1.3. Optimazation of Learning Algorithms
No ratings yet
Part 1.3. Optimazation of Learning Algorithms
14 pages
ADL Unit-3
100% (2)
ADL Unit-3
21 pages
Unit-1 and 2 and 3
No ratings yet
Unit-1 and 2 and 3
212 pages
Optimizers and Activation Functions in Deep Learning
No ratings yet
Optimizers and Activation Functions in Deep Learning
15 pages
Optimization Techniques in Deep Learning
No ratings yet
Optimization Techniques in Deep Learning
14 pages
Adam Optimizer
No ratings yet
Adam Optimizer
14 pages
Day 2 - Loss & Activation Functions
No ratings yet
Day 2 - Loss & Activation Functions
8 pages
Neural Networks and Deep Learning: Enhancing Ai Through Neural Network Optimization
No ratings yet
Neural Networks and Deep Learning: Enhancing Ai Through Neural Network Optimization
5 pages
2023246032-Backward Propagation and Other Differential Algorithms
No ratings yet
2023246032-Backward Propagation and Other Differential Algorithms
48 pages
21BCP181 Ai 10
No ratings yet
21BCP181 Ai 10
8 pages
Deep Learning Exp 2.3 MU
No ratings yet
Deep Learning Exp 2.3 MU
4 pages
Deep Learning
No ratings yet
Deep Learning
20 pages
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
DL CS 6 M2 Live Session Flow
No ratings yet
DL CS 6 M2 Live Session Flow
32 pages
Mcculloh: Linear Activation Function
No ratings yet
Mcculloh: Linear Activation Function
18 pages
Unit 2.2
No ratings yet
Unit 2.2
46 pages
GD Compare
No ratings yet
GD Compare
5 pages
Optimization Techniques (SGD Alternatives)
No ratings yet
Optimization Techniques (SGD Alternatives)
34 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Optimization of Gradiant Descant
No ratings yet
Optimization of Gradiant Descant
7 pages
Lecture 2
No ratings yet
Lecture 2
31 pages
11 - Optimizers
No ratings yet
11 - Optimizers
16 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Activations, Loss Functions & Optimizers in ML
No ratings yet
Activations, Loss Functions & Optimizers in ML
29 pages
Deep Learning Unit 4
No ratings yet
Deep Learning Unit 4
10 pages
Otimization 2024 - Ver3
No ratings yet
Otimization 2024 - Ver3
42 pages
Optimization
No ratings yet
Optimization
3 pages
HCIP-AI-EI Developer V2.0 Training Material
No ratings yet
HCIP-AI-EI Developer V2.0 Training Material
508 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
MLP Encoder Decoder
No ratings yet
MLP Encoder Decoder
14 pages
Adl Unit 1 2
No ratings yet
Adl Unit 1 2
67 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
Introduction To Optimization-Lec1
No ratings yet
Introduction To Optimization-Lec1
36 pages
Optimization
No ratings yet
Optimization
26 pages
Adam
No ratings yet
Adam
1 page
L5 - UCLxDeepMind DL2020
No ratings yet
L5 - UCLxDeepMind DL2020
52 pages
Deep Learning
No ratings yet
Deep Learning
23 pages
769 Padam Closing The Generalizati
No ratings yet
769 Padam Closing The Generalizati
16 pages
Role of An Optimizer
No ratings yet
Role of An Optimizer
9 pages
Optimizers lionVSAdam
No ratings yet
Optimizers lionVSAdam
2 pages
Module2 Question and Answer
No ratings yet
Module2 Question and Answer
25 pages
Module 3-DL
No ratings yet
Module 3-DL
12 pages
Unit-2 Improving-Deep-Neural-Networks
No ratings yet
Unit-2 Improving-Deep-Neural-Networks
18 pages
Curs6site PDF
No ratings yet
Curs6site PDF
40 pages
Training NNs
No ratings yet
Training NNs
34 pages
Super GD
No ratings yet
Super GD
15 pages
Super Gradient Descent: Global Optimization Requires Global Gradient
No ratings yet
Super Gradient Descent: Global Optimization Requires Global Gradient
15 pages
AI for Everyone: An Intermediate Guide to Artificial Intelligence
From Everand
AI for Everyone: An Intermediate Guide to Artificial Intelligence
Nova Clarke
No ratings yet
Design And Analysis Of Algorithm
From Everand
Design And Analysis Of Algorithm
Bhupendra Mandloi
No ratings yet
Block LMS Updated
No ratings yet
Block LMS Updated
11 pages
EXP 5b New 19
No ratings yet
EXP 5b New 19
3 pages
Exp 7 Comm 19
No ratings yet
Exp 7 Comm 19
5 pages
STFT
No ratings yet
STFT
18 pages
Sim Hosting Api Version 2.O
No ratings yet
Sim Hosting Api Version 2.O
6 pages
Gamified Education System
No ratings yet
Gamified Education System
10 pages
Usman Flutter
No ratings yet
Usman Flutter
1 page
Openstack CLI References
No ratings yet
Openstack CLI References
556 pages
Download: Solutions Intermediate Progress Tests Unit 1answer
No ratings yet
Download: Solutions Intermediate Progress Tests Unit 1answer
2 pages
Q3 Module1 G11 CSS-NCII Sison-Central-Is
No ratings yet
Q3 Module1 G11 CSS-NCII Sison-Central-Is
10 pages
How To Used PC To Update V7 V8 Box PDF
No ratings yet
How To Used PC To Update V7 V8 Box PDF
7 pages
January Budget 2021
No ratings yet
January Budget 2021
6 pages
Zilog Z80 Assembly Language Programming Classic
100% (1)
Zilog Z80 Assembly Language Programming Classic
304 pages
Suraj Rathi Resume
No ratings yet
Suraj Rathi Resume
1 page
9-6 Error Messages Reference
No ratings yet
9-6 Error Messages Reference
2,536 pages
Indian Porn Sex Archita Pukham Viral Video Clip Full Original Video Social ...
No ratings yet
Indian Porn Sex Archita Pukham Viral Video Clip Full Original Video Social ...
4 pages
2.PC Jotun Chart1011 PDF
No ratings yet
2.PC Jotun Chart1011 PDF
3 pages
DC-30 - System Recovery Guide - V2.0 - EN
No ratings yet
DC-30 - System Recovery Guide - V2.0 - EN
12 pages
Alpha-Test Questionnaire
No ratings yet
Alpha-Test Questionnaire
4 pages
Design Sprint: How To Solve Big Problems and
No ratings yet
Design Sprint: How To Solve Big Problems and
84 pages
GE DigitalFlow GF868
No ratings yet
GE DigitalFlow GF868
163 pages
D8129 Octo Relay Mod Installation Manual enUS 2538142603
No ratings yet
D8129 Octo Relay Mod Installation Manual enUS 2538142603
8 pages
Cognizant Provider Cloud Infrastructure Services
No ratings yet
Cognizant Provider Cloud Infrastructure Services
2 pages
VTOC (Comp. Student Record Mngt. System)
No ratings yet
VTOC (Comp. Student Record Mngt. System)
8 pages
Library Reference: Communication Server 1000 Release 5.0
No ratings yet
Library Reference: Communication Server 1000 Release 5.0
44 pages
Template Script Mikrotik Routing Game Online: Ip Address Lokal - . Gateway Modem Game .
No ratings yet
Template Script Mikrotik Routing Game Online: Ip Address Lokal - . Gateway Modem Game .
6 pages
JAVA Internship
No ratings yet
JAVA Internship
63 pages
Altr BRSFT IPO9
No ratings yet
Altr BRSFT IPO9
31 pages
Iot (Internet of Things) : Connect The Things, Shrink The World
No ratings yet
Iot (Internet of Things) : Connect The Things, Shrink The World
26 pages
Protocols and Switching
No ratings yet
Protocols and Switching
48 pages
Artificial Intelligence - AL3391 2021 Regulation - Question Paper 2023 Nov Dec
No ratings yet
Artificial Intelligence - AL3391 2021 Regulation - Question Paper 2023 Nov Dec
4 pages
Tentative Program
No ratings yet
Tentative Program
3 pages
FSSAI - Internship Portal
No ratings yet
FSSAI - Internship Portal
3 pages

Adam 1

Uploaded by

Adam 1

Uploaded by

ADAM - Adaptive moment estimation

What is Deep Learning?

Simple Example: Classifying Cats vs. Dogs

1. Input: Pictures of cats and dogs.

o The network takes raw image pixels as input.

o It learns features like edges, shapes, and patterns in initial layers.

3. Output: The network predicts if the image is of a cat or a dog.

Weights and Biases in the Learning Process

3. Why Are Weights and Biases Important?

• Weights control how features are combined to predict an outcome.

o Inputs are multiplied by weights and added to biases.

o This result is passed through an activation function to produce the output.

o The network calculates the error (loss).

Think of a neural network as trying to fit a line through data points:

• Weights determine the slope of the line.

• Bias determines the intercept.

What is optimisation technique and how it is related to deep learning?

Optimization techniques are mathematical methods used to minimize or maximize a function. In

Common Optimization Techniques in Deep Learning:

o Variants: Batch Gradient Descent, Stochastic Gradient Descent (SGD), Mini-Batch

o Momentum: Speeds up convergence by considering the past gradient direction.

o RMSProp: Adapts learning rates based on recent gradient magnitudes.

o Adam (Adaptive Moment Estimation): Combines momentum and RMSProp for

Relation to Deep Learning:

• Impact: Efficient optimization ensures faster convergence and better accuracy.

Adam Optimization: An Overview

How Adam Works:

4. Robust: Performs well even with noisy datasets or gradients.

5. Efficient: Requires minimal memory and computational resources compared to second-order

1. Tuning Sensitivity: Sensitive to the choice of hyperparameters (α,β1,β2).

2. Overfitting: Can lead to overfitting in some cases due to rapid convergence.

2. Reinforcement Learning: Optimization in policy-gradient methods.

3. Generative Models: Training GANs (Generative Adversarial Networks) and variational

5. Time-Series Analysis: Applied in forecasting models, such as recurrent neural networks

ANN, CNN, RNN are optimised using ADAM

NAG: Oscillations radium reduces

Two important concepts:

1. Momentum (Momentum & NAG techniques)

ADAM Merges both Momentum and Learning rate decay concepts.

ADAM ADV: Efficient, Less Memory, very well in practice

If there are 2 weight parameters:

The average at time stamp t is given by below equation.

Beta is Hyper parameter 0<Beta<1, Beta = 0.9 (best)

Vt-1 is Prev average

W & B are parameters.

W is weight and B is biases.

You might also like