Adam 1
Adam 1
Deep Learning is a subset of machine learning focused on neural networks with many layers (deep
neural networks). It mimics the way the human brain processes data by learning features and
patterns directly from raw input.
Deep learning is widely used in tasks like image recognition, speech processing, and natural language
understanding.
Key Concept:
• Neural Networks: The building blocks of deep learning. They consist of layers of neurons that
process input data and learn features automatically.
• Learning Process: The network adjusts its parameters (weights and biases) by minimizing an
error through backpropagation and optimization techniques (e.g., Adam optimizer).
2. Process:
o Higher layers combine these features to identify complex patterns like "cat ears" or
"dog snout."
In the context of neural networks, weights and biases are the key parameters that a network learns
during training to map inputs to outputs.
1. Weights:
• Definition: Weights determine the importance of a particular feature in the input data. They
are the values assigned to the connections between neurons.
• Role: Each input feature (e.g., pixel value in an image) is multiplied by its corresponding
weight. The weight adjusts how much influence that input feature has on the output.
• Training: During the learning process, weights are updated to minimize the difference
between the predicted output and the actual output (using optimization algorithms like
gradient descent).
2. Biases:
• Definition: A bias is an additional parameter added to the weighted sum of inputs to shift
the output activation function. It allows the model to fit the data more flexibly.
• Role: Bias ensures that the neuron can produce a non-zero output even when all inputs are
zero. This improves the learning capability of the network.
• Training: Like weights, biases are also updated during the training process.
• Biases add flexibility to the model, ensuring it can represent complex data patterns.
4. Learning Process:
1. Forward Pass:
2. Backward Pass:
o Weights and biases are updated using optimization algorithms to reduce the loss.
Simple Analogy:
In higher-dimensional problems, weights and biases perform similar roles, scaling and shifting the
data to achieve the best fit for the target output.
Optimization Techniques:
1. Gradient Descent:
o Adjusts model parameters (weights and biases) by calculating gradients of the loss
function.
2. Advanced Optimizers:
3. Regularization Techniques:
o Add penalties to the loss function to avoid overfitting (e.g., L1, L2 regularization).
• Key Role: Without optimization, training deep neural networks would be computationally
infeasible due to the large number of parameters.
What is Adam optimiser? Why are we using this? Applications? Adv disadvantage? Conclusion?
Adam (short for Adaptive Moment Estimation) is a popular optimization algorithm in deep learning.
It combines the strengths of two other optimization methods:
• Momentum: Helps smooth out updates by considering the direction of past gradients.
• RMSProp: Adapts the learning rate for each parameter based on the magnitude of recent
gradients.
Adam is well-suited for handling large datasets and models with a high number of parameters,
making it a preferred choice for many machine learning and deep learning tasks.
Adam maintains two moving averages for each parameter during training:
1. First moment (mean): The exponentially decaying average of past gradients (Vdw).
2. Second moment (uncentered variance): The exponentially decaying average of the squared
gradients (Sdw).
Advantages of Adam:
1. Adaptive Learning Rates: Adjusts learning rates for each parameter independently, making it
effective for problems with sparse gradients.
Sparse Gradients
Sparse gradients refer to a situation during training where most of the gradient values for a model's parameters are zero or
near-zero. This typically happens when:
2. Data Characteristics: The input data has sparse features, meaning many features are zero or inactive. For
example:
o Text data represented as one-hot encoded vectors.
o High-dimensional data with most entries as zeros.
3. Combines Momentum and RMSProp: Benefits from the advantages of both methods,
resulting in faster convergence.
Disadvantages of Adam:
3. Generalization: May not generalize as well as simpler methods like SGD in certain problems.
4. Plateau in Convergence: May stop improving after a certain point, especially in highly
complex loss landscapes.
Applications of Adam:
1. Deep Learning: Training neural networks for tasks like image recognition, natural language
processing (NLP), and speech recognition.
4. Sparse Data: Effective in handling problems with sparse datasets or features, such as text
data in NLP.
Summary:
Adam is a highly versatile optimization algorithm that excels in a wide range of deep learning
applications due to its efficiency and ability to adapt learning rates. However, it requires careful
hyperparameter tuning and may not always outperform simpler methods for problems with well-
structured data.
First understand Root mean square (RMS) propagation & Momentum (Mini batch korean descent)
optimization alogos
ADAM: Most powerful technique used to optimize almost all the machine learning algorithms.
Used in Deep learning and deep neural networks.
Batch gradient descent: if you want to reach a point it directly goes but it is slow.
Momentum (Mini batch korean descent) Optimizer: Uses Momentum and Velocity to reach the point
and it circles the point as satellite and reaches the point after few Oscillations around point.
Adagrad: Misses the minima (point) slightly because learning rate decay is less (learning rate is
reduced)
Root mean square (RMS) propagation: Reaches the point properly
ADAM is formed by combining both Root mean square (RMS) propagation & Momentum (Mini
batch gradient descent) optimization alogos.
Weight updation in Momentum (Mini batch gradient descent) takes place in zig-zag form and a lot
of training time is wasted in moving in the vertical direction instead of just moving straight to the
local minima.
To reduce the time taken to train the model and to have a straighter path we use optimization.
Optimizations reduces the time taken to train the model thus optimize the training. Speeds up the
learnings of the model.
Graph of cost and weight parameter: this is how cost reaches the local minina.
Red line is mini batch gradient descent. Green is batch gradient descent.
Right side figure is Contour of 3D fig on left. It is 2D top view of left fig.
Exponentially weighted moving average: Average is calculated as we move forward in time and
meet new points. Higher weightage to newer points and lower weightage to older points.
Lets assume W & B are y and x axis, Since W (Vertical height) is high, dw^2 is higher. When dw is
divided by Sdw the whole value (EWMA) reduces and vertical height is reduced. If the root value
becomes to small the fraction might become too big and EWMA may overshoot to avoid this epsilon
is introduced.
RMS Prop:
Vdw & Sdw are EWMA. When dw^2 it is Sdw and without square is Vdw.