0% found this document useful (0 votes)

13 views

Deep Learning Module 3

Uploaded by

prajwaloconner

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views

Deep Learning Module 3

Uploaded by

prajwaloconner

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Module 3

OPTIMIZATION FOR TRAINING DEEP MODELS

In the context of deep learning, optimization refers to the process of adjusting a model's
parameters to minimize or maximize an objective function, based on that input data.
The objective function often a cost or loss function measures how well the model's
predictions match the true outputs or labels in the dataset.

1.How Learning Differs from Pure Optimization?

➢ In machine learning, we aim to improve a performance measure P (like accuracy on
new data) rather than directly optimizing it. Instead, we optimize a different cost
function J(θ) on training data, hoping it will improve P.
➢ Unlike in pure optimization, where minimizing J is the final goal, machine learning
optimizes J to indirectly improve the model’s performance on unseen data.
➢ The training cost function J(θ) is usually an average over all training examples,
written as:
J(θ) = E(x,y)∼pˆdata L(f(x; θ), y)
➢ Here:
• L is the loss per example,
• f(x;θ) is the model’s prediction for input x,
• p^data is the distribution of the training data.
➢ In supervised learning, y is the known target output, and the cost function depends on
the difference between f(x;θ) (the model’s prediction) and y (the target).
➢ This setup can be adapted for different purposes, such as:
• Adding regularization (including θ or x in the cost function),
• Applying it to unsupervised learning (excluding y from the arguments).
➢ Ideally, we would want to minimize an objective function defined over the true data
distribution Pdata, not just the training data. This ideal cost function is:
J∗ (θ)=E(x,y)∼pdata L(f(x;θ),y)

1.1. Empirical Risk Minimization

• Empirical risk is the average loss computed over a given set of training data.
• The goal of a machine learning algorithm is to reduce the expected generalization
error, known as the risk.
• The expectation is taken over the true distribution pdata(x, y).
• If pdata(x, y),were known, risk minimization would be a straightforward optimization
task.
• When pdata(x, y),is unknown but we have a training set, it becomes a machine learning
problem.
• Machine learning is converted to an optimization problem by minimizing the
expected loss on the training set.
• This involves replacing p(x, y),with the empirical distribution pˆ(x, y).
• Empirical risk is minimized using the formula:

• The training process based on minimizing this average training error is known as
empirical risk minimization.
• Empirical risk minimization assumes optimizing empirical risk will reduce true risk.
• Empirical risk minimization is prone to overfitting, where models memorize the
training data.
• Loss functions like 0-1 loss lack useful derivatives, making empirical risk
minimization challenging for gradient descent.
• Modern deep learning avoids pure empirical risk minimization by optimizing a
different quantity.

1.2 Surrogate Loss Functions and Early Stopping

• Sometimes, the loss function we want to minimize, like classification error, is hard to
optimize, especially when the problem is complex. In these cases, we use a simpler
loss function, called a surrogate loss, that’s easier to work with but still helps
improve the model’s performance.
• For example, the negative log-likelihood is often used as a substitute for the 0-1 loss
because it allows the model to estimate the probability of each class and make better
predictions on average.
• In some cases, using a surrogate loss actually allows the model to learn more. For
instance, when training with the log-likelihood surrogate, the 0-1 loss on the test set
might continue to decrease even after the 0-1 loss on the training set has reached zero.
• This happens because, even when the expected classification error is zero, the model
can still become more confident by pushing the classes further apart, making the
classifier more robust and reliable. This process extracts more useful information
from the training data than just minimizing the 0-1 loss.
• A key difference between optimization for training deep models and general
optimization is that training algorithms don't stop when they reach a local minimum.
• Instead, the algorithm usually minimizes the surrogate loss function but halts when a
convergence criterion is met, which is typically based on the true underlying loss
function, such as 0-1 loss measured on a validation set.
• The stopping condition is designed to prevent overfitting by halting training as soon
as overfitting begins.

1.3 Batch and Minibatch Algorithms

• One aspect of machine learning algorithms that separates them from general
optimization algorithms is that the objective function usually decomposes as a sum
over the training examples.
In machine learning, optimization involves adjusting model parameters to minimize the
objective function, which often averages over training examples.
The goal is to maximize the log-likelihood of the data:

The objective function is the expected log likelihood over the training data:

The gradient of the objective function, used to update parameters, is:

• Optimization algorithms that use the entire training set are called batch or
deterministic gradient methods, because they process all of the training examples
simultaneously in a large batch.
• Batch size refers to the number of training examples used in one iteration of model
training in algorithms like minibatch gradient descent.
• Optimization algorithms that use only a single example at a time are sometimes called
stochastic or sometimes online methods.
• The term online is usually reserved for the case where the examples are drawn from a
stream of continually created examples rather than from a fixed-size training set over
which several passes are made.
• Most algorithms used for deep learning fall somewhere in between, using morethan
one but less than all of the training examples. These were traditionally called
minibatch or minibatch stochastic methods and it is now common to simply call
them stochastic methods.
The equivalence is easiest to derive when both x and y are discrete. In this case, the
generalization error can be written as

with the exact gradient

computing the gradient of the loss with respect to the parameters for that minibatch:
1.2 Challenges in Neural Network Optimization
Optimization is tough because it’s hard to find the best solution. In regular machine learning,
problems are often designed to be easier to solve by making sure they have a simple structure
(convex case). But with neural networks, the problems are more complicated and harder to
solve (non-convex case).
Even when problems are simpler (convex case), there are still challenges. some of the main
challenges involved in optimizing deep learning models are:

1.2.1 Ill-Conditioning
• One challenge that can come up even when optimizing simpler (convex) functions is
called ill-conditioning of the Hessian matrix (H). This is a common issue in many
types of optimization problems.
• In neural network training, ill-conditioning is a known problem. Ill-conditioning can
manifest by causing SGD to get “stuck” in the sense that even very small steps
increase the cost function.
• A second-order Taylor series expansion of the cost function predicts that a gradient
descent step of − € g will add to the cost. Ill-conditioning of the gradient becomes a
problem when 1/2 € 2gTHg exceeds € gTg.

• To see if ill-conditioning is slowing down the training of a neural network, we can

look at two things:
• the squared gradient norm (gTg)
• the term involving the Hessian matrix (gTHg).
• In many cases, the gradient norm doesn't shrink much as training goes on, but the
gTHg term increases a lot. This causes learning to slow down even though the
gradient (the signal that tells the model how to improve) is still strong.
• The reason for this is that the learning rate has to be reduced to avoid making steps
that are too big due to the stronger curvature of the cost function.
• Figure shows an example where the gradient increases a lot during successful training
of a neural network.
• In some cases, like training a convolutional network, the gradient (the signal guiding
the model) keeps increasing during training instead of decreasing, which is unusual
for a converging model.
• Despite the rising gradient, the model still works well, with its performance
improving over time and the classification error dropping on the validation set.

1.2.2 Local Minima

• A convex optimization problem is one where the function you're trying to minimize
has a special property: any local minimum is also a global minimum. This means
that if you find a minimum, you know it’s the best possible solution.
• Some convex functions might have a flat region (where the function stays the same)
at the bottom, but any point in that flat region is also a good solution.
• For non-convex functions (like neural networks), there can be many local minima.
A local minimum is a point where the function value is lower than its neighbors, but
there might be other points with even lower values.
• Neural networks, in particular, have a lot of local minima. This isn’t a huge problem.
• A model is identifiable if it can be trained with enough data to narrow down the best
set of parameters (settings). If a model isn’t identifiable, it means there could be many
ways to set its parameters that give equally good results.
• Latent variables in models (hidden variables not directly observed) can make models
non-identifiable because different settings of these variables can lead to the same
output.
• In neural networks, there’s a problem called weight space symmetry. This happens
when the network can swap the weights (connections) between certain layers or units
and still give the same result.
• For example, in a network with several layers and units, you can swap the weights
between units in the same layer in many ways (like swapping weights for unit 1 with
unit 2). This leads to many different local minima that look different but are
essentially the same solution.

1.2.3 Plateaus, Saddle Points and Other Flat Regions

• In high-dimensional non-convex functions, saddle points are more common than
local minima (or maxima). A saddle point is a point where the function has a
gradient of zero, but it isn’t a minimum or maximum.
• Around a saddle point, the function can go higher (like a maximum) in some
directions and lower (like a minimum) in others.
• The Hessian matrix, which helps describe how the function curves, has both positive
and negative eigenvalues at a saddle point. In directions with positive eigenvalues,
the function value increases (higher cost), and in directions with negative
eigenvalues, it decreases (lower cost).
• Saddle points are tricky for optimization algorithms because the gradient (the
direction of steepest descent) becomes very small near them, making it hard for the
algorithm to know which way to go.
• Gradient descent, which is designed to move downhill, often has a hard time with
saddle points. However, empirically, it seems to escape from saddle points in many
cases. Visualizations from Goodfellow et al. (2015) show that, even near a saddle
point (where all the weights are zero), gradient descent can find a way out.

•
• Neural network cost functions show few obstacles. The main one is a saddle point
near initialization, but SGD easily escapes it.
• Most training happens in a flat region of the cost function, due to gradient noise,
Hessian issues, or navigating around large obstacles.
• Newton’s method is a more advanced technique for optimization that’s designed to
find points where the gradient is zero. However, without adjustments, it can get stuck
at a saddle point because it’s designed to find any critical point (minima, maxima, or
saddle points).
• This is why second-order methods (like Newton’s method) haven’t fully replaced
gradient descent in training neural networks, especially in high dimensions.
• A modified version of Newton’s method, called the saddle-free Newton methodThis
method avoids saddle points and works better than traditional methods.However,
scaling this method for large neural networks is still a challenge, but it shows
potential for improvement.

1.2.4 Cliffs and Exploding Gradients

• Neural networks with many layers can have very steep regions, like cliffs, caused by
large weights. In these areas, the gradient update can move the parameters too far,
sometimes causing them to jump off the cliff.
• To avoid this problem, gradient clipping is used. It controls the size of the gradient
update, understanding that the gradient only shows the best direction, not the best step
size.

• In highly nonlinear deep neural networks or recurrent neural networks, the

objective function can have sharp nonlinearities due to the multiplication of several
parameters. These nonlinearities can cause very high derivatives in some areas.
• When the parameters get close to these cliff regions, a gradient descent update can
move the parameters too far, possibly undoing much of the optimization work that has
been done.

1.2.5 Long-Term Dependencies

• Neural networks with many layers, like feedforward and recurrent networks, create
deep computational graphs, which can lead to challenges during training. Recurrent
networks, in particular, repeatedly apply the same operation across time steps, making
these issues more severe.

• When a matrix W is multiplied repeatedly over ttt steps, the result depends on its
eigenvalues (λ).
• If ∣λ∣>1, the values explode, causing instability.
• If ∣λ∣<1, the values vanish, making it hard to update parameters effectively.
• This scaling issue also affects gradients, leading to the vanishing and exploding
gradient problem.
• Exploding gradients create cliff structures (as discussed earlier), which can cause
instability. Gradient clipping helps manage these issues by limiting gradient size.
• Repeated multiplication by W works like the power method, amplifying components
aligned with the largest eigenvector of W and discarding others.
• Recurrent networks face this problem more acutely because they reuse the same
matrix W at each time step. Feedforward networks avoid much of the issue because
they use different weights for each layer.

1.2.6 Inexact Gradients

• Most optimization methods assume we have the exact gradient or Hessian matrix,
but in practice, we often only have noisy or biased estimates.
• Deep learning algorithms commonly estimate gradients using minibatches of training
data, which introduces some noise.
• For some advanced models, the objective function and its gradient are intractable
(too complex to compute exactly), so we rely on approximations.

1.2.7 Poor Correspondence between Local and Global Structure

• Optimization problems can still arise even after overcoming local issues like cliffs,
saddle points, or poorly conditioned gradients.
• The local direction of improvement may not lead to better results globally, meaning
the path taken during training doesn’t reach areas of much lower cost.
• This mismatch between local and global optimization makes it harder to achieve the
best possible results, even if there are no local minima or saddle points.
• Future research is needed to better understand these issues and improve the training
process.
• Optimization methods that rely on moving downhill locally can fail if the local slope
doesn’t lead toward the global solution, even when there are no saddle points or local
minima.
• Some cost functions don’t have minima, only asymptotes (values that decrease
without ever reaching a minimum). If the training starts on the wrong side of a
“mountain,” the algorithm may struggle to traverse it.
• In higher dimensions, algorithms can often go around such mountains, but this can
make the training path long and increase the training time.

1.2.8 Theoretical Limits of Optimization

• There are theoretical limits to how well any optimization algorithm can perform on
neural networks, but these usually don’t affect practical applications.
• Some results apply only to networks with units that output discrete values, but most
neural network units output smooth values, making optimization easier using local
search.
• Certain problem types are proven to be intractable, but it’s often unclear if a specific
problem belongs to such a category.

1.3 Basic Algorithms

1.3.1 Stochastic Gradient Descent
• Stochastic gradient descent (SGD) and its variants are probably the most used
optimization algorithms for machine learning in general and for deep learning in
particular.
• Tt is possible to obtain an unbiased estimate of the gradient by taking the average
gradient on a minibatch of m examples drawn from the data generating distribution.
• A crucial parameter for the SGD algorithm is the learning rate. Previously, we have
described SGD as using a fixed learning rate €.
• In practice, it is necessary to gradually decrease the learning rate over time, so we
now denote the learning rate at iteration k as € k.
• A sufficient condition to guarantee convergence of SGD is that

In practice, it is common to decay the learning rate linearly until iteration τ:

1.3.2 Momentum
Momentum is used to speed up optimization, especially when:
• Gradients are small but consistent.
• Gradients are noisy.
• The optimization landscape has high curvature (sharp changes).

• Momentum accumulates a moving average of past gradients, smoothing out updates and
accelerating learning in consistent directions.

• A new variable, v, represents the velocity of parameters in the optimization space.

Update Rules

1. Velocity update:

oα: Momentum hyperparameter (controls decay of past gradients, 0≤α<1).

oη: Learning rate.
o∇θ: Gradient of the loss function with respect to the parameters.
om: Minibatch size.
2. Parameter update:

Acceleration: If gradients g are consistent (point in the same direction), momentum increases
step size.
• Final velocity reaches a terminal speed: Terminal speed=
• Example: With α=0.9, the step size is amplified 10x compared to standard gradient
descent.

Newton’s Second Law:

The force causes acceleration:

Instead of a second-order equation, velocity v(t)v(t)v(t) simplifies the dynamics:

• Velocity is the rate of change of position:

• Force is the rate of change of velocity:

• Momentum helps solve issues like poor conditioning of the Hessian matrix by
smoothing the optimization path. It avoids zig-zagging in narrow valleys, as seen with
standard gradient descent, and efficiently moves along the valley's length, reducing
wasted steps.
• This behaviour allows momentum to accelerate convergence, especially on quadratic
loss functions with elongated contour shapes.
1.3.3 Nesterov Momentum

Nesterov Momentum improves upon standard momentum by evaluating the gradient

after applying the current velocity update. This "look-ahead" approach allows it to add a
correction factor, leading to better convergence paths.

In convex problems, it significantly accelerates convergence, reducing the error rate from
O(1/k) to O(1/k^2). However, in stochastic gradient descent scenarios, it does not
improve convergence rates.

Velocity update:

Parameter update: θ ← θ + v

1.4 Parameter Initialization

Deep learning models are sensitive to initial conditions, and poor initialization can lead to
failure or slow convergence. The initial point affects convergence speed, solution quality,
and generalization.

Symmetry breaking is crucial to avoid identical behaviour in units, which would prevent
learning. Random initialization from high-entropy distributions ensures diversity across
units without the computational cost of methods like Gram-Schmidt orthogonalization.

Weights are typically initialized using Gaussian or uniform distributions. The scale of
the initialization is critical:

• Large initial weights help break symmetry but can lead to exploding gradients or
saturation of activation functions.
• Too-small weights can suppress activation values, slowing learning

Common heuristics include:

• Initializing weights from a uniform distribution where m is the
number of inputs.
• Glorot Initialization (2010) uses a scaled distribution based on the sum of the input
and output units suggest using the normalized initialization

• Orthogonal Initialization suggests using random orthogonal matrices to maintain

activation and gradient norms, especially for deep networks.
• One drawback to scaling rules that set all of the initial weights to have the same
standard deviation, such as 1 /√m, is that every individual weight becomes extremely
small when the layers become large. Martens (2010) introduced an alternative
initialization scheme called sparse initialization in which each unit is initialized to
have exactly k non-zero weights.

Bias Initialization:

• Biases are often set heuristically. For instance:

o Output biases may be initialized to match the expected marginal statistics of
the output.
o ReLU units' biases might be set to slightly positive values (e.g., 0.1) to avoid
saturation.
o LSTM forget gate biases are sometimes initialized to 1 to ensure proper
learning.

Variance/Precision Parameter Initialization: Parameters like variance (β) in models such

as linear regression can typically be initialized to 1.

Learning-Based Initialization: Sometimes, initialization can be done by training an

unsupervised model on the same inputs or using a supervised task related to the main task,
providing better convergence and generalization.

Hyperparameter Search: The choice of initialization scale is often treated as a

hyperparameter, and techniques like random search can help find optimal values.

1.5 Algorithms with Adaptive Learning Rates

1. The learning rate is crucial but hard to set. Momentum helps, but adds another
hyperparameter. A better approach might be to use different learning rates for each
parameter, adjusting them automatically during training.
2. Delta-Bar-Delta Algorithm (Jacobs, 1988):
o Idea: Adjust learning rates based on the direction of the gradient.
o Concept: If the gradient direction stays the same, increase the learning rate for
that parameter.
1.5.1 AdaGrad

• The AdaGrad algorithm adapts learning rates for each model parameter based on the
sum of squared gradients.
• Parameters with large gradients have smaller learning rates, while those with small
gradients have larger ones. This helps make more progress in flatter directions of the
parameter space.
• Theoretically good for convex problems.
• In deep learning, accumulating squared gradients can overly reduce the learning rate,
making training slow.
• AdaGrad works well for some models but not all.

1.5.2 RMSProp

• RMSProp (Hinton, 2012) improves AdaGrad by using an exponentially weighted

moving average of gradients instead of accumulating them.
• This helps in non-convex settings, like neural networks, where AdaGrad might slow
down.
• Discards old gradients, allowing faster convergence when the model reaches a convex
region.
• Trains more effectively by adapting learning rates, avoiding excessive slowdowns.
1.5.3 Adam

• Adam (Kingma and Ba, 2014) combines RMSProp and momentum.

• Momentum in Adam is the exponential moving average of gradients (first moment).
• It improves optimization by adapting the learning rate and considering past gradients,
making it more efficient for training deep models.
1.5.4 Choosing the Right Optimization Algorithm

• We looked at different algorithms designed to help optimize deep learning models by

adjusting the learning rate for each model parameter.
• A common question is: Which algorithm should you choose?
• There's no clear answer to this. A study compared many optimization algorithms
across different tasks.
• It found that algorithms with adaptive learning rates, like RMSProp and AdaDelta,
worked well in most cases. But no one algorithm is the best for all situations.
• Right now, the most popular optimization algorithms are SGD, SGD with
momentum, RMSProp, RMSProp with momentum, AdaDelta, and Adam.
• The choice of algorithm often depends on which one you’re most comfortable with,
since that makes tuning the hyperparameters easier.

DL Unit-2
No ratings yet
DL Unit-2
24 pages
Mtech Prjcet
100% (1)
Mtech Prjcet
16 pages
DL-12
No ratings yet
DL-12
55 pages
Module3_notes
No ratings yet
Module3_notes
18 pages
UNIT V NNHDL
No ratings yet
UNIT V NNHDL
33 pages
Deep Learning Module 3-1
No ratings yet
Deep Learning Module 3-1
31 pages
week 06 - Deep Feedforward Networks - Optimization
No ratings yet
week 06 - Deep Feedforward Networks - Optimization
83 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
25 pages
DL 3
No ratings yet
DL 3
72 pages
Lec 2
No ratings yet
Lec 2
5 pages
Dive Into Deep Learning-435-462
No ratings yet
Dive Into Deep Learning-435-462
28 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
4 Optimization
No ratings yet
4 Optimization
48 pages
Module 3-DL
No ratings yet
Module 3-DL
12 pages
Lecture-4-1
No ratings yet
Lecture-4-1
60 pages
Module 3dl1
No ratings yet
Module 3dl1
11 pages
Lecture 2
No ratings yet
Lecture 2
6 pages
DL 4
No ratings yet
DL 4
15 pages
Deep Learning 3rd Module
No ratings yet
Deep Learning 3rd Module
2 pages
Lecture-4 Emprical Risk and Optimization
No ratings yet
Lecture-4 Emprical Risk and Optimization
20 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
Lecture 1
No ratings yet
Lecture 1
6 pages
chapter02.Background-theory_5e45b9b50ccb12d028c8edf9b332c5e5
No ratings yet
chapter02.Background-theory_5e45b9b50ccb12d028c8edf9b332c5e5
20 pages
Chapter
No ratings yet
Chapter
46 pages
RADL TQKhoat
No ratings yet
RADL TQKhoat
50 pages
Lecture_2
No ratings yet
Lecture_2
31 pages
UNIT3
No ratings yet
UNIT3
17 pages
2a - 3
No ratings yet
2a - 3
8 pages
Unit 2
No ratings yet
Unit 2
18 pages
MLSM Lecture1 050923
No ratings yet
MLSM Lecture1 050923
37 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Unit IV
No ratings yet
Unit IV
89 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
Deep Neural Network Module 4 Regularization
No ratings yet
Deep Neural Network Module 4 Regularization
53 pages
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
No ratings yet
Towards A Mathematical Understanding of Neural Network-Based Machine Learning: What We Know and What We Don't
56 pages
Unit 2
No ratings yet
Unit 2
37 pages
UNIT 5
No ratings yet
UNIT 5
36 pages
Machine Learning Basics: Lecture Slides For Chapter 5 of Deep Learning Ian Goodfellow
No ratings yet
Machine Learning Basics: Lecture Slides For Chapter 5 of Deep Learning Ian Goodfellow
85 pages
unit-1.2-Perceptron-2024
No ratings yet
unit-1.2-Perceptron-2024
107 pages
Unit-3
No ratings yet
Unit-3
47 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
Lec 7 Optimization Part 2
No ratings yet
Lec 7 Optimization Part 2
139 pages
ML 01
No ratings yet
ML 01
24 pages
5th Unit DL Final Class Notes (1)
No ratings yet
5th Unit DL Final Class Notes (1)
77 pages
3
No ratings yet
3
11 pages
ML Notes
No ratings yet
ML Notes
14 pages
Unit Ii
No ratings yet
Unit Ii
8 pages
WEEK 4
No ratings yet
WEEK 4
61 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
Lecture 04 - Optimization - 4p
No ratings yet
Lecture 04 - Optimization - 4p
11 pages
Deep Learning Summer School 2015: Introduction To Machine Learning
No ratings yet
Deep Learning Summer School 2015: Introduction To Machine Learning
46 pages
CS601_Machine Learning_Unit 2 New
No ratings yet
CS601_Machine Learning_Unit 2 New
56 pages
Unit 2 Introduction to Deep Learning
No ratings yet
Unit 2 Introduction to Deep Learning
79 pages
DL UNIT 5 NOTES 2
No ratings yet
DL UNIT 5 NOTES 2
23 pages
ABDUA 3 and 4
No ratings yet
ABDUA 3 and 4
102 pages
Deep Learning Module-03 Search Creators
No ratings yet
Deep Learning Module-03 Search Creators
20 pages
Machine Learning Basics Dl2 Rk (1)
No ratings yet
Machine Learning Basics Dl2 Rk (1)
16 pages
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Optimization Algorithms and Hierarchical Convergence
From Everand
Optimization Algorithms and Hierarchical Convergence
Pasquale De Marco
No ratings yet
90 Shortcuts For Geany (Linux)
No ratings yet
90 Shortcuts For Geany (Linux)
1 page
XII-maths Delted Portion - Krishna
No ratings yet
XII-maths Delted Portion - Krishna
3 pages
MP 2066
No ratings yet
MP 2066
1 page
Woofer Test 2
No ratings yet
Woofer Test 2
12 pages
Go To Page Word Completed
No ratings yet
Go To Page Word Completed
17 pages
HBD1PR1 - HBD3PR1 - HBD3PR2 Data Sheet
No ratings yet
HBD1PR1 - HBD3PR1 - HBD3PR2 Data Sheet
4 pages
Ethereum 2 Report
No ratings yet
Ethereum 2 Report
25 pages
Chapter 6 - Capacitance Inductance - NLH
No ratings yet
Chapter 6 - Capacitance Inductance - NLH
40 pages
Merge Employee Records
No ratings yet
Merge Employee Records
12 pages
Bobcat operator Handbook
No ratings yet
Bobcat operator Handbook
21 pages
ESM-48100B5 V100R022 法规符合性声明
No ratings yet
ESM-48100B5 V100R022 法规符合性声明
7 pages
Functional Design: PK 34002 SH
No ratings yet
Functional Design: PK 34002 SH
12 pages
Threat Monitoring and Intelligent Data Analytics of Network Traffic
No ratings yet
Threat Monitoring and Intelligent Data Analytics of Network Traffic
8 pages
Raam Group Data - SVNIT
No ratings yet
Raam Group Data - SVNIT
22 pages
Australian Public Service - Better Practice Guide For Big Data
No ratings yet
Australian Public Service - Better Practice Guide For Big Data
35 pages
Sima Pro 7 Introduction To LCA
No ratings yet
Sima Pro 7 Introduction To LCA
88 pages
Application Letter Sample For Fresh Graduate Accounting
No ratings yet
Application Letter Sample For Fresh Graduate Accounting
8 pages
777d Update Test Charts
100% (1)
777d Update Test Charts
38 pages
Blackboard and Client Server
No ratings yet
Blackboard and Client Server
12 pages
01 Introduction
No ratings yet
01 Introduction
26 pages
Export Quotation Sample
No ratings yet
Export Quotation Sample
1 page
Untitled Document
No ratings yet
Untitled Document
9 pages
Best First Search Program
No ratings yet
Best First Search Program
12 pages
Randomness Vs Arbitrariness Author(s) : ZAHA HADID Source: AA Files, No. 2 (July 1982), P. 62 Published By: Stable URL: Accessed: 12/06/2014 21:22
No ratings yet
Randomness Vs Arbitrariness Author(s) : ZAHA HADID Source: AA Files, No. 2 (July 1982), P. 62 Published By: Stable URL: Accessed: 12/06/2014 21:22
2 pages
2 - 2022 Yuchai HI Profile (Lin)
No ratings yet
2 - 2022 Yuchai HI Profile (Lin)
32 pages
Business Resume Example
100% (1)
Business Resume Example
6 pages
Kneeling Carabao Festival:: Transforming The Economic Landscape of Pulilan, Bulacan
No ratings yet
Kneeling Carabao Festival:: Transforming The Economic Landscape of Pulilan, Bulacan
11 pages
UNIT-5 - GATE-Problems-Heat Exchangers
No ratings yet
UNIT-5 - GATE-Problems-Heat Exchangers
10 pages
Distributed Multimedia Systems
50% (6)
Distributed Multimedia Systems
3 pages

Deep Learning Module 3

Uploaded by

Deep Learning Module 3

Uploaded by

Module 3

OPTIMIZATION FOR TRAINING DEEP MODELS

1.How Learning Differs from Pure Optimization?

1.1. Empirical Risk Minimization

1.2 Surrogate Loss Functions and Early Stopping

1.3 Batch and Minibatch Algorithms

The gradient of the objective function, used to update parameters, is:

with the exact gradient

• To see if ill-conditioning is slowing down the training of a neural network, we can

1.2.2 Local Minima

1.2.3 Plateaus, Saddle Points and Other Flat Regions

1.2.4 Cliffs and Exploding Gradients

• In highly nonlinear deep neural networks or recurrent neural networks, the

1.2.5 Long-Term Dependencies

1.2.6 Inexact Gradients

1.2.7 Poor Correspondence between Local and Global Structure

1.2.8 Theoretical Limits of Optimization

1.3 Basic Algorithms

In practice, it is common to decay the learning rate linearly until iteration τ:

• A new variable, v, represents the velocity of parameters in the optimization space.

oα: Momentum hyperparameter (controls decay of past gradients, 0≤α<1).

Newton’s Second Law:

The force causes acceleration:

Instead of a second-order equation, velocity v(t)v(t)v(t) simplifies the dynamics:

• Velocity is the rate of change of position:

• Force is the rate of change of velocity:

Nesterov Momentum improves upon standard momentum by evaluating the gradient

1.4 Parameter Initialization

Common heuristics include:

• Orthogonal Initialization suggests using random orthogonal matrices to maintain

• Biases are often set heuristically. For instance:

Variance/Precision Parameter Initialization: Parameters like variance (β) in models such

Learning-Based Initialization: Sometimes, initialization can be done by training an

Hyperparameter Search: The choice of initialization scale is often treated as a

1.5 Algorithms with Adaptive Learning Rates

• RMSProp (Hinton, 2012) improves AdaGrad by using an exponentially weighted

• Adam (Kingma and Ba, 2014) combines RMSProp and momentum.

• We looked at different algorithms designed to help optimize deep learning models by

You might also like