0% found this document useful (0 votes)
4 views

Deep Learning Module 3

Uploaded by

prajwaloconner
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Deep Learning Module 3

Uploaded by

prajwaloconner
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Module 3

OPTIMIZATION FOR TRAINING DEEP MODELS


In the context of deep learning, optimization refers to the process of adjusting a model's
parameters to minimize or maximize an objective function, based on that input data.
The objective function often a cost or loss function measures how well the model's
predictions match the true outputs or labels in the dataset.

1.How Learning Differs from Pure Optimization?


➢ In machine learning, we aim to improve a performance measure P (like accuracy on
new data) rather than directly optimizing it. Instead, we optimize a different cost
function J(θ) on training data, hoping it will improve P.
➢ Unlike in pure optimization, where minimizing J is the final goal, machine learning
optimizes J to indirectly improve the model’s performance on unseen data.
➢ The training cost function J(θ) is usually an average over all training examples,
written as:
J(θ) = E(x,y)∼pˆdata L(f(x; θ), y)
➢ Here:
• L is the loss per example,
• f(x;θ) is the model’s prediction for input x,
• p^data is the distribution of the training data.
➢ In supervised learning, y is the known target output, and the cost function depends on
the difference between f(x;θ) (the model’s prediction) and y (the target).
➢ This setup can be adapted for different purposes, such as:
• Adding regularization (including θ or x in the cost function),
• Applying it to unsupervised learning (excluding y from the arguments).
➢ Ideally, we would want to minimize an objective function defined over the true data
distribution Pdata, not just the training data. This ideal cost function is:
J∗ (θ)=E(x,y)∼pdata L(f(x;θ),y)

1.1. Empirical Risk Minimization


• Empirical risk is the average loss computed over a given set of training data.
• The goal of a machine learning algorithm is to reduce the expected generalization
error, known as the risk.
• The expectation is taken over the true distribution pdata(x, y).
• If pdata(x, y),were known, risk minimization would be a straightforward optimization
task.
• When pdata(x, y),is unknown but we have a training set, it becomes a machine learning
problem.
• Machine learning is converted to an optimization problem by minimizing the
expected loss on the training set.
• This involves replacing p(x, y),with the empirical distribution pˆ(x, y).
• Empirical risk is minimized using the formula:

• The training process based on minimizing this average training error is known as
empirical risk minimization.
• Empirical risk minimization assumes optimizing empirical risk will reduce true risk.
• Empirical risk minimization is prone to overfitting, where models memorize the
training data.
• Loss functions like 0-1 loss lack useful derivatives, making empirical risk
minimization challenging for gradient descent.
• Modern deep learning avoids pure empirical risk minimization by optimizing a
different quantity.

1.2 Surrogate Loss Functions and Early Stopping


• Sometimes, the loss function we want to minimize, like classification error, is hard to
optimize, especially when the problem is complex. In these cases, we use a simpler
loss function, called a surrogate loss, that’s easier to work with but still helps
improve the model’s performance.
• For example, the negative log-likelihood is often used as a substitute for the 0-1 loss
because it allows the model to estimate the probability of each class and make better
predictions on average.
• In some cases, using a surrogate loss actually allows the model to learn more. For
instance, when training with the log-likelihood surrogate, the 0-1 loss on the test set
might continue to decrease even after the 0-1 loss on the training set has reached zero.
• This happens because, even when the expected classification error is zero, the model
can still become more confident by pushing the classes further apart, making the
classifier more robust and reliable. This process extracts more useful information
from the training data than just minimizing the 0-1 loss.
• A key difference between optimization for training deep models and general
optimization is that training algorithms don't stop when they reach a local minimum.
• Instead, the algorithm usually minimizes the surrogate loss function but halts when a
convergence criterion is met, which is typically based on the true underlying loss
function, such as 0-1 loss measured on a validation set.
• The stopping condition is designed to prevent overfitting by halting training as soon
as overfitting begins.

1.3 Batch and Minibatch Algorithms


• One aspect of machine learning algorithms that separates them from general
optimization algorithms is that the objective function usually decomposes as a sum
over the training examples.
In machine learning, optimization involves adjusting model parameters to minimize the
objective function, which often averages over training examples.
The goal is to maximize the log-likelihood of the data:

The objective function is the expected log likelihood over the training data:

The gradient of the objective function, used to update parameters, is:

• Optimization algorithms that use the entire training set are called batch or
deterministic gradient methods, because they process all of the training examples
simultaneously in a large batch.
• Batch size refers to the number of training examples used in one iteration of model
training in algorithms like minibatch gradient descent.
• Optimization algorithms that use only a single example at a time are sometimes called
stochastic or sometimes online methods.
• The term online is usually reserved for the case where the examples are drawn from a
stream of continually created examples rather than from a fixed-size training set over
which several passes are made.
• Most algorithms used for deep learning fall somewhere in between, using morethan
one but less than all of the training examples. These were traditionally called
minibatch or minibatch stochastic methods and it is now common to simply call
them stochastic methods.
The equivalence is easiest to derive when both x and y are discrete. In this case, the
generalization error can be written as

with the exact gradient

computing the gradient of the loss with respect to the parameters for that minibatch:
1.2 Challenges in Neural Network Optimization
Optimization is tough because it’s hard to find the best solution. In regular machine learning,
problems are often designed to be easier to solve by making sure they have a simple structure
(convex case). But with neural networks, the problems are more complicated and harder to
solve (non-convex case).
Even when problems are simpler (convex case), there are still challenges. some of the main
challenges involved in optimizing deep learning models are:

1.2.1 Ill-Conditioning
• One challenge that can come up even when optimizing simpler (convex) functions is
called ill-conditioning of the Hessian matrix (H). This is a common issue in many
types of optimization problems.
• In neural network training, ill-conditioning is a known problem. Ill-conditioning can
manifest by causing SGD to get “stuck” in the sense that even very small steps
increase the cost function.
• A second-order Taylor series expansion of the cost function predicts that a gradient
descent step of − € g will add to the cost. Ill-conditioning of the gradient becomes a
problem when 1/2 € 2gTHg exceeds € gTg.

• To see if ill-conditioning is slowing down the training of a neural network, we can


look at two things:
• the squared gradient norm (gTg)
• the term involving the Hessian matrix (gTHg).
• In many cases, the gradient norm doesn't shrink much as training goes on, but the
gTHg term increases a lot. This causes learning to slow down even though the
gradient (the signal that tells the model how to improve) is still strong.
• The reason for this is that the learning rate has to be reduced to avoid making steps
that are too big due to the stronger curvature of the cost function.
• Figure shows an example where the gradient increases a lot during successful training
of a neural network.
• In some cases, like training a convolutional network, the gradient (the signal guiding
the model) keeps increasing during training instead of decreasing, which is unusual
for a converging model.
• Despite the rising gradient, the model still works well, with its performance
improving over time and the classification error dropping on the validation set.

1.2.2 Local Minima


• A convex optimization problem is one where the function you're trying to minimize
has a special property: any local minimum is also a global minimum. This means
that if you find a minimum, you know it’s the best possible solution.
• Some convex functions might have a flat region (where the function stays the same)
at the bottom, but any point in that flat region is also a good solution.
• For non-convex functions (like neural networks), there can be many local minima.
A local minimum is a point where the function value is lower than its neighbors, but
there might be other points with even lower values.
• Neural networks, in particular, have a lot of local minima. This isn’t a huge problem.
• A model is identifiable if it can be trained with enough data to narrow down the best
set of parameters (settings). If a model isn’t identifiable, it means there could be many
ways to set its parameters that give equally good results.
• Latent variables in models (hidden variables not directly observed) can make models
non-identifiable because different settings of these variables can lead to the same
output.
• In neural networks, there’s a problem called weight space symmetry. This happens
when the network can swap the weights (connections) between certain layers or units
and still give the same result.
• For example, in a network with several layers and units, you can swap the weights
between units in the same layer in many ways (like swapping weights for unit 1 with
unit 2). This leads to many different local minima that look different but are
essentially the same solution.

1.2.3 Plateaus, Saddle Points and Other Flat Regions


• In high-dimensional non-convex functions, saddle points are more common than
local minima (or maxima). A saddle point is a point where the function has a
gradient of zero, but it isn’t a minimum or maximum.
• Around a saddle point, the function can go higher (like a maximum) in some
directions and lower (like a minimum) in others.
• The Hessian matrix, which helps describe how the function curves, has both positive
and negative eigenvalues at a saddle point. In directions with positive eigenvalues,
the function value increases (higher cost), and in directions with negative
eigenvalues, it decreases (lower cost).
• Saddle points are tricky for optimization algorithms because the gradient (the
direction of steepest descent) becomes very small near them, making it hard for the
algorithm to know which way to go.
• Gradient descent, which is designed to move downhill, often has a hard time with
saddle points. However, empirically, it seems to escape from saddle points in many
cases. Visualizations from Goodfellow et al. (2015) show that, even near a saddle
point (where all the weights are zero), gradient descent can find a way out.


• Neural network cost functions show few obstacles. The main one is a saddle point
near initialization, but SGD easily escapes it.
• Most training happens in a flat region of the cost function, due to gradient noise,
Hessian issues, or navigating around large obstacles.
• Newton’s method is a more advanced technique for optimization that’s designed to
find points where the gradient is zero. However, without adjustments, it can get stuck
at a saddle point because it’s designed to find any critical point (minima, maxima, or
saddle points).
• This is why second-order methods (like Newton’s method) haven’t fully replaced
gradient descent in training neural networks, especially in high dimensions.
• A modified version of Newton’s method, called the saddle-free Newton methodThis
method avoids saddle points and works better than traditional methods.However,
scaling this method for large neural networks is still a challenge, but it shows
potential for improvement.

1.2.4 Cliffs and Exploding Gradients


• Neural networks with many layers can have very steep regions, like cliffs, caused by
large weights. In these areas, the gradient update can move the parameters too far,
sometimes causing them to jump off the cliff.
• To avoid this problem, gradient clipping is used. It controls the size of the gradient
update, understanding that the gradient only shows the best direction, not the best step
size.

• In highly nonlinear deep neural networks or recurrent neural networks, the


objective function can have sharp nonlinearities due to the multiplication of several
parameters. These nonlinearities can cause very high derivatives in some areas.
• When the parameters get close to these cliff regions, a gradient descent update can
move the parameters too far, possibly undoing much of the optimization work that has
been done.

1.2.5 Long-Term Dependencies


• Neural networks with many layers, like feedforward and recurrent networks, create
deep computational graphs, which can lead to challenges during training. Recurrent
networks, in particular, repeatedly apply the same operation across time steps, making
these issues more severe.

• When a matrix W is multiplied repeatedly over ttt steps, the result depends on its
eigenvalues (λ).
• If ∣λ∣>1, the values explode, causing instability.
• If ∣λ∣<1, the values vanish, making it hard to update parameters effectively.
• This scaling issue also affects gradients, leading to the vanishing and exploding
gradient problem.
• Exploding gradients create cliff structures (as discussed earlier), which can cause
instability. Gradient clipping helps manage these issues by limiting gradient size.
• Repeated multiplication by W works like the power method, amplifying components
aligned with the largest eigenvector of W and discarding others.
• Recurrent networks face this problem more acutely because they reuse the same
matrix W at each time step. Feedforward networks avoid much of the issue because
they use different weights for each layer.

1.2.6 Inexact Gradients


• Most optimization methods assume we have the exact gradient or Hessian matrix,
but in practice, we often only have noisy or biased estimates.
• Deep learning algorithms commonly estimate gradients using minibatches of training
data, which introduces some noise.
• For some advanced models, the objective function and its gradient are intractable
(too complex to compute exactly), so we rely on approximations.

1.2.7 Poor Correspondence between Local and Global Structure


• Optimization problems can still arise even after overcoming local issues like cliffs,
saddle points, or poorly conditioned gradients.
• The local direction of improvement may not lead to better results globally, meaning
the path taken during training doesn’t reach areas of much lower cost.
• This mismatch between local and global optimization makes it harder to achieve the
best possible results, even if there are no local minima or saddle points.
• Future research is needed to better understand these issues and improve the training
process.
• Optimization methods that rely on moving downhill locally can fail if the local slope
doesn’t lead toward the global solution, even when there are no saddle points or local
minima.
• Some cost functions don’t have minima, only asymptotes (values that decrease
without ever reaching a minimum). If the training starts on the wrong side of a
“mountain,” the algorithm may struggle to traverse it.
• In higher dimensions, algorithms can often go around such mountains, but this can
make the training path long and increase the training time.

1.2.8 Theoretical Limits of Optimization


• There are theoretical limits to how well any optimization algorithm can perform on
neural networks, but these usually don’t affect practical applications.
• Some results apply only to networks with units that output discrete values, but most
neural network units output smooth values, making optimization easier using local
search.
• Certain problem types are proven to be intractable, but it’s often unclear if a specific
problem belongs to such a category.

1.3 Basic Algorithms


1.3.1 Stochastic Gradient Descent
• Stochastic gradient descent (SGD) and its variants are probably the most used
optimization algorithms for machine learning in general and for deep learning in
particular.
• Tt is possible to obtain an unbiased estimate of the gradient by taking the average
gradient on a minibatch of m examples drawn from the data generating distribution.
• A crucial parameter for the SGD algorithm is the learning rate. Previously, we have
described SGD as using a fixed learning rate €.
• In practice, it is necessary to gradually decrease the learning rate over time, so we
now denote the learning rate at iteration k as € k.
• A sufficient condition to guarantee convergence of SGD is that

In practice, it is common to decay the learning rate linearly until iteration τ:

1.3.2 Momentum
Momentum is used to speed up optimization, especially when:
• Gradients are small but consistent.
• Gradients are noisy.
• The optimization landscape has high curvature (sharp changes).

• Momentum accumulates a moving average of past gradients, smoothing out updates and
accelerating learning in consistent directions.

• A new variable, v, represents the velocity of parameters in the optimization space.

Update Rules

1. Velocity update:

oα: Momentum hyperparameter (controls decay of past gradients, 0≤α<1).


oη: Learning rate.
o∇θ: Gradient of the loss function with respect to the parameters.
om: Minibatch size.
2. Parameter update:

Acceleration: If gradients g are consistent (point in the same direction), momentum increases
step size.
• Final velocity reaches a terminal speed: Terminal speed=
• Example: With α=0.9, the step size is amplified 10x compared to standard gradient
descent.

Newton’s Second Law:

The force causes acceleration:

Instead of a second-order equation, velocity v(t)v(t)v(t) simplifies the dynamics:

• Velocity is the rate of change of position:

• Force is the rate of change of velocity:

• Momentum helps solve issues like poor conditioning of the Hessian matrix by
smoothing the optimization path. It avoids zig-zagging in narrow valleys, as seen with
standard gradient descent, and efficiently moves along the valley's length, reducing
wasted steps.
• This behaviour allows momentum to accelerate convergence, especially on quadratic
loss functions with elongated contour shapes.
1.3.3 Nesterov Momentum

Nesterov Momentum improves upon standard momentum by evaluating the gradient


after applying the current velocity update. This "look-ahead" approach allows it to add a
correction factor, leading to better convergence paths.

In convex problems, it significantly accelerates convergence, reducing the error rate from
O(1/k) to O(1/k^2). However, in stochastic gradient descent scenarios, it does not
improve convergence rates.

Velocity update:

Parameter update: θ ← θ + v

1.4 Parameter Initialization


Deep learning models are sensitive to initial conditions, and poor initialization can lead to
failure or slow convergence. The initial point affects convergence speed, solution quality,
and generalization.

Symmetry breaking is crucial to avoid identical behaviour in units, which would prevent
learning. Random initialization from high-entropy distributions ensures diversity across
units without the computational cost of methods like Gram-Schmidt orthogonalization.

Weights are typically initialized using Gaussian or uniform distributions. The scale of
the initialization is critical:

• Large initial weights help break symmetry but can lead to exploding gradients or
saturation of activation functions.
• Too-small weights can suppress activation values, slowing learning

Common heuristics include:


• Initializing weights from a uniform distribution where m is the
number of inputs.
• Glorot Initialization (2010) uses a scaled distribution based on the sum of the input
and output units suggest using the normalized initialization

• Orthogonal Initialization suggests using random orthogonal matrices to maintain


activation and gradient norms, especially for deep networks.
• One drawback to scaling rules that set all of the initial weights to have the same
standard deviation, such as 1 /√m, is that every individual weight becomes extremely
small when the layers become large. Martens (2010) introduced an alternative
initialization scheme called sparse initialization in which each unit is initialized to
have exactly k non-zero weights.

Bias Initialization:

• Biases are often set heuristically. For instance:


o Output biases may be initialized to match the expected marginal statistics of
the output.
o ReLU units' biases might be set to slightly positive values (e.g., 0.1) to avoid
saturation.
o LSTM forget gate biases are sometimes initialized to 1 to ensure proper
learning.

Variance/Precision Parameter Initialization: Parameters like variance (β) in models such


as linear regression can typically be initialized to 1.

Learning-Based Initialization: Sometimes, initialization can be done by training an


unsupervised model on the same inputs or using a supervised task related to the main task,
providing better convergence and generalization.

Hyperparameter Search: The choice of initialization scale is often treated as a


hyperparameter, and techniques like random search can help find optimal values.

1.5 Algorithms with Adaptive Learning Rates

1. The learning rate is crucial but hard to set. Momentum helps, but adds another
hyperparameter. A better approach might be to use different learning rates for each
parameter, adjusting them automatically during training.
2. Delta-Bar-Delta Algorithm (Jacobs, 1988):
o Idea: Adjust learning rates based on the direction of the gradient.
o Concept: If the gradient direction stays the same, increase the learning rate for
that parameter.
1.5.1 AdaGrad

• The AdaGrad algorithm adapts learning rates for each model parameter based on the
sum of squared gradients.
• Parameters with large gradients have smaller learning rates, while those with small
gradients have larger ones. This helps make more progress in flatter directions of the
parameter space.
• Theoretically good for convex problems.
• In deep learning, accumulating squared gradients can overly reduce the learning rate,
making training slow.
• AdaGrad works well for some models but not all.

1.5.2 RMSProp

• RMSProp (Hinton, 2012) improves AdaGrad by using an exponentially weighted


moving average of gradients instead of accumulating them.
• This helps in non-convex settings, like neural networks, where AdaGrad might slow
down.
• Discards old gradients, allowing faster convergence when the model reaches a convex
region.
• Trains more effectively by adapting learning rates, avoiding excessive slowdowns.
1.5.3 Adam

• Adam (Kingma and Ba, 2014) combines RMSProp and momentum.


• Momentum in Adam is the exponential moving average of gradients (first moment).
• It improves optimization by adapting the learning rate and considering past gradients,
making it more efficient for training deep models.
1.5.4 Choosing the Right Optimization Algorithm

• We looked at different algorithms designed to help optimize deep learning models by


adjusting the learning rate for each model parameter.
• A common question is: Which algorithm should you choose?
• There's no clear answer to this. A study compared many optimization algorithms
across different tasks.
• It found that algorithms with adaptive learning rates, like RMSProp and AdaDelta,
worked well in most cases. But no one algorithm is the best for all situations.
• Right now, the most popular optimization algorithms are SGD, SGD with
momentum, RMSProp, RMSProp with momentum, AdaDelta, and Adam.
• The choice of algorithm often depends on which one you’re most comfortable with,
since that makes tuning the hyperparameters easier.

You might also like