Deep Learning Module 3
Deep Learning Module 3
• The training process based on minimizing this average training error is known as
empirical risk minimization.
• Empirical risk minimization assumes optimizing empirical risk will reduce true risk.
• Empirical risk minimization is prone to overfitting, where models memorize the
training data.
• Loss functions like 0-1 loss lack useful derivatives, making empirical risk
minimization challenging for gradient descent.
• Modern deep learning avoids pure empirical risk minimization by optimizing a
different quantity.
The objective function is the expected log likelihood over the training data:
• Optimization algorithms that use the entire training set are called batch or
deterministic gradient methods, because they process all of the training examples
simultaneously in a large batch.
• Batch size refers to the number of training examples used in one iteration of model
training in algorithms like minibatch gradient descent.
• Optimization algorithms that use only a single example at a time are sometimes called
stochastic or sometimes online methods.
• The term online is usually reserved for the case where the examples are drawn from a
stream of continually created examples rather than from a fixed-size training set over
which several passes are made.
• Most algorithms used for deep learning fall somewhere in between, using morethan
one but less than all of the training examples. These were traditionally called
minibatch or minibatch stochastic methods and it is now common to simply call
them stochastic methods.
The equivalence is easiest to derive when both x and y are discrete. In this case, the
generalization error can be written as
computing the gradient of the loss with respect to the parameters for that minibatch:
1.2 Challenges in Neural Network Optimization
Optimization is tough because it’s hard to find the best solution. In regular machine learning,
problems are often designed to be easier to solve by making sure they have a simple structure
(convex case). But with neural networks, the problems are more complicated and harder to
solve (non-convex case).
Even when problems are simpler (convex case), there are still challenges. some of the main
challenges involved in optimizing deep learning models are:
1.2.1 Ill-Conditioning
• One challenge that can come up even when optimizing simpler (convex) functions is
called ill-conditioning of the Hessian matrix (H). This is a common issue in many
types of optimization problems.
• In neural network training, ill-conditioning is a known problem. Ill-conditioning can
manifest by causing SGD to get “stuck” in the sense that even very small steps
increase the cost function.
• A second-order Taylor series expansion of the cost function predicts that a gradient
descent step of − € g will add to the cost. Ill-conditioning of the gradient becomes a
problem when 1/2 € 2gTHg exceeds € gTg.
•
• Neural network cost functions show few obstacles. The main one is a saddle point
near initialization, but SGD easily escapes it.
• Most training happens in a flat region of the cost function, due to gradient noise,
Hessian issues, or navigating around large obstacles.
• Newton’s method is a more advanced technique for optimization that’s designed to
find points where the gradient is zero. However, without adjustments, it can get stuck
at a saddle point because it’s designed to find any critical point (minima, maxima, or
saddle points).
• This is why second-order methods (like Newton’s method) haven’t fully replaced
gradient descent in training neural networks, especially in high dimensions.
• A modified version of Newton’s method, called the saddle-free Newton methodThis
method avoids saddle points and works better than traditional methods.However,
scaling this method for large neural networks is still a challenge, but it shows
potential for improvement.
• When a matrix W is multiplied repeatedly over ttt steps, the result depends on its
eigenvalues (λ).
• If ∣λ∣>1, the values explode, causing instability.
• If ∣λ∣<1, the values vanish, making it hard to update parameters effectively.
• This scaling issue also affects gradients, leading to the vanishing and exploding
gradient problem.
• Exploding gradients create cliff structures (as discussed earlier), which can cause
instability. Gradient clipping helps manage these issues by limiting gradient size.
• Repeated multiplication by W works like the power method, amplifying components
aligned with the largest eigenvector of W and discarding others.
• Recurrent networks face this problem more acutely because they reuse the same
matrix W at each time step. Feedforward networks avoid much of the issue because
they use different weights for each layer.
1.3.2 Momentum
Momentum is used to speed up optimization, especially when:
• Gradients are small but consistent.
• Gradients are noisy.
• The optimization landscape has high curvature (sharp changes).
• Momentum accumulates a moving average of past gradients, smoothing out updates and
accelerating learning in consistent directions.
Update Rules
1. Velocity update:
Acceleration: If gradients g are consistent (point in the same direction), momentum increases
step size.
• Final velocity reaches a terminal speed: Terminal speed=
• Example: With α=0.9, the step size is amplified 10x compared to standard gradient
descent.
• Momentum helps solve issues like poor conditioning of the Hessian matrix by
smoothing the optimization path. It avoids zig-zagging in narrow valleys, as seen with
standard gradient descent, and efficiently moves along the valley's length, reducing
wasted steps.
• This behaviour allows momentum to accelerate convergence, especially on quadratic
loss functions with elongated contour shapes.
1.3.3 Nesterov Momentum
In convex problems, it significantly accelerates convergence, reducing the error rate from
O(1/k) to O(1/k^2). However, in stochastic gradient descent scenarios, it does not
improve convergence rates.
Velocity update:
Parameter update: θ ← θ + v
Symmetry breaking is crucial to avoid identical behaviour in units, which would prevent
learning. Random initialization from high-entropy distributions ensures diversity across
units without the computational cost of methods like Gram-Schmidt orthogonalization.
Weights are typically initialized using Gaussian or uniform distributions. The scale of
the initialization is critical:
• Large initial weights help break symmetry but can lead to exploding gradients or
saturation of activation functions.
• Too-small weights can suppress activation values, slowing learning
Bias Initialization:
1. The learning rate is crucial but hard to set. Momentum helps, but adds another
hyperparameter. A better approach might be to use different learning rates for each
parameter, adjusting them automatically during training.
2. Delta-Bar-Delta Algorithm (Jacobs, 1988):
o Idea: Adjust learning rates based on the direction of the gradient.
o Concept: If the gradient direction stays the same, increase the learning rate for
that parameter.
1.5.1 AdaGrad
• The AdaGrad algorithm adapts learning rates for each model parameter based on the
sum of squared gradients.
• Parameters with large gradients have smaller learning rates, while those with small
gradients have larger ones. This helps make more progress in flatter directions of the
parameter space.
• Theoretically good for convex problems.
• In deep learning, accumulating squared gradients can overly reduce the learning rate,
making training slow.
• AdaGrad works well for some models but not all.
1.5.2 RMSProp