0% found this document useful (0 votes)
7 views

Optimization Algorithm 0401

Chapter 4 discusses optimization algorithms in deep learning, focusing on the optimization problem, the role of gradients, and challenges such as learning rate selection and non-convexity. It introduces the Hessian matrix and its significance in understanding the curvature of loss functions, aiding in the identification of local and global minima. Additionally, it covers basic learning algorithms like Batch Gradient Descent and Stochastic Gradient Descent, detailing their formulations and pseudocode.

Uploaded by

dingnaikun01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Optimization Algorithm 0401

Chapter 4 discusses optimization algorithms in deep learning, focusing on the optimization problem, the role of gradients, and challenges such as learning rate selection and non-convexity. It introduces the Hessian matrix and its significance in understanding the curvature of loss functions, aiding in the identification of local and global minima. Additionally, it covers basic learning algorithms like Batch Gradient Descent and Stochastic Gradient Descent, detailing their formulations and pseudocode.

Uploaded by

dingnaikun01
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Chapter 4: Optimization Algorithm

Revised: 2025/04/01

1 Optimization Problem in Deep Learning


In deep learning, the optimization problem can typically be formulated as follows:

min L(θ) (1)


θ

where L(θ) is the loss function, representing the discrepancy between the predicted outputs of the neural
network and the actual target values from the training data, and θ denotes the parameters (weights and
biases) of the model.

1.1 Gradients and Optimization Process


The role of gradients in the optimization process is central. The gradient of the loss function, denoted
by ∇θ L(θ), guides the direction in which the parameters need to be updated to minimize the loss:

θnew = θold − α∇θ L(θold ) (2)

where α is the learning rate, a crucial hyperparameter that influences the step size of the parameter updates.

1.2 Key Problems in Optimization


However, there are several challenges in this optimization process:

1. Choosing the Right Learning Rate (α): If α is too small, the model will converge very slowly,
requiring many iterations. If α is too large, the model may overshoot the minimum or even diverge.

2. Non-Convexity of Loss Functions: In deep learning, loss functions are generally non-convex, leading
to the presence of many local minima, saddle points, and plateaus, which complicates the optimization
process.

3. High-Dimensional Parameter Space: Deep learning models often have a very large number of
parameters, making the optimization landscape extremely high-dimensional and complex.

1
2 HESSIAN MATRIX 2

1.3 Local and Global Optimization


1.3.1 Local Optimum

A local optimum of a function f : Rn → R is a point xlocal ∈ Rn such that there exists an ϵ > 0 where
for all x ∈ Rn within the neighborhood Nϵ (xlocal ) = {x | ∥x − xlocal ∥ < ϵ}, the following condition holds:

f (xlocal ) ≤ f (x). (3)

This definition implies that within this ϵ-neighborhood, xlocal provides the minimum value of the function f ,
but outside this neighborhood, there might be points where f attains lower values.

1.3.2 Global Optimum

A global optimum of a function f : Rn → R is a point xglobal ∈ Rn such that for all x ∈ Rn , the following
condition holds:
f (xglobal ) ≤ f (x). (4)

This definition implies that xglobal provides the minimum value of the function f across the entire domain.

1.3.3 Achieving Global Optimization

Global optimization can be achieved under specific conditions:


Convexity: If the function f is convex over its domain, and the domain is a convex set, then any local
minimum is also a global minimum. Mathematically, a function f is convex if for all x, y ∈ Rn and for all
λ ∈ [0, 1], the following condition holds:

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y). (5)

Strong Convexity and Smoothness: Additionally, if the function f is strongly convex and smooth,
not only does the global minimum exist, but it also guarantees convergence rates for optimization algorithms.
These conditions are sufficient but not necessary for global optimization. In non-convex problems typ-
ical of deep learning, reaching a global optimum is not guaranteed, but heuristic algorithms may still find
satisfactory solutions.

2 Hessian matrix
In deep learning, the Hessian matrix is a square matrix of second-order partial derivatives of a scalar-
valued function, typically the loss function, with respect to its input parameters. This matrix is crucial for
understanding the local curvature of the loss surface, which in turn affects the convergence properties of
optimization algorithms used in training deep neural networks.
Consider a neural network with a loss function L(θ), where θ = (θ1 , θ2 , ..., θn )T represents the vector of
parameters. The Hessian matrix H of the loss function with respect to the parameters is defined as:
2 HESSIAN MATRIX 3

 
∂2L ∂2L ∂2L
2 ···
 ∂∂θ2 L1 ∂θ1 ∂θ2 ∂θ1 ∂θn

 ∂2L
··· ∂2L 
 ∂θ2 ∂θ1 ∂θ22 ∂θ2 ∂θn 
H= .. .. .. ..  (6)
 . . . . 
 
∂2L ∂2L ∂2L
∂θn ∂θ1 ∂θn ∂θ2
··· 2
∂θn

In this section, we will explore the concept of the Hessian matrix and its pivotal role in the optimization
landscape of deep learning. The Hessian matrix, a square matrix of second-order partial derivatives, is
fundamental in understanding the curvature of multidimensional loss functions, which are at the heart of
training neural networks.

2.1 Taylor Expansion


Consider a multivariate scalar-valued function f : Rn → R. The Taylor expansion of f at a point x ∈ Rn
in the direction of a vector h ∈ Rn up to second order is given by:

1
f (x + h) = f (x) + ∇f (x)T h + hT H(f )(x)h + o(∥h∥2 ), (7)
2
where ∇f (x) is the gradient of f at x, and H(f )(x) is the Hessian matrix of f at x, which needs to be
defined.

2.2 Definition of the Hessian Matrix


The Hessian matrix H of a scalar-valued function f : Rn → R, with respect to a vector x ∈ Rn , is defined
as:
 
∂2f ∂2f ∂2f
2 ···
 ∂∂x2 f1 ∂x1 ∂x2 ∂x1 ∂xn

 ∂2f
··· ∂2f 
 ∂x2 ∂x1 ∂x22 ∂x2 ∂xn 
H(f )(x) =  .. .. .. .. . (8)
 . . . . 
 
∂2f ∂2f ∂2f
∂xn ∂x1 ∂xn ∂x2
··· ∂x2n

∂2f
Each element in this matrix, ∂xi ∂xj
, represents the second-order partial derivative of the function f with
respect to variables xi and xj . This matrix captures the local curvature of the function f around the point x.

2.3 Eigenvalues and Eigenvectors


Let’s consider a square matrix A ∈ Rn×n . An eigenvector of A is a non-zero vector v ∈ Rn such that,
when A acts on v, the vector v is only scaled and not rotated. Mathematically, this relationship is defined as:

Av = λv, (9)

where λ ∈ R is the scalar known as the eigenvalue corresponding to the eigenvector v.


To find the eigenvalues of A, we solve the characteristic equation:

det(A − λI) = 0, (10)


2 HESSIAN MATRIX 4

where I is the identity matrix of the same dimensions as A. Solving this equation will provide us with the
eigenvalues λi .
Once we have the eigenvalues, we can find the corresponding eigenvectors by solving:

(A − λi I)vi = 0 (11)

for each eigenvalue λi . This is a system of linear equations, which can be solved for the eigenvector vi .

2.4 Importance in Deep Learning


In the context of deep learning, the Hessian matrix is used to understand the curvature of the loss
function, which impacts the efficiency and convergence of optimization algorithms such as gradient descent.
Key applications include:

1. Identifying Saddle Points and Local Minima: The eigenvalues of the Hessian can indicate whether
a critical point is a local minimum, maximum, or a saddle point.
A critical point (a point where the gradient ∇f = 0) can be a local minimum, a local maximum, or a
saddle point. The nature of this point can often be determined by examining the Hessian matrix at that
point:

• Local Minima: If the Hessian matrix at a critical point x is positive definite (all its eigenvalues
are positive), then f is convex in a neighborhood around x, and x is a local minimum.
• Local Maxima: If the Hessian matrix at a critical point x is negative definite (all its eigenvalues
are negative), then f is concave in a neighborhood around x, and x is a local maximum.
• Saddle Point: If the Hessian matrix at a critical point x has both positive and negative eigenvalues,
then x is a saddle point. This is because f exhibits concave behavior in some directions and convex
behavior in others.
• Indeterminate: If the Hessian is semidefinite (some eigenvalues are zero, others are positive or
negative), then the test is inconclusive; the critical point could be any of the three types.

2. Adapting Learning Rates: In second-order optimization methods, the Hessian can be used to modify
the learning rate dynamically, leading to faster convergence.
The idea behind using the Hessian matrix to adapt learning rates is based on the principle of second-order
optimization. Specifically:

(a) In regions where the loss function is sharply curved (large eigenvalues of H), a smaller learning rate
is preferred to avoid overshooting the minimum. This can be achieved by scaling the learning rate
inversely with the curvature, for example, by using the eigenvalues of the Hessian.
(b) In contrast, in regions where the loss function is flat (small eigenvalues of H), a larger learning
rate can be employed to expedite convergence. Again, this adjustment can be informed by the
eigenvalues of the Hessian.
3 BASIC LEARNING ALGORITHMS 5

(c) The natural gradient, which scales the gradient by the inverse of the Hessian (in the ideal case) or an
approximation thereof, can be used to perform this adaptation automatically. However, computing
the full Hessian and its inverse is computationally expensive for large models.
(d) Approximations like the diagonal of the Hessian or low-rank approximations can be used to reduce
computational complexity. Algorithms such as AdaGrad, RMSprop, and Adam can be seen as
applying this principle in a more heuristic manner, where the scale of parameter updates is adjusted
based on accumulated squared gradients (a proxy for curvature).

3. Analyzing Loss Surface: Understanding the geometry of the loss surface helps in developing more
robust and efficient training strategies.

• The eigenvalues of the Hessian matrix indicate the curvature of the loss surface:
– Positive Eigenvalues: If all the eigenvalues of H at a particular point θ are positive, the loss
surface is convex around that point, indicating a local minimum.
– Negative Eigenvalues: If all eigenvalues are negative, the point is a local maximum.
– Mixed Eigenvalues: If eigenvalues are mixed (some positive, some negative), the point is a
saddle point, which can often impede gradient-based optimization methods.
• The condition number of the Hessian matrix, given by the ratio of the largest to the smallest
eigenvalue, provides information about the loss surface’s geometry:
– A high condition number indicates a significant difference in curvature along different di-
rections, leading to ill-conditioned optimization problems where gradient descent can oscillate
or converge very slowly.
– A low condition number suggests that the surface is more spherical, allowing for more stable
and faster convergence.

2.5 Challenges and Solutions


While the Hessian matrix provides valuable insights, its computation and storage are costly, especially for
large neural networks. Various approximations and strategies, such as diagonal approximations, quasi-Newton
methods, and Hessian-free optimizations, are employed to mitigate these challenges.

3 Basic Learning Algorithms


3.1 Batch Gradient Descent
Batch Gradient Descent is an optimization algorithm commonly used in machine learning to minimize
the cost function, typically representing the error between the predicted outputs of the model and the actual
values. It is called ’Batch’ because, in each iteration, the entire dataset is used to calculate the gradient of
the cost function.
3 BASIC LEARNING ALGORITHMS 6

3.1.1 Mathematical Formulation

Given a cost function L(θ), where θ represents the parameters of the model, the goal of Batch Gradient
Descent is to find the set of parameters θ that minimizes L(θ). The update rule for the parameter θ in each
iteration is given by:

θ := θ − α∇L(θ), (12)

where:

• α is the learning rate, a scalar that determines the step size at each iteration.

• ∇J(θ) is the gradient of the cost function with respect to the parameters θ.

1 ∑ ( )
m
L(θ) = L fθ (x(i) ), y (i) (13)
2m i=1
where:

• m is the number of training examples.

• fθ (x(i) ) is the hypothesis function, predicting the output given input x(i) .

• y (i) is the actual output for the i-th training example.

3.1.2 Algorithm (Pseudocode)

Algorithm 1 Batch Gradient Descent


1: Initialize the parameter vector θ (e.g., to zeros or small random values).

2: Set the learning rate α.


3: repeat
4: Compute the gradient ∇L(θ) over the entire dataset.
5: Update the parameters: θ ← θ − α ∇L(θ).
6: until convergence (e.g., change in cost function below a threshold or a max iteration count).

3.1.3 Notes

• The learning rate α needs to be chosen carefully, as a value too large can lead to divergence, while a
value too small leads to a slow convergence.

• Convergence is typically determined when the change in cost function between iterations is below a
predetermined threshold.
3 BASIC LEARNING ALGORITHMS 7

3.2 Stochastic Gradient Descent


Stochastic Gradient Descent (SGD) is a variation of the gradient descent algorithm that updates the
model parameters more frequently. Unlike Batch Gradient Descent, which uses the entire dataset to compute
the gradient of the cost function at each step, SGD updates the parameters for each training example. This
approach can lead to faster convergence with large datasets.

3.2.1 Mathematical Formulation

Just like in Batch Gradient Descent, the goal of SGD is to minimize the cost function L(θ). However,
instead of computing the gradient of the cost function based on the whole dataset, SGD computes the gradient
based on a single training example (x(i) , y (i) ). Therefore, the update rule for the parameter θ at each iteration
i is given by:

θ := θ − α∇Li (θ), (14)

where:

• α is the learning rate.

• ∇Li (θ) is the gradient of the cost function with respect to the parameters θ, computed using only the
i-th training example.

The cost function for the i-th example is typically represented as:

1 ( )
Li (θ) = L fθ (x(i) ), y (i) , (15)
2

3.2.2 Algorithm (Pseudocode)

Algorithm 2 Stochastic Gradient Descent


1: Initialize the parameter vector θ (e.g., zeros or small random values).

2: Set the learning rate α.


3: repeat
4: Shuffle the dataset randomly.
5: for i = 1 to m do
6: Compute the gradient ∇Li (θ) using the i-th example.
7: Update θ ← θ − α ∇Li (θ).
8: end for
9: until convergence (e.g., change in cost function below a threshold or reaching a max iteration limit).

3.2.3 Notes

• The learning rate α is crucial in SGD as well. However, due to the noisy gradient estimates, the learning
path may be more erratic compared to Batch Gradient Descent.
3 BASIC LEARNING ALGORITHMS 8

• Convergence in SGD may appear noisier and less steady; however, it often reaches an acceptable set of
parameters much quicker than Batch Gradient Descent, especially for large datasets.

• SGD can navigate out of local minima more effectively due to the noise, which can be beneficial for
non-convex optimization problems.

3.2.4 Improvements over Batch Gradient Descent

SGD provides several improvements over Batch Gradient Descent:

• Speed: By updating parameters more frequently, SGD can make significant progress towards the min-
imum even before it has seen all the data, which can be especially beneficial when dealing with large
datasets.

• Memory Efficiency: SGD requires less memory as it updates parameters using one data point at a
time.

• Escape Local Minima: The inherent noise in the gradient estimation due to sampling individual data
points can help the algorithm escape from local minima.

3.3 Mini-Batch Gradient Descent


Mini-Batch Gradient Descent is an optimization technique that represents a compromise between the full
dataset computation of Batch Gradient Descent (BGD) and the single example update of Stochastic Gradient
Descent (SGD). It aims to blend the advantages of both.

3.3.1 Mathematical Formulation

In Mini-Batch Gradient Descent, the dataset is divided into smaller subsets (mini-batches) containing n
examples each. The cost function L(θ) and parameters θ are updated for each mini-batch rather than the
entire dataset or a single data point. The update rule is given by:

θ := θ − α∇Lbatch (θ), (16)

where:

• α is the learning rate.

• ∇Lbatch (θ) is the gradient of the cost function with respect to the parameters θ, averaged over the
mini-batch.

The cost function for the mini-batch is:

1 ∑ ( )
n
Lbatch (θ) = L fθ (x(i) ), y (i) , (17)
2n i=1
where n is the size of the mini-batch.
3 BASIC LEARNING ALGORITHMS 9

3.3.2 Algorithm (Pseudocode)

Algorithm 3 Mini-Batch Gradient Descent


1: Initialize the parameter vector θ (e.g., zeros or small random values).

2: Set the learning rate α.


3: repeat
4: for each mini-batch in the dataset do
5: Compute the gradient ∇Lbatch (θ) using the current mini-batch.
6: Update θ ← θ − α ∇Lbatch (θ).
7: end for
8: until convergence (e.g., change in cost function below a threshold or a maximum number of epochs).

3.3.3 Notes

• The mini-batch size is a hyperparameter that balances the computational efficiency of BGD with the
update frequency of SGD.

• The learning rate α still requires careful tuning, and learning rate schedules can be applied.

3.3.4 Comparison with BGD and SGD

• Advantages

– Against BGD, Mini-Batch Gradient Descent:


∗ Reduces the variance in parameter updates, leading to more stable convergence than SGD.
∗ Utilizes vectorized operations more efficiently, making better use of hardware accelerations.
∗ Offers faster convergence on large datasets compared to BGD, due to more frequent updates.
– Against SGD, Mini-Batch Gradient Descent:
∗ Minimizes the noise in the parameter updates, leading to smoother convergence.
∗ Allows for parallelism in computing resources, speeding up the training process.
∗ Provides a balance between the robustness of SGD and the efficiency of BGD.

• Disadvantages

– Compared to BGD, Mini-Batch Gradient Descent:


∗ May not reach the exact convergence of BGD due to mini-batch sampling.
∗ Requires the setting of an additional hyperparameter, the mini-batch size.
– Compared to SGD, Mini-Batch Gradient Descent:
∗ Can potentially miss out on some of the regularization effects of the high variance updates of
SGD.
∗ May require more tuning for the mini-batch size to find the optimal balance.
4 GRADIENT DESCENT WITH MOMENTUM 10

4 Gradient Descent with Momentum


Gradient Descent with Momentum is a variation of the standard gradient descent algorithm, which
incorporates a momentum term to improve convergence speed and reduce oscillations. It’s particularly useful
for navigating ravines, where the surface curves much more steeply in one dimension than in another, which
are common around local optima.
The primary purpose of introducing momentum in gradient descent algorithms is to accelerate con-
vergence, especially in the face of small, but consistent, gradients (slopes). By accumulating velocity from
past gradients, momentum helps propel the parameter updates to traverse flat regions faster and dampen
oscillations in steep directions.

No Momentum (zigzag)
With Momentum

θ∗

4.1 Standard Momentum


The update rules for Gradient Descent with standard Momentum involve modifications to the standard
gradient descent algorithm by introducing a velocity term. Mathematically, the update rules at iteration t
are given by:

vt := γvt−1 − α∇L(θt−1 ), (18)

θt := θt−1 + vt , (19)

where:

• θt is the parameter vector at iteration t.

• vt is the velocity at iteration t, initialized to zero.

• γ is the momentum coefficient, typically set between 0 (no momentum) and 1.

• α is the learning rate.

• ∇L(θt−1 ) is the gradient of the loss function L with respect to the parameters θ at iteration t − 1.

4.1.1 Algorithm (Pseudocode)


4 GRADIENT DESCENT WITH MOMENTUM 11

Algorithm 4 Gradient Descent with Momentum


1: Initialize the parameter vector θ and velocity v to zero.

2: Choose a learning rate α and momentum coefficient γ.


3: repeat
4: Compute the gradient ∇L(θ) at the current parameters.
5: Update velocity: v ← γ v − α ∇L(θ)
6: Update parameters: θ ← θ + v
7: until convergence (e.g., when change in cost is below a threshold or after a max iteration count).

4.1.2 Notes

• The momentum term γ helps in smoothing out the updates and can prevent the algorithm from getting
stuck in local minima.

• Careful tuning of the learning rate α and momentum coefficient γ is required for optimal performance.

• Gradient Descent with Momentum can lead to faster convergence and reduced oscillation compared to
standard gradient descent.

4.2 Analogy to Physics


In physics, momentum is a measure of the quantity of motion of a moving body, typically defined as
the product of the mass and velocity of the object. In the context of optimization algorithms, particularly
gradient descent, momentum is used as an analogy to accelerate the search for minima, similar to how a ball
gains speed when rolling downhill.

• Physics Momentum In classical mechanics, the momentum (p) of an object is given by:

p = mv, (20)

where:

– m is the mass of the object.


– v is the velocity of the object.

The update of momentum in a physical system due to a force (F ) over a time interval (∆t) is described
by Newton’s second law:

∆p ∆v
F = =m , (21)
∆t ∆t
which integrates to:

F
vt = vt−1 + ∆t. (22)
m
4 GRADIENT DESCENT WITH MOMENTUM 12

• Momentum in Optimization In optimization, particularly gradient descent with momentum, we


analogize the ’velocity’ of the parameters with the updates made to those parameters. The ’mass’
concept does not directly apply, but the momentum term (γ) plays a similar role to mass in controlling
the rate of change of velocity. The update rule in optimization with momentum is:

vt = γvt−1 − α∇L(θ), (23)

and the parameter update is analogous to changing the position (akin to velocity change in physics):

θt = θt−1 + vt , (24)

where:

– γ acts as a momentum coefficient, dictating the proportion of the previous update’s ’velocity’
retained in the current update. A higher value of γ implies that more of the previous velocity is
retained, leading to a smoother and potentially faster convergence. In a physical system, if we
think of γ as analogous to the retention of motion (opposite to friction), then indeed (1 − γ) could
represent the portion of motion lost, similar to how friction counteracts movement
– α∇L(θ) represents the ’force’ (the gradient of the loss function) influencing the parameter updates.
– θt represents the ’position’ in the parameter space, analogous to an object’s position in physical
space.

4.3 Nesterov Accelerated Gradient (NAG)


Nesterov Accelerated Gradient (NAG) is an optimization technique that seeks to improve the conver-
gence rate of the standard gradient descent algorithm, particularly in the context of deep learning and high-
dimensional data spaces.

4.3.1 Rationale

The NAG method improves upon traditional momentum by incorporating a forward-looking step. This
means that instead of calculating the gradient at the current position, NAG first makes a step in the direction
of the accumulated momentum, and then computes the gradient. This lookahead approach allows NAG to
adjust its updates more responsively, leading to faster convergence and reduced oscillations.

4.3.2 Mathematical Formulation

Let’s denote:

• θ as the parameter vector.

• α as the learning rate.

• γ as the momentum factor.


4 GRADIENT DESCENT WITH MOMENTUM 13

• L as the loss function.

The update rules in Nesterov Accelerated Gradient are given by:

vt = γvt−1 − α∇L(θt−1 + γvt−1 ), (25)

θt = θt−1 + vt , (26)

where:

• vt is the velocity at time step t.

• ∇L(θt−1 + γvt−1 ) is the gradient of the loss function evaluated after the lookahead based on the previous
velocity.

4.3.3 Algorithm (Pseudocode)

Algorithm 5 Nesterov Accelerated Gradient


1: Initialize the parameters θ and velocity v to zero.

2: Choose learning rate α and momentum coefficient γ.


3: repeat
4: Compute the gradient at the lookahead position:
( )
∇L θ + γ v .

5: Update velocity:
( )
v ← γ v − α ∇L θ + γ v .

6: Update parameters:
θ ← θ + v.

7: until convergence (e.g., based on changes in the cost function or max iterations).

4.3.4 Notes

• The NAG method typically requires less hyperparameter tuning compared to standard gradient descent
and is particularly effective in scenarios with high curvature and noisy gradients.

• By calculating the gradient after a preliminary move in the direction of the momentum, NAG can avoid
overshooting and dampen oscillations, leading to improved convergence rates.
5 ADAPTIVE LEARNING ALGORITHMS 14

5 Adaptive Learning Algorithms


5.1 AdaGrad (Adaptive Gradient Algorithm)
AdaGrad is an optimization method that adapts the learning rate to the parameters, performing smaller
updates for parameters associated with frequently occurring features, and larger updates for parameters
associated with infrequent features. It is particularly well-suited for dealing with sparse data.

5.1.1 Rationale

The key idea behind AdaGrad is to modify the general learning rate at each time step for each parameter,
based on the past gradients that have been computed for that parameter. This approach aims to improve
convergence performance over standard stochastic gradient descent, especially in scenarios with large-scale
and sparse data.

5.1.2 Mathematical Formulation

Let’s denote:

• θ as the parameter vector.

• α as the initial learning rate.

• L as the loss function.

• gt,i as the gradient of the loss function with respect to the i-th parameter at time step t.

The AdaGrad update rule is given by:


t
2
Gt,ii = gτ,i , (27)
τ =1
α
θt+1,i = θt,i − √ gt,i , (28)
Gt,ii + ϵ

where Gt,ii is the sum of the squares of the past gradients w.r.t. to the i-th parameter up to time step t,
and ϵ is a smoothing term that avoids division by zero (usually on the order of 1e − 8).

5.1.3 Algorithm (Pseudocode)

5.1.4 Notes

• AdaGrad’s main advantage is that it eliminates the need to manually tune the learning rate. Most
implementations use a default value of 0.01 for α.

• The continuously accumulating denominator can cause the effective learning rate to shrink and eventually
become infinitesimally small, causing the algorithm to stop learning. This might necessitate switching to
5 ADAPTIVE LEARNING ALGORITHMS 15

Algorithm 6 AdaGrad
1: Initialize parameters θ (e.g., zeros or small random values).

2: Choose an initial learning rate α.


3: Set the gradient accumulator G to a vector (or matrix) of zeros.
4: repeat
5: Compute the current gradient gt .
6: Accumulate squared gradients:

Gii ← Gii + gt,i


2
for each parameter i.

7: Update each parameter:


α
θi ← θi − √ gt,i .
Gii + ϵ
8: until convergence (e.g., a threshold on parameter or cost changes, or a maximum iteration count).

a different optimization algorithm after a certain point, or modifying AdaGrad to reset the accumulation
periodically.

• AdaGrad is particularly effective for problems with sparse gradients (e.g., natural language processing
and image recognition).

5.2 RMSprop (Root Mean Square Propagation)


RMSprop is an adaptive learning rate method that addresses some of the issues encountered with the
conventional implementation of gradient descent, particularly in dealing with fast-changing curvatures in loss
functions associated with deep learning.

5.2.1 Rationale

The primary idea behind RMSprop is to maintain a moving average of the squares of gradients and to
divide the gradient by the square root of this average. This mechanism equips the algorithm to adapt the
learning rate to the topology of the error landscape: decreasing it when the gradient is large and steep, and
increasing it when the landscape is flat. This adaptiveness makes RMSprop particularly effective for deep
neural networks.

5.2.2 Mathematical Formulation

Let’s denote:

• θ as the parameter vector.

• α as the learning rate.

• L as the loss function.


5 ADAPTIVE LEARNING ALGORITHMS 16

• E[g 2 ]t as the expected value of the squared gradients at time step t.

The RMSprop updates are governed by the following equations:

E[g 2 ]t = βE[g 2 ]t−1 + (1 − β)(∇L(θ))2 , (29)

α
θt+1 = θt − √ ∇L(θt ), (30)
E[g 2 ]t + ϵ
where:

• E[g 2 ]t is the decaying average of past squared gradients.

• β is the decay rate, a hyperparameter typically set between 0.9 and 0.99.

• ϵ is a small constant added for numerical stability, usually around 10−8 .

5.2.3 Algorithm (Pseudocode)

Algorithm 7 RMSprop
1: Initialize parameters θ (e.g., to zeros or small random values) and the moving average of squared gradients

E[g 2 ] (often to zero).


2: Choose learning rate α, decay rate β, and stabilizing term ϵ.
3: repeat
4: Compute the gradient:
( )
g ← ∇L θ .

5: Update the expected square gradient:


( )2
E[g 2 ] ← β E[g 2 ] + (1 − β) g .

6: Update the parameters:


α
θ ← θ − √ g.
E[g 2 ] + ϵ
7: until convergence (e.g., based on small changes in the loss function or reaching a maximum number of
iterations).

5.2.4 Notes

• RMSprop automatically adjusts the learning rate during the optimization process, making it less sensitive
to hyperparameters.

• It has proven to be effective for training deep neural networks with noisy or sparse gradients.
5 ADAPTIVE LEARNING ALGORITHMS 17

• However, RMSprop, like other adaptive learning rate methods, may still require tuning of the learning
rate and decay rate.

5.3 Adam Optimization Algorithm


Adam, short for Adaptive Moment Estimation, is an algorithm for gradient-based optimization of stochas-
tic objective functions, based on adaptive estimates of lower-order moments. It is designed to combine the
advantages of two other extensions of stochastic gradient descent: AdaGrad and RMSprop.

5.3.1 Rationale

Adam is widely used in deep learning applications. It computes individual adaptive learning rates for
different parameters from estimates of first and second moments of the gradients. The method is straightfor-
ward to implement, computationally efficient, invariant to diagonal rescale of the gradients, and well-suited
for problems that are large in terms of data and/or parameters.

5.3.2 Mathematical Formulation

The Adam update rule is defined as follows:


Initialize time step t ← 0, moment vectors m0 ← 0 (1st moment), v0 ← 0 (2nd moment), and parameter
vector θ0 .
Hyperparameters: step size α (usually 10−3 ), exponential decay rates for moment estimates β1 (usually
0.9), β2 (usually 0.999), and small constant ϵ (usually 10−8 ).
At each iteration t:

1. t ← t + 1.

2. Compute the gradient: gt ← ∇θ L(θt−1 ).

3. Update biased first moment estimate: mt ← β1 · mt−1 + (1 − β1 ) · gt .

4. Update biased second raw moment estimate: vt ← β2 · vt−1 + (1 − β2 ) · gt2 .

5. Correct bias in first moment: m̂t ← mt


1−β1t
.

6. Correct bias in second raw moment: v̂t ← vt


1−β2t
.

7. Update parameters: θt ← θt−1 − α · √m̂t .


v̂t +ϵ

5.3.3 Algorithm (Pseudocode)

5.3.4 Notes

• Adam combines the benefits of AdaGrad, which works well with sparse gradients, and RMSprop, which
works well in online and non-stationary settings.
5 ADAPTIVE LEARNING ALGORITHMS 18

Algorithm 8 Adam (Adaptive Moment Estimation)


1: Initialize parameters θ, moment vectors m = 0, v = 0, and time step t = 0.

2: Choose learning rate α, decay rates β1 , β2 , and smoothing term ϵ.


3: repeat
4: t←t+1 (increment time step)
5: Compute the gradient gt of the objective function w.r.t. θ.
6: Update biased first moment estimate:

m ← β1 m + (1 − β1 ) gt .

7: Update biased second moment estimate:


( )2
v ← β2 v + (1 − β2 ) gt .

8: Compute bias-corrected first moment:


m
m̂ ← .
1 − β1t

9: Compute bias-corrected second moment:


v
v̂ ← .
1 − β2t

10: Update parameters:



θ←θ − α√ .
v̂ + ϵ
11: until convergence (e.g. changes in θ or loss below a threshold, or max iterations).
6 OPTIMIZATION ALGORITHM CHOOSING 19

• It is recommended to stick to the default values of the hyperparameters: α = 0.001, β1 = 0.9, β2 = 0.999,
and ϵ = 10−8 .

• While Adam generally works well in practice, its effectiveness can depend on proper tuning of its hyper-
parameters and the characteristics of the task.

6 Optimization Algorithm Choosing


Choosing the right optimization algorithm for training machine learning models is crucial for performance
and efficiency. The selection depends on various factors including the specific characteristics of the data, the
type of model, and the computational resources available. Here we summarize key points to consider when
selecting an optimization algorithm.

6.1 Considerations for Selection


1. Data Sparsity: In tasks involving sparse input features (e.g., text classification or natural language
processing), algorithms that adapt learning rates to individual parameters such as AdaGrad or Adam
often excel. These methods can assign larger updates to infrequent parameters, improving model
performance on rare features.

2. Scale of Data and Model: For large-scale datasets and high-dimensional models, stochastic methods
like SGD, RMSprop, or Adam are usually preferred due to their computational efficiency and reduced
memory requirements.

3. Landscape Complexity: If the optimization landscape is characterized by numerous saddle points or


flat regions, momentum-based methods (e.g., Nesterov Accelerated Gradient or Adam) can help
the optimizer traverse these regions more effectively.

4. Noise in Gradients: When gradient estimates are particularly noisy (e.g., due to small batch sizes or
inherently noisy data), adaptive methods such as RMSprop and Adam can smooth out updates by
normalizing gradient steps.

5. Learning Rate Scheduling: Consider whether the algorithm automatically adapts its learning rate
or if manual scheduling is required. While Adam and AdaGrad dynamically adjust the learning rate,
simpler methods such as SGD may need scheduled adjustments (e.g., step decay or exponential decay)
to achieve stable convergence.

6. Hyperparameters: The number and sensitivity of hyperparameters can affect both the tuning process
and the robustness of the algorithm. Optimizers with fewer hyperparameters or with well-established
defaults (such as Adam) tend to simplify training and can be good starting points.

7. Convergence Behavior: Different algorithms exhibit different convergence profiles. While some may
converge rapidly but risk overshooting optimal values, others may converge more slowly but steadily.
Reviewing empirical performance on similar tasks or well-known benchmarks can guide the choice of
algorithm.
7 COEFFICIENT INITIALIZATION 20

6.2 General Recommendations


Although there is no universally optimal optimization algorithm, the following recommendations can
serve as starting points:

• Adam is a robust default choice for many scenarios due to its resilience to hyperparameter settings and
strong empirical performance.

• For sparse data, AdaGrad can be particularly beneficial early on, while Adam often provides better
performance over longer training runs.

• When you require fine control over the learning process or are dealing with more stable, well-conditioned
data, SGD with Nesterov momentum can yield reliable convergence.

• It is essential to monitor the model s performance closely and be prepared to adjust parameters
or switch optimizers if training stagnates or diverges.

• Finally, experimenting with multiple algorithms and comparing their performance on a validation
set is often the most reliable way to determine the best fit for your specific problem.

In conclusion, the choice of optimization algorithm can profoundly affect both the training efficiency
and final performance of machine learning models. By analyzing the key characteristics of your problem—
such as data sparsity, scale, landscape complexity, and gradient noise—and understanding the strengths and
limitations of each optimizer, you can make informed decisions that lead to more effective and efficient training
outcomes.

7 Coefficient Initialization
Proper initialization of the weights (coefficients) in neural networks is crucial for ensuring effective and
efficient training. Improper initialization can lead to issues such as vanishing or exploding gradients, leading
to slow convergence or even failure of the training process.

7.1 Theories Behind Initialization


• Symmetry Breaking: If all weights are initialized to the same value, all neurons in a layer will follow
the same gradient and effectively be the same unit. Random initialization breaks this symmetry and
allows the neurons to learn different features.

• Vanishing and Exploding Gradients: Very small initial weights can lead to gradients diminishing
through layers, making deep networks hard to train (vanishing gradients). Conversely, very large initial
weights can lead to gradients growing exponentially (exploding gradients).

• Balanced Propagation: Initial weights should be set to maintain the variance of activations and gra-
dients across layers. This balance helps avoid vanishing and exploding gradients as the signal propagates
through the network.
7 COEFFICIENT INITIALIZATION 21

7.2 Simple Initialization Methods


7.2.1 Zero Initialization

Simply setting all weights to zero fails due to symmetry breaking problems.

7.2.2 Random Initialization

Weights are initialized randomly but must be scaled appropriately to prevent vanishing/exploding gra-
dients.

7.3 Xavier/Glorot Initialization


Xavier or Glorot initialization aims to address the problem of initializing neural network weights in such
a way that the variance of activations remains the same across every layer. This helps in preventing the
gradients from becoming too small (vanishing) or too large (exploding), which can significantly hinder the
learning process.

7.3.1 Rationale

The central idea behind Xavier/Glorot initialization is to maintain the flow of gradients so that they
neither vanish nor explode as they propagate through successive layers in deep networks. This is achieved by
balancing the variance of the outputs of each layer with that of the inputs.

7.3.2 Mathematical Derivation

• Forward Propagation Consider a fully connected layer with nin input neurons and nout output neu-
rons. The output z of this layer (before activation) can be expressed as:

z = W x + b, (31)

where W represents the weights, x represents the inputs, and b represents the biases. Assuming biases
are initialized to zero and neglecting them for simplicity, the variance of z can be given by:

Var(z) = nin · Var(W ) · Var(x). (32)

To ensure that the signal does not diminish or blow up as it passes through the layer, we set Var(z) =
Var(x). Therefore,

1
nin · Var(W ) = 1 ⇒ Var(W ) = . (33)
nin
• Backward Propagation During backpropagation, gradients with respect to the outputs of the layer
are propagated backward to the inputs. For the gradients not to vanish or explode, a similar condition
needs to be met:
7 COEFFICIENT INITIALIZATION 22

1
nout · Var(W ) = 1 ⇒ Var(W ) = . (34)
nout

7.3.3 Balancing Forward and Backward Propagation

To balance the variance during both forward and backward propagation, Xavier/Glorot initialization
averages the two variances:

2
Var(W ) = . (35)
nin + nout
This initialization ensures that the layers in the neural network are neither too weakly nor too strongly
activated, allowing gradients to flow appropriately.

7.3.4 Practical Implementation

For a uniform distribution, the weights should be sampled from:


[ √ √ ]
6 6
W ∼U − , . (36)
nin + nout nin + nout
For a normal distribution, weights should typically be sampled from:
( √ )
2
W ∼ N 0, . (37)
nin + nout
Xavier/Glorot Initialization offers an effective method for initializing the weights of layers that use com-
mon non-linear activations in neural networks. By accounting for the number of input and output units,
this approach aids in maintaining a stable gradient flow, enabling more effective and efficient training across
various network depths.

7.4 He Initialization
He Initialization is a method designed specifically for layers with ReLU (Rectified Linear Unit) activations
in neural networks. It addresses some of the shortcomings in initializing deep networks, particularly those
related to the variance of activations in networks using ReLU.

7.4.1 Rationale

ReLU activations are not bounded from above and are zero for all negative inputs. This characteristic can
lead to issues with standard initialization methods, such as Xavier/Glorot, which assume that the activation
functions are linear or close to linear. He Initialization aims to preserve the variance of activations and
backpropagated gradients throughout the network, which helps in preventing the vanishing or exploding
gradient problems in deep networks with ReLU activations.
7 COEFFICIENT INITIALIZATION 23

7.4.2 Mathematical Derivation

Consider a layer in a neural network with nin input neurons. Let’s denote the weight matrix associated
with this layer as W and its elements as Wij . We assume that the layer is followed by a ReLU activation
function.
When using ReLU, only the positive half of the activation distribution contributes to the forward pass
and the gradient computation. If we initialize the weights from a distribution with variance Var(W ), then
the variance of the output (before activation) from this layer would be nin Var(W ) due to the linearity of
expectation.
However, considering the ReLU’s effect, half of the activations will be zero, and the variance of the post-
activation outputs will be halved. To counteract this effect and keep the variance in activations consistent
across layers, He Initialization proposes to double the variance used in Xavier/Glorot Initialization for layers
followed by ReLU activations.
Therefore, we adjust the variance of the weights initialization as follows:

2
Var(W ) = . (38)
nin

7.4.3 Practical Implementation

For layers using ReLU activations, the weights should be initialized using values drawn from a random
2
distribution with mean zero and variance nin
. Typically, a Gaussian (normal) distribution is used:
( √ )
2
W ∼ N 0, . (39)
nin
Alternatively, for a uniform distribution, the range can be derived from the square root of the variance,
leading to:
[ √ √ ]
6 6
W ∼U − , . (40)
nin nin
He Initialization provides an effective way to initialize the weights of layers that use ReLU activations
in deep neural networks. By compensating for the reduced variance in ReLU units, this method helps in
maintaining a healthy gradient flow across many layers, enabling deeper networks to be trained more effectively
and efficiently.

7.5 Orthogonal Initialization


Orthogonal Initialization is an approach used in deep learning to initialize the weights of certain layers in
a neural network, particularly Recurrent Neural Networks (RNNs). It is based on the concept of maintaining
orthogonality in the weight matrices.
7 COEFFICIENT INITIALIZATION 24

7.5.1 Rationale

The main idea behind Orthogonal Initialization is to preserve the length (norm) of vectors through layers
of the network. When the weight matrix is orthogonal, it ensures that the singular values are equal to
one. This property helps in preventing the vanishing or exploding gradients problems, which are particularly
prevalent in RNNs due to their deep unfolded structure through time.
Orthogonality ensures that during backpropagation, the gradients are neither inflated nor diminished
excessively, leading to a more stable and efficient training process, especially in models where maintaining the
dynamics over time is crucial.

7.5.2 Mathematical Background

An orthogonal matrix Q is a square matrix whose columns and rows are orthogonal unit vectors, i.e.,
Q Q = QQT = I, where I is the identity matrix. In the context of neural networks, orthogonal initialization
T

sets the weight matrix such that:

W ∈ Rn×n , such that W T W = I, (41)

where n is the number of neurons in the layer. For non-square matrices where the number of inputs
differs from the number of outputs (nin ̸= nout ), a semi-orthogonal matrix is used, satisfying W T W = I for
nin > nout or W W T = I for nin < nout .

7.5.3 Implementation

Orthogonal initialization involves generating a random square matrix A (usually through Gaussian dis-
tribution) and then applying QR factorization to obtain the orthogonal or semi-orthogonal matrix Q. This
matrix Q is then used as the weight matrix for the neural network layer.
In practice, for layers where the number of inputs differs from the number of outputs, a rectangular
orthogonal matrix is constructed by taking the appropriate number of rows or columns from the orthogonal
matrix Q.

7.5.4 Advantages and Use Cases

Orthogonal Initialization is particularly beneficial for:

• Deep Recurrent Neural Networks (RNNs), including LSTMs and GRUs, where it helps in mitigating the
vanishing and exploding gradients problem.

• Layers where maintaining the energy of the input signal is crucial for the learning process.

7.6 Practical Recommendations


• Use Xavier initialization for layers with Sigmoid or Tanh activations.

• Use He initialization for layers with ReLU activations.


8 GRADIENT VARIATION AS A DIAGNOSTIC TOOL 25

• For recurrent neural networks, consider orthogonal initialization for hidden-to-hidden weights.

• Initialization can be combined with other techniques like Batch Normalization to further stabilize and
accelerate training.

• Regularly monitor the scale of activations and gradients during initial epochs and adjust initialization
as needed.

8 Gradient Variation as a Diagnostic Tool


Detecting overfitting and distribution shifts in machine learning models is crucial for ensuring robust and
generalizable model performance. One sophisticated method for diagnosing these issues involves analyzing the
variation in gradients between in-sample (training) and out-sample (validation or test) data. This approach
leverages the foundational principles of gradient-based learning and optimization to offer insights into the
model’s learning dynamics and generalization capabilities.

8.1 Detecting Overfitting Through Gradient Monitoring


Overfitting occurs when a model learns patterns specific to the training data at the expense of its
generalization to unseen data. This phenomenon can often be detected through a careful observation of the
model’s gradients:

• Gradient Dynamics: A key indicator of overfitting is when the loss on the training set consistently
decreases while the loss on the validation set stagnates or increases. The divergence in gradient magni-
tudes and directions between training and validation phases signals that the model updates are overly
tailored to the training data.

• Practical Implementation: Tools and libraries such as TensorFlow and PyTorch enable practitioners
to log and analyze gradient information. By comparing the gradients from the training and validation
phases, one can identify discrepancies that may indicate overfitting.

8.2 Identifying Distribution Shifts with Gradient Variation


Distribution shifts pose a significant challenge, as a model trained on one data distribution may perform
poorly on a different distribution. Gradient variation analysis can also shed light on this issue:

• Shift Detection: A sudden change in the behavior of gradients when the model is evaluated on new
data may suggest a distribution shift. Specifically, if the model’s performance degrades and the gradients
exhibit significantly different patterns compared to those observed during training, it might indicate that
the model is encountering data with a different underlying distribution.

• Application Strategies: Beyond detection, addressing distribution shifts may involve domain adap-
tation techniques or retraining the model with a dataset that better represents the target distribution.
Monitoring gradient variation provides a quantitative basis for deciding when such steps are necessary.
8 GRADIENT VARIATION AS A DIAGNOSTIC TOOL 26

8.3 Application Considerations and Best Practices


While gradient variation analysis offers valuable insights, it should be applied judiciously within the
broader context of model evaluation:

• Complementary Diagnostics: Gradient monitoring is most effective when used alongside other di-
agnostic tools and evaluation metrics. It is a part of a comprehensive model validation and diagnostics
strategy.

• Interpretation and Action: Identifying gradient variations is just the first step. The critical task is in-
terpreting these variations correctly and taking appropriate actions, such as adjusting model complexity,
incorporating regularization, or modifying the training data.

In conclusion, analyzing the variation in gradients between in-sample and out-sample data provides a
powerful mechanism for diagnosing overfitting and distribution shifts in machine learning models. By enabling
a deeper understanding of model behavior across different data sets, this approach aids in the development
of more robust, generalizable models.

You might also like