0% found this document useful (0 votes)
3 views

Backpropagation_optimization_tutorial

This document is a tutorial and survey on backpropagation and optimization techniques in deep learning, covering methods such as gradient descent, momentum, and adaptive learning rates like AdaGrad, RMSProp, and Adam. It discusses the convergence rates of various optimization algorithms and introduces concepts like sharpness-aware minimization and convergence guarantees for over-parameterized neural networks. The paper also provides detailed explanations of the backpropagation process in neural networks and the implementation of stochastic gradient descent.

Uploaded by

Tigabu Yaya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Backpropagation_optimization_tutorial

This document is a tutorial and survey on backpropagation and optimization techniques in deep learning, covering methods such as gradient descent, momentum, and adaptive learning rates like AdaGrad, RMSProp, and Adam. It discusses the convergence rates of various optimization algorithms and introduces concepts like sharpness-aware minimization and convergence guarantees for over-parameterized neural networks. The paper also provides detailed explanations of the backpropagation process in neural networks and the implementation of stochastic gradient descent.

Uploaded by

Tigabu Yaya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

To appear as a part of an upcoming textbook on deep learning.

Backpropagation and Optimization in Deep Learning: Tutorial and Survey

Benyamin Ghojogh, Ali Ghodsi {BGHOJOGH , ALI . GHODSI}@ UWATERLOO . CA


Waterloo, Ontario, Canada

Abstract gence rates are discussed. Adaptive learning rate methods,


This is a tutorial and survey paper on backprop- including AdaGrad, RMSProp, and Adam, are explained.
agation and optimization in neural networks. It Then, algorithms for sharpness-aware minimization are in-
starts with gradient descent, line-search, momen- troduced. Finally, convergence guarantees for optimization
tum, and steepest descent. Then, backpropaga- in over-parameterized neural networks are discussed.
tion is introduced. Afterwards, stochastic gra-
dient descent, mini-batch stochastic gradient de- 2. Gradient Descent
scent, and their convergence rates are discussed. 2.1. Gradient Descent
Adaptive learning rate methods, including Ada- Backpropagation and many optimization algorithms in
Grad, RMSProp, and Adam, are explained. neural networks are usually first-order optimization meth-
Then, algorithms for sharpness-aware minimiza- ods. The most well-known first-order optimization algo-
tion are introduced. Finally, convergence guaran- rithm is gradient descent. In fact, as it will be explained in
tees for optimization in over-parameterized neu- this chapter, backpropagation is nothing but gradient de-
ral networks are discussed. scent and chain rule in derivatives. Therefore, we start
with gradient descent. Gradient descent is one of the fun-
damental first-order methods. It was first suggested by
1. Introduction Cauchy in 1874 (Lemaréchal, 2012) and Hadamard in 1908
Machine learning is nothing but optimization plus some (Hadamard, 1908) and its convergence was later analyzed
other tools such as linear algebra, probability, and statistics. in (Curry, 1944).
Deep learning, as a family of machine learning algorithms, Consider the following unconstrained optimization of the
is also merely optimization with novelty on devising prac- function f (.) with respect to the variable w:
tical loss functions. In fact, deep learning minimizes some
loss function with optimization algorithms. Different opti-
minimize f (w). (1)
mization algorithms for training neural networks have been w
proposed, some of which are backpropagation (Rumelhart
et al., 1986), genetic algorithms (Montana & Davis, 1989; Numerical optimization for unconstrained optimization
Leung et al., 2003), and belief propagation as in restricted starts with a random feasible initial point and iteratively
Boltzmann machines (Hinton & Salakhutdinov, 2006). updates it by step ∆w:
The most well-known and widely used optimization algo-
rithm for deep learning is backpropagation. Backpropaga- w(k+1) := w(k) + ∆w. (2)
tion is a combination of gradient descent and chain rule in
derivatives. It is possible to use second-order optimization, It continues until convergence to (or getting sufficiently
such as Newton’s method (Ghojogh et al., 2023a), for train- close to) the desired optimal point w∗ .
ing neural networks; however, first-order optimization, i.e.,
Assume the gradient of function f (w) is L-smooth where
gradient descent, is usually used in deep learning. This is
L is the Lipschitz constant1 . In gradient descent, the update
because calculation of Hessian in second-order optimiza-
at every iteration is (see (Ghojogh et al., 2021; 2023a) for
tion is not efficient and the fast first-order optimization have
been found to be sufficient for training neural networks. 1
The function f (.) is Lipschitz with Lipschitz constant L if
This chapter discusses backpropagation and optimization |f (w1 ) − f (w2 )| ≤ L ∥w1 − w2 ∥2 , ∀w1 , w2 ∈ D. In other
in deep learning. It starts with gradient descent, line-search, words, the Lipschitz constant can be seen as an upper bound on
the slope of function in its domain D. Note that here, the gradi-
momentum, and steepest descent. Then, backpropaga- ent of function is assumed to be Lipschitz smooth meaning that
tion is introduced. Afterwards, stochastic gradient descent, L is an upper bound on the slope of gradient of function, i.e.,
mini-batch stochastic gradient descent, and their conver- |∇f (w1 ) − ∇f (w2 )| ≤ L ∥w1 − w2 ∥2 , ∀w1 , w2 ∈ D
2

Figure 1. The gradient descent steps for (a) a convex cost function and (b) a non-convex cost function. The left to right figures depict the
optimization steps on cost function, the cost value at the iterations of optimization, and the cost value in large number of iterations.

proof): 2.2. Convergence Criteria


1 For all numerical optimization methods including gradient
∆w = − ∇f (w(k) ) descent, there exist several methods for convergence crite-
L
1 rion to stop updating the solution and terminate optimiza-
=⇒ w(k+1) := w(k) − ∇f (w(k) ). (3) tion. Some of them are:
L
The problem is that either the Lipschitz constant L is often • Small norm of gradient:
not known or it is hard to compute. Hence, rather than
∆w = − L1 ∇f (w(k) ), we use: ∥∇f (w(k+1) )∥2 ≤ ϵ,

∆w = −η∇f (w(k) ), i.e., w(k+1) := w(k) − η∇f (w(k) ), where ϵ is a small positive number. The reason for this
(4) criterion is the first-order optimality condition, stating
that at the local optimum, there is ∥∇f (w∗ )∥2 = 0.
where η > 0 is the step size, also called the learning rate If the function is not convex, this criterion has the risk
in data science literature. If the optimization problem is of stopping at a saddle point.
maximization rather than minimization, the step should be
∆w = η∇f (w(k) ) rather than Eq. (4). In that case, the • Small change of cost function:
name of method is gradient ascent. The learning rate can
be found by line search which is often used in optimization |f (w(k+1) ) − f (w(k) )| ≤ ϵ.
and not in deep learning. Line search will be discussed in
Section 2.3. • Small change of gradient of function:
For a convex function, the series of solutions converges to
the optimal solution while the function value decreases it- |∇f (w(k+1) ) − ∇f (w(k) )| ≤ ϵ.
eratively until the local minimum:
• Reaching maximum desired number of iterations.
{w(0) , w(1) , w(2) , . . . } → w∗ ,
2.3. Line-search
f (w(0) ) ≥ f (w(1) ) ≥ f (w(2) ) ≥ · · · ≥ f (w∗ ).
It was explained that the step size of gradient descent re-
If the optimization problem is a convex problem, the solu- quires knowledge of the Lipschitz constant for the smooth-
tion is the global solution; otherwise, the solution is local. ness of gradient. However, the exact Lipschitz constant
As Fig. 1 illustrates, gradient descent even works relatively may not be known, especially that it is usually hard to com-
fine for non-convex cost functions; it might oscillate for pute. Alternatively, the suitable step size η can be found by
non-convex cost but its overall pattern is decreasing. a search which is named the line-search. The line-search of
3

1 Initialize w(0)
2 for iteration k = 0, 1, . . . do
3 Initialize η := 1
4 for iteration τ = 1, 2, . . . do
5 Check line-search condition, i.e., Eq. (6)
6 if not satisfied then
7 η ← 12 × η
8 else
9 w(k+1) := w(k) − η∇f (w(k) )
10 break the loop
11 Check the convergence criterion
12 if converged then Figure 2. Gradient descent (a) without and (b) with momentum.
Each contour shows the same cost value in the optimization. As
13 return w(k+1)
the figure shows, momentum reduces oscillation of optimization.

Algorithm 1: Gradient descent with line search

every optimization iteration starts with η = 1 and if it does


not satisfy:

f (w(k) + ∆w) − f (w(k) ) < 0, (5)

with step ∆w = −η∇f (w(k) ):

f (w(k) + ∆w) < f (w(k) )


Figure 3. Update of solution in gradient descent (a) without and
=⇒ f (w(k) − η∇f (w(k) )) < f (w(k) ), (6) (b) with momentum according to Eq. (8). According to the addi-
tion of terms in this equation, oscillation is reduced.
the step size is halved, η ← η/2. This halving step size is
repeated until this equation is satisfied, i.e., until there is a
decrease in the objective function. Note that this decrease where α > 0 is the momentum parameter which weights
will happen when the step size becomes small enough to the importance of history compared to the descent direc-
satisfy (see (Ghojogh et al., 2021; 2023a) for proof): tion. We use this (∆w)(k) in Eq. (2) for updating the solu-
tion:
1
η< . (7)
L w(k+1) := w(k) + (∆w)(k) .
A more sophisticated line-search method is the Armijo
Because of faithfulness to the track of previous updates,
line-search (Armijo, 1966), also called the backtracking
momentum reduces the amount of oscillation of updates in
line-search. Another more sophisticated line-search is
gradient descent optimization. This effect is shown in Fig.
Wolfe conditions (Wolfe, 1969). More details of these can
2. Moreover, the addition of terms in Eq. (8) is illustrated
be studied in (Ghojogh et al., 2021; 2023a). The algorithm
in Fig. 3 explains mathematically how the reduction of os-
of gradient descent with line-search is Algorithm 1. As this
cillation works when using the momentum term.
algorithm shows, line-search has its own internal iterations
inside every iteration of gradient descent. 2.5. Steepest Descent
2.4. Momentum Steepest descent is similar to gradient descent but there is a
difference between them. In steepest descent, the solution
Gradient descent and other first-order methods can have
moves toward the negative gradient as much as possible to
a momentum term. Momentum, proposed in (Rumelhart
reach the smallest function value which can be achieved
et al., 1986), makes the change of solution ∆w a little
at every iteration. Hence, the step size at iteration k of
similar to the previous change of solution. Therefore, the
steepest descent is calculated as (Chong & Zak, 2004):
change adds a history of previous change to Eq. (4):
η (k) := arg min f w(k) − η∇f (w(k) ) ,

(∆w)(k) := α(∆w)(k−1) − η (k) ∇f (w(k) ), (8) (9)
η
4

Figure 4. A feed-forward neural network with four layers.


Figure 5. Neuron i in the neural network.
and then, the solution is updated using Eq. (4) as in gradient
descent:

w(k+1) := w(k) − η∇f (w(k) ).

3. Backpropagation
Consider a feed-forward neural network with four layers
depicted in Fig. 4. Note that, depending on whether to
consider weights or nodes as layers in neural network, one
might call the network of this figure have either three or
four layers, respectively. Here, for the sake of explanation
of backpropagation, the nodes are considered to be layers
in the network.
Figure 6. Three neurons in three successive layers of a network.
Every neuron i in the neural network is depicted in Fig. 5.
Let wji denote the weight connecting neuron i to neuron j.
Let ai and zi be the output of neuron i before and after ap- if i is one of the hidden layers, δi is computed by chain rule
plying its activation function σi (.) : R → R, respectively: as:
m ∂e X  ∂e ∂aj  X  ∂aj 
δi = = × = δj × .
X
ai = wiℓ zℓ , (10) ∂ai ∂aj ∂ai ∂ai
j j
ℓ=1
(14)
zi := σi (ai ). (11)
The term ∂aj /∂ai is calculated by chain rule as:
Consider three neurons in three successive layers of a net-
work as
P illustrated in Fig. 6. Consider Eq. (10), i.e., ∂aj ∂aj ∂zi (a)
ai = = × = wji σ ′ (ai ), (15)
ℓ wiℓ zℓ , which sums over the neurons in layer ℓ. ∂ai ∂zi ∂ai
By chain rule in derivatives, the gradient of error e with
respect to to the weight between neurons ℓ and i is:
P
where (a) is because aj = i wji zi and zi = σ(ai ) and
σ ′ (.) denotes the derivative of activation function. Putting
∂e ∂e ∂ai (a) Eq. (15) in Eq. (14) gives:
= × = δi × zℓ , (12)
∂wiℓ ∂ai ∂wiℓ
X
P δi = σ ′ (ai ) (δj wji ). (16)
where (a) is because ai = ℓ wiℓ zℓ and we define:
j

∂e
δi := . (13) Putting this equation in Eq. (12) gives:
∂ai
∂e X
If layer i is the last layer, δi can be computed by derivative = zℓ σ ′ (ai ) (δj wji ). (17)
∂wiℓ j
of error (loss function) with respect to the output. However,
5

1 Initialize w(0)
4. Stochastic Gradient Descent
2 for iteration k = 0, 1, . . . do 4.1. Algorithm of Stochastic Gradient Descent
3 Initialize the learning rate η (0) Assume there is a dataset of n data instances, {xi ∈
4 for layer r from the last layer to the first Rd }ni=1 and their labels {li ∈ R}ni=1 . Let the cost
layer do function f (.) be decomposed into summation of n terms
5 for neuron i in the layer r do {fi (w)}ni=1 . In other words, the neural network has a loss
6 for neuron ℓ in the layer (r − 1) do value fi (w) per every input data instance xi . Therefore,
7 if layer r is the last layer then the total loss is the average of loss values over the n data
(k+1) (k) ∂e
8 wiℓ := wiℓ − η (k) zℓ ∂a i
instances and the optimization problem becomes:
9 else n
10
(k+1)
wiℓ
(k)
:= wiℓ − 1X
minimize fi (w). (21)
η (k) zℓ σ ′ (ai ) j (δj wji )
P w n i=1

In this case, the full gradient is the average gradient, i.e:


11 Check the convergence criterion
12 if converged then n
1X
13 return all weights of neural network ∇f (w) = ∇fi (w), (22)
(k)
n i=1
14 Adapt (update) the learning rate η .
so the update in gradient descent, i.e., Eq. (3), becomes:
Algorithm 2: Backpropagation algorithm
n
ηX
∆w = −η∇f (w(k) ) = − ∇fi (w(k) ).
n i=1
Backpropagation uses the gradient in Eq. (17) for updating
the weight wiℓ , ∀i, ℓ by gradient descent:
This is what gradient descent uses for updating the solution
at every iteration.
(k+1) (k) ∂e
wiℓ := wiℓ − η (k) , ∀i, ℓ. (18) Calculation of the full gradient is time-consuming and in-
∂wiℓ
efficient for large values of n, especially as it needs to be
recalculated at every iteration of gradient descent. Stochas-
Therefore, for the weights of the last layer (if i denotes the tic Gradient Descent (SGD), also called stochastic gradient
neurons in the last layer), the gradient descent becomes: method, approximates gradient descent stochastically and
samples (i.e. bootstraps) one of the points at every iteration
(k+1) (k) ∂e for updating the solution. Therefore, it uses:
wiℓ := wiℓ − η (k) zℓ , ∀i, ℓ, (19)
∂ai
w(k+1) := w(k) − η (k) ∇fi (w(k) ), (23)
according to Eqs. (12) and (13). For the weights of the
other layers (if i denotes the neurons in a hidden layer), the rather than Eq. (4), w(k+1) := w(k) − η∇f (w(k) ). The
gradient descent becomes: idea of stochastic approximation was first proposed in 1951
(Robbins & Monro, 1951). It was first used for machine
(k+1) (k)
X
wiℓ := wiℓ − η (k) zℓ σ ′ (ai ) (δj wji ), ∀i, ℓ, learning in 1998 (Bottou et al., 1998).
j As Eq. (23) states, SGD often uses an adaptive step size
(20) which changes in every iteration. The step size can be de-
creasing because in initial iterations, where the solution
according to Eq. (17). is far away from the optimal solution, the step size can
The backpropagation algorithm is in Algorithm 2. This be large; however, it should be small in the last iterations
tunes the weights from last layer to the first layer for ev- which is supposed to be close to the optimal solution. Some
ery iteration of optimization. Therefore, backpropagation, well-known adaptations for the step size are:
proposed in 1986 (Rumelhart et al., 1986), is actually gradi-
1
ent descent with chain rule in derivatives because of having η (k) := ,
layers of parameters (so that the gradients for weights of k
1
every layer depend on the the gradients for weights of later η (k) := √ ,
layers in the network). It is the most well-known optimiza- k
tion method used for training neural networks. η (k) := η.
6

4.2. Convergence Analysis of Stochastic Gradient Therefore, SGD has less accuracy than gradient descent.
Descent The advantage of SGD over gradient descent is that its ev-
Proposition 1 (Convergence rate of gradient descent with ery iteration is much faster than every iteration of gradient
full gradient). Consider a convex and differentiable func- descent because of less computations for gradient. This
tion f (.), with domain D, whose gradient is L-smooth. Let faster pacing of every iteration shows off more when n is
f ∗ be the minimum of cost function and w∗ be the mini- huge. In summary, SGD has fast convergence to a low ac-
mizer. Starting from the initial solution w(0) , after t itera- curate optimal solution.
tions of the optimization algorithm, the convergence rate of It is noteworthy that the full gradient is not available in
gradient descent is: SGD to use for checking convergence, as discussed before.
One can use other criteria or merely check the norm of gra-
2L∥w(0) − w∗ ∥22 1 dient for the sampled point. SGD can be used with the
f (w(t+1) ) − f ∗ ≤ = O( ). (24)
t+1 t line-search methods and momentum, too.
Proposition 2 (Convergence rate of gradient descent with
Pn
full gradient). Consider a function f (w) = i=1 fi (w) 5. Mini-Batch Stochastic Gradient Descent
and which is bounded below and each fi is differentiable. Gradient descent uses the entire n data points and SGD
Let the domain of function f (.) be D and its gradient be L- uses one randomly sampled point at every iteration. For
smooth. Assume E[∥∇fi (wk )∥22 | wk ] ≤ β 2 where β is a large datasets, gradient descent is very slow and intractable
constant. Assume E[∥∇fi (wk )∥22 | wk ] ≤ β 2 where β is a in every iteration while SGD will need a significant number
constant. Depending on the step size, the convergence rate of iterations to roughly cover all data. Besides, SGD has
of SGD is: low accuracy in convergence to the optimal solution. There
can be a middle case scenario where a batch of b randomly
1 1
f (w(t+1) ) − f ∗ ≤ O( ) if η (τ ) = , (25) sampled points is used at every iteration. This method is
log t τ named the mini-batch SGD or the hybrid deterministic-
log t 1 stochastic gradient method. This batch-wise approach is
f (w(t+1) ) − f ∗ ≤ O( √ ) if η (τ ) = √ , (26)
t τ wise for large datasets.
1 Usually, before start of optimization, the n data points are
f (w(t+1) ) − f ∗ ≤ O( + η) if η (τ ) = η, (27)
t randomly divided into ⌊n/b⌋ batches of size b. This is
equivalent to simple random sampling for sampling points
where τ denotes the iteration index. If the functions fi ’s are
into batches without replacement. Suppose the dataset is
µ-strongly convex, then the convergence rate of SGD is:
denoted by by D (where |D| = n) and the i-th batch is Bi
1 1 (where |Bi | = b). The batches are disjoint:
f (w(t+1) ) − f ∗ ≤ O( ) if η (τ ) = , (28)
t µτ
µ ⌊n/b⌋
f (w(t+1) ) − f ∗ ≤ O (1 − )t + η if η (τ ) = η.
 [
L Bi = D, (30)
(29) i=1
Bi ∩ Bj = ∅, ∀i, j ∈ {1, . . . , ⌊n/b⌋}, i ̸= j. (31)
Eqs. (27) and (29) show that with a fixed step size η, SGD
converges sublinearly for a non-convex function and expo- Another less-used approach for making batches is to sam-
nentially for a strongly convex function in the initial iter- ple points for a batch during optimization. This is equiva-
ations. However, in the late iterations where t → ∞, it lent to bootstrapping for sampling points into batches with
stagnates to a neighborhood O(η) around the optimal point replacement. In this case, the batches are not disjoint any-
and never reaches it. For example for Eq. (27), there is: more and Eqs. (30) and (31) do not hold.
1 Definition 1 (Epoch). In mini-batch SGD, when all ⌊n/b⌋
lim f (w(t+1) ) − f ∗ ≤ lim O( + η) = O(η) batches of data are used for optimization once, an epoch is
t→∞ t→∞ t
=⇒ f (w(t+1) ) = f ∗ + O(η). completed. After completion of an epoch, the next epoch is
started and epochs are repeated until convergence of opti-
This is while gradient descent has convergence rate as in mization.
Eq. (24) which converges to the solution in the late itera-
In mini-batch SGD, if the k-th iteration of optimization is
tions where t → ∞:
using the k ′ -th batch, the update of solution is done as:
1
lim f (w(t+1) ) − f ∗ ≤ lim O( ) = O(0) = 0 1 X
t→∞ t→∞ t w(k+1) := w(k) − η (k) ∇fi (w(k) ). (32)
=⇒ f (w(t+1) ) = f ∗ . b
i∈Bk′
7

The scale factor 1/b is sometimes dropped for simplicity. 6. Adaptive Learning Rate
Mini-batch SGD is used significantly in deep learning and Recall from Section 2.3 that the suitable learning rate can
neural networks (Bottou et al., 1998; Goodfellow et al., be found by line-search. However, the learning rate in deep
2016). Because of dividing data into batches, mini-batch learning is usually set to a initial value and it is adapted
SGD can be solved on parallel servers as a distributed op- in different iterations. The learning rate can be adapted
timization method, making it suitable for optimization in in stochastic gradient descent optimization methods. Three
deep learning using GPU cores. Note that the literature and most well-known methods for adapting the learning rate are
codes of deep learning usually refer to mini-batch SGD as AdaGrad, RMSProp, and Adam. These adaptive learning
SGD for simplicity and brevity; therefore, do not confuse rate methods are introduced in the following.
it with the one-sample SGD discussed in Section 4.
Proposition 3 (Convergence rates P for mini-batch SGD). 6.1. Adaptive Gradient (AdaGrad)
n
Consider a function f (w) = i=1 fi (w) which is Adaptive Gradient (AdaGrad) method (Duchi et al., 2011)
bounded below and each fi is differentiable. Let the do- updates the solution iteratively as:
main of function f (.) be D and its gradient be L-smooth
and assume η (k) = η is fixed. The batch-wise gradient is w(k+1) := w(k) − η (k) G−1 ∇fi (w(k) ), (38)
an approximation to the full gradient with some error et for where G is a (d × d) diagonal matrix whose (j, j)-th ele-
the t-th iteration: ment is:
1 X v
∇fi (w(t) ) = ∇f (w(t) ) + et . (33) u k
b u
G(j, j) := tε +
X 2
∇j fiτ (w(τ ) ) , (39)
i∈Bt′
τ =0
The convergence rate of mini-batch SGD for non-convex
and convex functions are: where ε ≥ 0 is for stability (making G full rank), iτ is
the randomly sampled point (from {1, . . . , n}) at iteration
1 τ , and ∇j fiτ (.) is the j-th dimension of the derivative of
+ ∥et ∥22 ,

O (34)
t fiτ (.); note that fiτ (.) is d-dimensional. Putting Eq. (39)
where t denotes the iteration index. If the functions fi ’s are in Eq. (38) can simplify AdaGrad to:
µ-strongly convex, then the convergence rate of mini-batch (k+1) (k)
SGD is: wj := wj
µ η (k) (k)
O (1 − )t + ∥et ∥22 .

(35) −q 2 ∇fj (wj ).
L Pk (τ )
ε+ τ =0 ∇j fiτ (w )
By increasing the batch size, (1/b) i∈Bt′ ∇fi (w(t) ) gets
P
(40)
closer to ∇f (w(t) ) so the error et in Eq. (33) is reduced. AdaGrad keeps a history of the sampled points and it takes
Therefore, the convergence rate of mini-batch, i.e., Eq. derivative for them to use. During the iterations so far, if a
(34), gets closer to that of gradient descent, i.e., Eq. (24), dimension has changed significantly, it dampens the learn-
if the batch size increases. ing rate for that dimension (see the inverse in Eq. (38));
If the batches are sampled without replacement (i.e., sam- hence, it gives more weight for changing the dimensions
pling batches by simple random sampling before start of which have not changed noticeably. In this way, all dimen-
optimization) or with replacement (i.e., bootstrapping dur- sions will have a fair chance to change.
ing optimization), the expected error is (Ghojogh et al.,
2020, Proposition 3): 6.2. Root Mean Square Propagation (RMSProp)
Root Mean Square Propagation (RMSProp) was first pro-
b σ2
E[∥et ∥22 ] = (1 − ) , (36) posed by Hinton in (Tieleman & Hinton, 2012) which was
n b an unpublished slide deck for academic lectures in Uni-
σ2 versity of Toronto. It is an improved version of Rprop
E[∥et ∥22 ] = , (37)
b (resilient backpropagation) (Riedmiller & Braun, 1992),
respectively, where σ 2 is the variance of whole dataset. Ac- which uses the sign of gradient in optimization. Inspired
cording to Eqs. (36) and (37), the accuracy of SGD by sam- by momentum in Eq. (8):
pling without and with replacement increases by b → n (in-
(∆w)(k) := α(∆w)(k−1) − η (k) ∇f (w(k) ),
creasing batch size toward the size of dataset) and b → ∞
(increasing the batch size to infinity), respectively. How- it updates a scalar variable v as (Hinton et al., 2012):
ever, this increase makes every iteration slower so there is
a trade-off between accuracy and speed. v (k+1) := γv (k) + (1 − γ)∥∇fi (w(k) )∥22 , (41)
8

where γ ∈ [0, 1] is the forgetting factor (e.g., γ = 0.9).


Then, it uses this v to weight the learning rate:

η (k) (k)
w(k+1) := w(k) − √ ∇fj (wj ), (42)
ε + v (k+1)
where ϵ ≥ 0 is for stability not to have division by zero.
Comparing Eqs. (40) and (42) shows that RMSProp has a
similar form to AdaGrad.

6.3. Adam Optimizer Figure 7. Visual comparison of (a) a sharp local minimum and (b)
Adam (Adaptive Moment Estimation) optimizer (Kingma an almost flat (smooth) local minimum. The credit of image is for
& Ba, 2014) improves over RMSProp by adding a momen- (Foret et al., 2021). See (Li et al., 2018) for visualization of loss
tum term. It updates the vector m ∈ Rd and the scalar v functions in neural networks.
as:

m(k+1) := γ1 m(k) + (1 − γ1 )∇fi (w(k) ), (43) zero-sum min-max loss function for finding a flat (smooth)
v (k+1)
:= γ2 v (k)
+ (1 − γ2 )∥∇fi (w (k)
)∥22 , (44) solution while minimizing the loss function of neural net-
work (Foret et al., 2021):
where γ1 , γ2 ∈ [0, 1]. It normalizes these variables as:
1 minimize LSAM (w) + λ∥w∥22 , (48)
(k+1) w
m := m(k+1) , (45)
1 − γ1k
c
LSAM (w) := maximize L(w + ϵ), (49)
ϵ:∥ϵ∥p ≤ρ
1
vb(k+1) := v (k+1) . (46)
1 − γ2k
where w is the weights of neural network, L(.) is the loss
Then, it updates the solution as: function of neural network, LSAM (.) is the SAM loss func-
tion, λ ≥ 0 is the regularization parameter, ρ ≥ 0 is a hy-
η (k) (k+1) perparameter, and p ∈ [1, ∞] where p = 2 is recommended
w(k+1) := w(k) − √ m
c , (47)
ε + vb(k+1) (Foret et al., 2021). It is possible to drop the regularization
term and write the loss as a min-max optimization problem:
which is stochastic gradient descent with momentum while
using RMSProp. The Adam optimizer is one of the mostly
used optimizers in neural networks. In summary, most of minimize maximize L(w + ϵ). (50)
w ϵ:∥ϵ∥p ≤ρ
the deep learning coding libraries have SGD (i.e., mini-
batch SGD) and Adam methods as options of optimizers.
The Eq. (50) is first finding the maximum loss in a neigh-
7. Sharpness-Aware Minimization (SAM) borhood of radius ϵ around the solution w and then min-
imizes that. This forces all the local neighborhood of the
Although backpropagation can find any local minimum of solution to be small and hence flat or smooth.
the loss function, not all local minima are equally good. It
Note that Eq. (49) can be restated as (Tahmasebi et al.,
has been empirically observed (Foret et al., 2021) that the
2024):
converged local minimum is better to be smooth than sharp;
meaning that the neighborhood of the found local minimum
is better to be almost flat rather than having a single sharp LSAM (w) := maximize L(w + ϵ)
ϵ:∥ϵ∥p ≤ρ
local minimum (see Fig. 7). It is observed that there may be
a connection between smoothness of the found local min- = L(w) − L(w) + maximize L(w + ϵ)
ϵ:∥ϵ∥p ≤ρ
imum and generalization of the neural network to unseen  
test data (Foret et al., 2021). Although, some works have = L(w) + maximize L(w + ϵ) − L(w) . (51)
| {z } ϵ:∥ϵ∥p ≤ρ
doubted this observation and they state that the smoothness empirical loss | {z }
of local minimum is not the only factor for generalization sharpness

(Wen et al., 2024).


The first term in this equation is the empirical loss and the
7.1. Vanilla SAM second term is the sharpness measure. The sharpness term
As a result, Sharpness-Aware Minimization (SAM) has can be generalized into a class of sharpness measurements
been proposed (Foret et al., 2021). This method uses a (Tahmasebi et al., 2024).
9

Consider the inner maximization in Eq. (50): network. Let n denote the number of weights in the neu-
ral network and the weights of network be {w1 , . . . , wn }.
ϵ∗ (w) := arg max L(w + ϵ) In each iteration of backpropagation, SWP makes a ran-
∥ϵ∥p ≤ρ
dom binary gradient mask m = [m1 , . . . , mn ]⊤ where
(a)
≈ arg max L(w) + ϵ⊤ ∇w L(w) i.i.d.
mi ∼ Bernoulli(β) where β is the parameter of the
∥ϵ∥p ≤ρ
Bernoulli distribution. The solution of the inner maximiza-
(b)
= arg max ϵ⊤ ∇w L(w), tion, i.e., ϵ∗ (w), is approximated by (Du et al., 2022):
∥ϵ∥p ≤ρ
1 ⊤ ∗
where ∇w denotes derative with respect to w, (a) is be- ϵ(w) ≈
b m ϵ (w), (55)
β
cause of the first-order Taylor series expansion of L(w +ϵ)
around 0 and (b) is because L(w) is not a function of ϵ. which is used in the gradient of loss in Eq. (54). In other
This maximization is a classical dual norm problem whose words, SWP does not use the entire weights and, instead,
solution is (Foret et al., 2021): uses a subset of weights for ϵ∗ (w). This simplification
does not affect the expectation of ϵ∗ (w):
|∇w L(w)|q−1
ϵ∗ (w) = ρ sign ∇w L(w)

, (52)
(∥∇w L(w)∥qq )(1/p) (55) 1 (a) 1
ϵ(w)i ] =
E[b E[mi ϵ∗ (w)i ] = E[mi ] E[ϵ∗ (w)i ]
β β
where (1/p) + (1/q) = 1. With p = 2, it simplifies to:
(b) 1 ∗ ∗
= βE[ϵ (w)i ] = E[ϵ (w)i ],
∇w L(w) β
ϵ∗ (w) = ρ . (53)
∥∇w L(w)∥2
where ϵ(w)i denotes the i-th element of ϵ(w), (a) is be-
It is noteworthy that some discussions exist in the literature cause mi and ϵ∗ (w)i are independent, and (b) is because
that Eq. (53) is an upper bound, and not an exact bound, on expectation of the Bernoulli distribution of mi .
the classification error. We refer interested readers to (Xie 7.2.2. S HARPNESS - SENSITIVE DATA S ELECTION (SDS)
et al., 2024) for those discussions.
SDS (Du et al., 2022) efficiently approximates ϵ∗ (w) using
Let us put the found ϵ, i.e., Eq. (53), in Eq. (50) and calcu-
a subset of mini-batch rather than the entire mini-batch. It
late the gradient of the loss function:
splits the mini-batch B as (Du et al., 2022):
d(w + ϵ∗ (w))
(a)
∇w L(w + ϵ∗ (w)) = ∇w L(w) w+ϵ∗ (w) B + := {(xi , yi ) ∈ B | L(w + ϵ∗ (w)) − L(w) > α},
dw
dϵ∗ (w) B − := {(xi , yi ) ∈ B | L(w + ϵ∗ (w)) − L(w) < α},
= ∇w L(w) w+ϵ∗ (w) + ∇w L(w) w+ϵ∗ (w) (56)
dw
(b) where B + is the sharpness-sensitive subset of mini-batch
≈ ∇w L(w) w+ϵ∗ (w)
, (54) and α > 0 is a hyperparameter. SDS uses the sharpness-
sensitive B + rather than the entire mini-batch B in the loss
where (a) is because of chain rule in derivatives and (b) function of SAM. Therefore, calculation is faster with less
is because the second term contains multiplication of two samples in the mini-batch.
derivatives which is small compared to the first term and
hence can be ignored. The Eq. (54) is the gradient of 7.3. Adaptive SAM
SAM loss function and it can be used in backpropagation. SAM uses a fixed radius when considering ∥ϵ∥p ≤ ρ in
This gradient can be numerically approximated by the deep its optimization. Therefore, it is sensitive to re-scaling
learning libraries such as PyTorch (Paszke et al., 2019). weights of neural network by for example a scaling matrix
A (Kwon et al., 2021):
7.2. Efficient SAM
Calculation is gradient in the vanilla SAM is time consum- maximize L(w + ϵ) ̸= maximize L(Aw + ϵ).
ϵ:∥ϵ∥p ≤ρ ϵ:∥ϵ∥p ≤ρ
ing and not efficient. Therefore, Efficient SAM (ESAM)
(Du et al., 2022) is proposed for improving the computa-
In other words, neural networks with different weight
tional efficiency of SAM. ESAM has two methodologies,
scales have different sharpness values.
i.e., Stochastic Weight Perturbation (SWP) and Sharpness-
sensitive Data Selection (SDS). Adaptive SAM (ASAM) (Kwon et al., 2021) makes SAM
robust to weight re-scaling. We define a normalization op-
7.2.1. S TOCHASTIC W EIGHT P ERTURBATION (SWP) −1
erator of w, denoted by Tw , where:
SWP (Du et al., 2022) efficiently approximates ϵ∗ (w) us- −1 −1
ing a random subset of weights rather than all weights of TAw A = Tw , (57)
10

for any invertible scaling matrix A. ASAM defines and


optimizes the adaptive sharpness of w (Kwon et al., 2021):
maximize L(w + ϵ) − L(w), (58)
−1
ϵ:∥Tw ϵ∥p ≤ρ

while the regular sharpness, sensitive to scale, of w is de-


fined as in Eq. (51).
maximize L(w + ϵ) − L(w).
ϵ:∥ϵ∥p ≤ρ Figure 8. In a neural network, (almost) all local solutions are good
enough being the global solution!
The adaptive sharpness, defined in Eq. (58), is scale-
invariant because for the scaled weights Aw, there is:
(a)
where n is the number of training instances.
maximize L(Aw + ϵ) = maximize L(w + A−1 ϵ)
−1
ϵ:∥TAw ϵ∥p ≤ρ −1
ϵ:∥TAw ϵ∥p ≤ρ Proposition 4 ((Soltanolkotabi et al., 2018, Theorem 1)).
(b)
In the above-mentioned shallow network, suppose the acti-
= maximize L(w + ϵ′ ) vation function is a quadratic function, i.e., σ(z) = z 2 . If
−1
ϵ′ :∥TAw Aϵ′ ∥p ≤ρ
k ≥ 2d and the weights of output layer v contains at least
(57) (c) d positive values and d negative values, then:
= maximize L(w + ϵ′ ) = maximize L(w + ϵ),
−1 ′ −1
ϵ′ :∥Tw ϵ ∥p ≤ρ ϵ:∥Tw ϵ∥p ≤ρ
1. There are no spurious local minima, i.e., all local min-
where (a) is because of left-multiplying the input of loss
ima are global minima.
with A, (b) is because of change of variable ϵ′ := A−1 ϵ
(so that ϵ := Aϵ′ ), and (c) is because of renaming the 2. All saddle points have a direction of strictly negative
dummy variable ϵ′ to ϵ. curvature. In other words, at a saddle point of weights
W s for the loss function, there is a direction U ∈
8. Convergence Guarantees for Optimization Rk×d such that:
in Over-parameterized Neural Networks
vec(U )⊤ ∇2 L(W s ) vec(U ) < 0,
There are some theories on the convergence of optimization
in over-parameterized neural networks, i.e., the networks where ∇L(W s ) si the second-order derivative of loss
with much more learnable parameters than the number of at the saddle point and vec(.) is the vectorization op-
training data instances. In other words, over-parameterized erator which vectorizes the matrix.
neural networks are wide networks having sufficient num-
ber of neurons in their hidden layers. The assumption of 3. If d ≤ n ≤ cd2 , where c > 0 is a constant, the global
being over-parameterized is for making sure that network optimum of the loss function L(W ) is zero.
has enough capability for learning. These theories explain
why shallow (Soltanolkotabi et al., 2018) and deep (Allen- There are other forms of Proposition 4, for more relaxed as-
Zhu et al., 2019b;a) neural networks work. sumptions and general forms of activation function, which
can be found in (Soltanolkotabi et al., 2018).
8.1. Convergence Guarantees for Optimization in Item 1 in Proposition 4 states that (almost) all local min-
Shallow Networks ima of the neural network are good enough, equal to the
Consider a shallow network with one hidden layer and an global minimum. In other words, the situation in Fig. 8
output layer. Let d be the dimensionality of input data, k occurs. Therefore, it does not matter much which solu-
be the number of hidden neurons, and suppose the last layer tion is found based on different initialization of network’s
has one neuron for regressing the label of data. This net- weights; the network usually converges to global minimum
work maps data x ∈ Rd to a scalar output as: of the loss function even if it is non-convex (Feizi et al.,
2017; Chizat & Bach, 2018; Du et al., 2018; 2019; Chen
x 7→ v ⊤ σ(W x),
et al., 2019). Different local solution usually yield to simi-
where W ∈ Rk×d is the weight matrix between input to lar performances (Choromanska et al., 2015). This happens
the hidden layer, v ∈ Rk is the weight vector of output because neural network kind of maps data to a reproducing
layer, and σ(.) is the activation function. Suppose the loss kernel Hilbert space (Ghojogh et al., 2023b) and almost all
function for the data instance xi is mean squared error be- local minima are global in that higher dimensional space
tween the output of network and the target label yi : (Dauphin et al., 2014).
1 X
n
2 Item 2 in Proposition 4 discusses that, in neural networks,
L(W ) = yi − v ⊤ σ(W xi ) , there is always a way to escape the saddle points toward
2n i=1
local minima by following some directions with negative
11

curvature. In fact, first-order optimization, including gradi- Proposition 5 explains that the deep over-parameterized
ent descent used in backpropagation, avoids saddle points neural networks converge to the solution, with a sufficiently
for different initializations (Dauphin et al., 2014; Lee et al., small loss value, in polynomial time.
2019; Panageas et al., 2019; Chen et al., 2019).
Item 3 explains that any solution found in the network 8.3. Other Works on Convergence Guarantees for
(which is the global solution according to item 1) is the zero Optimization in Neural Networks
loss for training data. It means that all local solutions can Note that convergence guarantees for optimization in neu-
fit the training data perfectly. Although, whether the per- ral networks using different activation functions have been
fectly working network on training data generalizes well to discussed in the literature. For example, in addition to
the unseen test data is another concern, not addressed in the above-mentioned references, convergence guarantees
this proposition. for networks with quadratic activation (Du & Lee, 2018),
ReLU activation (Cao & Gu, 2020; Zhang et al., 2019;
8.2. Convergence Guarantees for Optimization in Deep Sharifnassab et al., 2020), and leaky ReLU (Brutzkus et al.,
Networks 2017) exist.
There also exist convergence guarantees for optimization Convergence analysis have also been done for other net-
in deep neural networks (Allen-Zhu et al., 2019b;a). work structures, such as ResNet (Du et al., 2018; Shamir,
Proposition 5 ((Allen-Zhu et al., 2019b, Theorems 1 and 2018), and other types of data such as structured data (Li
2)). Consider a fully-connected l-layer neural network & Liang, 2018). Analysis of critical points, i.e., points in
with mean squared error loss function and ReLU activa- which the sign of gradient of loss function changes, has
tion function. Without loss of generality2 , assume the data also been discussed for neural networks (Zhou & Liang,
instances are normalized to have √ unit length and the last 2017; Nouiehed & Razaviyayn, 2022). Moreover, conver-
dimension of data instances be 1/ 2. Let d be the dimen- gence analysis for binary classification (Liang et al., 2018),
sionality of data and δ be a lower bound on the Euclidean loss surface analysis with an algebraic geometry approach
distance of every two points in the training dataset: (Mehta et al., 2021), error bounds on gradient descent in
networks (Cao & Gu, 2019) exist in the literature.
∥xi − xj ∥2 ≥ δ, ∀i, j ∈ {1, . . . , n},
Acknowledgement
where n is the number of training data instances. Assume Some of the materials in this tutorial paper have been cov-
the weights of network are randomly initialized. ered by Prof. Ali Ghodsi’s (Data Science Courses) and
Benyamin Ghojogh’s videos on YouTube. Moreover, some
• Consider a parameter m ≥ Ω(poly(n, l, δ −1 ) d), parts of this tutorial paper were inspired by the lectures of
where Ω(.) denotes the lower bound complexity and Prof. Kimon Fountoulakis at the University of Waterloo.
poly(.) is a polynomial function of its input arguments.

Having the learning rate η = Θ( poly(n,l) m ), gradient References
descent converges to a small loss value less than ϵ af-
Allen-Zhu, Zeyuan, Li, Yuanzhi, and Liang, Yingyu.
ter:
Learning and generalization in overparameterized neural
 poly(n, l) log(ϵ−1 )  networks, going beyond two layers. Advances in neural
T =Θ , (59) information processing systems, 32, 2019a.
δ2
iterations, with high probability at least 1 − Allen-Zhu, Zeyuan, Li, Yuanzhi, and Song, Zhao.
2
e−Ω(log (m)) . A convergence theory for deep learning via over-
−1
)d
parameterization. In International conference on ma-
• Consider a parameter m ≥ Ω( poly(n,l,δ b ), where chine learning, pp. 242–252. PMLR, 2019b.
b ∈ {1, . . . , n} is the mini-batch size. Having the
learning rate η = Θ( poly(n,l)bdδ
m log2 (m)
), mini-batch Armijo, Larry. Minimization of functions having Lipschitz
SGD converges to a small loss value less than ϵ after: continuous first partial derivatives. Pacific Journal of
mathematics, 16(1):1–3, 1966.
 poly(n, l) log(ϵ−1 ) log2 (m) 
T =Θ , (60)
δ2 b Bottou, Léon et al. Online learning and stochastic approx-
imations. On-line learning in neural networks, 17(9):
iterations, with high probability at least 1 −
2 142, 1998.
e−Ω(log (m)) .
2 Brutzkus, Alon, Globerson, Amir, Malach, Eran, and
It is always possible to normalize
√ data and also add an auxil-
iary dimension with value 1/ 2. Shalev-Shwartz, Shai. SGD learns over-parameterized
12

networks that provably generalize on linearly separable Du, Simon S, Zhai, Xiyu, Poczos, Barnabas, and
data. arXiv preprint arXiv:1710.10174, 2017. Singh, Aarti. Gradient descent provably optimizes
over-parameterized neural networks. arXiv preprint
Cao, Yuan and Gu, Quanquan. Generalization bounds of arXiv:1810.02054, 2018.
stochastic gradient descent for wide and deep neural net-
works. Advances in neural information processing sys- Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive
tems, 32, 2019. subgradient methods for online learning and stochastic
optimization. Journal of machine learning research, 12
Cao, Yuan and Gu, Quanquan. Generalization error bounds (7), 2011.
of gradient descent for learning over-parameterized deep
relu networks. In Proceedings of the AAAI Conference Feizi, Soheil, Javadi, Hamid, Zhang, Jesse, and Tse, David.
on Artificial Intelligence, volume 34, pp. 3349–3356, Porcupine neural networks:(almost) all local optima are
2020. global. arXiv preprint arXiv:1710.02196, 2017.
Chen, Yuxin, Chi, Yuejie, Fan, Jianqing, and Ma, Cong. Foret, Pierre, Kleiner, Ariel, Mobahi, Hossein, and
Gradient descent with random initialization: Fast global Neyshabur, Behnam. Sharpness-aware minimization for
convergence for nonconvex phase retrieval. Mathemati- efficiently improving generalization. In International
cal Programming, 176:5–37, 2019. Conference on Learning Representations (ICLR), 2021.
Chizat, Lenaic and Bach, Francis. On the global conver- Ghojogh, Benyamin, Nekoei, Hadi, Ghojogh, Aydin, Kar-
gence of gradient descent for over-parameterized models ray, Fakhri, and Crowley, Mark. Sampling algorithms,
using optimal transport. Advances in neural information from survey sampling to Monte Carlo methods: Tutorial
processing systems, 31, 2018. and literature review. arXiv preprint arXiv:2011.00901,
Chong, Edwin KP and Zak, Stanislaw H. An introduction 2020.
to optimization. John Wiley & Sons, 2004.
Ghojogh, Benyamin, Ghodsi, Ali, Karray, Fakhri, and
Choromanska, Anna, Henaff, Mikael, Mathieu, Michael, Crowley, Mark. KKT conditions, first-order and second-
Arous, Gérard Ben, and LeCun, Yann. The loss sur- order optimization, and distributed optimization: Tuto-
faces of multilayer networks. In Artificial intelligence rial and survey. arXiv preprint arXiv:2110.01858, 2021.
and statistics, pp. 192–204. PMLR, 2015.
Ghojogh, Benyamin, Crowley, Mark, Karray, Fakhri, and
Curry, Haskell B. The method of steepest descent for Ghodsi, Ali. Background on optimization. In Elements
non-linear minimization problems. Quarterly of Applied of Dimensionality Reduction and Manifold Learning, pp.
Mathematics, 2(3):258–261, 1944. 75–120. Springer, 2023a.

Dauphin, Yann N, Pascanu, Razvan, Gulcehre, Caglar, Ghojogh, Benyamin, Crowley, Mark, Karray, Fakhri, and
Cho, Kyunghyun, Ganguli, Surya, and Bengio, Yoshua. Ghodsi, Ali. Background on kernels. Elements of Di-
Identifying and attacking the saddle point problem in mensionality Reduction and Manifold Learning, pp. 43–
high-dimensional non-convex optimization. Advances in 73, 2023b.
neural information processing systems, 27, 2014.
Goodfellow, Ian, Bengio, Yoshua, and Courville, Aaron.
Du, Jiawei, Yan, Hanshu, Feng, Jiashi, Zhou, Joey Tianyi, Deep learning. MIT press, 2016.
Zhen, Liangli, Goh, Rick Siow Mong, and Tan, Vin-
cent YF. Efficient sharpness-aware minimization for Hadamard, Jacques. Mémoire sur le problème d’analyse re-
improved training of neural networks. In International latif à l’équilibre des plaques élastiques encastrées, vol-
Conference on Learning Representations (ICLR), 2022. ume 33. Imprimerie nationale, 1908.

Du, Simon and Lee, Jason. On the power of over- Hinton, Geoffrey, Srivastava, Nitish, and Swersky, Kevin.
parametrization in neural networks with quadratic acti- Neural networks for machine learning lecture 6a
vation. In International conference on machine learning, overview of mini-batch gradient descent. Technical re-
pp. 1329–1338. PMLR, 2018. port, Department of Computer Science, University of
Toronto, 2012.
Du, Simon, Lee, Jason, Li, Haochuan, Wang, Liwei, and
Zhai, Xiyu. Gradient descent finds global minima of Hinton, Geoffrey E and Salakhutdinov, Ruslan R. Reduc-
deep neural networks. In International conference on ing the dimensionality of data with neural networks. Sci-
machine learning, pp. 1675–1685. PMLR, 2019. ence, 313(5786):504–507, 2006.
13

Kingma, Diederik P and Ba, Jimmy. Adam: A Paszke, Adam, Gross, Sam, Massa, Francisco, Lerer,
method for stochastic optimization. arXiv preprint Adam, Bradbury, James, Chanan, Gregory, Killeen,
arXiv:1412.6980, 2014. Trevor, Lin, Zeming, Gimelshein, Natalia, Antiga, Luca,
et al. PyTorch: An imperative style, high-performance
Kwon, Jungmin, Kim, Jeongseop, Park, Hyunseo, and
deep learning library. Advances in neural information
Choi, In Kwon. ASAM: Adaptive sharpness-aware min-
processing systems, 32, 2019.
imization for scale-invariant learning of deep neural net-
works. In International Conference on Machine Learn- Riedmiller, Martin and Braun, Heinrich. Rprop-a fast adap-
ing, pp. 5905–5914. PMLR, 2021. tive learning algorithm. In Proceedings of the Interna-
tional Symposium on Computer and Information Science
Lee, Jason D, Panageas, Ioannis, Piliouras, Georgios,
VII, 1992.
Simchowitz, Max, Jordan, Michael I, and Recht, Ben-
jamin. First-order methods almost always avoid strict Robbins, Herbert and Monro, Sutton. A stochastic approx-
saddle points. Mathematical programming, 176:311– imation method. The annals of mathematical statistics,
337, 2019. pp. 400–407, 1951.
Lemaréchal, Claude. Cauchy and the gradient method. Doc Rumelhart, David E, Hinton, Geoffrey E, and Williams,
Math Extra, 251(254):10, 2012. Ronald J. Learning representations by back-propagating
errors. nature, 323(6088):533–536, 1986.
Leung, Frank Hung-Fat, Lam, Hak-Keung, Ling, Sai-Ho,
and Tam, Peter Kwong-Shun. Tuning of the structure Shamir, Ohad. Are ResNets provably better than linear pre-
and parameters of a neural network using an improved dictors? Advances in neural information processing sys-
genetic algorithm. IEEE Transactions on Neural net- tems, 31, 2018.
works, 14(1):79–88, 2003.
Sharifnassab, Arsalan, Salehkaleybar, Saber, and
Li, Hao, Xu, Zheng, Taylor, Gavin, Studer, Christoph, and Golestani, S Jamaloddin. Bounds on over-
Goldstein, Tom. Visualizing the loss landscape of neu- parameterization for guaranteed existence of descent
ral nets. Advances in neural information processing sys- paths in shallow ReLU networks. In International
tems, 31, 2018. conference on learning representations, 2020.
Li, Yuanzhi and Liang, Yingyu. Learning overparameter- Soltanolkotabi, Mahdi, Javanmard, Adel, and Lee, Ja-
ized neural networks via stochastic gradient descent on son D. Theoretical insights into the optimization land-
structured data. Advances in neural information process- scape of over-parameterized shallow neural networks.
ing systems, 31, 2018. IEEE Transactions on Information Theory, 65(2):742–
769, 2018.
Liang, Shiyu, Sun, Ruoyu, Li, Yixuan, and Srikant,
Rayadurgam. Understanding the loss surface of neural Tahmasebi, Behrooz, Soleymani, Ashkan, Bahri, Dara,
networks for binary classification. In International Con- Jegelka, Stefanie, and Jaillet, Patrick. A universal class
ference on Machine Learning, pp. 2835–2843. PMLR, of sharpness-aware minimization algorithms. In Interna-
2018. tional Conference on Machine Learning (ICML), 2024.
Mehta, Dhagash, Chen, Tianran, Tang, Tingting, and Tieleman, Tijmen and Hinton, Geoffrey. Lecture 6.5-
Hauenstein, Jonathan D. The loss surface of deep linear rmsprop: Divide the gradient by a running average of
networks viewed through the algebraic geometry lens. its recent magnitude. COURSERA: Neural networks for
IEEE Transactions on Pattern Analysis and Machine In- machine learning, 4(2):26–31, 2012.
telligence, 44(9):5664–5680, 2021.
Wen, Kaiyue, Li, Zhiyuan, and Ma, Tengyu. Sharpness
Montana, David J and Davis, Lawrence. Training feed- minimization algorithms do not only minimize sharpness
forward neural networks using genetic algorithms. In to achieve better generalization. Advances in Neural In-
IJCAI, volume 89, pp. 762–767, 1989. formation Processing Systems, 36, 2024.
Nouiehed, Maher and Razaviyayn, Meisam. Learning deep Wolfe, Philip. Convergence conditions for ascent methods.
models: Critical points and local openness. INFORMS SIAM review, 11(2):226–235, 1969.
Journal on Optimization, 4(2):148–173, 2022.
Xie, Wanyun, Latorre, Fabian, Antonakopoulos, Kimon,
Panageas, Ioannis, Piliouras, Georgios, and Wang, Xiao. Pethick, Thomas, and Cevher, Volkan. Improving SAM
First-order methods almost always avoid saddle points: requires rethinking its optimization formulation. In In-
The case of vanishing step-sizes. Advances in Neural ternational Conference on Machine Learning (ICML),
Information Processing Systems, 32, 2019. 2024.
14

Zhang, Xiao, Yu, Yaodong, Wang, Lingxiao, and Gu,


Quanquan. Learning one-hidden-layer ReLU networks
via gradient descent. In The 22nd international confer-
ence on artificial intelligence and statistics, pp. 1524–
1534. PMLR, 2019.
Zhou, Yi and Liang, Yingbin. Critical points of neural net-
works: Analytical forms and landscape properties. arXiv
preprint arXiv:1710.11205, 2017.

You might also like