Backpropagation_optimization_tutorial
Backpropagation_optimization_tutorial
Figure 1. The gradient descent steps for (a) a convex cost function and (b) a non-convex cost function. The left to right figures depict the
optimization steps on cost function, the cost value at the iterations of optimization, and the cost value in large number of iterations.
∆w = −η∇f (w(k) ), i.e., w(k+1) := w(k) − η∇f (w(k) ), where ϵ is a small positive number. The reason for this
(4) criterion is the first-order optimality condition, stating
that at the local optimum, there is ∥∇f (w∗ )∥2 = 0.
where η > 0 is the step size, also called the learning rate If the function is not convex, this criterion has the risk
in data science literature. If the optimization problem is of stopping at a saddle point.
maximization rather than minimization, the step should be
∆w = η∇f (w(k) ) rather than Eq. (4). In that case, the • Small change of cost function:
name of method is gradient ascent. The learning rate can
be found by line search which is often used in optimization |f (w(k+1) ) − f (w(k) )| ≤ ϵ.
and not in deep learning. Line search will be discussed in
Section 2.3. • Small change of gradient of function:
For a convex function, the series of solutions converges to
the optimal solution while the function value decreases it- |∇f (w(k+1) ) − ∇f (w(k) )| ≤ ϵ.
eratively until the local minimum:
• Reaching maximum desired number of iterations.
{w(0) , w(1) , w(2) , . . . } → w∗ ,
2.3. Line-search
f (w(0) ) ≥ f (w(1) ) ≥ f (w(2) ) ≥ · · · ≥ f (w∗ ).
It was explained that the step size of gradient descent re-
If the optimization problem is a convex problem, the solu- quires knowledge of the Lipschitz constant for the smooth-
tion is the global solution; otherwise, the solution is local. ness of gradient. However, the exact Lipschitz constant
As Fig. 1 illustrates, gradient descent even works relatively may not be known, especially that it is usually hard to com-
fine for non-convex cost functions; it might oscillate for pute. Alternatively, the suitable step size η can be found by
non-convex cost but its overall pattern is decreasing. a search which is named the line-search. The line-search of
3
1 Initialize w(0)
2 for iteration k = 0, 1, . . . do
3 Initialize η := 1
4 for iteration τ = 1, 2, . . . do
5 Check line-search condition, i.e., Eq. (6)
6 if not satisfied then
7 η ← 12 × η
8 else
9 w(k+1) := w(k) − η∇f (w(k) )
10 break the loop
11 Check the convergence criterion
12 if converged then Figure 2. Gradient descent (a) without and (b) with momentum.
Each contour shows the same cost value in the optimization. As
13 return w(k+1)
the figure shows, momentum reduces oscillation of optimization.
3. Backpropagation
Consider a feed-forward neural network with four layers
depicted in Fig. 4. Note that, depending on whether to
consider weights or nodes as layers in neural network, one
might call the network of this figure have either three or
four layers, respectively. Here, for the sake of explanation
of backpropagation, the nodes are considered to be layers
in the network.
Figure 6. Three neurons in three successive layers of a network.
Every neuron i in the neural network is depicted in Fig. 5.
Let wji denote the weight connecting neuron i to neuron j.
Let ai and zi be the output of neuron i before and after ap- if i is one of the hidden layers, δi is computed by chain rule
plying its activation function σi (.) : R → R, respectively: as:
m ∂e X ∂e ∂aj X ∂aj
δi = = × = δj × .
X
ai = wiℓ zℓ , (10) ∂ai ∂aj ∂ai ∂ai
j j
ℓ=1
(14)
zi := σi (ai ). (11)
The term ∂aj /∂ai is calculated by chain rule as:
Consider three neurons in three successive layers of a net-
work as
P illustrated in Fig. 6. Consider Eq. (10), i.e., ∂aj ∂aj ∂zi (a)
ai = = × = wji σ ′ (ai ), (15)
ℓ wiℓ zℓ , which sums over the neurons in layer ℓ. ∂ai ∂zi ∂ai
By chain rule in derivatives, the gradient of error e with
respect to to the weight between neurons ℓ and i is:
P
where (a) is because aj = i wji zi and zi = σ(ai ) and
σ ′ (.) denotes the derivative of activation function. Putting
∂e ∂e ∂ai (a) Eq. (15) in Eq. (14) gives:
= × = δi × zℓ , (12)
∂wiℓ ∂ai ∂wiℓ
X
P δi = σ ′ (ai ) (δj wji ). (16)
where (a) is because ai = ℓ wiℓ zℓ and we define:
j
∂e
δi := . (13) Putting this equation in Eq. (12) gives:
∂ai
∂e X
If layer i is the last layer, δi can be computed by derivative = zℓ σ ′ (ai ) (δj wji ). (17)
∂wiℓ j
of error (loss function) with respect to the output. However,
5
1 Initialize w(0)
4. Stochastic Gradient Descent
2 for iteration k = 0, 1, . . . do 4.1. Algorithm of Stochastic Gradient Descent
3 Initialize the learning rate η (0) Assume there is a dataset of n data instances, {xi ∈
4 for layer r from the last layer to the first Rd }ni=1 and their labels {li ∈ R}ni=1 . Let the cost
layer do function f (.) be decomposed into summation of n terms
5 for neuron i in the layer r do {fi (w)}ni=1 . In other words, the neural network has a loss
6 for neuron ℓ in the layer (r − 1) do value fi (w) per every input data instance xi . Therefore,
7 if layer r is the last layer then the total loss is the average of loss values over the n data
(k+1) (k) ∂e
8 wiℓ := wiℓ − η (k) zℓ ∂a i
instances and the optimization problem becomes:
9 else n
10
(k+1)
wiℓ
(k)
:= wiℓ − 1X
minimize fi (w). (21)
η (k) zℓ σ ′ (ai ) j (δj wji )
P w n i=1
4.2. Convergence Analysis of Stochastic Gradient Therefore, SGD has less accuracy than gradient descent.
Descent The advantage of SGD over gradient descent is that its ev-
Proposition 1 (Convergence rate of gradient descent with ery iteration is much faster than every iteration of gradient
full gradient). Consider a convex and differentiable func- descent because of less computations for gradient. This
tion f (.), with domain D, whose gradient is L-smooth. Let faster pacing of every iteration shows off more when n is
f ∗ be the minimum of cost function and w∗ be the mini- huge. In summary, SGD has fast convergence to a low ac-
mizer. Starting from the initial solution w(0) , after t itera- curate optimal solution.
tions of the optimization algorithm, the convergence rate of It is noteworthy that the full gradient is not available in
gradient descent is: SGD to use for checking convergence, as discussed before.
One can use other criteria or merely check the norm of gra-
2L∥w(0) − w∗ ∥22 1 dient for the sampled point. SGD can be used with the
f (w(t+1) ) − f ∗ ≤ = O( ). (24)
t+1 t line-search methods and momentum, too.
Proposition 2 (Convergence rate of gradient descent with
Pn
full gradient). Consider a function f (w) = i=1 fi (w) 5. Mini-Batch Stochastic Gradient Descent
and which is bounded below and each fi is differentiable. Gradient descent uses the entire n data points and SGD
Let the domain of function f (.) be D and its gradient be L- uses one randomly sampled point at every iteration. For
smooth. Assume E[∥∇fi (wk )∥22 | wk ] ≤ β 2 where β is a large datasets, gradient descent is very slow and intractable
constant. Assume E[∥∇fi (wk )∥22 | wk ] ≤ β 2 where β is a in every iteration while SGD will need a significant number
constant. Depending on the step size, the convergence rate of iterations to roughly cover all data. Besides, SGD has
of SGD is: low accuracy in convergence to the optimal solution. There
can be a middle case scenario where a batch of b randomly
1 1
f (w(t+1) ) − f ∗ ≤ O( ) if η (τ ) = , (25) sampled points is used at every iteration. This method is
log t τ named the mini-batch SGD or the hybrid deterministic-
log t 1 stochastic gradient method. This batch-wise approach is
f (w(t+1) ) − f ∗ ≤ O( √ ) if η (τ ) = √ , (26)
t τ wise for large datasets.
1 Usually, before start of optimization, the n data points are
f (w(t+1) ) − f ∗ ≤ O( + η) if η (τ ) = η, (27)
t randomly divided into ⌊n/b⌋ batches of size b. This is
equivalent to simple random sampling for sampling points
where τ denotes the iteration index. If the functions fi ’s are
into batches without replacement. Suppose the dataset is
µ-strongly convex, then the convergence rate of SGD is:
denoted by by D (where |D| = n) and the i-th batch is Bi
1 1 (where |Bi | = b). The batches are disjoint:
f (w(t+1) ) − f ∗ ≤ O( ) if η (τ ) = , (28)
t µτ
µ ⌊n/b⌋
f (w(t+1) ) − f ∗ ≤ O (1 − )t + η if η (τ ) = η.
[
L Bi = D, (30)
(29) i=1
Bi ∩ Bj = ∅, ∀i, j ∈ {1, . . . , ⌊n/b⌋}, i ̸= j. (31)
Eqs. (27) and (29) show that with a fixed step size η, SGD
converges sublinearly for a non-convex function and expo- Another less-used approach for making batches is to sam-
nentially for a strongly convex function in the initial iter- ple points for a batch during optimization. This is equiva-
ations. However, in the late iterations where t → ∞, it lent to bootstrapping for sampling points into batches with
stagnates to a neighborhood O(η) around the optimal point replacement. In this case, the batches are not disjoint any-
and never reaches it. For example for Eq. (27), there is: more and Eqs. (30) and (31) do not hold.
1 Definition 1 (Epoch). In mini-batch SGD, when all ⌊n/b⌋
lim f (w(t+1) ) − f ∗ ≤ lim O( + η) = O(η) batches of data are used for optimization once, an epoch is
t→∞ t→∞ t
=⇒ f (w(t+1) ) = f ∗ + O(η). completed. After completion of an epoch, the next epoch is
started and epochs are repeated until convergence of opti-
This is while gradient descent has convergence rate as in mization.
Eq. (24) which converges to the solution in the late itera-
In mini-batch SGD, if the k-th iteration of optimization is
tions where t → ∞:
using the k ′ -th batch, the update of solution is done as:
1
lim f (w(t+1) ) − f ∗ ≤ lim O( ) = O(0) = 0 1 X
t→∞ t→∞ t w(k+1) := w(k) − η (k) ∇fi (w(k) ). (32)
=⇒ f (w(t+1) ) = f ∗ . b
i∈Bk′
7
The scale factor 1/b is sometimes dropped for simplicity. 6. Adaptive Learning Rate
Mini-batch SGD is used significantly in deep learning and Recall from Section 2.3 that the suitable learning rate can
neural networks (Bottou et al., 1998; Goodfellow et al., be found by line-search. However, the learning rate in deep
2016). Because of dividing data into batches, mini-batch learning is usually set to a initial value and it is adapted
SGD can be solved on parallel servers as a distributed op- in different iterations. The learning rate can be adapted
timization method, making it suitable for optimization in in stochastic gradient descent optimization methods. Three
deep learning using GPU cores. Note that the literature and most well-known methods for adapting the learning rate are
codes of deep learning usually refer to mini-batch SGD as AdaGrad, RMSProp, and Adam. These adaptive learning
SGD for simplicity and brevity; therefore, do not confuse rate methods are introduced in the following.
it with the one-sample SGD discussed in Section 4.
Proposition 3 (Convergence rates P for mini-batch SGD). 6.1. Adaptive Gradient (AdaGrad)
n
Consider a function f (w) = i=1 fi (w) which is Adaptive Gradient (AdaGrad) method (Duchi et al., 2011)
bounded below and each fi is differentiable. Let the do- updates the solution iteratively as:
main of function f (.) be D and its gradient be L-smooth
and assume η (k) = η is fixed. The batch-wise gradient is w(k+1) := w(k) − η (k) G−1 ∇fi (w(k) ), (38)
an approximation to the full gradient with some error et for where G is a (d × d) diagonal matrix whose (j, j)-th ele-
the t-th iteration: ment is:
1 X v
∇fi (w(t) ) = ∇f (w(t) ) + et . (33) u k
b u
G(j, j) := tε +
X 2
∇j fiτ (w(τ ) ) , (39)
i∈Bt′
τ =0
The convergence rate of mini-batch SGD for non-convex
and convex functions are: where ε ≥ 0 is for stability (making G full rank), iτ is
the randomly sampled point (from {1, . . . , n}) at iteration
1 τ , and ∇j fiτ (.) is the j-th dimension of the derivative of
+ ∥et ∥22 ,
O (34)
t fiτ (.); note that fiτ (.) is d-dimensional. Putting Eq. (39)
where t denotes the iteration index. If the functions fi ’s are in Eq. (38) can simplify AdaGrad to:
µ-strongly convex, then the convergence rate of mini-batch (k+1) (k)
SGD is: wj := wj
µ η (k) (k)
O (1 − )t + ∥et ∥22 .
(35) −q 2 ∇fj (wj ).
L Pk (τ )
ε+ τ =0 ∇j fiτ (w )
By increasing the batch size, (1/b) i∈Bt′ ∇fi (w(t) ) gets
P
(40)
closer to ∇f (w(t) ) so the error et in Eq. (33) is reduced. AdaGrad keeps a history of the sampled points and it takes
Therefore, the convergence rate of mini-batch, i.e., Eq. derivative for them to use. During the iterations so far, if a
(34), gets closer to that of gradient descent, i.e., Eq. (24), dimension has changed significantly, it dampens the learn-
if the batch size increases. ing rate for that dimension (see the inverse in Eq. (38));
If the batches are sampled without replacement (i.e., sam- hence, it gives more weight for changing the dimensions
pling batches by simple random sampling before start of which have not changed noticeably. In this way, all dimen-
optimization) or with replacement (i.e., bootstrapping dur- sions will have a fair chance to change.
ing optimization), the expected error is (Ghojogh et al.,
2020, Proposition 3): 6.2. Root Mean Square Propagation (RMSProp)
Root Mean Square Propagation (RMSProp) was first pro-
b σ2
E[∥et ∥22 ] = (1 − ) , (36) posed by Hinton in (Tieleman & Hinton, 2012) which was
n b an unpublished slide deck for academic lectures in Uni-
σ2 versity of Toronto. It is an improved version of Rprop
E[∥et ∥22 ] = , (37)
b (resilient backpropagation) (Riedmiller & Braun, 1992),
respectively, where σ 2 is the variance of whole dataset. Ac- which uses the sign of gradient in optimization. Inspired
cording to Eqs. (36) and (37), the accuracy of SGD by sam- by momentum in Eq. (8):
pling without and with replacement increases by b → n (in-
(∆w)(k) := α(∆w)(k−1) − η (k) ∇f (w(k) ),
creasing batch size toward the size of dataset) and b → ∞
(increasing the batch size to infinity), respectively. How- it updates a scalar variable v as (Hinton et al., 2012):
ever, this increase makes every iteration slower so there is
a trade-off between accuracy and speed. v (k+1) := γv (k) + (1 − γ)∥∇fi (w(k) )∥22 , (41)
8
η (k) (k)
w(k+1) := w(k) − √ ∇fj (wj ), (42)
ε + v (k+1)
where ϵ ≥ 0 is for stability not to have division by zero.
Comparing Eqs. (40) and (42) shows that RMSProp has a
similar form to AdaGrad.
6.3. Adam Optimizer Figure 7. Visual comparison of (a) a sharp local minimum and (b)
Adam (Adaptive Moment Estimation) optimizer (Kingma an almost flat (smooth) local minimum. The credit of image is for
& Ba, 2014) improves over RMSProp by adding a momen- (Foret et al., 2021). See (Li et al., 2018) for visualization of loss
tum term. It updates the vector m ∈ Rd and the scalar v functions in neural networks.
as:
m(k+1) := γ1 m(k) + (1 − γ1 )∇fi (w(k) ), (43) zero-sum min-max loss function for finding a flat (smooth)
v (k+1)
:= γ2 v (k)
+ (1 − γ2 )∥∇fi (w (k)
)∥22 , (44) solution while minimizing the loss function of neural net-
work (Foret et al., 2021):
where γ1 , γ2 ∈ [0, 1]. It normalizes these variables as:
1 minimize LSAM (w) + λ∥w∥22 , (48)
(k+1) w
m := m(k+1) , (45)
1 − γ1k
c
LSAM (w) := maximize L(w + ϵ), (49)
ϵ:∥ϵ∥p ≤ρ
1
vb(k+1) := v (k+1) . (46)
1 − γ2k
where w is the weights of neural network, L(.) is the loss
Then, it updates the solution as: function of neural network, LSAM (.) is the SAM loss func-
tion, λ ≥ 0 is the regularization parameter, ρ ≥ 0 is a hy-
η (k) (k+1) perparameter, and p ∈ [1, ∞] where p = 2 is recommended
w(k+1) := w(k) − √ m
c , (47)
ε + vb(k+1) (Foret et al., 2021). It is possible to drop the regularization
term and write the loss as a min-max optimization problem:
which is stochastic gradient descent with momentum while
using RMSProp. The Adam optimizer is one of the mostly
used optimizers in neural networks. In summary, most of minimize maximize L(w + ϵ). (50)
w ϵ:∥ϵ∥p ≤ρ
the deep learning coding libraries have SGD (i.e., mini-
batch SGD) and Adam methods as options of optimizers.
The Eq. (50) is first finding the maximum loss in a neigh-
7. Sharpness-Aware Minimization (SAM) borhood of radius ϵ around the solution w and then min-
imizes that. This forces all the local neighborhood of the
Although backpropagation can find any local minimum of solution to be small and hence flat or smooth.
the loss function, not all local minima are equally good. It
Note that Eq. (49) can be restated as (Tahmasebi et al.,
has been empirically observed (Foret et al., 2021) that the
2024):
converged local minimum is better to be smooth than sharp;
meaning that the neighborhood of the found local minimum
is better to be almost flat rather than having a single sharp LSAM (w) := maximize L(w + ϵ)
ϵ:∥ϵ∥p ≤ρ
local minimum (see Fig. 7). It is observed that there may be
a connection between smoothness of the found local min- = L(w) − L(w) + maximize L(w + ϵ)
ϵ:∥ϵ∥p ≤ρ
imum and generalization of the neural network to unseen
test data (Foret et al., 2021). Although, some works have = L(w) + maximize L(w + ϵ) − L(w) . (51)
| {z } ϵ:∥ϵ∥p ≤ρ
doubted this observation and they state that the smoothness empirical loss | {z }
of local minimum is not the only factor for generalization sharpness
Consider the inner maximization in Eq. (50): network. Let n denote the number of weights in the neu-
ral network and the weights of network be {w1 , . . . , wn }.
ϵ∗ (w) := arg max L(w + ϵ) In each iteration of backpropagation, SWP makes a ran-
∥ϵ∥p ≤ρ
dom binary gradient mask m = [m1 , . . . , mn ]⊤ where
(a)
≈ arg max L(w) + ϵ⊤ ∇w L(w) i.i.d.
mi ∼ Bernoulli(β) where β is the parameter of the
∥ϵ∥p ≤ρ
Bernoulli distribution. The solution of the inner maximiza-
(b)
= arg max ϵ⊤ ∇w L(w), tion, i.e., ϵ∗ (w), is approximated by (Du et al., 2022):
∥ϵ∥p ≤ρ
1 ⊤ ∗
where ∇w denotes derative with respect to w, (a) is be- ϵ(w) ≈
b m ϵ (w), (55)
β
cause of the first-order Taylor series expansion of L(w +ϵ)
around 0 and (b) is because L(w) is not a function of ϵ. which is used in the gradient of loss in Eq. (54). In other
This maximization is a classical dual norm problem whose words, SWP does not use the entire weights and, instead,
solution is (Foret et al., 2021): uses a subset of weights for ϵ∗ (w). This simplification
does not affect the expectation of ϵ∗ (w):
|∇w L(w)|q−1
ϵ∗ (w) = ρ sign ∇w L(w)
, (52)
(∥∇w L(w)∥qq )(1/p) (55) 1 (a) 1
ϵ(w)i ] =
E[b E[mi ϵ∗ (w)i ] = E[mi ] E[ϵ∗ (w)i ]
β β
where (1/p) + (1/q) = 1. With p = 2, it simplifies to:
(b) 1 ∗ ∗
= βE[ϵ (w)i ] = E[ϵ (w)i ],
∇w L(w) β
ϵ∗ (w) = ρ . (53)
∥∇w L(w)∥2
where ϵ(w)i denotes the i-th element of ϵ(w), (a) is be-
It is noteworthy that some discussions exist in the literature cause mi and ϵ∗ (w)i are independent, and (b) is because
that Eq. (53) is an upper bound, and not an exact bound, on expectation of the Bernoulli distribution of mi .
the classification error. We refer interested readers to (Xie 7.2.2. S HARPNESS - SENSITIVE DATA S ELECTION (SDS)
et al., 2024) for those discussions.
SDS (Du et al., 2022) efficiently approximates ϵ∗ (w) using
Let us put the found ϵ, i.e., Eq. (53), in Eq. (50) and calcu-
a subset of mini-batch rather than the entire mini-batch. It
late the gradient of the loss function:
splits the mini-batch B as (Du et al., 2022):
d(w + ϵ∗ (w))
(a)
∇w L(w + ϵ∗ (w)) = ∇w L(w) w+ϵ∗ (w) B + := {(xi , yi ) ∈ B | L(w + ϵ∗ (w)) − L(w) > α},
dw
dϵ∗ (w) B − := {(xi , yi ) ∈ B | L(w + ϵ∗ (w)) − L(w) < α},
= ∇w L(w) w+ϵ∗ (w) + ∇w L(w) w+ϵ∗ (w) (56)
dw
(b) where B + is the sharpness-sensitive subset of mini-batch
≈ ∇w L(w) w+ϵ∗ (w)
, (54) and α > 0 is a hyperparameter. SDS uses the sharpness-
sensitive B + rather than the entire mini-batch B in the loss
where (a) is because of chain rule in derivatives and (b) function of SAM. Therefore, calculation is faster with less
is because the second term contains multiplication of two samples in the mini-batch.
derivatives which is small compared to the first term and
hence can be ignored. The Eq. (54) is the gradient of 7.3. Adaptive SAM
SAM loss function and it can be used in backpropagation. SAM uses a fixed radius when considering ∥ϵ∥p ≤ ρ in
This gradient can be numerically approximated by the deep its optimization. Therefore, it is sensitive to re-scaling
learning libraries such as PyTorch (Paszke et al., 2019). weights of neural network by for example a scaling matrix
A (Kwon et al., 2021):
7.2. Efficient SAM
Calculation is gradient in the vanilla SAM is time consum- maximize L(w + ϵ) ̸= maximize L(Aw + ϵ).
ϵ:∥ϵ∥p ≤ρ ϵ:∥ϵ∥p ≤ρ
ing and not efficient. Therefore, Efficient SAM (ESAM)
(Du et al., 2022) is proposed for improving the computa-
In other words, neural networks with different weight
tional efficiency of SAM. ESAM has two methodologies,
scales have different sharpness values.
i.e., Stochastic Weight Perturbation (SWP) and Sharpness-
sensitive Data Selection (SDS). Adaptive SAM (ASAM) (Kwon et al., 2021) makes SAM
robust to weight re-scaling. We define a normalization op-
7.2.1. S TOCHASTIC W EIGHT P ERTURBATION (SWP) −1
erator of w, denoted by Tw , where:
SWP (Du et al., 2022) efficiently approximates ϵ∗ (w) us- −1 −1
ing a random subset of weights rather than all weights of TAw A = Tw , (57)
10
curvature. In fact, first-order optimization, including gradi- Proposition 5 explains that the deep over-parameterized
ent descent used in backpropagation, avoids saddle points neural networks converge to the solution, with a sufficiently
for different initializations (Dauphin et al., 2014; Lee et al., small loss value, in polynomial time.
2019; Panageas et al., 2019; Chen et al., 2019).
Item 3 explains that any solution found in the network 8.3. Other Works on Convergence Guarantees for
(which is the global solution according to item 1) is the zero Optimization in Neural Networks
loss for training data. It means that all local solutions can Note that convergence guarantees for optimization in neu-
fit the training data perfectly. Although, whether the per- ral networks using different activation functions have been
fectly working network on training data generalizes well to discussed in the literature. For example, in addition to
the unseen test data is another concern, not addressed in the above-mentioned references, convergence guarantees
this proposition. for networks with quadratic activation (Du & Lee, 2018),
ReLU activation (Cao & Gu, 2020; Zhang et al., 2019;
8.2. Convergence Guarantees for Optimization in Deep Sharifnassab et al., 2020), and leaky ReLU (Brutzkus et al.,
Networks 2017) exist.
There also exist convergence guarantees for optimization Convergence analysis have also been done for other net-
in deep neural networks (Allen-Zhu et al., 2019b;a). work structures, such as ResNet (Du et al., 2018; Shamir,
Proposition 5 ((Allen-Zhu et al., 2019b, Theorems 1 and 2018), and other types of data such as structured data (Li
2)). Consider a fully-connected l-layer neural network & Liang, 2018). Analysis of critical points, i.e., points in
with mean squared error loss function and ReLU activa- which the sign of gradient of loss function changes, has
tion function. Without loss of generality2 , assume the data also been discussed for neural networks (Zhou & Liang,
instances are normalized to have √ unit length and the last 2017; Nouiehed & Razaviyayn, 2022). Moreover, conver-
dimension of data instances be 1/ 2. Let d be the dimen- gence analysis for binary classification (Liang et al., 2018),
sionality of data and δ be a lower bound on the Euclidean loss surface analysis with an algebraic geometry approach
distance of every two points in the training dataset: (Mehta et al., 2021), error bounds on gradient descent in
networks (Cao & Gu, 2019) exist in the literature.
∥xi − xj ∥2 ≥ δ, ∀i, j ∈ {1, . . . , n},
Acknowledgement
where n is the number of training data instances. Assume Some of the materials in this tutorial paper have been cov-
the weights of network are randomly initialized. ered by Prof. Ali Ghodsi’s (Data Science Courses) and
Benyamin Ghojogh’s videos on YouTube. Moreover, some
• Consider a parameter m ≥ Ω(poly(n, l, δ −1 ) d), parts of this tutorial paper were inspired by the lectures of
where Ω(.) denotes the lower bound complexity and Prof. Kimon Fountoulakis at the University of Waterloo.
poly(.) is a polynomial function of its input arguments.
dδ
Having the learning rate η = Θ( poly(n,l) m ), gradient References
descent converges to a small loss value less than ϵ af-
Allen-Zhu, Zeyuan, Li, Yuanzhi, and Liang, Yingyu.
ter:
Learning and generalization in overparameterized neural
poly(n, l) log(ϵ−1 ) networks, going beyond two layers. Advances in neural
T =Θ , (59) information processing systems, 32, 2019a.
δ2
iterations, with high probability at least 1 − Allen-Zhu, Zeyuan, Li, Yuanzhi, and Song, Zhao.
2
e−Ω(log (m)) . A convergence theory for deep learning via over-
−1
)d
parameterization. In International conference on ma-
• Consider a parameter m ≥ Ω( poly(n,l,δ b ), where chine learning, pp. 242–252. PMLR, 2019b.
b ∈ {1, . . . , n} is the mini-batch size. Having the
learning rate η = Θ( poly(n,l)bdδ
m log2 (m)
), mini-batch Armijo, Larry. Minimization of functions having Lipschitz
SGD converges to a small loss value less than ϵ after: continuous first partial derivatives. Pacific Journal of
mathematics, 16(1):1–3, 1966.
poly(n, l) log(ϵ−1 ) log2 (m)
T =Θ , (60)
δ2 b Bottou, Léon et al. Online learning and stochastic approx-
imations. On-line learning in neural networks, 17(9):
iterations, with high probability at least 1 −
2 142, 1998.
e−Ω(log (m)) .
2 Brutzkus, Alon, Globerson, Amir, Malach, Eran, and
It is always possible to normalize
√ data and also add an auxil-
iary dimension with value 1/ 2. Shalev-Shwartz, Shai. SGD learns over-parameterized
12
networks that provably generalize on linearly separable Du, Simon S, Zhai, Xiyu, Poczos, Barnabas, and
data. arXiv preprint arXiv:1710.10174, 2017. Singh, Aarti. Gradient descent provably optimizes
over-parameterized neural networks. arXiv preprint
Cao, Yuan and Gu, Quanquan. Generalization bounds of arXiv:1810.02054, 2018.
stochastic gradient descent for wide and deep neural net-
works. Advances in neural information processing sys- Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive
tems, 32, 2019. subgradient methods for online learning and stochastic
optimization. Journal of machine learning research, 12
Cao, Yuan and Gu, Quanquan. Generalization error bounds (7), 2011.
of gradient descent for learning over-parameterized deep
relu networks. In Proceedings of the AAAI Conference Feizi, Soheil, Javadi, Hamid, Zhang, Jesse, and Tse, David.
on Artificial Intelligence, volume 34, pp. 3349–3356, Porcupine neural networks:(almost) all local optima are
2020. global. arXiv preprint arXiv:1710.02196, 2017.
Chen, Yuxin, Chi, Yuejie, Fan, Jianqing, and Ma, Cong. Foret, Pierre, Kleiner, Ariel, Mobahi, Hossein, and
Gradient descent with random initialization: Fast global Neyshabur, Behnam. Sharpness-aware minimization for
convergence for nonconvex phase retrieval. Mathemati- efficiently improving generalization. In International
cal Programming, 176:5–37, 2019. Conference on Learning Representations (ICLR), 2021.
Chizat, Lenaic and Bach, Francis. On the global conver- Ghojogh, Benyamin, Nekoei, Hadi, Ghojogh, Aydin, Kar-
gence of gradient descent for over-parameterized models ray, Fakhri, and Crowley, Mark. Sampling algorithms,
using optimal transport. Advances in neural information from survey sampling to Monte Carlo methods: Tutorial
processing systems, 31, 2018. and literature review. arXiv preprint arXiv:2011.00901,
Chong, Edwin KP and Zak, Stanislaw H. An introduction 2020.
to optimization. John Wiley & Sons, 2004.
Ghojogh, Benyamin, Ghodsi, Ali, Karray, Fakhri, and
Choromanska, Anna, Henaff, Mikael, Mathieu, Michael, Crowley, Mark. KKT conditions, first-order and second-
Arous, Gérard Ben, and LeCun, Yann. The loss sur- order optimization, and distributed optimization: Tuto-
faces of multilayer networks. In Artificial intelligence rial and survey. arXiv preprint arXiv:2110.01858, 2021.
and statistics, pp. 192–204. PMLR, 2015.
Ghojogh, Benyamin, Crowley, Mark, Karray, Fakhri, and
Curry, Haskell B. The method of steepest descent for Ghodsi, Ali. Background on optimization. In Elements
non-linear minimization problems. Quarterly of Applied of Dimensionality Reduction and Manifold Learning, pp.
Mathematics, 2(3):258–261, 1944. 75–120. Springer, 2023a.
Dauphin, Yann N, Pascanu, Razvan, Gulcehre, Caglar, Ghojogh, Benyamin, Crowley, Mark, Karray, Fakhri, and
Cho, Kyunghyun, Ganguli, Surya, and Bengio, Yoshua. Ghodsi, Ali. Background on kernels. Elements of Di-
Identifying and attacking the saddle point problem in mensionality Reduction and Manifold Learning, pp. 43–
high-dimensional non-convex optimization. Advances in 73, 2023b.
neural information processing systems, 27, 2014.
Goodfellow, Ian, Bengio, Yoshua, and Courville, Aaron.
Du, Jiawei, Yan, Hanshu, Feng, Jiashi, Zhou, Joey Tianyi, Deep learning. MIT press, 2016.
Zhen, Liangli, Goh, Rick Siow Mong, and Tan, Vin-
cent YF. Efficient sharpness-aware minimization for Hadamard, Jacques. Mémoire sur le problème d’analyse re-
improved training of neural networks. In International latif à l’équilibre des plaques élastiques encastrées, vol-
Conference on Learning Representations (ICLR), 2022. ume 33. Imprimerie nationale, 1908.
Du, Simon and Lee, Jason. On the power of over- Hinton, Geoffrey, Srivastava, Nitish, and Swersky, Kevin.
parametrization in neural networks with quadratic acti- Neural networks for machine learning lecture 6a
vation. In International conference on machine learning, overview of mini-batch gradient descent. Technical re-
pp. 1329–1338. PMLR, 2018. port, Department of Computer Science, University of
Toronto, 2012.
Du, Simon, Lee, Jason, Li, Haochuan, Wang, Liwei, and
Zhai, Xiyu. Gradient descent finds global minima of Hinton, Geoffrey E and Salakhutdinov, Ruslan R. Reduc-
deep neural networks. In International conference on ing the dimensionality of data with neural networks. Sci-
machine learning, pp. 1675–1685. PMLR, 2019. ence, 313(5786):504–507, 2006.
13
Kingma, Diederik P and Ba, Jimmy. Adam: A Paszke, Adam, Gross, Sam, Massa, Francisco, Lerer,
method for stochastic optimization. arXiv preprint Adam, Bradbury, James, Chanan, Gregory, Killeen,
arXiv:1412.6980, 2014. Trevor, Lin, Zeming, Gimelshein, Natalia, Antiga, Luca,
et al. PyTorch: An imperative style, high-performance
Kwon, Jungmin, Kim, Jeongseop, Park, Hyunseo, and
deep learning library. Advances in neural information
Choi, In Kwon. ASAM: Adaptive sharpness-aware min-
processing systems, 32, 2019.
imization for scale-invariant learning of deep neural net-
works. In International Conference on Machine Learn- Riedmiller, Martin and Braun, Heinrich. Rprop-a fast adap-
ing, pp. 5905–5914. PMLR, 2021. tive learning algorithm. In Proceedings of the Interna-
tional Symposium on Computer and Information Science
Lee, Jason D, Panageas, Ioannis, Piliouras, Georgios,
VII, 1992.
Simchowitz, Max, Jordan, Michael I, and Recht, Ben-
jamin. First-order methods almost always avoid strict Robbins, Herbert and Monro, Sutton. A stochastic approx-
saddle points. Mathematical programming, 176:311– imation method. The annals of mathematical statistics,
337, 2019. pp. 400–407, 1951.
Lemaréchal, Claude. Cauchy and the gradient method. Doc Rumelhart, David E, Hinton, Geoffrey E, and Williams,
Math Extra, 251(254):10, 2012. Ronald J. Learning representations by back-propagating
errors. nature, 323(6088):533–536, 1986.
Leung, Frank Hung-Fat, Lam, Hak-Keung, Ling, Sai-Ho,
and Tam, Peter Kwong-Shun. Tuning of the structure Shamir, Ohad. Are ResNets provably better than linear pre-
and parameters of a neural network using an improved dictors? Advances in neural information processing sys-
genetic algorithm. IEEE Transactions on Neural net- tems, 31, 2018.
works, 14(1):79–88, 2003.
Sharifnassab, Arsalan, Salehkaleybar, Saber, and
Li, Hao, Xu, Zheng, Taylor, Gavin, Studer, Christoph, and Golestani, S Jamaloddin. Bounds on over-
Goldstein, Tom. Visualizing the loss landscape of neu- parameterization for guaranteed existence of descent
ral nets. Advances in neural information processing sys- paths in shallow ReLU networks. In International
tems, 31, 2018. conference on learning representations, 2020.
Li, Yuanzhi and Liang, Yingyu. Learning overparameter- Soltanolkotabi, Mahdi, Javanmard, Adel, and Lee, Ja-
ized neural networks via stochastic gradient descent on son D. Theoretical insights into the optimization land-
structured data. Advances in neural information process- scape of over-parameterized shallow neural networks.
ing systems, 31, 2018. IEEE Transactions on Information Theory, 65(2):742–
769, 2018.
Liang, Shiyu, Sun, Ruoyu, Li, Yixuan, and Srikant,
Rayadurgam. Understanding the loss surface of neural Tahmasebi, Behrooz, Soleymani, Ashkan, Bahri, Dara,
networks for binary classification. In International Con- Jegelka, Stefanie, and Jaillet, Patrick. A universal class
ference on Machine Learning, pp. 2835–2843. PMLR, of sharpness-aware minimization algorithms. In Interna-
2018. tional Conference on Machine Learning (ICML), 2024.
Mehta, Dhagash, Chen, Tianran, Tang, Tingting, and Tieleman, Tijmen and Hinton, Geoffrey. Lecture 6.5-
Hauenstein, Jonathan D. The loss surface of deep linear rmsprop: Divide the gradient by a running average of
networks viewed through the algebraic geometry lens. its recent magnitude. COURSERA: Neural networks for
IEEE Transactions on Pattern Analysis and Machine In- machine learning, 4(2):26–31, 2012.
telligence, 44(9):5664–5680, 2021.
Wen, Kaiyue, Li, Zhiyuan, and Ma, Tengyu. Sharpness
Montana, David J and Davis, Lawrence. Training feed- minimization algorithms do not only minimize sharpness
forward neural networks using genetic algorithms. In to achieve better generalization. Advances in Neural In-
IJCAI, volume 89, pp. 762–767, 1989. formation Processing Systems, 36, 2024.
Nouiehed, Maher and Razaviyayn, Meisam. Learning deep Wolfe, Philip. Convergence conditions for ascent methods.
models: Critical points and local openness. INFORMS SIAM review, 11(2):226–235, 1969.
Journal on Optimization, 4(2):148–173, 2022.
Xie, Wanyun, Latorre, Fabian, Antonakopoulos, Kimon,
Panageas, Ioannis, Piliouras, Georgios, and Wang, Xiao. Pethick, Thomas, and Cevher, Volkan. Improving SAM
First-order methods almost always avoid saddle points: requires rethinking its optimization formulation. In In-
The case of vanishing step-sizes. Advances in Neural ternational Conference on Machine Learning (ICML),
Information Processing Systems, 32, 2019. 2024.
14