Notes ch6
Notes ch6
STOCHASTIC GRADIENT
DESCENT
1 Problem formulation
This chapter considers the following stochastic optimization problem
where ξ ∼ D denotes the random data sample and D denotes the data distribution. Since
D is typically unknown in machine learning, the closed-form of f (x) is also unknown.
• Let Fk = {xk , ξk−1 , xk−1 , · · · , ξ0 } be the filtration containing all historical variables
at and before iteration k. Note that ξk does not belong to Fk .
1
where γ is the learning rate, and ξk ∼ D is a random data sampled at iteration k. Since
ξk is a random variable for any k = 0, 1, · · · , each variable xk is also a random variable for
k = 1, 2, · · · .
3 Convergence analysis
To facilitate convergence analysis, we introduce the following assumption:
The above assumption indicates that, conditioned on the filtration Fk , the stochastic gra-
dient ∇F (xk ; ξk ) is an unbiased estimate on ∇f (xk ), and the variance is bounded by σ 2 .
Under the above assumption, it is easy to verify that
E[k∇F (xk ; ξk )k2 |Fk ] = E[k∇F (xk ; ξk ) − ∇f (xk ) + ∇f (xk )k2 |Fk ]
= k∇f (xk )k2 + E[k∇F (xk ; ξk ) − ∇f (xk )k2 |Fk ]
≤ k∇f (xk )k2 + σ 2 (5)
where the second equality holds due to (3) and the last inequality holds due to (4).
Theorem 3.2. Suppose f (x) is L-smooth and Assumption 3.1 holds. If γ ≤ 1/L,
SGD will converge at the following rate
K
1 X 2∆0
E[k∇f (xk )k2 ] ≤ + γLσ 2 , (6)
K +1 γ(K + 1)
k=0
− 21 −1
? 2∆0
where ∆0 = f (x0 ) − f . If we further choose γ = (K+1)Lσ 2 +L , SGD
converges as
K r
1 X 8L∆0 σ 2 2L∆0
E[k∇f (xk )k2 ] ≤ + . (7)
K +1 K +1 K +1
k=0
Remark. If σ 2 = 0, the stochastic gradient reduces to the true gradient, and hence, SGD
reduces to GD. Substituting σ 2 = 0 to SGD rate (7), we recover the rate O(L/K) for GD.
In other words, our convergence rate for SGD is consistent with GD.
2
Proof. Since f (x) is L-smooth, we have
L
E[f (xk+1 )|Fk ] ≤ f (xk ) + E[h∇f (xk ), xk+1 − xk i|Fk ] + E[kxk+1 − xk k2 |Fk ]
2
Lγ 2
= f (xk ) − γE[h∇f (xk ), ∇F (xk ; ξk )i|Fk ] + E[k∇F (xk ; ξk )k2 |Fk ]
2
(a) Lγ Lγ 2 σ 2
≤ f (xk ) − γ(1 − )k∇f (xk )k2 +
2 2
(b) 2 2
γ Lγ σ
≤ f (xk ) − k∇f (xk )k2 + (8)
2 2
where inequality (a) holds due to Assumption 3.1, and inequality (b) holds if γ ≤ 1/L. By
taking expectations over the filtration Fk , we have
γ Lγ 2 σ 2
E[f (xk+1 )] ≤ E[f (xk )] − E[k∇f (xk )k2 ] + (9)
2 2
which is equivalent to
2
E[k∇f (xk )k2 ] ≤ (E[f (xk )] − E[f (xk+1 )]) + γLσ 2 (10)
γ
K
1 X 2(f (x0 ) − f ? )
E[k∇f (xk )k2 ] ≤ + γLσ 2 . (11)
K +1 γ(K + 1)
k=0
K
1 X 2(f (x0 ) − f ? )
E[k∇f (xk )k2 ] ≤ + γ1 Lσ 2
K +1 γ(K + 1)
k=0
r
2L∆0 σ 2 2L∆0
=2 + . (14)
K +1 K +1
3
3.2 Smooth and convex problem
Theorem 3.3. Suppose f (x) is convex and L-smooth. Under Assumption 3.1, if
γ ≤ 1/(2L), SGD will converge at the following rate
K
1 X ∆0
E[f (xk ) − f (x? )] ≤ + γσ 2 (15)
K +1 γ(K + 1)
k=0
− 12 −1
∆0
where ∆0 = kx0 − x? k2 . If we further choose γ = (K+1)σ 2 + 2L , SGD
converges as follows
K r
1 X σ 2 ∆0 2L∆0
E[f (xk ) − f (x? )] ≤ 2 + . (16)
K +1 K +1 K +1
k=0
where (a) holds due to Assumption 3.1, (b) holds due to (17) and (19), and (c) holds if we
choose γ ≤ 1/(2L). Taking expectation over the filtration Fk , we have
which is equivalent to
E[kxk − x? k2 ] − E[kxk+1 − x? k2 ]
E[f (xk ) − f (x? )] ≤ + γσ 2 (21)
γ
4
Taking averaging over k = 0, 1, · · · , K, we have
K
1 X kx0 − x? k2
E[f (xk ) − f (x? )] ≤ + γσ 2 (22)
K +1 γ(K + 1)
k=0
K r
1 X σ 2 ∆0 2L∆0
E[f (xk ) − f (x? )] ≤ 2 + . (24)
K +1 K +1 K +1
k=0
Theorem 3.4. Suppose f (x) is µ-strongly convex and L-smooth. Under Assumption
3.1, if γ ≤ 1/L, SGD will converge at the following rate
γLσ 2
E[f (xk )] − f ? ≤ (1 − γµ)k ∆0 + . (25)
µ
µ2 ∆0 K
1
where ∆0 = f (x0 ) − f ? . If we further choose γ = min{ L1 , µK
ln Lσ 2 }, SGD will
converge at the following rate
Lσ 2
µ
E[f (xK )] − f ? = Õ + ∆0 exp(− K) (26)
µ2 K L
Proof. Since f (x) is µ-strongly convex, it holds from Theorem 3.8 of our notes “Ch0” that
1
f (y) ≤ f (x) + h∇f (x), y − xi + k∇f (y) − ∇f (x)k2 . (27)
2µ
Lγ 2 σ 2
E[f (xk+1 )] − f ? ≤ (1 − γµ) E[f (xk )] − f ? + . (29)
2
5
Keep iterating the above inequality, we have
γLσ 2
E[f (xK )] − f ? ≤ (1 − γµ)K (f (x0 ) − f ? ) + . (30)
µ
We let ∆0 = f (x0 ) − f ? . With the fact that (1 − x) ≤ exp(−x) when x ∈ (0, 1), the above
inequality becomes
γLσ 2
E[f (xK )] − f ? ≤ ∆0 exp(−γµK) + . (31)
µ
Now we let
1 1 µ2 ∆0 K
γ = min{γ1 , } ≤ γ1 where γ1 = ln . (32)
L µK Lσ 2
Since exp(− min{a, b}) ≤ exp(−a) + exp(−b) for any a ∈ R and b ∈ R, we have
µ γ1 Lσ 2
E[f (xK )] − f ? ≤ ∆0 exp(− K) + ∆0 exp(−γ1 µK) +
L µ
2 2
Lσ µ ∆0 K µ
≤ 2 [1 + ln ] + ∆0 exp(− K)
µ K Lσ 2 L
Lσ 2
µ
= Õ + ∆0 exp(− K) (33)
µ2 K L
B
1 X (b)
gk = ∇F (xk ; ξk ), (34a)
B
b=1
(b) (b)
FkB = {xk , {ξk−1 }B B
b=1 , xk−1 , {ξk−2 }b=1 , · · · , x0 } (35)
6
Assumption 4.1. Given the filtration FkB , we assume
(b)
E[∇F (xk ; ξk )|FkB ] = ∇f (xk ), (36)
(b) 2
E[k∇F (xk ; ξk ) − ∇f (xk )k |FkB ] 2
≤σ . (37)
(b)
Moreover, we assume {ξk }B
b=1 are independent of each other for any k = 0, 1, · · · .
B
1 X (b)
E[gk |FkB ] = E[∇F (xk ; ξk )|Fk ] = ∇f (xk ), (38)
B
b=1
(b)
where the first equality (a) holds due to the independence between {ξk }B b=1 and Eq.(36),
and (b) holds due to Eq.(37). According to (39), the variance of the mini-batch stochatic
gradient gk is B-times smaller than single-sample stochastic gradient ∇F (x; ξ).
7
Theorem 4.2. Under Assumption 4.1, mini-batch SGD will converge as follows
Lσ 2
? f µ
E[f (xK )] − f = Õ + ∆0 exp(− K) (42)
µ2 BK L
5 Experiments
5.1 Linear regression
• We use different learning rates to implement stochastic gradient descent for linear
regression task, where the loss function is `(Xθ; y) = 12 (Xθ − y)2 . We randomly
generated a dataset X(N = 100, d = 5) and a label set y(N = 100, d = 1), and set
the initial parameter θ. In the experiment, we set the number of epochs to 10 and the
batch-size to 1. We used two different learning rates, varying from low to high, and
the results are shown in Figure 1.
8
Figure 1: The loss descent plots for stochastic gradient descent with different learning rates.
We can observe that a smaller learning rate can lead to more cautious and stable
parameter updates, and the final convergence results are better, but the training speed
is slower. Properly increasing the learning rate can accelerate training speed, but it
will sacrifice a certain convergence effect. However, excessive learning rates may lead
to unstable training, causing oscillations during training, and may cause the model to
fall into local minima or result in divergence. Therefore, we usually use the decaying
learning rate in our daily life, which converges quickly in the early stage and ensures
a good convergence result in the later stage.
9
Figure 2: Loss curves trained using two learning rate reduction strategies in the SGD
algorithm.
• In the previous task, we set the learning rate to 0.05, epochs to 10, and conducted
stochastic gradient descent experiments with different batch-size, as shown in Figure
3.
Figure 3: The loss descent plots for stochastic gradient descent with different batch sizes.
We can observe that when the batch size is larger, the convergence speed is slightly
10
faster and ultimately better convergence results can be obtained, because the more
data used for each parameter update, the better it represents the gradient of the overall
loss function, resulting in higher gradient accuracy.
Figure 4: Loss curves trained using two batch size in the CIFAR-10 classification experiment.
We can observe that increasing the batch size can improve the convergence speed and
effectiveness of the model with a fixed number of iterations. This is because with a
larger batch size, the information each batch captured is closer to the true distribution,
resulting in more accurate gradient calculations. As a result, the model can obtain
better convergence effectiveness.
11