0% found this document useful (0 votes)
20 views11 pages

Notes ch6

Uploaded by

wzhengmath314
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views11 pages

Notes ch6

Uploaded by

wzhengmath314
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

CHAPTER 6.

STOCHASTIC GRADIENT
DESCENT

Jinghua Huang Pengfei Wu Kun Yuan

October 24, 2023

1 Problem formulation
This chapter considers the following stochastic optimization problem

min f (x) = Eξ∼D [F (x; ξ)] (1)


x∈Rd

where ξ ∼ D denotes the random data sample and D denotes the data distribution. Since
D is typically unknown in machine learning, the closed-form of f (x) is also unknown.

Notation. We introduce the following notations:

• Let x? := arg minx∈Rd {f (x)} be the optimal solution to problem (1).

• Let f ? := minx∈Rd {f (x)} be the optimal function value.

• Let Fk = {xk , ξk−1 , xk−1 , · · · , ξ0 } be the filtration containing all historical variables
at and before iteration k. Note that ξk does not belong to Fk .

2 Stochastic gradient descent


Since f (x) does not have a closed-form, we cannot access its gradient. However, since F (x; ξ)
is known, we can use ∇x F (x; ξ) to approximate the true gradient ∇f (x). Throughout
this lecture, we let ∇F (x; ξ) = ∇x F (x; ξ) for notation simplicity. Given any arbitrary
initialization variable x0 , stochastic gradient descent (SGD) iterates as follows

xk+1 = xk − γ∇F (xk ; ξk ), ∀ k = 0, 1, 2, · · · (2)

1
where γ is the learning rate, and ξk ∼ D is a random data sampled at iteration k. Since
ξk is a random variable for any k = 0, 1, · · · , each variable xk is also a random variable for
k = 1, 2, · · · .

3 Convergence analysis
To facilitate convergence analysis, we introduce the following assumption:

Assumption 3.1. Given the filtration Fk , we assume

E[∇F (xk ; ξk )|Fk ] = ∇f (xk ) (3)


E[k∇F (xk ; ξk ) − ∇f (xk )k2 |Fk ] ≤ σ 2 (4)

The above assumption indicates that, conditioned on the filtration Fk , the stochastic gra-
dient ∇F (xk ; ξk ) is an unbiased estimate on ∇f (xk ), and the variance is bounded by σ 2 .
Under the above assumption, it is easy to verify that

E[k∇F (xk ; ξk )k2 |Fk ] = E[k∇F (xk ; ξk ) − ∇f (xk ) + ∇f (xk )k2 |Fk ]
= k∇f (xk )k2 + E[k∇F (xk ; ξk ) − ∇f (xk )k2 |Fk ]
≤ k∇f (xk )k2 + σ 2 (5)

where the second equality holds due to (3) and the last inequality holds due to (4).

3.1 Smooth and non-convex problem

Theorem 3.2. Suppose f (x) is L-smooth and Assumption 3.1 holds. If γ ≤ 1/L,
SGD will converge at the following rate

K
1 X 2∆0
E[k∇f (xk )k2 ] ≤ + γLσ 2 , (6)
K +1 γ(K + 1)
k=0

 − 21 −1
? 2∆0
where ∆0 = f (x0 ) − f . If we further choose γ = (K+1)Lσ 2 +L , SGD
converges as

K r
1 X 8L∆0 σ 2 2L∆0
E[k∇f (xk )k2 ] ≤ + . (7)
K +1 K +1 K +1
k=0

Remark. If σ 2 = 0, the stochastic gradient reduces to the true gradient, and hence, SGD
reduces to GD. Substituting σ 2 = 0 to SGD rate (7), we recover the rate O(L/K) for GD.
In other words, our convergence rate for SGD is consistent with GD.

2
Proof. Since f (x) is L-smooth, we have

L
E[f (xk+1 )|Fk ] ≤ f (xk ) + E[h∇f (xk ), xk+1 − xk i|Fk ] + E[kxk+1 − xk k2 |Fk ]
2
Lγ 2
= f (xk ) − γE[h∇f (xk ), ∇F (xk ; ξk )i|Fk ] + E[k∇F (xk ; ξk )k2 |Fk ]
2
(a) Lγ Lγ 2 σ 2
≤ f (xk ) − γ(1 − )k∇f (xk )k2 +
2 2
(b) 2 2
γ Lγ σ
≤ f (xk ) − k∇f (xk )k2 + (8)
2 2
where inequality (a) holds due to Assumption 3.1, and inequality (b) holds if γ ≤ 1/L. By
taking expectations over the filtration Fk , we have

γ Lγ 2 σ 2
E[f (xk+1 )] ≤ E[f (xk )] − E[k∇f (xk )k2 ] + (9)
2 2
which is equivalent to

2
E[k∇f (xk )k2 ] ≤ (E[f (xk )] − E[f (xk+1 )]) + γLσ 2 (10)
γ

Taking averaging over k = 0, 1, · · · , K, we have

K
1 X 2(f (x0 ) − f ? )
E[k∇f (xk )k2 ] ≤ + γLσ 2 . (11)
K +1 γ(K + 1)
k=0

Defining ∆0 := f (x0 ) − f ? , if we set


" − 21 #−1
2∆0
γ= +L , (12)
(K + 1)Lσ 2

it then holds that


    12
1 2∆0
γ ≤ min , γ1 , where γ1 = . (13)
L (K + 1)Lσ 2

Substituting (12) and (13) into (11), we have

K
1 X 2(f (x0 ) − f ? )
E[k∇f (xk )k2 ] ≤ + γ1 Lσ 2
K +1 γ(K + 1)
k=0
r
2L∆0 σ 2 2L∆0
=2 + . (14)
K +1 K +1

3
3.2 Smooth and convex problem

Theorem 3.3. Suppose f (x) is convex and L-smooth. Under Assumption 3.1, if
γ ≤ 1/(2L), SGD will converge at the following rate

K
1 X ∆0
E[f (xk ) − f (x? )] ≤ + γσ 2 (15)
K +1 γ(K + 1)
k=0

 − 12 −1
∆0
where ∆0 = kx0 − x? k2 . If we further choose γ = (K+1)σ 2 + 2L , SGD
converges as follows

K r
1 X σ 2 ∆0 2L∆0
E[f (xk ) − f (x? )] ≤ 2 + . (16)
K +1 K +1 K +1
k=0

Proof. According to Lemma 3.4 in Chapter 1, if f (x) is L-smooth, we have

k∇f (xk )k2 ≤ 2L(f (xk ) − f ? ), ∀k = 0, 1, · · · . (17)

Also, since f (x) is convex, we have

f ? − f (xk ) ≥ h∇f (xk ), x? − xk i, ∀k = 0, 1, · · · . (18)

With the recursion of SGD, we have

E[kxk+1 − x? k2 |Fk ] = E[kxk − x? − γ∇F (xk ; ξk )k2 |Fk ]


≤ kxk − x? k2 − 2γE[hxk − x? , ∇F (xk ; ξk )i|Fk ] + γ 2 E[k∇F (xk ; ξk )k2 |Fk ]
(a)
= kxk − x? k2 − 2γhxk − x? , ∇f (xk )i + γ 2 k∇f (xk )k2 + γ 2 σ 2
(b)
≤ kxk − x? k2 − 2γ(f (xk ) − f ? ) + 2γ 2 L(f (xk ) − f ? ) + γ 2 σ 2
= kxk − x? k2 − 2γ(1 − γL)(f (xk ) − f ? ) + γ 2 σ 2
(c)
≤ kxk − x? k2 − γ(f (xk ) − f ? ) + γ 2 σ 2 (19)

where (a) holds due to Assumption 3.1, (b) holds due to (17) and (19), and (c) holds if we
choose γ ≤ 1/(2L). Taking expectation over the filtration Fk , we have

E[kxk+1 − x? k2 ] ≤ E[kxk − x? k2 ] − γE[f (xk ) − f (x? )] + γ 2 σ 2 (20)

which is equivalent to

E[kxk − x? k2 ] − E[kxk+1 − x? k2 ]
E[f (xk ) − f (x? )] ≤ + γσ 2 (21)
γ

4
Taking averaging over k = 0, 1, · · · , K, we have

K
1 X kx0 − x? k2
E[f (xk ) − f (x? )] ≤ + γσ 2 (22)
K +1 γ(K + 1)
k=0

Similar to the arguments in (12)–(14), if we choose


" − 21 #−1
∆0
γ= + 2L , (23)
(K + 1)σ 2

SGD will converge as follows

K r
1 X σ 2 ∆0 2L∆0
E[f (xk ) − f (x? )] ≤ 2 + . (24)
K +1 K +1 K +1
k=0

Remark. If σ 2 = 0, we recover the rate O(L/K) of GD in convex scenarios.

3.3 Smooth and strongly convex problem

Theorem 3.4. Suppose f (x) is µ-strongly convex and L-smooth. Under Assumption
3.1, if γ ≤ 1/L, SGD will converge at the following rate

γLσ 2
E[f (xk )] − f ? ≤ (1 − γµ)k ∆0 + . (25)
µ

µ2 ∆0 K
1
where ∆0 = f (x0 ) − f ? . If we further choose γ = min{ L1 , µK

ln Lσ 2 }, SGD will
converge at the following rate

Lσ 2
 
µ
E[f (xK )] − f ? = Õ + ∆0 exp(− K) (26)
µ2 K L

where the Õ(·) notation hides all logarithm terms.

Proof. Since f (x) is µ-strongly convex, it holds from Theorem 3.8 of our notes “Ch0” that

1
f (y) ≤ f (x) + h∇f (x), y − xi + k∇f (y) − ∇f (x)k2 . (27)

Let y = xk and x = x? , we have

k∇f (xk )k2 ≥ 2µ f (xk ) − f (x? ) .



(28)

Substituting the above inequality to (9), we have

 Lγ 2 σ 2
E[f (xk+1 )] − f ? ≤ (1 − γµ) E[f (xk )] − f ? + . (29)
2

5
Keep iterating the above inequality, we have

γLσ 2
E[f (xK )] − f ? ≤ (1 − γµ)K (f (x0 ) − f ? ) + . (30)
µ

We let ∆0 = f (x0 ) − f ? . With the fact that (1 − x) ≤ exp(−x) when x ∈ (0, 1), the above
inequality becomes

γLσ 2
E[f (xK )] − f ? ≤ ∆0 exp(−γµK) + . (31)
µ

Now we let

1 1 µ2 ∆0 K 
γ = min{γ1 , } ≤ γ1 where γ1 = ln . (32)
L µK Lσ 2

Since exp(− min{a, b}) ≤ exp(−a) + exp(−b) for any a ∈ R and b ∈ R, we have

µ γ1 Lσ 2
E[f (xK )] − f ? ≤ ∆0 exp(− K) + ∆0 exp(−γ1 µK) +
L µ
2 2
Lσ µ ∆0 K  µ
≤ 2 [1 + ln ] + ∆0 exp(− K)
µ K Lσ 2 L
Lσ 2
 
µ
= Õ + ∆0 exp(− K) (33)
µ2 K L

where we hide the logarithm term in the Õ(·) notation.

4 Mini-batch stochastic gradient descent


When training deep neural network, it is common to sample a batch of data to estimate the
true gradient. SGD with mini-batch will iterate as follows:

B
1 X (b)
gk = ∇F (xk ; ξk ), (34a)
B
b=1

xk+1 = xk − γgk . (34b)

where B is the batch size.

4.1 Mini-batch SGD suffers smaller variance


The following lemma establishes that mini-batch samples enable SGD to have a more accu-
rate gradient estimate. To state the lemma, we first introduce

(b) (b)
FkB = {xk , {ξk−1 }B B
b=1 , xk−1 , {ξk−2 }b=1 , · · · , x0 } (35)

6
Assumption 4.1. Given the filtration FkB , we assume

(b)
E[∇F (xk ; ξk )|FkB ] = ∇f (xk ), (36)
(b) 2
E[k∇F (xk ; ξk ) − ∇f (xk )k |FkB ] 2
≤σ . (37)

(b)
Moreover, we assume {ξk }B
b=1 are independent of each other for any k = 0, 1, · · · .

With the above assumption, it is easy to verify that

B
1 X (b)
E[gk |FkB ] = E[∇F (xk ; ξk )|Fk ] = ∇f (xk ), (38)
B
b=1

and the variance is derived as


B (b) σ 2
(a) 1 X (b)
E[kgk − ∇f (xk )k2 |FkB ] = E[k∇F (x k ; ξk ) − ∇f (x k )k 2
|F B
k ≤
] (39)
B2 B
b=1

(b)
where the first equality (a) holds due to the independence between {ξk }B b=1 and Eq.(36),
and (b) holds due to Eq.(37). According to (39), the variance of the mini-batch stochatic
gradient gk is B-times smaller than single-sample stochastic gradient ∇F (x; ξ).

4.2 Convergence of mini-batch SGD


The convergence analysis of mini-batch SGD is almost the same as vanilla SGD, except that
the variance of stochastic gradient gk becomes σ 2 /B.

7
Theorem 4.2. Under Assumption 4.1, mini-batch SGD will converge as follows

• If f (x) is L-smooth, mini-batch SGD converge as follows


s 
K
1 X L∆f0 σ 2 L∆f0
E[k∇f (xk )k2 ] = O  + . (40)
K +1 B(K + 1) K +1
k=0

where ∆f0 = f (x0 ) − f ? .

• If f (x) is L-smooth and convex, mini-batch SGD converge as follows


s !
K
1 X L∆x0 σ 2 L∆x0
E[f (xk ) − f ? ] = O + . (41)
K +1 B(K + 1) K + 1
k=0

where ∆x0 = kx0 − x? k2 .

• If f (x) is L-smooth and µ-strongly convex, mini-batch SGD converges as follows

Lσ 2
 
? f µ
E[f (xK )] − f = Õ + ∆0 exp(− K) (42)
µ2 BK L

where ∆f0 = f (x0 ) − f ? and Õ(·) hides all logarithm terms.

5 Experiments
5.1 Linear regression
• We use different learning rates to implement stochastic gradient descent for linear
regression task, where the loss function is `(Xθ; y) = 12 (Xθ − y)2 . We randomly
generated a dataset X(N = 100, d = 5) and a label set y(N = 100, d = 1), and set
the initial parameter θ. In the experiment, we set the number of epochs to 10 and the
batch-size to 1. We used two different learning rates, varying from low to high, and
the results are shown in Figure 1.

8
Figure 1: The loss descent plots for stochastic gradient descent with different learning rates.

We can observe that a smaller learning rate can lead to more cautious and stable
parameter updates, and the final convergence results are better, but the training speed
is slower. Properly increasing the learning rate can accelerate training speed, but it
will sacrifice a certain convergence effect. However, excessive learning rates may lead
to unstable training, causing oscillations during training, and may cause the model to
fall into local minima or result in divergence. Therefore, we usually use the decaying
learning rate in our daily life, which converges quickly in the early stage and ensures
a good convergence result in the later stage.

• We set decaying learning rate:γk = γ00.2k+1 and γk = γ0 /(0.2k + 1),where k represents


epochs. The results of different learning rate decaying strategies are also different.
The experimental results are shown in Figure 2.

9
Figure 2: Loss curves trained using two learning rate reduction strategies in the SGD
algorithm.

• In the previous task, we set the learning rate to 0.05, epochs to 10, and conducted
stochastic gradient descent experiments with different batch-size, as shown in Figure
3.

Figure 3: The loss descent plots for stochastic gradient descent with different batch sizes.

We can observe that when the batch size is larger, the convergence speed is slightly

10
faster and ultimately better convergence results can be obtained, because the more
data used for each parameter update, the better it represents the gradient of the overall
loss function, resulting in higher gradient accuracy.

5.2 Image classification


• We conducted training on the CIFAR-10 dataset using the ResNet-18 architecture.
Now, we choose SGD as the optimizer and we are investigating the impact of different
batch sizes on the results. We performed a comparative experiment by setting the
batch size to 16 and 128, while keeping the learning rate constant at 0.005. We
trained for nearly 1200 steps and recorded the loss during the training process. And
the loss in the training process is shown as the Figure 4.

Figure 4: Loss curves trained using two batch size in the CIFAR-10 classification experiment.

We can observe that increasing the batch size can improve the convergence speed and
effectiveness of the model with a fixed number of iterations. This is because with a
larger batch size, the information each batch captured is closer to the true distribution,
resulting in more accurate gradient calculations. As a result, the model can obtain
better convergence effectiveness.

11

You might also like