0% found this document useful (0 votes)

20 views11 pages

Notes ch6

Uploaded by

wzhengmath314

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views11 pages

Notes ch6

Uploaded by

wzhengmath314

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

CHAPTER 6.

STOCHASTIC GRADIENT
DESCENT

Jinghua Huang Pengfei Wu Kun Yuan

October 24, 2023

1 Problem formulation
This chapter considers the following stochastic optimization problem

min f (x) = Eξ∼D [F (x; ξ)] (1)

x∈Rd

where ξ ∼ D denotes the random data sample and D denotes the data distribution. Since
D is typically unknown in machine learning, the closed-form of f (x) is also unknown.

Notation. We introduce the following notations:

• Let x? := arg minx∈Rd {f (x)} be the optimal solution to problem (1).

• Let f ? := minx∈Rd {f (x)} be the optimal function value.

• Let Fk = {xk , ξk−1 , xk−1 , · · · , ξ0 } be the filtration containing all historical variables
at and before iteration k. Note that ξk does not belong to Fk .

2 Stochastic gradient descent

Since f (x) does not have a closed-form, we cannot access its gradient. However, since F (x; ξ)
is known, we can use ∇x F (x; ξ) to approximate the true gradient ∇f (x). Throughout
this lecture, we let ∇F (x; ξ) = ∇x F (x; ξ) for notation simplicity. Given any arbitrary
initialization variable x0 , stochastic gradient descent (SGD) iterates as follows

xk+1 = xk − γ∇F (xk ; ξk ), ∀ k = 0, 1, 2, · · · (2)

1
where γ is the learning rate, and ξk ∼ D is a random data sampled at iteration k. Since
ξk is a random variable for any k = 0, 1, · · · , each variable xk is also a random variable for
k = 1, 2, · · · .

3 Convergence analysis
To facilitate convergence analysis, we introduce the following assumption:

Assumption 3.1. Given the filtration Fk , we assume

E[∇F (xk ; ξk )|Fk ] = ∇f (xk ) (3)

E[k∇F (xk ; ξk ) − ∇f (xk )k2 |Fk ] ≤ σ 2 (4)

The above assumption indicates that, conditioned on the filtration Fk , the stochastic gra-
dient ∇F (xk ; ξk ) is an unbiased estimate on ∇f (xk ), and the variance is bounded by σ 2 .
Under the above assumption, it is easy to verify that

E[k∇F (xk ; ξk )k2 |Fk ] = E[k∇F (xk ; ξk ) − ∇f (xk ) + ∇f (xk )k2 |Fk ]
= k∇f (xk )k2 + E[k∇F (xk ; ξk ) − ∇f (xk )k2 |Fk ]
≤ k∇f (xk )k2 + σ 2 (5)

where the second equality holds due to (3) and the last inequality holds due to (4).

3.1 Smooth and non-convex problem

Theorem 3.2. Suppose f (x) is L-smooth and Assumption 3.1 holds. If γ ≤ 1/L,
SGD will converge at the following rate

K
1 X 2∆0
E[k∇f (xk )k2 ] ≤ + γLσ 2 , (6)
K +1 γ(K + 1)
k=0

− 21 −1
? 2∆0
where ∆0 = f (x0 ) − f . If we further choose γ = (K+1)Lσ 2 +L , SGD
converges as

K r
1 X 8L∆0 σ 2 2L∆0
E[k∇f (xk )k2 ] ≤ + . (7)
K +1 K +1 K +1
k=0

Remark. If σ 2 = 0, the stochastic gradient reduces to the true gradient, and hence, SGD
reduces to GD. Substituting σ 2 = 0 to SGD rate (7), we recover the rate O(L/K) for GD.
In other words, our convergence rate for SGD is consistent with GD.

2
Proof. Since f (x) is L-smooth, we have

L
E[f (xk+1 )|Fk ] ≤ f (xk ) + E[h∇f (xk ), xk+1 − xk i|Fk ] + E[kxk+1 − xk k2 |Fk ]
2
Lγ 2
= f (xk ) − γE[h∇f (xk ), ∇F (xk ; ξk )i|Fk ] + E[k∇F (xk ; ξk )k2 |Fk ]
2
(a) Lγ Lγ 2 σ 2
≤ f (xk ) − γ(1 − )k∇f (xk )k2 +
2 2
(b) 2 2
γ Lγ σ
≤ f (xk ) − k∇f (xk )k2 + (8)
2 2
where inequality (a) holds due to Assumption 3.1, and inequality (b) holds if γ ≤ 1/L. By
taking expectations over the filtration Fk , we have

γ Lγ 2 σ 2
E[f (xk+1 )] ≤ E[f (xk )] − E[k∇f (xk )k2 ] + (9)
2 2
which is equivalent to

2
E[k∇f (xk )k2 ] ≤ (E[f (xk )] − E[f (xk+1 )]) + γLσ 2 (10)
γ

Taking averaging over k = 0, 1, · · · , K, we have

K
1 X 2(f (x0 ) − f ? )
E[k∇f (xk )k2 ] ≤ + γLσ 2 . (11)
K +1 γ(K + 1)
k=0

Defining ∆0 := f (x0 ) − f ? , if we set

" − 21 #−1
2∆0
γ= +L , (12)
(K + 1)Lσ 2

it then holds that

12
1 2∆0
γ ≤ min , γ1 , where γ1 = . (13)
L (K + 1)Lσ 2

Substituting (12) and (13) into (11), we have

K
1 X 2(f (x0 ) − f ? )
E[k∇f (xk )k2 ] ≤ + γ1 Lσ 2
K +1 γ(K + 1)
k=0
r
2L∆0 σ 2 2L∆0
=2 + . (14)
K +1 K +1

3
3.2 Smooth and convex problem

Theorem 3.3. Suppose f (x) is convex and L-smooth. Under Assumption 3.1, if
γ ≤ 1/(2L), SGD will converge at the following rate

K
1 X ∆0
E[f (xk ) − f (x? )] ≤ + γσ 2 (15)
K +1 γ(K + 1)
k=0

− 12 −1
∆0
where ∆0 = kx0 − x? k2 . If we further choose γ = (K+1)σ 2 + 2L , SGD
converges as follows

K r
1 X σ 2 ∆0 2L∆0
E[f (xk ) − f (x? )] ≤ 2 + . (16)
K +1 K +1 K +1
k=0

Proof. According to Lemma 3.4 in Chapter 1, if f (x) is L-smooth, we have

k∇f (xk )k2 ≤ 2L(f (xk ) − f ? ), ∀k = 0, 1, · · · . (17)

Also, since f (x) is convex, we have

f ? − f (xk ) ≥ h∇f (xk ), x? − xk i, ∀k = 0, 1, · · · . (18)

With the recursion of SGD, we have

E[kxk+1 − x? k2 |Fk ] = E[kxk − x? − γ∇F (xk ; ξk )k2 |Fk ]

≤ kxk − x? k2 − 2γE[hxk − x? , ∇F (xk ; ξk )i|Fk ] + γ 2 E[k∇F (xk ; ξk )k2 |Fk ]
(a)
= kxk − x? k2 − 2γhxk − x? , ∇f (xk )i + γ 2 k∇f (xk )k2 + γ 2 σ 2
(b)
≤ kxk − x? k2 − 2γ(f (xk ) − f ? ) + 2γ 2 L(f (xk ) − f ? ) + γ 2 σ 2
= kxk − x? k2 − 2γ(1 − γL)(f (xk ) − f ? ) + γ 2 σ 2
(c)
≤ kxk − x? k2 − γ(f (xk ) − f ? ) + γ 2 σ 2 (19)

where (a) holds due to Assumption 3.1, (b) holds due to (17) and (19), and (c) holds if we
choose γ ≤ 1/(2L). Taking expectation over the filtration Fk , we have

E[kxk+1 − x? k2 ] ≤ E[kxk − x? k2 ] − γE[f (xk ) − f (x? )] + γ 2 σ 2 (20)

which is equivalent to

E[kxk − x? k2 ] − E[kxk+1 − x? k2 ]
E[f (xk ) − f (x? )] ≤ + γσ 2 (21)
γ

4
Taking averaging over k = 0, 1, · · · , K, we have

K
1 X kx0 − x? k2
E[f (xk ) − f (x? )] ≤ + γσ 2 (22)
K +1 γ(K + 1)
k=0

Similar to the arguments in (12)–(14), if we choose

" − 21 #−1
∆0
γ= + 2L , (23)
(K + 1)σ 2

SGD will converge as follows

K r
1 X σ 2 ∆0 2L∆0
E[f (xk ) − f (x? )] ≤ 2 + . (24)
K +1 K +1 K +1
k=0

Remark. If σ 2 = 0, we recover the rate O(L/K) of GD in convex scenarios.

3.3 Smooth and strongly convex problem

Theorem 3.4. Suppose f (x) is µ-strongly convex and L-smooth. Under Assumption
3.1, if γ ≤ 1/L, SGD will converge at the following rate

γLσ 2
E[f (xk )] − f ? ≤ (1 − γµ)k ∆0 + . (25)
µ

µ2 ∆0 K
1
where ∆0 = f (x0 ) − f ? . If we further choose γ = min{ L1 , µK

ln Lσ 2 }, SGD will
converge at the following rate

Lσ 2

µ
E[f (xK )] − f ? = Õ + ∆0 exp(− K) (26)
µ2 K L

where the Õ(·) notation hides all logarithm terms.

Proof. Since f (x) is µ-strongly convex, it holds from Theorem 3.8 of our notes “Ch0” that

1
f (y) ≤ f (x) + h∇f (x), y − xi + k∇f (y) − ∇f (x)k2 . (27)
2µ

Let y = xk and x = x? , we have

k∇f (xk )k2 ≥ 2µ f (xk ) − f (x? ) .

(28)

Substituting the above inequality to (9), we have

Lγ 2 σ 2
E[f (xk+1 )] − f ? ≤ (1 − γµ) E[f (xk )] − f ? + . (29)
2

5
Keep iterating the above inequality, we have

γLσ 2
E[f (xK )] − f ? ≤ (1 − γµ)K (f (x0 ) − f ? ) + . (30)
µ

We let ∆0 = f (x0 ) − f ? . With the fact that (1 − x) ≤ exp(−x) when x ∈ (0, 1), the above
inequality becomes

γLσ 2
E[f (xK )] − f ? ≤ ∆0 exp(−γµK) + . (31)
µ

Now we let

1 1 µ2 ∆0 K
γ = min{γ1 , } ≤ γ1 where γ1 = ln . (32)
L µK Lσ 2

Since exp(− min{a, b}) ≤ exp(−a) + exp(−b) for any a ∈ R and b ∈ R, we have

µ γ1 Lσ 2
E[f (xK )] − f ? ≤ ∆0 exp(− K) + ∆0 exp(−γ1 µK) +
L µ
2 2
Lσ µ ∆0 K µ
≤ 2 [1 + ln ] + ∆0 exp(− K)
µ K Lσ 2 L
Lσ 2

µ
= Õ + ∆0 exp(− K) (33)
µ2 K L

where we hide the logarithm term in the Õ(·) notation.

4 Mini-batch stochastic gradient descent

When training deep neural network, it is common to sample a batch of data to estimate the
true gradient. SGD with mini-batch will iterate as follows:

B
1 X (b)
gk = ∇F (xk ; ξk ), (34a)
B
b=1

xk+1 = xk − γgk . (34b)

where B is the batch size.

4.1 Mini-batch SGD suffers smaller variance

The following lemma establishes that mini-batch samples enable SGD to have a more accu-
rate gradient estimate. To state the lemma, we first introduce

(b) (b)
FkB = {xk , {ξk−1 }B B
b=1 , xk−1 , {ξk−2 }b=1 , · · · , x0 } (35)

6
Assumption 4.1. Given the filtration FkB , we assume

(b)
E[∇F (xk ; ξk )|FkB ] = ∇f (xk ), (36)
(b) 2
E[k∇F (xk ; ξk ) − ∇f (xk )k |FkB ] 2
≤σ . (37)

(b)
Moreover, we assume {ξk }B
b=1 are independent of each other for any k = 0, 1, · · · .

With the above assumption, it is easy to verify that

B
1 X (b)
E[gk |FkB ] = E[∇F (xk ; ξk )|Fk ] = ∇f (xk ), (38)
B
b=1

and the variance is derived as

B (b) σ 2
(a) 1 X (b)
E[kgk − ∇f (xk )k2 |FkB ] = E[k∇F (x k ; ξk ) − ∇f (x k )k 2
|F B
k ≤
] (39)
B2 B
b=1

(b)
where the first equality (a) holds due to the independence between {ξk }B b=1 and Eq.(36),
and (b) holds due to Eq.(37). According to (39), the variance of the mini-batch stochatic
gradient gk is B-times smaller than single-sample stochastic gradient ∇F (x; ξ).

4.2 Convergence of mini-batch SGD

The convergence analysis of mini-batch SGD is almost the same as vanilla SGD, except that
the variance of stochastic gradient gk becomes σ 2 /B.

7
Theorem 4.2. Under Assumption 4.1, mini-batch SGD will converge as follows

• If f (x) is L-smooth, mini-batch SGD converge as follows

s 
K
1 X L∆f0 σ 2 L∆f0
E[k∇f (xk )k2 ] = O  + . (40)
K +1 B(K + 1) K +1
k=0

where ∆f0 = f (x0 ) − f ? .

• If f (x) is L-smooth and convex, mini-batch SGD converge as follows

s !
K
1 X L∆x0 σ 2 L∆x0
E[f (xk ) − f ? ] = O + . (41)
K +1 B(K + 1) K + 1
k=0

where ∆x0 = kx0 − x? k2 .

• If f (x) is L-smooth and µ-strongly convex, mini-batch SGD converges as follows

Lσ 2

? f µ
E[f (xK )] − f = Õ + ∆0 exp(− K) (42)
µ2 BK L

where ∆f0 = f (x0 ) − f ? and Õ(·) hides all logarithm terms.

5 Experiments
5.1 Linear regression
• We use different learning rates to implement stochastic gradient descent for linear
regression task, where the loss function is `(Xθ; y) = 12 (Xθ − y)2 . We randomly
generated a dataset X(N = 100, d = 5) and a label set y(N = 100, d = 1), and set
the initial parameter θ. In the experiment, we set the number of epochs to 10 and the
batch-size to 1. We used two different learning rates, varying from low to high, and
the results are shown in Figure 1.

8
Figure 1: The loss descent plots for stochastic gradient descent with different learning rates.

We can observe that a smaller learning rate can lead to more cautious and stable
parameter updates, and the final convergence results are better, but the training speed
is slower. Properly increasing the learning rate can accelerate training speed, but it
will sacrifice a certain convergence effect. However, excessive learning rates may lead
to unstable training, causing oscillations during training, and may cause the model to
fall into local minima or result in divergence. Therefore, we usually use the decaying
learning rate in our daily life, which converges quickly in the early stage and ensures
a good convergence result in the later stage.

• We set decaying learning rate:γk = γ00.2k+1 and γk = γ0 /(0.2k + 1),where k represents

epochs. The results of different learning rate decaying strategies are also different.
The experimental results are shown in Figure 2.

9
Figure 2: Loss curves trained using two learning rate reduction strategies in the SGD
algorithm.

• In the previous task, we set the learning rate to 0.05, epochs to 10, and conducted
stochastic gradient descent experiments with different batch-size, as shown in Figure
3.

Figure 3: The loss descent plots for stochastic gradient descent with different batch sizes.

We can observe that when the batch size is larger, the convergence speed is slightly

10
faster and ultimately better convergence results can be obtained, because the more
data used for each parameter update, the better it represents the gradient of the overall
loss function, resulting in higher gradient accuracy.

5.2 Image classification

• We conducted training on the CIFAR-10 dataset using the ResNet-18 architecture.
Now, we choose SGD as the optimizer and we are investigating the impact of different
batch sizes on the results. We performed a comparative experiment by setting the
batch size to 16 and 128, while keeping the learning rate constant at 0.005. We
trained for nearly 1200 steps and recorded the loss during the training process. And
the loss in the training process is shown as the Figure 4.

Figure 4: Loss curves trained using two batch size in the CIFAR-10 classification experiment.

We can observe that increasing the batch size can improve the convergence speed and
effectiveness of the model with a fixed number of iterations. This is because with a
larger batch size, the information each batch captured is closer to the true distribution,
resulting in more accurate gradient calculations. As a result, the model can obtain
better convergence effectiveness.

Mathematical Statistics (II)
No ratings yet
Mathematical Statistics (II)
112 pages
M2 P&F-AlgoStoch
No ratings yet
M2 P&F-AlgoStoch
132 pages
DS303: Introduction To Machine Learning: Stochastic Gradient Descent
No ratings yet
DS303: Introduction To Machine Learning: Stochastic Gradient Descent
19 pages
Is Stochastic Gradient Descent Effective? A PDE Perspective On Machine Learning Processes
No ratings yet
Is Stochastic Gradient Descent Effective? A PDE Perspective On Machine Learning Processes
50 pages
Solutions To Stochastic Calculus For Finance I (Steven Shreve)
100% (1)
Solutions To Stochastic Calculus For Finance I (Steven Shreve)
19 pages
Controle Stochastique M2 S10
No ratings yet
Controle Stochastique M2 S10
203 pages
Stochastic Optimization Under Hidden Convexity: Ilyas Fatkhullin Niao He Yifan Hu
No ratings yet
Stochastic Optimization Under Hidden Convexity: Ilyas Fatkhullin Niao He Yifan Hu
34 pages
Error Estimates Between SGD With Momentum and Underdamped Langevin Diffusion
No ratings yet
Error Estimates Between SGD With Momentum and Underdamped Langevin Diffusion
46 pages
Lecture05 Descent
No ratings yet
Lecture05 Descent
31 pages
Better Theory For SGD in The Nonconvex World
No ratings yet
Better Theory For SGD in The Nonconvex World
33 pages
ConvexSpring25 Week9
No ratings yet
ConvexSpring25 Week9
26 pages
Subgradients Slides
No ratings yet
Subgradients Slides
37 pages
Nonasymptotic Analysis of Stochastic Gradient Hamiltonian Monte Carlo Under Local Conditions For Nonconvex Optimization
No ratings yet
Nonasymptotic Analysis of Stochastic Gradient Hamiltonian Monte Carlo Under Local Conditions For Nonconvex Optimization
34 pages
Raghu Meka Notes
No ratings yet
Raghu Meka Notes
7 pages
Stochastic Gradient Descent On Nonconvex Functions With General Noise Models
No ratings yet
Stochastic Gradient Descent On Nonconvex Functions With General Noise Models
19 pages
0105 Stoch Subgrad Notes
No ratings yet
0105 Stoch Subgrad Notes
17 pages
SGOS Book
No ratings yet
SGOS Book
238 pages
SDE For SGD
No ratings yet
SDE For SGD
35 pages
SRE Report Merged
No ratings yet
SRE Report Merged
16 pages
RD Sharma Solution Class 9 Maths Chapter 6 Factorization of Polynomials PDF
No ratings yet
RD Sharma Solution Class 9 Maths Chapter 6 Factorization of Polynomials PDF
25 pages
Handbook of Convergence Theorems
No ratings yet
Handbook of Convergence Theorems
70 pages
Non-Convex Learning Via Stochastic Gradient Langevin Dynamics A Nonasymptotic Analysis
No ratings yet
Non-Convex Learning Via Stochastic Gradient Langevin Dynamics A Nonasymptotic Analysis
29 pages
Chapter 12 - Linear Programming Revision Notes
No ratings yet
Chapter 12 - Linear Programming Revision Notes
8 pages
Convex Module B
No ratings yet
Convex Module B
29 pages
Note 5
No ratings yet
Note 5
8 pages
5 Why Does SGD Prefer Flat Minim
No ratings yet
5 Why Does SGD Prefer Flat Minim
15 pages
CS 563-DeepLearning-SentimentApplication-April2022 (27403)
No ratings yet
CS 563-DeepLearning-SentimentApplication-April2022 (27403)
124 pages
Convergence of Markov Chains For Constant Step-Size Stochastic Gradient Descent With Separable Functions
No ratings yet
Convergence of Markov Chains For Constant Step-Size Stochastic Gradient Descent With Separable Functions
30 pages
2304.09221 Convergence of Stochastic Gradient Descent Under
No ratings yet
2304.09221 Convergence of Stochastic Gradient Descent Under
14 pages
A High Probability Analysis of Adaptive SGD With Momentum
No ratings yet
A High Probability Analysis of Adaptive SGD With Momentum
13 pages
Stochastic Gradient Descent For Nonconvex Learning Without Bounded Gradient Assumptions
No ratings yet
Stochastic Gradient Descent For Nonconvex Learning Without Bounded Gradient Assumptions
7 pages
Gradient Descendent
No ratings yet
Gradient Descendent
10 pages
Gradient Decent - PDF 2
No ratings yet
Gradient Decent - PDF 2
7 pages
Duchi 16
No ratings yet
Duchi 16
88 pages
Bridging The Gap Between Constant Step Size Stochastic Gradient Descent and Markov Chains
No ratings yet
Bridging The Gap Between Constant Step Size Stochastic Gradient Descent and Markov Chains
30 pages
(Touzi) Deterministic and Stochastic Control, Application To Finance
No ratings yet
(Touzi) Deterministic and Stochastic Control, Application To Finance
117 pages
Controle Sto Arret Optimal
No ratings yet
Controle Sto Arret Optimal
58 pages
VLSI Signal Processing Basics and Iteration Bound K.K. Parhi
100% (1)
VLSI Signal Processing Basics and Iteration Bound K.K. Parhi
49 pages
Mirror Descent Slides
No ratings yet
Mirror Descent Slides
35 pages
SDE Book
No ratings yet
SDE Book
119 pages
Lecture 15 Projected Gradient
No ratings yet
Lecture 15 Projected Gradient
8 pages
A Lyapunov Analysis For Accelerated Gradient Methods
No ratings yet
A Lyapunov Analysis For Accelerated Gradient Methods
37 pages
Approximation of The Invariant Measure of Stable SDEs
No ratings yet
Approximation of The Invariant Measure of Stable SDEs
32 pages
O4MD 03 Descent Methods
No ratings yet
O4MD 03 Descent Methods
18 pages
Master LN
No ratings yet
Master LN
135 pages
Bregman
No ratings yet
Bregman
9 pages
Combes - An Introduction To Stochastic Approximation - 2013
No ratings yet
Combes - An Introduction To Stochastic Approximation - 2013
9 pages
Lec 13
No ratings yet
Lec 13
6 pages
5 The Stochastic Approximation Algorithm: 5.1 Stochastic Processes - Some Basic Concepts
No ratings yet
5 The Stochastic Approximation Algorithm: 5.1 Stochastic Processes - Some Basic Concepts
14 pages
NIPS 2011 Non Asymptotic Analysis of Stochastic Approximation Algorithms For Machine Learning Paper
No ratings yet
NIPS 2011 Non Asymptotic Analysis of Stochastic Approximation Algorithms For Machine Learning Paper
9 pages
Mathematical Modeling and Computation in Finance
No ratings yet
Mathematical Modeling and Computation in Finance
6 pages
Stochastic Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Stochastic Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
22 pages
Unit 1.2 Algorithms-and-Flowchart
No ratings yet
Unit 1.2 Algorithms-and-Flowchart
29 pages
Johnson's Rule Problem
No ratings yet
Johnson's Rule Problem
11 pages
Lecture 7: Stochastic Gradient Descent
No ratings yet
Lecture 7: Stochastic Gradient Descent
4 pages
Lecture 7 8 Other Descent Methods
No ratings yet
Lecture 7 8 Other Descent Methods
7 pages
Lecture 5
No ratings yet
Lecture 5
4 pages
Image Processing Interpolation
No ratings yet
Image Processing Interpolation
69 pages
Subgradients: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Subgradients: Ryan Tibshirani Convex Optimization 10-725
25 pages
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Sol3 2015
No ratings yet
Sol3 2015
8 pages
Lecture 11 AGD Restart Lower Bounds
No ratings yet
Lecture 11 AGD Restart Lower Bounds
5 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Lecture Note SGD
No ratings yet
Lecture Note SGD
4 pages
Smooth Convex Minimization Problems
No ratings yet
Smooth Convex Minimization Problems
28 pages
06 SG Method
No ratings yet
06 SG Method
33 pages
Nitte Meenakshi Institute of Technology
No ratings yet
Nitte Meenakshi Institute of Technology
13 pages
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
27 pages
Art of Programming Through Algorithms and Flowcharts in C
No ratings yet
Art of Programming Through Algorithms and Flowcharts in C
7 pages
Coding Theory Binary Linear Codes
100% (1)
Coding Theory Binary Linear Codes
5 pages
Final Viva
No ratings yet
Final Viva
27 pages
CSE256
No ratings yet
CSE256
2 pages
Unit 1 - or & Supply Chain - WWW - Rgpvnotes.in
No ratings yet
Unit 1 - or & Supply Chain - WWW - Rgpvnotes.in
9 pages
Machine Learning Presentaion
No ratings yet
Machine Learning Presentaion
15 pages
Assign 7
No ratings yet
Assign 7
5 pages
MFA-106-Unit IV Predictive Modelling and Analysis-21may2024
No ratings yet
MFA-106-Unit IV Predictive Modelling and Analysis-21may2024
10 pages
Finite Element Analysis
No ratings yet
Finite Element Analysis
5 pages
A Construction of High Performance Quasicycle LDPC Codes A Combinatoric Design Approach
No ratings yet
A Construction of High Performance Quasicycle LDPC Codes A Combinatoric Design Approach
11 pages
Chinese Remainder Theorem PDF
No ratings yet
Chinese Remainder Theorem PDF
5 pages
Viva Question
No ratings yet
Viva Question
5 pages
APPC 1.6A WKST Polynomial End Behavior
No ratings yet
APPC 1.6A WKST Polynomial End Behavior
2 pages
Sorting Algorithms
No ratings yet
Sorting Algorithms
14 pages
Competitive Programming: Maximum Bipartite Matching
No ratings yet
Competitive Programming: Maximum Bipartite Matching
19 pages
IOQM Worksheet 12 Integer Root Theorem
No ratings yet
IOQM Worksheet 12 Integer Root Theorem
6 pages
Unit 3
No ratings yet
Unit 3
14 pages
A Path Finding Visualization Using A Star Algorithm and Dijkstra's Algorithm
100% (1)
A Path Finding Visualization Using A Star Algorithm and Dijkstra's Algorithm
2 pages
Task1 BasicProgramming AbelHendrikMP TI23T
No ratings yet
Task1 BasicProgramming AbelHendrikMP TI23T
3 pages
CPU Scheduling - I: Roadmap
No ratings yet
CPU Scheduling - I: Roadmap
5 pages
Tutorial For K Means Clustering in Python Sklearn - MLK - Machine Learning Knowledge-3
No ratings yet
Tutorial For K Means Clustering in Python Sklearn - MLK - Machine Learning Knowledge-3
3 pages
Midterm 2021-2022 CLC.7
No ratings yet
Midterm 2021-2022 CLC.7
1 page

Notes ch6

Uploaded by

Notes ch6

Uploaded by

CHAPTER 6.

Jinghua Huang Pengfei Wu Kun Yuan

October 24, 2023

min f (x) = Eξ∼D [F (x; ξ)] (1)

Notation. We introduce the following notations:

• Let x? := arg minx∈Rd {f (x)} be the optimal solution to problem (1).

• Let f ? := minx∈Rd {f (x)} be the optimal function value.

2 Stochastic gradient descent

xk+1 = xk − γ∇F (xk ; ξk ), ∀ k = 0, 1, 2, · · · (2)

Assumption 3.1. Given the filtration Fk , we assume

E[∇F (xk ; ξk )|Fk ] = ∇f (xk ) (3)

3.1 Smooth and non-convex problem

Taking averaging over k = 0, 1, · · · , K, we have

Defining ∆0 := f (x0 ) − f ? , if we set

it then holds that

Substituting (12) and (13) into (11), we have

Proof. According to Lemma 3.4 in Chapter 1, if f (x) is L-smooth, we have

k∇f (xk )k2 ≤ 2L(f (xk ) − f ? ), ∀k = 0, 1, · · · . (17)

Also, since f (x) is convex, we have

f ? − f (xk ) ≥ h∇f (xk ), x? − xk i, ∀k = 0, 1, · · · . (18)

With the recursion of SGD, we have

E[kxk+1 − x? k2 |Fk ] = E[kxk − x? − γ∇F (xk ; ξk )k2 |Fk ]

E[kxk+1 − x? k2 ] ≤ E[kxk − x? k2 ] − γE[f (xk ) − f (x? )] + γ 2 σ 2 (20)

Similar to the arguments in (12)–(14), if we choose

SGD will converge as follows

Remark. If σ 2 = 0, we recover the rate O(L/K) of GD in convex scenarios.

3.3 Smooth and strongly convex problem

where the Õ(·) notation hides all logarithm terms.

Let y = xk and x = x? , we have

k∇f (xk )k2 ≥ 2µ f (xk ) − f (x? ) .

Substituting the above inequality to (9), we have

where we hide the logarithm term in the Õ(·) notation.

4 Mini-batch stochastic gradient descent

xk+1 = xk − γgk . (34b)

where B is the batch size.

4.1 Mini-batch SGD suffers smaller variance

With the above assumption, it is easy to verify that

and the variance is derived as

4.2 Convergence of mini-batch SGD

• If f (x) is L-smooth, mini-batch SGD converge as follows

where ∆f0 = f (x0 ) − f ? .

• If f (x) is L-smooth and convex, mini-batch SGD converge as follows

where ∆x0 = kx0 − x? k2 .

• If f (x) is L-smooth and µ-strongly convex, mini-batch SGD converges as follows

where ∆f0 = f (x0 ) − f ? and Õ(·) hides all logarithm terms.

• We set decaying learning rate:γk = γ00.2k+1 and γk = γ0 /(0.2k + 1),where k represents

5.2 Image classification

You might also like