0% found this document useful (0 votes)
215 views4 pages

Lecture Note SGD

This document provides an overview of stochastic gradient descent (SGD) and simulated annealing for optimization problems. SGD iteratively estimates the gradient using a single random sample to update the parameters. It converges to local optima under certain step size conditions. Noisy SGD adds noise and can converge to global optima as noise increases. SGD is illustrated using mean estimation. Simulated annealing uses a temperature-based probability distribution over solutions during iterations to approximate global optima.

Uploaded by

Nishant Panda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
215 views4 pages

Lecture Note SGD

This document provides an overview of stochastic gradient descent (SGD) and simulated annealing for optimization problems. SGD iteratively estimates the gradient using a single random sample to update the parameters. It converges to local optima under certain step size conditions. Noisy SGD adds noise and can converge to global optima as noise increases. SGD is illustrated using mean estimation. Simulated annealing uses a temperature-based probability distribution over solutions during iterations to approximate global optima.

Uploaded by

Nishant Panda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Statistics & Discrete Methods of Data Sciences CS395T(51800), CSE392(63625), M393C(54377)

MW 11:00a.m.-12:30p.m. GDC 5.304


Lecture Notes: Stochastic Gradient Descent, [email protected]

1 Stochastic Gradient Descent


Consider the following optimization problem:

max f (x) := Eξ [g(x, ξ)] (1)


x

where ξ is a one dimensional random variable and x is a one dimensional parameter we want to estimate.
One can think the objective function here as a parameter estimation process. It has the gradient descent
form:

xt+1 = xt + t ∇f (x)
Since in practice, we will observe samplings drawn from the distributions which is unknown form us, so the
maximum expectation in Equation 1 can be written as:
N
1 X
f (x) = Eξ [g(x, ξ)] ≈ g(x, ξi )
N i=1

where {ξi } are N observations. General gradient descent scheme needs to estimate or calculate:
N
1 X
∇f (x) = ∇Eξ [g(x, ξ)] ≈ ∇x g(x, ξi )
N i=1

which is inefficient for large scale optimization. The stochastic gradient descent(SGD) claims that, by
replacing the gradient as:

∇f (x) ≈ ∇x g(x, ξt )
one can guarantee the similar convergence as original gradient descent scheme. Here ξt is one sampling
drawn from the p.d.f of ξ. (In other word, t ∈ {1, 2, . . . , N } if we only have real N observational data instead
of being aware of the p.d.f of ξ)
We claim that, under the assumption that:

• iteration scheme is xt+1 = xt + t ∇f (x)


P∞ P∞
• step size {t } satisfying t=1 t = ∞, t=1 2t < ∞

the stochastic gradient descent will converge to the local optimum with probability one. Moreover, if:

• iteration scheme is xt+1 = xt + ∇f (x) + 2σ −1 ηt with noise ηt ∼ N (0, 1), i.i.d

the stochastic gradient descent will converge to the global optimum with probability one, as σ → ∞. This
scheme with noise data is commonly called Noisy Stochastic Gradient Descent(NSGD)

1.1 Example: Stochastic Gradient Descent for estimation of mean


Why this method works? The best way to understand the principle of SGD is to check the simple case:
recursive estimation of mean in one dimension.
We consider a simple optimization problem. We want to optimize the following objective function:

f (x) = −Eξ [(ξ − x)2 ]


For example, if ξ ∼ N (µ, σ 2 ) (σ is a constant) then the optimum solution is the mean of ξ: x∗ =
argmax f (x) = E(ξ) = µ. If you apply stochastic gradient descent scheme into this problem, it will
give you:

xt+1 = xt + t (ξt − xt )

1
Statistics & Discrete Methods of Data Sciences CS395T(51800), CSE392(63625), M393C(54377)
MW 11:00a.m.-12:30p.m. GDC 5.304
Lecture Notes: Stochastic Gradient Descent, [email protected]

1
where ξt are i.i.d. random variables. Now if we set the step size t = t+1 , then we will find:
t
t 1 1 X
xt+1 = xt + ξt = ξj
t+1 t+1 t + 1 j=0

which is exactly the unbiased estimation of x when you have t+1 observations {ξj }tj=0 . Moreover, as t → ∞,

Eξ [(x − x∗ )] → 0

and
1
Eξ [(xt − x∗ )2 ] → V ar(ξ)
t
These are the expected behaviors for the evolution of x.
For the general step size case, we P
are interested in the estimation of the parameter x. How is the estimation
∞ P∞ 2
changed and why the condition: t=1 t = ∞, t=1 t < ∞ is essential for the guarantee of convergence?
In order to answer these questions, now we are going to write down how Eξ (xt ) evolves with step size t :

Eξ (xt+1 ) = Eξ (xt ) + t Eξ (ξt − xt )


= Eξ (xt ) + t (x∗ − Eξ (xt ))
= (1 − t )Eξ (xt ) + t x∗
= (1 − t ) [(1 − t−1 )Eξ (xt−1 ) + t−1 x∗ ] + t x∗
= ...
   
t
Y Xt−1 t
Y
= (1 − j )Eξ (x0 ) +   (1 − k ) j + t  x∗
j=0 j=0 k=j+1

Hence, we have,
 
t
Y
Eξ (xt+1 − x∗ ) =  (1 − j ) Eξ (x0 − x∗ )
j=0
hQ i
t ∗ ∗
The term j=0 (1 − j ) Eξ (x0 − x ) is the bias term here. We wish Eξ (xt − x ) converges to 0 as t goes to
infinity, thereby reaches the maximum of f(x). Then we could deduce that
t
X
log(1 − t ) → −∞ as t → ∞
j=0

since t is small, we apply Taylor expansion here, and we get:


t
X
t → ∞ as t → ∞
j=0

which gives you something like ”first order” condition w.r.t step size. This indicates wherever the initializa-
tion x0 is, the SGD will always reach the optimum.
For the estimation of variance, one can deduce that:

V arξ (xt+1 ) = V arξ (xt ) + t V arξ (ξt − xt )


= V arξ ((1 − t )xt + t ξt )
= (1 − t )2 V arξ (xt ) + 2t V arξ (ξt )

2
Statistics & Discrete Methods of Data Sciences CS395T(51800), CSE392(63625), M393C(54377)
MW 11:00a.m.-12:30p.m. GDC 5.304
Lecture Notes: Stochastic Gradient Descent, [email protected]

σ, V ar(ξt ) = σ 2 , then we could


Since we assume ξt are drawn from normal distribution with fixed varianceP

derive the asymptotic behavior: (leave as an exercise, and you will see t=1 2t < ∞ is needed in order
to make the SGD converges to the optimal solution)

lim V arξ (xt ) = O(t )


t→∞

Therefore

Eξ [(xt − x∗ )2 ] = V arξ (xt − x∗ ) − Eξ [(xt − x∗ )]2 = O(t )


For general convex function f (x), we might not be able to get an asymptotic bounds for convex f , but we
can still prove the bound for convergence. Detailed discussion can be found at Chapter 14.3 of [UML]
For general non-convex function f (x), we can establish similar asymptotic normality of evolution of xt , by
approximating the function g(x, ξ) locally into the quadratic form. Details of proof can be found here.

1.2 Simulated Annealing


In section 1.1 we use a simple example to show that SGD converges to the local optimum. One potential
drawback is that SGD might not necessarily converge to global optimum. The SGD does not have the ability
to ”jump out” the saddle point. We are going to introduce one method that overcome this difficulty, but
first we want to review a classical method called Simulated annealing.
Simulated annealing (SA) is a probabilistic technique that makes attempt to approximate the global optimum
of the optimization problem:
max f (x) := Eξ [g(x, ξ)]. (2)
x

In each iteration, the method will choose a temperature σ and generate samples from the following distri-
bution:

exp(σ(f (x) − f0 ))
ρσ (x) =
Z
R
where Z is the partition function: Z = exp(σ(f (x) − f0 ))dx, which takes the sum over all possible values
that normalizing the probability density function to ensure the integral will remain 1, no matter what σ and
f (x) you take. f0 is the ground or initial state you starts from.
If you are searching from position xt to a new random position xt+1 , then this algorithm will accept your
search rule with probability
exp(σ(f (xt+1 ) − f (xt )))
min{1, }
Z
so when f (xt+1 ) >= f (xt ), the algorithm will always accept your move, and when f (xt+1 ) < f (xt ), it will
accept this move with the probability less than 1.
When σ → ∞, one could claim that the distribution ρ∞ (x) = limσ→∞ ρσ (x) is the uniform distribution
which choose values from the set {argmax f (x)} (by setting f0 = max f (x)). That means if a random
variable X ∼ ρσ (x), then P r(X ∈ {argmax f (x)}) = 1.
Why we are introducing this? If we know ρ∞ (x), then any sample drawn from this distribution will give
you the global optimum. However, this is somehow self-contradict since we need to know f0 = max f (x) to
draw samples from the distribution ρσ (x) for large enough σ.
We will claim in the following section that, by adding a noisy data term in the SGD scheme (as did in
NSGD), the distribution of the evolution of xt will eventually converge to ρ∞ (x), thereby xt converges to
global optimum with probability one.

1.3 Fokker-Planck Equation


This section mainly interprets a simplified version of noisy SGD in [WMY]. This can be viewed as a MCMC
interpretation of why NSGD converges to global optimum.
Think the following NSGD iteration scheme

3
Statistics & Discrete Methods of Data Sciences CS395T(51800), CSE392(63625), M393C(54377)
MW 11:00a.m.-12:30p.m. GDC 5.304
Lecture Notes: Stochastic Gradient Descent, [email protected]


xt+1 = xt + ∇f (x) + 2σ −1 ηt , ηt ∼ N (0, 1)
(t)
as a sampling procedure, where xt ∼ ρσ (x). For simplicity, we consider step size  to be fixed. Assume
(0) (t) (t+1) (t)
x0 ∼ ρσ (x), and we have calculated ρσ (x), how can we calculate ρσ (x) w.r.t. ρσ (x) from the iteration
scheme?
One way to do that is to derive in the distributional sense, meaning that for any test function h, we calculate
the expectation and using Taylor expansion to approximate it:


2σ −1 ηt )]
Eξ (h(xt+1 )) = Eξ [h(xt + ∇f (x) +
√ 1 √
= Eξ [h(xt ) + (∇f (x) + 2σ −1 ηt )∇h(xt ) + (∇f (x) + 2σ −1 ηt )2 ∆h(xt )] (3)
2
= Eξ [h(xt ) + (∇h(xt )∇f (xt ) + σ −1 ∆h(xt )ηt2 ) + O(2 )]
√ √
We leave out the term 2 ηt ∇h(xt ) due to the fact that E( 2 ηt ) = 0.
Next, we rewrite the equation (3) and apply integration by parts:

1
(E(h(xt+1 ) − h(xt ))) = E[∇h(xt )∇f (xt ) + σ −1 ∆h(xt )ξt2 ]
Z  Z  Z Z
1
ρ(t+1)
σ (x)h(x)dx − ρ(t)
σ (x)h(x)dx = ρ(t)
σ (x)∇h(x)∇f (x)dx + σ −1 ρ(t)
σ (x)∆h(x)dx

Z Z (4)
−1
= − ∇ · (ρ(t)
σ (x)∇f (x))h(x)dx + σ ∆ρ (t)
σ (x)h(x)dx
Z h i
−1
= −∇ · (ρ(t)
σ (x)∇f (x)) + σ ∆ρ(t)
σ (x) h(x)dx

Therefore
(t+1) (t)
ρσ (x) − ρσ (x) −1
= −∇ · (ρ(t)
σ (x)∇f (x)) + σ ∆ρ(t)
σ (x).

By setting  → 0, we consider the following evolution partial differential equation:
∂ (t) −1
ρ (x) = −∇ · (ρ(t)
σ (x)∇f (x)) + σ ∆ρ(t)
σ (x)
∂t σ
this is the Fobber-Planck equation.
(t) D
When t → ∞, we have ρσ → ρσ , you can verify further that

ρσ = 0
∂t
so NSGD generates distributions that converges to a specific distribution, where all samples drawn from this
distribution will equal to the global optimum with probability one, as long as you choose σ large enough.
More discussions regarding general NSGD and its convergence rate can be found at [YG].

References
[UML] Shalev-Shwartz, Shai, and Shai Ben-David. Understanding machine learning: From theory to algo-
rithms. Cambridge university press, 2014.
[WMY] Welling, Max, and Yee W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. Pro-
ceedings of the 28th International Conference on Machine Learning (ICML-11). 2011.
[YG] Yin, G. Rates of convergence for a class of global stochastic optimization algorithms. SIAM Journal
on Optimization 10.1 (1999): 99-120.

You might also like