Lecture Note SGD
Lecture Note SGD
where ξ is a one dimensional random variable and x is a one dimensional parameter we want to estimate.
One can think the objective function here as a parameter estimation process. It has the gradient descent
form:
xt+1 = xt + t ∇f (x)
Since in practice, we will observe samplings drawn from the distributions which is unknown form us, so the
maximum expectation in Equation 1 can be written as:
N
1 X
f (x) = Eξ [g(x, ξ)] ≈ g(x, ξi )
N i=1
where {ξi } are N observations. General gradient descent scheme needs to estimate or calculate:
N
1 X
∇f (x) = ∇Eξ [g(x, ξ)] ≈ ∇x g(x, ξi )
N i=1
which is inefficient for large scale optimization. The stochastic gradient descent(SGD) claims that, by
replacing the gradient as:
∇f (x) ≈ ∇x g(x, ξt )
one can guarantee the similar convergence as original gradient descent scheme. Here ξt is one sampling
drawn from the p.d.f of ξ. (In other word, t ∈ {1, 2, . . . , N } if we only have real N observational data instead
of being aware of the p.d.f of ξ)
We claim that, under the assumption that:
the stochastic gradient descent will converge to the local optimum with probability one. Moreover, if:
√
• iteration scheme is xt+1 = xt + ∇f (x) + 2σ −1 ηt with noise ηt ∼ N (0, 1), i.i.d
the stochastic gradient descent will converge to the global optimum with probability one, as σ → ∞. This
scheme with noise data is commonly called Noisy Stochastic Gradient Descent(NSGD)
xt+1 = xt + t (ξt − xt )
1
Statistics & Discrete Methods of Data Sciences CS395T(51800), CSE392(63625), M393C(54377)
MW 11:00a.m.-12:30p.m. GDC 5.304
Lecture Notes: Stochastic Gradient Descent, [email protected]
1
where ξt are i.i.d. random variables. Now if we set the step size t = t+1 , then we will find:
t
t 1 1 X
xt+1 = xt + ξt = ξj
t+1 t+1 t + 1 j=0
which is exactly the unbiased estimation of x when you have t+1 observations {ξj }tj=0 . Moreover, as t → ∞,
Eξ [(x − x∗ )] → 0
and
1
Eξ [(xt − x∗ )2 ] → V ar(ξ)
t
These are the expected behaviors for the evolution of x.
For the general step size case, we P
are interested in the estimation of the parameter x. How is the estimation
∞ P∞ 2
changed and why the condition: t=1 t = ∞, t=1 t < ∞ is essential for the guarantee of convergence?
In order to answer these questions, now we are going to write down how Eξ (xt ) evolves with step size t :
Hence, we have,
t
Y
Eξ (xt+1 − x∗ ) = (1 − j ) Eξ (x0 − x∗ )
j=0
hQ i
t ∗ ∗
The term j=0 (1 − j ) Eξ (x0 − x ) is the bias term here. We wish Eξ (xt − x ) converges to 0 as t goes to
infinity, thereby reaches the maximum of f(x). Then we could deduce that
t
X
log(1 − t ) → −∞ as t → ∞
j=0
which gives you something like ”first order” condition w.r.t step size. This indicates wherever the initializa-
tion x0 is, the SGD will always reach the optimum.
For the estimation of variance, one can deduce that:
2
Statistics & Discrete Methods of Data Sciences CS395T(51800), CSE392(63625), M393C(54377)
MW 11:00a.m.-12:30p.m. GDC 5.304
Lecture Notes: Stochastic Gradient Descent, [email protected]
Therefore
In each iteration, the method will choose a temperature σ and generate samples from the following distri-
bution:
exp(σ(f (x) − f0 ))
ρσ (x) =
Z
R
where Z is the partition function: Z = exp(σ(f (x) − f0 ))dx, which takes the sum over all possible values
that normalizing the probability density function to ensure the integral will remain 1, no matter what σ and
f (x) you take. f0 is the ground or initial state you starts from.
If you are searching from position xt to a new random position xt+1 , then this algorithm will accept your
search rule with probability
exp(σ(f (xt+1 ) − f (xt )))
min{1, }
Z
so when f (xt+1 ) >= f (xt ), the algorithm will always accept your move, and when f (xt+1 ) < f (xt ), it will
accept this move with the probability less than 1.
When σ → ∞, one could claim that the distribution ρ∞ (x) = limσ→∞ ρσ (x) is the uniform distribution
which choose values from the set {argmax f (x)} (by setting f0 = max f (x)). That means if a random
variable X ∼ ρσ (x), then P r(X ∈ {argmax f (x)}) = 1.
Why we are introducing this? If we know ρ∞ (x), then any sample drawn from this distribution will give
you the global optimum. However, this is somehow self-contradict since we need to know f0 = max f (x) to
draw samples from the distribution ρσ (x) for large enough σ.
We will claim in the following section that, by adding a noisy data term in the SGD scheme (as did in
NSGD), the distribution of the evolution of xt will eventually converge to ρ∞ (x), thereby xt converges to
global optimum with probability one.
3
Statistics & Discrete Methods of Data Sciences CS395T(51800), CSE392(63625), M393C(54377)
MW 11:00a.m.-12:30p.m. GDC 5.304
Lecture Notes: Stochastic Gradient Descent, [email protected]
√
xt+1 = xt + ∇f (x) + 2σ −1 ηt , ηt ∼ N (0, 1)
(t)
as a sampling procedure, where xt ∼ ρσ (x). For simplicity, we consider step size to be fixed. Assume
(0) (t) (t+1) (t)
x0 ∼ ρσ (x), and we have calculated ρσ (x), how can we calculate ρσ (x) w.r.t. ρσ (x) from the iteration
scheme?
One way to do that is to derive in the distributional sense, meaning that for any test function h, we calculate
the expectation and using Taylor expansion to approximate it:
√
2σ −1 ηt )]
Eξ (h(xt+1 )) = Eξ [h(xt + ∇f (x) +
√ 1 √
= Eξ [h(xt ) + (∇f (x) + 2σ −1 ηt )∇h(xt ) + (∇f (x) + 2σ −1 ηt )2 ∆h(xt )] (3)
2
= Eξ [h(xt ) + (∇h(xt )∇f (xt ) + σ −1 ∆h(xt )ηt2 ) + O(2 )]
√ √
We leave out the term 2 ηt ∇h(xt ) due to the fact that E( 2 ηt ) = 0.
Next, we rewrite the equation (3) and apply integration by parts:
1
(E(h(xt+1 ) − h(xt ))) = E[∇h(xt )∇f (xt ) + σ −1 ∆h(xt )ξt2 ]
Z Z Z Z
1
ρ(t+1)
σ (x)h(x)dx − ρ(t)
σ (x)h(x)dx = ρ(t)
σ (x)∇h(x)∇f (x)dx + σ −1 ρ(t)
σ (x)∆h(x)dx
Z Z (4)
−1
= − ∇ · (ρ(t)
σ (x)∇f (x))h(x)dx + σ ∆ρ (t)
σ (x)h(x)dx
Z h i
−1
= −∇ · (ρ(t)
σ (x)∇f (x)) + σ ∆ρ(t)
σ (x) h(x)dx
Therefore
(t+1) (t)
ρσ (x) − ρσ (x) −1
= −∇ · (ρ(t)
σ (x)∇f (x)) + σ ∆ρ(t)
σ (x).
By setting → 0, we consider the following evolution partial differential equation:
∂ (t) −1
ρ (x) = −∇ · (ρ(t)
σ (x)∇f (x)) + σ ∆ρ(t)
σ (x)
∂t σ
this is the Fobber-Planck equation.
(t) D
When t → ∞, we have ρσ → ρσ , you can verify further that
∂
ρσ = 0
∂t
so NSGD generates distributions that converges to a specific distribution, where all samples drawn from this
distribution will equal to the global optimum with probability one, as long as you choose σ large enough.
More discussions regarding general NSGD and its convergence rate can be found at [YG].
References
[UML] Shalev-Shwartz, Shai, and Shai Ben-David. Understanding machine learning: From theory to algo-
rithms. Cambridge university press, 2014.
[WMY] Welling, Max, and Yee W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. Pro-
ceedings of the 28th International Conference on Machine Learning (ICML-11). 2011.
[YG] Yin, G. Rates of convergence for a class of global stochastic optimization algorithms. SIAM Journal
on Optimization 10.1 (1999): 99-120.