0% found this document useful (0 votes)

215 views4 pages

Lecture Note SGD

This document provides an overview of stochastic gradient descent (SGD) and simulated annealing for optimization problems. SGD iteratively estimates the gradient using a single random sample to update the parameters. It converges to local optima under certain step size conditions. Noisy SGD adds noise and can converge to global optima as noise increases. SGD is illustrated using mean estimation. Simulated annealing uses a temperature-based probability distribution over solutions during iterations to approximate global optima.

Uploaded by

Nishant Panda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

215 views4 pages

Lecture Note SGD

Uploaded by

Nishant Panda

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Statistics & Discrete Methods of Data Sciences CS395T(51800), CSE392(63625), M393C(54377)

MW 11:00a.m.-12:30p.m. GDC 5.304

Lecture Notes: Stochastic Gradient Descent, [email protected]

1 Stochastic Gradient Descent

Consider the following optimization problem:

max f (x) := Eξ [g(x, ξ)] (1)

where ξ is a one dimensional random variable and x is a one dimensional parameter we want to estimate.
One can think the objective function here as a parameter estimation process. It has the gradient descent
form:

xt+1 = xt + t ∇f (x)
Since in practice, we will observe samplings drawn from the distributions which is unknown form us, so the
maximum expectation in Equation 1 can be written as:
N
1 X
f (x) = Eξ [g(x, ξ)] ≈ g(x, ξi )
N i=1

where {ξi } are N observations. General gradient descent scheme needs to estimate or calculate:
N
1 X
∇f (x) = ∇Eξ [g(x, ξ)] ≈ ∇x g(x, ξi )
N i=1

which is inefficient for large scale optimization. The stochastic gradient descent(SGD) claims that, by
replacing the gradient as:

∇f (x) ≈ ∇x g(x, ξt )
one can guarantee the similar convergence as original gradient descent scheme. Here ξt is one sampling
drawn from the p.d.f of ξ. (In other word, t ∈ {1, 2, . . . , N } if we only have real N observational data instead
of being aware of the p.d.f of ξ)
We claim that, under the assumption that:

• iteration scheme is xt+1 = xt + t ∇f (x)

P∞ P∞
• step size {t } satisfying t=1 t = ∞, t=1 2t < ∞

the stochastic gradient descent will converge to the local optimum with probability one. Moreover, if:
√
• iteration scheme is xt+1 = xt + ∇f (x) + 2σ −1 ηt with noise ηt ∼ N (0, 1), i.i.d

the stochastic gradient descent will converge to the global optimum with probability one, as σ → ∞. This
scheme with noise data is commonly called Noisy Stochastic Gradient Descent(NSGD)

1.1 Example: Stochastic Gradient Descent for estimation of mean

Why this method works? The best way to understand the principle of SGD is to check the simple case:
recursive estimation of mean in one dimension.
We consider a simple optimization problem. We want to optimize the following objective function:

f (x) = −Eξ [(ξ − x)2 ]

For example, if ξ ∼ N (µ, σ 2 ) (σ is a constant) then the optimum solution is the mean of ξ: x∗ =
argmax f (x) = E(ξ) = µ. If you apply stochastic gradient descent scheme into this problem, it will
give you:

xt+1 = xt + t (ξt − xt )

1
Statistics & Discrete Methods of Data Sciences CS395T(51800), CSE392(63625), M393C(54377)
MW 11:00a.m.-12:30p.m. GDC 5.304
Lecture Notes: Stochastic Gradient Descent, [email protected]

1
where ξt are i.i.d. random variables. Now if we set the step size t = t+1 , then we will find:
t
t 1 1 X
xt+1 = xt + ξt = ξj
t+1 t+1 t + 1 j=0

which is exactly the unbiased estimation of x when you have t+1 observations {ξj }tj=0 . Moreover, as t → ∞,

Eξ [(x − x∗ )] → 0

and
1
Eξ [(xt − x∗ )2 ] → V ar(ξ)
t
These are the expected behaviors for the evolution of x.
For the general step size case, we P
are interested in the estimation of the parameter x. How is the estimation
∞ P∞ 2
changed and why the condition: t=1 t = ∞, t=1 t < ∞ is essential for the guarantee of convergence?
In order to answer these questions, now we are going to write down how Eξ (xt ) evolves with step size t :

Eξ (xt+1 ) = Eξ (xt ) + t Eξ (ξt − xt )

= Eξ (xt ) + t (x∗ − Eξ (xt ))
= (1 − t )Eξ (xt ) + t x∗
= (1 − t ) [(1 − t−1 )Eξ (xt−1 ) + t−1 x∗ ] + t x∗
= ...
   
t
Y Xt−1 t
Y
= (1 − j )Eξ (x0 ) +   (1 − k ) j + t  x∗
j=0 j=0 k=j+1

Hence, we have,
 
t
Y
Eξ (xt+1 − x∗ ) =  (1 − j ) Eξ (x0 − x∗ )
j=0
hQ i
t ∗ ∗
The term j=0 (1 − j ) Eξ (x0 − x ) is the bias term here. We wish Eξ (xt − x ) converges to 0 as t goes to
infinity, thereby reaches the maximum of f(x). Then we could deduce that
t
X
log(1 − t ) → −∞ as t → ∞
j=0

since t is small, we apply Taylor expansion here, and we get:

t
X
t → ∞ as t → ∞
j=0

which gives you something like ”first order” condition w.r.t step size. This indicates wherever the initializa-
tion x0 is, the SGD will always reach the optimum.
For the estimation of variance, one can deduce that:

V arξ (xt+1 ) = V arξ (xt ) + t V arξ (ξt − xt )

= V arξ ((1 − t )xt + t ξt )
= (1 − t )2 V arξ (xt ) + 2t V arξ (ξt )

2
Statistics & Discrete Methods of Data Sciences CS395T(51800), CSE392(63625), M393C(54377)
MW 11:00a.m.-12:30p.m. GDC 5.304
Lecture Notes: Stochastic Gradient Descent, [email protected]

σ, V ar(ξt ) = σ 2 , then we could

Since we assume ξt are drawn from normal distribution with fixed varianceP
∞
derive the asymptotic behavior: (leave as an exercise, and you will see t=1 2t < ∞ is needed in order
to make the SGD converges to the optimal solution)

lim V arξ (xt ) = O(t )

t→∞

Therefore

Eξ [(xt − x∗ )2 ] = V arξ (xt − x∗ ) − Eξ [(xt − x∗ )]2 = O(t )

For general convex function f (x), we might not be able to get an asymptotic bounds for convex f , but we
can still prove the bound for convergence. Detailed discussion can be found at Chapter 14.3 of [UML]
For general non-convex function f (x), we can establish similar asymptotic normality of evolution of xt , by
approximating the function g(x, ξ) locally into the quadratic form. Details of proof can be found here.

1.2 Simulated Annealing

In section 1.1 we use a simple example to show that SGD converges to the local optimum. One potential
drawback is that SGD might not necessarily converge to global optimum. The SGD does not have the ability
to ”jump out” the saddle point. We are going to introduce one method that overcome this difficulty, but
first we want to review a classical method called Simulated annealing.
Simulated annealing (SA) is a probabilistic technique that makes attempt to approximate the global optimum
of the optimization problem:
max f (x) := Eξ [g(x, ξ)]. (2)
x

In each iteration, the method will choose a temperature σ and generate samples from the following distri-
bution:

exp(σ(f (x) − f0 ))
ρσ (x) =
Z
R
where Z is the partition function: Z = exp(σ(f (x) − f0 ))dx, which takes the sum over all possible values
that normalizing the probability density function to ensure the integral will remain 1, no matter what σ and
f (x) you take. f0 is the ground or initial state you starts from.
If you are searching from position xt to a new random position xt+1 , then this algorithm will accept your
search rule with probability
exp(σ(f (xt+1 ) − f (xt )))
min{1, }
Z
so when f (xt+1 ) >= f (xt ), the algorithm will always accept your move, and when f (xt+1 ) < f (xt ), it will
accept this move with the probability less than 1.
When σ → ∞, one could claim that the distribution ρ∞ (x) = limσ→∞ ρσ (x) is the uniform distribution
which choose values from the set {argmax f (x)} (by setting f0 = max f (x)). That means if a random
variable X ∼ ρσ (x), then P r(X ∈ {argmax f (x)}) = 1.
Why we are introducing this? If we know ρ∞ (x), then any sample drawn from this distribution will give
you the global optimum. However, this is somehow self-contradict since we need to know f0 = max f (x) to
draw samples from the distribution ρσ (x) for large enough σ.
We will claim in the following section that, by adding a noisy data term in the SGD scheme (as did in
NSGD), the distribution of the evolution of xt will eventually converge to ρ∞ (x), thereby xt converges to
global optimum with probability one.

1.3 Fokker-Planck Equation

This section mainly interprets a simplified version of noisy SGD in [WMY]. This can be viewed as a MCMC
interpretation of why NSGD converges to global optimum.
Think the following NSGD iteration scheme

3
Statistics & Discrete Methods of Data Sciences CS395T(51800), CSE392(63625), M393C(54377)
MW 11:00a.m.-12:30p.m. GDC 5.304
Lecture Notes: Stochastic Gradient Descent, [email protected]

√
xt+1 = xt + ∇f (x) + 2σ −1 ηt , ηt ∼ N (0, 1)
(t)
as a sampling procedure, where xt ∼ ρσ (x). For simplicity, we consider step size to be fixed. Assume
(0) (t) (t+1) (t)
x0 ∼ ρσ (x), and we have calculated ρσ (x), how can we calculate ρσ (x) w.r.t. ρσ (x) from the iteration
scheme?
One way to do that is to derive in the distributional sense, meaning that for any test function h, we calculate
the expectation and using Taylor expansion to approximate it:

√
2σ −1 ηt )]
Eξ (h(xt+1 )) = Eξ [h(xt + ∇f (x) +
√ 1 √
= Eξ [h(xt ) + (∇f (x) + 2σ −1 ηt )∇h(xt ) + (∇f (x) + 2σ −1 ηt )2 ∆h(xt )] (3)
2
= Eξ [h(xt ) + (∇h(xt )∇f (xt ) + σ −1 ∆h(xt )ηt2 ) + O(2 )]
√ √
We leave out the term 2 ηt ∇h(xt ) due to the fact that E( 2 ηt ) = 0.
Next, we rewrite the equation (3) and apply integration by parts:

1
(E(h(xt+1 ) − h(xt ))) = E[∇h(xt )∇f (xt ) + σ −1 ∆h(xt )ξt2 ]
Z Z Z Z
1
ρ(t+1)
σ (x)h(x)dx − ρ(t)
σ (x)h(x)dx = ρ(t)
σ (x)∇h(x)∇f (x)dx + σ −1 ρ(t)
σ (x)∆h(x)dx

Z Z (4)
−1
= − ∇ · (ρ(t)
σ (x)∇f (x))h(x)dx + σ ∆ρ (t)
σ (x)h(x)dx
Z h i
−1
= −∇ · (ρ(t)
σ (x)∇f (x)) + σ ∆ρ(t)
σ (x) h(x)dx

Therefore
(t+1) (t)
ρσ (x) − ρσ (x) −1
= −∇ · (ρ(t)
σ (x)∇f (x)) + σ ∆ρ(t)
σ (x).

By setting → 0, we consider the following evolution partial differential equation:
∂ (t) −1
ρ (x) = −∇ · (ρ(t)
σ (x)∇f (x)) + σ ∆ρ(t)
σ (x)
∂t σ
this is the Fobber-Planck equation.
(t) D
When t → ∞, we have ρσ → ρσ , you can verify further that
∂
ρσ = 0
∂t
so NSGD generates distributions that converges to a specific distribution, where all samples drawn from this
distribution will equal to the global optimum with probability one, as long as you choose σ large enough.
More discussions regarding general NSGD and its convergence rate can be found at [YG].

References
[UML] Shalev-Shwartz, Shai, and Shai Ben-David. Understanding machine learning: From theory to algo-
rithms. Cambridge university press, 2014.
[WMY] Welling, Max, and Yee W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. Pro-
ceedings of the 28th International Conference on Machine Learning (ICML-11). 2011.
[YG] Yin, G. Rates of convergence for a class of global stochastic optimization algorithms. SIAM Journal
on Optimization 10.1 (1999): 99-120.

Hydrodynamics Horace Lamb
No ratings yet
Hydrodynamics Horace Lamb
728 pages
Introductory Probability Theory (Nicholas N.N. Nsowah-Nuamah)
No ratings yet
Introductory Probability Theory (Nicholas N.N. Nsowah-Nuamah)
299 pages
Advanced Calculus Notes
No ratings yet
Advanced Calculus Notes
273 pages
Statistical Theory and Methodology in Science and Engineering
No ratings yet
Statistical Theory and Methodology in Science and Engineering
607 pages
DS303: Introduction To Machine Learning: Stochastic Gradient Descent
No ratings yet
DS303: Introduction To Machine Learning: Stochastic Gradient Descent
19 pages
Rakhlin Mathstat sp22
No ratings yet
Rakhlin Mathstat sp22
108 pages
Is Stochastic Gradient Descent Effective? A PDE Perspective On Machine Learning Processes
No ratings yet
Is Stochastic Gradient Descent Effective? A PDE Perspective On Machine Learning Processes
50 pages
Slides No Break
No ratings yet
Slides No Break
77 pages
Lecture05 Descent
No ratings yet
Lecture05 Descent
31 pages
Katyusha: The First Direct Acceleration of Stochastic Gradient Methods
No ratings yet
Katyusha: The First Direct Acceleration of Stochastic Gradient Methods
45 pages
Signal Processing-Stochastic Processes
No ratings yet
Signal Processing-Stochastic Processes
35 pages
Lecture 3 ML - Optimization
No ratings yet
Lecture 3 ML - Optimization
32 pages
Numerical Methods For Stochastic Partial Differential Equations With White Noise (Karniadakis, George Zhang, Zhongqiang)
No ratings yet
Numerical Methods For Stochastic Partial Differential Equations With White Noise (Karniadakis, George Zhang, Zhongqiang)
391 pages
Better Theory For SGD in The Nonconvex World
No ratings yet
Better Theory For SGD in The Nonconvex World
33 pages
Elements of Econometrics - Study Guide
No ratings yet
Elements of Econometrics - Study Guide
363 pages
Stochastic Hamiltonian Gradient Methods For Smooth Games
No ratings yet
Stochastic Hamiltonian Gradient Methods For Smooth Games
31 pages
Stochastic Gradient Descent On Nonconvex Functions With General Noise Models
No ratings yet
Stochastic Gradient Descent On Nonconvex Functions With General Noise Models
19 pages
Protter
No ratings yet
Protter
43 pages
SDE For SGD
No ratings yet
SDE For SGD
35 pages
Non-Convex Learning Via Stochastic Gradient Langevin Dynamics A Nonasymptotic Analysis
No ratings yet
Non-Convex Learning Via Stochastic Gradient Langevin Dynamics A Nonasymptotic Analysis
29 pages
0105 Stoch Subgrad Notes
No ratings yet
0105 Stoch Subgrad Notes
17 pages
Convergence of Markov Chains For Constant Step-Size Stochastic Gradient Descent With Separable Functions
No ratings yet
Convergence of Markov Chains For Constant Step-Size Stochastic Gradient Descent With Separable Functions
30 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
114 pages
Notes ch6
No ratings yet
Notes ch6
11 pages
(Papa+Clemencon'15) SGD Based On Incomplete KSample UStats
No ratings yet
(Papa+Clemencon'15) SGD Based On Incomplete KSample UStats
9 pages
Raghu Meka Notes
No ratings yet
Raghu Meka Notes
7 pages
WEEK 1 Introduction of Statistics
No ratings yet
WEEK 1 Introduction of Statistics
39 pages
Chapter 8 Binomial Distribution
88% (8)
Chapter 8 Binomial Distribution
5 pages
Statistics and Probability Learning Module 3rd Quarter
100% (2)
Statistics and Probability Learning Module 3rd Quarter
69 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
111 pages
Research Fundamentals For Dissertation: Concepts, Methods, Tests, and Reporting Practices
No ratings yet
Research Fundamentals For Dissertation: Concepts, Methods, Tests, and Reporting Practices
513 pages
Lecture 7 (With Notes)
No ratings yet
Lecture 7 (With Notes)
39 pages
Cs3491 - Aiml - Unit III - Gradient Descent
No ratings yet
Cs3491 - Aiml - Unit III - Gradient Descent
12 pages
Bangalore University: Computer Science and Engineering
No ratings yet
Bangalore University: Computer Science and Engineering
25 pages
Introduction To Nonlinear Filtering
No ratings yet
Introduction To Nonlinear Filtering
126 pages
Gradient Descendent
No ratings yet
Gradient Descendent
10 pages
Fundamentals of Statistics (18.6501x)
No ratings yet
Fundamentals of Statistics (18.6501x)
20 pages
Gradient Decent - PDF 2
No ratings yet
Gradient Decent - PDF 2
7 pages
Statistics For MGMT
No ratings yet
Statistics For MGMT
140 pages
Approximation of The Invariant Measure of Stable SDEs
No ratings yet
Approximation of The Invariant Measure of Stable SDEs
32 pages
Asymptotic Theory and Parametric Inference
No ratings yet
Asymptotic Theory and Parametric Inference
32 pages
1 One Dimension: Gradient Descent
No ratings yet
1 One Dimension: Gradient Descent
5 pages
Introduction To Stochastic Approximation Algorithms
No ratings yet
Introduction To Stochastic Approximation Algorithms
14 pages
Logistic Regression
No ratings yet
Logistic Regression
9 pages
SPDEs
No ratings yet
SPDEs
92 pages
ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)
No ratings yet
ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)
45 pages
Montanari
No ratings yet
Montanari
10 pages
MIT14 384F13 Rec7
No ratings yet
MIT14 384F13 Rec7
6 pages
Mth302 Quiz 4 SOlved
No ratings yet
Mth302 Quiz 4 SOlved
17 pages
Bridging The Gap Between Constant Step Size Stochastic Gradient Descent and Markov Chains
No ratings yet
Bridging The Gap Between Constant Step Size Stochastic Gradient Descent and Markov Chains
30 pages
Error Propagation
No ratings yet
Error Propagation
22 pages
Notes
No ratings yet
Notes
118 pages
A Strengthened Conjecture On The Minimax Optimal Constant Stepsize For Gradient Descent
No ratings yet
A Strengthened Conjecture On The Minimax Optimal Constant Stepsize For Gradient Descent
8 pages
Lecture 5
No ratings yet
Lecture 5
4 pages
Lecture 7 8 Other Descent Methods
No ratings yet
Lecture 7 8 Other Descent Methods
7 pages
Msqe Metrics 1 ps2
No ratings yet
Msqe Metrics 1 ps2
11 pages
SDELab A Package For Solving Stochastic Differential Equations in MATLAB
No ratings yet
SDELab A Package For Solving Stochastic Differential Equations in MATLAB
17 pages
Invariant density estimation. Let us introduce the local time estimator (x) = Λ (x)
No ratings yet
Invariant density estimation. Let us introduce the local time estimator (x) = Λ (x)
30 pages
Kill Me
No ratings yet
Kill Me
23 pages
Combes - An Introduction To Stochastic Approximation - 2013
No ratings yet
Combes - An Introduction To Stochastic Approximation - 2013
9 pages
EM-algorithm: California Institute of Technology 136-93 Pasadena, CA 91125 Welling@vision - Caltech.edu
No ratings yet
EM-algorithm: California Institute of Technology 136-93 Pasadena, CA 91125 Welling@vision - Caltech.edu
7 pages
Bachelor of Arts in Economics Program Curriculum 2018 - Final - Senate Approved - 2018.10.04
No ratings yet
Bachelor of Arts in Economics Program Curriculum 2018 - Final - Senate Approved - 2018.10.04
108 pages
Desymm
No ratings yet
Desymm
13 pages
Hypergeometric Distribution
No ratings yet
Hypergeometric Distribution
9 pages
Industrial Mathematics Institute: Research Report
No ratings yet
Industrial Mathematics Institute: Research Report
25 pages
Proc Capability - Sas User Guide
No ratings yet
Proc Capability - Sas User Guide
365 pages
Parameter estimation: S (x) = S (ϑ, x), the observed process is, X, 0 ≤ t ≤ T
No ratings yet
Parameter estimation: S (x) = S (ϑ, x), the observed process is, X, 0 ≤ t ≤ T
13 pages
Applied Science: Course No: AS ESDM 361 Title: Environmental Science and Disaster Management Credit: 3 (2+1) Semester: I
No ratings yet
Applied Science: Course No: AS ESDM 361 Title: Environmental Science and Disaster Management Credit: 3 (2+1) Semester: I
277 pages
Note 6: EECS 189 Introduction To Machine Learning Fall 2020 1 Multivariate Gaussians
No ratings yet
Note 6: EECS 189 Introduction To Machine Learning Fall 2020 1 Multivariate Gaussians
9 pages
Lim 05429427
No ratings yet
Lim 05429427
10 pages
Maths 9709 Paper 5 Format 2 - Discrete Random Vari - 240816 - 094421
No ratings yet
Maths 9709 Paper 5 Format 2 - Discrete Random Vari - 240816 - 094421
167 pages
An Adaptive Simulated Annealing Algorithm PDF
No ratings yet
An Adaptive Simulated Annealing Algorithm PDF
9 pages
Disentangling Classical and Bayesian Approaches To Uncertainty Analysis
No ratings yet
Disentangling Classical and Bayesian Approaches To Uncertainty Analysis
19 pages
Notes On Differential Topology: George Torres Last Updated January 4, 2019
No ratings yet
Notes On Differential Topology: George Torres Last Updated January 4, 2019
35 pages
Stochastic Processes, Ito Calculus and Black-Scholes Formula
No ratings yet
Stochastic Processes, Ito Calculus and Black-Scholes Formula
36 pages
Statistical Inference For Ergodic Diffusion Process: Yu.A. Kutoyants
No ratings yet
Statistical Inference For Ergodic Diffusion Process: Yu.A. Kutoyants
24 pages
Dual Spaces
No ratings yet
Dual Spaces
30 pages
Lecture 11: Standard Error, Propagation of Error, Central Limit Theorem in The Real World
No ratings yet
Lecture 11: Standard Error, Propagation of Error, Central Limit Theorem in The Real World
13 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
281A Final Sol
No ratings yet
281A Final Sol
9 pages
Rudin 2
No ratings yet
Rudin 2
24 pages
Ndustrial Automation
No ratings yet
Ndustrial Automation
53 pages
2017 TSSM
No ratings yet
2017 TSSM
52 pages
Presentation Nishant
No ratings yet
Presentation Nishant
42 pages
4330 Week 2
No ratings yet
4330 Week 2
20 pages
MCQ Sampling and Sampling Distributions Wiht Correct Answers
100% (20)
MCQ Sampling and Sampling Distributions Wiht Correct Answers
6 pages
Pharmacy Statistics Prelims - Reviewer
No ratings yet
Pharmacy Statistics Prelims - Reviewer
47 pages
Monte Carlo Integration Lecture
No ratings yet
Monte Carlo Integration Lecture
8 pages
Organized 1 35
No ratings yet
Organized 1 35
35 pages
Pep20 by Example PDF
No ratings yet
Pep20 by Example PDF
24 pages
Diffusions and Stochastic Differential Equations
No ratings yet
Diffusions and Stochastic Differential Equations
8 pages
Fenics Discontinuous Galerkin
No ratings yet
Fenics Discontinuous Galerkin
20 pages
Mathematical Association of America The American Mathematical Monthly
No ratings yet
Mathematical Association of America The American Mathematical Monthly
18 pages
Ncert Solutions Class 12
No ratings yet
Ncert Solutions Class 12
20 pages
The Normal Distribution 2
No ratings yet
The Normal Distribution 2
23 pages
MCO 3 EM 2024 3rdsem MP@ 1
No ratings yet
MCO 3 EM 2024 3rdsem MP@ 1
15 pages
SSP4SE Appa
No ratings yet
SSP4SE Appa
10 pages
Reduced Row Echelon
No ratings yet
Reduced Row Echelon
4 pages
Application of Normal Distribution: Huining Kang August 10, 2020
No ratings yet
Application of Normal Distribution: Huining Kang August 10, 2020
18 pages
Risk Analysis in Investment Appraisal
No ratings yet
Risk Analysis in Investment Appraisal
17 pages
FCAAM StatPro
No ratings yet
FCAAM StatPro
12 pages
Pre Rex Tutorial
No ratings yet
Pre Rex Tutorial
8 pages
O I G Ync@: (I L/ W-N/,T
No ratings yet
O I G Ync@: (I L/ W-N/,T
5 pages
Jordan Content
No ratings yet
Jordan Content
6 pages
Group 3 Data Analysis
No ratings yet
Group 3 Data Analysis
7 pages
Spectral Theorem Notes
No ratings yet
Spectral Theorem Notes
4 pages
SR Pocket Map
No ratings yet
SR Pocket Map
2 pages
Normal Distribution and Test of Normality
No ratings yet
Normal Distribution and Test of Normality
38 pages
TecoloteMenu InsidePGS FINAL
No ratings yet
TecoloteMenu InsidePGS FINAL
4 pages

Lecture Note SGD

Uploaded by

Lecture Note SGD

Uploaded by

Statistics & Discrete Methods of Data Sciences CS395T(51800), CSE392(63625), M393C(54377)

MW 11:00a.m.-12:30p.m. GDC 5.304

1 Stochastic Gradient Descent

max f (x) := Eξ [g(x, ξ)] (1)

• iteration scheme is xt+1 = xt + t ∇f (x)

1.1 Example: Stochastic Gradient Descent for estimation of mean

f (x) = −Eξ [(ξ − x)2 ]

Eξ (xt+1 ) = Eξ (xt ) + t Eξ (ξt − xt )

since t is small, we apply Taylor expansion here, and we get:

V arξ (xt+1 ) = V arξ (xt ) + t V arξ (ξt − xt )

σ, V ar(ξt ) = σ 2 , then we could

lim V arξ (xt ) = O(t )

Eξ [(xt − x∗ )2 ] = V arξ (xt − x∗ ) − Eξ [(xt − x∗ )]2 = O(t )

1.2 Simulated Annealing

1.3 Fokker-Planck Equation

You might also like

• iteration scheme is xt+1 = xt + t ∇f (x)

Eξ (xt+1 ) = Eξ (xt ) + t Eξ (ξt − xt )

since t is small, we apply Taylor expansion here, and we get:

V arξ (xt+1 ) = V arξ (xt ) + t V arξ (ξt − xt )

lim V arξ (xt ) = O(t )

Eξ [(xt − x∗ )2 ] = V arξ (xt − x∗ ) − Eξ [(xt − x∗ )]2 = O(t )