cs236 Lecture4
cs236 Lecture4
Stefano Ermon
Stanford University
Lecture 4
We want to learn the full distribution so that later we can answer any
probabilistic inference query
In this setting we can view the learning problem as density estimation
We want to construct Pθ as ”close” as possible to Pdata (recall we assume
we are given a dataset D of samples from Pdata )
How do we evaluate ”closeness”?
KL-divergence is one possibility:
X
Pdata (x) Pdata (x)
D(Pdata ||Pθ ) = Ex∼Pdata log = Pdata (x) log
Pθ (x) x
Pθ (x)
Variance:
T
" #
1 X VP [g (x)]
VP [ĝ ] = VP g (x t ) =
T T
t=1
m m Y
n
(j) (j)
Y Y
L(θ, D) = Pθ (x(j) ) = pneural (xi |x<i ; θi )
j=1 j=1 i=1
n
m X
(j) (j)
X
ℓ(θ) = log L(θ, D) = log pneural (xi |x<i ; θi )
j=1 i=1
1 Initialize θ0 at random
2 Compute ∇θ ℓ(θ) (by back propagation)
3 θt+1 = θt + αt ∇θ ℓ(θ)
What is the gradient with respect to θi ?
m n m
(j) (j) (j) (j)
X X X
∇θi ℓ(θ) = ∇θ i log pneural (xi |x<i ; θi ) = ∇θi log pneural (xi |x<i ; θi )
j=1 i=1 j=1
1 Initialize θ0 at random
2 Compute ∇θ ℓ(θ) (by back propagation)
3 θt+1 = θt + αt ∇θ ℓ(θ)
m X
n
(j) (j)
X
∇θ ℓ(θ) = ∇θ log pneural (xi |x<i ; θi )
j=1 i=1
Overfits
Hypothesis space: low degree polynomial
Right tradeoff