Chapter 15 Bayesian Inference For Gaussian Process (Lecture On 02-23-2021) - STAT 243 - Stochastic Process
Chapter 15 Bayesian Inference For Gaussian Process (Lecture On 02-23-2021) - STAT 243 - Stochastic Process
Process(Lecture on 02/23/2021)
When the data (y, x) has a highly nonlinear relationship, the way to perform a regression model
is to assume y = f (x) + ϵ , where f is an unknown function. We use GP as the prior for f .
Assume f ∼ GP (μ, Cν (⋅, ⋅, θ)) , it means that
where H = (Hij )
n
i,j=1
and Hij = Cν (xi , xj ; θ) . A commonly used covariance kernel is the
Matern covariance kernel, defined as
1−ν
2
2 2 ν
Cν (d, ϕ, σ ) = σ (√2νϕd) Kν (√2νϕd) (15.2)
Γ(ν)
y = f (x) + ϵ
(15.3)
i.i.d. 2
ϵ ∼ N (0, τ ), f ∼ GP (μ, Cν )
T
( f (x1 ) ⋯ f (xn ) ) ∼ M V N (μ1, Σ) (15.4)
where Σ 2
= σ H(ϕ) . For simplicity, assume we work with the exponential covariance function
2 2
Cν (xi , xj , ϕ, σ ) = σ exp(−ϕ||xi − xj ||) (15.5)
then H(ϕ) is an n × n matrix with the (i, j)th entry of H (ϕ) is given by exp(−ϕ||xi − xj ||) .
Let μ ∼ P (μ) = N (μ|aμ , bμ ) σ , 2 2
∼ Pσ (σ ) = I G(σ |aσ , bσ ) τ
2
, 2
∼ Pτ 2 (τ
2
) = I G(τ
2
|aτ , bτ )
and ϕ ∼ Pϕ (ϕ) = U nif (aϕ , bϕ ) be the prior distribution for the model parameters. Utlimately it
turns out that the model can be written hierarchically as:
T 2
y ∼ N (( f (x1 ) ⋯ f (xn ) ) ,τ I)
T
( f (x1 ) ⋯ f (xn ) ) ∼ M V N (μ1, Σ)
(15.6)
2 2
μ ∼ N (μ|aμ , bμ ), σ ∼ I G(σ |aσ , bσ )
2 2
τ ∼ I G(τ |aτ , bτ ), ϕ ∼ U nif (aϕ , bϕ )
T
Just for notational simplicity, denote θ = ( f (x1 ) ⋯ f (xn ) ) , our job is to estimate the
posterior distribution of μ, τ 2 , σ 2 , ϕ and θ.
1. There is some issues with directly build a sampler for μ, τ 2 , σ 2 , ϕ and θ together.
It is hard for the chain to converge. Therefore, in practice, we usually marginalize
out θ at first.
Since y ∼ N (θ, τ
2
I) and θ ∼ N (μI, Σ) , marginally y ∼ N (μ1, Σ + τ
2
. Mixing will be
I)
2 2 2 2 2 2
p(μ, τ , σ , ϕ|y) ∝ N (μ1, τ I + σ H (ϕ))I G(σ |aσ , bσ )I G(σ |aτ , bτ )
(15.7)
× U nif (aϕ , bϕ )N (μ|aμ , bμ )
Now we consider the full conditionals for each parameter. If it has closed form, then it can be
sampled using Gibbs sampler, otherwise it will be sampled using Metropolis-Hasting sampler.
Firstly,
2 2
p(μ|rest, y) ∝ N (y|μ1, τ I + σ H (ϕ)) × N (μ|aμ , bμ )
T 2 2 −1
(y − μ1) (τ I + σ H (ϕ)) (y − μ1)
∝ exp{− }
2
2
(μ − aμ )
× exp{− }
2bμ (15.8)
1 1
2 T 2 2 −1
∝ exp{− [μ (1 (σ H (ϕ) + τ I) 1 + )
2 bμ
aμ
T 2 2 −1
− 2μ(1 (σ H (ϕ) + τ I) y + )]}
bμ
1
bμ|⋅ =
T 2 2 −1 1
1 (σ H (ϕ) + τ I) 1 +
bμ
(15.9)
aμ
T 2 2 −1
aμ|⋅ = bμ|⋅ [1 (σ H (ϕ) + τ I) y + ]
bμ
In addition,
2 2 2 2 2 2
p(ϕ, τ , σ |y, μ) ∝ N (y|μ1, τ I + σ H (ϕ))I G(σ |aσ , bσ )I G(σ |aτ , bτ )U nif (aϕ , bϕ
ϕ, τ
2
,σ
2
do not have any closed form full conditional. They need to be updated using Metropolis-
Hastings.
Should we update ϕ, τ 2 , σ 2 all together, or should we update them one at a time? The
answer is case by case. Updating σ 2 , τ 2 together and updating ϕ separately using M-
H tends to give better mixing.
Once posterior samples of μ, ϕ, τ 2 , σ 2 are obtained, the next step is to get the posterior samples
of θ. Note that
2 2
p(θ|rest, y) ∝ N (y|θ, τ I)N (θ|μ1, σ H (ϕ))
T
(y − θ) (y − θ)
∝ exp{− }
2
2τ
T 2 −1
(θ − μ1) (σ H (ϕ)) (θ − μ1) (15.11
× exp{− }
2
−1 −1
1 I H (ϕ) y H (ϕ) 1μ
T T
∝ exp{− [θ ( + )θ − 2θ ( + )]}
2 2 2 2
2 τ σ τ σ
where
−1
I H (ϕ)
2 2 −1
Σθ|⋅ (ϕ, μ, σ , τ ) = ( + )
2 2
τ σ
(15.12)
−1
y H (ϕ) 1μ
2 2
μθ|⋅ (ϕ, μ, σ , τ ) = Σθ|⋅ ( + )
2 2
τ σ
For the lth post burn-in samples of parameters, denoted as (ϕl , μl , σl2 , τl2 ), we can obtain a
sample of θ by sample from N (μθ|⋅ (ϕl , μl , σl2 , τl2 ), Σθ|⋅ (ϕl , μl , σl2 , τl2 )) 。