0% found this document useful (0 votes)
19 views4 pages

Chapter 15 Bayesian Inference For Gaussian Process (Lecture On 02-23-2021) - STAT 243 - Stochastic Process

1) The document discusses Bayesian inference for Gaussian processes for nonlinear regression models. It assumes the unknown function f in the regression model follows a Gaussian process prior. 2) It presents the hierarchical Bayesian model for the regression, with priors placed over the hyperparameters μ, τ2, σ2, and φ. 3) Marginalizing over the latent function values θ is recommended for improved mixing when sampling the posterior. The full conditional distributions of each parameter are then derived for Gibbs sampling.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views4 pages

Chapter 15 Bayesian Inference For Gaussian Process (Lecture On 02-23-2021) - STAT 243 - Stochastic Process

1) The document discusses Bayesian inference for Gaussian processes for nonlinear regression models. It assumes the unknown function f in the regression model follows a Gaussian process prior. 2) It presents the hierarchical Bayesian model for the regression, with priors placed over the hyperparameters μ, τ2, σ2, and φ. 3) Marginalizing over the latent function values θ is recommended for improved mixing when sampling the posterior. The full conditional distributions of each parameter are then derived for Gibbs sampling.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Chapter 15 Bayesian Inference for Gaussian

Process(Lecture on 02/23/2021)

When the data (y, x) has a highly nonlinear relationship, the way to perform a regression model
is to assume y = f (x) + ϵ , where f is an unknown function. We use GP as the prior for f .
Assume f ∼ GP (μ, Cν (⋅, ⋅, θ)) , it means that

( f (x1 ) ⋯ f (xn ) ) ∼ N (μ1, H ) (15.1)

where H = (Hij )
n
i,j=1
and Hij = Cν (xi , xj ; θ) . A commonly used covariance kernel is the
Matern covariance kernel, defined as

1−ν
2
2 2 ν
Cν (d, ϕ, σ ) = σ (√2νϕd) Kν (√2νϕd) (15.2)
Γ(ν)

where d = ||xi − xj || is the Euclidean distance between xi and xj .

Now, the regression model can be written as

y = f (x) + ϵ
(15.3)
i.i.d. 2
ϵ ∼ N (0, τ ), f ∼ GP (μ, Cν )

As a bayesian statistician, we want to find the posterior distribution of f , that is f |y1 , ⋯ , yn .

Let us have the data (y1 , x1 ), ⋯ , (yn , xn ), then

T
( f (x1 ) ⋯ f (xn ) ) ∼ M V N (μ1, Σ) (15.4)

where Σ 2
= σ H(ϕ) . For simplicity, assume we work with the exponential covariance function

2 2
Cν (xi , xj , ϕ, σ ) = σ exp(−ϕ||xi − xj ||) (15.5)

then H(ϕ) is an n × n matrix with the (i, j)th entry of H (ϕ) is given by exp(−ϕ||xi − xj ||) .
Let μ ∼ P (μ) = N (μ|aμ , bμ ) σ , 2 2
∼ Pσ (σ ) = I G(σ |aσ , bσ ) τ
2
, 2
∼ Pτ 2 (τ
2
) = I G(τ
2
|aτ , bτ )

and ϕ ∼ Pϕ (ϕ) = U nif (aϕ , bϕ ) be the prior distribution for the model parameters. Utlimately it
turns out that the model can be written hierarchically as:

T 2
y ∼ N (( f (x1 ) ⋯ f (xn ) ) ,τ I)

T
( f (x1 ) ⋯ f (xn ) ) ∼ M V N (μ1, Σ)
(15.6)
2 2
μ ∼ N (μ|aμ , bμ ), σ ∼ I G(σ |aσ , bσ )

2 2
τ ∼ I G(τ |aτ , bτ ), ϕ ∼ U nif (aϕ , bϕ )

T
Just for notational simplicity, denote θ = ( f (x1 ) ⋯ f (xn ) ) , our job is to estimate the
posterior distribution of μ, τ 2 , σ 2 , ϕ and θ.

1. There is some issues with directly build a sampler for μ, τ 2 , σ 2 , ϕ and θ together.
It is hard for the chain to converge. Therefore, in practice, we usually marginalize
out θ at first.

2. σ 2 and ϕ can not be jointly estimated consistently. One solution is to estimate σ 2


without any constrains, while put a uniform prior on a small range for ϕ. Typically
3 3
the range is set as [ max(d) , min(d)
] , where d = {|xi − xj | : |xi − xj | ≠ 0} .

Since y ∼ N (θ, τ
2
I) and θ ∼ N (μI, Σ) , marginally y ∼ N (μ1, Σ + τ
2
. Mixing will be
I)

improved when θ is integrated out. The full posterior for μ, τ 2


,σ ,ϕ
2
is therefore

2 2 2 2 2 2
p(μ, τ , σ , ϕ|y) ∝ N (μ1, τ I + σ H (ϕ))I G(σ |aσ , bσ )I G(σ |aτ , bτ )
(15.7)
× U nif (aϕ , bϕ )N (μ|aμ , bμ )

Now we consider the full conditionals for each parameter. If it has closed form, then it can be
sampled using Gibbs sampler, otherwise it will be sampled using Metropolis-Hasting sampler.
Firstly,

2 2
p(μ|rest, y) ∝ N (y|μ1, τ I + σ H (ϕ)) × N (μ|aμ , bμ )

T 2 2 −1
(y − μ1) (τ I + σ H (ϕ)) (y − μ1)
∝ exp{− }
2
2
(μ − aμ )
× exp{− }
2bμ (15.8)

1 1
2 T 2 2 −1
∝ exp{− [μ (1 (σ H (ϕ) + τ I) 1 + )
2 bμ


T 2 2 −1
− 2μ(1 (σ H (ϕ) + τ I) y + )]}

Therefore, by complete the squares, we have μ|rest, y ∼ N (aμ|⋅ , bμ|⋅ ) , where

1
bμ|⋅ =
T 2 2 −1 1
1 (σ H (ϕ) + τ I) 1 +

(15.9)

T 2 2 −1
aμ|⋅ = bμ|⋅ [1 (σ H (ϕ) + τ I) y + ]

Therefore, we can sample μ from its full conditional distribution.

In addition,

2 2 2 2 2 2
p(ϕ, τ , σ |y, μ) ∝ N (y|μ1, τ I + σ H (ϕ))I G(σ |aσ , bσ )I G(σ |aτ , bτ )U nif (aϕ , bϕ

ϕ, τ
2

2
do not have any closed form full conditional. They need to be updated using Metropolis-
Hastings.

Should we update ϕ, τ 2 , σ 2 all together, or should we update them one at a time? The
answer is case by case. Updating σ 2 , τ 2 together and updating ϕ separately using M-
H tends to give better mixing.
Once posterior samples of μ, ϕ, τ 2 , σ 2 are obtained, the next step is to get the posterior samples
of θ. Note that

2 2
p(θ|rest, y) ∝ N (y|θ, τ I)N (θ|μ1, σ H (ϕ))

T
(y − θ) (y − θ)
∝ exp{− }
2

T 2 −1
(θ − μ1) (σ H (ϕ)) (θ − μ1) (15.11
× exp{− }
2
−1 −1
1 I H (ϕ) y H (ϕ) 1μ
T T
∝ exp{− [θ ( + )θ − 2θ ( + )]}
2 2 2 2
2 τ σ τ σ

After completing the squares, we get θ|rest, y ∼ N (μθ|⋅ (ϕ, μ, σ , τ


2 2
), Σθ|⋅ (ϕ, μ, σ , τ
2 2
))

where

−1
I H (ϕ)
2 2 −1
Σθ|⋅ (ϕ, μ, σ , τ ) = ( + )
2 2
τ σ
(15.12)
−1
y H (ϕ) 1μ
2 2
μθ|⋅ (ϕ, μ, σ , τ ) = Σθ|⋅ ( + )
2 2
τ σ

For the lth post burn-in samples of parameters, denoted as (ϕl , μl , σl2 , τl2 ), we can obtain a
sample of θ by sample from N (μθ|⋅ (ϕl , μl , σl2 , τl2 ), Σθ|⋅ (ϕl , μl , σl2 , τl2 )) 。

You might also like