0% found this document useful (0 votes)

1K views55 pages

Gaussian Processes in Machine Learning Tutorial

1) The document is a tutorial on advances in Gaussian processes presented by Carl Edward Rasmussen at NIPS 2006. 2) Gaussian processes allow for inference of distributions over functions and prediction of future observations based on training data. 3) They provide a practical framework to address model fitting, selection, and interpretation challenges in prediction problems.

Uploaded by

sakya_dasgupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1K views55 pages

Gaussian Processes in Machine Learning Tutorial

Uploaded by

sakya_dasgupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

Advances in Gaussian Processes

Tutorial at NIPS 2006 in Vancouver

Carl Edward Rasmussen

Max Planck Institute for Biological Cybernetics, Tübingen

December 4th, 2006

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 1 / 55
The Prediction Problem
420
CO2 concentration, ppm
400
?

380

360

340

320
1960 1980 2000 2020
year
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 2 / 55
The Prediction Problem
420
CO2 concentration, ppm
400

380

360

340

320
1960 1980 2000 2020
year
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 3 / 55
The Prediction Problem
420
CO2 concentration, ppm
400

380

360

340

320
1960 1980 2000 2020
year
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 4 / 55
The Prediction Problem
420
CO2 concentration, ppm
400

380

360

340

320
1960 1980 2000 2020
year
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 5 / 55
The Prediction Problem

Ubiquitous questions:

• Model fitting
• how do I fit the parameters?
• what about overfitting?
• Model Selection
• how to I find out which model to use?
• how sure can I be?
• Interpretation
• what is the accuracy of the predictions?
• can I trust the predictions, even if
• . . . I am not sure about the parameters?
• . . . I am not sure of the model structure?

Gaussian processes solve some of the above, and provide a practical framework
to address the remaining issues.

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 6 / 55
Outline

Part I: foundations Part II: advanced topics

• What is a Gaussian process • Example
• from distribution to process priors over functions
•
• distribution over functions hierarchical priors using
•
• the marginalization property hyperparameters
• Inference • learning the covariance
function
• Bayesian inference
• posterior over functions • Approximate methods for
• predictive distribution classification
• marginal likelihood • Gaussian Process latent variable
• Occam’s Razor models
• automatic complexity penalty • Sparse methods

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 7 / 55
The Gaussian Distribution

The Gaussian distribution is given by

p(x|µ, Σ) = N(µ, Σ) = (2π)−D/2 |Σ|−1/2 exp − 12 (x − µ)> Σ−1 (x − µ)

where µ is the mean vector and Σ the covariance matrix.

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 8 / 55
Conditionals and Marginals of a Gaussian

joint Gaussian joint Gaussian

conditional marginal

Both the conditionals and the marginals of a joint Gaussian are again Gaussian.

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 9 / 55
What is a Gaussian Process?

A Gaussian process is a generalization of a multivariate Gaussian distribution to

infinitely many variables.

Informally: infinitely long vector ' function

Definition: a Gaussian process is a collection of random variables, any

finite number of which have (consistent) Gaussian distributions.

A Gaussian distribution is fully specified by a mean vector, µ, and covariance

matrix Σ:
f = (f1 , . . . , fn )> ∼ N(µ, Σ), indexes i = 1, . . . , n
A Gaussian process is fully specified by a mean function m(x) and covariance
function k(x, x 0 ):

f (x) ∼ GP m(x), k(x, x 0 ) , indexes: x

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 10 / 55
The marginalization property

Thinking of a GP as a Gaussian distribution with an infinitely long mean vector

and an infinite by infinite covariance matrix may seem impractical. . .

. . . luckily we are saved by the marginalization property:

Recall: Z
p(x) = p(x, y)dy.

For Gaussians:
hai h A B i
p(x, y) = N , =⇒ p(x) = N(a, A)
b B> C

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 11 / 55
Random functions from a Gaussian Process

Example one dimensional Gaussian process:

p(f (x)) ∼ GP m(x) = 0, k(x, x 0 ) = exp(− 12 (x − x 0 )2 )

To get an indication of what this distribution over functions looks like, focus on a
finite subset of function values f = (f (x1 ), f (x2 ), . . . , f (xn ))> , for which

f ∼ N(0, Σ),

where Σij = k(xi , xj ).

Then plot the coordinates of f as a function of the corresponding x values.

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 12 / 55
Some values of the random function
1.5

0.5
output, f(x)

−0.5

−1

−1.5
−5 0 5
input, x
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 13 / 55
Sequential Generation

Factorize the joint distribution

Y
n
p(f1 , . . . , fn |x1 , . . . xn ) = p(fi |fi−1 , . . . , f1 , xi , . . . , x1 ),
i=1

and generate function values sequentially.

What do the individual terms look like? For Gaussians:

h a i h A B i
p(x, y) = N , =⇒ p(x|y) = N(a+BC−1 (y−b), A−BC−1 B> )
b B> C

Do try this at home!

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 14 / 55
Function drawn at random from a Gaussian Process with Gaussian covariance

−1

−2

6
4
6
2
4
0 2
−2 0
−2
−4
−4
−6 −6

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 15 / 55
Maximum likelihood, parametric model
Supervised parametric learning:

• data: x, y
• model: y = fw (x) + ε

Gaussian likelihood:
Y
p(y|x, w, Mi ) ∝ exp(− 12 (yc − fw (xc ))2 /σ2noise ).
c

Maximize the likelihood:

wML = argmax p(y|x, w, Mi ).

Make predictions, by plugging in the ML estimate:

p(y∗ |x∗ , wML , Mi )

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 16 / 55
Bayesian Inference, parametric model
Supervised parametric learning:

• data: x, y
• model: y = fw (x) + ε

Gaussian likelihood:
Y
p(y|x, w, Mi ) ∝ exp(− 12 (yc − fw (xc ))2 /σ2noise ).
c

Parameter prior:
p(w|Mi )

Posterior parameter distribution by Bayes rule p(a|b) = p(b|a)p(a)/p(b):

p(w|Mi )p(y|x, w, Mi )
p(w|x, y, Mi ) =
p(y|x, Mi )

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 17 / 55
Bayesian Inference, parametric model, cont.

Making predictions:
Z
p(y∗ |x∗ , x, y, Mi ) = p(y∗ |w, x∗ , Mi )p(w|x, y, Mi )dw

Marginal likelihood:
Z
p(y|x, Mi ) = p(w|Mi )p(y|x, w, Mi )dw.

Model probability:
p(Mi )p(y|x, Mi )
p(Mi |x, y) =
p(y|x)

Problem: integrals are intractable for most interesting models!

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 18 / 55
Non-parametric Gaussian process models
In our non-parametric model, the “parameters” is the function itself!

Gaussian likelihood:
y|x, f (x), Mi ∼ N(f, σ2noise I)

(Zero mean) Gaussian process prior:

f (x)|Mi ∼ GP m(x) ≡ 0, k(x, x 0 )

Leads to a Gaussian process posterior

f (x)|x, y, Mi ∼ GP mpost (x) = k(x, x)[K(x, x) + σ2noise I]−1 y,
kpost (x, x 0 ) = k(x, x 0 ) − k(x, x)[K(x, x) + σ2noise I]−1 k(x, x 0 )

.

And a Gaussian predictive distribution:

y∗ |x∗ , x, y, Mi ∼ N k(x∗ , x)> [K + σ2noise I]−1 y,
k(x∗ , x∗ ) + σ2noise − k(x∗ , x)> [K + σ2noise I]−1 k(x∗ , x)

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 19 / 55
Prior and Posterior

2 2

1 1

output, f(x)
output, f(x)

0 0

−1 −1

−2 −2

−5 0 5 −5 0 5
input, x input, x

Predictive distribution:

p(y∗ |x∗ , x, y) ∼ N k(x∗ , x)> [K + σ2noise I]−1 y,

k(x∗ , x∗ ) + σ2noise − k(x∗ , x)> [K + σ2noise I]−1 k(x∗ , x)

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 20 / 55
Graphical model for Gaussian Process

Square nodes are observed (clamped),

y1 x∗1 round nodes stochastic (free).
x1 f1 f∗1 y∗1
All pairs of latent variables are con-
y2 x∗2 nected.
f2 f∗2
y∗2
x2
Predictions y∗ depend only on the corre-
y3 f3 f∗3 x∗3
sponding single latent f ∗ .

x3 y∗3 Notice, that adding a triplet x∗m , fm∗ , y∗m

fn does not influence the distribution. This
is guaranteed by the marginalization
yn xn
property of the GP.

This explains why we can make inference using a finite amount of computation!

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 21 / 55
Some interpretation
Recall our main result:

f∗ |X∗ , X, y ∼ N K(X∗ , X)[K(X, X) + σ2n I]−1 y,

K(X∗ , X∗ ) − K(X∗ , X)[K(X, X) + σ2n I]−1 K(X, X∗ ) .

The mean is linear in two ways:

X
n X
n
µ(x∗ ) = k(x∗ , X)[K(X, X) + σ2n ]−1 y = βc y(c) = αc k(x∗ , x(c) ).
c=1 c=1

The last form is most commonly encountered in the kernel literature.

The variance is the difference between two terms:

V(x∗ ) = k(x∗ , x∗ ) − k(x∗ , X)[K(X, X) + σ2n I]−1 k(X, x∗ ),

the first term is the prior variance, from which we subtract a (positive) term,
telling how much the data X has explained. Note, that the variance is
independent of the observed outputs y.
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 22 / 55
The marginal likelihood
Log marginal likelihood:

1 1 n
log p(y|x, Mi ) = − y> K−1 y − log |K| − log(2π)
2 2 2
is the combination of a data fit term and complexity penalty. Occam’s Razor is
automatic.

Learning in Gaussian process models involves finding

• the form of the covariance function, and

• any unknown (hyper-) parameters θ.

This can be done by optimizing the marginal likelihood:

∂ log p(y|x, θ, Mi ) 1 ∂K −1 1 ∂K
= y> K−1 K y − trace(K−1 )
∂θj 2 ∂θj 2 ∂θj

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 23 / 55
Example: Fitting the length scale parameter
(x − x 0 )2
Parameterized covariance function: k(x, x 0 ) = v2 exp − + σ2n δxx 0 .
2`2

1.5
observations
too short
1 good length scale
too long

0.5

−0.5
−10 −8 −6 −4 −2 0 2 4 6 8 10

The mean posterior predictive function is plotted for 3 different length scales (the
green curve corresponds to optimizing the marginal likelihood). Notice, that an
almost exact fit to the data can be achieved by reducing the length scale – but the
marginal likelihood does not favour this!
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 24 / 55
Why, in principle, does Bayesian Inference work?
Occam’s Razor

too simple
P(Y|Mi)

"just right"
too complex

Y
All possible data sets
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 25 / 55
An illustrative analogous example
Imagine the simple task of fitting the variance, σ2 , of a zero-mean Gaussian to a
set of n scalar observations.

P
The log likelihood is log p(y|µ, σ2 ) = − 21 (yi − µ)2 /σ2 − 2n log(σ2 ) − n
2 log(2π)
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 26 / 55
From random functions to covariance functions

Consider the class of linear functions:

f (x) = ax + b, where a ∼ N(0, α), and b ∼ N(0, β).

We can compute the mean function:

ZZ Z Z
µ(x) = E[f (x)] = f (x)p(a)p(b)dadb = axp(a)da + bp(b)db = 0,

and covariance function:

ZZ
0 0
k(x, x ) = E[(f (x) − 0)(f (x ) − 0)] = (ax + b)(ax 0 + b)p(a)p(b)dadb
Z Z Z
= a2 xx 0 p(a)da + b2 p(b)db + (x + x 0 ) abp(a)p(b)dadb = αxx 0 + β.

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 27 / 55
From random functions to covariance functions II
Consider the class of functions (sums of squared exponentials):
1X
f (x) = lim γi exp(−(x − i/n)2 ), where γi ∼ N(0, 1), ∀i
n→∞ n
i
Z∞
= γ(u) exp(−(x − u)2 )du, where γ(u) ∼ N(0, 1), ∀u.
−∞

The mean function is:

Z∞ Z∞
2
µ(x) = E[f (x)] = exp(−(x − u) ) γp(γ)dγdu = 0,
−∞ −∞

and the covariance function:

Z
E[f (x)f (x )] = exp − (x − u)2 − (x 0 − u)2 du
0

Z
x + x 0 2 (x + x 0 )2 (x − x 0 )2
− x2 − x 02 )du ∝ exp −

= exp − 2(u − ) + .
2 2 2
Thus, the squared exponential covariance function is equivalent to regression
using infinitely many Gaussian shaped basis functions placed everywhere, not just
at your training points!
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 28 / 55
Using finitely many basis functions may be dangerous!

0.5

−0.5
−10 −8 −6 −4 −2 0 2 4 6 8 10

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 29 / 55
Model Selection in Practise; Hyperparameters
There are two types of task: form and parameters of the covariance function.

Typically, our prior is too weak to quantify aspects of the covariance function.
We use a hierarchical model using hyperparameters. Eg, in ARD:
X
D
(xd − x 0 )2
0
k(x, x ) = v20 exp − d
, hyperparameters θ = (v0 , v1 , . . . , vd , σ2n ).
2v2d
d=1

v1=v2=1 v1=v2=0.32 v1=0.32 and v2=1

2 2 2
1 0 0
0
−2 −2
2 2 2
2 2 2
0 0 0 0 0 0
x2 −2 −2 x1 x2 −2 −2 x1 x2 −2 −2 x1
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 30 / 55
Rational quadratic covariance function

The rational quadratic (RQ) covariance function:

r2 −α
kRQ (r) = 1+
2α`2
with α, ` > 0 can be seen as a scale mixture (an infinite sum) of squared
exponential (SE) covariance functions with different characteristic length-scales.

Using τ = `−2 and p(τ|α, β) ∝ τα−1 exp(−ατ/β):

Z
kRQ (r) = p(τ|α, β)kSE (r|τ)dτ
Z τr2
ατ r2 −α
∝ τα−1 exp − exp − dτ ∝ 1 + ,
β 2 2α`2

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 31 / 55
Rational quadratic covariance function II

1 α=1/2
3
α=2
α→∞ 2
0.8
1

output, f(x)
covariance

0.6
0
0.4 −1

0.2 −2

0 −3
0 1 2 3 −5 0 5
input distance input, x

The limit α → ∞ of the RQ covariance function is the SE.

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 32 / 55
Matérn covariance functions
Stationary covariance functions can be based on the Matérn form:

1 h √2ν iν √2ν
0
k(x, x ) = |x − x 0
| K ν |x − x 0
| ,
Γ (ν)2ν−1 ` `
where Kν is the modified Bessel function of second kind of order ν, and ` is the
characteristic length scale.

Sample functions from Matérn forms are bν − 1c times differentiable. Thus, the
hyperparameter ν can control the degree of smoothness

Special cases:

• kν=1/2 (r) = exp(− `r ): Laplacian covariance function, Browninan motion

(Ornstein-Uhlenbeck)
√ √
• kν=3/2 (r) = 1 + `3r exp − `3r (once differentiable)
√ √
5r2
• kν=5/2 (r) = 1 + `5r + 3` exp − `5r (twice differentiable)

2
2
r
• kν→∞ = exp(− 2` 2 ): smooth (infinitely differentiable)

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 33 / 55
Matérn covariance functions II

Univariate Matérn covariance function with unit characteristic length scale and
unit variance:

covariance function sample functions

1
ν=1/2 2
ν=1
covariance

output, f(x)
ν=2 1

0.5 ν→∞ 0
−1
−2
0
0 1 2 3 −5 0 5
input distance input, x

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 34 / 55
Periodic, smooth functions
To create a distribution over periodic functions of x, we can first map the inputs
to u = (sin(x), cos(x))> , and then measure distances in the u space. Combined
with the SE covariance function, which characteristic length scale `, we get:
kperiodic (x, x 0 ) = exp(−2 sin2 (π(x − x 0 ))/`2 )

3 3

2 2

1 1

0 0

−1 −1

−2 −2

−3 −3
−2 −1 0 1 2 −2 −1 0 1 2

Three functions drawn at random; left ` > 1, and right ` < 1.

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 35 / 55
The Prediction Problem
420
CO2 concentration, ppm
400
?

380

360

340

320
1960 1980 2000 2020
year
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 36 / 55
Covariance Function
The covariance function consists of several terms, parameterized by a total of 11
hyperparameters:

• long-term smooth trend (squared exponential)

k1 (x, x 0 ) = θ21 exp(−(x − x 0 )2 /θ22 ),
• seasonal trend (quasi-periodic
smooth)
k2 (x, x 0 ) = θ23 exp − 2 sin2 (π(x − x 0 ))/θ25 × exp − 12 (x − x 0 )2 /θ24 ,
• short- and medium-term anomaly (rational quadratic)
−θ8
(x−x 0 )2
k3 (x, x 0 ) = θ26 1 + 2θ8 θ27
• noise (independent Gaussian,
and
dependent)
0 2
(x−x )
k4 (x, x 0 ) = θ29 exp − 2θ210
+ θ211 δxx 0 .

k(x, x 0 ) = k1 (x, x 0 ) + k2 (x, x 0 ) + k3 (x, x 0 ) + k4 (x, x 0 )

Let’s try this with the gpml software (http://www.gaussianprocess.org/gpml).

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 37 / 55
Long- and medium-term mean predictions

400 1
CO2 concentration, ppm

CO2 concentration, ppm

380 0.5

360 0

340 −0.5

320 −1

1960 1970 1980 1990 2000 2010 2020

year

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 38 / 55
Mean Seasonal Component
2020
−3.6
2010
2000 −2.8
−1
Year

1990 0 −2
−2 −1
1 2 3.1
2 1 0
1980 −3.3
3 −2.8
2.8
1970
1960
J F M A M J J A S O N D
Month

Seasonal component: magnitude θ3 = 2.4 ppm, decay-time θ4 = 90 years.

Dependent noise, magnitude θ9 = 0.18 ppm, decay θ10 = 1.6 months.

Independent noise, magnitude θ11 = 0.19 ppm.

Optimize or integrate out? See MacKay [5].

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 39 / 55
Binary Gaussian Process Classification
4 1

class probability, π(x)

latent function, f(x)

−2

−4
0
input, x input, x

The class probability is related to the latent function, f , through:

p(y = 1|f (x)) = π(x) = Φ f (x) ,
where Φ is a sigmoid function, such as the logistic or cumulative Gaussian.
Observations are independent given f , so the likelihood is
Y
n Y
n
p(y|f) = p(yi |fi ) = Φ(yi fi ).
i=1 i=1
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 40 / 55
Prior and Posterior for Classification
We use a Gaussian process prior for the latent function:
f|X, θ ∼ N(0, K)
The posterior becomes:

N(f|0, K) Y
m
p(y|f) p(f|X, θ)
p(f|D, θ) = = Φ(yi fi ),
p(D|θ) p(D|θ)
i=1

which is non-Gaussian.

The latent value at the test point, f (x∗ ) is

Z
p(f∗ |D, θ, x∗ ) = p(f∗ |f, X, θ, x∗ )p(f|D, θ)df,

and the predictive class probability becomes

Z
p(y∗ |D, θ, x∗ ) = p(y∗ |f∗ )p(f∗ |D, θ, x∗ )df∗ ,

both of which are intractable to compute.

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 41 / 55
Gaussian Approximation to the Posterior

We approximate the non-Gaussian posterior by a Gaussian:

p(f|D, θ) ' q(f|D, θ) = N(m, A)

then q(f∗ |D, θ, x∗ ) = N(f∗ |µ∗ , σ2∗ ), where

µ∗ = k>
∗K
−1
m
σ2∗ = k(x∗ , x∗ )−k>
∗ (K
−1
− K−1 AK−1 )k∗ .

Using this approximation with the cumulative Gaussian likelihood

Z µ
∗
q(y∗ = 1|D, θ, x∗ ) = Φ(f∗ ) N(f∗ |µ∗ , σ2∗ )df∗ = Φ √
1 + σ2∗

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 42 / 55
Laplace’s method and Expectation Propagation

How do we find a good Gaussian approximation N(m, A) to the posterior?

Laplace’s method: Find the Maximum A Posteriori (MAP) lantent values fMAP ,
and use a local expansion (Gaussian) around this point as suggested by Williams
and Barber [10].

Variational bounds: bound the likelihood by some tractable expression

A local variational bound for each likelihood term was given by Gibbs and
MacKay [1]. A lower bound based on Jensen’s inequality by Opper and Seeger
[7].

Expectation Propagation: use an approximation of the likelihood, such that the

moments of the marginals of the approximate posterior match the (approximate)
moment of the posterior, Minka [6].

Laplace’s method and EP were compared by Kuss and Rasmussen [3].

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 43 / 55
Gaussian process latent variable models

GP’s can be used for non-linear dimensionality reduction (unsupervised learning).

Observed (high-dimensional) data Ydc , where 1 6 d 6 D indexes dimensions and

1 6 c 6 n indexes dimensions.

Assume that each visible coordinate, yd , is modeled by a separate GP using some

latent (low dimensional) inputs x.

Find the best latent inputs by maximizing the marginal likelihood under the
constraint that all visible variables must share the same latent values.

Computationally, this isn’t too expensive, as all dimensions are modeled using the
same covariance matrix K.

This is the GPLVM model proposed by Lawrence [4].

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 44 / 55
Gaussian process latent variable models

Finding the latent variables is a high-

dimensional, non-linear, optimization
problem with local optima.

GPLVM defines a map from latent to

observed space, not a generative model.

Mapping new latent coordinates to

(distributions over) observations is easy.

Finding the latent coordinates (pre-

image) for new cases is difficult.

Motion capture example, representing

102-D data in 2-D, borrowed from Neil
Lawrence.

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 45 / 55
Sparse Approximations

Recall the graphical model for a Gaussian process. Inference is expensive because
the latent variables are fully connected.

y1 x∗1
Exact inference: O(n3 ).
x1 f1 f∗1 y∗1
Sparse approximations: solve a smaller,
y2 x∗2
f∗2
sparse, approximation of the original
f2
x2 y∗2 problem.

y3 f3 f∗3 x∗3 Algorithm: Subset of data.

x3 y∗3
Are there better ways to sparsify?
fn

yn xn

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 46 / 55
Inducing Variables

Because of the marginalization property, we can introduce more latent variables

without changing the distribution of the original variables.

s2
s1
u2
y1
u1 The u = (u1 , u2 , . . .)> are called inducing
x∗1
variables.
x1 f1 f∗1 y∗1
The inducing variables have associated
y2 x∗2
inducing inputs, s, but no associated
f2 f∗2
x2 y∗2 output values.

y3 f3 f∗3 x∗3 The marginalization property ensures

that
y∗3
Z
x3
fn
p(f, f ) = p(f, f∗ , u)du
∗

yn xn

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 47 / 55
The Central Approximations

In a unifying treatment, Candela and Rasmussen [2] assume that training and test
sets are conditionally independent given u.

s2
s1
u2
Assume: p(f, f∗ ) ' q(f, f∗ ), where
y1
u1
x∗1
Z
q(f, f∗ ) = q(f∗ |u)q(f|u)p(u)du.
x1 f1 f∗1 y∗1

y2 x∗2 The inducing variables induce the depen-

f2 f∗2 dencies between training and test cases.
x2 y∗2
Different sparse algorithms in the litera-
y3 f3 f∗3 x∗3
ture correspond to different
x3 y∗3 • choices of the inducing inputs
fn
• further approximations
yn xn

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 48 / 55
Training and test conditionals

The exact training and test conditionals are:

p(f|u) = N(Kf,u K−1

f,f u, Kf,f − Qf,f )

p(f∗ |u) = N(Kf∗ ,u K−1

f,f u, Kf∗ ,f∗ − Qf∗ ,f∗ ),

where Qa,b = Ka,u K−1

u,u Ku,b .

These equations are easily recognized as the usual predictive equations for GPs.

The effective prior is:

hK
f,f Q∗,f i
q(f, f∗ ) = N 0,
Qf,∗ K∗,∗

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 49 / 55
Example: Subset of Regressors

Replace both training and test conditionals by deterministic relations:

q(f|u) = N(Kf,u K−1

f,f u, 0)

q(f∗ |u) = N(Kf∗ ,u K−1

f,f u, 0).

The effective prior becomes

Qf,f Q∗,f i
h
qSOR (f, f∗ ) = N 0, ,
Qf,∗ Q∗,∗

showing that SOR is just a GP with (degenerate) covariance function Q.

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 50 / 55
Example: Sparse parametric Gaussian processes

Snelson and Ghahramani [8] introduced the idea of sparse GP inference based on
a pseudo data set, integrating out the targets, and optimizing the inputs.

Equivalently, in the unifying scheme:

q(f|u) = N(Kf,u K−1

f,f u, diag[Kf,f − Qf,f ])
q(f∗ |u) = p(f∗ |u).

The effective prior becomes

h Q − diag[Q − K ] Q∗,f i
f,f f,f f,f
qFITC (f, f∗ ) = N 0, ,
Qf,∗ K∗,∗

which can be computed efficiently.

The Bayesian Committee Machine [9] uses block diag instead of diag, and the
inducing variables to be the test cases.

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 51 / 55
Sparse approximations

Most published sparse approximations can be understood in a single graphical

model framework.

The inducing inputs (or expansion points, or support vectors) may be a subset of
the training data, or completely free.

The approximations are understood as exact inference in a modified model

(rather than approximate inference for the exact model).

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 52 / 55
Conclusions
Complex non-linear inference problems can be solved by manipulating plain old
Gaussian distributions

• Bayesian inference is tractable for GP regression and

• Approximations exist for classification
• predictions are probabilistic
• compare different models (via the marginal likelihood)

GPs are a simple and intuitive means of specifying prior information, and
explaining data, and equivalent to other models: RVM’s, splines, closely related
to SVMs.

Outlook:

• new interesting covariance functions

• application to structured data
• better understanding of sparse methods
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 53 / 55
More on Gaussian Processes

Rasmussen and Williams

Gaussian Processes for Machine Learning,
MIT Press, 2006.
http://www.GaussianProcess.org/gpml

Gaussian process web (code, papers, etc): http://www.GaussianProcess.org

Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 54 / 55
A few references
[1] Gibbs, M. N. and MacKay, D. J. C. (2000). Variational Gaussian Process Classifiers. IEEE
Transactions on Neural Networks, 11(6):1458–1464.
[2] Joaquin Quiñonero-Candela and Carl Edward Rasmussen (2005). A unifying view of sparse
approximate gaussian process regression. Journal of Machine Learning Research, 6:1939–1959.
[3] Kuss, M. and Rasmussen, C. E. (2005). Assessing approximate inference for binary gaussian
process classification. Journal of Machine Learning Research, 6:1679–1704.
[4] Lawrence, N. (2005). Probabilistic non-linear principal component analysis with Gaussian
process latent variable models. Journal of Machine Learning Research, 6:1783–1816.
[5] MacKay, D. J. C. (1999). Comparison of Approximate Methods for Handling Hyperparameters.
Neural Computation, 11(5):1035–1068.
[6] Minka, T. P. (2001). A Family of Algorithms for Approximate Bayesian Inference. PhD thesis,
Massachusetts Institute of Technology.
[7] Seeger, M. (2003). Bayesian Gaussian Process Models: PAC-Bayesian Generalisation Error
Bounds and Sparse Approximations. PhD thesis, School of Informatics, University of Edinburgh.
http://www.cs.berkeley.edu/∼mseeger.
[8] Snelson, E. and Ghahramani, Z. (2006). Sparse gaussian processes using pseudo-inputs. In
Advances in Neural Information Processing Systems 18. MIT Press.
[9] Tresp, V. (2000). A Bayesian Committee Machine. Neural Computation, 12(11):2719–2741.
[10] Williams, C. K. I. and Barber, D. (1998). Bayesian Classification with Gaussian Processes. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351.
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 55 / 55

Sta 2300 - Theory of Estimation
100% (1)
Sta 2300 - Theory of Estimation
2 pages
Metaheuristics For Machine Learning - Algorithms and Applications
No ratings yet
Metaheuristics For Machine Learning - Algorithms and Applications
628 pages
Case Study Using Normal Distribution
100% (2)
Case Study Using Normal Distribution
3 pages
convolutional_neural_networks
No ratings yet
convolutional_neural_networks
108 pages
Maths Project (XII) - Probability-Final
82% (11)
Maths Project (XII) - Probability-Final
18 pages
Scientific Reasoning The Bayesian Approach 3rd Edition Colin Howson instant download
No ratings yet
Scientific Reasoning The Bayesian Approach 3rd Edition Colin Howson instant download
53 pages
The Language of Mathematics Utilizing Math in Practice 1st Edition Robert L. Baber pdf download
No ratings yet
The Language of Mathematics Utilizing Math in Practice 1st Edition Robert L. Baber pdf download
54 pages
Probability Cheatsheet
100% (1)
Probability Cheatsheet
10 pages
Buy ebook Machine Learning with Python for Everyone (Addison Wesley Data & Analytics Series) 1st Edition, (Ebook PDF) cheap price
100% (1)
Buy ebook Machine Learning with Python for Everyone (Addison Wesley Data & Analytics Series) 1st Edition, (Ebook PDF) cheap price
47 pages
Object Detection in Drone Imagery Using Convolutional Neural Networks
100% (1)
Object Detection in Drone Imagery Using Convolutional Neural Networks
191 pages
Properties of Multiple Microstrip Antennas in Several Layers of Anisotropic Material
100% (1)
Properties of Multiple Microstrip Antennas in Several Layers of Anisotropic Material
25 pages
ECN 456 Statistics Note 2024
No ratings yet
ECN 456 Statistics Note 2024
64 pages
Gaussian Process Regression: 4F10 Pattern Recognition, 2010
No ratings yet
Gaussian Process Regression: 4F10 Pattern Recognition, 2010
40 pages
BBS Assignment 2 Final
No ratings yet
BBS Assignment 2 Final
5 pages
Assessment 1 - Sta404 - Nov 2021 - Week 7
No ratings yet
Assessment 1 - Sta404 - Nov 2021 - Week 7
2 pages
2022 Test
No ratings yet
2022 Test
12 pages
Introduction to Graph Neural Networks - Zhiyuan Liu & Jie Zhou
No ratings yet
Introduction to Graph Neural Networks - Zhiyuan Liu & Jie Zhou
142 pages
Lecture-1 Module-5 Random Process
No ratings yet
Lecture-1 Module-5 Random Process
28 pages
IAS+ +Non Life+ +Lecture+Notes
No ratings yet
IAS+ +Non Life+ +Lecture+Notes
16 pages
Case Study 2 Bazinga Inc
No ratings yet
Case Study 2 Bazinga Inc
12 pages
6 - Stat - Discrete Probability Distributions 2024
No ratings yet
6 - Stat - Discrete Probability Distributions 2024
31 pages
ML Practical Updated
No ratings yet
ML Practical Updated
64 pages
Spearman Coefficient
No ratings yet
Spearman Coefficient
6 pages
MisUse of MonteCarlo Simulation in NPV Analysis-Davis
No ratings yet
MisUse of MonteCarlo Simulation in NPV Analysis-Davis
5 pages
Gaussian Mixture Model
No ratings yet
Gaussian Mixture Model
17 pages
Lecture 28: Brownian Bridge: t t (δ) a t −2δD −2δD −2δa t t −2δD t∧T
No ratings yet
Lecture 28: Brownian Bridge: t t (δ) a t −2δD −2δD −2δa t t −2δD t∧T
4 pages
TU DELFT MSc Thesis Jolien Rip - Probabilistic Downtime Analysis
100% (1)
TU DELFT MSc Thesis Jolien Rip - Probabilistic Downtime Analysis
144 pages
Lecture 11
No ratings yet
Lecture 11
49 pages
8 - Illustrating A Normal Random
No ratings yet
8 - Illustrating A Normal Random
47 pages
PDI-GRLWEAP Brochure PDF
No ratings yet
PDI-GRLWEAP Brochure PDF
2 pages
PyGAD-2 15 1
No ratings yet
PyGAD-2 15 1
203 pages
AMA 4203 SMA 2272- STATISTICS -B
No ratings yet
AMA 4203 SMA 2272- STATISTICS -B
4 pages
Roadmap For Quant 1705217537
No ratings yet
Roadmap For Quant 1705217537
10 pages
Logistic Regression Tutorial
No ratings yet
Logistic Regression Tutorial
25 pages
MarkovDecisionProcesses Analysis
No ratings yet
MarkovDecisionProcesses Analysis
10 pages
Statistics Exercise Solution
100% (1)
Statistics Exercise Solution
19 pages
BOOK - Per Jacobsson, Göran Grimvall. Risks in Technological Systems - Reliability Engineering - Springer
100% (2)
BOOK - Per Jacobsson, Göran Grimvall. Risks in Technological Systems - Reliability Engineering - Springer
347 pages
Computational Geomechanics Course Notes: Brice Lecampion Geo-Energy Lab - EPFL September 17, 2021
No ratings yet
Computational Geomechanics Course Notes: Brice Lecampion Geo-Energy Lab - EPFL September 17, 2021
137 pages
Monte Carlo Concepts Algorithms and Applications Compress
No ratings yet
Monte Carlo Concepts Algorithms and Applications Compress
722 pages
BackPropogationCrossEntNotes PDF
No ratings yet
BackPropogationCrossEntNotes PDF
4 pages
Chi Square (X2) Test
No ratings yet
Chi Square (X2) Test
15 pages
Nonlinear Deformation Analyses of An Embankment Da 2016 Soil Dynamics and Ea
No ratings yet
Nonlinear Deformation Analyses of An Embankment Da 2016 Soil Dynamics and Ea
12 pages
UserManual Reliability
No ratings yet
UserManual Reliability
74 pages
9 - Simulation Output Analysis
No ratings yet
9 - Simulation Output Analysis
32 pages
Using Matlab For Modified Cam Clay
No ratings yet
Using Matlab For Modified Cam Clay
12 pages
Stock Price Prediction Using Recurrent Neural Networks PDF
No ratings yet
Stock Price Prediction Using Recurrent Neural Networks PDF
132 pages
Micro To MACRO Mathematical Modelling in Soil Mechanics (Trends in Mathematics)
No ratings yet
Micro To MACRO Mathematical Modelling in Soil Mechanics (Trends in Mathematics)
407 pages
Cosom Lab File
100% (1)
Cosom Lab File
37 pages
BoulangerZiotopoulou 2018 NDApractices GEESD
No ratings yet
BoulangerZiotopoulou 2018 NDApractices GEESD
23 pages
maths-class-x-chapter-14-probability-practice(answers)
No ratings yet
maths-class-x-chapter-14-probability-practice(answers)
6 pages
Validation Drain 2016
No ratings yet
Validation Drain 2016
5 pages
Level Set Methods in Medical Image Analysis: Segmentation: Nikos Paragios
No ratings yet
Level Set Methods in Medical Image Analysis: Segmentation: Nikos Paragios
92 pages
Crop Yield Prediction Using Machine Learning - 2020 - Computers and Electronic
50% (2)
Crop Yield Prediction Using Machine Learning - 2020 - Computers and Electronic
18 pages
Gaussian Processes in Machine Learning
No ratings yet
Gaussian Processes in Machine Learning
9 pages
Pile Dynamics Analyzer-DLT (PDA-DLT) : Accurate. Reliable. Efficient
No ratings yet
Pile Dynamics Analyzer-DLT (PDA-DLT) : Accurate. Reliable. Efficient
2 pages
Geyer - Markov Chain Monte Carlo Lecture Notes
No ratings yet
Geyer - Markov Chain Monte Carlo Lecture Notes
166 pages
Statistics M2
No ratings yet
Statistics M2
18 pages
Formula Sheet (1) Descriptive Statistics: Quartiles (n+1) /4 (n+1) /2 (The Median) 3 (n+1) /4
No ratings yet
Formula Sheet (1) Descriptive Statistics: Quartiles (n+1) /4 (n+1) /2 (The Median) 3 (n+1) /4
13 pages
Information Theory and Cognition A Review
No ratings yet
Information Theory and Cognition A Review
19 pages
Predictive Modeling of Stock Prices Using Transformer Model
No ratings yet
Predictive Modeling of Stock Prices Using Transformer Model
8 pages
General Mathematics: Learning Activity Sheet Simple and General Annuities
No ratings yet
General Mathematics: Learning Activity Sheet Simple and General Annuities
14 pages
Neural Networks and Deep Learning
No ratings yet
Neural Networks and Deep Learning
22 pages
Logistic Regression and SGD
No ratings yet
Logistic Regression and SGD
10 pages
Deep Learning Techniques and Application
No ratings yet
Deep Learning Techniques and Application
20 pages
Time Series Forecasting ANN
No ratings yet
Time Series Forecasting ANN
8 pages
Machine Learning & Data Mining: Understanding
No ratings yet
Machine Learning & Data Mining: Understanding
7 pages
Icsmge 2022-115
No ratings yet
Icsmge 2022-115
7 pages
Solving Inverse Problems Using Datadriven Models
No ratings yet
Solving Inverse Problems Using Datadriven Models
174 pages
R07 Set No. 2: 5. (A) If A Poisson Distribution Is Such That P (X 1) - P (X 3) - Find
No ratings yet
R07 Set No. 2: 5. (A) If A Poisson Distribution Is Such That P (X 1) - P (X 3) - Find
8 pages
Advanced Modeling in Biological Engineering Using Soft Computing Methods
No ratings yet
Advanced Modeling in Biological Engineering Using Soft Computing Methods
16 pages
Anomaly Detection
No ratings yet
Anomaly Detection
11 pages
R2Openbugs: A Package For Running Openbugs From R: Sibylle Sturtz Uwe Ligges Andrew Gelman
No ratings yet
R2Openbugs: A Package For Running Openbugs From R: Sibylle Sturtz Uwe Ligges Andrew Gelman
16 pages
Segmentation and Object Recognition Using Edge Detection Techniques
No ratings yet
Segmentation and Object Recognition Using Edge Detection Techniques
9 pages
Numge2023 383 2
No ratings yet
Numge2023 383 2
6 pages
Dynamics: Unit 1
No ratings yet
Dynamics: Unit 1
86 pages
G-2P1J-01-0 ODTSF Parameters Calibration F - AOS - 0-0
No ratings yet
G-2P1J-01-0 ODTSF Parameters Calibration F - AOS - 0-0
41 pages
A Review of Bayesian Machine Learning Principles, Methods, and Applications
No ratings yet
A Review of Bayesian Machine Learning Principles, Methods, and Applications
6 pages
Computers and Geotechnics: Sajjad Salam, Ming Xiao, Arash Khosravifar, Katerina Ziotopoulou
No ratings yet
Computers and Geotechnics: Sajjad Salam, Ming Xiao, Arash Khosravifar, Katerina Ziotopoulou
13 pages
Investigations of Pile-Soil Behaviour, With Special Reference To The Foundations of Offshore Structures. RichardJardine-1986-PhD-Thesis
No ratings yet
Investigations of Pile-Soil Behaviour, With Special Reference To The Foundations of Offshore Structures. RichardJardine-1986-PhD-Thesis
393 pages
Ziotopoulou Montgomery 16WCEE Revised
No ratings yet
Ziotopoulou Montgomery 16WCEE Revised
12 pages
CSE-Machine Learning & Big Data - WSS Source Book
No ratings yet
CSE-Machine Learning & Big Data - WSS Source Book
181 pages
Advance Numerical Analysis
No ratings yet
Advance Numerical Analysis
133 pages
Knime Anomaly Detection Visualization
No ratings yet
Knime Anomaly Detection Visualization
13 pages
Journal of Geotechnical and Geoenvironmental Engineering Volume 145 Issue 11 2019 (Doi 10.1061 - (ASCE) GT.1943-5606.0002156) Boulanger, Ross W. - Nonlinear Dynamic Analyses of Austrian Dam in The
No ratings yet
Journal of Geotechnical and Geoenvironmental Engineering Volume 145 Issue 11 2019 (Doi 10.1061 - (ASCE) GT.1943-5606.0002156) Boulanger, Ross W. - Nonlinear Dynamic Analyses of Austrian Dam in The
19 pages
Adaline/Madaline:Applications
100% (1)
Adaline/Madaline:Applications
25 pages
Artificial Neural Networks An Econometric Perspective
No ratings yet
Artificial Neural Networks An Econometric Perspective
98 pages
Lab 3 - Linear Regression
No ratings yet
Lab 3 - Linear Regression
15 pages
Karanja Evanson Mwangi Cit Masters Report Libre PDF
No ratings yet
Karanja Evanson Mwangi Cit Masters Report Libre PDF
136 pages
Specfem3d Manual
No ratings yet
Specfem3d Manual
90 pages
Blob Detection: Unveiling Patterns in Visual Data
From Everand
Blob Detection: Unveiling Patterns in Visual Data
Fouad Sabry
No ratings yet