Gaussian Processes in Machine Learning Tutorial
Gaussian Processes in Machine Learning Tutorial
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 1 / 55
The Prediction Problem
420
CO2 concentration, ppm
400
?
380
360
340
320
1960 1980 2000 2020
year
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 2 / 55
The Prediction Problem
420
CO2 concentration, ppm
400
380
360
340
320
1960 1980 2000 2020
year
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 3 / 55
The Prediction Problem
420
CO2 concentration, ppm
400
380
360
340
320
1960 1980 2000 2020
year
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 4 / 55
The Prediction Problem
420
CO2 concentration, ppm
400
380
360
340
320
1960 1980 2000 2020
year
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 5 / 55
The Prediction Problem
Ubiquitous questions:
• Model fitting
• how do I fit the parameters?
• what about overfitting?
• Model Selection
• how to I find out which model to use?
• how sure can I be?
• Interpretation
• what is the accuracy of the predictions?
• can I trust the predictions, even if
• . . . I am not sure about the parameters?
• . . . I am not sure of the model structure?
Gaussian processes solve some of the above, and provide a practical framework
to address the remaining issues.
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 6 / 55
Outline
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 7 / 55
The Gaussian Distribution
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 8 / 55
Conditionals and Marginals of a Gaussian
Both the conditionals and the marginals of a joint Gaussian are again Gaussian.
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 9 / 55
What is a Gaussian Process?
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 10 / 55
The marginalization property
Recall: Z
p(x) = p(x, y)dy.
For Gaussians:
hai h A B i
p(x, y) = N , =⇒ p(x) = N(a, A)
b B> C
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 11 / 55
Random functions from a Gaussian Process
To get an indication of what this distribution over functions looks like, focus on a
finite subset of function values f = (f (x1 ), f (x2 ), . . . , f (xn ))> , for which
f ∼ N(0, Σ),
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 12 / 55
Some values of the random function
1.5
0.5
output, f(x)
−0.5
−1
−1.5
−5 0 5
input, x
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 13 / 55
Sequential Generation
Y
n
p(f1 , . . . , fn |x1 , . . . xn ) = p(fi |fi−1 , . . . , f1 , xi , . . . , x1 ),
i=1
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 14 / 55
Function drawn at random from a Gaussian Process with Gaussian covariance
−1
−2
6
4
6
2
4
0 2
−2 0
−2
−4
−4
−6 −6
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 15 / 55
Maximum likelihood, parametric model
Supervised parametric learning:
• data: x, y
• model: y = fw (x) + ε
Gaussian likelihood:
Y
p(y|x, w, Mi ) ∝ exp(− 12 (yc − fw (xc ))2 /σ2noise ).
c
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 16 / 55
Bayesian Inference, parametric model
Supervised parametric learning:
• data: x, y
• model: y = fw (x) + ε
Gaussian likelihood:
Y
p(y|x, w, Mi ) ∝ exp(− 12 (yc − fw (xc ))2 /σ2noise ).
c
Parameter prior:
p(w|Mi )
p(w|Mi )p(y|x, w, Mi )
p(w|x, y, Mi ) =
p(y|x, Mi )
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 17 / 55
Bayesian Inference, parametric model, cont.
Making predictions:
Z
p(y∗ |x∗ , x, y, Mi ) = p(y∗ |w, x∗ , Mi )p(w|x, y, Mi )dw
Marginal likelihood:
Z
p(y|x, Mi ) = p(w|Mi )p(y|x, w, Mi )dw.
Model probability:
p(Mi )p(y|x, Mi )
p(Mi |x, y) =
p(y|x)
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 18 / 55
Non-parametric Gaussian process models
In our non-parametric model, the “parameters” is the function itself!
Gaussian likelihood:
y|x, f (x), Mi ∼ N(f, σ2noise I)
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 19 / 55
Prior and Posterior
2 2
1 1
output, f(x)
output, f(x)
0 0
−1 −1
−2 −2
−5 0 5 −5 0 5
input, x input, x
Predictive distribution:
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 20 / 55
Graphical model for Gaussian Process
This explains why we can make inference using a finite amount of computation!
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 21 / 55
Some interpretation
Recall our main result:
X
n X
n
µ(x∗ ) = k(x∗ , X)[K(X, X) + σ2n ]−1 y = βc y(c) = αc k(x∗ , x(c) ).
c=1 c=1
the first term is the prior variance, from which we subtract a (positive) term,
telling how much the data X has explained. Note, that the variance is
independent of the observed outputs y.
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 22 / 55
The marginal likelihood
Log marginal likelihood:
1 1 n
log p(y|x, Mi ) = − y> K−1 y − log |K| − log(2π)
2 2 2
is the combination of a data fit term and complexity penalty. Occam’s Razor is
automatic.
∂ log p(y|x, θ, Mi ) 1 ∂K −1 1 ∂K
= y> K−1 K y − trace(K−1 )
∂θj 2 ∂θj 2 ∂θj
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 23 / 55
Example: Fitting the length scale parameter
(x − x 0 )2
Parameterized covariance function: k(x, x 0 ) = v2 exp − + σ2n δxx 0 .
2`2
1.5
observations
too short
1 good length scale
too long
0.5
−0.5
−10 −8 −6 −4 −2 0 2 4 6 8 10
The mean posterior predictive function is plotted for 3 different length scales (the
green curve corresponds to optimizing the marginal likelihood). Notice, that an
almost exact fit to the data can be achieved by reducing the length scale – but the
marginal likelihood does not favour this!
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 24 / 55
Why, in principle, does Bayesian Inference work?
Occam’s Razor
too simple
P(Y|Mi)
"just right"
too complex
Y
All possible data sets
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 25 / 55
An illustrative analogous example
Imagine the simple task of fitting the variance, σ2 , of a zero-mean Gaussian to a
set of n scalar observations.
P
The log likelihood is log p(y|µ, σ2 ) = − 21 (yi − µ)2 /σ2 − 2n log(σ2 ) − n
2 log(2π)
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 26 / 55
From random functions to covariance functions
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 27 / 55
From random functions to covariance functions II
Consider the class of functions (sums of squared exponentials):
1X
f (x) = lim γi exp(−(x − i/n)2 ), where γi ∼ N(0, 1), ∀i
n→∞ n
i
Z∞
= γ(u) exp(−(x − u)2 )du, where γ(u) ∼ N(0, 1), ∀u.
−∞
Z
x + x 0 2 (x + x 0 )2 (x − x 0 )2
− x2 − x 02 )du ∝ exp −
= exp − 2(u − ) + .
2 2 2
Thus, the squared exponential covariance function is equivalent to regression
using infinitely many Gaussian shaped basis functions placed everywhere, not just
at your training points!
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 28 / 55
Using finitely many basis functions may be dangerous!
0.5
−0.5
−10 −8 −6 −4 −2 0 2 4 6 8 10
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 29 / 55
Model Selection in Practise; Hyperparameters
There are two types of task: form and parameters of the covariance function.
Typically, our prior is too weak to quantify aspects of the covariance function.
We use a hierarchical model using hyperparameters. Eg, in ARD:
X
D
(xd − x 0 )2
0
k(x, x ) = v20 exp − d
, hyperparameters θ = (v0 , v1 , . . . , vd , σ2n ).
2v2d
d=1
2 2 2
1 0 0
0
−2 −2
2 2 2
2 2 2
0 0 0 0 0 0
x2 −2 −2 x1 x2 −2 −2 x1 x2 −2 −2 x1
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 30 / 55
Rational quadratic covariance function
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 31 / 55
Rational quadratic covariance function II
1 α=1/2
3
α=2
α→∞ 2
0.8
1
output, f(x)
covariance
0.6
0
0.4 −1
0.2 −2
0 −3
0 1 2 3 −5 0 5
input distance input, x
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 32 / 55
Matérn covariance functions
Stationary covariance functions can be based on the Matérn form:
1 h √2ν iν √2ν
0
k(x, x ) = |x − x 0
| K ν |x − x 0
| ,
Γ (ν)2ν−1 ` `
where Kν is the modified Bessel function of second kind of order ν, and ` is the
characteristic length scale.
Sample functions from Matérn forms are bν − 1c times differentiable. Thus, the
hyperparameter ν can control the degree of smoothness
Special cases:
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 33 / 55
Matérn covariance functions II
Univariate Matérn covariance function with unit characteristic length scale and
unit variance:
output, f(x)
ν=2 1
0.5 ν→∞ 0
−1
−2
0
0 1 2 3 −5 0 5
input distance input, x
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 34 / 55
Periodic, smooth functions
To create a distribution over periodic functions of x, we can first map the inputs
to u = (sin(x), cos(x))> , and then measure distances in the u space. Combined
with the SE covariance function, which characteristic length scale `, we get:
kperiodic (x, x 0 ) = exp(−2 sin2 (π(x − x 0 ))/`2 )
3 3
2 2
1 1
0 0
−1 −1
−2 −2
−3 −3
−2 −1 0 1 2 −2 −1 0 1 2
380
360
340
320
1960 1980 2000 2020
year
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 36 / 55
Covariance Function
The covariance function consists of several terms, parameterized by a total of 11
hyperparameters:
400 1
CO2 concentration, ppm
360 0
340 −0.5
320 −1
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 38 / 55
Mean Seasonal Component
2020
−3.6
2010
2000 −2.8
−1
Year
1990 0 −2
−2 −1
1 2 3.1
2 1 0
1980 −3.3
3 −2.8
2.8
1970
1960
J F M A M J J A S O N D
Month
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 39 / 55
Binary Gaussian Process Classification
4 1
−2
−4
0
input, x input, x
N(f|0, K) Y
m
p(y|f) p(f|X, θ)
p(f|D, θ) = = Φ(yi fi ),
p(D|θ) p(D|θ)
i=1
which is non-Gaussian.
µ∗ = k>
∗K
−1
m
σ2∗ = k(x∗ , x∗ )−k>
∗ (K
−1
− K−1 AK−1 )k∗ .
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 42 / 55
Laplace’s method and Expectation Propagation
Laplace’s method: Find the Maximum A Posteriori (MAP) lantent values fMAP ,
and use a local expansion (Gaussian) around this point as suggested by Williams
and Barber [10].
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 43 / 55
Gaussian process latent variable models
Find the best latent inputs by maximizing the marginal likelihood under the
constraint that all visible variables must share the same latent values.
Computationally, this isn’t too expensive, as all dimensions are modeled using the
same covariance matrix K.
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 44 / 55
Gaussian process latent variable models
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 45 / 55
Sparse Approximations
Recall the graphical model for a Gaussian process. Inference is expensive because
the latent variables are fully connected.
y1 x∗1
Exact inference: O(n3 ).
x1 f1 f∗1 y∗1
Sparse approximations: solve a smaller,
y2 x∗2
f∗2
sparse, approximation of the original
f2
x2 y∗2 problem.
yn xn
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 46 / 55
Inducing Variables
s2
s1
u2
y1
u1 The u = (u1 , u2 , . . .)> are called inducing
x∗1
variables.
x1 f1 f∗1 y∗1
The inducing variables have associated
y2 x∗2
inducing inputs, s, but no associated
f2 f∗2
x2 y∗2 output values.
yn xn
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 47 / 55
The Central Approximations
In a unifying treatment, Candela and Rasmussen [2] assume that training and test
sets are conditionally independent given u.
s2
s1
u2
Assume: p(f, f∗ ) ' q(f, f∗ ), where
y1
u1
x∗1
Z
q(f, f∗ ) = q(f∗ |u)q(f|u)p(u)du.
x1 f1 f∗1 y∗1
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 48 / 55
Training and test conditionals
These equations are easily recognized as the usual predictive equations for GPs.
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 49 / 55
Example: Subset of Regressors
Qf,f Q∗,f i
h
qSOR (f, f∗ ) = N 0, ,
Qf,∗ Q∗,∗
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 50 / 55
Example: Sparse parametric Gaussian processes
Snelson and Ghahramani [8] introduced the idea of sparse GP inference based on
a pseudo data set, integrating out the targets, and optimizing the inputs.
The Bayesian Committee Machine [9] uses block diag instead of diag, and the
inducing variables to be the test cases.
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 51 / 55
Sparse approximations
The inducing inputs (or expansion points, or support vectors) may be a subset of
the training data, or completely free.
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 52 / 55
Conclusions
Complex non-linear inference problems can be solved by manipulating plain old
Gaussian distributions
GPs are a simple and intuitive means of specifying prior information, and
explaining data, and equivalent to other models: RVM’s, splines, closely related
to SVMs.
Outlook:
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 54 / 55
A few references
[1] Gibbs, M. N. and MacKay, D. J. C. (2000). Variational Gaussian Process Classifiers. IEEE
Transactions on Neural Networks, 11(6):1458–1464.
[2] Joaquin Quiñonero-Candela and Carl Edward Rasmussen (2005). A unifying view of sparse
approximate gaussian process regression. Journal of Machine Learning Research, 6:1939–1959.
[3] Kuss, M. and Rasmussen, C. E. (2005). Assessing approximate inference for binary gaussian
process classification. Journal of Machine Learning Research, 6:1679–1704.
[4] Lawrence, N. (2005). Probabilistic non-linear principal component analysis with Gaussian
process latent variable models. Journal of Machine Learning Research, 6:1783–1816.
[5] MacKay, D. J. C. (1999). Comparison of Approximate Methods for Handling Hyperparameters.
Neural Computation, 11(5):1035–1068.
[6] Minka, T. P. (2001). A Family of Algorithms for Approximate Bayesian Inference. PhD thesis,
Massachusetts Institute of Technology.
[7] Seeger, M. (2003). Bayesian Gaussian Process Models: PAC-Bayesian Generalisation Error
Bounds and Sparse Approximations. PhD thesis, School of Informatics, University of Edinburgh.
http://www.cs.berkeley.edu/∼mseeger.
[8] Snelson, E. and Ghahramani, Z. (2006). Sparse gaussian processes using pseudo-inputs. In
Advances in Neural Information Processing Systems 18. MIT Press.
[9] Tresp, V. (2000). A Bayesian Committee Machine. Neural Computation, 12(11):2719–2741.
[10] Williams, C. K. I. and Barber, D. (1998). Bayesian Classification with Gaussian Processes. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 20(12):1342–1351.
Rasmussen (MPI for Biological Cybernetics) Advances in Gaussian Processes December 4th, 2006 55 / 55