0% found this document useful (0 votes)
67 views

Gaussian Processes: Probabilistic Inference (CO-493)

This document discusses a lecture on Gaussian processes given by Marc Deisenroth at Imperial College London. It provides an overview of Bayesian linear regression as a refresher. It then discusses placing priors over functions and defining Gaussian processes, which allow defining a distribution directly over functions. It describes how Gaussian processes can be used for inference and fitting nonlinear functions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views

Gaussian Processes: Probabilistic Inference (CO-493)

This document discusses a lecture on Gaussian processes given by Marc Deisenroth at Imperial College London. It provides an overview of Bayesian linear regression as a refresher. It then discusses placing priors over functions and defining Gaussian processes, which allow defining a distribution directly over functions. It describes how Gaussian processes can be used for inference and fitting nonlinear functions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 146

Probabilistic Inference (CO-493)

Gaussian Processes
Marc Deisenroth
Department of Computing [email protected]
Imperial College London

January 22, 2019


Overview

Bayesian Linear Regression (1-Slide Refresher)

Priors over Functions

Gaussian Processes
Definition and Derivation
Inference
Covariance Functions and Hyper-Parameters
Training

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 2
Bayesian Linear Regression: Model

` ˘
Prior ppθq “ N m0 , S0
Likelihood ppy|x, θq “ N y | φJ pxqθ, σ2
` ˘

m0 S0 ùñ y “ φJ pxqθ ` e, e „ N 0, σ2
` ˘

θ
§ Parameter θ becomes a latent (random) variable
σ § Distribution ppθq induces a distribution over
x y plausible functions
§ Choose a conjugate Gaussian prior
§ Closed-form computations
§ Gaussian posterior

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 3
Overview

Bayesian Linear Regression (1-Slide Refresher)

Priors over Functions

Gaussian Processes
Definition and Derivation
Inference
Covariance Functions and Hyper-Parameters
Training

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 4
Distribution over Functions
Consider a linear regression setting
e „ N 0, σn2
` ˘
y “ a ` bx ` e ,
` ˘
ppa, bq “ N 0, I

0
b

−1

−2

−3

−4
−4 −2 0 2 4
a

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 5
Sampling from the Prior over Functions
Consider a linear regression setting
e „ N 0, σn2
` ˘
y “ f pxq ` e “ a ` bx ` e ,
` ˘
ppa, bq “ N 0, I
f i pxq “ ai ` bi x, rai , bi s „ ppa, bq

4 15

3
10
2
5
1

0 y 0
b

−1
−5
−2
−10
−3

−4 −15
−2 0 2 −10 −5 0 5 10
a x
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 6
Sampling from the Posterior over Functions
Consider a linear regression setting
e „ N 0, σn2
` ˘
y “ f pxq ` e “ a ` bx ` e ,
` ˘
ppa, bq “ N 0, I
X “ rx1 , . . . , x N s, y “ ry1 , . . . , y N s Training data

15

10

0
y

−5

−10

−15
−10 −5 0 5 10
x
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 7
Sampling from the Posterior over Functions
Consider a linear regression setting
y “ f pxq ` e “ a ` bx ` e , e „ N 0, σn2
` ˘
` ˘
ppa, bq “ N 0, I
` ˘
ppa, b|X, yq “ N m N , S N Posterior

0
b

−1

−2

−3

−4
−4 −2 0 2 4
a
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 8
Sampling from the Posterior over Functions
Consider a linear regression setting
e „ N 0, σn2
` ˘
y “ f pxq ` e “ a ` bx ` e ,
rai , bi s „ ppa, b|X, yq
f i “ a i ` bi x

4 15

3
10
2
5
1

0 y 0
b

−1
−5
−2
−10
−3

−4 −15
−2 0 2 −10 −5 0 5 10
a x
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 9
Fitting Nonlinear Functions

§ Fit nonlinear functions using (Bayesian) linear regression:


Linear combination of nonlinear features

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 10
Fitting Nonlinear Functions

§ Fit nonlinear functions using (Bayesian) linear regression:


Linear combination of nonlinear features
§ Example: Radial-basis-function (RBF) network
n
ÿ
θi „ N 0, σp2
` ˘
f pxq “ θi φi pxq ,
i“1

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 10
Fitting Nonlinear Functions

§ Fit nonlinear functions using (Bayesian) linear regression:


Linear combination of nonlinear features
§ Example: Radial-basis-function (RBF) network
n
ÿ
θi „ N 0, σp2
` ˘
f pxq “ θi φi pxq ,
i“1

where
φi pxq “ exp ´ 12 px ´ µi qJ px ´ µi q
` ˘

for given “centers” µi

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 10
Illustration: Fitting a Radial Basis Function Network

φi pxq “ exp ´ 12 px ´ µi qJ px ´ µi q
` ˘

f(x) 2

-2

-5 0 5
x
§ Place Gaussian-shaped basis functions φi at 25 input locations µi ,
linearly spaced in the interval r´5, 3s
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 11
Samples from the RBF Prior

n
ÿ ` ˘
f pxq “ θi φi pxq , ppθq “ N 0, I
i“1

2
f(x)

-2

-4
-5 0 5
x

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 12
Samples from the RBF Posterior

n
ÿ ` ˘
f pxq “ θi φi pxq , ppθ|X, yq “ N m N , S N
i“1

2
f(x)

-2

-4
-5 0 5
x

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 13
RBF Posterior

2
f(x)

-2

-5 0 5
x

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 14
Limitations

f(x) 0

-2

-5 0 5
x
§ Feature engineering (what basis functions to use?)
§ Finite number of features:
§ Above: Without basis functions on the right, we cannot express
any variability of the function
§ Ideally: Add more (infinitely many) basis functions

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 15
Approach

§ Instead of sampling parameters, which induce a distribution over


functions, sample functions directly
Place a prior on functions
Make assumptions on the distribution of functions

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 16
Approach

§ Instead of sampling parameters, which induce a distribution over


functions, sample functions directly
Place a prior on functions
Make assumptions on the distribution of functions
§ Intuition: function = infinitely long vector of function values
Make assumptions on the distribution of function values

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 16
Approach

§ Instead of sampling parameters, which induce a distribution over


functions, sample functions directly
Place a prior on functions
Make assumptions on the distribution of functions
§ Intuition: function = infinitely long vector of function values
Make assumptions on the distribution of function values

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 16
Approach

§ Instead of sampling parameters, which induce a distribution over


functions, sample functions directly
Place a prior on functions
Make assumptions on the distribution of functions
§ Intuition: function = infinitely long vector of function values
Make assumptions on the distribution of function values
Gaussian process

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 16
Overview

Bayesian Linear Regression (1-Slide Refresher)

Priors over Functions

Gaussian Processes
Definition and Derivation
Inference
Covariance Functions and Hyper-Parameters
Training

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 17
Reference

http://www.gaussianprocess.org/

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 18
Problem Setting

3
2
1
f(x)

0
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
x

Objective
` ˘
For a set of observations yi “ f pxi q ` ε, ε „ N 0, σε2 , find a
distribution over functions pp f q that explains the data
Probabilistic regression problem

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 19
Problem Setting

2
f(x)

−2

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9 10
x

Objective
` ˘
For a set of observations yi “ f pxi q ` ε, ε „ N 0, σε2 , find a
distribution over functions pp f q that explains the data
Probabilistic regression problem

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 19
Some Application Areas

§ Reinforcement learning and robotics


§ Bayesian optimization (experimental design)
§ Geostatistics
§ Sensor networks
§ Time-series modeling and forecasting
§ High-energy physics
§ Medical applications
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 20
Gaussian Process

§ We will place a distribution pp f q on functions f


§ Informally, a function can be considered an infinitely long vector
of function values f “ r f 1 , f 2 , f 3 , ...s
§ A Gaussian process is a generalization of a multivariate Gaussian
distribution to infinitely many variables.

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 21
Gaussian Process

§ We will place a distribution pp f q on functions f


§ Informally, a function can be considered an infinitely long vector
of function values f “ r f 1 , f 2 , f 3 , ...s
§ A Gaussian process is a generalization of a multivariate Gaussian
distribution to infinitely many variables.

Definition (Rasmussen & Williams, 2006)


A Gaussian process (GP) is a collection of random variables f 1 , f 2 , . . . ,
any finite number of which is Gaussian distributed.

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 21
Gaussian Process

§ We will place a distribution pp f q on functions f


§ Informally, a function can be considered an infinitely long vector
of function values f “ r f 1 , f 2 , f 3 , ...s
§ A Gaussian process is a generalization of a multivariate Gaussian
distribution to infinitely many variables.

Definition (Rasmussen & Williams, 2006)


A Gaussian process (GP) is a collection of random variables f 1 , f 2 , . . . ,
any finite number of which is Gaussian distributed.

§ A Gaussian distribution is specified by a mean vector µ and a


covariance matrix Σ
§ A Gaussian process is specified by a mean function mp¨q and a
covariance function (kernel) kp¨, ¨q

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 21
Mean Function

3
2
1
f(x) 0
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
x

§ The “average” function of the distribution over functions


§ Allows us to bias the model (can make sense in
application-specific settings)
§ “Agnostic” mean function in the absence of data or prior
knowledge: mp¨q ” 0 everywhere (for symmetry reasons)

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 22
Covariance Function
3
2
1

f(x)
0
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
x

§ The covariance function (kernel) is symmetric and positive


semi-definite
§ It allows us to compute covariances/correlations between
(unknown) function values by just looking at the corresponding
inputs:
Covr f pxi q, f px j qs “ kpxi , x j q
Kernel trick (Schölkopf & Smola, 2002)
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 23
GP Regression as a Bayesian Inference Problem

Objective
` ˘
For a set of observations yi “ f pxi q ` e, e „ N 0, σn2 , find a
(posterior) distribution over functions pp f |X, yq that explains the
data. Here: X training inputs, y training targets

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 25
GP Regression as a Bayesian Inference Problem

Objective
` ˘
For a set of observations yi “ f pxi q ` e, e „ N 0, σn2 , find a
(posterior) distribution over functions pp f |X, yq that explains the
data. Here: X training inputs, y training targets

Training data: X, y. Bayes’ theorem yields


ppy| f , Xq pp f q
pp f |X, yq “
ppy|Xq

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 25
GP Regression as a Bayesian Inference Problem

Objective
` ˘
For a set of observations yi “ f pxi q ` e, e „ N 0, σn2 , find a
(posterior) distribution over functions pp f |X, yq that explains the
data. Here: X training inputs, y training targets

Training data: X, y. Bayes’ theorem yields


ppy| f , Xq pp f q
pp f |X, yq “
ppy|Xq

Prior: pp f q “ GPpm, kq Specify mean m function and kernel k.

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 25
GP Regression as a Bayesian Inference Problem

Objective
` ˘
For a set of observations yi “ f pxi q ` e, e „ N 0, σn2 , find a
(posterior) distribution over functions pp f |X, yq that explains the
data. Here: X training inputs, y training targets

Training data: X, y. Bayes’ theorem yields


ppy| f , Xq pp f q
pp f |X, yq “
ppy|Xq

Prior: pp f q “ GPpm, kq Specify mean m function and kernel k.


` ˘
Likelihood (noise model): ppy| f , Xq “ N f pXq, σn2 I

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 25
GP Regression as a Bayesian Inference Problem

Objective
` ˘
For a set of observations yi “ f pxi q ` e, e „ N 0, σn2 , find a
(posterior) distribution over functions pp f |X, yq that explains the
data. Here: X training inputs, y training targets

Training data: X, y. Bayes’ theorem yields


ppy| f , Xq pp f q
pp f |X, yq “
ppy|Xq

Prior: pp f q “ GPpm, kq Specify mean m function and kernel k.


` ˘
Likelihood (noise model): ppy| f , Xq “ N f pXq, σn2 I
ş
Marginal likelihood (evidence): ppy|Xq “ ppy| f , Xqpp f |Xqd f

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 25
GP Regression as a Bayesian Inference Problem

Objective
` ˘
For a set of observations yi “ f pxi q ` e, e „ N 0, σn2 , find a
(posterior) distribution over functions pp f |X, yq that explains the
data. Here: X training inputs, y training targets

Training data: X, y. Bayes’ theorem yields


ppy| f , Xq pp f q
pp f |X, yq “
ppy|Xq

Prior: pp f q “ GPpm, kq Specify mean m function and kernel k.


` ˘
Likelihood (noise model): ppy| f , Xq “ N f pXq, σn2 I
ş
Marginal likelihood (evidence): ppy|Xq “ ppy| f , Xqpp f |Xqd f
Posterior: pp f |y, Xq “ GPpmpost , kpost q

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 25
GP Prior
§ Treat a function as a long vector of function values:
f “ r f1, f2, . . . s
Look at a distribution over function values f i “ f pxi q

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 26
GP Prior
§ Treat a function as a long vector of function values:
f “ r f1, f2, . . . s
Look at a distribution over function values f i “ f pxi q
§ Consider a finite number of N function values f and all other
(infinitely many) function values f̃ . Informally:
Σ f f Σ f f˜
˜« ff « ff¸
µf
pp f , f̃ q “ N ,
µ f˜ Σ f˜ f Σ f˜ f˜

where Σ f˜ f˜ P Rmˆm and Σ f f˜ P R Nˆm , m Ñ 8.


§ Σpi,jq
f f “ Covr f pxi q, f px j qs “ kpxi , x j q

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 26
GP Prior
§ Treat a function as a long vector of function values:
f “ r f1, f2, . . . s
Look at a distribution over function values f i “ f pxi q
§ Consider a finite number of N function values f and all other
(infinitely many) function values f̃ . Informally:
Σ f f Σ f f˜
˜« ff « ff¸
µf
pp f , f̃ q “ N ,
µ f˜ Σ f˜ f Σ f˜ f˜

where Σ f˜ f˜ P Rmˆm and Σ f f˜ P R Nˆm , m Ñ 8.


§ Σpi,jq
f f “ Covr f pxi q, f px j qs “ kpxi , x j q
§ Key property: The marginal remains finite
ż
pp f q “ pp f , f̃ qd f̃ “ N µ f , Σ f f
` ˘

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 26
GP Prior (2)

§ In practice, we always have finite training and test inputs


xtrain , xtest .
§ Define f ˚ :“ f test , f :“ f train .

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 27
GP Prior (2)

§ In practice, we always have finite training and test inputs


xtrain , xtest .
§ Define f ˚ :“ f test , f :“ f train .
§ Then, we obtain the finite marginal
˜« ff « ff¸
Σf f Σf˚
ż
µf
pp f , f ˚ q “ pp f , f ˚ , f other qd f other “ N ,
µ˚ Σ˚ f Σ˚˚

Computing the joint distribution of an arbitrary number of


training and test inputs boils down to manipulating
(finite-dimensional) Gaussian distributions

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 27
GP Posterior Predictions

e „ N 0, σn2
` ˘
y “ f pxq ` e,
§ Objective: Find pp f pX ˚ q|X, y, X ˚ q for training data X, y and test
inputs X ˚ .
` ˘
§ GP prior at training inputs: pp f |Xq “ N mpXq, K
` ˘
§ Gaussian Likelihood: ppy| f , Xq “ N f pXq, σn2 I

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 28
GP Posterior Predictions

e „ N 0, σn2
` ˘
y “ f pxq ` e,
§ Objective: Find pp f pX ˚ q|X, y, X ˚ q for training data X, y and test
inputs X ˚ .
` ˘
§ GP prior at training inputs: pp f |Xq “ N mpXq, K
` ˘
§ Gaussian Likelihood: ppy| f , Xq “ N f pXq, σn2 I
§ With f „ GP it follows that f , f ˚ are jointly Gaussian distributed:
ˆ„  „ ˙
mpXq K kpX, X ˚ q
pp f , f ˚ |X, X ˚ q “ N ,
mpX ˚ q kpX ˚ , Xq kpX ˚ , X ˚ q

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 28
GP Posterior Predictions

e „ N 0, σn2
` ˘
y “ f pxq ` e,
§ Objective: Find pp f pX ˚ q|X, y, X ˚ q for training data X, y and test
inputs X ˚ .
` ˘
§ GP prior at training inputs: pp f |Xq “ N mpXq, K
` ˘
§ Gaussian Likelihood: ppy| f , Xq “ N f pXq, σn2 I
§ With f „ GP it follows that f , f ˚ are jointly Gaussian distributed:
ˆ„  „ ˙
mpXq K kpX, X ˚ q
pp f , f ˚ |X, X ˚ q “ N ,
mpX ˚ q kpX ˚ , Xq kpX ˚ , X ˚ q

§ Due to the Gaussian likelihood, we also get ( f is unobserved)


˜„  « ff¸
mpXq K `σn2 I kpX, X ˚ q
pp y , f ˚ |X, X ˚ q “ N ,
mpX ˚ q kpX ˚ , Xq kpX ˚ , X ˚ q

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 28
GP Posterior Predictions
Prior:
ˆ„  „ ˙
mpXq K ` σn2 I kpX, X ˚ q
ppy, f ˚ |X, X ˚ q “ N ,
mpX ˚ q kpX ˚ , Xq kpX ˚ , X ˚ q

Posterior predictive distribution pp f ˚ |X, y, X ˚ q at test inputs X ˚

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 29
GP Posterior Predictions
Prior:
ˆ„  „ ˙
mpXq K ` σn2 I kpX, X ˚ q
ppy, f ˚ |X, X ˚ q “ N ,
mpX ˚ q kpX ˚ , Xq kpX ˚ , X ˚ q

Posterior predictive distribution pp f ˚ |X, y, X ˚ q at test inputs X ˚


obtained by Gaussian conditioning:
` ˘
pp f ˚ |X, y, X ˚ q “ N Er f ˚ |X, y, X ˚ s, Vr f ˚ |X, y, X ˚ s
Er f ˚ |X, y, X ˚ s “ mpost pX ˚ q “ loomoon kpX ˚ , XqpK ` σn2 Iq´1 looooomooooon
mpX ˚ q` loooooooooooomoooooooooooon py ´ mpXqq
prior mean “Kalman gain” error

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 29
GP Posterior Predictions
Prior:
ˆ„  „ ˙
mpXq K ` σn2 I kpX, X ˚ q
ppy, f ˚ |X, X ˚ q “ N ,
mpX ˚ q kpX ˚ , Xq kpX ˚ , X ˚ q

Posterior predictive distribution pp f ˚ |X, y, X ˚ q at test inputs X ˚


obtained by Gaussian conditioning:
` ˘
pp f ˚ |X, y, X ˚ q “ N Er f ˚ |X, y, X ˚ s, Vr f ˚ |X, y, X ˚ s
Er f ˚ |X, y, X ˚ s “ mpost pX ˚ q “ loomoon kpX ˚ , XqpK ` σn2 Iq´1 looooomooooon
mpX ˚ q` loooooooooooomoooooooooooon py ´ mpXqq
prior mean “Kalman gain” error

Vr f ˚ |X, y, X ˚ s “ kpost pX ˚ , X ˚ q
kpX ˚ , XqpK ` σn2 Iq´1 kpX, X ˚ q
kpX ˚ , X ˚ q´ loooooooooooooooooomoooooooooooooooooon
“ loooomoooon
prior variance ě0

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 29
GP Posterior
Posterior over functions (with training data X, y):
ppy| f p¨q, Xq pp f p¨q|Xq
pp f p¨q|X, yq “
ppy|Xq

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 30
GP Posterior
Posterior over functions (with training data X, y):
ppy| f p¨q, Xq pp f p¨q|Xq
pp f p¨q|X, yq “
ppy|Xq
Using the properties of Gaussians, we obtain (with K :“ kpX, Xq)
` ˘
ppy| f p¨q, Xq pp f p¨q|Xq “ N y | f pXq, σn2 I GPpmp¨q, kp¨, ¨qq

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 30
GP Posterior
Posterior over functions (with training data X, y):
ppy| f p¨q, Xq pp f p¨q|Xq
pp f p¨q|X, yq “
ppy|Xq
Using the properties of Gaussians, we obtain (with K :“ kpX, Xq)
` ˘
ppy| f p¨q, Xq pp f p¨q|Xq “ N y | f pXq, σn2 I GPpmp¨q, kp¨, ¨qq
` ˘
“ Z ˆ GP mpost p¨q, k post p¨, ¨q
mpost p¨q “ mp¨q ` kp¨, XqpK ` σn2 Iq´1 py ´ mpXqq
kpost p¨, ¨q “ kp¨, ¨q ´ kp¨, XqpK ` σn2 Iq´1 kpX, ¨q

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 30
GP Posterior
Posterior over functions (with training data X, y):
ppy| f p¨q, Xq pp f p¨q|Xq
pp f p¨q|X, yq “
ppy|Xq
Using the properties of Gaussians, we obtain (with K :“ kpX, Xq)
` ˘
ppy| f p¨q, Xq pp f p¨q|Xq “ N y | f pXq, σn2 I GPpmp¨q, kp¨, ¨qq
` ˘
“ Z ˆ GP mpost p¨q, k post p¨, ¨q
mpost p¨q “ mp¨q ` kp¨, XqpK ` σn2 Iq´1 py ´ mpXqq
kpost p¨, ¨q “ kp¨, ¨q ´ kp¨, XqpK ` σn2 Iq´1 kpX, ¨q
Marginal likelihood:
ż
Z “ ppy|Xq “ ppy| f , Xq pp f |Xq d f “ N y | mpXq, K ` σn2 I
` ˘

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 30
GP Posterior
Posterior over functions (with training data X, y):
ppy| f p¨q, Xq pp f p¨q|Xq
pp f p¨q|X, yq “
ppy|Xq
Using the properties of Gaussians, we obtain (with K :“ kpX, Xq)
` ˘
ppy| f p¨q, Xq pp f p¨q|Xq “ N y | f pXq, σn2 I GPpmp¨q, kp¨, ¨qq
` ˘
“ Z ˆ GP mpost p¨q, k post p¨, ¨q
mpost p¨q “ mp¨q ` kp¨, XqpK ` σn2 Iq´1 py ´ mpXqq
kpost p¨, ¨q “ kp¨, ¨q ´ kp¨, XqpK ` σn2 Iq´1 kpX, ¨q
Marginal likelihood:
ż
Z “ ppy|Xq “ ppy| f , Xq pp f |Xq d f “ N y | mpXq, K ` σn2 I
` ˘

` ˘
Prediction at x˚ : pp f px˚ q|X, y, x˚ q “ N mpost px˚ q, kpost px˚ , x˚ q
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 30
Illustration: Inference with Gaussian Processes

3
2
1
f(x)

0
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
x
Prior belief about the function

Predictive (marginal) mean and variance:


Er f px˚ q|x˚ , ∅s “ mpx˚ q “ 0
Vr f px˚ q|x˚ , ∅s “ σ2 px˚ q “ kpx˚ , x˚ q

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 31
Illustration: Inference with Gaussian Processes

3
2
1
f(x)

0
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
x
Prior belief about the function

Predictive (marginal) mean and variance:


Er f px˚ q|x˚ , ∅s “ mpx˚ q “ 0
Vr f px˚ q|x˚ , ∅s “ σ2 px˚ q “ kpx˚ , x˚ q

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 31
Illustration: Inference with Gaussian Processes

3
2
1
f(x)

0
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
x
Posterior belief about the function

Predictive (marginal) mean and variance:


Er f px˚ q|x˚ , X, ys “ mpx˚ q “ kpX, x˚ qJ pK ` σn2 Iq´1 y
Vr f px˚ q|x˚ , X, ys “ σ2 px˚ q “ kpx˚ , x˚ q ´ kpX, x˚ qJ pK ` σn2 Iq´1 kpX, x˚ q

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 31
Illustration: Inference with Gaussian Processes

3
2
1
f(x)

0
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
x
Posterior belief about the function

Predictive (marginal) mean and variance:


Er f px˚ q|x˚ , X, ys “ mpx˚ q “ kpX, x˚ qJ pK ` σn2 Iq´1 y
Vr f px˚ q|x˚ , X, ys “ σ2 px˚ q “ kpx˚ , x˚ q ´ kpX, x˚ qJ pK ` σn2 Iq´1 kpX, x˚ q

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 31
Illustration: Inference with Gaussian Processes

3
2
1
f(x)

0
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
x
Posterior belief about the function

Predictive (marginal) mean and variance:


Er f px˚ q|x˚ , X, ys “ mpx˚ q “ kpX, x˚ qJ pK ` σn2 Iq´1 y
Vr f px˚ q|x˚ , X, ys “ σ2 px˚ q “ kpx˚ , x˚ q ´ kpX, x˚ qJ pK ` σn2 Iq´1 kpX, x˚ q

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 31
Illustration: Inference with Gaussian Processes

3
2
1
f(x)

0
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
x
Posterior belief about the function

Predictive (marginal) mean and variance:


Er f px˚ q|x˚ , X, ys “ mpx˚ q “ kpX, x˚ qJ pK ` σn2 Iq´1 y
Vr f px˚ q|x˚ , X, ys “ σ2 px˚ q “ kpx˚ , x˚ q ´ kpX, x˚ qJ pK ` σn2 Iq´1 kpX, x˚ q

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 31
Illustration: Inference with Gaussian Processes

3
2
1
f(x)

0
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
x
Posterior belief about the function

Predictive (marginal) mean and variance:


Er f px˚ q|x˚ , X, ys “ mpx˚ q “ kpX, x˚ qJ pK ` σn2 Iq´1 y
Vr f px˚ q|x˚ , X, ys “ σ2 px˚ q “ kpx˚ , x˚ q ´ kpX, x˚ qJ pK ` σn2 Iq´1 kpX, x˚ q

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 31
Illustration: Inference with Gaussian Processes

3
2
1
f(x)

0
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
x
Posterior belief about the function

Predictive (marginal) mean and variance:


Er f px˚ q|x˚ , X, ys “ mpx˚ q “ kpX, x˚ qJ pK ` σn2 Iq´1 y
Vr f px˚ q|x˚ , X, ys “ σ2 px˚ q “ kpx˚ , x˚ q ´ kpX, x˚ qJ pK ` σn2 Iq´1 kpX, x˚ q

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 31
Illustration: Inference with Gaussian Processes

3
2
1
f(x)

0
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
x
Posterior belief about the function

Predictive (marginal) mean and variance:


Er f px˚ q|x˚ , X, ys “ mpx˚ q “ kpX, x˚ qJ pK ` σn2 Iq´1 y
Vr f px˚ q|x˚ , X, ys “ σ2 px˚ q “ kpx˚ , x˚ q ´ kpX, x˚ qJ pK ` σn2 Iq´1 kpX, x˚ q

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 31
Illustration: Inference with Gaussian Processes

3
2
1
f(x)

0
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
x
Posterior belief about the function

Predictive (marginal) mean and variance:


Er f px˚ q|x˚ , X, ys “ mpx˚ q “ kpX, x˚ qJ pK ` σn2 Iq´1 y
Vr f px˚ q|x˚ , X, ys “ σ2 px˚ q “ kpx˚ , x˚ q ´ kpX, x˚ qJ pK ` σn2 Iq´1 kpX, x˚ q

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 31
Illustration: Inference with Gaussian Processes

3
2
1
f(x)

0
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
x
Posterior belief about the function

Predictive (marginal) mean and variance:


Er f px˚ q|x˚ , X, ys “ mpx˚ q “ kpX, x˚ qJ pK ` σn2 Iq´1 y
Vr f px˚ q|x˚ , X, ys “ σ2 px˚ q “ kpx˚ , x˚ q ´ kpX, x˚ qJ pK ` σn2 Iq´1 kpX, x˚ q

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 31
Covariance Function

§ A Gaussian process is fully specified by a mean function m and a


kernel/covariance function k
§ The covariance function (kernel) is symmetric and positive
semi-definite
§ Covariance function encodes high-level structural assumptions
about the latent function f (e.g., smoothness, differentiability,
periodicity)

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 33
Gaussian Covariance Function

k Gauss pxi , x j q “ σ2f exp ´ pxi ´ x j qJ pxi ´ x j q{`2


` ˘

§ σ f : Amplitude of the latent function

§ Assumption on latent function: Smooth (8 differentiable)


Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 34
Gaussian Covariance Function

k Gauss pxi , x j q “ σ2f exp ´ pxi ´ x j qJ pxi ´ x j q{`2


` ˘

§ σ f : Amplitude of the latent function


§ `: Length-scale. How far do we have to move in input space
before the function value changes significantly, i.e., when do
function values become uncorrelated?
Smoothness parameter

§ Assumption on latent function: Smooth (8 differentiable)


Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 34
Amplitude Parameter σ2f

k Gauss pxi , x j q “ σ2f exp ´ pxi ´ x j qJ pxi ´ x j q{`2


` ˘

Samples from a GP prior with signal variance 4.0


4

0
f(x)

4
0.0 0.2 0.4 0.6 0.8 1.0
x

§ Controls the amplitude (vertical magnitude) of the function we


wish to model
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 35
Amplitude Parameter σ2f

k Gauss pxi , x j q “ σ2f exp ´ pxi ´ x j qJ pxi ´ x j q{`2


` ˘

Samples from a GP prior with signal variance 2.0


4

0
f(x)

4
0.0 0.2 0.4 0.6 0.8 1.0
x

§ Controls the amplitude (vertical magnitude) of the function we


wish to model
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 35
Amplitude Parameter σ2f

k Gauss pxi , x j q “ σ2f exp ´ pxi ´ x j qJ pxi ´ x j q{`2


` ˘

Samples from a GP prior with signal variance 1.0


4

0
f(x)

4
0.0 0.2 0.4 0.6 0.8 1.0
x

§ Controls the amplitude (vertical magnitude) of the function we


wish to model
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 35
Amplitude Parameter σ2f

k Gauss pxi , x j q “ σ2f exp ´ pxi ´ x j qJ pxi ´ x j q{`2


` ˘

Samples from a GP prior with signal variance 0.5


4

0
f(x)

4
0.0 0.2 0.4 0.6 0.8 1.0
x

§ Controls the amplitude (vertical magnitude) of the function we


wish to model
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 35
Length-Scale `

k Gauss pxi , x j q “ σ2f exp ´ pxi ´ x j qJ pxi ´ x j q{`2


` ˘

1.0
0.05
0.1
0.8 0.2
0.5
5.0
Correlation

0.6

0.4

0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0
|| ||

§ How “wiggly” is the function?


§ How much information we can transfer to other function values?
§ How far do we have to move in input space from x to x1 to make
f pxq and f px1 q uncorrelated?
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 36
Length-Scale ` (2)

k Gauss pxi , x j q “ σ2f exp ´ pxi ´ x j qJ pxi ´ x j q{`2


` ˘

Samples from a GP prior with lengthscale 0.05


3
2
1
f(x)

0
1
2

0.0 0.2 0.4 0.6 0.8 1.0


x

Explore interactive diagrams at https://drafts.distill.pub/gp/


Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 37
Length-Scale ` (2)

k Gauss pxi , x j q “ σ2f exp ´ pxi ´ x j qJ pxi ´ x j q{`2


` ˘

Samples from a GP prior with lengthscale 0.1


2

0
f(x)

0.0 0.2 0.4 0.6 0.8 1.0


x

Explore interactive diagrams at https://drafts.distill.pub/gp/


Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 37
Length-Scale ` (2)

k Gauss pxi , x j q “ σ2f exp ´ pxi ´ x j qJ pxi ´ x j q{`2


` ˘

Samples from a GP prior with lengthscale 0.2


2.0
1.5
1.0
0.5
0.0
f(x)

0.5
1.0
1.5
2.0
0.0 0.2 0.4 0.6 0.8 1.0
x

Explore interactive diagrams at https://drafts.distill.pub/gp/


Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 37
Length-Scale ` (2)

k Gauss pxi , x j q “ σ2f exp ´ pxi ´ x j qJ pxi ´ x j q{`2


` ˘

Samples from a GP prior with lengthscale 0.5


1.5
1.0
0.5
0.0
f(x)

0.5
1.0
1.5
0.0 0.2 0.4 0.6 0.8 1.0
x

Explore interactive diagrams at https://drafts.distill.pub/gp/


Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 37
Matérn Covariance Function
´ ? ¯ ´ ? ¯
3}xi ´x j } 3}xi ´x j }
k Mat,3{2 pxi , x j q “ σ2f 1 ` ` exp ´ `

§ σ f : Amplitude of the latent function


§ `: Length-scale. How far do we have to move in input space
before the function value changes significantly?

§ Assumption on latent function: 1-times differentiable


Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 38
Periodic Covariance Function
` κpxi ´x j q ˘ ¯
´ 2 sin2
k per pxi , x j q “ σ2f exp ´ 2π
`2 „
cospκxq
“ k Gauss pupxi q, upx j qq, upxq “
sinpκxq
κ: Periodicity parameter

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 39
Creating New Covariance Functions

Assume k1 and k2 are valid covariance functions and up¨q is a


(nonlinear) transformation of the input space. Then
§ k1 ` k2 is a valid covariance function

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 40
Creating New Covariance Functions

Assume k1 and k2 are valid covariance functions and up¨q is a


(nonlinear) transformation of the input space. Then
§ k1 ` k2 is a valid covariance function
§ k1 k2 is a valid covariance function

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 40
Creating New Covariance Functions

Assume k1 and k2 are valid covariance functions and up¨q is a


(nonlinear) transformation of the input space. Then
§ k1 ` k2 is a valid covariance function
§ k1 k2 is a valid covariance function
§ kpupxq, upx1 qq is a valid covariance function (MacKay, 1998)
Periodic covariance function and Manifold Gaussian Process
(Calandra et al., 2016)

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 40
Creating New Covariance Functions

Assume k1 and k2 are valid covariance functions and up¨q is a


(nonlinear) transformation of the input space. Then
§ k1 ` k2 is a valid covariance function
§ k1 k2 is a valid covariance function
§ kpupxq, upx1 qq is a valid covariance function (MacKay, 1998)
Periodic covariance function and Manifold Gaussian Process
(Calandra et al., 2016)
Automatic Statistician (Lloyd et al., 2014)

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 40
Hyper-Parameters of a GP

The GP possesses a set of hyper-parameters:


§ Parameters of the mean function
§ Parameters of the covariance function (e.g., length-scales and
signal variance)
§ Likelihood parameters (e.g., noise variance σn2 )

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 41
Hyper-Parameters of a GP

The GP possesses a set of hyper-parameters:


§ Parameters of the mean function
§ Parameters of the covariance function (e.g., length-scales and
signal variance)
§ Likelihood parameters (e.g., noise variance σn2 )
Train a GP to find a good set of hyper-parameters

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 41
Hyper-Parameters of a GP

The GP possesses a set of hyper-parameters:


§ Parameters of the mean function
§ Parameters of the covariance function (e.g., length-scales and
signal variance)
§ Likelihood parameters (e.g., noise variance σn2 )
Train a GP to find a good set of hyper-parameters

Model selection to find good mean and covariance functions


(can also be automated: Automatic Statistician (Lloyd et al., 2014))

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 41
Gaussian Process Training: Hyper-Parameters

GP Training f ψ
Find good hyper-parameters θ (kernel/mean
function parameters ψ, noise variance σn2 ) xi yi σn
N

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 43
Gaussian Process Training: Hyper-Parameters

GP Training f ψ
Find good hyper-parameters θ (kernel/mean
function parameters ψ, noise variance σn2 ) xi yi σn
N
§ Place a prior ppθq on hyper-parameters
§ Posterior over hyper-parameters:

ppθq ppy|X, θq
ż
ppθ|X, yq “ , ppy|X, θq “ ppy| f , Xqpp f |X, θqd f
ppy|Xq

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 43
Gaussian Process Training: Hyper-Parameters

GP Training f ψ
Find good hyper-parameters θ (kernel/mean
function parameters ψ, noise variance σn2 ) xi yi σn
N
§ Place a prior ppθq on hyper-parameters
§ Posterior over hyper-parameters:

ppθq ppy|X, θq
ż
ppθ|X, yq “ , ppy|X, θq “ ppy| f , Xqpp f |X, θqd f
ppy|Xq
§ Choose hyper-parameters θ˚ , such that

θ˚ P arg max log ppθq ` log ppy|X, θq


θ

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 43
Gaussian Process Training: Hyper-Parameters

GP Training f ψ
Find good hyper-parameters θ (kernel/mean
function parameters ψ, noise variance σn2 ) xi yi σn
N
§ Place a prior ppθq on hyper-parameters
§ Posterior over hyper-parameters:

ppθq ppy|X, θq
ż
ppθ|X, yq “ , ppy|X, θq “ ppy| f , Xqpp f |X, θqd f
ppy|Xq
§ Choose hyper-parameters θ˚ , such that

θ˚ P arg max log ppθq ` log ppy|X, θq


θ

Maximize marginal likelihood if ppθq “ U (uniform prior)


Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 43
Training via Marginal Likelihood Maximization

GP Training
Maximize the evidence/marginal likelihood (probability of the data
given the hyper-parameters, where the unwieldy f has been
integrated out) Also called Maximum Likelihood Type-II

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 44
Training via Marginal Likelihood Maximization

GP Training
Maximize the evidence/marginal likelihood (probability of the data
given the hyper-parameters, where the unwieldy f has been
integrated out) Also called Maximum Likelihood Type-II
Marginal likelihood (with a prior mean function mp¨q ” 0):
ż
ppy|X, θq “ ppy| f , Xq pp f |X, θq d f
ż
` ˘ ` ˘ ` ˘
“ N y | f pXq, σn2 I N f pXq | 0, K d f “ N y | 0, K ` σn2 I

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 44
Training via Marginal Likelihood Maximization

GP Training
Maximize the evidence/marginal likelihood (probability of the data
given the hyper-parameters, where the unwieldy f has been
integrated out) Also called Maximum Likelihood Type-II
Marginal likelihood (with a prior mean function mp¨q ” 0):
ż
ppy|X, θq “ ppy| f , Xq pp f |X, θq d f
ż
` ˘ ` ˘ ` ˘
“ N y | f pXq, σn2 I N f pXq | 0, K d f “ N y | 0, K ` σn2 I

Learning the GP hyper-parameters:


θ˚ P arg max log ppy|X, θq
θ

log ppy|X, θq “ ´ 12 yJ K ´1
θ y ´ 1
2 log |K θ| ` const , K θ :“ K ` σn2 I

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 44
Training via Marginal Likelihood Maximization

Log-marginal likelihood:

log ppy|X, θq “ ´ 12 yJ K ´1
θ y ´
1
2 log |K θ| ` const , K θ :“ K ` σn2 I

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 45
Training via Marginal Likelihood Maximization

Log-marginal likelihood:

log ppy|X, θq “ ´ 12 yJ K ´1
θ y ´
1
2 log |K θ| ` const , K θ :“ K ` σn2 I

§ Automatic trade-off between data fit and model complexity

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 45
Training via Marginal Likelihood Maximization

Log-marginal likelihood:

log ppy|X, θq “ ´ 12 yJ K ´1
θ y ´
1
2 log |K θ| ` const , K θ :“ K ` σn2 I

§ Automatic trade-off between data fit and model complexity


§ Gradient-based optimization of hyper-parameters θ:

B log ppy|X, θq BK θ ´1 BK θ ˘
“ 21 yJ K ´1 K θ y ´ 12 tr K ´1
`
Bθi θ Bθi θ Bθi
BK θ
“ 21 tr pααJ ´ K ´1
` ˘
θ q Bθ ,
i
α :“ K ´1
θ y

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 45
Example: Training Data

2.0
1.5
1.0
0.5
f(x)

0.0
0.5
1.0
1.5
4 2 0 2 4
x

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 46
Example: Marginal Likelihood Contour

4 Log-marginal likelihood 1.2208


1.2493
3 1.2778
log-length-scale log(l)

2 1.3063
1.3348
1 1.3633
1.3918
0
1.4202
1 1.4487
1.4772
2
5 4 3 2 1 0 1
log-noise log( n)

§ Three local optima. What do you expect?


Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 47
Demo

https://drafts.distill.pub/gp/

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 48
Marginal Likelihood and Parameter Learning

§ The marginal likelihood is non-convex

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 49
Marginal Likelihood and Parameter Learning

§ The marginal likelihood is non-convex


§ Especially in the very-small-data regime, a GP can end up in
three different situations when optimizing the hyper-parameters:

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 49
Marginal Likelihood and Parameter Learning

§ The marginal likelihood is non-convex


§ Especially in the very-small-data regime, a GP can end up in
three different situations when optimizing the hyper-parameters:
§ Short length-scales, low noise (highly nonlinear mean function
with little noise)

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 49
Marginal Likelihood and Parameter Learning

§ The marginal likelihood is non-convex


§ Especially in the very-small-data regime, a GP can end up in
three different situations when optimizing the hyper-parameters:
§ Short length-scales, low noise (highly nonlinear mean function
with little noise)
§ Long length-scales, high noise (everything is considered noise)

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 49
Marginal Likelihood and Parameter Learning

§ The marginal likelihood is non-convex


§ Especially in the very-small-data regime, a GP can end up in
three different situations when optimizing the hyper-parameters:
§ Short length-scales, low noise (highly nonlinear mean function
with little noise)
§ Long length-scales, high noise (everything is considered noise)
§ Hybrid

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 49
Marginal Likelihood and Parameter Learning

§ The marginal likelihood is non-convex


§ Especially in the very-small-data regime, a GP can end up in
three different situations when optimizing the hyper-parameters:
§ Short length-scales, low noise (highly nonlinear mean function
with little noise)
§ Long length-scales, high noise (everything is considered noise)
§ Hybrid
§ Re-start hyper-parameter optimization from random
initialization to mitigate the problem

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 49
Marginal Likelihood and Parameter Learning

§ The marginal likelihood is non-convex


§ Especially in the very-small-data regime, a GP can end up in
three different situations when optimizing the hyper-parameters:
§ Short length-scales, low noise (highly nonlinear mean function
with little noise)
§ Long length-scales, high noise (everything is considered noise)
§ Hybrid
§ Re-start hyper-parameter optimization from random
initialization to mitigate the problem
§ With increasing data set size the GP typically ends up in the
“hybrid” mode. Other modes are unlikely.

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 49
Marginal Likelihood and Parameter Learning

§ The marginal likelihood is non-convex


§ Especially in the very-small-data regime, a GP can end up in
three different situations when optimizing the hyper-parameters:
§ Short length-scales, low noise (highly nonlinear mean function
with little noise)
§ Long length-scales, high noise (everything is considered noise)
§ Hybrid
§ Re-start hyper-parameter optimization from random
initialization to mitigate the problem
§ With increasing data set size the GP typically ends up in the
“hybrid” mode. Other modes are unlikely.
§ Ideally, we would integrate the hyper-parameters out
No closed-form solution Markov chain Monte Carlo

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 49
Model Selection—Mean Function and Kernel

§ Assume we have a finite set of models Mi , each one specifying a


mean function mi and a kernel k i . How do we find the best one?

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 50
Model Selection—Mean Function and Kernel

§ Assume we have a finite set of models Mi , each one specifying a


mean function mi and a kernel k i . How do we find the best one?
§ Some options:
§ Cross validation
§ Bayesian Information Criterion, Akaike Information Criterion
§ Compare marginal likelihood values (assuming a uniform prior on
the set of models)

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 50
Example
3

1
f(x)

-1

-2
-4 -3 -2 -1 0 1 2 3 4
x
§ Four different kernels (mean function fixed to m ” 0)
§ MAP hyper-parameters for each kernel
§ Log-marginal likelihood values for each (optimized) model
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 51
Example
Constant kernel, LML=-1.1073
3

1
f(x)

-1

-2
-4 -3 -2 -1 0 1 2 3 4
x
§ Four different kernels (mean function fixed to m ” 0)
§ MAP hyper-parameters for each kernel
§ Log-marginal likelihood values for each (optimized) model
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 51
Example
Linear kernel, LML=-1.0065
3

1
f(x)

-1

-2
-4 -3 -2 -1 0 1 2 3 4
x
§ Four different kernels (mean function fixed to m ” 0)
§ MAP hyper-parameters for each kernel
§ Log-marginal likelihood values for each (optimized) model
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 51
Example
Matern kernel, LML=-0.8625
3

1
f(x)

-1

-2
-4 -3 -2 -1 0 1 2 3 4
x
§ Four different kernels (mean function fixed to m ” 0)
§ MAP hyper-parameters for each kernel
§ Log-marginal likelihood values for each (optimized) model
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 51
Example
Gaussian kernel, LML=-0.69308
3

1
f(x)

-1

-2
-4 -3 -2 -1 0 1 2 3 4
x
§ Four different kernels (mean function fixed to m ” 0)
§ MAP hyper-parameters for each kernel
§ Log-marginal likelihood values for each (optimized) model
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 51
Application Areas
5
8

6
ang.vel. in rad/s

4
0
2

−2
−5
−2 0 2
angle in rad

§ Reinforcement learning and robotics


Model value functions and/or dynamics with GPs
§ Bayesian optimization (Experimental Design)
Model unknown utility functions with GPs
§ Geostatistics
Spatial modeling (e.g., landscapes, resources)
§ Sensor networks
§ Time-series modeling and forecasting
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 52
Limitations of Gaussian Processes

Computational and memory complexity


Training set size: N
§ Training scales in O pN 3 q
§ Prediction (variances) scales in O pN 2 q
§ Memory requirement: O pND ` N 2 q

Practical limit N « 10, 000

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 53
Limitations of Gaussian Processes

Computational and memory complexity


Training set size: N
§ Training scales in O pN 3 q
§ Prediction (variances) scales in O pN 2 q
§ Memory requirement: O pND ` N 2 q

Practical limit N « 10, 000

Some solution approaches:


§ Sparse GPs with inducing variables (e.g., Snelson & Ghahramani,
2006; Quiñonero-Candela & Rasmussen, 2005; Titsias 2009;
Hensman et al., 2013; Matthews et al., 2016)
§ Combination of local GP expert models (e.g., Tresp 2000; Cao &
Fleet 2014; Deisenroth & Ng, 2015)

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 53
Tips and Tricks for Practitioners

§ To set initial hyper-parameters, use domain knowledge.

https://drafts.distill.pub/gp

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 54
Tips and Tricks for Practitioners

§ To set initial hyper-parameters, use domain knowledge.


§ Standardize input data and set initial length-scales ` to « 0.5.

https://drafts.distill.pub/gp

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 54
Tips and Tricks for Practitioners

§ To set initial hyper-parameters, use domain knowledge.


§ Standardize input data and set initial length-scales ` to « 0.5.
§ Standardize targets y and set initial signal variance to σ f « 1.

https://drafts.distill.pub/gp

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 54
Tips and Tricks for Practitioners

§ To set initial hyper-parameters, use domain knowledge.


§ Standardize input data and set initial length-scales ` to « 0.5.
§ Standardize targets y and set initial signal variance to σ f « 1.
§ Often useful: Set initial noise level relatively high (e.g.,
σn « 0.5 ˆ σ f amplitude), even if you think your data have low
noise. The optimization surface for your other parameters will be
easier to move in.

https://drafts.distill.pub/gp

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 54
Tips and Tricks for Practitioners

§ To set initial hyper-parameters, use domain knowledge.


§ Standardize input data and set initial length-scales ` to « 0.5.
§ Standardize targets y and set initial signal variance to σ f « 1.
§ Often useful: Set initial noise level relatively high (e.g.,
σn « 0.5 ˆ σ f amplitude), even if you think your data have low
noise. The optimization surface for your other parameters will be
easier to move in.
§ When optimizing hyper-parameters, try random restarts or other
tricks to avoid local optima are advised.

https://drafts.distill.pub/gp

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 54
Tips and Tricks for Practitioners

§ To set initial hyper-parameters, use domain knowledge.


§ Standardize input data and set initial length-scales ` to « 0.5.
§ Standardize targets y and set initial signal variance to σ f « 1.
§ Often useful: Set initial noise level relatively high (e.g.,
σn « 0.5 ˆ σ f amplitude), even if you think your data have low
noise. The optimization surface for your other parameters will be
easier to move in.
§ When optimizing hyper-parameters, try random restarts or other
tricks to avoid local optima are advised.
§ Mitigate the problem of numerical instability (Cholesky
decomposition of K ` σn2 I) by penalizing high signal-to-noise
ratios σ f {σn

https://drafts.distill.pub/gp

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 54
Appendix

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 55
The Gaussian Distribution

D 1
ppx|µ, Σq “ p2πq´ 2 |Σ|´ 2 exp ´ 21 px ´ µqJ Σ´1 px ´ µq
` ˘

§ Mean vector µ Average of the data


§ Covariance matrix Σ Spread of the data
p(x)
Mean
0.3 95% confidence bound

0.25

0.04
0.2 0.03
p(x)

0.02
0.01

p(x, y)
0.15
0.00
0.01
0.1 0.02
0.03
0.04
0.05 8
6
4
8 2 y
0 6 0
−4 −3 −2 −1 0 1 2 3 4 5 6 4 2
x 2
x 0 4

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 56
The Gaussian Distribution

D 1
ppx|µ, Σq “ p2πq´ 2 |Σ|´ 2 exp ´ 21 px ´ µqJ Σ´1 px ´ µq
` ˘

§ Mean vector µ Average of the data


§ Covariance matrix Σ Spread of the data
3
p(x) Mean
Mean 95% confidence bound
0.3 95% confidence bound 2

0.25 1

0.2 0
p(x)

x2
0.15 −1

0.1 −2

0.05 −3

0
−4 −3 −2 −1 0 1 2 3 4 5 6 −5 −4 −3 −2 −1 0 1 2 3
x x1

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 56
The Gaussian Distribution

D 1
ppx|µ, Σq “ p2πq´ 2 |Σ|´ 2 exp ´ 21 px ´ µqJ Σ´1 px ´ µq
` ˘

§ Mean vector µ Average of the data


§ Covariance matrix Σ Spread of the data
3
Data Data
p(x) Mean
0.3 Mean 2 95% confidence bound
95% confidence interval
0.25 1

0.2 0
p(x)

x2
0.15 −1

0.1 −2

0.05 −3

0
−4 −3 −2 −1 0 1 2 3 4 5 6 −5 −4 −3 −2 −1 0 1 2 3
x x1

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 56
Conditional
˜« ff « ff¸
µx Σ xx Σ xy
Joint p(x,y)
ppx, yq “ N ,
3
µy Σyx Σyy
2

0
y

-1

-2

-3

-4

-5
-6 -4 -2 0 2 4
x

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 57
Conditional
˜« ff « ff¸
µx Σ xx Σ xy
Joint p(x,y)
ppx, yq “ N ,
3 Observation µy Σyx Σyy
2

0
y

-1

-2

-3

-4

-5
-6 -4 -2 0 2 4
x

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 57
Conditional
˜« ff « ff¸
µx Σ xx Σ xy
Joint p(x,y)
ppx, yq “ N ,
3 Observation y
Conditional p(x|y)
µy Σyx Σyy
2

ppx|yq “ N µ x|y , Σ x|y


0
` ˘
y

-1

-2 µ x|y “ µ x ` Σ xy Σ´1
yy py ´ µy q
-3

-4
Σ x|y “ Σ xx ´ Σ xy Σ´1
yy Σyx
-5
-6 -4 -2 0 2 4
x
Conditional ppx|yq is also Gaussian
Computationally convenient

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 57
Marginal

Joint p(x,y) ˜« ff « ff¸


3 Marginal p(x) µx Σ xx Σ xy
ppx, yq “ N ,
2
µy Σyx Σyy
1

0
Marginal distribution:
y

-1
ż
-2
pp x q “ pp x , y qd y
-3

“ N µ x , Σ xx
` ˘
-4

-5
-6 -4 -2 0 2 4
x

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 58
Marginal

Joint p(x,y) ˜« ff « ff¸


3 Marginal p(x) µx Σ xx Σ xy
ppx, yq “ N ,
2
µy Σyx Σyy
1

0
Marginal distribution:
y

-1
ż
-2
pp x q “ pp x , y qd y
-3

“ N µ x , Σ xx
` ˘
-4

-5
-6 -4 -2 0 2 4
x

§ The marginal of a joint Gaussian distribution is Gaussian


§ Intuitively: Ignore (integrate out) everything you are not
interested in

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 58
The Gaussian Distribution in the Limit

Consider the joint Gaussian distribution ppx, x̃q, where x P RD and


x̃ P Rk , k Ñ 8 are random variables.

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 59
The Gaussian Distribution in the Limit

Consider the joint Gaussian distribution ppx, x̃q, where x P RD and


x̃ P Rk , k Ñ 8 are random variables.
Then
˜« ff „ ¸
µx Σ xx Σ x x̃
ppx, x̃q “ N ,
µ x̃ Σ x̃x Σ x̃ x̃

where Σ x̃ x̃ P Rkˆk and Σ x x̃ P RDˆk , k Ñ 8.

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 59
The Gaussian Distribution in the Limit

Consider the joint Gaussian distribution ppx, x̃q, where x P RD and


x̃ P Rk , k Ñ 8 are random variables.
Then
˜« ff „ ¸
µx Σ xx Σ x x̃
ppx, x̃q “ N ,
µ x̃ Σ x̃x Σ x̃ x̃

where Σ x̃ x̃ P Rkˆk and Σ x x̃ P RDˆk , k Ñ 8.


However, the marginal remains finite
ż
pp x q “ pp x , x̃ qd x̃ “ N µ x , Σ xx
` ˘

where we integrate out an infinite number of random variables x̃i .

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 59
Marginal and Conditional in the Limit
§ In practice, we consider finite training and test data xtrain , xtest

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 60
Marginal and Conditional in the Limit
§ In practice, we consider finite training and test data xtrain , xtest
§ Then, x “ txtrain , xtest , xother u
(xother plays the role of x̃ from previous slide)

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 60
Marginal and Conditional in the Limit
§ In practice, we consider finite training and test data xtrain , xtest
§ Then, x “ txtrain , xtest , xother u
(xother plays the role of x̃ from previous slide)
¨» fi » fi˛
µtrain Σtrain Σtrain,test Σtrain,other
ppxq “ N ˝– µtest fl , – Σtest,train Σtest Σtest,other ffi
˚— ffi — ‹
fl‚
µother Σother,train Σother,test Σother

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 60
Marginal and Conditional in the Limit
§ In practice, we consider finite training and test data xtrain , xtest
§ Then, x “ txtrain , xtest , xother u
(xother plays the role of x̃ from previous slide)
¨» fi » fi˛
µtrain Σtrain Σtrain,test Σtrain,other
ppxq “ N ˝– µtest fl , – Σtest,train Σtest Σtest,other ffi
˚— ffi — ‹
fl‚
µother Σother,train Σother,test Σother
ż
ppxtrain , xtest q “ pp xtrain , xtest , xother qd xother

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 60
Marginal and Conditional in the Limit
§ In practice, we consider finite training and test data xtrain , xtest
§ Then, x “ txtrain , xtest , xother u
(xother plays the role of x̃ from previous slide)
¨» fi » fi˛
µtrain Σtrain Σtrain,test Σtrain,other
ppxq “ N ˝– µtest fl , – Σtest,train Σtest Σtest,other ffi
˚— ffi — ‹
fl‚
µother Σother,train Σother,test Σother
ż
ppxtrain , xtest q “ pp xtrain , xtest , xother qd xother

ppxtest |xtrain q “ N µ˚ , Σ˚
` ˘

µ˚ “ µtest ` Σtest,train Σ´1


train pxtrain ´ µtrain q

Σ˚ “ Σtest ´ Σtest,train Σ´1


train Σtrain,test

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 60
Gaussian Process Training: Hierarchical Inference

θ: Collection of all hyper-parameters


§ Level-1 inference (posterior on f ):

ppy|X, f q pp f |X, θq
pp f |X, y, θq “
ppy|X, θq
ż f ψ
ppy|X, θq “ ppy| f , Xq pp f |X, f θqd f

xi yi σn
N

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 61
Gaussian Process Training: Hierarchical Inference

θ: Collection of all hyper-parameters


§ Level-1 inference (posterior on f ):

ppy|X, f q pp f |X, θq
pp f |X, y, θq “
ppy|X, θq
ż f ψ
ppy|X, θq “ ppy| f , Xq pp f |X, f θqd f

§ Level-2 inference (posterior on θ) xi yi σn


N
ppy|X, θq ppθq
ppθ|X, yq “
ppy|Xq

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 61
GP as the Limit of an Infinite RBF Network
Consider the universal function approximator
˜ ¸
N
ÿ 1 ÿ px ´ pi ` Nn qq2
f pxq “ lim γn exp ´ , x P R, λ P R`
NÑ8 N
n“1
λ2
iPZ
` ˘
with γn „ N 0, 1 (random weights)
Gaussian-shaped basis functions (with variance λ2 {2) everywhere
on the real axis

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 62
GP as the Limit of an Infinite RBF Network
Consider the universal function approximator
˜ ¸
N
ÿ 1 ÿ px ´ pi ` Nn qq2
f pxq “ lim γn exp ´ , x P R, λ P R`
NÑ8 N
n“1
λ2
iPZ
` ˘
with γn „ N 0, 1 (random weights)
Gaussian-shaped basis functions (with variance λ2 {2) everywhere
on the real axis
ÿ ż i`1 ˆ
px ´ sq2
˙ ż8 ˆ
px ´ sq2
˙
f pxq “ γpsq exp ´ ds “ γpsq exp ´ ds
i λ2 ´8 λ2
iPZ

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 62
GP as the Limit of an Infinite RBF Network
Consider the universal function approximator
˜ ¸
N
ÿ 1 ÿ px ´ pi ` Nn qq2
f pxq “ lim γn exp ´ , x P R, λ P R`
NÑ8 N
n“1
λ2
iPZ
` ˘
with γn „ N 0, 1 (random weights)
Gaussian-shaped basis functions (with variance λ2 {2) everywhere
on the real axis
ÿ ż i`1 ˆ
px ´ sq2
˙ ż8 ˆ
px ´ sq2
˙
f pxq “ γpsq exp ´ ds “ γpsq exp ´ ds
i λ2 ´8 λ2
iPZ

§ Mean: Er f pxqs “ 0
´ 1 2
¯
§ Covariance: Covr f pxq, f px1 qs “ θ12 exp ´ px´x2 q for suitable θ21

GP with mean 0 and Gaussian covariance function
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 62
References I
[1] G. Bertone, M. P. Deisenroth, J. S. Kim, S. Liem, R. R. de Austri, and M. Welling. Accelerating the BSM Interpretation of
LHC Data with Machine Learning. arXiv preprint arXiv:1611.02704, 2016.
[2] R. Calandra, J. Peters, C. E. Rasmussen, and M. P. Deisenroth. Manifold Gaussian Processes for Regression. In Proceedings
of the IEEE International Joint Conference on Neural Networks, 2016.
[3] Y. Cao and D. J. Fleet. Generalized Product of Experts for Automatic and Principled Fusion of Gaussian Process
Predictions. http://arxiv.org/abs/1410.7827, 2014.
[4] N. A. C. Cressie. Statistics for Spatial Data. Wiley-Interscience, 1993.
[5] M. Cutler and J. P. How. Efficient Reinforcement Learning for Robots using Informative Simulated Priors. In IEEE
International Conference on Robotics and Automation, Seattle, WA, May 2015.
[6] M. P. Deisenroth and J. W. Ng. Distributed Gaussian Processes. In Proceedings of the International Conference on Machine
Learning, 2015.
[7] M. P. Deisenroth, C. E. Rasmussen, and D. Fox. Learning to Control a Low-Cost Manipulator using Data-Efficient
Reinforcement Learning. In Proceedings of Robotics: Science and Systems, Los Angeles, CA, USA, June 2011.
[8] M. P. Deisenroth, C. E. Rasmussen, and J. Peters. Gaussian Process Dynamic Programming. Neurocomputing,
72(7–9):1508–1524, Mar. 2009.
[9] M. P. Deisenroth, R. Turner, M. Huber, U. D. Hanebeck, and C. E. Rasmussen. Robust Filtering and Smoothing with
Gaussian Processes. IEEE Transactions on Automatic Control, 57(7):1865–1871, 2012.
[10] R. Frigola, F. Lindsten, T. B. Schön, and C. E. Rasmussen. Bayesian Inference and Learning in Gaussian Process
State-Space Models with Particle MCMC. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger,
editors, Advances in Neural Information Processing Systems, pages 3156–3164. Curran Associates, Inc., 2013.
[11] N. HajiGhassemi and Marc P. Deisenroth. Approximate Inference for Long-Term Forecasting with Periodic Gaussian
Processes. In Proceedings of the International Conference on Artificial Intelligence and Statistics, April 2014. Acceptance rate:
36%.
[12] J. Hensman, N. Fusi, and N. D. Lawrence. Gaussian Processes for Big Data. In A. Nicholson and P. Smyth, editors,
Proceedings of the Conference on Uncertainty in Artificial Intelligence. AUAI Press, 2013.

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 63
References II
[13] A. Krause, A. Singh, and C. Guestrin. Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient
Algorithms and Empirical Studies. Journal of Machine Learning Research, 9:235–284, Feb. 2008.
[14] M. C. H. Lee, H. Salimbeni, M. P. Deisenroth, and B. Glocker. Patch Kernels for Gaussian Processes in High-Dimensional
Imaging Problems. In NIPS Workshop on Practical Bayesian Nonparametrics, 2016.
[15] J. R. Lloyd, D. Duvenaud, R. Grosse, J. B. Tenenbaum, and Z. Ghahramani. Automatic Construction and
Natural-Language Description of Nonparametric Regression Models. In AAAI Conference on Artificial Intelligence, pages
1–11, 2014.
[16] D. J. C. MacKay. Introduction to Gaussian Processes. In C. M. Bishop, editor, Neural Networks and Machine Learning,
volume 168, pages 133–165. Springer, Berlin, Germany, 1998.
[17] A. G. d. G. Matthews, J. Hensman, R. Turner, and Z. Ghahramani. On Sparse Variational Methods and the
Kullback-Leibler Divergence between Stochastic Processes. In Proceedings of the International Conference on Artificial
Intelligence and Statistics, 2016.
[18] M. A. Osborne, S. J. Roberts, A. Rogers, S. D. Ramchurn, and N. R. Jennings. Towards Real-Time Information Processing
of Sensor Network Data Using Computationally Efficient Multi-output Gaussian Processes. In Proceedings of the
International Conference on Information Processing in Sensor Networks, pages 109–120. IEEE Computer Society, 2008.
[19] J. Quiñonero-Candela and C. E. Rasmussen. A Unifying View of Sparse Approximate Gaussian Process Regression.
Journal of Machine Learning Research, 6(2):1939–1960, 2005.
[20] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. Adaptive Computation and Machine
Learning. The MIT Press, Cambridge, MA, USA, 2006.
[21] S. Roberts, M. A. Osborne, M. Ebden, S. Reece, N. Gibson, and S. Aigrain. Gaussian Processes for Time Series Modelling.
Philosophical Transactions of the Royal Society (Part A), 371(1984), Feb. 2013.
[22] B. Schölkopf and A. J. Smola. Learning with Kernels—Support Vector Machines, Regularization, Optimization, and Beyond.
Adaptive Computation and Machine Learning. The MIT Press, Cambridge, MA, USA, 2002.
[23] E. Snelson and Z. Ghahramani. Sparse Gaussian Processes using Pseudo-inputs. In Y. Weiss, B. Schölkopf, and J. C. Platt,
editors, Advances in Neural Information Processing Systems 18, pages 1257–1264. The MIT Press, Cambridge, MA, USA, 2006.
[24] M. K. Titsias. Variational Learning of Inducing Variables in Sparse Gaussian Processes. In Proceedings of the International
Conference on Artificial Intelligence and Statistics, 2009.
[25] V. Tresp. A Bayesian Committee Machine. Neural Computation, 12(11):2719–2741, 2000.

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 64

You might also like