0% found this document useful (0 votes)

95 views146 pages

Gaussian Processes: Probabilistic Inference (CO-493)

This document discusses a lecture on Gaussian processes given by Marc Deisenroth at Imperial College London. It provides an overview of Bayesian linear regression as a refresher. It then discusses placing priors over functions and defining Gaussian processes, which allow defining a distribution directly over functions. It describes how Gaussian processes can be used for inference and fitting nonlinear functions.

Uploaded by

Hareesh Padmanabha Rao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

95 views146 pages

Gaussian Processes: Probabilistic Inference (CO-493)

Uploaded by

Hareesh Padmanabha Rao

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 146

Probabilistic Inference (CO-493)

Gaussian Processes
Marc Deisenroth
Department of Computing [email protected]
Imperial College London

January 22, 2019

Overview

Bayesian Linear Regression (1-Slide Refresher)

Priors over Functions

Gaussian Processes
Definition and Derivation
Inference
Covariance Functions and Hyper-Parameters
Training

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 2
Bayesian Linear Regression: Model

` ˘
Prior ppθq “ N m0 , S0
Likelihood ppy|x, θq “ N y | φJ pxqθ, σ2
` ˘

m0 S0 ùñ y “ φJ pxqθ ` e, e „ N 0, σ2
` ˘

θ
§ Parameter θ becomes a latent (random) variable
σ § Distribution ppθq induces a distribution over
x y plausible functions
§ Choose a conjugate Gaussian prior
§ Closed-form computations
§ Gaussian posterior

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 3
Overview

Bayesian Linear Regression (1-Slide Refresher)

Priors over Functions

Gaussian Processes
Definition and Derivation
Inference
Covariance Functions and Hyper-Parameters
Training

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 4
Distribution over Functions
Consider a linear regression setting
e „ N 0, σn2
` ˘
y “ a ` bx ` e ,
` ˘
ppa, bq “ N 0, I

0
b

−1

−2

−3

−4
−4 −2 0 2 4
a

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 5
Sampling from the Prior over Functions
Consider a linear regression setting
e „ N 0, σn2
` ˘
y “ f pxq ` e “ a ` bx ` e ,
` ˘
ppa, bq “ N 0, I
f i pxq “ ai ` bi x, rai , bi s „ ppa, bq

4 15

3
10
2
5
1

0 y 0
b

−1
−5
−2
−10
−3

−4 −15
−2 0 2 −10 −5 0 5 10
a x
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 6
Sampling from the Posterior over Functions
Consider a linear regression setting
e „ N 0, σn2
` ˘
y “ f pxq ` e “ a ` bx ` e ,
` ˘
ppa, bq “ N 0, I
X “ rx1 , . . . , x N s, y “ ry1 , . . . , y N s Training data

0
y

−5

−10

−15
−10 −5 0 5 10
x
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 7
Sampling from the Posterior over Functions
Consider a linear regression setting
y “ f pxq ` e “ a ` bx ` e , e „ N 0, σn2
` ˘
` ˘
ppa, bq “ N 0, I
` ˘
ppa, b|X, yq “ N m N , S N Posterior

0
b

−1

−2

−3

−4
−4 −2 0 2 4
a
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 8
Sampling from the Posterior over Functions
Consider a linear regression setting
e „ N 0, σn2
` ˘
y “ f pxq ` e “ a ` bx ` e ,
rai , bi s „ ppa, b|X, yq
f i “ a i ` bi x

4 15

3
10
2
5
1

0 y 0
b

−1
−5
−2
−10
−3

−4 −15
−2 0 2 −10 −5 0 5 10
a x
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 9
Fitting Nonlinear Functions

§ Fit nonlinear functions using (Bayesian) linear regression:

Linear combination of nonlinear features

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 10
Fitting Nonlinear Functions

§ Fit nonlinear functions using (Bayesian) linear regression:

Linear combination of nonlinear features
§ Example: Radial-basis-function (RBF) network
n
ÿ
θi „ N 0, σp2
` ˘
f pxq “ θi φi pxq ,
i“1

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 10
Fitting Nonlinear Functions

§ Fit nonlinear functions using (Bayesian) linear regression:

Linear combination of nonlinear features
§ Example: Radial-basis-function (RBF) network
n
ÿ
θi „ N 0, σp2
` ˘
f pxq “ θi φi pxq ,
i“1

where
φi pxq “ exp ´ 12 px ´ µi qJ px ´ µi q
` ˘

for given “centers” µi

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 10
Illustration: Fitting a Radial Basis Function Network

φi pxq “ exp ´ 12 px ´ µi qJ px ´ µi q
` ˘

f(x) 2

-2

-5 0 5
x
§ Place Gaussian-shaped basis functions φi at 25 input locations µi ,
linearly spaced in the interval r´5, 3s
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 11
Samples from the RBF Prior

n
ÿ ` ˘
f pxq “ θi φi pxq , ppθq “ N 0, I
i“1

2
f(x)

-2

-4
-5 0 5
x

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 12
Samples from the RBF Posterior

n
ÿ ` ˘
f pxq “ θi φi pxq , ppθ|X, yq “ N m N , S N
i“1

2
f(x)

-2

-4
-5 0 5
x

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 13
RBF Posterior

2
f(x)

-2

-5 0 5
x

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 14
Limitations

f(x) 0

-2

-5 0 5
x
§ Feature engineering (what basis functions to use?)
§ Finite number of features:
§ Above: Without basis functions on the right, we cannot express
any variability of the function
§ Ideally: Add more (infinitely many) basis functions

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 15
Approach

§ Instead of sampling parameters, which induce a distribution over

functions, sample functions directly
Place a prior on functions
Make assumptions on the distribution of functions

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 16
Approach

§ Instead of sampling parameters, which induce a distribution over

functions, sample functions directly
Place a prior on functions
Make assumptions on the distribution of functions
§ Intuition: function = infinitely long vector of function values
Make assumptions on the distribution of function values

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 16
Approach

§ Instead of sampling parameters, which induce a distribution over

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 16
Approach

§ Instead of sampling parameters, which induce a distribution over

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 16
Overview

Bayesian Linear Regression (1-Slide Refresher)

Priors over Functions

Gaussian Processes
Definition and Derivation
Inference
Covariance Functions and Hyper-Parameters
Training

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 17
Reference

http://www.gaussianprocess.org/

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 18
Problem Setting

3
2
1
f(x)

0
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
x

Objective
` ˘
For a set of observations yi “ f pxi q ` ε, ε „ N 0, σε2 , find a
distribution over functions pp f q that explains the data
Probabilistic regression problem

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 19
Problem Setting

2
f(x)

−2

−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9 10
x

Objective
` ˘
For a set of observations yi “ f pxi q ` ε, ε „ N 0, σε2 , find a
distribution over functions pp f q that explains the data
Probabilistic regression problem

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 19
Some Application Areas

§ Reinforcement learning and robotics

§ Bayesian optimization (experimental design)
§ Geostatistics
§ Sensor networks
§ Time-series modeling and forecasting
§ High-energy physics
§ Medical applications
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 20
Gaussian Process

§ We will place a distribution pp f q on functions f

§ Informally, a function can be considered an infinitely long vector
of function values f “ r f 1 , f 2 , f 3 , ...s
§ A Gaussian process is a generalization of a multivariate Gaussian
distribution to infinitely many variables.

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 21
Gaussian Process

§ We will place a distribution pp f q on functions f

Definition (Rasmussen & Williams, 2006)

A Gaussian process (GP) is a collection of random variables f 1 , f 2 , . . . ,
any finite number of which is Gaussian distributed.

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 21
Gaussian Process

§ We will place a distribution pp f q on functions f

Definition (Rasmussen & Williams, 2006)

A Gaussian process (GP) is a collection of random variables f 1 , f 2 , . . . ,
any finite number of which is Gaussian distributed.

§ A Gaussian distribution is specified by a mean vector µ and a

covariance matrix Σ
§ A Gaussian process is specified by a mean function mp¨q and a
covariance function (kernel) kp¨, ¨q

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 21
Mean Function

3
2
1
f(x) 0
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
x

§ The “average” function of the distribution over functions

§ Allows us to bias the model (can make sense in
application-specific settings)
§ “Agnostic” mean function in the absence of data or prior
knowledge: mp¨q ” 0 everywhere (for symmetry reasons)

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 22
Covariance Function
3
2
1

f(x)
0
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
x

§ The covariance function (kernel) is symmetric and positive

semi-definite
§ It allows us to compute covariances/correlations between
(unknown) function values by just looking at the corresponding
inputs:
Covr f pxi q, f px j qs “ kpxi , x j q
Kernel trick (Schölkopf & Smola, 2002)
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 23
GP Regression as a Bayesian Inference Problem

Objective
` ˘
For a set of observations yi “ f pxi q ` e, e „ N 0, σn2 , find a
(posterior) distribution over functions pp f |X, yq that explains the
data. Here: X training inputs, y training targets

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 25
GP Regression as a Bayesian Inference Problem

Training data: X, y. Bayes’ theorem yields

ppy| f , Xq pp f q
pp f |X, yq “
ppy|Xq

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 25
GP Regression as a Bayesian Inference Problem

Training data: X, y. Bayes’ theorem yields

ppy| f , Xq pp f q
pp f |X, yq “
ppy|Xq

Prior: pp f q “ GPpm, kq Specify mean m function and kernel k.

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 25
GP Regression as a Bayesian Inference Problem

Training data: X, y. Bayes’ theorem yields

ppy| f , Xq pp f q
pp f |X, yq “
ppy|Xq

Prior: pp f q “ GPpm, kq Specify mean m function and kernel k.

` ˘
Likelihood (noise model): ppy| f , Xq “ N f pXq, σn2 I

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 25
GP Regression as a Bayesian Inference Problem

Training data: X, y. Bayes’ theorem yields

ppy| f , Xq pp f q
pp f |X, yq “
ppy|Xq

Prior: pp f q “ GPpm, kq Specify mean m function and kernel k.

` ˘
Likelihood (noise model): ppy| f , Xq “ N f pXq, σn2 I
ş
Marginal likelihood (evidence): ppy|Xq “ ppy| f , Xqpp f |Xqd f

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 25
GP Regression as a Bayesian Inference Problem

Training data: X, y. Bayes’ theorem yields

ppy| f , Xq pp f q
pp f |X, yq “
ppy|Xq

Prior: pp f q “ GPpm, kq Specify mean m function and kernel k.

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 25
GP Prior
§ Treat a function as a long vector of function values:
f “ r f1, f2, . . . s
Look at a distribution over function values f i “ f pxi q

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 26
GP Prior
§ Treat a function as a long vector of function values:
f “ r f1, f2, . . . s
Look at a distribution over function values f i “ f pxi q
§ Consider a finite number of N function values f and all other
(infinitely many) function values f̃ . Informally:
Σ f f Σ f f˜
˜« ff « ff¸
µf
pp f , f̃ q “ N ,
µ f˜ Σ f˜ f Σ f˜ f˜

where Σ f˜ f˜ P Rmˆm and Σ f f˜ P R Nˆm , m Ñ 8.

§ Σpi,jq
f f “ Covr f pxi q, f px j qs “ kpxi , x j q

where Σ f˜ f˜ P Rmˆm and Σ f f˜ P R Nˆm , m Ñ 8.

§ Σpi,jq
f f “ Covr f pxi q, f px j qs “ kpxi , x j q
§ Key property: The marginal remains finite
ż
pp f q “ pp f , f̃ qd f̃ “ N µ f , Σ f f
` ˘

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 26
GP Prior (2)

§ In practice, we always have finite training and test inputs

xtrain , xtest .
§ Define f ˚ :“ f test , f :“ f train .

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 27
GP Prior (2)

§ In practice, we always have finite training and test inputs

xtrain , xtest .
§ Define f ˚ :“ f test , f :“ f train .
§ Then, we obtain the finite marginal
˜« ff « ff¸
Σf f Σf˚
ż
µf
pp f , f ˚ q “ pp f , f ˚ , f other qd f other “ N ,
µ˚ Σ˚ f Σ˚˚

Computing the joint distribution of an arbitrary number of

training and test inputs boils down to manipulating
(finite-dimensional) Gaussian distributions

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 27
GP Posterior Predictions

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 28
GP Posterior Predictions

e „ N 0, σn2
` ˘
y “ f pxq ` e,
§ Objective: Find pp f pX ˚ q|X, y, X ˚ q for training data X, y and test
inputs X ˚ .
` ˘
§ GP prior at training inputs: pp f |Xq “ N mpXq, K
` ˘
§ Gaussian Likelihood: ppy| f , Xq “ N f pXq, σn2 I
§ With f „ GP it follows that f , f ˚ are jointly Gaussian distributed:
ˆ„  „ ˙
mpXq K kpX, X ˚ q
pp f , f ˚ |X, X ˚ q “ N ,
mpX ˚ q kpX ˚ , Xq kpX ˚ , X ˚ q

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 28
GP Posterior Predictions

§ Due to the Gaussian likelihood, we also get ( f is unobserved)

˜„  « ff¸
mpXq K `σn2 I kpX, X ˚ q
pp y , f ˚ |X, X ˚ q “ N ,
mpX ˚ q kpX ˚ , Xq kpX ˚ , X ˚ q

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 28
GP Posterior Predictions
Prior:
ˆ„  „ ˙
mpXq K ` σn2 I kpX, X ˚ q
ppy, f ˚ |X, X ˚ q “ N ,
mpX ˚ q kpX ˚ , Xq kpX ˚ , X ˚ q

Posterior predictive distribution pp f ˚ |X, y, X ˚ q at test inputs X ˚

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 29
GP Posterior Predictions
Prior:
ˆ„  „ ˙
mpXq K ` σn2 I kpX, X ˚ q
ppy, f ˚ |X, X ˚ q “ N ,
mpX ˚ q kpX ˚ , Xq kpX ˚ , X ˚ q

Posterior predictive distribution pp f ˚ |X, y, X ˚ q at test inputs X ˚

obtained by Gaussian conditioning:
` ˘
pp f ˚ |X, y, X ˚ q “ N Er f ˚ |X, y, X ˚ s, Vr f ˚ |X, y, X ˚ s
Er f ˚ |X, y, X ˚ s “ mpost pX ˚ q “ loomoon kpX ˚ , XqpK ` σn2 Iq´1 looooomooooon
mpX ˚ q` loooooooooooomoooooooooooon py ´ mpXqq
prior mean “Kalman gain” error

Posterior predictive distribution pp f ˚ |X, y, X ˚ q at test inputs X ˚

Vr f ˚ |X, y, X ˚ s “ kpost pX ˚ , X ˚ q
kpX ˚ , XqpK ` σn2 Iq´1 kpX, X ˚ q
kpX ˚ , X ˚ q´ loooooooooooooooooomoooooooooooooooooon
“ loooomoooon
prior variance ě0

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 29
GP Posterior
Posterior over functions (with training data X, y):
ppy| f p¨q, Xq pp f p¨q|Xq
pp f p¨q|X, yq “
ppy|Xq

Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 30
GP Posterior
Posterior over functions (with training data X, y):
ppy| f p¨q, Xq pp f p¨q|Xq
pp f p¨q|X, yq “
ppy|Xq
Using the properties of Gaussians, we obtain (with K :“ kpX, Xq)
` ˘
ppy| f p¨q, Xq pp f p¨q|Xq “ N y | f pXq, σn2 I GPpmp¨q, kp¨, ¨qq
` ˘
“ Z ˆ GP mpost p¨q, k post p¨, ¨q
mpost p¨q “ mp¨q ` kp¨, XqpK ` σn2 Iq´1 py ´ mpXqq
kpost p¨, ¨q “ kp¨, ¨q ´ kp¨, XqpK ` σn2 Iq´1 kpX, ¨q

` ˘
Prediction at x˚ : pp f px˚ q|X, y, x˚ q “ N mpost px˚ q, kpost px˚ , x˚ q
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 30
Illustration: Inference with Gaussian Processes

3
2
1
f(x)

0
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
x
Prior belief about the function