Gaussian Processes: Probabilistic Inference (CO-493)
Gaussian Processes: Probabilistic Inference (CO-493)
Gaussian Processes
Marc Deisenroth
Department of Computing [email protected]
Imperial College London
Gaussian Processes
Definition and Derivation
Inference
Covariance Functions and Hyper-Parameters
Training
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 2
Bayesian Linear Regression: Model
` ˘
Prior ppθq “ N m0 , S0
Likelihood ppy|x, θq “ N y | φJ pxqθ, σ2
` ˘
m0 S0 ùñ y “ φJ pxqθ ` e, e „ N 0, σ2
` ˘
θ
§ Parameter θ becomes a latent (random) variable
σ § Distribution ppθq induces a distribution over
x y plausible functions
§ Choose a conjugate Gaussian prior
§ Closed-form computations
§ Gaussian posterior
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 3
Overview
Gaussian Processes
Definition and Derivation
Inference
Covariance Functions and Hyper-Parameters
Training
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 4
Distribution over Functions
Consider a linear regression setting
e „ N 0, σn2
` ˘
y “ a ` bx ` e ,
` ˘
ppa, bq “ N 0, I
0
b
−1
−2
−3
−4
−4 −2 0 2 4
a
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 5
Sampling from the Prior over Functions
Consider a linear regression setting
e „ N 0, σn2
` ˘
y “ f pxq ` e “ a ` bx ` e ,
` ˘
ppa, bq “ N 0, I
f i pxq “ ai ` bi x, rai , bi s „ ppa, bq
4 15
3
10
2
5
1
0 y 0
b
−1
−5
−2
−10
−3
−4 −15
−2 0 2 −10 −5 0 5 10
a x
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 6
Sampling from the Posterior over Functions
Consider a linear regression setting
e „ N 0, σn2
` ˘
y “ f pxq ` e “ a ` bx ` e ,
` ˘
ppa, bq “ N 0, I
X “ rx1 , . . . , x N s, y “ ry1 , . . . , y N s Training data
15
10
0
y
−5
−10
−15
−10 −5 0 5 10
x
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 7
Sampling from the Posterior over Functions
Consider a linear regression setting
y “ f pxq ` e “ a ` bx ` e , e „ N 0, σn2
` ˘
` ˘
ppa, bq “ N 0, I
` ˘
ppa, b|X, yq “ N m N , S N Posterior
0
b
−1
−2
−3
−4
−4 −2 0 2 4
a
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 8
Sampling from the Posterior over Functions
Consider a linear regression setting
e „ N 0, σn2
` ˘
y “ f pxq ` e “ a ` bx ` e ,
rai , bi s „ ppa, b|X, yq
f i “ a i ` bi x
4 15
3
10
2
5
1
0 y 0
b
−1
−5
−2
−10
−3
−4 −15
−2 0 2 −10 −5 0 5 10
a x
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 9
Fitting Nonlinear Functions
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 10
Fitting Nonlinear Functions
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 10
Fitting Nonlinear Functions
where
φi pxq “ exp ´ 12 px ´ µi qJ px ´ µi q
` ˘
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 10
Illustration: Fitting a Radial Basis Function Network
φi pxq “ exp ´ 12 px ´ µi qJ px ´ µi q
` ˘
f(x) 2
-2
-5 0 5
x
§ Place Gaussian-shaped basis functions φi at 25 input locations µi ,
linearly spaced in the interval r´5, 3s
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 11
Samples from the RBF Prior
n
ÿ ` ˘
f pxq “ θi φi pxq , ppθq “ N 0, I
i“1
2
f(x)
-2
-4
-5 0 5
x
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 12
Samples from the RBF Posterior
n
ÿ ` ˘
f pxq “ θi φi pxq , ppθ|X, yq “ N m N , S N
i“1
2
f(x)
-2
-4
-5 0 5
x
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 13
RBF Posterior
2
f(x)
-2
-5 0 5
x
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 14
Limitations
f(x) 0
-2
-5 0 5
x
§ Feature engineering (what basis functions to use?)
§ Finite number of features:
§ Above: Without basis functions on the right, we cannot express
any variability of the function
§ Ideally: Add more (infinitely many) basis functions
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 15
Approach
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 16
Approach
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 16
Approach
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 16
Approach
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 16
Overview
Gaussian Processes
Definition and Derivation
Inference
Covariance Functions and Hyper-Parameters
Training
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 17
Reference
http://www.gaussianprocess.org/
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 18
Problem Setting
3
2
1
f(x)
0
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
x
Objective
` ˘
For a set of observations yi “ f pxi q ` ε, ε „ N 0, σε2 , find a
distribution over functions pp f q that explains the data
Probabilistic regression problem
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 19
Problem Setting
2
f(x)
−2
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8 9 10
x
Objective
` ˘
For a set of observations yi “ f pxi q ` ε, ε „ N 0, σε2 , find a
distribution over functions pp f q that explains the data
Probabilistic regression problem
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 19
Some Application Areas
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 21
Gaussian Process
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 21
Gaussian Process
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 21
Mean Function
3
2
1
f(x) 0
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
x
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 22
Covariance Function
3
2
1
f(x)
0
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
x
Objective
` ˘
For a set of observations yi “ f pxi q ` e, e „ N 0, σn2 , find a
(posterior) distribution over functions pp f |X, yq that explains the
data. Here: X training inputs, y training targets
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 25
GP Regression as a Bayesian Inference Problem
Objective
` ˘
For a set of observations yi “ f pxi q ` e, e „ N 0, σn2 , find a
(posterior) distribution over functions pp f |X, yq that explains the
data. Here: X training inputs, y training targets
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 25
GP Regression as a Bayesian Inference Problem
Objective
` ˘
For a set of observations yi “ f pxi q ` e, e „ N 0, σn2 , find a
(posterior) distribution over functions pp f |X, yq that explains the
data. Here: X training inputs, y training targets
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 25
GP Regression as a Bayesian Inference Problem
Objective
` ˘
For a set of observations yi “ f pxi q ` e, e „ N 0, σn2 , find a
(posterior) distribution over functions pp f |X, yq that explains the
data. Here: X training inputs, y training targets
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 25
GP Regression as a Bayesian Inference Problem
Objective
` ˘
For a set of observations yi “ f pxi q ` e, e „ N 0, σn2 , find a
(posterior) distribution over functions pp f |X, yq that explains the
data. Here: X training inputs, y training targets
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 25
GP Regression as a Bayesian Inference Problem
Objective
` ˘
For a set of observations yi “ f pxi q ` e, e „ N 0, σn2 , find a
(posterior) distribution over functions pp f |X, yq that explains the
data. Here: X training inputs, y training targets
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 25
GP Prior
§ Treat a function as a long vector of function values:
f “ r f1, f2, . . . s
Look at a distribution over function values f i “ f pxi q
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 26
GP Prior
§ Treat a function as a long vector of function values:
f “ r f1, f2, . . . s
Look at a distribution over function values f i “ f pxi q
§ Consider a finite number of N function values f and all other
(infinitely many) function values f̃ . Informally:
Σ f f Σ f f˜
˜« ff « ff¸
µf
pp f , f̃ q “ N ,
µ f˜ Σ f˜ f Σ f˜ f˜
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 26
GP Prior
§ Treat a function as a long vector of function values:
f “ r f1, f2, . . . s
Look at a distribution over function values f i “ f pxi q
§ Consider a finite number of N function values f and all other
(infinitely many) function values f̃ . Informally:
Σ f f Σ f f˜
˜« ff « ff¸
µf
pp f , f̃ q “ N ,
µ f˜ Σ f˜ f Σ f˜ f˜
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 26
GP Prior (2)
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 27
GP Prior (2)
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 27
GP Posterior Predictions
e „ N 0, σn2
` ˘
y “ f pxq ` e,
§ Objective: Find pp f pX ˚ q|X, y, X ˚ q for training data X, y and test
inputs X ˚ .
` ˘
§ GP prior at training inputs: pp f |Xq “ N mpXq, K
` ˘
§ Gaussian Likelihood: ppy| f , Xq “ N f pXq, σn2 I
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 28
GP Posterior Predictions
e „ N 0, σn2
` ˘
y “ f pxq ` e,
§ Objective: Find pp f pX ˚ q|X, y, X ˚ q for training data X, y and test
inputs X ˚ .
` ˘
§ GP prior at training inputs: pp f |Xq “ N mpXq, K
` ˘
§ Gaussian Likelihood: ppy| f , Xq “ N f pXq, σn2 I
§ With f „ GP it follows that f , f ˚ are jointly Gaussian distributed:
ˆ„ „ ˙
mpXq K kpX, X ˚ q
pp f , f ˚ |X, X ˚ q “ N ,
mpX ˚ q kpX ˚ , Xq kpX ˚ , X ˚ q
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 28
GP Posterior Predictions
e „ N 0, σn2
` ˘
y “ f pxq ` e,
§ Objective: Find pp f pX ˚ q|X, y, X ˚ q for training data X, y and test
inputs X ˚ .
` ˘
§ GP prior at training inputs: pp f |Xq “ N mpXq, K
` ˘
§ Gaussian Likelihood: ppy| f , Xq “ N f pXq, σn2 I
§ With f „ GP it follows that f , f ˚ are jointly Gaussian distributed:
ˆ„ „ ˙
mpXq K kpX, X ˚ q
pp f , f ˚ |X, X ˚ q “ N ,
mpX ˚ q kpX ˚ , Xq kpX ˚ , X ˚ q
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 28
GP Posterior Predictions
Prior:
ˆ„ „ ˙
mpXq K ` σn2 I kpX, X ˚ q
ppy, f ˚ |X, X ˚ q “ N ,
mpX ˚ q kpX ˚ , Xq kpX ˚ , X ˚ q
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 29
GP Posterior Predictions
Prior:
ˆ„ „ ˙
mpXq K ` σn2 I kpX, X ˚ q
ppy, f ˚ |X, X ˚ q “ N ,
mpX ˚ q kpX ˚ , Xq kpX ˚ , X ˚ q
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 29
GP Posterior Predictions
Prior:
ˆ„ „ ˙
mpXq K ` σn2 I kpX, X ˚ q
ppy, f ˚ |X, X ˚ q “ N ,
mpX ˚ q kpX ˚ , Xq kpX ˚ , X ˚ q
Vr f ˚ |X, y, X ˚ s “ kpost pX ˚ , X ˚ q
kpX ˚ , XqpK ` σn2 Iq´1 kpX, X ˚ q
kpX ˚ , X ˚ q´ loooooooooooooooooomoooooooooooooooooon
“ loooomoooon
prior variance ě0
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 29
GP Posterior
Posterior over functions (with training data X, y):
ppy| f p¨q, Xq pp f p¨q|Xq
pp f p¨q|X, yq “
ppy|Xq
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 30
GP Posterior
Posterior over functions (with training data X, y):
ppy| f p¨q, Xq pp f p¨q|Xq
pp f p¨q|X, yq “
ppy|Xq
Using the properties of Gaussians, we obtain (with K :“ kpX, Xq)
` ˘
ppy| f p¨q, Xq pp f p¨q|Xq “ N y | f pXq, σn2 I GPpmp¨q, kp¨, ¨qq
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 30
GP Posterior
Posterior over functions (with training data X, y):
ppy| f p¨q, Xq pp f p¨q|Xq
pp f p¨q|X, yq “
ppy|Xq
Using the properties of Gaussians, we obtain (with K :“ kpX, Xq)
` ˘
ppy| f p¨q, Xq pp f p¨q|Xq “ N y | f pXq, σn2 I GPpmp¨q, kp¨, ¨qq
` ˘
“ Z ˆ GP mpost p¨q, k post p¨, ¨q
mpost p¨q “ mp¨q ` kp¨, XqpK ` σn2 Iq´1 py ´ mpXqq
kpost p¨, ¨q “ kp¨, ¨q ´ kp¨, XqpK ` σn2 Iq´1 kpX, ¨q
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 30
GP Posterior
Posterior over functions (with training data X, y):
ppy| f p¨q, Xq pp f p¨q|Xq
pp f p¨q|X, yq “
ppy|Xq
Using the properties of Gaussians, we obtain (with K :“ kpX, Xq)
` ˘
ppy| f p¨q, Xq pp f p¨q|Xq “ N y | f pXq, σn2 I GPpmp¨q, kp¨, ¨qq
` ˘
“ Z ˆ GP mpost p¨q, k post p¨, ¨q
mpost p¨q “ mp¨q ` kp¨, XqpK ` σn2 Iq´1 py ´ mpXqq
kpost p¨, ¨q “ kp¨, ¨q ´ kp¨, XqpK ` σn2 Iq´1 kpX, ¨q
Marginal likelihood:
ż
Z “ ppy|Xq “ ppy| f , Xq pp f |Xq d f “ N y | mpXq, K ` σn2 I
` ˘
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 30
GP Posterior
Posterior over functions (with training data X, y):
ppy| f p¨q, Xq pp f p¨q|Xq
pp f p¨q|X, yq “
ppy|Xq
Using the properties of Gaussians, we obtain (with K :“ kpX, Xq)
` ˘
ppy| f p¨q, Xq pp f p¨q|Xq “ N y | f pXq, σn2 I GPpmp¨q, kp¨, ¨qq
` ˘
“ Z ˆ GP mpost p¨q, k post p¨, ¨q
mpost p¨q “ mp¨q ` kp¨, XqpK ` σn2 Iq´1 py ´ mpXqq
kpost p¨, ¨q “ kp¨, ¨q ´ kp¨, XqpK ` σn2 Iq´1 kpX, ¨q
Marginal likelihood:
ż
Z “ ppy|Xq “ ppy| f , Xq pp f |Xq d f “ N y | mpXq, K ` σn2 I
` ˘
` ˘
Prediction at x˚ : pp f px˚ q|X, y, x˚ q “ N mpost px˚ q, kpost px˚ , x˚ q
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 30
Illustration: Inference with Gaussian Processes
3
2
1
f(x)
0
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
x
Prior belief about the function
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 31
Illustration: Inference with Gaussian Processes
3
2
1
f(x)
0
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
x
Prior belief about the function
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 31
Illustration: Inference with Gaussian Processes
3
2
1
f(x)
0
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
x
Posterior belief about the function
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 31
Illustration: Inference with Gaussian Processes
3
2
1
f(x)
0
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
x
Posterior belief about the function
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 31
Illustration: Inference with Gaussian Processes
3
2
1
f(x)
0
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
x
Posterior belief about the function
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 31
Illustration: Inference with Gaussian Processes
3
2
1
f(x)
0
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
x
Posterior belief about the function
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 31
Illustration: Inference with Gaussian Processes
3
2
1
f(x)
0
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
x
Posterior belief about the function
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 31
Illustration: Inference with Gaussian Processes
3
2
1
f(x)
0
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
x
Posterior belief about the function
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 31
Illustration: Inference with Gaussian Processes
3
2
1
f(x)
0
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
x
Posterior belief about the function
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 31
Illustration: Inference with Gaussian Processes
3
2
1
f(x)
0
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
x
Posterior belief about the function
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 31
Illustration: Inference with Gaussian Processes
3
2
1
f(x)
0
−1
−2
−3
−5 −4 −3 −2 −1 0 1 2 3 4 5 6 7 8
x
Posterior belief about the function
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 31
Covariance Function
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 33
Gaussian Covariance Function
0
f(x)
4
0.0 0.2 0.4 0.6 0.8 1.0
x
0
f(x)
4
0.0 0.2 0.4 0.6 0.8 1.0
x
0
f(x)
4
0.0 0.2 0.4 0.6 0.8 1.0
x
0
f(x)
4
0.0 0.2 0.4 0.6 0.8 1.0
x
1.0
0.05
0.1
0.8 0.2
0.5
5.0
Correlation
0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
|| ||
0
1
2
0
f(x)
0.5
1.0
1.5
2.0
0.0 0.2 0.4 0.6 0.8 1.0
x
0.5
1.0
1.5
0.0 0.2 0.4 0.6 0.8 1.0
x
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 39
Creating New Covariance Functions
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 40
Creating New Covariance Functions
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 40
Creating New Covariance Functions
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 40
Creating New Covariance Functions
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 40
Hyper-Parameters of a GP
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 41
Hyper-Parameters of a GP
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 41
Hyper-Parameters of a GP
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 41
Gaussian Process Training: Hyper-Parameters
GP Training f ψ
Find good hyper-parameters θ (kernel/mean
function parameters ψ, noise variance σn2 ) xi yi σn
N
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 43
Gaussian Process Training: Hyper-Parameters
GP Training f ψ
Find good hyper-parameters θ (kernel/mean
function parameters ψ, noise variance σn2 ) xi yi σn
N
§ Place a prior ppθq on hyper-parameters
§ Posterior over hyper-parameters:
ppθq ppy|X, θq
ż
ppθ|X, yq “ , ppy|X, θq “ ppy| f , Xqpp f |X, θqd f
ppy|Xq
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 43
Gaussian Process Training: Hyper-Parameters
GP Training f ψ
Find good hyper-parameters θ (kernel/mean
function parameters ψ, noise variance σn2 ) xi yi σn
N
§ Place a prior ppθq on hyper-parameters
§ Posterior over hyper-parameters:
ppθq ppy|X, θq
ż
ppθ|X, yq “ , ppy|X, θq “ ppy| f , Xqpp f |X, θqd f
ppy|Xq
§ Choose hyper-parameters θ˚ , such that
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 43
Gaussian Process Training: Hyper-Parameters
GP Training f ψ
Find good hyper-parameters θ (kernel/mean
function parameters ψ, noise variance σn2 ) xi yi σn
N
§ Place a prior ppθq on hyper-parameters
§ Posterior over hyper-parameters:
ppθq ppy|X, θq
ż
ppθ|X, yq “ , ppy|X, θq “ ppy| f , Xqpp f |X, θqd f
ppy|Xq
§ Choose hyper-parameters θ˚ , such that
GP Training
Maximize the evidence/marginal likelihood (probability of the data
given the hyper-parameters, where the unwieldy f has been
integrated out) Also called Maximum Likelihood Type-II
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 44
Training via Marginal Likelihood Maximization
GP Training
Maximize the evidence/marginal likelihood (probability of the data
given the hyper-parameters, where the unwieldy f has been
integrated out) Also called Maximum Likelihood Type-II
Marginal likelihood (with a prior mean function mp¨q ” 0):
ż
ppy|X, θq “ ppy| f , Xq pp f |X, θq d f
ż
` ˘ ` ˘ ` ˘
“ N y | f pXq, σn2 I N f pXq | 0, K d f “ N y | 0, K ` σn2 I
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 44
Training via Marginal Likelihood Maximization
GP Training
Maximize the evidence/marginal likelihood (probability of the data
given the hyper-parameters, where the unwieldy f has been
integrated out) Also called Maximum Likelihood Type-II
Marginal likelihood (with a prior mean function mp¨q ” 0):
ż
ppy|X, θq “ ppy| f , Xq pp f |X, θq d f
ż
` ˘ ` ˘ ` ˘
“ N y | f pXq, σn2 I N f pXq | 0, K d f “ N y | 0, K ` σn2 I
log ppy|X, θq “ ´ 12 yJ K ´1
θ y ´ 1
2 log |K θ| ` const , K θ :“ K ` σn2 I
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 44
Training via Marginal Likelihood Maximization
Log-marginal likelihood:
log ppy|X, θq “ ´ 12 yJ K ´1
θ y ´
1
2 log |K θ| ` const , K θ :“ K ` σn2 I
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 45
Training via Marginal Likelihood Maximization
Log-marginal likelihood:
log ppy|X, θq “ ´ 12 yJ K ´1
θ y ´
1
2 log |K θ| ` const , K θ :“ K ` σn2 I
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 45
Training via Marginal Likelihood Maximization
Log-marginal likelihood:
log ppy|X, θq “ ´ 12 yJ K ´1
θ y ´
1
2 log |K θ| ` const , K θ :“ K ` σn2 I
B log ppy|X, θq BK θ ´1 BK θ ˘
“ 21 yJ K ´1 K θ y ´ 12 tr K ´1
`
Bθi θ Bθi θ Bθi
BK θ
“ 21 tr pααJ ´ K ´1
` ˘
θ q Bθ ,
i
α :“ K ´1
θ y
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 45
Example: Training Data
2.0
1.5
1.0
0.5
f(x)
0.0
0.5
1.0
1.5
4 2 0 2 4
x
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 46
Example: Marginal Likelihood Contour
2 1.3063
1.3348
1 1.3633
1.3918
0
1.4202
1 1.4487
1.4772
2
5 4 3 2 1 0 1
log-noise log( n)
https://drafts.distill.pub/gp/
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 48
Marginal Likelihood and Parameter Learning
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 49
Marginal Likelihood and Parameter Learning
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 49
Marginal Likelihood and Parameter Learning
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 49
Marginal Likelihood and Parameter Learning
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 49
Marginal Likelihood and Parameter Learning
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 49
Marginal Likelihood and Parameter Learning
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 49
Marginal Likelihood and Parameter Learning
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 49
Marginal Likelihood and Parameter Learning
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 49
Model Selection—Mean Function and Kernel
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 50
Model Selection—Mean Function and Kernel
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 50
Example
3
1
f(x)
-1
-2
-4 -3 -2 -1 0 1 2 3 4
x
§ Four different kernels (mean function fixed to m ” 0)
§ MAP hyper-parameters for each kernel
§ Log-marginal likelihood values for each (optimized) model
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 51
Example
Constant kernel, LML=-1.1073
3
1
f(x)
-1
-2
-4 -3 -2 -1 0 1 2 3 4
x
§ Four different kernels (mean function fixed to m ” 0)
§ MAP hyper-parameters for each kernel
§ Log-marginal likelihood values for each (optimized) model
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 51
Example
Linear kernel, LML=-1.0065
3
1
f(x)
-1
-2
-4 -3 -2 -1 0 1 2 3 4
x
§ Four different kernels (mean function fixed to m ” 0)
§ MAP hyper-parameters for each kernel
§ Log-marginal likelihood values for each (optimized) model
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 51
Example
Matern kernel, LML=-0.8625
3
1
f(x)
-1
-2
-4 -3 -2 -1 0 1 2 3 4
x
§ Four different kernels (mean function fixed to m ” 0)
§ MAP hyper-parameters for each kernel
§ Log-marginal likelihood values for each (optimized) model
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 51
Example
Gaussian kernel, LML=-0.69308
3
1
f(x)
-1
-2
-4 -3 -2 -1 0 1 2 3 4
x
§ Four different kernels (mean function fixed to m ” 0)
§ MAP hyper-parameters for each kernel
§ Log-marginal likelihood values for each (optimized) model
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 51
Application Areas
5
8
6
ang.vel. in rad/s
4
0
2
−2
−5
−2 0 2
angle in rad
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 53
Limitations of Gaussian Processes
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 53
Tips and Tricks for Practitioners
https://drafts.distill.pub/gp
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 54
Tips and Tricks for Practitioners
https://drafts.distill.pub/gp
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 54
Tips and Tricks for Practitioners
https://drafts.distill.pub/gp
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 54
Tips and Tricks for Practitioners
https://drafts.distill.pub/gp
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 54
Tips and Tricks for Practitioners
https://drafts.distill.pub/gp
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 54
Tips and Tricks for Practitioners
https://drafts.distill.pub/gp
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 54
Appendix
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 55
The Gaussian Distribution
D 1
ppx|µ, Σq “ p2πq´ 2 |Σ|´ 2 exp ´ 21 px ´ µqJ Σ´1 px ´ µq
` ˘
0.25
0.04
0.2 0.03
p(x)
0.02
0.01
p(x, y)
0.15
0.00
0.01
0.1 0.02
0.03
0.04
0.05 8
6
4
8 2 y
0 6 0
−4 −3 −2 −1 0 1 2 3 4 5 6 4 2
x 2
x 0 4
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 56
The Gaussian Distribution
D 1
ppx|µ, Σq “ p2πq´ 2 |Σ|´ 2 exp ´ 21 px ´ µqJ Σ´1 px ´ µq
` ˘
0.25 1
0.2 0
p(x)
x2
0.15 −1
0.1 −2
0.05 −3
0
−4 −3 −2 −1 0 1 2 3 4 5 6 −5 −4 −3 −2 −1 0 1 2 3
x x1
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 56
The Gaussian Distribution
D 1
ppx|µ, Σq “ p2πq´ 2 |Σ|´ 2 exp ´ 21 px ´ µqJ Σ´1 px ´ µq
` ˘
0.2 0
p(x)
x2
0.15 −1
0.1 −2
0.05 −3
0
−4 −3 −2 −1 0 1 2 3 4 5 6 −5 −4 −3 −2 −1 0 1 2 3
x x1
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 56
Conditional
˜« ff « ff¸
µx Σ xx Σ xy
Joint p(x,y)
ppx, yq “ N ,
3
µy Σyx Σyy
2
0
y
-1
-2
-3
-4
-5
-6 -4 -2 0 2 4
x
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 57
Conditional
˜« ff « ff¸
µx Σ xx Σ xy
Joint p(x,y)
ppx, yq “ N ,
3 Observation µy Σyx Σyy
2
0
y
-1
-2
-3
-4
-5
-6 -4 -2 0 2 4
x
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 57
Conditional
˜« ff « ff¸
µx Σ xx Σ xy
Joint p(x,y)
ppx, yq “ N ,
3 Observation y
Conditional p(x|y)
µy Σyx Σyy
2
-1
-2 µ x|y “ µ x ` Σ xy Σ´1
yy py ´ µy q
-3
-4
Σ x|y “ Σ xx ´ Σ xy Σ´1
yy Σyx
-5
-6 -4 -2 0 2 4
x
Conditional ppx|yq is also Gaussian
Computationally convenient
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 57
Marginal
0
Marginal distribution:
y
-1
ż
-2
pp x q “ pp x , y qd y
-3
“ N µ x , Σ xx
` ˘
-4
-5
-6 -4 -2 0 2 4
x
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 58
Marginal
0
Marginal distribution:
y
-1
ż
-2
pp x q “ pp x , y qd y
-3
“ N µ x , Σ xx
` ˘
-4
-5
-6 -4 -2 0 2 4
x
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 58
The Gaussian Distribution in the Limit
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 59
The Gaussian Distribution in the Limit
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 59
The Gaussian Distribution in the Limit
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 59
Marginal and Conditional in the Limit
§ In practice, we consider finite training and test data xtrain , xtest
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 60
Marginal and Conditional in the Limit
§ In practice, we consider finite training and test data xtrain , xtest
§ Then, x “ txtrain , xtest , xother u
(xother plays the role of x̃ from previous slide)
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 60
Marginal and Conditional in the Limit
§ In practice, we consider finite training and test data xtrain , xtest
§ Then, x “ txtrain , xtest , xother u
(xother plays the role of x̃ from previous slide)
¨» fi » fi˛
µtrain Σtrain Σtrain,test Σtrain,other
ppxq “ N ˝– µtest fl , – Σtest,train Σtest Σtest,other ffi
˚— ffi — ‹
fl‚
µother Σother,train Σother,test Σother
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 60
Marginal and Conditional in the Limit
§ In practice, we consider finite training and test data xtrain , xtest
§ Then, x “ txtrain , xtest , xother u
(xother plays the role of x̃ from previous slide)
¨» fi » fi˛
µtrain Σtrain Σtrain,test Σtrain,other
ppxq “ N ˝– µtest fl , – Σtest,train Σtest Σtest,other ffi
˚— ffi — ‹
fl‚
µother Σother,train Σother,test Σother
ż
ppxtrain , xtest q “ pp xtrain , xtest , xother qd xother
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 60
Marginal and Conditional in the Limit
§ In practice, we consider finite training and test data xtrain , xtest
§ Then, x “ txtrain , xtest , xother u
(xother plays the role of x̃ from previous slide)
¨» fi » fi˛
µtrain Σtrain Σtrain,test Σtrain,other
ppxq “ N ˝– µtest fl , – Σtest,train Σtest Σtest,other ffi
˚— ffi — ‹
fl‚
µother Σother,train Σother,test Σother
ż
ppxtrain , xtest q “ pp xtrain , xtest , xother qd xother
ppxtest |xtrain q “ N µ˚ , Σ˚
` ˘
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 60
Gaussian Process Training: Hierarchical Inference
ppy|X, f q pp f |X, θq
pp f |X, y, θq “
ppy|X, θq
ż f ψ
ppy|X, θq “ ppy| f , Xq pp f |X, f θqd f
xi yi σn
N
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 61
Gaussian Process Training: Hierarchical Inference
ppy|X, f q pp f |X, θq
pp f |X, y, θq “
ppy|X, θq
ż f ψ
ppy|X, θq “ ppy| f , Xq pp f |X, f θqd f
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 61
GP as the Limit of an Infinite RBF Network
Consider the universal function approximator
˜ ¸
N
ÿ 1 ÿ px ´ pi ` Nn qq2
f pxq “ lim γn exp ´ , x P R, λ P R`
NÑ8 N
n“1
λ2
iPZ
` ˘
with γn „ N 0, 1 (random weights)
Gaussian-shaped basis functions (with variance λ2 {2) everywhere
on the real axis
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 62
GP as the Limit of an Infinite RBF Network
Consider the universal function approximator
˜ ¸
N
ÿ 1 ÿ px ´ pi ` Nn qq2
f pxq “ lim γn exp ´ , x P R, λ P R`
NÑ8 N
n“1
λ2
iPZ
` ˘
with γn „ N 0, 1 (random weights)
Gaussian-shaped basis functions (with variance λ2 {2) everywhere
on the real axis
ÿ ż i`1 ˆ
px ´ sq2
˙ ż8 ˆ
px ´ sq2
˙
f pxq “ γpsq exp ´ ds “ γpsq exp ´ ds
i λ2 ´8 λ2
iPZ
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 62
GP as the Limit of an Infinite RBF Network
Consider the universal function approximator
˜ ¸
N
ÿ 1 ÿ px ´ pi ` Nn qq2
f pxq “ lim γn exp ´ , x P R, λ P R`
NÑ8 N
n“1
λ2
iPZ
` ˘
with γn „ N 0, 1 (random weights)
Gaussian-shaped basis functions (with variance λ2 {2) everywhere
on the real axis
ÿ ż i`1 ˆ
px ´ sq2
˙ ż8 ˆ
px ´ sq2
˙
f pxq “ γpsq exp ´ ds “ γpsq exp ´ ds
i λ2 ´8 λ2
iPZ
§ Mean: Er f pxqs “ 0
´ 1 2
¯
§ Covariance: Covr f pxq, f px1 qs “ θ12 exp ´ px´x2 q for suitable θ21
2λ
GP with mean 0 and Gaussian covariance function
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 62
References I
[1] G. Bertone, M. P. Deisenroth, J. S. Kim, S. Liem, R. R. de Austri, and M. Welling. Accelerating the BSM Interpretation of
LHC Data with Machine Learning. arXiv preprint arXiv:1611.02704, 2016.
[2] R. Calandra, J. Peters, C. E. Rasmussen, and M. P. Deisenroth. Manifold Gaussian Processes for Regression. In Proceedings
of the IEEE International Joint Conference on Neural Networks, 2016.
[3] Y. Cao and D. J. Fleet. Generalized Product of Experts for Automatic and Principled Fusion of Gaussian Process
Predictions. http://arxiv.org/abs/1410.7827, 2014.
[4] N. A. C. Cressie. Statistics for Spatial Data. Wiley-Interscience, 1993.
[5] M. Cutler and J. P. How. Efficient Reinforcement Learning for Robots using Informative Simulated Priors. In IEEE
International Conference on Robotics and Automation, Seattle, WA, May 2015.
[6] M. P. Deisenroth and J. W. Ng. Distributed Gaussian Processes. In Proceedings of the International Conference on Machine
Learning, 2015.
[7] M. P. Deisenroth, C. E. Rasmussen, and D. Fox. Learning to Control a Low-Cost Manipulator using Data-Efficient
Reinforcement Learning. In Proceedings of Robotics: Science and Systems, Los Angeles, CA, USA, June 2011.
[8] M. P. Deisenroth, C. E. Rasmussen, and J. Peters. Gaussian Process Dynamic Programming. Neurocomputing,
72(7–9):1508–1524, Mar. 2009.
[9] M. P. Deisenroth, R. Turner, M. Huber, U. D. Hanebeck, and C. E. Rasmussen. Robust Filtering and Smoothing with
Gaussian Processes. IEEE Transactions on Automatic Control, 57(7):1865–1871, 2012.
[10] R. Frigola, F. Lindsten, T. B. Schön, and C. E. Rasmussen. Bayesian Inference and Learning in Gaussian Process
State-Space Models with Particle MCMC. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger,
editors, Advances in Neural Information Processing Systems, pages 3156–3164. Curran Associates, Inc., 2013.
[11] N. HajiGhassemi and Marc P. Deisenroth. Approximate Inference for Long-Term Forecasting with Periodic Gaussian
Processes. In Proceedings of the International Conference on Artificial Intelligence and Statistics, April 2014. Acceptance rate:
36%.
[12] J. Hensman, N. Fusi, and N. D. Lawrence. Gaussian Processes for Big Data. In A. Nicholson and P. Smyth, editors,
Proceedings of the Conference on Uncertainty in Artificial Intelligence. AUAI Press, 2013.
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 63
References II
[13] A. Krause, A. Singh, and C. Guestrin. Near-Optimal Sensor Placements in Gaussian Processes: Theory, Efficient
Algorithms and Empirical Studies. Journal of Machine Learning Research, 9:235–284, Feb. 2008.
[14] M. C. H. Lee, H. Salimbeni, M. P. Deisenroth, and B. Glocker. Patch Kernels for Gaussian Processes in High-Dimensional
Imaging Problems. In NIPS Workshop on Practical Bayesian Nonparametrics, 2016.
[15] J. R. Lloyd, D. Duvenaud, R. Grosse, J. B. Tenenbaum, and Z. Ghahramani. Automatic Construction and
Natural-Language Description of Nonparametric Regression Models. In AAAI Conference on Artificial Intelligence, pages
1–11, 2014.
[16] D. J. C. MacKay. Introduction to Gaussian Processes. In C. M. Bishop, editor, Neural Networks and Machine Learning,
volume 168, pages 133–165. Springer, Berlin, Germany, 1998.
[17] A. G. d. G. Matthews, J. Hensman, R. Turner, and Z. Ghahramani. On Sparse Variational Methods and the
Kullback-Leibler Divergence between Stochastic Processes. In Proceedings of the International Conference on Artificial
Intelligence and Statistics, 2016.
[18] M. A. Osborne, S. J. Roberts, A. Rogers, S. D. Ramchurn, and N. R. Jennings. Towards Real-Time Information Processing
of Sensor Network Data Using Computationally Efficient Multi-output Gaussian Processes. In Proceedings of the
International Conference on Information Processing in Sensor Networks, pages 109–120. IEEE Computer Society, 2008.
[19] J. Quiñonero-Candela and C. E. Rasmussen. A Unifying View of Sparse Approximate Gaussian Process Regression.
Journal of Machine Learning Research, 6(2):1939–1960, 2005.
[20] C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. Adaptive Computation and Machine
Learning. The MIT Press, Cambridge, MA, USA, 2006.
[21] S. Roberts, M. A. Osborne, M. Ebden, S. Reece, N. Gibson, and S. Aigrain. Gaussian Processes for Time Series Modelling.
Philosophical Transactions of the Royal Society (Part A), 371(1984), Feb. 2013.
[22] B. Schölkopf and A. J. Smola. Learning with Kernels—Support Vector Machines, Regularization, Optimization, and Beyond.
Adaptive Computation and Machine Learning. The MIT Press, Cambridge, MA, USA, 2002.
[23] E. Snelson and Z. Ghahramani. Sparse Gaussian Processes using Pseudo-inputs. In Y. Weiss, B. Schölkopf, and J. C. Platt,
editors, Advances in Neural Information Processing Systems 18, pages 1257–1264. The MIT Press, Cambridge, MA, USA, 2006.
[24] M. K. Titsias. Variational Learning of Inducing Variables in Sparse Gaussian Processes. In Proceedings of the International
Conference on Artificial Intelligence and Statistics, 2009.
[25] V. Tresp. A Bayesian Committee Machine. Neural Computation, 12(11):2719–2741, 2000.
Gaussian Processes Marc Deisenroth @Imperial College London, January 22, 2019 64