0% found this document useful (0 votes)
64 views54 pages

Introduction To Gaussian Process Models: C Esar Lincoln Cavalcante Mattos

The document provides an introduction and agenda for a presentation on Gaussian process models. It discusses why Gaussian processes are used, basic definitions of multivariate Gaussian distributions and Gaussian processes, and how Gaussian processes can be applied to regression problems by placing a Gaussian process prior over functions, predicting values based on observations, and optimizing hyperparameters.

Uploaded by

Jéssyca Bessa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views54 pages

Introduction To Gaussian Process Models: C Esar Lincoln Cavalcante Mattos

The document provides an introduction and agenda for a presentation on Gaussian process models. It discusses why Gaussian processes are used, basic definitions of multivariate Gaussian distributions and Gaussian processes, and how Gaussian processes can be applied to regression problems by placing a Gaussian process prior over functions, predicting values based on observations, and optimizing hyperparameters.

Uploaded by

Jéssyca Bessa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Introduction to Gaussian Process Models

Cesar Lincoln Cavalcante Mattos

Federal University of Ceara (UFC)


Department of Teleinformatics Engineering (DETI)
Graduate Program in Teleinformatics Engineering (PPGETI)

November 2015
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

Agenda

1 Introduction to GPs
- Why GPs?
- Basic Definitions
- GPs for Regression
- Covariance Function and Hyperparameters Optimization
- From Feature Space to GPs
2 Dynamical System Identification
3 Advanced Topics
- Sparse Models
- Classification
- Robust Learning
- Unsupervised Learning
- Deep Models
- More Nonlinear Dynamical Models
4 Conclusion
2 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

Where I come from

Fortaleza, Ceara, Brazil


2.55 millions of inhabitants.
5th largest city in Brazil.
34 Km of beaches.
Around 25-30 C all year.
2nd Brazilian tourism
destination.

3 / 54
Beira Mar Avenue. Iracema guerreira statue.

Jangada at the sunset. Dragao do Mar Center of Art and Culture.


Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

Federal University of Ceara (UFC)


Graduate Program in Teleinformatics Engineering (PPGETI)

UFC
8 campi.
114 undergraduate courses.
146 graduate courses.
2,150 professors.
26,800 undergraduate
students.
6,000 graduate students.

PPGETI
200 masters dissertations.
75 PhD thesis.
5 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

CENTAURO - Center of Reference for Automation and


Robotics

Automation, Robotics, Electromagnetic Compatibility, Industrial


Processes and Machine Learning.
20 collaborators, around 1/3 working with ML, led by Prof.
Guilherme Barreto.
International collaboration with Portugal, German, Finland, England.

6 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

Basics

Why Gaussian Process?

Parametric models act as a bottleneck between the training set and


the predictions.
The complexity of the model should grow with the amount of data
available.
Balance between data fit and regularization.
Principled way to find (few) hyperparameters.
Models uncertainty with probability distributions (Bayesian
treatment).
Most of the theoretical background comes from the useful properties
of the Gaussian distribution.

7 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

Basics

Multivariate Gaussian Distribution

Definition
If a vector of random variables f RN follows a multivariate Gaussian
distribution, we can express it by
 
1 1 > 1
p(f |, K ) = N 1 exp (f ) K (f ) , (1)
(2) 2 |K | 2 2

where the distribution is completely defined by its mean vector RN


and its covariance matrix K RN N .

8 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

Basics

Two Important Properties


Consider the following collection of random variables:
     
f1 1 K 11 K 12
f = N (, K ), = , K = . (2)
f2 2 K 21 K 22

Marginalization
The observation of a larger collection of variables does not affect the
distribution of smaller subsets, which implies that f 1 N (1 , K 11 ) and
f 2 N (2 , K 22 ).

Conditioning
Conditioning on Gaussians results in a new Gaussian distribution given by

p(f 1 |f 2 = y ) = N (f 1 |1 + K 12 K 1 1
22 (y 2 ), K 11 K 12 K 22 K 21 ) (3)
9 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

Basics

Gaussian Process

Definition
A GP defines a distribution of functions

f : X R, (4)

such as, for any finite subset {x 1 , x 2 , , x N } X of the domain, a


vector f = [f (x 1 ), f (x 2 ), , f (x N )]> follows a multivariate
Gaussian distribution:
f N (, K ). (5)
In the infinite case, we have a GP prior for the function f ().
By the marginalization property, we can analyze the infinite object
f () by analyzing any finite subset f .
The vector of evaluations f is a single sample from a GP.

10 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

Basics

Samples from a GP

Figure 1 : Samples from the GP prior.

11 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

GPs for Regression

Nonlinear Regression

Let a dataset be given by

D = {(x i , yi )|N
i=1 } = (X , y ), (6)

where x i RD are the inputs, X RN D and y RN are observed


outputs.
A general nonlinear task can be modeled as

yi = f (x i ) + i , i N (0, n2 ) (7)

The values fi = f (x i ) are not observed directly and f () is unknown.

12 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

GPs for Regression

Standard GP modeling
Choose a multivariate Gaussian prior for the unknown function

f = f (X ) N (f |0, K ), (8)
Kij = k (x i , x j ), (9)
where K RN N , Kij = k (x i , x j ), is the covariance matrix,
obtained with a kernel function k (, ).
A common choice is the squared exponential function:
D
!
1 X
k (x i , x j ) = f2 exp wd2 (xid xjd )2 , (10)
2
d=1

where the vector = [f2 , w12 , . . . , wD2 ]T is comprised of the


hyperparameters which characterize the covariance of the model.
The hyperparameters w12 , . . . , wD 2 are responsible for the so-called

automatic relevance determination (ARD) of the input dimensions.


13 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

GPs for Regression

Standard GP modeling

Likelihood
Considering the observation of a Gaussian noisy version of f , we have

p(y |f ) = N (y |f , n2 I ), (11)

where I RN N is the identity matrix.

Marginal likelihood
The marginal distribution of y is calculated by integrating out f :
Z Z
p(y |X ) = p(y |f )p(f |X )d f = N (y |f , n2 I )N (f |0, K )d f , (12)

= N (y |0, K + n2 I ). (13)

14 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

GPs for Regression

Standard GP modeling
Inference
Inference for f , given a new input x , is obtained by conditioning:

K + n2 I k f
    
y
N 0, , (14)
f k f k
p(f |x , X , y ) = N (k f (K + n2 I )1 y , k k f (K + n2 I )1 k f ).
(15)

where

K = K (X , X ), (16)
k f = [K (x , x 1 ), , K (x , x N )], (17)
kf = k>
f , (18)
k = K (x , x ). (19)
15 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

GPs for Regression

Standard GP modeling

Figure 2 : Posterior predictive distribution of a GP.

16 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

GPs for Regression

Samples from a GP

Figure 3 : Samples from the GP prior, with f2 = 1, w = 1 and n2 = 0.

17 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

GPs for Regression

Samples from a GP

Figure 4 : Samples from the posterior (after the observation of y ), without noise
(n2 = 0).
18 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

GPs for Regression

Samples from a GP

Figure 5 : Samples from the posterior (after the observation of y ), with noise
(n2 = 0.01). 19 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

GPs for Regression

Samples from a GP

Figure 6 : Samples with f2 = 0.1, Figure 7 : Samples with f2 = 2,


w = 1 and n2 = 0. w = 1 and n2 = 0.

20 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

GPs for Regression

Samples from a GP

Figure 8 : Samples with f2 = 1, Figure 9 : Samples with f2 = 1,


w = 2 and n2 = 0. w = 0.5 and n2 = 0.

21 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

Covariance Matrix

Hyperparameters Optimization
The noise variance n2 is included in the vector of hyperparameters
which is optimized with the maximization of the marginal log-likelihood
L() = log p(y |X , ), the so-called evidence of the model:
1 1 N
L() = log |K + n2 I | y > (K + n2 I )1 y log(2). (20)
2| {z } 2| {z } 2
model capacity data fitting

Figure 10 : Bayesian model selection. 22 / 54


Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

Covariance Matrix

Hyperparameters Optimization

The hyperparameters are optimized with the help of analytical


gradients:
2
 
L() 1 2 1 (K + n I )
= Tr (K + n I )
i 2 i
2
(21)
1 (K + n I )
+ y > (K + n2 I )1 (K + n2 I )1 y .
2 i
No need for cross-validation to perform model selection.

23 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

Covariance Matrix

Kernel Function

The choice of the kernel function directly affects the characteristics of


the model.
The squared exponential, for example, imposes a certain degree of
smoothness.
Any function that generates a positive semidefinite covariance matrix
is acceptable.
New kernel functions can be created by linear combination of valid
kernels, increasing the expressiveness of the model.

24 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

Covariance Matrix

Samples from different kernel functions

Squared exponential Linear

Matern 3/2 Periodic


25 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

Alternative View

From Feature Space to GPs


Let : RD RQ be a mapping function and w i RQ a vector of
weights:
yi = w > (x i ) + i , i N (0, n2 ). (22)
If we consider the prior p(w ) = N (w |0, w ) and = (X ) is a
matrix where each row is given by (x i ), we have the posterior
 
p(y |w , X )p(w ) 1 1 1
p(w |y , X ) = = N w 2 A y , A
, (23)
p(y |X ) n
where A RQQ = 12 > + 1 w .
n
Prediction is performed by averaging the weights:
Z
p(f |x , X , y ) = p(f |x , w )p(w |y , X )d w (24)
 
1 > 1 > 1
= N f 2 (x ) A y , (x ) A (x ) .

n
(25)
26 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

Alternative View

From Feature Space to GPs


We can rewrite the predictive distribution:

p(f |x , X , y ) = N f (x )> w (T w + n2 I )1 y ,


(x )> w (x ) (x )> w (T w + n2 I )1 T w (x ) .
(26)
Now we apply the kernel trick:
> w = T 1/2 1/2 >
w w = = k (X , X ) = K , (27)
(x )> w = k (x , X ) = k f , (28)
> w (x ) = k (X , x ) = k f , (29)
>
(x ) w (x ) = k (x , X ) = k . (30)
Finally, we get the standard GP prediction expression:
p(f |x , X , y ) = N (k f (K + n2 I )1 y , k k f (K + n2 I )1 k f ).
(31)
27 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

Introduction

Dynamical System Identification


Dynamical Systems
A process whose states present a temporal dependency, i.e., its outputs are
function of its past.

Black-box modeling
The model is obtained only from the systems inputs and outputs.

Figure 11 : Model M obtained after the identification of the system P .


28 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

Introduction

Dynamical System Identification


NARX (nonlinear autoregressive with exogenous inputs) Models
Given a dynamical system with input ui and output yi , we have:

x i = [yi1 , yi2 , , yiLy , ui1 , ui2 , , uiLu ]> , (32)


yi = f (x i ) + i , (33)

where x i is the regressor vector (or state), Ly and Lu are some specified
lags, f () it the transition function and i is a Gaussian noise.

Figure 12 : Structure of an autoregressive model.


29 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

GP-NARX

Dynamical System Identification

Figure 13 : GP-NARX model for system identification. 30 / 54


Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

Experiments

Dynamical System Identification

Validation
1-step ahead prediction: the prediction is performed based on past
inputs and known observed outputs.
Infinite-step ahead prediction (free simulation): the prediction is
performed based on past inputs and past predictions.

Metrics
q PN
1
Root Mean Square Error: RMSE = N i=1 (yi i )2 .
Negative Log Density
 Error: 
1 PN (yi i )2
NLD = 2N i=1 log(2) + log(i2 ) + i2
.

31 / 54
Artificial datasets1

Input/Samples
# Output Estimation Test Noise
yi =
yi1 yi2 (yi1 +2.5)
+ ui1 ui = U(2, 2) ui = sin(2i/25)
1 2
1+yi1 2
+yi2 N (0, 0.29)
300 samples 100 samples

yi1
ui = sin(2i/25)+
yi = 3
+ ui1 ui = U(2, 2)
2 2
1+yi1 sin(2i/10) N (0, 0.65)
300 samples 100 samples
yi = 0.8yi1 + ui = U(1, 1) ui = sin(2i/25)
3 N (0, 0.07)
(ui1 0.8)ui1 (ui1 + 0.5) 300 samples 100 samples
ui = N (ui |0, 1) ui = N (ui |0, 1)
3 )
yi = yi1 0.5 tanh(yi1 + ui1
4 1 ui 1 1 ui 1 N (0, 0.0025)
150 samples 150 samples
yi = 0.3yi1 + 0.6yi2 + ui = U (1, 1) ui = sin(2i/250)
5 N (0, 0.18)
0.3 sin(3ui1 ) + 0.1 sin(5ui1 ) 500 samples 500 samples

1
Narendra, K.S., Parthasarathy, K., Identification and control of dynamical
systems using neural networks, 1990; Kocijan J. et. al., Dynamic systems
identification with Gaussian processes, 2005
Artificial-1 dataset

Linear ARX - 1-step ahead prediction. Linear ARX - Free simulation.

GP-NARX - 1-step ahead prediction. GP-NARX - Free simulation.


Artificial-2 dataset

Linear ARX - 1-step ahead prediction. Linear ARX - Free simulation.

GP-NARX - 1-step ahead prediction. GP-NARX - Free simulation.


Artificial-3 dataset

Linear ARX - 1-step ahead prediction. Linear ARX - Free simulation.

GP-NARX - 1-step ahead prediction. GP-NARX - Free simulation.


Artificial-4 dataset

Linear ARX - 1-step ahead prediction. Linear ARX - Free simulation.

GP-NARX - 1-step ahead prediction. GP-NARX - Free simulation.


Artificial-5 dataset (SE kernel)

Linear ARX - 1-step ahead prediction. Linear ARX - Free simulation.

GP-NARX SE - 1-step ahead prediction. GP-NARX SE - Free simulation.


Artificial-5 dataset (SE+Periodic kernel)

Linear ARX - 1-step ahead prediction. Linear ARX - Free simulation.

GP-NARX SE+Periodic - 1-step ahead prediction. GP-NARX SE+Periodic - Free simulation.


Mackey-Glass Time Series (100 training samples)

GP-NAR - SE+Linear kernel - 1-step ahead prediction.

GP-NAR - SE+Linear kernel - Free simulation.


Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

Robotics and Control

GP for Learning in Robotics and Control2


Probabilistic Inference for Learning Control (PILCO)
Model-based policy search for autonomous learning.
Considers model uncertainty with the use of GPs.

Figure 14 : Examples of systems controlled with PILCO. Videos available in


http://www.youtube.com/user/PilcoLearner
2
Deisenroth, M. P., Fox, D. and Rasmussen, C. E., Gaussian processes for data-efficient
learning in robotics and control, 2015 40 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

Sparse Models

Sparse GP Approximations

In order to use GP for large datasets, we need to avoid its O(N 3 )


complexity and O(N 2 ) storage demands.
Most of the sparse approximations schemes consist of replacing the
full kernel matrix K N by K NM K 1 2
M K MN and get O(M N )
2
complexity and O(M ) storage demands, where M < N .

Figure 15 : Visualization of the sparse approximation.

41 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

Classification

GPs for Binary Classification


The observed output (a class probability) is related to the latent
function f () by a non-Gaussian likelihood p(yi = 1|f (x i )):

Figure 16 : Squashing a latent function into a class probability.

Need to apply approximate inference methods:


- Sampling methods (e.g. MCMC).
- Laplace approximation.
- Expectation Propagation (EP).
- Variational Bayes (VB).
42 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

Robust Learning

Training in the Presence of Outliers


We can use heavy-tailed distributions to account for non-Gaussian
noise in the form of outliers.
Approximate inference is also necessary.

Figure 17 : Comparison between the Gaussian and heavy-tailed distributions.


43 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

Robust Learning

Robust Latent NARX GP (GP-RLARX)3

Performs autoregression with latent (non-observed) variables, instead


of the outputs.
Chooses a Student-t likelihood, expressed as a mixture of Gaussians.
Uses a variational approximation for inference.

(x ) (x ) (x )
xi = f (xi1 , , xiLx ui1 , , uiLu ) + i , i N (i |0, x2 ),
(34)
(y) (y) (y)
yi = xi + i , i N (i |0, i1 ), i (i |, ). (35)

3
Mattos, C. L. C., et al., Latent Autoregressive Gaussian Process Models for Robust
System Identification, submitted to DYCOPS 2016.
44 / 54
GP-RLARX for System Identification

Artificial 1. Artificial 2. Artificial 3.

Artificial 4. Artificial 5.

RMSE values for free simulation with different levels of contamination by outliers.
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

Unsupervised Learning

GPs for Unsupervised Learning

We have observed (noisy) data but do not


know the latent inputs which generated it.
Related to the problem of nonlinear dimen-
sionality reduction.

Gaussian Process Latent Variable Model (GPLVM)4


- GP prior over the unknown mapping f () to the outputs.
- Prior over the latent input space, e.g. p(X ) = N 2
Q
i=1 N (x i |0, x I ).
- Propagation of the uncertainty over a nonlinear function is intractable.
- Usually applies variational approximations5 .
4
Lawrence, N., Gaussian process latent variable models for visualisation of high
dimensional data, 2004
5
Titsias, M., Lawrence, N., Bayesian Gaussian process latent variable model, 2010
46 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

Deep Models

Deep GPs6

If we consider a GPLVM with a GP prior in the inputs, we have a


Deep GP model with two layers.
Multiple hidden layers provide a powerful hierarchical structure for
both deep unsupervised and supervised learning.
Inference is usually performed with a variational approximation.

Figure 18 : Deep GP hierarchical structure.

6
Damianou A. and Lawrence, N., Deep Gaussian Process, 2013
47 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

More Nonlinear Dynamical Models

VGPDS (Variational GP Dynamical Systems)7

Figure 19 : Dynamical latent variables GP model.


7
Damianou A., et al, Variational Gaussian Process Dynamical Systems, 2011 48 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

More Nonlinear Dynamical Models

GP-SSM (GP State-Space Models)8

Figure 20 : State-space model with GP transition.


8
Frigola, R. et al, Variational Gaussian Process State-Space Models, 2014 49 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

More Nonlinear Dynamical Models

RGP (Recurrent Gaussian Process)9

A Deep GP model with a latent autoregressive structure and a specific


variational approximation:
   
(h) (h) (h) (h)
xi = f (h) x i + i , f (h) N 0, K f , 1hH
   
(H +1) (H +1) (H +1)
y i = f (H +1) x i + i , f (H +1) N 0, K f

where
h i> hh i i>
(1) (1) (1)



x i1 , u i1 = x i1 , , xiL , [ui1 , , u iL u ] , if h = 1,
h i> hh i h ii>
(h) (h) (h1) (h) (h) (h1) (h1)
x i = x i1 , x i = xi1 , , xiL , xi , , xiL+1 , if 1 < h H ,

h i>
(H ) (H )
x i = xi , , x (H )
, if h = H + 1.

iL+1

9
work in progress!
50 / 54
RGP for System Identification (free simulation)

GP-NARX. RGP with 2 hidden layers.

GP-NARX. RGP with 2 hidden layers.


Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

Final Remarks

GP is a powerful nonparametric learning framework.


Versatile method: regression, classification, unsupervised learning, etc.
Can be made even more flexible by the introduction of new kernel
functions and architectures.
Optimization of hyperparameters directly from the marginal likelihood.
Computational cost is usually high, but can be alleviated with sparse
approximations.
Output is a fully defined distribution.

52 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

References

1 NARENDRA, K.S., PARTHASARATHY, K.: Identification and control of dynamical


systems using neural networks. IEEE T Neural Networ 1(1) (1990) 4-27.
2 KOCIJAN, J.; GIRARD A.; BANKO B.; MURRAY-SMITH R., Dynamic Systems
Identification With Gaussian Processes. Mathematical and Computer Modelling of
Dynamical Systems, v. 11, n. 4, p. 411-424, 2005.
3 RASMUSSEN C. E.; WILLIAMS C. K. I., Gaussian Processes for Machine Learning.
Cambridge, MA: MIT Press, 2006.
4 TITSIAS, Michalis K and LAWRENCE, Neil D, Bayesian Gaussian process latent variable
model. International Conference on Artificial Intelligence and Statistics, p. 207-215, 2010.
5 DAMIANOU, Andreas and LAWRENCE, Neil, Deep Gaussian Processes. Proceedings of
the Sixteenth International Conference on Artificial Intelligence and Statistics, p. 207-215,
2013.
6 FRIGOLA, Roger and CHEN, Yutian and RASMUSSEN, Carl, Variational Gaussian
process state-space models. Advances in Neural Information Processing Systems, p.
3680-3688, 2014.

53 / 54
Introduction to GPs Dynamical System Identification Advanced Topics Conclusion

Thank you for your attention!


Questions?

Cesar Lincoln Cavalcante Mattos


[email protected]
54 / 54

You might also like