Snelson 2005 Sparse Gps
Snelson 2005 Sparse Gps
Abstract
1 Introduction
The Gaussian process (GP) is a popular and elegant method for Bayesian non-linear non-
parametric regression and classification. Unfortunately its non-parametric nature causes
computational problems for large data sets, due to an unfavourable N 3 scaling for training,
where N is the number of data points. In recent years there have been many attempts to
make sparse approximations to the full GP in order to bring this scaling down to M 2 N
where M N [1, 2, 3, 4, 5, 6, 7, 8, 9]. Most of these methods involve selecting a subset
of the training points of size M (active set) on which to base computation. A typical way of
choosing such a subset is through some sort of information criterion. For example, Seeger
et al. [7] employ a very fast approximate information gain criterion, which they use to
greedily select points into the active set.
A major common problem to these methods is that they lack a reliable way of learning
kernel hyperparameters, because the active set selection interferes with this learning proce-
dure. Seeger et al. [7] construct an approximation to the full GP marginal likelihood, which
they try to maximize to find the hyperparameters. However, as the authors state, they have
persistent difficulty in practically doing this through gradient ascent. The reason for this
is that reselecting the active set causes non-smooth fluctuations in the marginal likelihood
and its gradients, meaning that they cannot get smooth convergence. Therefore the speed
of active set selection is somewhat undermined by the difficulty of selecting hyperparame-
ters. Inappropriately learned hyperparameters will adversely affect the quality of solution,
especially if one is trying to use them for automatic relevance determination (ARD) [10].
In this paper we circumvent this problem by constructing a GP regression model that en-
ables us to find active set point locations and hyperparameters in one smooth joint optimiza-
tion. The covariance function of our GP is parameterized by the locations of pseudo-inputs
— an active set not constrained to be a subset of the data, found by a continuous optimiza-
tion. This is a further major advantage, since we can improve the quality of our fit by the
fine tuning of their precise locations.
Our model is closely related to several sparse GP approximations, in particular Seeger’s
method of projected latent variables (PLV) [7, 8]. We discuss these relations in section 3.
In principle we could also apply our technique of moving active set points off data points to
approximations such as PLV. However we empirically demonstrate that a crucial difference
between PLV and our method (SPGP) prevents this idea from working for PLV.
We provide here a concise summary of GPs for regression, but see [11, 12, 13, 10] for
more detailed reviews. We have a data set D consisting of N input vectors X = {xn }N n=1
of dimension D and corresponding real valued targets y = {yn }N n=1 . We place a zero
mean Gaussian process prior on the underlying latent function f (x) that we are trying to
model. We therefore have a multivariate Gaussian distribution on any finite subset of latent
variables; in particular, at X: p(f |X) = N (f |0, KN ), where N (f |m, V) is a Gaussian
distribution with mean m and covariance V. In a Gaussian process the covariance matrix
is constructed from a covariance function, or kernel, K which expresses some prior notion
of smoothness of the underlying function: [KN ]nn0 = K(xn , xn0 ). Usually the covariance
function depends on a small number of hyperparameters θ, which control these smoothness
properties. For our experiments later on we will use the standard Gaussian covariance with
ARD hyperparameters:
h XD (d) 2
i
K(xn , xn0 ) = c exp − 21 bd x(d)
n − xn0 , θ = {c, b} . (1)
d=1
which is typically used to train the GP by finding a (local) maximum with respect to the
hyperparameters θ and σ 2 .
Prediction is made by considering a new input point x and conditioning on the observed
data and hyperparameters. The distribution of the target value at the new point is then:
p(y|x, D, θ) = N y k> 2 −1
y, Kxx − k> 2 −1
kx + σ 2 , (3)
x (KN + σ I) x (KN + σ I)
where [kx ]n = K(xn , x) and Kxx = K(x, x). The GP is a non-parametric model, because
the training data are explicitly required at test time in order to construct the predictive
distribution, as is clear from the above expression.
GPs are prohibitive for large data sets because training requires O(N 3 ) time due to the
inversion of the covariance matrix. Once the inversion is done, prediction is O(N ) for the
predictive mean and O(N 2 ) for the predictive variance per new test case.
2 Sparse Pseudo-input Gaussian processes (SPGPs)
In order to derive a sparse model that is computationally tractable for large data sets, which
still preserves the desirable properties of the full GP, we examine in detail the GP predictive
distribution (3). Consider the mean and variance of this distribution as functions of x, the
new input. Regarding the hyperparameters as known and fixed for now, these functions
are effectively parameterized by the locations of the N training input and target pairs,
X and y. In this paper we consider a model with likelihood given by the GP predictive
distribution, and parameterized by a pseudo data set. The sparsity in the model will arise
because we will generally consider a pseudo data set D̄ of size M < N : pseudo-inputs
X̄ = {x̄m }M ¯ M
m=1 and pseudo targets f̄ = {fm }m=1 . We have denoted the pseudo targets
f̄ instead of ȳ because as they are not real observations, it does not make much sense
to include a noise variance for them. They are therefore equivalent to the latent function
values f . The actual observed target value will of course be assumed noisy as before. These
assumptions therefore lead to the following single data point likelihood:
p(y|x, X̄, f̄ ) = N y k> −1 > −1 2
x KM f̄ , Kxx − kx KM kx + σ , (4)
where [KM ]mm0 = K(x̄m , x̄m0 ) and [kx ]m = K(x̄m , x), for m = 1, . . . , M .
This can be viewed as a standard regression model with a particular form of parameterized
mean function and input-dependent noise model. The target data are generated i.i.d. given
the inputs, giving the complete data likelihood:
YN
p(y|X, X̄, f̄ ) = p(yn |xn , X̄, f̄ ) = N (y|KNM K−1 2
M f̄ , Λ + σ I) , (5)
n=1
Learning in the model involves finding a suitable setting of the parameters – an appropriate
pseudo data set that explains the real data well. However rather than simply maximize the
likelihood with respect to X̄ and f̄ it turns out that we can integrate out the pseudo targets
f̄ . We place a Gaussian prior on the pseudo targets:
p(f̄ |X̄) = N (f̄ |0, KM ) . (6)
This is a very reasonable prior because we expect the pseudo data to be distributed in a
very similar manner to the real data, if they are to model them well. It is not easy to place a
prior on the pseudo-inputs and still remain with a tractable model, so we will find these by
maximum likelihood (ML). For the moment though, consider the pseudo-inputs as known.
We find the posterior distribution over pseudo targets f̄ using Bayes rule on (5) and (6):
p(f̄ |D, X̄) = N f̄ |KM Q−1 2 −1
y, KM Q−1
M KMN (Λ + σ I) M KM , (7)
where QM = KM + KMN (Λ + σ 2 I)−1 KNM .
Given a new input x∗ , the predictive distribution is then obtained by integrating the likeli-
hood (4) with the posterior (7):
Z
p(y∗ |x∗ , D, X̄) = df̄ p(y∗ |x∗ , X̄, f̄ ) p(f̄ |D, X̄) = N (y∗ |µ∗ , σ∗2 ) , (8)
where µ∗ = k> −1 2 −1
∗ QM KMN (Λ + σ I) y
σ∗2 = K∗∗ − k> −1 −1 2
∗ (KM − QM )k∗ + σ .
Note that inversion of the matrix Λ + σ 2 I is not a problem because it is diagonal. The
computational cost is dominated by the matrix multiplication KMN (Λ + σ 2 I)−1 KNM in
the calculation of QM which is O(M 2 N ). After various precomputations, prediction can
then be made in O(M ) for the mean and O(M 2 ) for the variance per test case.
(a) (b) (c)
y y y
x x x
Figure 1: Predictive distributions (mean and two standard deviation lines) for: (a) full GP,
(b) SPGP trained using gradient ascent on (9), (c) SPGP trained using gradient ascent on
(10). Initial pseudo point positions are shown at the top as red crosses; final pseudo point
positions are shown at the bottom as blue crosses (the y location on the plots of these
crosses is not meaningful).
We are left with the problem of finding the pseudo-input locations X̄ and hyperparameters
Θ = {θ, σ 2 }. We can do this by computing the marginal likelihood from (5) and (6):
Z
p(y|X, X̄, Θ) = df̄ p(y|X, X̄, f̄ ) p(f̄ |X̄)
(9)
= N (y|0, KNM K−1 M K MN + Λ + σ 2
I) .
The marginal likelihood can then be maximized with respect to all these parameters
{X̄, Θ} by gradient ascent. The details of the gradient calculations are long and tedious
and therefore omitted here for brevity. They closely follow the derivations of hyperparam-
eter gradients of Seeger et al. [7] (see also section 3), and as there, can be most efficiently
coded with Cholesky factorisations. Note that KM , KMN and Λ are all functions of the
M pseudo-inputs X̄ and θ. The exact form of the gradients will of course depend on the
functional form of the covariance function chosen, but our method will apply to any co-
variance that is differentiable with respect to the input points. It is worth saying that the
SPGP can be viewed as a standard GP with a particular non-stationary covariance function
parameterized by the pseudo-inputs.
Since we now have M D +|Θ| parameters to fit, instead of just |Θ| for the full GP, one may
be worried about overfitting. However, consider the case where we let M = N and X̄ = X
– the pseudo-inputs coincide with the real inputs. At this point the marginal likelihood is
equal to that of a full GP (2). This is because at this point KMN = KM = KN and Λ = 0.
Moreover the predictive distribution (8) also collapses to the full GP predictive distribution
(3). These are clearly desirable properties of the model, and they give confidence that a
good solution will be found when M < N . However it is the case that hyperparameter
learning complicates matters, and we discuss this further in section 4.
y y y
x x x
Figure 2: Sample data drawn from the marginal likelihood of: (a) a full GP, (b) SPGP, (c)
PLV. For (b) and (c), the blue crosses show the location of the 10 pseudo-input points.
As discussed earlier, the major difference between our method and these other methods,
is that they do not use this marginal likelihood to learn locations of active set input points
– only the hyperparameters are learnt from (10). This begged the question of what would
happen if we tried to use their marginal likelihood approximation (10) instead of (9) to try
to learn pseudo-input locations by gradient ascent. We show that the Λ that appears in the
SPGP marginal likelihood (9) is crucial for finding pseudo-input points by gradients.
Figure 1 shows what happens when we try to optimize these two likelihoods using gradient
ascent with respect to the pseudo inputs, on a simple 1D data set. Plotted are the predictive
distributions, initial and final locations of the pseudo inputs. Hyperparameters were fixed
to their true values for this example. The initial pseudo-input locations were chosen adver-
sarially: all towards the left of the input space (red crosses). Using the SPGP likelihood, the
pseudo-inputs spread themselves along the extent of the training data, and the predictive
distribution matches the full GP very closely (Figure 1(b)). Using the PLV likelihood, the
points begin to spread, but very quickly become stuck as the gradient pushing the points
towards the right becomes tiny (Figure 1(c)).
Figure 2 compares data sampled from the marginal likelihoods (9) and (10), given a partic-
ular setting of the hyperparameters and a small number of pseudo-input points. The major
difference between the two is that the SPGP likelihood has a constant marginal variance of
Knn + σ 2 , whereas the PLV decreases to σ 2 away from the pseudo-inputs. Alternatively,
the noise component of the PLV likelihood is a constant σ 2 , whereas the SPGP noise grows
to Knn + σ 2 away from the pseudo-inputs. If one is in the situation of Figure 1(c), under
the SPGP likelihood, moving the rightmost pseudo-input slightly to the right will imme-
diately start to reduce the noise in this region from Knn + σ 2 towards σ 2 . Hence there
will be a strong gradient pulling it to the right. With the PLV likelihood, the noise is fixed
at σ 2 everywhere, and moving the point to the right does not improve the quality of fit of
the mean function enough locally to provide a significant gradient. Therefore the points
become stuck, and we believe this effect accounts for the failure of the PLV likelihood in
Figure 1(c).
It should be emphasised that the global optimum of the PLV likelihood (10) may well be a
good solution, but it is going to be difficult to find with gradients. The SPGP likelihood (9)
also suffers from local optima of course, but not so catastrophically. It may be interesting
in the future to compare which performs better for hyperparameter optimization.
4 Experiments
In the previous section we showed our gradient method successfully learning the pseudo-
inputs on a 1D example. There the initial pseudo input points were chosen adversarially, but
on a real problem it is sensible to initialize by randomly placing them on real data points,
n = 10000 n = 10000 n = 10000
random info−gain smo−bart
−1 −1 −1
10 10 10
−2 −2 −2
10 10 10
0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200
−2 −2 −2
10 10 10
0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140
140 160
160
Figure 3: Our results have been added to plots reproduced with kind permission from [7].
The plots show mean square test error as a function of active/pseudo set size M . Top row
– data set kin-40k, bottom row – pumadyn-32nm1 . We have added circles which show
SPGP with both hyperparameter and pseudo-input learning from random initialisation. For
kin-40k the squares show SPGP with hyperparameters obtained from a full GP and fixed.
For pumadyn-32nm the squares show hyperparameters initialized from a full GP. random,
info-gain and smo-bart are explained in the text. The horizontal lines are a full GP trained
on a subset of the data.
and this is what we do for all of our experiments. To compare our results to other methods
we have run experiments on exactly the same data sets as in Seeger et al. [7], following
precisely their preprocessing and testing methods. In Figure 3, we have reproduced their
learning curves for two large data sets1 , superimposing our test error (mean squared).
Seeger et al. compare three methods: random, info-gain and smo-bart. random involves
picking an active set of size M randomly from among training data. info-gain is their own
greedy subset selection method, which is extremely cheap to train – barely more expensive
than random. smo-bart is Smola and Bartlett’s [1] more expensive greedy subset selection
method. Also shown with horizontal lines is the test error for a full GP trained on a subset
of the data of size 2000 for data set kin-40k and 1024 for pumadyn-32nm. For these learning
curves, they do not actually learn hyperparameters by maximizing their approximation to
the marginal likelihood (10). Instead they fix them to those obtained from the full GP2 .
For kin-40k we follow Seeger et al.’s procedure of setting the hyperparameters from the full
GP on a subset. We then optimize the pseudo-input positions, and plot the results as red
squares. We see the SPGP learning curve lying significantly below all three other methods
in Figure 3. We rapidly approach the error of a full GP trained on 2000 points, using a
pseudo set of only a few hundred points. We then try the harder task of also finding the
hyperparameters at the same time as the pseudo-inputs. The results are plotted as blue
circles. The method performs extremely well for small M , but we see some overfitting
1
kin-40k: 10000 training, 30000 test, 9 attributes, see www.igi.tugraz.at/aschwaig/data.html.
pumadyn-32nm: 7168 training, 1024 test, 33 attributes, see www.cs.toronto/ delve.
2
Seeger et al. have a separate section testing their likelihood approximation (10) to learn hyper-
parameters, in conjunction with the active set selection methods. They show that it can be used to
reliably learn hyperparameters with info-gain for active set sizes of 100 and above. They have more
trouble reliably learning hyperparameters for very small active sets.
standard GP SPGP
behaviour for large M which seems to be caused by the noise hyperparameter being driven
too small (the blue circles have higher likelihood than the red squares below them).
For data set pumadyn-32nm, we again try to jointly find hyperparameters and pseudo-
inputs. Again Figure 3 shows SPGP with extremely low error for small pseudo set size
– with just 10 pseudo-inputs we are already close to the error of a full GP trained on 1024
points. However, in this case increasing the pseudo set size does not decrease our error. In
this problem there is a large number of irrelevant attributes, and the relevant ones need to
be singled out by ARD. Although the hyperparameters learnt by our method are reasonable
(2 out of the 4 relevant dimensions are found), they are not good enough to get down to the
error of the full GP. However if we initialize our gradient algorithm with the hyperparam-
eters of the full GP, we get the points plotted as squares (this time red likelihoods > blue
likelihoods, so it is a problem of local optima not overfitting). Now with only a pseudo set
of size 25 we reach the performance of the full GP, and significantly outperform the other
methods (which also had their hyperparameters set from the full GP).
Another main difference between the methods lies in training time. Our method performs
optimization over a potentially large parameter space, and hence is relatively expensive to
train. On the face of it methods such as info-gain and random are extremely cheap. How-
ever all these methods must be combined with obtaining hyperparameters in some way –
either by a full GP on a subset (generally expensive), or by gradient ascent on an approx-
imation to the likelihood. When you consider this combined task, and that all methods
involve some kind of gradient based procedure, then none of the methods are particularly
cheap. We believe that the gain in accuracy achieved by our method can often be worth the
extra training time associated with optimizing in a larger parameter space.
Although GPs are very flexible regression models, they are still limited by the form of the
covariance function. For example it is difficult to model non-stationary processes with a GP
because it is hard to construct sensible non-stationary covariance functions. Although the
SPGP is not specifically designed to model non-stationarity, the extra flexibility associated
with moving pseudo inputs around can actually achieve this to a certain extent. Figure
4 shows the SPGP fit to some data with an input dependent noise variance. The SPGP
achieves a much better fit to the data than the standard GP by moving almost all the pseudo-
input points outside the region of data3 . It will be interesting to test these capabilities further
in the future. The extension to classification is also a natural avenue to explore.
We have demonstrated a significant decrease in test error over the other methods for a given
small pseudo/active set size. Our method runs into problems when we consider much larger
3
It should be said that there are local optima in this problem, and other solutions looked closer
to the standard GP. We ran the method 5 times with random initialisations. All runs had higher
likelihood than the GP; the one with the highest likelihood is plotted.
pseudo set size and/or high dimensional input spaces, because the space in which we are
optimizing becomes impractically big. However we have currently only tried using an ‘off
the shelf’ conjugate gradient minimizer, or L-BFGS, and there are certainly improvements
that can be made in this area. For example we can try optimizing subsets of variables
iteratively (chunking), or stochastic gradient ascent, or we could make a hybrid by picking
some points randomly and optimizing others. In general though we consider our method
most useful when one wants a very sparse (hence fast prediction) and accurate solution.
One further way in which to deal with large D is to learn a low dimensional projection of
the input space. This has been considered for GPs before [14], and could easily be applied
to our model.
In conclusion, we have presented a new method for sparse GP regression, which shows
a significant performance gain over other methods especially when searching for an ex-
tremely sparse solution. We have shown that the added flexibility of moving pseudo-input
points which are not constrained to lie on the true data points leads to better solutions, and
even some non-stationary effects can be modelled. Finally we have shown that hyperpa-
rameters can be jointly learned with pseudo-input points with reasonable success.
Acknowledgements
Thanks to the authors of [7] for agreeing to make their results and plots available for repro-
duction. Thanks to all at the Sheffield GP workshop for helping to clarify this work.
References
[1] A. J. Smola and P. Bartlett. Sparse greedy Gaussian process regression. In Advances in Neural
Information Processing Systems 13. MIT Press, 2000.
[2] C. K. I. Williams and M. Seeger. Using the Nyström method to speed up kernel machines. In
Advances in Neural Information Processing Systems 13. MIT Press, 2000.
[3] V. Tresp. A Bayesian committee machine. Neural Computation, 12:2719–2741, 2000.
[4] L. Csató and M. Opper. Sparse online Gaussian processes. Neural Computation, 14:641–668,
2002.
[5] L. Csató. Gaussian Processes — Iterative Sparse Approximations. PhD thesis, Aston Univer-
sity, UK, 2002.
[6] N. D. Lawrence, M. Seeger, and R. Herbrich. Fast sparse Gaussian process methods: the
informative vector machine. In Advances in Neural Information Processing Systems 15. MIT
Press, 2002.
[7] M. Seeger, C. K. I. Williams, and N. D. Lawrence. Fast forward selection to speed up sparse
Gaussian process regression. In C. M. Bishop and B. J. Frey, editors, Proceedings of the Ninth
International Workshop on Artificial Intelligence and Statistics, 2003.
[8] M. Seeger. Bayesian Gaussian Process Models: PAC-Bayesian Generalisation Error Bounds
and Sparse Approximations. PhD thesis, University of Edinburgh, 2003.
[9] J. Quiñonero Candela. Learning with Uncertainty — Gaussian Processes and Relevance Vector
Machines. PhD thesis, Technical University of Denmark, 2004.
[10] D. J. C. MacKay. Introduction to Gaussian processes. In C. M. Bishop, editor, Neural Networks
and Machine Learning, NATO ASI Series, pages 133–166. Kluwer Academic Press, 1998.
[11] C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. In Advances in
Neural Information Processing Systems 8. MIT Press, 1996.
[12] C. E. Rasmussen. Evaluation of Gaussian Processes and Other Methods for Non-Linear Re-
gression. PhD thesis, University of Toronto, 1996.
[13] M. N. Gibbs. Bayesian Gaussian Processes for Regression and Classification. PhD thesis,
Cambridge University, 1997.
[14] F. Vivarelli and C. K. I. Williams. Discovering hidden features with Gaussian processes regres-
sion. In Advances in Neural Information Processing Systems 11. MIT Press, 1998.