0% found this document useful (0 votes)

45 views8 pages

Snelson 2005 Sparse Gps

The document presents a new sparse Gaussian process (GP) regression model called sparse pseudo-input GPs (SPGPs). The model parameterizes the GP covariance using the locations of M pseudo-input points, which are learned through gradient-based optimization along with hyperparameters of the covariance function. This allows for joint optimization of the pseudo-inputs and hyperparameters. The training cost is O(M2N) and prediction cost is O(M2) per test case, where M<<N. The method is closely related to other sparse GP approaches and can match full GP performance using small M.

Uploaded by

Donlapark Pornnopparath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

45 views8 pages

Snelson 2005 Sparse Gps

Uploaded by

Donlapark Pornnopparath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Sparse Gaussian Processes using Pseudo-inputs

Edward Snelson Zoubin Ghahramani

Gatsby Computational Neuroscience Unit

University College London
17 Queen Square, London WC1N 3AR, UK
{snelson,zoubin}@gatsby.ucl.ac.uk

Abstract

We present a new Gaussian process (GP) regression model whose co-

variance is parameterized by the the locations of M pseudo-input points,
which we learn by a gradient based optimization. We take M N ,
where N is the number of real data points, and hence obtain a sparse
regression method which has O(M 2 N ) training cost and O(M 2 ) pre-
diction cost per test case. We also find hyperparameters of the covari-
ance function in the same joint optimization. The method can be viewed
as a Bayesian regression model with particular input dependent noise.
The method turns out to be closely related to several other sparse GP ap-
proaches, and we discuss the relation in detail. We finally demonstrate
its performance on some large data sets, and make a direct comparison to
other sparse GP methods. We show that our method can match full GP
performance with small M , i.e. very sparse solutions, and it significantly
outperforms other approaches in this regime.

1 Introduction
The Gaussian process (GP) is a popular and elegant method for Bayesian non-linear non-
parametric regression and classification. Unfortunately its non-parametric nature causes
computational problems for large data sets, due to an unfavourable N 3 scaling for training,
where N is the number of data points. In recent years there have been many attempts to
make sparse approximations to the full GP in order to bring this scaling down to M 2 N
where M N [1, 2, 3, 4, 5, 6, 7, 8, 9]. Most of these methods involve selecting a subset
of the training points of size M (active set) on which to base computation. A typical way of
choosing such a subset is through some sort of information criterion. For example, Seeger
et al. [7] employ a very fast approximate information gain criterion, which they use to
greedily select points into the active set.
A major common problem to these methods is that they lack a reliable way of learning
kernel hyperparameters, because the active set selection interferes with this learning proce-
dure. Seeger et al. [7] construct an approximation to the full GP marginal likelihood, which
they try to maximize to find the hyperparameters. However, as the authors state, they have
persistent difficulty in practically doing this through gradient ascent. The reason for this
is that reselecting the active set causes non-smooth fluctuations in the marginal likelihood
and its gradients, meaning that they cannot get smooth convergence. Therefore the speed
of active set selection is somewhat undermined by the difficulty of selecting hyperparame-
ters. Inappropriately learned hyperparameters will adversely affect the quality of solution,
especially if one is trying to use them for automatic relevance determination (ARD) [10].
In this paper we circumvent this problem by constructing a GP regression model that en-
ables us to find active set point locations and hyperparameters in one smooth joint optimiza-
tion. The covariance function of our GP is parameterized by the locations of pseudo-inputs
— an active set not constrained to be a subset of the data, found by a continuous optimiza-
tion. This is a further major advantage, since we can improve the quality of our fit by the
fine tuning of their precise locations.
Our model is closely related to several sparse GP approximations, in particular Seeger’s
method of projected latent variables (PLV) [7, 8]. We discuss these relations in section 3.
In principle we could also apply our technique of moving active set points off data points to
approximations such as PLV. However we empirically demonstrate that a crucial difference
between PLV and our method (SPGP) prevents this idea from working for PLV.

1.1 Gaussian processes for regression

We provide here a concise summary of GPs for regression, but see [11, 12, 13, 10] for
more detailed reviews. We have a data set D consisting of N input vectors X = {xn }N n=1
of dimension D and corresponding real valued targets y = {yn }N n=1 . We place a zero
mean Gaussian process prior on the underlying latent function f (x) that we are trying to
model. We therefore have a multivariate Gaussian distribution on any finite subset of latent
variables; in particular, at X: p(f |X) = N (f |0, KN ), where N (f |m, V) is a Gaussian
distribution with mean m and covariance V. In a Gaussian process the covariance matrix
is constructed from a covariance function, or kernel, K which expresses some prior notion
of smoothness of the underlying function: [KN ]nn0 = K(xn , xn0 ). Usually the covariance
function depends on a small number of hyperparameters θ, which control these smoothness
properties. For our experiments later on we will use the standard Gaussian covariance with
ARD hyperparameters:
h XD (d) 2
i
K(xn , xn0 ) = c exp − 21 bd x(d)
n − xn0 , θ = {c, b} . (1)
d=1

In standard GP regression we also assume a Gaussian noise model or likelihood p(y|f ) =

N (y|f , σ 2 I). Integrating out the latent function values we obtain the marginal likelihood:

p(y|X, θ) = N (y|0, KN + σ 2 I) , (2)

which is typically used to train the GP by finding a (local) maximum with respect to the
hyperparameters θ and σ 2 .
Prediction is made by considering a new input point x and conditioning on the observed
data and hyperparameters. The distribution of the target value at the new point is then:

p(y|x, D, θ) = N y k> 2 −1
y, Kxx − k> 2 −1
kx + σ 2 , (3)

x (KN + σ I) x (KN + σ I)

where [kx ]n = K(xn , x) and Kxx = K(x, x). The GP is a non-parametric model, because
the training data are explicitly required at test time in order to construct the predictive
distribution, as is clear from the above expression.
GPs are prohibitive for large data sets because training requires O(N 3 ) time due to the
inversion of the covariance matrix. Once the inversion is done, prediction is O(N ) for the
predictive mean and O(N 2 ) for the predictive variance per new test case.
2 Sparse Pseudo-input Gaussian processes (SPGPs)
In order to derive a sparse model that is computationally tractable for large data sets, which
still preserves the desirable properties of the full GP, we examine in detail the GP predictive
distribution (3). Consider the mean and variance of this distribution as functions of x, the
new input. Regarding the hyperparameters as known and fixed for now, these functions
are effectively parameterized by the locations of the N training input and target pairs,
X and y. In this paper we consider a model with likelihood given by the GP predictive
distribution, and parameterized by a pseudo data set. The sparsity in the model will arise
because we will generally consider a pseudo data set D̄ of size M < N : pseudo-inputs
X̄ = {x̄m }M ¯ M
m=1 and pseudo targets f̄ = {fm }m=1 . We have denoted the pseudo targets
f̄ instead of ȳ because as they are not real observations, it does not make much sense
to include a noise variance for them. They are therefore equivalent to the latent function
values f . The actual observed target value will of course be assumed noisy as before. These
assumptions therefore lead to the following single data point likelihood:
p(y|x, X̄, f̄ ) = N y k> −1 > −1 2

x KM f̄ , Kxx − kx KM kx + σ , (4)
where [KM ]mm0 = K(x̄m , x̄m0 ) and [kx ]m = K(x̄m , x), for m = 1, . . . , M .
This can be viewed as a standard regression model with a particular form of parameterized
mean function and input-dependent noise model. The target data are generated i.i.d. given
the inputs, giving the complete data likelihood:
YN
p(y|X, X̄, f̄ ) = p(yn |xn , X̄, f̄ ) = N (y|KNM K−1 2
M f̄ , Λ + σ I) , (5)
n=1

where Λ = diag(λ), λn = Knn − k> −1

n KM kn , and [KNM ]nm = K(xn , x̄m ).

Learning in the model involves finding a suitable setting of the parameters – an appropriate
pseudo data set that explains the real data well. However rather than simply maximize the
likelihood with respect to X̄ and f̄ it turns out that we can integrate out the pseudo targets
f̄ . We place a Gaussian prior on the pseudo targets:
p(f̄ |X̄) = N (f̄ |0, KM ) . (6)
This is a very reasonable prior because we expect the pseudo data to be distributed in a
very similar manner to the real data, if they are to model them well. It is not easy to place a
prior on the pseudo-inputs and still remain with a tractable model, so we will find these by
maximum likelihood (ML). For the moment though, consider the pseudo-inputs as known.
We find the posterior distribution over pseudo targets f̄ using Bayes rule on (5) and (6):
p(f̄ |D, X̄) = N f̄ |KM Q−1 2 −1
y, KM Q−1

M KMN (Λ + σ I) M KM , (7)
where QM = KM + KMN (Λ + σ 2 I)−1 KNM .
Given a new input x∗ , the predictive distribution is then obtained by integrating the likeli-
hood (4) with the posterior (7):
Z
p(y∗ |x∗ , D, X̄) = df̄ p(y∗ |x∗ , X̄, f̄ ) p(f̄ |D, X̄) = N (y∗ |µ∗ , σ∗2 ) , (8)

where µ∗ = k> −1 2 −1
∗ QM KMN (Λ + σ I) y
σ∗2 = K∗∗ − k> −1 −1 2
∗ (KM − QM )k∗ + σ .

Note that inversion of the matrix Λ + σ 2 I is not a problem because it is diagonal. The
computational cost is dominated by the matrix multiplication KMN (Λ + σ 2 I)−1 KNM in
the calculation of QM which is O(M 2 N ). After various precomputations, prediction can
then be made in O(M ) for the mean and O(M 2 ) for the variance per test case.
(a) (b) (c)

y y y

x x x

Figure 1: Predictive distributions (mean and two standard deviation lines) for: (a) full GP,
(b) SPGP trained using gradient ascent on (9), (c) SPGP trained using gradient ascent on
(10). Initial pseudo point positions are shown at the top as red crosses; final pseudo point
positions are shown at the bottom as blue crosses (the y location on the plots of these
crosses is not meaningful).

We are left with the problem of finding the pseudo-input locations X̄ and hyperparameters
Θ = {θ, σ 2 }. We can do this by computing the marginal likelihood from (5) and (6):
Z
p(y|X, X̄, Θ) = df̄ p(y|X, X̄, f̄ ) p(f̄ |X̄)
(9)
= N (y|0, KNM K−1 M K MN + Λ + σ 2
I) .
The marginal likelihood can then be maximized with respect to all these parameters
{X̄, Θ} by gradient ascent. The details of the gradient calculations are long and tedious
and therefore omitted here for brevity. They closely follow the derivations of hyperparam-
eter gradients of Seeger et al. [7] (see also section 3), and as there, can be most efficiently
coded with Cholesky factorisations. Note that KM , KMN and Λ are all functions of the
M pseudo-inputs X̄ and θ. The exact form of the gradients will of course depend on the
functional form of the covariance function chosen, but our method will apply to any co-
variance that is differentiable with respect to the input points. It is worth saying that the
SPGP can be viewed as a standard GP with a particular non-stationary covariance function
parameterized by the pseudo-inputs.
Since we now have M D +|Θ| parameters to fit, instead of just |Θ| for the full GP, one may
be worried about overfitting. However, consider the case where we let M = N and X̄ = X
– the pseudo-inputs coincide with the real inputs. At this point the marginal likelihood is
equal to that of a full GP (2). This is because at this point KMN = KM = KN and Λ = 0.
Moreover the predictive distribution (8) also collapses to the full GP predictive distribution
(3). These are clearly desirable properties of the model, and they give confidence that a
good solution will be found when M < N . However it is the case that hyperparameter
learning complicates matters, and we discuss this further in section 4.

3 Relation to other methods

It turns out that Seeger’s method of PLV [7, 8] uses a very similar marginal likelihood
approximation and predictive distribution. If you remove Λ from all the SPGP equations
you get precisely their expressions. In particular the marginal likelihood they use is:
p(y|X, X̄, Θ) = N (y|0, KNM K−1 2
M KMN + σ I) , (10)
which has also been used elsewhere before [1, 4, 5]. They have derived this expression from
a somewhat different route, as a direct approximation to the full GP marginal likelihood.
(a) (b) (c)

y y y

x x x

Figure 2: Sample data drawn from the marginal likelihood of: (a) a full GP, (b) SPGP, (c)
PLV. For (b) and (c), the blue crosses show the location of the 10 pseudo-input points.

As discussed earlier, the major difference between our method and these other methods,
is that they do not use this marginal likelihood to learn locations of active set input points
– only the hyperparameters are learnt from (10). This begged the question of what would
happen if we tried to use their marginal likelihood approximation (10) instead of (9) to try
to learn pseudo-input locations by gradient ascent. We show that the Λ that appears in the
SPGP marginal likelihood (9) is crucial for finding pseudo-input points by gradients.
Figure 1 shows what happens when we try to optimize these two likelihoods using gradient
ascent with respect to the pseudo inputs, on a simple 1D data set. Plotted are the predictive
distributions, initial and final locations of the pseudo inputs. Hyperparameters were fixed
to their true values for this example. The initial pseudo-input locations were chosen adver-
sarially: all towards the left of the input space (red crosses). Using the SPGP likelihood, the
pseudo-inputs spread themselves along the extent of the training data, and the predictive
distribution matches the full GP very closely (Figure 1(b)). Using the PLV likelihood, the
points begin to spread, but very quickly become stuck as the gradient pushing the points
towards the right becomes tiny (Figure 1(c)).
Figure 2 compares data sampled from the marginal likelihoods (9) and (10), given a partic-
ular setting of the hyperparameters and a small number of pseudo-input points. The major
difference between the two is that the SPGP likelihood has a constant marginal variance of
Knn + σ 2 , whereas the PLV decreases to σ 2 away from the pseudo-inputs. Alternatively,
the noise component of the PLV likelihood is a constant σ 2 , whereas the SPGP noise grows
to Knn + σ 2 away from the pseudo-inputs. If one is in the situation of Figure 1(c), under
the SPGP likelihood, moving the rightmost pseudo-input slightly to the right will imme-
diately start to reduce the noise in this region from Knn + σ 2 towards σ 2 . Hence there
will be a strong gradient pulling it to the right. With the PLV likelihood, the noise is fixed
at σ 2 everywhere, and moving the point to the right does not improve the quality of fit of
the mean function enough locally to provide a significant gradient. Therefore the points
become stuck, and we believe this effect accounts for the failure of the PLV likelihood in
Figure 1(c).
It should be emphasised that the global optimum of the PLV likelihood (10) may well be a
good solution, but it is going to be difficult to find with gradients. The SPGP likelihood (9)
also suffers from local optima of course, but not so catastrophically. It may be interesting
in the future to compare which performs better for hyperparameter optimization.

4 Experiments
In the previous section we showed our gradient method successfully learning the pseudo-
inputs on a 1D example. There the initial pseudo input points were chosen adversarially, but
on a real problem it is sensible to initialize by randomly placing them on real data points,
n = 10000 n = 10000 n = 10000
random info−gain smo−bart
−1 −1 −1
10 10 10

−2 −2 −2
10 10 10
0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200 0 200 400 600 800 1000 1200

random info−gain smo−bart

info−gain
−1 −1 −1
10 10 10

−2 −2 −2
10 10 10
0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140 160 0 20 40 60 80 100 120 140
140 160
160

Figure 3: Our results have been added to plots reproduced with kind permission from [7].
The plots show mean square test error as a function of active/pseudo set size M . Top row
– data set kin-40k, bottom row – pumadyn-32nm1 . We have added circles which show
SPGP with both hyperparameter and pseudo-input learning from random initialisation. For
kin-40k the squares show SPGP with hyperparameters obtained from a full GP and fixed.
For pumadyn-32nm the squares show hyperparameters initialized from a full GP. random,
info-gain and smo-bart are explained in the text. The horizontal lines are a full GP trained
on a subset of the data.

and this is what we do for all of our experiments. To compare our results to other methods
we have run experiments on exactly the same data sets as in Seeger et al. [7], following
precisely their preprocessing and testing methods. In Figure 3, we have reproduced their
learning curves for two large data sets1 , superimposing our test error (mean squared).
Seeger et al. compare three methods: random, info-gain and smo-bart. random involves
picking an active set of size M randomly from among training data. info-gain is their own
greedy subset selection method, which is extremely cheap to train – barely more expensive
than random. smo-bart is Smola and Bartlett’s [1] more expensive greedy subset selection
method. Also shown with horizontal lines is the test error for a full GP trained on a subset
of the data of size 2000 for data set kin-40k and 1024 for pumadyn-32nm. For these learning
curves, they do not actually learn hyperparameters by maximizing their approximation to
the marginal likelihood (10). Instead they fix them to those obtained from the full GP2 .
For kin-40k we follow Seeger et al.’s procedure of setting the hyperparameters from the full
GP on a subset. We then optimize the pseudo-input positions, and plot the results as red
squares. We see the SPGP learning curve lying significantly below all three other methods
in Figure 3. We rapidly approach the error of a full GP trained on 2000 points, using a
pseudo set of only a few hundred points. We then try the harder task of also finding the
hyperparameters at the same time as the pseudo-inputs. The results are plotted as blue
circles. The method performs extremely well for small M , but we see some overfitting
1
kin-40k: 10000 training, 30000 test, 9 attributes, see www.igi.tugraz.at/aschwaig/data.html.
pumadyn-32nm: 7168 training, 1024 test, 33 attributes, see www.cs.toronto/ delve.
2
Seeger et al. have a separate section testing their likelihood approximation (10) to learn hyper-
parameters, in conjunction with the active set selection methods. They show that it can be used to
reliably learn hyperparameters with info-gain for active set sizes of 100 and above. They have more
trouble reliably learning hyperparameters for very small active sets.
standard GP SPGP

Figure 4: Regression on a data

set with input dependent noise.
Left: standard GP. Right: SPGP.
y y Predictive mean and two stan-
dard deviation lines are shown.
Crosses show final locations of
pseudo-inputs for SPGP. Hyper-
parameters are also learnt.
x x

behaviour for large M which seems to be caused by the noise hyperparameter being driven
too small (the blue circles have higher likelihood than the red squares below them).
For data set pumadyn-32nm, we again try to jointly find hyperparameters and pseudo-
inputs. Again Figure 3 shows SPGP with extremely low error for small pseudo set size
– with just 10 pseudo-inputs we are already close to the error of a full GP trained on 1024
points. However, in this case increasing the pseudo set size does not decrease our error. In
this problem there is a large number of irrelevant attributes, and the relevant ones need to
be singled out by ARD. Although the hyperparameters learnt by our method are reasonable
(2 out of the 4 relevant dimensions are found), they are not good enough to get down to the
error of the full GP. However if we initialize our gradient algorithm with the hyperparam-
eters of the full GP, we get the points plotted as squares (this time red likelihoods > blue
likelihoods, so it is a problem of local optima not overfitting). Now with only a pseudo set
of size 25 we reach the performance of the full GP, and significantly outperform the other
methods (which also had their hyperparameters set from the full GP).
Another main difference between the methods lies in training time. Our method performs
optimization over a potentially large parameter space, and hence is relatively expensive to
train. On the face of it methods such as info-gain and random are extremely cheap. How-
ever all these methods must be combined with obtaining hyperparameters in some way –
either by a full GP on a subset (generally expensive), or by gradient ascent on an approx-
imation to the likelihood. When you consider this combined task, and that all methods
involve some kind of gradient based procedure, then none of the methods are particularly
cheap. We believe that the gain in accuracy achieved by our method can often be worth the
extra training time associated with optimizing in a larger parameter space.

5 Conclusions, extensions and future work

Although GPs are very flexible regression models, they are still limited by the form of the
covariance function. For example it is difficult to model non-stationary processes with a GP
because it is hard to construct sensible non-stationary covariance functions. Although the
SPGP is not specifically designed to model non-stationarity, the extra flexibility associated
with moving pseudo inputs around can actually achieve this to a certain extent. Figure
4 shows the SPGP fit to some data with an input dependent noise variance. The SPGP
achieves a much better fit to the data than the standard GP by moving almost all the pseudo-
input points outside the region of data3 . It will be interesting to test these capabilities further
in the future. The extension to classification is also a natural avenue to explore.
We have demonstrated a significant decrease in test error over the other methods for a given
small pseudo/active set size. Our method runs into problems when we consider much larger
3
It should be said that there are local optima in this problem, and other solutions looked closer
to the standard GP. We ran the method 5 times with random initialisations. All runs had higher
likelihood than the GP; the one with the highest likelihood is plotted.
pseudo set size and/or high dimensional input spaces, because the space in which we are
optimizing becomes impractically big. However we have currently only tried using an ‘off
the shelf’ conjugate gradient minimizer, or L-BFGS, and there are certainly improvements
that can be made in this area. For example we can try optimizing subsets of variables
iteratively (chunking), or stochastic gradient ascent, or we could make a hybrid by picking
some points randomly and optimizing others. In general though we consider our method
most useful when one wants a very sparse (hence fast prediction) and accurate solution.
One further way in which to deal with large D is to learn a low dimensional projection of
the input space. This has been considered for GPs before [14], and could easily be applied
to our model.
In conclusion, we have presented a new method for sparse GP regression, which shows
a significant performance gain over other methods especially when searching for an ex-
tremely sparse solution. We have shown that the added flexibility of moving pseudo-input
points which are not constrained to lie on the true data points leads to better solutions, and
even some non-stationary effects can be modelled. Finally we have shown that hyperpa-
rameters can be jointly learned with pseudo-input points with reasonable success.
Acknowledgements
Thanks to the authors of [7] for agreeing to make their results and plots available for repro-
duction. Thanks to all at the Sheffield GP workshop for helping to clarify this work.

References
[1] A. J. Smola and P. Bartlett. Sparse greedy Gaussian process regression. In Advances in Neural
Information Processing Systems 13. MIT Press, 2000.
[2] C. K. I. Williams and M. Seeger. Using the Nyström method to speed up kernel machines. In
Advances in Neural Information Processing Systems 13. MIT Press, 2000.
[3] V. Tresp. A Bayesian committee machine. Neural Computation, 12:2719–2741, 2000.
[4] L. Csató and M. Opper. Sparse online Gaussian processes. Neural Computation, 14:641–668,
2002.
[5] L. Csató. Gaussian Processes — Iterative Sparse Approximations. PhD thesis, Aston Univer-
sity, UK, 2002.
[6] N. D. Lawrence, M. Seeger, and R. Herbrich. Fast sparse Gaussian process methods: the
informative vector machine. In Advances in Neural Information Processing Systems 15. MIT
Press, 2002.
[7] M. Seeger, C. K. I. Williams, and N. D. Lawrence. Fast forward selection to speed up sparse
Gaussian process regression. In C. M. Bishop and B. J. Frey, editors, Proceedings of the Ninth
International Workshop on Artificial Intelligence and Statistics, 2003.
[8] M. Seeger. Bayesian Gaussian Process Models: PAC-Bayesian Generalisation Error Bounds
and Sparse Approximations. PhD thesis, University of Edinburgh, 2003.
[9] J. Quiñonero Candela. Learning with Uncertainty — Gaussian Processes and Relevance Vector
Machines. PhD thesis, Technical University of Denmark, 2004.
[10] D. J. C. MacKay. Introduction to Gaussian processes. In C. M. Bishop, editor, Neural Networks
and Machine Learning, NATO ASI Series, pages 133–166. Kluwer Academic Press, 1998.
[11] C. K. I. Williams and C. E. Rasmussen. Gaussian processes for regression. In Advances in
Neural Information Processing Systems 8. MIT Press, 1996.
[12] C. E. Rasmussen. Evaluation of Gaussian Processes and Other Methods for Non-Linear Re-
gression. PhD thesis, University of Toronto, 1996.
[13] M. N. Gibbs. Bayesian Gaussian Processes for Regression and Classification. PhD thesis,
Cambridge University, 1997.
[14] F. Vivarelli and C. K. I. Williams. Discovering hidden features with Gaussian processes regres-
sion. In Advances in Neural Information Processing Systems 11. MIT Press, 1998.

Gaussian Process
No ratings yet
Gaussian Process
9 pages
ML Merge
No ratings yet
ML Merge
145 pages
Linear - Regression
100% (1)
Linear - Regression
39 pages
Ai - Foundations of Machine Learning II
No ratings yet
Ai - Foundations of Machine Learning II
54 pages
ML 3
No ratings yet
ML 3
66 pages
ML RUSA Module 6 Probablistic EM KNN SVM
No ratings yet
ML RUSA Module 6 Probablistic EM KNN SVM
51 pages
Durrande 2020
No ratings yet
Durrande 2020
90 pages
Lecture3 2015
No ratings yet
Lecture3 2015
38 pages
Convergence of Gaussian Process Regression With Estimated Hyper-Parameters and Applications in Bayesian Inverse Problems
No ratings yet
Convergence of Gaussian Process Regression With Estimated Hyper-Parameters and Applications in Bayesian Inverse Problems
28 pages
Bishop Solutions PDF
No ratings yet
Bishop Solutions PDF
87 pages
Talk On Regression Based Method For Bayesian Nonparanormal Graphical Models
No ratings yet
Talk On Regression Based Method For Bayesian Nonparanormal Graphical Models
40 pages
Wilson2020 Part1
No ratings yet
Wilson2020 Part1
52 pages
07 - Bayesian Learning
No ratings yet
07 - Bayesian Learning
55 pages
Ml2 Script v2
No ratings yet
Ml2 Script v2
123 pages
Thesis GPR
No ratings yet
Thesis GPR
116 pages
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
From Everand
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
Rob Porter
No ratings yet
Bayesian Kernel Methods
No ratings yet
Bayesian Kernel Methods
40 pages
1 s2.0 S0005109820303253 Main
No ratings yet
1 s2.0 S0005109820303253 Main
11 pages
Exact Inference For Gaussian Process Regression in Case of Big Data With The Cartesian Product Structure
No ratings yet
Exact Inference For Gaussian Process Regression in Case of Big Data With The Cartesian Product Structure
10 pages
Statistical Methods-1
No ratings yet
Statistical Methods-1
63 pages
Stochastic Differential Equations in Machine Learning
No ratings yet
Stochastic Differential Equations in Machine Learning
26 pages
Modern Database Management Systems Edition 8-Answers Ch1
67% (3)
Modern Database Management Systems Edition 8-Answers Ch1
13 pages
Gaussian Processes in Machine Learning
No ratings yet
Gaussian Processes in Machine Learning
9 pages
CS772 Lec9 13
No ratings yet
CS772 Lec9 13
15 pages
Lecture6 2015
No ratings yet
Lecture6 2015
36 pages
Security
No ratings yet
Security
235 pages
05-1 Supervised Learning
No ratings yet
05-1 Supervised Learning
65 pages
Authentic - V1?
No ratings yet
Authentic - V1?
5 pages
Energetic Variational Gaussian Process Regression For Computer Experiments
No ratings yet
Energetic Variational Gaussian Process Regression For Computer Experiments
19 pages
ANoteon Krigingand Gaussian Processes
No ratings yet
ANoteon Krigingand Gaussian Processes
6 pages
Time Grad
No ratings yet
Time Grad
11 pages
Notes5 Regression
No ratings yet
Notes5 Regression
14 pages
Neural Processes
No ratings yet
Neural Processes
11 pages
Gaussian Processes For Natural Language Processing
No ratings yet
Gaussian Processes For Natural Language Processing
3 pages
11 Slides
No ratings yet
11 Slides
6 pages
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
No ratings yet
Aula 4 (L) - Oggi La Tua Lezione È in Presenza
11 pages
Ryan Adams 140814 Bayesopt Ncap
No ratings yet
Ryan Adams 140814 Bayesopt Ncap
84 pages
1.2.6 Advanced
No ratings yet
1.2.6 Advanced
5 pages
Advanced ML Notes (Midterm)
No ratings yet
Advanced ML Notes (Midterm)
10 pages
Manual GPML
No ratings yet
Manual GPML
51 pages
Introduction To Gaussian Process Models: C Esar Lincoln Cavalcante Mattos
No ratings yet
Introduction To Gaussian Process Models: C Esar Lincoln Cavalcante Mattos
54 pages
Deep GP Untuk Speech
No ratings yet
Deep GP Untuk Speech
8 pages
PRML Slides 3
No ratings yet
PRML Slides 3
57 pages
The Use of Gaussian Processes in System Identification
No ratings yet
The Use of Gaussian Processes in System Identification
13 pages
Deep Gaussian Covariance Network
No ratings yet
Deep Gaussian Covariance Network
14 pages
Sslmicf9quarter3week1 779332714012337
No ratings yet
Sslmicf9quarter3week1 779332714012337
4 pages
Cheatsheet Supervised Learning
No ratings yet
Cheatsheet Supervised Learning
4 pages
Hyper-Parameter Initialization For Squared Exponential Kernel-Based Gaussian Process Regression
No ratings yet
Hyper-Parameter Initialization For Squared Exponential Kernel-Based Gaussian Process Regression
6 pages
Exact Gaussian Processes On A Million Data Points: Equal Contribution
No ratings yet
Exact Gaussian Processes On A Million Data Points: Equal Contribution
13 pages
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
From Everand
Digital Signal and Image Processing using MATLAB, Volume 3: Advances and Applications, The Stochastic Case
Gérard Blanchet
3/5 (1)
System Data
No ratings yet
System Data
6 pages
5772 Learning Stationary Time Series Using Gaussian Processes With Nonparametric Kernels
No ratings yet
5772 Learning Stationary Time Series Using Gaussian Processes With Nonparametric Kernels
9 pages
Gaussian Process For Nonstationary Time Series Prediction: So$ane Brahim-Belhouari, Amine Bermak
No ratings yet
Gaussian Process For Nonstationary Time Series Prediction: So$ane Brahim-Belhouari, Amine Bermak
8 pages
Modeling Input/Output Data: Partial Least Squares (PLS)
No ratings yet
Modeling Input/Output Data: Partial Least Squares (PLS)
18 pages
Tutorial
No ratings yet
Tutorial
11 pages
A Tutorial On Gaussian Processes (Or Why I Don'T Use SVMS) : Zoubin Ghahramani
No ratings yet
A Tutorial On Gaussian Processes (Or Why I Don'T Use SVMS) : Zoubin Ghahramani
31 pages
Gaussian Processes For Regression: A Tutorial
No ratings yet
Gaussian Processes For Regression: A Tutorial
7 pages
KELOMPOK 5 - An Overview of Business Intelligence, Analytics, and Data Science
No ratings yet
KELOMPOK 5 - An Overview of Business Intelligence, Analytics, and Data Science
15 pages
Tutorial: Gaussian Process Models For Machine Learning
No ratings yet
Tutorial: Gaussian Process Models For Machine Learning
35 pages
Ghahramani Lecture2
No ratings yet
Ghahramani Lecture2
30 pages
Satya Nadella The Man Who Rebuilt Microsoft
No ratings yet
Satya Nadella The Man Who Rebuilt Microsoft
2 pages
Mingjun Zhong Et Al - Classifying EEG For Brain Computer Interfaces Using Gaussian Processes
No ratings yet
Mingjun Zhong Et Al - Classifying EEG For Brain Computer Interfaces Using Gaussian Processes
9 pages
GPML PDF
No ratings yet
GPML PDF
3 pages
Gaussian Process Regression With Heteroscedastic Residuals
No ratings yet
Gaussian Process Regression With Heteroscedastic Residuals
15 pages
Hemochron Elite - Itc Usa
No ratings yet
Hemochron Elite - Itc Usa
4 pages
Title: Automated Eye Tracking System
No ratings yet
Title: Automated Eye Tracking System
19 pages
Applied Python Programming (Cycle-1) - 1
No ratings yet
Applied Python Programming (Cycle-1) - 1
26 pages
05 Ccnasec-Firewall - p3
No ratings yet
05 Ccnasec-Firewall - p3
34 pages
Eipl Profile - 24
No ratings yet
Eipl Profile - 24
21 pages
ESB Services API Reference Guide
No ratings yet
ESB Services API Reference Guide
12 pages
Rsa Authentication Manager 8.5 Setup Config Guide
No ratings yet
Rsa Authentication Manager 8.5 Setup Config Guide
120 pages
Design of Power-Efficient High-Speed 4-Bit Compara
No ratings yet
Design of Power-Efficient High-Speed 4-Bit Compara
8 pages
Shared Printer
No ratings yet
Shared Printer
3 pages
IGCSEFM Factorisation
No ratings yet
IGCSEFM Factorisation
10 pages
CHAPTER THREE Edited Handout
No ratings yet
CHAPTER THREE Edited Handout
13 pages
UNIT-4 Introduction To IPR (IPR-Enginering)
No ratings yet
UNIT-4 Introduction To IPR (IPR-Enginering)
18 pages
CHAPTER-1.2: Example Networks
No ratings yet
CHAPTER-1.2: Example Networks
17 pages
Cellular IoT - 01 15 2024
No ratings yet
Cellular IoT - 01 15 2024
3 pages
How To Use The Tone Curve Panel in Lightroom
No ratings yet
How To Use The Tone Curve Panel in Lightroom
1 page
Device Dispatch
No ratings yet
Device Dispatch
7 pages
Cyb - SS4 - DSTS - 3000-4000a GFS - en - Rev - A00
No ratings yet
Cyb - SS4 - DSTS - 3000-4000a GFS - en - Rev - A00
14 pages
Catalogo
No ratings yet
Catalogo
37 pages
Synopsis
No ratings yet
Synopsis
9 pages
B8 Comp WK1 1
No ratings yet
B8 Comp WK1 1
2 pages
Guide LDP
No ratings yet
Guide LDP
6 pages
Smart Lock System Project Report
No ratings yet
Smart Lock System Project Report
2 pages
L2B Test 3 Answer Key
No ratings yet
L2B Test 3 Answer Key
2 pages
Aeropuerto de Pasto Antonio Nariño
No ratings yet
Aeropuerto de Pasto Antonio Nariño
1 page
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Numerical Analysis II Essentials
From Everand
Numerical Analysis II Essentials
The Editors of REA
No ratings yet

Snelson 2005 Sparse Gps

Uploaded by

Snelson 2005 Sparse Gps

Uploaded by

Sparse Gaussian Processes using Pseudo-inputs

Edward Snelson Zoubin Ghahramani

Gatsby Computational Neuroscience Unit

We present a new Gaussian process (GP) regression model whose co-

1.1 Gaussian processes for regression

In standard GP regression we also assume a Gaussian noise model or likelihood p(y|f ) =

p(y|X, θ) = N (y|0, KN + σ 2 I) , (2)

where Λ = diag(λ), λn = Knn − k> −1

3 Relation to other methods

random info−gain smo−bart

Figure 4: Regression on a data

5 Conclusions, extensions and future work

You might also like