0% found this document useful (0 votes)
29 views

Deep Kernel Learning

The document introduces scalable deep kernels that integrate deep learning architectures with the flexibility of kernel methods, allowing for improved performance in Gaussian processes. By transforming inputs using deep architectures and local kernel interpolation, the proposed method achieves linear scalability with training instances and fast prediction times. The results demonstrate significant advantages over traditional Gaussian processes and stand-alone deep learning models across various applications.

Uploaded by

hmaghaminia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Deep Kernel Learning

The document introduces scalable deep kernels that integrate deep learning architectures with the flexibility of kernel methods, allowing for improved performance in Gaussian processes. By transforming inputs using deep architectures and local kernel interpolation, the proposed method achieves linear scalability with training instances and fast prediction times. The results demonstrate significant advantages over traditional Gaussian processes and stand-alone deep learning models across various applications.

Uploaded by

hmaghaminia
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Deep Kernel Learning

Andrew Gordon Wilson∗ Zhiting Hu∗ Ruslan Salakhutdinov Eric P. Xing


CMU CMU University of Toronto CMU

Abstract (1996), who had shown that Bayesian neural net-


works with infinitely many hidden units converged
to Gaussian processes with a particular kernel (co-
We introduce scalable deep kernels, which
variance) function. Gaussian processes were subse-
combine the structural properties of deep
quently viewed as flexible and interpretable alterna-
learning architectures with the non-
tives to neural networks, with straightforward learn-
parametric flexibility of kernel methods.
ing procedures. Where neural networks used finitely
Specifically, we transform the inputs of a
many highly adaptive basis functions, Gaussian pro-
spectral mixture base kernel with a deep
cesses typically used infinitely many fixed basis func-
architecture, using local kernel interpolation,
tions. As argued by MacKay (1998), Hinton et al.
inducing points, and structure exploit-
(2006), and Bengio (2009), neural networks could
ing (Kronecker and Toeplitz) algebra for
automatically discover meaningful representations in
a scalable kernel representation. These
high-dimensional data by learning multiple layers of
closed-form kernels can be used as drop-in
highly adaptive basis functions. By contrast, Gaus-
replacements for standard kernels, with ben-
sian processes with popular kernel functions were used
efits in expressive power and scalability. We
typically as simple smoothing devices.
jointly learn the properties of these kernels
through the marginal likelihood of a Gaus- Recent approaches (e.g., Yang et al., 2015; Lloyd et al.,
sian process. Inference and learning cost 2014; Wilson, 2014; Wilson and Adams, 2013) have
O(n) for n training points, and predictions demonstrated that one can develop more expressive
cost O(1) per test point. On a large and kernel functions, which are indeed able to discover rich
diverse collection of applications, including structure in data without human intervention. Such
a dataset with 2 million examples, we show methods effectively use infinitely many adaptive ba-
improved performance over scalable Gaus- sis functions. The relevant question then becomes not
sian processes with flexible kernel learning which paradigm (e.g., kernel methods or neural net-
models, and stand-alone deep architectures. works) replaces the other, but whether we can combine
the advantages of each approach. Indeed, deep neural
networks provide a powerful mechanism for creating
1 Introduction adaptive basis functions, with inductive biases which
have proven effective for learning in many application
“How can Gaussian processes possibly replace neural domains, including visual object recognition, speech
networks? Have we thrown the baby out with the perception, language understanding, and information
bathwater?” questioned MacKay (1998). It was the retrieval (Krizhevsky et al., 2012; Hinton et al., 2012;
late 1990s, and researchers had grown frustrated with Socher et al., 2011; Kiros et al., 2014; Xu et al., 2015).
the many design choices associated with neural net- In this paper, we combine the non-parametric flexi-
works – regarding architecture, activation functions, bility of kernel methods with the structural proper-
and regularisation – and the lack of a principled frame- ties of deep neural networks. In particular, we use
work to guide in these choices. deep feedforward fully-connected and convolutional
Gaussian processes had recently been popularised networks, in combination with spectral mixture co-
within the machine learning community by Neal variance functions (Wilson and Adams, 2013), induc-
ing points (Quiñonero-Candela and Rasmussen, 2005),

Authors contributed equally. Appearing in Proceedings of
structure exploiting algebra (Saatchi, 2011), and lo-
the 19th International Conference on Artificial Intelligence cal kernel interpolation (Wilson and Nickisch, 2015;
and Statistics (AISTATS) 2016, Cadiz, Spain. JMLR: Wilson et al., 2015), to create scalable expressive
W&CP volume 51. Copyright 2016 by the authors. closed form covariance kernels for Gaussian processes.

370
Deep Kernel Learning

As a non-parametric method, the information capac- with a Gaussian process transformation, in an unsu-
ity of our model grows with the amount of avail- pervised setting. While promising, both models are
able data, but its complexity is automatically cali- very task specific, and require sophisticated approxi-
brated through the marginal likelihood of the Gaus- mate Bayesian inference which is much more demand-
sian process, without the need for regularization or ing than what is required by standard Gaussian pro-
cross-validation (Rasmussen and Ghahramani, 2001; cesses or deep learning models, and typically does not
Rasmussen and Williams, 2006; Wilson, 2014). The scale beyond a few thousand training points. Similarly,
flexibility and automatic calibration provided by the Salakhutdinov and Hinton (2008) combine deep belief
non-parametric layer typically provides a high stan- networks (DBNs) with Gaussian processes, showing
dard of performance, while reducing the need for ex- improved performance over standard GPs with RBF
tensive hand tuning from the user. kernels, in the context of semi-supervised learning.
However, their model is heavily relying on unsuper-
We further build on the ideas in KISS-GP (Wilson and
vised pre-training of DBNs, with the GP component
Nickisch, 2015) and extensions (Wilson et al., 2015), so
unable to scale beyond a few thousand training points.
that our deep kernel learning model can scale linearly
Likewise, Calandra et al. (2014) combine a feedfor-
with the number of training instances n, instead of
ward neural network transformation with a Gaussian
O(n3 ) as is standard with Gaussian processes (GPs),
process, showing an ability to learn sharp discontinu-
while retaining a fully non-parametric representation.
ities. However, similar to many other approaches, the
Our approach also scales as O(1) per test point, in-
resulting model can only scale to at most a few thou-
stead of the standard O(n2 ) for GPs, allowing for very
sand data points.
fast prediction times. Because KISS-GP creates an
approximate kernel from a user specified kernel for In a frequentist setting, Yang et al. (2014) combine
fast computations, independently of a specific infer- convolutional networks, with parameters pre-trained
ence procedure, we can view the resulting kernel as a on ImageNet, with a scalable Fastfood (Le et al., 2013)
scalable deep kernel. We demonstrate the value of this expansion for the RBF kernel applied to the final layer.
scalability in the experimental results section, where it The resulting method is scalable and flexible, but the
is the large datasets that provide the greatest oppor- network parameters generally must first be trained
tunities for our model to discover expressive statistical separately from the Fastfood features, and the com-
representations. bined model remains parametric, due to the paramet-
ric expansion provided by Fastfood. Careful atten-
We begin by reviewing related work in section 2, and
tion must still be paid to training procedures, regu-
providing background on Gaussian processes in section
larization, and manual calibration of the network ar-
3. In section 4 we derive scalable closed form deep ker-
chitecture. In a similar manner, Huang et al. (2015)
nels, and describe how to perform efficient automatic
and Snoek et al. (2015) have combined deep architec-
learning of these kernels through the Gaussian process
tures with parametric Bayesian models. Huang et al.
marginal likelihood. In section 5, we show substan-
(2015) pursue an unsupervised pre-training procedure
tially improved performance over standard Gaussian
using deep autoencoders, showing improved perfor-
processes, expressive kernel learning approaches, deep
mance over GPs using standard kernels. Snoek et al.
neural networks, and Gaussian processes applied to
(2015) show promising performance on Bayesian op-
the outputs of trained deep networks, on a wide range
timisation tasks, for tuning the parameters of a deep
of datasets. We also interpret the learned kernels to
neural network, but do not use a Bayesian (marginal
gain new insights into our modelling problems.
likelihood) objective for training network parameters.
Our approach is distinct in that we combine deep feed-
2 Related Work forward and convolutional architectures with spectral
mixture covariances (Wilson and Adams, 2013), in-
Given the intuitive value of combining kernels and neu- ducing points, Kronecker and Toeplitz algebra, and
ral networks, it is encouraging that various distinct local kernel interpolation (Wilson and Nickisch, 2015;
forms of such combinations have been considered in Wilson et al., 2015), to derive expressive and scalable
different contexts. closed form kernels, where all parameters are trained
The Gaussian process regression network (Wilson jointly with a unified supervised objective, as part of a
et al., 2012) replaces all weight connections and activa- non-parametric Gaussian process framework, without
tion functions in a Bayesian neural network with Gaus- requiring approximate Bayesian inference. Moreover,
sian processes, allowing the authors to model input the simple joint learning procedure in our approach
dependent correlations between multiple tasks. Alter- can be applied in general settings. Indeed we show
natively, Damianou and Lawrence (2013) replace ev- that the proposed model outperforms state of the art
ery activation function in a Bayesian neural network stand-alone deep learning architectures and Gaussian

371
Andrew Gordon Wilson∗ , Zhiting Hu∗ , Ruslan Salakhutdinov, Eric P. Xing

processes with advanced kernel learning procedures on tion. For example, the popular RBF kernel,
a wide range of datasets, demonstrating its practical 1
significance. We also show that jointly training all the kRBF (x, x0 ) = exp(− ||x − x0 ||/`2 ) (3)
2
weights of a deep kernel through the marginal likeli-
encodes the inductive bias that function values closer
hood of a non-parametric GP provides significant ad-
together in the input space are more correlated. The
vantages over training a GP applied to the output
complexity of the functions in the input space is deter-
layer of a trained deep neural network. Moreover,
mined by the interpretable length-scale hyperparam-
we achieve scalability while retaining non-parametric
eter `. Shorter length-scales correspond to functions
model structure by leveraging the very recent KISS-
which vary more rapidly with the inputs x.
GP approach (Wilson and Nickisch, 2015) and exten-
sions in Wilson et al. (2015) for efficiently representing The structure of our data is discovered through
kernel functions, to produce scalable deep kernels. learning interpretable kernel hyperparameters. The
marginal likelihood of the targets y, the probability
of the data conditioned only on kernel hyperparame-
3 Gaussian Processes
ters γ, provides a principled probabilistic framework
for kernel learning:
We briefly review Gaussian processes (GPs) and the
computational requirements for predictions and kernel log p(y|γ, X) ∝ −[y> (Kγ + σ 2 I)−1 y + log |Kγ + σ 2 I|] ,
learning (see e.g., Rasmussen and Williams (2006) for (4)
a comprehensive discussion of GPs). where we have used Kγ as shorthand for KX,X given γ.
Note that the expression for the log marginal likeli-
We assume a dataset D of n input (predictor) vectors hood in Eq. (4) pleasingly separates into automatically
X = {x1 , . . . , xn }, each of dimension D, which index calibrated model fit and complexity terms (Rasmussen
an n × 1 vector of targets y = (y(x1 ), . . . , y(xn ))> . and Ghahramani, 2001). Kernel learning is performed
If f (x) ∼ GP(µ, kγ ), then any collection of function by optimizing Eq. (4) with respect to γ.
values f has a joint Gaussian distribution,
The computational bottleneck for inference is solving
>
f = f (X) = [f (x1 ), . . . , f (xn )] ∼ N (µ, KX,X ) , (1) the linear system (KX,X + σ 2 I)−1 y, and for kernel
learning is computing the log determinant log |KX,X +
with mean vector and covariance matrix defined σ 2 I|. The standard approach is to compute the
by the mean function and covariance kernel of the Cholesky decomposition of the n × n matrix KX,X ,
Gaussian process: µi = µ(xi ), and (KX,X )ij = which requires O(n3 ) operations and O(n2 ) storage.
kγ (xi , xj ), where the kernel kγ is parametrized by After inference, the predictive mean costs O(n), and
γ. Assuming additive Gaussian noise, y(x)|f (x) ∼ the predictive variance costs O(n2 ), per test point x∗ .
N (y(x); f (x), σ 2 ), the predictive distribution of the
GP evaluated at the n∗ test points indexed by X∗ , 4 Deep Kernel Learning
is given by
In this section we show how we can contruct kernels
f ∗ |X∗ ,X, y, γ, σ 2 ∼ N (E[f ∗ ], cov(f ∗ )) , (2) which encapsulate the expressive power of deep archi-
E[f ∗ ] = µX∗ + KX∗ ,X [KX,X + σ 2 I]−1 (y − µX ) , tectures, and how to learn the properties of these ker-
nels as part of a scalable probabilistic GP framework.
cov(f ∗ ) = KX∗ ,X∗ − KX∗ ,X [KX,X + σ 2 I]−1 KX,X∗ .
Specifically, starting from a base kernel k(xi , xj |θ)
KX∗ ,X , for example, represents the n∗ × n matrix of with hyperparameters θ, we transform the inputs (pre-
covariances between the GP evaluated at X∗ and X. dictors) x as
µX and µX∗ are mean vectors evaluated at X and X∗ ,
k(xi , xj |θ) → k(g(xi , w), g(xj , w)|θ, w) , (5)
and KX,X is the n × n covariance matrix evaluated at
training inputs X. All covariance matrices implicitly where g(x, w) is a non-linear mapping given by a deep
depend on the kernel hyperparameters γ. architecture, such as a deep convolutional network,
parametrized by weights w. The popular RBF kernel
GPs with RBF kernels correspond to models which
(Eq. (3)) is a sensible choice of base kernel k(xi , xj |θ).
have an infinite basis expansion in a dual space, and
For added flexibility, we also propose to use spectral
have compelling theoretical properties: these models
mixture base kernels (Wilson and Adams, 2013):
are universal approximators, and have prior support
to within an arbitrarily small epsilon band of any con- kSM (x, x0 |θ) = (6)
tinuous function (Micchelli et al., 2006). Indeed the Q
X 1  
|Σq | 2 1 1
properties of the distribution over functions induced by aq D exp − ||Σq2 (x − x0 )||2 coshx − x0 , 2πµq i .
a Gaussian process are controlled by the kernel func- q=1 (2π) 2 2

372
Deep Kernel Learning

W (1)
W (2) h1 (θ) For kernel learning, we use the chain rule to compute
W (L)
(1)
Input layer h1
... Output layer derivatives of the log marginal likelihood with respect

...
(2)
h1
x1 (L)
h1
to the deep kernel hyperparameters:
y1

∂L ∂L ∂Kγ ∂L ∂L ∂Kγ ∂g(x, w)

...
...
...

...

...
...
= , = .
∂θ ∂Kγ ∂θ ∂w ∂Kγ ∂g(x, w) ∂w
yP
xD (L)
hC
The implicit derivative of the log marginal likelihood

...
hB
(2)
...
(1)
hA with respect to our n × n data covariance matrix Kγ
h∞ (θ)
is given by
Hidden layers
∞ layer ∂L 1
= (Kγ−1 yy> Kγ−1 − Kγ−1 ) , (7)
Figure 1: Deep Kernel Learning: A Gaussian process with ∂Kγ 2
a deep kernel maps D dimensional inputs x through L para-
metric hidden layers followed by a hidden layer with an in- where we have absorbed the noise covariance σ 2 I into
finite number of basis functions, with base kernel hyperpa- our covariance matrix, and treat it as part of the base
rameters θ. Overall, a Gaussian process with a deep kernel ∂K
kernel hyperparameters θ. ∂θγ are the derivatives of
produces a probabilistic mapping with an infinite number
of adaptive basis functions parametrized by γ = {w, θ}. the deep kernel with respect to the base kernel hyper-
All parameters γ are learned jointly through the marginal parameters (such as length-scale), conditioned on the
likelihood of the Gaussian process. fixed transformation of the inputs g(x, w). Similarly,
∂Kγ
∂g(x,w) are the implicit derivatives of the deep kernel
with respect to g, holding θ fixed. The derivatives with
The parameters of the spectral mixture kernel θ = respect to the weight variables ∂g(x,w)
∂w are computed
{aq , Σq , µq } are mixture weights, bandwidths (inverse using standard backpropagation.
length-scales), and frequencies. The spectral mixture
(SM) kernel, which forms an expressive basis for all For scalability, we replace all instances of Kγ with
stationary covariance functions, can discover quasi- the KISS-GP covariance matrix (Wilson and Nickisch,
periodic stationary structure with an interpretable and 2015; Wilson et al., 2015)
succinct representation, while the deep learning trans- deep
formation g(x, w) captures non-stationary and hierar- Kγ ≈ M KU,U M > := KKISS , (8)
chical structure.
where M is a sparse matrix of interpolation weights,
We use the deep kernel of Eq. (5) as the covari- containing only 4 non-zero entries per row for local
ance function of a Gaussian process to model data cubic interpolation, and KU,U is a covariance matrix
D = {xi , yi }ni=1 . Conditioned on all kernel hyperpa- created from our deep kernel, evaluated over m la-
rameters, we can interpret our model as applying a tent inducing points U = [ui ]i=1...m . We place the in-
Gaussian process with base kernel kθ to the final hid- ducing points over a regular multidimensional lattice,
den layer of a deep network. Since a GP with (RBF and exploit the resulting decomposition of KU,U into a
or SM) base kernel kθ corresponds to an infinite basis Kronecker product of Toeplitz matrices for extremely
function representation, our network effectively has a fast matrix vector multiplications (MVMs), without
hidden layer with an infinite number of hidden units. requiring any grid structure in the data inputs X or
The overall model is shown in Figure 1. the transformed inputs g(x, w). Because KISS-GP op-
erates by creating an approximate kernel which ad-
We emphasize, however, that we jointly learn all deep
mits fast computations, and is independent from a spe-
kernel hyperparameters, γ = {w, θ}, which include w,
cific inference and learning procedure, we can view the
the weights of the network, and θ the parameters of
KISS approximation applied to our deep kernels as a
the base kernel, by maximizing the log marginal like- deep
stand-alone kernel, k(x, z) = m> x KU,U mz , which can
lihood L of the Gaussian process (see Eq. (4)). Indeed
be combined with Gaussian processes or other kernel
compartmentalizing our model into a base kernel and
machines for scalable learning.
deep architecture is for pedagogical clarity. When ap-
−1
plying a Gaussian process one can use our deep kernel, For inference we solve KKISS y using linear conjugate
which operates as a single unit, as a drop-in replace- gradients (LCG), an iterative procedure for solving lin-
ment for e.g., standard ARD or Matérn kernels (Ras- ear systems which only involves matrix vector multi-
mussen and Williams, 2006), since learning and infer- plications (MVMs). The number of iterations required
ence follow the same procedures. In the experiments, for convergence to within machine precision is j  n,
we show that jointly learning all deep kernel parame- and in practice j depends on the conditioning of the
ters has advantages over training a GP applied to the KISS-GP covariance matrix rather than the number of
output layer of a trained deep neural network. training points n. For estimating the log determinant

373
Andrew Gordon Wilson∗ , Zhiting Hu∗ , Ruslan Salakhutdinov, Eric P. Xing

Table 1: Comparative RMSE performance and runtime on UCI regression datasets, with n training points and d the input
dimensions. The results are averaged over 5 equal partitions (90% train, 10% test) of the data ± one standard deviation.
The best denotes the best-performing kernel according to Yang et al. (2015) (note that often the best performing kernel
is GP-SM). Following Yang et al. (2015), as exact Gaussian processes are intractable on the large data used here, the
Fastfood finite basis function expansions are used for approximation in GP (RBF, SM, Best). We verified on datasets
with n < 10, 000 that exact GPs with RBF kernels provide comparable performance to the Fastfood expansions. For
datasets with n < 6, 000 we used a fully-connected DNN with a [d-1000-500-50-2] architecture, and for n > 6000 we
used a [d-1000-1000-500-50-2] architecture. DNN+GP is a GP applied to fixed pre-trained output layer of the DNN. We
used RBF kernel and KISS-GP approximation for direct comparison with our proposed deep kernel learning (DKL). We
consider scalable DKL with RBF and SM base kernels. For the SM base kernel, we set Q = 4 for datasets with n < 10, 000
training instances, and use Q = 6 for larger datasets.
RMSE Runtime(s)
Datasets n d
GP DKL DKL
DNN DNN+GP DNN
RBF SM best RBF SM RBF SM

Gas 2,565 128 0.21±0.07 0.14±0.08 0.12±0.07 0.11±0.05 0.11±0.05 0.11±0.05 0.09±0.06 7.4 7.8 10.5
Skillcraft 3,338 19 1.26±3.14 0.25±0.02 0.25±0.02 0.25±0.00 0.25±0.00 0.25±0.00 0.25±0.00 15.8 15.9 17.1
SML 4,137 26 6.94±0.51 0.27±0.03 0.26±0.04 0.25±0.02 0.25±0.01 0.24±0.01 0.23±0.01 1.1 1.5 1.9
Parkinsons 5,875 20 3.94±1.31 0.00±0.00 0.00±0.00 0.31±0.04 0.31±0.04 0.29±0.04 0.29±0.04 3.2 3.4 6.5
Pumadyn 8,192 32 1.00±0.00 0.21±0.00 0.20±0.00 0.25±0.02 0.25±0.02 0.24±0.02 0.23±0.02 7.5 7.9 9.8
PoleTele 15,000 26 12.6±0.3 5.40±0.3 4.30±0.2 3.42±0.05 3.36±0.04 3.28±0.04 3.11±0.07 8.0 8.3 27.0
Elevators 16,599 18 0.12±0.00 0.090±0.001 0.089±0.002 0.099±0.001 0.097±0.002 0.084±0.002 0.084±0.002 8.9 9.2 11.8
Kin40k 40,000 8 0.34±0.01 0.19±0.02 0.06±0.00 0.11±0.01 0.11±0.01 0.05±0.00 0.03±0.01 19.8 20.7 25.0
Protein 45,730 9 1.64±1.66 0.50±0.02 0.47±0.01 0.49±0.01 0.49±0.01 0.46±0.01 0.43±0.01 143 155 144
KEGG 48,827 22 0.33±0.17 0.12±0.01 0.12±0.01 0.12±0.01 0.12±0.00 0.11±0.00 0.10±0.01 31.3 34.2 61.0
CTslice 53,500 385 7.13±0.11 2.21±0.06 0.59±0.07 0.41±0.06 0.41±0.02 0.36±0.01 0.34±0.02 36.4 44.3 80.4
KEGGU 63,608 27 0.29±0.12 0.12±0.00 0.12±0.00 0.12±0.00 0.12±0.00 0.11±0.00 0.11±0.00 39.5 43.0 41.1
3Droad 434,874 3 12.9±0.1 10.3±0.2 9.90±0.10 7.36±0.07 7.04±0.06 6.91±0.04 6.91±0.04 239 256 292
Song 515,345 90 0.55±0.00 0.46±0.00 0.45±0.00 0.45±0.02 0.45±0.01 0.44±0.00 0.43±0.01 518 539 590
Buzz 583,250 77 0.88±0.01 0.51±0.01 0.51±0.01 0.49±0.00 0.49±0.00 0.48±0.00 0.46±0.01 486 523 770
Electric 2M 11 0.23±0.00 0.053±0.000 0.053±0.000 0.058±0.002 0.054±0.002 0.050±0.002 0.048±0.002 3458 3542 4881

in the marginal likelihood we follow the approach de- All experiments were performed on a Linux machine
scribed in Wilson and Nickisch (2015) with extensions with eight 4.0GHz CPU cores and 32GB RAM. We
in Wilson et al. (2015). implemented DNNs based on Caffe (Jia et al., 2014),
a general deep learning platform.
KISS-GP training scales as O(n + h(m)) (where h(m)
is typically close to linear in m), versus conventional For our deep kernel learning model, we first train a
scalable GP approaches which require O(m2 n + m3 ) deep neural network using SGD with the squared loss
(Quiñonero-Candela and Rasmussen, 2005) computa- objective, and rectified linear activation functions. Af-
tions and need m  n for tractability, which results in ter the neural network has been pre-trained, a KISS-
severe deteriorations in predictive performance. The GP model was fitted using the top-level features of the
ability to have large m ≈ n allows KISS-GP to have DNN model as inputs. Using this pre-training initial-
near-exact accuracy in its approximation (Wilson and ization, our joint deep kernel learning (DKL) model
Nickisch, 2015), retaining a non-parametric represen- of section 4 is then trained by optimizing all the hy-
tation, while providing linear scaling in n and O(1) perparameters γ of our deep kernel, by backpropagat-
time per test point prediction (Wilson et al., 2015). ing derivatives through the marginal likelihood of the
We empirically demonstrate this scalability and accu- Gaussian process (see Eq. 7).
racy in our experiments of section 5.

5 Experiments 5.1 UCI regression tasks

We consider a large set of UCI regression problems


We evaluate the proposed deep kernel learning method
of varying sizes and properties. Table 1 reports test
on a wide range of regression problems, including a
root mean squared error (RMSE) for 1) many scal-
large and diverse collection of regression tasks from
able Gaussian process kernel learning methods based
the UCI repository (section 5.1), orientation extrac-
on Fastfood (Yang et al., 2015); 2) stand-alone deep
tion from face patches (section 5.2), magnitude recov-
neural networks (DNNs); and 3) our proposed com-
ery of handwritten digits (section 5.3), and step func-
bined deep kernel learning (DKL) model using both
tion recovery (section 5.4 and the supplementary ma-
RBF and SM base kernels.
terial). We show that the proposed algorithm substan-
tially outperforms GPs with expressive kernel learning For smaller datasets, where the number of training ex-
approaches, and deep neural networks, without any amples n < 6, 000, we used a fully-connected neural
significant increases in computational overhead. network with a d-1000-500-50-2 architecture; for larger

374
Deep Kernel Learning
-43.10 36.15 17.35 -3.49 -19.81 Label

Training data

Test data

Figure 2: Left: Randomly sampled training and test examples. Right: The two dimensional outputs of the convolutional
network on a set of test cases. Each point is shown using a line segment that has the same orientation as the input face.

datasets we used a d-1000-1000-500-50-2 architecture1 . Table 2: RMSE performance on Olivetti and MNIST. For
comparison, in the face orientation extraction, we trained
Table 1 shows that on most of the datasets, our DKL DKL on the same amount (12,000) of training instances as
method strongly outperforms not only Gaussian pro- with DBN+GP, but used all labels; whereas DBN+GP (as
cesses with the standard RBF kernel, but also the best- with GP) scaled to only 1,000 labeled images and modeled
the remaining data through unsupervised pretraining of
performing kernels selected from a wide range of alter- DBN. CNN+GP is a GP applied to fixed pre-trained CNN.
native kernel learning procedures (Yang et al., 2015). We used RBF base kernel within GPs.
We further compared DKL to stand-alone deep neu- Datasets GP DBN+GP CNN CNN+GP DKL
ral networks which have the exact same architecture Olivetti 16.33 6.42 6.34 6.42 6.07
as the DNN component of DKL, and DNN+GP which MNIST 1.25 1.03 0.59 0.56 0.53
is a GP applied to a pre-trained DNN. We see that
DNN+GP outperforms stand-alone DNNs, showing highly-structured image data.
the non-parametric flexibility of kernel methods. By The Olivetti face data set contains ten 64×64 images
combining KISS-GP with DNNs as part of a joint DKL of forty different people, for 400 images total. Follow-
procedure, we obtain consistently better results than ing Salakhutdinov and Hinton (2008), we constructed
DNN and DNN+GP over all 16 datasets. Moreover, datasets of 28×28 images by randomly rotating (uni-
using a spectral mixture base kernel (Eq. (6)) to create formly from −90◦ to +90◦ ), cropping, and subsam-
a deep kernel provides notable additional performance pling the original 400 images. We then randomly se-
improvements. By effectively learning the salient fea- lect 30 people uniformly and collect their images as
tures from raw data, plain DNNs generally achieve training data, while using the images of the remain-
competitive performance compared to expressive GPs. ing 10 people as test data. Figure 2 shows randomly
Combining the complementary advantages of these ap- sampled examples from the training and test data.
proaches into scalable deep kernels consistently brings
substantial additional performance gains. For training DKL on the Olivetti face patches we used
a convolutional network consisting of 2 convolutional
We next investigate the runtime of DKL. Table 1, layers followed by 4 fully-connected layers, mapping a
right panel, compares DKL with a stand-alone DNN face patch to a 2-dimensional feature vector, with a
in terms of runtime for evaluating the objective and SM base kernel. We describe this convolutional archi-
derivatives (i.e. one forward and backpropagation pass tecture in detail in the supplementary material.
for DNN; one computation of the marginal likelihood
and all relevant derivatives for DKL). We see that in Table 2 shows the RMSE of the predicted face ori-
addition to improving accuracy, combining KISS-GP entations using four models. The DBN+GP model,
with DNNs for deep kernels introduces only negligible proposed by Salakhutdinov and Hinton (2008), first
runtime costs: KISS-GP imposes an additional run- extracts features from raw data using a Deep Belief
time of about 10% over a stand-alone DNN. Overall, Network (DBN), and then applies a Gaussian process
these results show the general applicability and prac- with an RBF kernel. However, their approach could
tical significance of our scalable DKL approach. only handle up to a few thousand labelled datapoints,
due to the O(n3 ) complexity of standard Gaussian pro-
5.2 Face orientation extraction cesses. The remaining data were modeled through
We now consider the task of predicting the orienta- unsupervised learning of a DBN, leaving the large
tion of a face extracted from a gray-scale image patch, amount of available labels unused.
explored in Salakhutdinov and Hinton (2008). We Our proposed deep kernel methods, by contrast, scale
investigate our DKL procedure for efficiently learn- linearly with the size of training data, and are capa-
ing meaningful representations from high-dimensional ble of directly modeling the full labeled data to accu-
1 rately recover salient patterns. Figure 2, right panel,
We found [d-1000-1000-500-50] architectures provide
a similar level of performance, but scalable Kronecker al- shows that the deep kernel discovers features essential
gebra is most effective if the network maps into D ≤ 5 for orientation prediction, while filtering out irrelevant
dimensional spaces. factors such as identities and scales.

375
Andrew Gordon Wilson∗ , Zhiting Hu∗ , Ruslan Salakhutdinov, Eric P. Xing 4
x 10
350
CNN CNN CNN
6.4 10
DKL−RBF DKL−RBF DKL−RBF
DKL−SM 300 DKL−SM DKL−SM

Total Training Time (s)


6.2
8
250

Runtime (s)
6

RMSE
6
200
5.8

4
150
5.6

100 2
5.4

5.2 50 0
1 2 3 4 5 6 1 2 3 4 5 6 1.2 2 3 4 5 6
#Training Instances x 10
4
#Training Instances x 10
4
#Training Instances x 10
4

Figure 3: Left: RMSE vs. n, the number of training examples. Middle: Runtime vs n. Right: Total training time vs
n. The dashed line in black indicates a slope of 1. CNNs are used within DKL. We set Q = 4 for the SM kernel.
0

Figure 3, left panel, further validates the benefit of

Log Spectral Density


scaling to large data. As more training data are used, −200

our model continues to increase in accuracy. Indeed, −400

it is the large datasets that will provide the greatest


opportunities for our model to discover expressive sta- −600

tistical representations. −800


0 10 20 30
Frequency
In Figure 4 we show the spectral density (the Fourier Figure 4: The log spectral densities of the DKL-SM and
transform) of the base kernels learned through our DKL-SE base kernels are in black and red, respectively.
deep kernel learning method. The expressive spectral
correlated, and thus overcome the fundamental limita-
mixture (SM) kernel discovers a structure with two
tions of a Euclidean distance metric (used by standard
peaks in the frequency domain. The RBF kernel is
kernels), where similar rotation angles are not partic-
only able to use a single Gaussian in the spectral do-
ularly correlated, regardless of what hyper-parameters
main, centred at the origin. In an attempt to capture
are learned with Euclidean kernels.
the significant mass near a frequency of 25, the RBF
kernel spectral density spreads itself across the whole We next measure the scalability of our model. Fig-
frequency domain, missing the important local correla- ure 3, middle panel, shows the runtimes in seconds,
tions near a frequency s = 0, thus erroneously discard- as a function of training instances, for evaluating the
ing much of the network features as white noise, since objective and any relevant derivatives. We see that,
a broad spectral peak corresponds to a short length- with the scalable KISS-GP, the joint model achieves
scale. This result provides intuition for why spectral a roughly linear asymptotic scaling, with a slope of
mixture base kernels generally perform much better 1. In Figure 3, right panel, we show how the total
than RBF base kernels, despite the flexibility of the training time (i.e., the time for CNN pre-training plus
deep architecture. the time for DKL with CNN architecture joint train-
ing) changes with varying the data size n. In addition
We further see the benefit of an SM base kernel in Fig-
to the linear scaling which is necessary for modeling
ure 5, where we show the learned covariance matrices
large data, the fixed added time in combining KISS-
constructed from the whole deep kernels (composition
GP with CNNs is modest, especially considering the
of base kernel and deep architecture) for RBF and SM
gains in performance and expressive power.
base kernels. The covariance matrix is evaluated on
a set of test inputs, where we randomly sample 400 5.3 Digit magnitude extraction
instances from the test set and sort them in terms
of the orientation angles of the input faces. We see We map images of handwritten digits to a single real-
that the deep kernels with both RBF and SM base value that is as close as possible to the integer repre-
kernels discover that faces with similar rotation an- sented by the digit in the image, as in Salakhutdinov
gles are highly correlated, concentrating their largest and Hinton (2008). The MNIST digit dataset contains
entries on the diagonal (i.e., face pairs with similar 60,000 training data and 10,000 test 28 × 28 images of
orientations). Deep kernel learning with an SM base ten handwritten digits (0 to 9). We used a convolu-
kernel captures these correlations more strongly than tional neural network with a similar architecture as the
the RBF base kernel, which is somewhat more diffuse. LeNet (LeCun et al., 1998) (detailed in the supplemen-
tary material). Table 2 shows that a CNN performs
In Figure 5, right panel, we also show the learned considerably better than GP and DBN+GP, and DKL
covariance matrix for an RBF kernel with a stan- (with CNN architecture) further improves over CNN.
dard Gaussian process applied to the raw data inputs.
We see that the entries are very diffuse. In essence, 5.4 Step function recovery
through deep kernel learning, we can learn a metric
where faces with similar rotation angles are highly We have so far considered RMSE for comparison to
alternative methods where posterior predictive distri-

376
Deep Kernel Learning
0.2
300

2
100 100
100

0.1
200

200 200 200

0 100

300 300 300

400 −0.1 400 0 400 0


100 200 300 400 100 200 300 400 100 200 300 400

Figure 5: Left: The induced covariance matrix using DKL-SM kernel on a set of test cases, where the test samples
are ordered by the orientations of the input faces. Middle: The respective covariance matrix using DKL-RBF kernel.
Right: The respective covariance matrix using regular RBF kernel. The models are trained with n = 12, 000, and Q = 4
for the SM base kernel.
18
GP(RBF)
GP(SM)
can then be combined with Gaussian process inference
16 DKL−SM
Training data
and learning procedures for O(n) training and O(1)
14 testing time. Moreover, we use spectral mixture co-
variances as a base kernel, which provides a significant
Output Y

12
additional boost in representational power. Overall,
10 our scalable deep kernels can be used in place of stan-
8
dard kernels, following the same inference and learning
procedures, but with benefits in expressive power and
6
efficiency. We show on a wide range of experiments
4
−1 −0.5 0 0.5 1
the general applicability and practical significance of
Input X our approach, consistently outperforming scalable GPs
Figure 6: Recovering a step function. We show the pre- with expressive kernels, stand-alone DNNs, and GPs
dictive mean and 95% of the predictive probability mass applied to the outputs of trained DNNs.
for regular GPs with RBF and SM kernels, and DKL with
SM base kernel. We set Q = 4 for SM kernels. A major challenge in developing expressive kernel
learning approaches is the Euclidean and absolute dis-
butions are not readily available, or on problems where tance based metrics which are pervasive in most fam-
RMSE has historically been used as a benchmark. ilies of kernel functions, such as the ARD and Matérn
However, an advantage of DKL over stand-alone deep kernels. Indeed, although intuitive in some cases, one
architectures is the ability to naturally produce a pos- cannot expect Euclidean and absolute distance as mea-
terior predictive distribution, which is especially use- sures of similarity to be generally applicable, and they
ful in applications such as reinforcement learning and are especially problematic in high dimensional input
Bayesian optimisation. In Figure 6, we consider an ex- spaces (Aggarwal et al., 2001). Modern approaches
ample where we use DKL to learn the posterior predic- attempt to learn a flexible parametric family, for exam-
tive distribution for a step function with many chal- ple, through weighted combinations of known kernels
lenging discontinuities. This problem is particularly (e.g., Gönen and Alpaydın, 2011), but are still funda-
difficult for conventional GP approaches, due to strong mentally limited to these standard notions of distance.
smoothness assumptions intrinsic to popular kernels.
As we have seen in the Olivetti faces examples, our
GPs with SM kernels improve upon RBF kernels, but approach allows for the whole functional form of the
neither can properly adapt to the many sharp changes metric to be learned in a flexible manner, through ex-
in covariance structure. By contrast, DKL-SM ac- pressive transformations of the input space. We expect
curately encodes the discontinuities of the function, such metric learning to be particularly valuable in high
and has reasonable uncertainty over the whole domain. dimensional classification problems, which we view as
Further details are in the supplement. a promising direction for future research. We hope
that this work will help bring together research on neu-
6 Discussion ral networks and kernel methods, to inspire many new
models and unifying perspectives which combine the
We have explored scalable deep kernels, which combine complementary advantages of these approaches.
the structural properties of deep architectures with the
non-parametric flexibility of kernel methods. In par-
ticular, we transform the inputs of a base kernel with
a deep architecture, and then leverage local kernel in-
terpolation, inducing points, and structure exploiting
algebra (e.g., Kronecker and Toeplitz methods) for a
scalable kernel representation. These scalable kernels

377
Andrew Gordon Wilson∗ , Zhiting Hu∗ , Ruslan Salakhutdinov, Eric P. Xing

Acknowledgements: We thank NIH R01GM093156 Micchelli, C. A., Xu, Y., and Zhang, H. (2006). Univer-
and NIH R01GM087694, NSF IIS-1218282, and ONR sal kernels. The Journal of Machine Learning Research,
N000141410684 grants for support. 7:2651–2667.
Neal, R. (1996). Bayesian Learning for Neural Networks.
Springer Verlag.
References Quiñonero-Candela, J. and Rasmussen, C. (2005). A uni-
Aggarwal, C. C., Hinneburg, A., and Keim, D. A. (2001). fying view of sparse approximate gaussian process re-
On the surprising behavior of distance metrics in high gression. The Journal of Machine Learning Research,
dimensional space. Springer. 6:1939–1959.
Bengio, Y. (2009). Learning deep architectures for AI. Rasmussen, C. E. and Ghahramani, Z. (2001). Occam’s ra-
Foundations and Trends in Machine Learning. zor. In Neural Information Processing Systems (NIPS).
Calandra, R., Peters, J., Rasmussen, C. E., and Deisen- Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian
roth, M. P. (2014). Manifold gaussian processes for re- processes for Machine Learning. The MIT Press.
gression. arXiv preprint arXiv:1402.5876. Saatchi, Y. (2011). Scalable Inference for Structured Gaus-
sian Process Models. PhD thesis, University of Cam-
Damianou, A. and Lawrence, N. (2013). Deep Gaussian
bridge.
processes. In Artificial Intelligence and Statistics.
Salakhutdinov, R. and Hinton, G. (2008). Using deep be-
Gönen, M. and Alpaydın, E. (2011). Multiple kernel learn- lief nets to learn covariance kernels for Gaussian pro-
ing algorithms. Journal of Machine Learning Research, cesses. Advances in Neural Information Processing Sys-
12:2211–2268. tems, 20:1249–1256.
Hinton, G. E., Deng, L., Yu, D., Dahl, G. E., rahman Snoek, J., Rippel, O., Swersky, K., Kiros, R., Satish, N.,
Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Sundaram, N., Patwary, M., Ali, M., and Adams, R. P.
Nguyen, P., Sainath, T. N., and Kingsbury, B. (2012). (2015). Scalable bayesian optimization using deep neu-
Deep neural networks for acoustic modeling in speech ral networks. In International Conference on Machine
recognition: The shared views of four research groups. Learning.
IEEE Signal Process. Mag., 29(6):82–97.
Socher, R., Huang, E., Pennington, J., Ng, A., and Man-
Hinton, G. E., Osindero, S., and Teh, Y. W. (2006). A fast ning, C. (2011). Dynamic pooling and unfolding re-
learning algorithm for deep belief nets. Neural Compu- cursive autoencoders for paraphrase detection. In Ad-
tation, 18(7):1527–1554. vances in Neural Information Processing Systems 24,
pages 801–809.
Huang, W., Zhao, D., Sun, F., Liu, H., and Chang, E.
(2015). Scalable gaussian process regression using deep Wilson, A. G. (2014). Covariance kernels for fast auto-
neural networks. In Proceedings of the 24th International matic pattern discovery and extrapolation with Gaussian
Conference on Artificial Intelligence, pages 3576–3582. processes. PhD thesis, University of Cambridge.
AAAI Press. Wilson, A. G. and Adams, R. P. (2013). Gaussian process
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., kernels for pattern discovery and extrapolation. Inter-
Girshick, R., Guadarrama, S., and Darrell, T. (2014). national Conference on Machine Learning (ICML).
Caffe: Convolutional architecture for fast feature em- Wilson, A. G., Dann, C., and Nickisch, H. (2015).
bedding. arXiv preprint arXiv:1408.5093. Thoughts on massively scalable Gaussian processes.
Kiros, R., Salakhutdinov, R., and Zemel, R. (2014). Unify- Technical Report, Carnegie Mellon University.
ing visual-semantic embeddings with multimodal neural http://www.cs.cmu.edu/~andrewgw/msgp.html.
language models. TACL. Wilson, A. G., Knowles, D. A., and Ghahramani, Z. (2012).
Gaussian process regression networks. In International
Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). Ima-
Conference on Machine Learning (ICML), Edinburgh.
genet classification with deep convolutional neural net-
Omnipress.
works. In Advances in Neural Information Processing
Systems. Wilson, A. G. and Nickisch, H. (2015). Kernel interpolation
for scalable structured Gaussian processes (KISS-GP).
Le, Q., Sarlos, T., and Smola, A. (2013). Fastfood- International Conference on Machine Learning (ICML).
computing Hilbert space expansions in loglinear time.
In Proceedings of the 30th International Conference on Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A. C.,
Machine Learning, pages 244–252. Salakhutdinov, R., Zemel, R. S., and Bengio, Y. (2015).
Show, attend and tell: Neural image caption generation
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). with visual attention. ICML.
Gradient-based learning applied to document recogni-
tion. Proceedings of the IEEE, 86(11):2278–2324. Yang, Z., Moczulski, M., Denil, M., de Freitas, N., Smola,
A., Song, L., and Wang, Z. (2014). Deep fried convnets.
Lloyd, J. R., Duvenaud, D., Grosse, R., Tenenbaum, J. B., arXiv preprint arXiv:1412.7149.
and Ghahramani, Z. (2014). Automatic construction
Yang, Z., Smola, A. J., Song, L., and Wilson, A. G. (2015).
and Natural-Language description of nonparametric re-
A la carte - learning fast kernels. Artificial Intelligence
gression models. In Association for the Advancement of
and Statistics.
Artificial Intelligence (AAAI).
MacKay, D. J. (1998). Introduction to Gaussian processes.
In Bishop, C. M., editor, Neural Networks and Machine
Learning, chapter 11, pages 133–165. Springer-Verlag.

378

You might also like