Deep Kernel Learning
Deep Kernel Learning
370
Deep Kernel Learning
As a non-parametric method, the information capac- with a Gaussian process transformation, in an unsu-
ity of our model grows with the amount of avail- pervised setting. While promising, both models are
able data, but its complexity is automatically cali- very task specific, and require sophisticated approxi-
brated through the marginal likelihood of the Gaus- mate Bayesian inference which is much more demand-
sian process, without the need for regularization or ing than what is required by standard Gaussian pro-
cross-validation (Rasmussen and Ghahramani, 2001; cesses or deep learning models, and typically does not
Rasmussen and Williams, 2006; Wilson, 2014). The scale beyond a few thousand training points. Similarly,
flexibility and automatic calibration provided by the Salakhutdinov and Hinton (2008) combine deep belief
non-parametric layer typically provides a high stan- networks (DBNs) with Gaussian processes, showing
dard of performance, while reducing the need for ex- improved performance over standard GPs with RBF
tensive hand tuning from the user. kernels, in the context of semi-supervised learning.
However, their model is heavily relying on unsuper-
We further build on the ideas in KISS-GP (Wilson and
vised pre-training of DBNs, with the GP component
Nickisch, 2015) and extensions (Wilson et al., 2015), so
unable to scale beyond a few thousand training points.
that our deep kernel learning model can scale linearly
Likewise, Calandra et al. (2014) combine a feedfor-
with the number of training instances n, instead of
ward neural network transformation with a Gaussian
O(n3 ) as is standard with Gaussian processes (GPs),
process, showing an ability to learn sharp discontinu-
while retaining a fully non-parametric representation.
ities. However, similar to many other approaches, the
Our approach also scales as O(1) per test point, in-
resulting model can only scale to at most a few thou-
stead of the standard O(n2 ) for GPs, allowing for very
sand data points.
fast prediction times. Because KISS-GP creates an
approximate kernel from a user specified kernel for In a frequentist setting, Yang et al. (2014) combine
fast computations, independently of a specific infer- convolutional networks, with parameters pre-trained
ence procedure, we can view the resulting kernel as a on ImageNet, with a scalable Fastfood (Le et al., 2013)
scalable deep kernel. We demonstrate the value of this expansion for the RBF kernel applied to the final layer.
scalability in the experimental results section, where it The resulting method is scalable and flexible, but the
is the large datasets that provide the greatest oppor- network parameters generally must first be trained
tunities for our model to discover expressive statistical separately from the Fastfood features, and the com-
representations. bined model remains parametric, due to the paramet-
ric expansion provided by Fastfood. Careful atten-
We begin by reviewing related work in section 2, and
tion must still be paid to training procedures, regu-
providing background on Gaussian processes in section
larization, and manual calibration of the network ar-
3. In section 4 we derive scalable closed form deep ker-
chitecture. In a similar manner, Huang et al. (2015)
nels, and describe how to perform efficient automatic
and Snoek et al. (2015) have combined deep architec-
learning of these kernels through the Gaussian process
tures with parametric Bayesian models. Huang et al.
marginal likelihood. In section 5, we show substan-
(2015) pursue an unsupervised pre-training procedure
tially improved performance over standard Gaussian
using deep autoencoders, showing improved perfor-
processes, expressive kernel learning approaches, deep
mance over GPs using standard kernels. Snoek et al.
neural networks, and Gaussian processes applied to
(2015) show promising performance on Bayesian op-
the outputs of trained deep networks, on a wide range
timisation tasks, for tuning the parameters of a deep
of datasets. We also interpret the learned kernels to
neural network, but do not use a Bayesian (marginal
gain new insights into our modelling problems.
likelihood) objective for training network parameters.
Our approach is distinct in that we combine deep feed-
2 Related Work forward and convolutional architectures with spectral
mixture covariances (Wilson and Adams, 2013), in-
Given the intuitive value of combining kernels and neu- ducing points, Kronecker and Toeplitz algebra, and
ral networks, it is encouraging that various distinct local kernel interpolation (Wilson and Nickisch, 2015;
forms of such combinations have been considered in Wilson et al., 2015), to derive expressive and scalable
different contexts. closed form kernels, where all parameters are trained
The Gaussian process regression network (Wilson jointly with a unified supervised objective, as part of a
et al., 2012) replaces all weight connections and activa- non-parametric Gaussian process framework, without
tion functions in a Bayesian neural network with Gaus- requiring approximate Bayesian inference. Moreover,
sian processes, allowing the authors to model input the simple joint learning procedure in our approach
dependent correlations between multiple tasks. Alter- can be applied in general settings. Indeed we show
natively, Damianou and Lawrence (2013) replace ev- that the proposed model outperforms state of the art
ery activation function in a Bayesian neural network stand-alone deep learning architectures and Gaussian
371
Andrew Gordon Wilson∗ , Zhiting Hu∗ , Ruslan Salakhutdinov, Eric P. Xing
processes with advanced kernel learning procedures on tion. For example, the popular RBF kernel,
a wide range of datasets, demonstrating its practical 1
significance. We also show that jointly training all the kRBF (x, x0 ) = exp(− ||x − x0 ||/`2 ) (3)
2
weights of a deep kernel through the marginal likeli-
encodes the inductive bias that function values closer
hood of a non-parametric GP provides significant ad-
together in the input space are more correlated. The
vantages over training a GP applied to the output
complexity of the functions in the input space is deter-
layer of a trained deep neural network. Moreover,
mined by the interpretable length-scale hyperparam-
we achieve scalability while retaining non-parametric
eter `. Shorter length-scales correspond to functions
model structure by leveraging the very recent KISS-
which vary more rapidly with the inputs x.
GP approach (Wilson and Nickisch, 2015) and exten-
sions in Wilson et al. (2015) for efficiently representing The structure of our data is discovered through
kernel functions, to produce scalable deep kernels. learning interpretable kernel hyperparameters. The
marginal likelihood of the targets y, the probability
of the data conditioned only on kernel hyperparame-
3 Gaussian Processes
ters γ, provides a principled probabilistic framework
for kernel learning:
We briefly review Gaussian processes (GPs) and the
computational requirements for predictions and kernel log p(y|γ, X) ∝ −[y> (Kγ + σ 2 I)−1 y + log |Kγ + σ 2 I|] ,
learning (see e.g., Rasmussen and Williams (2006) for (4)
a comprehensive discussion of GPs). where we have used Kγ as shorthand for KX,X given γ.
Note that the expression for the log marginal likeli-
We assume a dataset D of n input (predictor) vectors hood in Eq. (4) pleasingly separates into automatically
X = {x1 , . . . , xn }, each of dimension D, which index calibrated model fit and complexity terms (Rasmussen
an n × 1 vector of targets y = (y(x1 ), . . . , y(xn ))> . and Ghahramani, 2001). Kernel learning is performed
If f (x) ∼ GP(µ, kγ ), then any collection of function by optimizing Eq. (4) with respect to γ.
values f has a joint Gaussian distribution,
The computational bottleneck for inference is solving
>
f = f (X) = [f (x1 ), . . . , f (xn )] ∼ N (µ, KX,X ) , (1) the linear system (KX,X + σ 2 I)−1 y, and for kernel
learning is computing the log determinant log |KX,X +
with mean vector and covariance matrix defined σ 2 I|. The standard approach is to compute the
by the mean function and covariance kernel of the Cholesky decomposition of the n × n matrix KX,X ,
Gaussian process: µi = µ(xi ), and (KX,X )ij = which requires O(n3 ) operations and O(n2 ) storage.
kγ (xi , xj ), where the kernel kγ is parametrized by After inference, the predictive mean costs O(n), and
γ. Assuming additive Gaussian noise, y(x)|f (x) ∼ the predictive variance costs O(n2 ), per test point x∗ .
N (y(x); f (x), σ 2 ), the predictive distribution of the
GP evaluated at the n∗ test points indexed by X∗ , 4 Deep Kernel Learning
is given by
In this section we show how we can contruct kernels
f ∗ |X∗ ,X, y, γ, σ 2 ∼ N (E[f ∗ ], cov(f ∗ )) , (2) which encapsulate the expressive power of deep archi-
E[f ∗ ] = µX∗ + KX∗ ,X [KX,X + σ 2 I]−1 (y − µX ) , tectures, and how to learn the properties of these ker-
nels as part of a scalable probabilistic GP framework.
cov(f ∗ ) = KX∗ ,X∗ − KX∗ ,X [KX,X + σ 2 I]−1 KX,X∗ .
Specifically, starting from a base kernel k(xi , xj |θ)
KX∗ ,X , for example, represents the n∗ × n matrix of with hyperparameters θ, we transform the inputs (pre-
covariances between the GP evaluated at X∗ and X. dictors) x as
µX and µX∗ are mean vectors evaluated at X and X∗ ,
k(xi , xj |θ) → k(g(xi , w), g(xj , w)|θ, w) , (5)
and KX,X is the n × n covariance matrix evaluated at
training inputs X. All covariance matrices implicitly where g(x, w) is a non-linear mapping given by a deep
depend on the kernel hyperparameters γ. architecture, such as a deep convolutional network,
parametrized by weights w. The popular RBF kernel
GPs with RBF kernels correspond to models which
(Eq. (3)) is a sensible choice of base kernel k(xi , xj |θ).
have an infinite basis expansion in a dual space, and
For added flexibility, we also propose to use spectral
have compelling theoretical properties: these models
mixture base kernels (Wilson and Adams, 2013):
are universal approximators, and have prior support
to within an arbitrarily small epsilon band of any con- kSM (x, x0 |θ) = (6)
tinuous function (Micchelli et al., 2006). Indeed the Q
X 1
|Σq | 2 1 1
properties of the distribution over functions induced by aq D exp − ||Σq2 (x − x0 )||2 coshx − x0 , 2πµq i .
a Gaussian process are controlled by the kernel func- q=1 (2π) 2 2
372
Deep Kernel Learning
W (1)
W (2) h1 (θ) For kernel learning, we use the chain rule to compute
W (L)
(1)
Input layer h1
... Output layer derivatives of the log marginal likelihood with respect
...
(2)
h1
x1 (L)
h1
to the deep kernel hyperparameters:
y1
...
...
...
...
...
...
= , = .
∂θ ∂Kγ ∂θ ∂w ∂Kγ ∂g(x, w) ∂w
yP
xD (L)
hC
The implicit derivative of the log marginal likelihood
...
hB
(2)
...
(1)
hA with respect to our n × n data covariance matrix Kγ
h∞ (θ)
is given by
Hidden layers
∞ layer ∂L 1
= (Kγ−1 yy> Kγ−1 − Kγ−1 ) , (7)
Figure 1: Deep Kernel Learning: A Gaussian process with ∂Kγ 2
a deep kernel maps D dimensional inputs x through L para-
metric hidden layers followed by a hidden layer with an in- where we have absorbed the noise covariance σ 2 I into
finite number of basis functions, with base kernel hyperpa- our covariance matrix, and treat it as part of the base
rameters θ. Overall, a Gaussian process with a deep kernel ∂K
kernel hyperparameters θ. ∂θγ are the derivatives of
produces a probabilistic mapping with an infinite number
of adaptive basis functions parametrized by γ = {w, θ}. the deep kernel with respect to the base kernel hyper-
All parameters γ are learned jointly through the marginal parameters (such as length-scale), conditioned on the
likelihood of the Gaussian process. fixed transformation of the inputs g(x, w). Similarly,
∂Kγ
∂g(x,w) are the implicit derivatives of the deep kernel
with respect to g, holding θ fixed. The derivatives with
The parameters of the spectral mixture kernel θ = respect to the weight variables ∂g(x,w)
∂w are computed
{aq , Σq , µq } are mixture weights, bandwidths (inverse using standard backpropagation.
length-scales), and frequencies. The spectral mixture
(SM) kernel, which forms an expressive basis for all For scalability, we replace all instances of Kγ with
stationary covariance functions, can discover quasi- the KISS-GP covariance matrix (Wilson and Nickisch,
periodic stationary structure with an interpretable and 2015; Wilson et al., 2015)
succinct representation, while the deep learning trans- deep
formation g(x, w) captures non-stationary and hierar- Kγ ≈ M KU,U M > := KKISS , (8)
chical structure.
where M is a sparse matrix of interpolation weights,
We use the deep kernel of Eq. (5) as the covari- containing only 4 non-zero entries per row for local
ance function of a Gaussian process to model data cubic interpolation, and KU,U is a covariance matrix
D = {xi , yi }ni=1 . Conditioned on all kernel hyperpa- created from our deep kernel, evaluated over m la-
rameters, we can interpret our model as applying a tent inducing points U = [ui ]i=1...m . We place the in-
Gaussian process with base kernel kθ to the final hid- ducing points over a regular multidimensional lattice,
den layer of a deep network. Since a GP with (RBF and exploit the resulting decomposition of KU,U into a
or SM) base kernel kθ corresponds to an infinite basis Kronecker product of Toeplitz matrices for extremely
function representation, our network effectively has a fast matrix vector multiplications (MVMs), without
hidden layer with an infinite number of hidden units. requiring any grid structure in the data inputs X or
The overall model is shown in Figure 1. the transformed inputs g(x, w). Because KISS-GP op-
erates by creating an approximate kernel which ad-
We emphasize, however, that we jointly learn all deep
mits fast computations, and is independent from a spe-
kernel hyperparameters, γ = {w, θ}, which include w,
cific inference and learning procedure, we can view the
the weights of the network, and θ the parameters of
KISS approximation applied to our deep kernels as a
the base kernel, by maximizing the log marginal like- deep
stand-alone kernel, k(x, z) = m> x KU,U mz , which can
lihood L of the Gaussian process (see Eq. (4)). Indeed
be combined with Gaussian processes or other kernel
compartmentalizing our model into a base kernel and
machines for scalable learning.
deep architecture is for pedagogical clarity. When ap-
−1
plying a Gaussian process one can use our deep kernel, For inference we solve KKISS y using linear conjugate
which operates as a single unit, as a drop-in replace- gradients (LCG), an iterative procedure for solving lin-
ment for e.g., standard ARD or Matérn kernels (Ras- ear systems which only involves matrix vector multi-
mussen and Williams, 2006), since learning and infer- plications (MVMs). The number of iterations required
ence follow the same procedures. In the experiments, for convergence to within machine precision is j n,
we show that jointly learning all deep kernel parame- and in practice j depends on the conditioning of the
ters has advantages over training a GP applied to the KISS-GP covariance matrix rather than the number of
output layer of a trained deep neural network. training points n. For estimating the log determinant
373
Andrew Gordon Wilson∗ , Zhiting Hu∗ , Ruslan Salakhutdinov, Eric P. Xing
Table 1: Comparative RMSE performance and runtime on UCI regression datasets, with n training points and d the input
dimensions. The results are averaged over 5 equal partitions (90% train, 10% test) of the data ± one standard deviation.
The best denotes the best-performing kernel according to Yang et al. (2015) (note that often the best performing kernel
is GP-SM). Following Yang et al. (2015), as exact Gaussian processes are intractable on the large data used here, the
Fastfood finite basis function expansions are used for approximation in GP (RBF, SM, Best). We verified on datasets
with n < 10, 000 that exact GPs with RBF kernels provide comparable performance to the Fastfood expansions. For
datasets with n < 6, 000 we used a fully-connected DNN with a [d-1000-500-50-2] architecture, and for n > 6000 we
used a [d-1000-1000-500-50-2] architecture. DNN+GP is a GP applied to fixed pre-trained output layer of the DNN. We
used RBF kernel and KISS-GP approximation for direct comparison with our proposed deep kernel learning (DKL). We
consider scalable DKL with RBF and SM base kernels. For the SM base kernel, we set Q = 4 for datasets with n < 10, 000
training instances, and use Q = 6 for larger datasets.
RMSE Runtime(s)
Datasets n d
GP DKL DKL
DNN DNN+GP DNN
RBF SM best RBF SM RBF SM
Gas 2,565 128 0.21±0.07 0.14±0.08 0.12±0.07 0.11±0.05 0.11±0.05 0.11±0.05 0.09±0.06 7.4 7.8 10.5
Skillcraft 3,338 19 1.26±3.14 0.25±0.02 0.25±0.02 0.25±0.00 0.25±0.00 0.25±0.00 0.25±0.00 15.8 15.9 17.1
SML 4,137 26 6.94±0.51 0.27±0.03 0.26±0.04 0.25±0.02 0.25±0.01 0.24±0.01 0.23±0.01 1.1 1.5 1.9
Parkinsons 5,875 20 3.94±1.31 0.00±0.00 0.00±0.00 0.31±0.04 0.31±0.04 0.29±0.04 0.29±0.04 3.2 3.4 6.5
Pumadyn 8,192 32 1.00±0.00 0.21±0.00 0.20±0.00 0.25±0.02 0.25±0.02 0.24±0.02 0.23±0.02 7.5 7.9 9.8
PoleTele 15,000 26 12.6±0.3 5.40±0.3 4.30±0.2 3.42±0.05 3.36±0.04 3.28±0.04 3.11±0.07 8.0 8.3 27.0
Elevators 16,599 18 0.12±0.00 0.090±0.001 0.089±0.002 0.099±0.001 0.097±0.002 0.084±0.002 0.084±0.002 8.9 9.2 11.8
Kin40k 40,000 8 0.34±0.01 0.19±0.02 0.06±0.00 0.11±0.01 0.11±0.01 0.05±0.00 0.03±0.01 19.8 20.7 25.0
Protein 45,730 9 1.64±1.66 0.50±0.02 0.47±0.01 0.49±0.01 0.49±0.01 0.46±0.01 0.43±0.01 143 155 144
KEGG 48,827 22 0.33±0.17 0.12±0.01 0.12±0.01 0.12±0.01 0.12±0.00 0.11±0.00 0.10±0.01 31.3 34.2 61.0
CTslice 53,500 385 7.13±0.11 2.21±0.06 0.59±0.07 0.41±0.06 0.41±0.02 0.36±0.01 0.34±0.02 36.4 44.3 80.4
KEGGU 63,608 27 0.29±0.12 0.12±0.00 0.12±0.00 0.12±0.00 0.12±0.00 0.11±0.00 0.11±0.00 39.5 43.0 41.1
3Droad 434,874 3 12.9±0.1 10.3±0.2 9.90±0.10 7.36±0.07 7.04±0.06 6.91±0.04 6.91±0.04 239 256 292
Song 515,345 90 0.55±0.00 0.46±0.00 0.45±0.00 0.45±0.02 0.45±0.01 0.44±0.00 0.43±0.01 518 539 590
Buzz 583,250 77 0.88±0.01 0.51±0.01 0.51±0.01 0.49±0.00 0.49±0.00 0.48±0.00 0.46±0.01 486 523 770
Electric 2M 11 0.23±0.00 0.053±0.000 0.053±0.000 0.058±0.002 0.054±0.002 0.050±0.002 0.048±0.002 3458 3542 4881
in the marginal likelihood we follow the approach de- All experiments were performed on a Linux machine
scribed in Wilson and Nickisch (2015) with extensions with eight 4.0GHz CPU cores and 32GB RAM. We
in Wilson et al. (2015). implemented DNNs based on Caffe (Jia et al., 2014),
a general deep learning platform.
KISS-GP training scales as O(n + h(m)) (where h(m)
is typically close to linear in m), versus conventional For our deep kernel learning model, we first train a
scalable GP approaches which require O(m2 n + m3 ) deep neural network using SGD with the squared loss
(Quiñonero-Candela and Rasmussen, 2005) computa- objective, and rectified linear activation functions. Af-
tions and need m n for tractability, which results in ter the neural network has been pre-trained, a KISS-
severe deteriorations in predictive performance. The GP model was fitted using the top-level features of the
ability to have large m ≈ n allows KISS-GP to have DNN model as inputs. Using this pre-training initial-
near-exact accuracy in its approximation (Wilson and ization, our joint deep kernel learning (DKL) model
Nickisch, 2015), retaining a non-parametric represen- of section 4 is then trained by optimizing all the hy-
tation, while providing linear scaling in n and O(1) perparameters γ of our deep kernel, by backpropagat-
time per test point prediction (Wilson et al., 2015). ing derivatives through the marginal likelihood of the
We empirically demonstrate this scalability and accu- Gaussian process (see Eq. 7).
racy in our experiments of section 5.
374
Deep Kernel Learning
-43.10 36.15 17.35 -3.49 -19.81 Label
Training data
Test data
Figure 2: Left: Randomly sampled training and test examples. Right: The two dimensional outputs of the convolutional
network on a set of test cases. Each point is shown using a line segment that has the same orientation as the input face.
datasets we used a d-1000-1000-500-50-2 architecture1 . Table 2: RMSE performance on Olivetti and MNIST. For
comparison, in the face orientation extraction, we trained
Table 1 shows that on most of the datasets, our DKL DKL on the same amount (12,000) of training instances as
method strongly outperforms not only Gaussian pro- with DBN+GP, but used all labels; whereas DBN+GP (as
cesses with the standard RBF kernel, but also the best- with GP) scaled to only 1,000 labeled images and modeled
the remaining data through unsupervised pretraining of
performing kernels selected from a wide range of alter- DBN. CNN+GP is a GP applied to fixed pre-trained CNN.
native kernel learning procedures (Yang et al., 2015). We used RBF base kernel within GPs.
We further compared DKL to stand-alone deep neu- Datasets GP DBN+GP CNN CNN+GP DKL
ral networks which have the exact same architecture Olivetti 16.33 6.42 6.34 6.42 6.07
as the DNN component of DKL, and DNN+GP which MNIST 1.25 1.03 0.59 0.56 0.53
is a GP applied to a pre-trained DNN. We see that
DNN+GP outperforms stand-alone DNNs, showing highly-structured image data.
the non-parametric flexibility of kernel methods. By The Olivetti face data set contains ten 64×64 images
combining KISS-GP with DNNs as part of a joint DKL of forty different people, for 400 images total. Follow-
procedure, we obtain consistently better results than ing Salakhutdinov and Hinton (2008), we constructed
DNN and DNN+GP over all 16 datasets. Moreover, datasets of 28×28 images by randomly rotating (uni-
using a spectral mixture base kernel (Eq. (6)) to create formly from −90◦ to +90◦ ), cropping, and subsam-
a deep kernel provides notable additional performance pling the original 400 images. We then randomly se-
improvements. By effectively learning the salient fea- lect 30 people uniformly and collect their images as
tures from raw data, plain DNNs generally achieve training data, while using the images of the remain-
competitive performance compared to expressive GPs. ing 10 people as test data. Figure 2 shows randomly
Combining the complementary advantages of these ap- sampled examples from the training and test data.
proaches into scalable deep kernels consistently brings
substantial additional performance gains. For training DKL on the Olivetti face patches we used
a convolutional network consisting of 2 convolutional
We next investigate the runtime of DKL. Table 1, layers followed by 4 fully-connected layers, mapping a
right panel, compares DKL with a stand-alone DNN face patch to a 2-dimensional feature vector, with a
in terms of runtime for evaluating the objective and SM base kernel. We describe this convolutional archi-
derivatives (i.e. one forward and backpropagation pass tecture in detail in the supplementary material.
for DNN; one computation of the marginal likelihood
and all relevant derivatives for DKL). We see that in Table 2 shows the RMSE of the predicted face ori-
addition to improving accuracy, combining KISS-GP entations using four models. The DBN+GP model,
with DNNs for deep kernels introduces only negligible proposed by Salakhutdinov and Hinton (2008), first
runtime costs: KISS-GP imposes an additional run- extracts features from raw data using a Deep Belief
time of about 10% over a stand-alone DNN. Overall, Network (DBN), and then applies a Gaussian process
these results show the general applicability and prac- with an RBF kernel. However, their approach could
tical significance of our scalable DKL approach. only handle up to a few thousand labelled datapoints,
due to the O(n3 ) complexity of standard Gaussian pro-
5.2 Face orientation extraction cesses. The remaining data were modeled through
We now consider the task of predicting the orienta- unsupervised learning of a DBN, leaving the large
tion of a face extracted from a gray-scale image patch, amount of available labels unused.
explored in Salakhutdinov and Hinton (2008). We Our proposed deep kernel methods, by contrast, scale
investigate our DKL procedure for efficiently learn- linearly with the size of training data, and are capa-
ing meaningful representations from high-dimensional ble of directly modeling the full labeled data to accu-
1 rately recover salient patterns. Figure 2, right panel,
We found [d-1000-1000-500-50] architectures provide
a similar level of performance, but scalable Kronecker al- shows that the deep kernel discovers features essential
gebra is most effective if the network maps into D ≤ 5 for orientation prediction, while filtering out irrelevant
dimensional spaces. factors such as identities and scales.
375
Andrew Gordon Wilson∗ , Zhiting Hu∗ , Ruslan Salakhutdinov, Eric P. Xing 4
x 10
350
CNN CNN CNN
6.4 10
DKL−RBF DKL−RBF DKL−RBF
DKL−SM 300 DKL−SM DKL−SM
Runtime (s)
6
RMSE
6
200
5.8
4
150
5.6
100 2
5.4
5.2 50 0
1 2 3 4 5 6 1 2 3 4 5 6 1.2 2 3 4 5 6
#Training Instances x 10
4
#Training Instances x 10
4
#Training Instances x 10
4
Figure 3: Left: RMSE vs. n, the number of training examples. Middle: Runtime vs n. Right: Total training time vs
n. The dashed line in black indicates a slope of 1. CNNs are used within DKL. We set Q = 4 for the SM kernel.
0
376
Deep Kernel Learning
0.2
300
2
100 100
100
0.1
200
0 100
Figure 5: Left: The induced covariance matrix using DKL-SM kernel on a set of test cases, where the test samples
are ordered by the orientations of the input faces. Middle: The respective covariance matrix using DKL-RBF kernel.
Right: The respective covariance matrix using regular RBF kernel. The models are trained with n = 12, 000, and Q = 4
for the SM base kernel.
18
GP(RBF)
GP(SM)
can then be combined with Gaussian process inference
16 DKL−SM
Training data
and learning procedures for O(n) training and O(1)
14 testing time. Moreover, we use spectral mixture co-
variances as a base kernel, which provides a significant
Output Y
12
additional boost in representational power. Overall,
10 our scalable deep kernels can be used in place of stan-
8
dard kernels, following the same inference and learning
procedures, but with benefits in expressive power and
6
efficiency. We show on a wide range of experiments
4
−1 −0.5 0 0.5 1
the general applicability and practical significance of
Input X our approach, consistently outperforming scalable GPs
Figure 6: Recovering a step function. We show the pre- with expressive kernels, stand-alone DNNs, and GPs
dictive mean and 95% of the predictive probability mass applied to the outputs of trained DNNs.
for regular GPs with RBF and SM kernels, and DKL with
SM base kernel. We set Q = 4 for SM kernels. A major challenge in developing expressive kernel
learning approaches is the Euclidean and absolute dis-
butions are not readily available, or on problems where tance based metrics which are pervasive in most fam-
RMSE has historically been used as a benchmark. ilies of kernel functions, such as the ARD and Matérn
However, an advantage of DKL over stand-alone deep kernels. Indeed, although intuitive in some cases, one
architectures is the ability to naturally produce a pos- cannot expect Euclidean and absolute distance as mea-
terior predictive distribution, which is especially use- sures of similarity to be generally applicable, and they
ful in applications such as reinforcement learning and are especially problematic in high dimensional input
Bayesian optimisation. In Figure 6, we consider an ex- spaces (Aggarwal et al., 2001). Modern approaches
ample where we use DKL to learn the posterior predic- attempt to learn a flexible parametric family, for exam-
tive distribution for a step function with many chal- ple, through weighted combinations of known kernels
lenging discontinuities. This problem is particularly (e.g., Gönen and Alpaydın, 2011), but are still funda-
difficult for conventional GP approaches, due to strong mentally limited to these standard notions of distance.
smoothness assumptions intrinsic to popular kernels.
As we have seen in the Olivetti faces examples, our
GPs with SM kernels improve upon RBF kernels, but approach allows for the whole functional form of the
neither can properly adapt to the many sharp changes metric to be learned in a flexible manner, through ex-
in covariance structure. By contrast, DKL-SM ac- pressive transformations of the input space. We expect
curately encodes the discontinuities of the function, such metric learning to be particularly valuable in high
and has reasonable uncertainty over the whole domain. dimensional classification problems, which we view as
Further details are in the supplement. a promising direction for future research. We hope
that this work will help bring together research on neu-
6 Discussion ral networks and kernel methods, to inspire many new
models and unifying perspectives which combine the
We have explored scalable deep kernels, which combine complementary advantages of these approaches.
the structural properties of deep architectures with the
non-parametric flexibility of kernel methods. In par-
ticular, we transform the inputs of a base kernel with
a deep architecture, and then leverage local kernel in-
terpolation, inducing points, and structure exploiting
algebra (e.g., Kronecker and Toeplitz methods) for a
scalable kernel representation. These scalable kernels
377
Andrew Gordon Wilson∗ , Zhiting Hu∗ , Ruslan Salakhutdinov, Eric P. Xing
Acknowledgements: We thank NIH R01GM093156 Micchelli, C. A., Xu, Y., and Zhang, H. (2006). Univer-
and NIH R01GM087694, NSF IIS-1218282, and ONR sal kernels. The Journal of Machine Learning Research,
N000141410684 grants for support. 7:2651–2667.
Neal, R. (1996). Bayesian Learning for Neural Networks.
Springer Verlag.
References Quiñonero-Candela, J. and Rasmussen, C. (2005). A uni-
Aggarwal, C. C., Hinneburg, A., and Keim, D. A. (2001). fying view of sparse approximate gaussian process re-
On the surprising behavior of distance metrics in high gression. The Journal of Machine Learning Research,
dimensional space. Springer. 6:1939–1959.
Bengio, Y. (2009). Learning deep architectures for AI. Rasmussen, C. E. and Ghahramani, Z. (2001). Occam’s ra-
Foundations and Trends in Machine Learning. zor. In Neural Information Processing Systems (NIPS).
Calandra, R., Peters, J., Rasmussen, C. E., and Deisen- Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian
roth, M. P. (2014). Manifold gaussian processes for re- processes for Machine Learning. The MIT Press.
gression. arXiv preprint arXiv:1402.5876. Saatchi, Y. (2011). Scalable Inference for Structured Gaus-
sian Process Models. PhD thesis, University of Cam-
Damianou, A. and Lawrence, N. (2013). Deep Gaussian
bridge.
processes. In Artificial Intelligence and Statistics.
Salakhutdinov, R. and Hinton, G. (2008). Using deep be-
Gönen, M. and Alpaydın, E. (2011). Multiple kernel learn- lief nets to learn covariance kernels for Gaussian pro-
ing algorithms. Journal of Machine Learning Research, cesses. Advances in Neural Information Processing Sys-
12:2211–2268. tems, 20:1249–1256.
Hinton, G. E., Deng, L., Yu, D., Dahl, G. E., rahman Snoek, J., Rippel, O., Swersky, K., Kiros, R., Satish, N.,
Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Sundaram, N., Patwary, M., Ali, M., and Adams, R. P.
Nguyen, P., Sainath, T. N., and Kingsbury, B. (2012). (2015). Scalable bayesian optimization using deep neu-
Deep neural networks for acoustic modeling in speech ral networks. In International Conference on Machine
recognition: The shared views of four research groups. Learning.
IEEE Signal Process. Mag., 29(6):82–97.
Socher, R., Huang, E., Pennington, J., Ng, A., and Man-
Hinton, G. E., Osindero, S., and Teh, Y. W. (2006). A fast ning, C. (2011). Dynamic pooling and unfolding re-
learning algorithm for deep belief nets. Neural Compu- cursive autoencoders for paraphrase detection. In Ad-
tation, 18(7):1527–1554. vances in Neural Information Processing Systems 24,
pages 801–809.
Huang, W., Zhao, D., Sun, F., Liu, H., and Chang, E.
(2015). Scalable gaussian process regression using deep Wilson, A. G. (2014). Covariance kernels for fast auto-
neural networks. In Proceedings of the 24th International matic pattern discovery and extrapolation with Gaussian
Conference on Artificial Intelligence, pages 3576–3582. processes. PhD thesis, University of Cambridge.
AAAI Press. Wilson, A. G. and Adams, R. P. (2013). Gaussian process
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., kernels for pattern discovery and extrapolation. Inter-
Girshick, R., Guadarrama, S., and Darrell, T. (2014). national Conference on Machine Learning (ICML).
Caffe: Convolutional architecture for fast feature em- Wilson, A. G., Dann, C., and Nickisch, H. (2015).
bedding. arXiv preprint arXiv:1408.5093. Thoughts on massively scalable Gaussian processes.
Kiros, R., Salakhutdinov, R., and Zemel, R. (2014). Unify- Technical Report, Carnegie Mellon University.
ing visual-semantic embeddings with multimodal neural http://www.cs.cmu.edu/~andrewgw/msgp.html.
language models. TACL. Wilson, A. G., Knowles, D. A., and Ghahramani, Z. (2012).
Gaussian process regression networks. In International
Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). Ima-
Conference on Machine Learning (ICML), Edinburgh.
genet classification with deep convolutional neural net-
Omnipress.
works. In Advances in Neural Information Processing
Systems. Wilson, A. G. and Nickisch, H. (2015). Kernel interpolation
for scalable structured Gaussian processes (KISS-GP).
Le, Q., Sarlos, T., and Smola, A. (2013). Fastfood- International Conference on Machine Learning (ICML).
computing Hilbert space expansions in loglinear time.
In Proceedings of the 30th International Conference on Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A. C.,
Machine Learning, pages 244–252. Salakhutdinov, R., Zemel, R. S., and Bengio, Y. (2015).
Show, attend and tell: Neural image caption generation
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). with visual attention. ICML.
Gradient-based learning applied to document recogni-
tion. Proceedings of the IEEE, 86(11):2278–2324. Yang, Z., Moczulski, M., Denil, M., de Freitas, N., Smola,
A., Song, L., and Wang, Z. (2014). Deep fried convnets.
Lloyd, J. R., Duvenaud, D., Grosse, R., Tenenbaum, J. B., arXiv preprint arXiv:1412.7149.
and Ghahramani, Z. (2014). Automatic construction
Yang, Z., Smola, A. J., Song, L., and Wilson, A. G. (2015).
and Natural-Language description of nonparametric re-
A la carte - learning fast kernels. Artificial Intelligence
gression models. In Association for the Advancement of
and Statistics.
Artificial Intelligence (AAAI).
MacKay, D. J. (1998). Introduction to Gaussian processes.
In Bishop, C. M., editor, Neural Networks and Machine
Learning, chapter 11, pages 133–165. Springer-Verlag.
378