0% found this document useful (0 votes)
12 views11 pages

Maximizing Overall Diversity For Improved Uncertainty Estimates in Deep Ensembles

Uploaded by

adeka1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views11 pages

Maximizing Overall Diversity For Improved Uncertainty Estimates in Deep Ensembles

Uploaded by

adeka1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Maximizing Overall Diversity for Improved Uncertainty Estimates in Deep

Ensembles

Siddhartha Jain,*1 Ge Liu,*1 Jonas Mueller,2 David Gifford,1


*
The authors contribute equally, 1 CSAIL,MIT, 2 Amazon Web Services
[email protected], [email protected], [email protected], [email protected]
arXiv:1906.07380v2 [cs.LG] 12 Feb 2020

Abstract Practical applications of massive neural networks (NN)


The inaccuracy of neural network models on inputs that are commonly limited to small ensembles because of the
do not stem from the distribution underlying the training unwieldy nature of these models (Osband et al. 2016;
data is problematic and at times unrecognized. Uncer- Balan et al. 2015; Beluch et al. 2018). Although supervised
tainty estimates of model predictions are often based learning performance may be enhanced by an ensemble com-
on the variation in predictions produced by a diverse prised of only a few ERM-trained models, the resulting
ensemble of models applied to the same input. Here ensemble-based uncertainty estimates can exhibit excessive
we describe Maximize Overall Diversity (MOD), an sampling variability in low-density regions of the underlying
approach to improve ensemble-based uncertainty esti- training distribution. Consider the example of an ensemble
mates by encouraging larger overall diversity in ensem- comprised of five models whose predictions just might agree
ble predictions across all possible inputs. We apply MOD
at points far from the training data by chance. Figure 1 de-
to regression tasks including 38 Protein-DNA binding
datasets, 9 UCI datasets, and the IMDB-Wiki image picts an example of this phenomenon, which we refer to as
dataset. We also explore variants that utilize adversar- uncertainty collapse, since the resulting ensemble-based un-
ial training techniques and data density estimation. For certainty estimates would indicate these predictions are of
out-of-distribution test examples, MOD significantly im- high-confidence despite not being supported by any nearby
proves predictive performance and uncertainty calibra- training datapoints.
tion without sacrificing performance on test data drawn Unreliable uncertainty estimates are highly undesirable in
from same distribution as the training data. We also find applications where future input queries may not stem from
that in Bayesian optimization tasks, the performance the same distribution. A shift in input distribution can be
of UCB acquisition is improved via MOD uncertainty caused by sampling bias, covariate shift, and the adaptive
estimates.
experimentation that occurs in bandits, Bayesian optimization
(BO), and reinforcement learning (RL) contexts. Here, we
Introduction propose Maximize Overall Diversity (MOD), a technique to
Model ensembling provides a simple, yet extremely effec- stabilize OOD model uncertainty estimates produced by an
tive technique for improving the predictive performance of ensemble of arbitrary neural networks. The core idea is to
arbitrary supervised learners each trained with empirical risk consider all possible inputs and encourage as much overall
minimization (ERM) (Breiman 1996; Brown 2004). Often, diversity in the corresponding model ensemble outputs as can
ensembles are utilized not only to improve predictions on test be tolerated without diminishing the ensemble’s predictive
examples stemming from the same underlying distribution as performance. MOD utilizes an auxiliary loss function and
the training data, but also to provide estimates of model un- data-augmentation strategy that is easily integrated into any
certainty when learners are presented with out-of-distribution existing training procedure.
(OOD) examples that may look different than the data en-
countered during training (Lakshminarayanan, Pritzel, and Related Work
Blundell 2017; Osband et al. 2016). The widespread success NN ensembles have been previously demonstrated to pro-
of ensembles crucially relies on the variance-reduction pro- duce useful uncertainty estimates for sequential experimen-
duced by aggregating predictions that are statistically prone to tation applications in Bayesian optimization and reinforce-
different types of individual errors (Kuncheva and Whitaker ment learning (Papadopoulos, Edwards, and Murray 2001;
2003). Thus, prediction improvements are best realized by Lakshminarayanan, Pritzel, and Blundell 2017; Riquelme,
using a large ensemble with many base models, and a large Tucker, and Snoek 2018). Proposed methods to improve
ensemble is also typically employed to produce stable dis- ensembles include adversarial training to enforce smooth-
tributional estimates of model uncertainty (Breiman 1996; ness (Lakshminarayanan, Pritzel, and Blundell 2017), and
Papadopoulos, Edwards, and Murray 2001). maximizing ensemble output diversity over the training data
Copyright c 2020, Association for the Advancement of Artificial (Brown 2004). Recent work has proposed regularizers based
Intelligence (www.aaai.org). All rights reserved. on augmented out-of-distribution examples, but is primar-
ily specific to classification tasks and non-trivially requires the ensemble) by treating the aggregate ensemble out-
auxiliary generators of OOD examples (Lee et al. 2018) or put as a single Gaussian distribution N (s µ(x), σs2 (x)).
existing examples from other classes (Vyas et al. 2018). An- Here, the ensemble-estimate of f (x) is given by µ s(x) =
other line of related work solely aims at producing better meanp{µm (x)}M m=1 q, and the uncertainty in the target value
out-of-distribution detectors (Liang, Li, and Srikant 2017; s2 (x) = σeps
is given by σ 2 2
(x) + σmod (x) based on noise-level
Choi and Jang 2018; Ren et al. 2019). estimate σeps (x) = meanp{σm (x)}M
2 2
m=1 q and model uncer-
Our work seeks to improve uncertainty estimates in regres- 2
tainty estimate σmod (x) = variancep{µm (x)}M m=1 q. While
sion settings, where OOD data can stem from an arbitrary we focus on Gaussian likelihoods for simplicity, our proposed
unknown distribution, and robust prediction on OOD data methodology is applicable to general parametric conditional
is desired rather than just detection of OOD examples. We distributions.
propose a simple technique to regularize ensemble behavior
over all possible inputs that does not require training of ad- Maximizing Overall Diversity (MOD)
ditional generator. Consideration of all possible inputs has
previously been advocated by (Hooker and Rosset 2012), Assuming X ∈ X , Y ∈ [−C, C] have been scaled to
although not in the context of uncertainty estimation. Pearce bounded regions, MOD encourages higher ensemble diver-
et al. (2018) propose a regularizer to ensure an ensemble sity by introducing an auxiliary loss that is computed over
approximates a valid Bayesian posterior, but their method- augmented data sampled from another distribution QX . Like
ology is only applicable to homoskedastic noise unlike ours. Pin , QX is also defined over the input feature space, but
Hafner et al. (2018) also aim to control Bayesian NN output- differs from the underlying training data distribution and in-
behavior beyond the training distribution, but our methods stead describes OOD examples that could be encountered at
do not require the Bayesian formulation they impose and test-time. The underlying population objective we target is
can be applied to arbitrary NN ensembles, which are one
min Lin − γLout where
of the most straightforward methods used for quantifying θ1 ,...,θM
NN uncertainty (Papadopoulos, Edwards, and Murray 2001; M
Lakshminarayanan, Pritzel, and Blundell 2017; Riquelme, 1 X (1)
Lin = EP [L(θm , x, y)]
Tucker, and Snoek 2018). Malinin and Gales (2018) focus M m=1 in
on incorporating distributional uncertainty into uncertainty 2
estimates via an additional prior distribution, whereas our Lout = EQ [σmod (x)]
focus is on improving model uncertainty in model ensembles. with L as the original supervised learning loss function
(e.g. NLL), and a user-specified penalty γ > 0. Since NLL
Methods entails a proper-scoring rule (Lakshminarayanan, Pritzel, and
Blundell 2017), minimizing the above objective with a suf-
We consider standard regression, assuming continuous target ficiently small value of γ will ensure the ensemble seeks to
2
values are generated via Y = f (X) +  with  ∼ N (0, σX ), recover PY |X=x for inputs x that lie in the support of the
such that σX may heteroscedastically depend on feature val- training distribution Pin and otherwise output large model
ues X. Given a limited training dataset D = {xn , yn }N n=1 , uncertainty for OOD x that lie outside this support. As it
where xn ∼ Pin specifies the underlying data distribution is difficult in most applications to specify how future OOD
from which the in-distribution examples in the training data examples may look, we aim to ensure our ensemble outputs
are sampled, our goal is to learn an ensemble of M neural high uncertainty estimates for any possible Pout by taking the
networks that accurately models both the underlying function entire input space into consideration. To account for any pos-
f (x) as well as the uncertainty in ensemble estimates of f (x). sible OOD distribution, we simply pick QX as the uniform
Of particular concern are scenarios where test examples x distribution over X , the bounded region of all possible inputs
may stem from a different distribution Pout 6= Pin , which we x. This choice is motivated by Theorem 1 below, which states
refer to as out-of-distribution (OOD) examples. As in (Lak- that the uniform distribution most closely approximates all
shminarayanan, Pritzel, and Blundell 2017), each network possible OOD distributions in the minimax sense.
m (with parameters θm ) in our NN ensemble outputs both
an estimated mean µm (x) to predict f (x) and an estimated Theorem 1 The uniform distribution QX equals:
2
variance σm (x) to predict σx2 , and the per network loss func- arg min max KL(P ||Q) where for discrete X , P
Q∈P Pout ∈P
tion L(θm ; xn , yn ) = − log pθm (yn |xn ), is chosen as the
denotes the set of all distributions, and for continuous X , P
negative log-likelihood (NLL) under the Gaussian assump-
2 is the set of all distributions with density functions that are
tion yn ∼ N (µm (xn ), σm (xn )). While traditional bagging
bounded within some interval [a, b].
provides different training data to each ensemble member,
we simply train each NN using the entire dataset, since the Proof For the discrete case with |X |= N : let Pout , Q
randomness of separate NN-initializations and SGD-training have corresponding pmf P p, q, so KL(Pout ||Q) =
suffice to produce comparable performance to bagging of
P
x∈X p(x) log p(x) − x∈X p(x) log q(x). When Q is the
NN models (Lakshminarayanan, Pritzel, and Blundell 2017; uniform distribution, the worst case Pout is one that puts
Lee et al. 2015; Osband et al. 2016). all its mass on a single point x, which corresponds to
Following (Lakshminarayanan, Pritzel, and Blundell KL(Pout ||Q) = log N . For any non-uniform Q0 : there ex-
2017), we estimate PY |X=x (and NLL with respect to ists x0 where q 0 (x0 ) < q(x0 ) = 1/N . Thus for Pout
0
which
Figure 1: Regression on synthesized data with 95% confidence intervals (CI). The training examples are depicted as black dots
and the ground-truth function as a grey dotted line. The predicted conditional mean and CI from individual networks are drawn
in colored dashed lines/bands and the overall ensemble conditional mean/CI are depicted via the smooth purple line/band.

puts all its mass on x0 , we have KL(Pout


0
||Q0 ) > log N . The 2018). Under this perspective, we want to encourage greater
proof for the continuous case is similar.  model uncertainty for the lowest density points that lie fur-
In practice, we approximate Lin using the average loss thest from the training data. Commonly used covariance
over the training data as in ERM, and train each θm with kernels for Gaussian Process regressors (e.g. radial basis
respect to its contribution to this term independently of the functions) explicitly enforce a high amount of uncertainty
others as in bagging. To approximate Lout , we similarly on points that lie far from the training data. As calculating
utilize an empirical average based on augmented examples the distance of each point to the entire training set may be
{xj }Kj=1 sampled uniformly throughout the feature space X .
undesirably inefficient for large datasets, we only compute
Uniformly sampling from the input space takes constant time the distance of our augmented data to a current minibatch
to compute. We expect only a marginal increase in terms of B during training. Specifically, we use these distances to
training time since the computation of back-propagation is compute the following:
largely parallelized and thus an increase in minibatch size Pk b 2
would only cause an increase in memory consumption rather i=1 ||x̃b − xi ||2
w̃b = Pk (2)
than computation time. The formal MOD procedure is de- maxb i=1 ||x̃b − xbi ||22
tailed in Algorithm 1. We advocate selecting γ as the largest
value for which estimates of Lin (on held-out validation data) where w̃b are weights for each of the augmented points x̃b ,
do not indicate worse predictive performance. This strategy and xbi (i = 1, . . . , k) are members of the minibatch B that
naturally favors smaller values of γ as the sample size N are the k nearest neighbors of x̃b . Throughout this paper, we
grows, thus resulting in lower model uncertainty estimates use k = 5.
(with γ → 0 as N → ∞ when Pin is supported everywhere The w̃b are thus inversely related to a crude density esti-
and our NN are universal approximators). mate of the training distribution Pin evaluated at each aug-
We also experiment with an alternative choice of QX be- mented sample x̃b . Rather than optimizing the loss Lout
ing the uniform distribution over the finite training data (i.e. which uniformly weights each augmented sample (as done
q(x) = 1/N ∀x ∈ D and = 0 otherwise). We call this alterna- in Algorithm 1), we can instead form a weighted loss
tive method MOD-in, and note its similarity to the diversity- computed over the minibatch of augmented samples as:
P|B| 2
encouraging penalty proposed by (Brown 2004), which is b=1 w̃b · σmod (r
xb ) which should increase the model uncer-
also measured over the training data. Note that MOD in tainty for augmented inputs proportionally to their distance
contrast considers QX to be uniformly distributed over all from the training data. We call this variant of our methodol-
possible test inputs rather than only the training examples. ogy with augmented input reweighting MOD-R.
Maximizing diversity solely over the training data may fail to
control ensemble behavior at OOD points that do not lie near Maximizing Overall Diversity with Adversarial
any training example, and thus fail to prevent uncertainty Optimization (MOD-Adv)
collapse. We also consider another variant of MOD that utilizes adver-
sarial training techniques. Here, we maximize the variance
Maximizing Reweighted Diversity (MOD-R) on relatively over-confident points in out-of-distribution re-
gions, which are likely to comprise worst-case Pout . Specif-
Aiming for high-quality OOD uncertainty estimates, we are
ically, we formulate a maximin optimization for the MOD
mostly concerned with regularizing the ensemble-variance 2
penalty max min σmod (x), and thus the full training objec-
around points located in low density regions of the training Θ x
2
data distribution. To obtain a simple estimate that intuitively tive becomes minθ1 ,...,θM Lin − γ · minx σmod (x). We call
reflects the inverse of the local density of Pin at a particular this variant MOD-Adv. In practice, we obtain the augmented
set of feature values, one can compute the feature-distance points by taking a single gradient step in the direction of lower
2
to the nearest training data points (Papernot and McDaniel variance (σmod ), starting from uniformly sampled points. The
extra gradient step can double the computation time com- Negative Correlation (NegCorr) This method from (Liu
pared to MOD. The full algorithm is given in Algorithm 1. and Yao 1999; Shui et al. 2018) minimizes the empirical
Note that MOD-Adv is different than the traditional adversar- correlation between predictions of different ensemble mem-
ial training in two aspects: first it takes a gradient step with bers overP the training data. It adds
P a penalty to the loss of
regard to the model uncertainty measurement (the variance the form m [(µm (x) − µ s(x)) · m0 6=m (µm0 (x) − µ s(x))]
of ensemble mean prediction) instead of with regard to the where µm (x) is the prediction of the mth ensemble mem-
predicted score of another class; second, the adversarial step ber and µs(x) is the mean ensemble prediction. This penalty
is taken starting from a uniformly sampled example instead is weighted by a user-specified penalty γ, as done in our
of a training example. We apply MOD-Adv to only regres- methodology.
sion tasks with continuous features since it is more natural to
apply gradient descent on them. Experiment Details
All experiments were run on Nvidia TitanX 1080 Ti and
Algorithm 1 MOD Training Procedure (+ Variants) Nvidia TitanX 2080 Ti GPUs with PyTorch version 1.0. Un-
Input: Training data D = {(xn , yn )}N n=1 , penalty γ > 0,
less otherwise indicated, all p-values were computed using
batch-size |B| a single tailed paired t-test per dataset, and the p-values are
Output: Parameters of ensemble of M neural networks combined using Fisher’s method to produce an overall p-
θ1 ,...,θM value across all datasets in a task. All hyperparameters – in-
Initialize θ1 ,...,θM randomly, initialize wb = 1 for b = cluding learning rate, `2 -regularization, γ for MOD/Negative
1, . . . , |B| Correlation, and adversarial training δ – were tuned based on
repeat validation set NLL. In every regression task, the search for
Sample minibatch from training data:{(xb , yb )}b=1
|B| hyperparameter γ was over the values 0.01, 0.1, 1, 5, 10, 20,
Sample |B| augmented inputs x r1 ,..., x
rB uniformly at 50. For MOD-Adv, we search for δ over 0.2,1.0,3.0,5.0 for
random from X UCI and 0.1,0.5,1 for the image data.
if MOD-Adv then
2 Univariate Regression
x̃b ← xrb − αadv · ∇xrb σmod xb ) ∀1 ≤ b ≤ |B|
(r
We first consider a one-dimensional regression toy dataset
for m = 1,. . . , M do
that is similar to the one used by (Blundell et al. 2015). We
if MOD-R then wb = w̃b (defined in equa-
generated training data from the function:
tion (2))
Update θm via SGD with gradient y = 0.3x + 0.3 sin(2πx) + 0.3 sin(4πx) + 
„X|B| |B| 
1 X
2 with  ∼ N (0, 0.02)
= ∇θ m L(θm ; (xb , yb )) − γ wb · σmod (r
xb )
|B| Here, the training data only contain samples drawn from two
b=1 b=1
until iteration limit reached limited-size regions. Using the standard NLL loss as well as
the auxiliary MOD penalty, we train a deep ensemble with
4 neural networks of identical architectures consisting of 1-
hidden layer with 50 units, ReLU activation, two sigmoid
Experiments outputs to estimate the mean and variance of PY |X=x , and
Baseline Methods L2 regularization. To depict the improvement gained by sim-
ply adding ensemble members, we also train an ensemble
Here, we evaluate various alternative strategies for improv- of 120 networks with same architecture. Figure 1 shows the
ing model ensembles. All strategies are applied to the same predictions and 95% confidence interval of the ensembles.
base NN ensemble, which is taken to be the Deep Ensem- MOD is able to produce more reliable uncertainty estimates
bles (DeepEns) model of (Lakshminarayanan, Pritzel, and on the lefthand regions that lack training data, whereas stan-
Blundell 2017) previously described in Methods. dard deep ensembles exhibit uncertainty collapse, even with
Deep Ensembles with Adversarial Training (Deep- many networks. MOD also properly inflated the predictive
Ens+AT) (Lakshminarayanan, Pritzel, and Blundell 2017) uncertainty in the center region where no training data is
used this strategy to improve their basic DeepEns model. found. Using a smaller γ = 3 in MOD ensures the ensem-
The idea is to adversarially sample inputs that lie close ble predictive performance remains strong for in-distribution
to the training data but on which the NLL loss is high inputs that lie near the training data and the ensemble ex-
(assuming they share the same label as their neighbor- hibits adequate levels of certainty around these points. While
ing training example). Then, we include these adversar- the larger γ = 10 value leads to overly conservative un-
ial points as augmented data when training the ensemble, certainty estimates that are large everywhere, we note the
which smooths the function learned by the ensemble. Start- mean of the ensemble predictions remains highly accurate
ing from training example x, we sample augmented datapoint for in-distribution inputs.
x0 = x + δsign(∇x L(θ, x, y)) with the labels for x0 assumed
to be the same as that for the corresponding x. L here denotes Protein Binding Microarray Data
the NLL loss function, and the values for hyperparameter δ We next study scientific data with discrete features by predict-
that we search over include 0.05, 0.1, 0.2. ing Protein-DNA binding. This is a collection of 38 different
microarray datasets, each of which contains measurements
of the binding affinity of a single transcription factor (TF)
protein against all possible 8-base DNA sequences (Barrera
et al. 2016). We consider each dataset as a separate task with
Y taken to be the binding affinity scaled to the interval [0,1]
and X the one-hot embedded DNA sequence. As we ignore
reverse-complements, there are ∼ 32, 000 possible values of
X.

Regression We trained a small ensemble of 4 neural net-


works with the same architecture as in the previous experi-
ments. We consider 2 different OOD test sets, one comprised
of the sequences with top 10% Y -values and the other com-
prised of the sequences with more than 80% of the position in
X being G or C (GC-content). For each OOD set, we use the
remainder of the sequences as corresponding in-distribution
set. We separate them into extremely small training set (300
examples) and validation set (300 examples), and use the
rest as in-distribution test set. We compare MOD along with Figure 2: Calibration curves for regression models trained
3 alternative sampling distribution (MOD-in, MOD-R, and on two of the UCI datasets (top) and two DNA TF binding
MOD-Adv) against the 3 baselines previously mentioned. We datasets (bottom). A perfect calibration curve should lie on
search over 0,1e-3,0.01,0.05,0.1 for l2 penalty and 0.01 for the diagonal, and an over-confident model has calibration
learning rate. curve where the model expected confidence level is higher
Table 1 and Appendix Table 1 shows mean OOD and in- than observed confidence level (below diagonal). MOD-R
distribution performance across 38 TFs (averaged over 10 and MOD significantly improve the over-confident predic-
runs using random data splits and NN initializations). MOD tions from the deep ensembles trained without augmentation
methods have significantly improved performance on all met- loss.
rics and OOD setups compared to DeepEns/DeepEns+AT,
both in terms of # of TF outperforming and overall p-value methods) and using 200 points randomly sampled from the
and is on par with DeepEns+AT on in-Distribution. The re- bottom 90% of Y values as are initial training set.
weighting scheme (MOD-R) further improved the perfor- We evaluated on the metric of simple regret rT =
mance on top 10% Y -value OOD set up. Figure 2 shows the maxx∈X f (x) − maxt∈[1,T ] f (xt ) (second term in the sub-
calibration curve on two of the TFs where the deep ensem- traction quantifies the best point acquired so far and the first
bles are over-confident on top 10% Y -value OOD examples. term is the global best). The results are presented in Table 2.
MOD-R and MOD improve the calibration results by signifi- MOD outperforms all other methods in both number of TFs
cant margin compared to most of the baselines. with better regret and the combined p-value. MOD-R is also
strong outperforming all other methods except MOD with
Bayesian Optimization Next, we compared how the respect to which is about equivalent in terms of statistical
MOD, MOD-R, and MOD-in ensembles performed against significance. Figure 3 shows rT for the TFs OVOL2 and
the DeepEns, DeepEns+AT, and NegCorr ensembles in 38 HESX1, a task in which MOD and MOD-R outperform the
Bayesian optimization tasks using the same protein binding other methods.
data (Hashimoto, Yadlowsky, and Duchi 2018). For each
TF, we performed 30 rounds of DNA-sequence acquisition, UCI Regression Datasets
acquiring batches of 10 sequences per round in an attempt We next experimented with 9 real world datasets with contin-
to maximize binding affinity. We used the upper confidence uous inputs in some applicable bounded domain. We follow
bound (UCB) as our acquisition function (Chen et al. 2017), the experimental setup that (Lakshminarayanan, Pritzel, and
ordering the candidate points via µ s(x) + β · σmod (x) (with Blundell 2017) and (Hernández-Lobato et al. 2017) used
UCB coefficient β = 1). to evaluate deep ensembles and deep Bayesian regressors.
At every acquisition iteration, we randomly held out 10% We split off all datapoints whose y-values fall in the top
of the training set as the validation set and chose the γ penalty 5% as an OOD test set (so datapoints with such large y-
(for MOD, MOD-in, MOD-R, and NegCorr) that produced values are never encountered during training). We simulate
the best validation NLL (out of choices: 0, 5, 10, 20, 40, the situation where training set is limited and thus used
80). The stopping epoch is chosen based on the validation 40% of the data for training and 10% for validation. The
NLL not increasing for 10 epochs with an upper limit of 30 remaining data is used as an in-distribution test set. The
epochs. Optimization was done with a learning rate of 0.01, analysis is repeated for 10 random splits of the data to
L2 penalty of 0.01 and used the Adam optimizer. For each ensure robustness. We again use an ensemble of 4 fully-
of the 38 TFs, we performed 20 Bayesian optimization runs connected neural networks with the same architecture as
with different seed sequences (same seeds used for all the above and the NLL training loss searching over hyperparame-
Table 1: NLL on OOD/in-distribution test set averaged across 38 TFs over 10 replicate runs(See Appendix Table 1 for RMSE).
MOD out-performance p-value is the combined p-value of MOD NLL being less than the NLL of the method in the corresponding
row. Bold indicates best in category and bold+italicized indicates second best. In case of a tie in the means, the method with
lower standard deviation is highlighted.
(MOD out-performance) (MOD out-performance)
Methods Out-of-distribution NLL # of TFs p-value In-distribution NLL # of TFs p-value
(OOD as sequences with top 10% binding affinity)
DeepEns 0.7485±0.124 26 1.7e-05 -0.4266±0.031 32 7.7e-09
DeepEns+AT 0.7438±0.122 25 0.001 -0.4312±0.033 26 0.005
NegCorr 0.7358±0.118 27 0.061 -0.4314±0.032 17 0.761
MOD 0.7153±0.117 − − -0.4312±0.031 − −
MOD-R 0.7225±0.116 22 0.359 -0.4325±0.032 16 0.777
MOD-in 0.7326±0.121 26 0.012 -0.4317±0.032 19 0.535
(OOD as sequences with >80% GC content)
DeepEns -0.6938±0.052 20 0.022 -0.5649±0.029 34 3.1e-11
DeepEns+AT -0.7010±0.041 23 0.007 -0.5740±0.027 21 0.292
NegCorr -0.6805±0.065 25 0.011 -0.5700±0.026 25 0.017
MOD -0.7007±0.047 − − -0.5729±0.027 − −
MOD-R -0.6959±0.040 24 0.004 -0.5720±0.027 22 0.357
MOD-in -0.6948±0.054 21 0.103 -0.5711±0.028 22 0.163

Table 2: Regret (rT ) comparison. Each cell shows the number


of TFs (out of 38) for which the method in corresponding row
outperforms the method in the corresponding column (lower
rT ). The number in parentheses is the combined (across 38
TFs) p-value of MOD/-in/-R regret being less than the regret
of the method in the corresponding column.
vs DeepEns DeepEns+AT NegCorr
MOD-in 21 (0.111) 21 (0.041) 19 (0.356)
MOD 26 (0.003) 24 (0.004) 20 (0.001)
MOD-R 22 (0.019) 23 (0.007) 22 (0.017)
vs MOD-in MOD MOD-R
MOD-in − 17 (0.791) 16 (0.51)
MOD 19 (0.002) − 22 (0.173)
MOD-R 20 (0.052) 14 (0.674) −

MOD-Adv) and examine the calibration curves.


As shown in Table 3, MOD outperforms DeepEns in 6 out
of the 9 datasets on OOD NLL, and has significant overall
p-value compared to all baselines. MOD-Adv ranks top 1
in OOD NLL in terms of averaged ranks across all datasets,
showing better robustness than MOD. The MOD loss lead to
higher-quality uncertainties on OOD data while also improv-
ing in-distribution performance of DeepEns.
Figure 3: Regret for two Bayesian optimization tasks (aver-
aged over 20 replicate runs). The bands depict 50% confi- Figure 2 shows the calibration curve on two of the datasets
dence intervals, and the x-axis indicates the number of DNA where the basic deep ensembles exhibit over-confidence on
sequences whose binding affinity has been profiled so far. OOD data. Note that retaining accurate calibration on OOD
data is extremely difficult for most machine learning meth-
ods. MOD and MOD-R improve calibration by a significant
margin compared to most of the baselines, validating the
ter values: L2 penalty ∈ {0, 0.001, 0.01, 0.05, 0.1}, learning effectiveness of our MOD procedure.
rate ∈ {0.0005, 0.001, 0.01}. We report the negative log- The selection of γ is critical for MOD, thus we also ex-
likelihood (NLL) on both in- and out-of-distribution test amine the effect of the choice of γ on the in-distribution
sets for ensembles trained via different strategies (including performance for the 9 UCI and 38 TF binding regression
Table 3: Averaged NLL on out-of-distribution/in-distribution test example over 10 replicate runs for UCI datasets, top 5%
samples were heldout as OOD test set (See Appendix Table 2 for RMSE). MOD outperformance p-value is the combined (via
Fisher’s method) p-value of MOD NLL being less than the NLL of the method in the corresponding column (with p-value per
dataset being computed using a paired single tailed t-test). Bold indicates best in category and bold+italicized indicates second
best. In case of a tie in the means, the method with the lower standard deviation is highlighted.
Datasets DeepEns DeepEns+AT NegCorr MOD MOD-R MOD-in MOD-Adv
Out-of-distribution NLL
concrete -0.831±0.237 -0.915±0.204 -0.913±0.277 -0.904±0.118 -0.910±0.193 -0.924±0.188 -0.950±0.200
yacht -1.597±0.840 -1.762±0.647 -1.972±0.570 -1.797±0.437 -1.761±0.578 -1.638±0.663 -1.948±0.343
naval-propulsion-plant -2.580±0.103 -1.380±0.087 -2.618±0.056 -2.729±0.071 -2.130±0.069 -2.057±0.055 -2.629±0.068
wine-quality-red 0.133±0.132 0.115±0.086 0.113±0.104 0.153±0.107 0.084±0.072 0.085±0.065 0.217±0.114
power-plant -1.734±0.054 -1.731±0.088 -1.659±0.075 -1.638±0.151 -1.644±0.120 -1.731±0.050 -1.669±0.066
protein-tertiary-structure 1.162±0.231 1.178±0.158 1.231±0.130 1.197±0.137 1.194±0.214 1.154±0.132 1.299±0.252
kin8nm -1.980±0.053 -1.970±0.093 -2.036±0.046 -1.999±0.049 -2.003±0.095 -1.993±0.078 -2.027±0.085
bostonHousing 1.591±0.680 1.243±0.690 1.821±0.913 0.568±0.959 0.460±0.648 0.923±0.733 1.517±0.711
energy -1.590±0.253 -1.784±0.153 -1.718±0.193 -1.736±0.117 -1.741±0.264 -1.733±0.199 -1.772±0.242
MOD outperformance p-value 0.002 4.9e-07 0.034 - 1.6e-04 4.6e-05 0.027
In-distribution NLL
concrete -1.075±0.094 -1.129±0.084 -1.089±0.102 -1.155±0.086 -1.137±0.132 -1.090±0.092 -1.047±0.177
yacht -3.286±0.692 -3.245±0.822 -3.570±0.166 -3.500±0.190 -3.461±0.252 -3.339±0.815 -3.556±0.203
naval-propulsion-plant -2.735±0.077 -1.513±0.042 -2.810±0.042 -2.857±0.067 -2.297±0.061 -2.238±0.047 -2.817±0.046
wine-quality-red -0.070±0.853 -0.341±0.068 -0.266±0.291 -0.337±0.069 -0.348±0.045 -0.351±0.055 -0.170±0.505
power-plant -1.521±0.015 -1.525±0.018 -1.524±0.023 -1.523±0.012 -1.522±0.017 -1.524±0.013 -1.523±0.016
protein-tertiary-structure -0.514±0.013 -0.519±0.007 -0.544±0.012 -0.533±0.009 -0.532±0.012 -0.529±0.008 -0.540±0.012
kin8nm -1.305±0.016 -1.315±0.020 -1.334±0.015 -1.317±0.019 -1.315±0.017 -1.322±0.015 -1.314±0.020
bostonHousing -0.901±0.154 -0.937±0.144 -0.656±0.671 -0.953±0.147 -0.883±0.188 -0.925±0.180 -0.728±0.376
energy -2.426±0.151 -2.517±0.098 -2.620±0.130 -2.507±0.153 -2.525±0.098 -2.522±0.129 -2.638±0.137
MOD outperformance p-value 2.4e-08 6.9e-11 0.116 - 8.9e-06 2.2e-06 0.046

tasks. As shown in Figure 4, γ generally does not affect or In the context of Wiki images, we tried to predict the ages
hurt in-distribution NLL until it gets too large at which point given the image of a person using 2000 images of males as
it fairly consistently starts hurting it. When γ is selected prop- the training set. For the OOD dataset, we hold out the oldest
erly it may even improve the in-distribution slightly as shown 10% of the people as the OOD set. We used the Wide Resid-
in the previous tables. ual Network architecture (Zagoruyko and Komodakis 2016)
with a depth of 4 and a width factor of 2. As before, we used
an ensemble of size 4. The search for the optimal γ value
was over 0, 2, 5, 10, 20, 40, 80. The stopping epoch is chosen
based on the validation NLL not increasing for 10 epochs
with an upper limit of 30 epochs. Optimization was done
with a learning rate of 0.001, l2 penalty of 0.001 and used the
Adam optimizer. The NLL results are in Table 4 whereas the
RMSE results are in the Appendix. Both Maximize Overall
Diversity and MOD-Adv get the best results on OOD NLL
with the improvement being statistically significant over the
other methods. MOD gets an NLL of 1.129 on OOD data,
MOD-Adv gets an NLL of 1.155 on OOD, and MOD-R gets
Figure 4: The effect of different γ on in-distribution test 1.185 on OOD. This is in contrast to DeepEns which gets
performance (NLL). only 1.304 on OOD. Thus both MOD and MOD-R show sig-
nificant improvements on NLL on the OOD data. In addition,
while DeepEns+AT has a better mean in-distribution NLL
Age Prediction from Images compared to MOD, the focus of this paper is out of distribu-
To demonstrate the effectiveness of MOD, MOD-Adv, and tion uncertainty on which Maximize Overall Diversity and
MOD-R on high dimensional data, we consider supervised MOD-Adv perform very well. Notably every MOD variant
learning with image data. Here, we use a dataset of hu- improves performance for both in and out of distribution.
man images collected from IMDB and Wikimedia and anno- Thus augmenting the loss function with the MOD penalty
tated with age and gender information (Rothe, Timofte, and should not make your model worse.
Van Gool 2015). The IMDB/Wiki parts of the dataset consist
of 460K+/62K+ images respectively. 28,601 images in the
Wiki dataset are males and the rest are females.
Kavukcuoglu, K.; and Wierstra, D. 2015. Weight uncertainty
Table 4: Image regression results showing mean performance in neural networks. arXiv preprint arXiv:1505.05424.
across 20 randomly seeded runs (along with ± one standard
deviation). In-Dist refers to the in-distribution test set. OOD [Breiman 1996] Breiman, L. 1996. Bagging predictors. Ma-
refers to the out of distribution test set. Bold indicates best in chine Learning 24:123–140.
category and bold+italicized indicates second best. In case [Brown 2004] Brown, G. 2004. Diversity in neural network
of a tie in means, the lower standard deviation method is ensembles. Ph.D. Dissertation, University of Birmingham.
highlighted. [Chen et al. 2017] Chen, R. Y.; Sidor, S.; Abbeel, P.; and
Methods OOD NLL In-Dist NLL Schulman, J. 2017. UCB exploration via Q-ensembles.
arXiv:1706.01502.
DeepEns 1.3100 ± 0.2486 -0.2193 ± 0.0207
[Choi and Jang 2018] Choi, H., and Jang, E. 2018. Genera-
DeepEns+AT 1.2348±0.1291 -0.2419±0.0213
tive ensembles for robust anomaly detection. arXiv preprint
NegCorr 1.1731±0.1978 -0.2286±0.0179
arXiv:1810.01392.
MOD-in 1.2625±0.1961 -0.2301±0.0128
MOD 1.1294±0.1707 -0.2306±0.0148 [Hafner et al. 2018] Hafner, D.; Tran, D.; Lillicrap, T.; Ir-
MOD-R 1.1847±0.2442 -0.2285±0.0191 pan, A.; and Davidson, J. 2018. Reliable uncertainty esti-
MOD-Adv 1.1547±0.1865 -0.2305±0.0149 mates in deep neural networks using noise contrastive priors.
arXiv:1807.09289.
[Hashimoto, Yadlowsky, and Duchi 2018] Hashimoto, T. B.;
Conclusion Yadlowsky, S.; and Duchi, J. C. 2018. Derivative free opti-
mization via repeated classification. In International Confer-
We have introduced a loss function and data augmentation ence on Artificial Intelligence and Statistics.
strategy that helps stabilize distribution uncertainty estimates
[Hernández-Lobato et al. 2017] Hernández-Lobato, J. M.;
obtained from model ensembling. Our method increases
Requeima, J.; Pyzer-Knapp, E. O.; and Aspuru-Guzik,
model uncertainty over the entire input space while simul-
A. 2017. Parallel and distributed thompson sampling
taneously maintaining predictive performance, which helps
for large-scale accelerated exploration of chemical space.
mitigate uncertainty collapse that may arise in small model
arXiv:1706.01825.
ensembles. We further proposed two variants of our method.
MOD-R assesses the distance of an augmented sample from [Hooker and Rosset 2012] Hooker, G., and Rosset, S. 2012.
the training distribution and aims to ensure higher model Prediction-focused regularization using data-augmented re-
uncertainty in regions with low-density, and MOD-Adv uses gression. Statistics and Computing 1:237–349.
adversarial optimization to improve model uncertainty on rel- [Kuncheva and Whitaker 2003] Kuncheva, L. I., and
atively over-confident regions more efficiently. Our methods Whitaker, C. J. 2003. Measures of diversity in classifier
produce improvements to both the in and out of distribution ensembles and their relationship with the ensemble accuracy.
NLL, out of distribution RMSE, and calibration on a variety Machine Learning 51:181–207.
of datasets drawn from biology, vision, and common UCI [Lakshminarayanan, Pritzel, and Blundell 2017]
datasets. We also showed MOD is useful in hard Bayesian Lakshminarayanan, B.; Pritzel, A.; and Blundell, C.
optimization tasks. Future work could develop techniques 2017. Simple and scalable predictive uncertainty estimation
to generate OOD augmented samples for structured data, using deep ensembles. In Advances in Neural Information
as well as applying ensembles with improved uncertainty- Processing Systems.
awareness to currently challenging tasks such as exploration [Lee et al. 2015] Lee, S.; Purushwalkam, S.; Cogswell, M.;
in reinforcement learning. Crandall, D.; and Batra, D. 2015. Why M heads are better
than one: Training a diverse ensemble of deep networks.
References arXiv:1511.06314.
[Balan et al. 2015] Balan, A. K.; Rathod, V.; Murphy, K. P.; [Lee et al. 2018] Lee, K.; Lee, H.; Lee, K.; and Shin, J. 2018.
and Welling, M. 2015. Bayesian dark knowledge. In Ad- Training confidence-calibrated classifiers for detecting out-of-
vances in Neural Information Processing Systems. distribution samples. In International Conference on Learn-
[Barrera et al. 2016] Barrera, L. A.; Vedenko, A.; Kurland, ing Representations.
J. V.; Rogers, J. M.; Gisselbrecht, S. S.; Rossin, E. J.; [Liang, Li, and Srikant 2017] Liang, S.; Li, Y.; and Srikant,
Woodard, J.; Mariani, L.; Kock, K. H.; Inukai, S.; et al. 2016. R. 2017. Enhancing the reliability of out-of-distribution
Survey of variation in human transcription factors reveals image detection in neural networks. arXiv preprint
prevalent DNA binding changes. Science 351(6280):1450– arXiv:1706.02690.
1454. [Liu and Yao 1999] Liu, Y., and Yao, X. 1999. Ensem-
[Beluch et al. 2018] Beluch, W. H.; Genewein, T.; ble learning via negative correlation. Neural networks
Nürnberger, A.; and Köhler, J. M. 2018. The power 12(10):1399–1404.
of ensembles for active learning in image classification. [Malinin and Gales 2018] Malinin, A., and Gales, M. 2018.
In IEEE Conference on Computer Vision and Pattern Predictive uncertainty estimation via prior networks. In
Recognition. Advances in Neural Information Processing Systems, 7047–
[Blundell et al. 2015] Blundell, C.; Cornebise, J.; 7058.
[Osband et al. 2016] Osband, I.; Blundell, C.; Pritzel, A.; and
Van Roy, B. 2016. Deep exploration via bootstrapped DQN.
In Advances in Neural Information Processing Systems.
[Papadopoulos, Edwards, and Murray 2001] Papadopoulos,
G.; Edwards, P. J.; and Murray, A. F. 2001. Confidence esti-
mation methods for neural networks: A practical comparison.
IEEE Transactions on Neural Networks 12:1278–1287.
[Papernot and McDaniel 2018] Papernot, N., and McDaniel,
P. 2018. Deep k-nearest neighbors: Towards confident,
interpretable and robust deep learning. arXiv preprint
arXiv:1803.04765.
[Pearce et al. 2018] Pearce, T.; Zaki, M.; Brintrup, A.; and
Neel, A. 2018. Uncertainty in neural networks: Bayesian
ensembling. arXiv preprint arXiv:1810.05546.
[Ren et al. 2019] Ren, J.; Liu, P. J.; Fertig, E.; Snoek, J.;
Poplin, R.; DePristo, M. A.; Dillon, J. V.; and Lakshmi-
narayanan, B. 2019. Likelihood ratios for out-of-distribution
detection. arXiv preprint arXiv:1906.02845.
[Riquelme, Tucker, and Snoek 2018] Riquelme, C.; Tucker,
G.; and Snoek, J. 2018. Deep bayesian bandits showdown:
An empirical comparison of bayesian deep networks for
thompson sampling. In International Conference on Learn-
ing Representations.
[Rothe, Timofte, and Van Gool 2015] Rothe, R.; Timofte, R.;
and Van Gool, L. 2015. Dex: Deep expectation of appar-
ent age from a single image. In Proceedings of the IEEE
International Conference on Computer Vision Workshops,
10–15.
[Shui et al. 2018] Shui, C.; Mozafari, A. S.; Marek, J.; Hedhli,
I.; and Gagne, C. 2018. Diversity regularization in deep
ensembles. arXiv:1802.07881.
[Vyas et al. 2018] Vyas, A.; Jammalamadaka, N.; Zhu, X.;
Das, D.; Kaul, B.; and Willke, T. L. 2018. Out-of-distribution
detection using an ensemble of self supervised leave-out
classifiers. In Proceedings of the European Conference on
Computer Vision (ECCV), 550–564.
[Zagoruyko and Komodakis 2016] Zagoruyko, S., and Ko-
modakis, N. 2016. Wide residual networks. arXiv preprint
arXiv:1605.07146.
Appendix

Table 5: RMSE on OOD/in-distribution test set averaged across 38 TFs over 10 replicate runs. MOD out-performance p-value is
the combined p-value of MOD RMSE being less than the RMSE of the method in the corresponding row. Bold indicates best in
category and bold+italicized indicates second best. In case of a tie in the means, the method with lower standard deviation is
highlighted.
(MOD out-performance) (MOD out-performance)
Methods Out-of-distribution RMSE # of TFs p-value In-distribution RMSE # of TFs p-value
(OOD as sequences with top 10% binding affinity)
DeepEns 0.2837±0.011 26 1.3e-04 0.1591±0.005 34 2.1e-12
DeepEns+AT 0.2812±0.011 23 0.087 0.1582±0.005 28 3.6e-05
NegCorr 0.2814±0.010 20 0.124 0.1583±0.005 20 0.492
MOD 0.2802±0.010 0 0.0e+00 0.1581±0.005 0 0.0e+00
MOD-R 0.2795±0.010 17 0.933 0.1579±0.005 13 0.969
MOD-in 0.2801±0.010 16 0.617 0.1581±0.005 18 0.633
(OOD as sequences with >80% GC content)
DeepEns 0.1190±0.004 25 0.008 0.1415±0.003 36 8.5e-24
DeepEns+AT 0.1180±0.003 19 0.029 0.1394±0.003 22 0.106
NegCorr 0.1179±0.004 21 0.052 0.1403±0.003 30 4.7e-08
MOD 0.1173±0.003 0 0.0e+00 0.1394±0.003 0 0.0e+00
MOD-R 0.1177±0.003 22 0.112 0.1398±0.003 19 0.079
MOD-in 0.1177±0.004 21 0.401 0.1403±0.003 27 2.8e-06
Table 6: RMSE on out-of-distribution/in-distribution test examples for the UCI datasets (over 10 replicate runs). Examples with
Y -values in the top 5% were held out as the OOD test set.
Datasets DeepEns DeepEns+AT NegCorr MOD MOD-R MOD-in MOD-Adv
Out-of-distribution RMSE
concrete 0.105±0.016 0.102±0.011 0.096±0.012 0.105±0.014 0.113±0.028 0.099±0.011 0.095±0.012
yacht 0.038±0.041 0.039±0.044 0.018±0.005 0.026±0.005 0.026±0.007 0.039±0.045 0.021±0.005
naval-propulsion-plant 0.111±0.008 0.126±0.015 0.112±0.011 0.109±0.008 0.125±0.008 0.129±0.007 0.122±0.009
wine-quality-red 0.239±0.013 0.236±0.009 0.234±0.008 0.239±0.011 0.234±0.010 0.235±0.008 0.242±0.014
power-plant 0.041±0.002 0.042±0.004 0.044±0.004 0.048±0.012 0.047±0.010 0.041±0.003 0.043±0.003
protein-tertiary-structure 0.301±0.009 0.301±0.003 0.294±0.003 0.299±0.008 0.300±0.011 0.300±0.007 0.300±0.004
kin8nm 0.042±0.005 0.040±0.005 0.037±0.003 0.039±0.003 0.041±0.005 0.039±0.004 0.039±0.004
bostonHousing 0.221±0.023 0.213±0.021 0.210±0.016 0.209±0.017 0.200±0.014 0.212±0.016 0.217±0.022
energy 0.060±0.014 0.054±0.012 0.053±0.009 0.053±0.005 0.054±0.013 0.054±0.009 0.047±0.009
MOD outperformance p-value 0.028 0.102 0.989 0.0e+00 0.090 0.017 0.124
In-distribution RMSE
concrete 0.086±0.004 0.085±0.004 0.084±0.004 0.083±0.003 0.083±0.004 0.084±0.004 0.084±0.004
yacht 0.017±0.019 0.019±0.025 0.010±0.002 0.012±0.003 0.012±0.004 0.018±0.023 0.010±0.002
naval-propulsion-plant 0.080±0.003 0.089±0.003 0.079±0.004 0.077±0.003 0.088±0.003 0.091±0.004 0.085±0.004
wine-quality-red 0.170±0.005 0.170±0.005 0.169±0.004 0.170±0.003 0.169±0.004 0.169±0.005 0.170±0.004
power-plant 0.053±0.001 0.053±0.001 0.052±0.001 0.053±0.001 0.053±0.001 0.053±0.001 0.052±0.001
protein-tertiary-structure 0.164±0.002 0.164±0.001 0.161±0.001 0.163±0.000 0.163±0.001 0.163±0.001 0.164±0.001
kin8nm 0.074±0.002 0.072±0.003 0.071±0.002 0.072±0.002 0.073±0.001 0.072±0.001 0.073±0.002
bostonHousing 0.085±0.008 0.084±0.008 0.083±0.007 0.084±0.009 0.084±0.008 0.084±0.009 0.085±0.009
energy 0.042±0.004 0.039±0.003 0.036±0.004 0.039±0.004 0.039±0.003 0.039±0.003 0.037±0.004
In-dist RMSEpval Outperformed by MOD 4.0e-06 5.9e-07 0.943 0.0e+00 8.9e-04 0.002 0.002

Table 7: Image regression results showing mean performance across 20 randomly seeded runs (along with ± one standard
deviation). In-Dist refers to the in-distribution test set. OOD refers to the out of distribution test set. Bold indicates best in
category and bold+italicized indicates second best. In case of a tie in means, the lower standard deviation method is highlighted.
Methods OOD RMSE In-Dist RMSE
DeepEns 0.388 ± 0.021 0.196 ± 0.003
DeepEns+AT 0.378 ±0.016 0.192±0.004
NegCorr 0.376± 0.015 0.194± 0.002
MOD-in 0.382±0.016 0.194±0.002
MOD 0.375±0.014 0.194±0.002
MOD-R 0.377±0.019 0.193±0.002
MOD-Adv 0.374±0.016 0.193±0.002

You might also like