0% found this document useful (0 votes)
6 views15 pages

Batch Normalization Embeddings For Deep Domain Generalization

Uploaded by

dpaeast
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views15 pages

Batch Normalization Embeddings For Deep Domain Generalization

Uploaded by

dpaeast
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Batch Normalization Embeddings for Deep Domain Generalization

Mattia Segu, Alessio Tonioni & Federico Tombari


Google
{msegu,alessiot,tombari}@google.com
arXiv:2011.12672v3 [cs.LG] 18 May 2021

Abstract

Domain generalization aims at training machine learn- Art Photo


ing models to perform robustly across different and unseen Painting
domains. Several recent methods use multiple datasets to
train models to extract domain-invariant features, hoping
to generalize to unseen domains. Instead, first we explic-
itly train domain-dependant representations by using ad-

Domain Space
hoc batch normalization layers to collect independent do-
Test
main’s statistics. Then, we propose to use these statistics to Cartoon
map domains in a shared latent space, where membership to
a domain can be measured by means of a distance function.
At test time, we project samples from an unknown domain
into the same space and infer properties of their domain as
a linear combination of the known ones. We apply the same
mapping strategy at training and test time, learning both a Figure 1: Visualization of our method on the PACS dataset
latent representation and a powerful but lightweight ensem- when the domains Art Painting, Photo and Carton are avail-
ble model. We show a significant increase in classification able at training time. We propose to use batch normaliza-
accuracy over current state-of-the-art techniques on pop- tion layers to implicitly learn a domain space onto which
ular domain generalization benchmarks: PACS, Office-31 map both known (training) and unknown (testing) domains.
and Office-Caltech. At test time, we project each target sample independently
in the domain space and locate it with respect to the known
domains using the corresponding distances Da,t , Dp,t , and
1. Introduction Dc,t . Properties of the unknown domain are revealed by
the location of the unseen sample. We leverage these hints
Machine learning models trained on a certain data dis- to improve classification of each test sample by means of a
tribution often fail to generalize to samples from different linear combination of domain specific classifiers, weighted
distributions. This phenomenon is commonly referred to by the inverse of the distances.
in literature as domain shift between training and testing
data [48, 34], and is one of the biggest limitations of data model parameters to obtain consistent performance across
driven algorithms. Assuming the availability of few anno- domains via ad-hoc training policies [49, 46, 50, 27], while
tated samples from the test domain, the problem can be a different line of work requires modifications to the model
mitigated by fine-tuning the model with explicit supervi- architecture to achieve domain invariance [21, 25, 10, 36].
sion [53] or with domain adaptation techniques [51]. Unfor- While these methods try to extract domain-invariant fea-
tunately, this assumption does not always hold in practice as tures, we go in the opposite direction and explicitly lever-
it is often unfeasible in real scenarios to collect samples for age domain-specific representations by collecting domain-
any possible environment. dependent batch normalization (BN) statistics for each of
Domain generalization refers to algorithms to solve the the domains available at training time. By doing so, we train
domain shift problem by learning models robust to un- a lightweight ensemble of domain-specific models sharing
seen domains. Several works leverage different domains all parameters except for BN statistics. Peculiarly to our
at training time to learn a domain-invariant feature extrac- proposal, we use the accumulated statistics to map each do-
tor [42, 14, 23, 41, 28]. Other works focus on optimizing the main as a point in a latent space of domains. We will re-

1
fer to this mapping as the Batch Normalization Embedding to learn domain-invariant representations, whereas Deep
(BNE) of a domain. Fig. 1 sketches a visualization of such Separation Networks [4] extract image representations
space for the case of three domains available at training time partitioned into two sub-spaces: one unique to each domain
(e.g. Photo, Art Painting and Cartoon). At convergence, and one shared. Differently, Motiian et al. [41] propose to
each training domain is mapped to a single point in the do- learn a discriminative embedding subspace via a Siamese
main space. Then, at test time unseen samples from un- architecture [23]. Episodic training [27] was proposed to
known domains can be mapped to the same space by means train a generic model while exposing it to domain shift.
of their instance normalization statistics. By measuring the In each episode, a feature extractor is trained with a badly
distances between the instance normalization statistics of tuned classifier (or vice-versa) to obtain robust features.
the test sample (black dot) and the accumulated population Recently, [40] proposed a method to simultaneously
statistics of each domain (colored dots), we can infer prop- discover latent domains by clustering featuring together
erties of the unknown test domain. Specifically, we lever- and minimizing feature discrepancy between them. For
age the reciprocal of such distances at test time to weigh all these methods, the limited variety of domains to which
the domain-specific predictions of our lightweight ensemble the model can be exposed at training time can limit the
and accurately classify an unseen sample from an unknown magnitude of the shift to which the model learns invariance.
domain. The same combination of domain-specific mod- Data-level, denotes methods attempting to reduce the
els can be used at training time on samples from the known training set domain bias by augmenting the cardinality and
domains to force the ensemble to learn a meaningful latent variety of the samples. Data augmentation methods based
space and logits that can be linearly combined according to on domain-guided perturbations of input samples [46] or on
the proposed weighting strategy. adversarial examples [50] have been proposed with the pur-
To sum up the contributions of our work: (i) we propose pose of training a model to be robust to distribution shift.
to accumulate domain-specific batch normalization statis- Domain randomization was adopted [49, 33] to solve the
tics accumulated on convolutional layers to map image sam- analogous problem of transferring a model from synthetic
ples into a latent space where membership to a domain can to real data by extending synthetic data with random ren-
be measured according to a distance from domain BNEs; derings. By performing data augmentation those methods
(ii) we propose to use this concept to learn a lightweight force the feature extractor to learn domain-invariant fea-
ensemble model that shares all parameters excepts the nor- tures, while we argue that discarding domain-specific in-
malization statistics and can generalize better to unseen do- formation might be detrimental for performance.
mains; (iii) compared to previous work, we do not dis- Model-based, denotes methods relying on ad-hoc archi-
card domain-specific attributes but exploit them to learn a tectures to tackle the domain generalization problem. [25]
domain latent space and map unknown domains with re- introduced a low-rank parameterized CNN model, a dy-
spect to known ones; (iv) our method can be applied to any namically parameterized neural network that generalizes
modern Convolutional Neural Network (CNN) that relies the shallow binary undo bias method [21]. Similarly, a
on batch normalization layers, and scales gracefully to the structured low-rank constraint is exploited to align multi-
number of domains available at training time. ple domain-specific networks and a domain-invariant one
in [10]. Mancini et al. [36] train multiple domain-specific
2. Related Work classifiers and estimate the probabilities that a target sam-
Domain Generalization. Most domain generalization ple belongs to each source domain to fuse the classifiers’
works attempt to expose the model to domain shift at train- predictions. A recent work [6] proposes an alternative ap-
ing time to generalize to unseen domains. Invariance can be proach to tackle domain generalization by teaching a model
encouraged at multiple levels: to simultaneously solve jigsaw puzzles and perform well on
Feature-level, denotes methods deriving domain- a task of interest. Most of these methods require changes to
invariant features by minimizing a discrepancy between state-of-the-art architectures, resulting in an increased num-
multiple training domains. Ghifary et al. [14] brought ber of parameters or complexity of the network.
domain generalization to the attention of the deep learning Meta-learning, denotes methods relying on special train-
community by training multi-task autoencoders to trans- ing policies to train models robust to domain shift. [26]
form images from one source domain into different ones, extend to domain generalization the widely used model ag-
thereby learning invariant features. Analogously, Li et nostic meta learning framework [13]. [2] propose a novel
al. [28] extended adversarial autoencoders by minimizing regularization function in a meta-learning framework to
the Maximum Mean Discrepancy measure to align the make the model trained on one domain perform well on an-
distributions of the source domains to an arbitrary prior other domain. [19] propose a training heuristic that itera-
distribution via adversarial feature learning. Conditional tively discards the dominant features activated on the train-
Invariant Adversarial Networks [30] have been proposed ing data, challenging the model to learn more robust rep-

2
resentations. A gradient-based meta-train procedure was composed of mt unlabelled samples collected from the un-
introduced by [11] to expose the optimization to domain known marginal distribution pxt of the target domain t. As
shift while regularizing the semantic structure of the feature opposed to the domain adaptation setting, we assume that
space. These methods simulate unseen domains by split- target samples are not available at training time, and that
ting the training data in a meta-training set and meta-test each of them might belong to a different unseen domain.
set, therefore are inherently bounded by the variety of the
samples available at training time. 3.2. Multi-Source Domain Alignment Layer
Batch Normalization for distribution alignment. The use
of separate batch normalization statistics to align a training Neural networks are particularly prone to capture dataset
distribution to a test one has been firstly introduced for do- bias in their internal representations [32], making internal
main adaptation [7, 8, 31]. The same domain-dependent features distributions highly domain-dependent. To cap-
batchnorm layer has been adapted to the multi-domain sce- ture and alleviate the distribution shift that is inherent in
nario [39, 37] and exploited in a graph-based method [38] the multi-source setting, we draw inspiration from [8, 39,
that leverages domain meta-data to better align unknown 37, 45] and adopt batch normalization layers [20] to nor-
domains to the known ones. All these works, however, re- malize the activations of each domain to the same reference
quire some representation of the target domain to perform distribution via domain-specific normalization statistics.
the alignment during training, using either samples or meta- At inference time, the activations of a certain domain d
data describing the target domain. Our approach instead are normalized by matching their first and second order mo-
does not rely on any external source of information regard- ments, nominally (µd , σd2 ), to those of a reference Gaussian
ing the target domain. Domain-specific normalization lay- with zero mean and unitary variance:
ers have only recently been proposed for domain generaliza-
tion in [45], where a cluster of networks is trained to learn z − µd
an optimal mixture of instance and batch normalization. BN (z; d) = p 2 , (1)
σd + 
3. Method
where z is an input activation extracted from the marginal
The core idea of our method is to exploit domain-specific distribution qdz of the activations from the domain d;
batch normalization statistics to map known and unknown µd = Ez∼qdz [z] and σd2 = V arz∼qdz [z] are the population
domains in a shared latent space, where domain member- statistics for the domain d, and  > 0 is a small constant to
ship of samples can be measured according to their distance avoid numerical instability. At training time, the layer col-
from the domain embeddings of the known domains. lects and applies domain-specific batch statistics (µ̃d , σ̃d2 ),
while updating the corresponding moving averages to ap-
3.1. Problem Formulation proximate the domain population statistics.
Let X and Y denote the input (e.g. images) and the At inference time, if the domain label d of a test sam-
output (e.g. object categories) spaces of a model. Let ple is unknown or it does not belong to D, we can still rely
D = {di }K i=1 denote the set of the K source domains on normalization by instance statistics, i.e. the degenerate
available at training time. Each domain di can be de- case of batch statistics with batch size equal to 1. Fig. 2
scribed by an unknown conditional probability distribution (a) depicts the functioning of a multi-source domain gener-
pyx,di = p(y|x, i) over the space X × Y. The aim of a ma- alization layer. Our method builds on the observation that
chine learning model is to learn the probability distribution for convolutional layers instance statistics and batch statis-
pyx = p(y|x) of the training set [5] by training models to tics are approximations of the same underlying distribution
learn a mapping X → Y. We propose to use a lightweight with different degrees of noise. Since the population statis-
ensemble of models to learn a mapping (X , D) → Y that tics are a temporal integration of the batch statistics, the va-
leverages the domain label to model a set of conditional lidity of this statement extends to the comparison with them.
distributions {pyx,di }Ki=1 , each conditioned on the domain For example, statistics for a single channel in the case of a
membership. Let t be a generic target domain available only batch normalization layer applied on a 2D feature map of
at testing time and following the unknown probability dis- size H × W , for a generic batch size B (batch statistics)
tribution pyx,t over the same space. Since it is not possible to and for B = 1 (instance statistics), are computed as:
learn the target distribution pyx,t during training, the goal of
our method is to accurately estimate it as a mixture (i.e. lin- 1 X
ear combination) of the learned source distributions pyx,di . µ̃ = zb,h,w
B·H ·W
b,h,w
For each source domain d ∈ D, a training set (2)
Sd = {(x1d , y1d ), ..., (xnd , ynd )} containing nd labelled (B=1) 1 X
= zh,w
samples is provided. The test set T = {x1t , ..., xnt } is H ·W
h,w

3
ep

art painting
cartoon
photo
sketch (unseen)
pop. statistics
ec test sample
ea
rt

(a) Multi-Source Domain Alignment Layer. (b) t-SNE plot of instance and population statistics for seen and unseen domains.

Figure 2: Our method on PACS ([25]) with Sketch as unknown domain. A Multi-Source Domain Alignment Layer (a)
collects domain-specific population statistics and compute instance statistics for test samples. After training, the population
and instance statistics map respectively the source domains and the test samples into a latent space, where domain similarity
can be measured by distances between embedding vectors. In (b), we visualize the learned domain space L1 by means of a
t-SNE plot of instance normalization and population statistics for a model trained with our method. Each test sample from
the unseen domain sketch can be localized through its instance statistics (cyan dots) with respect to the known domains,
embedded by the population statistics (green dots). Considering a test sample embedding, e.g. rt , the estimated distances
(orange arrows) will be used to weigh the predictions of domain-specific classifiers.

the multiple source distributions, we propose to reduce the


1 X domain shift on the unknown target domain by interpolat-
σ̃ 2 = (zb,h,w − µ̃)2 ing across these distributions to estimate the unknown dis-
B·H ·W
b,h,w
(3) tribution pyx,t . The resulting target distribution is a weighted
(B=1) 1 X mixture of the distributions in the ensemble, for which the
= (zh,w − µ̃)2 ,
H ·W choice of the weights depends on the similarity of a test
h,w
sample to each source domain.
where µ̃ and σ̃ 2 are respectively the batch mean and vari- We denote with a l ∈ B = {1, 2, ..., L} in superscript no-
ance and zb,h,w is the value of a single element of the fea- tation the different batch normalization layers in the model.
ture map. If we consider zb,h,w to be described by a nor- For each of them we can define a latent space Ll spanned by
mally distributed random variable Z ∼ N (µ, σ 2 ), then the the activation statistics at the l − th layer of the model. In
instance and batch statistics are an estimate of the parame- this space, we observe that single samples x are mapped
ters of the same gaussian computed over a different number via their instance statics (µ̃, σ̃ 2 ), whereas the population
of samples, H · W and B · H · W respectively. In the next statistics accumulated for each domain (µd , σd2 ) are used to
section, we explain how we exploit this property to map represent domain centroids. Fig. 2-(b) shows a visualiza-
source domains and unseen samples from unknown domain tion of the latent space L1 for the PACS dataset [25], com-
into the same latent space. posed of 4 domains, and 3 of which (e.g. Art Painting, Car-
toon, Photo) assumed available at training time. Population
3.3. Domain Localization in the Batchnorm Latent
(big green dots) and instance (small dots) statistics are re-
Space
spectively used to project domains and individual samples.
The domain alignment layers described in Sec. 3.2 al- We rely on t-SNE[35] to visualize instance and population
low to learn the multiple source distributions {pyx,d }d∈D statistics of source and target samples, and we observe how
distinctly. By leveraging them, we can learn a lightweight the latent space that we propose allows a spontaneous and
ensemble of domain-specific models, where every network stark division between domain clusters. Considering all la-
shares all the weights except for the normalization statis- tent spaces at different layers, we define a batch normaliza-
tics. Since such a lightweight ensemble nicely embodies tion embedding (BNE) for a certain domain d as the stack-

4
ing of the population statistics computed at every layer: making the comparison meaningful. The similarity of a test
sample xt to the domain d is defined as the reciprocal of the
ed = [e1d , e2d , ..., eL
d] (4) distance from that domain and denoted as wdt .
2 2 L2 Once the similarity to each source domains is computed,
= [(µ1d , σd1 ), (µ2d , σd2 ), ..., (µL
d , σd )].
we can use them to recover the unknown target distribution
For a target sample xt from an unknown domain t, we can pyx,t as a mixture (i.e. a linear combination) of the learned
derive a projection to the same space by forward propagat- source distributions pyx,d weighted by the corresponding do-
ing it through the network and computing its instance statis- main similarity:
tics. The latent embedding rt of xt is defined as the stacked P t y
vector of its instance statistics at different batch normaliza- y d∈D wd px,d
px,t = P t . (8)
tion layers in the network: d∈D wd

rt = [rt1 , rt2 , ..., rtL ] (5) We denote with f (·) the result of a forward pass in a neural
2 2 2 network. We get the final prediction of our lightweight en-
= [(µ1t , σt1 ), (µ2t , σt2 ), ..., (µL L
t , σt )]. semble model f (xt ) as a linear combination of the domain
dependant models f (xt |d):
Each rtl represents the instance statistics collected at a cer-
tain layer l during forward propagation and can be used to wt f (xt |d)
P
map the sample xt in the latent space Ll of layer l. Once f (xt ) = d∈D P d t , (9)
d∈D wd
the BNE for the test sample is available, it is possible to
measure the similarity of a target sample xt to one of the Fig. 2-(b) shows a visualization of our localization tech-
known domains d as the inverse of the distance between rt nique superimposed over the t-SNE plot of instance and
and ed . By extension, this allows a soft 1-Nearest Neigh- population statistics. By measuring the distance of a test
bour domain classification of any test sample. samples (yellow dot) from the training domain embeddings
To compute a distance between two points in Ll , we (big green dots), we obtain ad-hoc mixture coefficients for
consider the means and variances of the corresponding each test sample.
batch normalization layer as the parameters of a multivari- Our formulation allows to navigate in the latent space
ate Gaussian distribution. We can hence adopt a distance of the batchnorm statistics. Specifically, if a test sample
on the space of probability measures, i.e. a symmetric and belongs to one of the source domains, our method assigns a
positive definite function that satisfies the triangle inequal- high weight to the prediction of the corresponding domain-
ity. We select the Wasserstein distance for the special case specific model. On the other hand, if the test sample does
of two multivariate gaussian distributions, but we report a not belong to any of the source domains, the final prediction
comparison to alternative distances in the supplementary will be expressed as a linear combination of the domain-
material. Let p ∼ N (µp , Cp ) and q ∼ N (µq , Cq ) be two dependant models embodied in our lightweight ensemble.
normal distributions on Rn , with expected value µp and
µq ∈ Rn respectively and Cp , Cq ∈ Rn×n covariance ma- 3.4. Training Policy
trices. Denoting with || · ||2 the Euclidean norm on Rn , the To encourage a well-defined latent space for every batch
2-Wasserstein distance is: normalization layer, we replicate at training time the dis-
tance weighting procedure described in Eq. 9 to compute
W(p, q) = φ((µp , Cp ), (µq , Cq )) (6) predictions on samples from known domains. Each training
1 1 1
= ||µp − µq ||22 + T r(Cp + Cq − 2(Cq Cp Cq ) ), 2 2 2 batch is composed of K domain batches with an equal num-
ber of samples. During every training step, (i) the domain
T r being the trace of the matrix. We rely on Eq. 6 to mea- batches are first propagated to update the corresponding do-
sure the distance between a test sample xt and the domain d main population statistics (µd , σd2 , ). Then, (ii) all individ-
by summing over the batch normalization layers l ∈ B the ual samples are propagated assuming an unknown domain
distance between the activation embeddings rtl and eld : to collect their instance statistics and compute the domain
X similarities wdt , as in Sec. 3.3. Finally, (iii) each sample
DL (ed , rt ) = W(eld , rtl ) (7) is propagated under K different domain assumptions (i.e.
l∈B through the corresponding domain-specific branches) and
=
X 2
φ((µld , Diag(σdl )), (µlxt , Diag(σxl t ))).
2 the resulting domain-specific predictions are weighted ac-
l∈B
cording to Eq. 9. Applying this procedure during training
promotes the creation of a well-defined latent space.
Eq. 2 shows that instance and batch statistics differ only Since we initialize our model with weights pre-trained
for the number of samples over which they are estimated, on ImageNet [9], each domain-specific batch normalization

5
branch needs to be specialized before starting the distance 4.2. Domain Generalization for Classification
training (DT) procedure described above, otherwise conver-
We compare BNE against several methods for object
gence problems might occur. We thus warm-up domain-
classification on commonly used benchmarks. Additional
specific batch normalization statistics by pre-training the
experiments are reported in the supplementary material.
model on the whole dataset following the standard pro-
PACS. We first benchmark our method on the PACS
cedure, except for the accumulation and application of
dataset [25], which presents a challenging domain general-
domain-specific batch normalization statistics.
ization setting for object recognition. Every test uses 3 do-
mains as training set and one as unknown test set; for each
4. Experiments of this leave-one-out configurations, we train a model from
the same initialization for 60 epochs. We test our method
4.1. Experimental Settings using the ResNet-18 architecture and report the results in
By means of a synecdoche, we name our method after Tab. 1. Overall, our proposal obtains the best absolute ac-
BNE, its main component. curacy on 1 out of 4 target sets, with the second best aver-
age accuracy (Avg.) of 83.1% and a relative gain (∆%) of
Datasets. We evaluate our method on three domain gener-
+5.86 making it the second most effective algorithm on this
alization benchmarks:
dataset. Since all the networks are initialized with weights
PACS [25] features 4 domains (Art Painting, Cartoon, trained on ImageNet, they are implicitly biased towards the
Photo, Sketch) with a significant domain shift. Each domain Photo domain, as testified by the higher accuracy on it when
includes samples from 7 different categories, for a total of treated as test set. Sketch, instead, is arguably the more chal-
9991 samples. Some examples are shown in Fig. 2. lenging domain, as testified by the lower accuracy achieved
Office-31 [44] was originally introduced for domain adap- by all methods. It is in this scenario that our method is able
tation and has been subsequently used for domain general- to provide the bigger gain (+9.6% absolute gain in accuracy
ization. The dataset is composed of 3 different sources and over our baseline). The best perfomance on this datasets is
31 categories, representing images captured with a Webcam obtained by DSON [45], which proposes to learn an ad-hoc
and a dSLR camera or collected from the Amazon website. mixture of instance and batch normalization statics to im-
Office-Caltech [15] is a variant of Office-31 featuring prove generalization. This characteristic makes it a perfect
one additional domain, derived from the Caltech-256 candidate to be extended with the domain embedding strat-
dataset [16]. The dataset is composed of the 10 categories egy that we propose in Sec. 3.3. An in-depth discussion on
shared between Caltech-256 and the domains in Office-31. this topic is provided in Sec. 4.4.
Evaluation Protocol. Coherently with other works, we Office-31. We evaluate our method on Office-31 [44] and
evaluate both the AlexNet [24] and the ResNet-18 [17] ar- follow the leave-one-domain-out protocol. To compare
chitectures. For the experiments on PACS and Office-31 we with published results, we use AlexNet initialized with Im-
follow the standard leave-one-domain-out evaluation pro- ageNet weights and train it with our method for 100 epochs
cedure, where the model is trained on all domains but one, with learning rate 10−4 . Tab. 2 shows that our approach ob-
and tested on the left-out one. For Office-Caltech we do the tains the best absolute accuracy in two out of three test sce-
same but also test following a leave-two-domain-out proce- narios and a relative gain comparable or better than the al-
dure. Since the original version of AlexNet does not include ternatives. The Amazon target domain proves to be the most
batch normalization layers, we adopt a variant with batch challenging setting, as the images are acquired in ideal con-
normalization applied on the activations of each convolu- ditions (i.e. white backgrounds, studio lighting. . . ) that are
tional layer [47]. Since the goal of domain generalization is fairly different from the ImageNet domain. In this challen-
to leverage multiple sources to learn models that are robust ing generalization scenario, BNE boosts the absolute accu-
on any target domain, the natural deep-learning baseline to racy with respect to the baseline by an impressive +11.2%
compare against consists in training directly on the merged and +2.3% with respect to the closest competitor.
set of source domains. We will refer to it as (DeepAll). Office-caltech. We evaluate our method also on Office-
We compare our method against this strong baseline and Caltech [15] and follow the standard evaluation procedure
several deep-learning based state-of-the-art methods for do- for this dataset, enumerating cases with a single target do-
main generalization. Since different methods rely on differ- main, either Amazon or Caltech, and scenarios with pairs
ent initializations of network weights, which result in dif- of target domains: Dslr-Webcam and Amazon-Caltech. We
ferent baselines, we compare with methods providing their use AlexNet initialized with ImageNet weights to compare
own baseline and report for every competitor: the perfor- with published results, and train BNE for 100 epochs. Tab. 3
mance on each unseen domain, the average baseline perfor- shows that our approach achieves the best average accuracy
mance (Avg. DA), the average performance of the method and the best gain with respect to the baseline. The perfor-
itself (Avg.) and the relative gain (∆%). mance boost delivered with our method is especially evident

6
Table 1: State-of-the-art comparison on PACS with ResNet-18. Methods with * do not use domain labels.

Method Art Cartoon Photo Sketch Avg. DA Avg. ∆%


CrossGrad - [46] 78.7 73.3 94.0 65.1 79.1 77.8 -1.64
MetaReg - [2] 79.9 75.1 95.2 69.5 79.9 81.7 +2.25
MLDG - [26] 79.5 77.3 94.3 71.5 79.1 80.7 +2.02
Epi-FCR - [27] 82.1 77.0 93.9 73.0 79.1 81.5 +3.03
JiGen* - [6] 79.4 75.3 96.0 71.4 79.1 80.5 +1.77
MASF - [11] 80.3 77.1 94.99 71.69 79.2 81.0 +1.01
D-SAM - [12] 77.3 72.4 95.3 77.8 79.5 80.7 +1.47
MMLD* - [40] 81.3 77.2 96.1 72.3 78.7 81.8 +3.93
DSON - [45] 84.7 77.6 95.9 82.2 78.9 85.1 +7.85
DeepAll 75.8 73 94.4 70.9 - 78.5 -
BNE (Ours) 78.8 78.9 94.8 79.7 78.5 83.1 +5.86

Table 2: State-of-the-art comparison on Office-31 with AlexNet.

Method Amazon Dslr Webcam Avg. DA Avg. ∆%


UB - [21] 42.4 98.5 93.4 74.2 78.1 +5.26
DSN - [4] 44.0 99.0 94.5 74.2 79.2 +6.74
MTAE - [14] 43.7 99.0 94.2 74.2 79.0 +6.47
DGLRC - [10] 45.4 99.4 95.3 74.2 80.0 +7.82
MCIT - [43] 51.7 97.9 94.0 74.2 81.2 +9.43
DeepAll 43.8 94.1 88.4 - 75.4 -
BNE (Ours) 54.0 99.4 92.3 75.4 81.9 +8.62

in the challenging scenario where 2 domains are treated as it is clear how every component contributes to an increase
targets, e.g. +6.4% absolute accuracy on Dslr-Webcam. in performance with respect to the baseline. By comparing
line (c) to (d), we can notice how our proposal is more ef-
4.3. Ablation Study fective than the variant DNet inspired by the domain map-
ping strategy from [37], while also requiring less parame-
To measure the impact on performance of the different
ters. More details on DNet and extensive comparisons are
components of our method, we run ablation experiments on
reported in the supplementary material.
the PACS dataset using the ResNet-18 backbone, and report
the results in Tab. 4. We compare again with the DeepAll 4.4. Choosing a Distance Metric
baseline and against a DiscoveryNet (DNet, row (d)), a vari-
ant of BNE inspired by [37]. While in BNE we propose A crucial component of our method is the distance func-
to assign domain membership by looking at the distance tion, used to locate each test sample with respect to the
between the latent batch normalization embeddings, with known domains. As mentioned in Sec. 3.3, we picked the
DNet domain membership is learned through a domain clas- Wasserstein distance for this task after a detailed prelimi-
sification network. On row (a) we show the performance nary study among different options, reported in Tab. 5. We
gain attributable to the usage of separate batchnorm statis- considered three different distances: using a fixed value for
tics for each training domain, while using at inference time the distance (Uniform), equivalent to averaging predictions
the projection and weighting strategy described in Sec. 3.3; of domain specific branches; using the Bhattacharya dis-
row (b) extends the this approach by leveraging distance tance; using the Wasserstein distance. The basic Uniform
weighting (DT) at training time, as described in Sec. 3.4; distance setting is similar to [45], with the main differ-
finally, row (c) includes a warm-up phase in the initial train- ence being the normalization layer used: a domain-specific
ing phase to help population statistics to converge to stable batchnorm in our case, a learned mixture of instance and
values before starting distance training. By comparing the batch normalizations for [45]. Their approach thus also re-
average accuracy (Avg.) across the four possible target sets, quire to learn additional parameters. In our experiments, the

7
Table 3: State-of-the-art comparison on Office-Caltech with AlexNet.

Method Amazon Caltech Dslr, Amazon, Avg. DA Avg. ∆%


Webcam Caltech
UB - [21] 91.0 86.0 80.5 70.0 84.5 81.9 -3.08
DSN - [4] - - 85.8 81.2 84.5 - -
MTAE - [14] 93.1 86.2 85.3 80.5 84.5 86.3 +2.13
DGLRC - [10] 94.2 87.6 86.3 82.2 84.5 87.6 +3.67
MDA - [18] 93.5 86.9 84.9 82.6 84.5 87.0 +2.96
CIDG - [29] 93.2 85.1 83.7 65.91 84.5 82.0 -2.96
MCIT - [43] 93.3 86.3 85.2 82.7 84.5 86.9 +2.84
DeepAll 91.7 82.1 83.4 84.5 - 85.4 -
BNE (Ours) 93.7 85.9 89.8 87.7 85.4 89.3 +4.57

Table 4: Comparison of different variants of our method on PACS with Resnet.

Method DT Warm-up Art Cartoon Photo Sketch Avg. ∆%


DeepAll - - 75.8 73.0 94.4 70.9 78.5 -
(a) BNE 7 7 74.7 71.7 93.0 74.8 78.6 +0.03
(b) BNE 3 7 79.8 76.0 92.5 72.5 80.2 +2.13
(c) BNE 3 3 78.8 78.9 94.8 79.7 83.1 +5.86
(d) DNet 3 3 77.3 73.8 94.2 71.2 79.1 +0.08

Table 5: Comparison of different distance measures with our method on PACS with Alexnet.

Method Distance Art Cartoon Photo Sketch Average


DeepAll - 64.4 65.4 88.0 53.8 67.9
BNE (Ours) Uniform 64.9 64.0 88.7 61.7 69.8
BNE (Ours) Bhattacharyya 66.3 64.6 89.4 64.3 71.2
BNE (Ours) Wasserstein 66.7 65.7 89.5 66.8 72.2

Wasserstein distance proves to be a more principled choice, specific normalization layers to disentangle independent
consistently delivering the best performance both on aver- representations for each training domain, and then use
age and on any specific left-out domain. Measuring the such implicit embeddings to localize unseen samples from
similarity of all samples to one of the training domain by unknown domains. Our method outperforms many al-
means of a distance function is always more effective than ternatives on a number of domain generalization bench-
blindly averaging predictions, therefore the effectiveness of marks ([25, 44, 15]), underlining the advantage of main-
our method derives from the accurate sample-wise domain taining specific domain representations over forcing invari-
attribution rather than from the chosen normalization layer. ant representations. We believe that our work highlights
This observation leaves open the opportunity of combining interesting properties of batch normalization layers, not ex-
our distance weighting scheme with other recently proposed tensively explored yet. Our formulation could also be eas-
normalization methods, e.g. [45]. ily extended to the domain adaptation setting by injecting
unlabelled samples from the target domain during training.
5. Conclusions If few target samples happened to be available at the same
time, they could all be used to retrieve a less biased estimate
Our method allows to navigate in the latent space of of the statistics of the unseen domain. We plan to explore
batch normalization statistics, describing unknown domains these directions in future work.
as a combination of the known ones. We rely on domain-

8
References chine Learning-Volume 70, pages 1126–1135. JMLR. org,
2017. 2
[1] Martı́n Abadi, Ashish Agarwal, Paul Barham, Eugene
[14] Muhammad Ghifary, W Bastiaan Kleijn, Mengjie Zhang,
Brevdo, Zhifeng Chen, Craig Citro, Greg Corrado, Andy
and David Balduzzi. Domain generalization for object recog-
Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow:
nition with multi-task autoencoders. In Proceedings of the
Large-scale machine learning on heterogeneous distributed
IEEE international conference on computer vision, pages
systems. 2015. 11
2551–2559, 2015. 1, 2, 7, 8, 12
[2] Yogesh Balaji, Swami Sankaranarayanan, and Rama Chel- [15] Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman.
lappa. Metareg: Towards domain generalization using meta- Geodesic flow kernel for unsupervised domain adaptation.
regularization. In Advances in Neural Information Process- In 2012 IEEE Conference on Computer Vision and Pattern
ing Systems, pages 998–1008, 2018. 2, 7, 12 Recognition, pages 2066–2073. IEEE, 2012. 6, 8, 11
[3] Nils Bjorck, Carla P Gomes, Bart Selman, and Kilian Q [16] Gregory Griffin, Alex Holub, and Pietro Perona. Caltech-256
Weinberger. Understanding batch normalization. In Ad- object category dataset. 2007. 6
vances in Neural Information Processing Systems, pages
[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
7694–7705, 2018. 13
Deep residual learning for image recognition. In Proceed-
[4] Konstantinos Bousmalis, George Trigeorgis, Nathan Silber- ings of the IEEE conference on computer vision and pattern
man, Dilip Krishnan, and Dumitru Erhan. Domain separa- recognition, pages 770–778, 2016. 6, 11
tion networks. In Advances in neural information processing [18] Shoubo Hu, Kun Zhang, Zhitang Chen, and Laiwan Chan.
systems, pages 343–351, 2016. 2, 7, 8, 12 Domain generalization via multidomain discriminant analy-
[5] John S Bridle. Probabilistic interpretation of feedforward sis. In Uncertainty in artificial intelligence: proceedings of
classification network outputs, with relationships to statisti- the... conference. Conference on Uncertainty in Artificial In-
cal pattern recognition. In Neurocomputing, pages 227–236. telligence, volume 35. NIH Public Access, 2019. 8
Springer, 1990. 3 [19] Zeyi Huang, Haohan Wang, Eric P Xing, and Dong
[6] Fabio M Carlucci, Antonio D’Innocente, Silvia Bucci, Bar- Huang. Self-challenging improves cross-domain generaliza-
bara Caputo, and Tatiana Tommasi. Domain generalization tion. arXiv preprint arXiv:2007.02454, 2020. 2
by solving jigsaw puzzles. In Proceedings of the IEEE Con- [20] Sergey Ioffe and Christian Szegedy. Batch normalization:
ference on Computer Vision and Pattern Recognition, pages Accelerating deep network training by reducing internal co-
2229–2238, 2019. 2, 7, 12 variate shift. In International Conference on Machine Learn-
[7] Fabio Maria Carlucci, Lorenzo Porzi, Barbara Caputo, Elisa ing, pages 448–456, 2015. 3, 13
Ricci, and Samuel Rota Bulo. Autodial: Automatic domain [21] Aditya Khosla, Tinghui Zhou, Tomasz Malisiewicz,
alignment layers. In 2017 IEEE International Conference on Alexei A Efros, and Antonio Torralba. Undoing the dam-
Computer Vision (ICCV), pages 5077–5085. IEEE, 2017. 3, age of dataset bias. In European Conference on Computer
11 Vision, pages 158–171. Springer, 2012. 1, 2, 7, 8
[8] Fabio Maria Carlucci, Lorenzo Porzi, Barbara Caputo, Elisa [22] Diederik P Kingma and Jimmy Ba. Adam: A method for
Ricci, and Samuel Rota Bulo. Just dial: Domain alignment stochastic optimization. arXiv preprint arXiv:1412.6980,
layers for unsupervised domain adaptation. In International 2014. 11
Conference on Image Analysis and Processing, pages 357– [23] Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov.
369. Springer, 2017. 3, 11 Siamese neural networks for one-shot image recognition. In
[9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, ICML deep learning workshop, volume 2. Lille, 2015. 1, 2
and Li Fei-Fei. Imagenet: A large-scale hierarchical image [24] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
database. In 2009 IEEE conference on computer vision and Imagenet classification with deep convolutional neural net-
pattern recognition, pages 248–255. Ieee, 2009. 5 works. In Advances in neural information processing sys-
[10] Zhengming Ding and Yun Fu. Deep domain generalization tems, pages 1097–1105, 2012. 6, 11
with structured low-rank constraint. IEEE Transactions on [25] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M
Image Processing, 27(1):304–313, 2017. 1, 2, 7, 8 Hospedales. Deeper, broader and artier domain generaliza-
[11] Qi Dou, Daniel Coelho de Castro, Konstantinos Kamnitsas, tion. In Proceedings of the IEEE international conference on
and Ben Glocker. Domain generalization via model-agnostic computer vision, pages 5542–5550, 2017. 1, 2, 4, 6, 8, 11,
learning of semantic features. In Advances in Neural Infor- 12, 13
mation Processing Systems, pages 6447–6458, 2019. 3, 7, [26] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M
12 Hospedales. Learning to generalize: Meta-learning for do-
[12] Antonio D’Innocente and Barbara Caputo. Domain gen- main generalization. In Thirty-Second AAAI Conference on
eralization with domain-specific aggregation modules. In Artificial Intelligence, 2018. 2, 7, 12
German Conference on Pattern Recognition, pages 187–198. [27] Da Li, Jianshu Zhang, Yongxin Yang, Cong Liu, Yi-Zhe
Springer, 2018. 7 Song, and Timothy M Hospedales. Episodic training for
[13] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model- domain generalization. In Proceedings of the IEEE Inter-
agnostic meta-learning for fast adaptation of deep networks. national Conference on Computer Vision, pages 1446–1455,
In Proceedings of the 34th International Conference on Ma- 2019. 1, 2, 7, 12

9
[28] Haoliang Li, Sinno Jialin Pan, Shiqi Wang, and Alex C Kot. national Conference on Computer Vision, pages 5715–5725,
Domain generalization with adversarial feature learning. In 2017. 1, 2
Proceedings of the IEEE Conference on Computer Vision [42] Krikamol Muandet, David Balduzzi, and Bernhard
and Pattern Recognition, pages 5400–5409, 2018. 1, 2 Schölkopf. Domain generalization via invariant fea-
[29] Ya Li, Mingming Gong, Xinmei Tian, Tongliang Liu, and ture representation. In International Conference on Machine
Dacheng Tao. Domain generalization via conditional invari- Learning, pages 10–18, 2013. 1, 12
ant representations. In Thirty-Second AAAI Conference on [43] Mohammad Mahfujur Rahman, Clinton Fookes, Mahsa Bak-
Artificial Intelligence, 2018. 8 tashmotlagh, and Sridha Sridharan. Multi-component im-
[30] Ya Li, Xinmei Tian, Mingming Gong, Yajing Liu, Tongliang age translation for deep domain generalization. In 2019
Liu, Kun Zhang, and Dacheng Tao. Deep domain gener- IEEE Winter Conference on Applications of Computer Vision
alization via conditional invariant adversarial networks. In (WACV), pages 579–588. IEEE, 2019. 7, 8
Proceedings of the European Conference on Computer Vi- [44] Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Dar-
sion (ECCV), pages 624–639, 2018. 2, 12 rell. Adapting visual category models to new domains. In
[31] Yanghao Li, Naiyan Wang, Jianping Shi, Xiaodi Hou, and European conference on computer vision, pages 213–226.
Jiaying Liu. Adaptive batch normalization for practical do- Springer, 2010. 6, 8, 11
main adaptation. Pattern Recognition, 80:109–117, 2018. 3, [45] Seonguk Seo, Yumin Suh, Dongwan Kim, Jongwoo Han,
13 and Bohyung Han. Learning to optimize domain specific
[32] Yanghao Li, Naiyan Wang, Jianping Shi, Jiaying Liu, and normalization for domain generalization. arXiv preprint
Xiaodi Hou. Revisiting batch normalization for practical do- arXiv:1907.04275, 2019. 3, 6, 7, 8, 13
main adaptation. arXiv preprint arXiv:1603.04779, 2016. 3 [46] Shiv Shankar, Vihari Piratla, Soumen Chakrabarti, Sid-
[33] Antonio Loquercio, Elia Kaufmann, René Ranftl, Alexey dhartha Chaudhuri, Preethi Jyothi, and Sunita Sarawagi.
Dosovitskiy, Vladlen Koltun, and Davide Scaramuzza. Deep Generalizing across domains via cross-gradient training.
drone racing: From simulation to reality with domain ran- arXiv preprint arXiv:1804.10745, 2018. 1, 2, 7, 12
domization. IEEE Transactions on Robotics, 2019. 2 [47] Marcel Simon, Erik Rodner, and Joachim Denzler. Imagenet
[34] Yawei Luo, Liang Zheng, Tao Guan, Junqing Yu, and Yi pre-trained models with batch normalization. arXiv preprint
Yang. Taking a closer look at domain shift: Category-level arXiv:1612.01452, 2016. 6, 11
adversaries for semantics consistent domain adaptation. In [48] Masashi Sugiyama and Amos J Storkey. Mixture regression
Proceedings of the IEEE Conference on Computer Vision for covariate shift. In Advances in Neural Information Pro-
and Pattern Recognition, pages 2507–2516, 2019. 1 cessing Systems, pages 1337–1344, 2007. 1
[35] Laurens van der Maaten and Geoffrey Hinton. Visualiz- [49] Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Woj-
ing data using t-sne. Journal of machine learning research, ciech Zaremba, and Pieter Abbeel. Domain randomization
9(Nov):2579–2605, 2008. 4 for transferring deep neural networks from simulation to the
[36] Massimiliano Mancini, Samuel Rota Bulò, Barbara Caputo, real world. In 2017 IEEE/RSJ international conference on
and Elisa Ricci. Best sources forward: domain generaliza- intelligent robots and systems (IROS), pages 23–30. IEEE,
tion through source-specific nets. In 2018 25th IEEE In- 2017. 1, 2
ternational Conference on Image Processing (ICIP), pages [50] Riccardo Volpi, Hongseok Namkoong, Ozan Sener, John C
1353–1357. IEEE, 2018. 1, 2, 12 Duchi, Vittorio Murino, and Silvio Savarese. Generalizing
[37] Massimiliano Mancini, Samuel Rota Bulo, Barbara Caputo, to unseen domains via adversarial data augmentation. In
and Elisa Ricci. Robust place categorization with deep do- Advances in Neural Information Processing Systems, pages
main generalization. IEEE Robotics and Automation Letters, 5334–5344, 2018. 1, 2
3(3):2093–2100, 2018. 3, 7, 14 [51] Mei Wang and Weihong Deng. Deep visual domain adapta-
[38] Massimiliano Mancini, Samuel Rota Bulo, Barbara Caputo, tion: A survey. Neurocomputing, 312:135–153, 2018. 1
and Elisa Ricci. Adagraph: Unifying predictive and continu- [52] Yuxin Wu and Kaiming He. Group normalization. In Pro-
ous domain adaptation through graphs. In Proceedings of the ceedings of the European Conference on Computer Vision
IEEE Conference on Computer Vision and Pattern Recogni- (ECCV), pages 3–19, 2018. 13
tion, pages 6568–6577, 2019. 3 [53] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson.
[39] Massimiliano Mancini, Lorenzo Porzi, Samuel Rota Bulò, How transferable are features in deep neural networks? In
Barbara Caputo, and Elisa Ricci. Boosting domain adapta- Advances in neural information processing systems, pages
tion by discovering latent domains. In Proceedings of the 3320–3328, 2014. 1
IEEE Conference on Computer Vision and Pattern Recogni-
tion, pages 3771–3780, 2018. 3, 11
[40] Toshihiko Matsuura and Tatsuya Harada. Domain general-
ization using a mixture of multiple latent domains. In AAAI,
pages 11749–11756, 2020. 2, 7
[41] Saeid Motiian, Marco Piccirilli, Donald A Adjeroh, and Gi-
anfranco Doretto. Unified deep supervised domain adapta-
tion and generalization. In Proceedings of the IEEE Inter-

10
6. Supplementary Material 0.95. ResNet-18 is trained with Adam [22] and weight de-
cay 10−6 . The initial learning rate is 10−4 . Coherently
We provide supplementary material to further validate with previous works ([8, 7, 39]), we also compute gradients
our method and complement the experimental section in- through the mean and standard deviation computation for
cluded in the main paper. Sec. 6.1 provides an algorith- the batch normalization layers. All the input images are nor-
mic overview of the proposed training policy and addi- malized according to the statistics computed on ImageNet.
tional training details are reported in Sec. 6.2; Sec. 6.3 At training time, data augmentation is performed by first
shows additional results obtained with ResNet-18 [17] on resizing the input image to 256, then randomly cropping to
Office-31 [44] and Office-Caltech [15], and with AlexNet 224×224 for ResNet-18 and 227×227 for AlexNet; finally,
on PACS [25]; In Sec. 6.4, we conduct a qualitative analy- a random horizontal flip is performed. Every training batch
sis to verify our choices in terms of batch sizes and distance is composed of 16 samples per domain for ResNet-18 and 6
measures. Moreover, we validate BNE against other popu- for AlexNet.
lar normalization strategies. In Sec. 6.5, we validate quan-
All the models are implemented in Tensorflow 2.0 ([1]).
titatively our latent space proposal. Finally, we extensively
We initialize both AlexNet and ResNet-18 using the pub-
compare the performance of BNE against the variant DNet
licly available Caffe weights pre-trained on ImageNet, after
that leverages a domain discovery network.
carefully converting them.1
N.B.: Blue references point to the original manuscript.
6.3. Additional Results
6.1. Training Policy
We here provide additional results with the ResNet-
We here provide a formalization of the distance training
18 [17] architecture for the dataset Office-31 [44] and with
policy described in Sec. 3.4.
the AlexNet [47] architecture for PACS [25]. In the original
Let T = {bd }d∈D be a training batch composed of K
manuscript, we already provide results with AlexNet and
domain batches, each containing n samples from the cor-
ResNet-18 respectively to compare against recently pub-
responding domain d: bd = {(xid , ydi )}ni=1 . In Alg. 1, we
lished works. Moreover, we expand the experimental set-
illustrate the training procedure for a single training batch
ting with the addition of the dataset Office-Caltech [15], for
T using the same notation as in the original manuscript.
which we present results with both ResNet-18 and AlexNet.
During every training step, first, the domain batches are
propagated to update the corresponding domain embedding
ed (l:2-6). Then, each individual sample xt is propagated 6.3.1 PACS
using instance normalization to collect its instance statis-
2
tics (µlt , σ1lt ) ∀l ∈ B (l:8). Given the statistics we com- In Tab. 6, we extend the comparison on PACS consider-
pute the target embedding rt (l:9-10) and the domain sim- ing AlexNet to compare against a vast literature of pub-
ilarities wdt (l:12), as in Sec. 3.3. Each sample is prop- lished works relying on this older architecture. Once again
agated under K different domain assumptions (i.e. through our proposal achieves absolute performance comparable to
the corresponding domain-specific branches) (l:13). The re- the state of the art even if starting from a weaker base-
sulting domain-specific predictions are weighted according line. Indeed when comparing the relative gain in perfor-
to Eq. 11 to compute the final prediction (l:14). Finally, mance provided by our method (∆%), we are clearly out-
the cross-entropy loss between the final predictions f (xt ) performing any previously published solutions with an in-
and the corresponding ground truth yt is computed (l:15) crease of +6.33%, while the second best obtains +4.88%.
and back-propagated to update the weights θ of the model Once again, when considering Sketch as unseen domain our
(l:16). Applying this procedure during training encourages method can boost the performance by a +13% absolute gain
the creation of a batch normalization latent space. in accuracy over our baseline.

6.2. Training Settings 6.3.2 Office-31


Coherently with other works, we evaluate both the
AlexNet [24] and the more recent ResNet-18 [17] archi- In Tab. 7, we extend the comparison on Office-31 consider-
tecture. Before training each network, we initialize them ing ResNet-18 as it is a good example of a modern archi-
with pre-trained weights on ImageNet and fine-tune the tecture with native batch normalization layers. The results
last fully-connected layer on the dataset of interest for 20 confirms that our method is able to improve performances
epochs. To train AlexNet [24], we use SGD as optimizer over DeepAll across all three tests.
with momentum 0.95 and L2 regularization on network 1 ResNet-18 and AlexNet ImageNet weights available at https :
weights with weight decay 5 × 10−5 . The initial learn- //github.com/heuritech/convnets-keras and https://
ing rate is 10−3 , exponentially decayed with decay rate github.com/cvjena/cnn-models.

11
Algorithm 1 Training Step for a batch T
1: for bd ∈ T do . for every domain batch
2: Collect domain batch statistics (µ̃d , σ̃d ) . forward propagation
3: µ̂ld ←− 0.99µ̂ld + 0.01µ̃ld ∀l ∈ B . update domain population mean
4: (σ̂dl )2 ←− 0.99(σ̂dl )2 + 0.01(σ̃dl )2 ∀l ∈ B . update domain population variance
5: eld ←− (µ̂ld , (σ̂dl )2 ) ∀l ∈ B . update domain layer embeddings
6: ed ←− [e1d , e2d , ..., eLd] . update domain embedding
7: for (xt , yt ) ∈ T do . for every sample
8: Collect instance statistics (µt , σt 2 ) . forward propagation
2
9: rtl ←− (µlt , σtl ) ∀l ∈ B . define target layer embeddings
10: rt ←− [rt1 , rt2 ,P..., rtL ] . define target embedding
11: DL (ed , rt ) = l∈B W(eld , rtl ) ∀d ∈ D . compute domain distances
12: wdt = DL (e1d ,rt ) ∀d ∈ D . compute domain similarities
13: fdt ←− f (x Pt
|d) ∀d ∈ D . compute domain-specific predictions
wt f t
14: f (xt ) = Pd∈D wd t d . compute final predictions
d∈D d
P
15: L(θ; T ) = (xt ,yt )∈T XE(f (xt ), yt ) . compute cross-entropy loss
16: θ ←− θ − η · L(θ; T ) . update weights

Table 6: State-of-the-art comparison on PACS with AlexNet.

Method Art Cartoon Photo Sketch Avg. DA Avg. ∆%


DICA - [42] 64.6 64.5 91.8 51.1 68.7 68.0 -1.02
D-MTAE - [14] 60.3 58.7 91.1 47.9 68.7 64.5 -6.11
DSN - [4] 61.1 66.5 83.3 58.6 68.7 67.4 -1.89
TF-CNN - [25] 62.9 67.0 89.5 57.5 67.1 69.2 +3.13
CIDDG - [30] 62.7 69.7 78.7 64.5 71.7 68.9 -3.91
Fusion - [36] 64.1 66.8 90.2 60.1 67.1 70.3 +4.77
CrossGrad - [46] 64.1 66.8 90.2 60.1 68.7 70.3 +2.33
MetaReg - [2] 69.8 70.4 91.1 59.3 69.3 72.6 +4.76
MLDG - [26] 66.2 66.9 88.0 59.0 67.2 70.0 +4.17
Epi-FCD - [27] 64.7 72.3 86.1 65.0 68.7 72.0 +4.80
JiGen - [6] 67.6 71.7 89.0 65.2 71.5 73.4 +2.66
MASF - [11] 70.4 72.5 90.7 67.3 71.7 75.2 +4.88
DeepAll 64.4 65.4 88.0 53.8 - 67.9 -
BNE (Ours) 66.7 65.7 89.5 66.8 67.9 72.2 +6.33

Table 7: State-of-the-art comparison on Office-31 with ResNet-18.

Method Amazon Dslr Webcam Avg. DA Avg. ∆%


DeepAll 55.1 99.0 92.6 - 82.2 -
BNE (Ours) 55.5 99.3 95.4 82.2 83.4 +1.42

6.3.3 Office-Caltech ResNet as architecture in Tab. 8, with a clear +5.5% gain


over DeepAll.
In Tab. 8 we show additional results for Office-Caltech us-
ing the ResNet-18 architecture. The same good property
observed using Alexnet is confirmed also when considering

12
Table 8: State-of-the-art comparison on Office-Caltech with ResNet-18.

Method Amazon Caltech Dslr, Amazon, Avg. DA Avg. ∆%


Webcam Caltech
DeepAll 92.7 83.1 85.3 80.7 - 85.5 -
BNE (Ours) 92.9 87.4 93.0 87.3 85.5 90.2 +5.50

6.4. Ablation Study 6.4.3 Normalization Strategies

We here provide additional ablation studies to better BNE can be interpreted as a peculiar normalization tech-
highlight different characteristics of our method with re- nique to achieve better generalization. We hence provide a
spect to the chosen batch size and the distance measures quantitative comparison of our method against other pop-
used in the latent space. The experiments are conducted on ular normalization strategies: (a) InstanceNorm [31]; (b)
the PACS dataset [25]. BatchNorm [20]; (c) Freeze BatchNorm [20] (i.e., keeping
the population statistics as the one computed after the ima-
genet pre-training); (d) BNE(Ours).
6.4.1 Batch Size Results for this comparison are shown in Tab. 11. Freez-
ing batch normalization statistics (c) to those accumu-
We study the impact of different batch sizes on the perfor- lated on ImageNet provides better generalization than fine-
mance of our method in Tab. 9. As expected and already tuning population statistics on the training datasets (b),
documented in several recent works leveraging batch nor- which might lead to overfitting and is equivalent to the
malization layers [3, 52], the larger the batch size is the baseline DeepAll. This is coherent with what highlighted
better the generalization capability. In particular for our in [45], Among the analysed normalization techniques, In-
method the bigger is the batch size used at training time, the stanceNorm (a) achieves the poorest results. Our proposal
better are the approximation of the true population statistics (d) instead, by combining instance and batch normalization
(i.e., the better are the domain embeddings ed ). This trans- properties in a principled way, achieves the best results. We
lates in better final performance as detailed in Tab. 9 where can indeed notice how BNE outperforms by a large margin
we can observe an increment of +2.9 Average accuracy be- all other normalization strategies, both overall (+3.5% over
tween using batch size 16 and 64. Freeze BatchNorm) and on any specific domain.

6.5. Latent Space Validation


6.4.2 Method Components We now want to investigate how well we are able to
collect domain specific attributes of samples by projecting
In the main paper we measured the contribution in achiev- them to the batchnorm latent space. We trained ResNet-
ing the final performance of the different components of our 18 until convergence without distance training and warm-up
methods. The proposed setting leveraged the PACS dataset on the PACS dataset considering Photo or Sketch as unseen
and the ResNet-18 architecture. We here consider ablation domains. Once trained, we forward every training sample
experiments on the PACS dataset using the AlexNet archi- through the network and compute its instance statistics to
tecture and report the results in Tab.3, comparing again with project it to the batchnorm latent space. After the projection
the DeepAll baseline. On row (a) we show the performance we measure the distance from every domain embedding: if
gained by using separate batchnorm statistics for the differ- the closest domain matches the real domain, then the la-
ent train domains and using the projection and weighting tent space effectively represents membership to a certain
strategy described in Sec. 3.3; row (b) extends the method domain. In Tab. 12 we report the average value of the recip-
above by using the distance weighting at training time (DT) rocal of the distance for every training sample with respect
as described in Sec. 3.4; finally, row (c) includes a warm- to the centroid of the three training domains. The higher
up phase in the training of the model to make population values on the diagonal confirm our intuition that the batch-
statistics converge to stable values before starting the dis- norm latent space can be used to implicitly encode domain
tance training. By comparing the average accuracy (Avg.) attributes.
across the four possible target sets, it is clear how every Furthermore, we investigate the relationship between
component contributes to an increase in performance with measured distances and the prediction accuracy on the same
respect to the baseline. ResNet-18 trained on PACS without DT. For each test sam-

13
Table 9: Comparison of different batch sizes (per domain) on PACS with Alexnet.

Method Batch Size Art Cartoon Photo Sketch Average


DeepAll - 64.4 65.4 88.0 53.8 67.9
BNE (Ours) 16 65.7 66.5 88.9 56.0 69.3
BNE (Ours) 32 67.9 65.7 89.4 62.3 71.3
BNE (Ours) 64 66.7 65.7 89.5 66.8 72.2

Table 10: Comparison of different variants of our method on PACS with Alexnet.

Method DT Warm-up Art Cartoon Photo Sketch Avg.


DeepAll - - 64.4 65.4 88.0 53.8 67.9
(a) BNE 7 7 64.4 65.9 89.6 54.2 68.53
(b) BNE 3 7 63.7 67.9 84.6 66.6 70.7
(c) BNE 3 3 66.7 65.7 89.5 66.8 72.2

Table 11: Comparison of different normalization strategies on PACS with ResNet-18.

Method Art Cartoon Photo Sketch Average


(a) InstanceNorm 62.6 72.7 79.7 71.7 71.7
(b) BatchNorm 75.8 73.0 94.4 70.9 78.5
(c) Freeze BatchNorm 75.0 76.8 92.6 73.9 79.6
(d) BNE (Ours) 78.8 78.9 94.8 79.7 83.1

Table 12: Analysis of the average similarity value as domain run the test for all 4 possible unseen domains following
classification metrics with ResNet-18 on PACS without dis- the leave-one-domain-out protocol and report in Tab. 13 the
tance training. Classified domain is in bold. average accuracy. The results show a clear correlation be-
tween distances and accuracy, as trusting the “closest” do-
(a) Photo unseen.
main branch clearly results in a higher accuracy than the
Source Art Cartoon Sketch others.
Art 8.24 5.35 4.21
6.6. Domain Discovery Net
Cartoon 6.58 7.02 5.97
In Sec 4.3, we compared the performance of BNE
Sketch 3.94 4.56 10.19 with DNet. For this purpose, we follow [37] and implement
a domain discovery network (DNet) that takes as input the
(b) Sketch unseen. activations after the first convolutional block and directly
outputs the probability for the input sample to belong to
Source Art Cartoon Photo each one of the training domains. This probability distri-
Art 1.18 0.70 1.15 bution is used to weigh the domain-specific predictions of
our lightweight ensemble. Analogously to [37] we imple-
Cartoon 0.94 1.02 0.90 mented DNet as a lateral branch to our lightweight ensem-
Photo 1.19 0.70 1.25 ble that is composed of a global pooling layer, followed by
a ReLU non linearity, a fully-connected layer and a soft-
max activation. DNet is trained in an end-to-end fashion
together with the main classifier. We considered two op-
ple we measure the prediction accuracy obtained using only tions: (i) training DNet applying only a cross-entropy loss
the predictions from either the closest domain branch, the on the image classification logits with respect to the input
second closest or the third (i.e., the farthest away). We categories; (ii) training DNet directly supervising the clas-

14
Table 13: Analysis of the classification accuracy considering predictions from the closest, second-closest (second) and third-
closest (third) domain branches to target sample. We also report the average accuracy (Avg.) over all leave-one-domain-out
tests.

Method Art Painting Cartoon Photo Sketch Avg.


Closest 68.02 63.10 92.75 71.09 72.20
Second 65.14 60.75 86.83 57.11 64.58
Third 47.95 60.66 67.31 57.90 58.08

Table 14: Comparison BNE and DNet with or without cross-entropy loss applied also on domain logits. We report the
domain classification accuracy for different runs with different unseen domains, the average domain classification accuracy
(Avg. Domain) and the average image classification accuracy (Avg. Class).

Method XE Art Cartoon Photo Sketch Avg. Domain Avg. Class


BNE 7 71.2 82.2 75.0 60.1 72.1 83.1
DNet 7 33.3 33.3 33.3 33.3 33.3 79.1
BNE 3 75.9 82.7 74.5 58.7 73.0 80.8
DNet 3 96.0 87.8 62.7 84.3 82.7 74.9

sification of samples in the correct domain using domain rely only on the activations of the first layer due to the
labels. fixed input size of the domain classification branch. More-
In Tab. 14 we compare the domain classification accu- over, our method allows a parameter-free domain represen-
racy (Avg. Domain) of BNE and that of DNet with or with- tation, while DNet relies on a lateral branch to the main
out applying a cross entropy loss over the domain labels network. Finally, BNE allows to map samples in a latent
across four tests considering different unseen domains on space where distances from domain embeddings are com-
PACS. When cross-entropy is not applied on domain log- puted, while DNet can only output the distance of the input
its, BNE largely outperforms DNet. The 33% Avg. Do- sample from the training domains.
mainfor DNet without cross-entropy on domain logits de-
notes that the discovery network learns to disregard the
multidomain BN layer, always predicting the same domain
class and thus leveraging only one branch of the multido-
main BN layer. However, when cross-entropy is applied
also on domain logits, the domain classification branch
of DNet can adapt its parameters to predict well the do-
main classes (Avg. Domain). This, however, comes at the
cost of a remarkable drop in image classification accuracy
(Avg. Class) that can may be partially explained by DNet
overfitting more to the training data and being less able to
generalize to the test one. BNE instead provides a mean-
ingful domain representation even without cross-entropy on
domain logits, largely outperforming the image classifica-
tion accuracy of DNet. Nevertheless, since our representa-
tion is not parametric, we cannot witness a visible increase
in Avg. Domainwhen applying a cross-entropy loss also on
the domain membership assigned through our representa-
tion.
The advantages of BNE over DNet are clear, since BNE
leverages all the activations throughout the network to get
an estimate of the domain membership, while DNet must

15

You might also like