Journal 1
Journal 1
Editor: Urun Dogan, Marius Kloft, Francesco Orabona, and Tatiana Tommasi
Abstract
We introduce a new representation learning approach for domain adaptation, in which
data at training and test time come from similar but different distributions. Our approach
is directly inspired by the theory on domain adaptation suggesting that, for effective do-
main transfer to be achieved, predictions must be made based on features that cannot
discriminate between the training (source) and test (target) domains.
The approach implements this idea in the context of neural network architectures that
are trained on labeled data from the source domain and unlabeled data from the target do-
main (no labeled target-domain data is necessary). As the training progresses, the approach
promotes the emergence of features that are (i) discriminative for the main learning task
on the source domain and (ii) indiscriminate with respect to the shift between the domains.
We show that this adaptation behaviour can be achieved in almost any feed-forward model
by augmenting it with few standard layers and a new gradient reversal layer. The resulting
augmented architecture can be trained using standard backpropagation and stochastic gra-
dient descent, and can thus be implemented with little effort using any of the deep learning
packages.
We demonstrate the success of our approach for two distinct classification problems
(document sentiment analysis and image classification), where state-of-the-art domain
adaptation performance on standard benchmarks is achieved. We also validate the ap-
proach for descriptor learning task in the context of person re-identification application.
Keywords: domain adaptation, neural network, representation learning, deep learning,
synthetic data, image classification, sentiment analysis, person re-identification
c 2016 Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, et al.
Ganin, Ustinova, Ajakan, Germain, Larochelle, Laviolette, Marchand and Lempitsky
1. Introduction
The cost of generating labeled data for a new machine learning task is often an obstacle
for applying machine learning methods. In particular, this is a limiting factor for the fur-
ther progress of deep neural network architectures, that have already brought impressive
advances to the state-of-the-art across a wide variety of machine-learning tasks and appli-
cations. For problems lacking labeled data, it may be still possible to obtain training sets
that are big enough for training large-scale deep models, but that suffer from the shift in
data distribution from the actual data encountered at “test time”. One important example
is training an image classifier on synthetic or semi-synthetic images, which may come in
abundance and be fully labeled, but which inevitably have a distribution that is different
from real images (Liebelt and Schmid, 2010; Stark et al., 2010; Vázquez et al., 2014; Sun and
Saenko, 2014). Another example is in the context of sentiment analysis in written reviews,
where one might have labeled data for reviews of one type of product (e.g., movies), while
having the need to classify reviews of other products (e.g., books).
Learning a discriminative classifier or other predictor in the presence of a shift be-
tween training and test distributions is known as domain adaptation (DA). The proposed
approaches build mappings between the source (training-time) and the target (test-time)
domains, so that the classifier learned for the source domain can also be applied to the
target domain, when composed with the learned mapping between domains. The appeal
of the domain adaptation approaches is the ability to learn a mapping between domains in
the situation when the target domain data are either fully unlabeled (unsupervised domain
annotation) or have few labeled samples (semi-supervised domain adaptation). Below, we
focus on the harder unsupervised case, although the proposed approach (domain-adversarial
learning) can be generalized to the semi-supervised case rather straightforwardly.
Unlike many previous papers on domain adaptation that worked with fixed feature
representations, we focus on combining domain adaptation and deep feature learning within
one training process. Our goal is to embed domain adaptation into the process of learning
representation, so that the final classification decisions are made based on features that
are both discriminative and invariant to the change of domains, i.e., have the same or
very similar distributions in the source and the target domains. In this way, the obtained
feed-forward network can be applicable to the target domain without being hindered by
the shift between the two domains. Our approach is motivated by the theory on domain
adaptation (Ben-David et al., 2006, 2010), that suggests that a good representation for
cross-domain transfer is one for which an algorithm cannot learn to identify the domain of
origin of the input observation.
We thus focus on learning features that combine (i) discriminativeness and (ii) domain-
invariance. This is achieved by jointly optimizing the underlying features as well as two
discriminative classifiers operating on these features: (i) the label predictor that predicts
class labels and is used both during training and at test time and (ii) the domain classifier
that discriminates between the source and the target domains during training. While the
parameters of the classifiers are optimized in order to minimize their error on the training set,
the parameters of the underlying deep feature mapping are optimized in order to minimize
the loss of the label classifier and to maximize the loss of the domain classifier. The latter
2
Domain-Adversarial Neural Networks
update thus works adversarially to the domain classifier, and it encourages domain-invariant
features to emerge in the course of the optimization.
Crucially, we show that all three training processes can be embedded into an appro-
priately composed deep feed-forward network, called domain-adversarial neural network
(DANN) (illustrated by Figure 1, page 12) that uses standard layers and loss functions,
and can be trained using standard backpropagation algorithms based on stochastic gradi-
ent descent or its modifications (e.g., SGD with momentum). The approach is generic as
a DANN version can be created for almost any existing feed-forward architecture that is
trainable by backpropagation. In practice, the only non-standard component of the pro-
posed architecture is a rather trivial gradient reversal layer that leaves the input unchanged
during forward propagation and reverses the gradient by multiplying it by a negative scalar
during the backpropagation.
We provide an experimental evaluation of the proposed domain-adversarial learning
idea over a range of deep architectures and applications. We first consider the simplest
DANN architecture where the three parts (label predictor, domain classifier and feature
extractor) are linear, and demonstrate the success of domain-adversarial learning for such
architecture. The evaluation is performed for synthetic data as well as for the sentiment
analysis problem in natural language processing, where DANN improves the state-of-the-art
marginalized Stacked Autoencoders (mSDA) of Chen et al. (2012) on the common Amazon
reviews benchmark.
We further evaluate the approach extensively for an image classification task, and present
results on traditional deep learning image data sets—such as MNIST (LeCun et al., 1998)
and SVHN (Netzer et al., 2011)—as well as on Office benchmarks (Saenko et al., 2010),
where domain-adversarial learning allows obtaining a deep architecture that considerably
improves over previous state-of-the-art accuracy.
Finally, we evaluate domain-adversarial descriptor learning in the context of person
re-identification application (Gong et al., 2014), where the task is to obtain good pedes-
trian image descriptors that are suitable for retrieval and verification. We apply domain-
adversarial learning, as we consider a descriptor predictor trained with a Siamese-like loss
instead of the label predictor trained with a classification loss. In a series of experiments, we
demonstrate that domain-adversarial learning can improve cross-data-set re-identification
considerably.
2. Related work
The general approach of achieving domain adaptation explored under many facets. Over the
years, a large part of the literature has focused mainly on linear hypothesis (see for instance
Blitzer et al., 2006; Bruzzone and Marconcini, 2010; Germain et al., 2013; Baktashmotlagh
et al., 2013; Cortes and Mohri, 2014). More recently, non-linear representations have become
increasingly studied, including neural network representations (Glorot et al., 2011; Li et al.,
2014) and most notably the state-of-the-art mSDA (Chen et al., 2012). That literature has
mostly focused on exploiting the principle of robust representations, based on the denoising
autoencoder paradigm (Vincent et al., 2008).
Concurrently, multiple methods of matching the feature distributions in the source and
the target domains have been proposed for unsupervised domain adaptation. Some ap-
3
Ganin, Ustinova, Ajakan, Germain, Larochelle, Laviolette, Marchand and Lempitsky
proaches perform this by reweighing or selecting samples from the source domain (Borg-
wardt et al., 2006; Huang et al., 2006; Gong et al., 2013), while others seek an explicit
feature space transformation that would map source distribution into the target one (Pan
et al., 2011; Gopalan et al., 2011; Baktashmotlagh et al., 2013). An important aspect
of the distribution matching approach is the way the (dis)similarity between distributions
is measured. Here, one popular choice is matching the distribution means in the kernel-
reproducing Hilbert space (Borgwardt et al., 2006; Huang et al., 2006), whereas Gong et al.
(2012) and Fernando et al. (2013) map the principal axes associated with each of the dis-
tributions.
Our approach also attempts to match feature space distributions, however this is accom-
plished by modifying the feature representation itself rather than by reweighing or geometric
transformation. Also, our method uses a rather different way to measure the disparity be-
tween distributions based on their separability by a deep discriminatively-trained classifier.
Note also that several approaches perform transition from the source to the target domain
(Gopalan et al., 2011; Gong et al., 2012) by changing gradually the training distribution.
Among these methods, Chopra et al. (2013) does this in a “deep” way by the layerwise
training of a sequence of deep autoencoders, while gradually replacing source-domain sam-
ples with target-domain samples. This improves over a similar approach of Glorot et al.
(2011) that simply trains a single deep autoencoder for both domains. In both approaches,
the actual classifier/predictor is learned in a separate step using the feature representation
learned by autoencoder(s). In contrast to Glorot et al. (2011); Chopra et al. (2013), our
approach performs feature learning, domain adaptation and classifier learning jointly, in a
unified architecture, and using a single learning algorithm (backpropagation). We therefore
argue that our approach is simpler (both conceptually and in terms of its implementation).
Our method also achieves considerably better results on the popular Office benchmark.
While the above approaches perform unsupervised domain adaptation, there are ap-
proaches that perform supervised domain adaptation by exploiting labeled data from the
target domain. In the context of deep feed-forward architectures, such data can be used
to “fine-tune” the network trained on the source domain (Zeiler and Fergus, 2013; Oquab
et al., 2014; Babenko et al., 2014). Our approach does not require labeled target-domain
data. At the same time, it can easily incorporate such data when they are available.
An idea related to ours is described in Goodfellow et al. (2014). While their goal is
quite different (building generative deep networks that can synthesize samples), the way
they measure and minimize the discrepancy between the distribution of the training data
and the distribution of the synthesized data is very similar to the way our architecture
measures and minimizes the discrepancy between feature distributions for the two domains.
Moreover, the authors mention the problem of saturating sigmoids which may arise at the
early stages of training due to the significant dissimilarity of the domains. The technique
they use to circumvent this issue (the “adversarial” part of the gradient is replaced by a
gradient computed with respect to a suitable cost) is directly applicable to our method.
Also, recent and concurrent reports by Tzeng et al. (2014); Long and Wang (2015)
focus on domain adaptation in feed-forward networks. Their set of techniques measures and
minimizes the distance between the data distribution means across domains (potentially,
after embedding distributions into RKHS). Their approach is thus different from our idea
of matching distributions by making them indistinguishable for a discriminative classifier.
4
Domain-Adversarial Neural Networks
Below, we compare our approach to Tzeng et al. (2014); Long and Wang (2015) on the
Office benchmark. Another approach to deep domain adaptation, which is arguably more
different from ours, has been developed in parallel by Chen et al. (2015).
From a theoretical standpoint, our approach is directly derived from the seminal theo-
retical works of Ben-David et al. (2006, 2010). Indeed, DANN directly optimizes the notion
of H-divergence. We do note the work of Huang and Yates (2012), in which HMM repre-
sentations are learned for word tagging using a posterior regularizer that is also inspired
by Ben-David et al.’s work. In addition to the tasks being different—Huang and Yates
(2012) focus on word tagging problems—, we would argue that DANN learning objective
more closely optimizes the H-divergence, with Huang and Yates (2012) relying on cruder
approximations for efficiency reasons.
A part of this paper has been published as a conference paper (Ganin and Lempitsky,
2015). This version extends Ganin and Lempitsky (2015) very considerably by incorporat-
ing the report Ajakan et al. (2014) (presented as part of the Second Workshop on Transfer
and Multi-Task Learning), which brings in new terminology, in-depth theoretical analy-
sis and justification of the approach, extensive experiments with the shallow DANN case
on synthetic data as well as on a natural language processing task (sentiment analysis).
Furthermore, in this version we go beyond classification and evaluate domain-adversarial
learning for descriptor learning setting within the person re-identification application.
3. Domain Adaptation
We consider classification tasks where X is the input space and Y = {0, 1, . . . , L−1} is the
set of L possible labels. Moreover, we have two different distributions over X×Y , called the
source domain DS and the target domain DT . An unsupervised domain adaptation learning
algorithm is then provided with a labeled source sample S drawn i.i.d. from DS , and an
unlabeled target sample T drawn i.i.d. from DT X
, where DTX
is the marginal distribution of
DT over X.
0
S = {(xi , yi )}ni=1 ∼ (DS )n ; T = {xi }N X n
i=n+1 ∼ (DT ) ,
with N = n + n0 being the total number of samples. The goal of the learning algorithm is
to build a classifier η : X → Y with a low target risk
RDT (η) = Pr η(x) 6= y ,
(x,y)∼DT
5
Ganin, Ustinova, Ajakan, Germain, Larochelle, Laviolette, Marchand and Lempitsky
et al. (2004). Note that we assume in definition 1 below that the hypothesis class H is a
(discrete or continuous) set of binary classifiers η : X → {0, 1}.1
Definition 1 (Ben-David et al., 2006, 2010; Kifer et al., 2004) Given two domain
distributions DSX and DT
X
over X, and a hypothesis class H, the H-divergence between
DS and DT is
X X
dH (DSX , DT
X
) = 2 sup Pr η(x) = 1 − Pr η(x) = 1 .
η∈H x∼DSX X
x∼DT
That is, the H-divergence relies on the capacity of the hypothesis class H to distinguish
between examples generated by DSX from examples generated by DT X
. Ben-David et al.
(2006, 2010) proved that, for a symmetric hypothesis class H, one can compute the empirical
0
H-divergence between two samples S ∼ (DSX )n and T ∼ (DT X n
) by computing
X n N !
1 1 X
dˆH (S, T ) = 2 1 − min I[η(xi ) = 0] + 0 I[η(xi ) = 1] , (1)
η∈H n n
i=1 i=n+1
where I[a] is the indicator function which is 1 if predicate a is true, and 0 otherwise.
6
Domain-Adversarial Neural Networks
m
1X
RS (η) = I [η(xi ) 6= yi ]
n
i=1
7
Ganin, Ustinova, Ajakan, Germain, Larochelle, Laviolette, Marchand and Lempitsky
Similarly, the prediction layer Gy learns a function Gy : RD → [0, 1]L that is parame-
terized by a pair (V, c) ∈ RL×D × RL :
Gy (Gf (x); V, c) = softmax VGf (x) + c ,
|a|
exp(ai )
with softmax(a) = P|a| .
j=1 exp(aj )
i=1
diction loss on the i-th example, and R(W, b) is an optional regularizer that is weighted
by hyper-parameter λ.
The heart of our approach is to design a domain regularizer directly derived from the
H-divergence of Definition 1. To this end, we view the output of the hidden layer Gf (·)
(Equation 4) as the internal representation of the neural network. Thus, we denote the
source sample representations as
S(Gf ) = Gf (x) x ∈ S .
Similarly, given an unlabeled sample from the target domain we denote the corresponding
representations
T (Gf ) = Gf (x) x ∈ T .
Based on Equation (1), the empirical H-divergence of a symmetric hypothesis class H
between samples S(Gf ) and T (Gf ) is given by
X n N !
1 1 X
dˆH S(Gf ), T (Gf ) = 2 1−min
I η(Gf (xi ))=0 + 0 I η(Gf (xi ))=1 . (6)
η∈H n n
i=1 i=n+1
2. For brevity of notation, we will sometimes drop the dependence of Gf on its parameters (W, b) and
shorten Gf (x; W, b) to Gf (x).
8
Domain-Adversarial Neural Networks
Let us consider H as the class of hyperplanes in the representation space. Inspired by the
Proxy A-distance (see Section 3.2), we suggest estimating the “min” part of Equation (6)
by a domain classification layer Gd that learns a logistic regressor Gd : RD → [0, 1],
parameterized by a vector-scalar pair (u, z) ∈ RD × R, that models the probability that a
given input is from the source domain DSX or the target domain DT X
. Thus,
where di denotes the binary variable (domain label) for the i-th example, which indicates
whether xi come from the source distribution (xi ∼DSX if di =0) or from the target distribu-
tion (xi ∼DTX
if di =1).
Recall that for the examples from the source distribution (di =0), the corresponding
labels yi ∈ Y are known at training time. For the examples from the target domains, we
do not know the labels at training time, and we want to predict such labels at test time.
This enables us to add a domain adaptation term to the objective of Equation (5), giving
the following regularizer:
n N
" #
1X i 1 X i
R(W, b) = max − Ld (W, b, u, z) − 0 Ld (W, b, u, z , (8)
u,z n n
i=1 i=n+1
where Lid (W, b, u, z)=Ld Gd (Gf (xi ; W, b); u, z), di ). This regularizer seeks to approximate
the H-divergence of Equation (6), as 2(1−R(W, b)) is a surrogate for dˆH S(Gf ), T (Gf ) . In
line with Theorem 2, the optimization problem given by Equations (5) and (8) implements a
trade-off between the minimization of the source risk RS (·) and the divergence dˆH (·, ·). The
hyper-parameter λ is then used to tune the trade-off between these two quantities during
the learning process.
For learning, we first note that we can rewrite the complete optimization objective of
Equation (5) as follows:
where we are seeking the parameters Ŵ, V̂, b̂, ĉ, û, ẑ that deliver a saddle point given by
Thus, the optimization problem involves a minimization with respect to some parameters,
as well as a maximization with respect to the others.
9
Ganin, Ustinova, Ajakan, Germain, Larochelle, Laviolette, Marchand and Lempitsky
Note: In this pseudo-code, e(y) refers to a “one-hot” vector, consisting of all 0s except for a 1 at position y,
and is the element-wise product.
We propose to tackle this problem with a simple stochastic gradient procedure, in which
updates are made in the opposite direction of the gradient of Equation (9) for the minimizing
parameters, and in the direction of the gradient for the maximizing parameters. Stochastic
estimates of the gradient are made, using a subset of the training samples to compute the
averages. Algorithm 1 provides the complete pseudo-code of this learning procedure.3 In
words, during training, the neural network (parameterized by W, b, V, c) and the domain
regressor (parameterized by u, z) are competing against each other, in an adversarial way,
over the objective of Equation (9). For this reason, we refer to networks trained according
to this objective as Domain-Adversarial Neural Networks (DANN). DANN will effectively
attempt to learn a hidden layer Gf (·) that maps an example (either source or target) into
a representation allowing the output layer Gy (·) to accurately classify source samples, but
crippling the ability of the domain regressor Gd (·) to detect whether each example belongs
to the source or target domains.
10
Domain-Adversarial Neural Networks
Training DANN then parallels the single layer case and consists in optimizing
n n N
1X i 1 X 1 X i
E(θf , θy , θd ) = Ly (θf , θy ) − λ Lid (θf , θd ) + 0 Ld (θf , θd ) , (10)
n n n
i=1 i=1 i=n+1
11
Ganin, Ustinova, Ajakan, Germain, Larochelle, Laviolette, Marchand and Lempitsky
Figure 1: The proposed architecture includes a deep feature extractor (green) and a deep
label predictor (blue), which together form a standard feed-forward architecture.
Unsupervised domain adaptation is achieved by adding a domain classifier (red)
connected to the feature extractor via a gradient reversal layer that multiplies
the gradient by a certain negative constant during the backpropagation-based
training. Otherwise, the training proceeds standardly and minimizes the label
prediction loss (for source examples) and the domain classification loss (for all
samples). Gradient reversal ensures that the feature distributions over the two
domains are made similar (as indistinguishable as possible for the domain classi-
fier), thus resulting in the domain-invariant features.
predictor and into the domain classifier (with loss weighted by λ). The only difference is
that in (13), the gradients from the class and domain predictors are subtracted, instead of
being summed (the difference is important, as otherwise SGD would try to make features
dissimilar across domains in order to minimize the domain classification loss). Since SGD—
and its many variants, such as ADAGRAD (Duchi et al., 2010) or ADADELTA (Zeiler,
2012)—is the main learning algorithm implemented in most libraries for deep learning, it
would be convenient to frame an implementation of our stochastic saddle point procedure
as SGD.
Fortunately, such a reduction can be accomplished by introducing a special gradient
reversal layer (GRL), defined as follows. The gradient reversal layer has no parameters
associated with it. During the forward propagation, the GRL acts as an identity trans-
formation. During the backpropagation however, the GRL takes the gradient from the
subsequent level and changes its sign, i.e., multiplies it by −1, before passing it to the
preceding layer. Implementing such a layer using existing object-oriented packages for deep
learning is simple, requiring only to define procedures for the forward propagation (identity
transformation), and backpropagation (multiplying by −1). The layer requires no parame-
ter update.
The GRL as defined above is inserted between the feature extractor Gf and the domain
classifier Gd , resulting in the architecture depicted in Figure 1. As the backpropagation
process passes through the GRL, the partial derivatives of the loss that is downstream
12
Domain-Adversarial Neural Networks
the GRL (i.e., Ld ) w.r.t. the layer parameters that are upstream the GRL (i.e., θf ) get
multiplied by −1, i.e., ∂L ∂Ld
∂θf is effectively replaced with − ∂θf . Therefore, running SGD in
d
the resulting model implements the updates of Equations (13-15) and converges to a saddle
point of Equation (10).
Mathematically, we can formally treat the gradient reversal layer as a “pseudo-function”
R(x) defined by two (incompatible) equations describing its forward and backpropagation
behaviour:
R(x) = x , (16)
dR
= −I , (17)
dx
where I is an identity matrix. We can then define the objective “pseudo-function” of
(θf , θy , θd ) that is being optimized by the stochastic gradient descent within our method:
n
1X
Ẽ(θf , θy , θd ) = Ly Gy (Gf (xi ; θf ); θy ), yi (18)
n
i=1
n
1 X N
1 X
−λ Ld Gd (R(Gf (xi ; θf )); θd ), di + 0 Ld Gd (R(Gf (xi ; θf )); θd ), di .
n n
i=1 i=n+1
Running updates (13-15) can then be implemented as doing SGD for (18) and leads
to the emergence of features that are domain-invariant and discriminative at the same
time. After the learning, the label predictor Gy (Gf (x; θf ); θy ) can be used to predict labels
for samples from the target domain (as well as from the source domain). Note that we
release the source code for the Gradient Reversal layer along with the usage examples as
an extension to Caffe (Jia et al., 2014).4
5. Experiments
In this section, we present a variety of empirical results for both shallow domain adversarial
neural networks (Subsection 5.1) and deep ones (Subsections 5.2 and 5.3).
13
Ganin, Ustinova, Ajakan, Germain, Larochelle, Laviolette, Marchand and Lempitsky
------ C
- ----- ++
+
+
---- +
+
++
++
+
+
---- +
+
++
+ B
--------- - ---------
+
----- -- - +
--------- - ---------
+ +
--------- - ---------
+
A -- A
--------- -- --
−3 −2 −1 0 1 2 3 −1.0 −0.5 0.0 0.5 1.0 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
(a) Standard NN. For the “domain classification”, we use a non adversarial domain regressor on the hidden
neurons learned by the Standard NN. (This is equivalent to run Algorithm 1, without Lines 22 and 31)
+++++ ++++ ++ +
++
++++++++
++++ +++ - ++
++++++++
+++ ++
++++++++
++++
-- -
++ D
++++ ++
++++
+
+++ +++
+++ + +++ D
++++
+
++ +++
++++ ++++
+
+++ +++
++++
+++
++
----
++++
-- ++ - +++ +++ +
++ ---- - +++
----
+
++ -
---- ----
++ +
C+++
+
-
B + +
+ +++ ++
+ +
+++ ++
+
- -
- - +
--
+
---- ---- +
- ----
+ + + +++ + + +
++
+ B -- ++
------- - -----
+ - - + ++
+ ------- - ------ ++
+ ++
------- - -----
+ -
--- ------ - --- C
- --- ------ --- ------
- ----- -- ----
+
A
A ++
−3 −2 −1 0 1 2 3 −1.5 −1.0- −0.5 0.0 0.5 1.0 1.5 −3 −2 −1 0 1 2 3 −3 −2 −1 0 1 2 3
Figure 2: The inter-twinning moons toy problem. Examples from the source sample are rep-
−”(label 0), while examples from the unlabeled
resented as a “+”(label 1) and a “−
target sample are represented as black dots. See text for the figure discussion.
one. As the source sample S, we generate a lower moon and an upper moon labeled 0 and 1
respectively, each of which containing 150 examples. The target sample T is obtained by
the following procedure: (1) we generate a sample S 0 the same way S has been generated;
(2) we rotate each example by 35◦ ; and (3) we remove all the labels. Thus, T contains 300
unlabeled examples. We have represented those examples in Figure 2.
We study the adaptation capability of DANN by comparing it to the standard neural net-
work (NN). In these toy experiments, both algorithms share the same network architecture,
with a hidden layer size of 15 neurons. We train the NN using the same procedure as the
DANN. That is, we keep updating the domain regressor component using target sample T
(with a hyper-parameter λ = 6; the same value is used for DANN), but we disable the adver-
sarial back-propagation into the hidden layer. To do so, we execute Algorithm 1 by omitting
the lines numbered 22 and 31. This allows recovering the NN learning algorithm—based
on the source risk minimization of Equation (5) without any regularizer—and simultane-
ously train the domain regressor of Equation (7) to discriminate between source and target
domains. With this toy experience, we will first illustrate how DANN adapts its decision
boundary when compared to NN. Moreover, we will also illustrate how the representation
given by the hidden layer is less adapted to the source domain task with DANN than with
NN (this is why we need a domain regressor in the NN experiment). We recall that this is
the founding idea behind our proposed algorithm. The analysis of the experiment appears
in Figure 2, where upper graphs relate to standard NN, and lower graphs relate to DANN.
By looking at the lower and upper graphs pairwise, we compare NN and DANN from four
different perspectives, described in details below.
14
Domain-Adversarial Neural Networks
15
Ganin, Ustinova, Ajakan, Germain, Larochelle, Laviolette, Marchand and Lempitsky
to (roughly) capture the rotation angle of the domain classification problem. Hence, we
observe that the adaptation regularizer of DANN prevents these kinds of neurons to be
produced. It is indeed striking to see that the two predominant patterns in the NN neurons
(i.e., the two parallel lines crossing the plane from lower left to upper right) are vanishing
in the DANN neurons.
16
Domain-Adversarial Neural Networks
Table 1: Classification accuracy on the Amazon reviews data set, and Pairwise Poisson
binomial test.
• For the DANN algorithm, the adaptation parameter λ is chosen among 9 values
between 10−2 and 1 on a logarithmic scale. The hidden layer size l is either 50 or 100.
Finally, the learning rate µ is fixed at 10−3 .
• For the NN algorithm, we use exactly the same hyper-parameters grid and training
procedure as DANN above, except that we do not need an adaptation parameter.
Note that one can train NN by using the DANN implementation (Algorithm 1) with
λ = 0.
• For the SVM algorithm, the hyper-parameter C is chosen among 10 values between
10−5 and 1 on a logarithmic scale. This range of values is the same as used by Chen
et al. (2012) in their experiments.
As presented at Section 5.1.2, we used reverse cross validation selecting the hyper-parameters
for all three learning algorithms, with early stopping as the stopping criterion for DANN
and NN.
17
Ganin, Ustinova, Ajakan, Germain, Larochelle, Laviolette, Marchand and Lempitsky
The “Original data” part of Table 1a shows the target test accuracy of all algorithms,
and Table 1b reports the probability that one algorithm is significantly better than the oth-
ers according to the Poisson binomial test (Lacoste et al., 2012). We note that DANN has
a significantly better performance than NN and SVM, with respective probabilities 0.87
and 0.83. As the only difference between DANN and NN is the domain adaptation regu-
larizer, we conclude that our approach successfully helps to find a representation suitable
for the target domain.
18
Domain-Adversarial Neural Networks
(a) DANN on Original data. (b) DANN & NN with 100 hidden (c) DANN on mSDA representa-
neurons. tions.
Figure 3: Proxy A-distances (PAD). Note that the PAD values of mSDA representations
are symmetric when swapping source and target samples.
or mSDA and DANN combined. Recall that PAD, as described in Section 3.2, is a metric
estimating the similarity of the source and the target representations. More precisely, to
obtain a PAD value, we use the following procedure: (1) we construct the data set U of
Equation (2) using both source and target representations of the training samples; (2) we
randomly split U in two subsets of equal size; (3) we train linear SVMs on the first subset
of U using a large range of C values; (4) we compute the error of all obtained classifiers
on the second subset of U ; and (5) we use the lowest error to compute the PAD value of
Equation (3).
Firstly, Figure 3a compares the PAD of DANN representations obtained in the experi-
ments of Section 5.1.3 (using the hyper-parameters values leading to the results of Table 1)
to the PAD computed on raw data. As expected, the PAD values are driven down by the
DANN representations.
Secondly, Figure 3b compares the PAD of DANN representations to the PAD of standard
NN representations. As the PAD is influenced by the hidden layer size (the discriminating
power tends to increase with the representation length), we fix here the size to 100 neurons
for both algorithms. We also fix the adaptation parameter of DANN to λ ' 0.31; it was
the value that has been selected most of the time during our preceding experiments on the
Amazon Reviews data set. Again, DANN is clearly leading to the lowest PAD values.
Lastly, Figure 3c presents two sets of results related to Section 5.1.4 experiments. On
one hand, we reproduce the results of Chen et al. (2012), which noticed that the mSDA
representations have greater PAD values than original (raw) data. Although the mSDA
approach clearly helps to adapt to the target task, it seems to contradict the theory of Ben-
David et al.. On the other hand, we observe that, when running DANN on top of mSDA
(using the hyper-parameters values leading to the results of Table 1), the obtained represen-
tations have much lower PAD values. These observations might explain the improvements
provided by DANN when combined with the mSDA procedure.
19
Ganin, Ustinova, Ajakan, Germain, Larochelle, Laviolette, Marchand and Lempitsky
5.2.1 Baselines
The following baselines are evaluated in the experiments of this subsection. The source-only
model is trained without consideration for target-domain data (no domain classifier branch
included into the network). The train-on-target model is trained on the target domain with
class labels revealed. This model serves as an upper bound on DA methods, assuming that
target data are abundant and the shift between the domains is considerable.
In addition, we compare our approach against the recently proposed unsupervised DA
method based on subspace alignment (SA) (Fernando et al., 2013), which is simple to setup
and test on new data sets, but has also been shown to perform very well in experimental
comparisons with other “shallow” DA methods. To boost the performance of this baseline,
we pick its most important free parameter (the number of principal components) from the
range {2, . . . , 60}, so that the test performance on the target domain is maximized. To apply
SA in our setting, we train a source-only model and then consider the activations of the last
hidden layer in the label predictor (before the final linear classifier) as descriptors/features,
and learn the mapping between the source and the target domains (Fernando et al., 2013).
Since the SA baseline requires training a new classifier after adapting the features, and
in order to put all the compared settings on an equal footing, we retrain the last layer of
the label predictor using a standard linear SVM (Fan et al., 2008) for all four considered
methods (including ours; the performance on the target domain remains approximately the
same after the retraining).
For the Office data set (Saenko et al., 2010), we directly compare the performance of
our full network (feature extractor and label predictor) against recent DA approaches using
previously published results.
20
Domain-Adversarial Neural Networks
fully-conn fully-conn
GRL 100 units 1 unit
ReLU Logistic
(a) MNIST architecture; inspired by the classical LeNet-5 (LeCun et al., 1998).
(c) GTSRB architecture; we used the single-CNN baseline from Cireşan et al. (2012) as our starting
point.
For the loss functions, we set Ly and Ld to be the logistic regression loss and the
binomial cross-entropy respectively. Following Srivastava et al. (2014) we also use dropout
and `2 -norm restriction when we train the SVHN architecture.
The other hyper-parameters are not selected through a grid search as in the small scale
experiments of Section 5.1, which would be computationally costly. Instead, the learning
rate is adjusted during the stochastic gradient descent using the following formula:
µ0
µp = ,
(1 + α · p)β
where p is the training progress linearly changing from 0 to 1, µ0 = 0.01, α = 10 and
β = 0.75 (the schedule was optimized to promote convergence and low error on the source
domain). A momentum term of 0.9 is also used.
The domain adaptation parameter λ is initiated at 0 and is gradually changed to 1 using
the following schedule:
2
λp = − 1,
1 + exp(−γ · p)
where γ was set to 10 in all experiments (the schedule was not optimized/tweaked). This
strategy allows the domain classifier to be less sensitive to noisy signal at the early stages of
the training procedure. Note however that these λp were used only for updating the feature
21
Ganin, Ustinova, Ajakan, Germain, Larochelle, Laviolette, Marchand and Lempitsky
Figure 5: The effect of adaptation on the distribution of the extracted features (best viewed
in color). The figure shows t-SNE (van der Maaten, 2013) visualizations of the
CNN’s activations (a) in case when no adaptation was performed and (b) in
case when our adaptation procedure was incorporated into training. Blue points
correspond to the source domain examples, while red ones correspond to the target
domain. In all cases, the adaptation in our method makes the two distributions
of features much closer.
5.2.3 Visualizations
We use t-SNE (van der Maaten, 2013) projection to visualize feature distributions at dif-
ferent points of the network, while color-coding the domains (Figure 5). As we already
observed with the shallow version of DANN (see Figure 2), there is a strong correspondence
6. Equivalently, one can use the same λp for both feature extractor and domain classification components,
but use a learning rate of µ/λp for the latter.
22
Domain-Adversarial Neural Networks
between the success of the adaptation in terms of the classification accuracy for the target
domain, and the overlap between the domain distributions in such visualizations.
23
Ganin, Ustinova, Ajakan, Germain, Larochelle, Laviolette, Marchand and Lempitsky
Target
MNIST-M SVHN MNIST GTSRB
Figure 6: Examples of domain pairs used in the experiments. See Section 5.2.4 for details.
Table 2: Classification accuracies for digit image classifications for different source and
target domains. MNIST-M corresponds to difference-blended digits over non-
uniform background. The first row corresponds to the lower performance bound
(i.e., if no adaptation is performed). The last row corresponds to training on
the target domain data with known class labels (upper bound on the DA perfor-
mance). For each of the two DA methods (ours and Fernando et al., 2013) we
show how much of the gap between the lower and the upper bounds was covered
(in brackets). For all five cases, our approach outperforms Fernando et al. (2013)
considerably, and covers a big portion of the gap.
24
Domain-Adversarial Neural Networks
Real
0.2
Syn
Validation error
Syn Adapted
Syn + Real
0.15
Syn + Real Adapted
0.1
0 1 2 3 4 5
Batches seen ·105
Figure 7: Results for the traffic signs classification in the semi-supervised setting. Syn
and Real denote available labeled data (100,000 synthetic and 430 real images
respectively); Adapted means that ≈ 31,000 unlabeled target domain images were
used for adaptation. The best performance is achieved by employing both the
labeled samples and the large unlabeled corpus in the target domain.
feature distributions. We observe a quite strong separation between the domains when we
feed them into the CNN trained solely on MNIST, whereas for the SVHN-trained network
the features are much more intermixed. This difference probably explains why our method
succeeded in improving the performance by adaptation in the SVHN → MNIST scenario
(see Table 2) but not in the opposite direction (SA is not able to perform adaptation in
this case either). Unsupervised adaptation from MNIST to SVHN gives a failure example
for our approach: it doesn’t manage to improve upon the performance of the non-adapted
model which achieves ≈ 0.25 accuracy (we are unaware of any unsupervised DA methods
capable of performing such adaptation).
Synthetic Signs → GTSRB. Overall, this setting is similar to the Syn Numbers → SVHN
experiment, except the distribution of the features is more complex due to the significantly
larger number of classes (43 instead of 10). For the source domain we obtained 100,000
synthetic images (which we call Syn Signs) simulating various imaging conditions. In the
target domain, we use 31,367 random training samples for unsupervised adaptation and the
rest for evaluation. Once again, our method achieves a sensible increase in performance
proving its suitability for the synthetic-to-real data adaptation.
As an additional experiment, we also evaluate the proposed algorithm for semi-supervised
domain adaptation, i.e., when one is additionally provided with a small amount of labeled
target data. Here, we reveal 430 labeled examples (10 samples per class) and add them
to the training set for the label predictor. Figure 7 shows the change of the validation
error throughout the training. While the graph clearly suggests that our method can be
beneficial in the semi-supervised setting, thorough verification of semi-supervised setting is
left for future work.
Office data set. We finally evaluate our method on Office data set, which is a collection of
three distinct domains: Amazon, DSLR, and Webcam. Unlike previously discussed data
25
Ganin, Ustinova, Ajakan, Germain, Larochelle, Laviolette, Marchand and Lempitsky
sets, Office is rather small-scale with only 2817 labeled images spread across 31 different
categories in the largest domain. The amount of available data is crucial for a successful
training of a deep model, hence we opted for the fine-tuning of the CNN pre-trained on
the ImageNet (AlexNet from the Caffe package, see Jia et al., 2014) as it is done in some
recent DA works (Donahue et al., 2014; Tzeng et al., 2014; Hoffman et al., 2013; Long and
Wang, 2015). We make our approach more comparable with Tzeng et al. (2014) by using
exactly the same network architecture replacing domain mean-based regularization with the
domain classifier.
Following previous works, we assess the performance of our method across three transfer
tasks most commonly used for evaluation. Our training protocol is adopted from Gong
et al. (2013); Chopra et al. (2013); Long and Wang (2015) as during adaptation we use
all available labeled source examples and unlabeled target examples (the premise of our
method is the abundance of unlabeled data in the target domain). Also, all source domain
data are used for training. Under this “fully-transductive” setting, our method is able
to improve previously-reported state-of-the-art accuracy for unsupervised adaptation very
considerably (Table 3), especially in the most challenging Amazon → Webcam scenario
(the two domains with the largest domain shift).
Interestingly, in all three experiments we observe a slight over-fitting (performance on
the target domain degrades while accuracy on the source continues to improve) as training
progresses, however, it doesn’t ruin the validation accuracy. Moreover, switching off the
domain classifier branch makes this effect far more apparent, from which we conclude that
our technique serves as a regularizer.
26
Domain-Adversarial Neural Networks
Figure 8: Matching and non-matching pairs of probe-gallery images from different person
re-identification data sets. The three data sets are treated as different domains
in our experiments.
27
Ganin, Ustinova, Ajakan, Germain, Larochelle, Laviolette, Marchand and Lempitsky
experiments with all images of the whole CUHK data set as source domain and VIPeR and
PRID data sets as target domains as in the original paper (Yi et al., 2014).
Following Yi et al. (2014), we augmented our data with mirror images, and during
test time we calculate similarity score between two images as the mean of the four scores
corresponding to different flips of the two compared images. In case of CUHK, where there
are 4 images (including mirror images) for each of the two camera views for each person,
all 16 combinations’ scores are averaged.
28
Domain-Adversarial Neural Networks
1 1 1
DML DML DML
Identification rate (%)
0 0 0
20 40 20 40 20 40
Rank (a) Rank (b) Rank (c)
Whole CUHK → VIPeR CUHK/p1 → VIPeR PRID → VIPeR
1 1 1
DML DML DML
Identification rate (%)
0 0 0
20 40 20 40 20 40
Rank Rank (e) Rank (f)
(d) Whole CUHK → PRID CUHK/p1 → PRID VIPeR → PRID
1 1
DML DML
Identification rate (%)
0.6 0.6
0.4 0.4
0.2 0.2
0 0
20 40 20 40
Rank (g) Rank (h)
VIPeR → CUHK/p1 PRID → CUHK/p1
Figure 9: Results on VIPeR, PRID and CUHK/p1 with and without domain-adversarial
learning. Across the eight domain pairs domain-adversarial learning improves re-
identification accuracy. For some domain pairs the improvement is considerable.
29
Ganin, Ustinova, Ajakan, Germain, Larochelle, Laviolette, Marchand and Lempitsky
Figure 10: The effect of adaptation shown by t-SNE visualizations of source and target
domains descriptors in a VIPeR → CUHK/p1 experiment pair. VIPeR is de-
picted with green and CUHK/p1 - with red. As in the image classification case,
domain-adversarial learning ensures a closer match between the source and the
target distributions.
6. Conclusion
The paper proposes a new approach to domain adaptation of feed-forward neural networks,
which allows large-scale training based on large amount of annotated data in the source
domain and large amount of unannotated data in the target domain. Similarly to many
previous shallow and deep DA techniques, the adaptation is achieved through aligning the
distributions of features across the two domains. However, unlike previous approaches, the
alignment is accomplished through standard backpropagation training.
The approach is motivated and supported by the domain adaptation theory of Ben-David
et al. (2006, 2010). The main idea behind DANN is to enjoin the network hidden layer to
learn a representation which is predictive of the source example labels, but uninformative
about the domain of the input (source or target). We implement this new approach within
both shallow and deep feed-forward architectures. The latter allows simple implementation
within virtually any deep learning package through the introduction of a simple gradient
reversal layer. We have shown that our approach is flexible and achieves state-of-the-art
30
Domain-Adversarial Neural Networks
results on a variety of benchmark in domain adaptation, namely for sentiment analysis and
image classification tasks.
A convenient aspect of our approach is that the domain adaptation component can be
added to almost any neural network architecture that is trainable with backpropagation.
Towards this end, We have demonstrated experimentally that the approach is not confined
to classification tasks but can be used in other feed-forward architectures, e.g., for descriptor
learning for person re-identification.
Acknowledgments
This work has been supported by National Science and Engineering Research Council
(NSERC) Discovery grants 262067 and 0122405 as well as the Russian Ministry of Science
and Education grant RFMEFI57914X0071. Computations were performed on the Colosse
supercomputer grid at Université Laval, under the auspices of Calcul Québec and Compute
Canada. The operations of Colosse are funded by the NSERC, the Canada Foundation
for Innovation (CFI), NanoQuébec, and the Fonds de recherche du Québec – Nature et
technologies (FRQNT). We also thank the Graphics & Media Lab, Faculty of Computa-
tional Mathematics and Cybernetics, Lomonosov Moscow State University for providing
the synthetic road signs data set.
References
Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, and Mario Marchand.
Domain-adversarial neural networks. NIPS 2014 Workshop on Transfer and Multi-task
learning: Theory Meets Practice, 2014. URL http://arxiv.org/abs/1412.4446.
Pablo Arbelaez, Michael Maire, Charless Fowlkes, and Jitendra Malik. Contour detection
and hierarchical image segmentation. IEEE Transaction Pattern Analysis and Machine
Intelligence, 33, 2011.
Artem Babenko, Anton Slesarev, Alexander Chigorin, and Victor S. Lempitsky. Neural
codes for image retrieval. In ECCV, pages 584–599, 2014.
Mahsa Baktashmotlagh, Mehrtash Tafazzoli Harandi, Brian C. Lovell, and Mathieu Salz-
mann. Unsupervised domain adaptation by domain invariant projection. In ICCV, pages
769–776, 2013.
Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of repre-
sentations for domain adaptation. In NIPS, pages 137–144, 2006.
Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jen-
nifer Wortman Vaughan. A theory of learning from different domains. Machine Learning,
79(1-2):151–175, 2010.
John Blitzer, Ryan T. McDonald, and Fernando Pereira. Domain adaptation with struc-
tural correspondence learning. In Conference on Empirical Methods in Natural Language
Processing, pages 120–128, 2006.
31
Ganin, Ustinova, Ajakan, Germain, Larochelle, Laviolette, Marchand and Lempitsky
Lorenzo Bruzzone and Mattia Marconcini. Domain adaptation problems: A DASVM classi-
fication technique and a circular validation strategy. IEEE Transaction Pattern Analysis
and Machine Intelligence, 32(5):770–787, 2010.
Minmin Chen, Zhixiang Eddie Xu, Kilian Q. Weinberger, and Fei Sha. Marginalized de-
noising autoencoders for domain adaptation. In ICML, pages 767–774, 2012.
Qiang Chen, Junshi Huang, Rogerio Feris, Lisa M. Brown, Jian Dong, and Shuicheng Yan.
Deep domain adaptation for describing people based on fine-grained clothing attributes.
In CVPR, June 2015.
S. Chopra, S. Balakrishnan, and R. Gopalan. Dlid: Deep learning for domain adaptation
by interpolating between domains. In ICML Workshop on Challenges in Representation
Learning, 2013.
Dan Cireşan, Ueli Meier, Jonathan Masci, and Jürgen Schmidhuber. Multi-column deep
neural network for traffic sign classification. Neural Networks, 32:333–338, 2012.
Corinna Cortes and Mehryar Mohri. Domain adaptation and sample bias correction theory
and algorithm for regression. Theor. Comput. Sci., 519:103–126, 2014.
Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and
Trevor Darrell. Decaf: A deep convolutional activation feature for generic visual recog-
nition. In ICML, 2014.
John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online
learning and stochastic optimization. Technical report, EECS Department, University of
California, Berkeley, Mar 2010.
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin. LIB-
LINEAR: A library for large linear classification. Journal of Machine Learning Research,
9:1871–1874, 2008.
Basura Fernando, Amaury Habrard, Marc Sebban, and Tinne Tuytelaars. Unsupervised
visual domain adaptation using subspace alignment. In ICCV, 2013.
Pascal Germain, Amaury Habrard, François Laviolette, and Emilie Morvant. A PAC-
Bayesian approach for domain adaptation with specialization to linear classifiers. In
ICML, pages 738–746, 2013.
Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Domain adaptation for large-scale
sentiment classification: A deep learning approach. In ICML, pages 513–520, 2011.
32
Domain-Adversarial Neural Networks
Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. Geodesic flow kernel for unsuper-
vised domain adaptation. In CVPR, pages 2066–2073, 2012.
Boqing Gong, Kristen Grauman, and Fei Sha. Connecting the dots with landmarks: Dis-
criminatively learning domain-invariant features for unsupervised domain adaptation. In
ICML, pages 222–230, 2013.
Shaogang Gong, Marco Cristani, Shuicheng Yan, and Chen Change Loy. Person re-
identification. Springer, 2014.
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.
Raghuraman Gopalan, Ruonan Li, and Rama Chellappa. Domain adaptation for object
recognition: An unsupervised approach. In ICCV, pages 999–1006, 2011.
Doug Gray, Shane Brennan, and Hai Tao. Evaluating appearance models for recognition,
reacquisition, and tracking. In IEEE International Workshop on Performance Evaluation
for Tracking and Surveillance, Rio de Janeiro, 2007.
Martin Hirzer, Csaba Beleznai, Peter M. Roth, and Horst Bischof. Person re-identification
by descriptive and discriminative classification. In SCIA, 2011.
Judy Hoffman, Eric Tzeng, Jeff Donahue, Yangqing Jia, Kate Saenko, and Trevor Darrell.
One-shot adaptation of supervised deep convolutional models. CoRR, abs/1312.6204,
2013. URL http://arxiv.org/abs/1312.6204.
Fei Huang and Alexander Yates. Biased representation learning for domain adaptation. In
Joint Conference on Empirical Methods in Natural Language Processing and Computa-
tional Natural Language Learning, pages 1313–1323, 2012.
Jiayuan Huang, Alexander J. Smola, Arthur Gretton, Karsten M. Borgwardt, and Bernhard
Schölkopf. Correcting sample selection bias by unlabeled data. In NIPS, pages 601–608,
2006.
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Gir-
shick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast
feature embedding. CoRR, abs/1408.5093, 2014.
Daniel Kifer, Shai Ben-David, and Johannes Gehrke. Detecting change in data streams. In
Very Large Data Bases, pages 180–191, 2004.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. Imagenet classification with deep
convolutional neural networks. In NIPS, pages 1097–1105, 2012.
Alexandre Lacoste, François Laviolette, and Mario Marchand. Bayesian comparison of
machine learning algorithms on single and multiple datasets. In AISTATS, pages 665–
675, 2012.
Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to docu-
ment recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998.
33
Ganin, Ustinova, Ajakan, Germain, Larochelle, Laviolette, Marchand and Lempitsky
Wei Li and Xiaogang Wang. Locally aligned feature transforms across views. In CVPR,
pages 3594–3601, 2013.
Yujia Li, Kevin Swersky, and Richard Zemel. Unsupervised domain adaptation by domain
invariant projection. In NIPS 2014 Workshop on Transfer and Multitask Learning, 2014.
Joerg Liebelt and Cordelia Schmid. Multi-view object class detection with a 3d geometric
model. In CVPR, 2010.
Chunxiao Liu, Chen Change Loy, Shaogang Gong, and Guijin Wang. POP: person re-
identification post-rank optimisation. In ICCV, pages 441–448, 2013.
Mingsheng Long and Jianmin Wang. Learning transferable features with deep adaptation
networks. CoRR, abs/1502.02791, 2015.
Andy Jinhua Ma, Jiawei Li, Pong C. Yuen, and Ping Li. Cross-domain person reidentifi-
cation using domain adaptation ranking svms. IEEE Transactions on Image Processing,
24(5):1599–1613, 2015.
Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Domain adaptation: Learning
bounds and algorithms. In COLT, 2009a.
Yishay Mansour, Mehryar Mohri, and Afshin Rostamizadeh. Multiple source adaptation
and the rényi divergence. In UAI, pages 367–374, 2009b.
Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng.
Reading digits in natural images with unsupervised feature learning. In NIPS Workshop
on Deep Learning and Unsupervised Feature Learning, 2011.
M. Oquab, L. Bottou, I. Laptev, and J. Sivic. Learning and transferring mid-level image
representations using convolutional neural networks. In CVPR, 2014.
Sakrapee Paisitkriangkrai, Chunhua Shen, and Anton van den Hengel. Learning to rank
in person re-identification with metric ensembles. CoRR, abs/1503.01543, 2015. URL
http://arxiv.org/abs/1503.01543.
Sinno Jialin Pan, Ivor W. Tsang, James T. Kwok, and Qiang Yang. Domain adaptation
via transfer component analysis. IEEE Transactions on Neural Networks, 22(2):199–210,
2011.
Kate Saenko, Brian Kulis, Mario Fritz, and Trevor Darrell. Adapting visual category models
to new domains. In ECCV, pages 213–226, 2010.
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhut-
dinov. Dropout: A simple way to prevent neural networks from overfitting. The Journal
of Machine Learning Research, 15(1):1929–1958, 2014.
Michael Stark, Michael Goesele, and Bernt Schiele. Back to the future: Learning shape
models from 3d CAD data. In BMVC, pages 1–11, 2010.
34
Domain-Adversarial Neural Networks
Baochen Sun and Kate Saenko. From virtual to reality: Fast adaptation of virtual object
detectors to real domains. In BMVC, 2014.
Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. Deep domain
confusion: Maximizing for domain invariance. CoRR, abs/1412.3474, 2014. URL http:
//arxiv.org/abs/1412.3474.
Laurens van der Maaten. Barnes-Hut-SNE. CoRR, abs/1301.3342, 2013. URL http:
//arxiv.org/abs/1301.3342.
David Vázquez, Antonio Manuel López, Javier Marı́n, Daniel Ponsa, and David Gerónimo
Gomez. Virtual and real world adaptationfor pedestrian detection. IEEE Transaction
Pattern Analysis and Machine Intelligence, 36(4):797–809, 2014.
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting
and composing robust features with denoising autoencoders. In ICML, pages 1096–1103,
2008.
Dong Yi, Zhen Lei, and Stan Z. Li. Deep metric learning for practical person re-
identification. CoRR, abs/1407.4979, 2014. URL http://arxiv.org/abs/1407.4979.
Matthew D. Zeiler and Rob Fergus. Visualizing and understanding convolutional networks.
CoRR, abs/1311.2901, 2013. URL http://arxiv.org/abs/1311.2901.
Ziming Zhang and Venkatesh Saligrama. Person re-identification via structured prediction.
CoRR, abs/1406.4444, 2014. URL http://arxiv.org/abs/1406.4444.
Rui Zhao, Wanli Ouyang, and Xiaogang Wang. Person re-identification by saliency learning.
CoRR, abs/1412.1908, 2014. URL http://arxiv.org/abs/1412.1908.
Erheng Zhong, Wei Fan, Qiang Yang, Olivier Verscheure, and Jiangtao Ren. Cross valida-
tion framework to choose amongst models and datasets for transfer learning. In Machine
Learning and Knowledge Discovery in Databases, pages 547–562. Springer, 2010.
35