J Neucom 2018 05 083
J Neucom 2018 05 083
Ivor Tsang
Accepted Manuscript
PII: S0925-2312(18)30668-4
DOI: 10.1016/j.neucom.2018.05.083
Reference: NEUCOM 19644
Please cite this article as: Mei Wang, Weihong Deng, Deep Visual Domain Adaptation: A Survey,
Neurocomputing (2018), doi: 10.1016/j.neucom.2018.05.083
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
T
of Posts and Telecommunications, Beijing, 100876, China
IP
CR
Abstract
US
dress the lack of massive amounts of labeled data. Compared to conventional
methods, which learn shared feature subspaces or reuse important source in-
AN
stances with shallow representations, deep domain adaptation methods lever-
age deep networks to learn more transferable representations by embedding
domain adaptation in the pipeline of deep learning. There have been com-
M
prehensive surveys for shallow domain adaptation, but few timely reviews
the emerging deep learning based methods. In this paper, we provide a com-
ED
∗
Corresponding author. Tel:+86 10 62283059 Fax: +86 10 62285019
Email address: [email protected] (Weihong Deng)
T
computer vision applications
IP
1. Introduction
CR
Over the past few years, machine learning has achieved great success and
has benefited real-world applications. However, collecting and annotating
US
datasets for every new task and domain are extremely expensive and time-
consuming processes, sufficient training data may not always be available.
AN
Fortunately, the big data era makes a large amount of data available for
other domains and tasks. For instance, although large-scale labeled video
databases that are publicly available only contain a small number of samples,
M
statistically, the YouTube face dataset (YTF) consists of 3.4K videos. The
number of labeled still images is more than sufficient [1]. Hence, skillfully
ED
using the auxiliary data for the current task with scarce data will be helpful
for real-world applications.
PT
However, due to many factors (e.g., illumination, pose, and image qual-
ity), there is always a distribution change or domain shift between two do-
CE
mains that can degrade the performance, as shown in Fig. 1. Mimicking the
human vision system, domain adaptation (DA) is a particular case of trans-
AC
fer learning (TL) that utilizes labeled data in one or more relevant source
domains to execute new tasks in a target domain. Over the past decades,
various shallow DA methods have been proposed to solve a domain shift be-
tween the source and target domains. The common algorithms for shallow
2
ACCEPTED MANUSCRIPT
T
IP
CR
Amazon DSLR Webcam Caltech-256
MNIST
US
(a)
USPS SVHN
AN
(b)
(c)
ED
Figure 1: (a) Some object images from the ”Bike” and ”Laptop” categories in Amazon,
DSLR, Webcam, and Caltech-256 databases. (b) Some digit images from MNIST, USPS,
and SVHN databases. (c) Some face images from LFW, BCS and CUFS databases. Re-
PT
alworld computer vision applications, such as face recognition, must learn to adapt to
distributions specific to each domain.
CE
AC
3
ACCEPTED MANUSCRIPT
T
For the second class, a common shared space is generally learned in which
IP
the distributions of the two datasets are matched.
Recently, neural-network-based deep learning approaches have achieved
CR
many inspiring results in visual categorization applications, such as image
classification [8], face recognition [9], and object detection [10]. Simulat-
US
ing the perception of the human brain, deep networks can represent high-
level abstractions by multiple layers of non-linear transformations. Exist-
ing deep network architectures [11] include convolutional neural networks
AN
(CNNs) [8, 12, 13, 14], deep belief networks (DBNs) [15], and stacked au-
toencoders (SAEs) [16], among others. Although some studies have shown
M
that deep networks can learn more transferable representations that disen-
tangle the exploratory factors of variations underlying the data samples and
ED
higher layers. Therefore, recent work has addressed this problem by deep
DA, which combines deep learning and DA.
There have been other surveys on TL and DA over the past few years
AC
[18, 19, 20, 21, 22, 23]. Pan et al. [18] categorized TL under three subset-
tings, including inductive TL, transductive TL, and unsupervised TL, but
they only studied homogeneous feature spaces. Shao et al. [19] catego-
4
ACCEPTED MANUSCRIPT
T
TL that operate under various settings, requirements, and domains. Zhang
IP
et al. [22] were the first to summarize several transferring criteria in detail
from the concept level. These five surveys mentioned above only cover the
CR
methodologies on shallow TL or DA. The work presented by Csurka et al.
[23] briefly analyzed the state-of-the-art shallow DA methods and catego-
US
rized the deep DA methods into three subsettings based on training loss:
classification loss, discrepancy loss and adversarial loss. However, Csurka’s
work mainly focused on shallow methods, and it only discussed deep DA in
AN
image classification applications.
In this paper, we focus on analyzing and discussing deep DA methods.
M
that define how two domains are diverged. 2) extending Csurka’s work, we
improve and detail the three subsettings (training with classification loss, dis-
PT
crepancy loss and adversarial loss) and summarize different approaches used
in different DA scenes. 3) Considering the distance of the source and target
CE
5
ACCEPTED MANUSCRIPT
first define some notations, and then we categorize deep DA into different
settings (given in Fig. 2). In the next three sections, different approaches
are discussed for each setting, which are given in Table 1 and Table 2 in
T
detail. Then, in Section VI, we introduce some successful computer vision
IP
applications of deep DA. Finally, the conclusion of this paper and discussion
of future works are presented in Section VII.
CR
2. Overview
labeled data is the source domain Ds = {X s , P (X)s }, and the test dataset
with a small amount of labeled data or no labeled data is the target domain
AC
Dt = {X t , P (X)t }. We see that the partially labeled part, Dtl , and the
unlabeled parts, Dtu , form the entire target domain, that is, Dt = Dtl ∪ Dtu .
Each domain is together with its task: the former is T s = {Y s , P (Y s |X s )},
and the latter is T t = {Y t , P (Y t |X t )}. Similarly, P (Y s |X s ) can be learned
6
ACCEPTED MANUSCRIPT
from the source labeled data {xsi , yis }, while P (Y t |X t ) can be learned from
labeled target data {xtli , yitl } and unlabeled data {xtu
i }.
T
The case of traditional machine learning is Ds = Dt and T s = T t . For TL,
IP
Pan et al. [18] summarized that the differences between different datasets
CR
can be caused by domain divergence Ds 6= Dt (i.e., distribution shift or
feature space difference) or task divergence T s 6= T t (i.e., conditional distri-
bution shift or label space difference), or both. Based on this summary, Pan
US
et al. categorized TL into three main groups: inductive, transductive and
unsupervised TL.
AN
According to this classification, DA methods are transductive TL solu-
tions with the assumption that the tasks are the same, i.e., T s = T t , and
M
source and target domains are identical (X s = X t ) with the same di-
mension (ds = dt ). Hence, the source and target datasets are generally
AC
7
ACCEPTED MANUSCRIPT
T
between source and target
Domain domain
Select Intermediate Domain
adaptation Heterogeneous Labeled+unlabeled data are
Semi-Supervised
available in target domain
IP
Multi-step No labeled data in target
domain
Unsupervised
Domain adaptation
CR
Figure 2: An overview of different settings of domain adaptation
1. In the supervised DA, a small amount of labeled target data, Dtl , are
US
present. However, the labeled data are commonly not sufficient for
tasks.
AN
2. In the semi-supervised DA, both limited labeled data, Dtl , and redun-
dant unlabeled data, Dtu , in the target domain are available in the
training stage, which allows the networks to learn the structure infor-
M
All of the above DA settings assumed that the source and target domains
are directly related; thus, transferring knowledge can be accomplished in one
step. We call them one-step DA. In reality, however, this assumption is oc-
casionally unavailable. There is little overlap between the two domains, and
8
ACCEPTED MANUSCRIPT
T
bridges to connect two seemingly unrelated domains and then perform one-
IP
step DA via this bridge, named multi-step (or transitive) DA [24, 25]. For
example, face images and vehicle images are dissimilar between each other
CR
due to different shapes or other aspects, and thus, one-step DA would fail.
However, some intermediate images, such as ’football helmet’, can be intro-
US
duced to be an intermediate domain and have a smooth knowledge transfer.
Fig. 3 shows the differences between the learning processes of one-step and
multi-step DA techniques.
AN
Different Domain Source Domain Target Domain Source Domain intermediate Target Domain
ED
One-Step
Learning System Learning System Learning System Knowledge Learning System Knowledge Knowledge Learning System
Figure 3: Different learning processes between (a) traditional machine learning, (b) one-
step domain adaptation and (c) multi-step domain adaptation [18].
CE
AC
9
ACCEPTED MANUSCRIPT
T
One-step DA
Brief Description Subsettings
IP
Approaches
class criterion [26, 27]
CR
[28, 29, 30, 31, 32]
[33, 34, 35, 26, 36]
fine-tuning the deep network statistic criterion
Discrepancy-based
US
with labeled or unlabeled target
data to diminish the domain
shift
[37, 34, 38, 32, 39]
[40, 41, 42, 43]
architecture criterion
AN
[44, 45, 46, 47, 48]
[49]
geometric criterion
M
[50]
generative models
ED
[58]
encoder-decoder
CE
reconstruction
[62, 63, 64]
10
ACCEPTED MANUSCRIPT
T
selecting certain parts of data from the
IP
Instance-based auxiliary datasets to compose the intermediate
domains [25, 50]
CR
freeze weights of one network and use their
Representation-based intermediaterepresentations as input to the
new network [66]
US
DA is adopted by shallow methods, whereas deep networks only extract vec-
AN
torial features and are not helpful for transferring knowledge directly. For
example, [71] extracted the convolutional activations from a CNN as the
tensor representation, and then performed tensor-aligned invariant subspace
M
tically meaningful and domain invariant. With the ”good” feature represen-
tations, the performance of the target task would improve significantly. In
this paper, we focus on the narrow definition and discuss how to utilize deep
networks to learn ”good” feature representations with extra training criteria.
11
ACCEPTED MANUSCRIPT
In one-step DA, the deep approaches can be summarized into three cases,
which refers to [23]. Table 1 shows these three cases and brief descriptions.
T
The first case is the discrepancy-based deep DA approach, which assumes
IP
that fine-tuning the deep network model with labeled or unlabeled target
data can diminish the shift between the two domains. Class criterion, statis-
CR
tic criterion, architecture criterion and geometric criterion are four major
techniques for performing fine-tuning.
US
• Class Criterion: uses the class label information as a guide for trans-
ferring knowledge between different domains. When the labeled sam-
AN
ples from the target domain are available in supervised DA, soft label
and metric learning are always effective [26, 27, 30, 31, 28]. When such
samples are unavailable, some other techniques can be adopted to sub-
M
stitute for class labeled data, such as pseudo labels [32, 33, 34, 29] and
attribute representation [35, 26].
ED
monly used methods for comparing and reducing distribution shift are
maximum mean discrepancy (MMD) [37, 34, 38, 32, 39, 40], correla-
CE
12
ACCEPTED MANUSCRIPT
T
the relationship of geometric structures can reduce the domain shift
IP
[50].
The second case can be referred to as an adversarial-based deep DA ap-
CR
proach [54]. In this case, a domain discriminator that classifies whether a
data point is drawn from the source or target domain is used to encourage
US
domain confusion through an adversarial objective to minimize the distance
between the empirical source and target mapping distributions. Further-
more, the adversarial-based deep DA approach can be categorized into two
AN
cases based on whether there are generative models.
• Generative Models: combine the discriminative model with a gen-
M
13
ACCEPTED MANUSCRIPT
can be helpful for improving the performance of DA. The reconstructor can
ensure both specificity of intra-domain representations and indistinguishabil-
ity of inter-domain representations.
T
• Encoder-Decoder Reconstruction: by using stacked autoencoders
IP
(SAEs), encoder-decoder reconstruction methods combine the encoder
network for representation learning with a decoder network for data
CR
reconstruction [59, 60, 61, 43].
• Adversarial Reconstruction: the reconstruction error is measured
US
as the difference between the reconstructed and original images within
each image domain by a cyclic mapping obtained via a GAN discrimi-
nator, such as dual GAN [62], cycle GAN [63] and disco GAN [64].
AN
√
Class Criterion
√
Statistic Criterion
Discrepancy-based
ED
√ √
Architecture Criterion
√
Geometric Criterion
√
Generative Model
PT
Adversarial-based √
Non-Generative Model
√
Encoder-Decoder Model
Reconstruction-based √
CE
Adversarial Model
AC
14
ACCEPTED MANUSCRIPT
T
crafted, feature-based and representation-based selection mechanisms.
IP
• Hand-Crafted: users determine the intermediate domains based on
experience [65].
CR
• Instance-Based: selecting certain parts of data from the auxiliary
datasets to compose the intermediate domains to train the deep net-
work [25, 50].
US
• Representation-Based: transfer is enabled via freezing the previ-
ously trained network and using their intermediate representations as
AN
input to the new one [66].
M
As mentioned in Section 2.1, the data in the target domain have three
ED
only focus on the first and third settings in this paper. The cases where the
different approaches are mainly used for each DA setting are shown in Table
AC
15
ACCEPTED MANUSCRIPT
the discrepancy-based approaches have been studied for years and produced
more methods in many research works, whereas the adversarial-based and
reconstruction-based approaches are a relatively new research topic but have
T
recently been attracting more attention.
IP
4.1. Homogeneous Domain Adaptation
CR
4.1.1. Discrepancy-Based Approaches
Yosinski et al.[72] proved that transferable features learned by deep net-
works have limitations due to fragile co-adaptation and representation speci-
US
ficity and that fine-tuning can enhance generalization performance. Fine-
tuning (can also be viewed as a discrepancy-based deep DA approach) is
AN
to train a base network with source data and then directly reuse the first
n layers to conduct a target network. The remaining layers of the target
M
network are randomly initialized and trained with loss based on discrepancy.
During training, the first n layers of the target network can be fine-tuned or
ED
frozen depending on the size of the target dataset and its similarity to the
source dataset [73]. Some common rules of thumb for navigating the 4 major
scenarios are given in Table 4.
PT
Table 4: Some Common Rules of Thumb for Deciding Fine-tuned or Frozen in the First
CE
n Layers. [73]
The Size of Target Dataset
Low Medium High
AC
• Class Criterion
16
ACCEPTED MANUSCRIPT
T
IP
CR
US
AN
M
Figure 4: The average accuracy over the validation set for a network trained with different
ED
strategies. Baseline B: the network is trained on dataset B. 2) BnB: the first n layers
are reused from baseline B and frozen. The higher layers are trained on dataset B. 3)
BnB+: the same as BnB but where all layers are fine-tuned. 4) AnB: the first n layers are
PT
reused from the network trained on dataset A and frozen. The higher layers are trained
on dataset B. 5) AnB+: the same as AnB but where all layers are fine-tuned [72].
CE
AC
17
ACCEPTED MANUSCRIPT
The class criterion is the most basic training loss in deep DA. After pre-
training the network with source data, the remaining layers of the target
model use the class label information as a guide to train the network. Hence,
T
a small number of labeled samples from the target dataset is assumed to be
IP
available.
Ideally, the class label information is given directly in supervised DA.
CR
Most work commonly uses the negative log-likelihood of the ground truth
P
class with softmax as their training loss, L = − Ni=0 yi log ŷi (ŷi are the
US
softmax predictions of the model, which represent class probabilities) [26,
27, 30, 74]. To extend this, Hinton et al. [31] modified the softmax function
to soft label loss:
AN
exp(zi /T )
qi = P (1)
j (exp(zj /T ))
the information about the learned function that resides in the ratios of very
small probabilities can be obtained. For example, when recognizing digits,
PT
will be presented in Section 4.1.2) and soft label loss. Using soft labels
rather than hard labels can preserve the relationships between classes across
domains. Gebru et al. [35] modified existing adaptation algorithms based
on [26] and utilized soft label loss at the fine-grained class level Lcsof t and
18
ACCEPTED MANUSCRIPT
T
IP
CR
US
AN
Figure 5: Deep DA by combining domain confusion loss and soft label loss [26].
M
In addition to softmax loss, there are other methods that can be used
ED
those with different labels are far away. Based on this idea, [28] constructed
the semantic alignment loss and the separation loss accordingly. Deep trans-
CE
fer metric learning is proposed by [30], which applies the marginal Fisher
analysis criterion and MMD criterion (described in Statistic Criterion) to
AC
19
ACCEPTED MANUSCRIPT
where α, β and γ are regularization parameters and W (m) and b(m) are the
(M )
weights and biases of the mth layer of the network. Dts (X s , X t ) is the
MMD between representations of the source and target domains. Sc and Sb
T
define the intra-class compactness and the interclass separability.
IP
However, what can we do if there is no class label information in the target
domain directly? As we all know, humans can identify unseen classes given
CR
only a high-level description. For instance, when provided the description
”tall brown animals with long necks”, we are able to recognize giraffes. To
US
imitate the ability of humans, [75] introduced high-level semantic attributes
per class. Assume that ac = (ac1 , ..., acm ) is the attribute representation for
class c, which has fixed-length binary values with m attributes in all the
AN
classes. The classifiers provide estimates of p(am |x) for each attribute am .
In the test stage, each target class y obtains its attribute vector ay in a
M
deterministic way, i.e., p(a|y) = [[a = ay ]]. By applying Bayes rule, p(y|a) =
p(y)
p(ay )
[[a = ay ]], the posterior of a test class can be calculated as follows:
ED
X M
p(y) Y
p(y|x) = p(y|a)p(a|x) = y)
p(aym |x) (3)
M
p(a m=1
a∈{0,1}
PT
Gebru et al. [35] drew inspiration from these works and leveraged at-
tributes to improve performance in the DA of fine-grained recognition. There
CE
pendent classifiers from obtaining conflicting labels with attribute and class
level, an attribute consistency loss is also implemented.
Occasionally, when fine-tuning the network in unsupervised DA, a label
of target data, which is called a pseudo label, can preliminarily be obtained
20
ACCEPTED MANUSCRIPT
based on the maximum posterior probability. Yan et al. [34] initialized the
target model using the source data and then defined the class posterior proba-
bility p(yjt = c|xtj ) by the output of the target model. With p(yjt = c|xtj ), they
assigned pseudo-label ybjt to xtj by ybjt = arg max p(yjt = c|xtj ). In [29], two dif-
T
c
IP
ferent networks assign pseudo-labels to unlabeled samples, another network
is trained by the samples to obtain target discriminative representations. The
CR
deep transfer network (DTN) [33] used some base classifiers, e.g., SVMs and
MLPs, to obtain the pseudo labels for the target samples to estimate the
US
conditional distribution of the target samples and match both the marginal
and the conditional distributions with the MMD criterion. When casting
the classifier adaptation into the residual learning framework, [32] used the
AN
pseudo label to build the conditional entropy E(Dt , f t ), which ensures that
the target classifier f t fits the target-specific structures well.
M
• Statistic Criterion
Although some discrepancy-based approaches search for pseudo labels, at-
ED
tribute labels or other substitutes to labeled target data, more work focuses
on learning domain-invariant representations via minimizing the domain dis-
PT
2
M M D2 (s, t) = sup Exs ∼s [φ(xs )] − Ext ∼s [φ(xt )] H
(4)
kφkH ≤1
where φ represents the kernel function that maps the original data to a
reproducing kernel Hilbert space (RKHS) and kφkH ≤ 1 defines a set of
functions in the unit ball of RKHS H.
21
ACCEPTED MANUSCRIPT
T
(a) The Deep Adaptation Network (DAN) architecture (b) The Joint Adaptation Network (JAN) architecture
IP
CR
(c) The Residual Transfer Network (RTN) architecture
US
Figure 6: Different approaches with the MMD metric. (a) The deep adaptation network
(DAN) architecture [38], (b) the joint adaptation network (JAN) architecture [37] and (c)
AN
the residual transfer network (RTN) architecture [32].
Based on the above, Ghifary et al. [40] proposed a model that introduced
M
the MMD metric in feedforward neural networks with a single hidden layer.
The MMD metric is computed between representations of each domain to
ED
reduce the distribution mismatch in the latent space. The empirical estimate
of MMD is as follows:
2
PT
M N
2 1 X 1 X
M M D (Ds , Dt ) = φ(xsi )− φ(xtj ) (5)
M i=1 N j=1
H
CE
Subsequently, Tzeng et al. [39] and Long et al. [38] extended MMD to
a deep CNN model and achieved great success. The deep domain confusion
AC
network (DDC) by Tzeng et al. [39] used two CNNs for the source and target
domains with shared weights. The network is optimized for classification loss
in the source domain, while domain difference is measured by an adaptation
22
ACCEPTED MANUSCRIPT
L=LC (X L , y) + λM M D2 (X s X t ) (6)
T
classification loss on the available labeled data, X L , and the ground-truth
IP
labels, y. M M D2 (X s X t ) denotes the distance between the source and target
CR
data. DDC only adapts one layer of the network, resulting in a reduction in
the transferability of multiple layers. Rather than using a single layer and
linear MMD, Long et al. [38] proposed the deep adaptation network (DAN)
US
that matches the shift in marginal distributions across domains by adding
multiple adaptation layers and exploring multiple kernels, assuming that the
AN
conditional distributions remain unchanged. However, this assumption is
rather strong in practical applications; in other words, the source classifier
M
networks (RTNs) [32] added a gated residual layer for classifier adaptation.
More recently, [34] proposed a weighted MMD model that introduces an
auxiliary weight for each class in the source domain when the class weights
in the target domain are not the same as those in the source domain.
23
ACCEPTED MANUSCRIPT
T
statistics between domains. Sun et al. [41] extended CORAL to deep neural
IP
networks (deep CORAL) with a nonlinear transformation.
1
LCORAL = kCS − CT k2F (7)
CR
4d2
where k · k2F denotes the squared matrix Frobenius norm. CS and CT denote
US
the covariance matrices of the source and target data, respectively.
By the Taylor expansion of the Gaussian kernel, MMD can be viewed
as minimizing the distance between the weighted sums of all raw moments
AN
[78]. The interpretation of MMD as moment matching procedures motivated
Zellinger et al. [79] to match the higher-order moments of the domain dis-
M
K
X (8)
1 s t
+ Ck (X ) − Ck (X )
k=2
|b − a|k 2
CE
where Ck (X) = E((x − E(X))k is the vector of all k th -order sample central
1
P
moments and E(X) = |X| x∈X x is the empirical expectation.
AC
24
ACCEPTED MANUSCRIPT
• Architecture Criterion
Some other methods optimize the architecture of the network to minimize
the distribution discrepancy. This adaptation behavior can be achieved in
T
most deep DA models, such as supervised and unsupervised settings.
IP
Rozantsev et al. [47] considered that the weights in corresponding layers
are not shared but related by a weight regularizer rw (·) to account for the
CR
differences between the two domains. The weight regularizer rw (·) can be
expressed as the exponential loss function:
where θjs and θjt denote the parameters of the j th layer of the source and
−1 (9)
AN
target models, respectively. To further relax this restriction, they allow the
weights in one stream to undergo a linear transformation:
M
2
rw (θjs , θjt ) = exp( aj θjs + bj − θjt )−1 (10)
ED
where aj and bj are scalar parameters that encode the linear transformation.
The work of Shu et al. [81] is similar to [47] using weakly parameter-shared
PT
F F
i=1
25
ACCEPTED MANUSCRIPT
T
IP
Figure 7: The two-stream architecture with related weight [47].
CR
mean and standard deviation for each individual feature channel such that
each layer receives data from a similar distribution, irrespective of whether
it comes from the source or the target domain. Therefore, Li et al. used BN
US
to align the distribution for recomputing the mean and standard deviation
in the target domain.
AN
t x − µ(X t )
BN (X ) = λ +β (12)
σ(X t )
M
where λ and β are parameters learned from the target data and µ(x) and
σ(x) are the mean and standard deviation computed independently for each
ED
feature channel. Based on [44], [83] endowed BN layers with a set of align-
ment parameters which can be learned automatically and can decide the
degree of feature alignment required at different levels of the deep network.
PT
Furthermore, Ulyanov et al. [84] found that when replacing BN layers with
instance normalization (IN) layers, where µ(x) and σ(x) are computed inde-
CE
pendently for each channel and each sample, the performance of DA can be
further improved.
AC
Occasionally, neurons are not effective for all domains because of the
presence of domain biases. For example, when recognizing people, the tar-
get domain typically contains one person centered with minimal background
clutter, whereas the source dataset contains many people with more clutter.
26
ACCEPTED MANUSCRIPT
Thus, the neurons that capture the features of other people and clutter are
useless. Domain-guided dropout was proposed by [48] to solve the problem of
multi-DA, and it mutes non-related neurons for each domain. Rather than
T
assigning dropout with a specific dropout rate, it depends on the gain of
IP
the loss function of each neuron on the domain sample when the neuron is
removed.
CR
si = L(g(x)\i ) − L(g(x)) (13)
where L is the softmax loss function and g(x)\i is the feature vector after
US
setting the response of the ith neuron to zero. In [85], each source domain is
assigned with different parameters, Θ(i) = Θ(0) + ∆(i) , where Θ(0) is a domain
AN
general model, and ∆(i) is a domain specific bias term. After the low rank
parameterized CNNs are trained, Θ(0) can serve as the classifier for target
domain.
M
• Geometric Criterion
The geometric criterion mitigates the domain shift by integrating inter-
ED
mediate subspaces on a geodesic path from the source to the target domains.
A geodesic flow curve is constructed to connect the source and target do-
PT
mains on the Grassmannian. The source and target subspaces are points
on a Grassmann manifold. By sampling a fixed [86] or infinite [87] number
CE
get data are projected to the obtained intermediate subspaces to align the
distribution.
Inspired by the intermediate representations on the geodesic path, Chopra
et al. [50] proposed a model called deep learning for DA by interpolating
27
ACCEPTED MANUSCRIPT
T
the source and target domains. Once intermediate datasets are generated, a
IP
deep nonlinear feature extractor using the predictive sparse decomposition is
trained in an unsupervised manner.
CR
4.1.2. Adversarial-Based Approaches
Recently, great success has been achieved by the GAN method [88], which
US
estimates generative models via an adversarial process. GAN consists of two
models: a generative model G that extracts the data distribution and a
AN
discriminative model D that distinguishes whether a sample is from G or
training datasets by predicting a binary label. The networks are trained on
M
(14)
+Ez∼pz (z) [log(1 − D(G(z)))]
In DA, this principle has been employed to ensure that the network cannot
CE
distinguish between the source and target domains. [58] proposed a unified
framework for adversarial-based approaches and summarized the existing
AC
28
ACCEPTED MANUSCRIPT
T
IP
CR
Figure 8: Generalized architecture for adversarial domain adaptation. Existing adversarial
US
adaptation methods can be viewed as instantiations of a framework with different choices
regarding their properties. [58]
AN
• Generative Models
Synthetic target data with ground-truth annotations are an appealing
M
alternative to address the problem of a lack of training data. First, with the
help of source data, generators render unlimited quantities of synthetic target
ED
data, which are paired with synthetic source data to share labels or appear
as if they were sampled from the target domain while maintaining labels,
or something else. Then, synthetic data with labels are used to train the
PT
paired with synthetic source ones. It consists of a pair of GANs: GAN1 for
generating source data and GAN2 for generating target data. The weights
of the first few layers in the generative models and the last few layers in
29
ACCEPTED MANUSCRIPT
T
images that are from the two distributions and share the labels. Therefore,
IP
the shared labels of synthetic target samples can be used to train the target
model.
CR
US
AN
More work focuses on generating synthetic data that are similar to the
ED
target data while maintaining annotations. Yoo et al. [89] transferred knowl-
edge from the source domain to pixel-level target images with GANs. A do-
main discriminator ensures the invariance of content to the source domain,
PT
30
ACCEPTED MANUSCRIPT
T
from the same source images, a content similarity is used that penalizes large
IP
differences between source and synthetic images for foreground pixels only
by a masked pairwise mean squared error [91]. The goal of the network is to
CR
learn G, D and T by solving the optimization problem:
US
+βLt (T, G) + γLc (G)
(15)
AN
where α, β, and γ are parameters that control the trade-off between the
losses. Ld , Lt and Lc are the adversarial loss, softmax loss and content-
similarity loss, respectively.
M
ED
PT
CE
AC
Figure 10: The model that exploits GANs conditioned on noise vector and source images.
[52]
• Non-Generative Models
31
ACCEPTED MANUSCRIPT
T
be directly used in the target domain even if it is trained on source samples.
IP
Therefore, whether the representations are domain-confused or not is crucial
to transferring knowledge. Inspired by GAN, domain confusion loss, which
CR
is produced by the discriminator, is introduced to improve the performance
of deep DA without generators.
US
AN
M
ED
distributions over the two domains are made similar. The network consists
of shared feature extraction layers and two classifiers. DANN minimizes the
AC
domain confusion loss (for all samples) and label prediction loss (for source
samples) while maximizing domain confusion loss via the use of the GRL.
In contrast to the above methods, the adversarial discriminative domain
adaptation (ADDA) [58] considers independent source and target mappings
32
ACCEPTED MANUSCRIPT
by untying the weights, and the parameters of the target model are initialized
by the pre-trained source one. This is more flexible because of allowing
more domain-specific feature extractions to be learned. ADDA minimizes
T
the source and target representation distances through iteratively minimizing
IP
these following functions, which is most similar to the original GAN:
CR
min Lcls (X s , Y s ) =
M s ,C
K
X
US
− E(xs ,ys )∼(X s ,Y s )
min LadvD (X s ,X t , M s , M t ) =
k=1
1[k=ys ] log C(M s (xs ))
AN
D
min LadvM (M s , M t ) =
M s ,M t
(16)
ED
where the mappings M s and M t are learned from the source and target data,
PT
ing the labeled source data. The second function LadvD is minimized to train
the discriminator, while the third function LadvM is learning a representation
AC
33
ACCEPTED MANUSCRIPT
T
IP
Figure 12: The Adversarial discriminative domain adaptation (ADDA) architecture. [58]
CR
distribution over binary labels. Unlike previous methods that match the en-
US
tire source and target domains, Cao et al. introduced a selective adversarial
network (SAN) [92] to address partial transfer learning from large domains
to small domains, which assumes that the target label space is a subspace of
AN
the source label space. It simultaneously avoids negative transfer by filter-
ing out outlier source classes, and it promotes positive transfer by matching
M
the data distributions in the shared label space via splitting the domain dis-
criminator into many class-wise domain discriminators. [93] encoded domain
ED
labels and class labels to produce four groups of pairs, and replaced the typ-
ical binary adversarial discriminator by a four-class discriminator. Volpi et
PT
al. [94] trained a feature generator (S) to perform data augmentation in the
source feature space and obtained a domain invariant feature through playing
a minimax game against features from S.
CE
34
ACCEPTED MANUSCRIPT
T
near the support.
IP
4.1.3. Reconstruction-Based Approaches
CR
In DA, the data reconstruction of source or target samples is an auxiliary
task that simultaneously focuses on creating a shared representation between
the two domains and keeping the individual characteristics of each domain.
• Encoder-Decoder Reconstruction
US
The basic autoencoder framework [98] is a feedforward neural network
AN
that includes the encoding and decoding processes. The autoencoder first
encodes an input to some hidden representation, and then it decodes this
M
representations can represent both the source and target domain data. Thus,
a linear classifier that is trained on the labeled data of the source domain
AC
can make predictions on the target domain data with these representations.
Despite their remarkable results, SDAs are limited by their high computa-
tional cost and lack of scalability to high-dimensional features. To address
these crucial limitations, Chen et al. [100] proposed the marginalized SDA
35
ACCEPTED MANUSCRIPT
T
The deep reconstruction classification network (DRCN) proposed in [60]
IP
learns a shared encoding representation that provides useful information for
cross-domain object recognition. DRCN is a CNN architecture that combines
CR
two pipelines with a shared encoder. After a representation is provided by the
encoder, the first pipeline, which is a CNN, works for supervised classification
US
with source labels, whereas the second pipeline, which is a deconvolutional
network, optimizes for unsupervised reconstruction with target data.
AN
min λLc ({θenc , θlab }) + (1 − λ)Lr ({θenc , θdec }) (17)
cation and reconstruction. θenc , θdec and θlab denote the parameters of the
encoder, decoder and source classifier, respectively. Lc is cross-entropy loss
ED
36
ACCEPTED MANUSCRIPT
T
IP
CR
US
Figure 13: The deep reconstruction classification network (DRCN) architecture. [60]
loss is simple and that the private features are only used for reconstruction in
AN
DSNs, [101] reinforced them by incorporating a hybrid adversarial learning
in a separation network and an adaptation network.
M
Dual learning was first proposed by Xia et al. [102] to reduce the require-
ment of labeled data in natural language processing. Dual learning trains two
”opposite” language translators, e.g., A-to-B and B-to-A. The two transla-
tors represent a primal-dual pair that evaluates how likely the translated
37
ACCEPTED MANUSCRIPT
sentences belong to the targeted language, and the closed loop measures the
disparity between the reconstructed and the original ones. Inspired by dual
learning, adversarial reconstruction is adopted in deep DA with the help of
T
dual GANs.
IP
Zhu et al. [63] proposed a cycle GAN that can translate the charac-
teristics of one image domain into the other in the absence of any paired
CR
training examples. Compared to dual learning, cycle GAN uses two gener-
ators rather than translators, which learn a mapping G : X → Y and an
US
inverse mapping F : Y → X. Two discriminators, DX and DY , measure how
realistic the generated image is (G(X) ≈ Y or G(Y ) ≈ X) by an adversarial
loss and how well the original input is reconstructed after a sequence of two
AN
generations (F (G(X)) ≈ X or G(F (Y )) ≈ Y ) by a cycle consistency loss
(reconstruction loss). Thus, the distribution of images from G(X) (or F (Y ))
M
The dual GAN [62] and the disco GAN [64] were proposed at the same
time, where the core idea is similar to cycle GAN. In dual GAN, the gen-
erator is configured with skip connections between mirrored downsampling
and upsampling layers [103, 53], making it a U-shaped net to share low-level
38
ACCEPTED MANUSCRIPT
T
IP
Figure 14: The cycle GAN architecture. [63]
CR
information (e.g., object shapes, textures, clutter, and so forth). For discrim-
US
inators, the Markovian patch-GAN [104] architecture is employed to capture
local high-frequency information. In disco GAN, various forms of distance
functions, such as mean-square error (MSE), cosine distance, and hinge loss,
AN
can be used as the reconstruction loss, and the network is applied to trans-
late images, changing specified attributes including hair color, gender and
M
label loss, while [32] used both statistic (MMD) and architecture criteria
(adapt classifier by residual function) for unsupervised DA. [34] introduced
CE
representations into private and shared representations, while the MMD cri-
terion or domain confusion loss is helpful to make the shared representations
similar and soft subspace orthogonality constraints ensure dissimilarity be-
tween the private and shared representations. [47] used the MMD between
39
ACCEPTED MANUSCRIPT
the learned source and target representations and also allowed the weights of
the corresponding layers to differ. [43] learned domain-invariant representa-
tions by encoder-decoder reconstruction approaches and the KL divergence.
T
4.2. Heterogeneous Domain Adaptation
IP
In heterogeneous DA, the feature spaces of the source and target domains
CR
are not the same, Xs 6= Xt, and the dimensions of the feature spaces may also
differ. According to the divergence of feature spaces, heterogeneous DA can
be further divided into two scenarios. In one scenario, the source and target
US
domain both contain images, and the divergence of feature spaces is mainly
caused by different sensory devices (e.g., visual light (VIS) vs. near-infrared
AN
(NIR) or RGB vs. depth) and different styles of images (e.g., sketches vs.
photos). In the other scenario, there are different types of media in source
M
and target domain (e.g., text vs. image and language vs. image). Obviously,
the cross-domain gap of the second scenario is much larger.
ED
T
[Qxt , 0ds , xt ] , to augment the transformed data with their original features
and zeros. These projection matrices are found using standard SVM with
hinge loss in both the linear and nonlinear cases and an alternating opti-
mization algorithm is proposed to simultaneously solve the dual SVM and
40
ACCEPTED MANUSCRIPT
T
tion transforms one of source and target features to align with the other.
IP
[107] proposed a sparse and class-invariant feature transformation matrix to
map the weight vector of classifiers learned from the source domain to the
CR
target domain. The asymmetric regularized cross-domain transfer (ARC-t)
[108] used asymmetric, non-linear transformations learned in Gaussian RBF
US
kernel space to map the target data to the source domain. Extended from
[109], ARC-t performed asymmetric transformation based on metric learning,
and transfer knowledge between domains with different dimensions through
AN
changes of the regularizer. Since we focus on deep DA, we refer the interested
readers to [20], which summarizes shallow approaches of heterogeneous DA.
M
However, as for deep methods, there is not much work focused on hetero-
geneous DA so far. The special and effective methods of heterogeneous deep
ED
the first n layers between the source and target domains, which limits the
feature spaces of the input to the same dimension. However, in heterogeneous
AC
DA, the dimensions of the feature spaces of source domain may differ from
those of target domain.
In first scenario of heterogeneous DA, the images in different domains
can be directly resized into the same dimensions, so the Class Criterion and
41
ACCEPTED MANUSCRIPT
Statistic Criterion are still effective and are mainly used. For example, given
an RGB image and its paired depth image, [110] used the mid-level repre-
sentation learned by CNNs as a supervisory signal to re-train a CNN on
T
depth images. To transform an RGB object detector into a RGB-D detector
IP
without needing complete RGB-D data, Hoffman et al. [111] first trained an
RGB network using labeled RGB data from all categories and finetuned the
CR
network with labeled depth data from partial categories, then combined mid-
level RGB and depth representations at fc6 to incorporate both modalities
US
into the final object class prediction. [112] first trained the network using
large face database of photos and then finetuned it using small database
of composite sketches; [113] transferred the VIS deep networks to the NIR
AN
domain in the same way.
In second scenario, the features of different media can not be directly
M
42
ACCEPTED MANUSCRIPT
T
ployed a compound loss function that consists of a multiclass GAN loss,
IP
a regularizing component and an f-constancy component to transfer unla-
beled face photos to emoji images. To generate images for birds and flowers
CR
based on text, [118] trained a GAN conditioned on text features encoded
by a hybrid character-level convolutional-recurrent neural network. [119]
US
proposed stacked generative adversarial networks (StackGAN) with condi-
tioning augmentation for synthesizing photo-realistic images from text. It
AN
decomposes the synthesis problem into several sketch-refinement processes.
Stage-I GAN sketches the primitive shape and basic colors of the object to
yield low-resolution image, and Stage-II GAN completes details of the object
M
43
ACCEPTED MANUSCRIPT
T
two generators, GA and GB , to generate sketches from photos and photos
IP
from sketches, respectively. Based on cycle GAN [63], [120] proposed a
multi-adversarial network to avoid artifacts of facial photo-sketch synthe-
CR
sis by leveraging the implicit presence of feature maps of different resolutions
in the generator subnetwork.
is, it is decided in advance. For example, when the source domain is image
data and the target domain is composed of text data, some annotated images
PT
proxy for economic activity, Xie et al. [65] transferred knowledge from day-
time satellite imagery to poverty prediction with the help of some nighttime
light intensity information as an intermediate domain.
AC
44
ACCEPTED MANUSCRIPT
T
domain.
IP
Tan et al. [25] proposed distant domain transfer learning (DDTL), where
long-distance domains fail to transfer knowledge by only one intermediate
CR
domain but can be related via multiple intermediate domains. DDTL grad-
ually selects unlabeled data from the intermediate domains by minimizing
US
reconstruction errors on the selected instances in the source and intermedi-
ate domains and all the instances in the target domain simultaneously. With
removal of the unrelated source data, the selected intermediate domains grad-
AN
ually become closer to the target domain from the source domain:
nS
1 X 2
J1 (fe , fd , vS , vT ) = v i x̂i − xiS
M
nS i=1 S S 2
nI
1 X 2
+ vIi x̂iI − xiI (19)
ED
nI i=1 2
nT
1 X 2
+ x̂i − xiT + R(vS , vT )
nT i=1 T 2
PT
where x̂iS , x̂iT and x̂iI are reconstructions of source data S i , target data T i and
intermediate data I i based on the autoencoder, respectively, and fe and fd are
CE
>
the parameters of the encoder and decoder, respectively. vS = (vS1 , ..., vSnS )
>
and vI = (vI1 , ..., vInI ) , vSi , vIi ∈ 0, 1 are selection indicators for the ith source
AC
45
ACCEPTED MANUSCRIPT
T
and use their intermediate representations as input to the new network. Rusu
IP
et al. [66] introduced progressive networks that have the ability to accumu-
CR
late and transfer knowledge to new domains over a sequence of experiences.
To avoid the target model losing its ability to solve the source domain, they
constructed a new neural network for each domain, while transfer is enabled
US
via lateral connections to features of previously learned networks. In the pro-
cess, the parameters in the latest network are frozen to remember knowledge
AN
of intermediate domains.
M
ED
PT
CE
AC
46
ACCEPTED MANUSCRIPT
T
recognition, object detection, style translation, and so forth. In this section,
IP
we present different application examples using various visual deep DA meth-
ods. Because the information of commonly used datasets for evaluating the
CR
performance is provided in [22] in detail, we do not introduce it in this paper.
to perform a fair comparison among all the methods directly. Thus, similar
to the work of Pan [18], we show the comparison results between the proposed
PT
presented in Table 5.
In [37], [79], and [26], the authors used the Office-31 dataset1 as one of
the evaluation data sets, as shown in Fig. 1(a). The Office dataset is a
AC
computer vision classification data set with images from three distinct do-
mains: Amazon (A), DSLR (D), and Webcam (W). The largest domain,
1
https://cs.stanford.edu/∼jhoffman/domainadapt/
47
Table 5: Comparison between Transfer Learning and Non-Adaptation Learning Methods
Source
Data Set
vs. Baselines Deep Domain Adaptation Methods
(reference)
Target
AlexNet DDC DAN RTN JAN DANN
A vs. W 61.6±0.5 61.8±0.4 68.5 73.3±0.3 75.2±0.4 73.0±0.5
D vs. W 95.4±0.3 95.0±0.5 96.0±0.3 96.8±0.2 96.6±0.2 96.4±0.3
AC
Office-31 Dataset W vs. D 99.0±0.2 98.5±0.4 99.0±0.3 99.6±0.1 99.6±0.1 99.2±0.3
ACC (unit:%)[37] A vs. D 63.8±0.5 64.4±0.3 67.0±0.4 71.0±0.2 72.8±0.3 72.3±0.3
D vs. A 51.1±0.6 52.1±0.6 54.0±0.5 50.5±0.3 57.5±0.2 53.4±0.4
CE
W vs. A 49.8±0.4 52.2±0.4 53.1±0.5 51.0±0.1 56.3±0.2 51.2±0.5
Avg 70.1 70.6 72.9 73.7 76.3 74.3
PT
AlexNet Deep CORAL CMD DLID AdaBN DANN
A vs. W 61.6 66.4 77.0±0.6 51.9 74.2 73
ED
D vs. W 95.4 95.7 96.3±0.4 78.2 95.7 96.4
Office-31 Dataset W vs. D 99.0 M99.2 99.2±0.2 89.9 99.8 99.2
ACC (unit:%)[79] A vs. D 63.8 66.8 79.6±0.6 - 73.1 -
48
D vs. A 51.1 52.8 63.8±0.7 - 59.8 -
W vs. A 49.8 51.5 63.3±0.6 - 57.4 -
AN
Avg 70.1 72.1 79.9 - 76.7 -
Domain Confusion
AlexNet DLID DANN Soft Labels
ACCEPTED MANUSCRIPT
Confusion +Soft
US
A vs. W 56.5±0.3 51.9 53.6±0.2 82.7±0.7 82.8±0.9 82.7±0.8
D vs. W 92.4±0.3 78.2 71.2±0.0 95.9±0.6 95.6±0.4 95.7±0.5
Office-31 Dataset W vs. D 93.6±0.2 89.9 83.5±0.0 98.3±0.3 97.5±0.2 97.6±0.2
CR
ACC (unit:%)[26] A vs. D 64.6±0.4 - - 84.9±1.2 85.9±1.1 86.1±1.2
D vs. A 47.6±0.1 - - 66.0±0.5 66.2±0.4 66.2±0.3
IP
W vs. A 42.7±0.1 - - 65.2±0.6 64.9±0.5 65.0±0.5
Avg 66.2 - - 82.17
T 82.13 82.22
MNIST, USPS, VGG-16 DANN CoGAN ADDA
and SVHN M vs. U 75.2±1.6 77.1±1.8 91.2±0.8 89.4±0.2
digits datasets U vs. M 57.1±1.7 73.0±2.0 89.1±0.8 90.1±0.8
ACC (unit:%)[58] S vs. M 60.1±1.1 73.9 - 76.0±1.8
ACCEPTED MANUSCRIPT
Amazon, has 2817 labeled images and its corresponding 31 classes, which
consist of objects commonly encountered in office settings. By using this
dataset, previous works can show the performance of methods across all six
T
possible DA tasks. [37] showed comparison experiments among the standard
IP
AlexNet [8], the DANN method [55], and the MMD algorithm and its vari-
ations, such as DDC [39], DAN [38], JAN [37] and RTN [32]. Zellinger et
CR
al. [79] evaluated their proposed CMD algorithm in comparison to other
discrepancy-based methods (DDC, deep CROAL [41], DLID [50], AdaBN
US
[44]) and the adversarial-based method DANN. [26] proposed an algorithm
combining soft label loss and domain confusion loss, and they also compared
them with DANN and DLID under a supervised DA setting.
AN
In [58], MNIST2 (M), USPS3 (U), and SVHN4 (S) digit datasets (shown in
Fig. 1(b)) are used for a cross-domain hand-written digit recognition task,
M
is VGG-16 [12].
(BAE) for face recognition across view angle, ethnicity, and imaging sensor.
2
http://yann.lecun.com/exdb/mnist/
3
http://statweb.stanford.edu/∼tibs/ElemStatLearn/data.html
4
http://ufldl.stanford.edu/housenumbers/
49
ACCEPTED MANUSCRIPT
In BAE, source domain samples are shifted to the target domain, and sparse
reconstruction is used with several local neighbors from the target domain
to ensure its correction, and vice versa. Single sample per person domain
T
adaptation network (SSPP-DAN) in [122] generates synthetic images with
IP
varying poses to increase the number of samples in the source domain and
bridges the gap between the synthetic and source domains by adversarial
CR
training with a GRL in real-world face recognition. [1] improved the per-
formance of video face recognition by using an adversarial-based approach
US
with large-scale unlabeled videos, labeled still images and synthesized images.
Considering that age variations are difficult problems for smile detection and
that networks trained on the current benchmarks do not perform well on
AN
young children, Xia et al. [123] applied DAN [38] and JAN [37] (mentioned
in Section 4.1.1) to two baseline deep models, i.e., AlexNet and ResNet, to
M
Figure 17: The single sample per person domain adaptation network (SSPP-DAN) archi-
tecture. [122]
50
ACCEPTED MANUSCRIPT
T
[125]). They are composed of a window selection mechanism and classifiers
IP
that are pre-trained labeled bounding boxes by using the features extracted
from CNNs. At test time, the classifier decides whether a region obtained
CR
by sliding windows contains the object. Although the R-CNN algorithm is
effective, a large amount of bounding box labeled data is required to train
US
each detection category. To solve the problem of lacking labeled data, con-
sidering the window selection mechanism as being domain independent, deep
AN
DA methods can be used in classifiers to adapt to the target domain.
Because R-CNNs train classifiers on regions just like classification, weak
labeled data (such as image-level class labels) are directly useful for the de-
M
tector. Most works learn the detector with limited bounding box labeled data
and massive weak labeled data. The large-scale detection through adapta-
ED
tion (LSDA) [126] trains a classification layer for the target domain and then
uses a pre-trained source model along with output layer adaptation tech-
PT
labeled source objects and target objects and then transferred the bounding
box labeled information from source objects to target objects based on their
AC
relatedness. Extending [126] and [127], Tang et al. [128] transferred visual
(based on the LSDA model) and semantic similarity (based on work vectors)
for training an object detector on weak labeled category. [129] incorporated
both an image-level and an instance-level adaptation component into faster
51
ACCEPTED MANUSCRIPT
T
with domain-transfer samples and pseudo-labeling samples.
IP
6.4. Semantic Segmentation
CR
Fully convolutional network models (FCNs) for dense prediction have
proven to be successful for evaluating semantic segmentation, but their per-
formance will also degrade under domain shifts. Therefore, some work has
US
also explored using weak labels to improve the performance of semantic seg-
mentation. Hong et al. [131] used a novel encoder-decoder architecture with
AN
attention model by transferring weak class labeled knowledge in the source
domain, while [132, 133] transferred weak object location knowledge.
M
performance on real images with the help of virtual ones. It uses the global
label distribution loss of the images and local label distribution loss of the
CE
52
ACCEPTED MANUSCRIPT
T
the feature space, [138] used a GAN to address domain shift in which a gen-
IP
erator projects the features to the image space and a discriminator operates
on this projected image space.
CR
US
AN
M
ED
Specially, when the feature spaces of source and target images are not the
same, image-to-image translation should be performed by heterogeneous DA.
More approaches of image-to-image translation use a dataset of paired
images and incorporate a DA algorithm into generative networks. Isola et al.
53
ACCEPTED MANUSCRIPT
[53] proposed the pix2pix framework, which uses a conditional GAN to learn
a mapping from source to target images. Tzeng et al. [56] utilized domain
confusion loss and pairwise loss to adapt from simulation to real-world data
T
in a PR2 robot. However, several other methods also address the unpaired
IP
setting, such as CoGAN [51], cycle GAN [63], dual GAN [62] and disco GAN
[64].
CR
Matching the statistical distribution by fine-tuning a deep network is an-
other way to achieve image-to-image translation. Gatys et al. [139] fine-tuned
US
the CNN to achieve DA by the total loss, which is a linear combination be-
tween the content and the style loss, such that the target image is rendered in
the style of the source image maintaining the content. The content loss min-
AN
imizes the mean squared difference of the feature representation between the
original image and generated image in higher layers, while the style loss mini-
M
mizes the element-wise mean squared difference between the Gram matrix of
them on each layer. [46] demonstrated that matching the Gram matrices of
ED
feature maps is equivalent to minimizing the MMD. Rather than MMD, [42]
proposed a deep generative correlation alignment network (DGCAN) that
PT
bridges the domain discrepancy between CAD synthetic and real images by
applying the content and CORAL losses to different layers.
CE
ingly popular. When given video sequences of a person, person re-ID recog-
nizes whether this person has been in another camera to compensate for the
limitations of fixed devices. Recently, deep DA methods have been used in re-
ID when models trained on one dataset are directly used on another. Xiao
54
ACCEPTED MANUSCRIPT
T
ative adversarial network (SPGAN) [140] translated the labeled source image
IP
to the target domain, preserving self similarity and domain-dissimilarity in
an unsupervised manner, and then it trains re-ID models with the translated
CR
images using supervised feature learning methods.
US
Recently, image captioning, which automatically describes an image with
a natural sentence, has been an emerging challenge in computer vision and
AN
natural language processing. Due to lacking of paired image-sentence train-
ing data, DA leverages different types of data in other source domains to
M
tackle this challenge. Chen et al. [141] proposed a novel adversarial training
procedure (captioner v.s. critics) for cross-domain image captioning using
ED
paired source data and unpaired target data. One captioner adapts the sen-
tence style from source to target domain, whereas two critics, namely domain
critic and multi-modal critic, aim at distinguishing them. Zhao et al. [142]
PT
fine-tuned the pre-trained source model on limited data in the target domain
via a dual learning mechanism.
CE
7. Conclusion
AC
55
ACCEPTED MANUSCRIPT
survey paper, we focus on this narrow definition, and we have reviewed deep
DA techniques on visual categorization tasks.
Deep DA is classified as homogeneous DA and heterogeneous DA, and
T
it can be further divided into supervised, semi-supervised and unsupervised
IP
settings. The first setting is the simplest but is generally limited due to
the need for labeled data; thus, most previous works focused on unsuper-
CR
vised cases. Semi-supervised deep DA is a hybrid method that combines the
methods of the supervised and unsupervised settings.
US
Furthermore, the approaches of deep DA can be classified into one-step
DA and multi-step DA considering the distance of the source and target
domains. When the distance is small, one-step DA can be used based on
AN
training loss. It consists of the discrepancy-based approach, the adversarial-
based approach, and the reconstruction-based approach. When the source
M
and target domains are not directly related, multi-step (or transitive) DA
can be used. The key of multi-step DA is to select and utilize intermediate
ED
Although deep DA has achieved success recently, many issues still remain
to be addressed. First, most existing algorithms focus on homogeneous deep
CE
DA, which assumes that the feature spaces between the source and target
domains are the same. However, this assumption may not be true in many
applications. We expect to transfer knowledge without this severe limitation
AC
and take advantage of existing datasets to help with more tasks. Heteroge-
neous deep DA may attract increasingly more attention in the future.
In addition, deep DA techniques have been successfully applied in many
56
ACCEPTED MANUSCRIPT
T
segmentation and person re-identification. How to achieve these tasks with
IP
no or a very limited amount of data is probably one of the main challenges
that should be addressed by deep DA in the next few years.
CR
Finally, since existing deep DA methods aim at aligning marginal dis-
tributions, they commonly assume shared label space across the source and
US
target domains. However, in realistic scenario, the images of the source and
target domain may be from the different set of categories or only a few cate-
gories of interest are shared. Recently, some papers [92, 143, 144] have begun
AN
to focus on this issue and we believe it is worthy of more attention.
M
8. Acknowledgements
This work was partially supported by the National Natural Science Foun-
ED
dation of China under Grant Nos. 61573068, 61471048, and 61375031, and
Beijing Nova Program under Grant No. Z161100004916088.
PT
References
CE
[1] K. Sohn, S. Liu, G. Zhong, X. Yu, M.-H. Yang, and M. Chandraker, “Unsuper-
vised domain adaptation for face recognition in unlabeled videos,” arXiv preprint
arXiv:1708.02191, 2017.
AC
57
ACCEPTED MANUSCRIPT
[3] W.-S. Chu, F. De la Torre, and J. F. Cohn, “Selective transfer machine for per-
sonalized facial action unit detection,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2013, pp. 3515–3522.
T
[4] B. Gong, K. Grauman, and F. Sha, “Connecting the dots with landmarks: Discrim-
inatively learning domain-invariant features for unsupervised domain adaptation,”
IP
in International Conference on Machine Learning, 2013, pp. 222–230.
CR
[5] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via transfer
component analysis,” IEEE Transactions on Neural Networks, vol. 22, no. 2, pp.
199–210, 2011.
US
[6] M. Gheisari and M. S. Baghshah, “Unsupervised domain adaptation via representa-
tion learning and adaptive classifier learning,” Neurocomputing, vol. 165, pp. 300–
AN
311, 2015.
[7] S. Pachori, A. Deshpande, and S. Raman, “Hashing in the zero shot framework with
M
[9] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to
human-level performance in face verification,” in Proceedings of the IEEE conference
on computer vision and pattern recognition, 2014, pp. 1701–1708.
CE
[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for
accurate object detection and semantic segmentation,” in Proceedings of the IEEE
AC
[11] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, and F. E. Alsaadi, “A survey of deep
neural network architectures and their applications,” Neurocomputing, vol. 234, pp.
11–26, 2017.
58
ACCEPTED MANUSCRIPT
[12] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale
image recognition,” arXiv preprint arXiv:1409.1556, 2014.
T
houcke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
IP
[14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”
CR
in Proceedings of the IEEE conference on computer vision and pattern recognition,
2016, pp. 770–778.
US
[15] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief
nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006.
AN
[16] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked de-
noising autoencoders: Learning useful representations in a deep network with a local
denoising criterion,” Journal of Machine Learning Research, vol. 11, no. Dec, pp.
M
3371–3408, 2010.
[19] L. Shao, F. Zhu, and X. Li, “Transfer learning for visual categorization: A survey,”
IEEE transactions on neural networks and learning systems, vol. 26, no. 5, pp.
1019–1034, 2015.
AC
59
ACCEPTED MANUSCRIPT
T
[22] J. Zhang, W. Li, and P. Ogunbona, “Transfer learning for cross-dataset recognition:
A survey,” 2017.
IP
[23] G. Csurka, “Domain adaptation for visual applications: A comprehensive survey,”
CR
arXiv preprint arXiv:1702.05374, 2017.
[24] B. Tan, Y. Song, E. Zhong, and Q. Yang, “Transitive transfer learning,” in Proceed-
[30] J. Hu, J. Lu, and Y.-P. Tan, “Deep transfer metric learning,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 325–333.
60
ACCEPTED MANUSCRIPT
[31] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,”
arXiv preprint arXiv:1503.02531, 2015.
[32] M. Long, H. Zhu, J. Wang, and M. I. Jordan, “Unsupervised domain adaptation with
T
residual transfer networks,” in Advances in Neural Information Processing Systems,
2016, pp. 136–144.
IP
[33] X. Zhang, F. X. Yu, S.-F. Chang, and S. Wang, “Deep transfer network: Unsuper-
CR
vised domain adaptation,” arXiv preprint arXiv:1503.00591, 2015.
[34] H. Yan, Y. Ding, P. Li, Q. Wang, Y. Xu, and W. Zuo, “Mind the class weight bias:
[36] W. Ge and Y. Yu, “Borrowing treasures from the wealthy: Deep transfer learning
M
[37] M. Long, J. Wang, and M. I. Jordan, “Deep transfer learning with joint adaptation
ED
[38] M. Long, Y. Cao, J. Wang, and M. Jordan, “Learning transferable features with deep
PT
[39] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell, “Deep domain confusion:
Maximizing for domain invariance,” arXiv preprint arXiv:1412.3474, 2014.
[40] M. Ghifary, W. B. Kleijn, and M. Zhang, “Domain adaptive neural networks for ob-
AC
[41] B. Sun and K. Saenko, “Deep coral: Correlation alignment for deep domain adapta-
tion,” in Computer Vision–ECCV 2016 Workshops. Springer, 2016, pp. 443–450.
61
ACCEPTED MANUSCRIPT
[42] X. Peng and K. Saenko, “Synthetic to real adaptation with deep generative corre-
lation alignment networks,” arXiv preprint arXiv:1701.05524, 2017.
T
learning: Transfer learning with deep autoencoders.” in IJCAI, 2015, pp. 4119–
4125.
IP
[44] Y. Li, N. Wang, J. Shi, J. Liu, and X. Hou, “Revisiting batch normalization for
CR
practical domain adaptation,” arXiv preprint arXiv:1603.04779, 2016.
[45] X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive in-
US
stance normalization,” arXiv preprint arXiv:1703.06868, 2017.
[46] Y. Li, N. Wang, J. Liu, and X. Hou, “Demystifying neural style transfer,” arXiv
preprint arXiv:1701.01036, 2017.
AN
[47] A. Rozantsev, M. Salzmann, and P. Fua, “Beyond sharing weights for deep domain
adaptation,” arXiv preprint arXiv:1603.06432, 2016.
M
[48] T. Xiao, H. Li, W. Ouyang, and X. Wang, “Learning deep feature representations
with domain guided dropout for person re-identification,” in Proceedings of the IEEE
ED
[49] S.-A. Rebuffi, H. Bilen, and A. Vedaldi, “Learning multiple visual domains with
PT
[50] S. Chopra, S. Balakrishnan, and R. Gopalan, “Dlid: Deep learning for domain
CE
[51] M.-Y. Liu and O. Tuzel, “Coupled generative adversarial networks,” in Advances in
AC
62
ACCEPTED MANUSCRIPT
[53] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with
conditional adversarial networks,” arXiv preprint arXiv:1611.07004, 2016.
T
M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural networks,”
Journal of Machine Learning Research, vol. 17, no. 59, pp. 1–35, 2016.
IP
[55] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropaga-
CR
tion,” in International Conference on Machine Learning, 2015, pp. 1180–1189.
US
T. Darrell, “Adapting deep visuomotor representations with weak pairwise con-
straints,” CoRR, vol. abs/1511.07111, 2015.
AN
[57] K.-C. Peng, Z. Wu, and J. Ernst, “Zero-shot deep domain adaptation,” arXiv
preprint arXiv:1707.01922, 2017.
[62] Z. Yi, H. Zhang, P. T. Gong et al., “Dualgan: Unsupervised dual learning for image-
to-image translation,” arXiv preprint arXiv:1704.02510, 2017.
63
ACCEPTED MANUSCRIPT
[63] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation
using cycle-consistent adversarial networks,” arXiv preprint arXiv:1703.10593, 2017.
[64] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim, “Learning to discover cross-domain
T
relations with generative adversarial networks,” arXiv preprint arXiv:1703.05192,
2017.
IP
[65] M. Xie, N. Jean, M. Burke, D. Lobell, and S. Ermon, “Transfer learning from deep
CR
features for remote sensing and poverty mapping,” 2015.
US
K. Kavukcuoglu, R. Pascanu, and R. Hadsell, “Progressive neural networks,” arXiv
preprint arXiv:1606.04671, 2016.
AN
[67] J. Hoffman, E. Tzeng, J. Donahue, Y. Jia, K. Saenko, and T. Darrell,
“One-shot adaptation of supervised deep convolutional models,” arXiv preprint
arXiv:1312.6204, 2013.
M
[70] L. Zhang, Z. He, and Y. Liu, “Deep object recognition across domains based on
CE
adaptive extreme learning machine,” Neurocomputing, vol. 239, pp. 194–203, 2017.
[71] H. Lu, L. Zhang, Z. Cao, W. Wei, K. Xian, C. Shen, and A. van den Hengel,
“When unsupervised domain adaptation meets tensor representations,” in The IEEE
AC
[72] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in
deep neural networks?” in Advances in neural information processing systems, 2014,
pp. 3320–3328.
64
ACCEPTED MANUSCRIPT
T
[74] X. Wang, X. Duan, and X. Bai, “Deep sketch feature for cross-domain image re-
trieval,” Neurocomputing, vol. 207, pp. 387–397, 2016.
IP
[75] C. H. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect unseen object
CR
classes by between-class attribute transfer,” in Computer Vision and Pattern Recog-
nition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 951–958.
US
[76] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf, and A. J.
Smola, “Integrating structured biological data by kernel maximum mean discrep-
ancy,” Bioinformatics, vol. 22, no. 14, pp. e49–e57, 2006.
AN
[77] B. Sun, J. Feng, and K. Saenko, “Return of frustratingly easy domain adaptation.”
in AAAI, vol. 6, no. 7, 2016, p. 8.
M
[81] X. Shu, G.-J. Qi, J. Tang, and J. Wang, “Weakly-shared deep transfer networks
for heterogeneous-domain knowledge propagation,” in Proceedings of the 23rd ACM
international conference on Multimedia. ACM, 2015, pp. 35–44.
65
ACCEPTED MANUSCRIPT
[82] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network train-
ing by reducing internal covariate shift,” in International Conference on Machine
Learning, 2015, pp. 448–456.
T
[83] F. M. Carlucci, L. Porzi, B. Caputo, E. Ricci, and S. R. Bulò, “Autodial: Automatic
domain alignment layers,” in International Conference on Computer Vision, 2017.
IP
[84] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Improved texture networks: Maximiz-
CR
ing quality and diversity in feed-forward stylization and texture synthesis,” arXiv
preprint arXiv:1701.02096, 2017.
US
[85] D. Li, Y. Yang, Y.-Z. Song, and T. M. Hospedales, “Deeper, broader and artier do-
main generalization,” in Computer Vision (ICCV), 2017 IEEE International Con-
ference on. IEEE, 2017, pp. 5543–5551.
AN
[86] R. Gopalan, R. Li, and R. Chellappa, “Domain adaptation for object recognition:
An unsupervised approach,” in Computer Vision (ICCV), 2011 IEEE International
M
[87] B. Gong, Y. Shi, F. Sha, and K. Grauman, “Geodesic flow kernel for unsupervised
ED
[89] D. Yoo, N. Kim, S. Park, A. S. Paek, and I. S. Kweon, “Pixel-level domain transfer,”
in European Conference on Computer Vision. Springer, 2016, pp. 517–532.
AC
66
ACCEPTED MANUSCRIPT
[91] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image
using a multi-scale deep network,” in Advances in neural information processing
systems, 2014, pp. 2366–2374.
T
[92] Z. Cao, M. Long, J. Wang, and M. I. Jordan, “Partial transfer learning with selective
adversarial networks,” arXiv preprint arXiv:1707.07901, 2017.
IP
[93] S. Motiian, Q. Jones, S. Iranmanesh, and G. Doretto, “Few-shot adversarial domain
CR
adaptation,” in Advances in Neural Information Processing Systems, 2017, pp. 6673–
6683.
US
[94] R. Volpi, P. Morerio, S. Savarese, and V. Murino, “Adversarial feature augmentation
for unsupervised domain adaptation,” arXiv preprint arXiv:1711.08561, 2017.
[96] J. Shen, Y. Qu, W. Zhang, and Y. Yu, “Wasserstein distance guided representation
M
[98] Y. Bengio, “Learning deep architectures for ai,” Foundations and Trends in Machine
PT
[99] X. Glorot, A. Bordes, and Y. Bengio, “Domain adaptation for large-scale sentiment
CE
[101] J.-C. Tsai and J.-T. Chien, “Adversarial domain separation and adaptation,” in
Machine Learning for Signal Processing (MLSP), 2017 IEEE 27th International
Workshop on. IEEE, 2017, pp. 1–6.
67
ACCEPTED MANUSCRIPT
[102] D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, and W.-Y. Ma, “Dual learning for
machine translation,” in Advances in Neural Information Processing Systems, 2016,
pp. 820–828.
T
[103] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for
biomedical image segmentation,” in International Conference on Medical Image
IP
Computing and Computer-Assisted Intervention. Springer, 2015, pp. 234–241.
CR
[104] C. Li and M. Wand, “Precomputed real-time texture synthesis with markovian
generative adversarial networks,” in European Conference on Computer Vision.
Springer, 2016, pp. 702–716.
US
[105] L. Duan, D. Xu, and I. Tsang, “Learning with augmented features for heterogeneous
domain adaptation,” arXiv preprint arXiv:1206.4660, 2012.
AN
[106] C. Wang and S. Mahadevan, “Heterogeneous domain adaptation using manifold
alignment,” in IJCAI proceedings-international joint conference on artificial intelli-
M
for multiple classes,” in Artificial Intelligence and Statistics, 2014, pp. 1095–1103.
[108] B. Kulis, K. Saenko, and T. Darrell, “What you saw is not what you get: Domain
PT
[109] K. Saenko, B. Kulis, M. Fritz, and T. Darrell, “Adapting visual category models
to new domains,” in European conference on computer vision. Springer, 2010, pp.
213–226.
AC
[110] S. Gupta, J. Hoffman, and J. Malik, “Cross modal distillation for supervision trans-
fer,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-
nition, 2016, pp. 2827–2836.
68
ACCEPTED MANUSCRIPT
T
[112] P. Mittal, M. Vatsa, and R. Singh, “Composite sketch recognition via deep network-
a transfer learning approach,” in Biometrics (ICB), 2015 International Conference
IP
on. IEEE, 2015, pp. 251–256.
CR
[113] X. Liu, L. Song, X. Wu, and T. Tan, “Transferring deep representation for nir-vis
heterogeneous face recognition,” in Biometrics (ICB), 2016 International Confer-
ence on. IEEE, 2016, pp. 1–8.
US
[114] W.-Y. Chen, T.-M. H. Hsu, Y.-H. H. Tsai, Y.-C. F. Wang, and M.-S. Chen, “Trans-
fer neural trees for heterogeneous domain adaptation,” in European Conference on
AN
Computer Vision. Springer, 2016, pp. 399–414.
[115] S. Rota Bulo and P. Kontschieder, “Neural decision forests for semantic image la-
M
[116] P. Kontschieder, M. Fiterau, A. Criminisi, and S. Rota Bulo, “Deep neural decision
forests,” in Proceedings of the IEEE International Conference on Computer Vision,
2015, pp. 1467–1475.
PT
[119] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas, “Stack-
gan: Text to photo-realistic image synthesis with stacked generative adversarial
networks,” in IEEE Int. Conf. Comput. Vision (ICCV), 2017, pp. 5907–5915.
69
ACCEPTED MANUSCRIPT
[121] M. Kan, S. Shan, and X. Chen, “Bi-shifting auto-encoder for unsupervised domain
T
adaptation,” in Proceedings of the IEEE International Conference on Computer
Vision, 2015, pp. 3846–3854.
IP
[122] S. Hong, W. Im, J. Ryu, and H. S. Yang, “Sspp-dan: Deep domain adapta-
CR
tion network for face recognition with single sample per person,” arXiv preprint
arXiv:1702.04069, 2017.
US
[123] Y. Xia, D. Huang, and Y. Wang, “Detecting smiles of young children via deep
transfer learning,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2017, pp. 1673–1681.
AN
[124] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on
computer vision, 2015, pp. 1440–1448.
M
[125] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object
detection with region proposal networks,” in Advances in neural information pro-
ED
[127] M. Rochan and Y. Wang, “Weakly supervised localization of novel objects using
appearance transfer,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2015, pp. 4315–4324.
AC
[128] Y. Tang, J. Wang, B. Gao, E. Dellandréa, R. Gaizauskas, and L. Chen, “Large scale
semi-supervised object detection using visual and semantic knowledge transfer,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
2016, pp. 2119–2128.
70
ACCEPTED MANUSCRIPT
[129] Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool, “Domain adaptive faster
r-cnn for object detection in the wild,” arXiv preprint arXiv:1803.03243, 2018.
T
supervised object detection through progressive domain adaptation,” arXiv preprint
arXiv:1803.11365, 2018.
IP
[131] S. Hong, J. Oh, H. Lee, and B. Han, “Learning transferrable knowledge for semantic
CR
segmentation with deep convolutional neural network,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2016, pp. 3204–3212.
US
[132] A. Kolesnikov and C. H. Lampert, “Seed, expand and constrain: Three principles
for weakly-supervised image segmentation,” in European Conference on Computer
Vision. Springer, 2016, pp. 695–711.
AN
[133] W. Shimoda and K. Yanai, “Distinct class-specific saliency maps for weakly su-
pervised semantic segmentation,” in European Conference on Computer Vision.
M
[134] J. Hoffman, D. Wang, F. Yu, and T. Darrell, “Fcns in the wild: Pixel-level adver-
ED
[135] Y. Zhang, P. David, and B. Gong, “Curriculum domain adaptation for semantic
segmentation of urban scenes,” in The IEEE International Conference on Computer
PT
[136] Y.-H. Chen, W.-Y. Chen, Y.-T. Chen, B.-C. Tsai, Y.-C. F. Wang, and M. Sun,
“No more discrimination: Cross city adaptation of road scene segmenters,” arXiv
preprint arXiv:1704.08509, 2017.
AC
[137] Y. Chen, W. Li, and L. Van Gool, “Road: Reality oriented adaptation for semantic
segmentation of urban scenes,” arXiv preprint arXiv:1711.11556, 2017.
71
ACCEPTED MANUSCRIPT
[139] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional
neural networks,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2016, pp. 2414–2423.
T
[140] W. Deng, L. Zheng, G. Kang, Y. Yang, Q. Ye, and J. Jiao, “Image-image domain
adaptation with preserved self-similarity and domain-dissimilarity for person re-
IP
identification,” arXiv preprint arXiv:1711.07027, 2017.
CR
[141] T.-H. Chen, Y.-H. Liao, C.-Y. Chuang, W.-T. Hsu, J. Fu, and M. Sun, “Show,
adapt and tell: Adversarial training of cross-domain image captioner,” in The IEEE
International Conference on Computer Vision (ICCV), vol. 2, 2017.
US
[142] W. Zhao, W. Xu, M. Yang, J. Ye, Z. Zhao, Y. Feng, and Y. Qiao, “Dual learning
for cross-domain image captioning,” in Proceedings of the 2017 ACM on Conference
AN
on Information and Knowledge Management. ACM, 2017, pp. 29–38.
[143] P. P. Busto and J. Gall, “Open set domain adaptation,” in The IEEE International
M
[144] J. Zhang, Z. Ding, W. Li, and P. Ogunbona, “Importance weighted adversarial nets
ED
72
ACCEPTED MANUSCRIPT
T
IP
CR
US
Mei Wang received the B.E. degree in information and communication en-
gineering from the Dalian University of Technology (DUT), Dalian, China, in
AN
2013 and received M.E. degree in communication engineering from the Bei-
jing University of Posts and Telecommunications (BUPT), Beijing, China,
M
73
ACCEPTED MANUSCRIPT
T
IP
CR
Weihong Deng received the B.E. degree in information engineering and the
US
Ph.D. degree in signal and information processing from the Beijing Univer-
sity of Posts and Telecommunications (BUPT), Beijing, China, in 2004 and
AN
2009, respectively. From Oct. 2007 to Dec. 2008, he was a postgraduate
exchange student in the School of Information Technologies, University of
Sydney, Australia. He is currently an professor in School of Information and
M
as associate editor for IEEE Access, and guest editor for Image and Vision
Computing Journal and the reviewer for dozens of international journals,
CE
74
ACCEPTED MANUSCRIPT
T
IP
CR
US
AN
M
ED
PT
CE
AC
75