0% found this document useful (0 votes)
12 views76 pages

J Neucom 2018 05 083

The document is a survey on Deep Visual Domain Adaptation, discussing the emerging techniques that leverage deep learning for domain adaptation in computer vision applications. It categorizes various deep domain adaptation methods, presents a taxonomy based on data properties, and highlights applications beyond image classification. The survey also addresses current deficiencies in methods and suggests future research directions.

Uploaded by

Truong Cao Dung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views76 pages

J Neucom 2018 05 083

The document is a survey on Deep Visual Domain Adaptation, discussing the emerging techniques that leverage deep learning for domain adaptation in computer vision applications. It categorizes various deep domain adaptation methods, presents a taxonomy based on data properties, and highlights applications beyond image classification. The survey also addresses current deficiencies in methods and suggests future research directions.

Uploaded by

Truong Cao Dung
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

Communicated by Dr.

Ivor Tsang

Accepted Manuscript

Deep Visual Domain Adaptation: A Survey

Mei Wang, Weihong Deng

PII: S0925-2312(18)30668-4
DOI: 10.1016/j.neucom.2018.05.083
Reference: NEUCOM 19644

To appear in: Neurocomputing

Received date: 8 January 2018


Revised date: 22 May 2018
Accepted date: 24 May 2018

Please cite this article as: Mei Wang, Weihong Deng, Deep Visual Domain Adaptation: A Survey,
Neurocomputing (2018), doi: 10.1016/j.neucom.2018.05.083

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service
to our customers we are providing this early version of the manuscript. The manuscript will undergo
copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please
note that during the production process errors may be discovered which could affect the content, and
all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT

Deep Visual Domain Adaptation: A Survey


Mei Wang, Weihong Deng∗
School of Information and Communication Engineering, Beijing University

T
of Posts and Telecommunications, Beijing, 100876, China

IP
CR
Abstract

Deep domain adaptation has emerged as a new learning technique to ad-

US
dress the lack of massive amounts of labeled data. Compared to conventional
methods, which learn shared feature subspaces or reuse important source in-
AN
stances with shallow representations, deep domain adaptation methods lever-
age deep networks to learn more transferable representations by embedding
domain adaptation in the pipeline of deep learning. There have been com-
M

prehensive surveys for shallow domain adaptation, but few timely reviews
the emerging deep learning based methods. In this paper, we provide a com-
ED

prehensive survey of deep domain adaptation methods for computer vision


applications with four major contributions. First, we present a taxonomy
PT

of different deep domain adaptation scenarios according to the properties of


data that define how two domains are diverged. Second, we summarize deep
CE

domain adaptation approaches into several categories based on training loss,


and analyze and compare briefly the state-of-the-art methods under these
AC

categories. Third, we overview the computer vision applications that go be-


yond image classification, such as face recognition, semantic segmentation


Corresponding author. Tel:+86 10 62283059 Fax: +86 10 62285019
Email address: [email protected] (Weihong Deng)

Preprint submitted to Elsevier May 29, 2018


ACCEPTED MANUSCRIPT

and object detection. Fourth, some potential deficiencies of current methods


and several future directions are highlighted.
Keywords: Deep domain adaptation, deep networks, transfer learning,

T
computer vision applications

IP
1. Introduction

CR
Over the past few years, machine learning has achieved great success and
has benefited real-world applications. However, collecting and annotating

US
datasets for every new task and domain are extremely expensive and time-
consuming processes, sufficient training data may not always be available.
AN
Fortunately, the big data era makes a large amount of data available for
other domains and tasks. For instance, although large-scale labeled video
databases that are publicly available only contain a small number of samples,
M

statistically, the YouTube face dataset (YTF) consists of 3.4K videos. The
number of labeled still images is more than sufficient [1]. Hence, skillfully
ED

using the auxiliary data for the current task with scarce data will be helpful
for real-world applications.
PT

However, due to many factors (e.g., illumination, pose, and image qual-
ity), there is always a distribution change or domain shift between two do-
CE

mains that can degrade the performance, as shown in Fig. 1. Mimicking the
human vision system, domain adaptation (DA) is a particular case of trans-
AC

fer learning (TL) that utilizes labeled data in one or more relevant source
domains to execute new tasks in a target domain. Over the past decades,
various shallow DA methods have been proposed to solve a domain shift be-
tween the source and target domains. The common algorithms for shallow

2
ACCEPTED MANUSCRIPT

T
IP
CR
Amazon DSLR Webcam Caltech-256

MNIST
US
(a)

USPS SVHN
AN
(b)

LFW BCS CUFS


M

(c)
ED

Figure 1: (a) Some object images from the ”Bike” and ”Laptop” categories in Amazon,
DSLR, Webcam, and Caltech-256 databases. (b) Some digit images from MNIST, USPS,
and SVHN databases. (c) Some face images from LFW, BCS and CUFS databases. Re-
PT

alworld computer vision applications, such as face recognition, must learn to adapt to
distributions specific to each domain.
CE
AC

3
ACCEPTED MANUSCRIPT

DA can mainly be categorized into two classes: instance-based DA [2, 3]


and feature-based DA [4, 5, 6, 7]. The first class reduces the discrepancy by
reweighting the source samples, and it trains on the weighted source samples.

T
For the second class, a common shared space is generally learned in which

IP
the distributions of the two datasets are matched.
Recently, neural-network-based deep learning approaches have achieved

CR
many inspiring results in visual categorization applications, such as image
classification [8], face recognition [9], and object detection [10]. Simulat-

US
ing the perception of the human brain, deep networks can represent high-
level abstractions by multiple layers of non-linear transformations. Exist-
ing deep network architectures [11] include convolutional neural networks
AN
(CNNs) [8, 12, 13, 14], deep belief networks (DBNs) [15], and stacked au-
toencoders (SAEs) [16], among others. Although some studies have shown
M

that deep networks can learn more transferable representations that disen-
tangle the exploratory factors of variations underlying the data samples and
ED

group features hierarchically in accordance with their relatedness to invari-


ant factors, Donahue et al. [17] showed that a domain shift still affects their
PT

performance. The deep features would eventually transition from general


to specific, and the transferability of the representation sharply decreases in
CE

higher layers. Therefore, recent work has addressed this problem by deep
DA, which combines deep learning and DA.
There have been other surveys on TL and DA over the past few years
AC

[18, 19, 20, 21, 22, 23]. Pan et al. [18] categorized TL under three subset-
tings, including inductive TL, transductive TL, and unsupervised TL, but
they only studied homogeneous feature spaces. Shao et al. [19] catego-

4
ACCEPTED MANUSCRIPT

rized TL techniques into feature-representation-level knowledge transfer and


classifier-level knowledge transfer. The survey written by Patel [21] only fo-
cused on DA, a subtopic of TL. [20] discussed 38 methods for heterogeneous

T
TL that operate under various settings, requirements, and domains. Zhang

IP
et al. [22] were the first to summarize several transferring criteria in detail
from the concept level. These five surveys mentioned above only cover the

CR
methodologies on shallow TL or DA. The work presented by Csurka et al.
[23] briefly analyzed the state-of-the-art shallow DA methods and catego-

US
rized the deep DA methods into three subsettings based on training loss:
classification loss, discrepancy loss and adversarial loss. However, Csurka’s
work mainly focused on shallow methods, and it only discussed deep DA in
AN
image classification applications.
In this paper, we focus on analyzing and discussing deep DA methods.
M

Specifically, the key contributions of this survey are as follows: 1) we present


a taxonomy of different deep DA scenarios according to the properties of data
ED

that define how two domains are diverged. 2) extending Csurka’s work, we
improve and detail the three subsettings (training with classification loss, dis-
PT

crepancy loss and adversarial loss) and summarize different approaches used
in different DA scenes. 3) Considering the distance of the source and target
CE

domains, multi-step DA methods are studied and categorized into hand-


crafted, feature-based and representation-based mechanisms. 4) We provide
a survey of many computer vision applications, such as image classification,
AC

face recognition, style translation, object detection, semantic segmentation


and person re-identification.
The remainder of this survey is structured as follows. In Section II, we

5
ACCEPTED MANUSCRIPT

first define some notations, and then we categorize deep DA into different
settings (given in Fig. 2). In the next three sections, different approaches
are discussed for each setting, which are given in Table 1 and Table 2 in

T
detail. Then, in Section VI, we introduce some successful computer vision

IP
applications of deep DA. Finally, the conclusion of this paper and discussion
of future works are presented in Section VII.

CR
2. Overview

2.1. Notations and Definitions


US
In this section, we introduce some notations and definitions that are used
AN
in this survey. The notations and definitions match those from the survey pa-
pers by [18, 23] to maintain consistency across surveys. A domain D consists
of a feature space X and a marginal probability distribution P (X), where
M

X = {x1 , ..., xn } ∈ X . Given a specific domain D = {X , P (X)}, a task T


consists of a feature space Y and an objective predictive function f (·), which
ED

can also be viewed as a conditional probability distribution P (Y |X) from a


probabilistic perspective. In general, we can learn P (Y |X) in a supervised
PT

manner from the labeled data {xi , yi }, where xi ∈ X and yi ∈ Y.


Assume that we have two domains: the training dataset with sufficient
CE

labeled data is the source domain Ds = {X s , P (X)s }, and the test dataset
with a small amount of labeled data or no labeled data is the target domain
AC

Dt = {X t , P (X)t }. We see that the partially labeled part, Dtl , and the
unlabeled parts, Dtu , form the entire target domain, that is, Dt = Dtl ∪ Dtu .
Each domain is together with its task: the former is T s = {Y s , P (Y s |X s )},
and the latter is T t = {Y t , P (Y t |X t )}. Similarly, P (Y s |X s ) can be learned

6
ACCEPTED MANUSCRIPT

from the source labeled data {xsi , yis }, while P (Y t |X t ) can be learned from
labeled target data {xtli , yitl } and unlabeled data {xtu
i }.

2.2. Different Settings of Domain Adaptation

T
The case of traditional machine learning is Ds = Dt and T s = T t . For TL,

IP
Pan et al. [18] summarized that the differences between different datasets

CR
can be caused by domain divergence Ds 6= Dt (i.e., distribution shift or
feature space difference) or task divergence T s 6= T t (i.e., conditional distri-
bution shift or label space difference), or both. Based on this summary, Pan

US
et al. categorized TL into three main groups: inductive, transductive and
unsupervised TL.
AN
According to this classification, DA methods are transductive TL solu-
tions with the assumption that the tasks are the same, i.e., T s = T t , and
M

the differences are only caused by domain divergence, Ds 6= Dt . There-


fore, DA can be split into two main categories based on different domain
ED

divergences (distribution shift or feature space difference): homogeneous and


heterogeneous DA. Then, we can further categorize DA into supervised, semi-
supervised and unsupervised DA in consideration of labeled data of the target
PT

domain. The classification is given in Fig. 2.


• In the homogeneous DA setting, the feature spaces between the
CE

source and target domains are identical (X s = X t ) with the same di-
mension (ds = dt ). Hence, the source and target datasets are generally
AC

different in terms of data distributions (P (X)s 6= P (X)t ).


In addition, we can further categorize the homogeneous DA setting into
three cases:

7
ACCEPTED MANUSCRIPT

Labeled data are available in


target domain Supervised

Labeled+unlabeled data are


Homogeneous available in target domain Semi-Supervised
Feature Space is same between
source and target domain
No labeled data in target
domain
Unsupervised
One-step
Domain adaptation
Labeled data are available in
Feature Space is different target domain
Supervised

T
between source and target
Domain domain
Select Intermediate Domain
adaptation Heterogeneous Labeled+unlabeled data are
Semi-Supervised
available in target domain

IP
Multi-step No labeled data in target
domain
Unsupervised
Domain adaptation

CR
Figure 2: An overview of different settings of domain adaptation

1. In the supervised DA, a small amount of labeled target data, Dtl , are

US
present. However, the labeled data are commonly not sufficient for
tasks.
AN
2. In the semi-supervised DA, both limited labeled data, Dtl , and redun-
dant unlabeled data, Dtu , in the target domain are available in the
training stage, which allows the networks to learn the structure infor-
M

mation of the target domain.


3. In the unsupervised DA, no labeled but sufficient unlabeled target do-
ED

main data, Dtu , are observable when training the network.

• In the heterogeneous DA setting, the feature spaces between the


PT

source and target domains are nonequivalent (X s 6= X t ), and the di-


mensions may also generally differ (ds 6= dt ).
CE

Similar to the homogeneous setting, the heterogeneous DA setting can


also be divided into supervised, semi-supervised and unsupervised DA.
AC

All of the above DA settings assumed that the source and target domains
are directly related; thus, transferring knowledge can be accomplished in one
step. We call them one-step DA. In reality, however, this assumption is oc-
casionally unavailable. There is little overlap between the two domains, and

8
ACCEPTED MANUSCRIPT

performing one-step DA will not be effective. Fortunately, there are some


intermediate domains that are able to draw the source and target domains
closer than their original distance. Thus, we use a series of intermediate

T
bridges to connect two seemingly unrelated domains and then perform one-

IP
step DA via this bridge, named multi-step (or transitive) DA [24, 25]. For
example, face images and vehicle images are dissimilar between each other

CR
due to different shapes or other aspects, and thus, one-step DA would fail.
However, some intermediate images, such as ’football helmet’, can be intro-

US
duced to be an intermediate domain and have a smooth knowledge transfer.
Fig. 3 shows the differences between the learning processes of one-step and
multi-step DA techniques.
AN

Traditional Machine Learning One-step Domain Adaptation Multi-step Domain Adaptation


M

Different Domain Source Domain Target Domain Source Domain intermediate Target Domain
ED

One-Step

Learning System Learning System Learning System Knowledge Learning System Knowledge Knowledge Learning System

(a) (b) (c)


PT

Figure 3: Different learning processes between (a) traditional machine learning, (b) one-
step domain adaptation and (c) multi-step domain adaptation [18].
CE
AC

3. Approaches of Deep Domain Adaptation

In a broad sense, deep DA is a method that utilizes a deep network to


enhance the performance of DA. Under this definition, shallow methods with
deep features [17, 67, 68, 69, 70] can be considered as a deep DA approach.

9
ACCEPTED MANUSCRIPT

Table 1: Different Deep Approaches to One-Step DA

T
One-step DA
Brief Description Subsettings

IP
Approaches
class criterion [26, 27]

CR
[28, 29, 30, 31, 32]
[33, 34, 35, 26, 36]
fine-tuning the deep network statistic criterion

Discrepancy-based
US
with labeled or unlabeled target
data to diminish the domain
shift
[37, 34, 38, 32, 39]
[40, 41, 42, 43]
architecture criterion
AN
[44, 45, 46, 47, 48]
[49]
geometric criterion
M

[50]
generative models
ED

using domain discriminators to [51, 52, 53]


Adversarial-based encourage domain confusion non-generative models
through an adversarial objective [26, 54, 55, 56, 57]
PT

[58]
encoder-decoder
CE

using the data reconstruction as reconstruction


Reconstruction-
an auxiliary task to ensure [59, 60, 61, 43]
based
feature invariance adversarial
AC

reconstruction
[62, 63, 64]

10
ACCEPTED MANUSCRIPT

Table 2: Different Deep Approaches to Multi-Step DA


Multi-step Approaches Brief Description
users determine the intermediate domains
Hand-crafted
based on experience [65]

T
selecting certain parts of data from the

IP
Instance-based auxiliary datasets to compose the intermediate
domains [25, 50]

CR
freeze weights of one network and use their
Representation-based intermediaterepresentations as input to the
new network [66]

US
DA is adopted by shallow methods, whereas deep networks only extract vec-
AN
torial features and are not helpful for transferring knowledge directly. For
example, [71] extracted the convolutional activations from a CNN as the
tensor representation, and then performed tensor-aligned invariant subspace
M

learning to realize DA. This approach reliably outperforms current state-of-


the-art approaches based on traditional hand-crafted features because suf-
ED

ficient representational and transferable features can be extracted through


deep networks, which can work better on discrimination tasks [17].
PT

In a narrow sense, deep DA is based on deep learning architectures de-


signed for DA and can obtain a firsthand effect from deep networks via back-
CE

propagation. The intuitive idea is to embed DA into the process of learning


representation and to learn a deep feature representation that is both seman-
AC

tically meaningful and domain invariant. With the ”good” feature represen-
tations, the performance of the target task would improve significantly. In
this paper, we focus on the narrow definition and discuss how to utilize deep
networks to learn ”good” feature representations with extra training criteria.

11
ACCEPTED MANUSCRIPT

3.1. Categorization of One-Step Domain Adaptation

In one-step DA, the deep approaches can be summarized into three cases,
which refers to [23]. Table 1 shows these three cases and brief descriptions.

T
The first case is the discrepancy-based deep DA approach, which assumes

IP
that fine-tuning the deep network model with labeled or unlabeled target
data can diminish the shift between the two domains. Class criterion, statis-

CR
tic criterion, architecture criterion and geometric criterion are four major
techniques for performing fine-tuning.

US
• Class Criterion: uses the class label information as a guide for trans-
ferring knowledge between different domains. When the labeled sam-
AN
ples from the target domain are available in supervised DA, soft label
and metric learning are always effective [26, 27, 30, 31, 28]. When such
samples are unavailable, some other techniques can be adopted to sub-
M

stitute for class labeled data, such as pseudo labels [32, 33, 34, 29] and
attribute representation [35, 26].
ED

• Statistic Criterion: aligns the statistical distribution shift between


the source and target domains using some mechanisms. The most com-
PT

monly used methods for comparing and reducing distribution shift are
maximum mean discrepancy (MMD) [37, 34, 38, 32, 39, 40], correla-
CE

tion alignment (CORAL) [41, 42], Kullback-Leibler (KL) divergence


[43] and H divergence, among others.
AC

• Architecture Criterion: aims at improving the ability of learning


more transferable features by adjusting the architectures of deep net-
works. The techniques that are proven to be cost effective include
adaptive batch normalization (BN) [44, 45, 46], weak-related weight

12
ACCEPTED MANUSCRIPT

[47], domain-guided dropout [48], and so forth.


• Geometric Criterion: bridges the source and target domains ac-
cording to their geometrical properties. This criterion assumes that

T
the relationship of geometric structures can reduce the domain shift

IP
[50].
The second case can be referred to as an adversarial-based deep DA ap-

CR
proach [54]. In this case, a domain discriminator that classifies whether a
data point is drawn from the source or target domain is used to encourage

US
domain confusion through an adversarial objective to minimize the distance
between the empirical source and target mapping distributions. Further-
more, the adversarial-based deep DA approach can be categorized into two
AN
cases based on whether there are generative models.
• Generative Models: combine the discriminative model with a gen-
M

erative component in general based on generative adversarial networks


(GANs). One of the typical cases is to use source images, noise vectors
ED

or both to generate simulated samples that are similar to the target


samples and preserve the annotation information of the source domain
PT

[51, 52, 53].


• Non-Generative Models: rather than generating models with input
CE

image distributions, the feature extractor learns a discriminative rep-


resentation using the labels in the source domain and maps the target
data to the same space through a domain-confusion loss, thus resulting
AC

in the domain-invariant representations [58, 26, 54, 55, 56].


The third case can be referred to as a reconstruction-based DA approach,
which assumes that the data reconstruction of the source or target samples

13
ACCEPTED MANUSCRIPT

can be helpful for improving the performance of DA. The reconstructor can
ensure both specificity of intra-domain representations and indistinguishabil-
ity of inter-domain representations.

T
• Encoder-Decoder Reconstruction: by using stacked autoencoders

IP
(SAEs), encoder-decoder reconstruction methods combine the encoder
network for representation learning with a decoder network for data

CR
reconstruction [59, 60, 61, 43].
• Adversarial Reconstruction: the reconstruction error is measured

US
as the difference between the reconstructed and original images within
each image domain by a cyclic mapping obtained via a GAN discrimi-
nator, such as dual GAN [62], cycle GAN [63] and disco GAN [64].
AN

Table 3: Different Approaches used in Different Domain Adaptation Settings


Supervised DA Unsupervised DA
M


Class Criterion

Statistic Criterion
Discrepancy-based
ED

√ √
Architecture Criterion

Geometric Criterion

Generative Model
PT

Adversarial-based √
Non-Generative Model

Encoder-Decoder Model
Reconstruction-based √
CE

Adversarial Model
AC

3.2. Categorization of Multi-Step Domain Adaptation

In multi-step DA, we first determine the intermediate domains that are


more related with the source and target domains than their direct connection.
Second, the knowledge transfer process will be performed between the source,

14
ACCEPTED MANUSCRIPT

intermediate and target domains by one-step DA with less information loss.


Thus, the key of multi-step DA is how to select and utilize intermediate
domains; additionally, it can fall into three categories referring to [18]: hand-

T
crafted, feature-based and representation-based selection mechanisms.

IP
• Hand-Crafted: users determine the intermediate domains based on
experience [65].

CR
• Instance-Based: selecting certain parts of data from the auxiliary
datasets to compose the intermediate domains to train the deep net-
work [25, 50].
US
• Representation-Based: transfer is enabled via freezing the previ-
ously trained network and using their intermediate representations as
AN
input to the new one [66].
M

4. One-Step Domain Adaptation

As mentioned in Section 2.1, the data in the target domain have three
ED

types regardless of homogeneous or heterogeneous DA: 1) supervised DA with


labeled data, 2) semi-supervised DA with labeled and unlabeled data and 3)
PT

non-supervised DA with unlabeled data. The second setting is able to be


accomplished by combining the methods of setting 1 and setting 3; thus, we
CE

only focus on the first and third settings in this paper. The cases where the
different approaches are mainly used for each DA setting are shown in Table
AC

3. As shown, more work is focused on unsupervised scenes because supervised


DA has its limitations. When only few labeled data in the target domain
are available, using the source and target labeled data to train parameters of
models typically results in overfitting to the source distribution. In addition,

15
ACCEPTED MANUSCRIPT

the discrepancy-based approaches have been studied for years and produced
more methods in many research works, whereas the adversarial-based and
reconstruction-based approaches are a relatively new research topic but have

T
recently been attracting more attention.

IP
4.1. Homogeneous Domain Adaptation

CR
4.1.1. Discrepancy-Based Approaches
Yosinski et al.[72] proved that transferable features learned by deep net-
works have limitations due to fragile co-adaptation and representation speci-

US
ficity and that fine-tuning can enhance generalization performance. Fine-
tuning (can also be viewed as a discrepancy-based deep DA approach) is
AN
to train a base network with source data and then directly reuse the first
n layers to conduct a target network. The remaining layers of the target
M

network are randomly initialized and trained with loss based on discrepancy.
During training, the first n layers of the target network can be fine-tuned or
ED

frozen depending on the size of the target dataset and its similarity to the
source dataset [73]. Some common rules of thumb for navigating the 4 major
scenarios are given in Table 4.
PT

Table 4: Some Common Rules of Thumb for Deciding Fine-tuned or Frozen in the First
CE

n Layers. [73]
The Size of Target Dataset
Low Medium High
AC

The Distance Low Freeze Try Freeze or Tune Tune


between Medium Try Freeze or Tune Tune Tune
Source and Target High Try Freeze or Tune Tune Tune

• Class Criterion

16
ACCEPTED MANUSCRIPT

T
IP
CR
US
AN
M

Figure 4: The average accuracy over the validation set for a network trained with different
ED

strategies. Baseline B: the network is trained on dataset B. 2) BnB: the first n layers
are reused from baseline B and frozen. The higher layers are trained on dataset B. 3)
BnB+: the same as BnB but where all layers are fine-tuned. 4) AnB: the first n layers are
PT

reused from the network trained on dataset A and frozen. The higher layers are trained
on dataset B. 5) AnB+: the same as AnB but where all layers are fine-tuned [72].
CE
AC

17
ACCEPTED MANUSCRIPT

The class criterion is the most basic training loss in deep DA. After pre-
training the network with source data, the remaining layers of the target
model use the class label information as a guide to train the network. Hence,

T
a small number of labeled samples from the target dataset is assumed to be

IP
available.
Ideally, the class label information is given directly in supervised DA.

CR
Most work commonly uses the negative log-likelihood of the ground truth
P
class with softmax as their training loss, L = − Ni=0 yi log ŷi (ŷi are the

US
softmax predictions of the model, which represent class probabilities) [26,
27, 30, 74]. To extend this, Hinton et al. [31] modified the softmax function
to soft label loss:
AN
exp(zi /T )
qi = P (1)
j (exp(zj /T ))

where z i is the logit output computed for each class. T is a temperature


M

that is normally set to 1 in standard softmax, but it takes a higher value to


produce a softer probability distribution over classes. By using it, much of
ED

the information about the learned function that resides in the ratios of very
small probabilities can be obtained. For example, when recognizing digits,
PT

one version of 2 may obtain a probability of 106 of being a 3 and 109 of


being a 7; in other words, this version of 2 looks more similar to 3 than 7.
CE

Inspired by Hinton, [26] fine-tuned the network by simultaneously minimizing


the domain confusion loss (belonging to adversarial-based approaches, which
AC

will be presented in Section 4.1.2) and soft label loss. Using soft labels
rather than hard labels can preserve the relationships between classes across
domains. Gebru et al. [35] modified existing adaptation algorithms based
on [26] and utilized soft label loss at the fine-grained class level Lcsof t and

18
ACCEPTED MANUSCRIPT

attribute level Lasof t .

T
IP
CR
US
AN

Figure 5: Deep DA by combining domain confusion loss and soft label loss [26].
M

In addition to softmax loss, there are other methods that can be used
ED

as training loss to fine-tune the target model in supervised DA. Embedding


metric learning in deep networks is another method that can make the dis-
tance of samples from different domains with the same labels be closer while
PT

those with different labels are far away. Based on this idea, [28] constructed
the semantic alignment loss and the separation loss accordingly. Deep trans-
CE

fer metric learning is proposed by [30], which applies the marginal Fisher
analysis criterion and MMD criterion (described in Statistic Criterion) to
AC

minimize their distribution difference:


(M ) (M ) 
min J = Sc(M ) − αSb + βDts X s, X t
M
X 2 2
(2)
+γ ( W (m) F
+ b(m) 2 )
m=1

19
ACCEPTED MANUSCRIPT

where α, β and γ are regularization parameters and W (m) and b(m) are the
(M )
weights and biases of the mth layer of the network. Dts (X s , X t ) is the
MMD between representations of the source and target domains. Sc and Sb

T
define the intra-class compactness and the interclass separability.

IP
However, what can we do if there is no class label information in the target
domain directly? As we all know, humans can identify unseen classes given

CR
only a high-level description. For instance, when provided the description
”tall brown animals with long necks”, we are able to recognize giraffes. To

US
imitate the ability of humans, [75] introduced high-level semantic attributes
per class. Assume that ac = (ac1 , ..., acm ) is the attribute representation for
class c, which has fixed-length binary values with m attributes in all the
AN
classes. The classifiers provide estimates of p(am |x) for each attribute am .
In the test stage, each target class y obtains its attribute vector ay in a
M

deterministic way, i.e., p(a|y) = [[a = ay ]]. By applying Bayes rule, p(y|a) =
p(y)
p(ay )
[[a = ay ]], the posterior of a test class can be calculated as follows:
ED

X M
p(y) Y
p(y|x) = p(y|a)p(a|x) = y)
p(aym |x) (3)
M
p(a m=1
a∈{0,1}
PT

Gebru et al. [35] drew inspiration from these works and leveraged at-
tributes to improve performance in the DA of fine-grained recognition. There
CE

are multiple independent softmax losses that simultaneously perform at-


tribute and class level to fine-tune the target model. To prevent the inde-
AC

pendent classifiers from obtaining conflicting labels with attribute and class
level, an attribute consistency loss is also implemented.
Occasionally, when fine-tuning the network in unsupervised DA, a label
of target data, which is called a pseudo label, can preliminarily be obtained

20
ACCEPTED MANUSCRIPT

based on the maximum posterior probability. Yan et al. [34] initialized the
target model using the source data and then defined the class posterior proba-
bility p(yjt = c|xtj ) by the output of the target model. With p(yjt = c|xtj ), they
assigned pseudo-label ybjt to xtj by ybjt = arg max p(yjt = c|xtj ). In [29], two dif-

T
c

IP
ferent networks assign pseudo-labels to unlabeled samples, another network
is trained by the samples to obtain target discriminative representations. The

CR
deep transfer network (DTN) [33] used some base classifiers, e.g., SVMs and
MLPs, to obtain the pseudo labels for the target samples to estimate the

US
conditional distribution of the target samples and match both the marginal
and the conditional distributions with the MMD criterion. When casting
the classifier adaptation into the residual learning framework, [32] used the
AN
pseudo label to build the conditional entropy E(Dt , f t ), which ensures that
the target classifier f t fits the target-specific structures well.
M

• Statistic Criterion
Although some discrepancy-based approaches search for pseudo labels, at-
ED

tribute labels or other substitutes to labeled target data, more work focuses
on learning domain-invariant representations via minimizing the domain dis-
PT

tribution discrepancy in unsupervised DA.


MMD is an effective metric for comparing the distributions between two
CE

datasets by a kernel two-sample test [76]. Given two distributions s and t,


the MMD is defined as follows:
AC

2
M M D2 (s, t) = sup Exs ∼s [φ(xs )] − Ext ∼s [φ(xt )] H
(4)
kφkH ≤1

where φ represents the kernel function that maps the original data to a
reproducing kernel Hilbert space (RKHS) and kφkH ≤ 1 defines a set of
functions in the unit ball of RKHS H.

21
ACCEPTED MANUSCRIPT

T
(a) The Deep Adaptation Network (DAN) architecture (b) The Joint Adaptation Network (JAN) architecture

IP
CR
(c) The Residual Transfer Network (RTN) architecture

US
Figure 6: Different approaches with the MMD metric. (a) The deep adaptation network
(DAN) architecture [38], (b) the joint adaptation network (JAN) architecture [37] and (c)
AN
the residual transfer network (RTN) architecture [32].

Based on the above, Ghifary et al. [40] proposed a model that introduced
M

the MMD metric in feedforward neural networks with a single hidden layer.
The MMD metric is computed between representations of each domain to
ED

reduce the distribution mismatch in the latent space. The empirical estimate
of MMD is as follows:
2
PT

M N
2 1 X 1 X
M M D (Ds , Dt ) = φ(xsi )− φ(xtj ) (5)
M i=1 N j=1
H
CE

Subsequently, Tzeng et al. [39] and Long et al. [38] extended MMD to
a deep CNN model and achieved great success. The deep domain confusion
AC

network (DDC) by Tzeng et al. [39] used two CNNs for the source and target
domains with shared weights. The network is optimized for classification loss
in the source domain, while domain difference is measured by an adaptation

22
ACCEPTED MANUSCRIPT

layer with the MMD metric.

L=LC (X L , y) + λM M D2 (X s X t ) (6)

where the hyperparameter λ is a penalty parameter. LC (X L , y) denotes

T
classification loss on the available labeled data, X L , and the ground-truth

IP
labels, y. M M D2 (X s X t ) denotes the distance between the source and target

CR
data. DDC only adapts one layer of the network, resulting in a reduction in
the transferability of multiple layers. Rather than using a single layer and
linear MMD, Long et al. [38] proposed the deep adaptation network (DAN)

US
that matches the shift in marginal distributions across domains by adding
multiple adaptation layers and exploring multiple kernels, assuming that the
AN
conditional distributions remain unchanged. However, this assumption is
rather strong in practical applications; in other words, the source classifier
M

cannot be directly used in the target domain. To make it more generalized, a


joint adaptation network (JAN) [37] aligns the shift in the joint distributions
ED

of input features and output labels in multiple domain-specific layers based


on a joint maximum mean discrepancy (JMMD) criterion. [33] proposed
DTN, where both the marginal and the conditional distributions are matched
PT

based on MMD. The shared feature extraction layer learns a subspace to


match the marginal distributions of the source and the target samples, and
CE

the discrimination layer matches the conditional distributions by classifier


transduction. In addition to adapting features using MMD, residual transfer
AC

networks (RTNs) [32] added a gated residual layer for classifier adaptation.
More recently, [34] proposed a weighted MMD model that introduces an
auxiliary weight for each class in the source domain when the class weights
in the target domain are not the same as those in the source domain.

23
ACCEPTED MANUSCRIPT

If φ is a characteristic kernel (i.e., Gaussian kernel or Laplace kernel),


MMD will compare all the orders of statistic moments. In contrast to MMD,
CORAL [77] learned a linear transformation that aligns the second-order

T
statistics between domains. Sun et al. [41] extended CORAL to deep neural

IP
networks (deep CORAL) with a nonlinear transformation.
1
LCORAL = kCS − CT k2F (7)

CR
4d2
where k · k2F denotes the squared matrix Frobenius norm. CS and CT denote

US
the covariance matrices of the source and target data, respectively.
By the Taylor expansion of the Gaussian kernel, MMD can be viewed
as minimizing the distance between the weighted sums of all raw moments
AN
[78]. The interpretation of MMD as moment matching procedures motivated
Zellinger et al. [79] to match the higher-order moments of the domain dis-
M

tributions, which we call central moment discrepancy (CMD). An empirical


estimate of the CMD metric for the domain discrepancy in the activation
ED

space [a, b]N is given by


1
CM DK (X s , X t ) = E(X s ) − E(X t ) 2
(b − a)
PT

K
X (8)
1 s t
+ Ck (X ) − Ck (X )
k=2
|b − a|k 2
CE

where Ck (X) = E((x − E(X))k is the vector of all k th -order sample central
1
P
moments and E(X) = |X| x∈X x is the empirical expectation.
AC

The association loss Lassoc proposed by [80] is an alternative discrepancy


measure, it enforces statistical associations between source and target data by
making the two-step round-trip probabilities Pijaba be similar to the uniform
distribution over the class labels.

24
ACCEPTED MANUSCRIPT

• Architecture Criterion
Some other methods optimize the architecture of the network to minimize
the distribution discrepancy. This adaptation behavior can be achieved in

T
most deep DA models, such as supervised and unsupervised settings.

IP
Rozantsev et al. [47] considered that the weights in corresponding layers
are not shared but related by a weight regularizer rw (·) to account for the

CR
differences between the two domains. The weight regularizer rw (·) can be
expressed as the exponential loss function:

rw (θjs , θjt ) = exp


US 
θjs − θjt
2

where θjs and θjt denote the parameters of the j th layer of the source and

−1 (9)
AN
target models, respectively. To further relax this restriction, they allow the
weights in one stream to undergo a linear transformation:
M

2
rw (θjs , θjt ) = exp( aj θjs + bj − θjt )−1 (10)
ED

where aj and bj are scalar parameters that encode the linear transformation.
The work of Shu et al. [81] is similar to [47] using weakly parameter-shared
PT

layers. The penalty term Ω controls the relatedness of parameters.


L
X 2 2
(l) (l) (l) (l)
Ω= ( WS − WT + bS − bT ) (11)
CE

F F
i=1

(l) (l) (l) (l)


where {WS , bS }Ll=1 and {WT , bT }Ll=1 are the parameters of the lth layer in
AC

the source and target domains, respectively.


Li et al. [44] hypothesized that the class-related knowledge is stored in
the weight matrix, whereas domain-related knowledge is represented by the
statistics of the batch normalization (BN) layer [82]. BN normalizes the

25
ACCEPTED MANUSCRIPT

T
IP
Figure 7: The two-stream architecture with related weight [47].

CR
mean and standard deviation for each individual feature channel such that
each layer receives data from a similar distribution, irrespective of whether
it comes from the source or the target domain. Therefore, Li et al. used BN

US
to align the distribution for recomputing the mean and standard deviation
in the target domain.
AN
 
t x − µ(X t )
BN (X ) = λ +β (12)
σ(X t )
M

where λ and β are parameters learned from the target data and µ(x) and
σ(x) are the mean and standard deviation computed independently for each
ED

feature channel. Based on [44], [83] endowed BN layers with a set of align-
ment parameters which can be learned automatically and can decide the
degree of feature alignment required at different levels of the deep network.
PT

Furthermore, Ulyanov et al. [84] found that when replacing BN layers with
instance normalization (IN) layers, where µ(x) and σ(x) are computed inde-
CE

pendently for each channel and each sample, the performance of DA can be
further improved.
AC

Occasionally, neurons are not effective for all domains because of the
presence of domain biases. For example, when recognizing people, the tar-
get domain typically contains one person centered with minimal background
clutter, whereas the source dataset contains many people with more clutter.

26
ACCEPTED MANUSCRIPT

Thus, the neurons that capture the features of other people and clutter are
useless. Domain-guided dropout was proposed by [48] to solve the problem of
multi-DA, and it mutes non-related neurons for each domain. Rather than

T
assigning dropout with a specific dropout rate, it depends on the gain of

IP
the loss function of each neuron on the domain sample when the neuron is
removed.

CR
si = L(g(x)\i ) − L(g(x)) (13)

where L is the softmax loss function and g(x)\i is the feature vector after

US
setting the response of the ith neuron to zero. In [85], each source domain is
assigned with different parameters, Θ(i) = Θ(0) + ∆(i) , where Θ(0) is a domain
AN
general model, and ∆(i) is a domain specific bias term. After the low rank
parameterized CNNs are trained, Θ(0) can serve as the classifier for target
domain.
M

• Geometric Criterion
The geometric criterion mitigates the domain shift by integrating inter-
ED

mediate subspaces on a geodesic path from the source to the target domains.
A geodesic flow curve is constructed to connect the source and target do-
PT

mains on the Grassmannian. The source and target subspaces are points
on a Grassmann manifold. By sampling a fixed [86] or infinite [87] number
CE

of subspaces along the geodesic, we can form the intermediate subspaces to


help to find the correlations between domains. Then, both source and tar-
AC

get data are projected to the obtained intermediate subspaces to align the
distribution.
Inspired by the intermediate representations on the geodesic path, Chopra
et al. [50] proposed a model called deep learning for DA by interpolating

27
ACCEPTED MANUSCRIPT

between domains (DLID). DLID generates intermediate datasets, starting


with all the source data samples and gradually replacing source data with
target data. Each dataset is a single point on an interpolating path between

T
the source and target domains. Once intermediate datasets are generated, a

IP
deep nonlinear feature extractor using the predictive sparse decomposition is
trained in an unsupervised manner.

CR
4.1.2. Adversarial-Based Approaches
Recently, great success has been achieved by the GAN method [88], which

US
estimates generative models via an adversarial process. GAN consists of two
models: a generative model G that extracts the data distribution and a
AN
discriminative model D that distinguishes whether a sample is from G or
training datasets by predicting a binary label. The networks are trained on
M

the label prediction loss in a mini-max fashion: simultaneously optimizing


G to minimize the loss while also training D to maximize the probability of
ED

assigning the correct label:

min max V (D, G) = Ex∼pdata (x) [log D(x)]


G D
PT

(14)
+Ez∼pz (z) [log(1 − D(G(z)))]
In DA, this principle has been employed to ensure that the network cannot
CE

distinguish between the source and target domains. [58] proposed a unified
framework for adversarial-based approaches and summarized the existing
AC

approaches according to whether to use a generator, which loss function to


employ, or whether to share weights across domains. In this paper, we only
categorize the adversarial-based approaches into two subsettings: generative
models and non-generative models.

28
ACCEPTED MANUSCRIPT

T
IP
CR
Figure 8: Generalized architecture for adversarial domain adaptation. Existing adversarial

US
adaptation methods can be viewed as instantiations of a framework with different choices
regarding their properties. [58]
AN
• Generative Models
Synthetic target data with ground-truth annotations are an appealing
M

alternative to address the problem of a lack of training data. First, with the
help of source data, generators render unlimited quantities of synthetic target
ED

data, which are paired with synthetic source data to share labels or appear
as if they were sampled from the target domain while maintaining labels,
or something else. Then, synthetic data with labels are used to train the
PT

target model as if no DA were required. Adversarial-based approaches with


generative models are able to learn such a transformation in an unsupervised
CE

manner based on GAN.


The core idea of CoGAN [51] is to generate synthetic target data that are
AC

paired with synthetic source ones. It consists of a pair of GANs: GAN1 for
generating source data and GAN2 for generating target data. The weights
of the first few layers in the generative models and the last few layers in

29
ACCEPTED MANUSCRIPT

the discriminative models are tied. This weight-sharing constraint allows


CoGAN to achieve a domain-invariant feature space without correspondence
supervision. A trained CoGAN can adapt the input noise vector to paired

T
images that are from the two distributions and share the labels. Therefore,

IP
the shared labels of synthetic target samples can be used to train the target
model.

CR
US
AN

Figure 9: The CoGAN architecture. [51]


M

More work focuses on generating synthetic data that are similar to the
ED

target data while maintaining annotations. Yoo et al. [89] transferred knowl-
edge from the source domain to pixel-level target images with GANs. A do-
main discriminator ensures the invariance of content to the source domain,
PT

and a real/fake discriminator supervises the generator to produce similar im-


ages to the target domain. Shrivastava et al. [90] developed a method for
CE

simulated+unsupervised (S+U) learning that uses a combined objective of


minimizing an adversarial loss and a self-regularization loss, where the goal
AC

is to improve the realism of synthetic images using unlabeled real data. In


contrast to other works in which the generator is conditioned only on a noise
vector or source images, Bousmalis et al. [52] proposed a model that exploits
GANs conditioned on both. The classifier T is trained to predict class labels

30
ACCEPTED MANUSCRIPT

of both source and synthetic images, while the discriminator is trained to


predict the domain labels of target and synthetic images. In addition, to
expect synthetic images with similar foregrounds and different backgrounds

T
from the same source images, a content similarity is used that penalizes large

IP
differences between source and synthetic images for foreground pixels only
by a masked pairwise mean squared error [91]. The goal of the network is to

CR
learn G, D and T by solving the optimization problem:

min max V (D, G) = αLd (D, G)


G,T D

US
+βLt (T, G) + γLc (G)
(15)
AN
where α, β, and γ are parameters that control the trade-off between the
losses. Ld , Lt and Lc are the adversarial loss, softmax loss and content-
similarity loss, respectively.
M
ED
PT
CE
AC

Figure 10: The model that exploits GANs conditioned on noise vector and source images.
[52]

• Non-Generative Models

31
ACCEPTED MANUSCRIPT

The key of deep DA is learning domain-invariant representations from


source and target samples. With these representations, the distribution of
both domains can be similar enough such that the classifier is fooled and can

T
be directly used in the target domain even if it is trained on source samples.

IP
Therefore, whether the representations are domain-confused or not is crucial
to transferring knowledge. Inspired by GAN, domain confusion loss, which

CR
is produced by the discriminator, is introduced to improve the performance
of deep DA without generators.

US
AN
M
ED

Figure 11: The domain-adversarial neural network (DANN) architecture. [55]


PT

The domain-adversarial neural network (DANN) [55] integrates a gradient


reversal layer (GRL) into the standard architecture to ensure that the feature
CE

distributions over the two domains are made similar. The network consists
of shared feature extraction layers and two classifiers. DANN minimizes the
AC

domain confusion loss (for all samples) and label prediction loss (for source
samples) while maximizing domain confusion loss via the use of the GRL.
In contrast to the above methods, the adversarial discriminative domain
adaptation (ADDA) [58] considers independent source and target mappings

32
ACCEPTED MANUSCRIPT

by untying the weights, and the parameters of the target model are initialized
by the pre-trained source one. This is more flexible because of allowing
more domain-specific feature extractions to be learned. ADDA minimizes

T
the source and target representation distances through iteratively minimizing

IP
these following functions, which is most similar to the original GAN:

CR
min Lcls (X s , Y s ) =
M s ,C
K
X

US
− E(xs ,ys )∼(X s ,Y s )

min LadvD (X s ,X t , M s , M t ) =
k=1
1[k=ys ] log C(M s (xs ))
AN
D

− E(xs )∼(X s ) [log D(M s (xs ))]

− E(xt )∼(X t ) [log(1 − D(M t (xt )))]


M

min LadvM (M s , M t ) =
M s ,M t
(16)
ED

− E(xt )∼(X t ) [log D(M (x ))]


t t

where the mappings M s and M t are learned from the source and target data,
PT

X s and X t . C represents a classifier working on the source domain. The first


classification loss function Lcls is optimized by training the source model us-
CE

ing the labeled source data. The second function LadvD is minimized to train
the discriminator, while the third function LadvM is learning a representation
AC

that is domain invariant.


Tzeng et al. [26] proposed adding an additional domain classification
layer that performs binary domain classification and designed a domain con-
fusion loss to encourage its prediction to be as close as possible to a uniform

33
ACCEPTED MANUSCRIPT

T
IP
Figure 12: The Adversarial discriminative domain adaptation (ADDA) architecture. [58]

CR
distribution over binary labels. Unlike previous methods that match the en-

US
tire source and target domains, Cao et al. introduced a selective adversarial
network (SAN) [92] to address partial transfer learning from large domains
to small domains, which assumes that the target label space is a subspace of
AN
the source label space. It simultaneously avoids negative transfer by filter-
ing out outlier source classes, and it promotes positive transfer by matching
M

the data distributions in the shared label space via splitting the domain dis-
criminator into many class-wise domain discriminators. [93] encoded domain
ED

labels and class labels to produce four groups of pairs, and replaced the typ-
ical binary adversarial discriminator by a four-class discriminator. Volpi et
PT

al. [94] trained a feature generator (S) to perform data augmentation in the
source feature space and obtained a domain invariant feature through playing
a minimax game against features from S.
CE

Rather than using discriminator to classify domain label, some papers


make some other explorations. Inspired by Wasserstein GAN [95], Shen et al.
AC

[96] utilized discriminator to estimate empirical Wasserstein distance between


the source and target samples and optimized the feature extractor network
to minimize the distance in an adversarial manner. In [97], two classifiers

34
ACCEPTED MANUSCRIPT

are treated as discriminators and are trained to maximize the discrepancy


to detect target samples outside the support of the source, while a feature
extractor is trained to minimize the discrepancy by generating target features

T
near the support.

IP
4.1.3. Reconstruction-Based Approaches

CR
In DA, the data reconstruction of source or target samples is an auxiliary
task that simultaneously focuses on creating a shared representation between
the two domains and keeping the individual characteristics of each domain.
• Encoder-Decoder Reconstruction
US
The basic autoencoder framework [98] is a feedforward neural network
AN
that includes the encoding and decoding processes. The autoencoder first
encodes an input to some hidden representation, and then it decodes this
M

hidden representation back to a reconstructed version. The DA approaches


based on encoder-decoder reconstruction typically learn the domain-invariant
ED

representation by a shared encoder and maintain the domain-special repre-


sentation by a reconstruction loss in the source and target domains.
Xavier and Bengio [99] proposed extracting a high-level representation
PT

based on stacked denoising autoencoders (SDA) [16]. By reconstructing the


union of data from various domains with the same network, the high-level
CE

representations can represent both the source and target domain data. Thus,
a linear classifier that is trained on the labeled data of the source domain
AC

can make predictions on the target domain data with these representations.
Despite their remarkable results, SDAs are limited by their high computa-
tional cost and lack of scalability to high-dimensional features. To address
these crucial limitations, Chen et al. [100] proposed the marginalized SDA

35
ACCEPTED MANUSCRIPT

(mSDA), which marginalizes noise with linear denoisers; thus, parameters


can be computed in closed-form and do not require stochastic gradient de-
scent.

T
The deep reconstruction classification network (DRCN) proposed in [60]

IP
learns a shared encoding representation that provides useful information for
cross-domain object recognition. DRCN is a CNN architecture that combines

CR
two pipelines with a shared encoder. After a representation is provided by the
encoder, the first pipeline, which is a CNN, works for supervised classification

US
with source labels, whereas the second pipeline, which is a deconvolutional
network, optimizes for unsupervised reconstruction with target data.
AN
min λLc ({θenc , θlab }) + (1 − λ)Lr ({θenc , θdec }) (17)

where λ is a hyper-parameter that controls the trade-off between classifi-


M

cation and reconstruction. θenc , θdec and θlab denote the parameters of the
encoder, decoder and source classifier, respectively. Lc is cross-entropy loss
ED

for classification, and Lr is squared loss k x − fr (x) k22 for reconstruction in


which fr (x) is the reconstruction of x.
Domain separation networks (DSNs) [59] explicitly and jointly model both
PT

private and shared components of the domain representations. A shared-


weight encoder learns to capture shared representations, while a private en-
CE

coder is used for domain-specific components in each domain. Additionally,


a shared decoder learns to reconstruct the input samples by both the pri-
AC

vate and shared representations. Then, a classifier is trained on the shared


representation. By partitioning the space in such a manner, the shared rep-
resentations will not be influenced by domain-specific representations such
that a better transfer ability can be obtained. Finding that the separation

36
ACCEPTED MANUSCRIPT

T
IP
CR
US
Figure 13: The deep reconstruction classification network (DRCN) architecture. [60]

loss is simple and that the private features are only used for reconstruction in
AN
DSNs, [101] reinforced them by incorporating a hybrid adversarial learning
in a separation network and an adaptation network.
M

Zhuang et al. [43] proposed transfer learning with deep autoencoders


(TLDA), which consists of two encoding layers. The distance in distribu-
ED

tions between domains is minimized with KL divergence in the embedding


encoding layer, and label information of the source domain is encoded using
a softmax loss in the label encoding layer. Ghifary et al. [61] extended the
PT

autoencoder into a model that jointly learns two types of data-reconstruction


tasks taken from related domains: one is self-domain reconstruction, and the
CE

other is between-domain reconstruction.


• Adversarial Reconstruction
AC

Dual learning was first proposed by Xia et al. [102] to reduce the require-
ment of labeled data in natural language processing. Dual learning trains two
”opposite” language translators, e.g., A-to-B and B-to-A. The two transla-
tors represent a primal-dual pair that evaluates how likely the translated

37
ACCEPTED MANUSCRIPT

sentences belong to the targeted language, and the closed loop measures the
disparity between the reconstructed and the original ones. Inspired by dual
learning, adversarial reconstruction is adopted in deep DA with the help of

T
dual GANs.

IP
Zhu et al. [63] proposed a cycle GAN that can translate the charac-
teristics of one image domain into the other in the absence of any paired

CR
training examples. Compared to dual learning, cycle GAN uses two gener-
ators rather than translators, which learn a mapping G : X → Y and an

US
inverse mapping F : Y → X. Two discriminators, DX and DY , measure how
realistic the generated image is (G(X) ≈ Y or G(Y ) ≈ X) by an adversarial
loss and how well the original input is reconstructed after a sequence of two
AN
generations (F (G(X)) ≈ X or G(F (Y )) ≈ Y ) by a cycle consistency loss
(reconstruction loss). Thus, the distribution of images from G(X) (or F (Y ))
M

is indistinguishable from the distribution Y (or X).

LGAN (G, DY , X, Y ) = Ey∼pdata (y) [log DY (y)]


ED

+Ex∼pdata (x) [log(1 − DY (G(x)))]


PT

Lcyc (G, F ) = Ex∼data(x) [kF (G(x)) − xk1 ]


(18)
+Ey∼data(y) [kG(F (y)) − yk1 ]
CE

where LGAN is the adversarial loss produced by discriminator DY with map-


ping function G : X → Y . Lcyc is the reconstruction loss using L1 norm.
AC

The dual GAN [62] and the disco GAN [64] were proposed at the same
time, where the core idea is similar to cycle GAN. In dual GAN, the gen-
erator is configured with skip connections between mirrored downsampling
and upsampling layers [103, 53], making it a U-shaped net to share low-level

38
ACCEPTED MANUSCRIPT

T
IP
Figure 14: The cycle GAN architecture. [63]

CR
information (e.g., object shapes, textures, clutter, and so forth). For discrim-

US
inators, the Markovian patch-GAN [104] architecture is employed to capture
local high-frequency information. In disco GAN, various forms of distance
functions, such as mean-square error (MSE), cosine distance, and hinge loss,
AN
can be used as the reconstruction loss, and the network is applied to trans-
late images, changing specified attributes including hair color, gender and
M

orientation while maintaining all other components.

4.1.4. Hybrid Approaches


ED

To obtain better performance, some of the aforementioned methods have


been used simultaneously. [26] combined a domain confusion loss and a soft
PT

label loss, while [32] used both statistic (MMD) and architecture criteria
(adapt classifier by residual function) for unsupervised DA. [34] introduced
CE

class-specific auxiliary weights assigned by the pseudo-labels into the origi-


nal MMD. In DSNs [59], encoder-decoder reconstruction approaches separate
AC

representations into private and shared representations, while the MMD cri-
terion or domain confusion loss is helpful to make the shared representations
similar and soft subspace orthogonality constraints ensure dissimilarity be-
tween the private and shared representations. [47] used the MMD between

39
ACCEPTED MANUSCRIPT

the learned source and target representations and also allowed the weights of
the corresponding layers to differ. [43] learned domain-invariant representa-
tions by encoder-decoder reconstruction approaches and the KL divergence.

T
4.2. Heterogeneous Domain Adaptation

IP
In heterogeneous DA, the feature spaces of the source and target domains

CR
are not the same, Xs 6= Xt, and the dimensions of the feature spaces may also
differ. According to the divergence of feature spaces, heterogeneous DA can
be further divided into two scenarios. In one scenario, the source and target

US
domain both contain images, and the divergence of feature spaces is mainly
caused by different sensory devices (e.g., visual light (VIS) vs. near-infrared
AN
(NIR) or RGB vs. depth) and different styles of images (e.g., sketches vs.
photos). In the other scenario, there are different types of media in source
M

and target domain (e.g., text vs. image and language vs. image). Obviously,
the cross-domain gap of the second scenario is much larger.
ED

Most heterogeneous DA with shallow methods fall into two categories:


symmetric transformation and asymmetric transformation. The symmet-
ric transformation learns feature transformations to project the source and
PT

target features onto a common subspace. Heterogeneous feature augmen-


tation (HFA) [105] first transformed the source and target data into a com-
CE

mon subspace using projection matrices P and Q respectively, then proposed


two new feature mapping functions, ϕs (xs ) = [P xs , xs , 0dt ]T and ϕt (xt ) =
AC

T
[Qxt , 0ds , xt ] , to augment the transformed data with their original features
and zeros. These projection matrices are found using standard SVM with
hinge loss in both the linear and nonlinear cases and an alternating opti-
mization algorithm is proposed to simultaneously solve the dual SVM and

40
ACCEPTED MANUSCRIPT

to find the optimal transformations. [106] treated each input domain as a


manifold which is represented by a Laplacian matrix, and used labels rather
than correspondences to align the manifolds. The asymmetric transforma-

T
tion transforms one of source and target features to align with the other.

IP
[107] proposed a sparse and class-invariant feature transformation matrix to
map the weight vector of classifiers learned from the source domain to the

CR
target domain. The asymmetric regularized cross-domain transfer (ARC-t)
[108] used asymmetric, non-linear transformations learned in Gaussian RBF

US
kernel space to map the target data to the source domain. Extended from
[109], ARC-t performed asymmetric transformation based on metric learning,
and transfer knowledge between domains with different dimensions through
AN
changes of the regularizer. Since we focus on deep DA, we refer the interested
readers to [20], which summarizes shallow approaches of heterogeneous DA.
M

However, as for deep methods, there is not much work focused on hetero-
geneous DA so far. The special and effective methods of heterogeneous deep
ED

DA have not been proposed, and heterogeneous deep DA is still performed


similar to some approaches of homogeneous DA.
PT

4.2.1. Discrepancy-Based Approach


In discrepancy-based approaches, the network generally shares or reuses
CE

the first n layers between the source and target domains, which limits the
feature spaces of the input to the same dimension. However, in heterogeneous
AC

DA, the dimensions of the feature spaces of source domain may differ from
those of target domain.
In first scenario of heterogeneous DA, the images in different domains
can be directly resized into the same dimensions, so the Class Criterion and

41
ACCEPTED MANUSCRIPT

Statistic Criterion are still effective and are mainly used. For example, given
an RGB image and its paired depth image, [110] used the mid-level repre-
sentation learned by CNNs as a supervisory signal to re-train a CNN on

T
depth images. To transform an RGB object detector into a RGB-D detector

IP
without needing complete RGB-D data, Hoffman et al. [111] first trained an
RGB network using labeled RGB data from all categories and finetuned the

CR
network with labeled depth data from partial categories, then combined mid-
level RGB and depth representations at fc6 to incorporate both modalities

US
into the final object class prediction. [112] first trained the network using
large face database of photos and then finetuned it using small database
of composite sketches; [113] transferred the VIS deep networks to the NIR
AN
domain in the same way.
In second scenario, the features of different media can not be directly
M

resized into the same dimensions. Therefore, discrepancy-based methods


fail to work without extra process. [81] proposed weakly shared DTNs to
ED

transfer labeled information across heterogeneous domains, particularly from


the text domain to the image domain. DTNs take paired data, such as
PT

text and image, as input to two SAEs, followed by weakly parameter-shared


network layers at the top. Chen et al. [114] proposed transfer neural trees
CE

(TNTs), which consist of two stream networks to learn a domain-invariant


feature representation for each modality. Then, a transfer neural decision
forest (Transfer-NDF) [115, 116] is used with stochastic pruning for adapting
AC

representative neurons in the prediction layer.

42
ACCEPTED MANUSCRIPT

4.2.2. Adversarial-Based Approach


Using Generative Models can generate the heterogeneous target data
while transferring some information of source domain to them. [117] em-

T
ployed a compound loss function that consists of a multiclass GAN loss,

IP
a regularizing component and an f-constancy component to transfer unla-
beled face photos to emoji images. To generate images for birds and flowers

CR
based on text, [118] trained a GAN conditioned on text features encoded
by a hybrid character-level convolutional-recurrent neural network. [119]

US
proposed stacked generative adversarial networks (StackGAN) with condi-
tioning augmentation for synthesizing photo-realistic images from text. It
AN
decomposes the synthesis problem into several sketch-refinement processes.
Stage-I GAN sketches the primitive shape and basic colors of the object to
yield low-resolution image, and Stage-II GAN completes details of the object
M

to produce a high-resolution photo-realistic image.


ED
PT
CE
AC

Figure 15: The StackGAN architecture. [119]

43
ACCEPTED MANUSCRIPT

4.2.3. Reconstruction-Based Approach


The Adversarial Reconstruction can be used in heterogeneous DA as well.
For example, the cycle GAN [63], dual GAN [62] and disco GAN [64] used

T
two generators, GA and GB , to generate sketches from photos and photos

IP
from sketches, respectively. Based on cycle GAN [63], [120] proposed a
multi-adversarial network to avoid artifacts of facial photo-sketch synthe-

CR
sis by leveraging the implicit presence of feature maps of different resolutions
in the generator subnetwork.

5. Multi-Step Domain Adaptation US


AN
For multi-step DA, the selection of the intermediate domain is problem
specific, and different problems may have different strategies.
M

5.1. Hand-Crafted Approaches

Occasionally, the intermediate domain can be selected by experience, that


ED

is, it is decided in advance. For example, when the source domain is image
data and the target domain is composed of text data, some annotated images
PT

will clearly be crawled as intermediate domain data.


With the common sense that nighttime light intensities can be used as a
CE

proxy for economic activity, Xie et al. [65] transferred knowledge from day-
time satellite imagery to poverty prediction with the help of some nighttime
light intensity information as an intermediate domain.
AC

5.2. Instance-Based Approaches

In other problems where there are many candidate intermediate domains,


some automatic selection criterion should be considered. Similar to the

44
ACCEPTED MANUSCRIPT

instance-transfer approaches proposed by Pan [18], because the samples of


the source domain cannot be used directly, the mixture of certain parts of
the source and target data can be useful for constructing the intermediate

T
domain.

IP
Tan et al. [25] proposed distant domain transfer learning (DDTL), where
long-distance domains fail to transfer knowledge by only one intermediate

CR
domain but can be related via multiple intermediate domains. DDTL grad-
ually selects unlabeled data from the intermediate domains by minimizing

US
reconstruction errors on the selected instances in the source and intermedi-
ate domains and all the instances in the target domain simultaneously. With
removal of the unrelated source data, the selected intermediate domains grad-
AN
ually become closer to the target domain from the source domain:
nS
1 X 2
J1 (fe , fd , vS , vT ) = v i x̂i − xiS
M

nS i=1 S S 2

nI
1 X 2
+ vIi x̂iI − xiI (19)
ED

nI i=1 2

nT
1 X 2
+ x̂i − xiT + R(vS , vT )
nT i=1 T 2
PT

where x̂iS , x̂iT and x̂iI are reconstructions of source data S i , target data T i and
intermediate data I i based on the autoencoder, respectively, and fe and fd are
CE

>
the parameters of the encoder and decoder, respectively. vS = (vS1 , ..., vSnS )
>
and vI = (vI1 , ..., vInI ) , vSi , vIi ∈ 0, 1 are selection indicators for the ith source
AC

and intermediate instance, respectively. R(vS , vT ) is a regularization term


that avoids all values of vS and vI being zero.
The DLID model [50] mentioned in Section 4.1.1 (Geometric Criterion)
constructs the intermediate domains with a subset of the source and target

45
ACCEPTED MANUSCRIPT

domains, where source samples are gradually replaced by target samples.

5.3. Representation-Based Approaches

Representation-based approaches freeze the previously trained network

T
and use their intermediate representations as input to the new network. Rusu

IP
et al. [66] introduced progressive networks that have the ability to accumu-

CR
late and transfer knowledge to new domains over a sequence of experiences.
To avoid the target model losing its ability to solve the source domain, they
constructed a new neural network for each domain, while transfer is enabled

US
via lateral connections to features of previously learned networks. In the pro-
cess, the parameters in the latest network are frozen to remember knowledge
AN
of intermediate domains.
M
ED
PT
CE
AC

Figure 16: The progressive network architecture. [66]

46
ACCEPTED MANUSCRIPT

6. Application of Deep Domain Adaptation

Deep DA techniques have recently been successfully applied in many real-


world applications, including image classification, object recognition, face

T
recognition, object detection, style translation, and so forth. In this section,

IP
we present different application examples using various visual deep DA meth-
ods. Because the information of commonly used datasets for evaluating the

CR
performance is provided in [22] in detail, we do not introduce it in this paper.

6.1. Image Classification


US
Because image classification is a basic task of computer vision applica-
tions, most of the algorithms mentioned above were originally proposed to
AN
solve such problems. Therefore, we do not discuss this application repeatedly,
but we show how much benefit deep DA methods for image classification can
M

bring. Because different papers often use different parameters, experimental


protocols and tuning strategies in the preprocessing steps, it is quite difficult
ED

to perform a fair comparison among all the methods directly. Thus, similar
to the work of Pan [18], we show the comparison results between the proposed
PT

deep DA methods and non-adaptation methods using only deep networks.


A list of simple experiments taken from some published deep DA papers are
CE

presented in Table 5.
In [37], [79], and [26], the authors used the Office-31 dataset1 as one of
the evaluation data sets, as shown in Fig. 1(a). The Office dataset is a
AC

computer vision classification data set with images from three distinct do-
mains: Amazon (A), DSLR (D), and Webcam (W). The largest domain,

1
https://cs.stanford.edu/∼jhoffman/domainadapt/

47
Table 5: Comparison between Transfer Learning and Non-Adaptation Learning Methods
Source
Data Set
vs. Baselines Deep Domain Adaptation Methods
(reference)
Target
AlexNet DDC DAN RTN JAN DANN
A vs. W 61.6±0.5 61.8±0.4 68.5 73.3±0.3 75.2±0.4 73.0±0.5
D vs. W 95.4±0.3 95.0±0.5 96.0±0.3 96.8±0.2 96.6±0.2 96.4±0.3
AC
Office-31 Dataset W vs. D 99.0±0.2 98.5±0.4 99.0±0.3 99.6±0.1 99.6±0.1 99.2±0.3
ACC (unit:%)[37] A vs. D 63.8±0.5 64.4±0.3 67.0±0.4 71.0±0.2 72.8±0.3 72.3±0.3
D vs. A 51.1±0.6 52.1±0.6 54.0±0.5 50.5±0.3 57.5±0.2 53.4±0.4
CE
W vs. A 49.8±0.4 52.2±0.4 53.1±0.5 51.0±0.1 56.3±0.2 51.2±0.5
Avg 70.1 70.6 72.9 73.7 76.3 74.3
PT
AlexNet Deep CORAL CMD DLID AdaBN DANN
A vs. W 61.6 66.4 77.0±0.6 51.9 74.2 73
ED
D vs. W 95.4 95.7 96.3±0.4 78.2 95.7 96.4
Office-31 Dataset W vs. D 99.0 M99.2 99.2±0.2 89.9 99.8 99.2
ACC (unit:%)[79] A vs. D 63.8 66.8 79.6±0.6 - 73.1 -

48
D vs. A 51.1 52.8 63.8±0.7 - 59.8 -
W vs. A 49.8 51.5 63.3±0.6 - 57.4 -
AN
Avg 70.1 72.1 79.9 - 76.7 -
Domain Confusion
AlexNet DLID DANN Soft Labels
ACCEPTED MANUSCRIPT

Confusion +Soft
US
A vs. W 56.5±0.3 51.9 53.6±0.2 82.7±0.7 82.8±0.9 82.7±0.8
D vs. W 92.4±0.3 78.2 71.2±0.0 95.9±0.6 95.6±0.4 95.7±0.5
Office-31 Dataset W vs. D 93.6±0.2 89.9 83.5±0.0 98.3±0.3 97.5±0.2 97.6±0.2
CR
ACC (unit:%)[26] A vs. D 64.6±0.4 - - 84.9±1.2 85.9±1.1 86.1±1.2
D vs. A 47.6±0.1 - - 66.0±0.5 66.2±0.4 66.2±0.3
IP
W vs. A 42.7±0.1 - - 65.2±0.6 64.9±0.5 65.0±0.5
Avg 66.2 - - 82.17
T 82.13 82.22
MNIST, USPS, VGG-16 DANN CoGAN ADDA
and SVHN M vs. U 75.2±1.6 77.1±1.8 91.2±0.8 89.4±0.2
digits datasets U vs. M 57.1±1.7 73.0±2.0 89.1±0.8 90.1±0.8
ACC (unit:%)[58] S vs. M 60.1±1.1 73.9 - 76.0±1.8
ACCEPTED MANUSCRIPT

Amazon, has 2817 labeled images and its corresponding 31 classes, which
consist of objects commonly encountered in office settings. By using this
dataset, previous works can show the performance of methods across all six

T
possible DA tasks. [37] showed comparison experiments among the standard

IP
AlexNet [8], the DANN method [55], and the MMD algorithm and its vari-
ations, such as DDC [39], DAN [38], JAN [37] and RTN [32]. Zellinger et

CR
al. [79] evaluated their proposed CMD algorithm in comparison to other
discrepancy-based methods (DDC, deep CROAL [41], DLID [50], AdaBN

US
[44]) and the adversarial-based method DANN. [26] proposed an algorithm
combining soft label loss and domain confusion loss, and they also compared
them with DANN and DLID under a supervised DA setting.
AN
In [58], MNIST2 (M), USPS3 (U), and SVHN4 (S) digit datasets (shown in
Fig. 1(b)) are used for a cross-domain hand-written digit recognition task,
M

and the experiment showed the comparison results on some adversarial-based


methods, such as DANN, CoGAN [51] and ADDA [58], where the baseline
ED

is VGG-16 [12].

6.2. Face Recognition


PT

The performance of face recognition significantly degrades when there are


variations in the test images that are not present in the training images. The
CE

dataset shift can be caused by poses, resolution, illuminations, expressions,


and modality. Kan et al. [121] proposed a bi-shifting auto-encoder network
AC

(BAE) for face recognition across view angle, ethnicity, and imaging sensor.

2
http://yann.lecun.com/exdb/mnist/
3
http://statweb.stanford.edu/∼tibs/ElemStatLearn/data.html
4
http://ufldl.stanford.edu/housenumbers/

49
ACCEPTED MANUSCRIPT

In BAE, source domain samples are shifted to the target domain, and sparse
reconstruction is used with several local neighbors from the target domain
to ensure its correction, and vice versa. Single sample per person domain

T
adaptation network (SSPP-DAN) in [122] generates synthetic images with

IP
varying poses to increase the number of samples in the source domain and
bridges the gap between the synthetic and source domains by adversarial

CR
training with a GRL in real-world face recognition. [1] improved the per-
formance of video face recognition by using an adversarial-based approach

US
with large-scale unlabeled videos, labeled still images and synthesized images.
Considering that age variations are difficult problems for smile detection and
that networks trained on the current benchmarks do not perform well on
AN
young children, Xia et al. [123] applied DAN [38] and JAN [37] (mentioned
in Section 4.1.1) to two baseline deep models, i.e., AlexNet and ResNet, to
M

transfer the knowledge from adults to infants.


ED
PT
CE
AC

Figure 17: The single sample per person domain adaptation network (SSPP-DAN) archi-
tecture. [122]

50
ACCEPTED MANUSCRIPT

6.3. Object Detection

Recent advances in object detection are driven by region-based convolu-


tional neural networks (R-CNNs [10], fast R-CNNs [124] and faster R-CNNs

T
[125]). They are composed of a window selection mechanism and classifiers

IP
that are pre-trained labeled bounding boxes by using the features extracted
from CNNs. At test time, the classifier decides whether a region obtained

CR
by sliding windows contains the object. Although the R-CNN algorithm is
effective, a large amount of bounding box labeled data is required to train

US
each detection category. To solve the problem of lacking labeled data, con-
sidering the window selection mechanism as being domain independent, deep
AN
DA methods can be used in classifiers to adapt to the target domain.
Because R-CNNs train classifiers on regions just like classification, weak
labeled data (such as image-level class labels) are directly useful for the de-
M

tector. Most works learn the detector with limited bounding box labeled data
and massive weak labeled data. The large-scale detection through adapta-
ED

tion (LSDA) [126] trains a classification layer for the target domain and then
uses a pre-trained source model along with output layer adaptation tech-
PT

niques to update the target classification parameters directly. Rochan et al.


[127] used word vectors to establish the semantic relatedness between weak
CE

labeled source objects and target objects and then transferred the bounding
box labeled information from source objects to target objects based on their
AC

relatedness. Extending [126] and [127], Tang et al. [128] transferred visual
(based on the LSDA model) and semantic similarity (based on work vectors)
for training an object detector on weak labeled category. [129] incorporated
both an image-level and an instance-level adaptation component into faster

51
ACCEPTED MANUSCRIPT

R-CNN and minimized the domain discrepancy based on adversarial training.


By using bounding box labeled data in a source domain and weak labeled
data in a target domain, [130] progressively fine-tuned the pre-trained model

T
with domain-transfer samples and pseudo-labeling samples.

IP
6.4. Semantic Segmentation

CR
Fully convolutional network models (FCNs) for dense prediction have
proven to be successful for evaluating semantic segmentation, but their per-
formance will also degrade under domain shifts. Therefore, some work has

US
also explored using weak labels to improve the performance of semantic seg-
mentation. Hong et al. [131] used a novel encoder-decoder architecture with
AN
attention model by transferring weak class labeled knowledge in the source
domain, while [132, 133] transferred weak object location knowledge.
M

Much attention has also been paid to deep unsupervised DA in seman-


tic segmentation. Hoffman et al. [134] first introduced it, in which global
ED

domain alignment is performed using FCNs with adversarial-based training,


while transferring spatial layout is achieved by leveraging class-aware con-
strained multiple instance loss. Zhang et al. [135] enhanced the segmentation
PT

performance on real images with the help of virtual ones. It uses the global
label distribution loss of the images and local label distribution loss of the
CE

landmark superpixels in the target domain to effectively regularize the fine-


tuning of the semantic segmentation network. Chen et al. [136] proposed
AC

a framework for cross-city semantic segmentation. The framework assigns


pseudo labels to pixels/grids in the target domain and jointly utilizes global
and class-wise alignment by domain adversarial learning to minimize domain
shift. In [137], a target guided distillation module adapts the style from the

52
ACCEPTED MANUSCRIPT

real images by imitating the pre-trained source network, and a spatial-aware


adaptation module leverages the intrinsic spatial structure to reduce the do-
main divergence. Rather than operating a simple adversarial objective on

T
the feature space, [138] used a GAN to address domain shift in which a gen-

IP
erator projects the features to the image space and a discriminator operates
on this projected image space.

CR
US
AN
M
ED

Figure 18: The architecture of pixel-level adversarial and constraint-based adaptation.


[134]
PT

6.5. Image-to-Image Translation


CE

Image-to-image translation has recently achieved great success with deep


DA, and it has been applied to various tasks, such as style transferring.
AC

Specially, when the feature spaces of source and target images are not the
same, image-to-image translation should be performed by heterogeneous DA.
More approaches of image-to-image translation use a dataset of paired
images and incorporate a DA algorithm into generative networks. Isola et al.

53
ACCEPTED MANUSCRIPT

[53] proposed the pix2pix framework, which uses a conditional GAN to learn
a mapping from source to target images. Tzeng et al. [56] utilized domain
confusion loss and pairwise loss to adapt from simulation to real-world data

T
in a PR2 robot. However, several other methods also address the unpaired

IP
setting, such as CoGAN [51], cycle GAN [63], dual GAN [62] and disco GAN
[64].

CR
Matching the statistical distribution by fine-tuning a deep network is an-
other way to achieve image-to-image translation. Gatys et al. [139] fine-tuned

US
the CNN to achieve DA by the total loss, which is a linear combination be-
tween the content and the style loss, such that the target image is rendered in
the style of the source image maintaining the content. The content loss min-
AN
imizes the mean squared difference of the feature representation between the
original image and generated image in higher layers, while the style loss mini-
M

mizes the element-wise mean squared difference between the Gram matrix of
them on each layer. [46] demonstrated that matching the Gram matrices of
ED

feature maps is equivalent to minimizing the MMD. Rather than MMD, [42]
proposed a deep generative correlation alignment network (DGCAN) that
PT

bridges the domain discrepancy between CAD synthetic and real images by
applying the content and CORAL losses to different layers.
CE

6.6. Person Re-identification

In the community, person re-identification (re-ID) has become increas-


AC

ingly popular. When given video sequences of a person, person re-ID recog-
nizes whether this person has been in another camera to compensate for the
limitations of fixed devices. Recently, deep DA methods have been used in re-
ID when models trained on one dataset are directly used on another. Xiao

54
ACCEPTED MANUSCRIPT

et al. [48] proposed the domain-guided dropout algorithm to discard use-


less neurons for re-identifying persons on multiple datasets simultaneously.
Inspired by cycle GAN and Siamese network, the similarity preserving gener-

T
ative adversarial network (SPGAN) [140] translated the labeled source image

IP
to the target domain, preserving self similarity and domain-dissimilarity in
an unsupervised manner, and then it trains re-ID models with the translated

CR
images using supervised feature learning methods.

6.7. Image Captioning

US
Recently, image captioning, which automatically describes an image with
a natural sentence, has been an emerging challenge in computer vision and
AN
natural language processing. Due to lacking of paired image-sentence train-
ing data, DA leverages different types of data in other source domains to
M

tackle this challenge. Chen et al. [141] proposed a novel adversarial training
procedure (captioner v.s. critics) for cross-domain image captioning using
ED

paired source data and unpaired target data. One captioner adapts the sen-
tence style from source to target domain, whereas two critics, namely domain
critic and multi-modal critic, aim at distinguishing them. Zhao et al. [142]
PT

fine-tuned the pre-trained source model on limited data in the target domain
via a dual learning mechanism.
CE

7. Conclusion
AC

In a broad sense, deep DA is utilizing deep networks to enhance the


performance of DA, such as shallow DA methods with features extracted
by deep networks. In a narrow sense, deep DA is based on deep learning
architectures designed for DA and optimized by back propagation. In this

55
ACCEPTED MANUSCRIPT

survey paper, we focus on this narrow definition, and we have reviewed deep
DA techniques on visual categorization tasks.
Deep DA is classified as homogeneous DA and heterogeneous DA, and

T
it can be further divided into supervised, semi-supervised and unsupervised

IP
settings. The first setting is the simplest but is generally limited due to
the need for labeled data; thus, most previous works focused on unsuper-

CR
vised cases. Semi-supervised deep DA is a hybrid method that combines the
methods of the supervised and unsupervised settings.

US
Furthermore, the approaches of deep DA can be classified into one-step
DA and multi-step DA considering the distance of the source and target
domains. When the distance is small, one-step DA can be used based on
AN
training loss. It consists of the discrepancy-based approach, the adversarial-
based approach, and the reconstruction-based approach. When the source
M

and target domains are not directly related, multi-step (or transitive) DA
can be used. The key of multi-step DA is to select and utilize intermediate
ED

domains, thus falling into three categories, including hand-crafted, feature-


based and representation-based selection mechanisms.
PT

Although deep DA has achieved success recently, many issues still remain
to be addressed. First, most existing algorithms focus on homogeneous deep
CE

DA, which assumes that the feature spaces between the source and target
domains are the same. However, this assumption may not be true in many
applications. We expect to transfer knowledge without this severe limitation
AC

and take advantage of existing datasets to help with more tasks. Heteroge-
neous deep DA may attract increasingly more attention in the future.
In addition, deep DA techniques have been successfully applied in many

56
ACCEPTED MANUSCRIPT

real-world applications, including image classification, and style translation.


We have also found that only a few papers address adaptation beyond classi-
fication and recognition, such as object detection, face recognition, semantic

T
segmentation and person re-identification. How to achieve these tasks with

IP
no or a very limited amount of data is probably one of the main challenges
that should be addressed by deep DA in the next few years.

CR
Finally, since existing deep DA methods aim at aligning marginal dis-
tributions, they commonly assume shared label space across the source and

US
target domains. However, in realistic scenario, the images of the source and
target domain may be from the different set of categories or only a few cate-
gories of interest are shared. Recently, some papers [92, 143, 144] have begun
AN
to focus on this issue and we believe it is worthy of more attention.
M

8. Acknowledgements

This work was partially supported by the National Natural Science Foun-
ED

dation of China under Grant Nos. 61573068, 61471048, and 61375031, and
Beijing Nova Program under Grant No. Z161100004916088.
PT

References
CE

[1] K. Sohn, S. Liu, G. Zhong, X. Yu, M.-H. Yang, and M. Chandraker, “Unsuper-
vised domain adaptation for face recognition in unlabeled videos,” arXiv preprint
arXiv:1708.02191, 2017.
AC

[2] L. Bruzzone and M. Marconcini, “Domain adaptation problems: A dasvm classifi-


cation technique and a circular validation strategy,” IEEE transactions on pattern
analysis and machine intelligence, vol. 32, no. 5, pp. 770–787, 2010.

57
ACCEPTED MANUSCRIPT

[3] W.-S. Chu, F. De la Torre, and J. F. Cohn, “Selective transfer machine for per-
sonalized facial action unit detection,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2013, pp. 3515–3522.

T
[4] B. Gong, K. Grauman, and F. Sha, “Connecting the dots with landmarks: Discrim-
inatively learning domain-invariant features for unsupervised domain adaptation,”

IP
in International Conference on Machine Learning, 2013, pp. 222–230.

CR
[5] S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via transfer
component analysis,” IEEE Transactions on Neural Networks, vol. 22, no. 2, pp.
199–210, 2011.

US
[6] M. Gheisari and M. S. Baghshah, “Unsupervised domain adaptation via representa-
tion learning and adaptive classifier learning,” Neurocomputing, vol. 165, pp. 300–
AN
311, 2015.

[7] S. Pachori, A. Deshpande, and S. Raman, “Hashing in the zero shot framework with
M

domain adaptation,” Neurocomputing, 2017.

[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep


ED

convolutional neural networks,” in Advances in neural information processing sys-


tems, 2012, pp. 1097–1105.
PT

[9] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to
human-level performance in face verification,” in Proceedings of the IEEE conference
on computer vision and pattern recognition, 2014, pp. 1701–1708.
CE

[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for
accurate object detection and semantic segmentation,” in Proceedings of the IEEE
AC

conference on computer vision and pattern recognition, 2014, pp. 580–587.

[11] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, and F. E. Alsaadi, “A survey of deep
neural network architectures and their applications,” Neurocomputing, vol. 234, pp.
11–26, 2017.

58
ACCEPTED MANUSCRIPT

[12] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale
image recognition,” arXiv preprint arXiv:1409.1556, 2014.

[13] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Van-

T
houcke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.

IP
[14] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,”

CR
in Proceedings of the IEEE conference on computer vision and pattern recognition,
2016, pp. 770–778.

US
[15] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm for deep belief
nets,” Neural computation, vol. 18, no. 7, pp. 1527–1554, 2006.
AN
[16] P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked de-
noising autoencoders: Learning useful representations in a deep network with a local
denoising criterion,” Journal of Machine Learning Research, vol. 11, no. Dec, pp.
M

3371–3408, 2010.

[17] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell,


ED

“Decaf: A deep convolutional activation feature for generic visual recognition,” in


International conference on machine learning, 2014, pp. 647–655.
PT

[18] S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on


knowledge and data engineering, vol. 22, no. 10, pp. 1345–1359, 2010.
CE

[19] L. Shao, F. Zhu, and X. Li, “Transfer learning for visual categorization: A survey,”
IEEE transactions on neural networks and learning systems, vol. 26, no. 5, pp.
1019–1034, 2015.
AC

[20] O. Day and T. M. Khoshgoftaar, “A survey on heterogeneous transfer learning,”


Journal of Big Data, vol. 4, no. 1, p. 29, 2017.

59
ACCEPTED MANUSCRIPT

[21] V. M. Patel, R. Gopalan, R. Li, and R. Chellappa, “Visual domain adaptation: A


survey of recent advances,” IEEE signal processing magazine, vol. 32, no. 3, pp.
53–69, 2015.

T
[22] J. Zhang, W. Li, and P. Ogunbona, “Transfer learning for cross-dataset recognition:
A survey,” 2017.

IP
[23] G. Csurka, “Domain adaptation for visual applications: A comprehensive survey,”

CR
arXiv preprint arXiv:1702.05374, 2017.

[24] B. Tan, Y. Song, E. Zhong, and Q. Yang, “Transitive transfer learning,” in Proceed-

and Data Mining. US


ings of the 21th ACM SIGKDD International Conference on Knowledge Discovery
ACM, 2015, pp. 1155–1164.
AN
[25] B. Tan, Y. Zhang, S. J. Pan, and Q. Yang, “Distant domain transfer learning,” in
AAAI, 2017, pp. 2604–2610.

[26] E. Tzeng, J. Hoffman, T. Darrell, and K. Saenko, “Simultaneous deep transfer


M

across domains and tasks,” in Proceedings of the IEEE International Conference on


Computer Vision, 2015, pp. 4068–4076.
ED

[27] X. Peng, J. Hoffman, X. Y. Stella, and K. Saenko, “Fine-to-coarse knowledge transfer


for low-res image classification,” in Image Processing (ICIP), 2016 IEEE Interna-
PT

tional Conference on. IEEE, 2016, pp. 3683–3687.

[28] S. Motiian, M. Piccirilli, D. A. Adjeroh, and G. Doretto, “Unified deep supervised


CE

domain adaptation and generalization,” in The IEEE International Conference on


Computer Vision (ICCV), vol. 2, 2017.
AC

[29] K. Saito, Y. Ushiku, and T. Harada, “Asymmetric tri-training for unsupervised


domain adaptation,” arXiv preprint arXiv:1702.08400, 2017.

[30] J. Hu, J. Lu, and Y.-P. Tan, “Deep transfer metric learning,” in Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 325–333.

60
ACCEPTED MANUSCRIPT

[31] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,”
arXiv preprint arXiv:1503.02531, 2015.

[32] M. Long, H. Zhu, J. Wang, and M. I. Jordan, “Unsupervised domain adaptation with

T
residual transfer networks,” in Advances in Neural Information Processing Systems,
2016, pp. 136–144.

IP
[33] X. Zhang, F. X. Yu, S.-F. Chang, and S. Wang, “Deep transfer network: Unsuper-

CR
vised domain adaptation,” arXiv preprint arXiv:1503.00591, 2015.

[34] H. Yan, Y. Ding, P. Li, Q. Wang, Y. Xu, and W. Zuo, “Mind the class weight bias:

preprint arXiv:1705.00609, 2017.


US
Weighted maximum mean discrepancy for unsupervised domain adaptation,” arXiv

[35] T. Gebru, J. Hoffman, and L. Fei-Fei, “Fine-grained recognition in the wild: A


AN
multi-task domain adaptation approach,” arXiv preprint arXiv:1709.02476, 2017.

[36] W. Ge and Y. Yu, “Borrowing treasures from the wealthy: Deep transfer learning
M

through selective joint fine-tuning,” arXiv preprint arXiv:1702.08690, 2017.

[37] M. Long, J. Wang, and M. I. Jordan, “Deep transfer learning with joint adaptation
ED

networks,” arXiv preprint arXiv:1605.06636, 2016.

[38] M. Long, Y. Cao, J. Wang, and M. Jordan, “Learning transferable features with deep
PT

adaptation networks,” in International Conference on Machine Learning, 2015, pp.


97–105.
CE

[39] E. Tzeng, J. Hoffman, N. Zhang, K. Saenko, and T. Darrell, “Deep domain confusion:
Maximizing for domain invariance,” arXiv preprint arXiv:1412.3474, 2014.

[40] M. Ghifary, W. B. Kleijn, and M. Zhang, “Domain adaptive neural networks for ob-
AC

ject recognition,” in Pacific Rim International Conference on Artificial Intelligence.


Springer, 2014, pp. 898–904.

[41] B. Sun and K. Saenko, “Deep coral: Correlation alignment for deep domain adapta-
tion,” in Computer Vision–ECCV 2016 Workshops. Springer, 2016, pp. 443–450.

61
ACCEPTED MANUSCRIPT

[42] X. Peng and K. Saenko, “Synthetic to real adaptation with deep generative corre-
lation alignment networks,” arXiv preprint arXiv:1701.05524, 2017.

[43] F. Zhuang, X. Cheng, P. Luo, S. J. Pan, and Q. He, “Supervised representation

T
learning: Transfer learning with deep autoencoders.” in IJCAI, 2015, pp. 4119–
4125.

IP
[44] Y. Li, N. Wang, J. Shi, J. Liu, and X. Hou, “Revisiting batch normalization for

CR
practical domain adaptation,” arXiv preprint arXiv:1603.04779, 2016.

[45] X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive in-

US
stance normalization,” arXiv preprint arXiv:1703.06868, 2017.

[46] Y. Li, N. Wang, J. Liu, and X. Hou, “Demystifying neural style transfer,” arXiv
preprint arXiv:1701.01036, 2017.
AN
[47] A. Rozantsev, M. Salzmann, and P. Fua, “Beyond sharing weights for deep domain
adaptation,” arXiv preprint arXiv:1603.06432, 2016.
M

[48] T. Xiao, H. Li, W. Ouyang, and X. Wang, “Learning deep feature representations
with domain guided dropout for person re-identification,” in Proceedings of the IEEE
ED

Conference on Computer Vision and Pattern Recognition, 2016, pp. 1249–1258.

[49] S.-A. Rebuffi, H. Bilen, and A. Vedaldi, “Learning multiple visual domains with
PT

residual adapters,” arXiv preprint arXiv:1705.08045, 2017.

[50] S. Chopra, S. Balakrishnan, and R. Gopalan, “Dlid: Deep learning for domain
CE

adaptation by interpolating between domains,” in ICML workshop on challenges in


representation learning, vol. 2, no. 6, 2013.

[51] M.-Y. Liu and O. Tuzel, “Coupled generative adversarial networks,” in Advances in
AC

neural information processing systems, 2016, pp. 469–477.

[52] K. Bousmalis, N. Silberman, D. Dohan, D. Erhan, and D. Krishnan, “Unsupervised


pixel-level domain adaptation with generative adversarial networks,” arXiv preprint
arXiv:1612.05424, 2016.

62
ACCEPTED MANUSCRIPT

[53] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with
conditional adversarial networks,” arXiv preprint arXiv:1611.07004, 2016.

[54] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette,

T
M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural networks,”
Journal of Machine Learning Research, vol. 17, no. 59, pp. 1–35, 2016.

IP
[55] Y. Ganin and V. Lempitsky, “Unsupervised domain adaptation by backpropaga-

CR
tion,” in International Conference on Machine Learning, 2015, pp. 1180–1189.

[56] E. Tzeng, C. Devin, J. Hoffman, C. Finn, P. Abbeel, S. Levine, K. Saenko, and

US
T. Darrell, “Adapting deep visuomotor representations with weak pairwise con-
straints,” CoRR, vol. abs/1511.07111, 2015.
AN
[57] K.-C. Peng, Z. Wu, and J. Ernst, “Zero-shot deep domain adaptation,” arXiv
preprint arXiv:1707.01922, 2017.

[58] E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative do-


M

main adaptation,” arXiv preprint arXiv:1702.05464, 2017.

[59] K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan, and D. Erhan, “Domain


ED

separation networks,” in Advances in Neural Information Processing Systems, 2016,


pp. 343–351.
PT

[60] M. Ghifary, W. B. Kleijn, M. Zhang, D. Balduzzi, and W. Li, “Deep reconstruction-


classification networks for unsupervised domain adaptation,” in European Confer-
CE

ence on Computer Vision. Springer, 2016, pp. 597–613.

[61] M. Ghifary, W. Bastiaan Kleijn, M. Zhang, and D. Balduzzi, “Domain generalization


AC

for object recognition with multi-task autoencoders,” in Proceedings of the IEEE


international conference on computer vision, 2015, pp. 2551–2559.

[62] Z. Yi, H. Zhang, P. T. Gong et al., “Dualgan: Unsupervised dual learning for image-
to-image translation,” arXiv preprint arXiv:1704.02510, 2017.

63
ACCEPTED MANUSCRIPT

[63] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation
using cycle-consistent adversarial networks,” arXiv preprint arXiv:1703.10593, 2017.

[64] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim, “Learning to discover cross-domain

T
relations with generative adversarial networks,” arXiv preprint arXiv:1703.05192,
2017.

IP
[65] M. Xie, N. Jean, M. Burke, D. Lobell, and S. Ermon, “Transfer learning from deep

CR
features for remote sensing and poverty mapping,” 2015.

[66] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick,

US
K. Kavukcuoglu, R. Pascanu, and R. Hadsell, “Progressive neural networks,” arXiv
preprint arXiv:1606.04671, 2016.
AN
[67] J. Hoffman, E. Tzeng, J. Donahue, Y. Jia, K. Saenko, and T. Darrell,
“One-shot adaptation of supervised deep convolutional models,” arXiv preprint
arXiv:1312.6204, 2013.
M

[68] A. Raj, V. P. Namboodiri, and T. Tuytelaars, “Subspace alignment based domain


adaptation for rcnn detector,” arXiv preprint arXiv:1507.05578, 2015.
ED

[69] H. V. Nguyen, H. T. Ho, V. M. Patel, and R. Chellappa, “Dash-n: Joint hierarchical


domain adaptation and feature learning,” IEEE Transactions on Image Processing,
vol. 24, no. 12, pp. 5479–5491, 2015.
PT

[70] L. Zhang, Z. He, and Y. Liu, “Deep object recognition across domains based on
CE

adaptive extreme learning machine,” Neurocomputing, vol. 239, pp. 194–203, 2017.

[71] H. Lu, L. Zhang, Z. Cao, W. Wei, K. Xian, C. Shen, and A. van den Hengel,
“When unsupervised domain adaptation meets tensor representations,” in The IEEE
AC

International Conference on Computer Vision (ICCV), vol. 2, 2017.

[72] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are features in
deep neural networks?” in Advances in neural information processing systems, 2014,
pp. 3320–3328.

64
ACCEPTED MANUSCRIPT

[73] B. Chu, V. Madhavan, O. Beijbom, J. Hoffman, and T. Darrell, “Best practices


for fine-tuning visual classifiers to new domains,” in Computer Vision–ECCV 2016
Workshops. Springer, 2016, pp. 435–442.

T
[74] X. Wang, X. Duan, and X. Bai, “Deep sketch feature for cross-domain image re-
trieval,” Neurocomputing, vol. 207, pp. 387–397, 2016.

IP
[75] C. H. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect unseen object

CR
classes by between-class attribute transfer,” in Computer Vision and Pattern Recog-
nition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 951–958.

US
[76] K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf, and A. J.
Smola, “Integrating structured biological data by kernel maximum mean discrep-
ancy,” Bioinformatics, vol. 22, no. 14, pp. e49–e57, 2006.
AN
[77] B. Sun, J. Feng, and K. Saenko, “Return of frustratingly easy domain adaptation.”
in AAAI, vol. 6, no. 7, 2016, p. 8.
M

[78] Y. Li, K. Swersky, and R. Zemel, “Generative moment matching networks,” in


Proceedings of the 32nd International Conference on Machine Learning (ICML-15),
ED

2015, pp. 1718–1727.

[79] W. Zellinger, T. Grubinger, E. Lughofer, T. Natschläger, and S. Saminger-Platz,


PT

“Central moment discrepancy (cmd) for domain-invariant representation learning,”


arXiv preprint arXiv:1702.08811, 2017.
CE

[80] P. Haeusser, T. Frerix, A. Mordvintsev, and D. Cremers, “Associative domain adap-


tation,” in International Conference on Computer Vision (ICCV), vol. 2, no. 5, 2017,
p. 6.
AC

[81] X. Shu, G.-J. Qi, J. Tang, and J. Wang, “Weakly-shared deep transfer networks
for heterogeneous-domain knowledge propagation,” in Proceedings of the 23rd ACM
international conference on Multimedia. ACM, 2015, pp. 35–44.

65
ACCEPTED MANUSCRIPT

[82] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network train-
ing by reducing internal covariate shift,” in International Conference on Machine
Learning, 2015, pp. 448–456.

T
[83] F. M. Carlucci, L. Porzi, B. Caputo, E. Ricci, and S. R. Bulò, “Autodial: Automatic
domain alignment layers,” in International Conference on Computer Vision, 2017.

IP
[84] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Improved texture networks: Maximiz-

CR
ing quality and diversity in feed-forward stylization and texture synthesis,” arXiv
preprint arXiv:1701.02096, 2017.

US
[85] D. Li, Y. Yang, Y.-Z. Song, and T. M. Hospedales, “Deeper, broader and artier do-
main generalization,” in Computer Vision (ICCV), 2017 IEEE International Con-
ference on. IEEE, 2017, pp. 5543–5551.
AN
[86] R. Gopalan, R. Li, and R. Chellappa, “Domain adaptation for object recognition:
An unsupervised approach,” in Computer Vision (ICCV), 2011 IEEE International
M

Conference on. IEEE, 2011, pp. 999–1006.

[87] B. Gong, Y. Shi, F. Sha, and K. Grauman, “Geodesic flow kernel for unsupervised
ED

domain adaptation,” in Computer Vision and Pattern Recognition (CVPR), 2012


IEEE Conference on. IEEE, 2012, pp. 2066–2073.
PT

[88] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,


A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neu-
ral information processing systems, 2014, pp. 2672–2680.
CE

[89] D. Yoo, N. Kim, S. Park, A. S. Paek, and I. S. Kweon, “Pixel-level domain transfer,”
in European Conference on Computer Vision. Springer, 2016, pp. 517–532.
AC

[90] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb, “Learn-


ing from simulated and unsupervised images through adversarial training,” arXiv
preprint arXiv:1612.07828, 2016.

66
ACCEPTED MANUSCRIPT

[91] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image
using a multi-scale deep network,” in Advances in neural information processing
systems, 2014, pp. 2366–2374.

T
[92] Z. Cao, M. Long, J. Wang, and M. I. Jordan, “Partial transfer learning with selective
adversarial networks,” arXiv preprint arXiv:1707.07901, 2017.

IP
[93] S. Motiian, Q. Jones, S. Iranmanesh, and G. Doretto, “Few-shot adversarial domain

CR
adaptation,” in Advances in Neural Information Processing Systems, 2017, pp. 6673–
6683.

US
[94] R. Volpi, P. Morerio, S. Savarese, and V. Murino, “Adversarial feature augmentation
for unsupervised domain adaptation,” arXiv preprint arXiv:1711.08561, 2017.

[95] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein gan,” arXiv preprint


AN
arXiv:1701.07875, 2017.

[96] J. Shen, Y. Qu, W. Zhang, and Y. Yu, “Wasserstein distance guided representation
M

learning for domain adaptation,” 2017.

[97] K. Saito, K. Watanabe, Y. Ushiku, and T. Harada, “Maximum classifier discrepancy


ED

for unsupervised domain adaptation,” arXiv preprint arXiv:1712.02560, 2017.

[98] Y. Bengio, “Learning deep architectures for ai,” Foundations and Trends in Machine
PT

Learning, vol. 2, no. 1, pp. 1–127, 2009.

[99] X. Glorot, A. Bordes, and Y. Bengio, “Domain adaptation for large-scale sentiment
CE

classification: A deep learning approach,” in Proceedings of the 28th international


conference on machine learning (ICML-11), 2011, pp. 513–520.

[100] M. Chen, Z. Xu, K. Weinberger, and F. Sha, “Marginalized denoising autoencoders


AC

for domain adaptation,” arXiv preprint arXiv:1206.4683, 2012.

[101] J.-C. Tsai and J.-T. Chien, “Adversarial domain separation and adaptation,” in
Machine Learning for Signal Processing (MLSP), 2017 IEEE 27th International
Workshop on. IEEE, 2017, pp. 1–6.

67
ACCEPTED MANUSCRIPT

[102] D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, and W.-Y. Ma, “Dual learning for
machine translation,” in Advances in Neural Information Processing Systems, 2016,
pp. 820–828.

T
[103] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for
biomedical image segmentation,” in International Conference on Medical Image

IP
Computing and Computer-Assisted Intervention. Springer, 2015, pp. 234–241.

CR
[104] C. Li and M. Wand, “Precomputed real-time texture synthesis with markovian
generative adversarial networks,” in European Conference on Computer Vision.
Springer, 2016, pp. 702–716.

US
[105] L. Duan, D. Xu, and I. Tsang, “Learning with augmented features for heterogeneous
domain adaptation,” arXiv preprint arXiv:1206.4660, 2012.
AN
[106] C. Wang and S. Mahadevan, “Heterogeneous domain adaptation using manifold
alignment,” in IJCAI proceedings-international joint conference on artificial intelli-
M

gence, vol. 22, no. 1, 2011, p. 1541.

[107] J. T. Zhou, I. W. Tsang, S. J. Pan, and M. Tan, “Heterogeneous domain adaptation


ED

for multiple classes,” in Artificial Intelligence and Statistics, 2014, pp. 1095–1103.

[108] B. Kulis, K. Saenko, and T. Darrell, “What you saw is not what you get: Domain
PT

adaptation using asymmetric kernel transforms,” in Computer Vision and Pattern


Recognition (CVPR), 2011 IEEE Conference on. IEEE, 2011, pp. 1785–1792.
CE

[109] K. Saenko, B. Kulis, M. Fritz, and T. Darrell, “Adapting visual category models
to new domains,” in European conference on computer vision. Springer, 2010, pp.
213–226.
AC

[110] S. Gupta, J. Hoffman, and J. Malik, “Cross modal distillation for supervision trans-
fer,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-
nition, 2016, pp. 2827–2836.

68
ACCEPTED MANUSCRIPT

[111] J. Hoffman, S. Gupta, J. Leong, S. Guadarrama, and T. Darrell, “Cross-modal


adaptation for rgb-d detection,” in Robotics and Automation (ICRA), 2016 IEEE
International Conference on. IEEE, 2016, pp. 5032–5039.

T
[112] P. Mittal, M. Vatsa, and R. Singh, “Composite sketch recognition via deep network-
a transfer learning approach,” in Biometrics (ICB), 2015 International Conference

IP
on. IEEE, 2015, pp. 251–256.

CR
[113] X. Liu, L. Song, X. Wu, and T. Tan, “Transferring deep representation for nir-vis
heterogeneous face recognition,” in Biometrics (ICB), 2016 International Confer-
ence on. IEEE, 2016, pp. 1–8.

US
[114] W.-Y. Chen, T.-M. H. Hsu, Y.-H. H. Tsai, Y.-C. F. Wang, and M.-S. Chen, “Trans-
fer neural trees for heterogeneous domain adaptation,” in European Conference on
AN
Computer Vision. Springer, 2016, pp. 399–414.

[115] S. Rota Bulo and P. Kontschieder, “Neural decision forests for semantic image la-
M

belling,” in Proceedings of the IEEE Conference on Computer Vision and Pattern


Recognition, 2014, pp. 81–88.
ED

[116] P. Kontschieder, M. Fiterau, A. Criminisi, and S. Rota Bulo, “Deep neural decision
forests,” in Proceedings of the IEEE International Conference on Computer Vision,
2015, pp. 1467–1475.
PT

[117] Y. Taigman, A. Polyak, and L. Wolf, “Unsupervised cross-domain image genera-


tion,” arXiv preprint arXiv:1611.02200, 2016.
CE

[118] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative


adversarial text to image synthesis,” arXiv preprint arXiv:1605.05396, 2016.
AC

[119] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas, “Stack-
gan: Text to photo-realistic image synthesis with stacked generative adversarial
networks,” in IEEE Int. Conf. Comput. Vision (ICCV), 2017, pp. 5907–5915.

69
ACCEPTED MANUSCRIPT

[120] L. Wang, V. A. Sindagi, and V. M. Patel, “High-quality facial photo-sketch synthesis


using multi-adversarial networks,” arXiv preprint arXiv:1710.10182, 2017.

[121] M. Kan, S. Shan, and X. Chen, “Bi-shifting auto-encoder for unsupervised domain

T
adaptation,” in Proceedings of the IEEE International Conference on Computer
Vision, 2015, pp. 3846–3854.

IP
[122] S. Hong, W. Im, J. Ryu, and H. S. Yang, “Sspp-dan: Deep domain adapta-

CR
tion network for face recognition with single sample per person,” arXiv preprint
arXiv:1702.04069, 2017.

US
[123] Y. Xia, D. Huang, and Y. Wang, “Detecting smiles of young children via deep
transfer learning,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2017, pp. 1673–1681.
AN
[124] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on
computer vision, 2015, pp. 1440–1448.
M

[125] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object
detection with region proposal networks,” in Advances in neural information pro-
ED

cessing systems, 2015, pp. 91–99.

[126] J. Hoffman, S. Guadarrama, E. S. Tzeng, R. Hu, J. Donahue, R. Girshick, T. Darrell,


PT

and K. Saenko, “Lsda: Large scale detection through adaptation,” in Advances in


Neural Information Processing Systems, 2014, pp. 3536–3544.
CE

[127] M. Rochan and Y. Wang, “Weakly supervised localization of novel objects using
appearance transfer,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2015, pp. 4315–4324.
AC

[128] Y. Tang, J. Wang, B. Gao, E. Dellandréa, R. Gaizauskas, and L. Chen, “Large scale
semi-supervised object detection using visual and semantic knowledge transfer,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
2016, pp. 2119–2128.

70
ACCEPTED MANUSCRIPT

[129] Y. Chen, W. Li, C. Sakaridis, D. Dai, and L. Van Gool, “Domain adaptive faster
r-cnn for object detection in the wild,” arXiv preprint arXiv:1803.03243, 2018.

[130] N. Inoue, R. Furuta, T. Yamasaki, and K. Aizawa, “Cross-domain weakly-

T
supervised object detection through progressive domain adaptation,” arXiv preprint
arXiv:1803.11365, 2018.

IP
[131] S. Hong, J. Oh, H. Lee, and B. Han, “Learning transferrable knowledge for semantic

CR
segmentation with deep convolutional neural network,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, 2016, pp. 3204–3212.

US
[132] A. Kolesnikov and C. H. Lampert, “Seed, expand and constrain: Three principles
for weakly-supervised image segmentation,” in European Conference on Computer
Vision. Springer, 2016, pp. 695–711.
AN
[133] W. Shimoda and K. Yanai, “Distinct class-specific saliency maps for weakly su-
pervised semantic segmentation,” in European Conference on Computer Vision.
M

Springer, 2016, pp. 218–234.

[134] J. Hoffman, D. Wang, F. Yu, and T. Darrell, “Fcns in the wild: Pixel-level adver-
ED

sarial and constraint-based adaptation,” arXiv preprint arXiv:1612.02649, 2016.

[135] Y. Zhang, P. David, and B. Gong, “Curriculum domain adaptation for semantic
segmentation of urban scenes,” in The IEEE International Conference on Computer
PT

Vision (ICCV), vol. 2, no. 5, 2017, p. 6.


CE

[136] Y.-H. Chen, W.-Y. Chen, Y.-T. Chen, B.-C. Tsai, Y.-C. F. Wang, and M. Sun,
“No more discrimination: Cross city adaptation of road scene segmenters,” arXiv
preprint arXiv:1704.08509, 2017.
AC

[137] Y. Chen, W. Li, and L. Van Gool, “Road: Reality oriented adaptation for semantic
segmentation of urban scenes,” arXiv preprint arXiv:1711.11556, 2017.

[138] S. Sankaranarayanan, Y. Balaji, A. Jain, S. N. Lim, and R. Chellappa, “Learning


from synthetic data: Addressing domain shift for semantic segmentation,” 2017.

71
ACCEPTED MANUSCRIPT

[139] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using convolutional
neural networks,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2016, pp. 2414–2423.

T
[140] W. Deng, L. Zheng, G. Kang, Y. Yang, Q. Ye, and J. Jiao, “Image-image domain
adaptation with preserved self-similarity and domain-dissimilarity for person re-

IP
identification,” arXiv preprint arXiv:1711.07027, 2017.

CR
[141] T.-H. Chen, Y.-H. Liao, C.-Y. Chuang, W.-T. Hsu, J. Fu, and M. Sun, “Show,
adapt and tell: Adversarial training of cross-domain image captioner,” in The IEEE
International Conference on Computer Vision (ICCV), vol. 2, 2017.

US
[142] W. Zhao, W. Xu, M. Yang, J. Ye, Z. Zhao, Y. Feng, and Y. Qiao, “Dual learning
for cross-domain image captioning,” in Proceedings of the 2017 ACM on Conference
AN
on Information and Knowledge Management. ACM, 2017, pp. 29–38.

[143] P. P. Busto and J. Gall, “Open set domain adaptation,” in The IEEE International
M

Conference on Computer Vision (ICCV), vol. 1, 2017, p. 3.

[144] J. Zhang, Z. Ding, W. Li, and P. Ogunbona, “Importance weighted adversarial nets
ED

for partial domain adaptation,” arXiv preprint arXiv:1803.09210, 2018.


PT
CE
AC

72
ACCEPTED MANUSCRIPT

T
IP
CR
US
Mei Wang received the B.E. degree in information and communication en-
gineering from the Dalian University of Technology (DUT), Dalian, China, in
AN
2013 and received M.E. degree in communication engineering from the Bei-
jing University of Posts and Telecommunications (BUPT), Beijing, China,
M

in 2016. From September 2018, she is a Ph.D. student in school of informa-


tion and communication engineering of BUPT. Her research interests include
ED

pattern recognition and computer vision, with a particular emphasis in deep


face recognition and transfer learning.
PT
CE
AC

73
ACCEPTED MANUSCRIPT

T
IP
CR
Weihong Deng received the B.E. degree in information engineering and the

US
Ph.D. degree in signal and information processing from the Beijing Univer-
sity of Posts and Telecommunications (BUPT), Beijing, China, in 2004 and
AN
2009, respectively. From Oct. 2007 to Dec. 2008, he was a postgraduate
exchange student in the School of Information Technologies, University of
Sydney, Australia. He is currently an professor in School of Information and
M

Telecommunications Engineering, BUPT. His research interests include sta-


tistical pattern recognition and computer vision, with a particular emphasis
ED

in face recognition. He has published over 100 technical papers in interna-


tional journals and conferences, such as IEEE TPAMI and CVPR. He serves
PT

as associate editor for IEEE Access, and guest editor for Image and Vision
Computing Journal and the reviewer for dozens of international journals,
CE

such as IEEE TPAMI / TIP / TIFS / TNNLS / TMM / TSMC, IJCV, PR


/ PRL. His Dissertation titled “Highly accurate face recognition algorithms”
AC

was awarded the Outstanding Doctoral Dissertation Award by Beijing Mu-


nicipal Commission of Education in 2011. He has been supported by the
program for New Century Excellent Talents by the Ministry of Education of
China in 2013 and Beijing Nova Program in 2016.

74
ACCEPTED MANUSCRIPT

T
IP
CR
US
AN
M
ED
PT
CE
AC

75

You might also like