Transfer Learning - Qiang Yang
Transfer Learning - Qiang Yang
Transfer learning deals with how systems can quickly adapt themselves to new
situations, new tasks and new environments. It gives machine learning systems the
ability to leverage auxiliary data and models to help solve target problems when there
is only a small amount of data available in the target domain. This makes such systems
more reliable and robust, keeping the machine learning model faced with
unforeseeable changes from deviating too much from expected performance. At an
enterprise level, transfer learning allows knowledge to be reused so experience gained
once can be repeatedly applied to the real world.
This self-contained, comprehensive reference text begins by describing the standard
algorithms and then demonstrates how these are used in different transfer learning
paradigms and applications. It offers a solid grounding for newcomers as well as new
insights for seasoned researchers and developers.
Q I A N G YA N G
Hong Kong University of Science and Technology
YU ZHANG
Southern University of Science and Technology
W E N Y UA N DA I
4Paradigm Co., Ltd.
S I N N O J I A L I N PA N
Nanyang Technological University
University Printing House, Cambridge CB2 8BS, United Kingdom
One Liberty Plaza, 20th Floor, New York, NY 10006, USA
477 Williamstown Road, Port Melbourne, VIC 3207, Australia
314–321, 3rd Floor, Plot 3, Splendor Forum, Jasola District Centre,
New Delhi – 110025, India
79 Anson Road, #06–04/06, Singapore 079906
www.cambridge.org
Information on this title: www.cambridge.org/9781107016903
DOI: 10.1017/9781139061773
© Qiang Yang, Yu Zhang, Wenyuan Dai and Sinno Jialin Pan 2020
This publication is in copyright. Subject to statutory exception
and to the provisions of relevant collective licensing agreements,
no reproduction of any part may take place without the written
permission of Cambridge University Press.
First published 2020
Printed in the United Kingdom by TJ International, Padstow Cornwall
A catalogue record for this publication is available from the British Library.
ISBN 978-1-107-01690-3 Hardback
Cambridge University Press has no responsibility for the persistence or accuracy of
URLs for external or third-party internet websites referred to in this publication
and does not guarantee that any content on such websites is, or will remain,
accurate or appropriate.
Contents
Preface page ix
References 336
Index 377
Preface
This book is about the foundations, methods, techniques and applications of trans-
fer learning. Transfer learning deals with how learning systems can quickly adapt
themselves to new situations, new tasks and new environments. Transfer learning
is a particularly important area of machine learning, which we can understand
from several angles. First, the ability to learn from small data seems to be a partic-
ularly strong aspect of human intelligence. For example, we observe that babies
learn from only a few examples and can quickly and effectively generalize from
the few examples to concepts. This ability to learn from small data can be partly
explained by the ability of humans to leverage and adapt the previous experience
and pretrained models to help solve future target problems. Adaptation is an in-
nate ability of intelligent beings and artificially intelligent agents should certainly
be endowed with transfer learning ability.
Second, in machine learning practice, we observe that we are often surrounded
with lots of small-sized data sets, which are often isolated and fragmented. Many
organizations do not have the ability to collect a huge amount of big data due to a
number of constraints that range from resource limitations to organizations inter-
ests, and to regulations and concerns for user privacy. This small-data challenge
is a serious problem faced by many organizations applying AI technology to their
problems. Transfer learning is a suitable solution for addressing this challenge be-
cause it can leverage many auxiliary data and external models, and adapt them to
solve the target problems.
Third, transfer learning can make AI and machine learning systems more reli-
able and robust. It is often the case that, when building a machine learning model,
one cannot foresee all future situations. In machine learning, this problem is of-
ten addressed using a technique known as regularization, which leaves room for
future changes by limiting the complexity of the models. Transfer learning takes
this approach further, by allowing the model to be complex while being prepared
for changes when they actually come.
In addition, when facing unforeseeable changes and taking a learned model
across domain boundaries, transfer learning still makes sure that the model per-
formance does not deviate from the expected performance too much. In this way,
x Preface
learning. Part II, which includes Chapters 15–22, covers many application fields
of transfer learning. We give concluding remarks in Chapter 23.
The book is an accumulation of hard research work by a group of researchers
that spans over a decade, mainly consisting of Professor Qiang Yang’s current and
former graduate students, postdoctoral researchers and research associates. We
have assigned each chapter to one or more students, and then the four main ed-
itors either wrote other chapters or went in depth in each chapter to help refine
the content, or did both.
The following is a list of these authors.
Finally, we wish to thank the managerial work of Yutao Deng, who helped keep
the schedules and manage team works. To all, our sincere thanks! Without their
tremendous effort, the book would have been impossible to complete.
We the editors wish to also thank our colleagues, organizations and collabo-
rators over the years. We thank the support of Hong Kong University of Science
and Technology, Hong Kong CERG Fund, Hong Kong Innovation and Technol-
ogy Fund, the 4Paradigm Corp., Nanyang Technological University Singapore, We-
bank and many others for their generous support.
Finally, we wish to acknowledge the support of our families, whose patience
and encouragement allowed us to finally complete the book.
PART I
Even though a machine learning model can be made to be of high quality, it can
also make mistakes, especially when the model is applied to different scenarios
from its training environments. For example, if a new photo is taken from an out-
door environment with different light intensities and levels of noise such as shad-
ows, sunlight from different angels and occlusion by passersby, the recognition
capability of the system may dramatically drop. This is because the model trained
by the machine learning system is applied to a “different” scenario. This drop in
performance shows that models can be outdated and needs updating when new
situations occur. It is this need to update or transfer models from one scenario to
another that lends importance to the topic of the book.
The need for transfer learning is not limited to image understanding. Another
example is understanding Twitter text messages by natural language processing
(NLP) techniques. Suppose we wish to classify Twitter messages into different
user moods such as happy or sad by its content. When one model is built using
a collection of Twitter messages and then applied to new data, the performance
drops quite dramatically as a different community of people will very likely ex-
press their opinions differently. This happens when we have teenagers in one
group and grown-ups in another.
As the previous examples demonstrate, a major challenge in practicing ma-
chine learning in many applications is that models do not work well in new task
domains. The reason why they do not work well may be due to one of several
reasons: lack of new training data due to the small data challenge, changes of cir-
cumstances and changes of tasks. For example, in a new situation, high-quality
training data may be in short supply if not often impossible to obtain for model
retraining, as in the case of medical diagnosis and medical imaging data.
Machine learning models cannot do well without sufficient training data. Obtain-
ing and labeling new data often takes much effort and resources in a new appli-
cation domain, which is a major obstacle in realizing AI in the real world. Having
well-designed AI systems without the needed training data is like having a sports
car without an energy.
This discussion highlights a major roadblock in populating machine learning
to the practical world: it would be impossible to collect large quantities of data in
every domain before applying machine learning. Here we summarize some of the
reasons to develop such a transfer learning methodology:
1) Many applications only have small data: the current success of machine learn-
ing relies on the availability of a large amount of labeled data. However, high-
quality labeled data are often in short supply. Traditional machine learning
methods often cannot generalize well to new scenarios, a phenomenon known
as overfitting, and fail in many such cases.
2) Machine learning models need to be robust: traditional machine learning of-
ten makes an assumption that both the training and test data are drawn from
the same distribution. However, this assumption is too strong to hold in many
1.1 AI, Machine Learning and Transfer Learning 5
allow one to focus on adapting the new parts for the book-recommendation task,
which allows one to further exploit the underlying similarities between the data
sets. Then, book domain classification and user preference learning models can
be adapted from those of the movie domain.
Based on the transfer learning methodologies, once we obtain a well-developed
model in one domain, we can bring this model to benefit other similar domains.
Hence, having an accurate “distance” measure between any task domains is nec-
essary in developing a sound transfer learning methodology. If the distance be-
tween two domains is large, then we may not wish to apply transfer learning as
the learning might turn out to produce a negative effect. On the other hand, if two
domains are “close by,” transfer learning can be fruitfully applied.
In machine learning, the distance between domains can often be measured in
terms of the features that are used to describe the data. In image analysis, fea-
tures can be pixels or patches in an image pattern, such as the color or shape.
In NLP, features can be words or phrases. Once we know that two domains are
close to each other, we can ensure that AI models can be propagated from the
well-developed domains to less-developed domains, making the application of AI
less data dependent. And this can be a good sign for successful transfer learning
applications.
Being able to transfer knowledge from one domain to another allows machine
learning systems to extend their range of applicability beyond their original cre-
ation. This generalization ability helps make AI more accessible and more robust
in many areas where AI talents or resources such as computing power, data and
hardware might be scarce. In a way, transfer learning allows the promotion of AI
as a more inclusive technology that serves everyone.
To give an intuitive example, we can use an analogy to highlight the key insights
behind transfer learning. Consider driving in different countries in the world. In
the USA and China, for example, the driver’s seat is on the left of the car and drives
on the right side of the road. In Britain, the driver sits on the right side of the car,
and drives on the left side of the road. For a traveler who is used to driving in the
USA to travel to drive in Britain, it is particularly hard to switch. Transfer learning,
however, tells us to find the invariant in the two driving domains that is a common
feature. On a closer observation, one can find that no matter where one drives, the
driver’s distance to the center of the road is the closest. Or, conversely, the driver
sits farthest from the side of the road. This fact allows human drivers to smoothly
“transfer” from one country to another. Thus, the insight behind transfer learning
is to find the “invariant” between domains and tasks.
Transfer learning has been studied under different terminologies in AI, such as
knowledge reuse and CBR, learning by analogy, domain adaptation, pre-training,
fine-tuning, and so on. In the fields of education and learning psychology, trans-
fer of learning has a similar notion as transfer learning in machine learning. In
particular, transfer of learning refers to the process in which past experience ac-
quired from previous source tasks can be used to influence future learning and
1.2 Transfer Learning: A Definition 7
Definition 1.1 (transfer learning) Given a source domain Ds and learning task
Ts , a target domain Dt and learning task Tt , transfer learning aims to help improve
the learning of the target predictive function f t (·) for the target domain using the
knowledge in Ds and Ts , where Ds = Dt or Ts = Tt .
A transfer learning process is illustrated in Figure 1.1. The process on the left
corresponds to a traditional machine learning process. The process on the right
corresponds to a transfer learning process. As we can see, transfer learning makes
use of not only the data in the target task domain as input to the learning algo-
rithm, but also any of the learning process in the source domain, including the
training data, models and task description. This figure shows a key concept of
transfer learning: it counters the lack of training data problem in the target do-
main with more knowledge gained from the source domain.
As a domain contains two components, D = {X , P X }, the condition Ds = Dt im-
plies that either Xs = Xt or P X s = P X T . Similarly, as a task is defined as a pair of
components T = {Y , PY |X }, the condition Ts = Tt implies that either Ys = Yt or
PYs |X s = PYt |X t . When the target domain and the source domain are the same, that
is, Ds = Dt , and their learning tasks are the same, that is, Ts = Tt , the learning
problem becomes a traditional machine learning problem.
Based on this definition, we can formulate different ways to categorize exist-
ing transfer learning studies into different settings. For instance, based on the ho-
mogeneity of the feature spaces and/or label spaces, we can categorize transfer
1.2 Transfer Learning: A Definition 9
learning into two settings: (1) homogeneous transfer learning and (2) heteroge-
neous transfer learning, whose definitions are described as follows (Pan, 2014).1
Besides using the homogeneity of the feature spaces and label spaces, we can
also categorize existing transfer learning studies into the following three settings
by considering whether labeled data and unlabeled data are available in the tar-
get domain: supervised transfer learning, semi-supervised transfer learning and
unsupervised transfer learning. In supervised transfer learning, only a few labeled
data are available in the target domain for training, and we do not use the unla-
beled data for training. For unsupervised transfer learning, there are only unla-
beled data available in the target domain. In semi-supervised transfer learning,
sufficient unlabeled data and a few labeled data are assumed to be available in
the target domain.
To design a transfer learning algorithm, we need to consider the following three
main research issues: (1) when to transfer, (2) what to transfer and (3) how to
transfer.
When to transfer asks in which situations transferring skills should be done.
Likewise, we are interested in knowing in which situations knowledge should not
be transferred. In some situations, when the source domain and the target do-
main are not related to each other, brute-force transfer may be unsuccessful. In
the worst case, it may even hurt the performance of learning in the target domain,
a situation which is often referred to as negative transfer. Most of current studies
on transfer learning focus on “what to transfer” and “how to transfer,” by implic-
itly assuming that the source domain and the target domain are related to each
other. However, how to avoid negative transfer is an important open issue that is
attracting more and more attentions.
What to transfer determines which part of knowledge can be transferred across
domains or tasks. Some knowledge is specific for individual domains or tasks,
and some knowledge may be common between different domains such that they
may help improve performance for the target domain or task. Note that the term
1 In the rest of book, without explicit specification, the term “transfer learning” denotes
homogeneous transfer learning.
10 Introduction
learning model using sufficient source data, which could be quite different from
the target data. After the deep model is trained, a few target labeled data are used
to fine-tune part of the parameters of the pretrained deep model, for example, to
fine-tune parameters of several layers while fixing parameters of other layers.
Different from the three aforementioned categories of approaches, relation-
based transfer learning approaches assume that some relationships between ob-
jects (i.e., instances) are similar across domains or tasks. Once these common re-
lationships are extracted, then they can be used as knowledge for transfer learn-
ing. Note that, in this category, data in the source domain and the target domain
are not required to be independent and identically distributed as the other three
categories.
Supervised learning
Target domain/task
Exploring target
Semi-supervised learning domain/task unlabeled data
Lack of labeled data
to choose the data from which it learns. However, active learning assumes that
there is a budget for the active learner to pose queries in the domain of interest. In
some real world applications, the budget may be quite limited, which means that
the labeled data queried by active learning may not be sufficient enough to learn
an accurate classifier in the domain of interest.
Transfer learning, in contrast, allows the domains, tasks and distributions used
in the training phase and the testing phase to be different. The main idea behind
transfer learning is to borrow labeled data or extract knowledge from some related
domains to help a machine learning algorithm to achieve greater performance in
the domain of interest. Thus, transfer learning can be referred to as a different
strategy for learning models with minimal human supervision, compared to semi-
supervised and active learning.
One of the most related learning paradigms to transfer learning is multi-task
learning. Although both transfer learning and multitask learning aim to general-
ize commonality across tasks, transfer learning is focused on learning on a target
task, where some source task(s) is(are) used as auxiliary information, while mul-
titask learning aims to learn a set of target tasks jointly to improve the general-
ization performance of each learning task without any source or auxiliary tasks.
As most existing multitask learning methods consider all tasks to have the same
importance, while transfer learning only takes the performance of the target task
into consideration, some detailed designs of the learning algorithms are differ-
ent. However, most existing multitask learning algorithms can be adapted to the
transfer learning setting.
We summarize the relationships between transfer learning and other machine
learning paradigms in Figure 1.2, and the difference between transfer learning and
multitask learning in Figure 1.3.
1.4 Fundamental Research Issues in Transfer Learning 13
Commonality
Commonality between
among target tasks
source and target tasks
1 2 …
often too coarse to serve the purpose well in measuring the distance in the trans-
ferrability between two domains or tasks. Second, if the domains have different
feature spaces and/or label spaces, one has to first project the data onto the same
feature and/or label space, and then apply the statistical distance measures as a
follow-up step. Therefore, more research needs to be done on a general notion of
distances between two domains or tasks.
diversity, to allow a system to explore new topics as well as cater to users’ recent
choices. Relating to transfer learning, the work shows that the recommendation
strategy in balancing exploration and exploitation can indeed be transferred be-
tween domains.
ally difficult problems such as question answering problems (Devlin et al., 2018).
It has accomplished surprising results by leading in many tasks in the open com-
petition SQuAD 2.0 (Rajpurkar et al., 2016). The source domain consists of an ex-
tremely large collection of natural language text corpus, with which BERT trained
a model that is based on the bidirectional transformers based on the attention
mechanism. The pertained model is capable of making a variety of predictions in
a language model more accurate than before, and the predictive power increases
with increasing amounts of training data in the source domain. Then, the BERT
model is applied to a specific task in a target domain by adding additional small
layers to the source model in such tasks as Next Sentence classification, Ques-
tion Answering and Named Entity Recognition (NER). The transfer learning ap-
proach corresponds to model-based transfer, where most hyperparameters stay
the same but a selected few hyperparameters can be adapted with the new data in
the target domain.
There have been some surveys on transfer learning in machine learning lit-
erature. Pan and Yang (2010) and Taylor and Stone (2009) give early surveys of
the work on transfer learning, where the former focused on machine learning in
classification and regression areas and the latter on reinforcement learning ap-
proaches. This book aims to give a comprehensive survey that cover both these
areas, as well as the more recent advances of transfer learning with deep learning.
transfer learning methods can also be useful when multiple source do-
mains exist.
Chapter 3 covers feature-based transfer learning. Features constitute a major el-
ement of machine learning. They can be straightforward attributes in the
input data, such as pixels in images or words and phrases in a text docu-
ment, or they can be composite features composed by certain nonlin-
ear transformations of input features. Together these features comprise
a high-dimensional feature space. Feature-based transfer is to identify
common subspaces of features between source and target domains, and
allow transfer to happen in these subspaces. This style of transfer learn-
ing is particularly useful when no clear instances can be directly trans-
ferred, but some common “style” of learning can be transferred.
Chapter 4 discusses model-based transfer learning. Model-based transfer is when
parts of a learning model can be transferred to a target domain from a
source domain, where the learning in the target domain can be “fine-
tuned” based on the transferred model. Model-based transfer learning
is particularly useful when one has a fairly complete collection of data
in a source domain, and the model in the source domain can be made
very powerful in terms of coverage. Then learning in a target domain cor-
responds to adapting the general model from the source domain to a spe-
cific model in a target domain on the “edge” of a network of
domains.
Chapter 5 explores relation-based transfer learning. This chapter is particularly
useful when knowledge is coded in terms of a knowledge graph or in rela-
tional logic form. When some dictionary of translation can be instituted,
and when knowledge exists in the form of some encoded rules, this type
of transfer learning can be particularly useful.
Chapter 6 presents heterogeneous transfer learning. Sometimes, when we deal
with transfer learning, the target domain may have a completely differ-
ent feature representation from that of the source domain. For example,
we may have collected labeled data about images, but the target task is to
classify text documents. If there is some relationship between the images
and the text documents, transfer learning can still happen at the seman-
tic level, where the semantics of the common knowledge between the
source and the target domains can be extracted as a “bridge” to enable
the knowledge transfer.
Chapter 7 discusses adversarial transfer learning. Machine learning, especially
deep learning, can be designed to generate data and at the same time
classify data. This dual relationship in machine learning can be exploited
to mimic the power of imitation and creation in humans. This learn-
ing process can be modeled as a game between multiple models, and is
called adversarial learning. Adversarial learning can be very useful in em-
powering a transfer learning process, which is the subject of this chapter.
20 Introduction
netics domain is full of data of very high dimensionality and low sample
sizes. We give an overview of works in this area.
Chapter 21 presents applications of transfer learning in activity recognition based
on sensors. Activity recognition refers to finding people’s activities from
sensor readings, which can be very useful for assisted living, security and
a wide range of other applications. A challenge in this domain is the lack
of labeled data, and this challenge is particularly fit for transfer learning
to address.
Chapter 22 discusses applications of transfer learning in urban computing. There
are many machine learning problems to address in urban computing,
ranging from traffic prediction to pollution forecast. When data has been
collected in one city, the model can be transferred to a newly considered
city via transfer learning, especially when there is not sufficient high-
quality data in these new cities.
Chapter 23 gives a summary of the whole book with an outlook for future works.
2
Instance-Based Transfer Learning
2.1 Introduction
is that the input instances of the source domain and the target domain have the
same or very similar support, which means that the features for most instances
have a similar range of values. Furthermore, the output labels of the source and
target tasks are the same. This assumption ensures that knowledge can be trans-
ferred across domains via instances. According the definitions of a domain and
a task, this assumption implies that, in instance-based transfer learning, the dif-
ference between domains/tasks is only caused by the differences of the marginal
Y |X
distribution of the features (i.e., PsX = PtX ) or conditional probabilities (i.e., Ps =
Y |X
Pt ).
When PsX = PtX but PYs |X = PYt |X , we refer to the problem setting as noninduc-
tive transfer learning.1 For example, suppose a hospital, either private or public,
aims to learn a prediction model for a specific disease from its own patients’ elec-
tronic medical records. Here we consider each hospital as a different domain. As
the populations of patients of different hospitals are different, the marginal prob-
abilities P X s are different across different domains. However, as the reasons that
cause the specific disease are the same, the conditional probabilities PY |X across
different domains remain the same. When PYs |X = PYt |X , we refer to the problem
setting as inductive transfer learning. For instance, consider avian influenza virus
as the specific disease in the previous example. As avian influenza virus has been
evolving, the reasons causing avian influenza virus may change across different
subtypes of avian influenza virus, for example, H1N1 versus H5N8. Here we con-
sider learning a prediction model for each subtype of avian influenza virus for a
specific hospital as a different task. As the reasons that cause different subtypes of
avian influenza virus are different, the conditional probabilities PY |X are different
across different tasks. In noninductive transfer learning, as the conditional prob-
abilities across domains are the same, that is, PYs |X = PYt |X , it can be theoretically
proven that, even without any labeled data in the target domain, an optimal pre-
dictive model can be learned from the source domain-labeled data and the target
domain-unlabeled data. While in the inductive transfer learning case, as the con-
ditional probabilities are different across tasks, a few labeled data in the target
domain would then be required to exist to help transfer the conditional proba-
bility or the discriminative function from the source task to the target task. Since
the assumptions of noninductive transfer learning and inductive transfer learn-
ing are different, the designs of instance-based transfer learning approaches for
these two settings are different. In the following, we will review the motivations,
basic ideas and representative methods for noninductive and inductive transfer
learning in detail.
1 Note that here we do not adopt the term “transductive transfer learning” used by Pan and Yang
(2010) because the term “transductive” has been widely used to distinguish whether a model has
an out-of-sample generalization ability, which may cause some confusion if used to define
transfer learning problem settings.
2.2 Instance-Based Noninductive Transfer Learning 25
where (x, y, θ) is a loss function in terms of the parameters θt . Since there are no
target domain-labeled data, one cannot optimize (2.1) directly. It has been proven
by Pan (2014) that, by using the Bayes’ rule and the definition of expectation, the
optimization (2.1) can be rewritten as follows,
P t (x, y)
θt∗ = arg min E(x,y)∼PX ,Y (x, y, θt ) , (2.2)
θt ∈Θ
s P s (x, y)
which aims to learn the optimal parameter θt∗ by minimizing the weighted ex-
pected risk over source domain-labeled data. In noninductive transfer learning,
as PYs |X = PYt |X , by decomposing the joint distribution P X ,Y = PY |X P X , we obtain
P t (x,y) P t (x)
P s (x,y) = P s (x) . Hence, (2.2) can be further rewritten as
P t (x)
θt∗ = arg min E(x,y)∼PX ,Y (x, y, θt ) , (2.3)
θt ∈Θ
s P s (x)
ns
θt∗ = arg min β(xsi )(xsi , y si , θt ), (2.4)
θt ∈Θ i =1
Therefore, to properly reuse the source domain-labeled data to learn a target model,
one needs to estimate the weight’s {β(xsi )}. As shown in (2.4), to estimate {β(xsi )},
2 In practice, a regularization term is added to avoid model overfitting.
26 Instance-Based Transfer Learning
that is, density ratios, only input instances without labels from the source domain
and the target domain are required. A simple solution to estimate {β(xsi )} for each
source domain instance is to first estimate PtX and PsX , respectively, and then com-
P t (xsi )
pute the ratio P s (xsi ) for each specific source domain instance xsi . However, it is
well known that density estimation itself is a difficult task (Tsuboi et al., 2009), es-
pecially when data are of high dimensions. In this way, the error caused by density
estimation will be propagated to the density ratio estimation .
In the literature (Quionero-Candela et al., 2009), more promising solutions have
PtX
been proposed to estimate , directly bypassing the density estimation step. In
PsX
the following sections, we introduce how to directly estimate the density ratio by
reviewing several representative methods.
P t (x) P (δ = 1) 1
= −1 .
P s (x) P (δ = 0) P (δ = 1|x)
Therefore, the density ratio for each source domain data instance can be esti-
mated as PP st (x) 1
(x) ∝ P s,t (δ=1|x) . To compute the probability P (δ = 1|x), we regard it
as a binary classification problem and train a classifier to solve it. After calculating
the ratio for each source data instance, a model can be trained by either reweight-
ing each source data instance or performing importance sampling on the source
data set.
Following the idea of Zadrozny (2004), Bickel et al. (2007) propose a framework
2.2 Instance-Based Noninductive Transfer Learning 27
to integrate the density ratio estimation step and the model training step with
reweighted source data instances. Let P X denote the probability density of x in
the union data set of the source domain and the target domain. We can use any
classifier to estimate the probability P (δ = 1|x). Suppose the classifier is param-
eterized by v and the parameters for the final learning model that is trained on
the reweighted source domain data are denoted by w. All the parameters can be
optimized using the maximum a posterior (MAP) approach:
where Ds and Dt denote the source data set and the target data set, respectively.
Note that P (w, v|Ds , Dt ) is proportional to P (Ds |w, v)P (Ds , Dt |v)P (w)P (v). There-
fore, the MAP solution can be found by maximizing P (Ds |w, v)P (Ds , Dt |v)P (w)P (v).
where Φ transforms each source domain data instance into the RKHS F , and
μ(PtX ) is the expectation of the target domain instances in the RKHS, that is,
μ(PtX ) = EPX [Φ(x)].
t
In practice, one can optimize the following empirical objective:
2
1 ns
1 nt 1 ns
s t
min βi Φ(xi ) − Φ(xi ) s.t. βi ≥ 0, βi − 1 ≤ , (2.7)
β n s i =1 n t i =1 n s i =1
where is a positive real number. After solving the optimal β , that is, the weights, β
can be incorporated into (2.4) with any specified loss function to learn a predictive
model θt∗ for the target domain.
28 Instance-Based Transfer Learning
where α = (α1 , . . . , αb )T are coefficients to be learned and φl (·) is the l th base func-
tion that can be linear or nonlinear. In this way, P t (x) can be approximated by
Pt (x) = ω(x)P
s (x). The coefficients α can be learned by minimizing a loss function
between P t (x) and Pt (x). Different loss functions lead to different specific meth-
ods.
For instance, Sugiyama et al. (2008) propose to use Kullback–Leibler (KL) di-
vergence as the loss function. The resultant method is known as KL Importance
Estimation Procedure (KLIEP), whose objective is written as follows,
P t (x)
DKL (PtX , P tX ) = P t (x) log dx (2.8)
X
ω(x)P s (x)
t
P t (x)
= P t (x) log dx −
P t (x) log ω(x)dx. (2.9)
Xt P s (x) Xt
Note that, in (2.9), the ground-truth marginal probability of the target domain
data, PtX , is used. However, it can shown that, empirically, minimizing the afore-
mentioned KL divergence can be approximated by solving the following optimiza-
tion problem, where the ground truth marginal probability of the target domain
data is canceled out:
1 nt b
t 1 ns b
max log αl φl (x j ) s.t. αl φl (xis ) = 1, αl ≥ 0 ∀l ∈ {1, . . . , b}.
α n t j =1 n s i =1 l =1
l =1
Another example of the loss function in discrepancy between ω(x) and ω(x) is
the squared loss (Kanamori et al., 2009). The resultant optimization problem can
be written as follows,
min − ω(x))2 P s (x)dx.
(ω(x)
α Xs ∪Xt
Besides, using KL divergence and squared loss as the loss function, many other
forms of loss functions can be used.
ities, that is, PYs |X = PYt |X . As the conditional probability is changed across differ-
ent tasks, if there are no labeled data in the target domain, then it is very difficult
if not impossible to adapt PYs |X to construct a precise PYt |X . Therefore, in most
instance-based inductive transfer learning approaches, besides a set of source
ns
domain-labeled data Ds = {(xsi , y si )}i =1 , a small set of target domain-labeled data
nt
Dt = {(xti , y ti )}i =1 is also required as inputs.3 The goal is still to learn a precise
predictive model for the target domain unseen data.
where α j s are the model parameters of a SVM, j s are slack variables to absorb
errors and C is a parameter to control how much penalty is conducted the mis-
classified examples. In the inductive transfer learning setting, Wu and Dietterich
(2004) proposed to modify the objective function and constraints by considering
the source domain-labeled data and the target domain-labeled data differently.
Suppose αsj and sj denote the model parameters and slack variables for the source
domain instance xsj for j ∈ {1, . . . , n s }, respectively. Similarly, αtj and tj denote the
parameter and slack variable for the target domain instance xtj , respectively, for
j ∈ {1, . . . , n t }. The parameters C s and C t are trade-off parameters. Then the re-
vised objective function of SVMs is formulated as
ns
nt
ns
nt
min αsj + αtj + C s sj + C t tj ,
j =1 j =1 j =1 j =1
ns
ns
s.t. y it y tj αtj K xtj , xit + y sj αsj K xsj , xit + b ≥ 1 − it i ∈ {1, . . . , n t },
j =1 j =1
nt ns
y is y tj αtj K t s
x j , xi + y j α j K x j , xi + b ≥ 1 − is i ∈ {1, . . . , n s },
s s s s
j =1 j =1
Generally speaking, the revised SVM jointly optimizes the losses on the labeled
data of both the source domain and the target domain.
Liao et al. (2005) further extend this idea to logistic regression, and propose the
“Migratory-Logit” algorithm. Migratory-Logit models the difference between two
domains by introducing a new “auxiliary variable” μi for each source data instance
(xis , y is ). The parameter μi could be geometrically understood as a “intercept term”
that makes xis migrate toward class y is in the target domain. It measures how mis-
match the source data instance xis is with respect to the target domain distribu-
tion PtX and thus controls the importance of source data instances. For a target
domain data instance (xit , y it ), the posterior probability of its label y it is the same
as the traditional logistic regression, that is, P (y it |xit ; w) = δ(y it wT xit ), where w is
1
the parameter vector and δ(a) = 1+exp(−a) is the Sigmoid function. For a source
domain instance (xis , y is ), the posterior probability of y is is defined as:
Then, all the parameters can be learned by maximizing the log-likelihood with the
optimization problem formulated as
1 ns
max L(w, μ ; Ds ∪ Dt ) s.t. y s μi ≤ C , y is μi ≥ 0, ∀i ∈ {1, 2, . . . , n s },
μ
w,μ n s i =1 i
where C is a hyper parameter to control the overall importance of the source do-
main data set.
The aforementioned approaches assume that, in the target domain, only la-
beled data are available as inputs for transfer learning algorithms. In many sce-
narios, plenty of unlabeled data may be available in the target domain as well.
2.3 Instance-Based Inductive Transfer Learning 31
Jiang and Zhai (2007) propose a general semi-supervised framework for instance-
based inductive transfer learning, where both labeled and unlabeled data in the
target domain are utilized with the source domain labeled data to train a target
predictive model.
In the work by Jiang and Zhai (2007), a parameter αi is introduced for each
source domain instance (xis , y is ) ∈ Ds to measure how P s (y is |xis ) is different from
P t (y is |xis ). Another parameter βi is introduced for each source domain instance
P t (xis )
(xis , y is ) ∈ Ds to approximate the density ratio P s (xis )
. Then, for each target domain
unlabeled instance xit ,u
∈ Dt and each possible label y, a parameter γi (y) is used to
measure how likely the true label of xit ,u is y. Let Dt = Dl ∪ Du where
n
Dl = {(xtj ,l , y tj ,l )} j =1
t ,l
represents the subset of target domain-labeled instances and
n
Du = {(xkt ,u )}k=1
t ,u
represents the subset of target domain-unlabeled instances. To
find an optimal classifier in terms of parameters θ, Jiang and Zhai (2007) propose
to solve the following optimization problem:
λs ns λt ,l n t ,l
θ = argmax αi βi log P (y is |xis ; θ ) + log P (y tj ,l |xtj ,l ; θ )
θ C s i =1 C t ,l j =1
λt ,u n
t ,u
+ γk (y) log(P (y|xkt ,u ; θ )) + log P (θ
θ ),
C t ,u k=1 y∈Y
s
n n t ,u
where C s = αi βi , C t ,l = n t ,l , and C t ,u = k=1 y∈Y γk (y) are normalization fac-
i =1
tors, the regularization parameters λs , λt ,l and λt ,u control the relative impor-
θ ) encodes the normal
tance of each part with the sum equal to 1, and the prior P (θ
θ
prior for . In this way, the source domain-labeled data, the target domain-labeled
data and the target domain-unlabeled data are fully utilized to learn the optimal
solution of θ .
domain data instance has a higher loss, its weight should be increased in the next
iteration.
For each source domain instance, if it has a higher loss, it may not be helpful
to the target task and so its loss will be decreased in the next iteration. The rule
s s
to update the weight for each source domain instance is w is = w is θ l (h(xi ),y i ) , where
θ = 1/(1 + 2 ln n s /(n s + n t )).
With these update rules, TraAdaBoost iteratively reweights both the source
domain-labeled data and the target domain-labeled data to reduce the impact of
misleading data instances in the source domain, and learn a series of classifiers to
construct an ensemble classifier for the target domain.
where G is the output image, S is the source image providing style and T is the
target image offering content. Here L cont ent is defined as
1 l
L cont ent (G, T, l ) = (G − Til, j )2 , (2.11)
2 i,j i,j
where l stands for the l th layer of deep learning model, i stands for the feature
mapping of the i th filter in the layer and j stands for the j th element of the vec-
torized feature mapping. In addition, the style loss is formulated as
L
L 2
L st yl e (G, S) = wl El = wl G amm(G)li , j −G amm(S)li , j , (2.12)
l l i,j
where Gamm(·)li , j , the style representation, is defined as the inner product be-
tween the vectorized feature maps i and j in layer l , that is, Gamm(G)li , j = k F ilk
F jl k . Specifically, in the work by Gatys et al. (2016), a nineteen-layer Visual Geom-
etry Group Network network is used as the base model and all of its max-pooling
2.3 Instance-Based Inductive Transfer Learning 33
layers are replaced by the mean pooling layers. First, the style and content features
are extracted from source images and target images. Then, a random white noise
image G 0 is passed through the network and its style features G l and content fea-
tures F l are computed. Gradients with respect to the pixel values can be computed
using error back-propagation and is used to iteratively update the generated
image G.
Although the task studied in Gatys et al.’s (2016) work is about style transfer for
images, the idea of generating new instances in the target domain by capturing
some important properties of the source domain can be applied to many other
transfer learning applications. We will review more generative models for transfer
learning later in Chapter 7.
3
Feature-Based Transfer Learning
3.1 Introduction
As discussed in the previous chapter, a common assumption behind instance-
based approaches is that the source domain data and the target domain data have
similar or the same support. However, the assumption may be too strong to be
satisfied in many real world scenarios, where the source domain data and the tar-
get domain data have some nonoverlapping features. For example, consider sen-
timent classification on customers’ reviews of different types of products. Here,
each type of products can be referred to as a domain, where customers may use
common as well as domain-specific words to express their opinions. For instance,
the word “boring” may be used to express negative sentiment on the DVD do-
main, while it is never used to express opinions on the furniture domain. There-
fore, some words or features are observed on some domain(s) but not observed
on other domain(s). This means that some features are source (or target) domain
specific, which do not have the support in the opposite domain. In this case, re-
weighting or resampling instances cannot help much to reduce the discrepancy
between domains. To address this issue, in this chapter, we introduce another ap-
proach to transfer learning known as feature-based transfer learning, which al-
lows transfer learning to operate in an abstracted “feature space” instead of the
raw input space. In this chapter, we focus on introducing homogeneous feature-
based transfer learning methods. Recall that, in homogeneous transfer learning,
we assume that Xs ∩ Xt = and Ys = Yt . Note that, in an extreme case, there may
be no overlapping features across the source domain and the target domain, but
there may exist some translators between the two spaces to enable successful
transfer learning. This is referred to as heterogeneous transfer learning, which will
be reviewed in Chapter 6.
A common idea behind feature-based transfer learning approaches is to learn
a pair of mapping functions {ϕs (·), ϕt (·)} to map data respectively from the source
domain and the target domain to a common feature space, where the difference
between domains can be reduced. After that, a target classifier is trained on the
new feature space with the mapped source domain and target domain data.
For testing on target domain unseen data, one first maps the data onto the
3.2 Minimizing the Domain Discrepancy 35
new feature space, and then performs the trained target classifier to make
predictions.
Detailed motivations and assumptions on learning the pair of feature map-
pings behind different feature-based approaches are different. In this chapter, we
summarize some main existing feature-based transfer learning approaches and
classify them into three categories. The first category of the approaches aims to
learn transferable features across given target and source domains by minimizing
the domain differences (known as domain discrepancy) explicitly. Another cate-
gory of the approaches aims to learn universal features that are expected to be
high-quality features across all domains. The third category of the approaches is
based on “feature augmentation” across domains, which seek to extend the fea-
ture space by considering extra correlations learned from data.
where φ(x) maps each instance to the Hilbert space H associated with the kernel
k(x i , x j ) = φ(x i )T φ(x j ), and n s and n t are the sample sizes of the source and the
36 Feature-Based Transfer Learning
target domains, respectively. By using the kernel trick, the MMD distance in (3.1)
can be simplified as
MMD Embedding
With the MMD distance, Pan et al. (2008b) propose a dimensionality reduc-
tion algorithm, known as MMD embedding (MMDE), for transfer learning, whose
high-level idea is formulated as follows,
where ϕ is the mapping to be learned, which maps the original data to a low-
dimensional space across domains. The first term in (3.3) aims to minimize the
MMD distance in distributions between the source and the target domain data,
Ω(ϕ) is a regularization term on the mapping ϕ, and the constraints are to ensure
original data properties to be preserved.
Based on the definition of the MMD distance, (3.3) can be written as
where K is the kernel matrix induced by the kernel function k(xi , x j ) = ψ(xi )T ψ(x j ),
and ψ(·) is defined as ψ(x) = φ(ϕ(x)) or ψ = φ ◦ ϕ.
In general, the optimization (3.4) is computationally intractable as the kernel
function k(x j , x j ) can be highly nonlinear of the mapping ϕ(·), which is unknown
and to be learned. To make it computationally solvable, Pan et al. (2008b) pro-
posed to first transform the optimization (3.4) to a kernel matrix learning problem
as follows,
the variance in the new feature space as the colored maximum variance unfold-
ing (MVU) (Weinberger et al., 2004) does. The first constraint preserves the pair-
wise distance and the second constraint guarantees that the embedded data are
centered. After solving (3.5), principal component analysis (PCA) is applied on K
to get the leading eigenvectors to reconstruct the desired future mapping for the
source and the target domain data.
One disadvantage of MMDE that it is an transductive learning method, which
cannot generalize to out-of-sample data. Moreover, the optimization problem in
(3.5) is a semi-definite programming (SDP) problem, which is computationally
expensive to be solved.
x S ∈ XS from the source domain and x T ∈ XT from the target domain are trans-
formed by the first several layers of the CNN. The transformation with the first
layers can be considered as an approximate of ψ(·) in MMDE.
Classification
Labeled fc_adapt
loss
data
Domain
loss
Unlabeled fc_adapt
data
Figure 3.1 Deep CNN for both classification loss as well as domain invariance,
where the dashed line means the weight sharing (adapted from Tzeng et al.
[2014]).
Ds xs1 Ys
JDD
Dt xt1 Yt
Figure 3.2 The architecture of the JAN (adapted from Long et al. [2017]) based on
the AlexNet
Given a mapping function U , its first derivative U , and its inverse ξ = (U )−1 , the
Bregman divergence is defined as
φ(x) φ(x)
D W (φ(X s )||φ(X t )) = d (ξ(P s ), ξ(P t ))d μ,
where
φ(x) φ(x) φ(x) φ(x) φ(X S ) φ(x) φ(x)
d (ξ(P s ), ξ(P t )) = (U (ξ(P t )) −U (ξ(P s ))) − P s (ξ(P t ) − ξ(P s )),
with d μ being the Lebesgue measure for φ(x), and the probability densities for
φ(x)
the source and the target domains in the projected space are denoted by P s
φ(x)
and P t , respectively.
1
P i (h; μi , Σi ) = (h − μi )T Σ−1
i (h − μi )
2
40 Feature-Based Transfer Learning
unit13:
Tower roof religious, church, plants,
impressive, monks
unit241:
Plants plants, fruits, basil,
land, mint
Shared cross-modal
representation
pool5
Model
specific CNN
(MLP)
Figure 3.3 Low-level representations are specific for each modality (bottom el-
ements) and a high-level representation is shared across all modalities (high-
lighted on the top) (adapted from Castrejon et al. [2016]).
T
where q (x; θr epr )) and y D denotes the domain from which the ex-
= softmax(θD f
ample is drawn. For a particular domain classifier θD , the loss that seeks to “max-
imally confuse” the two domains by computing the cross entropy between the
output predicted domain labels and a uniform distribution over domain labels is
defined as
L D (x S , x T , θr epr ; θD ) = − ln(q d ). (3.9)
d
3.3 Learning Universal Features 41
Finally, we can learn a model based on the new representation {âi } for the target
domain with the associated labels.
1 N
R sampl e (Y , μ) = ||T {N }×{H ,W,C } (Y )n − μzn ||2 . (3.12)
2NC HW n=1
H a2
b 1 b2 (a) a4
b3 b4 b5
C a3
a1
W
b2
b1 b3 b5
(b)
b4
Figure 3.4 (a) Sample clustering; (b) spatial clustering (adapted from Liao et al.
[2016]).
1 NHW
R spat i al (Y , μ) = ||T {N ,H ,W }×{C } (Y )i − μzi ||2 . (3.13)
2NC HW i =1
Moreover, clustering can be performed on the channel by using the following loss
1
NC
R spat i al (Y , μ) = ||T {N ,C }×{H ,W } (Y )i − μzi ||2 . (3.14)
2NC HW i =1
In Liao et al. (2016), the authors focused on investigating whether the representa-
tion for clustering is applicable to unseen categories, which is a zero-shot learning
problem. Given these features trained by the loss in (3.12), one can learn the out-
put embedding E via a structured SVM without regularization as
1 N
min max{0, Δ(y n , y) + x nT E [φ(y) − φ(y n )]}, (3.15)
E N n=1 y∈Y
where x n and y n are the feature and the class label of the n-th example, Δ is the 0-
1 loss function, and φ is the class-attribute matrix provided by the Caltech–UCSD
(University of California, San Diego) Birds data set with each entry indicating how
likely one attribute is present in a given class.
where 0 denotes a zero vector in the F -dimensional space. The first part of the
augmented feature represents the general feature, while the second and third parts
represent the source and target domain specific features, respectively.
It’s easy to generalize this method to a kernelized version. Assume that each
data point x is projected to a RKHS with the corresponding kernel k : X ×X → R.
k can be written as the dot product of two vectors k(x, x ) = 〈Φ(x), Φ(x )〉X . We can
define Φs and Φt in terms of Φ as
Denote the expanded kernel by k(x, x ). When x and x are from the same domain,
k(x, x ) = 〈Φ(x), Φ(x )〉X + 〈Φ(x), Φ(x )〉X = 2k(x, x ). When x and x are from differ-
w s · x i ≈ w t · x i ⇐⇒ 〈h c , h s , h t 〉 · 〈0, x i , −x i 〉 ≈ 0 (3.18)
In this way, one can construct a feature map for unlabeled data as
After that, any standard semi-supervised learning classifiers can be applied with
the feature maps defined for source domain labeled data and labeled and unla-
beled target domain data.
4
Model-Based Transfer Learning
4.1 Introduction
Figure 4.1 (a) Source model (the dash line). (b) Target model (the solid line) only
with limited target data (the crosses). (c) The target model (the solid line) trans-
ferred with the source model (the dash line) as a prior.
target domain by exploiting knowledge from auxiliary task(s). For example, Evge-
niou and Pontil (2004) propose a regularized multitask learning method based on
support vector machines (SVMs), where the optimization objective is to equally
minimize the loss over all tasks. Hence, the final learned model has a best over-
all performance balancing all the tasks. However, this result may not guarantee
the optimal performance on the desired target task. In transfer learning setting,
one only focuses on the performance of the target task. This difference can be
eliminated by simply changing the weight assignment for difference tasks in the
objective function in multitask learning. In the chapter, related multitask learning
algorithms will be only briefly mentioned as they will be introduced more specif-
ically in Chapter 9.
Based on specific assumptions in different model-based transfer learning
methods, we classify these algorithms into two categories: transferring knowl-
edge through shared model components (Section 4.2) and transferring knowledge
through regularization (Section 4.3). The first category, transferring knowledge
through shared model components, covers the transfer learning algorithms that
establish the target model by reusing some components in the source model or
reusing some hyperparameters of the source model (Li et al., 2006; Tommasi et al.,
2010; Jie et al., 2011). Moreover, there are methods that learn both the source
and target models simultaneously (Lawrence and Platt, 2004; Bonilla et al., 2007;
Schwaighofer et al., 2005).
The second category aims to transfer knowledge through regularization. Regu-
larization is a technique used to solve ill-posed machine learning problems and to
prevent model overfitting by restricting model flexibility. In model-based transfer
learning algorithms, the regularization is used to constrain parameters based on
some prior hypotheses. The SVM has been a commonly used base model in this
category because of its nice computational properties and good performance in
many applications. With the introduction of deep models, some approaches have
transferred model parameters in a pretrained deep learning model from auxiliary
tasks, where the parameters are used to initialize target domain models.
4.2 Transfer through Shared Model Components 47
where the conditional probability p(y i |z i ) gives the relationship between obser-
vations and the latent variables. (4.2) consists of two parts, the prior and the like-
lihood p(y i |z i ).
Suppose that we have M related but different tasks, each of which is modeled as
a GP on the corresponding training data {(Xm , ym )}. The probability distribution
for y = (yT1 , . . . , yTM )T is
M
p(y|X, θ) = p(ym |Xm , θ). (4.3)
m=1
and uses the informative vector machine (IVM) to find a sparse representation to
reduce the computation and speedup the training.
Schwaighofer et al. (2005) combine the hierarchical Bayesian learning and GP
together for multitask learning. In this algorithm, the hierarchical Bayesian mod-
eling essentially learns the mean and covariance functions of the GP. The algo-
rithm takes two steps:
(1) learn a common collaborative kernel matrix from the data via a simple and
efficient expectation-maximization (EM) algorithm;
(2) generalize the covariance matrix by using a generalized Nyström method.
Consider Figure 4.2 as an example. The soft label l (bot t l e) is a K -dimensional vec-
tor, where each dimension indicates the similarity of bottles to each of the K cat-
egories. In this example, the soft label of a bottle will have a higher weight on the
mug than the keyboard, since bottles and mugs are more visually similar. Thus,
training with these soft labels enforces the relationship that bottles and mugs
should be closer in the feature space than those between bottles and keyboards.
softmas
high
temp
Source activations
per class
Figure 4.2 An illustration of the soft labeling method (adapted from Tzeng et al.
[2014])
F
s(xi , y) = w̄ · φ̄(xi , y) = w(0) · φ(0) (x, y) + w(y,z) · φ(y,z) s p (x, z), y ,
z=1
where φ(·) is a feature-mapping function and w(·) is the parameter that separates
the corresponding two classes by a hyperplane. The score of the new class y is cal-
culated using the model trained on the target data as well as the prior knowledge
from the source data.
where J is the original objective function and J˜ is the regularized objective func-
tion with the regularization term Ω(·) on parameter θ with a regularization
weight α.
Evgeniou and Pontil (2004) propose that the model parameter can be decom-
posed into two parts, a task-specific part and a task-invariant part. The target and
source model parameters can be modeled as
θ s = θ 0 + vs (4.6)
θ t = θ 0 + vt . (4.7)
Figure 4.3 Adapt the parameter θ to detect a new class “lions” θ̃ using a regular-
izer Ω(·)
(1) SVMs elegantly separate the data using a hyperplane and only a few data de-
termine the boundary, which makes the model transfer intuitively easy and
the computing cost relatively low.
(2) The objective function of SVMs is simple in that it is convenient to add con-
straints and regularizers.
We can show that (4.5) can be generalized to SVMs. A standard SVM has the
following objective function as
1
min w2 s.t. y i [w · xi + b] >= 1 ∀i . (4.8)
w 2
Yang et al. (2007c) propose an adaptive SVM (A-SVM), which learns a new
52 Model-Based Transfer Learning
decision boundary that is close to the original decision boundary. In A-SVM, the
target model is defined as f t (x) = f s (x) + Δ f (x), where Δ f (x) is the permutation
function that shifts the source decision boundary to fit the target data.
Similar to Yang et al. (2007c), Jiang et al. (2008) propose a cross-domain SVM
(CD-SVM) algorithm to transfer knowledge from an SVM trained from a source
task to a new task. The motivation behind CD-SVM is that, if a support vector
learned by a source SVM falls in the neighborhood of some target domain training
data, then it tends to have a distribution similar to the source domain, and thus
can be used to help train a new SVM for the target domain. Therefore, in CD, the
target domain SVM can be optimized by adding neighborhood constraints of the
support vectors learned in the source domain.
Aytar and Zisserman (2011) improve (Yang et al., 2007c) in the application of
object category detection and propose a deformable adaptive SVM (DA-SVM). It
utilizes the trained image detector of other categories as the regularization term
to train on a new category using a minimum number of possible training sam-
ples in the current category. Duan et al. (2009) propose a domain transfer SVM
(DT-SVM) for video concept detection. DT-SVM tries to decrease the mismatch
across domain distributions, which are measured by MMD and, at the same time,
learn a decision function for the target domain. In video-concept detection appli-
cations, the change of a key frame is very frequent, which makes the feature rep-
resentations difficult to capture without a large amount of data. To address this
problem, DT-SVM proposes a unified framework to simultaneously learn an opti-
mal kernel function as well as a robust SVM classifier. Bruzzone and Marconcini
(2010) propose a domain adaptation SVM that exploits a semi-supervised method
to adapt the traditional SVM to a new domain while validating adapted classifier
with noisy labels. Xu et al. (2014a) propose an adaptive structural SVM (A-SSVM)
to adapt the classifier parameters between domains. This method introduces a
data-dependent regularization term for source domain selection and integrates
different feature extraction methods. By doing this, A-SSVM is able to capture the
structural knowledge through feature space and trained parameters.
Tommasi et al. (2010) propose an SVM-based adaptation algorithm that ex-
ploits some prior knowledge to imitate the human ability on recognizing objects
even from only one single view. This algorithm selects and adapts the weights of
the prior knowledge from different categories by assuming that the new categories
are similar to some of the existing categories. This method modifies the objective
function in conventional least squares SVMs (LS-SVMs) by changing the regular-
ization term where the modified objective function is formulated as
1 C l
min wt − βws 2 + [y i − wt · φ(xi ) − b]2 ,
wt ,b 2 2 i =1
where ws and wt are the parameter of the source and target models, respectively.
4.3 Transfer through Regularization 53
The regularization term constrains the target model parameter to be close to the
source parameter with β, a scaling factor between 0 and 1, controlling the close-
ness measurement.
where k(·, ·) is a kernel function and αi is a coefficient. Hence the overall objective
function is defined as
[k, f ] = arg min Ω DIST2k (Ds , Dtl ) + θR(k, f , Dtl ). (4.10)
k, f
This objective function consists of two terms. The first term minimizes a distribu-
tional distance DIST(·, ·) between two domains. Dtl is the set of labeled instances
in the target domain. In the second term, function R(·) represents the structural
risk of the classifier f (·) and kernel k(·), given the target data Dtl . Here the kernel
function k(·) is assumed to be a linear combination of the base kernels {k j }’s, that
is,
M
k= dj kj , (4.11)
j =1
where M is the total number of source models. It is noteworthy that both this
method and A-SVM (Yang et al., 2007c) do not utilize the abundant unlabeled data
in the target domain.
Duan et al. (2012c) propose an adaptive MKL (A-MKL) method. This algorithm
learns a kernel function and a classifier by optimizing both structural risk and
distribution discrepancy between the source and target domain.
Besides exploring the input kernel, Guo and Wang (2013) propose a domain
adaptive input-output kernel learning (DA-IOKL) algorithm that simultaneously
learns both the input and the output kernels with a discriminative vector-valued
decision function by reducing the data mismatch based on the MMD distance and
minimizing the structural risk.
54 Model-Based Transfer Learning
input labels
baseA
A A
input labels
baseB
B B
Figure 4.4 Overview of the experimental settings in CNN (adapted from Yosinski
et al. [2014])
Output
Hidden LSTM LSTM LSTM softmax
Embedding
a boy
Figure 4.6 An LSTM model for natural language classification (adapted from Mou
et al. [2016]). LSTM, long short-term memory network
Figure 4.7 The DeViSE model (adapted from Frome et al. [2013])
Frome et al. (2013) learn semantic knowledge in the a text domain and trans-
ferred the knowledge to a visual object recognition domain. First, a skip-gram
4.3 Transfer through Regularization 57
5.1 Introduction
Paper(T) Movie(M)
mula in Figure 5.1, the student may infer that a director is a member of movies
his/her employed actors participate in by substituting similar relations into the
target movie domain.
Instead of transferring relations in first-order relation-based transfer learning
approaches, second-order relation-based transfer learning approaches can also
be used. Second-order relations assume that two related relational domains share
some similar relation-independent structural regularities that can be extracted
from the source domain. These regularities can then be transferred to the target
domain. In fact, many abstract rules about relations stay valid across several dif-
ferent real world domains. For example, the distributional hypothesis initially dis-
covered in linguistics (Harris, 1954) finds that words with similar distributional
characteristics tend to be semantically related. Recently, it was found that this
distributional characteristics was valid for social networks (Mitzlaff et al., 2014).
Likewise, papers’ citation structures also tend to be semantically similar in ci-
tation networks (Ganguly and Pudi, 2017). In Figure 5.1, a relation-independent
structural pattern is represented as a second-order logic formula with predicate
variables, which is learned from a source domain. This second-order relation can
be instantiated with relations in the target domain to obtain new rules, which is a
form of transfer learning.
There also exist works that consider transferring across different networks (Ye
et al., 2013; Fang et al., 2013, 2015), where the structural knowledge of the net-
works are assumed to be transferable. Ye et al. (2013) propose the construction of
generalizable latent features through matrix factorization by considering both the
source and the target networks, and then adoption of an AdaBoost-style algorithm
with instance weighting to train a target classifier. Fang et al. (2013) constructed a
label propagation matrix to capture the influence of the structural information to
the labels of nodes in a network. The goal was to discover common signature sub-
graphs between two networks to construct new structural features for the target
network. The relational knowledge contained in edges is then transferred across
relational domains by discovering common latent structural features shared by
the source and the target networks. For the co-extraction of sentiment and topic
lexicons across domains with no labeled data in the target domain, Li et al. (2012)
proposed a two-stage relation-based transfer learning framework by leveraging
transferable syntactic relations between topical and sentimental words. In the first
stage, a simple strategy is proposed to generate a few high-quality sentiment and
topic seeds for the target domain and in the second stage, a novel relational adap-
tive bootstrapping method is applied to expand the seeds by exploiting the rela-
tions between topic and opinion words.
There has not been much research about “when to transfer” for relational do-
mains. The weighted pseudo-log-likelihood (WPLL) could be used as a metric to
measure the “degree” of a set of formulas being satisfied. Zhuo and Yang (2014) de-
veloped a score function based on WPLL to measure the similarity between source
5.2 Markov Logic Networks 61
and target domains in order to capture the transferability between the source and
target domains.
shallow transfer, where the source and the target domains share the same types of
objects and relations, and deep transfer, where the types of objects and relations
are different across domains (Davis and Domingos, 2009; Van Haaren et al., 2015).
These two categories respectively use the first-order and second-order relation-
based transfer learning techniques based on MLN.
In the first-order relation-based transfer learning approach, we aim to find an
explicit mapping of predicates across domains to generate new formulas for the
target domain. In the second-order approach, we extract the structural regulari-
ties from the source domain in the form of second-order logic and then transfer
them to the target domain.
gr
log Pw (X = x) = cr ln P w (X r,k = x r,k |M B x (X r,k )), (5.1)
r ∈R k=1
actor is mapped to the student, it cannot be mapped to other types. Two predi-
cates are compatible if they have the same number of arguments and the types of
arguments are compatible according to the current type constraints. With a new
compatible mapping that has no conflicts with other mappings, the mapping and
corresponding type-mapping constraints are updated.
After this construction, legal mappings are evaluated based on the WPLL score
of the MLN model that consists of only the translated clauses. The best local pred-
icate mapping is the one with the highest WPLL score. The process is iterated to
find local mappings for all source clauses. Table 5.1 illustrates the output of the
mapping algorithm. The mapped structure is then revised to fit the data in the
target domain based on various criteria:
clique are correlated. For the cliques having at least one true grounding in the tar-
get domain, the legal instantiations can be directly picked, refined or as the seeds
for the search of formulas in the target domain.
Different from DTM, which applies an auxiliary tool, that is, the second-order
clique, to collect candidates of reliable second-order formulas, the two-order-deep
transfer learning algorithm proposed by Van Haaren et al. (2015) directly com-
putes the posterior distributions of all second-order formulas given the data in
the source domain and then uses those posterior distributions as prior distribu-
tions over second-order formulas in the target domain to train the MLN in the
target domain.
System
updates Vaccines
Prevent Prevent
infection infection
Cause Cause
malfunction disfunction
Computer Diseases
viruses
Figure 5.2 The analogy between debugging for computer viruses and diagnosing
human diseases based on structural similarities. The dash lines bridge analogs
across domains
6.1 Introduction
As reviewed in previous chapters, the majority of work done in the area of trans-
fer learning focuses on cases where examples in a source domain and those in a
target domain share the same representation structure but follow different proba-
bility distributions. In this chapter, we introduce heterogeneous transfer learning,
which pushes the boundary further by allowing a source domain and a target do-
main to lie in incommensurable feature spaces or different label spaces.
Even though transfer learning is a powerful framework to apply in many sit-
uations, homogeneous transfer learning only focuses on generalization perfor-
mance across the same domain representations. As such, homogeneous transfer
learning is limited. As an example, consider the situation in Figure 6.1. In this ex-
ample, increasing the annotated high-resolution photographs in a source domain
offers little help in classifying the categories of sketch images, given that the cate-
gories of televisions and computer monitors are “visually” similar. This is partly
due to the limitation of homogeneous transfer learning. In this case, however,
heterogeneous transfer learning, which considers knowledge across domains with
different feature and label spaces, can bring out similar knowledge in the two do-
mains. Heterogeneous transfer learning enables different perspectives or aspects
of knowledge to be transferred from a source to a target domain. In the example in
Figure 6.1, text documents, lying in a completely different feature space from im-
ages, characterize televisions and computer monitors with more descriptive and
discriminative abilities. Therefore, they can provide additional knowledge to fur-
ther improve the classification of sketch images in the target domain. By borrow-
ing the knowledge learned from classifying high-resolution photographs in those
categories associated with televisions and computer monitors, say TV boxes and
keyboards, heterogeneous transfer learning can search for more hints about the
visual differences between televisions and monitors.
Besides, it may often be the case that a source domain in the same feature
and label representation as a target domain is not easily accessible to users. Con-
sider the task of human activity recognition using sensor data collected from cell-
6.1 Introduction 69
homogeneous
phones. This task requires many sensor records to be annotated with activity
names as labels. However, the annotation of such sensor data is especially labo-
rious and expensive (Wei et al., 2016a). In this case, finding a source domain with
sufficient labeled sensor records is as difficult, if not more, as building a model in
the target domain. In contrast, heterogeneous transfer learning offers greater flex-
ibility for source domain selection by allowing a source domain to be chosen from
a different feature space or a different label space. For the activity recognition ex-
ample, heterogeneous transfer learning algorithms can transfer knowledge from
social media messages to sensor records, which may greatly help improve activity
recognition performance.
Last, but not the least, heterogeneous transfer learning gets closer to human
intelligence where knowledge can be transferred between different types of signals
easily. This is evidenced by the multimodal sensory system of the brain. The mul-
timodal sensory neural system of humans can integrate signals from different sen-
sory modalities, such as visual stimuli, auditory stimuli, tactile stimuli and olfac-
tory stiumli. When signals in some of the modalities are absent or insufficient,
the system can leverage knowledge from other modalities to guarantee the ef-
fectiveness of perception (Recanzone, 2009). For example, we usually understand
others’ speech based on the auditory stimuli from sound bites. However, if some
other people are whispering in a low voice that we cannot hear them clearly, the
multimodal sensory system is capable of transferring knowledge from the visual
stimuli, such as the shape of mouth, to improve speech understanding and so on.
The rest of this chapter is organized as follows. In Section 6.2, we first give a
formal definition regarding heterogeneous transfer learning. Section 6.3 details
existing solutions toward heterogeneous transfer learning, and discusses their
70 Heterogeneous Transfer Learning
where c i j indicates the degree of semantic relatedness between the j -th example
j
in a source domain xs and the i -th example in a target domain xit .
Besides the annotated correspondence shown in the Figure 6.2, the correspon-
dence set C can also be built by labels. If the label of a source instance, say “horse,”
and the label of a target instance, say “pony,” are semantically close, we are also
able to infer that the source instance and the target instance have correspon-
dence. The correspondence is a prerequisite for heterogeneous transfer learning
algorithms to build either instance or feature mappings, as homogeneous transfer
learning algorithms do, to enable knowledge transfer.
6.3 Methodologies
Heterogeneous transfer learning can be categorized into two groups according
to the type of “heterogeneity” that we refer to. The first branch addresses the prob-
lem caused by knowledge transfer under the feature-space mismatches, that is,
Xs = Xt . The second branch enables knowledge transfer even if the label spaces
from the two domains are different, that is, Ys = Yt .
Latent
DL Manifold Deep
Techniques semantic
alignment learning
analysis
There have been two categories of alignment strategies according to the way
mappings are built: (1) latent space-based methods that learn a latent space ex-
panded by multiple latent factors shared across domains and (2) translation-based
methods that directly translate from a source feature space to a target feature
space. Figure 6.4 presents the general ideas of these two kinds of methods and
differences between them. To be more specific, the latent space-based alignment
translator
Latent space
Figure 6.4 Overview of the two strategies for the alignment of domains in differ-
ent feature spaces
Table 6.1 The explored research works so far in heterogeneous transfer learning
across domains in different feature spaces.
hhhh
hhhTechniques
hhhh
Latent space based
Translation based
Approach h Latent factor analysis DL Manifold alignment Deep learning
Single-level alignment
Multi-level alignment
the feature representation of the target domain is enriched with these shared la-
tent factors that encode knowledge from one or multiple source domains, and
improve the performance in kinds of tasks.
Yang et al. (2009) first propose and investigate heterogeneous transfer learning.
This work leverages a large corpus of unlabeled text documents, the source do-
main, to help images as the target domain better cluster. The authors put forward
a probabilistic approach named annotation-based probabilistic latent semantic
analysis (aPLSA). The core of aPLSA lies in employing the image–text multiview
data, which are tagged images on Flickr in the empirical studies of this work. Both
images and their auxiliary tags are projected into a common semantic latent space
where latent factors dictating the distribution of low-level features of images are
dc
finally output as clusters. Specifically, Z = {z i }i =1 , Xs , Xt , F denote the the la-
tent variable set, tags, image instances, and low-level image features, respectively.
dc
{z i }i =1 , meanwhile, are regarded as the clusters that are finally desired. Mathe-
matically, the goal of this model is to cluster target images, that is, to assign a
z i ∈ Z with the highest probability for each specific target image xit in a proba-
bilistic fashion:
To extract Z , aPLSA follows two chains, as Figure 6.5 shows. The first chain de-
cides the low-level features F from images Xt passing though Z :
P ( f |xit ) = P ( f |z)P (z|xit ). (6.2)
z∈Z
The other chain is inferred from auxiliary tags, and characterizes the correlation
74 Heterogeneous Transfer Learning
where the first loss term ensures that projections of source examples in the la-
tent space preserve the original structure as much as possible. The same applies
to the second loss, while the third term measures and lessens the difference be-
tween two projections. Specifically, (Zs , Xs ) = Xs − Zs Ps 2F with Ps representing
the projection matrix that maps Xs to Zs . Similarly, (Zt , Xt ) = Xt − Zt Pt 2F . As for
(Zs , Zt ), a strong hidden assumption is made that the source and target domain
are semantically similar so that their projections should be semantically close. As
a consequence, (Zs , Zt ) = 12 (Xs − Zt Ps 2F + Xt − Zs Pt 2F ).
Clearly, HeMap does not require any correspondence data between a source
and a target domain while aPLSA, as we mentioned earlier, does. The performance
of HeMap highly relies on the data themselves. Only when a source and a target
domain are semantically sufficiently close to each other is HeMap expected to
learn an effective latent space that encodes the shared semantic knowledge from
both domains.
Singh and Gordon (2008) first propose a collective matrix factorization (CMF) to
extract shared factors (common interests of users) among multiple relations (mul-
tiple user-item relations) in the field of recommendation. Specifically, CMF simul-
taneously factorizes multiple matrices with correspondence in rows or columns
while enforcing the factorized latent factors to be the same.
Subsequently, CMF and its variants have been extensively investigated for trans-
fer learning (Gupta et al., 2010; Zhu et al., 2011; Wang et al., 2011; Long et al.,
6.3 Methodologies 75
2014). Zhu et al. (2011) is an example approach for tackling heterogeneous trans-
fer learning problems. In this work, the proposed approach, known as hetero-
geneous transfer learning for image classification (HTLIC), conducts knowledge
transfer from sufficient unlabeled text documents to images in a target domain.
To bridge the gap between domains, the images Xt are treated as the target do-
main and the text corpora Xs as the source domain. It is assumed that the cor-
respondence mapping between the two domains is not given. Thus, HTLIC must
make full use of certain auxiliary tagged images from an online source (Flickr)
A = {xiat , xias }li =1 . In this equation, xiat ∈ Rd t has the same representational struc-
ture as xit , xias ∈ Rd s is the corresponding d s -dimensional tag vectors of images.
Their approach then constructs two matrices with the columns aligned, which
are further jointly factorized by following the operations of CMF. Based on this,
on one hand, the authors built a matrix by characterizing the correlation between
low-level image features and tags from A. This correlation matrix is defined as:
The other matrix captures the relationship between unlabeled documents and
tags in A. The matrix, denoted as F ∈ Rn s ×d t , can be inferred from Xs and Xas .
Clearly, the constructed matrices G and F are aligned in terms of the columns,
both lying in the feature space of the source domain. Subsequently, HTLIC applies
CMF to jointly factorize G and F and formulates the following objective function:
min λG − UVT 2F + (1 − λ)F − WVT 2F + R(U, V, W), (6.7)
U,V,W
Figure 6.6 The overview of HFA (adapted from Duan et al. [2012b]). Examples
from heterogeneous feature spaces are all transformed into the augmented fea-
ture space in the middle
1 ns nt
min min w2 + C ( ξsj + ξit ),
Ps ,Pt w,b,ξs ,ξt , 2 j =1 i =1
j i
1 C ns nt
Cu nu
min min (w2 + b 2 ) − ρ + ( (ξsj )2 + (ξit )2 ) + (ξu )2 ,
Ps ,Pt yu ,w,b,ξs ,ξt ,ξu 2
j i i
2 i =1 i =1 2 i =1 i
Dictionary Learning
Olshausen and Field (1997) first introduce the idea of learning an over-complete
dictionary from data, rather than using the off-the-shelf bases, to sparsely code
any signals in the data set. Learning robust dictionaries plays a key role in wide
applications of DL and sparse coding. In heterogeneous transfer learning, some
research works (Wang et al., 2012; Shekhar et al., 2013; Zhuang et al., 2013) learn
a dictionary for each domain and enable the semantic meanings of the dictionar-
ies to be coupled across domains. To couple the semantic meanings, this line of
methods requires correspondence across domains.
Before proceeding to detail these works, we would first elaborate the definition
of coupled dictionaries. Suppose that dsj is the j -th dictionary atom in a source
domain’s representational structure, and dit is the i -th dictionary atom in a target
domain’s representation. If dsj represents a group of instances semantically relat-
ing to “sports” in the source domain and so does dit , we could say that dsj and
dit share a latent factor. Two dictionaries Ds and Dt are coupled, if and only if all
dictionary atoms correspondingly share latent factors.
Early works couple dictionaries by enforcing the sparse codes of a pair of in-
stances known to have correspondence across domains to be the same (Yang et al.,
2010; Zhu et al., 2014). Wang et al. (2012) point out that such a strong assumption
'LFWLRQDU\ Ds 'LFWLRQDU\ Dt
6SDUVH 6SDUVH
FRHIILFLHQW Zs FRHIILFLHQW Z t
3UHOHDUQHG
PDSSLQJ:
6SDUVHFRGLQJ 6\QWKHVLV
6SDUVH 5HFRQVWUXFWLRQ
6RXUFH FRGLQJVWDJH 6SDUVHGRPDLQWUDQVIRUPVWDJH VWDJH 7DUJHW
Figure 6.7 The overview of the semi-coupled DL method (adapted from Wang
et al. [2012])
78 Heterogeneous Transfer Learning
would impair the flexibility of representation. Instead, they relaxed this assump-
tion by first learning two sets of sparse codes for two domains, respectively, and
then bridging them with a stable transformation. Figure 6.7 provides a very intu-
itive overview of the proposed method called semi-coupled DL (SCDL). The for-
mulation of the objective is:
where λs , λt and λW are trade-off parameters. The third term of (6.10) models the
linear transformation between sparse codes across domains. The 1 norm of Zs
and Zt ensure the sparsity of sparse codes. The constraints imposed are to guar-
antee that each dictionary atom is well normalized. To facilitate classification,
Zhuang et al. (2013) impose a structured sparsity constraint on the sparse codes
based on (6.10). The structured sparsity constraint, achieved by 1 /2 norm, can
produce more discriminative dictionaries, with each atom capturing the shared
structures within the same class of each domain.
Furthermore, Jia et al. (2010) propose a model called factorized latent spaces
with structured sparsity, which not only constrains sparse codes with structured
sparsity, but also dictionaries. As a result, original examples in either domain can
be represented by only a subset of dictionary atoms. A limitation may be that sim-
ply enforcing the sparse codes to be inter-translated or identical does not support
that dictionary atoms from different domains lie in a common latent space. To
address this limitation, Yu et al. (2014) propose formulating the coupled DL as a
co-clustering problem with cluster centers as dictionaries. Each cluster, consist-
ing of examples from heterogeneous domains, is regarded as a latent factor shared
by all heterogeneous domains.
Another approach is proposed by Shekhar et al. (2013), who put forward a model
called shared domain-adapted DL (SDDL), which projects both domains into a
common low-dimensional space and then learns a shared discriminative dictio-
nary in this latent space, which is illustrated in Figure 6.8. Different from previ-
ous models, SDDL does not require the existence of correspondence that is used
to couple different domains. Instead, SDDL aims to learn a single shared dictio-
nary that can optimally reconstruct both domains. In detail, the projection ma-
trices and the shared discriminative dictionary are learned jointly in the model,
which facilitates learning common internal structures from both domains accord-
ing to Shekhar et al. (2013). The reasons for the pre-projections are given as fol-
lows: (1) heterogeneous feature spaces across different domains should be com-
parable; (2) irrelevant and noisy information is disregarded after projection
and (3) the low-dimensional space is much more computationally efficient.
6.3 Methodologies 79
6RXUFHGRPDLQ 7DUJHWGRPDLQ
Ps Pt
&RPPRQ
ODWHQW
VXEVSDFH
6KDUHGGLVFULPLQDWLYHGLFWLRQDU\
Figure 6.8 The overview of the proposed SDDL method (adapted from Shekhar
et al. [2013])
Clearly, the objective is composed of two parts: (1) C 1 that minimizes the rep-
resentation error in the low-dimensional projected space and (2) C 2 that is the
regularizer to preserve the variance in the original data as principal component
analysis does. The definitions of C 1 and C 2 are given in (6.12) and (6.13), respec-
tively,
C 1 (D, Ps , Pt , Zs , Zt ) = Ps Xs − DZs 2F + Pt Xt − DZt 2F , (6.12)
After mapping the original data with two projection matrices Ps ∈ Rdc ×d s and
Pt ∈ Rdc ×d t into the d c -dimensional latent space, SDDL learns a shared dictionary
with K atoms, that is, D ∈ Rdc ×K . Simultaneously, we infer the sparse representa-
tions Zs and Zt over the shared dictionary for the source and target domains, re-
spectively. During the testing phase, a testing target example is first projected into
the latent space with Pt , as Figure 6.8 shows. In the following, its sparse represen-
tation over the shared dictionary D is inferred and passed further for classification
or other tasks.
SDDL is effective only when two domains do not differ a lot. Otherwise a shared
dictionary in the low-dimensional space may be incapable of reconstructing both
domains. If the correspondence data across domains is accessible, coupled DL
methods such as SCDL are more promising. Overall, this line of methods espe-
cially stands out in visual applications, given the fact that sparse coding has been
proved to be the most effective in representing images.
80 Heterogeneous Transfer Learning
Manifold Alignment
1 m ni ni
C= μ f T xi − f iT xij 2 Wi ( j , j ), (6.14)
2 i =1 j =1 j =1 i j
where Wi ( j , j ) is the similarity between the j -th example xij and the j -th instance
xij in the i -th domain. The second type of topology is preserved via labels of ex-
amples in the work by Wang and Mahadevan (2011), that is, the examples across
domains with the same label should be similar (minimizing (6.15)) and those with
different labels should be separated (maximizing (6.16)),
1 m m na nb
A= f T xa − f bT xbj 2 Wsa,b ( j , j ), (6.15)
2 a=1 b=1 j =1 j =1 a j
1 m m na nb
B= f T xa − f bT xbj 2 Wda,b ( j , j ), (6.16)
2 a=1 b=1 j =1 j =1 a j
where Wsa,b ( j , j ) = 1 if xaj and xbj carry the same label, otherwise Wsa,b ( j , j ) = 0.
Wda,b ( j , j ) acts the opposite. Combining (6.14), (6.15) and (6.16), the final objec-
tive to minimize is O = (A +C )/B .
When there exist few or no labeled data in the target domain, the alignment
in Wang and Mahadevan’s (2011) work tends to be ineffective. In this case, if
6.3 Methodologies 81
where c, θ y , f s , f t , xit and θxt denote the c-th class label, the classifier associ-
i
ated with the c-th class, feature representations of an example belonging to the
c-th class in a source domain, feature representations of an example belonging to
the c-th class in a target domain, the example which is represented by f t in the
target domain, and the classifier associated with the example. The TLRisk clas-
sifies a target example xit by directly evaluating the empirical risk R(xit , c) and
pinpointing the class c that minimizes this loss. According to Dai et al. (2008),
R(xit , c) ∝ Δ(θxt , θ y ) ∝ KL(p( f t |θ y )||p( f t |θxt )). A source domain is translated and
i i
contributes to calculating p( f t |θ y ),
p( f t |θ y ) = p( f t | f s )p( f s |c)p(c|θ y )d f s + p( f t |c)p(c|θ y ). (6.19)
Xs c∈Y c∈Y
p( f t | f s ) in (6.19) is the translator built from correspondence. Later on, Chen et al.
(2010b) follow this work and first applied heterogeneous transfer learning on vi-
sual contextual advertising, which recommends advertisements for images with-
out surrounding text.
By pointing out that the robustness of TLRisk highly relies on high-quality cor-
respondence data, Kulis et al. (2011) propose a method, called asymmetric regu-
larized cross-domain transformation (ARC-t), to leverage labels of both domains
to learn a translator. The method imposes similarity and dissimilarity constraints
– a pair of examples across domains carrying the same label should be as simi-
lar as possible after the translation, while a pair of examples with different labels
is expected to be dissimilar after the translation. The objective function can be
expressed as:
min Ω(T) + λ c((xsj )T Txit ), (6.20)
T i,j
where Ω regularizes the complexity of the translator T. The function c(·) is defined
as c((xsj )T Txit ) = (max(0, l −(xsj )T Txit ))2 if xsj and xit are from the same category, and
is formulated as c((xsj )T Txit ) = (max(0, (xsj )T Txit − u))2 if they carry different labels.
Translated source examples can, as a consequence, be trained together with target
examples for kinds of tasks such as classification.
Hoffman et al. (2013) designed an end-to-end model called max-margin do-
main transforms (MMDT), which simultaneously learns a classifier and a transla-
tor using labeled examples from both domains. MMDT adopts a linear translation
matrix bridging domains, and incorporates it into an SVM-style max-margin clas-
sifier. The overall objective is formulated as
1 1
min T2F + w22
T,w,b 2 2
#xs $T #w$
s.t. y sj j
≥ 1 ∀i ∈ Ds
1 b
#xt $T #w$
y it i TT ≥ 1 ∀i ∈ Dt , (6.21)
1 b
x1s
T x3s , xti
+1 Translator
x3s
xti
Test image
Figure 6.9 An illustration of the label propagation process from text to images
(adapted from Qi et al. [2011a])
ns
f t (xit ) = y sj T (xsj , xit ), (6.22)
i =1
where T (·, ·) is the translator function. The authors define the translator function
as the inner product of a source example and a target example in a hypothetical
topic space, that is,
T (xsj , xit ) = 〈Ps xsj , Pt xit 〉 = (Ps xsj )T Pt xit = (xsj )T Sxit . (6.23)
Therefore, TTI actually combines the ideas of both latent space and translation. To
learn the translator function, TTI takes the collective power of labeled target ex-
amples and the correspondence between domains. Specifically, the optimization
problem is formulated as
nt
ns
min γ (y it y sj (xsj )T Sxit ) + λ χ(c i , j · (xsj )T Sxit ) + Ω(S), (6.24)
S i =1 j =1 i,j
where the first term minimizes the losses of predicted labels of target examples,
and the second term maximizes the consistency between the known correspon-
dence c i , j and the translator function value. Note that the function χ(a) is small if
a is large. The last term in (6.24) controls the complexity of S.
Unlike the aforementioned works that learn a translator using data themselves,
84 Heterogeneous Transfer Learning
Zhou et al. (2014b) borrow an idea from multitask learning and learn a translator
based on source and target predictive models, that is, ws and wt . Specifically, the
problem is formulated as a non-negative least absolute shrinkage and selection
operator problem:
1 nc dt
min wct − Twcs 22 + λi ti 1
T n c c=1 i =1
s.t. ti ≥ 0, (6.25)
Multi-level Alignment
Over the years, researchers have realized that the single-level alignment ap-
proach has a strong assumption that only one level of operation is sufficient to
align heterogeneous domains. In fact, the complex interrelationship between dif-
ferent domains could be characterized by a hierarchy, which calls for “deep align-
ment.” Ideally, all the techniques we mentioned earlier including latent space and
translation-based techniques can achieve multi-level alignment by sequentially
repeating for multiple times. However, not all of them have been investigated
so far for multi-level alignment. Here, we introduce DL and deep learning-based
methods that have been explored.
6.3 Methodologies 85
Dictionary Learning
As a deep alignment version of the SDDL model (Shekhar et al., 2013), Nguyen
et al. (2015) present so-called domain adaptation using a sparse and hierarchi-
cal network (DASH-N). DASH-N adopts a similar idea of projecting both domains
into a common latent space in which a shared discriminative dictionary is learned
as SDDL, while it makes a difference by projecting multiple times and learning
multiple shared dictionaries in a hierarchical network. Figure 6.10 shows the
carefully designed architecture of the hierarchical network. The authors tailor
Contrast normalization
Spatial pyramid poolig
canyon, dogs,
free, sky
Linear SVM
Ps1 Ps2
D1 D2
Pt1 Pt 2
Figure 6.10 Overview of the DASH-N model (adapted from Nguyen et al. [2015])
DASH-N for visual applications. First of all, DASH-N performs dimension reduc-
tion and contrast normalization for input images from heterogeneous domains
using corresponding projection matrices, that is, P1s and P1t . Second, DASH-N ob-
tains the sparse codes by applying a shared dictionary D1 in the low-dimensional
space. Third, DASH-N performs the max pooling. These three steps correspond-
ing to procedures (a)–(c) in Figure 6.10 are repeated in the next layers. The times of
repeating equals to the number of alignment levels. All the projections and shared
dictionaries are learned jointly with the final classification performance as the su-
pervision. Multiple levels of alignment as well as this end-to-end learning scheme
ensure that the source domain can greatly benefit the target domain of interest.
Deep Learning
Deep neural networks have made tremendous success and achieved the state-
of-the-art performance on computer vision as well as other machine learning
tasks (Bengio, 2009). The success partly attributes to the capability of deep neural
networks in learning extremely powerful hierarchical nonlinear representations
of inputs. Motivated by recent advances on deep learning, several heterogeneous
deep learning approaches (Zhou et al., 2014a; Shu et al., 2015; Wang et al., 2018a)
have been proposed.
Zhou et al. (2014a) propose a hybrid heterogeneous transfer learning (HHTL)
algorithm, which alternates between learning robust representations and learn-
ing translators in a layer-wise fashion. Inspired by the effectiveness of marginal-
ized stacked denoized autoencoder (mSDA) (Chen et al., 2012b) in homogeneous
transfer learning, the authors adopt mSDA to learn high-level feature representa-
86 Heterogeneous Transfer Learning
canyon, dogs,
free, sky
Figure 6.11 Overview of the weakly shared deep transfer network (adapted from
Shu et al. [2015])
M
min X∗ − W∗ X̃m 2
∗ F , (6.26)
W∗ m=1
where λ balances between the alignment and the complexity of the translator T1 .
Note that H1s(c) and H1t (c) represent a subset of H1s and H1t that have the correspon-
dence that is indispensable to align features across domains.
This process can be recursively carried out by replacing X∗ in (6.26) with H1∗
and in the next layer replacing H1∗(c) in (6.27) with H2∗(c) . As a consequence, a se-
ries of weight matrices {Wl∗ }Ll=1 , high-level representations {Hl∗ }Ll=1 and translators
{Tl }Ll=1 can be obtained. A classifier f is then trained on the augmented source
domain data Hs = [H1s , · · · , HLs ]. A testing target example xt is first augmented as
ht = [Tht ,1 , · · · , Tht ,L ], and its label is predicted by applying f , that is, f (ht ).
Unfortunately, the layer-wise training method with the alternative manner in
Zhou et al.’s (2014a) work is highly inefficient. Shu et al. (2015) address this prob-
lem by designing a deep neural network architecture named as weakly shared
deep transfer networks (WSDTNs), as shown in Figure 6.11. WSDTN learns
6.3 Methodologies 87
Ă Ă Ă
&ODVVLILHUDGDSWDWLRQ
5HSUHVHQWDWLRQ Ă
PDWFKLQJ
$V\PPHWULF
Ă PDSSLQJ Ă
Ă 'LVWULEXWLRQ Ă
PDWFKLQJ
Ă Ă
Ă Ă
canyon, dogs,
free, sky
where l min denotes the index of the lowest layer from which the weakly shared
constraints are imposed.
Both HHTL and WSDTN, however, are not end-to-end because they separate
learning invariant representations across domains because they consist of two
stages, that is, learning invariant representations across domains and training a
classifier. Wang et al. (2018a) propose an end-to-end deep asymmetric transfer
network (DATN) for unbalanced domain adaptation, as shown in Figure 6.12.
DATN adopts a classical Siamese structure similar to WSDTN, but differs in the
alignment of domains in top layers. The alignment are two-fold: (1) a translator
is learned to bridge hidden representations of both domains and (2) the distri-
butions of hidden representations across domains should be as close as possible.
88 Heterogeneous Transfer Learning
where the L-th layer is the topmost layer. The distribution discrepancy, which is
minimized to guarantee the second type of alignment, is measured using MMD
(Gretton et al., 2012) as
2
1 ns
1 nt
s,L t ,L
L d i st = hj − hi . (6.29)
n s j =1 n t i =1
2
where 1{·} is a indicator function, hit ,l ,L denotes the hidden representation of the
i th labeled target example at the L-th layer, and ws,c represents the softmax pa-
rameters trained on source examples for the c-th class. The overall objective is a
linear combination of (6.28), (6.29) and (6.30).
Multi-level alignment is more powerful and flexible than the single-level align-
ment. However, if the groundtruth alignment between domains could be as sim-
ple as in the single level, single-level alignment is more preferred because of low
computational cost and wide applications.
reused by the target domain. Shi et al. (2013a) first propose a probabilistic trans-
lation method to align the label spaces without requiring explicit semantic rela-
tionship between labels. The key is a decision rule:
p(y t |xt ) = p(y s |xt )p(y t |y s ), (6.31)
ys
where the posterior probability p(y s |xt ) can be obtained by applying the pre-
trained classifier on the source domain to target examples. The estimation of
p(y t |y s ) follows
1 1
p(y t |y s ) = p(y t s
, y ) = p(y t |xs )p(xs ), (6.32)
p(y s ) s
xs p(x ) xs
where xs denotes a source example with y s as the label. p(xs ) can be estimated to
be the proportion of xs in all source examples. p(y t |xs ), similar to p(y s |xt ), can be
obtained by applying the classifier trained on the target domain to source exam-
ples.
6RXUFHRXWSXWVSDFH
6RXUFHRXWSXWVSDFH
*URXS *URXS
*URXS *URXS
Ă
Ă
Ă
7DUJHWRXWSXWVSDFH 7DUJHWRXWSXWVSDFH
(a) Grouping and initializing target out- (b) Modification of outputs toward group
puts centers
Figure 6.13 The alignment of label spaces for regression problems (adapted from
Shi et al. [2010a])
The aforementioned works are for classification problems where the label space
is categorical and discrete. Shi et al. (2010a) first propose an heterogenous regres-
sion model to unify two label spaces for regression problems. The basic idea, as
shown in Figure 6.13, is to assign source examples regression values in the label
space of a target domain by preserving the similarity between them.
Qi et al. (2011b) claim that the aforementioned label alignment is disadvanta-
geous, considering that the relationship between labels may vary across examples.
Despite that, the label “mountain” seems irrelevant to the label “castle,” a source
image labeled as “mountain” and a target image that is labeled as “castle” but con-
tains a castle built on a mountain are obviously correlated. Therefore, the authors
propose aligning labels by building a feature-level translator similar to Qi et al.
90 Heterogeneous Transfer Learning
6.4 Applications
Heterogeneous transfer learning techniques have been applied successfully in
many real world applications. In applications to images, most works focus on im-
proving the clustering or classification performance of images with the help of
text documents (Yang et al., 2009; Qi et al., 2011a; Zhu et al., 2011). In addition,
heterogeneous transfer learning is referred to as “heterogeneous domain adap-
tation” in the computer vision community (Duan et al., 2012b; Hoffman et al.,
2013; Wu et al., 2013; Li et al., 2014). In this area, the goal is to enable knowledge
transfer between images or videos in different feature representational structures.
For example, Wu et al. (2013) address the video activity recognition in the target
domain by transferring knowledge from a very related source domain. However,
the valuable source domain should be best represented by optical flows that dif-
fer from the silhouettes features in the target domain. Yet another widely studied
application is cross-language transfer learning (Dai et al., 2008; Ling et al., 2008;
Zhang et al., 2010a; Huang et al., 2013; Gouws et al., 2015). For example, Ling et al.
(2008) leverage labeled English web pages to help the classification of Chinese web
pages. Li et al. (2014) propose a HFA method to solve the cross-lingual sentiment
classification problem.
Human activity recognition enables a wide spectrum of machine learning ap-
plications. The success of human activity recognition relies on sufficient anno-
tated sensor records, while annotating raw sensor readings either in real time
or post hoc is particularly challenging. Fortunately, people nowadays proactively
share happenings about and around them, as well as their whereabouts on social
media platforms such as Twitter. Such platforms thus provide a huge and rich se-
mantic repository of activities that people are performing at different times and
locations. Wei et al. (2016a) first proposed the transfer of knowledge from so-
cial media messages oftentimes represented as bag-of-words to physical sensor
records characterized as numerical values.
Several researchers have applied heterogeneous transfer learning to the recom-
mendation problem. In the work by Li et al. (2009a), a method is put forward to
transfer the rating knowledge from a source domain (movie recommendation) to
a target domain (book recommendation domain), where a common shared sub-
space known as the codebook-based method links the source and target domains.
This method works even when the two domains do not have overlap over items or
products.
In Shi et al.’s (2012) work, the target task is to predict movie ratings in the Inter-
6.4 Applications 91
net movie database (IMDB). Five different data sets are available in this approach
to form the source domains: a genre database, a database of a sound technique,
information about running times, an actor graph with two movies connected if
they share common actors or actresses, and a director graph defined similarly. Shi
et al. (2012) build a gradient boosting consensus model, which integrates all the
five data sets in different feature spaces, to accurately predict ratings in the “out-
of-sample” condition.
Several public data sets exist on which heterogeneous transfer learning tech-
niques can be fairly compared.
Office1 : The data set is a standard domain adaptation data set in computer vi-
sion. This data set contains 4,106 images in thirty-one categories collected from
three sources: amazon (object images in Amazon), dslr (high-resolution images
taken from a digital SLR camera) and webcam (low-resolution images taken from a
web camera). amazon and webcam, as the source domains, are represented as 800
dimensional speeded up robust features (SURF) features (Bay et al., 2008), while
dslr is represented as 600-dimensional SURF features.
IXMAS2 : This data set consists of five views of actions, each of which is taken from
a camera. The actions covers eleven classes, and each action is executed three
times by twelve subjects. Each view is represented in both optimal flows and sil-
houettes. At each time, one of the views acts as the target domain, and the other
views together act as the source domain. The target domain adopts the silhouettes
representation while the source takes optimal flows.
Cross-lingual sentiment (CLS)3 : This data set contains 800,000 product reviews
in the four languages, English, German, French and Japanese. The reviews cover
three categories: books, DVDs and music. For each category and each language,
the data set is officially split into a training set, a test set and an unlabeled set. The
training and test sets include 2,000 reviews, and the sizes of the unlabeled set vary
from 9,000 to 170,000. The English is regarded as the source domain and each of
the other three languages acts as the target domain, respectively.
Data sets providing the correspondence usually come from Flickr. Therefore,
here we introduce a few heterogeneous data sets with the correspondence that
could facilitate training heterogeneous transfer learning algorithms.
FLICKR30K4 : Flickr30K contains 31,783 images, each of which is annotated with
five descriptive sentences by workers on Amazon Mechanical Turk. Overall, there
are 158,915 crowd-sourced captions.
MIRFLICKR5 : The data set has two versions, MIRFLICKR-25000 and MIRFLICKR-
1M. They consist of 25,000 and one million tagged images, respectively. Besides,
the MIRFLICKR-25000 data set is fully labeled with thirty-nine tags.
1 https://people.eecs.berkeley.edu/~jhoffman/domainadapt/
2 http://4drepository.inrialpes.fr/public/viewgroup/6
3 www.uni-weimar.de/en/media/chairs/webis/corpora/corpus-webis-cls-10/
4 https://illinois.edu/fb/sec/229675
5 http://press.liacs.nl/mirflickr/
92 Heterogeneous Transfer Learning
NUS-WIDE6 : The data set includes 269,648 images associated with tags from
Flickr. Six types of low-level features are extracted, including 64-dimensional color
histogram, 144-dimensional color correlogram, 73-dimensional edge direction
histogram, 128-dimensional wavelet texture, 225-dimensional block-wise color
moments and 500-dimensional bag of words based on scale-invariant feature
transform descriptions. Moreover, all the images are labeled by eighty-one con-
cepts for the sake of evaluation.
In Table 6.3, we show results from a few published papers on heterogeneous
transfer learning. The table presents the performance comparison of different
heterogeneous transfer learning algorithms and non-transfer methods. Qi et al.
(2011a) crawl tagged images from Flickr and text documents from Wikipedia with
the names of ten categories as keywords. For each category, the authors built a
category/non-category binary classification task. We present the comparison re-
sult in the bird/non-bird task as an example in the first line of the table. According
to Qi et al. (2011a), the translation-based method TTI outperforms HTLIC (Zhu
et al., 2011), a latent space-based method, and TLRisk (Dai et al., 2008), another
translation-based method. Note that SVM-t (SVM trained only on the target do-
main) does not transfer any knowledge and only trains on the target domain.
In addition, we also show comparison results on domain adaptation data sets
in computer vision reported by Hoffman et al. (2013). Note that T-SVM denotes
the transductive SVMs (Joachims, 1999), which does not transfer knowledge from
the source domain, but takes full advantage of unlabeled examples in the target
domain. Finally, we present results on the cross-view activity recognition data set
and cross-lingual sentiment classification data set. Generally speaking, compared
to the non-transfer methods SVM-t and T-SVM, heterogeneous transfer learning
does contribute to target domains by borrowing knowledge from source domains.
6 http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm
7
Adversarial Transfer Learning
7.1 Introduction
variants have also been developed for the GAN framework. In the next section, we
give details on the operations of GANs.
Adversarial learning works naturally with transfer learning. As a generative
model, GANs can generate and augment the target domain data, in a new type of
transfer learning known as “data augmentation.” This can be achieved by “trans-
lating” source domain samples to a target domain while retaining their label infor-
mation at the same time. The learning-based data augmentation approach differs
from traditional instance-based transfer learning models as it “creates” additional
target domain data. In contrast, the traditional models such as TrAdaBoost and
kernel mean matching learn the weights for the labeled source domain samples
only. Adversarial learning can also be used to learn a shared latent feature space
across domains by minimizing the task loss in the source domain and maximiz-
ing the domain confusion loss. Instead of learning domain invariant features as
reviewed in Chapter 3, adversarial features for transfer learning are learned by
solving a min-max game.
In this chapter, we first introduce GANs and then present adversarial transfer
learning models.
Prior distribuon
Sampling
Noise
Generator
Minibatch sampling
True data Generated
Assign label 0 to
samples samples
Assign label 1 to
Discriminator
True/fake?
Figure 7.1 The GAN framework. Two sub-networks, a generator and a discrim-
inator compete against each other. The generator maps a vector sampled from
a prior distribution to the data space. The discriminator tries to distinguish true
data samples from the generated samples while the generator aims to fool the
discriminator
In a GAN, the relationship between the discriminator and the generator is like a
police and a thief: the police try to discern the thief from ordinary people and the
thief aims to fool the police, which forms an adversarial objective. The interaction
between the generator and discriminator can be formulated as a two-player min-
max game:
where
Both the generator and discriminator are multi-layer perceptrons. The model can
be trained with gradient descent algorithms in an alternating manner, as outlined
in Algorithm 7.1. In each iteration of the optimization process, the discriminator is
updated first with the generator fixed and then the generator is optimized by fixing
the discriminator. This process repeats until the model reaches the convergence.
Theoretical analysis shows that, if there is infinite model capacity and training
time, there is a global optimum such that pG = p d at a . If the generator is fixed and
96 Adversarial Transfer Learning
where J SD(·) denotes the Jenson-Shannon divergence. (7.2) shows that the objec-
tive for the generator is to minimize the Jensen–Shannon divergence between the
generated distribution pG and the true data distribution p d at a and that the global
optimum can be achieved when pG = p d at a . If both the generator and discrimi-
nator have enough capacity, pG will converge to p d at a as expected.
In practice, optimizing the objective as defined in (7.1) might cause gradient
vanishing; that is, the gradient value used to update the network parameters dur-
ing a learning process approaches zero when iterating through too many layers,
which makes the learning stop. This is because, in the early stage of training,
the generated samples are poor and the discriminator can easily distinguish the
generated samples from the true data samples, and, as a result, the gradient of
log(1 − D(G(z))) vanishes. To provide sufficient gradient values, the generator is
trained to maximize log(D(G(z))) instead, which is referred to as non-saturating
GAN (NS-GAN).
In spite of the strong learning capacity, GANs are notoriously difficult to train.
Common problems include the following.
(1) Mode collapsing where the model fails to generate samples in certain regions.
(2) The min-max game fails to reach an equilibrium.
(3) Unrealistic samples.
7.3 Transfer Learning with Adversarial Models 97
A large body of research work addresses the aforementioned issues from various
perspectives. Radford et al. (2015) propose a deep convolutional GANs (DCGANs),
which adopts a CNN as the generator. CNNs have been successful in discrimina-
tive tasks and combining with the objective of GAN makes CNNs applicable for
unsupervised representation learning. Salimans et al. (2016) propose two tech-
niques to stabilize the training procedure of the GAN, namely feature matching
and minibatch discrimination. Feature matching requires that the activations of
the generated samples and the true data samples in intermediate layers of the
discriminator are similar. Minibatch discrimination encourages the discrimina-
tor to consider multiple samples in combination instead of an individual sample.
There are also attempts to extending GANs to other information theoretic mea-
sures such as total variance divergence (Zhao et al., 2016), f -divergence (Nowozin
et al., 2016) and Wasserstein distance (Arjovsky and Bottou, 2017; Arjovsky et al.,
2017). To improve the quality of generated samples, Denton et al. (2015) devel-
oped a LapGAN to integrate multiple conditional GANs within a Laplacian pyra-
mid. At each level of the pyramid, a generative model that is trained with the ob-
jective of GAN upscales low-resolutional images to fine-grained ones.
Self-regularization loss
Synthetic Refined
xs G x̂t
Adversarial loss
D xt
Real
Figure 7.2 Overview of SimGAN. The generator refines synthetic images from a
simulator to improve realism, as guided by the discriminator. In addition to the
adversarial loss, a self-regularization loss is introduced to the annotations from
the simulator after the refinement
as
where r eg denotes the self-regularization loss, ψ denotes mapping from the im-
age space to a new space, and xs and x̂t denote the real and refined images, re-
spectively. In practice, mapping ψ is usually identical mapping such that ψ(x) = x,
and the self-regularization loss r eg is the per-pixel difference between the syn-
thetic and refined images. Minimizing the the self-regularization loss encourages
the refined image to reserve the simulated annotations.
In SimGAN, two additional modifications are made to the vanilla GAN in order
to improve realism of refined images and stabilize training. The first modifica-
tion is the local adversarial loss where the discriminator classifies local patches
sampled from a refined image. This modification can avoid artifacts. The second
modification is updating the discriminator with a history of refined images, which
stabilizes the training procedure. The SimGAN is evaluated on the MPIIGaze data
set for gaze estimation (Zhang et al., 2015b; Wood et al., 2016) and the hand pose
estimation data set, the New York University (NYU) hand pose data set (Tomp-
son et al., 2014). In quantitative evaluation, SimGAN outperforms state-of-the-art
models on the MPIIGaze data set with a relative improvement of 21 percent. On
the NYU hand pose data set, SimGAN, which does not require any label in the
target domain, outperforms a model that is trained with real labeled images by
8.8 percent.
Another type of model builds bi-directional mapping between the source and
target domains. It can be helpful for applications such as image editing. If the re-
lationship between faces with black hair and those with blonde hair is known,
100 Adversarial Transfer Learning
one can imagine how he/she looks when he/she wants to change the hair color.
Paired data are necessary to build such correspondences (Isola et al., 2017). How-
ever, with adversarial learning, models can discover cross-domain relations with-
out paired data.
G
G
Xs Xt xs xˆt xˆs
F F
Ds Dt cycle-consistency loss
(b)
(a)
Figure 7.3 The network architecture of CycleGAN. (a) The bi-directional map-
pings G and F are learned simultaneously. (b) The cycle-consistency loss encour-
ages the two mappings to inverses of each other
A typical model to address this setting is CycleGAN (Zhu et al., 2017), whose
framework is shown in Figure 7.3. Let G denote the mapping from the source
domain samples to the target domain samples. There are infinite possibilities to
map the source domain samples to the target domain with the generated target
samples following the target domain distribution. Learning the mapping G is an
under-constrained problem.
To address the issue, an inverse mapping F is introduced to learn mapping from
the target domain to the source domain. The two mappings G and F are learned
simultaneously and they are bijections. To learn the mapping G, two losses are
considered. The first loss is an adversarial loss that ensures that the translated
sample G(xs ) is indistinguishable from target domain samples and defined as
G AN (G, D t ) = Ext ∼p(Xt ) [log D t (xt )] + Exs ∼p(Xs ) [log (1 − D t (G(xs )))], (7.4)
c yc (G, F ) = Exs ∼p(Xs ) [F (G(xs )) − xs 1 ] + Ext ∼p(Xt ) [G(F (xt )) − xt 1 ]. (7.5)
Putting (7.4) and (7.5) together, the full objective of the CycleGAN is formulated as
7.3 Transfer Learning with Adversarial Models 101
where λ balances the importance of the adversarial loss and the cycle-consistency
loss. Qualitative analyses show that meaningful correspondences across domains
can be established. “Real versus fake” perceptual studies on Amazon Mechanical
Turk show that CycleGAN can fool human annotators on around 25 percent of
trials. Yet their performance is still weaker than the model with strong paired su-
pervision data. Also, failure cases are observed when there are geometric changes.
Researchers have proposed several models with similar characteristics (Kim
et al., 2017; Yi et al., 2017; Zhu et al., 2017). These models differ from CycleGAN in
implementation details. DiscoGAN (Kim et al., 2017) adopts a similar network ar-
chitecture to DCGAN. CycleGAN adapts the architecture in Johnson et al. (2016a)’s
work, which uses residual blocks and instance normalization in the generator
and PatchGAN as the discriminator. Different from DCGAN, the discriminator in
PatchGAN decides whether the input image is real or fake at the patch level. There
are few parameters in the patch-level discriminator of PatchGAN and it can be
applied to images with arbitrary sizes. DualGAN (Yi et al., 2017) uses PatchGAN as
the discriminator as well and it adopts a U-shaped network proposed in the work
by Isola et al. (2017) as the generator.
ys
xs G hs C ŷs y
shared D d
xt G ht
where
1 ns
i 1 ns
i 1 nt
i
V (G,C , D) = (G,C ) − λ (G, D) + (G, D) , (7.6)
n s i =1 y n s i =1 d n t i =1 d
where the hyperparameter λ balances the two terms. As G is minimized with re-
spect to y while it is maximized with respect to d , a gradient reversal layer (GRL)
is proposed. The gradient from the domain classifier D to the feature extractor G
is multiplied by a negative constant during the back-propagation optimization.
Several models are developed on the basis of DANN. Bousmalis et al. (2016)
assume that modeling domain-specific features helps extract domain-invariant
features. They propose a domain separation network that decomposes feature
representations into private and shared parts. Tzeng et al. (2017) unify existing
domain adaptation models with adversarial learning in a framework. This unified
framework considers various design choices and facilitates exploration of novel
architectures.
Another model, known as joint adaptation networks, is proposed and it out-
performs DANN on several image classification data sets (Long et al., 2017). This
model considers how to match the joint distributions of the activations from both
source and target domains in multiple layers by minimizing the joint maximum
mean discrepancy (JMMD). The JMMD is parameterized by a multi-layer neural
7.3 Transfer Learning with Adversarial Models 103
z G x̂
(z, x̂)
(ẑ, x)
ẑ E x
where
V (D, E ,G) = Ex∼p d at a [Ez∼p E (·|x) [log D(x, z)]] + Ez∼p z [Ex∼pG (·|z) [log (1 − D(x, z))]].
104 Adversarial Transfer Learning
The learned representations of the encoder are then applied to other super-
vised learning tasks and achieve competitive results with unsupervised and self-
supervised feature learning models as shown by Donahue et al. (2016).
7.4 Discussion
Adversarial transfer learning models have great potential as they combine two
prominent approaches for learning with limited data. It allows “translation” be-
tween two domains, which is helpful to artistic creation applications such as im-
age/video editing. It also provides a learning-based data augmentation approach.
For example, we can train a self-driving car with the images from computer games,
which are refined by a GAN. In terms of discriminative feature learning, adver-
sarial transfer learning measures the domain discrepancy with a parameterized
network, which avoids hand-crafted statistical distances such as MMD and KL di-
vergence.
Adversarial transfer learning is a fast-advancing approach and there are plen-
tiful open challenges for future research, for example, how to incorporate target
domain label information, and how to address heterogeneous transfer learning
settings where either the feature spaces or the label spaces of the two domains are
different. It is expected that the two lines of research works, generative adversar-
ial learning and transfer learning, can be connected in a principled approach and
novel ideas are exchanged between them.
8
Transfer Learning in Reinforcement Learning
8.1 Introduction
Reinforcement learning is a paradigm of machine learning when the learner
interacts with an unknown environment. In reinforcement learning (Sutton and
Barto, 1998), an agent can be modeled via a Markov decision process (MDP), where
the agent sequentially takes actions and receives corresponding rewards. The re-
wards can be time delayed. Guided by this limited reward signal, reinforcement
learning aims to acquire a policy that decides on how to take actions in differ-
ent future situations. An optimal policy is defined to maximize the cumulative
rewards.
We take the game playing in Figure 8.1 as an example, which is often adopted
in reinforcement learning research works (Silver et al., 2016). In each step, an in-
telligent agent must decide on how to make a move, for example, fire or go left,
according to the current state of the game. An intelligent agent should learn this
policy from the delayed reward, that is, pass or fail at the end, to optimize the
success rate.
Reinforcement learning significantly differs from supervised learning in sev-
eral aspects. Supervised learning learns from labeled training samples provided
by an oracle teacher and optimizes the generalization performance measured on
unseen testing data. Unlike the limited reward signal in reinforcement learning
settings, in supervised learning, the labels describe the correct action in various
situations, for example, the correct move in each round of a game. Clearly, such
high-quality and informative labeled samples are not available in many real world
applications, which is the target application area of reinforcement learning.
A major challenge for reinforcement learning is the exploitation-exploration
trade-off , which refers to the critical decision of agents when interacting with an
environment. To maximize the cumulative rewards, an agent is suggested to ex-
ploit actions that are the best in the past observations. Since only the rewards cor-
responding to the selected actions are observed, the agent should also explore the
unattempted actions as well. The short-term rewards may be sacrificed in pur-
suit of the long-term cumulative rewards. Theoretically, an optimal policy should
never stop the exploration (Lai and Robbins, 1985). For example, to maximize the
106 Transfer Learning in Reinforcement Learning
8.2 Background
In this section, we introduce fundamental concepts of reinforcement learning.
Then, we discuss essential components of transfer learning, including “what to
transfer,” “how to transfer” and “when to transfer” in the context of reinforcement
learning. Finally, we introduce different objectives of transfer learning for rein-
forcement learning.
The optimal Q-function is also known to satisfy the Bellman equation, which is
defined as
∗ ∗
Q (s n , a n ) = E R M (s n+1 ) + γ max Q (s n+1 , a n+1 ) . (8.2)
a n+1
When the state or action space is very large or even continuous, we usually rep-
resent the value function with a functional approximation, where the feature func-
tions are used as the representation of an MDP. Among traditional reinforcement
learning methods, the linear function approximation plays a dominated role. In
contast, deep reinforcement learning methods exploit powerful deep neural net-
works, including multi-layer perceptrons, deep convolution neural networks
(Mnih et al., 2013), deep recurrent networks (Hausknecht and Stone, 2015) and
so on. Deep reinforcement learning learns useful representations for the value
function by leveraging the representation learning ability of deep neural networks.
Furthermore, deep reinforcement learning is capable of learning the value func-
tion and the policy in an end-to-end manner. As one of the representative deep
reinforcement learning methods, deep Q-network (DQN) with experience replay
improves the performance game playing significantly (Mnih et al., 2013) and it
adopts the convolutional neural network to extract the representation directly
from the raw frames of the game.
and “task” for an MDP M , respectively. The domain of an MDP M , that is, DM , in-
cludes the state space S and the action space A. In a continuous MDP, the domain
mainly indicates the continuous state variables and action space. If two MDPs be-
long to different domains, either the state space or the action space is different.
Transfer learning for MDPs with different domains depends on the handcrafted
or learned inter-domain mapping between the source and target domains.
Given a domain M , the task describes the remaining components of an MDP,
including the transition function P M and the reward function R M . The MDPs with
different tasks have distinctive dynamics or reward functions. As we discussed
earlier, P M and R M can be unknown to the agent and require the exploitation and
exploration.
In the following, we illustrate different domains and different tasks based on the
mountain car problem.
According to Pan and Yang (2010), essential issues to design a successful trans-
fer learning algorithm for reinforcement learning include deciding “what to trans-
fer,” “how to transfer,” and “when to transfer.”
“What to transfer” categorizes the transfer learning algorithms for reinforce-
ment learning into instance-based transfer learning, feature-based transfer learn-
ing and model-based transfer learning. The instance-based transfer learning iden-
tifies and reuses a subset of source experiences when learning the target MDP. The
feature-based transfer learning algorithms extract high-level and abstract con-
cepts from the source MDPs and accordingly change the state or action space
of the target MDP that focuses more on promising regions in the state or action
space or utilizes more powerful function approximator. The model-based trans-
fer learning reuses the value function or the transition function learned from the
source experiences in the target MDP.
“How to transfer” decides on which algorithm to discover and reuse the related
knowledge. The method used by a knowledge-transfer algorithm heavily relies
upon “what to transfer.” In the context of instance-based transfer learning, how to
transfer mainly indicates the criteria with which to identify related source experi-
ences. In the context of feature-based transfer learning, “how to transfer” relates
to how the representation of the source knowledge can be reused by the target do-
main. In the context of model-based transfer learning, how to transfer considers
how to reuse the source experiences in the target domain.
“When to transfer” mainly indicates the timing of using transfer learning. Source
MDPs are not guaranteed to be helpful in improving the performance of the target
MDP. When facing dramatically different source and target MDPs, the brute-force
knowledge transfer may jeopardize the target performance via the so-called nega-
tive transfer. When facing multiple source MDPs, when to transfer emphasizes the
necessity of selective transfer by identifying the similarity between the source and
target MDPs. When to transfer calls for a deeper and theoretical understanding of
transfer learning including the similarity measurement of different MDPs, how to
guarantee to avoid the negative transfer and so on.
Asymptotic improvement
Jump-start improvement
ing, transfer learning aims to improve the cumulative rewards in three circum-
stances including the jump-start improvement, asymptotic improvement and
learning speed improvement (Lazaric, 2012). These three objectives can be used
to measure the effectiveness of transfer learning algorithms. We separately dis-
cuss these objectives in the following. When a transfer learning algorithm returns
a learned policy πt for the target MDPs, to better understand the improvement,
the gap between the action-value function of πt and the optimal policy π∗ can be
decomposed as
Q πt − Q ∗ ≤ approx Q πt ,Q ∗ + est (N t ) + opt . (8.3)
In (8.3), the approximation error, that is, approx (Q πt ,Q ∗ ), denotes the asymp-
totic error caused by the bias of the function approximation. In an MDP with small
state and action spaces, the agent can perfectly learn the optimal value function
and suffer from no approximation error. The estimation error est (N t ) is due to
the estimation of the value function using finite experiences. As a result, the esti-
mation error decreases and converges to stable values with the increasing target-
domain experience. Finally, the optimization error, that is, opt , is caused by the
non-global optimum of optimizing the function approximation. The optimization
error often occurs in deep reinforcement learning.
Jump-start improvement: The advantage of knowledge transfer Can be empir-
ically measure by the performance improvement at the beginning of the learn-
ing process compared to algorithms without transfer learning. The intuitive idea
to achieve knowledge transfer is To directly use the policies or value functions
learned in the source MDPs to initialize the target one. If source and target MDPs
are similar enough, the transferred policy or value function can achieve better
performance compared to the random initialization, thereby leading to the jump-
112 Transfer Learning in Reinforcement Learning
start improvement as shown in Figure 8.3. Jump-start improvement does not guar-
antee the asymptotic improvement and the learning speed improvement.
Learning speed improvement: The critical motivation for applying transfer learn-
ing in reinforcement learning is to improve the efficiency of learning by reducing
the required target-domain interactions with the environment. That is, transfer
learning can learn more efficiently than the non-transfer cases. Thus, the learn-
ing speed improvement can be used to measure whether knowledge transfer can
reduce the estimation error much faster as a function of the interactive expe-
rience, as shown in (8.3). Transfer learning achieves this improvement by guid-
ing the exploitation and exploration more efficiently in the target MDP. Learning
speed improvement can be achieved via any of the instance-based, model-based
and feature-based transfer learning methods. In instance-based transfer learning,
reusing the experiences in source MDPs is equivalent to interacting with related
environments without cost. In the model-based transfer, the policy or the value
function learned in the source MDP is used. In the feature-based transfer, the ex-
tracted high-level representation changes that of the target MDP. All the trans-
ferred source knowledge guides the agent to focus on the regions of the state and
action that Are more likely to be optimal in source MDPs and speeds up the ex-
ploitation and exploration. The learning speed improvement can be measured
empirically by time to threshold and area ratio and analyzed theoretically by finite-
sample analysis (Taylor and Stone, 2009). Given a performance threshold, time to
threshold compares the number of interactions with environment required by re-
inforcement learning algorithms with and without transfer learning in the target
domain. Time to threshold, however, ignores the learning curve of the algorithms
and selecting the performance threshold could be tricky as well. Area ratio quanti-
fies the improvement of the area under the performance curve compared to algo-
rithms without transfer learning. Besides the empirical measurement, theoretical
analysis of the estimation error in (8.3) and “sample complexity” (Brunskill and
Li, 2013) provide more solid verification.
8.3 Inter-task Transfer Learning 113
1 Nt % &
Λs i = P s n , a n , r n , s n |M̂ si . (8.4)
N t n=1
Given the task compliance, all the source experiences are weighted and the weighted
source experiences are used to help the learning of the target MDP.
Genevay and Laroche (2016) apply an instance-based transfer method to the
8.3 Inter-task Transfer Learning 115
spoken dialogue system. Different from Lazaric et al. (2008) where the compliance
of a source MDP is manually defined, Genevay and Laroche (2016) formulate the
source selection problem as a multi-armed stochastic bandit problem that treats
each learned policy πsi from the i -th source MDP as an arm. Pulling the i -th arm
is equivalent to applying πsi to the target user and observing the corresponding
discounted reward in the dialogue system. The multi-armed bandit guarantees
that the most useful source MDP can be identified with high probability. Further-
more, the usefulness of certain source experience is defined by whether it contains
complementary information to the target MDP. As a result, Genevay and Laroche
(2016) select source experiences that are far from the target training data via a
density-based criterion. Finally, the proposed method learns from the transferred
experiences by using any batch reinforcement learning algorithm as the initializa-
tion. Empirically, this method successfully achieves both the jumpstart improve-
ment and asymptotic improvement.
Laroche and Barlier (2017) propose an instance-based transfer learning method
named transfer reinforcement learning with shared dynamics (TRLSD) that aims
at improving the learning speed when facing MDPs with the shared dynamics.
TRLSD is inspired by robotics applications where an agent takes advantage of
the shared transition function P to understand the complex environment. TRLSD
learns the shared transition function by using experiences from all MDPs and es-
timates the task-specific reward function using the target experiences only. More
concretely, TRLSD first translates the source experience < s m , a m , r m , s m > to the
target MDP via the reward proxy rˆm and then learns the target policy from all the
translated source experiences added to target experiences using the fitted-Q iter-
ation. In the process, it is found that the reward proxy learned from limited target
experiences suffers from a high uncertainty. To explore the shared dynamics and
reward function more efficiently, TRLSD adopts the optimism in the face of uncer-
tainty heuristic and explicitly models the uncertainty of the reward function with
the upper confidence reinforcement learning.
Action-Based Transfer
The action-based transfer learning methods assume that our aim is to adapt the
target action space by leveraging high-level knowledge from the source MDPs.
Among all action-based transfer learning algorithms, the option-based transfer
116 Transfer Learning in Reinforcement Learning
Goal
Figure 8.4 In this maze example (adapted from Hengst [2002]), the robot at-
tempts to move from the initial position to the goal through three rooms
Unlike the option-based transfer, another line of the action-based transfer learn-
ing shrinks the action space and focuses more on potential optimal actions. Sher-
stov and Stone (2005) generate a set of synthetic MDPs by randomly perturbing a
8.3 Inter-task Transfer Learning 117
single source MDP. The actions that are not optimal in any generated MDPs are
discarded. The agent solves the target MDP concerning only the remaining ac-
tions. Intuitively, by discarding non-optimal actions, the resulting smaller action
space alleviates the need to explore all the actions, thereby improving the learn-
ing speed. Sherstov and Stone (2005) select actions by heuristics. Azar et al. (2013)
propose a uncertain model upper confidence bound (umUCB) method that pro-
vides a theoretical mechanism to eliminate the actions. umUCB is tailored for the
multi-armed bandit problem with multiple source tasks and guaranteed to avoid
the negative transfer.
Feature Function Based Transfer
In reinforcement learning, searching for the optimal policy is equivalent to learn-
ing the optimal value function. Without loss of generality, we assume that the hy-
pothesis space H of the value function can be represented approximately as the
linear combination of d feature functions, that is,
' (
d
H = h : h(s, a) = φ j (s, a; θφ )w j . (8.5)
j =1
Figure 8.5 The dynamics of an MDP in (a) can be represented by the state transi-
tion graph in (b). The proto-value function method proposed in Mahadevan and
Maggioni (2007) transfers the eigenvectors of the graph Laplacian
that the one-step expected reward is the linear combination of d feature functions
φi (s, a; θφ ), that is,
) *
d
r (s, a) = Es ∼P r s, a, s = φi (s, a; θφ )w i . (8.6)
i =1
According to the Bellman equation, the action-value function under the policy π
can be written as
∞ T
Q π (s, a) = Eπ γt −τ φt +1 |s t = s, a t = a w, (8.7)
t =τ
) *
where ψπ (s, a) ≡ Eπ ∞ t =τ γ
t −τ
φi +1 |s t = s, a t = a is regarded as the successor fea-
ture functions. Clearly, when using the tabular representation of the state and ac-
tion spaces, the successor feature functions represent the prediction on future oc-
currence of all other states under the policy π. For instance, in Figure 8.4, if the
feature function φ(·) represents the position of the robot, the successor feature
functions can indicate the trajectory under the policy π. The successor feature
functions can learn via the Bellman equation. Intuitively, in the MDP, the tran-
sition function is summarized by the successor feature functions and the reward
function is modeled by w. The successor feature functions decouple the dynamics
and the reward function. When facing the source and target MDPs with the shared
transition function but different reward functions, the agent could directly exploit
the learned successor feature functions and estimate w in the target MDP.
Other definitions of the transferable feature functions also exist. For example,
Drummond (2002) decomposes the state spaces of the source MDPs into sub-
tasks and treats the independent value function of each subtask as the transfer-
able feature functions. Snel and Whiteson (2014) adopt the feature functions se-
lection to identify transferable feature functions. Walsh et al. (2006) and Lazaric
(2008) share a similar idea to multi-task learning (Zhang and Yang, 2017b). When
8.3 Inter-task Transfer Learning 119
facing multiple source MDPs, Walsh et al. (2006) and Lazaric (2008) assume that
the state aggregation or the subset of feature functions that perform well on all
source MDPs is transferable. Bou-Ammar et al. (2014) solve a sequential trans-
fer learning problem by assuming that feature functions for the source and target
MDPs can be factorized into an invariant part and a task-specific part, and esti-
mate both parts via sparse coding.
Deep reinforcement learning (Mnih et al., 2013), particularly the DQN algo-
rithm, proposes the extraction of transferable feature functions in an end-to-end
manner by taking advantage of powerful deep neural networks as the function
approximator. For example, DQN successfully learns to play very complex Atari
games based on the input images. To learn the complex deep neural networks,
however, DQN calls for massive experience that emphasizes the necessity of the
knowledge transfer. In the target MDP, DQN leverages the feature functions ex-
tracted by the deep neural network trained on the source MDPs. Therefore, the
reasonable feature functions can speed up learning the policy in the target MDP.
and then initializes the target MDP accordingly. Intuitively, in the hierarchical
Bayesian model, the global distribution Ωψ is estimated by using all the source
MDPs and serves as an informative prior for the target MDP.
120 Transfer Learning in Reinforcement Learning
For instance, Wilson et al. (2007) parameterize the transition and reward func-
tions and apply model-based Bayesian reinforcement learning to the target MDP.
The model by Wilson et al. (2007) is not limited to the specific form of the prior of
ψ. In comparison, Lazaric and Ghavamzadeh (2010) parameterize the value func-
tion based on a normal-inverse-Wishart hyper-prior.
Table 8.2 The handcrafted inter-task mapping for the mountain car problem
(Taylor et al., 2008a). The goal is to transfer from the three-dimensional problem to
the two-dimensional case
Inter-task mapping for the mountain car example
χ A (Neutral) =Neutral
χ A (North) =Right
Action mapping χ A (East) =Right
χ A (South) =Left
χ A (South) =Left
χS (x) = x
χS (ẋ) = ẋ
State mapping
χS (y) = x
χS ( ẏ) = ẋ
experience. TCB learns a mapping to preserve such geometry structure and, af-
ter learning the mapping, TCB adopts the translated source experiences to warm-
start the target domain. To maximize the cumulative rewards, TCB not only ex-
plores the reward function, but also the learning process of the mappings. TCB
achieves both the jump start and learning speed improvements in the application
of recommender systems.
For other applications like the robotics, the auxiliary guidance may be unavail-
able. Bou-Ammar et al. (2012) present a tranfer fitted-Q-iteration (TrFQI) algo-
rithm that automatically constructs the correspondence between source and tar-
get experiences and learns the inter-task mapping. More concretely, TrFQI and
transfer least squares policy iteration (TrLSPI) calculate the similarity between
each source and target experience pair via sparse coding. Based on the estimated
correspondences, TrFQI and TrLSPI approximate the inter-task mapping with a
Gaussian process.
When there is no auxiliary guidance, Bou-Ammar et al. (2015) propose the ex-
ploitation of the unsupervised loss to avoid this requirement. More concretely, all
the source and target state variables are mapped to a common representation.
The proposed method learns this mapping with an unsupervised loss by preserv-
ing the local manifold geometry. Then, similar to other instance-based transfer
learning methods, Bou-Ammar et al. (2015) reuse the translated experiences as
the initialization, which is far better than the random initialization.
Action-Based Transfer
In the option-based transfer, the agent discovers the options from the source
MDPs and augments the target action space. The source options, however, are
defined on the source state and action spaces. Therefore, the discovered options
cannot be directly transferred to the target domain. Hence, the concept of options
is generalized to abstract concepts to facilitate the reuse in the target domain.
Konidaris and Barto (2007) propose portable options that are transferable across
MDPs with different state spaces but the same action space. Konidaris and Barto
(2007) decompose the state space into the problem space and agent space. The
problem space describes problem-specific properties and the agent space mod-
els the agent-specific characteristics that are invariant across learning problems.
By taking the robot in Figure 8.4 as an example, the problem state records its loca-
tion within the environment and the agent state includes the internal sensor and
actuator of the robot. Clearly, when facing different environments, the agent state
remains the same. Konidaris and Barto (2007) first discover the portable options
within the fixed agent space and then augment the shared action space with the
portable options.
Konidaris and Barto (2007), however, heavily rely on the manual decomposi-
tion of the state space. Topin et al. (2015) propose the portable option discovery
(POD), which is tailored for the object-oriented MDP and automatically decides
the mapping between source and target MDPs. First, POD creates an abstract do-
main with the abstract state space S s and decides the mapping χs : S s → S . The
policy of an option can be mapped to the abstract states accordingly. Then, in the
target MDP, POD searches for another mapping from the abstract domain to the
target MDP, that is, χt : S → S t . In a word, POD translates the source options to
the target MDP via an abstract domain. Furthermore, POD automatically decides
the mappings χs and χt with the highest proportion of the state-action pairs that
are preserved between the abstract policy and the source/target policy.
9.1 Introduction
As discussed in Chapter 1, similar to transfer learning, multi-task learning (Caru-
ana, 1997) also aims to generalize knowledge across different tasks. Different from
transfer learning, which assumes some source domain(s) are available as inputs
for solving a learning problem in a target domain, in multi-task learning, there are
no source domains, but multiple target domains, each of which has insufficient la-
beled data to train a classifier independently. The goal of multi-task learning is to
jointly learn the multiple target tasks by exploiting useful information from related
learning tasks to help alleviate the data sparsity problem. In this sense, multi-task
learning exhibits similar characteristics to transfer learning. However, multi-task
learning is different from transfer learning in terms of the objective. That is, multi-
task learning aims to improve the performance of all the tasks at hand, while
transfer learning cares for the performance of the target task but not source tasks.
Hence, the roles of different tasks in multi-task learning are equally important but,
in transfer learning, the target task is more important than source tasks. From the
perspective of the flow of knowledge transfer, in transfer learning there are flows
targeting at the target task from source task(s) while multi-task learning has flows
between any pair of tasks, which is illustrated in Figure 9.1. So multi-task learning
and transfer learning are two different settings in terms of knowledge transfer. In
terms of learning algorithms, many multi-task learning algorithms can be revised
for transfer learning problems. Moreover, in the works by Xue et al. (2007) and
Zhang and Yeung (2010a, 2014), a new multi-task learning setting called asymmet-
ric multi-task learning is investigated and this setting considers a different sce-
nario where a new task is arrived when multiple tasks have been learned jointly via
some MTL method. This setting can be viewed as a hybrid of multi-task learning
and transfer learning where multi-task learning happens for old tasks and transfer
learning leverages knowledge from the old tasks to the new task.
Based on an assumption that all the tasks or some of them are related, learn-
ing multiple tasks together is empirically and theoretically found to have better
performance than learning them individually. According to different natures of
9.1 Introduction 127
Figure 9.1 An illustration for the difference between transfer learning and multi-
task learning from the perspective of the flow of knowledge transfer
present the definition of multi-task learning and then introduce different settings
in multi-task learning, that is, multi-task supervised learning, multi-task unsu-
pervised learning, multi-task semi-supervised learning, multi-task active learn-
ing, multi-task reinforcement learning, multi-task online learning and multi-task
multi-view learning. For each setting, we introduce representative models. More-
over, we also present parallel and distributed multi-task models where there are a
large number of tasks or data in different tasks located in different machines. For
a more detailed survey on multi-task learning, please refer to the work by Zhang
and Yang (2017b).
According to the definition of multi-task learning, there are two basic elements.
The first element is the task relatedness. The task relatedness is defined according
to our understanding about how all the tasks are related and it can be used to
design multi-task models. The second element is the nature of the learning task. In
machine learning, learning tasks can have multiple choices, including supervised
learning tasks such as classification and regression tasks, unsupervised learning
tasks such as data clustering tasks, semi-supervised learning tasks, active learning
tasks, reinforcement learning tasks, online learning tasks and multi-view learning
tasks. Hence different learning tasks correspond to different settings in multi-task
learning. In the following sections, we will introduce different settings in multi-
task learning as well as representative models.
and y ij is the label of xij . The goal of multi-task supervised learning is to learn m
functions { f i (x)}m
i =1
based on the training data sets of the m tasks such that f i (xij )
can approximate y ij well. After the learning process, f i (·) will be used to make pre-
diction on the labels of new data instances in the i -th task.
9.3 Multi-task Supervised Learning 129
Figure 9.2 A multi-task feedforward neural network with one input layer, hidden
layer and output layer
x̂ij = UT xij to construct the share feature representation and then learn a linear
function, which is defined as f i (xij ) = (ai )T x̂ij + b i , on the shared feature represen-
tation. The MTFL method formulates its objective function as
m 1 ni
min l (y ij , (ai )T UT xij + b i ) + λA22,1 s.t. UUT = I, (9.1)
A,U,b i =1 n i j =1
where U has a larger number of columns than that of rows and A is assumed to be
sparse based on the 1 constraint.
The p,q regularization makes W row-sparse and hence only useful features for
all the tasks will be preserved. The p,q regularization has some instantiations,
9.3 Multi-task Supervised Learning 131
for example, the 2,1 regularization (Obozinski et al., 2006, 2010) and the ∞,1
regularization (Liu et al., 2009b). To obtain a more compact subset of useful fea-
tures for all the tasks, Gong et al. (2013) propose a capped-p,1 penalty, that is,
d
i =1 min(wi p , θ), which will reduce to the p,1 regularization when θ is large
enough. Besides the p,q regularization, Lozano and Swirszcz (2012) propose a
multi-level Lasso to decompose w j i , the ( j , i )-th entry in W, as w j i = θ j ŵ j i , where
w j i will be 0 when either θ j or ŵ j i becomes 0. So, based on the 1 regularization
on θ j and ŵ j i , the objective function of multi-level Lasso is formulated as
m 1 ni
min l (y ki , (wi )T xik + b i ) + λ1 θ
θ 1 + λ2 Ŵ1
θ ,Ŵ,b i =1 n i k=1
s.t. w j i = θ j ŵ j i , θ j ≥ 0. (9.3)
It is easy to see that a zero θ j will filter out the j -th feature for all the tasks but
a zero ŵ j i can do that for the i -th task only, making their impact different. Then
the multi-level Lasso is extended in the works by Wang et al. (2014) and Han et al.
(2014) to more general settings.
In the second way, Zhang et al. (2010c) give a probabilistic interpretation for
p,1 -regularized multi-task feature selection methods that the p,1 regularizer is
corresponding to a generalized normal distribution prior:
w j i ∼ G N (0, ρ j , p).
Then Zhang et al. (2010c) extend this prior to the matrix-variate generalized
normal prior to learn pairwise relations among tasks. Different from that by Zhang
et al. (2010c), the horseshoe prior is adopted by the works by Hernández-Lobato
and Hernández-Lobato (2013) and Hernández-Lobato et al. (2015) to conduct
multi-task feature selection. The difference between the works of Hernández-
Lobato’s (2013) and that of Hernández-Lobato et al. (2015) is that the former gen-
eralizes the horseshoe prior to learn feature covariance, while the latter directly
uses the horseshoe prior.
where x̃ iA, j and x̃ iB, j denotes new hidden features by jointly learning the two tasks.
α11 α12
Matrix α = can be viewed as a quantitative measure for the task
α21 α22
relatedness of the two tasks based on hidden features, making this method more
flexible than only sharing hidden layers for multiple tasks.
Low-Rank Approach
It is intuitive that similar tasks usually have similar model parameters and this
intuition will lead to a low-rank W. With an assumption that model parameters of
the m tasks share a low-rank subspace, Ando and Zhang (2005) propose a
parametrization of wi as wi = ui + ΘT vi , where Θ ∈ Rh×d (h < d ) denotes the low-
rank subspace shared by all the tasks and ui is a task-specific parameter vector.
Then, by placing an orthonormal constraint on Θ to remove the redundancy, the
corresponding objective function is formulated as
m 1 ni T
Chen et al. (2009) generalize this model by adding a squared Frobenius regular-
ization on W, leading to an extended model with a convex objective function after
some relaxation.
According to optimization theory, using the trace norm of a matrix, denoted by
WS(1) , as a regularizer can lead to a low-rank matrix and hence the trace norm
9.3 Multi-task Supervised Learning 133
regularization (Pong et al., 2010) is widely used in multi-task learning, with the
objective function typically formulated as
m 1 ni
min l y ij , (wi )T xij + b i + λWS(1) . (9.5)
W,b i =1 n i j =1
Han and Zhang (2016) propose a capped trace regularizer, which is defined as
min(m,d )
i =1
min(μi (W), θ) with θ as a predefined hyperparameter. Minimizing the
capped trace regularizer will only penalize small singular values of W, leading to a
matrix with a lower rank than the trace norm regularization.
Kang et al. (2011) extend the MTFL method to the multiple-cluster case, in
which the learning model of tasks in each cluster is the MTFL method, with the
objective function formulated as
m 1 ni r
min l y ij , (wi )T xij + b i + λ WQi 2S(1)
W,b,{Qi } i =1 n i j =1 i =1
r
s.t. Qi ∈ {0, 1}m×m ∀i ∈ [r ], Qi = I,
i =1
where the 0/1 diagonal matrix Qi is responsible of identifying the i -th cluster.
To automatically determine the number of clusters, Han and Zhang (2015a)
propose a regularized objective function as
m 1 ni T
i
i i i
min l yj, w x j + bi + λ w − w j 2 , (9.7)
W,b i =1 n i j =1 j >i
f ∼ N (0, Σ).
p
The entry in Σ corresponding to the covariance between between f ji and f q is
defined as
p p
σ( f ji , f q ) = ωi p k(xij , xq ),
where k(·, ·) defines a kernel function and ωi p denotes the covariance between
tasks Ti and Tp . So Ω, whose (i , p)-th entry is ωi p , defines task relations in the
form of task covariance. If the Gaussian likelihood is defined on labels based on
f, the marginal likelihood, which has a closed form, can be used to learn Ω. To
improve the point estimation to reduce the risk of overfitting, Zhang and Yeung
(2010b) propose a multi-task generalized t process by assigning an inverse-
Wishart prior on Ω and adopting a generalized t likelihood.
Zhang and Yeung (2010a, 2014) propose a multi-task relationship learning
(MTRL) model by assinging a matrix-variate normal distribution on W as
W ∼ M N (0, I, Ω),
m 1 ni
min l (y ij , (wi )T xij + b i ) + λ1 W2F + λ2 tr(WΩ−1 WT )
W,b,Ω i =1 n i j =1
s.t. Ω 0, tr(Ω) ≤ 1, (9.9)
where Ω, the task covariance matrix, encodes the task relations among tasks. The
MTRL method has been extended to multi-task boosting (Zhang and Yeung, 2012)
and multi-label learning (Zhang and Yeung, 2013b), and generalized to learn sparse
task relations in the work by Zhang and Yang (2017a). Zhang and Schneider (2010)
propose a similar model to the MTRL method by placing a prior on W as W ∼
M N (0, Ω1 , Ω2 ), and the proposed method assumes that the inverse matrices of
Ω1 and Ω2 are sparse. As the prior used in the MTRL method implies that WT W
follows a Wishart distribution W (0, Ω), Zhang and Yeung (2013a) generalize the
MTRL method to propose a new prior to learn high-order task relations as
(WT W)t ∼ W (0, Ω), where t is a positive integer. Lee et al. (2016) propose a reg-
ularizer similar to that of the MTRL method by defining a parametric form of Ω as
Ω−1 = (Im − A)(Im − A)T , where A denotes asymmetric task relations proposed in
(Lee et al., 2016).
Different from these methods that focus on global learning models, Zhang (2013)
extends local learning methods such as the k-nearest-neighbor (kNN) classifier to
136 Multi-task Learning
where Nk (i , j ) denotes the set of task and instance indices for kNNs of xij , s(·, ·) de-
notes the similarity between instances, σi p defines the similarity of task Tp to Ti ,
and the learning function for the proposed multi-task kNN classifer is defined as
p p
f (xij ) = σi p s(xij , xq )y q .
(p,q)∈Nk (i , j )
The regularizer in (9.10) is to enforce Σ, which is the task similarity matrix to en-
code task relations, to be a symmetric matrix.
Multi-level Approach
The multi-level approach assumes that the parameter matrix W can be decom-
posed as h component matrices {Wi }hi=1 , that is, W = hi=1 Wi , where h, the num-
ber of levels, is equal to or larger than 2. The objective functions of different mod-
els in this approach can be unified as
m 1 ni h h
min l y ij , (wi )T xij + b i + g i (Wi ) s.t. W = Wi , (9.11)
W∈C W ,b i =1 n i j =1 i =1 i =1
where g i (Wi ) defines the regularizer for the i -th component matrices, and C W
defines a set of constraints on {Wi }hi=1 . According to (9.11), the regularizers of dif-
ferent component matrices are decomposable and regularizers for different com-
ponent matrices can be different.
Seven methods in this approach are introduced, that is, those by Jalali et al.
(2010), Chen et al. (2010a, 2011), Gong et al. (2012b), Zweig and Weinshall (2013)
and Han and Zhang (2015a, 2015b), and the corresponding choices of h, {g i (·)}
and C W are shown in Table 9.1. According to Table 9.1, the first four methods
have two component matrices while the last three ones can have two or more
component matrices. The choice of {g i (·)} varies among different methods. For
example, based on the ∞,1 and 2,1 norms, the g 1 (·)s in the works by Jalali et al.
(2010) and Gong et al. (2012b) enforce W1 to be row-sparse. Different from them,
the g 1 (·)’s proposed by Chen et al. (2010a, 2011) make W1 low rank by treating the
trace norm as the the regularizer and constraint, respectively. For W2 , the g 2 (·)’s
proposed in the works by Jalali et al. (2010) and Chen et al. (2010a) enforce it to be
sparse, while in Chen et al. (2011) and Gong et al. (2012b) they are enforced to be
column-sparse to capture outlier tasks. Zweig and Weinshall (2013) assume that
each component matrix is jointly sparse and row-sparse in different proportions
related to the number of level. In the work by Han and Zhang (2015a), a multi-level
task clustering method is to cluster all the tasks at each level via a fused-Lasso-
style regularizer, which is operated on vectors instead of scalars, as in the fused
9.4 Multi-task Unsupervised Learning 137
Lasso. By adopting the same regularizer as that in the works by Han and Zhang
(2015a), Han and Zhang (2015b) aim to learn the hierarchical structure among
tasks based on a sequential constraint SW defined in Table 9.1.
Table 9.1 Choices of g i (·) for different methods in the multi-level approach, where
j
{λ1 , λ2 , λ, η} are regularization parameters, wi denotes the j-th column in Wi ,
j
denotes an empty set and SW = {W|w − wk | ≥ |w − wk | ∀i ≥ 2, ∀k > j }
j
i −1 i −1 i i
Reference h {g i ()} CW
g 1 (W1 ) = λ1 W1 ∞,1
Jalali et al. (2010) 2
g 2 (W2 ) = λ
- 2 W2 1
0, if W1 S(1) ≤ λ1
g 1 (W1 ) =
Chen et al. (2010a) 2 +∞, otherwise.
g 2 (W2 ) = λ2 W2 1
g 1 (W1 ) = λ1 W1 S(1)
Chen et al. (2011) 2
g 2 (W2 ) = λ2 WT 2 2,1
g 1 (W1 ) = λ1 W1 2,1
Gong et al. (2012b) 2
g 2 (W2 ) = λ2 WT 2 2,1
Zweig and Weinshall (2013) ≥2 g i (Wi ) = λ(h−i
h−1
)
Wi 2,1 + λ(i
h−1
−1)
Wi 1
λ j k
Han and Zhang (2015a) ≥2 g i (Wi ) = i −1 k> j wi − wi 2
η
g i (Wi ) = iλ−1 k> j wi − wki 2
j
Han and Zhang (2015b) ≥2 SW
η
There are some models for multi-task unsupervised learning. For example,
Zhang (2015a) proposes two multi-task clustering methods, which extend the
MTFL model (Argyriou et al., 2006) and the MTRL method (Zhang and Yeung,
2010a) to clustering problems by treating labels as unknown cluster indicators to
be learned from data.
Reinforcement learning aims to learn how to take actions to maximize the cu-
mulative reward in an environment. It has proved to be effective in many applica-
tions such as game playing and robotics. Given similar environments in different
reinforcement learning tasks, it has been found that learning multiple reinforce-
ment learning tasks together can have better performance than learning them in-
dividually, which leads to multi-task reinforcement learning.
There are some models for multi-task reinforcement learning. For example,
Wilson et al. (2007) model each reinforcement learning task by a Markov deci-
sion process (MDP), while MDPs in all the tasks are clustered via a hierarchical
Bayesian infinite mixture model. Li et al. (2009c) use a Dirichlet process to cluster
tasks, each of which is learned via a regionalized policy. Lazaric and Ghavamzadeh
(2010) use a Gaussian process temporal-difference value function model for each
task and adopt a hierarchical Bayesian model to relate value functions in differ-
ent tasks. By assuming that value functions in all the tasks share sparse param-
eters, Calandriello et al. (2014) learn all the value functions together by adapting
the multi-task feature selection method with the 2,1 regularization (Obozinski
et al., 2006) and the MTFL method (Argyriou et al., 2006), respectively. Parisotto
et al. (2016) propose an actor-mimic method to learn policy networks for mul-
tiple tasks by combining deep reinforcement learning and model compression
techniques.
When training data in multiple tasks arrive sequentially, multi-task online learn-
ing can handle them, while conventional multi-task models cannot.
There are some models for multi-task online learning. For example, by assum-
ing that different tasks share a common goal, Dekel et al. (2006, 2007) propose the
use of absolute norms as a global loss function, which combines the loss of each
task together, to measure the relations among tasks. Lugosi et al. (2009) enforce
constraints on actions for all the tasks to model task relations. Cavallanti et al.
(2010) propose a perceptron-based multi-task online learning model by measur-
ing task relations based on the geometric structure shared among tasks. Pillonetto
et al. (2010) propose a multi-task Gaussian process with a Bayesian online algo-
rithm to share kernel parameters among tasks. Saha et al. (2011) propose an online
algorithm, which updates model parameters and task covariance together, for the
MTRL method (Zhang and Yeung, 2010a).
140 Multi-task Learning
10.1 Introduction
all the data points in all the tasks and then learns a linear classifier based on the
transformed features. The learner considered here is similar to that in the work by
Ando and Zhang (2005) but without the task-specific part ui in (9.4). With the use
of the Rademacher complexity, a generalization bound is derived as
1
E ≤ Eˆ + O , (10.1)
mn 0
where E denotes the average of the generalization errors of all the tasks, Eˆ denotes
the average of training errors of all the tasks, m denotes the number of tasks and n0
denotes the average number of training samples in all the tasks. The second term
on the right-hand side of bound (10.1) can show that the capacity of the linear
multi-task learner is upper-bounded by the Frobenius norm of the task-averaged
covariance matrix by assuming that the Frobenius norm of the feature transfor-
mation matrix is no smaller than 1. Different from previous bounds, the derived
generalization bound here is data-dependent, implying that this bound can be es-
timated from training data due to the data-dependent nature of the Rademacher
complexity.
In the work by Juba (2006), the Kolmogorov complexity in information theory is
extended to multi-task learning to give uniform bounds to measure the difference
between the empirical loss and generalization bound of different hypotheses pro-
vided by deterministic learning algorithms on independent samples drawn from
a set of unknown computable distributions over tasks.
In the work by Maurer (2006b), two classes of multi-task algorithms are ana-
lyzed in terms of the generalization bound based on the Rademacher complex-
ity. The first class to be analyzed include graph-regularized multi-task algorithms
with those in the works by Evgeniou and Pontil (2004) and Evgeniou et al. (2005)
as representative ones. In these algorithms, a graph G is used as a priori knowl-
edge to describe similarities between any pair of tasks and, based on this graph, a
regularizer is devised to encode such similarities to enforce similar tasks to have
similar model parameters. Based on the generalization bound for this class of al-
gorithms, their capacities are upper-bounded by tr(G−1 ). The second class of
multi-task algorithms to be analyzed include the Schatten norm regularization
WS(p) (1 ≤ p ≤ 43 ) with the trace norm regularization (Pong et al., 2010) as a spe-
cial case. According to the analysis by Maurer (2006b), the capacity of the cor-
q
responding multi-task learner is upper-bounded by the Schatten 2 norm of the
average data covariance over tasks, where q satisfies p1 + q1 = 1.
In the work by Kakade et al. (2012), some matrix regularizers, including the
squared Schatten norm regularization and squared group sparse regularization,
are proved to be strongly convex with respect to the trace norm and 2,1 norm, re-
spectively. Then, based on a widely used inequality in online learning (see Corol-
lary 4 in the work by Kakade et al. (2012)) and such strongly convexity of matrix
regularizers, a generalization bound is derived via the Rademacher complexity.
In the work by Crammer and Mansour (2012), a task clustering method is pro-
144 Transfer Learning Theory
posed to iteratively learn the model parameters for tasks in a task cluster and iden-
tify the cluster structure based on the training loss in a k-means style. Moreover,
the lower- and upper-bound of its VC dimension is analyzed in order to derive
the generalization bound and it shows that, when the logarithm of the number of
clusters is lower than the number of samples per task and the number of clusters
is much smaller than the total number of tasks, multi-task learning is significantly
better than single-task learning in terms of the order of the complexity in the gen-
eralization bound.
In the work by Maurer et al. (2013), a generalization bound is presented for (9.3)
where a dictionary is shared by all the tasks and coefficients in linear functions are
task-specific. With the use of the Rademacher complexity, a generalization bound
is derived to show that the capacity of multi-task sparse coding presented in (9.3)
is upper-bounded with respect to the sum of both the average trace norm and
spectral norm of data covariances over tasks. This model is extended to the trans-
fer learning setting where the dictionary learned in source tasks will be used to the
target task without learning it again and a similar generalization bound is also de-
rived, showing again that both the average trace norm and spectral norm of data
covariances in the target task affect the capacity of the target learner.
In the work by Pontil and Maurer (2013), the trace norm regularization in multi-
task learning is analyzed. With recent advances on tail bounds for sums of ran-
dom matrices and the Rademacher complexity, a dimension-independent bound
is presented to analyze the generalization bound where the capacity is upper-
bounded by the spectral norm of the average data covariance over tasks. Com-
pared with Maurer (2006b) and Kakade et al.’s (2012) works, which can also an-
alyze the trace norm regularization, the bound presented by Pontil and Maurer
(2013) is tighter in terms of the orders on both the number of tasks and the num-
ber of data points per task.
In the work by Zhang (2015b), a multi-task extension of algorithmic stability is
proposed and it is an extension of the conventional algorithmic stability in that
the sensitivity of a multi-task learner is tested when a data point is removed from
training data sets of all the tasks, respectively. In order to accommodate the newly
defined multi-task algorithmic stability, a generalized McDiarmid’s inequality is
proved to allow more than one input argument of a function under investigation
to be changed instead of only one in conventional McDiarmid’s inequality. Then,
with these new tools, a generalization bound is derived for general multi-task
learning. Then, such general bound is applied to analyze the task relation learning
approach (e.g., (9.9) with a fixed Ω), trace norm regularization and dirty approach
(e.g., (Chen et al., 2010a) with a trace norm regularizer instead of a constraint).
In the work by Pentina and Ben-David (2015), the problem of learning the ker-
nel function for support vector machines is studied under the multi-task and life-
long scenarios and some generalization bounds are presented to bound its gen-
eralization performance. The analyses show that, under mild conditions on the
family of kernels used for learning, learning-related tasks simultaneously in multi-
task learning are beneficial over single-task learning. Specifically, when the num-
10.3 Generalization Bounds for Supervised Transfer Learning 145
ber of tasks increases, with an assumption that there exists a kernel function,
which can achieve low approximation error on all the tasks, in the considered fam-
ily of kernel functions, then the overhead for learning such a kernel vanishes and
the corresponding complexity converges to that of the learner, which uses this
good kernel function.
In the work by Maurer et al. (2016), multi-task representation learning, which
learns a common representation for all the tasks, is analyzed and it accommo-
dates the multi-task feature learning (i.e., (9.1)) and (deep) neural networks. With
the use of the Gaussian complexity, which acts a similar role to the Rademacher
complexity, a generalization bound is presented to reveal that the capacity of such
learner depends on the complexity of the shared feature representation in terms
of the 2 norm. Moreover, a similar bound is presented for the transfer learning
setting. One benefit to use the Gaussian complexity is that it can analyze compos-
ite functions, which has the potential to analyze deep neural networks.
Besides analyzing generalization bounds, there are some other issues that have
been analyzed. For example, in the works by Lounici et al. (2009), Obozinski et al.
(2011) and Kolar et al. (2011), the oracle properties of the group sparsity in multi-
task learning are studied to reveal under which conditions the group Lasso can
identify features that can actually help the prediction of labels. In the works by
Argyriou et al. (2009, 2010), sufficient and necessary conditions are investigated
for the validness of the representer theorem in regularized multi-task methods. In
the work by Solnon et al. (2012), the covariance matrix of the noise among mul-
tiple kernel ridge regressors adopted in multi-task learning is estimated based on
the concept of minimal penalty and in a non-asymptotic setting, this estimator
converges toward the true covariance matrix.
tween tasks to measure the amount of information one task contains about an-
other. The analysis can neatly measure task relatedness and determine the trans-
fer of the “right” amount of information in a Bayesian setting. In a very formal
and precise sense, the analyses suggest that no other reasonable transfer method
can do much better than the proposed Kolmogorov complexity theoretic transfer
method. A practical approximation to the proposed method is devised to transfer
information between tasks.
In Maurer (2009) work, by assuming that all the tasks are sampled from an en-
vironment as Baxter (2000) did, the Rademacher complexity is used to analyze
transfer learning and the analysis shows that the spectral norm of the average
data covariance upper-bounds the model capacity, which is similar to the con-
clusion made for multi-task learning by Maurer et al. (2013). Moreover, the analy-
sis presented in this work explains the situations under which transfer learning is
preferable to single-task learning. That is, the source tasks should be related to the
target task, the input distribution needs to be high-dimensional and the number
of source tasks should be larger than the data dimension and the number of data
per task.
In the work by Yang et al. (2013), authors explore a transfer learning setting,
in which tasks are sampled independently with an unknown distribution from a
known family. The analysis studies how many labeled examples are required to
achieve an arbitrary specified expected accuracy by focusing on the asymptotics
in the number of tasks. The analysis can help understand the fundamental bene-
fits of transfer learning by comparing single-task learning. The proposed analysis
method is so general that it can be applied to other learning protocals, such as
the combination of transfer learning and self-verifying active learning. Under this
setting, authors find that the number of labeled examples required is significantly
smaller than that required for single-task learning.
In the work by Kuzborskij and Orabona (2013), a hypothesis transfer learning
scenario is studied, where the target learner can only access source learners but
not source data directly. Specifically, a theoretical analysis based on the algorith-
mic stability is conducted to analyze a class of hypothesis transfer learning al-
gorithms, that is, regularized least square regression with a biased regularization
to the source learner. Based on the analysis, the relatedness of source and target
tasks is found to accelerate the convergence of the leave-one-out error to the gen-
eralization error, which can inspire the use of the leave-one-out error to find the
optimal target learner, even when the target domain is associated with a small
training set. When the source domain is unrelated to the target domain, the anal-
ysis gives a theoretically principled way to prevent negative transfer such that the
transfer learning method can reduce to the single-task model, an ideal solution in
such a situation.
In the work by Pentina and Lampert (2014), built on the concept of the envir-
onment in the work by Baxter (2000), lifelong learning is analyzed from the PAC-
Bayesian perspective by presenting a PAC-Bayesian generalization bound to of-
fer a unified view on existing transfer learning paradigms such as the transfer of
10.3 Generalization Bounds for Supervised Transfer Learning 147
mize only the target error and consider the same weighting for source and target
errors.
Different from most existing algorithms that first determine the data distribu-
tions in two domains and then make appropriate corrections based on the esti-
mated distributions, in the work by Huang et al. (2006) a nonparametric method
is proposed to estimate the ratio of two data distributions without distribution
estimation. Built on kernel methods, the proposed method matches the two dis-
tributions in two domains via the mean value, which is referred to as the kernel
mean matching method.
In the work by Mansour et al. (2008), a theoretical analysis is presented for the
problem of unsupervised transfer learning with multiple sources tasks. For each
source task, the distribution over the data as well as a hypothesis with error at
most are given. In order to learn a good learner with small error for the target
task, combining these hypotheses is a good strategy. First, standard convex com-
binations of the source learners have been proved to possibly perform very poorly.
However, the analysis shows that there are theoretical guarantees for combina-
tions weighted by the source distributions. The main result shows that, for any
fixed target learner, there exists a distribution weighted combination that has an
error of, at most, with respect to any mixture of source distributions. This setting
is then generalized from a single target learner to multiple consistent target learn-
ers and the analysis shows that there exists a distribution weighted combination
with an error of, at most, 3.
As a generalization of a previous work (Ben-David et al., 2006), in the work by
Mansour et al. (2009), a novel distance between distributions, discrepancy dis-
tance, is introduced and it is suitable for unsupervised transfer learning problems
with any loss function. Bounds based on the Rademacher complexity are pro-
posed to estimate the discrepancy distance from finite data samples for different
loss functions. Based on this distance, new generalization bounds for unsuper-
vised transfer learning are derived for a wide family of loss functions. Based on
these bounds and the empirical estimation of the discrepancy distance, a series
of novel bounds are presented for large classes of regularized algorithms, includ-
ing support vector machines and kernel ridge regression. These bounds motivate
the proposal of several unsupervised transfer learning algorithms to minimize the
empirical estimation of the discrepancy distance for various loss functions.
In the work by Cortes et al. (2010), an analysis of importance weighting is pre-
sented to learn from finite samples and a series of theoretical and algorithmic re-
sults are given. First, this work shows some simple cases in which importance
weighting can perform badly and this suggests the importance to analyze this
technique. Then both upper and lower bounds for the generalization performance
for bounded importance weights are presented. More importantly, learning guar-
antees for the more common case where importance weights are unbounded but
their second moment is bounded are given. The assumption that the second mo-
ment is bounded is related to the Rényi divergence between the data distributions
in both domains. These bounds are then used to design an alternative reweight-
150 Transfer Learning Theory
11.1 Introduction
In this chapter, we study a new type of transfer learning problem where there
is a very large gap between the source and target domains, making most tradi-
tional transfer learning solutions invalid. For example, as shown in Figure 11.1,
the source domain is to classify objects among images collected from the Web,
but the target domain is to predict the poverty level of an area from its satellite
images. This is a problem faced by researchers at Stanford University studying
how to predict the poverty levels of African regions based on satellite images in
order to provide assistance to UN aid (Jean et al., 2016). These tasks are concep-
tually distant and hence the knowledge learned in the source domain cannot di-
rectly be used in the target domain. This problem is difficult for transfer learning
algorithms because there may not be direct linkage between source and target
domains. However, we human beings are naturally capable of making indirect in-
ference and learning via transitivity (Bryant and Trabasso, 1971). This ability helps
humans connect many concepts and transfer the knowledge between two seem-
ingly unrelated concepts. A typical methodology adopted by human learning is
to introduce a few intermediate concepts as a bridge to connect these concepts.
For example, a student who has solid mathematical knowledge may find it hard
to understand theoretical computer science. However, if the student has taken
some elementary computer science courses, then the elementary computer sci-
ence concepts can act as a bridge between the mathematical knowledge and the-
oretical computer science courses. The elementary computer science concepts
hereby serve as mappings between mathematical theories and deep computer
science concepts, and can be considered as an intermediate domain.
The ability for humans to conduct transitive inference and learning inspires a
novel learning paradigm known as transitive transfer learning (TTL). As illustrated
in Figure 11.2, in TTL, the source and target domains have few common factors,
but they can be connected by one or more intermediate domains through some
shared factors. For example, in the poverty prediction problem described in Fig-
ure 11.1, the knowledge such as the high-level representations of images learned
152 Transitive Transfer Learning
Figure 11.1 An illustration of a TTL problem. The source and target domains have
distant concepts, where the source domain is to do object recognition but the
target domain is to predict the poverty level from satellite images, which are from
Google Maps
from the object recognition task cannot be transferred to the target task directly,
as satellite images in the target domain are captured from an aerial view. Jean et al.
(2016) introduce an intermediate domain that is the nighttime light intensity in-
formation of cities, and use this information as a bridge to connect the knowledge
on object detection and poverty-level prediction. They transfer the knowledge
from object recognition tasks to help learn a model that predicts nighttime light
intensities from daytime images and then predict the poverty levels based on the
light intensities. The knowledge of object recognition tasks can help identify hills,
rivers, roads and buildings, which are highly relevant to the light intensities of a
city. The light intensities are key factors in estimating the poverty level. In text sen-
timent classification problems, knowledge in a corpus of book reviews can hardly
be transferred to music reviews because the words used in the two domains are
quite different and they follow very different word distributions. Using TTL, Tan
et al. (2015) introduce a set of movie reviews as a bridge. Reviews on movies share
11.2 TTL over Mixed Graphs 153
some words with book reviews, and, at same time, share some words with reviews
on the background music of the movies. The movie reviews can be crawled from
online movie websites. Thus, the movie reviews help build connections between
the book and music domains. This forms a transitive knowledge-transfer structure
to obtain a more versatile sentiment classification system.
In general, the TTL paradigm is important to extend the ability of transfer learn-
ing as it is able to transfer knowledge between domains with a huge distribution
gap. This helps in reusing as much previous knowledge as possible. Overall, tradi-
tional machine learning uses knowledge learned with data from the same domain
and transfer learning borrows knowledge from similar domains, while TTL pushes
the transfer learning boundary even further to allow connection with distant do-
mains.
There are two major research issues in designing the TTL paradigm. The first
one is how to select appropriate intermediate domain data that serve as the con-
necting bridges between distant domains. The second issue is how to transfer
knowledge effectively among transitively connected domains. In this chapter, we
will introduce three different learning algorithms under the TTL paradigm. In par-
ticular, in Section 11.2, we manually select an intermediate domain and trans-
fer the knowledge by using random walk with restarts. In Section 11.3, we select
the intermediate domain data by distribution measurement, such as the Kulback–
Leibler divergence and A -distance in the work by Blitzer et al. (2007a), and trans-
fer the knowledge via matrix factorization. In Section 11.4, we use deep learning to
make intermediate domain data selection and knowledge transfer, by simultane-
ously conducting domain data selection and knowledge transfer via deep neural
networks.
Figure 11.3 An example of text-to-image TTL with annotated images as the inter-
mediate domain data. Images are from Wikipedia and Flickr.
illustrated in Figure 11.3, where the words “eos,” “400d” and “business” are irrel-
evant to the images about architecture . These irrelevant data might degrade the
accuracy of the cross-domain connection if we use them to build the knowledge
transfer bridge directly. Besides irrelevant noises, thousands of tags are used when
people annotate pictures from one peculiar category, and vocabularies in the tags
are very different from those in formal documents. For example, articles from
Wikipedia have different writing styles under different contexts. Hence, only a few
image tags are useful for knowledge transfer, but they may be well hidden. To find
a unbiased and clean channel for knowledge transfer, Tan et al. (2014) propose a
mixed-transfer algorithm, which is able to transfer knowledge across domains ef-
fectively even with noisy co-occurrence data. Their algorithm determines which
source instances and which features are actually helping the knowledge transfer.
The mixed-transfer algorithm models the relationship between the source and
target domains as a joint transition probability graph of mixed instances and fea-
tures, which is illustrated in Figure 11.4. In the graph, we have two types of node,
the square nodes indicate the instances (e.g., documents and images) and the cir-
cles represent the features (e.g., words in the documents and texture in the im-
ages). The transition probabilities between two cross-domain features are con-
structed from the co-occurrence data and measured by a cross-domain harmonic
function, which is robust to irrelevant data.
In this graph-based algorithm, the label propagation process is simulated as a
random walk with restarts. The advantage is that we can transfer knowledge with
the help of all the instances and features globally and simultaneously. From the
structure of the graph, we can see that the feature nodes play the role of hubs for
information transmission within a domain and across domains. The label prop-
agation process continues until it converges. During this process, some features
have high probabilities of being visited. These are the features that carry the most
label knowledge and can be automatically detected by the random-walk process.
When the label propagation converges to a fixed point, the weights on instance
nodes indicate the label preference and can be used to build a model for making
predictions.
11.2 TTL over Mixed Graphs 155
Figure 11.4 An illustration of the joint transition probability graph of mixed in-
stances and features. In the graph, there are two types of nodes with the square
nodes indicating the instances (e.g., documents and images) and the circles rep-
resenting the features (e.g., words in the documents and texture in the images).
The transition probabilities between two cross-domain features are constructed
from the co-occurrence data
where X contains all the labeled data, and y are the labels, L(·) is the loss function
and R(·) indicates the relationship between the classifier and the unlabeled data
given the co-occurrence data O .
156 Transitive Transfer Learning
where X˜kt denotes the set of target instances, each of which has a positive value
||x̃tt −x̃t ||2
for the k-th feature, Φ(x̃tt , x̃tt ) = exp(− 2σ2
t
), and |X˜kt | is the cardinality of X˜kt .
A larger r k(s,t )
indicates that the k-th feature has higher relevance to the target do-
main. For instance, a set of annotated images that share a tag should be similar to
each other. Otherwise, the tag is irrelevant to these images.
On the set {X˜k1 , X˜k2 }, we calculate the similarity between the k-th feature in the
source domain and the l -th feature in the target domain with correlation coeffi-
cient (Mitra et al., 2002):
(s,t )
|cov( f ks , f lt )|
s k,l = 1− . (11.3)
var( f k1 ) × var( f lt )
where f ks and f lt are the feature vectors from X˜ks and X˜kt respectively, var(·) is the
variance of a variable, cov(·, ·) is the covariance between two variables.
(s,t )
Combining these two criteria, we obtain the final similarity a k,l :
(s,t )
a k,l = γ(s,t
k
) (s,t )
× s k,l (11.4)
Finally, we can construct the feature similarity matrix A (s,t ) between two do-
(s,t )
mains, where the (k, l )-th element of A (s,t ) is equal to a k,l , and then we have
(s,t )
a k,l = a l(t,k,s) , that is, A (s,t ) is the transpose of A (t ,s) .
Graph Construction
For the i -th domain, where i ∈ {s, t }, we have an n i -by-m i matrix A(i ,i ) , with its
(k, l )-th entry given by the value of the l -th feature of the k-th instance. It is clear
that A(i ,i ) is a matrix with all the entries being non-negative. We also have cross-
domain feature similarity matrix A(s,t ) .
In order to perform label propagation on this mixed graph, we have to con-
struct a joint transition probability graph. In other words, we have to normalize
the weights of the edges so that they are probability values.
11.2 TTL over Mixed Graphs 157
Mixed-Transfer Algorithm
In the mixed-transfer algorithm over the mixed graphs, we perform a random
walk that starts from nodes corresponding to labeled instances. The ‘walker’ moves
by traversing an edge to their neighboring nodes with the joint transition proba-
bility graph, or has a probability α to stay at the same node. The corresponding
model is formulated as
and
2
V(i ) (t + 1) = λi ,i Q(i ,i ) R(i ) (t + 1) + λi , j F(i , j ) V( j ) (t ), i = 1, 2, (11.6)
j =1, j =i
where l d(i ) is the number of labeled instances belonging to the d -th class and D(i )
(i )
is an n i -by-c matrix with the (k, d )-th element d k,d equal to 1/l d(i ) if and only if the
k-th instance is labeled and belongs to the d -th class and is otherwise equal to 0.
label information in the source domain is propagated to the intermediate and tar-
get data on the selected common subspaces.
treated as positive and the target data is assigned to be negative. After learning
h, we can estimate the error(h|DD i , D j )) in the A distance.
Given a triple t = {S, D, T}, we can extract six features, as described in Table 11.1.
The first three features summarize individual in-domain characteristics and the
other three features capture the pairwise cross-domain distances. These features
together affect the success probability of a transfer learning algorithm. However, it
is impossible to design a universal domain selection criteria, as different problems
may have different preferences (weights) on these features. To model the success
probability of the introduced intermediate domain, the following logistic function
is used:
6
f (tt ) = δ(β0 + βi c i ), (11.9)
i =1
1
where δ(x) = 1+exp{−x} . We estimate the parameters β = {β0 , · · · , β6 } to maximize
the log likelihood defined as:
t
β) =
L (β l (i ) log f (tt i ) + (1 − l (i ) ) log(1 − f (tt i )), (11.10)
i =1
Figure 11.5 An illustration of the CMTF algorithm in the TTL framework. The
algorithm learns two coupled feature representations by feature clustering, and
then propagates the label information from the source domain to the target do-
main through the intermediate domain based on the coupled feature represen-
tation
Tan et al.’s (2015) work. The CMTF algorithm is illustrated in Figure 11.5 and its
objective function is formulated as
L =||Xs − Fs As GTs ||F + ||XI − FI AI GTI ||F + ||XI − FI AI GTI ||F + ||Xt − Ft At GTt ||F
1 1
1 2 Â T 1 2 Â
= Xs − [F̂ , F̂s ] 2 Gs + XI − [F̂ , F̂I ] 2 GTI
Âs ÂI
F F
Ã1 Ã1
+ XI − [F̃1 , F̃2I ] 2 GTI + Xt − [F̃1 , F̃2t ] 2 GTt . (11.12)
ÃI Ãt
F F
According to (11.12), we can see that the first two terms correspond to the first
feature clustering and label propagation between the source and intermediate do-
mains in Figure 11.5 and the last two terms refer to the second feature clustering
and label propagation between the intermediate and target domains. In (11.12),
it is worth noting that XI is decomposed twice with different decomposition ma-
trices, since XI shares different knowledge with Xs and Xt . At the same time, we
couple these two decomposition processes via the label matrix GI . It is reasonable
as instances in the intermediate domain should have the same labels in different
decomposition processes.
Overall, the CMTF algorithm defines a transitive property among domains. The
label information in the source domain is transferred through F̂1 and Â1 to the
intermediate domain and affects the learning of GI . The knowledge on class labels
encoded in GI is further transferred from the intermediate domain to the target
domain through F̃1 and Ã1 .
The goal of TTL is to exploit the unlabeled data in the intermediate domains to
build a bridge between the source and target domains, which are originally distant
to each other, and train an accurate classifier for the target domain by transferring
supervised knowledge from the source domain with the help of the bridge. Note
that not all the data in the intermediate domains are supposed to be similar to the
source domain data, and some of them may be quite different. Therefore, simply
using all the intermediate data to build the bridge may fail to work.
164 Transitive Transfer Learning
1 nS
1 nI
J1 ( f e , f d , vS , vT ) = v Si x̂iS −xiS 22 + v i x̂i −xi 2
n S i =1 n I i =1 I I I 2
1 nT
+ x̂i − xiT 22 + R(vS , vT ), (11.13)
n T i =1 T
where x̂iS , x̂iI and x̂iT are reconstructions of xiS , xiI and xiT based on the autoencoder,
j
vS = (v S1 , · · · , v S S ) , vI = (v I1 , · · · , v I I ) and v Si , v I ∈ {0, 1} are selection indicators for
n n
the i -th instance in the source domain and the j -th instance in the intermediate
domains, respectively. When the value is equal to 1, the corresponding instance
is selected and otherwise unselected. R(vS , vT ) is a regularization function on vS
and vT to avoid a trivial solution by setting all values of vS and vT to be zero. In
λ S i I i
n n
SLA, R(vS , vT ) is defined as R(vS , vT ) = − nS v S − nλI v I . Minimizing this term
S I
i =1 i =1
is equivalent to encouraging the selection of as many instances as possible from
the source and intermediate domains. Two regularization parameters, λS and λI ,
control the importance of this regularization term.
1 nS
1 nT
1 nI
J2 ( f c , f e , f d )= h iS ))+
v Si (y Si , f c (h (y Ti , f c (hiT )) + v i g ( f c (hiI )),
n S i =1 n T i =1 n I i =1 I
(11.14)
where v = {vS , vT } and Θ denotes all parameters of the functions f c (·), f e (·), and
f d (·).
To solve (11.15), SLA uses the block coordinate decedent method, where, in
each iteration, variables in each block are optimized while keeping other variables
fixed. In (11.15), there are two blocks of variables: Θ and v. When the variables in
v are fixed, we can update Θ using the back propagation algorithm where the gra-
dients can be computed easily. Alternatively, when the variables in Θ are fixed, we
can obtain an analytical solution for v as follows,
⎧
⎪
⎨ 1 if (y si , f c ( f e (xx iS ))) + x̂iS − xiS 22 < λS
v Si = (11.16)
⎪
⎩ 0 otherwise
⎧
⎪
⎨ 1 if x̂iI − xiI 22 + g ( f c ( f e (xiI ))) < λI
v Ii = (11.17)
⎪
⎩ 0 otherwise
Based on (11.16), we can see that for data in the source domain, only those with
low reconstruction errors and low training losses will be selected during the op-
timization procedure. Similarly, based on (11.17), it can be found that, for data
in the intermediate domains, only those with low reconstruction errors and high
prediction confidences will be selected.
Figure 11.6 The network architecture used in SLA (adapted from Tan et al. [2017])
selected “useful” data samples. The overall algorithm for solving (11.15) is sum-
marized in Algorithm 11.2.
The network architecture corresponding to (11.15) is illustrated in Figure 11.6.
From Figure 11.6, we note that, except for the instance selection component v, the
rest of the architecture in Figure 11.6 can be viewed as a generalization of an au-
toencoder or a convolutional autoencoder by incorporating the side information.
12.1 Introduction
Three key research issues in transfer learning, discussed in Chapter 1, are when
to transfer, how to transfer and what to transfer. Once a source domain is con-
sidered to be helpful for a target domain (when to transfer), a transfer learning
algorithm (how to transfer) can help learn the transferable knowledge across
domains (what to transfer). Usually different transfer learning algorithms are likely
to learn different knowledge, leading to uneven transfer learning effectiveness,
which can be measured by the improvement of the performance over non-transfer
algorithms in the target domain. To obtain good performance in the target
domain, many transfer learning algorithms can be treated as candidate algorithms
to try, including instance-based transfer learning algorithms (Dai et al., 2007b),
model-based transfer learning algorithms (Tommasi et al., 2014) and feature-
based transfer learning algorithms (Pan et al., 2011). It is computationally
expensive and practically impossible to try all the transfer learning algorithms in a
brute-force way. As a trade-off, researchers usually heuristically choose a transfer
learning algorithm, which may lead to a suboptimal performance.
It is not the only way to optimize what to transfer by exploring the whole space
of transfer learning algorithms. Actually transfer learning experiences are help-
ful. It has been widely accepted in educational psychology (Luria, 1976; Belmont
et al., 1982) that learning from experience is a good methodology. To improve
transfer learning skills of deciding what to transfer, humans can conduct meta-
cognitive reflection on diverse experiences. Unfortunately, by ignoring previous
transfer learning experiences, all existing transfer learning algorithms learn from
scratch.
With machine learning models getting increasingly complex, the need for
automated machine learning, or AutoML (Yao et al., 2018), has emerged as a strong
trend in machine learning. As machine learning involves many tedious steps that
require much experience from human experts, ranging from sample selection,
feature engineering, algorithm selection, architectural design, model tuning and
evaluation, and so on, machine learning practice desires an end-to-end solution
12.2 The L2T Framework 169
where many of these steps can be automated. Recognizing the need to design
sophisticated architectures and engage in complex parameter tuning by AI
experts, AutoML aims to liberate humans from the manual-labor-driven tasks to
optimize a machine learning model, by introducing automation through machine
learning itself. Several research prototypes and solutions have been applied in
real world applications, for example see the works by Kotthoff et al. (2017), Wong
et al. (2018), Liu et al. (2018c), Bello et al. (2017) and Feurer et al. (2015). AutoML
has several advantages compared to traditional manual-based model construc-
tion, including fast deployment in practice, optimized selection of model and a
lower cost. There have been several applications of AutoML, including image and
speech recognition, recommendation systems and predictive analytics.
Similar to AutoML, transfer learning can also be packaged in an end-to-end
process. We can call the automated transfer learning framework collectively as
AutoTL, which stands for automated transfer learning. In this chapter we present a
novel AutoTL framework called Learning to Transfer (L2T), which selects transfer
learning algorithms automatically through experience. This framework was first
proposed by Wei et al. (2018). L2T is a special case of AutoTL, with an aim of iden-
tifying the suitable algorithm and model parameters based on previous transfer
learning experience.
By exploiting previous transfer learning experiences, the L2T framework is to
improve the transferring performance from a source to a target domain to
determine what and how to transfer between them. To achieve this goal, L2T
consists of two phases. In the first phase, given transfer learning experiences, each
of which consists of three elements, including a pair of source and target domains,
the knowledge transferred between them and the performance improvement, a
reflection function, which functionally maps a pair of domains and the trans-
ferred knowledge to the performance improvement, is learned from all the experi-
ences. During the second phase, for a new pair of domains, the learned reflection
function as an approximation of the performance improvement is maximized to
determine what to transfer between the two domains.
( )
Optimize
Learn transfer what to
∗
learning skills transfer for a = arg max ( +1 , +1 , )
target pair of
source and
target
domains
( , , )→ ( , , )→
Figure 12.1 An illustration of the L2T framework. The training stage learns a
reflection function f , which encrypts transfer learning skills, based on Ne transfer
learning experiences {E 1 , · · · , E Ne }. In the testing stage, for the (Ne + 1)-th source-
target pair, the learned reflection function f is maximized to learn the transferred
knowledge between them, that is, W∗ N +1 e
heterogeneous label spaces for each pair of domains, that is, Xes = Xet and
Yes = Yet . a e ∈ A = {a 1 , · · · , a Na } denotes a transfer learning algorithm that has
been conducted between S e and Te . Here the transferred knowledge by the
algorithm a e is parameterized as We . Finally, l e = p est /p et denotes the performance
improvement ratio that is the label of the corresponding transfer learning experi-
ence, where p est is the performance (e.g., classification accuracy) of a test data set
in Te after transferring We from S e and p et is that of the same test data set without
transfer.
In the training stage as illustrated in Figure 12.1, the L2T aims to learn a reflec-
tion function f based on Ne transfer learning experiences {E 1 , · · · , E Ne } by approx-
imating l e by f (S e , Te , We ). When a new pair of domains 〈S Ne +1 , TNe +1 〉 comes,
the L2T model can maximize f to learn the knowledge to be transferred, that is,
W∗Ne +1 , as shown in the testing stage in Figure 12.1.
where xet j denotes the j -th example in Xet , φ maps from the u-dimensional latent
space to the RKHS H and K (·, ·) = 〈φ(·), φ(·)〉 denotes the kernel function. Differ-
ent kernels K leads to different MMDs, leading to different forms of f and hence
learning f is to identifying the optimal K . By following multi-kernel MMD (Gret-
ton et al., 2012), K is parameterized as a linear combination of Nk kernels with
Nk
non-negative combination coefficients, that is, K = k=1 βk K k (βk ≥ 0, ∀k), and
) *T
the coefficients β = β1 , · · · , βNk will be learned instead. Then the MMD can be
simplified as
Nk
dˆe2 (Xes We , Xet We ) = 2
βk dˆe(k) (Xes We , Xet We ) = β T d̂e ,
k=1
# $T
where d̂e = dˆe(1) 2
, · · · , dˆe(N
2
with dˆe(k)2
calculated based on the k-th kernel K k .
k)
However, it is insufficient to the MMD alone to measure the divergence between
domains. A pair of domains with a small MMD have little distributional overlap-
ping if the variance of the distance between them is high. The distance variance
among all pairs of instances across domains is also required to fully characterize
the difference. According to Gretton et al. (2012), (12.1) is the empirical estimation
of d e2 (Xes We , Xet We ) = Exes xes xet xet h(xes , xes , xet , xet ) with h(xes , xes , xet , xet ) defined as
σ2e (Xes We , Xet We ) =Exes xes xet xet [(h(xes , xes , xet , xet ) − Exes xes xet xet h(xes , xes , xet , xet ))2 ].
net Hj j
where SLe = (xt −xet j )(xet j −xet j )T is the local scatter covariance matrix,
j , j =1 (n et )2 e j
net K (xet j ,xt )−H j j
ej
SeN = j , j =1 (n et )2
(xet j − xet j )(xet j − xet j )T is the non-local scatter covari-
ance matrix, and H j j is defined as
'
K (xet j , xet j ), if xet j ∈ Nr (xet j ) and xet j ∈ Nr (xet j )
Hj =j .
0, otherwise
It is noted that the calculation of τe depends on kernels. With τe(k) obtained from
Nk
the k-th kernel K k , τe can be reformulated as τe = k=1 βk τe(k) = β T τ e , where
τe = [τe(1) , · · · , τe(Nk ) ]T .
where γ2 is a regularization parameter. The first and second terms in (12.3) are
computed as
Nk #1 a 1 b
β∗ )T d̂W =
(β β∗k 2
K k (vi W, vi W) + 2 K k (w j W, w j W)
k=1 a i ,i =1 b j , j =1
2 a,b $
− K k (vi W, w j W)
ab i , j =1
n Nk / #
1
β∗)T Q̂W β ∗ =
(β β∗ K k (vi W,vi W)+K k (wi W, wi W)− 2K k (vi W, wi W)
n 2 − 1 i ,i =1 k=1 k
1 n $02
− 2 K k (vi W, vi W) + K k (wi W, wi W) − 2K k (vi W, wi W) ,
n i ,i =1
s s t
where shorthands, including vi = x(N , vi = x(N , w j = x(N +1) j ,
e +1)i
e+1)i e
t s t
wj = x(N and a =
and b = n Ne +1 , are used. The third term in (12.3)
nN e +1
,
e+1) j
Nk ∗ tr(WT SkN W)
can be β∗ )T τ W = k=1
calculated as (β βk . The non-convex (12.3) can be
tr(WT SL W) k
optimized via the conjugate gradient method.
manually (Blitzer et al., 2006), dimensionality reduction (Pan et al., 2011; Bak-
tashmotlagh et al., 2013, 2014), collective matrix factorization (Long et al., 2014),
dictionary learning/sparse coding (Raina et al., 2007; Zhang et al., 2016), mani-
fold learning (Gopalan et al., 2011; Gong et al., 2012a) and deep learning (Yosinski
et al., 2014; Long et al., 2015; Tzeng et al., 2015). Different from L2T, all existing
studies focus on transferring from scratch.
Training Testing
Transfer learning Task 1 Task 2
Multi task learning Task 1 ⋯ Task N Task 1 ⋯ Task N
Lifelong learning Task 1 ⋯ Task N Task N+1
Task 1 Task 2N-1 Task 2N+1
Learning to transfer ⋯
Task 2 Task 2N Task 2N+2
Figure 12.2 Illustration of the differences between L2T and other related learning
paradigms
Several successful attempts have been made in AutoML. For instance, Kotthoff
et al. (2017) present a system designed to automate the search in the machine
learning system Waikato environment for knowledge analysis’ learning algorithm
space and their respective hyperparameter settings to maximize performance.
Wong et al. (2018) apply transfer learning to help improve the AutoML process,
making it more cost-effective to apply the technique. Feurer et al. (2015) present
an AutoML system based on scikit-learn that automatically considers the past per-
formance of a system on similar data sets in the past. The technique is based
on an ensemble of learning systems that are to be optimized. Bello et al. (2017)
present an approach to automate the process of discovering optimization meth-
ods for deep learning architectures and their method uses a reinforcement learn-
ing algorithm to maximize the performance of a model based on a few functional
primitives to update a model. Liu et al. (2018c) present a method for learning the
structure of convolutional neural networks with a sequential-model-based
optimization strategy. This method is shown to be more efficient than the contem-
porary reinforcement-learning-based solutions.
The L2T framework introduced in this chapter is a special case of the AutoTL
framework, which applies AutoML to transfer learning tasks. In particular, L2T
belongs to the model selection module in AutoML. Different AutoML techniques
can be applied here to transfer learning. However, AutoML and AutoTL also have
differences. Specifically, the former focuses on automating the supervised learn-
ing algorithms, whereas the latter (i.e., the L2T framework) focuses on transfer
learning only. Hence, the L2T framework can be viewed as an AutoML case for
transfer learning.
13
Few-Shot Learning
13.1 Introduction
a new concept based on all previous experiences “pretraining.” There are many
phases of cognition ranging from physical observation to mental comprehension
and memory. Take the recognition of fruits as an example. Although
different kinds of fruits, such as wax apple and fruit apple, have distinctive
appearances, flavors and textures, they share something in common. Both their
skins are smooth and their shapes are similar. Those similar features support that
the knowledge can be “transferred” from one type of apple to the other. If an
algorithm possesses such a generalization ability based on universal features, a
model can also easily be adaptable to a novel concept with a few correcting
examples.
Following this insight, researchers have proposed few-shot learning to mimic
the learning ability of humans. There are many variants of few-shot learning, in-
cluding zero-shot learning, one-shot learning, Bayesian program learning (BPL),
poor resource learning and domain generalization. They all can be understood as
some variants of transfer learning. Thus, in the context of transfer learning, we
will review them one by one.
Compared with the previously introduced transfer learning settings, in few-
shot learning, the target domain are generally assumed to have very limited data,
including both labeled and unlabeled data. In some extreme cases, no data
instances in the target domain are assumed to be available in advance; for exam-
ple, this might be the case in domain generalization problems. In the following, we
introduce some representative state-of-the-art models under the settings of zero-
shot learning (Section 13.2), one-shot learning (Section 13.3), BPL (Section 13.4),
poor resource learning (Section 13.5) and, finally, domain generalization learning
(Section 13.6).
13.2.1 Overview
In the zero-shot learning setting, a learning system handles testing samples
from novel classes that do not appear in the training data set. Compared with
conventional machine learning settings, the critical difference is that new con-
cepts or labels appear in the test samples, and this difference requires a “bridge”
from knowledge of existing classes to that of the novel classes. The main bridge
employed in most zero-shot learning methods is the so-called semantic features.
These features make the transfer learning possible.
In particular, semantic features of a certain class are the attributes characteriz-
ing this class. Thus, instead of learning a mapping from X to Y, where X is an
m-dimensional feature space and Y is a label space, we try to learn a function:
X → F where F is a semantic feature space. Apart from that, we need a knowl-
edge base K , which lists all the class labels and their associated semantic features,
13.2 Zero-Shot Learning 179
which act as a bridge. The knowledge base K has the information about both the
existing classes and novel classes. Thus, after we obtain semantic features of an
example, we match the features in the knowledge base to obtain the most similar
classe as the one to which the data sample belongs. In the following, we introduce
some useful terminologies used in the work by Palatucci et al. (2009).
Shen et al. (2006b) present one of the first works in zero-shot learning via a
“bridging classifier”. This work won the Championship of 2005 ACM KDD CUP
data-mining competition (Shen et al., 2005), and has been subsequently applied
to several commercial search engine and advertisement systems. We will give a
detailed description of this solution below.
180 Few-Shot Learning
label C jI , which can be estimated from the Web pages in C I . Their relationship is
computed by applying the Bayes rule:
The terms in the last equation can be estimated by term frequency of words or
phrases in a category. For example,
nk
p(C iT |C jI ) = Πnk=1 (p(w k |C jI )) . (13.1)
A schematic figure showing how the mappings from queries to the target classes
through the intermediate classes is shown in Figure 13.1. In this figure, a query
q k is mapped to the target class label C T with a certain probability that is calcu-
lated through the intermediate classifiers from Q to C I , and then from C I to the
target C T .
Figure 13.1 The schematic graph shows the bridging classifier for query classifi-
cation through intermediate domains (adapted from Shen et al. [2006b])
where θ (1) ∈ Rh×d , θ (2) ∈ Rm×h and tanh(·) denotes the hyperbolic tangent func-
tion. They also consider that, if existing and novel classes are mixed together in
the test set, this model may mistakenly classify an image from a novel class to an
existing one. This is a typical issue dealt in transfer learning literature. Since the
target and source domains have separate distributions, the model trained by the
source data cannot be applied to the prediction task of the target data directly. If
the model is given access to some examples from the target domain, it can employ
some domain adaptation techniques to alleviate the distribution difference.
To address the data-shortage issue, Socher et al. (2013a) add a step before the
classification step to detect novel samples that are the samples belonging to un-
seen classes. Then they employ two kinds of strategies to perform the classifica-
tion task for the two groups, respectively, with one strategy handling novelties or
outliers and the other one dealing with normal samples.
Another interesting model in this category is the convex combination of seman-
tic embeddings (ConSE) (Norouzi et al., 2013). The distinction between ConSE
and Socher et al.’s (2013a) model is on the choice of the objective function. In fact,
it hides the regression process in a standard classification process, so the mean
squared error is substituted by the classification error. The classifier is trained in
the source domain to estimate the probability of a data point belonging to each of
the classes. In the test phase, the trained classifier is applied to the target data to
output the probability that this data point is drawn from each source class. Next,
the representation of the sample in the semantic feature space is computed by a
convex combination of label encodings corresponding to each source class with
the estimated probabilities as the weights and mathematically it can be defined as
1 T
f (x) = P( ŷ(x, t )|x) f ( ŷ(x, t )),
Z t =1
where top T most likely classes are involved, ŷ(x, t ) denotes the label with the t -th
highest probability, and Z is the normalization factor. The intuition behind the
method is to derive the representation with the similarity between the current
sample and different classes. Suppose the appearance of a liger is half akin to a
lion and half akin to a tiger, we have f (liger) is approximately 12 f (lion)+ 12 f (tiger).
With the predicted embedding in the semantic space, we can easily find its nearest
neighbors in the semantic knowledge base.
score will be the one that the sample belongs to. The formulation in this setting is
to map from X ×F to S where S is the score space. After predicting the matching
scores for all the classes, we can rank the scores in descending order and select the
most likely one or several to produce the predicted label. The form of the mapping
function can either be simply bilinear (i.e., xT Wf), where W is a d x × d f parameter
matrix to be learned, or nonlinear such as deep neural networks. Another differ-
ence lies in the choice of the loss function.
v
v
S
L N
S T S
E E
C C
P P
I I L P
matching score between the test image and each label. We can only identify the
nearest neighbor of Wg (x) in the label encoding space.
and the testing sample are projected into the same latent feature space. The mo-
tivation is to enable the model to have some discriminative ability. For instance,
given two objects, although humans may not be able to name them respectively,
humans can easily distinguish whether they are from the same category by com-
paring their key features, As long as the model possesses the discrimination abil-
ity, it can make its judgment by comparison. A typical architecture of the Siamese
neural network is shown in Figure 13.3.
Figure 13.3 The architecture of the Siamese neural network (adapted from
Koch [2015])
The inputs to the Siamese neural network are denoted by x(1) and x(2) respec-
tively and the output is denoted by P(x(1) , x(2) ). Within the twin neural networks,
L layers, which can be any of linear layer, convolutional layer, pooling layer or
other nonlinear layer, are connected in sequence. We use h(i ,l ) , where i ∈ {1, 2}
and l ∈ {1, · · · , L}, to denote the output of the l -th layer in the i -th neural network.
The outputs of the two neural networks, h(i ,L) for i ∈ {1, 2}, are separately trans-
formed into two vectors z(i ) . Finally, we use a metric to measure the distance be-
tween them as the output P(x(1) , x(2) ). In Koch’s (2015) work, the distance metric is
defined as
(1) (2)
(1) (2)
d (z , z ) = σ α j |z j − z j | ,
j
where z(ij ) is the j -th entry in vector z(i ) . The metric can naturally be used to ap-
proximate the probability of the pair of inputs to twin network owning the same
label.
186 Few-Shot Learning
In the training phase, for each pair of inputs (x(1) , x(2) ) from the support set, the
output y is set to 1 if x(1) and x(2) belong to the same class, or to 0 otherwise. The
loss function is defined as
Other Variations
The Siamese neural network can be viewed as a hard classification method, as
it assigns the label of the most similar exemplar in the support set to the test-
ing sample without considering less similar ones. As a hard decision may be mis-
led by outliers, its deficiency is inevitable. Since if we only get one exemplar from
each class, which is independent of the rest, we do not have more evidence to
support the judgment. In other words, there is no way to borrow knowledge via
comparison with other exemplars to make the current comparison result more
compelling.
This issue can be addressed if we take two or more shots from the same class or
from other related classes, then we can exploit more information from more rel-
evant shots. In this situation, a soft classification is preferred. Vinyals et al. (2016)
propose an algorithm to use the exemplars in the whole library to make a soft de-
cision. Technically, the algorithm estimates the probability of the new observation
belonging to each category.
ns
Given a support set of n s labeled examples S = {(xis , y is )}i =1 , the goal is to map
from this support set to a classifier cs(x), which can predict the probability dis-
tribution over all the candidate class labels y. A general form of the probability
distribution is defined as
ns
ns
P(y|x, S ) = a(x, xis ) (y is ) subject to a(x, xis ) = 1,
i =1 i =1
Figure 13.4 An illustration for the generative process of a character token, where
a character type acts like a template that can be used to generate a group of tokens
(adapted from Lake et al. [2011])
188 Few-Shot Learning
m 1
P(Z|W, τ ) ∝ exp − 2 ||(Zi − Wi − τ ||22 .
i =1 2σz
After acquiring the actual starting position of each stroke, we are now able to
generate a character token, or a ink track on an image, according to an adjusted
ink model proposed by Revow et al. (1996). As we know, when writing down a
13.4 Bayesian Program Learning 189
character, the ink would flow to surrounding positions when a pen is pressed on
a point. It would be necessary to model the diffusion process, otherwise the ink
would be mistakenly considered as other strokes. The probability that color of a
position (x, y) is white is
G
P(X(i )
(x,y)
= 0|S (i ) (i ) (i )
, Z , π ) = 1 −Q X (i )
(x,y)
|S(i ) (i ) (i )
, Z , π ,
P(X(i )
(x,y)
= 1|S(i ) , Z(i ) , π (i ) ) = 1 − P(X(i )
(x,y)
= 0|S(i ) , Z(i ) , π (i ) ).
The form of Q will be defined later and it can be intuitively viewed as a mixture of
random noises and the influence from all the m strokes.
Heuristically, if the track of a stroke is distant from position (x, y), it is unlikely
that X(x,y) is black due to such stroke. A Gaussian distribution can be used to
express this heuristic, in which, as the distance becomes larger, the probability
drops rapidly. As a single stroke traverses across many pixels of an image, which
results in a high complexity, the ink model discretizes continuous strokes to mul-
tiple beads. Take a vertical line as an example, we can use multiple points or beads
along it to approximate the stroke, instead of a complete line. In this way, we can
control the number of beads to be sampled along the line. We can define Q as
β m
Q(X(i )
(x,y)
|S(i ) , Z(i ) , π (i ) ) = + (1 − β) π(ij ) V (X(i )
|S (i ) , Z j(i ) ),
(x,y) j
R2 j =1
1 B
V (X(i )
|S (i ) , Z j(i ) ) =
(x,y) j
N (X(i )
|C + Z j(i ) , σ2b I),
(x,y) b
B b=1
where B is the number of beads to induce the stroke shape and C b ∈ R2 is a bead
coordinate for stroke S i .
Low-Resource Learning
When some data resources are available for a target learning task in the form of
parallel language pairs, it is possible to train a target-domain model. Zoph et al.
(2016) train a parent model from a high-resource domain (i.e., source domain) in
which a large number of French–English language pairs are available, as shown
in Figure 13.6. A portion of parameters in the parent model is used to initial-
ize the parameters in a child model that aims at the translation task in the low-
resource Uzbek–English language pair. The parent model and the child model are
constrained to share the identical architecture that is a two-layer encoder-decoder
model with long short-term memory units. The model uses an attention compo-
nent to look back at the source domain.
Target Target
prediction prediction
Figure 13.6 The architecture for machine translation by showing six blocks of pa-
rameters (adapted from Zoph et al. [2016])
two components and only fine-tune the attention units. With such constraints,
the attention units are expected to capture more general knowledge despite the
fact that the source features may have some noises.
There are some concerns about the chosen language pairs as intermediate do-
mains to conduct the TTL. For example, French people often acquires English
more rapidly than Chinese people because English and French share some com-
mon characteristics in many aspects. When facing the translation task, despite the
fact that there is no solid theory supporting the choice, we can expect that differ-
ent pairs result in different effects in the transitive learning process.
1 λ m m n si
m n si
2 si 2 si s
min wv w + Δ + C 1 ξ j +C 2 ρ ji ,
wv w ,Δsi ,ξ,ρ 2 2 i =1 i =1 j =1 i =1 j =1
s s
where C 1 , C 2 and λ are hyperparameters, and ξ ji and ρ ji are slack variables. (13.3)
defines a linear relationship between wv w , wsi and Δsi . (13.4) corresponds to the
loss incurred across all data sets when using the visual world weights wv w , since
the visual world model is expected to generalize across all data sets. (13.5) corre-
sponds to the loss incurred by the private model.
Figure 13.7 The architecture of the multitask autoencoder (adapted from Ghifary
et al. [2015]), where all the domains share the same encoder and have separate
decoders
/ 0n s
s i
has a training set Dsi = x ji . The encoder and decoders are defined as
j =1
h ji = σenc (W x ji ),
s s
+ ,
where Θsi = W, Vsi contains the shared and individual parameters. The loss func-
tion is defined as
n si
s s
J (Θsi ) = l ( f Θsi (x ji ), x ji ).
j =1
where R(Θsi ) is a regularization term. Ghifary et al. (2015) use the squared l 2 norm
regularization, that is, R(Θsi ) = W2F + m si 2
i =1 V F . The SGD is applied to solve the
objective function.
14
Lifelong Machine Learning
14.1 Introduction
In the past decades, there have been significant advances in machine learn-
ing. However, there is a missing part in most proposed learning algorithms if we
compare them with how humans learn to solve problems. We can observe that
humans solve problems by continuously learning and improving their capabili-
ties for various tasks in their lifetime. In contrast, most contemporary machine
learning theory and algorithms still only focus on a one-time solution to learning
problems. We can see many examples in text classification, image classification,
image segmentation and so on.
But humans typically learn to solve various problems in a sequential way, one
after another, and in a continuous way as well. For example, a musician might
learn how to play many different instruments and study how to compose and per-
form different music year after year. Because of their ability to continuously learn,
musicians can learn how to play guitar quickly if he or she already knows how
to play the piano, and read and compose music. We call the paradigm in which
learning happens continuously such that later learning can benefit from previous
learning “lifelong machine learning.”
An important reason for why lifelong machine learning is important for ma-
chine learning is that a large amount of labeled data from diverse learning tasks
become available in time. That is largely driven by the prevalence of data collec-
tion devices such as cameras and mobile phones, and the Internet of Things tech-
nology. Deep learning requires a lot of labeled data to learn complex models, and
these data can increasingly enable the learning to be increasingly effective. In ad-
dition, over time, the prevalence of highly popular machine learning platforms
such as Tensorflow (Abadi et al., 2016a) along with more powerful and cheaper
computing hardware makes developing machine learning much easier and more
efficient. These are the context to make lifelong machine learning feasible. In this
chapter, we will explain in detail the lifelong machine learning paradigm.
14.2 Lifelong Machine Learning: A Definition 197
Lifelong machine learning has a long history in machine learning, mainly in the
transfer learning community (Thrun, 1995; Ruvolo and Eaton, 2013; Silver et al.,
2013; Chen and Liu, 2016). We will review some of these main approaches.
A typical lifelong machine learning system (see Figure 14.1) uses a knowledge
base K B that stores previously learned knowledge learned over time. At time t ,
the system receives a task T t coming from a corresponding domain D t . A typical
lifelong machine learning system first builds a new model for T t based on the
training data from D t and the knowledge in K B. Then, lifelong machine learning
extracts the transferable knowledge from (D t , T t ) and updates the knowledge base
K B. The updated knowledge base K B is used to refine the models trained for
the previous t − 1 tasks.
There are two essential elements for a successful lifelong machine learning sys-
tem. First, there needs be a retention system for learned knowledge on previous
tasks to store the previous examples and models in the universal knowledge base.
Second, there needs be a selective transfer mechanism on how to select the previ-
ous domains and tasks to transfer to the current task, which is the domain knowl-
edge part at the center.
Knowledge retention enables lifelong machine learning from the perspective
of the knowledge representation for the universal knowledge. Learned knowledge
can be stored in various forms. The simplest method of retaining task knowledge
is in a functional form such as the training examples (Silver and Mercer, 1996). An
advantage of functional knowledge is the accuracy and purity of the knowledge to
allow for effective retention.
A disadvantage of functional knowledge is that there may be need for a large
amount of storage space that is searched frequently, which is time consuming.
Alternatively, we can retain the models learned previously in some form under-
standable by the current task; for example, the previous knowledge can be a
198 Lifelong Machine Learning
compressed form that has the same representation as the current task. The ad-
vantage of the latter approach is that the compact size of the retained model re-
quires relative small space for storing the previous training examples. In addition,
having a model allows for a more efficient way for the generalization of models.
In the past, many knowledge representation forms have been used for knowledge
retention, including neural networks and probability distributions.
Transfer learning enables lifelong machine learning from the perspective
knowledge reuse. The knowledge transfer component in lifelong machine learn-
ing pushes the limit of transfer learning in two directions. Instead of leveraging the
limited knowledge obtained from selected previous source tasks, lifelong machine
learning targets the large-scale knowledge transfer from all related source tasks
learned over time. An important issue includes how to identify related tasks to
transfer knowledge from and scale the knowledge transfer to hundreds of source
tasks.
Thrun (1995) defines a support set of previously learned tasks along with their
training examples. For any pair of training example, they are considered candi-
dates of a training data set for the invariant function if their outcomes agree on
that task. These are the positive examples. When examples’ outcome disagree,
they constitute the negative examples.
Given a set of previously encountered tasks, their training examples can be used
to define an invariant function to be learned via a neural network algorithm. When
learning the new function for the next task, the invariance network can be used
to improve the effectiveness of training by providing additional information on
gradient descent.
In the case of object recognition, for example, there may be many images of
objects such as shoes, hats, and so on to be learned. Having learned to recognize
images of shoes, for example, certain image features can be identified as invariant,
which can in turn be used to recognize hats.
of a computer while also being used to express a negative opinion about a bad
battery, such as “battery drains fast.”
In order to overcome this bias, domain-level knowledge can be added to en-
sure that only unambiguous words are stored in the knowledge base. The domain-
level knowledge can be viewed as how likely a word expresses the same sentiment
KB KB
in different domains. More specifically, m +,w and m −,w denote the numbers of
domains in which word w appears more in positive and negative examples, re-
KB KB
spectively. In the knowledge transfer step, n +,w and n −,w are used to compute
the positive and negative word counts combined with the empirical word counts.
The computed word counts are then used to estimate the conditional probabil-
KB KB
ity P (+|w) and P (−|w). The domain-level knowledge m +,w and m −,w is used to
select the words that at least appear in more than a certain number of different
domains.
Besides the classification problem, shared supervised knowledge has also been
used to build topic models in transfer learning (Chen and Liu, 2014a, 2014b; Wang
et al., 2016b). Topic models, such as probabilistic latent semantic analysis (PLSA)
(Hofmann, 1999) and latent Dirichlet allocation (LDA), are statistical models used
to discover topics from a collection of text documents. A topic is defined as a list
of words with the probabilities representing how likely the words belong to that
topic. PLSA assumes the following generative process for word and document co-
occurrences:
zT +1 = (z 1T +1 , z 2T +1 , · · · , z TT+1
+1 ) for the current domain d
T +1
= (d 1T +1 , d 2T +1 , · · · , d TT+1
+1 ).
k n
Then, for each topic z kT +1 ∈ zT +1 , its similar topics from the knowledge base Z
will be identified based on the Kulback–Leibler divergence between z kT +1 and any
topic in Z . The similar topics are put together to form a topic set MkT +1 . A fre-
quent item set-mining algorithm (Han and Kamber, 2000) is then used to find the
words that co-occur in many different topics from MkT +1 . The intuition is that,
if two words appear together many times in different topics, we should be confi-
dent to say that they are related. By limiting the topics to the similar topics MkT +1 ,
we can increase the chance of successful transfer by eliminating unrelated topics.
All the word pairs learned from the previous process for all the topics are used
to generate a “must-link” set, which is used as the prior knowledge to guide the
topic mining for current domain dT +1 . In LTM, a specific type of topic model,
the generalized Pólya Urn model, is adopted to incorporate this knowledge in its
Gibbs sampling process to encourage such a pair of words to be in the same topic.
Figure 14.2 shows the architecture of LTM.
Topics
Similar topics
Knowledge
base ( Z)
Figure 14.2 The architecture of the LTM model (adapted from Chen and
Liu [2016])
the algorithm to the problems where very limited data are available in the cur-
rent task. However, without those data, the learned must-links might be unrelated
to the current task and therefore might hurt the performance of the knowledge
transfer. Although the must-links are learned from past tasks only in AMC, these
cannot-links are learned together with the topic modeling.
The overall architecture of AMC is presented in Figure 14.3. The design of AMC
is similar to that of LTM. But, as AMC does not use any data from the target task to
learn must-links, the MustLinkMiner component in Figure 14.3 is different from
that in LTM. In AMC, MustLinkMiner uses a multiple minimum supports frequent
item set mining (MS-FIM) algorithm (Liu et al., 1999) to extract must-links be-
tween two words. The reason that the traditional single minimum-support fre-
quent item set-mining algorithm does not work for this problem is that generic
topics, such as price, quality, customer service and so on, are shared among many
topics. That means the frequency of generic topics are much higher than specific
ones that pose a challenge to learn must-links for both generic and specific topics.
The MS-FIM algorithm is applied to mine a frequent item set that contains a set
of terms that have appeared many times in the knowledge base.
Must-links
Must-link miner
Topics
Similar topics
Knowledge
base ( Z)
Figure 14.3 The architecture of the AMC model (adapted from Chen and
Liu [2016])
Different from LTM, AMC mines both must-links and cannot-links. As the po-
tential cannot-links for a term w can be any words in the vocabulary list except
14.5 Shared Model Components as Multi-task Learning 203
the ones that co-occur with w in a document before, the candidate set is too large
to consider directly without any a priori knowledge. In AMC, the topics from the
current task are served as the candidate pool for mining the cannot-links. For-
mally, given a knowledge base Z that contains all the topics from previous tasks
and z iT +1 ∈ zT +1 from the current task, AMC only considers two top terms w i and
w j from z iT +1 ∈ zT +1 as the candidates, and then uses the topics in the knowl-
edge base z iT +1 ∈ zT +1 to decide whether a cannot-link should be added to the
two terms or not. To determine the cannot-link relation, AMC examines all topics
from Z and labels the term pairs that seldom appear together in the topics of Z
as cannot-linked terms. Once AMC gets both the must-links and cannot-links, the
same knowledge-based topic modeling algorithm used in the LTM can be used to
learn a better topic model by incorporating the knowledge as a priori knowledge
to guide the topic modeling.
∀i θ i ∼ M, (14.1)
where θ i is the model for the i -th task and M is the high-level hidden model. A
similar idea is used in many multi-task learning methods (Zhang and Yang, 2017b)
to share knowledge among tasks.
In the ELLA framework (Ruvolo and Eaton, 2013), a model dictionary M ∈ Rd ×k
shared by different tasks is used to represent latent model components. The model
parameter θ t for a task t is represented as a linear combination of latent model
components in M. If st ∈ Rk denotes the linear combination weight, θ t can be
represented as
θ t = Mst . (14.2)
Because M is shared among all tasks, which are learned continuously, after seeing
more training data from different tasks, M should be able to improve over time.
t
More specifically, define {xit , y it }ni =1 as the training set for task t . The objective
function of ELLA is formulated as follows
' (
1 T 1 nt
t t t t
min L( f (xi ; Ms ), y i ) + μs 1 ,
T t =1 st n t i =1
where T is the total number of tasks seen so far and L is the loss function.
However, because this objective function depends on all of the previous train-
ing data and every model st also depends on the shared model components M,
204 Lifelong Machine Learning
the optimization for this objective function is very expensive to compute as more
tasks arrive. Ruvolo and Eaton (2013) use approximation techniques to ensure the
computational efficiency in the lifelong-learning setting.
Wang and Pineau (2016) extend ELLA to cover nonlinear cases where the
model components are not limited to linear hypotheses. More specifically, instead
of learning a dictionary of basis vectors as in the work by Kumar and Daumé III
(2012), Wang and Pineau (2016) propose the learning of a more generalized dictio-
nary F = [ f 1 , f 2 , · · · , f T ] that contains a set of basis functions in a functional space
where { f t }Tt=1 can be any hypothesis instead of the linear hypothesis assumed in
the ELLA. The objective function of the proposed method is formulated as
T
nt % &
T
min L( F(xit ), γt , y it ) + μ γt 1 . (14.3)
F,{γt } t =1 i =1 t =1
By relaxing the linear assumption for the model, Wang and Pineau (2016) can han-
dle more complicated learning tasks that expand the scope of the “shared model
components” approach.
This line of research on lifelong machine learning follows the perspective of
multi-task learning. By modeling the different tasks hierarchically, the knowledge
retention step can be easily expressed as the shared high-level hidden model.
However, similar to many hidden models, in this approach, it is hard to under-
stand the learned knowledge stored in the knowledge base K B. In addition, it
might be an oversimplification to assume that a large number of tasks share a set
of base model components for complex lifelong machine learning problems. For
example, learning how to classify documents should be quite different from learn-
ing how to classify images.
its knowledge base keeps growing. For example, when seeing the term “Peking
University,” it realizes that the word refers to a university in China because of the
upper case for university used in the phrase, and Peking is a name used to refer to
the city of Beijing in the past.
In the work by Mitchell et al. (2015), the never-ending learning problem L is
defined as a set of learning tasks and a set of coupling constraints among solutions
to these learning tasks. More specifically, a NELL learning task is defined as a tuple
L i = {Ti , P i , E i }, where Ti is the performance task, P i is a pre-defined metric for
task Ti and E i is the training experience. Ti = (Xi , Yi ) defines the problem domain
and model space f i : Xi → Yi . The performance metric P i is used to measure the
performance of each model f i . E i is the training data used to train the model. The
goal of the learning task Ti is to learn an optimal model f i∗ for the i -th learning
task given the training data E i and the predefined metrics P i as
performance metric of each task, that is, P i , is simply the accuracy of the corre-
sponding model.
In addition to a wide range of learning tasks, another unique component of
NELL is how it connects different concepts. These relations are formulated as con-
straints, listed as follows in the form of “coupling”:
From these descriptions, those constraints are derived directly or indirectly from
the outputs of distinct learning tasks. The overall architecture of NELL is shown in
Figure 14.4.
A distinguishing feature of the NELL system is its knowledge base, which is the
core of NELL. In NELL, the knowledge base includes all the beliefs predicted by
different models with high confidence. Over time, millions of coupling constraints
have been constructed to link all the tasks. The knowledge retention and knowl-
edge transfer processes are engineered to work in sync. NELL uses an expectation-
maximization-style learning paradigm (Dempster et al., 1977) to iteratively per-
form the knowledge retention (E-step) and knowledge transfer (M-step). In the
E-step, the parameters of all models are fixed and the models are used to output
the current best prediction for various tasks.
For example, after determining whether “Shanghai” is a “City” and “China” is
a country, it infers that <“Shanghai”, “China”> satisfies the relation “CityLocated-
InCountry(x,y).” The predictions are called beliefs in NELL. Each prediction comes
with a confidence score that represents how confident the model is about the pre-
diction. If the confidence score of a belief is higher than a certain predetermined
threshold, NELL adds it to the knowledge base as the new knowledge.
In addition to adding highly confident model predictions as knowledge com-
ponents, NELL also includes an active learning component to acquire supervised
knowledge from humans. In the M-step, all the beliefs in the knowledge base are
14.6 Never-Ending Language Learning 207
Knowledge base
(latent variables)
Beliefs
Knowledge
integrator
Candidate
beliefs
Text URL-specific
Orthographic Human
context HTML
classifier advice
patterns patterns
15.1 Introduction
Machine learning techniques are increasingly being applied to a wide range of
applications such as social networking, banking, supply chain management and
health care. With these applications, there is increasingly more sensitive
information inside various data such as personal medical records and financial
transaction information. This raises a critical issue: How to protect the private in-
formation of users?
Today, modern society increasingly demands solutions that address the privacy
issue. One of the most famous laws is Europe’s General Data Protection Regulation
(GDPR),1 which regulates the protection of private user data and restricts the data
transmission between organizations.
The question of how to guarantee user privacy and data confidentiality has thus
become a serious concern in machine learning. So far, researchers have attempted
to address this concern from several angles (Dwork et al., 2006a, 2006b; Chaudhuri
et al., 2011; Dwork and Roth, 2014; Abadi et al., 2016b; Lee and Kifer, 2018).
Among the different methods, data anonymization is a basic way to protect the
sensitive information in user data. However, data anonymization alone is insuf-
ficient for protecting the user privacy. In fact, by using additional external infor-
mation, an attacker can identify anonymized records. In a well-known case, the
personal health information of Massachusetts governor William Weld was discov-
ered in a supposedly anonymized public database (Sweeney, 2002; Ji et al., 2014).
By merging overlapping records between the health database and a voter registry,
researchers were able to identify the personal health records of the governor.
In the past decades, differential privacy (Dwork et al., 2006b; Dwork, 2008) has
been developed as a standard for the privacy preservation. To design a differen-
tially private algorithm, carefully designed noises are often added to the original
data to disambiguate analytic algorithms. By injecting random noise, an individ-
ual sample cannot affect the output of a differentially private algorithm signifi-
cantly, which limits the information gained by an adversary.
1 https://eugdpr.org/.
212 Privacy-Preserving Transfer Learning
15.2.1 Definition
Differential privacy (Dwork et al., 2006b; Dwork and Roth, 2014) has been
established as a rigorous standard to guarantee the privacy for algorithms that
access private data. Intuitively, given a privacy budget , an algorithm preserves
-differential privacy if changing one entry in the data set does not change the
log-likelihood of any output of the algorithm by more than (see Figure 15.1).
Formally, it is defined as follows.
1 n
min J ( f , D) = l ( f (x j ), y j ) + λr ( f ), (15.1)
f ∈H n j =1
Output Perturbation
The output perturbation method (Chaudhuri et al., 2011) is derived from the
sensitivity method proposed by Dwork et al. (2006b), which is a general method
for generating a privacy-preserving approximation to any function. For the mini-
mizer w∗ = argminw J (w, D), it outputs a predictor
wpr i v = w∗ + b,
214 Privacy-Preserving Transfer Learning
Objective Perturbation
Differently from the output perturbation method, the objective perturbation
method (Chaudhuri et al., 2011) adds a noise term to the objective function. In-
stead of minimizing J , it learns the predictor by solving the following objective
function as
1 1
wpr i v = argmin J (w, D) + bT w + Δw22 ,
w n 2
where b is sampled according to (15.2) with β = /2 and , Δ are computed as:
2
(1) = − log(1 + nλ
2c
+ nc2 λ2 ).
where L is the size of L t and σ is a constant. The gradient g̃ t is used to update the
15.3 Privacy-Preserving Transfer Learning 215
model. Abadi et al. (2016b) prove that, with carefully selected instance sampling
rate, σ, and the iteration number, the gradient perturbation method guarantees
the (, δ)-differential privacy.
(1) Each source domain uses its labeled samples to train a differentially private
si si m
logistic regression model wpriv under parameters . All the hypotheses {wpriv }i =1
are then sent to the target domain.
(2) Each source domain fetches the public data set, and computes the differen-
tially private “importance weight” vector vsi with its unlabeled samples and
the public data set. Then {vsi }m
i =1
are sent to the target.
(3) The target domain fetches the public data set and computes the non-private
“importance weight” vector v.
(4) The target domain computes the “hypothesis weight” vector vH ∈ Rm such
that the Kulback–Leibler divergence between v and the linear combination of
vsi weighted by vH is minimized.
216 Privacy-Preserving Transfer Learning
(5) The target model constructs an informative Gaussian prior using vH and
si
{wpriv }m
i =1
from the source domains.
(6) The target domain trains a Bayesian logistic regression model with limited la-
beled target data and the informative Gaussian prior by following Marx et al.
(2008), and returns the parameters wpr i v .
Target
Final
Importance hypothesis
weights Hypothesis
Public
Importance Hypothesis
weights Hypothesis
weights Gaussian
Importance prior
Unlabeled Hypothesis
weights Labeled
(1) Partition the source data set to K disjoint sets based on features.
(2) Scale samples in each subset with its importance and train K differentially
private logistic regression models on these subsets with a total privacy budget
s to obtain {wks }Kk=1 based on a variant of the objective perturbation method.
15.3 Privacy-Preserving Transfer Learning 217
(3) Split the target data set by samples into two parts with equal size, that is, Dl
and Dh , and partition both of them into K disjoint sets in the same way as the
source data set.
(4) In the K subsets of Dl , obtain {wlk }Kk=1 by differentially private hypotheses
transfer. Here Guo et al. (2018b) use a different method from Wang et al.
(2018d). The same method in step 2 is applied with the regularization term
as r k (w) = 12 w − wks 22 . The whole privacy budget is set to be .
(5) Construct a meta-data set D f = {σ(xT(1) wl1 ), . . . , σ(xT(K ) wlK )} by using all {x, y} ∈
Dh , where x(k) denotes the part of x in the k-th subset.
(6) Train an -differentially private logistic regression with privacy budget on D f
and obtain the model parameter wh .
F
Figure 15.3 The framework of differentially private transfer learning with feature
split and stacking with K = 3
In the work by Guo et al. (2018b), the privacy of both the source domain (by s )
and the target domain (by ) is guaranteed. In addition, a variant of the objective
perturbation method is used so that the feature importance can be leveraged to
define the noise, while keeping the overall privacy fixed. As shown in Guo et al.’s
(2018b) work, the proposed method obtains better generalization performance,
especially when the feature importance is known.
to train the common model, that is, majority-voted ERM and weighted ERM. In
the following sections, we introduce those two approaches based on following as-
sumptions:
Majority-Voted ERM
The majority-voted ERM method works as follows.
Theoretical analyses show that the perturbed output wpr i v is -differentially pri-
vate.
Weighted ERM
The main problem with the majority-voted ERM approach is its sensitivity to
the decision of a single party. The weighted ERM method is thus proposed to solve
this problem. Specifically, α(x) is defined to be the fraction of positive votes from
m classifiers for a sample x as
1 m
α(x) = I [h i (x) = 1].
m i =1
where
l α (·) = α(x)l (wT x) + (1 − α(x))l (−wT x);
where l ij (wi ) denotes the loss on the j -th data point in the i -th task with the model
parameter wi , r P (P) performs the knowledge transfer and rQ (Q) penalizes the
model complexity. Here rQ (Q) is assumed to be decomposable in terms of tasks,
that is, rQ (Q) = m i i
i =1 r Q (qi ). Thus, each column q in Q can be distributed to a
computing node and it can be updated locally. For P, the gradient information
220 Privacy-Preserving Transfer Learning
16.1 Introduction
Understanding the visual world around us has been a research focus in AI for
decades and significant contributions have been made. AI has already achieved
human-level performance in various visual tasks and contexts, such as face recog-
nition, handwritten character recognition, lip reading and so on. In AI’s early days,
most visual models had been developed on the basis of handcrafted features for
general visual tasks such as image classification and video classification. Recently,
deep neural network models become a new trend due to their powerful ability to
learn hierarchical feature representations.
However, the advances of visual models heavily rely on large-scale labeled data.
As labeled data are difficult to obtain, transfer learning is desired. To transfer the
knowledge from a source domain to a target domain, feature-based methods are
widely adopted. Some models augment target domain features with source do-
main features. Some models learn a mapping from the source domain to the target
domain. Some other models learn a shared dictionary across the two domains. In
deep neural networks, some learned features are highly “transferable” and hence
the features learned from one data set can naturally be generalized and trans-
ferred to another domain and context.
16.2 Overview
Image data are ubiquitous. For example, users share photos on social networks,
traffic cameras monitor road environments, advertisements are displayed with
images on online-shopping sites and so on. Understanding images plays a key
role in various applications, including self-driving cars, video surveillance, rec-
ommender systems and so on. Thanks to large-scale labeled databases and the
advancement of vision models, remarkable progress has been made in under-
standing the visual world. Yet labeled data are scarce in real world applications.
For example, when a traffic camera is set up in a new city, the distribution of the
traffic flow is likely to differ from that in other cities. As manually labeling video
frames is time-consuming and it costs moderate human effort, it is necessary to
transfer the knowledge from labeled data in other cities or public data sets, where
transfer learning models can be applied.
In this section, we review transfer learning models for vision tasks. We first focus
on the image classification task and then discuss other vision tasks such as video
classification, captioning, object detection and so on. For survey papers that ad-
dress visual domain adaptation, readers may refer to the works by Csurka (2017)
and Patel et al. (2015).
Most transfer learning models are general-purpose, and they can be directly
used in visual applications without particular adjustments. For transfer learning
models pertaining to image classification, there are a plethora of approaches, as
shown in Figure 16.1. They are first categorized into shallow models and deep
models. Deep models are usually artificial neural networks and learn hierarchi-
cal representations, while shallow models do not. For shallow models, there are
mainly four transfer approaches, namely feature augmentation-based, feature
transformation-based, parameter adaptation and dictionary-based approaches.
For deep models, they can be divided into feature-based and model-based ap-
proaches, and these two approaches are usually used together. In the following,
we introduce those two categories of transfer learning models.
Dictionary
learning
where 0 denotes a zero vector of dimension d and φs and φt denote the feature
augmentation mapping in the source and target domains, respectively. Knowl-
edge transfer across domains is achieved by considering domain-shared and
domain-specific features simultaneously. This method can be extended to the
multiple domain setting.
The feature augmentation-based method has been extended by considering in-
termediate subspaces that connect the source and target domains (Gopalan et al.,
2011, 2014; Gong et al., 2012a). The geodesic flow sampling method (Gopalan
et al., 2011) views the generative subspaces of the source and target domains as
points on a Grassmann manifold, and then samples point along the geodesic to
obtain intermediate subspace representations. Original feature representations
from the two domains are projected into these subspaces and they are concate-
nated into high-dimensional feature representations. A discriminative classifier
is constructed on the resulting feature representation. Instead of sampling finite
subspaces, the geodesic flow kernel method (Gong et al., 2012a) defines a kernel
function that integrates an infinite number of subspaces that lie on the geodesic
flow from the source domain to the target one. A more general framework is pro-
posed by Gopalan et al. (2014), which considers feature representations in a
reproducing kernel Hibert space using kernel methods and a low-dimensional
manifold representation using Laplacian eigenmaps.
224 Transfer Learning in Computer Vision
where c i denotes the i -th supervision constraint. In the work by Saenko et al.
(2010), the regularizer is defined as r (W) = tr(W)−log det(W) and two types of con-
straints, namely class-based constraints and correspondence-based constraints,
are considered. For the class-based constraints, a random labeled sample is se-
lected from the source domain and the target domain, respectively. This distance
between the two samples should be smaller than a threshold if they have the same
label; otherwise, the distance should exceed a threshold. Alternatively,
correspondence-based constraints can be constructed if the relationship other
than the label information of two samples is known. The similarity and dissimilar-
ity constraints help learning a domain-invariant transformation. The constrained
optimization problem defined in (16.2) is first converted to an unconstrained prob-
lem and then it is solved by an information-theoretic metric learning method.
Later, Kulis et al. (2011) propose a more general formulation where the model
proposed by Saenko et al. (2010) becomes a special case. It learns an asymmetric
nonlinear transformation, which makes the model capable of handling changes
in the feature type and dimension.
The feature transformation method works well even when the source and target
domains have different representations, which belongs to heterogeneous transfer
learning. Dai et al. (2008) present one of the first such works known as translated
learning, where the training data and test data can be from totally different fea-
ture spaces. For example, the source can be text while the target can be text or au-
dio. Translated learning is an example of heterogeneous transfer learning. A main
method for this approach is to obtain a “dictionary” that can be a translator to link
the different feature spaces.
An intuitive idea for translated learning is to translate some or all of the training
data from the source domain as well as the target domain into a common target
feature space, such that learning can be done in this single space. This approach
can be used for applications such as cross-lingual text classification and cross-
domain image understanding. It can also be used to link the knowledge between
16.2 Overview 225
text and images, in applications where one can use text to explain the semantics of
images. Compared with the machine translation methods typically used in natural
language understanding, the key difference lies in how different feature spaces are
connected; instead of focusing on the sequential nature of texts to be translated,
in translated learning the target data may be of any order.
Dai et al. (2008) present a solution to the translated learning problem, which
is to make the best use of available data to construct a dictionary or translator.
While the target data alone may not be sufficient in building a good classifier for
the target domain, by leveraging the available labeled data in the source domain,
we can indeed build effective translators, which in turn can enhance the training
data in the target domain. An example is to translate between the text and image
feature spaces using the social tagging data available on the World Wide Web.
The translated learning model assumes that the learning tasks are represented
by a common label space c, which is the same in both the source and the target do-
mains. The learning process can be represented using a Markov chain c → f → x,
where f represents the features of the data instances x. The source domain data
x s are represented by the features f s in the source feature space, while the test
data in the target domain x t are represented by the features f t in the target fea-
ture space. Translated learning models the learning in the source space through
a Markov chain c → f s → x s , which can be connected to another Markov chain
c → f t → x t in the target space. An important feature of translated learning is to
show how to connect these two paths, so that a new chain c → f s → f t → x, can
be formed to translate the knowledge from the source space to the target space. In
this process, the mapping f s → f t acts as a feature-level translator. The algorithm,
known as TLRisk, exploits the risk minimization framework in the work by Lafferty
and Zhai (2001) to model translated learning.
We first express the overall objective of translated learning. We can use the risk
function R(c, x t ) to measure the the risk for classifying x t to the category c. To
predict the label for an instance x t , we need only to find the class-label c that
minimizes the risk function R(c, x t ), so that the hypothesis h t can be estimated
with
The risk function R(c, x t ) can be formulated as the expected loss when c and x t
are relevant. Since C only depends on c and X t only depends on x t , we can use
p(C |c) to replace p(C |c, x t ), and use p(X t |x t ) to replace p(X t |c, x t ).
Dai et al. (2008) represent the risk function as follows:
R(c, x t ) = L(θC , θ X t , r = 1)p(θC |c)p(θX t |x t )d θ X t d θC (16.4)
ΘC ΘX t
where ΘC and Θ X t are the model spaces corresponding to the label space C and
the target data space X t , respectively. L(θC , θ X t , r = 1) represent the loss function
226 Transfer Learning in Computer Vision
where f s , f t and δ f denote the source domain model, the target domain model
and the delta function, respectively. Further, the delta function is defined as
δ f (x) = wT φ(x), where w denotes the parameter of the delta function and the
mapping φ projects the data sample into a high-dimensional space. To estimate
the parameter w, the objective of A-SVM is extended from standard SVMs as
1
nt
min w2 + C i
w 2
i
where i measures the classification error and C controls the trade-off between
the two terms. (16.5) learns a target domain model that correctly classifies labeled
samples in the target domain and is close to the source domain model at the same
time.
Domain transfer SVM, proposed by Duan et al. (2009), improves over the A-SVM
by reducing the domain discrepancy measured by maximum mean discrepancy
(MMD) and learns a target decision function simultaneously. Adaptive multiple
kernel learning (Duan et al., 2012c) learns a kernel function based on multiple
base kernels. There are also methods that learn the feature transformation and
classifier parameters jointly (Shi and Sha, 2012; Donahue et al., 2013; Hoffman
et al., 2013).
Dictionary-Based Approach
Dictionary learning represents high-dimensional data as a linear combination
of basic elements. The basic elements are referred to as “atoms” and the atoms
compose a dictionary. Dictionary learning has been successfully applied in vari-
ous vision tasks, such as face recognition, image reconstruction, image de-blurring
16.2 Overview 227
where Vs and Vt denote the sparse representation of Xs and Xt over the dictionary
K, respectively.
Meanwhile, the regularization cost C 2 is introduced to ensure that the projec-
tion does not lose too much information and it is defined as
Combining (16.6) and (16.7) and applying algebraic calculations, the overall op-
timization problem is formulated as
where λ is a positive constant, T0 denotes the sparsity level, ·0 , the 0 norm,
is defined as the number of non-zero elements in a vector, and W̃, X̃ and Ṽ are
defined as
Xs 0
W̃ = [Ws , Wt ], X̃ = , Ṽ = [Vs , Vt ].
0 Xt
This framework can be extended to a kernelized version and it can handle multi-
ple source domains as well.
Frozen Fine-tune
xs conv1 − 3 conv4 − 5 f c6 f c7 f c8 ys
xt conv1 − 3 conv4 − 5 f c6 f c7 f c8
Model-based transfer learning via the parameter sharing and fine-tuning is the
most widely adopted method. This is because parameters in a deep neural net-
work are transferable in that they are suitable for multiple domains (Donahue
et al., 2014; Oquab et al., 2014; Yosinski et al., 2014). The generalization ability
of parameters is referred as “transferability.” Two popular model-based transfer
learning methods are parameter-sharing and fine-tuning. Parameter-sharing as-
sumes that parameters are highly transferable, and it directly copies parameters
in the source network to the target network. The fine-tuning method assumes that
parameters in the source network are useful but they need to be trained with the
target data to better adapt to the target domain.
Feature-based transfer learning models learn a common feature space that is
shared by both the source and target domains (Long et al., 2015; Ganin et al.,
2016). For deep neural networks, feature-based transfer learning is usually used
in conjunction with model-based transfer learning. A typical example is shown
in Figure 16.2. The first three layers are copied from a source network and this
corresponds to parameter sharing in model-based transfer learning and also fea-
ture sharing in the feature-based transfer learning. The next two layers are initial-
ized with parameters from the source network and they are fine-tuned during the
training process. The last three layers are domain-specific and learned based on
the target data.
He et al. (2018b) show that, by learning from scratch, some vision tasks achieve
comparable performance with the fine-tuning approach based on the ImageNet
data set. We think this phenomena occurs based on a prerequisite that the tar-
get domain has enough training data. So, when the target domain has few train-
ing data, the fine-tuning approach can perform better than the pure supervised
learning approach.
16.3 Transfer Learning for Medical Image Analysis 229
• Small data and expensive labeling: Medical image data are collected through
special equipment under very private contexts, hence, medical image data of-
ten only have small sample, measured in the order of hundreds of samples only.
This is much smaller than general image data sets such as ImageNet and CIFAR.
The labeling of medical images often relies on experienced and well-trained hu-
man experts such as doctors and radiologists, making the labeling for medical
images much more expensive.
230 Transfer Learning in Computer Vision
In the following , we discuss how transfer learning helps MIA tasks under differ-
ent settings.
Figure 16.3 MRI images from the public BRATS data set (Menze et al., 2015; Bakas
et al., 2017) in 3D and multimodality. The upper row shows three different views
of the 3D MRI image in T1, and the lower row shows the images of the T1, T2 and
flair modalities.
differences between general images and medical images, the transferred repre-
sentations help the learned model to achieve comparable or even better perfor-
mance than human experts in many diagnosis tasks including retinal diseases
(Kermany et al., 2018), pneumonia (Rajpurkar et al., 2017) and skin cancer (Es-
teva et al., 2017).
Taking the diagnosis of retinal diseases as an example (Kermany et al., 2018),
fine-tuning from the ImageNet data set has the following four steps: (1) the in-
ception network is chosen as the backbone model and randomly initialized at the
beginning. (2) The inception network is first trained on the ImageNet data set with
a final classifier layer outputting one of 1,000 classes. (3) After pretraining on the
ImageNet data set, the last classifier layer with 1,000 output nodes is replaced by
another randomly initialized classifier layer to predict four retinal status, while
other previous layers remain unchanged. (4) The whole network continues to fine-
tune the last several fully connected layers with all previous CNN layers frozen.
The whole process is illustrated in Figure 16.4. It turns out that this strategy can
achieve a high accuracy like 93.4 percent with limited labeled data.
,
c
l
w w
neovascularization
macular edema
Figure 16.4 An example of the fine-tuning method from the ImageNet data set
about how to achieve comparable or better performance than human experts on
the retinal disease diagnosis (adapted from Kermany et al. [2018])
training protocols of transfer learning: (1) fine-tuning: the model is initialized from
a pretrained network and then it is trained with labeled data from the target do-
main. (2) Off-the-shelf: the network pretrained from the source domain is frozen
except the last classifier layer, which is randomly initialized and trained on the tar-
get data. In the two studied problems, that is, the thoracoabdominal lymph node
detection and interstitial lung disease classification, transfer learning making use
of the ImageNet data can achieve the state-of-the-art performance.
Samala et al. (2016) study transfer learning between two similar medical image
domains for mass detection of the breast cancer. The proposed method develops a
computer-aided detection system for masses in digital breast tomosynthesis vol-
umes using a CNN to transfer knowledge from mammograms. Empirical studies
show that using transfer learning improves the area under curve score.
low-level features (e.g., gray-scale values) rather than high-level features (e.g., ge-
ometric structures). Therefore, the domain adaptation module (DAM) is intro-
duced to replace the low-level layer for the target domain and the domain critic
module (DCM) will concatenate the multiple high-level features as its input to
learn to tell the source domain from the target domain. The DCM and DAM are
trained together by optimizing the adversarial loss as
where M denotes the DAM and D denotes the DCM. As shown in Figure 16.5, the
DCM and DAM work together to learn how to align high-level features between
the source and target domains, and then the high-level layers of the segmentation
model can be reused by the target domain.
Figure 16.5 The framework of the unsupervised domain adaptation from MRI
segmentation to CT segmentation (adapted from Dou et al. [2018])
17
Transfer Learning in Natural Language Processing
17.1 Introduction
Transfer learning is often referred to as domain adaptation in natural language
processing (NLP) tasks. Transfer learning plays a significant role in various NLP
tasks, especially when there are limited data for training a model. In this case,
transfer learning can help these tasks by leveraging the knowledge gained from
other related learning tasks.
This chapter will give an overview of transfer learning in NLP. We give an
overview in two sections. In the first section, we give a general introduction about
how transfer learning can be used in NLP tasks. In the second section, we focus on
how transfer learning helps sentiment analysis. In the next chapter, we will devote
an entire chapter to transfer learning in dialog systems, which is a task in NLP.
(1) Freezing: It applies the neural network trained on a source domain to the tar-
get domain without any modification.
(2) Fine-tuning: In this method, a neural network is trained on a source domain.
Then this neural network is applied to the target domain with parameters of
some layers fixed while parameters of other layers will be learned on the target
domain data. An illustration of the fine-tuning method is shown in Figure 17.1.
Output Randomized
layers inializaon
Fixed Copy
layers parameters
Figure 17.1 An illustration of the fine-tuning method by training the top layer on
the target domain with other layers fixed
models such as word2vec (Mikolov et al., 2013a) and glove (Pennington et al.,
2014) are widely used and word representations pretrained from a large source
data set are used to initialize the word embedding layer in a target model for
many NLP tasks. When the size of the target data set is much smaller than that
of the source data set used for word embeddings, it is observed that freezing rep-
resentations outperforms fine-tuning them (Seo et al., 2016) and otherwise the
fine-tuning method is better than the freezing method (Kim, 2014).
Min et al. (2017) train a BiDAF model (Seo et al., 2016) on a source data set,
Stanford question answering (SQuAD) (Rajpurkar et al., 2016), which is a span-
supervised question and answering (QA) data set and then adapt it to two other
QA data sets, WikiQA and SemEval 2016. Moreover, specific layers trained on a
sentence scoring task are applied to a different task – the entailment task.
Devlin et al. (2018) pretrain a proposed bidirectional encoder representations
from transformers (BERT) model on two tasks, that is, masked language model-
ing and next sentence prediction, and then fine-tune the pretrained BERT model
on eleven NLP tasks/data sets, including multi-genre natural language inference,
Quora question pairs, question natural language inference, Stanford sentiment
treebank, the corpus of linguistic acceptability, semantic textual similarity bench-
mark, Microsoft research paraphrase corpus, recognizing textual entailment, Wino-
grad natural language inference, SQuAD data set, the CoNLL 2003 named entity
recognition (NER) data set and the situations with adversarial generations data
set, to achieve the state-of-the-art performance.
J = λJ t + (1 − λ) J s , (17.1)
where J t and J s are the individual loss function of each domain, and λ ∈ (0, 1) is a
regularization parameter to balance the loss functions of two domains.
In the rest of this section, we will introduce the use of the MTL approach on
NLP tasks by highlighting different strategies and rationales.
Machine Translation
Different models to use MTL for machine translation can be classified into two
categories.
The first category is by training a unified translation model under the MTL
framework, thus simultaneously translating from one source language into sev-
eral different target languages. The general model could be regarded as a varia-
tion of an encoder-decoder framework. Dong et al. (2015) build a recurrent neural
network (RNN)-based encoder-decoder model with multiple target tasks, each of
17.2 Transfer Learning in NLP 237
which is for a target language. Different tasks share the same encoder, as shown
in Figure 17.2. Different from Dong et al. (2015), Zoph and Knight (2016) define a
specific encoder for each target language and jointly train them. Malaviya et al.
(2017) build a massive many-to-one neural machine translation system from 1,017
languages to English. Moreover, Johnson et al. (2016b) jointly train encoders and
decoders.
Shared encoder
English
Multilingual Tasks
Similar to machine translation, it is often beneficial to use the MTL approach
to jointly train models for various NLP tasks such as POS tagging (Fang and Cohn,
2017), dependency parsing (Duong et al., 2015; Guo et al., 2016b), discourse
segmentation (Braud et al., 2017), sequence tagging (Yang et al., 2016), NER
(Gillick et al., 2016) and document classification (Pappas and Popescu-Belis, 2017).
Relation Extraction
For relation extraction, the information related to different relations or roles can
often be shared. Specifically, the knowledge learned from other types of relations
or even other tasks, could be transferred to the target relation and help to improve
the performance of the relation extractor.
238 Transfer Learning in Natural Language Processing
Task1: Machine
Task2: POS tags Task3: NER labels
translation
Softmax
Shared decoder
Shared attention
Sharing layers
Shared encoder
Figure 17.3 Learning three NLP tasks in a neural network based on the MTL ap-
proach
Question Answering
For the NLP task of QA and the task of reading comprehension, many effective
approaches are based on RNNs that learn a mapping from a text document and
a given question to an answer. Conventional approaches regard a document as
a long sentence and encode it word by word. However, model quality and train-
ing efficiency would decrease once given a relatively long document. Inspired by
studies on how people do reading comprehension by first skimming the docu-
17.2 Transfer Learning in NLP 239
ment, identifying relevant parts and carefully reading these parts, it is beneficial
to jointly learn different parts for the QA and reading comprehension tasks.
Choi et al. (2017) present a framework that has two parts, including a simple
and fast model for sentence selection and a more complex model for answer gen-
eration based on the question and those selected sentences for the QA task. These
two parts are learned jointly.
Wang et al. (2018c) jointly train a ranker, which learns to rank retrieved pas-
sages, and an answer-extraction reader for the open-domain QA, where the model
is given a question and can access to a large corpus (e.g., Wikipedia) instead of a
preselected passage.
Semantic Parsing
A long-standing problem in NLP research is the task of semantic parsing, which
aims to parse natural language texts to meanings. A challenge is the limitation
on data resources because the parsers are often manually designed and the text
are labeled by humans. Thus, for the task of semantic parsing, an MTL approach
is often adopted when given multiple labeled text data to help leverage as much
shared information as possible.
Guo et al. (2016a) describe a universal framework that can exploit multi-typed
source treebanks to improve the parsing of a target treebank. Specifically, the pro-
posed framework considers two kinds of source treebanks, including multilingual
universal treebanks and monolingual heterogeneous treebanks.
Peng et al. (2017) learn three semantic dependency graph formalisms, including
the DELPH-IN bi-lexical dependencies representation, Enju predicate-argument
structures representation and Prague semantic dependencies representation, in
parallel.
Fan et al. (2017) jointly learn different Alexa-based semantic parsing formalisms
with different levels of parameter sharing. They explore three multitask archi-
tectures for sequence-to-sequence modeling, including one-to-many, one-to-one
and one-to-shared-many.
Zhao and Huang (2017) propose the first end-to-end discourse parser that
jointly trains a syntactic and a discourse parser as well as the first syntacto-
discourse treebank by integrating the Penn Treebank with the RST Treebank.
Representation Learning
For learning the vectorized representations of text, for example, words and sen-
tences, the challenge is to define the objective function. Most existing representa-
tion learning models have been based on a single task with a loss function, such
as predicting the next word (Mikolov et al., 2013b) or sentence (Kiros et al., 2015)
or training on a certain task such as entailment (Conneau et al., 2017) or machine
translation (McCann et al., 2017). Thus, the performance on these tasks are often
limited by the small amount of training data. Rather than learning representa-
tions from only one task, intuitively learning from multiple tasks for representa-
tion learning could leverage more supervised data from many tasks. Moreover,
240 Transfer Learning in Natural Language Processing
the use of MTL also benefits from a regularization effect such as reducing the
risk of the overfitting, thus making the learned representations universal across
tasks.
Jernite et al. (2017) adopt three auxiliary tasks for sentence representation learn-
ing. The first task is to learn to arrange the order of sentences in a passage. The sec-
ond task is to select the next sentence out of five candidates given the first three
sentences of a paragraph. The third task is trained to recover the conjunction cat-
egory in sentences.
Hashimoto et al. (2017) introduce a joint many-task (JMT) model to utilize lin-
guistic hierarchies by successively growing the depth of the model to solve in-
creasingly complex tasks, as shown in Figure 17.4. The JMT model can be trained
in an end-to-end manner for the POS tagging, chunking, dependency parsing, se-
mantic relatedness and textual entailment. In the JMT model, higher layers have
shortcut connections to lower layers.
Figure 17.4 The JMT model (adapted from Hashimoto et al. [2017])
Chunking
Chunking (Chomsky, 1956), an effective NLP technique in which linguistic stru-
ctures are grouped by hierarchical components, has been shown to benefit from
being jointly trained with low-level tasks such as POS tagging.
Collobert and Weston (2008) first propose a general DNN architecture, which
enables the model to learn multiple NLP tasks such as semantic role labeling
(SRL), NER, POS, chunking and language modeling simultaneously.
Søgaard and Goldberg (2016) show that low-level tasks such as POS tagging and
NER can be learned to generate feature representations at bottom layers in neural
17.3 Transfer Learning in Sentiment Analysis 241
networks when used as auxiliary tasks for chunking. Moreover, the authors show
how this hierarchical architecture can be used for domain adaptation.
Ruder et al. (2017) define chunking, NER and a simplified version of SRL as main
tasks, and pair them with POS tagging as an auxiliary task. The collection of tasks
are then used in an MTL setting.
e-commerce sites, or opinions about social events as social media messages. Sen-
timent analysis aims to take these comments as input and produce their polarity
such as positive or negative as output. In this section, we will introduce how trans-
fer learning techniques are applied to sentiment analysis.
As mentioned earlier, users tend to use natural language texts to express
opinions and attitudes about products or services on social media or review sites.
Thus, it is helpful to build models that take these user comments and correctly
interpret their emotional tendency. Sentiment analysis, which aims to automat-
ically determine the overall sentiment polarity of the text, achieves this objec-
tive by producing positive or negative polarity as output. As the needs for under-
standing user feedback in modern society grow, sentiment analysis has attracted
increasing attention over the past decades (Pang et al., 2002; Hu and Liu, 2004;
Pang and Lee, 2008; Liu, 2012). The characterization of sentiment polarity can be
deployed in practical systems that gauge market reaction and summarize opinion
in various scenarios such as Web pages, discussion boards and blogs. Successful
sentiment analysis can greatly facilitate service-oriented societies.
Supervised learning has been widely used in sentiment analysis using
techniques such as DNNs. These supervised models often require massive labeled
data as the training data to build sentiment classification models for a specific
domain (Wang and Manning, 2012; Socher et al., 2013b; Tang et al., 2015). A
major bottleneck in building a sentiment model is the cost spent on annotating
new corpora for new application domains, as data labeling in these new
domains may be time-consuming and expensive. This is a typical cold-start
problem.
To address the afore mentioned cold-start problem, cross-domain sentiment
classification is desirable. Cross-domain sentiment analysis aims to leverage the
knowledge from a related source domain that has abundant labeled data to im-
prove the performance of a target domain that has no or few labeled data. Because
cross-domain sentiment analysis can speed up the launching of a new service, in
many fast growing industry sectors, it has become a chosen tool to use.
For cross-domain sentiment classification, a main challenge lies in that fea-
tures in the source and target domains may be mismatched or have discrepancy
in their meanings. This is caused by variations of sentiment expression in differ-
ent domains. For example, “light” means positively in one domain while negative
in another. An examples is illustrated in Figure 17.5.
Therefore, in cross-domain sentiment analysis, we envision a scenario in which,
once we have obtained a good sentiment classifier in a source domain, we wish to
transfer the knowledge to a target domain with minimal human labeling of data.
For example, we may have a good sentiment model for the movie domain already,
and we wish to transfer the knowledge to a new domain, the electronics domain.
We need to overcome several challenges in cross-domain sentiment classifica-
tion. First, the target domain usually contains sentiment words or phrases that do
not appear or rarely appear in the source domain. For example, in the movie do-
17.3 Transfer Learning in Sentiment Analysis 243
main, the words engaging and thoughtful are used to express positive sentiment,
whereas insipid and plotless often indicate negative sentiment. However, in the
electronics domain, glossy and responsive are often used to express positive senti-
ment, whereas the words fuzzy and blurry are used to express negative sentiment.
Second, the semantic meaning of a word often differs from one domain to an-
other. For example, lightweight is usually used to express a positive sentiment to-
ward portable electronic devices in the electronics domain because a lightweight
device is easier to carry. However, the same word has a negative sentiment in the
movie domain since movies that do not invoke deep thoughts in viewers are con-
sidered to be lightweight. Therefore, due to this domain discrepancy, a sentiment
classifier trained in a source domain may not work well when directly applied to a
target domain.
In the following sections, we introduce some of the representative techniques
for cross-domain sentiment analysis based on transfer learning.
• Pivot: Blitzer et al. (2006) introduce the concept of the pivot. Pivots are the fea-
tures with two attributes. First, they are frequently occurring in both domains.
Second, they behave in the same way for discriminative learning in both do-
mains, that is, the semantic and the polarity of them are preserved across do-
mains.
• Non-pivot: Blitzer et al. (2006) propose the concept of a non-pivot phrase as op-
posite to the pivot. Non-pivots are usually the features with two characteristics.
First, in terms of occurrence, non-pivots are much more frequent in one do-
244 Transfer Learning in Natural Language Processing
main than in another one and their existence highly depends on the domains.
Second, the semantic meanings of non-pivots vary across domains.
tronics review, that is, the pivots like “great” and “awful,” many words are totally
different, such as “glossy” and “responsive.” Likewise, many words are useful for
the movie domain, but they are not useful for sentiment classification for the
electronics domain. For example, words like “engaging” and “thoughtful” are not
useful. The key intuition of SCL is that even when (“engaging,” “thoughtful”) and
(“glossy,” “responsive”) are domain specific, if they have high correlation with pivot
words such as “great” and have low correlations with words like “awful,” then they
can still be aligned with these pivot words, and thus with each other.
Figure 17.6 An illustrated example for the multiple pivot prediction tasks
Given the labeled data from a source domain and unlabeled data from both
domains, SCL first selects m pivots, which occur frequently in both domains and
have high mutual-information values with sentiment labels. These pivots act as
a bridge between the source and target domains. Then, all other n features are
regarded as the non-pivots. As shown in Figure 17.6, SCL models the correlation
between the pivots and the non-pivots by utilizing m linear pivot predictors to
predict the occurrence of each pivot from both domains and induces a projected
feature space that works well for both domains.
We can denote the weight vector of the i -th pivot predictor as wi ∈ Rn . This
allows positive entries in wi to mean that the corresponding non-pivots are pos-
itively correlated with the i -th pivot. All the weight vectors can be arranged into
a matrix W = [wi ]mi =1
∈ Rn×m and Θ ∈ Rn×k consists of the top k left singular vec-
tors of W. Here Θ are the principal predictors for the weight space. Given a feature
vector x ∈ Rd where d = m + n, let DS(x) denote its non-pivot part. SCL applies
the projection Φ(x) = DS(x)Θ to obtain new k-dimensional features and learns a
sentiment predictor for the augmented instance 〈x, DS(x)Θ〉, where 〈·, ·〉 is a con-
catenation operation.
mains into meaningful clusters by using the pivots as a bridge. The SFA algorithm
transforms the cooccurrence relations between pivots and non-pivots into a bi-
partite graph between domains and adapts a spectral clustering algorithm (Ng
et al., 2002) on the bipartite graph to solve the domain mismatch problem.
Figure 17.7 An example of the bipartite graph between pivots and non-pivots
SFA constructs a bipartite graph between domains. It first chooses l words that
have high term frequencies in both domains and low mutual-information values
as pivots and the remaining m − l words are treated as non-pivots, where m is the
total number of words. Let WP ∈ Rl and WN P ∈ R(m−l ) denote the vocabulary of the
pivots and non-pivots, respectively. SFA leverages the cooccurrence relationship
between pivots and non-pivots to construct a bipartite graph G = (VP ∪ VN P , E ).
In G, each vertex in VP corresponds to a pivot in WP and each vertex in VN P corre-
sponds to a non-pivot in WN P . An edge in E connects two vertices in VP and VN P ,
respectively. For each edge e i j ∈ E , there is a non-negative weight r i j to measure
the relation between the pivot w i ∈ WP and the non-pivot w j ∈ WN P according to
their cooccurrence. In this way, they form a cooccurrence matrix M ∈ R(m−l )×l . An
example for the bipartite graph is shown in Figure 17.7. Finally, we use the con-
structed bipartite graph to model the intrinsic relationship between pivots and
non-pivots.
Autoencoder-Based Models
Autoencoder-based models aims to align domain-specific features based on a
reconstruction criterion and learn intermediate representations shared across do-
mains. In the following, we introduce several representative autoencoder-based
models.
where xi is the i -th training sample out of the total n training samples.
The denoising autoencoder (DAE) (Vincent et al., 2008) is an alternative to the
ordinary autoencoder. In DAE, each input x is stochastically corrupted to x̃, and its
objective is to reconstruct input data from their corruptions, that is, minimizing
a denoising reconstruction error L x, g(f (x̃) , as shown in Figure 17.9. Multiple
DAEs can be stacked into a deep learning architecture, which is called stacked
DAE (SDA).
248 Transfer Learning in Natural Language Processing
Glorot et al. (2011) successfully adapt the SDA to learn general feature repre-
sentations for cross-domain sentiment classification. Based on unlabeled data
from both domains and the label information in the source domain, the proposed
method tackles the cross-domain sentiment classification problem with a two-
step procedure. Glorot et al. (2011) train an SDA to reconstruct the input based
on the union of the source and target data. Then a linear classifier such as sup-
port vector machine is trained on the resulting feature representation f (x) of the
source labeled data. The SDA is able to disentangle hidden factors, which explains
the variations in the input data and automatically group the features according to
their relatedness to these factors.
Bi-transferring Autoencoder
Transferred
source domain
Labeled Unlabeled
source domain target domain
Figure 17.10 The framework of the BTAE (adapted from Zhou et al. [2016])
Specifically, the encoder f c aims to map an input example x from both domains
into a latent feature representation z:
The decoders g s and g t attempt to map the latent representation to the source or
target domain as
The two domains can be generally reconstructed from each other. The objective
function for the BTAE system is formulated as
min Xs − g s f c (Xs ) 2 + g t f c (Xs ) − Xt Bt 2 + Xt − g t f c (Xt ) 2
f c ,g s ,g t ,Bs ,Bt 2 2 2
2
+g s f c (Xt ) − Xs Bs 2 + γ Bs 2F + Bt 2F (17.6)
(17.6) is to minimize the reconstruction error in the source domain. The second
term of the equation minimizes the reconstruction error of the target domain data
based on the source domain data with the help of a linear transformation matrix
Bt . The third and fourth terms are defined similarly. After solving (17.6), the trans-
ferred source domain data g t f c (Xs ) have a similar distribution to that of the tar-
get domain and hence can be used to train a sentiment classifier for the target
domain.
Embedding-Based Models
Embedding-based models (Bollegala et al., 2015) focus on learning domain-
specific word representations that accurately capture the domain-specific aspects
of semantic meanings of words. Actually, this type of methods can solve both
problems of feature mismatch and semantic variation. Bollegala et al. (2015) pro-
pose a cross-domain word representation learning (CDWRL) method for cross-
domain sentiment analysis. The goal of CDWRL is to predict the surrounding non-
pivots of every pivot such that the semantic meaning and orientation of non-
pivots are captured. Therefore, there exist two requirements for learning the cross-
domain word embeddings. First, for both domains, pivots must accurately predict
the cooccurring non-pivots. Second, word representations learned for pivots must
be similar in the two domains. Thus, the objective function for CDWRL is formu-
lated as
where s and t denote the source and target domain, respectively, and C and W
denote the pivot and non-pivot, respectively. L(C s ,Ws ) is defined as a rank-based
predictive hinge loss:
L(C s ,Ws ) = max(0, 1 − c sT w s + c sT w s∗ ),
d ∈D s (c s ,w s )∈d w s∗ ∼p(w s )
1 K
(i )
R(C s ,C t ) = c s − c t(i ) .
2 i =1
Through learning the embeddings for the pivots and non-pivots, their semantic
relation and sentiment orientation can be captured for domain adaptation.
17.3 Transfer Learning in Sentiment Analysis 251
) * ) *
φ = min max Ex∼PX log D (x) + Ez∼PZ log (1 − D (G (z))) . (17.7)
G D r eal G
backprop backprop
softmax softmax
Objective: GRL Objective:
Predict labels Predict domains
MN-sentiment MN-domain
qw qw
Query vector
users why a model picks a certain word as pivot and non-pivot words, to provide
more confidence that the system is making a correct decision.
Another limitation of previous cross-domain sentiment classification works is
that, for the most part, pivots and non-pivots are hand-picked by humans. It would
be nice to automatically learn the pivots from two domains. Li et al. (2017b) pro-
pose a method known as the adversarial memory network (AMN), which is shown
in Figure 17.11, to automatically capture pivots in a cross-domain setting. They
also introduced a novel word attention mechanism into the domain adversarial
learning framework to allow for interpretability. They make use of the attention
mechanism of a memory network, which can automatically visualize which words
are more likely to be pivots and contribute more to the domain-invariant repre-
sentations based on attention scores, to interpret what to transfer.
Several illustrative examples of the learned result are shown in Figure 17.12 to
visualize the attentions of the AMN model. We can see that the words with high
attention weights such as great, good, best, beautiful, fantastic, gorgeous, terrible,
disappointed, disappointment and poor are pivots. These are indeed the words
chosen by humans.
Despite superior experimental performance, AMN is limited to only focusing
on word-level attention because it ignores the hierarchical structure of documents.
In practice, we wish to accurately capture pivots in long documents, which of-
ten follow a hierarchical structure. Besides, it cannot automatically capture and
exploit the relationship between non-pivots and pivots, which may result in the
17.3 Transfer Learning in Sentiment Analysis 253
degraded performance when the source and target domains only have few over-
lapping pivots.
To simultaneously harness the collective power of pivots and non-pivots, and
to interpret what to transfer, Li et al. (2017b) introduce a hierarchical attention
transfer network (HATN) for cross-domain sentiment classification. HATN jointly
train two hierarchical attention networks named P-net and NP-net.
The first part is P-net, which aims to capture the pivots. To achieve this goal, the
labeled data Xs in the source domain is fed into the P-net for sentiment classifi-
cation and, in the meantime, all the data Xs and Xt in both domains are fed into
the P-net for domain classification based on adversarial learning to make the do-
main classifier indiscriminative between the representations from the source and
target domains. In this way, it guarantees that representations from the P-net are
both domain-shared and useful for sentiment classification. It can thus identity
the pivot features with the attention mechanism.
The second part is NP-net, which aims to align the non-pivots. To reach this
goal, the transformed labeled data g (Xs ) in the source domain S generated by hid-
ing all the pivots identified by the P-net are fed into the NP-net for sentiment clas-
sification. At the same time, all transformed data g (Xs ) and g (Xt ) in both domains
S and T generated in the same way are fed to NP-net for +(positive)/-(negative)
pivot predictions.
The P-Net and the NP-net work together to predict whether an original sample
x contains positive or negative pivots based on the transformed sample g (x). The
transformed sample g (x) has two labels, a label z + indicating whether x contains
at least one positive pivot and a label z − indicating whether x contains at least one
negative pivot. The intuition behind it is that positive non-pivots tend to cooccur
with positive pivots and negative non-pivots tend to cooccur with negative pivots.
In this way, the NP-net can discover domain-specific features with the pivots as
a bridge and capture the non-pivots that are expected to correlate closely to the
pivots with the attention mechanism.
The NP-net needs positive and negative pivots as a bridge across domains. Since
the P-net possesses the ability of automatically capturing the pivots with atten-
tions, the training procedure of HATN consists of two stages.
tions simultaneously, but transformed unlabeled data g (Xt ) fed to the NP-net
can only be used for the +(positive)/-(negative) pivot predictions.
from the P-net, the NP-net aims to pay higher word attentions to the non-pivots
in the two domains, such as source non-pivots readable and insipid in the books
domain and target non-pivots pixelated, fuzzy and distorted in the electronics do-
main. The sentences that contain these non-pivots also get higher sentence atten-
tions in the NP-net.
Figure 17.14 The architecture of the AE-SCL model (adapted from Ziser and Re-
ichart [2017])
+ ,
Specifically, the feature set is denoted by f, the subset of pivots by f p ⊆ 1, . . . , f ,
+ ,
and the subset of non-pivots by f np ⊆ 1, . . . , f such that f p ∪ f np = f and f p ∩
f np = . Besides, the representations of a pivot and a non-pivot of an input x are
denoted by xp and xnp , respectively.
The goal of the AE-SCL aims to induce a robust and compact feature represen-
tation by learning a nonlinear prediction function from xp to xnp . The prediction
function is based on the framework of the autoencoder. Based on xnp , the AE-
SCL first encodes xnp into an intermediate representation h wh (xnp ) = σ wh xnp
and then predicts the occurrence of the pivot x p with a decoder function
256 Transfer Learning in Natural Language Processing
o = r wr (h wh (xnp )) = σ w r h wh (xnp ) , which reflects the probability that the pivot
appears in the input. Hence, the cross-entropy loss function is naturally used.
Two observations are in order. First, pivots with similar semantic meanings of-
ten have similar word embeddings. Second, pivots that occur in an input are much
fewer than pivots that do not occur in it. Thus, Ziser and Reichart (2017) pro-
pose an AE-SCL with similarity regularization, where pretrained word embed-
dings of the pivots are injected into the AE-SCL model to improve the general-
ization across examples with semantically similar pivots. Moreover, Ziser and Re-
ichart (2018) propose a pivot-based language modeling that incorporates pivots
into a language modeling method.
18
Transfer Learning in Dialogue Systems
18.1 Introduction
sible for detecting speech-acts, slots and the corresponding slot values. It takes a
user utterance as the input and identifies the speech-acts and slot values appeared
in the user utterance. The dialogue state tracker (Wang and Lemon, 2013; Hen-
derson et al., 2014; Zilka and Jurcicek, 2015; Lee and Kim, 2016; Sun et al., 2016)
is responsible of inferring and maintaining dialogue states. It takes a parsed user
utterance (including speech-acts, slots and slot values) as the input and keeps the
track of current dialogue state based on the user utterance and the previous dia-
logue state. The DPL module (Williams, 2008a, 2008b; Lefèvre et al., 2009; Young
et al., 2010) takes current dialogue state as the input and decides the next action
to take. The natural language generation module (Wen et al., 2015a) converts the
system action back to text response. Among these four modules, the DPL module
is the core component.
According to their training approaches, task-oriented dialogue systems can be
categorized into modular dialogue systems and end-to-end dialogue systems. The
components in modular dialogue systems can be built separately with different
objective functions, while end-to-end dialogue systems are trained jointly with
a single objective function. Modular dialogue systems have a low inter-modular
dependency, thus we can plug-in different components freely. Hence, compo-
nents can be trained or handcrafted independently and the data needed to train
each component can be obtained easily. However, without a consistent objective
function, the whole system with different components might not work well. End-
to-end dialogue systems have a single objective function. When trained jointly,
the whole system can learn together and achieve the better performance, and it
does not require intermediate annotation thus to reduce human efforts. However,
since all components are learned jointly, we cannot switch components with-
out retraining. Moreover, the training of an end-to-end dialogue system is more
difficult.
The rest parts of this chapter are organized as follows. We first introduce the
problem formulation, then we introduce each of the four modules one by one
18.2 Problem Formulation 259
from Section 18.3 to Section 18.6, and finally we introduce the end-to-end dia-
logue systems in Section 18.7.
In the following four sections, we will introduce how transfer learning can help
the learning of the four modules.
(1) Speech-act classification: The input is a user utterance X n and the output is
the speech-act a n of the user utterance. The speech-act classification can be
viewed as a multi-label classification problem.
(2) Slot-filling: The problem is to find all possible slot value pairs sn = {s 1 = v 1 , s 2 =
v 2 , · · · } in the user utterance X n . The slot-filling task can be viewed as a se-
quential classification problem based on the word sequence of a sentence.
For an input sentence “who played Zeus in the 2010 action movie Titans,” the
expected output is a semantic tag for each word in the input sentence like
“who:{} played:{} zeus:{character=zeus} in:{} the:{} 2010:{year=2010}
action:{genre=action} movie:{type=movie} Titans:{name=Titans}.”
domain classifier to each instance X s in the source domain. If the predicted prob-
ability that the source instance X s belongs to a target class a is above a thresh-
old, that is, f at (X s ) > ρ, then X s is transferred to the target domain as an instance
of target domain class a. After some instances are transferred, the target-domain
classifier can be retrained with the transferred source domain instances and the
target domain instances.
where φ(x i ) denotes the word embedding for word x i , σ is an activation function,
Wi h is a 3d × h matrix and Who is a d × d matrix. The parameter Wi h and Who are
shared for all labels, making it a transfer learning method.
Jeong and Lee (2009) propose dividing the model parameters into domain-
dependent and domain-independent parameters, where the domain-independent
parameters are shared across domains to transfer knowledge. A conditional ran-
dom field is used as the base model for the slot-filling problem. The probability of
the slot set is given by
s = argmax P (s |X ),
s
domain data, the pretrained model is fine-tuned on a few instances in the target
domain. However, the source and target domains have to use the same kind of
models. Instance transfer can work without modifying the structure of the clas-
sifier and it is easy to train. However, the source and the target models need to
be trained for multiple times, which is time consuming. Parameter transfer can
transfer common model parameters from a source domain to a target domain.
However, the model parameters have to be partitioned into shared parameters
and domain-dependent parameters and, as a result, some classifiers could not be
used.
ods under this setting are based on the Q-learning framework and we categorize
these works into three categories, including transferred linear model, transferred
Gaussian process and transferred Bayesian committee machine.
Before presenting these three approaches, we first introduce some notations.
In a modular dialogue system, without loss of generality, we formulate the DPL
as an MDP since a POMDP policy could be represented by an MDP on the belief
state. The MDP is defined as {H , Y , P, R, γ}, where H denotes the dialogue state,
Y denotes the reply of the agent, P is the state transition probability function, R
is the reward function and γ ∈ [0, 1] is the discounted factor. At time step n, H̃n
denotes the dialogue state, Ỹn denotes the agent reply and r n denotes the reward.
We assume the spoken language understanding module and the dialogue state
tracker have provided the current state H̃n at time step n, so we can observe H̃n , Ỹn
and r n . The goal is to find an optimal policy that can maximize the cumulative
return, defined as G n = ∞ k=0
γk r n+k .
k(( H̃ , Ỹ ), ( H̃ , Ỹ )) = k H̃ ( H̃ , H̃ )k Ỹ (Ỹ , Ỹ ).
Given training state-action sequences B = [( H̃0 , Ỹ0 ), · · · , ( H̃n , Ỹn )]T and the cor-
responding immediate rewards r = [r 0 , · · · , r n ]T , the Q-function Q π ( H̃ , Ỹ ) for any
state-action pair ( H̃ , Ỹ ) is given by
where m = [m( H̃0 , Ỹ0 ), · · · , m( H̃n , Ỹn )]T , K is the kernel matrix, H is the band ma-
trix with diagonal [1, −γ], k( H̃ , Ỹ ) = [k(( H̃0 , Ỹ0 ), ( H̃ , Ỹ )), · · · , k(( H̃n , Ỹn ), ( H̃ , Ỹ ))]T
and σ2 is the variance for the noise.
To transfer a Gaussian process policy, there are basically two approaches.
(1) Transferring the mean function Q̄( H̃ , Ỹ ). In the works by Gašić et al. (2015a,
2015b, 2015c), source data are used to build a good Q-function Q̄( H̃ , Ỹ ) for
the target domain. In the works by Gašić et al. (2013, 2014) and Casanueva
et al. (2015), the mean function in the source domain is used as a prior on that
of the target domain.
(2) Transferring the covariance function k(( H̃ , Ỹ ), ( H̃ , Ỹ )). In the works by Gašić
et al. (2013, 2014, 2015a, 2015b, 2015c) and Casanueva et al. (2015), the kernel
function on state-action pairs is defined from different domains.
where s denotes a source slot. The kernel function between a source speech-act
a s and a target speech-act a is defined as
'
s δa s (a) a ∈ A s
k A (a , a) = ,
0 a ∉As
266 Transfer Learning in Dialogue Systems
where A s and A t are the collections of speech-acts in the source and the target
domains, respectively, and δa s (a) is the kernel function defined in the source do-
main.
Gašić et al. (2013) define the cross-domain kernel function based on common
slots in the source and the target domains. For slots only appeared in the target
domain, the most similar slots are used to calculate the kernel function. Specifi-
cally, the kernel function is defined as
k H̃ ( H̃ s , H̃ ) = < H̃ sss , H̃ s s > + < H̃ls(s t ) , H̃ s t >
s s ∈S s t ∉S
where s s denotes a slot in the source domain S , s t denotes a slot in the target
domain T and function l : T → S finds a source slot l (s t ) that is the most similar
to the target slot s t . The kernel function for actions is defined as
'
s δa s (a) a ∈As
k a (a , a) =
δa s (L(a)) a ∉ A s ,
where function L : A t → A s maps an action that does not exist in the source do-
main to a replaced action in the source domain and δa s (a) is the kernel function
defined in the source domain.
By assuming that the source and target domains are from different users with
the same set of slots, Casanueva et al. (2015) propose to use additional features to
determine the kernel function as
where ls is an acoustic feature vector for the state-action pair in the source do-
main and similarly l is for the target domain. Hence, the kernel depends on some
external features to help calculate the cross-domain similarity.
Gašić et al. (2015a) propose building a distributed policy for each node in a
knowledge graph. A dialogue policy is decomposed into a set of topic-specific
policies that are distributed across the class nodes in the graph. The root node
in the knowledge graph is general for all its children nodes and so this policy can
work for all sub-domains. The proposed method matches only the common slots
with
'
s
s s δa s (a) a ∈ A s
k H̃ ( H̃ , H̃ ) = < H̃ s , H̃ s > and k A (a , a) = ,
s∈S ∪T 0 a ∉As
where δa s (a) is the kernel function defined in the source domain. If there is no
common slots, the none-matching slots are treated as abstract slots and then re-
named to be some names such as “slot-1” and “slot-2.” The abstract slots in the
source and target domains are matched one-by-one in order.
Transferred Gaussian processes for Q-learning do not assume a completely iden-
tical feature space in the two domains, but they still assume that there are com-
mon slots between the two domains. Moreover, transferred Gaussian processes
are computationally expensive, thus they could not support large training data sets.
18.5 Transfer Learning in DPL 267
M
Σi ( H̃ , Ỹ )−1Q̄ i ( H̃ , Ỹ ),
Q
Q̄( H̃ , Ỹ ) = ΣQ ( H̃ , Ỹ )
i =1
M
ΣQ ( H̃ , Ỹ )−1 = −(M − 1) × k(( H̃ , Ỹ ), ( H̃ , Ỹ ))−1 + Σi ( H̃ , Ỹ )−1 ,
Q
i =1
The Bayesian committee machine does not assume the existence of common
slots in the source and target domains. Instead, an entropy-based cross-domain
kernel function is defined to estimate the data similarity between different do-
mains. However, each committee is a Gaussian process model, which is still com-
putationally expensive, and thus it could not support large data sets.
268 Transfer Learning in Dialogue Systems
(1) Model fine-tuning: A model is first trained on a source domain and then it is
retrained with the target domain data.
(2) Curriculum learning: In each training epoch, the training instances are sorted
such that general instances are fed into the training of the model first and then
the specific target domain data are fed.
(3) Instance synthesis: Synthetic target domain sentences are built from delexi-
calized source domain sentences by substituting slot values.
domain data. The data-sorting strategy and the model fine-tuning strategy differ
in the ordering under which the source and the target domain data are used. In the
data-sorting strategy, at each training epoch the model utilizes the source domain
data first and then the target domain data. In the model fine-tuning strategy, the
model is fully optimized on the source domain data and then adapted to the target
domain.
then decide the next action, there is no unique definition of a ground-truth dia-
logue state in an end-to-end dialogue system. Instead, in each time step, the dia-
logue system directly takes current question as the input and generates an output
sentence based on an internal state. The internal state is updated at each time step
and in some aspect, it represents the abstract dialogue state at each time step. So
the whole end-to-end dialogue system can be viewed as a policy function with
the input as the dialogue history as well as current question and the output as the
answer of the system. Unlike modular dialogue systems, the action space of an
end-to-end dialogue system is the space of all possible sentences.
Based on the the type of knowledge being transferred, we categorize relevant
works into the following two categories.
(1) The complete parameter fine-tuning methods: They first pretrain an end-to-
end dialogue model on the source domain with sufficient training data and
then fine-tune all parameters in the target domain with a few training data.
(2) The partial parameter sharing methods: They share only a part of model pa-
rameters across domains, in contrast to the model fine-tuning methods where
all parameters are transferred.
transferred models can indeed capture the personal responding styles of the five
volunteers.
Yang et al. (2017) transfer a pretrained long short-term memory network-based
encoder-decoder dialogue model to a target domain with dual learning. They ini-
tialized a post agent and a response agent separately, then treated the post agent
as the primal task and the response agent as the dual task, and perform dual learn-
ing. The primal and dual tasks can form a closed loop and generate informative
feedbacks to train the dialogue system even with only a small number of training
data in the target domain. In the adaptation process, the post agent first generated
an intermediate response and then the response agent generated a post based on
the intermediate response. This dual process can monitor the quality of generated
responses, and improve the post agent and the response agent simultaneously.
Joshi et al. (2017) aimed to share parameters in a memory network among mul-
tiple users with different profiles. For each profile, the per-response accuracy is
used as an evaluation metric. In the experiment, the proposed multi-profile trans-
fer learning model outperforms the baselines trained with data from only one user.
tem first learns common dialogue knowledge from the source domain and then
adapts this knowledge to the target user.
An example about the coffee ordering dialogue is shown in Figure 18.2. X de-
notes users’ utterances and Y denotes the replies of the agent. In this example,
given the dialogue context H2u = {X 1 , Y1 , X 2 } and the candidate reply set
{Yc1 , Yc2 , Yc3 }, the dialogue policy should decide which reply is more appropriate.
Formally, the inputs for this problem include abundant dialogue data
u u u u
{{X n s , Yn s }Tn=0 } of source customers {u s }, and a few dialogue data {{X n t , Yn t }Tn=0 }
of the target customer u t . The expected output is a policy πu t for target user.
A personalized coffee ordering dialogue can be formulated as a reinforcement
learning problem, and the flowchart is illustrated in Figure 18.3. In each turn of
the dialogue, based on the question Y asked by the system and the answer X of
the user, the dialogue belief state transits from one state to another. By asking a
personalized question Y p , the system can make the whole dialogue significantly
shorter. For example, if the system knows that a user always orders a cup of cold
mocha and delivers to his home, the system can ask a personalized question “Cold
mocha deliver to No. 1199 Mingsheng Road?,” and the user might say yes with a
high probability, making the dialogue shorter.
Due to the difference in user preferences, directly fine-tuning the whole dia-
logue model might lead to the negative transfer. In a reinforcement learning dia-
logue policy, Mo et al. (2018) propose a personalized Q-function, which consists
of a general part Q g and a personal part Q p as
u
Q π (Hnu , Ynu ) = Q g (Hnu , Ynu ; w) + Q p (Hnu , Ynu ; pu , w p )
∞
k u,g u u
∞
k u,p u u
≈Eπu γ r t +k+1 |Hn , Yn + Eπu γ r t +k+1 |Hn , Yn , (18.1)
k=0 k=0
u,g u,p
where r t and r t denote the general and personal rewards for user u at time t ,
respectively, the general Q-function Q g (Hnu , Ynu ; w) captures the expected reward
related to the general dialogue policy for all users, w is the set of parameters for
18.7 Transfer Learning in End-to-End Dialogue Systems 273
Coffee Y Size
Personalized Hot/
queson Y Address
cold
Personalized
transion Y
Queson Y
Transion Confirm Y Pay
Belief state
Figure 18.3 The flowchart of the PETAL system on the coffee-ordering task
the general Q-function and contains a large number of parameters such that it
requires a lot of training data, and the personal Q-function Q p (Hnu , Ynu ; pu , w p )
captures the expected reward related to the preference of each user.
The general part accounts for the general dialogue policy, and it is pretrained in
the source domain and transferred to the target domain, while the personal part is
for personal preferences of each user and is learned with the target domain data
only. M, w and w p are shared across different users, and they could be trained
on source domains and then transferred to the target domain. These parameters
contain the common dialogue knowledge, which is independent of users’ prefer-
ences. Moreover, pu , which is user-specific, captures the preferences of different
users.
The detailed PETAL algorithm is shown in Algorithm 18.1. The PETAL algorithm
trains a model for each user in the source domain. M, w and w p are shared by all
users and there is a separate pu for each user in the source domain. The PETAL
algorithm transfers M, w and w p to the target domain by using them to initialize
the corresponding variables in the target domain, and then it trains them as well
as pu for each target user with limited training data. Since the source and target
users might have different preferences, pu learned in the source domain is not
very useful in the target domain. The personal preference of each target user will
be learned separately in each pu . Without modeling pu for each user, different
274 Transfer Learning in Dialogue Systems
preferences of the source and target users might interfere with each other and
thus cause the negative transfer.
0 1 1 0 0
sentences might lead to negative transfer. For example, the transferred policy
might make a wrong suggestion to the target user according to the preferences
of source users. Different from traditional methods that transfer entire sentences,
the proposed model can transfer fine-grain phrases between a group of users.
Mo et al. (2017) propose a personalized decoder, which consists of a general de-
coder to generate general patterns and a personalized decoder to generate per-
sonal preference words. A personal word gating mechanism is introduced to select
the appropriate decoder for each generated word, making the personalized de-
coder switch between generating general pattern words and personalized phrases.
For the example in Figure 18.5, the personal control gate selects the personal de-
coder to generate the personal phrase “hot latte”, and uses the common decoder
to generate common phrases “Still” and “?.” The personalized decoder can gener-
ate different personal phrases for different users in a sentence, while the knowl-
edge for the shared phrases is shared among all users.
Basic decoder: The hidden state for the t -th word in the n-th turn is defined as
ω(hu,d u u,d u
n,t , ŷ n,t −1 ) =H0 hn,t + E0 ŷn,t −1 + bo (18.2)
g (hu,d u T u,d u
n,t , ŷ n,t −1 , y) =oy ω(hn,t , ŷ n,t −1 ) (18.3)
u
exp(g (hu,d u
n,t , ŷ n,t −1 , y))
p( ŷ n,t = y) = u,d u
(18.4)
∀y exp(g (hn,t , ŷ n,t −1 , y ))
where oy is the output embedding for word y, and H0 , E0 and bo are parameters.
Personalized dialogue decoder for phrase-level transfer learning: The proposed
personalized decoder is illustrated in Figure 18.5. While the sentence-level trans-
fer is to transfer entire sentences, the proposed personalized decoder is on the
phrase level and is to transfer a shared fraction of the sentences to the target do-
main, where a phrase is a short sequence of words containing a coherent meaning.
To achieve maximum knowledge transfer and to avoid negative transfer caused by
differences in user preferences, the proposed personalized decoder has a shared
component and a personalized component. To learn to switch between the shared
and personal components in the phrase level, Mo et al. (2017) introduce a per-
u
sonal control gate o n,t , which is learned from the training data, for each word.
Given the the embedding vector for the n-th response hu,c n and initial hidden
18.7 Transfer Learning in End-to-End Dialogue Systems 277
state hu,d u
n,0 for the predicted word ŷ n,0 , the initial states are computed as
u,g
hn,0 = hu,d u u,d u
n,0 , hn,0 = hn,0 , ô n,0 = 0
u,g
ŷun,0 = 0, ŷn,0 = 0
u,g u,g
where hn,t is the hidden state for the shared component, and ŷ n,t records the last
word generated by the shared component, hun,t is the hidden state for the personal
component and hu,d u
n,t is the hidden state for generating the word ŷ n,t .
The shared component adopts the gated recurrent unit model to capture the
long-term dependency and is shared by all users. Specifically, at each time step t ,
the shared component is defined as
g u,g g u,g g
zun,t =σ(Wz hn,t −1 + Uz ŷn,t −1 + Vz hu,c
n + bz ) (18.5)
g u,g g u,g g
run,t =σ(Wr hn,t −1 + Ur ŷn,t −1 + Vr hu,c n + br ) (18.6)
u,g g u u,g g u,g g
h̃n,t =σ(Wh (rn,t " hn,t −1 ) + Uh ŷn,t −1 + Vh hu,cn + bh ) (18.7)
u,g u,g u,g
ĥn,t =zun,t " hn,t −1 + (1 − zun,t ) " h̃n,t , (18.8)
where " denotes the element-wise product between vectors or matrices, σ(·) is
u,g
the sigmoid function, zun,t is the update gate, run,t is the forget gate and ĥn,t is the
u
tentative updated hidden state. If the t -th word is a shared word (i.e., ô n,t = 0),
then the model updates the shared hidden state and last general word as usual and
u,g u,g u,g u,g
otherwise hn,t and ŷt remain unchanged. Thus, hn,t and ŷt can be updated as
u,g u u,g u u,g
hn,t = (1 − ô n,t ) " ĥn,t + ô n,t " hn,t −1 (18.9)
u,g u u,g
ŷt = (1 − ô n,t ) " ŷut−1 + ô n,t
u
" ŷt −1 . (18.10)
' g u,g g
u σ(Wo hn,t −1 + Uo ŷun,t −1 + bo ) if ô n,tu
−1 = 0
p(ô n,t = 1) = u u u u u u . (18.13)
σ(Wo hn,t −1 + Uo ŷn,t −1 + bo ) if ô n,t −1 = 1
1 In training process, the ground-truth o u is used as a label to train the prediction function for ô u .
n,t n,t
278 Transfer Learning in Dialogue Systems
High-level RNN for dialogue state
tracking
State 1 State 2
Transion
Dialogue Predicted : 0 1 1 0
policy
Low-level RNN for
spoken language Personalized
understanding decoder
u
ô n,t decides whether to use the personal component to generate the next word.
hu,d
n,t is defined as
u,g
hu,d u u u
n,t = (1 − ô n,t ) " hn,t + ô n,t " hn,t , (18.14)
where hu,d u
n,t is the hidden vector that directly generates the next word ŷ n,t and the
u
probability of generating the next word y n,t is defined by the generation process
in (18.2)–(18.4).
The decoding procedure is as follows:
u,g u,g
(1) Initialize hn,0 , hun,0 , ô n,0
u
, ŷun,0 and ŷn,0 based on hu,d u,c u
n,0 and hn . ô n,0 is initialized
u
to be 0 and ŷn,0 is initialized to be a zero vector 0.
u,g
u
(2) Compute personal control gate ô n,t based on hu,c u u
n , ô n,t −1 , hn,t −1 , hn,t −1 and
u
ŷt −1 with (18.13).
u,g
(3) Compute hn,t , hun,t and the outputted hidden state hu,d
n,t based on the personal
u
control gate ô n,t .
u
(4) Generate ŷ n,t based on the outputted hidden state hu,d
n,t with (18.14).
(5) Repeat step 2 to step 4 until the ending symbol.
The shared and personal components can be trained together with supervised
learning and reinforcement learning.
The personalized decoder is capable of transferring dialogue knowledge, and it
can easily be combined with many models including the Seq2seq model (Sutskever
et al., 2014) and the HRED model (Serban et al., 2015). The architecture for person-
alized HRED, obtained by combining HRED with personalized decoder, is shown
in Figure 18.6.
19
Transfer Learning in Recommender Systems
19.1 Introduction
tion systems with transfer learning with an algorithm known as codebook transfer
(CBT). Subsequently, many research works appear that allow transfer learning to
be applied to recommender systems, which include all instance-based, feature-
based and model-based transfer learning frameworks.
The organization of this chapter is as follows. We first discuss some represen-
tative transfer learning methods for recommendation in Section 19.2. We then
present two transfer learning applications in news recommendation and VIP rec-
ommendation in Section 19.3 and Section 19.4, respectively.
where T is the target data, S is the source data, Θ is the parameter to be learned
and K I , K F and K M denote the knowledge in instance-based transfer learning,
feature-based transfer learning and model-based transfer learning, respectively.
We can see that the framework in (19.1) contains a loss function, two regulariza-
tion terms and a constraint.
In the previous equation, the function l (·) is the loss function, and r (·) is a reg-
ularization function. We wish the loss to be as small as possible provided that the
model size is also under control to ensure the generalization ability of learning.
In this equation, there are a loss function l (·), a regularization term r (·) and a con-
straint c(·). Different transfer learning methods may instantiate different parts of
problem statement. Examples are l (·) and r (·) in the works by Pan et al. (2015b,
2016b) and Hu et al. (2019) and l (·) and c(·) in the works by Pan et al. (2012, 2017).
19.2 What to Transfer in Recommendation 281
Pan et al. (2015b) study two types of one-class feedback that have different un-
certainties such as transactions and examinations for preference learning and
item ranking. Specifically, they propose learning the confidence of each exam-
ination instance and then transfering the confidence-weighted examination in-
stances to the target preference learning task with transaction records by an adap-
tive Bayesian personalized ranking (ABPR) algorithm.
Pan et al. (2016b) study labeled feedback such as numerical ratings and unla-
beled feedback such as examinations, and design a self transfer learning (sTL) al-
gorithm, which iteratively identifies some likely-to-prefer examination instances
and transfers them to improve the target rating prediction task.
Hu et al. (2019) develop a deep learning model called transfer meets hybrid
(TMH) to selectively transfer the interacted source item instances by the corre-
sponding target user via an attentive weighting scheme, as well as to exploit the
unstructured text information of the target (user, item) pair by a memory network
in a hybrid manner.
Pan et al. (2012) take users’ uncertain actions, that is, feedback in the form of
rating intervals, as source rating instances, and integrate them into the target five-
star rating matrix factorization task for preference learning via a transfer by inte-
grative factorization (TIF) method.
Pan et al. (2017) propose a transfer to rank (ToR) method to first exploit the
union of the target explicit feedback and source implicit feedback to obtain a can-
didate list of item instances of users’ potential interest, and then transfer them
to the target matrix factorization task with explicit feedback only to re-rank the
candidate list.
Observe that instance-based transfer learning methods have been applied to
different recommendation problems in terms of the input such as ratings, trans-
actions, examinations and installation, and the output including rating predic-
tion and item ranking. The transferred knowledge of instances can be in different
forms, including examination instances (Pan et al., 2015b, 2016b), installation in-
stances (Hu et al., 2019), rating instances (Pan et al., 2012) and candidate item in-
stances (Pan et al., 2017). Furthermore, the transfer learning algorithms can also
be formulated in different styles such as adaptive styles (Pan et al., 2015b), iter-
ative styles (Pan et al., 2016b), integrative styles (Pan et al., 2012; Hu et al., 2019)
and two-stage styles (Pan et al., 2017).
which includes a loss function l (·) and two regularization terms r (·).
Singh and Gordon (2008) study users’ rating behaviors and items’ attributes in a
single framework, and bridge two different domains via sharing the knowledge of
items’ latent features. More specifically, they design a collective matrix factoriza-
tion (CMF) model that jointly factorizes a rating matrix in terms of users and items
and a data matrix about items, and share the collectively learned item-specific la-
tent feature matrix to achieve bidirectional knowledge transfer. Besides, Shi et al.
(2013b) propose a joint matrix factorization (JMF) method for a rating matrix and
an item similarity matrix defined on the contextual information.
Pan et al. (2010b) aim to leverage both user-side and item-side source exami-
nation information to improve the target rating prediction problem. In particular,
they design a coordinate system transfer (CST) algorithm, where the coordinate
systems are actually the user-specific and item-specific latent features learned
from the user-side and item-side source examination data, respectively. Such la-
tent features are then transferred to the target rating prediction task via two biased
regularization terms. We can see that the CST method is a two-stage approach, in-
cluding the coordinate system construction and transfer learning.
Pan and Yang (2013) turn to exploit the front-side binary feedback such as users’
likes and dislikes to assist the target rating prediction task. In order to acquire the
rich knowledge for sharing, they design a transfer by collective factorization (TCF)
approach, which models the data-independent knowledge via two shared latent
matrices and the data-dependent effect via two non-shared matrices simultane-
ously for the target numerical ratings and the source like/dislike binary ratings.
In TCF, the two shared latent feature matrices, that is, the user-specific latent fea-
ture matrix and the item-specific latent feature matrix, are designed to bridge two
heterogeneous data in a collective manner.
Pan et al. (2016a) study two one-class feedback such as purchases and browses
via a transfer via joint similarity learning (TJSL) method from the perspective of
joint similarity learning by sharing the item-specific latent feature matrix. In par-
ticular, the TJSL method proposes to learn a similarity between a candidate item
and a purchased item and also a similarity between a candidate item and a browsed
item. Such joint similarity learning via sharing item-specific latent features is em-
pirically shown to achieve better performance in terms of item ranking.
Hu et al. (2018) develop a deep transfer learning model for shared user cross-
domain recommendation. They propose a collaborative cross network (CoNet),
which can transfer source domain knowledge in a way of deep representations.
The knowledge transfer happens in both directions, from the source to target do-
main and from the target to source domain. The idea is to feed the representations
from the source network into the hidden layers in the target network. This makes
the preference learning in the target domain easier even in the sparse data case
since the target network only needs to learn an incremental “residual” represen-
tations with the reference of source representations.
19.2 What to Transfer in Recommendation 283
sparsity problem in the target domain. The transferred knowledge in the code-
book can also be shared in a collective manner as done in the rating-matrix gener-
ative model (RMGM) (Li et al., 2009b). Besides, Gao et al. (2013) propose a cluster-
level latent factor model (CLFM) with two types of codebooks, one for the shared
common rating pattern and another for the domain-specific rating pattern.
Pan et al. (2015a) propose a compressed knowledge transfer via factorization
machine (CKT-FM) to integrate explicit and implicit feedbacks. First, CKT-FM
mines the compressed knowledge of users and items via a clustering method,
where the extracted knowledge of membership information for both users and
items are assumed to be stable across two types of feedback. Second, CKT-FM
transfers the mined compressed knowledge to the target rating prediction task via
a generic feature engineering based factorization approach, that is, factorization
machine.
Kanagawa et al. (2019) apply a recent deep learning method called domain sep-
aration networks (DSNs) (Bousmalis et al., 2016) to a content-based cross-domain
recommendation problem with two preference data, where a common encoder
and a common decoder are shared between the source and target preference pre-
diction tasks, while two private encoders are also kept for the source data and the
target data.
We can see that the main idea of most model-based transfer learning in
recommendation is to share or transfer high-level rating behaviors such as clus-
ters or memberships of users and items, which are assumed to be relatively sta-
ble and consistent across explicit and implicit feedbacks. The transferred knowl-
edge is particularly useful when the target domain is extremely sparse in terms of
ratings.
Finally, we summarize the aforementioned transfer learning methods in rec-
ommendation in Table 19.1. We can see that most transfer learning models are
designed to transfer knowledge from the frontal-side source information such as
examinations, uncertain ratings and binary ratings.
2007; Liu et al., 2010a) are not applicable, because they rely on users’ historical
reading behaviors and news articles’ content information that are not available in
the DCSR problem.
The DCSR problem can be solved from the perspective of transfer learning.
Although there are no users’ behaviors about the cold-start users and cold-start
items in the news domain, there may be some other related domains with users’
behaviors. Specifically, we leverage some knowledge from a related domain, that
is, the app domain, where the users’ app-installation behaviors are available. Most
cold-start users in the news domain have already installed some apps, and this in-
formation may be helpful in determining his/her preferences in news articles. In
particular, we assume that users with similar app-installation behaviors are likely
to have similar interests in news articles. With this assumption, the neighborhood
in the APP domain can be used as the knowledge to be transferred to the target
domain of news articles.
19.3.2 Challenges
The main difficulty of the DCSR problem is the lack of previous preference data
for new users and new items. We face the new user cold-start challenge, where the
target users to whom we will provide recommendations have not read any items
before. We also face the new item cold-start challenge, where the target items that
we will recommend to the target users are totally new for all users. Under such
challenges, most existing recommendation algorithms are not applicable.
To address the two challenges in the DCSR problem, we make a preference as-
sumption across the APP domain and news domain that neighborhood structure
in the two domains are similar. A neighborhood-based transfer learning (NTL)
method is introduced to transfer the knowledge of the neighborhood from the APP
domain to the news domain, which can address the new user cold-start challenge.
For the new item cold-start challenge, a category-level preference is designed to
replace the traditional item-level preference because the latter is not available for
the new items in the DCSR problem. With these two techniques addressing the
two challenges, some well-studied neighborhood-based recommendation meth-
ods are applicable to the DCSR problem.
Figure 19.1 An illustration of the NTL method for the DCSR problem
19.3 News Recommendation 287
where G u· is a row vector w.r.t. user u in the user-genre matrix G. Once we have
calculated the cosine similarity, for each cold-start user u, we first remove users
with small similarity values (e.g., s u,u < 0.1), and then take the most similar users
to construct the neighborhood Nu .
For the item-level preference rˆu ,i in (19.5), we are unable to have such a score
directly because item i is new to all users, including the warm-start users and the
target cold-start user u . We can propose approximating the item-level preference
by a category-level preference as
where c(i ) can be the level-1 or level-2 category. We can have two types of category-
level preferences as
where Nu ,c1 (i ) and Nu ,c2 (i ) denote the number of items (by user u ) belonging to
the level-1 category c 1 (i ) and the level-2 category c 2 (i ), respectively.
288 Transfer Learning in Recommender Systems
which will be used for the preference prediction. Specifically, the neighborhood
Nu helps address the new user cold-start challenge and the category-level prefer-
ence Nu ,c1 (i ) or Nu ,c2 (i ) addresses the new item cold-start challenge.
where far transfer represents knowledge transfer across two heterogeneous social
networks of instant messenger and microblog, and near transfer for that within
the target social network of microblog. The goal is to predict the missing values
290 Transfer Learning in Recommender Systems
in R, and thus we can rank and recommend VIPs for each user. An illustration is
shown in Figure 19.2.
19.4.2 Challenges
VIP recommendation is basically a one-class collaborative filtering problem (Pan
et al., 2008a), since the rating of user u on VIP i is either “1” or unknown (missing
value). Hence, most memory-based collaborative filtering methods for rating pre-
diction cannot be used directly, which we will explain later. We observe two very
fundamental challenges in VIP recommendation.
(1) The first challenge is the scalability, since it’s extremely time consuming to
estimate the similarity between every two users when there are millions of
users.
(2) The second challenge is the sparsity, since the observed following relations in
R are very few, and thus the estimated similarities between users may be not
accurate.
We can see that these two challenges are rooted in the “similarity” in memory-
based collaborative filtering methods, for example, the Resnick’s rule (Resnick et al.,
1994). As far as we know, distributed algorithms can address the first challenge
and most transfer learning works focus on addressing the second challenge. How-
ever, few works study to address those two challenges in a single framework. In
the following section, we will introduce a solution to achieve this.
19.4 VIP Recommendation in Social Networks 291
where Nu is the set of nearest neighboring users of user u according to the PCC.
Finally, according to Resnick et al. (1994), we can predict the rating of user u on
item i as
rˆui = r¯u· + y wi s uw (r wi − m w· ), (19.13)
w∈Nu
where r¯u· = i y ui r ui / i y ui is the average rating of user u on all items rated by
user u. (19.13) can equivalently be reformulated as
rˆui = r¯u· − y wi s uw m w· + y wi s uw r wi , (19.14)
w∈Nu w∈Nu
where the first term represents the globally average rating of user u and the second
term represents the aggregation of local average ratings of its nearest neighbors.
For the one-class collaborative filtering for VIP recommendation, we have r¯u· = 1
and m w· = 1, thus such average ratings do not contain any discriminative infor-
mation and can be safely discarded. Finally, we obtain a simplified prediction rule
as
rˆui = y wi s uw r wi , (19.15)
w∈Nu
which means that the rating of user u on item i can be estimated from the prefer-
ences of nearest neighbors of user u on item i via a weighted aggregation.
where Ñu denotes the set of user u’s friends in the social network of instant mes-
senger and x uw denotes the relationship of user u and his/her friend w. To con-
sider each friend equally, we set x uw = 1 in (19.16) and then we have
rˆui = y wi r wi . (19.17)
w∈Ñu
For the one-class collaborative filtering problem' in microblog, we can further re-
1, user w has followed VIP i
place the term y wi r wi in (19.17) with f wi = to
0, otherwise
get
rˆui = f wi , (19.18)
w∈Ñu
which implies that, if user u has |Ñu | friends in the social network of instant mes-
senger and w∈Ñu f wi of them have followed VIP i in the social network of mi-
croblog, then the preference of user u on VIP i is equal to w∈Ñu f wi . We can
see that two heterogeneous social networks of microblog (the following relation
f wi ) and instant messenger (the friendship relations Ñu ) are integrated together
in such an intuitive way as shown in (19.18). The knowledge of social relations,
Ñu , of instant messenger is naturally embedded in the prediction method.
According to (19.18), we can see that the predicted score rˆui must be an integer
since f wi is either 1 or 0, and user u may have the same score on several different
VIPs, where we cannot distinguish the ranking positions. To address this problem,
we further introduce a popularity score for each VIP i , 0 ≤ p i ≤ 1, i = 1, . . . , m, to
obtain the prediction rule as
rˆui = p i + f wi . (19.19)
w∈Ñu
The SORT method transfers friendship relations, Ñu , from a source social network
of instant messenger to a target VIP recommendation in microblog. We can see
that the procedures of similarity calculation and neighbor search in Resnick’s rule
(Resnick et al., 1994) is avoided.
20
Transfer Learning in Bioinformatics
20.1 Introduction
With the fast-growing biological technology comes the rapid growth in the
amount of biological data. These data are made available with increasingly lower
cost in bio-sensor technologies, and, as a result, in the next few years we expect to
witness a dramatic increase in the application of personal genomic and personal-
ized medicine.
In response to the data sparsity problem, various novel machine learning meth-
ods have been developed. Among them, transfer learning and multitask learning
are good solutions. In the following sections, we introduce how to use transfer
learning and multitask learning to solve bioinformatics problems.
294 Transfer Learning in Bioinformatics
various types of data. Besides the automatic prediction based on statistical mod-
els, mixed initiative approaches such as visualization have also been developed to
tackle the large-scale data and complex modeling problems in systems biology.
Biomedical text mining refers to using information retrieval techniques to ex-
tract information on genes, proteins and their functional relationships from scien-
tific literature (Krallinger and Valencia, 2005). Today we face a vast amount of bi-
ological findings that are published as articles, journals, blogs, books and confer-
ence proceedings. For example, PubMed and MEDLINE provide much up-to-date
information for biological researchers. If we follow a traditional way of acquiring
the information, a researcher has to read through this huge volume of informa-
tion to discover potential findings in his/her field. With text mining technologies,
new findings that are published in text can be automatically detected and then
presented to researchers.
Biomedical image mining is an important problem in many applications. The
manual classification of images is time-consuming, repetitive and unreliable.
Given a set of training images classified into a number of classes, the goal of an
automatic image classification method is to train a model to accurately predict
the category of new images. A typical example is breast cancer identification from
medical imaging data through computer-aided detection in screening mammog-
raphy. An important issue for such models is how to reduce the false positive rates
of the classification.
Transfer learning is important to solving the data sparsity problem for each of
the aforementioned problems in bioinformatics. In the following sections, we will
survey recent transfer learning studies in these areas.
We introduce some notations that are used in this chapter. The data of the
source domain Ds is composed of data instances xi and their corresponding la-
ns
bels y i , thus, the source domain data (Xs , ys ) is denoted by {(xis , y is )}i =1 . Similarly,
t
the data of the target domain Dt is composed of data instances xi and their corre-
nt
sponding labels y it , thus the target domain data (Xt , yt ) is denoted by {(xit , y it )}i =1 .
The functions f s (·) and f t (·) denote the predictive functions in the source domain
D s and the target domain D t , respectively. In multitask learning, the data (Xi , yi )
ni
of task i for i = 1, . . . m can be represented by {(xij , y ij )} j =1 , where m is the total
number of tasks.
the sequence data can be from different domains. By learning these tasks together,
the data sparsity problem can be alleviated.
In multitask learning, the regularization approach is often exploited. Under a
regularization framework, the objective function of this approach consists of two
terms, including an empirical loss on the training data of all tasks and a regular-
ization term that encodes the relationships among tasks.
As a pioneer work, Evgeniou and Pontil (2004) propose a multitask extension of
a support vector machine (SVM), which minimizes the following objective func-
tion as
m nt m m 1 m
ξ({wt }) = l (y it , wTt xit ) + λ1 ||wt ||22 + λ2 ||wt − wt ||22 . (20.1)
t =1 i =1 t =1 t =1 m t =1
The first and second terms in (20.1) denote the empirical error and the squared
2 norm regularization of parameter vectors, respectively, which are the same as
those of single-task SVMs. A difference between single-task SVMs and multitask
SVMs lies in the third term of (20.1), which is designed to penalize a large deviation
between each parameter vector and the mean parameter vector of all tasks. This
penalized term enforces the parameter vectors in all tasks to be similar to each
other.
One of the earliest works in multitask sequence classification is the one by Wid-
mer et al. (2010a), which proposes two regularization-based multitask learning
methods to predict the splice sites across different organisms. In order to lever-
age information from related organisms, Widmer et al. (2010a) suggest two princi-
pal approaches to incorporate relations across organisms. The proposed methods
modify the regularization term in the work by Evgeniou and Pontil (2004). How-
ever, different from Evgeniou and Pontil (2004), the relation among the organisms
in the work by Widmer et al. (2010a) is defined by a tree or a graph implied by their
taxonomy or phylogeny. The first approach trains the models in a top-down man-
ner, where a model is learned for each node in the hierarchy over the data set of
the corresponding task and the parent node provide the a priori information, and
its objective function is formulated as
m
nt
m
ξ({wt }) = l (y it , wTt xit ) + λ1 ||wt − wpar ent (t ) ||22 . (20.2)
t =1 i =1 t =1
In biology, an organism and its ancestors should be similar, due to the inheritance
in the evolution. This information leads to the second approach in Widmer et al.
(2010a) with the objective function formulated as
m
nt
m
m
ξ({wt }) = l (y it , wTt xit ) + λ1 γt t ||wt − wt ||22 , (20.3)
t =1 i =1 t =1 t =1
m
nt
m
m
R
ξ({wt }) = l (y it , wTt xit ) + λ1 ||ut ||22 + λ2 ||vr t ||22 , (20.4)
t =1 i =1 t =1 t =1 r t =1
where ut is the parameter vector of a leaf node and vr t denotes the parameter vec-
tor of its corresponding ancestor that are internal nodes in the hierarchy
structure.
Similarly, Schweikert et al. (2008) consider a number of domain transfer learn-
ing methods for the splice site recognition across several organisms by using a
model of a well-analyzed source domain with its associated data to obtain or re-
fine a model for a less analyzed target domain. In the work by Schweikert et al.
(2008), is the source domain, while Caenorhabditis remanei, Pristionchus pacifi-
cus, Drosophila melanogaster and Arabidopsis thaliana are treated as target do-
mains. The domain transfer learning methods used include
It is verified by the experiments in the work by Schweikert et al. (2008) that the dif-
ferences of classification functions for recognizing splice site in these organisms
will increase with increasing evolutionary distance.
Jacob and Vert (2007) design an algorithm to learn peptide-MHC-I binding mod-
els for many alleles simultaneously by sharing binding information across alleles.
The sharing of information is controlled by a user-defined measure of similar-
ity between alleles, where the similarity can be defined in terms of supertypes
or more directly by comparing key residues that are known to play a role in the
peptide-MHC binding. The pair of an allele a and a peptide candidate p is repre-
sented in a feature vector. Then, based on the kernel trick, Jacob and Vert (2007)
298 Transfer Learning in Bioinformatics
where, for the peptide kernel K pep , any kernel between the peptide representation
can be used and, for the allele kernel K al l , the authors exploit some methods to
model the relationships across alleles including the multitask kernel and super-
type kernel.
Jacob et al. (2008) propose a regularized multitask method for MHC class I bind-
ing prediction to group similar tasks in a cluster. To achieve this goal, the regular-
ization term is formulated as
where Ωmean (W) measures on the average of weight vectors, Ωbet ween (W) is a
measure of between-cluster variance and Ωwi t hi n (W) is a measure of within-
cluster variance.
Following Jacob and Vert (2007) and Jacob et al. (2008), Widmer et al. (2010b)
propose to improve the predictive power of a multitask kernel method for the
MHC class I binding prediction by developing an advanced kernel based on Jacob
and Vert (2007). In addition, Widmer et al. (2010c) investigate multitask learning
scenarios where a latent structural relation across tasks exists and apply the pro-
posed method for the splice site recognition as well as the MHC class I binding
prediction. More specifically, they model the relatedness between tasks based on
meta-tasks such that the information is transferred between two tasks t and t
according to the number of meta-tasks co-occurred in tasks t and t .
As mentioned in Section 20.2, the protein subcellular localization prediction
based on protein sequences can be categorized into biological sequence analysis.
Xu et al. (2011) compare a multitask learning method under SVMs (implementa-
tion (1) with a common feature representation-based approach (implementation
(2), which is based on Argyriou et al.’s (2006, 2008) works, for the protein subcel-
lular location prediction problem. To answer the question “can multi-task learn-
ing generate more accurate classifiers than single-task learning?,” Xu et al. (2011)
conduct several experiments on different organisms by comparing the test accu-
racy among the proposed multitask learning methods and baselines. Through ex-
perimental results, we can see that multitask learning techniques can generally
help improve the prediction performance for protein subcellular localization in
comparison with supervised single-task learning techniques and that the related-
ness of tasks may affect the performance of multitask learning techniques.
Liu et al. (2010b) propose a cross-platform model based on a multitask linear
regression model for the siRNA efficacy prediction. Given a vectorized represen-
tation of siRNAs, a linear ridge regression model is applied to predict the novel
siRNA efficacy from a set of siRNAs with known efficacy. It is shown that, in the
siRNA efficacy prediction, there exists certain efficacy distribution diversity across
20.4 Gene Expression Analysis and Genetic Analysis 299
siRNAs binding to different mRNAs and that common properties across different
siRNAs have some influence on the potent siRNA design.
where B is a m × P matrix, m is the number of SNPs and the j -th row β j corre-
sponds to the j -th SNP. Here the 1,2 regularizer is used to select features for all
the tasks.
Xu et al. (2010) also exploit to solve this problem by borrowing the idea of the
collective matrix factorization (CMF) method (Singh and Gordon, 2008). The pro-
posed method uses the similarities of proteins between two interaction networks
and shows that, when the source matrix is sufficiently dense and similar to the
target network, transfer learning is effective for predicting protein–protein inter-
actions in a sparse network. Consider a similarity matrix S ∈ Rm×n as the corre-
spondence between networks G and P . The rows and columns of S correspond to
proteins in networks G and P , respectively, and each element S i j of S represents
the similarity between node i in network G and node j in network P . The objective
function of the proposed method is formulated as
D
min ||X0 TPd − Yd ||2F + λ||T||1 s.t. ||Pd ||F = 1, ∀d ∈ {1, . . . , D},
T,Pd
d =0
where Td = TPd means that the phenotype responses under different experimen-
tal conditions are lying on the same low-dimensional space T. Hence, the first
term in this objective function enforces the fitting between the gene expression
and the phenotypic signature under each condition, while the second term en-
forces the sparsity on T.
Bickel et al. (2008) study the problem of predicting the HIV therapy outcomes
of different drug combinations based on observed genetic properties of patients,
where each task corresponds to a particular drug combination. The authors pro-
pose to jointly train models for different drug combinations by pooling data to-
gether for all tasks and use the weights to adapt the data for each particular task.
The goal is to learn a hypothesis f t : x → y for each task t by minimizing the
loss function with respect to p(x, y|t ), where x describes the genotype of the virus
that a patient carries as well as the patient’s treatment history and y denotes the
class label indicating whether the therapy is successful or not. Simply pooling the
available data for all tasks will generate a set of training samples D = {(xij , y ij , i )}.
The proposed method is to create a task-specific weight function r t (x, y) for each
sample.
Figure 20.1 An illustration of the DNP algorithm: (a) The selected features and
the corresponding subnetwork; (b) the selection of a single feature; (c) gradients
with lower variance via multiple dropouts
Liu et al. (2017) introduce a deep neural network model tailored for the HDLSS
data, which is named deep neural pursuit (DNP). DNP selects a subset of features
from a very long sequence of genes (approximately 200,000 long) with very small
sample sizes. To alleviate the problem of overfitting, DNP takes the average over
multiple dropouts to calculate gradients with low variance. By using a deep neural
network, DNP enjoys the advantages of the high nonlinearity, robustness to high
dimensionality and the capability of learning from a small number of samples.
This allows it to maintain the stability in feature selection in an end-to-end style
of model training.
For a feedforward neural network, we can select a specific input feature if at
least one of the connections associated with that feature has non-zero weight. To
achieve this goal, we place the l p,1 norm to constrain the input weights, that is,
WF p,1 . We use WF j to denote the weights associated with the j -th input node
in WF . We can define the l p,1 norm of the input weights as WF p,1 = j WF j p ,
where · p is the l p norm on a vector. One effect of the l p,1 norm is to enforce
the group sparsity (Evgeniou and Pontil, 2007) and here we assume that weights
in WF j form a group. A general form of the objective function for training the
feedforward network in formulated as:
n
min (y i , f (xi |W)) s.t. WF p,1 ≤ λ. (20.12)
W i
Without loss of generality, we only consider the binary classification problem and
use the logistic loss in (20.12).
The whole process of the feature selection in the DNP consists of training a
deep neural network. We graphically illustrate DNP’s greedy feature selection in
Figure 20.1 and detail the learning process in Algorithm 20.1.
304 Transfer Learning in Bioinformatics
In DNP, we maintain two sets, that is, a selected set S and a candidate set C ,
with S ∪ C = F .
Initially, S starts from a bias to avoid the case that all rectified linear hidden
units are inactive. Except for the weights corresponding to the bias, all weights
in the neural network are initialized to be zero. Upon the selected set S , input
weights WF comprise of selected input weights WS , which are input weights as-
sociated with features in S , and candidate weights WC . We update the whole neu-
ral network until convergence while fixing all candidate weights WC to zero (i.e.,
steps 4 and 5 of Algorithm 20.1). In Figure 20.1(a), we plot S and C with solid cir-
cles and dotted circles, respectively. All dotted connections are fixed zero. Then,
GF is employed to select one feature, say the j -th one from C (step 7).
After initialization, WF is updated by initializing newly selected input weights
WF j with the Xavier initializer (Glorot and Bengio, 2010) and reusing earlier weights
WS (step 9). S and C are updated by adding and removing j , respectively (step 10).
One question is how to select features using GF . Without loss of generality, we
assume that all features are normalized.
The gradient’s magnitude implies how much the objective function may de-
crease by updating the corresponding weight (Perkins et al., 2003).
Similarly, the norm of a group of gradients infers how much the loss may de-
crease by updating this group of weights together. According to Tewari et al. (2011),
there exists an equivalence between minimizing the l p,1 norm in (20.12) and greed-
ily selecting features with the maximum l q norm of gradients, where q satisfies
1/p + 1/q = 1.
We assume that the larger the GF j q is, the more j -th feature contributes
to minimizing (20.12). Consequently, we select the features with the maximum
GF j q . Throughout our experiments, we choose p = q = 2 provided that our em-
pirical comparisons among different settings of p show a limited difference. On
the other hand, DNP can satisfy the norm constraint, that is, ||WF ||p,1 ≤ λ, by
early stopping at the k-th iteration. We illustrate the selection of a single feature
in Figure 20.1(b).
Due to the small sample size, the backpropagated gradients in DNP have es-
pecially high variance. This makes selecting the features according to gradients
misleading. As shown in Figure 20.1(c), DNP utilizes multiple dropouts technique
to avoid high-variance gradients. As a regularizer, the dropout (Srivastava et al.,
2014) randomly drops neurons and features during forward training and back
propagation. Therefore, gradients G are calculated on the subnetwork composed
of the rest neurons.
Multiple dropouts in DNP can improve the quality of the features selected. First,
according to step 6 of Algorithm 20.1, DNP randomly drops neurons multiple
times, computes GFc based on the remaining neurons and connections, and aver-
ages multiple GFc . Such multiple dropouts technique obtains averaged gradients
with low variance.
More importantly, multiple dropouts empower DNP with the stable feature se-
20.7 Deep Learning for Bioinformatics 305
lection. Stability, as a vital criterion for feature selection, indicates that identical
features should be consistently selected even using slightly changed training data
sets (Kalousis et al., 2007). Multiple dropouts combine selected features over many
random subnetworks to make the DNP method more stable and powerful.
et al. (2017b) train a deep learning model using thousands of non-MPs with solved
structures. The non-MPs serve as the source data, while the MPs as the target data.
The transfer learning model works well for MP contact prediction by increasing
the determination accuracy by a large margin. The authors went on to study why
transfer learning worked well in MP prediction. They found that the underlying
contact occurrence patterns in both MPs and non-MPs are similar, implying that
the structure of the problem space is similar.
A data set is composed of enzyme–ligand interaction data, G-protein-coupled
receptors (GPCRs)–ligand interaction data and ion channel–ligand interaction
data. Another released ligand interaction data set contains four subsets for en-
zymes, ion channels, GPCRs, and nuclear receptors (Kashima et al., 2009), respec-
tively.
21
Transfer Learning in Activity Recognition
21.1 Introduction
Human behavior recognition from sensor observations is an important topic
in both AI and mobile computing. It is also a difficult task as the sensor and be-
havior data are usually noisy and limited. In this chapter, we review two major
problems in human behavior recognition, including location estimation and ac-
tivity recognition. Solving these two problems helps answer typical questions in
human behavior recognition, such as where a user is, what the user is doing and
whether the user will be interested in doing something at somewhere. In the prior
attempts to solve these problems, we find that in practice the biggest challenge
comes from the data sparsity. Such data sparsity can be because we have lim-
ited labeled data for new contexts in localization, or limited sensor data for users
and activities in activity recognition. In order to address these challenges, transfer
learning, which can effectively incorporate domain-dependent auxiliary data in
the training process and thus greatly relieve the data sparsity problem, becomes
a viable approach. In the remainder part of this chapter, we introduce research
works on using transfer learning for wireless localization and sensor-based activ-
ity recognition.
where Y is the set of possible locations in the environment. Generally, the mobile
device receives different signal strength vectors at different locations. As a result,
given a signal-to-location mapping function, we can predict the user’s location
with her current signal strength vector. Such a signal-to-location mapping func-
tion is also referred as a localization model, which is capable of transforming a sig-
nal vector to a location. In the offline training stage, given sufficient labeled data
{(xi , y i )}, we learn a mapping function f : Rd1 → Y . In the online testing stage, we
use f to predict the location for a new signal vector x (Pan et al., 2007a).
20 20
Frequency
Frequency
15 15
10 10
5 5
0 0
−50 −45 −40 −35 −50 −45 −40 −35
Signal strength (dBm) Signal strength (dBm)
(a) RSS at device A (b) RSS at device B
15 25
20
Frequency
10
Frequency
15
10
5
5
0 0
−80 −78 −76 −74 −72 −70 −68 −80 −78 −76 −74 −72 −70 −68
Signal strength (dBm) Signal strength (dBm)
(c) RSS at time A (d) RSS at time B
tivates us to well consider such signal variation problem, where few/sparse data
are available in the new context.
As we cannot afford collecting a large amount of labeled data all the time when
the context changes, we are facing the data sparsity challenge in learning some
localization model for a new mobile device, or a new time period, or a new space.
Traditional learning algorithms may just ignore the signal data difference between
different contexts, and use the existing data in another context to train a model.
In general, such a simple strategy by overlooking the difference may greatly de-
teriorate the localization performance. This motivates us to take the difference
of these data into account and carefully design transfer learning algorithms. In
the following section, we review transfer learning algorithms in the application of
wireless localization according to different transfer strategies, including feature-
based, instance-based and model-based transfer learning.
collected a large amount of labeled data D s = {(x(is ) , y(is ) )|i = 1, . . . , n s }. On the target
device, we may collect a small amount of labeled data D t = {(x(it ) , y(it ) )|i = 1, ..., n t }.
Finally, we also have a test data set from the target device D tt st = {(xtt st (i ) , ytt st (i ) )|i =
1, ..., n tt st }. This setting is illustrated in Figure 21.3, where a matrix in the figure
denotes a 2D location space and a tick indicates labeled data collected in that
location.
Target device with data (i.e., D s = ): MeanShift (Haeberlen et al., 2004) treats the
signal variation as a Gaussian mean-value shift, and use a linear model as
x t , j = c 1 · x s, j + c 2 , (21.1)
to fit the RSS value x t , j of the j -th AP on a target device based on the RSS value x s, j
on a source device. Here, c 1 and c 2 are model parameters to be estimated by the
least square fit. Once c 1 and c 2 are learned, we can transform all the data from the
source device {x(is ) |i = 1, ..., n s } into the target device. Finally, we can have much
more data for the target device and thus be able to train an accurate classifier for
the localization.
Similar to MeanShift, ModelTree (Yin et al., 2005) also applies a regression anal-
ysis to learn the temporal predictive relationship between the RSS values received
by sparsely located reference points and that received by the mobile device. Then
it uses the newly observed RSS values at the device and the reference points for
the localization with some decision tree algorithm.
Target device without data (i.e., D aux = ): Kjaergaard and Munk (2008) propose
a hyperbolic location fingerprinting (HLF) method to address the device signal
variation problem. The intuition is that, each single RSS from a certain AP is vul-
nerable to the device heterogeneity, but the relative value between two RSS from
two certain APs may be more stable. Therefore, the HLF method tries to turn the
absolute RSS values into ratios among different APs and uses them as the new
21.2 Transfer Learning for Wireless Localization 311
where the expectation is taken over each dimension of x to account for the ran-
domness in the RSS from each AP. Finally, given g (·), we can build a function
f : Rd2 → Y to locate the heterogeneous devices. Because pairwise RSS values can
be shown to be insufficient to discriminate different locations, the HOP model
designs some higher-order features as follows.
We then learn a set of hs such that they are representative for the data by solving
the following problem
nL
max k=1
log P (x(k) ; h), (21.4)
h∈{0,1}d2
where h = [h 1 , ..., h d2 ] is a vector of d 2 HOP features and P (x; h) is the data likeli-
hood to be defined later based on h. Directly learning h leads to optimizing an
excessive number of parameters. We can reduce the number of parameters by
rewriting (21.3) to an equivalent form as
d 1
c (x − x k2 ) + b = i =1
(k 1 ,k 2 ) k 1 ,k 2 k 1
αi x i + b, (21.5)
) *
where αi = (k1 ,k2 ) c k1 ,k2 δ(k 1 = i ) − c k1 ,k2 δ(k 2 = i ) . Zheng et al. (2016) prove
d1
α = 0,
i =1 i
(21.6)
which means that a HOP feature defined in (21.3) corresponds to a special feature
transformation function with a constraint as
d 1
d1
h=δ i =1
α i x i + b > 0 , s.t. i =1 αi = 0. (21.7)
As a result, to learn HOP features, we only need to focus on learning linear weights
for each individual RSS value x i subject to a zero-sum constraint. A careful deriva-
tion shows that (21.4) can be learned through a constrained restricted Boltzmann
machine (RBM) as
1 −E (x,h)
P (x; h) = e , (21.8)
Z h
d1 (xi −ai )2 d 2 x
where E (x, h) = i =1 2π2i
− j =1 b j h j − i , j πi h j w i j is an energy function and
i
−E (x,h)
Z = x,h e is a partition function. The first term of E (x, h) models a Gaussian
312 Transfer Learning in Activity Recognition
distribution over each x i , where a i and πi are the mean and standard deviation.
The second term models the bias b j for each h j . The third term models the linear
d 1 x i
mapping between x and h j . In RBM, each h j can be seen as h j = δ( i =1 πi w i , j +
b j > 0), and it is sampled by a conditional probability (Krizhevsky and Hinton,
2009) as
d 1 x i
P (h j = 1|x) = σ i =1 π
w i , j + b j , (21.9)
i
where σ(r ) = 1+e1 −r is the sigmoid function. To take the zero-sum constraint into
account, we compare (21.9) with (21.7), and set αi = π1 w i j . Finally, the objective
i
function is formulated as
1 nL d 1 1
min − log P (x(k) ) s.t. i =1 w i j = 0, ∀ j. (21.10)
n L k=1 πi
These HOP features can be learned together with the localization classifier.
In addition to this feature-based transfer learning methods, there are some
feature-based transfer learning methods for cross-space localization (Wang et al.,
2010) and cross-device and cross-time localization (Zhang et al., 2013).
model is formulated as
μ ns
( f (s)∗ , f (t )∗ ) = arg min V (x(is ) , y s(i ) , f (s) ) + γ A f (s) 2HK + γ(1)
I
f (s) 2I
f (s) , f (t ) n s i =1 1
1 ns
+ V (x(it ) , y t(i ) , f (t ) ) + γ A f (t ) 2HK + γ(2)
I
f (t ) 2I
n s i =1 2
γI
+ [ f (s) (x(is ) ) − f (t ) (x(it ) )]2 . (21.11)
i =1
The first term in (21.11) is to minimize the localization loss at the source time
period, and the second and third terms are for the manifold regularization. The
next three terms are defined similarly for the target time period. The last term in
(21.11) is to enforce the location prediction on the reference points, which can re-
ceive real time RSS values, to be consistent by two localization classifiers f (s) and
f (t ) . In this way, the training instances in different time periods can be leveraged
together.
Xu et al. (2017) propose a metric transfer learning framework (MTLF). Many
previous studies use the Euclidean distance to measure the dissimilarity between
instances from two different domains. However, the Euclidean distance may be
suboptimal in some real world applications. In MTLF, instance weights are learned
and exploited to bridge the distributions of different domains, while the Maha-
lanobis distance is learned simultaneously to maximize the interclass distances
and minimize the intraclass distances for the target domain. In addition to these
instance-based transfer learning methods for the cross-time localization, there is
an instance-based transfer learning method for the cross-space localization (Pan
et al., 2008c).
This work on instance-based transfer learning focuses on handling localiza-
tion with wireless data. There exist some work that uses instance-based transfer
learning for the localization based on the image data. For example, in the work by
314 Transfer Learning in Activity Recognition
Lu et al. (2016), the localization system considers two kinds of inputs, including
red green blue (RGB) images that are obtained under a normal light condition
and thermal images that are obtained under an emergency power outage condi-
tion. As thermal images are not obtained as easily as color images, an active trans-
fer learning method is proposed to treat RGB images as the source domain and
thermal images as the target domain. On the one hand, it uses an adaptive multi-
kernel learning framework to train the model with both labeled RGB images and
thermal images. On the other hand, it also tries to carefully choose thermal images
to be labeled by the human expert so as to maximize the performance gain.
wt = w0 + vt , ∀t = 1, ..., T,
where w0 is shared by all the tasks and vt is specific to task t . We are interested in
finding appropriate feature mappings ϕt which can map the raw signal data to a
k-dimensional latent feature space where the learned hypotheses across tasks are
similar, that is, vt is “small.”
The objective function of the LatentMTL model is formulated as
T
nt
λ1 T λ3 T
min∗ πt (ξi t + ξ∗i t ) + vt 2 + λ2 w0 2 + Ω(ϕt )
w0 ,vt ,ξi t ,ξi t ,b,ϕt t =1 i =1 T t =1 T t =1
2 34 5 2 34 5 2 34 5
loss knowledge share regularization
Figure 21.5 The TrHMM model to adapt a localization model from time 0 to time
t
• In the loss term, ξi t and ξ∗i t are slack variables measuring the errors. πt are the
weight parameters for each task t .
• In the knowledge share term, minimizing vt 2 regularizes the dissimilarities
among the task hypotheses in the latent feature space ϕt (x).
• In the regularization term, minimizing w0 2 corresponds to maximizing the
margin of the learned models to provide the generalization ability. Generally,
the regularization parameter λ1 is set to be larger than λ2 to force the task hy-
potheses to be similar. Ω(ϕt ) penalizes the complexity of the mapping function
ϕt . To make our problem tractable, we consider ϕt ∈ Rk×d as a linear transfor-
mation by letting ϕt (x) = ϕt x. We use the squared Frobenius norm for Ω(ϕt ),
2
that is, Ω(ϕt ) = ϕt F .
• The constraints follow the routine of the standard -SVR (Scholkopf and Smola,
2001) with b as the bias term and as the tolerance parameter.
Sequential model: The TrHMM model (Zheng et al., 2008b) exploits the trajecto-
ries (i.e., the blue lines in Figure 21.4) by using the hidden Markov model (HMM)
and then for transfer learning, it allows HMM parameters to be shared and up-
dated carefully across different time periods.
For an HMM θ = (λ, A, π), the radio map λ = {P (xi |y i )} models the signal distri-
1
bution, for example, Gaussian distribution P (x|y) = (2π)k/21 |Σ|1/2 e − 2 (x−μμ) Σ(x−μμ) at
T
each location, where μ is the mean and Σ is the covariance matrix. By assuming
the independence among the APs according to Ladd et al. (2002), we can simplify
Σ as a diagonal matrix. The transition matrix A = {P (y i +1 |y i )} encodes the prob-
ability for a user to move from one location to another. The location prior π is
set to be a uniform distribution over all the locations as a user can start from any
location.
316 Transfer Learning in Activity Recognition
• It learns the signal correlation model α from the source time period 0. Specif-
ically, TrHMM uses a multiple linear regression model on the data at time 0
over all the location grids and derives the regression coefficients αk = {αkij },
which encode the signal correlations between reference locations {l c } and a
non-reference location k. That is,
where s kj is the RSS at location k from the j -th AP, αkij (1 ≤ i ≤ n) is the regression
weights for the j -th AP signal at location k and r i j (1 ≤ i ≤ n) is the RSS at the
i -th reference point from the j -th AP.
• It applies the signal correlation model α to the target time period t and re-
estimates the radio map using up-to-date signal data D ∈L t ar . Specifically, it uses
the αs to update the non-reference point locations’ signal strengths with the
newly collected signal strengths on the reference point locations {l c }. As there
is a possible shift for the regression parameters over time, a trade-off constraint
is added to derive the new λt as
r eg
μ0 + (1 − β)μ
μ t = βμ μt
# T $ # $
r eg r eg r eg T
Σ t = β Σ0 + μ t − μ 0 μ t − μ 0 + (1 − β) Σt + μ t − μ t μt − μt ,
r eg r eg r eg
where we balance the regressed radio map λt = (μ μ t , Σt ) and the base ra-
dio map λ0 = (μ μ0 , Σ0 ) by introducing a parameter β ∈ [0, 1].
• It updates the model, whose the location prior π and transition matrix A are
shared from the source time 0, by using the trace data T t . Specifically, we first
trained an HMM θ0 = (λ0 , A 0 , π0 ) at time 0 as the base model. Then, in another
time period t , we improve λ0 by applying the regression analysis and obtain a
new HMM θt = (λt , A 0 , π0 ).
With such severe labeled data sparsity problem, it becomes natural to consider
transfer learning in real world activity scenarios. In the context of transfer learn-
ing, we would hope that: (1) the source and target domains can have different fea-
ture spaces, for example, using sensor readings collected from smartphones (the
source domain) to help recognize activities from smart watch sensor readings (the
target domain); (2) the source and target domains can have different probability
distributions, for example, using the sensor readings collected on one person to
help recognize activities for another person; and (3) the source and target domains
can have different label spaces, for example, using sensor readings collected for
walking and running to help recognize activities for swimming.
Different transfer learning approaches have been proposed in recent years to
tackle different aspects of the aforementioned transfer learning settings. For trans-
ferring between different feature spaces, Khan et al. (2018) propose a transductive
transfer learning model based on CNNs. The proposed CNN model minimized
layerwise Kulback–Leibler divergence between the source and target domains.
For transferring between different persons, Deng et al. (2014) propose a cross-
person activity recognition approach that uses a reduced kernel extreme learn-
ing machine to realize the initial activity recognition model. Deng et al. (2014)
also propose an online learning algorithm to use highly confident recognition re-
sults to adapt the online model. For transferring between different label spaces,
Wang et al. (2018b) propose a stratified transfer learning framework to first ob-
tain pseudo labels on the target domain and then transform labels from both the
source and target domains into a subspace.
In the remainder part of this section, we will discuss some transfer learning ap-
proaches in details. In particular we will discuss how to relax the same feature
space requirement and the same label space requirement, that is, activity recog-
nition from different sets of sensors and activity recognition between different sets
of activities.
where c is an activity label. Since the activity-label spaces may be large, for sim-
plicity, we approximate the value of p(yt |xt ) by the mode, which is the most fre-
quent label of p(c|xt ) and denoted by ĉ. In other words,
We assume that the two label spaces are different but related. Therefore, the
joint distribution p(ys , yt ) should have high mutual information in general and
that p(yt |ĉ) should also be high.
From this equation, the introduced transfer learning framework takes two steps.
In the first step, we will estimate p(ĉ|xt ) where ĉ is labeled according to the source
domain label space. Briefly speaking, we aim to use the source domain label space
to explain the target domain sequences xt first. Since the two domains have dif-
ferent feature spaces, in the first step we need to transfer across different feature
spaces. Next, we estimate p(yt |ĉ) where yt is defined on the target domain label
space and ĉ is defined on the source domain label space, hence, in the second
step, we need to transfer across different label spaces.
We can extract two kinds of information from sensor readings. The first is that,
given a sequence of sensor reading, we can estimate the generative distribution
from which such a sensor reading is generated. Since we only care about the rel-
ative distance between two distributions of sensor readings instead of describing
these distributions accurately, we simply plot the frequency of each sensor value,
which will be discretized if it is continuous, and then smooth the discretized prob-
ability distribution. Since we have quite different feature spaces, we first normal-
ize all our sensor readings into the range of [0,1].
In particular, suppose that we have a training set in the source domain {x i , y i },
where x i is a sensor reading and y i is the corresponding label. For each activity
y i , we can select all sequences of sensor readings x that have y i as its label. Next,
we could count the occurrences of sensor values x i j , and then estimate the prob-
ability distribution for each of the sensor in the sensor reading sequence xi . An
intuitive explanation of aforementioned method is that we try to link each gener-
ative distribution of different sensors to a target activity.
Following a similar approach, we can also estimate the probability distribution
for each sensor reading sequence in the target domain. Now that for each sensor
reading sequence, we have an estimated distribution Q and we wish to find a close
distribution P in the source domain. Since KL divergence is asymmetric, that is,
D K L (P ∥ Q) = D K L (Q ∥ P ). Therefore, instead of calculating D K L (P ∥ Q), we use
D K L (P ∥ Q) + D K L (Q ∥ P ), which is a symmetric measurement, to measure the
distance between two distributions generating sensor readings.
Two issues need to be addressed for the selection of candidate labels based
on the relative entropy measurements. The first issue is that, although D K L (P ∥
Q) + D K L (Q ∥ P ) equals zero if and only if the two distributions P and Q are iden-
tical, the fact that sensors have a very large value does not necessarily mean the
two distributions are highly uncorrelated. Consider two accelerometers where the
directions of accelerations are different. In this case, whenever the first accelerom-
eter senses a high value, the second accelerometer will sense a low value. There-
fore, we need to consider distribution pairs at both high divergence and low di-
vergence values. The second issue we consider is the different sampling rates of
different sensors when plotting their signal values versus time. Different kinds of
sensors have very different sampling rates and the accuracy of distributions esti-
mated can vary a lot. When calculating the correlation between different sensors,
another important step is to use a distance metric that can take different sampling
rates into account. Now given two series of sensor readings of only one dimension,
Q and C of length n and m, we wish to align two sequences based on DTW (Keogh
and Pazzani, 2000).
The idea of DTW is simple. We could construct an n-by-m matrix where the
(i , j )-th element contains the distance d (q i , c j ) between the two points q i and
c j , which is measured as the absolute value of difference of q i and c j , that is,
d (q i , c j ) = |q i − c j |. Since the (i , j )-th element corresponds to the alignment be-
tween q i and c j , the objective is to find a warping path W that is a contiguous set
21.3 Transfer Learning for Activity Recognition 321
Algorithm 21.1 Projecting the labels in the source domain to the unlabeled sensor
readings in the target domain
Input: Source domain activities §s , source domain data Ds = {(xs , ys )} =
{(x i , y i )|y i ∈ L s }, target domain data Dt = {xt }
Output: Pseudo-labeled target domain data D t = {(xs , y s )}
begin
1: Normalize each sensor reading sequence in S and T .
2: For each pair of sensor reading and activity in (xs , ys ) ∈ S, estimate its proba-
bility distribution p( f s |y s ).
3: For each unlabeled sequence in the target domain xt , estimate the distribution
of its feature values: P ( f t ).
4: Calculate the relative entropy between distributions in T and all the distribu-
tions in S. Take the top-K similar and the bottom-K similar distributions out
and record their labels as candidates.
5: Calculate the DTW score between this sensor reading sequence xt and all the
labeled sensor reading sequences (xs , ys ) in the source domain. Take the top-K
highest and the bottom-K lowest similar sensor readings out and record their
labels as candidates.
6: Label an unlabeled sequence xt with the label that appeared maximum times
in the candidate label set.
end
of elements that define the mapping between Q and C . Thus, the element at posi-
tion K of the warping path W is defined as w k = (i , j )k . This warping path can be
found via dynamic programming with a quadratic time complexity.
Algorithm 21.1 shows the step for projecting the labels in the source domain to
the unlabeled sensor readings in the target domain. Notice that in Algorithm 21.1,
we introduce a parameter K , which is used to control the number of candidate
label sequences in the source domain.
From this formulation, we can see that such a problem can be reduced to esti-
322 Transfer Learning in Activity Recognition
Algorithm 21.2 Projecting target domain sequences with source domain labels to
target domain sequences with target domain labels
Input: Pseudo-labeled target domain data D t = {(xt , ĉ)}
Output: Labeled target domain data: Dt∗ = {(xt , yt )}
begin
1: For each pseudo-labeled target domain instance d t , calculate its minimum
loss value R(i , j ) based on the recurrence relation R(i , j ) = mink∈L t {R(i −1, k)+
NGD(ĉi , j ) + NGD(k, j )}, where NGD denotes the Google similarity distance
metric.
2: Relabel d t using the labels in the target domain label space, thereby creating a
new sequence d t∗ .
end
We briefly explain the nature of this recursive relation. In order to minimize the
loss up to time slice i , we need to consider the minimum loss up to time slice i −1.
To do that, we need to enumerate all possible R(i − 1, k), where k ∈ L t is the label
we assigned to time slice i −1. Next, we need to minimize the distance between the
original “pseudo-label” ĉi and this new label j ∈ L t . Furthermore, Q(k, j ) is also
considered in the recursive function to minimize the distance between succes-
sive slices yit and yit −1 . It can be seen that this recurrence relation could be solved
via dynamic programming. Here we use the Google similarity distance (Cilibrasi
and Vitányi, 2007) as Q to approximate the information distance between two
entities.
The Google similarity distance is defined as
Algorithm 21.2 further explains the procedure we use to bridge the gap between
different labels.
After these two steps, we now have the label y ti ∈ L t for each unlabeled sensor
reading in the target domain and then we can apply any machine learning algo-
rithm used for activity recognition such as hidden Markov models (Patterson et al.,
2005) or conditional random fields (Vail et al., 2007) to train an activity recognition
classifier in the target domain.
22
Transfer Learning in Urban Computing
22.1 Introduction
Nowadays, cell phones, vehicles and infrastructures (e.g., traffic cameras and
air-quality monitoring stations) continuously generate a huge amount of data re-
lated to our cities in heterogeneous formats such as GPS points, online posts, road
conditions and weather conditions. This opens a new door for us to know about
the dynamics of our city from different perspectives and facilitates various urban
computing applications for traffic monitoring, society security, urban planning,
health care and so on. The current solutions can help streamline citywide plan-
ning and decision-making in the following way:
Fine-grained data inference: In many urban monitoring tasks, the obtained data
cannot cover the whole city area. A representative example is air-quality monitor-
ing where the stations for air-quality sensing are only sparsely located in a city.
Then, how to infer a more fine-grained data distribution based on the sparsely
collected data becomes an important research issue.
Future phenomenon prediction: Another important and hot research area is the
urban-event prediction problem, such as air quality and traffic prediction. Note
that, in traditional statistics, such a problem can often be modeled as a time-series
modeling problem and be solved via statistical models such as the autoregressive
integrated moving average. However, in complex urban computing applications,
in order to obtain a more accurate prediction, usually a more complicated ma-
chine learning model is constructed to take heterogeneous data sources into con-
sideration. For example, in the air-quality prediction, many conditions such as
road maps, weather conditions and traffic conditions can all be used to make the
prediction.
Event detection: Detecting abnormal events is very important in urban comput-
ing applications. For example, under destructive weather conditions such as ty-
phoons and hurricanes, it is a critical issue to detect road obstacles, such as fallen
trees and ponding water, in a real-time manner. City authorities can restore road
transportation in a timely manner to reduce losses.
Facility deployment: Finding appropriate sites for deploying a new facility such
22.2 “What to Transfer” in Urban Computing 325
fer the information of the missing modality by transferring knowledge from other
modalities, then we may improve the performance of the target application.
Cross-region and cross-city transfer: The auxiliary knowledge source for building
a new application in a city or region is the experience from other cities/regions
where the same (or similar) application has already been built in the past. In such
scenarios, we can also call the source city/region as the data-rich city/region and
the target city/region as the data-scarce city/region. While the basic idea of the
cross-city/region transfer is intuitive, we highlight that it practically faces plenty of
difficulties. For example, different cities have distinct development levels, which
makes a direct transfer usually useless and perhaps leads to “negative transfer.”
Cross-application transfer: For a new smart-city application to be developed,
another important knowledge source for transfer learning is from an existing and
related application that has already collected a lot of data. For example, suppose
we want to open a new ridesharing industry in a city, but we do not have any data
about the behaviors of ridesharing cars. Then, to implement an urban comput-
ing application related to ridesharing such as the demand-supply prediction, it is
possible to leverage existing data from taxi-related applications.
It is worth noting that these knowledge sources can be combined together in a
transfer learning application in urban computing. For example, suppose that we
want to build a reliable and useful connection between different data modalities
within a city for a target application. We can first learn the connection in a source
city where all the data modalities are adequate and then transfer it to a target city
where some data modalities and the target application data are lacking.
edge needs to be extracted between the two domains. While this part is usually
application specific and no single method can be always effective, here we give
some guidelines and suggestions.
enterprise wants to start the business in a new city. Traditionally, operators ad-
dress this problem via questionnaire surveys to understand the needs of citizens
and detailed investigations to learn characteristics of candidate locations, based
on which locations for new stores will be selected. Obviously, these traditional
methods are too time-consuming to adopt, especially when cities are developing
rapidly nowadays. On the other hand, to adopt traditional machine learning tech-
nologies to solve this problem, we face the cold-start issue, that is, there is not
enough data about the chain stores in the target city.
In this situation, we turn to transfer learning for a solution. For example, Guo
et al. (2018a) propose a CityTransfer model to conduct the intercity knowledge
transfer between cities and the intracity knowledge transfer between enterprises.
In the following, we will introduce their problem settings and the CityTransfer
model in detail.
mates r i j , which denotes the rating of a store of the enterprise h i in grid g j , by the
number of reviews in g j related to h i .
Table 22.1 Urban characteristics and chain hotel enterprise data (Guo et al.,
2018a)
Source Beijing Shanghai
7 Days Inn hotels 160 46
Home Inn hotels 179 156
Hanting Inn hotels 123 147
7 Days Inn reviews 31,215 8,610
Home Inn reviews 35,310 45,146
Hanting Inn reviews 18,195 22,875
POIs 348,863 444,703
Check-ins 21,222070 16,928,489
House prices 55,030 50,224
where y 2s and y 2t are parameters. The parameters in the autoencoder are estimated
by minimizing the reconstruction error as
m1
m2
O1 = fˆs − f s 2 + fˆt − f t 2 . (22.5)
i i 2 i i 2
i =1 i =1
330 Transfer Learning in Urban Computing
rˆisj = b i + e sj + u iT v sj . (22.7)
Similarly, the rating for enterprise h i in grid j of the target city is estimated as
rˆit j = b i + e tj + u iT v tj . (22.8)
quality in a region is impacted by many factors, for example, POIs, traffic, pro-
duction factories and so on. Thus, the level of pollution largely varies with the
location. In a pollution forecasting system, we wish to leverage these multimodal
data to estimate the fine-grained air quality in a city in advance.
More specifically, our task is to classify the air quality into good, moderate, un-
healthy and so on. We can formulate the air-quality prediction as a classification
problem. Given the multimodal data in each region during a specific period, we
wish to classify the corresponding air quality to a class. We note that the traditional
classification models cannot address this problem ideally for two reasons. First,
the air-quality data are very scarce because many cities have only a few air-quality
monitoring stations, resulting in a label scarcity issue. Second, there is a data in-
sufficiency issue because some multimodel data about important impacting fac-
tors may be missing in some regions or some periods, or even totally missing; for
example, the meteorology data in Shanghai may be missing for some hours. Like-
wise, the taxi trajectory data may not be available for some regions and for some
periods in certain areas of Shanghai.
Therefore, we consider the problem of whether it is feasible to transfer the knowl-
edge from a city to another city to help predict the air quality. In this section, we
explain one of the solutions known as the FLORAL model (Wei et al., 2016b) to
transfer multimodal data between cities.
parties. Thus, data are increasingly difficult to obtain in many domains. Further-
more, as more areas in our society move to digitalization and datalization, pre-
dictive modeling is on increasing demand. In the “long tail” of application fields
where the distribution is over decreasing available data, only the heads are ben-
efiting from machine learning and AI. If we cannot provide the “have-nots” with
the benefits of the “haves,” our society will become more polarized.
Transfer learning can be a technical solution for this “small-data challenge.” If
we can take the models from a data-rich areas and transfer them to data-poor ar-
eas, we can potentially enable these data-poor areas with faster progress toward
an information- and knowledge-based society. Indeed, through many of the ap-
plication examples given in this book, we have seen that transfer learning can ef-
fectively alleviate the small data problem.
One of the areas to explore in the future is to continue to explore lifelong ma-
chine learning and automated transfer learning. One of humans’ intelligence
sources lies in our ability to quickly and effortlessly adapt to new tasks and new
environment. In fact, humans cannot only transfer knowledge to a new domain,
but also learn how to automatically transfer given new tasks and environments.
This indeed is a wonderful puzzle in nature that cannot be solved by computa-
tional means alone. Neural science and experimental neurology can potentially
shed light on the nature of such abilities, and we hope AI in general and transfer
learning in particular can benefit from such insights.
As we are witnessing one of the fundamental AI revolutions in human history,
transfer learning distinguishes itself as a deep research area that inspires new
ideas and thoughts into the nature of intelligence. In answering Turing’s question
“Can Machines Think?,” we hope to start shed light into the question by giving
answers to “How Can Machines Think in New Environments and for New Tasks?”
References
1000 Genomes Project Consortium. 2015. A global reference for human genetic variation.
Nature, 526(7571), 68–74.
Aas, Kjersti. 2001. Microarray Data Mining: A Survey. Tech. report Norwegian Computing
Center.
Abadi, Martín, Barham, Paul, Chen, Jianmin, et al. 2016a. TensorFlow: A system for large-
scale machine learning. Pages 265–283 of: Keeton, Kimberly, and Roscoe, Timothy
(eds.), Proceedings of the 12th USENIX Symposium on Operating Systems Design and
Implementation.
Abadi, Martín, Chu, Andy, Goodfellow, Ian J., et al. 2016b. Deep learning with differential
privacy. Pages 308–318 of: Proceedings of ACM Conference on Computer and Commu-
nications Security.
Abu-El-Haija, Sami, Kothari, Nisarg, Lee, et al. 2016. Youtube-8M: A large-scale video clas-
sification benchmark. arXiv preprint, arXiv:1609.08675.
Acharya, Ayan, Mooney, Raymond J., and Ghosh, Joydeep. 2014. Active multitask learning
using both latent and supervised shared topics. Pages 190–198 of: Proceedings of the
2014 SIAM International Conference on Data Mining.
Amaldi, Edoardo, and Kann, Viggo. 1998. On the approximability of minimizing nonzero
variables or unsatisfied relations in linear systems. Theoretical Computer Science,
209(1), 237–260.
Ando, Rie Kubota, and Zhang, Tong. 2005. A framework for learning predictive structures
from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6,
1817–1853.
Antony, Joseph, McGuinness, Kevin, O’Connor, Noel E., and Moran, Kieran. 2016. Quanti-
fying radiographic knee osteoarthritis severity using deep convolutional neural net-
works. Pages 1195–1200 of: 23rd International Conference on Pattern Recognition.
Argyriou, Andreas, Evgeniou, Theodoros, and Pontil, Massimiliano. 2006. Multi-task feature
learning. Pages 41–48 of: Advances in Neural Information Processing Systems.
Argyriou, Andreas, Evgeniou, Theodoros, and Pontil, Massimiliano. 2008. Convex multi-
task feature learning. Machine Learning, 73(3), 243–272.
Argyriou, Andreas, Micchelli, Charles A., and Pontil, Massimiliano. 2009. When is there a
representer theorem? Vector versus matrix regularizers. Journal of Machine Learning
Research, 10, 2507–2529.
Argyriou, Andreas, Micchelli, Charles A., and Pontil, Massimiliano. 2010. On spectral learn-
ing. Journal of Machine Learning Research, 11, 935–953.
References 337
Arık, Sercan Ö., Chrzanowski, Mike, Coates, Adam, et al. 2017. Deep voice: Real-time neural
text-to-speech. Pages 195–204 of: Proceedings of International Conference on Machine
Learning.
Arjovsky, Martín, and Bottou, Léon. 2017. Towards principled methods for training genera-
tive adversarial networks. CoRR, abs/1701.04862.
Arjovsky, Martín, Chintala, Soumith, and Bottou, Léon. 2017. Wasserstein generative adver-
sarial networks. Pages 214–223 of: Proceedings of the 34th International Conference on
Machine Learning.
Ashley, Kevin D. 1991. Reasoning with cases and hypotheticals in HYPO. International Jour-
nal of Man-Machine Studies, 34(6), 753–796.
Augenstein, Isabelle, and Søgaard, Anders. 2017. Multi-task learning of keyphrase bound-
ary classification. Pages 341–346 of: Proceedings of the 55th Annual Meeting of the
Association for Computational Linguistics.
Aytar, Yusuf, and Zisserman, Andrew. 2011. Tabula rasa: Model transfer for object category
detection. Pages 2252–2259 of: Proceedings of IEEE International Conference on Com-
puter Vision.
Azar, Mohammad Gheshlaghi, Lazaric, Alessandro, and Brunskill, Emma. 2013. Sequential
transfer in multi-armed bandit with finite set of models. Pages 2220–2228 of: Advances
in Neural Information Processing Systems.
Bächlin, Marc, Roggen, Daniel, Tröster, Gerhard, et al. 2009. Potentials of enhanced context
awareness in wearable assistants for Parkinson’s disease patients with the freezing of
gait syndrome. Pages 123–130 of: Proceedings of the 13th IEEE International Sympo-
sium on Wearable Computers.
Bahdanau, Dzmitry, Cho, Kyunghyun, and Bengio, Yoshua. 2014. Neural machine transla-
tion by jointly learning to align and translate. CoRR, abs/1409.0473.
Bakas, Spyridon, Akbari, Hamed, Sotiras, Aristeidis, et al. 2017. Advancing the cancer
genome atlas glioma MRI collections with expert segmentation labels and radiomic
features. Scientific Data, 4, 170117.
Bakker, Bart, and Heskes, Tom. 2003. Task clustering and gating for Bayesian multitask
learning. Journal of Machine Learning Research, 4, 83–99.
Baktashmotlagh, Mahsa, Harandi, Mehrtash T., Lovell, Brian C., and Salzmann, Math-
ieu. 2013. Unsupervised domain adaptation by domain invariant projection.
Pages 769–776 of: Proceedings of IEEE International Conference on Computer Vision.
Baktashmotlagh, Mahsa, Harandi, Mehrtash T., Lovell, Brian C., and Salzmann, Mathieu.
2014. Domain adaptation on the statistical manifold. Pages 2481–2488 of: Proceedings
of IEEE Conference on Computer Vision and Pattern Recognition.
Balcan, Maria-Florina, Blum, Avrim, and Vempala, Santosh. 2015. Efficient representations
for lifelong learning and autoencoding. Pages 191–210 of: Proceedings of the 28th Con-
ference on Learning Theory.
Balikas, Georgios, Moura, Simon, and Amini, Massih-Reza. 2017. Multitask learning for
fine-grained Twitter sentiment analysis. Pages 1005–1008 of: Proceedings of the 40th
International ACM SIGIR Conference on Research and Development in Information
Retrieval.
Bao, Ling, and Intille, Stephen S. 2004. Activity recognition from user-annotated accelera-
tion data. Pages 1–17 of: Proceedings of the Second International Conference on Perva-
sive Computing.
Barreto, André, Dabney, Will, Munos, Rémi, et al. 2017. Successor features for transfer in re-
inforcement learning. Pages 4058–4068 of: Advances in Neural Information Processing
Systems.
338 References
Bartlett, Peter L., and Mendelson, Shahar. 2002. Rademacher and Gaussian complexities:
Risk bounds and structural results. Journal of Machine Learning Research, 3, 463–482.
Barzilai, Aviad, and Crammer, Koby. 2015. Convex multi-task learning by clustering.
Pages 65–73 of: Proceedings of the 18th International Conference on Artificial Intelli-
gence and Statistics.
Bassily, Raef, Smith, Adam D., and Thakurta, Abhradeep. 2014. Private empirical risk min-
imization: Efficient algorithms and tight error bounds. Pages 464–473 of: Proceedings
of IEEE Annual Symposium on Foundations of Computer Science.
Baxter, Jonathan. 2000. A model of inductive bias learning. Journal of Artifical Intelligence
Research, 12, 149–198.
Bay, Herbert, Ess, Andreas, Tuytelaars, Tinne, and Van Gool, Luc. 2008. Speeded-up robust
features (SURF). Computer Vision and Image Understanding, 110(3), 346–359.
Bello, Irwan, Zoph, Barret, Vasudevan, Vijay, and Le, Quoc V. 2017. Neural optimizer search
with reinforcement learning. Pages 459–468 of: Proceedings of the 34th International
Conference on Machine Learning.
Belmont, John M., Butterfield, Earl C., and Ferretti, Ralph P. 1982. To secure transfer of train-
ing instruct self-management skills. Pages 147–154 of: Detterman, Douglas K., and
Sternberg, Robert J. (eds.), How and How Much Can Intelligence Be Increased. Ablex
Publishing Corporation.
Ben-David, Shai, and Borbely, Reba Schuller. 2008. A notion of task relatedness yielding
provable multiple-task learning guarantees. Machine Learning, 73(3), 273–287.
Ben-David, Shai, and Schuller, Reba. 2003. Exploiting task relatedness for multiple task
learning. Pages 567–580 of: Proceedings of the 16th Annual Conference on Computa-
tional Learning Theory.
Ben-David, Shai, Gehrke, Johannes, and Schuller, Reba. 2002. A theoretical framework for
learning from a pool of disparate data sources. Pages 443–449 of: Proceedings of the 8th
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
Ben-David, Shai, Blitzer, John, Crammer, Koby, and Pereira, Fernando. 2006. Analysis of
representations for domain adaptation. Pages 137–144 of: Advances in Neural Infor-
mation Processing Systems.
Ben-David, Shai, Blitzer, John, Crammer, et al. 2010. A theory of learning from different
domains. Machine Learning, 79(1–2), 151–175.
Bengio, Yoshua. 2009. Learning deep architectures for AI. Foundations and Trends in Ma-
chine Learning, 2(1), 1–127.
Bengio, Yoshua. 2012. Deep learning of representations for unsupervised and transfer
learning. Pages 17–36 of: Proceedings of ICML Workshop on Unsupervised and Transfer
Learning.
Bengio, Yoshua, Lamblin, Pascal, Popovici, Dan, and Larochelle, Hugo. 2007. Greedy layer-
wise training of deep networks. Pages 153–160 of: Advances in Neural Information Pro-
cessing Systems.
Bengio, Yoshua, Courville, Aaron, and Vincent, Pascal. 2013. Representation learning: A re-
view and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intel-
ligence, 35(8), 1798–1828.
Bi, Jinbo, Xiong, Tao, Yu, Shipeng, Dundar, Murat, and Rao, R. Bharat. 2008. An Improved
multi-task learning approach with applications in medical diagnosis. Pages 117–132
of: Proceedings of European Conference on Machine Learning and Practice of Knowl-
edge Discovery in Databases.
Bickel, Steffen, Brückner, Michael, and Scheffer, Tobias. 2007. Discriminative learning for
differing training and test distributions. Pages 81–88 of: Proceedings of the 24th Inter-
national Conference on Machine Learning.
References 339
Bickel, Steffen, Bogojeska, Jasmina, Lengauer, Thomas, and Scheffer, Tobias. 2008. Multi-
task learning for HIV therapy screening. Pages 56–63 of: Proceedings of the Twenty-
Fifth International Conference on Machine Learning.
Biermann, Alan W., and Long, Philip M. 1996. The composition of messages in speech-
graphics interactive systems. Pages 97–100 of: Proceedings of the 1996 International
Symposium on Spoken Dialogue.
Blitzer, John, McDonald, Ryan, and Pereira, Fernando. 2006. Domain adaptation with struc-
tural correspondence learning. Pages 120–128 of: Proceedings of the 2006 Conference
on Empirical Methods in Natural Language Processing.
Blitzer, John, Crammer, Koby, Kulesza, Alex, Pereira, Fernando, and Wortman, Jennifer.
2007a. Learning bounds for domain adaptation. Pages 129–136 of: Advances in Neu-
ral Information Processing Systems.
Blitzer, John, Dredze, Mark, and Pereira, Fernando. 2007b. Biographies, bollywood, boom-
boxes and blenders: Domain adaptation for sentiment classification. Pages 440–447
of: Proceedings of the 45th Annual Meeting of the Association for Computational
Linguistics.
Blum, Avrim, and Mitchell, Tom M. 1998. Combining labeled and unlabeled data with co-
training. Pages 92–100 of: Bartlett, Peter L., and Mansour, Yishay (eds.), Proceedings of
the Eleventh Annual Conference on Computational Learning Theory.
Bollegala, Danushka, Maehara, Takanori, and Kawarabayashi, Ken-ichi. 2015. Unsuper-
vised cross-domain word representation learning. Pages 730–740 of: Proceedings of the
53rd Annual Meeting of the Association for Computational Linguistics.
Bonilla, Edwin V., Chai, Kian Ming Adam, and Williams, Christopher K. I. 2007. Multi-task
Gaussian process prediction. Pages 153–160 of: Advances in Neural Information Pro-
cessing Systems 20.
Bou-Ammar, Haitham, Tuyls, Karl, Taylor, Matthew E., Driessens, Kurt, and Weiss, Gerhard.
2012. Reinforcement learning transfer via sparse coding. Pages 383–390 of: Proceedings
of International Conference on Autonomous Agents and Multiagent Systems.
Bou-Ammar, Haitham, Eaton, Eric, Ruvolo, Paul, and Taylor, Matthew E. 2014. Online
multi-task learning for policy gradient methods. Pages 1206–1214 of: Proceedings of
the 31th International Conference on Machine Learning.
Bou-Ammar, Haitham, Eaton, Eric, Ruvolo, Paul, and Taylor, Matthew E. 2015. Unsuper-
vised cross-domain transfer in policy gradient reinforcement learning via manifold
alignment. Pages 2504–2510 of: Proceedings of the Twenty-Ninth AAAI Conference on
Artificial Intelligence.
Bousmalis, Konstantinos, Trigeorgis, George, Silberman, Nathan, Krishnan, Dilip, and Er-
han, Dumitru. 2016. Domain separation networks. Pages 343–351 of: Advances in Neu-
ral Information Processing Systems.
Bousquet, Olivier, and Elisseeff, André. 2002. Stability and generalization. Journal of Ma-
chine Learning Research, 2, 499–526.
Braud, Chloé, Lacroix, Ophélie, and Søgaard, Anders. 2017. Cross-lingual and cross-domain
discourse segmentation of entire documents. Pages 237–243 of: Proceedings of the 55th
Annual Meeting of the Association for Computational Linguistics.
Bromley, Jane, Guyon, Isabelle, LeCun, Yann, Säckinger, Eduard, and Shah, Roopak. 1993.
Signature verification using a Siamese time delay neural network. Pages 737–744 of:
Advances in Neural Information Processing Systems.
Brosch, Tom, and Tam, Roger C. 2013. Manifold learning of brain MRIs by deep learning.
Pages 633–640 of: Proceedings of the 16th International Conference on Medical Image
Computing and Computer-Assisted Intervention.
340 References
Brunskill, Emma, and Li, Lihong. 2013. Sample complexity of multi-task reinforcement
learning. In: Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial
Intelligence.
Bruzzone, Lorenzo, and Marconcini, Mattia. 2010. Domain adaptation problems: A DASVM
classification technique and a circular validation strategy. IEEE Transactions on Pat-
tern Analysis and Machine Intelligence, 32(5), 770–787.
Bryant, Peter E., and Trabasso, Thomas. 1971. Transitive inferences and memory in young
children. Nature, 232, 456–458.
Bulling, Andreas, and Roggen, Daniel. 2011. Recognition of visual memory recall processes
using eye movement analysis. Pages 455–464 of: Proceedings of the 13th International
Conference on Ubiquitous Computing.
Bulling, Andreas, Ward, Jamie A., Gellersen, Hans, and Tröster, Gerhard. 2008. Ro-
bust recognition of reading activity in transit using wearable electrooculography.
Pages 19–37 of: Proceedings of the 6th International Conference on Pervasive
Computing.
Bulling, Andreas, Blanke, Ulf, and Schiele, Bernt. 2014. A tutorial on human activity recog-
nition using body-worn inertial sensors. ACM Computing Surveys, 46(3), 33:1–33:33.
Calandriello, Daniele, Lazaric, Alessandro, and Restelli, Marcello. 2014. Sparse multi-task
reinforcement learning. Pages 819–827 of: Advances in Neural Information Processing
Systems.
Cao, Qiong, Ying, Yiming, and Li, Peng. 2013. Similarity metric learning for face recognition.
Pages 2408–2415 of: Proceedings of IEEE International Conference on Computer Vision.
Cao, Zhangjie, Long, Mingsheng, Wang, Jianmin, and Jordan, Michael I. 2017. Partial trans-
fer learning with selective adversarial networks. CoRR, abs/1707.07901.
Carbonell, Jaime G. 1981. A computational model of analogical problem solving.
Pages 147–152 of: Proceedings of the 7th International Joint Conference on Artificial
Intelligence.
Carbonell, Jaime G., Etzioni, Oren, Gil, Yolanda, et al. 1991. PRODIGY: An integrated archi-
tecture for planning and learning. SIGART Bulletin, 2(4), 51–55.
Carlson, Andrew, Betteridge, Justin, Kisiel, Bryan, et al. 2010. Toward an architecture for
never-ending language learning. In: Proceedings of the 24th AAAI Conference on Artifi-
cial Intelligence.
Caruana, Rich. 1997. Multitask learning. Machine Learning, 28(1), 41–75.
Casanueva, Inigo, Hain, Thomas, Christensen, Heidi, Marxer, Ricard, and Green, Phil. 2015.
Knowledge transfer between speakers for personalised dialogue management. Pages
12–21 of: Proceedings of the 16th Annual Meeting of the Special Interest Group on Dis-
course and Dialogue.
Castrejon, Lluis, Aytar, Yusuf, Vondrick, Carl, Pirsiavash, Hamed, and Torralba, Anto-
nio. 2016. Learning aligned cross-modal representations from weakly aligned data.
Pages 2940–2949 of: Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition.
Cavallanti, Giovanni, Cesa-Bianchi, Nicolò, and Gentile, Claudio. 2010. Linear algo-
rithms for online multitask classification. Journal of Machine Learning Research, 11,
2901–2934.
Chaudhuri, Kamalika, Monteleoni, Claire, and Sarwate, Anand D. 2011. Differentially
private empirical risk minimization. Journal of Machine Learning Research, 12,
1069–1109.
Chavarriaga, Ricardo, Sagha, Hesam, Calatroni, Alberto, et al. 2013. The opportunity chal-
lenge: A benchmark database for on-body sensor-based activity recognition. Pattern
Recognition Letters, 34(15), 2033–2042.
References 341
Chen, Austin H., and Huang, Zone-Wei. 2010. A new multi-task learning technique to pre-
dict classification of leukemia and prostate cancer. Pages 11–20 of: Proceedings of the
Second International Conference on Medical Biometrics.
Chen, Jianhui, Tang, Lei, Liu, Jun, and Ye, Jieping. 2009. A convex formulation for learning
shared structures from multiple tasks. Pages 137–144 of: Proceedings of the 26th Inter-
national Conference on Machine Learning.
Chen, Jianhui, Liu, Ji, and Ye, Jieping. 2010a. Learning incoherent sparse and low-rank pat-
terns from multiple tasks. Pages 1179–1188 of: Proceedings of the 16th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining.
Chen, Jianhui, Zhou, Jiayu, and Ye, Jieping. 2011. Integrating low-rank and group-sparse
structures for robust multi-task learning. Pages 42–50 of: Proceedings of the 17th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining.
Chen, Minmin, Xu, Zhixiang, Sha, Fei, and Weinberger, Kilian Q. 2012a. Marginalized de-
noising autoencoders for domain adaptation. Pages 767–774 of: Proceedings of the 29th
International Conference on Machine Learning.
Chen, Minmin, Xu, Z., Weinberger, Kilian Q., and Sha, Fei. 2012b. Marginalized stacked de-
noising autoencoders. In: Proceedings of the Learning Workshop.
Chen, Wei-Yu, Hsu, Tzu-Ming Harry, Tsai, Yao-Hung Hubert, Wang, Yu-Chiang Frank, and
Chen, Ming-Syan. 2016a. Transfer neural trees for heterogeneous domain adaptation.
Pages 399–414 of: Proceedings of European Conference on Computer Vision.
Chen, Xi, Duan, Yan, Houthooft, Rein, Schulman, John, Sutskever, Ilya, and Abbeel, Pieter.
2016b. InfoGAN: Interpretable representation learning by information maximizing
generative adversarial nets. Pages 2172–2180 of: Advances in Neural Information Pro-
cessing Systems.
Chen, Yuqiang, Jin, Ou, Xue, Gui-Rong, Chen, Jia, and Yang, Qiang. 2010b. Visual contextual
advertising: Bringing textual advertisements to images. In: Proceedings of 24th AAAI
Conference on Artificial Intelligence.
Chen, Zhiyuan, and Liu, Bing. 2016. Lifelong Machine Learning. Morgan & Claypool.
Chen, Zhiyuan, Ma, Nianzu, and Liu, Bing. 2015. Lifelong learning for sentiment classifica-
tion. Pages 750–756 of: Proceedings of the 53rd Annual Meeting of the Association for
Computational Linguistics.
Cheng, Heng-Tze, Koc, Levent, Harmsen, Jeremiah, et al. 2016. Wide & deep learning for
recommender systems. Pages 7–10 of: Proceedings of the 1st Workshop on Deep Learn-
ing for Recommender Systems.
Choi, Eunsol, Hewlett, Daniel, Uszkoreit, Jakob, et al. 2017. Coarse-to-fine question answer-
ing for long documents. Pages 209–220 of: Proceedings of the 55th Annual Meeting of
the Association for Computational Linguistics.
Chomsky, Noam. 1956. Three models for the description of language. IRE Transactions on
Information Theory, 2(3), 113–124.
Cilibrasi, Rudi, and Vitányi, Paul M. B. 2007. The Google similarity distance. IEEE Transac-
tions on Knowledge and Data Engineering, 19(3), 370–383.
Collobert, Ronan, and Weston, Jason. 2008. A unified architecture for natural language pro-
cessing: Deep neural networks with multitask learning. Pages 160–167 of: Proceedings
of the 25th International Conference on Machine Learning.
Conneau, Alexis, Kiela, Douwe, Schwenk, Holger, Barrault, Loïc, and Bordes, Antoine. 2017.
Supervised learning of universal sentence representations from natural language in-
ference data. Pages 670–680 of: Proceedings of the 2017 Conference on Empirical Meth-
ods in Natural Language Processing.
Cortes, Corinna, Mansour, Yishay, and Mohri, Mehryar. 2010. Learning bounds for impor-
tance weighting. Pages 442–450 of: Advances in Neural Information Processing Systems.
342 References
Cortes, Corinna, Mohri, Mehryar, and Medina, Andres Muñoz. 2015. Adaptation algorithm
and theory based on generalized discrepancy. Pages 169–178 of: Proceedings of the 21th
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
Covington, Paul, Adams, Jay, and Sargin, Emre. 2016. Deep neural networks for YouTube
recommendations. Pages 191–198 of: Proceedings of the 10th ACM Conference on Rec-
ommender Systems.
Crammer, Koby, and Mansour, Yishay. 2012. Learning multiple tasks using shared hypothe-
ses. Pages 1484–1492 of: Advances in Neural Information Processing Systems.
Cree, V., and Macaulay. 2000. Transfer of Learning in Professional and Vocational Education.
Routledge.
Csurka, Gabriela. 2017. Domain adaptation for visual applications: A comprehensive sur-
vey. CoRR, abs/1702.05374.
da Silva, Bruno Castro, Konidaris, George, and Barto, Andrew G. 2012. Learning parameter-
ized skills. Proceedings of the 29th International Conference on Machine Learning.
Dahlmeier, Daniel, and Ng, Hwee Tou. 2010. Domain adaptation for semantic role labeling
in the biomedical domain. Bioinformatics, 26(8), 1098–1104.
Dai, Wenyuan, Xue, Gui-Rong, Yang, Qiang, and Yu, Yong. 2007a. Transferring naive Bayes
classifiers for text classification. Pages 540–545 of: Proceedings of the Twenty-Second
AAAI Conference on Artificial Intelligence.
Dai, Wenyuan, Yang, Qiang, Xue, Gui-Rong, and Yu, Yong. 2007b. Boosting for transfer
learning. Pages 193–200 of: Proceedings of the 24th International Conference on Ma-
chine Learning.
Dai, Wenyuan, Chen, Yuqiang, Xue, Gui-Rong, Yang, Qiang, and Yu, Yong. 2008. Translated
learning: Transfer learning across different feature spaces. Pages 353–360 of: Advances
in Neural Information Processing Systems.
Das, Abhinandan S., Datar, Mayur, Garg, Ashutosh, and Rajaram, Shyam. 2007. Google
news personalization: Scalable online collaborative filtering. Pages 271–280 of: Pro-
ceedings of the 16th International Conference on World Wide Web.
Daumé III, Hal. 2007. Frustratingly easy domain adaptation. Pages 256–263 of: Proceedings
of the 45th Annual Meeting of the Association for Computational Linguistics.
Davis, Jesse, and Domingos, Pedro. 2009. Deep transfer via second-order Markov logic.
Pages 217–224 of: Proceedings of the 26th International Conference on Machine
Learning.
Dekel, Ofer, Long, Philip M., and Singer, Yoram. 2006. Online multitask learning.
Pages 453–467 of: Proceedings of the 19th Annual Conference on Learning Theory.
Dekel, Ofer, Long, Philip M., and Singer, Yoram. 2007. Online learning of multiple tasks with
a shared loss. Journal of Machine Learning Research, 8, 2233–2264.
Dempster, A. P., Laird, N. M., and Rubin, D. B. 1977. Maximum likelihood from incomplete
data via the EM algorithm. Journal of the Royal Statistical Society, 39(1), 1–38.
Deng, Wan-Yu, Zheng, Qing-Hua, and Wang, Zhong-Min. 2014. Cross-person activity
recognition using reduced kernel extreme learning machine. Neural Networks, 53, 1–7.
Denton, Emily L., Chintala, Soumith, Fergus, Rob, et al. 2015. Deep generative image mod-
els using a Laplacian pyramid of adversarial networks. Pages 1486–1494 of: Advances
in Neural Information Processing Systems.
Devin, Coline, Gupta, Abhishek, Darrell, Trevor, Abbeel, Pieter, and Levine, Sergey. 2017.
Learning modular neural network policies for multi-task and multi-robot transfer.
Pages 2169–2176 of: Proceedings of IEEE International Conference on Robotics and
Automation.
References 343
Devlin, Jacob, Chang, Ming-Wei, Lee, Kenton, and Toutanova, Kristina. 2018. BERT:
Pre-training of deep bidirectional transformers for language understanding. CoRR,
abs/1810.04805.
Dietterich, Thomas G., Lathrop, Richard H., and Lozano-Pérez, Tomás. 1997. Solving
the multiple instance problem with axis-parallel rectangles. Artificial Intelligence,
89(1–2), 31–71.
Donahue, Jeff, Hoffman, Judy, Rodner, Erik, Saenko, Kate, and Darrell, Trevor. 2013. Semi-
supervised domain adaptation with instance constraints. Pages 668–675 of: Proceed-
ings of IEEE Conference on Computer Vision and Pattern Recognition.
Donahue, Jeff, Jia, Yangqing, Vinyals, Oriol, et al. 2014. DeCAF: A deep convolutional acti-
vation feature for generic visual recognition. Pages 647–655 of: Proceedings of the 31th
International Conference on Machine Learning.
Donahue, Jeff, Krähenbühl, Philipp, and Darrell, Trevor. 2016. Adversarial feature learning.
CoRR, abs/1605.09782.
Dong, Daxiang, Wu, Hua, He, Wei, Yu, Dianhai, and Wang, Haifeng. 2015. Multi-task learn-
ing for multiple language translation. Pages 1723–1732 of: Proceedings of the 53rd An-
nual Meeting of the Association for Computational Linguistics and the 7th Interna-
tional Joint Conference on Natural Language Processing.
Dönnes, Pierre, and Elofsson, Arne. 2002. Prediction of MHC Class I binding peptides, using
SVMHC. BMC Bioinformatics, 3, 25.
Dou, Qi, Ouyang, Cheng, Chen, Cheng, Chen, Hao, and Heng, Pheng-Ann. 2018. Unsuper-
vised cross-modality domain adaptation of ConvNets for biomedical image segmen-
tations with adversarial loss. Pages 691–697 of: Proceedings of the Twenty-Seventh In-
ternational Joint Conference on Artificial Intelligence.
Drummond, Chris. 2002. Accelerating reinforcement learning by composing solutions
of automatically identified subtasks. Journal of Artificial Intelligence Research, 16,
59–104.
Duan, Lixin, Tsang, Ivor W., Xu, Dong, and Maybank, Stephen J. 2009. Domain transfer SVM
for video concept detection. Pages 1375–1381 of: Proceedings of IEEE Conference on
Computer Vision and Pattern Recognition.
Duan, Lixin, Tsang, Ivor W., and Xu, Dong. 2012a. Domain transfer multiple kernel learning.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3), 465–479.
Duan, Lixin, Xu, Dong, and Tsang, Ivor W. 2012b. Learning with augmented features for het-
erogeneous domain adaptation. Pages 711–718 of: Proceedings of International Con-
ference on Machine Learning.
Duan, Lixin, Xu, Dong, Tsang, Ivor Wai-Hung, and Luo, Jiebo. 2012c. Visual event recogni-
tion in videos by learning from web data. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 34(9), 1667–1680.
Dumoulin, Vincent, Belghazi, Ishmael, Poole, Ben, et al. 2016. Adversarially learned infer-
ence. CoRR, abs/1606.00704.
Duong, Long, Cohn, Trevor, Bird, Steven, and Cook, Paul. 2015. Low resource dependency
parsing: Cross-lingual parameter sharing in a neural network parser. Pages 845–850 of:
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguis-
tics and the 7th International Joint Conference on Natural Language Processing.
Dwork, Cynthia. 2008. Differential privacy: A survey of results. Pages 1–19 of: Proceedings of
the 5th Annual Conference on Theory and Applications of Models of Computation.
Dwork, Cynthia, and Roth, Aaron. 2014. The algorithmic foundations of differential privacy.
Foundations and Trends in Theoretical Computer Science, 9(3–4), 211–407.
344 References
Dwork, Cynthia, Kenthapadi, Krishnaram, McSherry, Frank, Mironov, Ilya, and Naor, Moni.
2006a. Our data, ourselves: Privacy via distributed noise generation. Pages 486–503 of:
Proceedings of the 25th Annual International Conference on the Theory and Applica-
tions of Cryptographic Techniques.
Dwork, Cynthia, McSherry, Frank, Nissim, Kobbi, and Smith, Adam D. 2006b. Calibrating
noise to sensitivity in private data analysis. Pages 265–284 of: Proceedings of the 3rd
Theory of Cryptography Conference.
Ellis, Henry Carlton. 1965. The Transfer of Learning. MacMillan.
Elman, Jeffrey L. 1993. Learning and development in neural networks: The importance of
starting small. Cognition, 48(1), 71–99.
Emekçi, Fatih, Sahin, Ozgur D., Agrawal, Divyakant, and El Abbadi, Amr. 2007. Privacy pre-
serving decision tree learning over multiple parties. Data and Knowledge Engineering,
63(2), 348–361.
Esteva, Andre, Kuprel, Brett, Novoa, Roberto A., et al. 2017. Dermatologist-level classifica-
tion of skin cancer with deep neural networks. Nature, 542(7639), 115–118.
Evgeniou, A., and Pontil, Massimiliano. 2007. Multi-task feature learning. Advances in Neu-
ral Information Processing Systems, 19, 41.
Evgeniou, Theodoros, and Pontil, Massimiliano. 2004. Regularized multi-task learning.
Pages 109–117 of: Proceedings of the 10th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining.
Evgeniou, Theodoros, Micchelli, Charles A., and Pontil, Massimiliano. 2005. Learning mul-
tiple tasks with Kernel methods. Journal of Machine Learning Research, 6, 615–637.
Fan, Jianqing, and Li, Runze. 2006. Statistical challenges with high dimensionality: Feature
selection in knowledge discovery. arXiv, arXiv:math/0602133.
Fan, Xing, Monti, Emilio, Mathias, Lambert, and Dreyer, Markus. 2017. Transfer learning
for neural semantic parsing. Pages 48–56 of: Proceedings of the 2nd Workshop on Rep-
resentation Learning for NLP.
Fang, Meng, and Cohn, Trevor. 2017. Model transfer for tagging low-resource languages
using a bilingual dictionary. Pages 587–593 of: Proceedings of the 55th Annual Meeting
of the Association for Computational Linguistics.
Fang, Meng, and Tao, Dacheng. 2015. Active multi-task learning via bandits. Pages 505–513
of: Proceedings of the 2015 SIAM International Conference on Data Mining.
Fang, Meng, Yin, Jie, and Zhu, Xingquan. 2013. Transfer learning across networks for col-
lective classification. Pages 161–170 of: Proceedings of IEEE International Conference
on Data Mining.
Fang, Meng, Yin, Jie, Zhu, Xingquan, and Zhang, Chengqi. 2015. TrGraph: Cross-network
transfer learning via common signature subgraphs. IEEE Transactions on Knowledge
and Data Engineering, 27(9), 2536–2549.
Ferguson, Kimberly, and Mahadevan, Sridhar. 2006. Proto-transfer learning in Markov de-
cision processes using spectral methods. Proceedings of ICML Workshop on Transfer
Learning.
Ferns, Norm, Panangaden, Prakash, and Precup, Doina. 2004. Metrics for finite Markov de-
cision processes. Pages 162–169 of: Proceedings of the 20th Conference in Uncertainty
in Artificial Intelligence.
Feurer, Matthias, Klein, Aaron, Eggensperger, Katharina, et al. 2015. Efficient and robust
automated machine learning. Pages 2962–2970 of: Advances in Neural Information
Processing Systems 28.
Firat, Orhan, Sankaran, Baskaran, Al-Onaizan, Yaser, et al. 2016. Zero-resource translation
with multi-lingual neural machine translation. Pages 268–277 of: Proceedings of the
2016 Conference on Empirical Methods in Natural Language Processing.
References 345
Fong, Pui Kuen, and Weber-Jahnke, Jens H. 2012. Privacy preserving decision tree learn-
ing using unrealized data sets. IEEE Transactions on Knowledge and Data Engineering,
24(2), 353–364.
Forbus, Kenneth D., Gentner, Dedre, Markman, Arthur B., and Ferguson, Ronald W. 1998.
Analogy just looks like high level perception: Why a domain-general approach to ana-
logical mapping is right. Journal of Experimental and Theoretical Artificial Intelligence,
10(2), 231–257.
Friedman, Jerome, Hastie, Trevor, and Tibshirani, Robert. 2001. The Elements of Statistical
Learning. Springer.
Frome, Andrea, Corrado, Gregory S., Shlens, Jonathon, et al. 2013. DeViSE: A deep visual-
semantic embedding model. Pages 2121–2129 of: Advances in Neural Information
Processing Systems.
Ganguly, Soumyajit, and Pudi, Vikram. 2017. Paper2vec: Combining graph and text infor-
mation for scientific paper representation. Pages 383–395 of: Proceedings of European
Conference on Information Retrieval.
Ganin, Yaroslav, and Lempitsky, Victor. 2015. Unsupervised domain adaptation by back-
propagation. Pages 1180–1189 of: Proceedings of the 32nd International Conference on
Machine Learning.
Ganin, Yaroslav, Ustinova, Evgeniya, Ajakan, Hana, et al. 2016. Domain-adversarial training
of neural networks. Journal of Machine Learning Research, 17, 2096–2030.
Gao, Chen, Chen, Xiangning, Feng, Fuli, et al. 2019. Cross-domain recommendation with-
out sharing user-relevant data. Pages 491–502 of: Proceedings of the 2019 World Wide
Web Conference on World Wide Web.
Gao, Sheng, Luo, Hao, Chen, Da, et al. 2013. Cross-domain recommendation via cluster-
level latent factor model. Pages 161–176 of: Proceedings of the European Conference on
Machine Learning and Practice of Knowledge Discovery in Databases.
Gašić, M., Kim, Dongho, Tsiakoulis, Pirros, and Young, Steve. 2015a. Distributed dialogue
policies for multi-domain statistical dialogue management. Pages 5371–5375 of: Pro-
ceedings of IEEE International Conference on Acoustics, Speech and Signal Processing.
Gašić, M., Mrkšic, N., Barahona, L. Rojas, et al. 2015b. Multi-agent learning in multi-domain
spoken dialogue systems. In: Proceedings of NIPS workshop on Spoken Language Un-
derstanding and Interaction.
Gašić, Milica, Breslin, Catherine, Henderson, Matthew, et al. 2013. POMDP-based dialogue
manager adaptation to extended domains. In: Proceedings of the 14th Annual Meeting
of the Special Interest Group on Discourse and Dialogue.
Gašić, Milica, Kim, Dongho, Tsiakoulis, Pirros, et al. 2014. Incremental on-line adaptation
of POMDP-based dialogue managers to extended domains. Pages 140–144 of: Pro-
ceedings of the 15th Annual Conference of the International Speech Communication
Association.
Gašić, Milica, Mrkšic, Nikola, Su, Pei-hao, et al. 2015c. Policy committee for adaptation in
multi-domain spoken dialogue systems. Pages 806–812 of: Proceedings of 2015 IEEE
Workshop on Automatic Speech Recognition and Understanding.
Gatys, Leon A., Ecker, Alexander S., and Bethge, Matthias. 2016. Image style transfer using
convolutional neural networks. Pages 2414–2423 of: Proceedings of IEEE Conference on
Computer Vision and Pattern Recognition.
Genevay, Aude, and Laroche, Romain. 2016. Transfer learning for user adaptation in spoken
dialogue systems. Pages 975–983 of: Proceedings of the 2016 International Conference
on Autonomous Agents and Multiagent Systems.
346 References
Germain, Pascal, Habrard, Amaury, Laviolette, François, and Morvant, Emilie. 2013. A
PAC-Bayesian approach for domain adaptation with specialization to linear classi-
fiers. Pages 738–746 of: Proceedings of the 30th International Conference on Machine
Learning.
Getoor, Lise, and Taskar, Ben. 2007. Introduction to Statistical Relational Learning. MIT
Press.
Ghifary, Muhammad, Bastiaan Kleijn, W., Zhang, Mengjie, and Balduzzi, David.
2015. Domain generalization for object recognition with multi-task autoencoders.
Pages 2551–2559 of: Proceedings of the IEEE International Conference on Computer
Vision.
Ghifary, Muhammad, Kleijn, W. Bastiaan, Zhang, Mengjie, Balduzzi, David, and Li, Wen.
2016. Deep reconstruction-classification networks for unsupervised domain adapta-
tion. Pages 597–613 of: Proceedings of European Conference on Computer Vision.
Gillick, Dan, Brunk, Cliff, Vinyals, Oriol, and Subramanya, Amarnag. 2016. Multilingual lan-
guage processing from bytes. Pages 1296–1306 of: Proceedings of the 2016 Conference of
the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies.
Giorgi, John M., and Bader, Gary. 2018. Transfer learning for biomedical named entity
recognition with neural networks. Bioinformatics, 34(23), 4087–4094.
Girshick, Ross, Donahue, Jeff, Darrell, Trevor, and Malik, Jitendra. 2014. Rich feature hier-
archies for accurate object detection and semantic segmentation. Pages 580–587 of:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Glorot, Xavier, and Bengio, Yoshua. 2010. Understanding the difficulty of training deep
feedforward neural networks. Pages 249–256 of: Proceedings of International Confer-
ence on Artificial Intelligence and Statistics.
Glorot, Xavier, Bordes, Antoine, and Bengio, Yoshua. 2011. Domain adaptation for large-
scale sentiment classification: A deep learning approach. Pages 513–520 of: Proceed-
ings of the 28th International Conference on Machine Learning.
Gong, Boqing, Shi, Yuan, Sha, Fei, and Grauman, Kristen. 2012a. Geodesic flow kernel for
unsupervised domain adaptation. Pages 2066–2073 of: Proceedings of IEEE Conference
on Computer Vision and Pattern Recognition.
Gong, Pinghua, Ye, Jieping, and Zhang, Changshui. 2012b. Robust multi-task feature learn-
ing. Pages 895–903 of: Proceedings of the 18th ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining.
Gong, Pinghua, Ye, Jieping, and Zhang, Changshui. 2013. Multi-stage multi-task feature
learning. Journal of Machine Learning Research, 14, 2979–3010.
Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, et al. 2014. Generative adversarial
nets. Pages 2672–2680 of: Advances in Neural Information Processing Systems.
Gopalan, Raghuraman, Li, Ruonan, and Chellappa, Rama. 2011. Domain adaptation for
object recognition: An unsupervised approach. Pages 999–1006 of: Proceedings of IEEE
International Conference on Computer Vision.
Gopalan, Raghuraman, Li, Ruonan, and Chellappa, Rama. 2014. Unsupervised adaptation
across domain shifts by generating intermediate data representations. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 36(11), 2288–2302.
Görnitz, Nico, Widmer, Christian, Zeller, Georg, et al. 2011. Hierarchical multitask struc-
tured output learning for large-scale sequence segmentation. Pages 2690–2698 of: Ad-
vances in Neural Information Processing Systems.
Gouws, Stephan, Bengio, Yoshua, and Corrado, Greg. 2015. BilBOWA: Fast bilingual dis-
tributed representations without word alignments. Pages 748–756 of: Proceedings of
the 32nd International Conference on Machine Learning.
References 347
Gretton, Arthur, Bousquet, Olivier, Smola, Alex, and Schölkopf, Bernhard. 2005. Measur-
ing statistical dependence with Hilbert-Schmidt norms. Pages 63–77 of: Proceedings of
International Conference on Algorithmic Learning Theory.
Gretton, Arthur, Borgwardt, Karsten M., Rasch, Malte, Schölkopf, Bernhard, and Smola,
Alex J. 2007. A kernel method for the two-sample-problem. Pages 513–520 of: Advances
in Neural Information Processing Systems.
Gretton, Arthur, Sejdinovic, Dino, Strathmann, Heiko, et al. 2012. Optimal kernel choice
for large-scale two-sample tests. Pages 1214–1222 of: Advances in Neural Information
Processing Systems.
Guo, Bin, Li, Jing, Zheng, Vincent W., Wang, Zhu, and Yu, Zhiwen. 2018a. CityTransfer:
Transferring inter- and intra-city knowledge for chain store site recommendation
based on multi-source urban data. Pages 135:1–135:23 of: Proceeding of the 2018 ACM
International Joint Conference on Pervasive and Ubiquitous Computing.
Guo, Jiang, Che, Wanxiang, Wang, Haifeng, and Liu, Ting. 2016a. Exploiting multi-typed
treebanks for parsing with deep multi-task learning. CoRR, abs/1606.01161.
Guo, Jiang, Che, Wanxiang, Wang, Haifeng, and Liu, Ting. 2016b. A universal framework for
inductive transfer parsing across multi-typed treebanks. Pages 12–22 of: Proceedings
of the 26th International Conference on Computational Linguistics.
Guo, Xiawei, Yao, Quanming, Tu, Wei-Wei, et al. 2018b. Privacy-preserving transfer learning
for knowledge sharing. CoRR, abs/1811.09491.
Guo, Zhenyu, and Wang, Z. Jane. 2013. Cross-domain object recognition via input-output
Kernel analysis. IEEE Transactions on Image Processing, 22(8), 3108–3119.
Gupta, Sunil Kumar, Phung, Dinh, Adams, Brett, Tran, Truyen, and Venkatesh, Svetha. 2010.
Nonnegative shared subspace learning and its application to social media retrieval.
Pages 1169–1178 of: Proceedings of the 16th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining.
Haeberlen, Andreas, Flannery, Eliot, Ladd, Andrew M., et al. 2004. Practical robust localiza-
tion over large-scale 802.11 wireless networks. Pages 70–84 of: Proceedings of the 10th
Annual International Conference on Mobile Computing and Networking.
Ham, Ji Hun, Lee, Daniel D., and Saul, Lawrence K. 2003. Learning high dimensional cor-
respondences from low dimensional manifolds. Proceedings of ICML Workshop on the
Continuum from Labeled to Unlabeled Data in Machine Learning and Data Mining.
Hamm, Jihun, Cao, Yingjun, and Belkin, Mikhail. 2016. Learning privately from multiparty
data. Pages 555–563 of: Proceedings of the 33rd International Conference on Machine
Learning.
Hammerla, Nils Y., Halloran, Shane, and Plötz, Thomas. 2016. Deep, convolutional, and
recurrent models for human activity recognition using wearables. Pages 1533–1540 of:
Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence.
Han, Jiawei, and Kamber, Micheline. 2000. Data Mining: Concepts and Techniques. Morgan
Kaufmann.
Han, Lei, and Zhang, Yu. 2015a. Learning multi-level task groups in multi-task learning.
Pages 2638–2644 of: Proceedings of the 29th AAAI Conference on Artificial Intelligence.
Han, Lei, and Zhang, Yu. 2015b. Learning tree structure in multi-task learning. Proceedings
of the 21st ACM SIGKDD Conference on Knowledge Discovery and Data Mining.
Han, Lei, and Zhang, Yu. 2016. Multi-stage multi-task learning with reduced rank.
Pages 1638–1644 of: Proceedings of the 30th AAAI Conference on Artificial Intelligence.
Han, Lei, Zhang, Yu, Song, Guojie, and Xie, Kunqing. 2014. Encoding tree sparsity in multi-
task learning: A probabilistic framework. Pages 1854–1860 of: Proceedings of the 28th
AAAI Conference on Artificial Intelligence.
Harris, Zellig S. 1954. Distributional structure. Word, 10(2–3), 146–162.
348 References
Hashimoto, Kazuma, Tsuruoka, Yoshimasa, Socher, Richard, et al. 2017. A joint many-task
model: Growing a neural network for multiple NLP tasks. Pages 1923–1933 of: Proceed-
ings of the 2017 Conference on Empirical Methods in Natural Language Processing.
Hausknecht, Matthew J., and Stone, Peter. 2015. Deep recurrent Q-learning for partially
observable MDPs. CoRR, abs/1507.06527.
He, Jia, Liu, Rui, Zhuang, Fuzhen, et al. 2018a. A general cross-domain recommendation
framework via Bayesian neural network. Pages 1001–1006 of: Proceedings of the 2018
IEEE International Conference on Data Mining.
He, Jingrui, and Lawrence, Rick. 2011. A graph-based framework for multi-task multi-view
learning. Pages 25–32 of: Proceedings of the 28th International Conference on Machine
Learning.
He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Jian, Sun. 2016. Identity mappings in
deep residual networks. Pages 630–645 of: European Conference on Computer Vision.
He, Kaiming, Girshick, Ross B., and Dollár, Piotr. 2018b. Rethinking ImageNet pre-training.
CoRR, abs/1811.08883.
He, Yulan, and Young, Steve. 2006. Spoken language understanding using the hidden vector
state model. Speech Communication, 48(3), 262–275.
Henderson, Matthew, Gašić, Milica, Thomson, Blaise, et al. 2012. Discriminative spoken
language understanding using word confusion networks. Pages 176–181 of: Proceed-
ings of IEEE Spoken Language Technology Workshop.
Henderson, Matthew, Thomson, Blaise, and Young, Steve. 2014. Word-based dialog state
tracking with recurrent neural networks. Pages 292–299 of: Proceedings of the 15th An-
nual Meeting of the Special Interest Group on Discourse and Dialogue.
Hengst, Bernhard. 2002. Discovering hierarchy in reinforcement learning with HEXQ.
Pages 243–250 of: Proceedings of the Nineteenth International Conference on Machine
Learning.
Hernández-Lobato, Daniel, and Hernández-Lobato, José Miguel. 2013. Learning feature se-
lection dependencies in multi-task learning. Pages 746–754 of: Advances in Neural In-
formation Processing Systems.
Hernández-Lobato, Daniel, Hernández-Lobato, José Miguel, and Ghahramani, Zoubin.
2015. A probabilistic model for dirty multi-task feature selection. Pages 1073–1082 of:
Proceedings of the 32nd International Conference on Machine Learning.
Hinrichs, Thomas R., and Forbus, Kenneth D. 2011. Transfer learning through analogy in
games. AI Magazine, 32(1), 70–83.
Hoffman, Judy, Rodner, Erik, Donahue, Jeff, Saenko, Kate, and Darrell, Trevor. 2013. Effi-
cient learning of domain-invariant image representations. CoRR, abs/1301.3224.
Hoffman, Judy, Guadarrama, Sergio, Tzeng, Eric S., et al. 2014. LSDA: Large scale detection
through adaptation. Pages 3536–3544 of: Advances in Neural Information Processing
Systems.
Hofmann, Thomas. 1999. Probabilistic latent semantic analysis. Pages 289–296 of: Proceed-
ings of the Fifteenth Conference on Uncertainty in Artificial Intelligence.
Holyoak, Keith J., and Thagard, Paul. 1989. Analogical mapping by constraint satisfaction.
Cognitive Science, 13(3), 295–355.
Hu, Guangneng, Zhang, Yu, and Yang, Qiang. 2018. CoNet: Collaborative cross networks for
cross-domain recommendation. Pages 667–676 of: Proceedings of the 27th ACM Inter-
national Conference on Information and Knowledge Management.
Hu, Guangneng, Zhang, Yu, and Yang, Qiang. 2019. Transfer meets hybrid: A synthetic
approach for cross-domain collaborative filtering with text. Pages 2822–2829 of: Pro-
ceedings of the Web Conference.
References 349
Hu, Minqing, and Liu, Bing. 2004. Mining and summarizing customer reviews. Pages 168–
177 of: Proceedings of the tenth ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining.
Huang, Jiayuan, Smola, Alexander J., Gretton, Arthur, Borgwardt, Karsten M., and
Schölkopf, Bernhard. 2006. Correcting sample selection bias by unlabeled data.
Pages 601–608 of: Advances in Neural Information Processing Systems.
Huang, Jui-Ting, Li, Jinyu, Yu, Dong, Deng, Li, and Gong, Yifan. 2013. Cross-language
knowledge transfer using multilingual deep neural network with shared hidden lay-
ers. Pages 7304–7308 of: Proceedings of the IEEE International Conference on Acoustics,
Speech and Signal Processing.
Huber, Peter J. 1964. Robust estimation of a location parameter. The Annals of Mathemati-
cal Statistics, 35(1), 73–101.
Isola, Phillip, Zhu, Jun-Yan, Zhou, Tinghui, and Efros, Alexei A. 2017. Image-to-image trans-
lation with conditional adversarial networks. Pages 1125–1134 of: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition.
Jacob, Laurent, and Vert, Jean-Philippe. 2007. Efficient peptide-MHC-I binding prediction
for alleles with few known binders. Bioinformatics, 24(3), 358–366.
Jacob, Laurent, Bach, Francis R., and Vert, Jean-Philippe. 2008. Clustered multi-task learn-
ing: A convex formulation. Pages 745–752 of: Advances in Neural Information Process-
ing Systems.
Jagannathan, Geetha, Pillaipakkamnatt, Krishnan, and Wright, Rebecca N. 2012. A practi-
cal differentially private random decision tree classifier. Transactions on Data Privacy,
5(1), 273–295.
Jalali, Ali, Ravikumar, Pradeep, Sanghavi, Sujay, and Ruan, Chao. 2010. A dirty model for
multi-task learning. Pages 964–972 of: Advances in Neural Information Processing
Systems 23.
Jean, Neal, Burke, Marshall, Xie, Michael, et al. 2016. Combining satellite imagery and ma-
chine learning to predict poverty. Science, 353(6301), 790–794.
Jeffreys, Harold. 1946. An invariant form for the prior probability in estimation prob-
lems. Proceedings of the Royal Society of London. Series A, Mathematical and Physical
Sciences, 86(1007).
Jeong, Minwoo, and Lee, Gary Geunbae. 2009. Multi-domain spoken language understand-
ing with transfer learning. Speech Communication, 51(5), 412–424.
Jernite, Yacine, Bowman, Samuel R., and Sontag, David. 2017. Discourse-based objectives
for fast unsupervised sentence representation learning. CoRR, abs/1705.00557.
Ji, Zhanglong, Jiang, Xiaoqian, Wang, Shuang, Xiong, Li, and Ohno-Machado, Lucila. 2014.
Differentially private distributed logistic regression using private and public data.
BMC Medical Genomics, 7(1), S14.
Jia, Yangqing, Salzmann, Mathieu, and Darrell, Trevor. 2010. Factorized latent spaces with
structured sparsity. Pages 982–990 of: Advances in Neural Information Processing
Systems.
Jiang, Jing. 2009. Multi-task transfer learning for weakly-supervised relation extraction.
Pages 1012–1020 of: Proceedings of the 47th Annual Meeting of the Association for Com-
putational Linguistics and the 4th International Joint Conference on Natural Language
Processing of the AFNLP.
Jiang, Jing, and Zhai, Chengxiang. 2007. Instance weighting for domain adaptation in NLP.
Pages 264–271 for: Proceedings of the 45th Annual Meeting of the Association of Com-
putational Linguistics.
350 References
Jiang, Wei, Zavesky, Eric, Chang, Shih-Fu, and Loui, Alex. 2008. Cross-domain learning
methods for high-level visual concept classification. Pages 161–164 of: Proceedings of
the 15th IEEE International Conference on Image Processing.
Jie, Luo, Tommasi, Tatiana, and Caputo, Barbara. 2011. Multiclass transfer learning from
unconstrained priors. Pages 1863–1870 of: Proceedings of IEEE International Confer-
ence on Computer Vision.
Joachims, Thorsten. 1999. Transductive inference for text classification using support vec-
tor machines. Pages 200–209 of: Proceedings of the Sixteenth International Conference
on Machine Learning.
Johnson, Justin, Alahi, Alexandre, and Fei-Fei, Li. 2016a. Perceptual losses for real-time style
transfer and super-resolution. Pages 694–711 of: Proceedings of European Conference
on Computer Vision.
Johnson, Melvin, Schuster, Mike, Le, Quoc V., et al. 2016b. Google’s multilingual neural ma-
chine translation system: Enabling zero-shot translation. CoRR, abs/1611.04558.
Joshi, Chaitanya K., Mi, Fei, and Faltings, Boi. 2017. Personalization in goal-oriented dialog.
CoRR, abs/1706.07503.
Juba, Brendan. 2006. Estimating relatedness via data compression. Pages 441–448 of: Pro-
ceedings of the 23rd International Conference on Machine Learning.
Kakade, Sham M., Shalev-Shwartz, Shai, and Tewari, Ambuj. 2012. Regularization tech-
niques for learning with matrices. Journal of Machine Learning Research, 13,
1865–1890.
Kalousis, Alexandros, Prados, Julien, and Hilario, Melanie. 2007. Stability of feature se-
lection algorithms: A study on high-dimensional spaces. Knowledge and Information
Systems, 12(1), 95–116.
Kanagawa, Heishiro, Kobayashi, Hayato, Shimizu, Nobuyuki, Tagami, Yukihiro, and Suzuki,
Taiji. 2019. Cross-domain recommendation via deep domain adaptation. Pages 20–29
of: Proceedings of the 41st European Conference on Information Retrieval.
Kanamori, Takafumi, Hido, Shohei, and Sugiyama, Masashi. 2009. A least-squares ap-
proach to direct importance estimation. Journal of Machine Learning Research, 10,
1391–1445.
Kang, Zhuoliang, Grauman, Kristen, and Sha, Fei. 2011. Learning with whom to share in
multi-task feature learning. Pages 521–528 of: Proceedings of the 28th International
Conference on Machine Learning.
Karpathy, Andrej, Toderici, George, Shetty, Sanketh, et al. 2014. Large-scale video classifica-
tion with convolutional neural networks. Pages 1725–1732 of: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition.
Kashima, Hisashi, Yamanishi, Yoshihiro, Kato, Tsuyoshi, Sugiyama, Masashi, and Tsuda,
Koji. 2009. Simultaneous inference of biological networks of multiple species from
genome-wide data and evolutionary information: A semi-supervised approach. Bioin-
formatics, 25(22), 2962–2968.
Katiyar, Arzoo, and Cardie, Claire. 2017. Going out on a limb: Joint extraction of entity men-
tions and relations without dependency trees. Pages 917–928 of: Proceedings of the
55th Annual Meeting of the Association for Computational Linguistics.
Kato, Tsuyoshi, Kashima, Hisashi, Sugiyama, Masashi, and Asai, Kiyoshi. 2007. Multi-task
learning via conic programming. Pages 737–744 of: Advances in Neural Information
Processing Systems.
Kato, Tsuyoshi, Kashima, Hisashi, Sugiyama, Masashi, and Asai, Kiyoshi. 2010a. Conic Pro-
gramming for multitask learning. IEEE Transactions on Knowledge and Data Engineer-
ing, 22(7), 957–968.
References 351
Kato, Tsuyoshi, Okada, Kinya, Kashima, Hisashi, and Sugiyama, Masashi. 2010b. A trans-
fer learning approach and selective integration of multiple types of assays for biologi-
cal network inference. International Journal of Knowledge Discovery in Bioinformatics,
1(1), 66–80.
Keogh, Eamonn J., and Pazzani, Michael J. 2000. Scaling up dynamic time warping for
datamining applications. Pages 285–289 of: Proceedings of the Sixth ACM SIGKDD In-
ternational Conference on Knowledge Discovery and Data Mining.
Kermany, Daniel S., Goldbaum, Michael, Cai, Wenjia, et al. 2018. Identifying medi-
cal diagnoses and treatable diseases by image-based deep learning. Cell, 172(5),
1122–1131.
Khan, Md Abdullah Al Hafiz, Roy, Nirmalya, and Misra, Archan. 2018. Scaling human activ-
ity recognition via deep learning-based domain adaptation. Pages 1–9 for: Proceedings
of IEEE International Conference on Pervasive Computing and Communications.
Khosla, Aditya, Zhou, Tinghui, Malisiewicz, Tomasz, Efros, Alexei A., and Torralba, Antonio.
2012. Undoing the damage of dataset bias. Pages 158–171 of: Proceedings of European
Conference on Computer Vision.
Kim, Edward, Corte-Real, Miguel, and Baloch, Zubair. 2016. A deep semantic mobile ap-
plication for thyroid cytopathology. Proceedings of Medical Imaging 2016: PACS and
Imaging Informatics: Next Generation and Innovations.
Kim, Taeksoo, Cha, Moonsu, Kim, Hyunsoo, Lee, Jung Kwon, and Kim, Jiwon. 2017.
Learning to discover cross-domain relations with generative adversarial networks.
Pages 1857–1865 of: Proceedings of International Conference on Machine Learning.
Kim, Yoon. 2014. Convolutional neural networks for sentence classification. Pages
1746–1751 of: Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing.
Kiros, Ryan, Zhu, Yukun, Salakhutdinov, Ruslan R., et al. 2015. Skip-thought vectors. Pages
3294–3302 of: Advances in Neural Information Processing Systems.
Kjaergaard, Mikkel Baun, and Munk, Carsten Valdemar. 2008. Hyperbolic location finger-
printing: A calibration-free solution for handling differences in signal strength. Pages
110–116 of: Proceedings of the Sixth IEEE International Conference on Pervasive Com-
puting and Communications.
Kober, Jens, Öztop, Erhan, and Peters, Jan. 2011. Reinforcement learning to adjust robot
movements to new situations. Pages 2650–2655 of: Proceedings of the 22nd Interna-
tional Joint Conference on Artificial Intelligence.
Koch, Gregory. 2015. Siamese Neural Networks for One-Shot Image Recognition. M.Phil. the-
sis, University of Toronto.
Kolar, Mladen, Lafferty, John D., and Wasserman, Larry A. 2011. Union support recovery in
multi-task learning. Journal of Machine Learning Research, 12, 2415–2435.
Koller, Daphne, and Friedman, Nir. 2009. Probabilistic Graphical Models: Principles and
Techniques. MIT Press.
Kolodner, Janet. 1993. Case-Based Reasoning. Morgan Kaufmann.
Konidaris, George, and Barto, Andrew G. 2007. Building portable options: skill transfer in
reinforcement learning. Pages 895–900 of: Proceedings of the 20th International Joint
Conference on Artificial Intelligence.
Kotthoff, Lars, Thornton, Chris, Hoos, Holger H., Hutter, Frank, and Leyton-Brown, Kevin.
2017. Auto-WEKA 2.0: Automatic model selection and hyperparameter optimization
in WEKA. Journal of Machine Learning Research, 18, 25:1–25:5.
Krallinger, Martin, and Valencia, Alfonso. 2005. Text-mining and information-retrieval ser-
vices for molecular biology. Genome Biology, 6, 224.
352 References
Krishnan, P., Krishnakumar, A. S., Ju, Wen-Hua, Mallows, Colin, and Ganu, Sachin. 2004.
A system for LEASE: Location estimation assisted by stationery emitters for indoor
RF wireless networks. In: Proceedings of IEEE International Conference on Computer
Communications.
Krizhevsky, Alex, and Hinton, Geoffrey. 2009. Learning Multiple Layers of Features from Tiny
Images. Computer Science Department, University of Toronto, Technical Report.
Kulis, Brian, Saenko, Kate, and Darrell, Trevor. 2011. What you saw is not what you get: Do-
main adaptation using asymmetric kernel transforms. Pages 1785–1792 of: Proceed-
ings of the IEEE Conference on Computer Vision and Pattern Recognition.
Kullback, S., and Leibler, R. A. 1951. On information and sufficiency. Annals of Mathemati-
cal Statistics, 22(1), 79–86.
Kumar, Abhishek, and Daumé III, Hal. 2012. Learning task grouping and overlap in multi-
task learning. Proceedings of the 29th International Conference on Machine Learning.
Kumaraswamy, Raksha, Odom, Phillip, Kersting, Kristian, Leake, David, and Natarajan, Sri-
raam. 2015. Transfer learning via relational type matching. Pages 811–816 of: Proceed-
ings of IEEE International Conference on Data Mining.
Kuzborskij, Ilja, and Orabona, Francesco. 2013. Stability and hypothesis transfer learn-
ing. Pages 942–950 of: Proceedings of the 30th International Conference on Machine
Learning.
Ladd, Andrew M., Bekris, Kostas E., Rudys, Algis, et al. 2002. Robotics-based location sens-
ing using wireless Ethernet. Pages 227–238 of: Proceedings of the 8th Annual Interna-
tional Conference on Mobile Computing and Networking.
Lafferty, John D., and Zhai, ChengXiang. 2001. Document language models, query mod-
els, and risk minimization for information retrieval. Pages 111–119 of: Croft, W. Bruce,
Harper, David J., Kraft, Donald H., and Zobel, Justin (eds.), SIGIR 2001: Proceedings of
the 24th Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval.
Lai, Tze Leung, and Robbins, Herbert. 1985. Asymptotically efficient adaptive allocation
rules. Advances in Applied Mathematics, 6(1), 4–22.
Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B. 2015. Human-level concept learning
through probabilistic program induction. Science, 350(6266), 1332–1338.
Lake, Brenden, Salakhutdinov, Ruslan, Gross, Jason, and Tenenbaum, Joshua. 2011. One
shot learning of simple visual concepts. Pages 2568–2573 for: Proceedings of the An-
nual Meeting of the Cognitive Science Society.
Lake, Brenden M., Salakhutdinov, Ruslan, and Tenenbaum, Joshua B. 2013. One-shot learn-
ing by inverting a compositional causal process. Pages 2526–2534 of: Advances in Neu-
ral Information Processing Systems.
Laroche, Romain, and Barlier, Merwan. 2017. Transfer reinforcement learning with shared
dynamics. Pages 2147–2153 of: Proceedings of the Thirty-First AAAI Conference on
Artificial Intelligence.
Larrañaga, Pedro, Calvo, Borja, Santana, Roberto, et al. 2006. Machine learning in bioinfor-
matics. Briefings in Bioinformatics, 7(1), 86–112.
Lawrence, Neil D., and Platt, John C. 2004. Learning to learn with the informative vec-
tor machine. Proceedings of the Twenty-First International Conference on Machine
Learning.
Lazaric, Alessandro. 2008. Knowledge Transfer in Reinforcement Learning. Ph.D. thesis,
Politecnico di Milano.
Lazaric, Alessandro. 2012. Transfer in reinforcement learning: A framework and a survey.
Pages 143–173 of: Wiering, Marco, and van Otterlo, Martijn (eds), Reinforcement Learn-
ing: State-of-the-Art.
References 353
Li, Lihong, Chu, Wei, Langford, John, and Schapire, Robert E. 2010. A contextual-bandit ap-
proach to personalized news article recommendation. Pages 661–670 of: Proceedings
of the 19th International Conference on World Wide Web.
Li, Qi, and Ji, Heng. 2014. Incremental joint extraction of entity mentions and relations.
Pages 402–412 of: Proceedings of the 52nd Annual Meeting of the Association for Com-
putational Linguistics.
Li, Sijin, Liu, Zhi-Qiang, and Chan, Antoni B. 2015. Heterogeneous multi-task learning for
human pose estimation with deep convolutional neural network. International Jour-
nal of Computer Vision, 113(1), 19–36.
Li, Wen, Duan, Lixin, Xu, Dong, and Tsang, Ivor W. 2014. Learning with augmented features
for supervised and semi-supervised heterogeneous domain adaptation. IEEE Transac-
tions on Pattern Analysis and Machine Intelligence, 36(6), 1134–1148.
Li, Zheng, Zhang, Yu, Wei, Ying, Wu, Yuxiang, and Yang, Qiang. 2017b. End-to-end adver-
sarial memory network for cross-domain sentiment classification. Pages 2237–2243 of:
Proceedings of the International Joint Conference on Artificial Intelligence.
Liao, Renjie, Schwing, Alexander G., Zemel, Richard S., and Urtasun, Raquel. 2016. Learning
deep parsimonious representations. Pages 5076–5084 of: Advances in Neural Informa-
tion Processing Systems.
Liao, Xuejun, Xue, Ya, and Carin, Lawrence. 2005. Logistic regression with an auxiliary data
source. Pages 505–512 of: Proceedings of the 22nd International Conference on Machine
Learning.
Ling, Xiao, Xue, Gui-Rong, Dai, Wenyuan, et al. 2008. Can Chinese web pages be classified
with English data source? Pages 969–978 of: Proceedings of the 17th International Con-
ference on World Wide Web.
Liu, Bing. 2012. Sentiment analysis and opinion mining. Synthesis Lectures on Human Lan-
guage Technologies, 5(1), 1–167.
Liu, Bing, Hsu, Wynne, and Ma, Yiming. 1999. Mining association rules with multiple min-
imum supports. Pages 337–341 of: Proceedings of the Fifth ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining.
Liu, Bo, Wei, Ying, Zhang, Yu, and Yang, Qiang. 2017. Deep neural networks for high dimen-
sion, low sample size data. Pages 2287–2293 of: Sierra, Carles (ed.), Proceedings of the
Twenty-Sixth International Joint Conference on Artificial Intelligence.
Liu, Bo, Wei, Ying, Zhang, Yu, Yan, Zhixian, and Yang, Qiang. 2018. Transferable contex-
tual bandit for cross-domain recommendation. Pages 3619–3626 of: Proceedings of the
Thirty-Second AAAI Conference on Artificial Intelligence.
Liu, Chenxi, Zoph, Barret, Neumann, Maxim, et al. 2018c. Progressive neural architecture
search. Pages 19–35 of: Proceedings of 15th European Conference on Computer Vision.
Liu, Dong, Hua, Xian-Sheng, Yang, Linjun, Wang, Meng, and Zhang, Hong-Jiang. 2009a. Tag
ranking. Pages 351–360 of: Proceedings of the 18th International Conference on World
Wide Web.
Liu, Han, Palatucci, Mark, and Zhang, Jian. 2009b. Blockwise coordinate descent proce-
dures for the multi-task lasso, with applications to neural semantic basis discov-
ery. Pages 649–656 of: Proceedings of the 26th International Conference on Machine
Learning.
Liu, Jiahui, Dolan, Peter, and Pedersen, Elin Rønby. 2010a. Personalized news recommen-
dation based on click behavior. Pages 31–40 of: Proceedings of the 15th International
Conference on Intelligent User Interfaces.
Liu, Qi, Xu, Qian, Zheng, Vincent W., et al. 2010b. Multi-task learning for cross-platform
siRNA efficacy prediction: an in-silico study. BMC Bioinformatics, 11, 181.
References 355
Liu, Qiuhua, Liao, Xuejun, and Carin, Lawrence. 2007. Semi-supervised multitask learning.
Pages 937–944 of: Advances in Neural Information Processing Systems.
Liu, Qiuhua, Liao, Xuejun, Li, Hui, Stack, Jason R., and Carin, Lawrence. 2009c. Semisuper-
vised multitask learning. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 31(6), 1074–1086.
Liu, Wu, Mei, Tao, Zhang, Yongdong, Che, Cherry, and Luo, Jiebo. 2015a. Multi-task deep
visual-semantic embedding for video thumbnail selection. Pages 3707–3715 of: Pro-
ceedings of IEEE Conference on Computer Vision and Pattern Recognition.
Liu, Xiaodong, Gao, Jianfeng, et al. 2015b. Representation learning using multi-task deep
neural networks for semantic classification and information retrieval. Pages 912–921
of: Proceedings of the 2015 Conference of the North American Chapter of the Association
for Computational Linguistics: Human Language Technologies.
Long, Mingsheng, Wang, Jianmin, Ding, Guiguang, Shen, Dou, and Yang, Qiang. 2014.
Transfer learning with graph co-regularization. IEEE Transactions on Knowledge and
Data Engineering, 26(7), 1805–1818.
Long, Mingsheng, Cao, Yue, Wang, Jianmin, and Jordan, Michael I. 2015. Learning transfer-
able features with deep adaptation networks. Pages 97–105 of: Proceedings of the 32nd
International Conference on Machine Learning.
Long, Mingsheng, Zhu, Han, Wang, Jianmin, and Jordan, Michael I. 2017. Deep transfer
learning with joint adaptation networks. Pages 2208–2217 of: Proceedings of Interna-
tional Conference on Machine Learning.
Lounici, Karim, Pontil, Massimiliano, Tsybakov, Alexandre B., and van de Geer, Sara A. 2009.
Taking advantage of sparsity in multi-task learning. Proceedings of the 22nd Conference
on Learning Theory.
Lozano, Aurelie C., and Swirszcz, Grzegorz. 2012. Multi-level lasso for sparse multi-task re-
gression. Proceedings of the 29th International Conference on Machine Learning.
Lu, Guoyu, Yan, Yan, Ren, Li, et al. 2016. Where am I in the dark: Exploring active transfer
learning on the use of indoor localization based on thermal imaging. Neurocomputing,
173, 83–92.
Lugosi, Gábor, Papaspiliopoulos, Omiros, and Stoltz, Gilles. 2009. Online multi-task learn-
ing with hard constraints. Proceedings of the 22nd Conference on Learning Theory.
Luo, Bingfeng, Feng, Yansong, Xu, Jianbo, Zhang, Xiang, and Zhao, Dongyan. 2017. Learn-
ing to predict charges for criminal cases with legal basis. Pages 2727–2736 of: Proceed-
ings of the 2017 Conference on Empirical Methods in Natural Language Processing.
Luong, Minh-Thang, Le, Quoc V., Sutskever, Ilya, Vinyals, Oriol, and Kaiser, Lukasz. 2016.
Multi-task sequence to sequence learning. Proceedings of the 4th International Con-
ference on Learning Representations.
Luria, Aleksandr R. 1976. Cognitive Development: Its Cultural and Social Foundations. Har-
vard University Press.
Ma, Zhigang, Yang, Yi, Nie, Feiping, et al. 2014. Harnessing lab knowledge for real-world
action recognition. International Journal of Computer Vision, 109(1–2), 60–73.
Mahadevan, Sridhar, and Maggioni, Mauro. 2007. Proto-value functions: A Laplacian
framework for learning representation and control in Markov decision processes. Jour-
nal of Machine Learning Research, 8, 2169–2231.
Mahajan, Dhruv, Girshick, Ross B., Ramanathan, Vignesh, et al. 2018. Exploring the limits
of weakly supervised pretraining. CoRR, abs/1805.00932.
Mahmud, M. M., and Ray, Sylvian R. 2007. Transfer learning using Kolmogorov complexity:
Basic theory and empirical evaluations. Pages 985–992 of: Advances in Neural Infor-
mation Processing Systems.
356 References
Mairesse, François, and Walker, Marilyn A. 2008. Trainable generation of big-five person-
ality styles through data-driven parameter estimation. Pages 165–173 of: Proceedings
of the 46th Annual Meeting of the Association for Computational Linguistics.
Mairesse, François, and Walker, Marilyn A. 2011. Controlling user perceptions of linguis-
tic style: Trainable generation of personality traits. Computational Linguistics, 37(3),
455–488.
Mairesse, François, Gašić, Milica, Jurcícek, Filip, et al. 2009. Spoken language un-
derstanding from unaligned data using discriminative classification models.
Pages 4749–4752 of: Proceedings of IEEE International Conference on Acoustics,
Speech and Signal Processing.
Malaviya, Chaitanya, Neubig, Graham, and Littell, Patrick. 2017. Learning language rep-
resentations for typology prediction. Pages 2529–2535 of: Proceedings of the 2017
Conference on Empirical Methods in Natural Language Processing.
Man, Tong, Shen, Huawei, Jin, Xiaolong, and Cheng, Xueqi. 2017. Cross-domain recom-
mendation: An embedding and mapping approach. Pages 2464–2470 of: Proceedings
of the 26th International Joint Conference on Artificial Intelligence.
Mansour, Yishay, Mohri, Mehryar, and Rostamizadeh, Afshin. 2008. Domain adaptation
with multiple sources. Pages 1041–1048 of: Advances in Neural Information Process-
ing Systems.
Mansour, Yishay, Mohri, Mehryar, and Rostamizadeh, Afshin. 2009. Domain adaptation:
Learning bounds and algorithms. Proceedings of the 22nd Conference on Learning
Theory.
Mao, Xiangbo, Lin, Binbin, Cai, Deng, He, Xiaofei, and Pei, Jian. 2013. Parallel field align-
ment for cross media retrieval. Pages 897–906 of: Proceedings of the 21st ACM Interna-
tional Conference on Multimedia.
Marx, Zvika, Rosenstein, Michael T., Dietterich, Thomas G., and Kaelbling, Leslie Pack.
2008. Two algorithms for transfer learning. Inductive Transfer: 10 Years Later.
Maurer, Andreas. 2005. Algorithmic stability and meta-learning. Journal of Machine Learn-
ing Research, 6, 967–994.
Maurer, Andreas. 2006a. Bounds for linear multi-task learning. Journal of Machine Learning
Research, 7, 117–139.
Maurer, Andreas. 2006b. The Rademacher complexity of linear transformation classes.
Pages 65–78 of: Proceedings of the 19th Annual Conference on Learning Theory.
Maurer, Andreas. 2009. Transfer bounds for linear feature learning. Machine Learning,
75(3), 327–350.
Maurer, Andreas, Pontil, Massimiliano, and Romera-Paredes, Bernardino. 2013. Sparse
coding for multitask and transfer learning. Pages 343–351 of: Proceedings of the 30th
International Conference on Machine Learning.
Maurer, Andreas, Pontil, Massimiliano, and Romera-Paredes, Bernardino. 2016. The benefit
of multitask representation learning. Journal of Machine Learning Research, 17, 1–32.
McAllester, David A. 1999. Some PAC-Bayesian theorems. Machine Learning, 37(3),
355–363.
McCann, Bryan, Bradbury, James, Xiong, Caiming, and Socher, Richard. 2017. Learned in
translation: Contextualized word vectors. Pages 6297–6308 of: Advances in Neural In-
formation Processing Systems.
McGovern, Amy, and Barto, Andrew G. 2001. Automatic discovery of subgoals in reinforce-
ment learning using diverse density. Pages 361–368 of: Proceedings of the Eighteenth
International Conference on Machine Learning.
References 357
McNamara, Daniel, and Balcan, Maria-Florina. 2017. Risk bounds for transferring repre-
sentations with and without fine-tuning. Pages 2373–2381 of: Proceedings of the 34th
International Conference on Machine Learning.
Menze, Bjoern H., Jakab, András, Bauer, Stefan, et al. 2015. The multimodal brain tumor im-
age segmentation benchmark (BRATS). IEEE Transactions on Medical Imaging, 34(10),
1993–2024.
Mihalkova, Lilyana, and Mooney, Raymond J. 2008. Transfer learning by mapping with min-
imal target data. In: Proceedings of the AAAI-08 Workshop on Transfer Learning for
Complex Tasks.
Mihalkova, Lilyana, Huynh, Tuyen N., and Mooney, Raymond J. 2007. Mapping and revis-
ing Markov logic networks for transfer learning. Pages 608–614 of: Proceedings of the
Twenty-Second AAAI Conference on Artificial Intelligence.
Mikolov, Tomas, Chen, Kai, Corrado, Greg, and Dean, Jeffrey. 2013a. Efficient estimation of
word representations in vector space. CoRR, abs/1301.3781.
Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Greg S., and Dean, Jeff. 2013b.
Distributed representations of words and phrases and their compositionality.
Pages 3111–3119 of: Advances in Neural Information Processing Systems.
Min, Sewon, Seo, Minjoon, and Hajishirzi, Hannaneh. 2017. Question answering through
transfer learning from large fine-grained supervision data. Pages 510–517 of: Proceed-
ings of the 55th Annual Meeting of the Association for Computational Linguistics.
Misra, Ishan, Shrivastava, Abhinav, Gupta, Abhinav, and Hebert, Martial. 2016. Cross-stitch
networks for multi-task learning. Pages 3994–4003 of: Proceedings of IEEE Conference
on Computer Vision and Pattern Recognition.
Mitchell, T., Cohen, W., Hruschka, E., et al. 2015. Never-ending learning. Pages 2302–2310
of: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence.
Mitra, Pabitra, Murthy, C. A., and Pal, Sankar K. 2002. Unsupervised feature selection us-
ing feature similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence,
24(3), 301–312.
Mitzlaff, Folke, Atzmüller, Martin, Hotho, Andreas, and Stumme, Gerd. 2014. The social dis-
tributional hypothesis: A pragmatic proxy for homophily in online social networks.
Social Network Analysis and Mining, 4(1), 216.
Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, et al. 2013. Playing Atari with deep
reinforcement learning. CoRR, abs/1312.5602.
Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, et al. 2015. Human-level control
through deep reinforcement learning. Nature, 518(7540), 529–533.
Mo, Kaixiang, Zhang, Yu, Yang, Qiang, and Fung, Pascale. 2017. Fine grained knowledge
transfer for personalized task-oriented dialogue systems. CoRR, abs/1711.04079.
Mo, Kaixiang, Zhang, Yu, Li, Shuangyin, Li, Jiajun, and Yang, Qiang. 2018. Personalizing a
dialogue system with transfer reinforcement learning. Pages 5317–5324 of :Proceedings
of the Thirty-Second AAAI Conference on Artificial Intelligence.
Moore, Andrew W. 1991. Variable resolution dynamic programming: Efficiently learning
action maps in multivariate real-valued state-spaces. Pages 333–337 of: Proceedings
of the Eighth International Conference on Machine Learning.
Mou, Lili, Meng, Zhao, Yan, Rui, et al. 2016. How transferable are neural networks in NLP
applications? Pages 479–489 of: Proceedings of the 2016 Conference on Empirical Meth-
ods in Natural Language Processing.
Mrkšic, Nikola, Séaghdha, Diarmuid Ó., Thomson, Blaise, et al. 2015. Multi-domain dialog
state tracking using recurrent neural networks. Pages 794–799 of: Proceedings of the
53rd Annual Meeting of the Association for Computational Linguistics.
358 References
Nassar, Marcel, Abdallah, Rami, Zeineddine, Hady Ali, Yaacoub, Elias, and Dawy, Zaher.
2008. A new multitask learning method for multiorganism gene network estimation.
Pages 2287–2291 of: Proceedings of IEEE International Symposium on Information
Theory.
Ng, Andrew Y., Jordan, Michael I., and Weiss, Yair. 2002. On spectral clustering: Analysis and
an algorithm. Pages 849–856 of: Advances in Neural Information Processing Systems.
Nguyen, Hien Van, Ho, Huy Tho, Patel, Vishal M., and Chellappa, Rama. 2015. DASH-N:
Joint hierarchical domain adaptation and feature learning. IEEE Transactions on Image
Processing, 24(12), 5479–5491.
Nguyen, Khanh, III, Hal Daumé, and Boyd-Graber, Jordan L. 2017. Reinforcement
learning for bandit neural machine translation with simulated human feedback.
Pages 1464–1474 of: Proceedings of the 2017 Conference on Empirical Methods in Nat-
ural Language Processing.
Ni, Jie, Qiu, Qiang, and Chellappa, Rama. 2013. Subspace interpolation via dictionary learn-
ing for unsupervised domain adaptation. Pages 692–699 of: Proceedings of IEEE Con-
ference on Computer Vision and Pattern Recognition.
Ni, Lionel M., Liu, Yunhao, Lau, Yiu Cho, and Patil, Abhishek P. 2003. LANDMARC: Indoor
location sensing using active RFID. Pages 407–415 of: Proceedings of IEEE International
Conference on Pervasive Computing and Communications.
Nickel, Maximilian, Murphy, Kevin, Tresp, Volker, and Gabrilovich, Evgeniy. 2016. A review
of relational machine learning for knowledge graphs. Proceedings of the IEEE, 104(1),
11–33.
Niehues, Jan, and Cho, Eunah. 2017. Exploiting linguistic resources for neural machine
translation using multi-task learning. Pages 80–89 of: Proceedings of the Second Con-
ference on Machine Translation.
Norouzi, Mohammad, Mikolov, Tomas, Bengio, Samy, et al. 2013. Zero-shot learning by
convex combination of semantic embeddings. CoRR, abs/1312.5650.
Nowozin, Sebastian, Cseke, Botond, and Tomioka, Ryota. 2016. f-GAN: Training genera-
tive neural samplers using variational divergence minimization. Pages 271–279 of: Ad-
vances in Neural Information Processing Systems.
Obozinski, Guillaume, Taskar, Ben, and Jordan, Michael. 2006. Multi-task Feature Selection.
Tech. Report, Department of Statistics, University of California, Berkeley.
Obozinski, Guillaume, Taskar, Ben, and Jordan, Michael. 2010. Joint covariate selection and
joint subspace selection for multiple classification problems. Statistics and Comput-
ing, 20(2), 231–252.
Obozinski, Guillaume, Wainwright, Martin J., and Jordan, Michael I. 2011. Support union
recovery in high-dimensional multivariate regression. The Annals of Statistics, 39(1),
1–47.
Olshausen, Bruno A., and Field, David J. 1997. Sparse coding with an overcomplete basis
set: A strategy employed by V1? Vision Research, 37(23), 3311–3325.
Oquab, Maxime, Bottou, Leon, Laptev, Ivan, and Sivic, Josef. 2014. Learning and trans-
ferring mid-level image representations using convolutional neural networks. Pages
1717–1724 of: Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition.
Palatucci, Mark, Pomerleau, Dean, Hinton, Geoffrey E., and Mitchell, Tom M. 2009. Zero-
shot learning with semantic output codes. Pages 1410–1418 of: Advances in Neural
Information Processing Systems.
Pan, Jialin. 2010. Feature-based Transfer Learning with Real-world Applications. Ph.D. the-
sis, Hong Kong University of Science and Technology.
References 359
Pan, Rong, Zhao, Junhui, Zheng, Vincent Wenchen, et al. 2007a. Domain-constrained semi-
supervised mining of tracking models in sensor networks. Pages 1023–1027 of: Pro-
ceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining.
Pan, Rong, Zhou, Yunhong, Cao, Bin, et al. 2008a. One-class collaborative filtering. Pages
502–511 of: Proceedings of the Eighth IEEE International Conference on Data Mining.
Pan, Sinno J., Kwok, James T., Yang, Qiang, and Pan, Jeffrey J. 2007b. Adaptive localization
in a dynamic WiFi environment through multi-view learning. Pages 1108–1113 of: Pro-
ceedings of the 22nd National Conference on Artificial Intelligence.
Pan, Sinno Jialin. 2014. Transfer Learning. Pages 537–570 of: Data Classification: Algorithms
and Applications. Chapman & Hall/CRC.
Pan, Sinno Jialin, and Yang, Qiang. 2010. A survey on transfer learning. IEEE Transactions
on Knowledge and Data Engineering, 22(10), 1345–1359.
Pan, Sinno Jialin, Kwok, James T., and Yang, Qiang. 2008b. Transfer learning via dimension-
ality reduction. Pages 677–682 of: Proceedings of the 23rd AAAI Conference on Artificial
Intelligence.
Pan, Sinno Jialin, Shen, Dou, Yang, Qiang, and Kwok, James T. 2008c. Transferring localiza-
tion models across space. Pages 1383–1388 of: Proceedings of the 23rd AAAI Conference
on Artificial Intelligence.
Pan, Sinno Jialin, Ni, Xiaochuan, Sun, Jian-Tao, Yang, Qiang, and Chen, Zheng. 2010a.
Cross-domain sentiment classification via spectral feature alignment. Pages 751–760
of: Proceedings of the 19th International Conference on World Wide Web.
Pan, Sinno Jialin, Tsang, Ivor W., Kwok, James T., and Yang, Qiang. 2011. Domain adap-
tation via transfer component analysis. IEEE Transactions on Neural Networks, 22(2),
199–210.
Pan, Weike, and Yang, Qiang. 2013. Transfer learning in heterogeneous collaborative filter-
ing domains. Artificial Intelligence, 197, 39–55.
Pan, Weike, Xiang, Evan W., Liu, Nathan N., and Yang, Qiang. 2010b. Transfer learning
in collaborative filtering for sparsity reduction. Pages 230–235 of: Proceedings of the
Twenty-Fourth AAAI Conference on Artificial Intelligence.
Pan, Weike, Xiang, Evan Wei, and Yang, Qiang. 2012. Transfer learning in collaborative fil-
tering with uncertain ratings. Pages 662–668 of: Proceedings of the Twenty-Sixth AAAI
Conference on Artificial Intelligence.
Pan, Weike, Liu, Zhuode, Ming, et al. 2015a. Compressed knowledge transfer via factor-
ization machine for heterogeneous collaborative recommendation. Knowledge-Based
Systems, 85, 234–244.
Pan, Weike, Zhong, Hao, Xu, Congfu, and Ming, Zhong. 2015b. Adaptive Bayesian person-
alized ranking for heterogeneous implicit feedbacks. Knowledge-Based Systems, 73,
173–180.
Pan, Weike, Liu, Mengsi, and Ming, Zhong. 2016a. Transfer learning for heterogeneous one-
class collaborative filtering. IEEE Intelligent Systems, 31(4), 43–49.
Pan, Weike, Yang, Qiang, Duan, Yuchao, and Ming, Zhong. 2016b. Transfer learning for
semisupervised collaborative recommendation. ACM Transactions on Interactive In-
telligent Systems, 6(2), 10:1–10:21.
Pan, Weike, Yang, Qiang, Duan, Yuchao, Tan, Ben, and Ming, Zhong. 2017. Transfer learning
for behavior ranking. ACM Transactions on Intelligent Systems and Technology, 8(5),
65:1–65:23.
Pang, Bo, and Lee, Lillian. 2008. Opinion mining and sentiment analysis. Foundations and
Trends in Information Retrieval, 2(1–2), 1–135.
360 References
Pang, Bo, Lee, Lillian, and Vaithyanathan, Shivakumar. 2002. Thumbs up? Sentiment clas-
sification using machine learning Techniques. Pages 79–86 of: Proceedings of the 2002
Conference on Empirical Methods in Natural Language Processing.
Pappas, Nikolaos, and Popescu-Belis, Andrei. 2017. Multilingual Hierarchical attention net-
works for document classification. Pages 1015–1025 of: Proceedings of the 8th Interna-
tional Joint Conference on Natural Language Processing.
Parameswaran, Shibin, and Weinberger, Kilian Q. 2010. Large margin multi-task metric
learning. Pages 1867–1875 of: Advances in Neural Information Processing Systems.
Parisotto, Emilio, Ba, Jimmy, and Salakhutdinov, Ruslan. 2016. Actor-mimic: Deep multi-
task and transfer reinforcement learning. Proceedings of the 4th International Confer-
ence on Learning Representations.
Patel, Vishal M., Gopalan, Raghuraman, Li, Ruonan, and Chellappa, Rama. 2015. Visual do-
main adaptation: A survey of recent advances. IEEE Signal Processing Magazine, 32(3),
53–69.
Patterson, Donald J., Fox, Dieter, Kautz, Henry A., and Philipose, Matthai. 2005. Fine-
grained activity recognition by aggregating abstract object usage. Pages 44–51 of: Pro-
ceedings of the Ninth IEEE International Symposium on Wearable Computers.
Pei, Zhongyi, Cao, Zhangjie, Long, Mingsheng, and Wang, Jianmin. 2018. Multi-adversarial
domain adaptation. Pages 3934–3941 of: Proceedings of the Thirty-Second AAAI Con-
ference on Artificial Intelligence.
Peng, Hao, Thomson, Sam, and Smith, Noah A. 2017. Deep multitask learning for semantic
dependency parsing. Pages 2037–2048 of: Proceedings of the 55th Annual Meeting of
the Association for Computational Linguistics.
Pennington, Jeffrey, Socher, Richard, and Manning, Christopher. 2014. Glove: Global vec-
tors for word representation. Pages 1532–1543 of: Proceedings of the 2014 Conference
on Empirical Methods in Natural Language Processing.
Pentina, Anastasia, and Ben-David, Shai. 2015. multi-task and lifelong learning of kernels.
Pages 194–208 of: Proceedings of the 26th International Conference on Algorithmic
Learning Theory.
Pentina, Anastasia, and Lampert, Christoph H. 2014. A PAC-Bayesian bound for lifelong
learning. Pages 991–999 of: Proceedings of the 31th International Conference on Ma-
chine Learning.
Pentina, Anastasia, and Lampert, Christoph H. 2015. Lifelong learning with non-i.i.d. tasks.
Pages 1540–1548 of: Advances in Neural Information Processing Systems.
Perkins, Simon, Lacker, Kevin, and Theiler, James. 2003. Grafting: Fast, incremental feature
selection by gradient descent in function space. Journal of Machine Learning Research,
3, 1333–1356.
Perrot, Michaël, and Habrard, Amaury. 2015. A theoretical analysis of metric hypothesis
transfer learning. Pages 1708–1717 of: Proceedings of the 32nd International Confer-
ence on Machine Learning.
Phillips, Caitlin. 2006. Knowledge Transfer in Markov Decision Processes. Tech. reptort,
McGill University.
Pillonetto, Gianluigi, Dinuzzo, Francesco, and Nicolao, Giuseppe De. 2010. Bayesian online
multitask learning of Gaussian processes. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 32(2), 193–205.
Plis, Sergey M., Hjelm, Devon R., Salakhutdinov, Ruslan, and Calhoun, Vince D. 2013. Deep
learning for neuroimaging: A validation study. CoRR, abs/1312.5847.
Pong, Ting Kei, Tseng, Paul, Ji, Shuiwang, and Ye, Jieping. 2010. Trace norm regularization:
Reformulations, algorithms, and multi-task learning. SIAM Journal on Optimization,
20(6), 3465–3489.
References 361
Ponomareva, Natalia, and Thelwall, Mike. 2012. Biographies or blenders: Which resource
is best for cross-domain sentiment analysis? Pages 488–499 of: Proceedings of Interna-
tional Conference on Intelligent Text Processing and Computational Linguistics.
Pontil, Massimiliano, and Maurer, Andreas. 2013. Excess risk bounds for multitask learning
with trace norm regularization. Pages 55–76 of: Proceedings of the 26th Annual Confer-
ence on Learning Theory.
Pugh, K. J., and Bergin, D. A. 2006. Motivational influences on transfer. Educational Psy-
chologist, 41, 147–160.
Puniyani, Kriti, Kim, Seyoung, and Xing, Eric P. 2010. Multi-population GWA mapping via
multi-task regularized regression. Bioinformatics, 26, i208–i216.
Qi, Guo-Jun, Aggarwal, Charu, and Huang, Thomas. 2011a. Towards semantic knowledge
propagation from text corpus to web images. Pages 297–306 of: Proceedings of the 20th
International Conference on World Wide Web.
Qi, Guo-Jun, Aggarwal, Charu, Rui, Yong, et al. 2011b. Towards cross-category knowledge
propagation for learning visual concepts. Pages 897–904 of: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition.
Qi, Yanjun, Tastan, Oznur, Carbonell, Jaime G., and Klein-Seetharaman, Judith. 2010. Semi-
supervised multi-task learning for predicting interactions between HIV-1 and human
proteins. Bioinformatics, 26, i645–i652.
Qiu, Qiang, Patel, Vishal M., Turaga, Pavan, and Chellappa, Rama. 2012. Domain adaptive
dictionary learning. Pages 631–645 of: Proceedings of European Conference on Com-
puter Vision.
Quionero-Candela, Joaquin, Sugiyama, Masashi, Schwaighofer, Anton, and Lawrence,
Neil D. 2009. Dataset Shift in Machine Learning. MIT Press.
Radford, Alec, Metz, Luke, and Chintala, Soumith. 2015. Unsupervised representa-
tion learning with deep convolutional generative adversarial networks. CoRR,
abs/1511.06434.
Raina, Rajat, Battle, Alexis, Lee, Honglak, Packer, Benjamin, and Ng, Andrew Y. 2007. Self-
taught learning: Transfer learning from unlabeled data. Pages 759–766 of: Proceedings
of the 24th International Conference on Machine Learning.
Raj, Anant, Namboodiri, Vinay P., and Tuytelaars, Tinne. 2015. Subspace alignment based
domain adaptation for RCNN detector. Pages 166.1–166.11 of: Proceedings of the
British Machine Vision Conference.
Rajpurkar, Pranav, Zhang, Jian, Lopyrev, Konstantin, and Liang, Percy. 2016. SQuAD:
100,000+ questions for machine comprehension of text. Pages 2383–2392 of: Proceed-
ings of the 2016 Conference on Empirical Methods in Natural Language Processing.
Rajpurkar, Pranav, Irvin, Jeremy, Zhu, Kaylie, et al. 2017. CheXNet: Radiologist-level pneu-
monia detection on chest X-rays with deep learning. CoRR, abs/1711.05225.
Ranzato, Marc’Aurelio, Chopra, Sumit, Auli, Michael, and Zaremba, Wojciech. 2015. Se-
quence level training with recurrent neural networks. CoRR, abs/1511.06732.
Recanzone, Gregg H. 2009. Interactions of auditory and visual stimuli in space and time.
Hearing Research, 258(1), 89–99.
Reichart, Roi, Tomanek, Katrin, Hahn, Udo, and Rappoport, Ari. 2008. Multi-task active
learning for linguistic annotations. Pages 861–869 of: Proceedings of the 46th Annual
Meeting of the Association for Computational Linguistics.
Reiss, Attila, and Stricker, Didier. 2012. Introducing a new benchmarked dataset for activ-
ity monitoring. Pages 108–109 of: Proceedings of the 16th International Symposium on
Wearable Computers.
362 References
Ren, Hang, Xu, Weiqun, and Yan, Yonghong. 2014. Markovian discriminative modeling for
cross-domain dialog state tracking. Pages 342–347 of: Proceedings of IEEE Spoken Lan-
guage Technology Workshop.
Resnick, Paul, and Varian, Hal R. 1997. Recommender systems. Communications of the
ACM, 40(3), 56–58.
Resnick, Paul, Iacovou, Neophytos, Suchak, Mitesh, Bergstrom, Peter, and Riedl, John. 1994.
GroupLens: An open architecture for collaborative filtering of netnews. Pages 175–186
of: Proceedings of the 1994 ACM Conference on Computer Supported Cooperative Work.
Revow, Michael, Williams, Christopher K. I., and Hinton, Geoffrey E. 1996. Using generative
models for handwritten digit recognition. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 18(6), 592–606.
Richards, Bradley L., and Mooney, Raymond J. 1992. Learning relations by pathfinding.
Pages 50–55 of: Proceedings of the 10th National Conference on Artificial Intelligence.
Richardson, Matthew, and Domingos, Pedro. 2006. Markov logic networks. Machine Learn-
ing, 62(1–2), 107–136.
Rohrbach, Marcus, Stark, Michael, Szarvas, György, Gurevych, Iryna, and Schiele, Bernt.
2010. What helps where – and why? Semantic relatedness for knowledge transfer.
Pages 910–917 of: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition.
Ruder, Sebastian, Bingel, Joachim, Augenstein, Isabelle, and Søgaard, Anders. 2017.
Sluice networks: Learning what to share between loosely related tasks. CoRR,
abs/1705.08142.
Rusu, Andrei A., Colmenarejo, Sergio Gomez, Gülçehre, Çaglar, et al. 2015. Policy distilla-
tion. CoRR, abs/1511.06295.
Ruvolo, Paul, and Eaton, Eric. 2013. ELLA: An efficient lifelong learning algorithm. Pages
507–515 of: Proceedings of the 30th International Conference on Machine Learning.
Saenko, Kate, Kulis, Brian, Fritz, Mario, and Darrell, Trevor. 2010. Adapting visual category
models to new domains. Pages 213–226 of: Proceedings of European Conference on
Computer Vision.
Saha, Avishek, Rai, Piyush, III, Hal Daumé, and Venkatasubramanian, Suresh. 2011. Online
learning of multiple tasks and their relationships. Pages 643–651 of: Proceedings of the
Fourteenth International Conference on Artificial Intelligence and Statistics.
Salimans, Tim, Goodfellow, Ian, Zaremba, Wojciech, et al. 2016. Improved techniques
for training GANs. Pages 2234–2242 of: Advances in Neural Information Processing
Systems.
Samala, Ravi K., Chan, Heang-Ping, Hadjiiski, Lubomir, et al. 2016. Mass detection in digital
breast tomosynthesis: Deep convolutional neural network with transfer learning from
mammography. Medical physics, 43(12), 6654–6666.
Schank, Roger C. 1983. Dynamic Memory – A Theory of Reminding and Learning in
Computers and People. Cambridge University Press.
Scholkopf, Bernhard, and Smola, Alexander J. 2001. Learning with Kernels: Support Vector
Machines, Regularization, Optimization, and Beyond. MIT Press.
Schunk, D. 1965. Learning Theories: An Educational Perspective. Pearson.
Schwaighofer, Anton, Tresp, Volker, and Yu, Kai. 2005. Learning Gaussian process kernels
via hierarchical Bayes. Pages 1209–1216 of: Advances in Neural Information Processing
Systems.
Schweikert, Gabriele Beate, Widmer, Christian, Schölkopf, Bernhard, and Rätsch, Gunnar.
2008. An empirical analysis of domain adaptation algorithms for genomic sequence
analysis. Pages 1433–1440 of: Advances in Neural Information Processing Systems.
References 363
Seo, Minjoon, Kembhavi, Aniruddha, Farhadi, Ali, and Hajishirzi, Hannaneh. 2016. Bidirec-
tional attention flow for machine comprehension. CoRR, abs/1611.01603.
Serban, Iulian V., Sordoni, Alessandro, Bengio, Yoshua, Courville, Aaron, and Pineau, Joelle.
2015. Building end-to-end dialogue systems using generative hierarchical neural net-
work models. arXiv preprint, arXiv:1507.04808.
Serban, Iulian V., Sordoni, Alessandro, Bengio, Yoshua, Courville, Aaron, and Pineau, Joelle.
2016. Building end-to-end dialogue systems using generative hierarchical neural net-
work models. Pages 3776–3784 of: Proceedings of the 30th AAAI Conference on Artificial
Intelligence.
Serban, Iulian Vlad, Sordoni, Alessandro, Lowe, Ryan, et al. 2017. A hierarchical latent vari-
able encoder-decoder model for generating dialogues. Pages 3295–3301 of: Proceed-
ings of the Thirty-First AAAI Conference on Artificial Intelligence.
Sermanet, Pierre, Eigen, David, Zhang, Xiang, et al. 2013. Overfeat: Integrated recognition,
localization and detection using convolutional networks. CoRR, abs/1312.6229.
Sevakula, R. K., Singh, V., Verma, N. K., Kumar, C., and Cui, Y. 2018. Transfer learning for
molecular cancer classification using deep neural networks. IEEE/ACM Transactions
on Computational Biology and Bioinformatics.
Shang, Lifeng, Lu, Zhengdong, and Li, Hang. 2015. Neural responding machine for short-
text conversation. Pages 1577–1586 of: Proceedings of the 53rd Annual Meeting of the
Association for Computational Linguistics.
Shekhar, Shashi, Patel, Vishal M., Nguyen, Hien, and Chellappa, Rama. 2013. Generalized
domain-adaptive dictionaries. Pages 361–368 of: Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition.
Shen, Dou, Pan, Rong, Sun, Jian-Tao, et al. 2005. Q2 C@UST: Our winning solution to query
classification in KDDCUP 2005. SIGKDD Explorations, 7(2), 100–110.
Shen, Dou, Pan, Rong, Sun, Jian-Tao, et al. 2006a. Query enrichment for web-query classi-
fication. ACM Transactions on Information Systems, 24(3), 320–352.
Shen, Dou, Sun, Jian-Tao, Yang, Qiang, and Chen, Zheng. 2006b. Building bridges for web
query classification. Pages 131–138 of: Proceedings of the 29th Annual International
ACM SIGIR Conference on Research and Development in Information Retrieval.
Sherstov, Alexander A., and Stone, Peter. 2005. Improving action selection in MDP’s via
knowledge transfer. Pages 1024–1029 of: Proceedings of the Twentieth National Con-
ference on Artificial Intelligence.
Shi, Xiaoxiao, Liu, Qi, Fan, Wei, Yang, Qiang, and Yu, Philip S. 2010a. Predictive modeling
with heterogeneous sources. Pages 814–825 of: Proceedings of the SIAM International
Conference on Data Mining.
Shi, Xiaoxiao, Liu, Qi, Fan, Wei, Yu, Philip S., and Zhu, Ruixin. 2010b. Transfer learning on
heterogenous feature spaces via spectral transformation. Pages 1049–1054 of: Proceed-
ings of the IEEE International Conference on Data Mining.
Shi, Xiaoxiao, Paiement, Jean-François, Grangier, David, and Yu, Philip S. 2012. Learning
from heterogeneous sources via gradient boosting consensus. Pages 224–235 of: Pro-
ceedings of the SIAM International Conference on Data Mining.
Shi, Xiaoxiao, Liu, Qi, Fan, Wei, and Philip, S. Yu. 2013a. Transfer across completely differ-
ent feature spaces via spectral embedding. IEEE Transactions on Knowledge and Data
Engineering, 25(4), 906–918.
Shi, Yangyang, Larson, Martha, and Jonker, Catholijn M. 2015. Recurrent neural network
language model adaptation with curriculum learning. Computer Speech and Lan-
guage, 33(1), 136–154.
364 References
Shi, Yuan, and Sha, Fei. 2012. Information-theoretical learning of discriminative clusters
for unsupervised domain adaptation. Pages 1275–1282 of: Proceedings of the 29th In-
ternational Conference on Machine Learning.
Shi, Yue, Larson, Martha, and Hanjalic, Alan. 2013b. Mining contextual movie similarity
with matrix factorization for context-aware recommendation. ACM Transactions on
Intelligent Systems and Technology, 4(1), 16:1–16:19.
Shin, Hoo-Chang, Roth, Holger R., Gao, Mingchen, et al. 2016. Deep convolutional neu-
ral networks for computer-aided detection: CNN architectures, dataset characteristics
and transfer learning. IEEE Transactions on Medical Imaging, 35(5), 1285–1298.
Shokri, Reza, and Shmatikov, Vitaly. 2015. Privacy-preserving deep learning. Pages
1310–1321 of: Proceedings of ACM Conference on Computer and Communications
Security.
Shrivastava, Ashish, Pfister, Tomas, Tuzel, Oncel, et al. 2017. Learning from simulated and
unsupervised images through adversarial training. Pages 2242–2251 of: Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition.
Shu, Xiangbo, Qi, Guo-Jun, Tang, Jinhui, and Wang, Jingdong. 2015. Weakly-shared deep
transfer networks for heterogeneous-domain knowledge propagation. Pages 35–44 of:
Proceedings of the 23rd ACM International Conference on Multimedia.
Si, Si, Tao, Dacheng, and Geng, Bo. 2010. Bregman divergence-based regularization for
transfer subspace learning. IEEE Transactions on Knowledge and Data Engineering,
22(7), 929–942.
Silver, Daniel L., and Mercer, Robert E. 1996. The parallel transfer of task knowledge using
dynamic learning rates based on a measure of relatedness. Connection Science Special
Issue: Transfer in Inductive Systems, 8(2), 277–294.
Silver, Daniel L., Yang, Qiang, and Li, Lianghao. 2013. Lifelong machine learning systems:
Beyond learning algorithms. Proceedings of the 2013 AAAI Spring Symposium on Life-
long Machine Learning, AAAI Technical Report, vol. SS-13-05.
Silver, David, Huang, Aja, Maddison, Chris J., et al. 2016. Mastering the game of Go with
deep neural networks and tree search. Nature, 529(7587), 484–489.
Singh, Ajit P., and Gordon, Geoffrey J. 2008. Relational learning via collective matrix factor-
ization. Pages 650–658 of: Proceeding of the 14th ACM SIGKDD International Confer-
ence on Knowledge Discovery and Data Mining.
Singh, Satinder P., Kearns, Michael J., Litman, Diane J., and Walker, Marilyn A. 1999. Rein-
forcement learning for spoken dialogue systems. Pages 956–962 of: Advances in Neural
Information Processing Systems.
Smola, Alexander J., and Schölkopf, Bernhard. 2004. A tutorial on support vector regression.
Statistics and Computing, 14(3), 199–222.
Smola, Alex, Gretton, Arthur, Song, Le, and Schölkopf, Bernhard. 2007a. A Hilbert space
embedding for distributions. Pages 13–31 of: Proceedings of International Conference
on Algorithmic Learning Theory.
Smola, Alexander J., Gretton, Arthur, Song, Le, and Schölkopf, Bernhard. 2007b. A Hilbert
space embedding for distributions. Pages 40–41 of: Proceedings of the 10th Interna-
tional Conference on Discovery Science.
Snel, Matthijs, and Whiteson, Shimon. 2014. Learning potential functions and their repre-
sentations for multi-task reinforcement learning. Autonomous Agents and Multi-Agent
Systems, 28(4), 637–681.
Socher, Richard, Ganjoo, Milind, Manning, Christopher D., and Ng, Andrew Y. 2013a. Zero-
shot learning through cross-modal transfer. Pages 935–943 of: Advances in Neural
Information Processing Systems.
References 365
Socher, Richard, Perelygin, Alex, Wu, Jean Y., et al. 2013b. Recursive deep models for se-
mantic compositionality over a sentiment treebank. Pages 1631–1642 of: Proceedings
of the 2013 Conference on Empirical Methods in Natural Language Processing.
Søgaard, Anders, and Goldberg, Yoav. 2016. Deep multi-task learning with low level tasks
supervised at lower layers. Pages 231–235 of: Proceedings of the 54th Annual Meeting
of the Association for Computational Linguistics.
Solnon, Matthieu, Arlot, Sylvain, and Bach, Francis R. 2012. Multi-task regression using
minimal penalties. Journal of Machine Learning Research, 13, 2773–2812.
Song, Jinhua, Gao, Yang, Wang, Hao, and An, Bo. 2016. Measuring the distance between
finite markov decision processes. Pages 468–476 of: Proceedings of the 2016 Interna-
tional Conference on Autonomous Agents & Multiagent Systems.
Sordoni, Alessandro, Bengio, Yoshua, Vahabi, Hossein, et al. 2015. A hierarchical recurrent
encoder-decoder for generative context-aware query suggestion. Pages 553–562 of:
Proceedings of the 24th ACM International on Conference on Information and Knowl-
edge Management.
Srivastava, Nitish, Hinton, Geoffrey E., Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov,
Ruslan. 2014. Dropout: A simple way to prevent neural networks from overfitting. Jour-
nal of Machine Learning Research, 15(1), 1929–1958.
Sugiyama, Masashi, Nakajima, Shinichi, Kashima, Hisashi, von Bünau, Paul, and Kawan-
abe, Motoaki. 2008. Direct importance estimation with model selection and its appli-
cation to covariate shift adaptation. Pages 1433–1440 of: Advances in Neural Informa-
tion Processing Systems.
Suk, Heung-Il, and Shen, Dinggang. 2013. Deep learning-based feature representation for
AD/MCI classification. Pages 583–590 of: Proceedings of the 16th International Confer-
ence on Medical Image Computing and Computer-Assisted Intervention.
Suk, Heung-Il, Lee, Seong-Whan, and Shen, Dinggang. 2014. Hierarchical feature represen-
tation and multimodal fusion with deep learning for AD/MCI diagnosis. NeuroImage,
101, 569–582.
Sun, Kai, Xie, Qizhe, and Yu, Kai. 2016. Recurrent polynomial network for dialogue state
tracking. Dialogue and Discourse, 7(3), 65–88.
Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc V. 2014. Sequence to sequence learning with
neural networks. Pages 3104–3112 of: Advances in Neural Information Processing
Systems.
Sutton, Richard S., and Barto, Andrew G. 1998. Reinforcement Learning – An Introduction.
MIT Press.
Sutton, Richard S., Precup, Doina, and Singh, Satinder. 1999. Between MDPs and semi-
MDPs: A framework for temporal abstraction in reinforcement learning. Artificial In-
telligence, 112(1–2), 181–211.
Sweeney, Latanya. 2002. k-Anonymity: A model for protecting privacy. International Jour-
nal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(5), 557–570.
Tai, Lei, Paolo, Giuseppe, and Liu, Ming. 2017. Virtual-to-real deep reinforcement learning:
Continuous control of mobile robots for mapless navigation. Pages 31–36 of: Proceed-
ings of 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems.
Tamada, Yoshinori, Bannai, Hideo, Kanehisa, Minoru, and Miyano, Satoru. 2005. Utilizing
evolutionary information and gene expression data for estimating gene networks with
Bayesian network models. Journal of Bioinformatics and Computational Biology, 3(6),
1295–1313.
Tan, Ben, Zhong, Erheng, Ng, Michael K., and Yang, Qiang. 2014. Mixed-transfer: Transfer
learning over mixed graphs. Pages 208–216 of: Proceedings of the SIAM International
Conference on Data Mining.
366 References
Tan, Ben, Song, Yangqiu, Zhong, Erheng, and Yang, Qiang. 2015. Transitive transfer learn-
ing. Pages 1155–1164 of: Proceedings of the 21th ACM SIGKDD International Confer-
ence on Knowledge Discovery and Data Mining.
Tan, Ben, Zhang, Yu, Pan, Sinno Jialin, and Yang, Qiang. 2017. Distant domain transfer
learning. Pages 2604–2610 of: Proceedings of the Thirty-First AAAI Conference on Ar-
tificial Intelligence.
Tang, Duyu, Qin, Bing, Feng, Xiaocheng, and Liu, Ting. 2015. Target-dependent sentiment
classification with long short term memory. CoRR, abs/1512.01100.
Taylor, Matthew E., and Stone, Peter. 2005. Behavior transfer for value-function-based re-
inforcement learning. Pages 53–59 of: Proceedings of the 4th International Joint Con-
ference on Autonomous Agents and Multiagent Systems.
Taylor, Matthew E., and Stone, Peter. 2007. Cross-domain transfer for reinforcement learn-
ing. Pages 879–886 of: Proceedings of the Twenty-Fourth International Conference on
Machine Learning.
Taylor, Matthew E., and Stone, Peter. 2009. Transfer learning for reinforcement learning
domains: A survey. Journal of Machine Learning Research, 10, 1633–1685.
Taylor, Matthew E., Stone, Peter, and Liu, Yaxin. 2005. Value functions for RL-based behav-
ior transfer: A comparative study. Pages 880–885 of: Proceedings of the Twentieth Na-
tional Conference on Artificial Intelligence and the Seventeenth Innovative Applications
of Artificial Intelligence Conference.
Taylor, Matthew E., Whiteson, Shimon, and Stone, Peter. 2007. Transfer via inter-task map-
pings in policy search reinforcement learning. Proceedings of the 6th International
Joint Conference on Autonomous Agents and Multiagent Systems.
Taylor, Matthew E., Jong, Nicholas K., and Stone, Peter. 2008a. Transferring instances for
model-based reinforcement learning. Pages 488–505 of: Proceedings of European Con-
ference on Machine Learning and Practice of Knowledge Discovery in Databases.
Taylor, Matthew E., Kuhlmann, Gregory, and Stone, Peter. 2008b. Autonomous transfer for
reinforcement learning. Pages 283–290 of: Proceedings of the 7th International Joint
Conference on Autonomous Agents and Multiagent Systems.
Tewari, Ambuj, Ravikumar, Pradeep K., and Dhillon, Inderjit S. 2011. Greedy algorithms for
structurally constrained high dimensional problems. Pages 882–890 of: Advances in
Neural Information Processing Systems.
Thomson, Blaise, and Young, Steve. 2010. Bayesian update of dialogue state: A POMDP
framework for spoken dialogue systems. Computer Speech and Language, 24(4),
562–588.
Thorndike, Edward. L., and S. Woodworth, R. 1901. The influence of improvement in one
mental function upon the efficiency of other functions. II. The estimation of magni-
tudes. Psychological Review, 8(01), 384–395.
Thrun, Sebastian. 1995. Explanation-Based Neural Network Learning a Lifelong Learning
Approach. Ph.D. thesis, University of Bonn.
Thrun, Sebastian, and O’Sullivan, Joseph. 1996. Discovering structure in multiple learning
tasks: The TC algorithm. Pages 489–497 of: Proceedings of the 13th International Con-
ference on Machine Learning.
Tibshirani, Robert. 1996. Regression shrinkage and selection via the lasso. Journal of the
Royal Statistical Society. Series B (Methodological), 58(1), 267–288.
Toffler, Alvin. 1970. Future Shock. Random House.
Tommasi, Tatiana, Orabona, Francesco, and Caputo, Barbara. 2010. Safety in num-
bers: Learning categories from few examples with multi model knowledge transfer.
Pages 3081–3088 of: Proceedings of IEEE Conference on Computer Vision and Pattern
Recognition.
References 367
Tommasi, Tatiana, Orabona, Francesco, and Caputo, Barbara. 2014. Learning categories
from few examples with multi model knowledge transfer. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 36(5), 928–941.
Tompson, Jonathan, Stein, Murphy, Lecun, Yann, and Perlin, Ken. 2014. Real-time contin-
uous pose recovery of human hands using convolutional networks. ACM Transactions
on Graphics, 33(5), 169:1–169:10.
Topin, Nicholay, Haltmeyer, Nicholas, Squire, Shawn, et al. 2015. Portable option discov-
ery for automated learning transfer in object-oriented Markov decision processes.
Pages 3532–3536 of: Pages 3856–3864 of: Proceedings of the Twenty-Fourth Interna-
tional Joint Conference on Artificial Intelligence.
Toshniwal, Shubham, Tang, Hao, Lu, Liang, and Livescu, Karen. 2017. Multitask learning
with low-level auxiliary tasks for encoder-decoded based speech recognition. Pro-
ceedings of the 18th Annual Conference of the International Speech Communication
Association.
Tsuboi, Yuta, Kashima, Hisashi, Hido, Shohei, Bickel, Steffen, and Sugiyama, Masashi. 2009.
Direct density ratio estimation for large-scale covariate shift adaptation. Journal of In-
formation Processing, 17, 138–155.
Tür, Gökhan. 2005. Model adaptation for spoken language understanding. Pages 41–44
of: Proceedings of IEEE International Conference on Acoustics, Speech, and Signal
Processing.
Tür, Gökhan. 2006. Multitask learning for spoken language understanding. Pages 585–
588 of: Proceedings of IEEE International Conference on Acoustics Speech and Signal
Processing.
Tzeng, Eric, Hoffman, Judy, Zhang, Ning, Saenko, Kate, and Darrell, Trevor. 2014. Deep do-
main confusion: Maximizing for domain invariance. CoRR, abs/1412.3474.
Tzeng, Eric, Hoffman, Judy, Darrell, Trevor, and Saenko, Kate. 2015. Simultaneous deep
transfer across domains and tasks. Pages 4068–4076 of: Proceedings of IEEE Interna-
tional Conference on Computer Vision.
Tzeng, Eric, Hoffman, Judy, Saenko, Kate, and Darrell, Trevor. 2017. Adversarial discrimina-
tive domain adaptation. Pages 2962–2971 of: Proceedings of IEEE Conference on Com-
puter Vision and Pattern Recognition.
Vail, Douglas L., Veloso, Manuela M., and Lafferty, John D. 2007. Conditional random fields
for activity recognition. In: Proceedings of the Sixth International Joint Conference on
Autonomous Agents and Multiagent Systems.
van Haaren, Jan, Kolobov, Andrey, and Davis, Jesse. 2015. TODTLER: Two-order-deep trans-
fer learning. Pages 3007–3015 of: Proceedings of the Twenty-Ninth AAAI Conference on
Artificial Intelligence.
van Kasteren, Tim, Noulas, Athanasios K., Englebienne, Gwenn, and Kröse, Ben J. A. 2008.
Accurate activity recognition in a home setting. Pages 1–9 of: Proceedings of the 10th
International Conference on Ubiquitous Computing.
Vapnik, Vladimir. 1995. The Nature of Statistical Learning Theory. Springer.
Vapnik, Vladimir N. 1998. Statistical Learning Theory. Wiley-Interscience.
Venugopalan, Subhashini, Rohrbach, Marcus, Donahue, Jeffrey, et al. 2015a. Sequence to
sequence – Video to text. Pages 4534–4542 of: Proceedings of the IEEE International
Conference on Computer Vision.
Venugopalan, Subhashini, Xu, Huijuan, Donahue, Jeff, et al. 2015b. Translating videos to
natural language using deep recurrent neural networks. Pages 1494–1504 of: Proceed-
ings of the 2015 Conference of the North American Chapter of the Association for Com-
putational Linguistics: Human Language Technologies.
368 References
Wang, Jialei, Kolar, Mladen, and Srebro, Nathan. 2016a. Distributed multi-task learning.
Pages 751–760 of: Proceedings of the 19th International Conference on Artificial Intelli-
gence and Statistics.
Wang, Jindong, Chen, Yiqiang, Hao, Shuji, Peng, Xiaohui, and Hu, Lisha. 2017a. Deep learn-
ing for sensor-based activity recognition: A survey. CoRR, abs/1707.03502.
Wang, Jindong, Chen, Yiqiang, Hu, Lisha, Peng, Xiaohui, and Yu, Philip S. 2018b. Stratified
transfer learning for cross-domain activity recognition. CoRR, abs/1801.00820.
Wang, Sheng, Li, Zhen, Yu, Yizhou, and Xu, Jinbo. 2017b. Folding membrane proteins by
deep transfer learning. CoRR, abs/1708.08407.
Wang, Shenlong, Zhang, Lei, Liang, Yan, and Pan, Quan. 2012. Semi-coupled dictionary
learning with applications to image super-resolution and photo-sketch synthesis.
Pages 2216–2223 of: Proceedings of the IEEE Conference on Computer Vision and Pat-
tern Recognition.
Wang, Shuai, Chen, Zhiyuan, and Liu, Bing. 2016b. Mining aspect-specific opinion using a
holistic lifelong topic model. Pages 167–176 of: Proceedings of the 25th International
Conference on World Wide Web.
Wang, Shuohang, Yu, Mo, Guo, Xiaoxiao, et al. 2018c. R3 : Reinforced ranker-reader for
open-domain question answering. Pages 5981–5988 of: Proceedings of the Thirty-
Second AAAI Conference on Artificial Intelligence.
Wang, Sida, and Manning, Christopher D. 2012. Baselines and bigrams: Simple, good senti-
ment and topic classification. Pages 90–94 of: Proceedings of the 50th Annual Meeting
of the Association for Computational Linguistics.
Wang, Wenhui, Yang, Nan, Wei, Furu, Chang, Baobao, and Zhou, Ming. 2017c. Gated self-
matching networks for reading comprehension and question answering. Pages 189–
198 of: Proceedings of the 55th Annual Meeting of the Association for Computational
Linguistics.
Wang, Xin, Bi, Jinbo, Yu, Shipeng, and Sun, Jiangwen. 2014. On multiplicative multitask fea-
ture learning. Pages 2411–2419 of: Advances in Neural Information Processing Systems.
Wang, Xuezhi, and Schneider, Jeff G. 2015. Generalization bounds for transfer learning un-
der model shift. Pages 922–931 of: Proceedings of the Thirty-First Conference on Uncer-
tainty in Artificial Intelligence.
Wang, Yang, Gu, Quanquan, and Brown, Donald E. 2018d. Differentially private hypothesis
transfer learning. Pages 811–826 of: Proceedings of European Conference on Machine
Learning and Knowledge Discovery in Databases.
Wang, Zhuoran, and Lemon, Oliver. 2013. A simple and generic belief tracking mechanism
for the dialog state tracking challenge: On the believability of observed information.
Pages 423–432 of: Proceedings of the 14th Annual Meeting of the Special Interest Group
on Discourse and Dialogue.
Wei, Ying, Zhu, Yin, Leung, Cane Wing-ki, Song, Yangqiu, and Yang, Qiang. 2016a. Instilling
social to physical: Co-regularized heterogeneous transfer learning. Pages 1338–1344
of: Proceedings of the 30th AAAI Conference on Artificial Intelligence.
Wei, Ying, Zheng, Yu, and Yang, Qiang. 2016b. Transfer knowledge between cities. Pages
1905–1914 of: Proceedings of the 22nd ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining.
Wei, Ying, Zhang, Yu, Huang, Junzhou, and Yang, Qiang. 2018. Transfer learning via learning
to transfer. Pages 5072–5081 of: Proceedings of the 35th International Conference on
Machine Learning.
Weinberger, Kilian Q., Sha, Fei, and Saul, Lawrence K. 2004. Learning a kernel matrix
for nonlinear dimensionality reduction. Proceedings of the Twenty-First International
Conference on Machine Learning.
370 References
Wen, Tsung-Hsien, Heidel, Aaron, Lee, Hung-yi, Tsao, Yu, and Lee, Lin-Shan. 2013. Recur-
rent neural network based language model personalization by social network crowd-
sourcing. Pages 2703–2707 of: Proceedings of the 14th Annual Conference of the Inter-
national Speech Communication Association.
Wen, Tsung-Hsien, Gašić, Milica, Mrkšic, Nikola, et al. 2015a. Semantically condi-
tioned LSTM-based natural language generation for spoken dialogue systems.
Pages 1711–1721 of: Proceedings of the 2015 Conference on Empirical Methods in Nat-
ural Language Processing.
Wen, Tsung-Hsien, Gašić, Milica, Mrkšic, Nikola, et al. 2015b. Toward multi-domain lan-
guage generation using recurrent neural networks. NIPS Workshop on ML for SLU and
Interaction.
Wen, Tsung-Hsien, Gašić, Milica, Mrkšic, Nikola, et al. 2016. Multi-domain neural network
language generation for spoken dialogue systems. Pages 120–129 of: Proceedings of the
2016 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies.
Widmer, Christian, Leiva, Jose, Altun, Yasemin, and Rätsch, Gunnar. 2010a. Leveraging se-
quence classification by taxonomy-based multitask learning. Pages 522–534 of: Pro-
ceedings of 14th the Annual International Conference on Research in Computational
Molecular Biology.
Widmer, Christian, Toussaint, Nora C., Altun, Yasemin, Kohlbacher, Oliver, and Rätsch,
Gunnar. 2010b. Novel machine learning methods for MHC Class I binding predic-
tion. Pages 98–109 of: Proceedings of the 5th IAPR International Conference on Pattern
Recognition in Bioinformatics.
Widmer, Christian, Toussaint, Nora C., Altun, Yasemin, and Rätsch, Gunnar. 2010c. In-
ferring latent task structure for multitask learning by multiple kernel learning. BMC
Bioinformatics, 1(Suppl. 8), 55.
Williams, Jason. 2013. Multi-domain learning and generalization in dialog state tracking.
Pages 433–441 of: Proceedings of the 14th Annual Meeting of the Special Interest Group
on Discourse and Dialogue.
Williams, Jason D. 2008a. The best of both worlds: Unifying conventional dialog systems
and POMDPs. Pages 1173–1176 of: Proceedings of the 9th Annual Conference of the In-
ternational Speech Communication Association.
Williams, Jason D. 2008b. Integrating expert knowledge into POMDP optimization for spo-
ken dialog systems. Proceedings of the AAAI Workshop on Advancements in POMDP
Solvers.
Wilson, Aaron, Fern, Alan, Ray, Soumya, and Tadepalli, Prasad. 2007. Multi-task reinforce-
ment learning: A hierarchical Bayesian approach. Pages 1015–1022 of: Proceedings of
the Twenty-Fourth International Conference on Machine Learning.
Winston, Patrick H. 1980. Learning and reasoning by analogy. Communications of the ACM,
23(12), 689–703.
Wong, Catherine, Houlsby, Neil, Lu, Yifeng, and Gesmundo, Andrea. 2018. Transfer learning
with neural AutoML. Pages 8366–8375 of: Advances in Neural Information Processing
Systems 31.
Wood, Erroll, Baltrušaitis, Tadas, Morency, Louis-Philippe, Robinson, Peter, and Bulling,
Andreas. 2016. Learning an appearance-based gaze estimator from one million syn-
thesised images. Pages 131–138 of: Proceedings of the Ninth Biennial ACM Symposium
on Eye Tracking Research and Applications.
Wu, Pengcheng, and Dietterich, Thomas G. 2004. Improving SVM accuracy by training on
auxiliary data sources. Pages 111–117 of: Proceedings of the 21st International Confer-
ence on Machine Learning.
References 371
Wu, Shuangzhi, Zhang, Dongdong, Yang, Nan, Li, Mu, and Zhou, Ming. 2017. Sequence-
to-dependency neural machine translation. Pages 698–707 of: Proceedings of the 55th
Annual Meeting of the Association for Computational Linguistics.
Wu, Xinxiao, Wang, Han, Liu, Cuiwei, and Jia, Yunde. 2013. Cross-view action recognition
over heterogeneous feature spaces. Pages 609–616 of: Proceedings of the IEEE Interna-
tional Conference on Computer Vision.
Xie, Liyang, Baytas, Inci M., Lin, Kaixiang, and Zhou, Jiayu. 2017. Privacy-preserving dis-
tributed multi-task learning with asynchronous updates. Pages 1195–1204 of: Proceed-
ings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining.
Xie, Michael, Jean, Neal, Burke, Marshall, Lobell, David, and Ermon, Stefano. 2016.
Transfer learning from deep features for remote sensing and poverty mapping.
Pages 3929–3935 of: Proceedings of the Thirtieth AAAI Conference on Artificial
Intelligence.
Xing, Eric P., Jordan, Michael I., and Karp, Richard M. 2001. Feature selection for high-
dimensional genomic microarray data. Pages 601–608 of: Proceedings of the 8th In-
ternational Conference on Machine Learning.
Xu, Jiaolong, Ramos, Sebastian, Vázquez, David, and López, Antonio M. 2014a. Domain
adaptation of deformable part-based models. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 36(12), 2367–2380.
Xu, Kelvin, Ba, Jimmy, Kiros, Ryan, et al. 2015. Show, attend and tell: Neural image caption
generation with visual attention. Pages 2048–2057 of: Proceedings of the 32nd Interna-
tional Conference on Machine Learning.
Xu, Qian, and Yang, Qiang. 2011. A survey of transfer and multitask learning in bioinfor-
matics. Journal of Computing Science and Engineering, 5(3), 257–268.
Xu, Qian, Xiang, Evan Wei, and Yang, Qiang. 2010. Protein–protein interaction prediction
via collective matrix factorization. Pages 62–67 of: Proceedings of IEEE International
Conference on Bioinformatics and Biomedicine.
Xu, Qian, Pan, Sinno Jialin, Xue, Hannah Hong, and Yang, Qiang. 2011. Multitask learning
for protein subcellular location prediction. IEEE/ACM Transactions on Computational
Biology and Bioinformatics, 8(3), 748–759.
Xu, Yonghui, Pan, Sinno Jialin, Xiong, Hui, Wu, et al. 2017. A unified framework for met-
ric transfer learning. IEEE Transactions on Knowledge and Data Engineering, 29(6),
1158–1171.
Xu, Zheng, Li, Wen, Niu, Li, and Xu, Dong. 2014b. Exploiting low-rank structure from latent
domains for domain generalization. Pages 628–643 of: Proceedings of the 13th Euro-
pean Conference on Computer Vision.
Xu, Zhixiang, Huang, Gao, Weinberger, Kilian Q., and Zheng, Alice X. 2014c. Gradient
boosted feature selection. Pages 522–531 of: Proceedings of the 20th ACM SIGKDD In-
ternational Conference on Knowledge Discovery and Data Mining. ACM.
Xue, Ya, Liao, Xuejun, Carin, Lawrence, and Krishnapuram, Balaji. 2007. Multi-task learning
for classification with Dirichlet process priors. Journal of Machine Learning Research,
8, 35–63.
Yamada, Makoto, Jitkrittum, Wittawat, Sigal, Leonid, Xing, Eric P., and Sugiyama, Masashi.
2014. High-dimensional feature selection by feature-wise kernelized lasso. Neural
Computation, 26(1), 185–207.
Yang, Bishan, and Mitchell, Tom. 2017. A joint sequential and relational model for frame-
semantic parsing. Pages 1247–1256 of: Proceedings of the 2017 Conference on Empirical
Methods in Natural Language Processing.
372 References
Yang, Can, He, Zengyou, Wan, Xiang, et al. 2008. SNPHarvester: A filtering-based approach
for detecting epistatic interactions in genome-wide association studies. Bioinformat-
ics, 25(4), 504–511.
Yang, Jian, Zhang, David, Yang, Jing-Yu, and Niu, Ben. 2007a. Globally maximizing, locally
minimizing: Unsupervised discriminant projection with applications to face and palm
biometrics. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(4),
650–664.
Yang, Jianchao, Wright, John, Huang, Thomas S., and Ma, Yi. 2010. Image super-resolution
via sparse representation. IEEE Transactions on Image Processing, 19(11), 2861–2873.
Yang, Jun, Yan, Rong, and Hauptmann, Alexander G. 2007b. Adapting SVM classifiers to
data with shifted distributions. Pages 69–76 of: Workshops Proceedings of the 7th IEEE
International Conference on Data Mining.
Yang, Jun, Yan, Rong, and Hauptmann, Alexander G. 2007c. Cross-domain video concept
detection using adaptive SVMs. Pages 188–197 of: Proceedings of the 15th ACM Inter-
national Conference on Multimedia.
Yang, Liu, Hanneke, Steve, and Carbonell, Jaime G. 2013. A theory of transfer learning with
applications to active learning. Machine Learning, 90(2), 161–189.
Yang, Min, Zhao, Zhou, Zhao, Wei, et al. 2017. Personalized response generation via domain
adaptation. Pages 1021–1024 of: Proceedings of the 40th International ACM SIGIR Con-
ference on Research and Development in Information Retrieval.
Yang, Qiang, Chen, Yuqiang, Xue, Gui-Rong, Dai, Wenyuan, and Yu, Yong. 2009. Heteroge-
neous transfer learning for image clustering via the social web. Pages 1–9 of: Proceed-
ings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th Interna-
tional Joint Conference on Natural Language Processing of the AFNLP.
Yang, Wen Hui, Dai, Dao Qing, and Yan, Hong. 2011. Finding correlated biclusters from
gene expression data. IEEE Transaction on Knowledge and Data Engineering, 23(4),
568–584.
Yang, Zhilin, Salakhutdinov, Ruslan, and Cohen, William. 2016. Multi-task cross-lingual se-
quence tagging from scratch. CoRR, abs/1603.06270.
Yao, Kaisheng, Zweig, Geoffrey, Hwang, Mei-Yuh, Shi, Yangyang, and Yu, Dong. 2013. Re-
current neural networks for language understanding. Pages 2524–2528 of: Proceedings
of the 14th Annual Conference of the International Speech Communication Association.
Yao, Kaisheng, Peng, Baolin, Zhang, Yu, et al. 2014. Spoken language understanding using
long short-term memory neural networks. Pages 189–194 of: Proceedings of IEEE Spo-
ken Language Technology Workshop.
Yao, Quanming, Wang, Mengshuo, Escalante, Hugo Jair, et al. 2018. Taking human
out of learning applications: A survey on automated machine learning. CoRR,
abs/1810.13306.
Yazdani, Majid, and Henderson, James. 2015. A model of zero-shot learning of spoken lan-
guage understanding. Pages 244–249 of: Proceedings of the 2015 Conference on Empir-
ical Methods in Natural Language Processing.
Ye, Jihang, Cheng, Hong, Zhu, Zhe, and Chen, Minghua. 2013. Predicting positive and neg-
ative links in signed social networks by transfer learning. Pages 1477–1488 of: Proceed-
ings of the 22nd International Conference on World Wide Web.
Yi, Zili, Zhang, Hao, Tan, Ping, and Gong, Minglun. 2017. DualGAN: Unsupervised dual
learning for image-to-image translation. Pages 2849–2857 of: Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition.
Yin, Haiyan, and Pan, Sinno Jialin. 2017. Knowledge transfer for deep reinforcement learn-
ing with hierarchical experience replay. Pages 1640–1646 of: Proceedings of the Thirty-
First AAAI Conference on Artificial Intelligence.
References 373
Yin, Jie, Yang, Qiang, and Ni, Lionel M. 2005. Adaptive temporal radio maps for indoor loca-
tion estimation. Pages 85–94 of: Proceedings of the 3rd IEEE International Conference
on Pervasive Computing and Communications.
Yosinski, Jason, Clune, Jeff, Bengio, Yoshua, and Lipson, Hod. 2014. How transferable are
features in deep neural networks? Pages 3320–3328 of: Advances in Neural Information
Processing Systems.
Young, Steve, Gašić, Milica, Keizer, Simon, Mairesse, et al. 2010. The hidden information
state model: A practical framework for POMDP-based spoken dialogue management.
Computer Speech and Language, 24(2), 150–174.
Young, Steve, Gašić, Milica, Thomson, Blaise, and Williams, Jason D. 2013. POMDP-
based statistical spoken dialog systems: A review. Proceedings of the IEEE, 101(5),
1160–1179.
Yu, Lantao, Zhang, Weinan, Wang, Jun, and Yu, Yong. 2017. SeqGAN: Sequence generative
adversarial nets with policy gradient. Pages 2852–2858 of: Proceedings of the Thirty-
First AAAI Conference on Artificial Intelligence.
Yu, Zhou, Wu, Fei, Yang, Yi, et al. 2014. Discriminative coupled dictionary hashing for fast
cross-media retrieval. Pages 395–404 of: Proceedings of the 37th International ACM SI-
GIR Conference on Research and Development in Information Retrieval.
Zadrozny, Bianca. 2004. Learning and evaluating classifiers under sample selection bias.
Proceedings of the Twenty-First International Conference on Machine Learning.
Zhang, Chao, Zhang, Lei, and Ye, Jieping. 2012. Generalization bounds for domain adapta-
tion. Advances in Neural Information Processing Systems.
Zhang, Duo, Mei, Qiaozhu, and Zhai, Chengxiang. 2010a. Cross-lingual latent topic extrac-
tion. Pages 1128–1137 of: Proceedings of the 48th Annual Meeting of the Association for
Computational Linguistics.
Zhang, Jing, Ding, Zewei, Li, Wanqing, and Ogunbona, Philip. 2018. Importance weighted
adversarial nets for partial domain adaptation. Pages 8156–8164 of: Proceedings of the
IEEE Conference on Computer Vision and Pattern Recognition.
Zhang, Jingwei, Springenberg, Jost Tobias, Boedecker, Joschka, and Burgard, Wolfram.
2017a. Deep reinforcement learning with successor features for navigation across sim-
ilar environments. Pages 2371–2378 of: Proceedings of 2017 IEEE/RSJ International
Conference on Intelligent Robots and Systems.
Zhang, Jintao, and Huan, Jun. 2012. Inductive multi-task learning with multiple view data.
Pages 543–551 of: Proceedings of the 18th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining.
Zhang, Kai, Gray, Joe W., and Parvin, Bahram. 2010b. Sparse multitask regression for iden-
tifying common mechanism of response to therapeutic targets. Bioinformatics, 26,
i97–i105.
Zhang, Kai, Zheng, Vincent W., Wang, Qiaojun, et al. 2013. Covariate shift in Hilbert space: A
solution via sorrogate kernels. Pages 388–395 of: Proceedings of the 30th International
Conference on Machine Learning.
Zhang, Lei, Zuo, Wangmeng, and Zhang, David. 2016. LSDT: Latent sparse domain trans-
fer learning for visual adaptation. IEEE Transactions on Image Processing, 25(3), 1177–
1191.
Zhang, Tong. 2002. Covering number bounds for certain regularized linear function classes.
Journal of Machine Learning Research, 2, 527–550.
Zhang, Weinan, Liu, Ting, Wang, Yifa, and Zhu, Qingfu. 2017b. Neural personalized re-
sponse generation as domain adaptation. CoRR, abs/1701.02073.
374 References
Zhang, Wenlu, Li, Rongjian, Zeng, Tao, Sun, et al. 2015a. Deep model based transfer and
multi-task learning for biological image analysis. Pages 1475–1484 of: Proceedings of
the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data
Mining.
Zhang, Wenlu, Li, Rongjian, Zeng, Tao, et al. 2017c. Deep model based transfer and multi-
task learning for biological image analysis. IEEE Transactions on Big Data.
Zhang, Xiao-Lei. 2015a. Convex discriminative multitask clustering. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 37(1), 28–40.
Zhang, Xucong, Sugano, Yusuke, Fritz, Mario, and Bulling, Andreas. 2015b. Appearance-
based gaze estimation in the wild. Pages 4511–4520 of: Proceedings of the IEEE Confer-
ence on Computer Vision and Pattern Recognition.
Zhang, Yi, and Schneider, Jeff G. 2010. Learning multiple tasks with a sparse matrix-normal
penalty. Pages 2550–2558 of: Advances in Neural Information Processing Systems.
Zhang, Yongfeng, Ai, Qingyao, Chen, Xu, and Croft, W. Bruce. 2017d. Joint representation
learning for top-N recommendation with heterogenous information sources. Pages
1449–1458 of: Proceedings of the 2017 ACM on Conference on Information and Knowl-
edge Management.
Zhang, Yu. 2013. Heterogeneous-neighborhood-based multi-task local learning algo-
rithms. Pages 1896–1904 of: Advances in Neural Information Processing Systems.
Zhang, Yu. 2015b. Multi-task learning and algorithmic stability. Pages 3181–3187 of: Pro-
ceedings of the 29th AAAI Conference on Artificial Intelligence.
Zhang, Yu. 2015c. Parallel multi-task learning. Pages 629–638 of: Proceedings of the IEEE
International Conference on Data Mining.
Zhang, Yu, and Yang, Qiang. 2017a. Learning sparse task relations in multi-task learning.
Pages 2914–2920 of: Proceedings of the 31st AAAI Conference on Artificial Intelligence.
Zhang, Yu, and Yang, Qiang. 2017b. A survey on multi-task learning. CoRR,
abs/1707.08114v2.
Zhang, Yu, and Yeung, Dit-Yan. 2009. Semi-supervised multi-task regression. Pages 617–631
of: Proceedings of European Conference on Machine Learning and Knowledge Discovery
in Databases.
Zhang, Yu, and Yeung, Dit-Yan. 2010a. A convex formulation for learning task relationships
in multi-task learning. Pages 733–742 of: Proceedings of the 26th Conference on Uncer-
tainty in Artificial Intelligence.
Zhang, Yu, and Yeung, Dit-Yan. 2010b. Multi-task learning using generalized t process.
Pages 964–971 of: Proceedings of the 13th International Conference on Artificial Intelli-
gence and Statistics.
Zhang, Yu, and Yeung, Dit-Yan. 2012. Multi-task boosting by exploiting task relationships.
Pages 697–710 of: Proceedings of European Conference on Machine Learning and Prin-
ciples and Practice of Knowledge Discovery in Dtabases.
Zhang, Yu, and Yeung, Dit-Yan. 2013a. Learning high-order task relationships in multi-task
learning. Pages 1917–1923 of: Proceedings of the 23rd International Joint Conference on
Artificial Intelligence.
Zhang, Yu, and Yeung, Dit-Yan. 2013b. Multilabel relationship learning. ACM Transactions
on Knowledge Discovery from Data, 7(2), article 7.
Zhang, Yu, and Yeung, Dit-Yan. 2014. A regularization approach to learning task relation-
ships in multitask learning. ACM Transactions on Knowledge Discovery from Data, 8(3),
article 12.
Zhang, Yu, Yeung, Dit-Yan, and Xu, Qian. 2010c. Probabilistic multi-task feature selection.
Pages 2559–2567 of: Advances in Neural Information Processing Systems.
References 375
Zhang, Zhanpeng, Luo, Ping, Loy, Chen Change, and Tang, Xiaoou. 2014. Facial landmark
detection by deep multi-task learning. Pages 94–108 of: Proceedings of the 13th Euro-
pean Conference on Computer Vision.
Zhao, Junbo Jake, Mathieu, Michaël, and LeCun, Yann. 2016. Energy-based generative ad-
versarial network. CoRR, abs/1609.03126.
Zhao, Kai, and Huang, Liang. 2017. Joint syntacto-discourse parsing and syntacto-
discourse treebank. Pages 2117–2123 of: Proceedings of the 2017 Conference on Em-
pirical Methods in Natural Language Processing.
Zhao, Xiangyu, Zhang, Liang, Ding, Zhuoye, et al. 2018. Deep reinforcement learning for
list-wise recommendations. CoRR, abs/1801.00209.
Zheng, Vincent W., Pan, Sinno J., Yang, Qiang, and Pan, Jeffrey J. 2008a. Transferring multi-
device localization models using latent multi-task learning. Pages 1427–1432 of: Pro-
ceedings of the 23rd AAAI Conference on Artificial Intelligence.
Zheng, Vincent W., Xiang, Evan Wei, Yang, Qiang, and Shen, Dou. 2008b. Transferring local-
ization models over time. Pages 1421–1426 of: Proceedings of the Twenty-Third AAAI
Conference on Artificial Intelligence.
Zheng, Vincent W., Cao, Hong, Gao, Shenghua, et al. 2016. Cold-start heterogenous-device
wireless localization. Pages 1429–1435 of: Proceedings of the 30th AAAI Conference on
Artificial Intelligence.
Zheng, Vincent Wenchen, Hu, Derek Hao, and Yang, Qiang. 2009. Cross-domain activity
recognition. Pages 61–70 of: Proceedings of the 11th International Conference on Ubiq-
uitous Computing.
Zhou, Guangyou, Xie, Zhiwen, Huang, Jimmy Xiangji, and He, Tingting. 2016. Bi-
transferring deep neural networks for domain adaptation. Pages 322–332 of: Proceed-
ings of the 54th Annual Meeting of the Association for Computational Linguistics.
Zhou, Joey Tianyi, Pan, Sinno Jialin, Tsang, Ivor W., and Yan, Yan. 2014a. Hybrid heteroge-
neous transfer learning through deep learning. Pages 2213–2219 of: Proceedings of the
28th AAAI Conference on Artificial Intelligence.
Zhou, Joey Tianyi, Tsang, Ivor W., Pan, Sinno Jialin, and Tan, Mingkui. 2014b. Heteroge-
neous domain adaptation for multiple classes. Pages 1095–1103 of: Proceedings of the
Seventeenth International Conference on Artificial Intelligence and Statistics.
Zhu, Fan, Shao, Ling, and Yu, Mengyang. 2014. Cross-modality submodular dictionary
learning for information retrieval. Pages 1479–1488 of: Proceedings of the 23rd ACM
International Conference on Information and Knowledge Management.
Zhu, Feng, Wang, Yan, Chen, Chaochao, et al. 2018. A deep framework for cross-domain
and cross-system recommendations. Pages 3711–3717 of: Proceedings of the 27th In-
ternational Joint Conference on Artificial Intelligence.
Zhu, Jun-Yan, Park, Taesung, Isola, Phillip, and Efros, Alexei A. 2017. Unpaired image-to-
image translation using cycle-consistent adversarial networks. Pages 2223–2232 of:
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Zhu, Xiaojin. 2005. Semi-supervised Learning Literature Survey, Tech. Report, Computer
Sciences TR 1530, University of Wisconsin-Madison.
Zhu, Yin, Chen, Yuqiang, Lu, Zhongqi, et al. 2011. Heterogeneous transfer learning for im-
age classification. Proceedings of the 25th AAAI Conference on Artificial Intelligence.
Zhuang, Yue Ting, Wang, Yan Fei, Wu, Fei, Zhang, Yin, and Lu, Weiming. 2013. Supervised
coupled dictionary learning with group structures for multi-modal retrieval. Proceed-
ings of the 27th AAAI Conference on Artificial Intelligence.
Zhuo, Hankz Hankui, and Yang, Qiang. 2014. Action-model acquisition for planning via
transfer learning. Artificial Intelligence, 212, 80–103.
376 References
Zilka, Lukas, and Jurcicek, Filip. 2015. Incremental LSTM-based dialog state tracker. Pages
757–762 of: Proceedings of IEEE Workshop on Automatic Speech Recognition and Un-
derstanding.
Ziser, Yftah, and Reichart, Roi. 2017. Neural structural correspondence learning for domain
adaptation. Pages 400–410 of: Proceedings of the 21st Conference on Computational
Natural Language Learning.
Ziser, Yftah, and Reichart, Roi. 2018. Pivot based language modeling for improved neural
domain adapation. Pages 1241–1251 of: Proceedings of the 2018 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Language
Technologies.
Zoph, Barret, and Knight, Kevin. 2016. Multi-source neural translation. Pages 30–34 of: Pro-
ceedings of The 2016 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies.
Zoph, Barret, Yuret, Deniz, May, Jonathan, and Knight, Kevin. 2016. Transfer learning for
low-resource neural machine translation. CoRR, abs/1604.02201.
Zweig, Alon, and Weinshall, Daphna. 2013. Hierarchical regularization cascade for joint
learning. Pages 37–45 of: Proceedings of the 30th International Conference on Machine
Learning.
Index
instance-based transfer learning, 10, 23, 45, 280, news recommendation, 284, 285
281, 312 non-pivot, 243
noninductive transfer learning, 24
Jeffrey’s J-divergence, 319
objective perturbation, 214
KL divergence, 319 one-shot learning, 184
knowledge distillation, 49 open-domain dialogue system, 257
kolmogorov complexity, 141 option-based transfer, 115
output perturbation, 213
latent Dirichlet allocation, 200
latent factor analysis, 72 PAC-Bayesian theorem, 141
learning by analogy, 17 parallel multitask learning, 140
learning psychology (transfer of learning), 6 parameter-based transfer learning, 45
learning to transfer, 169 peptide analysis, 298
lifelong machine learning, 175, 196, 197 personalized medicine, 293
lifelong sentiment classification, 199 pivot, 243
localization model, 308 policy distillation, 120
location estimation, 307 poor resource learning, 190
low resource learning, 191 privacy budget, 212
low-resource learning, 190 privacy-preserving transfer learning, 215
probabilistic latent semantic analysis, 200
machine learning, 3 protein–protein interaction prediction,
machine translation, 236 300
manifold alignment, 80 proto-value functions, 117
Markov decision process, 105
Markov logic networks, 59, 61 question answering, 238
Markov networks, 61
Markov random fields, 61
Rademacher/Gaussian complexity, 141
maximum mean discrepancy, 35
recommendation, 279
medical image analysis, 229
recommender system, 279
microarray analysis, 293
regression, 7
mixed transfer algorithm, 154
regularized empirical risk minimization,
model-based multitask learning, 45
213
model-based multitask supervised learning, 132
reinforcement learning, 105, 272
model-based transfer learning, 10, 45, 46, 51,
relation-based transfer learning, 11, 58
280, 283, 284, 314
relational adaptive bootstrapping, 60
modular dialogue system, 258
reproducing kernel hilbert space, 67
multi-armed bandit, 121
multi-armed stochastic bandit, 115 residual neural network, 305
multi-Kernel learning, 53
multi-kernel learning, 53 second-order Markov logic, 64
multi-party learning, 215, 217 second-order relation-based transfer learning,
multitask active learning, 128, 138 60, 62
multitask autoencoder, 194 segmentation, 232
multitask Learning, 48 selective learning algorithm, 163
multitask learning, 12, 45, 46, 126, 128, 175, 203, semi-supervised learning, 11
234, 295 semi-supervised transfer learning, 9
multitask multi-view learning, 140 sensor-based activity recognition, 307
multitask multiview learning, 128 sentiment analysis, 242
multitask online learning, 128, 139 simulation, 7
multitask reinforcement learning, 128, 139 simulation in robotics, 7
multitask semi-supervised learning, 128, 138 small-data challenge, vi, 335
multitask supervised learning, 128 social relation based transfer, 288
multitask unsupervised learning, 128, 137 spoken language understanding, 257, 259
stability, 141
naive Bayes classifier, 49 statistical relational learning, 59
natural language generation, 258, 268 style transfer, 93
natural language processing, 4, 49, 234 successor feature functions, 117
negative transfer, 9 supervised transfer learning, 9
neighborhood-based transfer learning, 286 support vector machine, 51, 52
never-ending language learning, 204 systems biology, 294, 299
Index 379