04.1 PP 3 22 Introduction
04.1 PP 3 22 Introduction
Introduction
Even though a machine learning model can be made to be of high quality, it can
also make mistakes, especially when the model is applied to different scenarios
from its training environments. For example, if a new photo is taken from an out-
door environment with different light intensities and levels of noise such as shad-
ows, sunlight from different angels and occlusion by passersby, the recognition
capability of the system may dramatically drop. This is because the model trained
by the machine learning system is applied to a “different” scenario. This drop in
performance shows that models can be outdated and needs updating when new
situations occur. It is this need to update or transfer models from one scenario to
another that lends importance to the topic of the book.
The need for transfer learning is not limited to image understanding. Another
example is understanding Twitter text messages by natural language processing
(NLP) techniques. Suppose we wish to classify Twitter messages into different
user moods such as happy or sad by its content. When one model is built using
a collection of Twitter messages and then applied to new data, the performance
drops quite dramatically as a different community of people will very likely ex-
press their opinions differently. This happens when we have teenagers in one
group and grown-ups in another.
As the previous examples demonstrate, a major challenge in practicing ma-
chine learning in many applications is that models do not work well in new task
domains. The reason why they do not work well may be due to one of several
reasons: lack of new training data due to the small data challenge, changes of cir-
cumstances and changes of tasks. For example, in a new situation, high-quality
training data may be in short supply if not often impossible to obtain for model
retraining, as in the case of medical diagnosis and medical imaging data.
Machine learning models cannot do well without sufficient training data. Obtain-
ing and labeling new data often takes much effort and resources in a new appli-
cation domain, which is a major obstacle in realizing AI in the real world. Having
well-designed AI systems without the needed training data is like having a sports
car without an energy.
This discussion highlights a major roadblock in populating machine learning
to the practical world: it would be impossible to collect large quantities of data in
every domain before applying machine learning. Here we summarize some of the
reasons to develop such a transfer learning methodology:
1) Many applications only have small data: the current success of machine learn-
ing relies on the availability of a large amount of labeled data. However, high-
quality labeled data are often in short supply. Traditional machine learning
methods often cannot generalize well to new scenarios, a phenomenon known
as overfitting, and fail in many such cases.
2) Machine learning models need to be robust: traditional machine learning of-
ten makes an assumption that both the training and test data are drawn from
the same distribution. However, this assumption is too strong to hold in many
1.1 AI, Machine Learning and Transfer Learning 5
allow one to focus on adapting the new parts for the book-recommendation task,
which allows one to further exploit the underlying similarities between the data
sets. Then, book domain classification and user preference learning models can
be adapted from those of the movie domain.
Based on the transfer learning methodologies, once we obtain a well-developed
model in one domain, we can bring this model to benefit other similar domains.
Hence, having an accurate “distance” measure between any task domains is nec-
essary in developing a sound transfer learning methodology. If the distance be-
tween two domains is large, then we may not wish to apply transfer learning as
the learning might turn out to produce a negative effect. On the other hand, if two
domains are “close by,” transfer learning can be fruitfully applied.
In machine learning, the distance between domains can often be measured in
terms of the features that are used to describe the data. In image analysis, fea-
tures can be pixels or patches in an image pattern, such as the color or shape.
In NLP, features can be words or phrases. Once we know that two domains are
close to each other, we can ensure that AI models can be propagated from the
well-developed domains to less-developed domains, making the application of AI
less data dependent. And this can be a good sign for successful transfer learning
applications.
Being able to transfer knowledge from one domain to another allows machine
learning systems to extend their range of applicability beyond their original cre-
ation. This generalization ability helps make AI more accessible and more robust
in many areas where AI talents or resources such as computing power, data and
hardware might be scarce. In a way, transfer learning allows the promotion of AI
as a more inclusive technology that serves everyone.
To give an intuitive example, we can use an analogy to highlight the key insights
behind transfer learning. Consider driving in different countries in the world. In
the USA and China, for example, the driver’s seat is on the left of the car and drives
on the right side of the road. In Britain, the driver sits on the right side of the car,
and drives on the left side of the road. For a traveler who is used to driving in the
USA to travel to drive in Britain, it is particularly hard to switch. Transfer learning,
however, tells us to find the invariant in the two driving domains that is a common
feature. On a closer observation, one can find that no matter where one drives, the
driver’s distance to the center of the road is the closest. Or, conversely, the driver
sits farthest from the side of the road. This fact allows human drivers to smoothly
“transfer” from one country to another. Thus, the insight behind transfer learning
is to find the “invariant” between domains and tasks.
Transfer learning has been studied under different terminologies in AI, such as
knowledge reuse and CBR, learning by analogy, domain adaptation, pre-training,
fine-tuning, and so on. In the fields of education and learning psychology, trans-
fer of learning has a similar notion as transfer learning in machine learning. In
particular, transfer of learning refers to the process in which past experience ac-
quired from previous source tasks can be used to influence future learning and
1.2 Transfer Learning: A Definition 7
Definition 1.1 (transfer learning) Given a source domain Ds and learning task
Ts , a target domain Dt and learning task Tt , transfer learning aims to help improve
the learning of the target predictive function f t (·) for the target domain using the
knowledge in Ds and Ts , where Ds = Dt or Ts = Tt .
A transfer learning process is illustrated in Figure 1.1. The process on the left
corresponds to a traditional machine learning process. The process on the right
corresponds to a transfer learning process. As we can see, transfer learning makes
use of not only the data in the target task domain as input to the learning algo-
rithm, but also any of the learning process in the source domain, including the
training data, models and task description. This figure shows a key concept of
transfer learning: it counters the lack of training data problem in the target do-
main with more knowledge gained from the source domain.
As a domain contains two components, D = {X , P X }, the condition Ds = Dt im-
plies that either Xs = Xt or P X s = P X T . Similarly, as a task is defined as a pair of
components T = {Y , PY |X }, the condition Ts = Tt implies that either Ys = Yt or
PYs |X s = PYt |X t . When the target domain and the source domain are the same, that
is, Ds = Dt , and their learning tasks are the same, that is, Ts = Tt , the learning
problem becomes a traditional machine learning problem.
Based on this definition, we can formulate different ways to categorize exist-
ing transfer learning studies into different settings. For instance, based on the ho-
mogeneity of the feature spaces and/or label spaces, we can categorize transfer
1.2 Transfer Learning: A Definition 9
learning into two settings: (1) homogeneous transfer learning and (2) heteroge-
neous transfer learning, whose definitions are described as follows (Pan, 2014).1
Besides using the homogeneity of the feature spaces and label spaces, we can
also categorize existing transfer learning studies into the following three settings
by considering whether labeled data and unlabeled data are available in the tar-
get domain: supervised transfer learning, semi-supervised transfer learning and
unsupervised transfer learning. In supervised transfer learning, only a few labeled
data are available in the target domain for training, and we do not use the unla-
beled data for training. For unsupervised transfer learning, there are only unla-
beled data available in the target domain. In semi-supervised transfer learning,
sufficient unlabeled data and a few labeled data are assumed to be available in
the target domain.
To design a transfer learning algorithm, we need to consider the following three
main research issues: (1) when to transfer, (2) what to transfer and (3) how to
transfer.
When to transfer asks in which situations transferring skills should be done.
Likewise, we are interested in knowing in which situations knowledge should not
be transferred. In some situations, when the source domain and the target do-
main are not related to each other, brute-force transfer may be unsuccessful. In
the worst case, it may even hurt the performance of learning in the target domain,
a situation which is often referred to as negative transfer. Most of current studies
on transfer learning focus on “what to transfer” and “how to transfer,” by implic-
itly assuming that the source domain and the target domain are related to each
other. However, how to avoid negative transfer is an important open issue that is
attracting more and more attentions.
What to transfer determines which part of knowledge can be transferred across
domains or tasks. Some knowledge is specific for individual domains or tasks,
and some knowledge may be common between different domains such that they
may help improve performance for the target domain or task. Note that the term
1 In the rest of book, without explicit specification, the term “transfer learning” denotes
homogeneous transfer learning.
10 Introduction
learning model using sufficient source data, which could be quite different from
the target data. After the deep model is trained, a few target labeled data are used
to fine-tune part of the parameters of the pretrained deep model, for example, to
fine-tune parameters of several layers while fixing parameters of other layers.
Different from the three aforementioned categories of approaches, relation-
based transfer learning approaches assume that some relationships between ob-
jects (i.e., instances) are similar across domains or tasks. Once these common re-
lationships are extracted, then they can be used as knowledge for transfer learn-
ing. Note that, in this category, data in the source domain and the target domain
are not required to be independent and identically distributed as the other three
categories.
Supervised learning
Target domain/task
Exploring target
Semi-supervised learning domain/task unlabeled data
Lack of labeled data
to choose the data from which it learns. However, active learning assumes that
there is a budget for the active learner to pose queries in the domain of interest. In
some real world applications, the budget may be quite limited, which means that
the labeled data queried by active learning may not be sufficient enough to learn
an accurate classifier in the domain of interest.
Transfer learning, in contrast, allows the domains, tasks and distributions used
in the training phase and the testing phase to be different. The main idea behind
transfer learning is to borrow labeled data or extract knowledge from some related
domains to help a machine learning algorithm to achieve greater performance in
the domain of interest. Thus, transfer learning can be referred to as a different
strategy for learning models with minimal human supervision, compared to semi-
supervised and active learning.
One of the most related learning paradigms to transfer learning is multi-task
learning. Although both transfer learning and multitask learning aim to general-
ize commonality across tasks, transfer learning is focused on learning on a target
task, where some source task(s) is(are) used as auxiliary information, while mul-
titask learning aims to learn a set of target tasks jointly to improve the general-
ization performance of each learning task without any source or auxiliary tasks.
As most existing multitask learning methods consider all tasks to have the same
importance, while transfer learning only takes the performance of the target task
into consideration, some detailed designs of the learning algorithms are differ-
ent. However, most existing multitask learning algorithms can be adapted to the
transfer learning setting.
We summarize the relationships between transfer learning and other machine
learning paradigms in Figure 1.2, and the difference between transfer learning and
multitask learning in Figure 1.3.
1.4 Fundamental Research Issues in Transfer Learning 13
Commonality
Commonality between
among target tasks
source and target tasks
1 2 …
often too coarse to serve the purpose well in measuring the distance in the trans-
ferrability between two domains or tasks. Second, if the domains have different
feature spaces and/or label spaces, one has to first project the data onto the same
feature and/or label space, and then apply the statistical distance measures as a
follow-up step. Therefore, more research needs to be done on a general notion of
distances between two domains or tasks.
diversity, to allow a system to explore new topics as well as cater to users’ recent
choices. Relating to transfer learning, the work shows that the recommendation
strategy in balancing exploration and exploitation can indeed be transferred be-
tween domains.
ally difficult problems such as question answering problems (Devlin et al., 2018).
It has accomplished surprising results by leading in many tasks in the open com-
petition SQuAD 2.0 (Rajpurkar et al., 2016). The source domain consists of an ex-
tremely large collection of natural language text corpus, with which BERT trained
a model that is based on the bidirectional transformers based on the attention
mechanism. The pertained model is capable of making a variety of predictions in
a language model more accurate than before, and the predictive power increases
with increasing amounts of training data in the source domain. Then, the BERT
model is applied to a specific task in a target domain by adding additional small
layers to the source model in such tasks as Next Sentence classification, Ques-
tion Answering and Named Entity Recognition (NER). The transfer learning ap-
proach corresponds to model-based transfer, where most hyperparameters stay
the same but a selected few hyperparameters can be adapted with the new data in
the target domain.
There have been some surveys on transfer learning in machine learning lit-
erature. Pan and Yang (2010) and Taylor and Stone (2009) give early surveys of
the work on transfer learning, where the former focused on machine learning in
classification and regression areas and the latter on reinforcement learning ap-
proaches. This book aims to give a comprehensive survey that cover both these
areas, as well as the more recent advances of transfer learning with deep learning.
transfer learning methods can also be useful when multiple source do-
mains exist.
Chapter 3 covers feature-based transfer learning. Features constitute a major el-
ement of machine learning. They can be straightforward attributes in the
input data, such as pixels in images or words and phrases in a text docu-
ment, or they can be composite features composed by certain nonlin-
ear transformations of input features. Together these features comprise
a high-dimensional feature space. Feature-based transfer is to identify
common subspaces of features between source and target domains, and
allow transfer to happen in these subspaces. This style of transfer learn-
ing is particularly useful when no clear instances can be directly trans-
ferred, but some common “style” of learning can be transferred.
Chapter 4 discusses model-based transfer learning. Model-based transfer is when
parts of a learning model can be transferred to a target domain from a
source domain, where the learning in the target domain can be “fine-
tuned” based on the transferred model. Model-based transfer learning
is particularly useful when one has a fairly complete collection of data
in a source domain, and the model in the source domain can be made
very powerful in terms of coverage. Then learning in a target domain cor-
responds to adapting the general model from the source domain to a spe-
cific model in a target domain on the “edge” of a network of
domains.
Chapter 5 explores relation-based transfer learning. This chapter is particularly
useful when knowledge is coded in terms of a knowledge graph or in rela-
tional logic form. When some dictionary of translation can be instituted,
and when knowledge exists in the form of some encoded rules, this type
of transfer learning can be particularly useful.
Chapter 6 presents heterogeneous transfer learning. Sometimes, when we deal
with transfer learning, the target domain may have a completely differ-
ent feature representation from that of the source domain. For example,
we may have collected labeled data about images, but the target task is to
classify text documents. If there is some relationship between the images
and the text documents, transfer learning can still happen at the seman-
tic level, where the semantics of the common knowledge between the
source and the target domains can be extracted as a “bridge” to enable
the knowledge transfer.
Chapter 7 discusses adversarial transfer learning. Machine learning, especially
deep learning, can be designed to generate data and at the same time
classify data. This dual relationship in machine learning can be exploited
to mimic the power of imitation and creation in humans. This learn-
ing process can be modeled as a game between multiple models, and is
called adversarial learning. Adversarial learning can be very useful in em-
powering a transfer learning process, which is the subject of this chapter.
20 Introduction
netics domain is full of data of very high dimensionality and low sample
sizes. We give an overview of works in this area.
Chapter 21 presents applications of transfer learning in activity recognition based
on sensors. Activity recognition refers to finding people’s activities from
sensor readings, which can be very useful for assisted living, security and
a wide range of other applications. A challenge in this domain is the lack
of labeled data, and this challenge is particularly fit for transfer learning
to address.
Chapter 22 discusses applications of transfer learning in urban computing. There
are many machine learning problems to address in urban computing,
ranging from traffic prediction to pollution forecast. When data has been
collected in one city, the model can be transferred to a newly considered
city via transfer learning, especially when there is not sufficient high-
quality data in these new cities.
Chapter 23 gives a summary of the whole book with an outlook for future works.