0% found this document useful (0 votes)

21 views20 pages

04.1 PP 3 22 Introduction

The document discusses the evolution and significance of artificial intelligence (AI) and machine learning (ML), particularly focusing on transfer learning as a solution to challenges faced when applying ML models to new scenarios. Transfer learning allows knowledge from one domain to enhance learning in another, addressing issues like data sparsity, model robustness, and personalization. The document emphasizes the importance of transfer learning in making AI more accessible and effective across various applications and domains.

Uploaded by

maciej.gumbowski

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views20 pages

04.1 PP 3 22 Introduction

Uploaded by

maciej.gumbowski

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

1

Introduction

1.1 AI, Machine Learning and Transfer Learning

AI was a vision initiated by Alan Turing when he asked the famous question:
“Can machines think?” This question has motivated generations of researchers to
explore ways to make machines behave intelligently. Throughout recent history,
AI has experienced several ups and downs, much of which evolve around the cen-
tral question of how machines can acquire knowledge from the outside world.
Attempts to make machines think like humans have gone a long way, from
force-feeding rule-like knowledge bases to machine learning from data. Machine
learning has thus grown from an obscure discipline to a major industrial and so-
cietal force in automating decisions that range from online commerce and ad-
vertising to education and health care. Machine learning is becoming a general
enabling technology for the world due to its strong ability to endow machines
with knowledge by letting them learn and adapt through labeled and unlabeled
data. Machine learning produces prediction models from data, thus often requir-
ing well-defined data as “teachers” to help tune statistical models. This ability in
making accurate predictions of future events are based on observations and un-
derstanding of the task domains. The data samples in the training examples are
often “labeled,” which means that observations and outcomes of predictions in
the training data are coupled and correlated. These examples are then used as
“teachers” by a machine learning algorithm to “train” a model that can be applied
to new data.
One can find many illustrative examples of machine learning in the real world.
One example is in the area of face recognition in computer-based image analysis.
Suppose that we have obtained a large pool of photos taken indoors. A machine
learning system can then use these data to train a model that reports whether a
new photo corresponds to a person appearing in the pool. An application of this
model would be a gate security system for a building, where a task would be to
ascertain whether a visitor is an employee in the organization.
4 Introduction

Even though a machine learning model can be made to be of high quality, it can
also make mistakes, especially when the model is applied to different scenarios
from its training environments. For example, if a new photo is taken from an out-
door environment with different light intensities and levels of noise such as shad-
ows, sunlight from different angels and occlusion by passersby, the recognition
capability of the system may dramatically drop. This is because the model trained
by the machine learning system is applied to a “different” scenario. This drop in
performance shows that models can be outdated and needs updating when new
situations occur. It is this need to update or transfer models from one scenario to
another that lends importance to the topic of the book.
The need for transfer learning is not limited to image understanding. Another
example is understanding Twitter text messages by natural language processing
(NLP) techniques. Suppose we wish to classify Twitter messages into different
user moods such as happy or sad by its content. When one model is built using
a collection of Twitter messages and then applied to new data, the performance
drops quite dramatically as a different community of people will very likely ex-
press their opinions differently. This happens when we have teenagers in one
group and grown-ups in another.
As the previous examples demonstrate, a major challenge in practicing ma-
chine learning in many applications is that models do not work well in new task
domains. The reason why they do not work well may be due to one of several
reasons: lack of new training data due to the small data challenge, changes of cir-
cumstances and changes of tasks. For example, in a new situation, high-quality
training data may be in short supply if not often impossible to obtain for model
retraining, as in the case of medical diagnosis and medical imaging data.
Machine learning models cannot do well without sufficient training data. Obtain-
ing and labeling new data often takes much effort and resources in a new appli-
cation domain, which is a major obstacle in realizing AI in the real world. Having
well-designed AI systems without the needed training data is like having a sports
car without an energy.
This discussion highlights a major roadblock in populating machine learning
to the practical world: it would be impossible to collect large quantities of data in
every domain before applying machine learning. Here we summarize some of the
reasons to develop such a transfer learning methodology:

1) Many applications only have small data: the current success of machine learn-
ing relies on the availability of a large amount of labeled data. However, high-
quality labeled data are often in short supply. Traditional machine learning
methods often cannot generalize well to new scenarios, a phenomenon known
as overfitting, and fail in many such cases.
2) Machine learning models need to be robust: traditional machine learning of-
ten makes an assumption that both the training and test data are drawn from
the same distribution. However, this assumption is too strong to hold in many
1.1 AI, Machine Learning and Transfer Learning 5

practical scenarios. In many cases, the distribution varies according to time

and space, and varies among situations, so we may never have access to new
training data to go with the same test distribution. In situations that differ from
the training data, the trained models need adaptation before they can be used.
3) Personalization and specialization are important issues: it is critical and prof-
itable to offer personalized service for every user according to individual tastes
and demands. In many real world applications, we can only collect very little
personal data from an individual user. As a result, traditional machine learn-
ing methods suffer from the cold start problems when we try to adapt a general
model to a specific situation.
4) User privacy and data security are important issues: often in our applications
we must work with other organizations by leveraging multiple data sets. Often
these data sets have different owners and cannot be revealed to each other
for privacy or security concerns. When building a model together, it would be
desirable for us to extract the “essence” of each data set and adapt them in
building a new model. For example, if we can adapt a general model at the
“edge” of a network of devices, then the data stored on the device need not to
be uploaded to enhance the general model; thus, privacy of the edge device
can be ensured.

These objectives for intelligent systems motivated the development of transfer

learning. In a nutshell, transfer learning refers to the machine learning paradigm
in which an algorithm extracts knowledge from one or more application scenar-
ios to help boost the learning performance in a target scenario. Compared to tra-
ditional machine learning, which requires large amounts of well-defined training
data as the input, transfer learning can be understood as a new learning paradigm,
which the rest of the book will cover in detail. Transfer learning is also a motivation
to solve the so-called data sparsity and cold start problems in many large-scale
and online applications (e.g., labeled user rating data in online recommendation
systems may be too few to allow these online systems to build a high-quality rec-
ommendation system).
Transfer learning can help promote AI in less-developed application areas, as
well as less technically developed geographical areas, even when not much la-
beled data is available in such areas. For example, suppose we wish to build a book
recommendation system in a new online shopping application. Suppose that the
book domain is so new that we do not have many transactions recorded in this do-
main. If we follow the supervised learning methodology in building a prediction
model in which we use the insufficient training data in the new domain, we cannot
have a credible prediction model on users’ next purchase. However, with transfer
learning, one can look to a related, well-developed but different domain for help,
such as an existing movie recommendation domain. Exploiting transfer learning
techniques, one can find the similarity and differences between the book and the
movie domains. For example, some authors also turn their books into movies, and
movies and books can attract similar user groups. Noticing these similarities can
6 Introduction

allow one to focus on adapting the new parts for the book-recommendation task,
which allows one to further exploit the underlying similarities between the data
sets. Then, book domain classification and user preference learning models can
be adapted from those of the movie domain.
Based on the transfer learning methodologies, once we obtain a well-developed
model in one domain, we can bring this model to benefit other similar domains.
Hence, having an accurate “distance” measure between any task domains is nec-
essary in developing a sound transfer learning methodology. If the distance be-
tween two domains is large, then we may not wish to apply transfer learning as
the learning might turn out to produce a negative effect. On the other hand, if two
domains are “close by,” transfer learning can be fruitfully applied.
In machine learning, the distance between domains can often be measured in
terms of the features that are used to describe the data. In image analysis, fea-
tures can be pixels or patches in an image pattern, such as the color or shape.
In NLP, features can be words or phrases. Once we know that two domains are
close to each other, we can ensure that AI models can be propagated from the
well-developed domains to less-developed domains, making the application of AI
less data dependent. And this can be a good sign for successful transfer learning
applications.
Being able to transfer knowledge from one domain to another allows machine
learning systems to extend their range of applicability beyond their original cre-
ation. This generalization ability helps make AI more accessible and more robust
in many areas where AI talents or resources such as computing power, data and
hardware might be scarce. In a way, transfer learning allows the promotion of AI
as a more inclusive technology that serves everyone.
To give an intuitive example, we can use an analogy to highlight the key insights
behind transfer learning. Consider driving in different countries in the world. In
the USA and China, for example, the driver’s seat is on the left of the car and drives
on the right side of the road. In Britain, the driver sits on the right side of the car,
and drives on the left side of the road. For a traveler who is used to driving in the
USA to travel to drive in Britain, it is particularly hard to switch. Transfer learning,
however, tells us to find the invariant in the two driving domains that is a common
feature. On a closer observation, one can find that no matter where one drives, the
driver’s distance to the center of the road is the closest. Or, conversely, the driver
sits farthest from the side of the road. This fact allows human drivers to smoothly
“transfer” from one country to another. Thus, the insight behind transfer learning
is to find the “invariant” between domains and tasks.
Transfer learning has been studied under different terminologies in AI, such as
knowledge reuse and CBR, learning by analogy, domain adaptation, pre-training,
fine-tuning, and so on. In the fields of education and learning psychology, trans-
fer of learning has a similar notion as transfer learning in machine learning. In
particular, transfer of learning refers to the process in which past experience ac-
quired from previous source tasks can be used to influence future learning and
1.2 Transfer Learning: A Definition 7

performance in a target situation (Thorndike and S. Woodworth, 1901). Trans-

fer of learning in the field of education shares a common goal as transfer learn-
ing in machine learning in that they both address the process of learning in one
context and applying the learning in another. In both areas, the learned knowl-
edge or model is taken to a future target task for use after some adaptation. When
one delves into the literature of education theory and learning psychology (Ellis,
1965; Pugh and Bergin, 2006; Schunk, 1965; Cree and Macaulay, 2000), one can
find that, despite the fact that transfer learning in machine learning aims to en-
dow machines with the ability to adapt and transfer of learning in education tries
to study how humans adapt in education, the processes or algorithms of transfer
are similar.
A final note on the benefit of transfer learning is in simulation technology. Often
in complex tasks, such as robotics and drug design, for example, it is too expensive
to engage real world experiments. In robotics, a mobile robot or an autonomous
vehicle needs to collect sufficient training data. For example, there may be many
ways in which a car is involved in a car crash but to create car crashes is far too ex-
pensive in real life. Instead, researchers often build sophisticated simulators such
that a trained model taught in the simulator environment is applied to the real
world after adaptation via transfer learning. The transfer learning step is needed
to account for many future situations that are not seen in the simulated envir-
onment and adapt the simulated prediction models, such as obstacle avoidance
models in autonomous cars, to unforeseeable future situations.

1.2 Transfer Learning: A Definition

To start with, we define what “domain,” “task” and “transfer learning” mean by
following the notations introduced by Pan and Yang (2010). A domain D consists of
two components: a feature space X and a marginal probability distribution PX ,
where each input instance x ∈ X . In general, if two domains are different, then
they may have different feature spaces or different marginal probability distribu-
tions. Given a specific domain, D = {X , P X }, a task T consists of two components:
a label space Y and a function f (·) (denoted by T = {Y , f (·)}). The function f (·)
is a predictive function that can be used to make predictions on unseen instances
{x∗ }s. From a probabilistic viewpoint, f (x) can be written as P (y|x). In classifica-
tion, labels can be binary, that is, Y = {−1, +1}, or discrete values, that is, multiple
classes. In regression, labels are of continuous values.
For simplicity, we now focus on the case where there are one source domain Ds
and one target domain Dt . The two-domain scenario is by far the most popular of
ns
the research works in the literature. In particular, we denote by Ds = {(xsi , y si )}i =1
the source domain labeled data, where xsi ∈ Xs is the data instance and y si ∈ Ys is
nt
the corresponding class label. Similarly, we denote by Dt = {(xti , y ti )}i =1 the target
domain labeled data, where the input xti is in Xt and y ti ∈ Yt is the corresponding
8 Introduction

Figure 1.1 An illustration of a transfer learning process

output. In most cases, 0 ≤ n t n s . Based on these notations, transfer learning can

be defined as follows (Pan and Yang, 2010).

Definition 1.1 (transfer learning) Given a source domain Ds and learning task
Ts , a target domain Dt and learning task Tt , transfer learning aims to help improve
the learning of the target predictive function f t (·) for the target domain using the
knowledge in Ds and Ts , where Ds = Dt or Ts = Tt .

A transfer learning process is illustrated in Figure 1.1. The process on the left
corresponds to a traditional machine learning process. The process on the right
corresponds to a transfer learning process. As we can see, transfer learning makes
use of not only the data in the target task domain as input to the learning algo-
rithm, but also any of the learning process in the source domain, including the
training data, models and task description. This figure shows a key concept of
transfer learning: it counters the lack of training data problem in the target do-
main with more knowledge gained from the source domain.
As a domain contains two components, D = {X , P X }, the condition Ds = Dt im-
plies that either Xs = Xt or P X s = P X T . Similarly, as a task is defined as a pair of
components T = {Y , PY |X }, the condition Ts = Tt implies that either Ys = Yt or
PYs |X s = PYt |X t . When the target domain and the source domain are the same, that
is, Ds = Dt , and their learning tasks are the same, that is, Ts = Tt , the learning
problem becomes a traditional machine learning problem.
Based on this definition, we can formulate different ways to categorize exist-
ing transfer learning studies into different settings. For instance, based on the ho-
mogeneity of the feature spaces and/or label spaces, we can categorize transfer
1.2 Transfer Learning: A Definition 9

learning into two settings: (1) homogeneous transfer learning and (2) heteroge-
neous transfer learning, whose definitions are described as follows (Pan, 2014).1

Definition 1.2 (homogeneous transfer learning) Given a source domain Ds and a

learning task Ts , a target domain Dt and a learning task Tt , homogeneous transfer
learning aims to help improve the learning of the target predictive function f t (·)

for Dt using the knowledge in Ds and Ts , where Xs Xt = and Ys = Yt , but
Xs Xt Y s |X s Y t |X t
P = P or P = P .

Definition 1.3 (heterogeneous transfer learning) Given a source domain Ds and

a learning task Ts , a target domain Dt and a learning task Tt , heterogeneous trans-
fer learning aims to help improve the learning of the target predictive function

f t (·) for Dt using the knowledge in Ds and Ts , where Xs Xt = or Ys = Yt .

Besides using the homogeneity of the feature spaces and label spaces, we can
also categorize existing transfer learning studies into the following three settings
by considering whether labeled data and unlabeled data are available in the tar-
get domain: supervised transfer learning, semi-supervised transfer learning and
unsupervised transfer learning. In supervised transfer learning, only a few labeled
data are available in the target domain for training, and we do not use the unla-
beled data for training. For unsupervised transfer learning, there are only unla-
beled data available in the target domain. In semi-supervised transfer learning,
sufficient unlabeled data and a few labeled data are assumed to be available in
the target domain.
To design a transfer learning algorithm, we need to consider the following three
main research issues: (1) when to transfer, (2) what to transfer and (3) how to
transfer.
When to transfer asks in which situations transferring skills should be done.
Likewise, we are interested in knowing in which situations knowledge should not
be transferred. In some situations, when the source domain and the target do-
main are not related to each other, brute-force transfer may be unsuccessful. In
the worst case, it may even hurt the performance of learning in the target domain,
a situation which is often referred to as negative transfer. Most of current studies
on transfer learning focus on “what to transfer” and “how to transfer,” by implic-
itly assuming that the source domain and the target domain are related to each
other. However, how to avoid negative transfer is an important open issue that is
attracting more and more attentions.
What to transfer determines which part of knowledge can be transferred across
domains or tasks. Some knowledge is specific for individual domains or tasks,
and some knowledge may be common between different domains such that they
may help improve performance for the target domain or task. Note that the term

1 In the rest of book, without explicit specification, the term “transfer learning” denotes
homogeneous transfer learning.
10 Introduction

“knowledge” is very general. Thus, in practice, it needs to be specified based on

different context.
How to transfer specifies the form that a transfer learning method takes. Differ-
ent answers to the question of “how to transfer” give a categorization for transfer
learning algorithms:

(1) instance-based algorithms, where the knowledge transferred corresponds to

the weights attached to source instances;
(2) feature-based algorithms, where the knowledge transferred corresponds to
the subspace spanned by the features in the source and target domains;
(3) model-based algorithms, where the knowledge to be transferred is embedded
in part of the source domain models and
(4) relation-based algorithms, where the knowledge to be transferred corresponds
to rules specifying the relations between the entities in the source domain.

Each of these types of transfer learning corresponds to an emphasis on which

part of the knowledge is being considered as a vehicle to facilitate the knowl-
edge transfer. Specifically, a common motivation behind instance-based trans-
fer learning approaches is that, although the source domain labeled data can-
not be reused directly due to the domain difference, part of them can be reused
for the target domain after reweighting or resampling. In this way, the source-
domain labeled instances with large weights can be considered as “knowledge” to
be transferred across domains. An implicit assumption behind the instance-based
approaches is that the source domain and the target domain have a lot of overlap-
ping features, which means that the domains share the same or similar support.
However, in many real world applications, only a portion of the feature spaces
from the source and target domains overlap, which means that many features
cannot be directly used as bridges for the knowledge transfer. As a result, some
instance-based methods may fail to work effectively for knowledge transfer.
Feature-based transfer learning approaches are more promising in this case. A
common idea behind feature-based approaches is to learn a “good” feature rep-
resentation for both the source domain and the target domain such that, by pro-
jecting data onto the new representation, the source domain labeled data can be
reused to train a precise classifier for the target domain. In this way, the knowl-
edge to be transferred across domains can be considered as the learned feature
representation.
Model-based transfer learning approaches assume the source domain and the
target domain share some parameters or hyperparameters of the learning mod-
els. A motivation of model-based approaches is that a well-trained source model
has captured a lot of useful structure, which is general and can be transferred to
learn a more precise target model. In this way, the knowledge to be transferred is
the domain-invariant structure of the model parameters. A recently widely used
pretraining technique for transfer learning based on deep learning is indeed a
model-based approach. Specifically, the idea of pretraining is to first train a deep
1.3 Relationship to Existing Machine Learning Paradigms 11

learning model using sufficient source data, which could be quite different from
the target data. After the deep model is trained, a few target labeled data are used
to fine-tune part of the parameters of the pretrained deep model, for example, to
fine-tune parameters of several layers while fixing parameters of other layers.
Different from the three aforementioned categories of approaches, relation-
based transfer learning approaches assume that some relationships between ob-
jects (i.e., instances) are similar across domains or tasks. Once these common re-
lationships are extracted, then they can be used as knowledge for transfer learn-
ing. Note that, in this category, data in the source domain and the target domain
are not required to be independent and identically distributed as the other three
categories.

1.3 Relationship to Existing Machine Learning Paradigms

Transfer learning and machine learning are closely related. On one hand, the
aim of transfer learning encompasses that of machine learning in that its key in-
gredient is “generalization.” In other words, it explores how to develop general
and robust machine learning models that can apply to not only the training data,
but also unanticipated future data. Therefore, all machine learning models should
have the ability to conduct transfer learning. On the other hand, transfer learn-
ing differs from other branches of machine learning in that transfer learning aims
to generalize commonalities across different tasks or domains, which are “sets”
of instances, while machine learning focuses on generalize commonalities across
“instances.” This difference makes the design of the learning algorithms quite dif-
ferent.
Specifically, machine learning algorithms such as semi-supervised learning, ac-
tive learning and transfer learning can be used to partially address the labeled
data sparsity issue for a target domain, but they have different assumptions. Semi-
supervised learning aims to address the labeled data sparsity problem in the same
domain by making use of a large amount of unlabeled data to discover an intrinsic
data structure to effectively propagate label information. Common assumptions
behind semi-supervised learning techniques are (1) the underlying intrinsic data
structure is very useful to learn a precise model even without sufficient labeled
data and (2) the training data, including labeled and unlabeled, and the unseen
test data are still represented in the same feature space and drawn from the same
data distribution.
Instead of exploring unlabeled data to train a precise model, active learning,
which is another branch in machine learning for reducing the annotation effort
of supervised learning, tries to design an active learner to pose queries, usually
in the form of unlabeled data instances to be labeled by an oracle (e.g., a human
annotator). The key motivation behind active learning is that a machine learning
algorithm can achieve greater accuracy with fewer training labels if it is allowed
12 Introduction

Supervised learning

Sufficient labeled data

Target domain/task
Exploring target
Semi-supervised learning domain/task unlabeled data
Lack of labeled data

Active learning Obtaining more target domain/task

labeled data with cost

Transfer learning Source domain/task

Reusing source domain/task data and/or model via

domain/task commonality

Figure 1.2 Relationship of transfer learning to other learning paradigms

to choose the data from which it learns. However, active learning assumes that
there is a budget for the active learner to pose queries in the domain of interest. In
some real world applications, the budget may be quite limited, which means that
the labeled data queried by active learning may not be sufficient enough to learn
an accurate classifier in the domain of interest.
Transfer learning, in contrast, allows the domains, tasks and distributions used
in the training phase and the testing phase to be different. The main idea behind
transfer learning is to borrow labeled data or extract knowledge from some related
domains to help a machine learning algorithm to achieve greater performance in
the domain of interest. Thus, transfer learning can be referred to as a different
strategy for learning models with minimal human supervision, compared to semi-
supervised and active learning.
One of the most related learning paradigms to transfer learning is multi-task
learning. Although both transfer learning and multitask learning aim to general-
ize commonality across tasks, transfer learning is focused on learning on a target
task, where some source task(s) is(are) used as auxiliary information, while mul-
titask learning aims to learn a set of target tasks jointly to improve the general-
ization performance of each learning task without any source or auxiliary tasks.
As most existing multitask learning methods consider all tasks to have the same
importance, while transfer learning only takes the performance of the target task
into consideration, some detailed designs of the learning algorithms are differ-
ent. However, most existing multitask learning algorithms can be adapted to the
transfer learning setting.
We summarize the relationships between transfer learning and other machine
learning paradigms in Figure 1.2, and the difference between transfer learning and
multitask learning in Figure 1.3.
1.4 Fundamental Research Issues in Transfer Learning 13

Transfer learning Multitask learning

Commonality
Commonality between
among target tasks
source and target tasks

1 2 …

Source task Target task Target Target Target

…
task 1 task 2 task

Figure 1.3 Relationship between transfer learning and multitask learning

1.4 Fundamental Research Issues in Transfer Learning

As we mentioned earlier, there are three research issues in transfer learning,
namely, “what to transfer,” “how to transfer” and “when to transfer.” As the objec-
tive of transfer learning is to transfer knowledge across different domains, the first
question is to ask what knowledge across domains can be transferred to boost the
generalization performance of the target domain, which is referred to as the “what
to transfer” issue. After identifying what knowledge to be transferred, a follow-up
question is how to encode the knowledge into a learning algorithm to transfer,
which corresponds to the “how to transfer” issue. The “when to transfer” issue is
to ask in which situations transfer learning should be performed or can be per-
formed safely. A fundamental research question behind these three issues is how
to measure the “distance” between any pair of domains or tasks. With the distance
measure between domains or tasks, one can identify what common knowledge
between tasks can be used to reduce distance between domains or tasks, that is,
what to transfer, and figure out how to reduce the distance between domains or
tasks based on the identified common knowledge, that is, how to transfer. More-
over, with the distance measure between domains or tasks, one can logically de-
cide “when to transfer”: if the distance is very large, it is advised not to conduct
transfer learning. Otherwise, it is “safe” to do so.
A subsequent question is thus: what form should such a notion of distance
measure be in? Traditionally, there are various types of statistical measures for
the distance between any two probability distributions. Typical measures among
them include Kulback–Leibler divergence, A-distance (which measures the do-
main separation) and Maximum Mean Discrepancy (MMD), to name a few. Recall
that a domain contains two components: a feature space and a marginal prob-
ability distribution, and a task also contains two components: a label space and
a conditional probability distribution. Therefore, existing statistical measures for
the distance between probability distributions could be used to measure the dis-
tance between domains or tasks by assuming the source domain (task) and the
target domain (task) share the same feature (label) space. However, there are some
limitations on using statistical distance measures for transfer learning. First, re-
searchers have found that these general distribution-based distance measures are
14 Introduction

often too coarse to serve the purpose well in measuring the distance in the trans-
ferrability between two domains or tasks. Second, if the domains have different
feature spaces and/or label spaces, one has to first project the data onto the same
feature and/or label space, and then apply the statistical distance measures as a
follow-up step. Therefore, more research needs to be done on a general notion of
distances between two domains or tasks.

1.5 Applications of Transfer Learning

1.5.1 Image Understanding
Many image understanding tasks from object recognition to activity recognition
have been considered. Typically, these computer vision tasks require a lot of la-
beled data to train a model, such as using the well-known ImageNet data set.
However, when computer vision situations slightly change, such as changing from
indoors to outdoors and from still cameras to moving cameras, the model needs
to be adapted to account for new situations. Transfer learning is an often used
technique to solve these adaptation problems.
In image analysis, many recent works combined deep learning architecture with
transfer learning. For example, Long et al. (2015) explore a deep learning archi-
tecture in which domain distances are minimized between the source and target
domains. In a paper published by Facebook (Mahajan et al., 2018), Mahajan et al.
apply transfer learning to image classification. The approach involves first train-
ing a deep learning model based on a very large image data set. This pretrained
model is then fine-tuned on specific tasks in a target domain, which involves rel-
atively small amounts of labeled data. The model is a deep convolutional network
trained for the task of classification based on hashtags assigned to billions of social
media images, and the target tasks are object recognition or image classification.
Their analysis shows that it is important to both increase the size of the pretraining
data set as well as to select a closely related label space between source and tar-
get tasks. This observation suggests that transfer learning requires the design of
“label-space engineering” approaches to match source and target learning tasks.
Their work also suggests that improvements on target tasks may be obtained by
increasing source model complexity and data set sizes.
Transfer learning also allows image analysis to play an important role in appli-
cations with a large societal impact. In the work by Xie et al. (2016), authors from
Stanford University Earth Sciences apply transfer learning to predict poverty lev-
els on earth based on satellite images. First, they used daytime images to predict
the nighttime light images. The resulting model is then transferred to predicting
poverty. This results in a very accurate prediction model that required much less
human labeling effort to build compared to traditional survey-based methods.
1.5 Applications of Transfer Learning 15

1.5.2 Bioinformatics and Bio-imaging

In biology, many experiments are costly and data are very few. Examples include
bio-imaging when doctors try to use computers to discover potential diseases,
and when software models are used to scan complex DNA and protein sequences
for patterns to point to a particular illness or cure. Transfer learning has been in-
creasingly used to help leverage the knowledge from one domain to another to
address the difficulty that labeled data in biology is costly to obtain. For example,
Xu and Yang (2011) give an early survey of transfer learning and multitask learn-
ing in bioinformatics applications, and Xu et al. (2011) present a transfer learn-
ing process to identify protein cellular structures in a target domain where the
labeled data is in short supply. In biomedical image analysis, a difficult problem is
to collect enough training data to train a model for identifying image patterns that
designate illnesses such as cancer. Such identification requires large amounts of
training data. However, these data are often very expensive to obtain as they re-
quire costly human experts to label. Furthermore, the data for pretrained models
and future models are often from different distributions. These problems inspire
many research works to apply transfer learning to adapt the pretrained model in
new tasks. For example, in the work by Shin et al. (2016), a pretrained model based
on ImageNet data is used as the source domain model, which is then transferred
for use in a medical image domain for thoraco-abdominal lymph node detection
and interstitial lung disease classification, with great success.

1.5.3 Recommender Systems and Collaborative Filtering

It is often the case that an online product recommendation system is difficult to
set up due to the cold start problem. This problem can be alleviated if we discover
similarities between domains and adapt a recommendation model from a ma-
ture domain to the new domain. This often saves time and resources that make
an otherwise impossible task successful. For example, Li et al. (2009b) and Pan
et al. (2010b) give early accounts of applying transfer learning for online recom-
mendation. In their applications, cross-domain recommendation systems trans-
fer user preference models from an existing domain (say, a book recommendation
domain) to a new domain (say, a movie recommendation domain). The scenario
corresponds to the business case where an online commerce site opens a new line
of business and wishes to quickly deploy a recommendation model for the opera-
tion in the new business line. In doing so, it must overcome the problem of a lack
of transaction data in the new business line. Another line of work is in integrat-
ing reinforcement learning and recommendation systems to allow the items that
are recommended to be both accurate according to past history of a user and po-
tentially diverse to enrich users’ interests. As an example, Liu et al. (2018) present
a bandit algorithm that balances between recommendation accuracy and topic
16 Introduction

diversity, to allow a system to explore new topics as well as cater to users’ recent
choices. Relating to transfer learning, the work shows that the recommendation
strategy in balancing exploration and exploitation can indeed be transferred be-
tween domains.

1.5.4 Robotics and Autonomous Cars

In designing robotics and autonomous cars, learning from simulations is a par-
ticularly useful approach. These are examples of hardware interactions, where
it is costly to gather labeled data for training reinforcement learning and super-
vised learning models. Taylor and Stone (2007) described how transfer learning
helps by allowing researchers to build a simulated model in a more or less ideal
domain, the source domain, and then learn a policy to deal with the anticipated
events in a target domain. The target domain model can handle more cases in the
real world to further handle more unanticipated and noisy data. When the mod-
els adapt well, much labor and many resources can be saved from retraining the
target domain model. In the work by Tai et al. (2017), a mapless motion planner
was designed based on a ten-dimensional sparse range findings and trained in an
end-to-end deep reinforcement learning algorithm. Then the learned planner is
transferred to the real world by generalizing via real world samples.

1.5.5 NLP and Text Mining

Text mining is a good application for transfer learning algorithms. Text mining
aims to discover useful structural knowledge from text and applies to other do-
mains. Among all the problems in text mining, text classification aims to label
new text documents with different class tags. A typical text classification prob-
lem is sentiment classification. On the Web, there are enormous user-generated
contents at online sites such as online forums, blogs, social networks and so on.
It is very important to be able to summarize opinions of consumers on products
and services. Sentiment classification addresses this problem by classifying the re-
views into positive and negative categories. However, on different domains, such
as different types of products, different types of online sites and different sectors
of business, users may express their opinions using different words. As a result,
a sentiment classifier trained on one domain may perform poorly on other do-
mains. In this case, transfer learning can help adapt a well-trained sentiment clas-
sifier across different domains.
Recently, work on pretraining gained new insights into the nature of transfer
learning. Devlin et al. (2018) highlight one successful condition for transfer learn-
ing applications: having a sufficient amount of source domain training data. For
example, Google’s NLP system BERT (Bidirectional Encoder Representations from
Transformers) applies transfer learning to a number of NLP tasks, showing that
transfer learning with a powerful pretained model can solve a variety of tradition-
1.6 Historical Notes 17

ally difficult problems such as question answering problems (Devlin et al., 2018).
It has accomplished surprising results by leading in many tasks in the open com-
petition SQuAD 2.0 (Rajpurkar et al., 2016). The source domain consists of an ex-
tremely large collection of natural language text corpus, with which BERT trained
a model that is based on the bidirectional transformers based on the attention
mechanism. The pertained model is capable of making a variety of predictions in
a language model more accurate than before, and the predictive power increases
with increasing amounts of training data in the source domain. Then, the BERT
model is applied to a specific task in a target domain by adding additional small
layers to the source model in such tasks as Next Sentence classification, Ques-
tion Answering and Named Entity Recognition (NER). The transfer learning ap-
proach corresponds to model-based transfer, where most hyperparameters stay
the same but a selected few hyperparameters can be adapted with the new data in
the target domain.

1.6 Historical Notes

Many human learning activities follow the style of transfer learning. We observe
that people often apply the knowledge gained from previous learning tasks to help
learn a new task. For example, a baby can be observed to first learn how to recog-
nize its parents before using this knowledge to help it learn how to recognize other
people.
Transfer learning has deep roots in AI, psychology, educational theory and cog-
nitive science. In AI, there have been many forms of transfer learning. Learning
by analogy is one of the fundamental insights of AI. Humans can draw on the
past experience to solve current problems very well. In AI, there have been sev-
eral early works on analogical reasoning such as dynamic memory (Schank, 1983).
Using analogy in problem solving, Carbonell (1981) and Winston (1980) pointed
out that analogical reasoning implies that the relationship between entities must
be compared, not just the entity themselves, to allow effective recall of previous
experiences. Forbus et al. (1998) have argued for high-level structural similarity
as a basis of analogical reasoning. Holyoak and Thagard (1989) have developed a
computational theory of analogical reasoning using this strategy, when abstrac-
tion rules that allow the two instances to be mapped to a unified representation
are given as input.
Analogical problem solving is the cornerstone for case-based reasoning (CBR),
where many systems have been developed. For example, HYPO (Ashley, 1991) re-
trieves similar past cases in a legal case base to argue in support of a claim or make
counterarguments. PRODIGY (Carbonell et al., 1991) uses a collection of previ-
ous problem-solving cases as a case base, and retrieves the most similar cases for
adaptation. Most operational systems of analogical reasoning such as CBR sys-
tems (Kolodner, 1993) have relied on an assumption that the past instances and
the new target problem are in the same representational space.
18 Introduction

Table 1.1 Notations

D A data set
X A feature space
H A hypothesis space
P A probability distribution
EP [·] Expectation with respect to distribution P
tr(A) Trace of matrix A
min Minimization
max Maximization
In An n × n identity matrix
I An identity matrix with the size depending on the context
0 A zero vector or matrix with the size depending on the context
1 A vector or matrix of all ones with the size depending on the context
· p The p norm of a vector where 0 ≤ p ≤ ∞
· 1 The 1 norm of a vector or matrix
· F The Frobenius norm of a matrix
· S(p) The Schatten p-norm norm of a matrix
μi (·) The i -th largest eigenvalue or singular value of a matrix
N (μ, σ) A univariate or multivariate normaldistribution
with mean
μ and variance σ
Ap,q The p,q norm of a matrix, that is, Ap,q = a1 p , . . . , an p q where ai is the i th row of A.
A−1 The inverse of a nonsingular matrix A
A+ The inverse of a nonsingular matrix A or the pseduo-inverse of a singular matrix

There have been some surveys on transfer learning in machine learning lit-
erature. Pan and Yang (2010) and Taylor and Stone (2009) give early surveys of
the work on transfer learning, where the former focused on machine learning in
classification and regression areas and the latter on reinforcement learning ap-
proaches. This book aims to give a comprehensive survey that cover both these
areas, as well as the more recent advances of transfer learning with deep learning.

1.7 About This Book

This book mainly consists of two parts. The first part is to introduce the founda-
tion of transfer learning in terms of representative methodologies and theoretical
studies. The second part is to discuss some advanced topics in transfer learning
and show some successful applications of transfer learning. The notations used in
this book are summarized in Table 1.1.
The book is the effort of years of original research and survey of the research
field by many former and current students of Professor Qiang Yang at Hong Kong
University of Science and Technology and several other organizations. In chrono-
logical order of chapters, the composition of the book is outlined as follows:

Chapter 2 covers instance-based transfer learning. One of the most straightfor-

ward transfer learning methods is to identify instances or samples from
the source domains and assign them weights. Then, these instances with
sufficiently high weights are transferred to the target domain to help train
a better machine learning model. In doing so, it is important to transfer
only those instances that can contribute to the learning in the target do-
main and at the same time avoid “negative transfer.” The instance-based
1.7 About This Book 19

transfer learning methods can also be useful when multiple source do-
mains exist.
Chapter 3 covers feature-based transfer learning. Features constitute a major el-
ement of machine learning. They can be straightforward attributes in the
input data, such as pixels in images or words and phrases in a text docu-
ment, or they can be composite features composed by certain nonlin-
ear transformations of input features. Together these features comprise
a high-dimensional feature space. Feature-based transfer is to identify
common subspaces of features between source and target domains, and
allow transfer to happen in these subspaces. This style of transfer learn-
ing is particularly useful when no clear instances can be directly trans-
ferred, but some common “style” of learning can be transferred.
Chapter 4 discusses model-based transfer learning. Model-based transfer is when
parts of a learning model can be transferred to a target domain from a
source domain, where the learning in the target domain can be “fine-
tuned” based on the transferred model. Model-based transfer learning
is particularly useful when one has a fairly complete collection of data
in a source domain, and the model in the source domain can be made
very powerful in terms of coverage. Then learning in a target domain cor-
responds to adapting the general model from the source domain to a spe-
cific model in a target domain on the “edge” of a network of
domains.
Chapter 5 explores relation-based transfer learning. This chapter is particularly
useful when knowledge is coded in terms of a knowledge graph or in rela-
tional logic form. When some dictionary of translation can be instituted,
and when knowledge exists in the form of some encoded rules, this type
of transfer learning can be particularly useful.
Chapter 6 presents heterogeneous transfer learning. Sometimes, when we deal
with transfer learning, the target domain may have a completely differ-
ent feature representation from that of the source domain. For example,
we may have collected labeled data about images, but the target task is to
classify text documents. If there is some relationship between the images
and the text documents, transfer learning can still happen at the seman-
tic level, where the semantics of the common knowledge between the
source and the target domains can be extracted as a “bridge” to enable
the knowledge transfer.
Chapter 7 discusses adversarial transfer learning. Machine learning, especially
deep learning, can be designed to generate data and at the same time
classify data. This dual relationship in machine learning can be exploited
to mimic the power of imitation and creation in humans. This learn-
ing process can be modeled as a game between multiple models, and is
called adversarial learning. Adversarial learning can be very useful in em-
powering a transfer learning process, which is the subject of this chapter.
20 Introduction

Chapter 8 discusses the use of transfer learning in reinforcement learning. Re-

inforcement learning allows rewards to be delayed, and introduces the
concept of actions and states in a learning system. Learning a policy in a
reinforcement learning problem requires a huge amount of training data,
which is time consuming to prepare. Transfer learning alleviates this pain
and is promising when the source and target domains and tasks can be
closely aligned.
Chapter 9 discusses multitask learning. So far, transfer learning has been dis-
cussed along a time line: a source domain and a model have been well
prepared before transfer learning can happen to a target domain in a later
time point. Multitask learning aims to learn at the same time point, by al-
lowing several tasks to benefit with common knowledge for each other.
This is the style of learning when a student takes several courses in the
same semester, when the student finds that some common contents or
learning methodology can be commonly shared between the courses.
Chapter 10 discusses transfer learning theory. Learning theory tells the general
capability of a learning system, by relating the number of samples with
the generalization error bounds of a particular algorithm. This line of
work generally follows the methodology of probably approximately cor-
rect learning, or PAC learning. When the bound is tight, the error bound
can also be used to design new algorithms. The transfer learning theory,
when properly done, can help give assurances for a learning system’s ca-
pability.
Chapter 11 surveys transitive transfer learning. Transfer learning so far has been
discussed in a source to target domain transfer model. When the source
and target domains are “far” from each other, there is no directly relation
between the two, transfer cannot directly happen between the two do-
mains. Even though this poses difficulty for transfer learning, there are
still opportunities for transfer learning when we can find some interme-
diate domains as “stepping-stones” for knowledge to “hop over” to target
domains. For example, this might happen when we consider a student
entering a university taking a calculus class; through several semesters’
of knowledge transfer, they eventually they take some advanced physics
or computing classes.
Chapter 12 presents learning to transfer as a way to achieve automated trans-
fer learning. Just like a typical machine learning system, the engineering
process can be very tedious, as there may be many parameters to tune. As
a result, researchers introduced the concept of automatic machine learn-
ing (AutoML) to automate the parameter tuning process through auto-
matic optimization. Likewise, transfer learning requires many engineer-
ing efforts, and, when sufficient transfer learning experience is gained,
this experience can in turn become the training data for building a
parameter-tuning model for automatic transfer learning (AutoTL).
1.7 About This Book 21

Chapter 13 presents few-shot learning. Few-shot learning is when models have

been built well enough in a source domain, there may be cases where
only few training data, or even no training data, are required in the target
domain before a target domain model is well trained.
Chapter 14 discusses lifelong machine learning. When transfer learning is en-
gaged continuously along a time line, the system can draw knowledge
from all previous experience in a lifelong manner. A challenge is to de-
cide how to store the previous knowledge and how to select the previous
experience to reuse when solving the next task in life.
Chapter 15 discusses privacy-preserving transfer learning. When transfer learn-
ing happens between two organizations, we wish to protect the sensi-
tive and private information about users and the confidential data in the
source domain. We wish to do this while transferring the knowledge itself.
Thus, care should be taken not to allow the target domain to reverse engi-
neer the sensitive data when transfer learning is applied. In this chapter,
we discuss how differential privacy is integrated with transfer learning to
protect the user privacy and ensure data confidentiality.
Chapter 16 discusses applications of transfer learning in computer vision, which
is one of the most extensive application fields of transfer learning. We
survey the work in this area, paying special attention to medical imaging
and transfer learning.
Chapter 17 discusses applications of transfer learning in NLP. NLP is one of the
main application areas of transfer learning, which requires special atten-
tion due to the language specific nature of NLP.
Chapter 18 discusses applications of transfer learning in dialogue systems. We
particularly separated dialogue systems out of the general survey on NLP
in the previous chapter because this is an increasingly important appli-
cation area not only in its own right, but also as a human–computer in-
teraction medium that will grow in the years to come.
Chapter 19 presents applications of transfer learning in recommendation sys-
tems. Recommendation systems is a machine learning technique and,
at the same time, an important application area of machine learning.
Transfer learning is particularly important in recommendation systems
because this domain constantly suffers from the so-called “cold start”
problem and data sparsity problem where not enough data and knowl-
edge have been gained in a newly started area. Transfer learning has
proven to be very useful in alleviating these problems.
Chapter 20 discusses applications of transfer learning in bioinformatics and bio-
imaging. Biological data are increasingly accumulated with advancement
of genetic and biomedical technology. This gives application opportu-
nities to machine learning. However, this is a domain where collecting
high-quality samples is extremely difficult, expensive and time consum-
ing. Thus, transfer learning can be very useful, especially when the ge-
22 Introduction

netics domain is full of data of very high dimensionality and low sample
sizes. We give an overview of works in this area.
Chapter 21 presents applications of transfer learning in activity recognition based
on sensors. Activity recognition refers to finding people’s activities from
sensor readings, which can be very useful for assisted living, security and
a wide range of other applications. A challenge in this domain is the lack
of labeled data, and this challenge is particularly fit for transfer learning
to address.
Chapter 22 discusses applications of transfer learning in urban computing. There
are many machine learning problems to address in urban computing,
ranging from traffic prediction to pollution forecast. When data has been
collected in one city, the model can be transferred to a newly considered
city via transfer learning, especially when there is not sufficient high-
quality data in these new cities.
Chapter 23 gives a summary of the whole book with an outlook for future works.