0% found this document useful (0 votes)

10 views

A Survey On Multi-Task Learning

This document provides a survey of multi-task learning (MTL), including classifications of MTL algorithms, applications of MTL, and theoretical analyses of MTL. MTL aims to improve task performance by leveraging information from related tasks. MTL algorithms are classified into five categories and can be combined with other learning paradigms. MTL is effective because it utilizes more data from multiple tasks.

Uploaded by

20COB258 Prabhat Kumar Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views

A Survey On Multi-Task Learning

Uploaded by

20COB258 Prabhat Kumar Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

5586 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 34, NO.

12, DECEMBER 2022

A Survey on Multi-Task Learning

Yu Zhang , Member, IEEE and Qiang Yang, Fellow, IEEE

Abstract—Multi-Task Learning (MTL) is a learning paradigm in machine learning and its aim is to leverage useful information
contained in multiple related tasks to help improve the generalization performance of all the tasks. In this paper, we give a survey for
MTL from the perspective of algorithmic modeling, applications and theoretical analyses. For algorithmic modeling, we give a definition
of MTL and then classify different MTL algorithms into five categories, including feature learning approach, low-rank approach, task
clustering approach, task relation learning approach and decomposition approach as well as discussing the characteristics of each
approach. In order to improve the performance of learning tasks further, MTL can be combined with other learning paradigms including
semi-supervised learning, active learning, unsupervised learning, reinforcement learning, multi-view learning and graphical models.
When the number of tasks is large or the data dimensionality is high, we review online, parallel and distributed MTL models as well as
dimensionality reduction and feature hashing to reveal their computational and storage advantages. Many real-world applications use
MTL to boost their performance and we review representative works in this paper. Finally, we present theoretical analyses and discuss
several future directions for MTL.

Index Terms—Multi-task learning, machine learning, artificial intelligence

1 INTRODUCTION One reason that MTL is effective is that it utilizes more data
from different learning tasks when compared with single-task
UMAN can learn multiple tasks simultaneously and dur-
H ing this learning process, human can use the knowl-
edge learned in a task to help the learning of another task.
learning. With more data, MTL can learn more robust and uni-
versal representations for multiple tasks and more powerful
models, leading to better knowledge sharing among tasks, bet-
For example, according to our experience in learning to
ter performance of each task and low risk of overfitting in each
play tennis and squash, we find that the skill of playing ten-
task.
nis can help learn to play squash and vice versa. Inspired by
MTL is related to other learning paradigms in machine
such human learning ability, Multi-Task Learning (MTL)
learning, including transfer learning [2], multi-label learn-
[1], a learning paradigm in machine learning, aims to learn
ing [3] and multi-output regression. The setting of MTL is
multiple related tasks jointly so that the knowledge con-
similar to that of transfer learning but with significant dif-
tained in a task can be leveraged by other tasks, with the
ferences. In MTL, there is no distinction among different
hope of improving the generalization performance of all the
tasks and the objective is to improve the performance of all
tasks at hand.
the tasks. However, transfer learning is to improve the per-
At its early stage, an important motivation of MTL is to alle-
formance of a target task with the help of source tasks,
viate the data sparsity problem where each task has a limited
hence the target task plays a more important role than
number of labeled data. In the data sparsity problem, the num-
source tasks. In a word, MTL treats all the tasks equally but
ber of labeled data in each task is insufficient to train an accu-
in transfer learning the target task attracts most attentions.
rate learner, while MTL aggregates the labeled data in all the
From the perspective of the knowledge flow, flows of
tasks in the spirit of data augmentation to obtain a more accu-
knowledge transfer in transfer learning are from source task
rate learner for each task. From this perspective, MTL can help
(s) to the target task, but in multi-task learning, there are
reuse existing knowledge and reduce the cost of manual label-
flows of knowledge sharing between any pair of tasks,
ing for learning tasks. When the era of “big data” comes in
which is illustrated in Fig. 1a. Continual learning [4], in
some areas such as computer vision and Natural Language
which tasks come sequentially, learns tasks one by one,
Processing (NLP), it is found that deep MTL models can
while MTL is to learn multiple tasks together. In multi-label
achieve better performance than their single-task counterparts.
learning and multi-output regression, each data point is
associated with multiple labels which can be categorical or
Yu Zhang is with the Department of Computer Science and Engineering, numeric. If we treat each of all the possible labels as a task,
Southern University of Science and Technology, Shenzhen, Guangdong multi-label learning and multi-output regression can be
518055, China, and also with the Peng Cheng Laboratory, Shenzhen, viewed in some sense as a special case of multi-task learning
Guangdong 518066, China. E-mail: [email protected].
where different tasks always share the same data during
Qiang Yang is with the Department of Computer Science and Engineering,
Hong Kong University of Science and Technology, Hong Kong. both the training and testing phrases. On the one hand,
E-mail: [email protected]. such characteristic in multi-label learning and multi-output
Manuscript received 5 Aug. 2019; revised 28 Dec. 2020; accepted 17 Mar. 2021. regression leads to different research issues from MTL. For
Date of publication 31 Mar. 2021; date of current version 7 Nov. 2022. example, the ranking loss, which enforces the scores (e.g.,
(Corresponding author: Yu Zhang.) the classification probability) of labels associated with a
Recommended for acceptance by L. Chen.
Digital Object Identifier no. 10.1109/TKDE.2021.3070203 data point to be larger than those of absent labels, can be

1041-4347 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
ZHANG AND YANG: SURVEY ON MULTI-TASK LEARNING 5587

Fig. 1. Illustrations for differences between MTL and other learning paradigms.

used for multi-label learning but it does not fit MTL where review relevant works. At last, we discuss several future
different tasks possess different data. On the other hand, directions for MTL.1
this characteristic in multi-label learning and multi-output
regression is invalid in MTL problems. For example, in a
MTL problem discussed in Section 2.7 where each task is to 2 MTL MODELS
predict the disease symptom score of Parkinson for a patient In order to fully characterize MTL, we first give the defini-
based on 19 bio-medical features, different patients/tasks tion of MTL.
should not share the bio-medical data. In a word, multi-
Definition (Multi-Task Learning). Given m learning
label learning and multi-output regression are different
tasks fT i gm
i¼1 where all the tasks or a subset of them are
from multi-task learning as illustrated in Fig. 1b and hence
related, multi-task learning aims to learn the m tasks
we will not survey literature on multi-label learning and
together to improve the learning of a model for each task
multi-output regression. Moreover, multi-view learning is
T i by using the knowledge contained in all or some of
another learning paradigm in machine learning, where each
other tasks. u
t
data point is associated with multiple views, each of which
consists of a set of features. Even though different views Based on the definition of MTL, we focus on supervised
have different sets of features, all the views are used learning tasks in this section since most MTL studies fall in
together to learn for the same task and hence multi-view this setting and for other types of tasks, we review them in
learning belongs to single-task learning with multiple sets the next section. In the setting of supervised learning tasks,
of features, which is different from MTL as shown in Fig. 1c. a task T i is usually accompanied by a training dataset Di
ni
Over past decades, MTL has attracted many attentions in consisting of ni training samples, i.e., Di ¼ fxij ; yij gj¼1 , where
the artificial intelligence and machine learning communities. xj 2 R is the jth training instance in T i and yj is its label.
i di i

Many MTL models have been devised and many MTL appli- We denote by Xi the training data matrix for T i , i.e., Xi ¼
cations in other areas have been exploited. Moreover, many ðxi1 ; . . . ; xini Þ. When different tasks lie in the same feature
analyses have been conducted to study theoretical problems space implying that di equals dj for any i 6¼ j, this setting is
in MTL. This paper serves as a survey on MTL from the per- the homogeneous-feature MTL, and otherwise it corre-
spective of algorithmic modeling, applications and theoretical sponds to heterogeneous-feature MTL. Without special
analyses. For algorithmic modeling, we first give a definition explanation, the default MTL setting is the homogeneous-
for MTL and then classify different MTL algorithms into five feature MTL. Here we need to distinguish the heteroge-
categories: feature learning approach which can be further neous-feature MTL from the heterogeneous MTL. In [6], the
categorized into feature transformation and feature selection heterogeneous MTL is considered to consist of different
approaches, low-rank approach, task clustering approach, types of supervised tasks including classification and
task relation learning approach and decomposition approach. regression problems, and here we generalize it to a more
After that, we discuss the combination of MTL with other general setting that the heterogeneous MTL consists of tasks
learning paradigms, including semi-supervised learning, with different types including supervised learning, unsu-
active learning, unsupervised learning, reinforcement learn- pervised learning, semi-supervised learning, reinforcement
ing, multi-view learning and graphical models. To handle a learning, multi-view learning and graphical models. The
large number of tasks, we review online, parallel and distrib- opposite to the heterogeneous MTL is the homogeneous
uted MTL models. For data in a high-dimensional space, fea- MTL which consists of tasks with only one type. In a word,
ture selection, dimensionality reduction and feature hashing the homogeneous and heterogeneous MTL differ in the type
are introduced as vital tools to process them. As a promising of learning tasks while the homogeneous-feature MTL is
learning paradigm, MTL has many applications in various different from the heterogeneous-feature MTL in terms of
areas and here we briefly review its applications in computer
vision, bioinformatics, health informatics, speech, NLP, web, 1. For an introduction to MTL without technical details, please refer
etc. From the perspective of theoretical analyses on MTL, we to [5].
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
5588 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 34, NO. 12, DECEMBER 2022

the original feature representations. Similarly, without spe-

cial explanation, the default MTL setting is the homoge-
neous MTL.
In order to characterize the relatedness in the definition
of MTL, there are three issues to be addressed: when to
share, what to share and how to share.
The ‘when to share’ issue is to make choices between sin-
gle-task and multi-task models for a multi-task problem. Cur-
rently such decision is made by human experts and there are
few works to study it. A simple solution is to formulate such
decision as a model selection problem and then use model Fig. 2. An example for the multi-task feedforward neural network with an
selection techniques, e.g., cross validation, to make decisions, input layer, a hidden layer, and an output layer.
but this solution is usually computational heavy and may
require much more training data. An advanced solution we
think is to use multi-task models which can degenerate to into two or more components, which are penalized by differ-
their single-task counterparts given some form of model ent regularizers.
parameters, for example, problem (33) presented in Section 2.8 In summary, there are mainly five approaches in the fea-
which can reduce to multiple single-task models with the ture-based and parameter-based MTL. In the following sec-
learning of different tasks decoupled when a parameter S tions, we review these approaches in a chronological order to
becomes diagonal. In this case, we can let the training data reveal the relations and evolutions among different models.
determine the form of S to make an implicit choice.
‘What to share’ needs to determine the form through 2.1 Feature Learning Approach
which knowledge sharing among all the tasks could occur. Since tasks are related, it is intuitive to assume that different
Usually, there are three forms for ‘what to share’, including tasks share a common feature representation based on the
feature, instance and parameter. Feature-based MTL aims original features. One reason to learn common feature rep-
to learn common features among different tasks as a way to resentations instead of directly using the original ones is
share knowledge. Instance-based MTL identifies useful data that the original representation may not have enough
instances in a task for other tasks and then shares knowl- expressive power for multiple tasks. With the training data
edge via the identified instances. Parameter-based MTL in all the tasks, a more powerful representation can be
uses model parameters (e.g., coefficients in linear models or learned for all the tasks and this representation can bring
weights in deep models) in a task to help learn model the improvement on the performance.
parameters in other tasks in some ways, for example, the Based on the relationship between the original feature
regularization. Existing MTL studies mainly focus on fea- representation and the learned one, we can further classify
ture-based and parameter-based methods, and only a few this category into two sub-categories. The first sub-category
works belong to the instance-based method. A representa- is the feature transformation approach where the learned
tive instance-based method is the multi-task distribution representation is a linear or nonlinear transformation of the
matching method proposed in [7], which first estimates den- original representation and in this approach, each feature in
sity ratios between probabilities that each instance as well as the learned representation is different from the original fea-
its label belongs to both its own task and a mixture of all the tures. Different from this approach, the feature selection
tasks and then uses all the weighted training data from all approach, the second sub-category, selects a subset of the
the tasks based on the estimated density ratios to learn original features as the learned representation and hence
model parameters for each task. Since the studies on the learned representation is similar to the original one by
instance-based MTL are few, we mainly review feature- eliminating useless features based on different criteria. In
based and parameter-based MTL models. the following, we introduce these two approaches.
After determining ‘what to share’, ‘how to share’ specifies
concrete ways to share knowledge among tasks. In feature-
based MTL, there is a primary approach: feature learning 2.1.1 Feature Transformation Approach
approach. The feature learning approach focuses on learning The multi-layer feedforwar d neural network [1], which
common feature representations for multiple tasks based on belongs to the feature transformation approach, is one of
shallow or deep models, where the learned common feature the earliest model for multi-task learning. To see how the
representation can be a subset or a transformation of the origi- multi-layer feedforward neural network is constructed for
nal feature representation. In parameter-based MTL, there are MTL, in Fig. 2 we show an example with an input layer, a
four main approaches: low-rank approach, task clustering hidden layer and an output layer. The input layer receives
approach, task relation learning approach and decomposition training instances from all the tasks and the output layer
approach. The low-rank approach interprets the relatedness has m output units with one for each task. Here the outputs
of multiple tasks as the low rankness of the parameter matrix of the hidden layer can be viewed as the common feature
of these tasks. The task clustering approach is to identify task representation learned for the m tasks and the transforma-
clusters, each of which contains similar tasks. The task relation from the original representation to the learned one
tion learning approach aims to learn quantitative relations depends on the weights connecting the input and hidden
between tasks from data automatically. The decomposition layers as well as the activation function adopted in the hid-
approach decomposes the model parameters of all the tasks den units. Hence, if the activation function in the hidden
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
ZHANG AND YANG: SURVEY ON MULTI-TASK LEARNING 5589

layer is linear, then the transformation is a linear function et al. [11] extend problem (2) to a general formulation where
and otherwise it is nonlinear. Compared with multi-layer the second term in the objective function becomes trðWT f
feedforward neural networks used for single-task learning, ðDÞWÞ with fðDÞ operating on the spectrum of D and dis-
the difference in the network architecture lies in the output cuss the condition on fðÞ to make the whole problem
layers where in single-task learning, there is only one out- convex.
put unit while in MTL, there are m ones. In [8], the radial Similar to the MTFL method, the multi-task sparse cod-
basis function network, which has only one hidden layer, is ing method [12] is to learn a linear transformation on fea-
extended to MTL by greedily determining the structure of tures with the objective function formulated as
the hidden layer. Different from these neural network mod-
els, Silver et al. [9] propose a context-sensitive multi-task min LðUA; bÞ s:t: kai k1 8i 2 ½m; kuj k2 1 8j 2 ½D;
neural network which has only one output unit shared by A;U;b
different tasks but has a task-specific context as an addi- (3)
tional input.
Different from multi-layer feedforward neural networks where ai , the ith column of A, contains model parameters of
which are connectionist models, the multi-task feature the ith task, uj is the jth column in U, ½c for an integer c
learning (MTFL) method [10] is formulated under the regu- denotes a set of integers from 1 to c, k k1 denotes the ‘1
larization framework with the objective function as norm of a vector or matrix and equals the sum of the abso-
lute value of its entries, and k k2 denotes the ‘2 norm of a
Xm
1X
ni
vector. Here the transformation U 2 RdD is also called the
min lðyij ; ðai ÞT UT xij þ bi Þ þ kAk22;1
A;U;b n i (1) dictionary in sparse coding and shared by all the tasks.
i¼1 j¼1
Compared with the MTFL method where U in problem (1)
s:t: UU ¼ I;
T
is a d d orthogonal matrix, U in problem (3) is overcom-
plete, which implies that D is larger than d, with each col-
where lð; Þ denotes a loss function such as the hinge loss or
umn having a bounded ‘2 norm. Another difference is that
square loss, b ¼ ðb1 ; . . . ; bm ÞT is a vector of offsets in all the
in problem (1) A is enforced to be row-sparse but in prob-
tasks, U 2 Rdd is a square transformation matrix, A 2 Rdm
lem (3) it is only sparse via the first constraint. With a simi-
contains model parameters of all the tasks with its ith col-
lar idea to the multi-task sparse coding method, Zhu et al.
umn ai as model parameters for the ith task after the trans-
[13] propose a multi-task infinite support vector machine
formation, the ‘2;1 norm of a matrix A denoted by kAk2;1
via the Indian buffet process and the difference is that in
equals the sum of the ‘2 norm of rows in A, I denotes an
[13] the dictionary is sparse and model parameters are non-
identity matrix with an appropriate size, and is a positive
sparse. In [14], the spike and slab prior is used to learn
regularization parameter. The first term in the objective
sparse model parameters for multi-output regression prob-
function of problem (1) measures the empirical loss on
lems where transformed features are induced by Gaussian
the training sets of all the tasks and the second one is to
processes and shared by different outputs.
enforce A to be row-sparse via the ‘2;1 norm which is equiv-
Recently deep learning becomes popular due to its capac-
alent to selecting features after the transformation, while the
ity to learn nonlinear features, which facilitates the learning
constraint enforces U to be orthogonal. Different from the
of invariant features for multiple tasks, and hence many
multi-layer feedforward neural network whose hidden rep-
deep multi-task models belonging to this approach have
resentations may be redundant, the orthogonality of U can
been proposed with each task modeled by a deep neural
prevent the MTFL method from it. As proved in [10], prob-
network. Here we classify deep multi-task models in this
lem (1) is equivalent to
approach into three main categories. The first category [15],
[16], [17], [18], [19] is to learn a common feature representa-
min LðW; bÞ þ trðWT D1 WÞ s:t: D 0; trðDÞ 1;
W;D;b tion for multiple tasks by sharing first several layers in a
(2) similar architecture to Fig. 2. However, different from Fig. 2,
Pm Pni deep MTL models in this category have a large number of
i T i
where LðW; bÞ ¼ i¼1 n1i j¼1 lðyj ; ðw Þ xj
i
þ bi Þ denotes the shared layers, which have general structures such as convo-
total training loss, trðÞ denotes the trace of a square matrix, lutional layers and pooling layers. Building on the first cate-
wi ¼ Uai is the model parameter for T i , W ¼ ðw1 ; . . . ; wm Þ, gory, the second category is to use adversarial learning,
0 denotes a zero vector or matrix with an appropriate which is inspired by generative adversarial networks, to
size, M1 for any square matrix M denotes its inverse when learn a common feature representation for MTL as did in
it is nonsingular or otherwise its pseudo inverse, and B C [20], [21]. Specifically, there are three networks in such
means that B C is positive semidefinite. Based on this for- adversarial multi-task models, including a feature network
mulation, we can see that the MTFL method is to learn a fea- Nf , a classification network Nc and a domain network Nd .
ture covariance D for all the tasks, which will be interpreted Based on Nf , Nc is to minimize the training loss for all the
tasks, while Nd aims to distinguish which task a data
in Section 2.8 from a probabilistic perspective. Given D, the
instance is from. The objective function of such models is
learning of different tasks can be decoupled and this can
usually formulated as
facilitate the parallel computing. When given W,D has an
1 1
analytical solution as D ¼ ðWT WÞ2 =tr ðWT WÞ2 and by X ni
m
1X
plugging this solution into problem (2), we can see that the min max lðyij ; Nc ðNf ðxij ÞÞÞ lce ðdij ; Nd ðNf ðxij ÞÞÞ ;
uf ;uc ud n
i¼1 i j¼1
regularizer on W is the squared trace norm. Then Argyriou
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
5590 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 34, NO. 12, DECEMBER 2022

in [27]. Moreover, in order to make speedup, a safe screening

method is proposed in [28] to filter out useless features corre-
sponding to zero rows in W before optimizing problem (4).
Liu et al. [29] propose to use the ‘1;1 norm to select features
with the objective function formulated as

min LðW; bÞ þ kWk1;1 : (5)

W;b
Fig. 3. The architecture for the cross-stitch network.
A block coordinate descent method is proposed to solve
where uf , uc , ud denote parameters of three networks Nf , Nc , problem (5). In general, we can use the ‘p;q norm to select
Nd , respectively, dij 2 f1; . . . ; mg denotes the task index/ features for MTL.
label of xij , and lce ð; Þ denotes the cross-entropy loss. Based In order to attain a more sparse subset of features, Gong
on this minimax problem, Nf is to minimize the training et al. [30] propose a capped-‘p;1 regularizer for multi-task
loss for all the tasks and maximize the cross-entropy loss to feature selection where p ¼ 1 or 2 and the objective function
fool the domain network to make the learned feature repre- is formulated as
sentation indistinguishable to all the tasks. When there is no
domain network, this category can reduce to the first cate- X
d
min LðW; bÞ þ minðkwi kp ; uÞ; (6)
gory. Moreover, in [21], each task can learn its specific fea- W;b
i¼1
ture representation to increase the expressive power of the
whole model. The last category is to learn different but where wi denotes the ith row of W. With a given threshold
related feature representations for different tasks with the u, the capped-‘p;1 regularizer (i.e., the second term in prob-
cross-stitch network [22] as a representative model. Specifi- lem (6)) focuses on rows with smaller ‘p norms than u,
cally, given two tasks A and B with an identical network which is more likely to be sparse. When u becomes large
architecture, xi;j i;j
A (xB ) denotes the hidden feature outputted enough, the capped-‘p;1 regularizer becomes kWkp;1 and
by the jth unit of the ith hidden layer for task A (B). Then hence problem (6) degenerates to problem (4) or (5) when p
we can define the cross-stitch operation on xi;j i;j
A and xB as equals 2 or 1.
i;j i;j
x~A aAA aAB xA Lozano and Swirszcz [31] propose a multi-level Lasso for
¼ ~i;j
i;j , where xA and x~i;j
B are new MTL where the ðj; iÞth entry in the parameter matrix W is
x~i;j
B
a BA a BB x B
hidden features after learning the two tasks jointly. When defined as wji ¼ uj w ^ ji . When uj is equal to 0, wji becomes 0
both aAB and aBA equal 0, training the two networks jointly for i 2 ½m and hence the jth feature is not selected by the
is equivalent to training them independently. The network model. In this sense, uj controls the global sparsity for the
architecture of the cross-stitch networkis shown in Fig. 3. jth feature among the m tasks. Moreover, when w ^ ji becomes
aAA aAB 0, wji is also 0 for i only, implying that the jth feature is not
Here matrix a , which is defined as a , enco- useful for task T i , and so w ^ ji is a local indicator for the spar-
aBA aBB
des feature-level task relations between the two tasks and it sity in task T j . Based on these observations, uj and w ^ ji are
can be learned via the backpropagation method. expected to be sparse, leading to the objective function for-
mulated as
2.1.2 Feature Selection Approach
min LðW; bÞ þ 1 kuuk1 þ 2 kWk
^ s:t: wji ¼ uj w
1 ^ ji ; uj 0;
One way to do feature selection in MTL is to use the ‘p;q u ;W;b
^
norm denoted by kWkp;q kðkw1 kp ; . . . ; kwd kp Þkq , where wi (7)
denotes the ith row of W and k kp denotes the ‘p norm of a
T
vector, to achieve the group sparsity. Obozinski et al. [23] where u ¼ ðu1 ; . . . ; ud Þ , W ^ ¼ ðw^ ;...;w
1
^ Þ, and the nonneg-
m

are among the first to study the multi-task feature selection ative constraint on uj is to keep the model identifiability. It
(MTFS) problem based on the ‘2;1 norm with the objective has beenP proved
pffiffiffiffiffiffiffiffiffiffiffiffiin
ffi [31] that problem (7) leads to a regular-
function formulated as izer dj¼1 kwj k1 , the square root of the ‘1;1 norm regulari-
2
zation. Moreover, Wang et al. [32] extend problem (7) P to a
min LðW; bÞ þ kWk2;1 : (4) general situation where the regularizer becomes 1 m i¼1 k
W;b
w^ i kpp þ 2 kuukqq . By utilizing a priori information describing
The regularizer on W in problem (4) is to enforce W to be the task relations in a hierarchical structure, Han et al. [33]
row-sparse, which in turn helps select important features. In propose a multi-component product based decomposition
[23], a path-following algorithm is proposed to solve prob- for wij where the number of components in the decomposi-
lem (4) and then Liu et al. [24] employ an optimal first-order tion can be arbitrary instead of only 2 in [31], [32]. Similar to
optimization method to solve it. Compared with problem [31], Jebara [34] proposes to learn a binary indicator vector
(1), we can see that problem (4) is similar to the MTFL to do multi-task feature selection based on the maximum
method without learning the transformation U. Lee et al. [25] entropy discrimination formalism.
propose a weighted ‘2;1 norm for multi-task feature selection Similar to [33] where a priori information is given to
where the weights can be learned as well and problem (4) is describe task relations in a hierarchical/tree structure, Kim
extended in [26] to a general case where feature groups can and Xing [35] utilize the given P tree
Pstructure to design a reg-
overlap with each other. In order to make problem (4) more ularizer on W as fðWÞ ¼ di¼1 v2V v kwi;Gv k2 , where V
robust to outliers, a square-root loss function is investigated denotes the set of nodes in the given tree structure, Gv
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
ZHANG AND YANG: SURVEY ON MULTI-TASK LEARNING 5591

denotes the set of leaf nodes (i.e., tasks) in a sub-tree rooted different situations of features and tasks. So this model can
at node v, and wi;Gv denotes a subvector of the ith row of W also handle outlier tasks but in a different way from [37].
indexed by Gv . This regularizer not only enforces each row
of W to be sparse as the ‘2;1 norm did in problem (4), but 2.1.3 Comparison Between Two Sub-Categories
also induces the sparsity in subsets of each row in W based The two sub-categories have different characteristics where
on the tree structure. the feature transformation approach learns a transformation
Different from conventional multi-task feature selection of the original features as the new representation but the fea-
methods which assume that different tasks share a set of ture selection approach selects a subset of the original features
original features, Zhou et al. [36] consider a different sce- as the new representation for all the tasks. Based on the char-
nario where useful features in different tasks have no over- acteristics of those two approaches, the feature selection
lapping. In order to achieve this, an exclusive Lasso model approach can be viewed as a special case of the feature trans-
is proposed with the objective function formulated as formation approach when the transformation matrix is a diag-
onal 0=1 matrix where the diagonal entries with value 1
min LðW; bÞ þ kWk21;2 ; correspond to the selected features. By selecting a subset of
W;b
the original features as the new representation, the feature
where the regularizer is the squared ‘1;2 norm on W. selection approach has a better interpretability.
Another way to select common features for MTL is to use
sparse priors to design probabilistic or Bayesian models. 2.2 Low-Rank Approach
For ‘p;1 -regularized multi-task feature selection, Zhang et al. The relatedness among multiple tasks can imply the low-rank
[37] propose a probabilistic interpretation where the ‘p;1 reg- of W, leading to the low-rank approach. For example, if the
ularizer corresponds to a generalized normal prior: wji ith, jth and kth tasks are related in that the model parameter
GN ðj0; rj ; pÞ, where denotes a (random) variable when we wi of the ith task is a linear combination of those of the other
do not want to introduce it explicitly. Based on this interpre- two tasks, then it is easy to show that the rank of W is at most
tation, Zhang et al. [37] further propose a probabilistic m 1 and hence of low rank. From this perspective, the more
framework for multi-task feature selection, in which task the relatedness is, the lower the rank of W is.
relations and outlier tasks can be identified, based on the Ando and Zhang [40] assume that model parameters of
matrix-variate generalized normal prior. different tasks share a low-rank subspace in part and specif-
In [38], a generalized horseshoe prior is proposed to do ically, wi takes the following form as
feature selection for MTL as:
wi ¼ ui þ Q T vi : (9)
Z Y
d
uji
Pðwi Þ ¼ N ðwji j0; ÞN ðui j0; r2 CÞN ðvi j0; g 2 CÞdui dvi ; Here Q 2 Rhd is the shared low-rank subspace by multiple
vji
j¼1 tasks where h < d. Then we can write in a matrix form as
W ¼ U þ QT V. Based on the form of W, the objective func-
where N ðjm; s Þ denotes a univariate or multivariate nor- tion proposed in [40] is formulated as
mal distribution with m as the mean and s as the variance
or covariance matrix, uji and vji are the jth entries in ui and min LðU þ Q T V; bÞ þ kUk2F s:t: QQ T ¼ I; (10)
vi , respectively, and r; g are hyperparameters. Here C U;V;Q
Q;b
shared by all the tasks denotes the feature correlation matrix
where k kF denotes the Frobenius norm. The orthonormal
to be learned from data and it encodes an assumption that
constraint on Q in problem (10) makes the subspace non-
different tasks share identical feature correlations. When C
redundant. When is large enough, the optimal U can
becomes an identity matrix which means that features are
become a zero matrix and hence problem (10) is very similar
independent, this prior degenerates to the horseshoe prior.
to problem (1) except that there is no regularization on V in
Hernandez-Lobato et al. [39] propose a probabilistic
problem (10) and that Q has a smaller number of rows than
model based on the horseshoe prior as
columns. Chen et al. [41] generalize problem (10) as
h i h
1h zj
i
1t vi ð1zj Þ
Pðwji Þ ¼ pðwji Þhji d0 ji pðwji Þtji d0 ji min LðW; bÞ þ 1 kUk2F þ 2 kWk2F
U;V;Q
Q;b
h i (8) (11)
1g ð1vi Þð1zj Þ
pðwji Þg j d0 j ; s:t: W ¼ U þ Q T V; QQ T ¼ I:

When setting 2 to be 0, problem (11) reduces to problem

where d0 is the probability mass function at zero and pðÞ
(10). Even though problem (11) is non-convex, with some
denotes the density function of non-zero coefficients. In
convex relaxation technique, it can be relaxed to the follow-
Eq. (8), zj indicates whether feature j is an outlier (zj ¼ 1) or
ing convex problem as
not (zj ¼ 0) and vi indicates whether task T i is an outlier
(vi ¼ 1) or not (vi ¼ 0). Moreover, hji and t ji indicate trðMÞ ¼ h
whether feature j is relevant for the prediction in T i min LðW; bÞ þ tr WT ðM þ hIÞ1 W s:t: ;
W;b;M 0 M I
(hji ; t ji ¼ 1) or not (hji ; t ji ¼ 0), and g j indicates whether a
(12)
non-outlier feature j is relevant (g j ¼ 1) for the prediction
or not (g j ¼ 0) in all non-outlier tasks. Based on the above where h ¼ 2 =1 and ¼ 1 hðh þ 1Þ. One advantage of
definitions, the three terms in the right-hand side of Eq. (8) problem (12) over problem (11) is that the global optimum
specify probability density functions of wji based on of the convex problem (12) is much easier to be obtained
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
5592 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 34, NO. 12, DECEMBER 2022

than that of the non-convex problem (11). Compared with matrix A is defined with its ði; jÞth entry aij recording the gen-
the alternative objective function (2) in the MTFL method, eralization accuracy obtained for task T i by using task T j ’s
problem (12) has a similar formulation where M models the distance metric. Based on A, mP tasks can
P be grouped into r
feature covariance for all the tasks. Problem (10) is extended clusters fCi gri¼1 by maximizing rt¼1 jC1t j i;j2Ct aij , where j j
in [42] to a general case where different wi ’s lie in a mani- denotes the cardinality of a set. After obtaining the cluster
fold instead of a subspace. Moreover, in [43], a latent vari- structure among all the tasks, the training data of tasks in a
able model is proposed for W with the same decomposition cluster will be pooled together to learn the final weighted near-
as Eq. (9) and it can provide a framework for MTL by est neighbor classifier. This approach has been extended to an
modeling more cases than problem (10) such as task cluster- iterative learning process [49] in a similar way to k-means
ing, sharing sparse representation, duplicate tasks and clustering.
evolving tasks. Bakker and Heskes [50] propose a multi-task Bayesian
It is well known that using the trace norm as a regular- neural network model with the network structure similar to
izer can make a matrix have low rank and hence this regu- Fig. 2 where input-to-hidden weights are shared by all the
larization is suitable for MTL. Specifically, an objective tasks but hidden-to-output weights are task-specific. By
function with the trace norm regularization is proposed in defining wi as the vector of hidden-to-output weights for
[44] as task T i , the multi-task Bayesian neuralP network assigns a
r
mixture of Gaussian prior to it: wi j¼1 pj N ðjmj ; S j Þ,
min LðW; bÞ þ kWkSð1Þ ; (13) where pj , mj and Sj specify the prior, the mean and the
W;b
covariance in the jth cluster. For tasks in a cluster, they will
where mi ðWÞ denotes the ith smallest singular value of W share a Gaussian distribution. When r equals 1, this model
Pminðm;dÞ degenerates to a case where model parameters of different
and kWkSð1Þ ¼ i¼1 mi ðWÞ denotes the trace norm of
matrix W. Based on the trace norm, Han and Zhang [45] tasks share a prior, which is similar to several Bayesian
propose a capped trace regularizer with the objective func- MTL models such as [51], [52], [53] that are based on Gauss-
tion formulated as ian processes and t processes.
Xue et al. [54] deploy the Dirichlet process to do cluster-
ing on task level. Specifically, it defines the prior on wi as
X
minðm;dÞ
min LðW; bÞ þ minðmi ðWÞ; uÞ: (14)
W;b
i¼1 wi G; G DPða; G0 Þ 8i 2 ½m;

With the use of the threshold u, the capped trace regularizer where DPða; G0 Þ denotes a Dirichlet process with a as a
only penalizes small singular values of W, which is related positive scaling parameter and G0 a base distribution. To
to the determination of the rank of W. When u is large see the clustering effect, by integrating out G, the condi-
enough, the capped trace regularizer will become the trace tional distribution of wi , given model parameters of other
norm and hence problem (14) will reduce to problem (13). tasks Wi ¼ f ; wi1 ; wiþ1 ; g, is
Moreover, a spectral k-support norm is proposed in [46] as
an improvement over the trace norm regularization. a 1 Xm
Pðwi jWi ; a; G0 Þ ¼ G0 þ d j;
The trace norm regularization has been extended to regu- m1þa m 1 þ a j¼1;j6¼i w
larize model parameters in deep multi-task models. Specifi-
cally, the weights in the last several fully connected layers where dwj denotes the distribution concentrated at a single
of deep multi-task neural networks can be viewed as the point wj . So wi can be equal to either wj (j 6¼ i) with proba-
parameters of learners for all the tasks. In this view, the 1
bility m1þa , which corresponds to the case that those two
weights connecting two consecutive layers for one task can tasks lie in the same cluster, or a new sample from G0 with
be organized in a matrix and hence the weights of all the probability m1þaa
, which is the case that task T i forms a
tasks can form a tensor. Based on such tensor representa- new task cluster. When a is large, the chance to form a new
tions, several tensor trace norms, which are based on the task cluster is large and so a affects the number of task clus-
trace norm, are used in [47] as regularizers to identify the ters. This model is extended in [55], [56] to a case where dif-
low-rank structure of the parameter tensor. ferent tasks in a task cluster share useful features via a
matrix stick-breaking process and a beta-Bernoulli hierar-
2.3 Task Clustering Approach chical prior, respectively, and in [57] where each task is a
The task clustering approach assumes that different tasks compressive sensing task. Moreover, a nested Dirichlet pro-
form several clusters, each of which consists of similar tasks. cess is proposed in [58], [59] to use Dirichlet processes to
As indicated by its name, this approach has a close connec- learn both task clusters and the state structure of an infinite
tion with clustering algorithms and it can be viewed as an hidden Markov model, which handles sequential data in
extension of clustering algorithms to the task level while the each task. In [60], wi is decomposed as wi ¼ ui þ QTi vi simi-
conventional clustering algorithms are on the data level. lar to Eq. (9), where ui and Qi are sampled according to a
Thrun and Sullivan [48] propose the first task clustering Dirichlet process.
algorithm by using a weighted nearest neighbor classifier for Different from [50], [54], Jacob et al. [61] aim to learn task
each task, where the initial weights to define the weighted clusters under the regularization framework by considering
euclidean distance are learned by minimizing pairwise three orthogonal aspects, including a global penalty to mea-
within-class distances and maximizing pairwise between-class sure on average how large the parameters, a measure of
distances simultaneously within each task. Then a task transfer between-cluster variance to quantify the distance among
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
ZHANG AND YANG: SURVEY ON MULTI-TASK LEARNING 5593

different clusters and a measure of within-cluster variance X

m X
m
to quantify the compactness of task clusters. By combining min LðW; bÞ þ 1 kWk2F þ 2 zij kwi wj k22 þ 3 kZk2;1
W;b;Z
these three aspects and adopting some convex relaxation i¼1 j¼1

technique, a convex objective function is formulated as s:t: Z 0; ZT 1 ¼ 1:

(18)
~ S1 W
min LðW; bÞ þ trðW11T WT Þ þ trðWS ~ TÞ
W;b;S
S The third term in the objective function of problem (18) enfor-
(15)
s:t: W
~ ¼ WP
P; aI S SÞ ¼ g;
bI; trðS ces the closeness of each pair of tasks based on Z and the last
term employs the ‘2;1 norm to enforce the row sparsity of Z
where P denotes the m m centering matrix, 1 denotes a which implies that the number of representative tasks is lim-
column vector of all ones with its size depending on the ited. The constraints in problem (18) guarantee that entries in
context, and a; b; g are hyperparameters. Z define valid probabilities. Problem (18) is related to problem
Kang et al. [62] extend the MTFL method [10] to the case (16) since the regularizer in problem (16) can be reformulated
P P
with multiple task clusters and aim to minimize the squared as 2 j > i kwi wj k2 ¼ minZ^ 0 j > i z^ij kwi wj k22 þ z^1ij ,
trace norm in each cluster. A diagonal matrix, Qi 2 Rmm , is where both the regularizer and constraint on Z ^ are different
defined as a cluster indicator matrix for the ith cluster. The from those on Z in problem (18).
jth diagonal entry of Qi is equal to 1 if task T j lies in the ith
Previous studies assume that each task can belong to
cluster and otherwise 0. Since eachP task can belong to only only one task cluster and this assumption seems too restric-
one cluster, it is easy to see that ri¼1 Qi ¼ I. Based on these
tive. In [66], a GO-MTL method relaxes this assumption by
considerations, the objective function is formulated as
allowing a task to belong to more than one cluster and
defines a decomposition of W similar to problem (17) as
X
r
Qi 2 f0; 1gmm
min LðW; bÞ þ kWQi k2Sð1Þ s:t: P : W ¼ LS where L 2 Rdr denotes the latent basis with r <
r
W;b;fQi g
i¼1 i¼1 Qi ¼ I m and S 2 Rrm contains linear combination coefficients for
all the tasks. S is assumed to be sparse since each task is
When r equals 1, this method reduces to the MTFL method. generated from only a few columns in L or equivalently
Han and Zhang [63] devise a structurally sparse regular- belongs to a small number of clusters. The objective function
izer to cluster tasks with the objective function as is formulated as
X
min LðW; bÞ þ kwi wj k2 : (16) min LðLS; bÞ þ 1 kSk1 þ 2 kLk2F : (19)
W;b L;S;b
j>i

Compared with the objective function of multi-task sparse

Problem (16) is a special case of the method proposed in [63] coding, i.e., problem (3), we can see that when the regulari-
with only one level of task clusters. The regularizer on W zation parameters take appropriate values, these two prob-
enforces any pair of columns in W to have a chance to be lems are almost equivalent except that in multi-task sparse
identical and after solving problem (16), the cluster struc- coding, the dictionary U is overcomplete, while here the
ture can be discovered by comparing columns in W. One number of columns in S is smaller than that of its rows. This
advantage of this structurally sparse regularizer is that the method has been extended in [67] to decompose the param-
convex problem (16) can automatically determine the num- eter tensor in the fully connected layers of deep neural
ber of task clusters. networks.
Barzilai and Crammer [64] propose a task clustering Among the aforementioned methods, the method in [48]
method by defining W as W ¼ FG where F 2 Rdr and G 2 first identifies the cluster structure and then learns the
f0; 1grm . With an assumption that each task belongs to model parameters of all the tasks separately, which is not
only one cluster, the objective function is formulated as preferred since the cluster structure learned may be subop-
timal for the model parameters, hence follow-up works
G 2 f0; 1grm learn model parameters and the cluster structure together.
min LðFG; bÞ þ kFk2F s:t: ; (17)
F;G;b kgi k2 ¼ 1 8i 2 ½m An important problem in clustering is to determine the
number of clusters and this is also important for this
where gi denotes the ith column of G. When using the hinge approach. Out of the above methods, only methods in [54],
loss or logistic loss, this non-convex problem can be relaxed [63] can automatically determine the number of task clus-
to a min-max problem, which has a global optimum, by uti- ters, where the method in [54] depends on the capacity of
lizing the dual problem with respect to W and b and dis- the Dirichlet process while the method in [63] relies on the
carding some non-convex constraints. use of a structurally sparse regularizer. Among all those
Zhou and Zhao [65] aim to cluster tasks by identifying models, some belong to Bayesian learning, i.e., [50], [54],
representative tasks which are a subset of the given m tasks. while the rest models are regularized models. Among those
If task T i is selected by task T j as a representative task, then regularized methods, only the objective function proposed
it is expected that model parameters for T j are similar to in [63] is convex while others are originally non-convex.
those of T i . zij is defined as the probability that task T j The task clustering approach is related to the low-rank
selects task T i as its representative task. Then based on a approach. To see that, suppose that there are r task clusters
matrix Z whose ði; jÞth entry is zij , the objective function is (r < m) and all the tasks in a cluster share the same model
formulated as parameters, making the parameter matrix W low-rank with
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
5594 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 34, NO. 12, DECEMBER 2022

the rank at most r. From the perspective of modeling, by set- W MN ðj0; I; VÞ; (20)
ting ui to be a zero vector in Eq. (9), we can see that the decom-
where MN ðjM; A; BÞ denotes a matrix-variate normal dis-
position of W in [40] becomes similar to those in [64], [66],
tribution with M as the mean, A the row covariance, and B
which in some sense shows the relation between those two
the column covariance. Based on this prior as well as some
approaches. Moreover, the equivalence between problems
likelihood function, the objective function for a modified
(12) and (15), two typical methods in the low-rank and task
maximum a posterior solution is formulated as
clustering approaches, has been proved in [68]. The task clus-
tering approach can visualize the learned cluster structure, V1 WT Þ
min LðW; bÞ þ 1 kWk2F þ 2 trðWV
which is an advantage over the low-rank approach. W;b;V
V
(21)
s:t: V 0; trðV
VÞ 1;
2.4 Task Relation Learning Approach where the second term in the objective function is to penal-
In MTL, tasks are related and the task relatedness can be ize the complexity of W, the last term is due to the matrix-
quantitated via task similarity, task correlation, task covari- variate normal prior, and the constraints control the com-
ance and so on. Here we use task relations to include all the plexity of the positive definite covariance matrix V . It has
quantitative relatedness. been proved in [79], [80] that problem (21) is jointly convex
In earlier studies on MTL, task relations are assumed to be with respect to W, b and V. Problem (21) has been extended
known as a priori information. In [69], [70], each task is to multi-task boosting [81] and multi-label learning [82] by
assumed to be similar to any other task and so model parame- learning label correlations. Problem (21) can also been inter-
ters of each task will be enforced to approach the average preted from the perspective of reproducing kernel Hilbert
model parameters of all the tasks. In [71], [72], task similarities spaces for vector-valued functions [83], [84], [85], [86].
for each pair of tasks are given and these studies utilize the Moreover, Problem (21) is extended to learn sparse task
task similarities to design regularizers to guide the learning of relations in [87] via the ‘1 regularization on V when the
multiple tasks in a principle that the more similar two tasks number of tasks is large. A model similar to problem (21) is
are, the closer the corresponding model parameters are proposed in [88] via a matrix-variate normal prior on W:
expected to be. A similar formulation to [71] is proposed in W MN ðj0; V1 ; V2 Þ, where V 1 1
1 and V 2 are assumed to
[73] to estimate the mean of multiple distributions by learning be sparse. The MTRL model is extended in [89] to use the
pairwise task relations and another similar formulation is pro- symmetric matrix-variate generalized hyperbolic distribu-
posed in [74] for log-density gradient estimation. Given a tree tion to learn block sparse structure in W and in [90] to use
structure describing relations among tasks in [75], model the matrix generalized inverse Gaussian prior to learn low-
parameters of a task corresponding to a node in the tree are rank V1 and V2 . Moreover, the MTRL model is generalized
enforced to be similar to those of its parent node. to the multi-task feature selection problem [37] by learning
However, in most applications, task relations are not task relations via the matrix-variate generalized normal dis-
available. In this case, learning task relations from data tribution. Since the prior defined in Eq. (20) implies that
automatically is a good option. Bonilla et al. [76] propose a WT W follows a Wishart distribution as Wðj0; V Þ, Zhang
multi-task Gaussian process (MTGP) by defining a prior on and Yeung [91] generalize it as
fji , the functional value for xij , as f N ðj0; S Þ, where f ¼
ðf11 ; . . . ; fnmm ÞT contains the functional values for all the train- ðWT WÞt Wðj0; V Þ; (22)
ing data. S , the covariance matrix, defines the covariance
where t is a positive integer to model high-order task rela-
between fji and fqp as sðfji ; fqp Þ ¼ vip kðxij ; xpq Þ, where kð; Þ
tionships. Eq. (22) can induce a new prior, which is a gener-
denotes a kernel function and vip describes the covariance
alization of the matrix-variate normal distribution, on W
between tasks T i and T p . In order to keep S positive defi-
and based on this new prior, a regularized method is
nite, a matrix V containing vip as its ði; pÞth entry is also
devised to learn high-order task relations in [91]. The MTRL
required to be positive definite, which makes V the task
model has been extended to multi-output regression [90],
covariance to describe the similarities between tasks. Then
[92], [93], [94] by modeling the structure contained in noises
based on the Gaussian likelihood for labels given f, the ana-
via some matrix-variate priors. For deep neural networks,
lytically marginal likelihood by integrating out f can be
the MTRL method has been extended in [95] by placing a
used to learn V from data. In [77], the learning curve and
tensor-variate normal distribution as a prior on the parame-
generalization bound of the MTGP are studied. Since V in
ter tensor in the fully connected layers.
MTGP has a point estimation which may lead to the overfit-
Different from the aforementioned methods which inves-
ting, based on a proposed weight-space view of MTGP,
tigate the use of global learning models in MTL, Zhang [96]
Zhang and Yeung [78] propose a multi-task generalized t
aims to learn the task relations in local learning methods
process by placing an inverse-Wishart prior on V as V
such as the k-nearest-neighbor (kNN) classifier by defining
IWðjn; CÞ, where n denotes the degree of freedom and C is
the learning function as a weighted voting of neighbors:
the base covariance for generating V. Since C models the
covariance between pairs of tasks, it can be determined X
fðxij Þ ¼ s ip sðxij ; xpq Þypq ; (23)
based on the maximum mean discrepancy (MMD). ðp;qÞ2Nk ði;jÞ
Different from [76], [78] which are Bayesian models,
Zhang and Yeung [79], [80] propose a regularized multi- where Nk ði; jÞ denotes the set of task indices and instance
task model called multi-task relationship learning (MTRL) indices for the k nearest neighbors of xij , i.e., ðp; qÞ 2 Nk ði; jÞ
by placing a matrix-variate normal prior on W as meaning that xpq is one of the k nearest neighbors of xij ,
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
ZHANG AND YANG: SURVEY ON MULTI-TASK LEARNING 5595

sðxij ; xpq Þ defines the similarity between xij and xpq , and s ip To help understand problem (26), we introduce several
represents the contribution of task T p to T i when T p has instantiations as follows.
some data points to be neighbors of a data point in T i . s ip In [98] where h equals 2 and CW ¼ ; is an empty set, g1 ðÞ
can be viewed as the similarity from T p to T i . When s ip ¼ 1 and g2 ðÞ are defined as
for all i and p, Eq. (23) reduces to the decision function of
the kNN classifier for all the tasks. Then the objective func- g1 ðW1 Þ ¼ 1 kW1 k1;1 ; g2 ðW2 Þ ¼ 2 kW2 k1 ;
tion to learn S , which is a m m matrix with s ip as its
ði; pÞth entry, can be formulated as where 1 and 2 are positive regularization parameters.
Similar to problem (5), each row of W1 is likely to be a zero
Xm
1X
ni
1 2 row and hence g1 ðW1 Þ can help select important features.
min lðyij ; fðxij ÞÞ þ kSS ST k2F þ kSSk2F Due to the ‘1 norm regularization, g2 ðW2 Þ makes W2 sparse.
S
i¼1
n i j¼1
4 2 (24) Because of the characteristics of two regularizers, the
s:t: s ii 0 8i 2 ½m; s ii s ij s ii 8i 6¼ j: parameter matrix W can eliminate unimportant features for
all the tasks when the corresponding rows in both W1 and
The first regularizer in problem (24) enforces S to be nearly W2 are sparse. Moreover, W2 can identify features for tasks
symmetric and the second one is to penalize the complexity which have their own useful features that may be outliers
of S. The constraints in problem (24) guarantee that the sim- for other tasks. Hence this model can be viewed as a ‘robust’
ilarity from one task to itself is positive and also the largest. version of problem (5).
Similarly, a multi-task kernel regression is proposed in [96] With two component matrices, Chen et al. [99] define
for regression tasks.
While the aforementioned methods whose task relations g2 ðW2 Þ ¼ 2 kW2 k1 ; CW ¼ fW1 jkW1 kSð1Þ 1 g; (27)
are symmetric except [96], Lee et al. [97] focus on learning
asymmetric task relations. Since different tasks are assumed where g1 ðW1 Þ ¼ 0. Similar to problem (13), CW makes W1
to be related, wi can lie in the space spanned by W, i.e., wi low-rank. With a sparse regularizer g2 ðW2 Þ, W2 makes the
Wai , and hence we have W WA. Here matrix A can be entire model matrix W more robust to outlier tasks in a way
viewed as asymmetric task relations between pairs of tasks. similar to the previous model. When 2 is large enough, W2
By assuming that A is sparse, the objective function is for- will become a zero matrix and then problem (27) will act
mulated as similarly to problem (13).
gi ðÞ’s in [100] where CW ¼ ; are defined as
X
m X
ni
min ð1 þ 1 k^ai k1 Þ lðyij ; ðwi ÞT xij þ bi Þ þ 2 kW WAk2F g1 ðW1 Þ ¼ 1 kW1 kSð1Þ ; g2 ðW2 Þ ¼ 2 kWT2 k2;1 : (28)
W;b;A
i¼1 j¼1

s:t: aij 0 8i; j 2 ½m; Different from the above two models which assume that W2
(25) is sparse, here g2 ðW2 Þ enforces W2 to be column-sparse. For
related tasks, their columns in W1 are correlated via the
where a ^i denotes the ith row of A by deleting aii . The term trace norm regularization and the corresponding columns
before the training loss of each task, i.e., 1 þ 1 k^
ai k1 , not only in W2 are zero. For outlier tasks which are unrelated to other
enforces A to be sparse but also allows asymmetric informa- tasks, the corresponding columns in W2 can take arbitrary
tion sharing from easier tasks to difficult ones. The regularizer values and hence model parameters in W for them have no
in problem (25) can make W approach WA with the closeness low-rank structure even though those in W1 may have.
depending on 2 . To see the connection between problems In [101], these functions are defined as
(25) and (21), we rewrite
the regularizer in problem (25)
as kW WAk2F ¼ tr WðI AÞðI AÞT WT . Based on this g1 ðW1 Þ ¼ 1 kW1 k2;1 ; g2 ðW2 Þ ¼ 2 kWT2 k2;1 ; CW ¼ ;:
reformulation, the regularizer in problem (25) is a special case (29)
of that in problem (21) by assuming V 1 ¼ ðI AÞðI AÞT .
Though A is asymmetric, from the perspective of the regular- Similar to problem (4), g1 ðW1 Þ makes W1 row-sparse. Here
g2 ðW2 Þ is identical to that in [100] and it makes W2 column-
izer, the task relations here are symmetric and act as the task
sparse. Hence W1 helps select useful features while non-
precision matrix with a restrictive form.
zero columns in W2 capture outlier tasks.
With h ¼ 2, Zhong and Kwok [102] define
2.5 Decomposition Approach
The decomposition approach assumes that the parameter
g1 ðW1 Þ ¼ 1 cðW1 Þ þ 2 kW1 k2F ; g2 ðW2 Þ ¼ 3 kW2 k2F ; CW ¼ ;;
matrix W can be decomposed into two or more P component P P
matrices fWk ghk¼1 where h 2, i.e., W ¼ hk¼1 Wk . The where cðUÞ ¼ di¼1 k > j juij uik j with uij as the ði; jÞth
objective functions of most methods in this approach can be entry in a matrix U. Due to the sparse nature of the ‘1 norm,
unified as cðW1 Þ enforces corresponding entries in different columns
of W1 to be identical, which is equivalent to clustering tasks
X
h Xh
in terms of individual model parameters. Both the squared
min L Wk ; b þ gk ðWk Þ; (26)
fWi g2CW ;b
k¼1 k¼1
Frobenius norm regularizations in g1 ðW1 Þ and g2 ðW2 Þ
penalize the complexities of W1 and W2 . The use of W2
where the regularizer is decomposable with respect to Wk ’s improves the model flexibility when not all the tasks exhibit
and CW denotes a set of constraints for component matrices. a clear cluster structure.
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
5596 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 34, NO. 12, DECEMBER 2022

Different from the aforementioned methods which have in problem (31) implies that for tasks not contained in a
only two component matrices, an arbitrary number of com- node a, the corresponding columns in Wa are zero. In [105],
ponent matrices are considered in [103] with rðWa Þ adopts the regularizer proposed in [69] which enfor-
ces the parameters of all the tasks to approach their average.
gk ðWk Þ ¼ ðh kÞkWk k2;1 þ ðk 1ÞkWk k1 =ðh 1Þ; Different from deep MTL models which are deep in
(30) terms of layers of feature representations, the decomposi-
tion approach can be viewed as a ‘deep’ approach in terms
where CW ¼ ;. According to Eq. (30), Wk is assumed to be of model parameters while most of previous approaches are
both sparse and row-sparse for all k 2 ½h. Based on different just shallow ones, making this approach have more power-
regularization parameters on the regularizer of Wk , we can ful capacity. Moreover, the decomposition approach can
see that when k increases, Wk is more likely to be sparse reduce to other approaches such as the feature learning,
than to be row-sparse. Even though each Wk is sparse or low-rank and task clustering approaches when there is only
row-sparse, the entire parameter matrix W can be non- one component matrix and hence it can be considered as an
sparse and hence this model can discover the latent sparse improved version of those approaches.
structure among tasks.
In the above methods, different component matrices
have no direct connection. When there is a dependency 2.6 Comparisons Among Different Approaches
among component matrices, problem (26) can model more Based on the above introduction, we can see that different
complex structure among tasks. For example, Han and approaches exhibit their own characteristics. Specifically, the
Zhang [104] define feature learning approach can learn common features, which
X are generic and invariant to all the tasks at hand and even
gk ðWk Þ ¼ kwik wjk k2 =hk1 8k 2 ½h new tasks, for all the tasks. When there exist outlier tasks
i>j which are unrelated to other tasks, the learned features can be
CW ¼ ffWk gj jwik1 wjk1 j jwik wjk j 8k 2; 8i > jg; influenced by outlier tasks significantly and they may cause
the performance deterioration. By assuming that the parame-
where wik denotes the ith column of Wk . Note that the con- ter matrix is low-rank, the low-rank approach can explicitly
straint set CW relates component matrices and the regular- learn the subspace of the parameter matrix or implicitly
izer gk ðWk Þ makes each pair of wik and wjk have a chance to achieve that via some convex or non-convex regularizer. This
become identical. Once this happens for some i, j, k, then approach is powerful but it seems applicable to only linear
based on the constraint set CW , wik0 and wjk0 will always have models, making nonlinear extensions non-trivial to be
the same value for k0 k. This corresponds to sharing all devised. The task clustering approach performs clustering on
the ancestor nodes for two internal nodes in a tree and the task level in terms of model parameters and it can identify
hence this method can learn a hierarchical structure to char- task clusters each of which consists of similar tasks. A major
acterize task relations. When the constraints are removed, limitation of the task clustering approach is that it can capture
this method reduces to the multi-level task clustering positive correlations among tasks in the same cluster but
method [63], which is a generalization of problem (16). ignore negative correlations among tasks in different clusters.
Another way to relate different component matrices is to Moreover, even though some methods in this category can
use a non-decomposable regularizer as [105] did, which is automatically determine the number of clusters, most of them
slightly different from problem (26) in terms of the regular- still need a model selection method such as cross validation to
izer. Specifically, given m tasks, there are 2m 1 possible determine it, which may bring additional computational
and non-empty task clusters. All the task clusters can be costs. The task relation learning approach can learn model
organized in a tree, where the root node represents a parameters and pairwise task relations simultaneously. The
dummy node, nodes in the second level represent groups learned task relations can give us insights about the relations
with a single task, and the parent-child relations are the between tasks and hence they improve the interpretability.
‘subset of’ relation. In total, there are h 2m component The decomposition approach can be viewed as extensions of
matrices each of which corresponds to a node in the tree other parameter-based approaches by equipping multi-level
and hence an index a is used to denote both a level and the parameters and hence they can model more complex task
corresponding node in the tree. The objective function is for- structure, e.g., tree structure. The number of components in
mulated as the decomposition approach is important to the performance
0 0 11p 12
and needs to be carefully determined.
!
X
h
BX X C
min L Wk ; b þ@ v @ rðWa Þp A A
fWi g;b
k¼1 v2V a2DðvÞ
(31) 2.7 Benchmark Datasets and Performance
Comparison
s:t: wia ¼ 0 8i 2
= tðaÞ; In this section, we introduce some benchmark datasets for
MTL and compare the performance of different MTL mod-
where p takes a value between 1 and 2, DðaÞ denotes the set els on them.
of all the descendants of a, tðaÞ denotes the set of tasks con- Some benchmark datasets for MTL are listed as follows.
tained in node a, wia denotes the ith column of Wa , and
rðWa Þ reflects relations among tasks in node a based on Wa . School dataset [50]: This dataset is to estimate exami-
The regularizer in problem (31) is used to prune the subtree nation scores of 15,362 students from 139 secondary
rooted at each node v based on the ‘p norm. The constraint schools in London from 1985 to 1987 where each
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
ZHANG AND YANG: SURVEY ON MULTI-TASK LEARNING 5597

TABLE 1
The Performance Comparison of Representative MTL Models in the Five Approaches
on Benchmark Datasets in Terms of Some Evaluation Metric

nMSE stands for ‘normalized mean squared error’, RMSE is for ‘root mean squared error’, and AUC stands for ‘Area Under Curve’. " after the evaluation metric
implies that the larger value the better performance and # indicates the opposite case.

school is treated as a task. The input consists of four Office-Home dataset:4 This dataset consists of images
school-specific and three student-specific attributes. from 4 different domains/tasks: artistic images, clip
SARCOS dataset:2 This dataset studies a multi-out- art, product images and real-world images. Each task
put problem of learning the inverse dynamics of 7 contains images of 65 object categories collected in the
SARCOS anthropomorphic robot arms, each of office and home settings. In total, there are about
which corresponds to a task, based on 21 features, 15,500 images in all the tasks.
including seven joint positions, seven joint velocities ImageCLEF dataset:5 This dataset contains 12 com-
and seven joint accelerations. This dataset contains mon categories shared by four tasks: Caltech-256,
48,933 data points. ImageNet ILSVRC 2012, Pascal VOC 2012 and Bing.
Computer Survey dataset [11]: This dataset is taken There are about 2,400 images in all the tasks.
from a survey of 180 persons/tasks who rated the In the above benchmark datasets, the first four datasets
likelihood of purchasing one of 20 different personal consist of regression tasks while the other datasets are classifi-
computers, resulting in 36,000 data points in all the cation tasks, where each task in the Sentiment, MHC-I and
tasks. The features contain 13 different computer Landmine datasets is a binary classification problem and that
characteristics (e.g., price, CPU and RAM) while the in the other three image datasets is a multi-class classification
output is an integer rating on the scale 0-10. problem. In order to compare different MTL approaches on
Parkinson dataset [105]: This dataset is to predict the those benchmark datasets, we select some representative
disease symptom score of Parkinson for patients at MTL methods from each of the five approaches introduced in
different times using 19 bio-medical features. This the previous sections and list in Table 1 their performance
dataset has 5,875 data points for 42 patients, each of reported in the MTL literature. We also include the perfor-
whom is treated as a task. mance of Single-Task Learning (STL), which trains a learning
Sentiment dataset:3 This dataset is to classify reviews model for each task separately, for comparison. It is easy to
of four products/tasks, i.e., books, DVDs, electronics see that MTL models perform better than STL counterparts in
and kitchen appliances, from Amazon into two clas- most cases, which verifies the effectiveness of MTL. Usually,
ses: positive and negative reviews. For each task, different datasets have their own characteristics, making
there are 1,000 positive and 1,000 negative reviews, them more suitable for some MTL approach. For example,
respectively. according to the studies in [50], [71], [102], different tasks in
MHC-I dataset [61]: This databset contains binding the School dataset are found to be very similar to each other.
affinities of 15,236 peptides with 35 MHC-I mole- According to [54], the Landmine dataset can have two task
cules. Each MHC-I molecule is considered as a task clusters, where the first cluster consisting of the first 15 tasks
and the goal is to predict whether a peptide binds a corresponds to regions that are relatively highly foliated and
molecule. the rest tasks belong to another cluster with regions that are
Landmine dataset [54]: This dataset consists of 9- bare earth or deserts. According to [61], it is well known in the
dimensional data points, whose features are extracted vaccine design community that some molecules/tasks in the
from radar images, from 29 landmine fields/tasks. MHC-I dataset can be grouped into empirically defined
Each task is to classify a data point into two classes supertypes known to have similar binding behaviors. For
(landmine or clutter). There are 14,820 data points in those three datasets, according to Table 1 we can see that the
total. task clustering, task relation learning and decomposition
Office-Caltech dataset [106]: The dataset contains data approaches have better performance since they can identify
from 10 common categories shared in the Caltech-256 the cluster structure contained in the data in a plain or hierar-
dataset and the Office dataset which consists of images chical way. For other datasets, they do not have so obvious
collected from three distinct domains/tasks: Amazon, structure among tasks but some MTL models can learn task
Webcam and DSLR, making this dataset contain 4 correlations, which can bring more insights for model design
tasks. There are 2,533 images in all the tasks. and the interpretation of experimental results. For example,

2. http://www.gaussianprocess.org/gpml/data/ 4. http://hemanthdv.org/OfficeHome-Dataset
3. http://www.cs.jhu.edu/ mdredze/datasets/sentiment/ 5. http://imageclef.org/2014/adaptation
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
5598 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 34, NO. 12, DECEMBER 2022

the task correlations in the SARCOS and Sentiment datasets models. For example, a learning framework is proposed in
are shown in Tables 2 and 3 of [79], and the task similarities in [108] to learn a suitable multi-task model for a given multi-
the Office-Caltech dataset are shown in Fig. 3b of [95]. More- task problem under problem (33) by utilizing S to represent
over, for image datasets (i.e., Office-Caltech, Office-Home and the corresponding multi-task model.
ImageCLEF), deep MTL models (e.g., [67], [95]) achieve better
performance than shallow models since they can learn power-
ful feature representations, while the rest datasets are from 2.9 Other Settings in MTL
diverse areas, making shallow models perform well on them. Instead of assuming that different tasks share an identical
feature representation, Zhang and Yeung [109] consider a
2.8 Another Taxonomy for Regularized MTL multi-database face recognition problem where face recog-
Methods nition in a database is treated as a task. Since different face
Regularized methods form a main methodology for MTL. databases have different image sizes, here naturally all the
Here we classify many regularized MTL algorithms into tasks do not lie in the same feature space in this application,
two main categories: learning with feature covariance and leading to a heterogeneous-feature MTL problem. To tackle
learning with task relations. The former can be viewed as a this problem, a multi-task discriminant analysis (MTDA) is
representative formulation in feature-based MTL, while the proposed in [109] by first projecting data in different tasks
latter is for parameter-based MTL. into a common subspace and then learning a common pro-
Objective functions in the first category can be unified as jection in this subspace to discriminate different classes in
different tasks. In [110], a latent probit model is proposed to
generate data of different tasks in different feature spaces
min LðW; bÞ þ trðWT Q1 WÞ þ fðQ
QÞ; (32)
W;b;Q
Q 2 via sparse transformations on a shared latent space and
then to generate labels based on this latent space.
where fðÞ denotes a regularizer or constraint on Q. From
In many MTL classification problems, each task is explicitly
the perspective of probabilistic modeling, the regularizer
T 1 or implicitly assumed to be a binary classification problem as
2 trðW Q WÞ corresponds to a matrix-variate normal dis-

each column in the parameter matrix W contains model
tribution on W as W MN ð0; 1 Q IÞ. Based on this proba-
parameters for the corresponding task. It is not difficult to see
bilistic prior, Q models the covariance between the features
that many methods in the feature learning approach, low-rank
since 1 Q is the row covariance matrix with each row in W
approach and decomposition approach can be directly
corresponding to a feature and different tasks share the fea-
extended to a general setting where each classification task can
ture covariance. All the models in this category differ in the
be a multi-class classification problem and correspondingly
choice of the function fðÞ on Q. For example, methods in
multiple columns in W contains model parameters of a multi-
[10], [40], [41] use fðÞ to restrict the trace of Q as shown in
class classification task. Such direct extension is applicable
problems (2) and (12). Moreover, multi-task feature selec-
since those methods only rely on the entire W or its rows but
tion methods based on the ‘2;1 norm such as [23], [24], [25]
not columns as a media to share knowledge among tasks.
can be reformulated as instances of problem (32).
However, to the best of our knowledge, there is no theoretical
Different from the first category, methods in the second
or empirical study to investigate such direct extension. For
category have a unified objective function as
most methods in the task clustering and task relation learning
approaches, such direct extension does not work since for mul-
S1 WT Þ þ gðS
min LðW; bÞ þ trðWS SÞ; (33) tiple columns in W corresponding to one task, we do not know
W;b;S
S 2
which one(s) can be used to represent this task. Therefore, the
where gðÞ denotes a regularizer or constraint on S . The reg- direct extension may not be the best solution to the general set-
S1 WT Þ corresponds to a matrix-variate nor-
ularizer 2 trðWS ting. In the following, we introduce four main approaches
mal prior on W as W MN ð0; I 1 S Þ, where S is to other than the direct extension to tackle the general setting in
model the task relations since 1 S is the column covariance MTL where each classification task can be a multi-class classifi-
with each column in W corresponding to a task. From this cation problem. The first method is to transform the multi-class
perspective, the two regularizers for W in problems (32) classification problem in each task into a binary classification
and (33) have different meanings even though the formula- problem. For example, multi-task metric learning [70], [111]
tions seem a bit similar. All the methods in this category use can do that by treating a pair of data points from the same class
different functions gðÞ to learn S with different functionali- as positive and that from different classes as negative. The sec-
ties. For example, the methods in [69], [70], [71], [72], which ond recipe is to utilize the characteristics of learners. For exam-
utilize a priori information on task relations, directly learn ple, the linear discriminant analysis can handle binary and
W and b by defining gðS SÞ ¼ 0. Some task clustering meth- multi-class classification problems in a unified formulation
ods [61], [65] identify task clusters by assuming that S has a and hence MTDA [109] can naturally handle them without
block structure. Several task relation learning methods changing the formulation. The third approach is to directly
including [79], [80], [87], [97], [107] directly learn S as a learn label correspondence among different tasks. In [112],
covariance matrix by constraining its trace or sparsity in two learning tasks, which share the training data, aim to maxi-
gðSSÞ. The trace norm regularization [44] can be formulated mize the mutual information to identify the correspondence
as an instance of problem (33). between labels in different tasks. By assuming that all the tasks
Even though this taxonomy cannot cover all the regular- share the same label space, the last approach including [47],
ized MTL methods, it can bring insights to understand reg- [67], [95] organizes the model parameters of all the tasks in a
ularized MTL methods better and help devise more MTL tensor where the model parameters of each task form a slice.
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
ZHANG AND YANG: SURVEY ON MULTI-TASK LEARNING 5599

Then the parameter tensor can be regularized by tensor trace method can accelerate the convergence rate of the opti-
norms [47] and a tensor-variate normal prior [95], or factorized mization process or facilitate the design of distributed
as a product of several low-rank matrices or tensors [67]. optimization algorithms.
Most MTL methods assume that the training data in each
task are stored in a data matrix. In some case, the training
data in each task exhibit a multi-modal structure and hence 3 MTL WITH OTHER LEARNING PARADIGMS
they are represented in a tensor instead of a matrix. Multilin- In the previous section, we review different MTL approaches
ear multi-task methods proposed in [113], [114] can handle for supervised learning tasks. In this section, we overview
this situation by employing tensor trace norms as a generali- some works on the combination of MTL with other learning
zation of the trace norm to perform the regularization. paradigms in machine learning, including unsupervised
learning such as clustering, semi-supervised learning, active
2.10 Optimization Techniques in MTL learning, reinforcement learning, multi-view learning and
Optimization techniques used in MTL can be categorized graphical models, to either improve the performance of
into three main classes as follows. supervised MTL further via additional information such as
unlabeled data or use MTL to help improve the performance
Gradient descent method and its variants: The gradi- of other learning paradigms.
ent descent method can be used to optimize smooth In most applications, labeled data are expensive to collect
unconstrained objective functions possessed by many but unlabeled data are abundant. So in some MTL applica-
MTL models. If the unconstrained objective function is tions, the training dataset of each task consists of both labeled
non-smooth, the subgradient can be used instead and and unlabeled data, hence we hope to exploit useful informa-
then the gradient descent method can also be used. tion contained in the unlabeled data to further improve the
When there are some constraints in the objective func- performance of supervised learning tasks. In machine learn-
tion of MTL models [44], [64], the projected gradient ing, semi-supervised learning and active learning are two
descent method can be used to project the updated ways to utilize unlabeled data but in different ways. Semi-
solution in each step to the space defined by con- supervised learning aims to exploit geometrical information
straints. For deep MTL models, stochastic gradient contained in the unlabeled data, while active learning selects
descent methods can be used. Moreover, the Grad- representative unlabeled data to query an oracle with the hope
Norm [115] is devised to normalize gradients to bal- of increasing the labeling cost as little as possible. Hence semi-
ance the learning of multiple tasks and [116] proposes supervised learning and active learning can be combined with
the gradient surgery to avoid the interference between MTL, leading to three new learning paradigms including
task gradients. Differently, [117] studies MTL from the semi-supervised multi-task learning [122], [123], [124], multi-
perspective of multi-objective optimization by learn- task active learning [125], [126], [127] and semi-supervised
ing dynamic loss weights. multi-task active learning [128]. Specifically, a semi-supervised
Block Coordinate Descent (BCD) method: The param- multi-task classification model is proposed in [122], [123] to
eters in many MTL models can be divided into several use random walk to exploit unlabeled data in each task and
blocks. For example, parameters in learning functions then cluster multiple tasks via a relaxed Dirichlet process. In
of all the tasks form a block and parameters to repre- [124], a semi-supervised multi-task Gaussian process for
sent task relations are from another block. Directly regression tasks, where different tasks are related via the
optimizing the objective function of such a MTL model hyperprior on the kernel parameters in Gaussian processes of
with respect to parameters in all blocks together is not all the tasks, is proposed to incorporate unlabeled data into the
easy. The BCD method, which is also known as the design of the kernel function in each task to achieve the
alternating method, is widely used in the MTL litera- smoothness in the corresponding functional spaces. Different
ture, e.g., [10], [12], [31], [40], [41], [61], [62], [65], [66], from these semi-supervised multi-task methods, multi-task
[79], [80], [96], [97], to alternatively optimize each active learning adaptively selects informative unlabeled data
block of parameters while fixing parameters in other for multi-task learners and hence the selection criterion is the
blocks. Hence, each step of the BCD method will solve core research issue. Reichart et al. [125] believe that data instan-
several subproblems, each of which is to optimize ces to be selected should be as informative as possible for a set
with respect to a block of parameters. Compared with of tasks instead of only one task and hence they propose two
the original objective function, each subproblem is eas- protocols for multi-task active learning. In [126], the expected
ier to be solved and so the BCD method can help error reduction is used as a criterion where each task is mod-
reduce the optimization complexity. eled by a supervised latent Dirichlet allocation model. Inspired
Proximal method [118]: For a nonsmooth objective by multi-armed bandits which balance the trade-off between
function, which is the sum of smooth and nonsmooth the exploitation and exploration, a selection strategy is pro-
functions, in an MTL model, the proximal method is posed in [127] to consider both the risk of a multi-task learner
frequently used (e.g., [24], [27], [63], [99], [100], [101], based on the trace norm regularization and the corresponding
[102], [103], [104], [119], [120], [121]) to construct a confidence bound. In [129], the MTRL method (i.e., problem
proximal problem by replacing the smooth function (21)) is extended to the interactive setting where a human
with a quadratic function that may be constructed expert is enquired about partial orderings of pairwise task
based on its Taylor series in various ways and the covariances based an inconsistency criterion. In [130], a pro-
resulting proximal problem is usually easier to be posed generalization bound is used to select a subset from
solved than the original problem. The proximal multiple unlabeled tasks to acquire labels to improve the
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
5600 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 34, NO. 12, DECEMBER 2022

generalization performance of all the tasks. For semi-super- in contexts among arms to improve the prediction of
vised multi-task active learning, Li et al. [128] propose a model rewards from contexts. In [147], a multi-task linearly solv-
to use the Fisher information as a criterion to select unlabeled able MDP, whose task basis matrix contains a library of
data to acquire their labels with the semi-supervised multi- component tasks shared by all the tasks, is proposed to
task classification model [122], [123] as the classifier for each maintain a parallel distributed representation of tasks each
task. of which enables an agent to draw on macro actions simul-
MTL achieves the performance improvement in not only taneously. In [148], a multi-task deep RL model based on
supervised learning tasks but also unsupervised learning the attention can automatically group tasks into sub-net-
tasks such as clustering. In [131], a multi-task Bregman clus- works on a state-level granularity. In [149], a sharing experi-
tering method is proposed based on single-task Bregman clus- ence framework is introduced to use task-specific rewards
tering by using the earth mover distance to minimize to identify similar parts defined as shared-regions which
distances between any pair of tasks in terms of cluster centers can guide the experience sharing of task policies. In [150],
and then in [132], [133], an improved version of [131] and its multi-task soft option learning, a hierarchical framework
kernel extension are proposed to avoid the negative effect based on planning as inference, is regularized by a shared
caused by the regularizer in [131] via choosing the better one prior to avoid training instabilities and allow the fine-tuning
between single-task and multi-task Bregman clustering. In of options for new tasks without forgetting learned policies.
[134], a multi-task kernel k-means method is proposed by The idea of compression and distillation have been incorpo-
learning the kernel matrix via both MMD between any pair of rated into MRL as in [151], [152], [153], [154]. For example,
tasks and the Laplacian regularization that helps identify a in [151], the proposed Actor-Mimic method combines both
smooth kernel space. In [135], two proposed multi-task clus- deep reinforcement learning and model compression tech-
tering methods are extensions of the MTFL and MTRL meth- niques to train a policy network which can learn to act for
ods by treating labels as cluster indicators to be learned. In multiple tasks. In [152], a policy distillation method is pro-
[136], the principle of MTL is incorporated into the subspace posed to not only train an efficient network to learn the pol-
clustering by capturing correlations between data instances. icy of an agent but also consolidate multiple task-specific
In [137], a multi-task clustering method belonging to instance- policies into a single policy. In [153], the problem of multi-
based MTL is proposed to share data instances among differ- task multi-agent reinforcement learning under the partial
ent tasks. In [138], a multi-task spectral clustering algorithm, observability is addressed by distilling decentralized single-
which can handle the out-of-sample issue via a linear function task policies into a unified policy across multiple tasks. In
to learn the cluster assignment, is proposed to achieve the fea- [154], each task has its own policy which is constrained to
ture selection among tasks via the ‘2;1 regularization [139]. be close to a shared policy that is trained by the distillation.
[140] proposes to identify the task cluster structure and learn Some works [155], [156], [157], [158], [159], [160] in MRL
task relations together. focus on online and distributed settings. Specifically, in
Reinforcement Learning (RL) is a promising area in [155], a distributed MRL framework is devised to model it
machine learning and has shown superior performance in as an instance of general consensus and an efficient decen-
many applications such as game playing (e.g., Atari and tralized solver is developed. In [156], [157], multiple goal-
Go) and robotics. MTL can help boost the performance of directed tasks are learned in an online setup without the
reinforcement learning, leading to Multi-task Reinforcement need for expert supervision by actively sampling harder
Learning (MRL). Some works [141], [142], [143], [144], [145], tasks. In [158], a distributed agent is developed to not only
[146], [147], [148], [149], [150] adapt the ideas introduced in use resources more efficiently in single-machine training
Section 2 to MRL. Specifically, in [141] where a task solves a but also scale to thousands of machines without sacrificing
sequence of Markov Decision Processes (MDPs), a hierar- data efficiency or resource utilization. [159] formulates MRL
chical Bayesian infinite mixture model is used to model the from a perspective of variational inference and it proposes a
distribution over MDPs and for each new MDP, previously novel distributed solver with quadratic convergence guar-
learned distributions are used as an informative prior. In antees. In [160], an online learning algorithm is proposed to
[142], a regionalized policy representation is introduced to dynamically combine different auxiliary tasks which pro-
characterize the behavior of an agent in each task and a vide gradient directions to speed up the training of the
Dirichlet process is placed over regionalized policy repre- main reinforcement learning task. Some works study the
sentations across multiple tasks to cluster tasks. In [143], a theoretical foundation of MRL. For example, in [161], shar-
Gaussian process temporal-difference value function model ing representations among tasks is analyzed with theoreti-
is used for each task and a hierarchical Bayesian approach cal guarantees to highlight conditions to share
is to model the distribution over value functions in different representations and finite-time bounds of approximated
tasks. Calandriello et al. [144] assume that parameter vectors value-iteration are extended to the multi-task setting. More-
of value functions in different tasks are jointly sparse and over, there are some works to design novel MRL methods.
then extend the MTFS method with the ‘2;1 regularization as For example, in [162], a MRL framework is proposed to
well as the MTFL method to learn value functions in multi- train agent to employ hierarchical policies that decide when
ple tasks together. In [145], a model associating each subtask to use a previously learned policy and when to learn a new
with a modular subpolicy is proposed to learn from policy skill with a temporal grammar that helps the agent learn
sketches, which annotate tasks with sequences of named complex temporal dependencies. [163] studies the problem
subtasks and provide information about high-level struc- of parallel learning of multiple sequential-decision tasks
tural relationships among tasks. In [146], a multi-task con- and proposes to automatically adapt the contribution of
textual bandit is introduced to leverage or learn similarities each task to the updates of the agent to make all tasks have
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
ZHANG AND YANG: SURVEY ON MULTI-TASK LEARNING 5601

comparable impacts on the learning dynamics. In [164], a 4 HANDLING BIG DATA

self-supervised representation learning algorithm is pro-
When the number of tasks is large, the total number of train-
posed for multi-task deep RL to capture structured informa-
ing data in all the tasks can be very big and hence a ‘big’
tion about environment dynamics based on multi-step
aspect in MTL denotes the number of tasks. In this case, we
predictive representations of future observations.
can either devise online, parallel, or distributed MTL mod-
Multi-view learning assumes that each data point is asso-
els to accelerate the learning process. Another ‘big’ aspect in
ciated with multiple sets of features where each set corre-
MTL lies in the data dimensionality which can be very high.
sponds to a view and it usually exploits information
In this situation, we can speedup the learning via feature
contained in multiple views for supervised or semi-super-
selection, dimensionality reduction and feature hashing to
vised learning tasks. Multi-task multi-view learning extends
reduce the dimension without losing too much useful infor-
multi-view learning to the MTL setting where each task is a
mation. In this section, we review some relevant works.
multi-view learning problem. Specifically, in [165], a graph-
When the number of tasks is very big, we can devise some
based method is proposed for multi-task multi-view classifi-
parallel MTL methods to speedup the learning process on
cation problems. In a task, each view is enforced to be con-
multi-CPU or multi-GPU devices. As a representative formu-
sistent with both other views and labels, while different
lation in feature-based MTL, problem (32) is easy to parallel-
tasks are expected to have similar predictions on views they
ize since when given the feature covariance matrix Q , the
share, making views as a bridge to construct the task relat-
learning of different tasks can be decoupled. However, for
edness. In [166], both a regularized MTL method [71] and
problem (33) in parameter-based MTL, the situation is totally
the MTRL method are applied to each view of different
different since even given the task covariance matrix S, differ-
tasks and different views in a task are expected to achieve
ent tasks are still coupled, making the direct parallelization
an agreement on unlabeled data. Different from [165], [166]
fail. In order to parallelize problem (33), Zhang [175] uses the
which study the multi-task multi-view classification prob-
FISTA algorithm to design a surrogate function for problem
lem, in [167], [168], two multi-task multi-view clustering
(33) with a given S , where the surrogate function is decom-
methods are proposed and both methods consider three fac-
posable with respect to tasks, leading to a parallel design for
tors: within-view-task clustering which conducts clustering
MTL based on different loss functions including the hinge,
on each view in a task, view relation learning which mini-
-insensitive and square losses.
mizes the disagreement among views in a task, and low-
Online multi-task learning is also capable of handling a big
rank structure learning which aims to learn a shared sub-
number of tasks. In [176], [177], under a setting where all the
space for different tasks under a common view. The differ-
tasks contribute toward a common goal, the relation between
ence between these two methods is that the first method
tasks is measured via a global loss function and several online
uses a bipartite graph co-clustering method for nonnegative
algorithms are proposed to use absolute norms as the global
data while the other one adopts a semi-nonnegative matrix
loss function. In [178], online MTL algorithms are devised by
tri-factorization to cluster general data. In [169], a multi-
modeling the task relatedness via hard constraints that the
label multi-view algorithm is proposed to not only learn
m-tuple of actions for tasks satisfies. In [179], perceptron-
common features via the ‘2;1 regularization but also identify
based online algorithms are proposed for multi-task binary
useless views via the Frobenius norm. In multi-task multi-
classification problems where task similarities are measured
view learning, each task is usually supplied with both
based on either the geometric closeness of the task reference
labeled and unlabeled data, hence this paradigm can also be
vectors or the dimension of their spanned subspace. In [180],
viewed as another way to utilize unlabeled information for
a recursive Bayesian online algorithm based on Gaussian pro-
MTL. A deep multi-task multi-view model is proposed in
cesses is devised to update both estimations and confidence
[170] to fuse all the views based on the cross-stitch network.
intervals when data instances arrive sequentially. In [181], an
MTL can help learn more accurate structure in graphical
online version of the MTRL method is proposed to update
models. In [171], an algorithm is proposed to learn Bayes
both the model parameters and task covariance in a sequential
network structures by assuming that different networks/
way. An online multi-task learning algorithm is proposed in
tasks share similar structures via a common prior and then
[182] to jointly learn the per-task model and the task relations
a heuristic search is used to find structures with high scores
by smoothing the loss function of each task w.r.t. a task distri-
for all the tasks. With a similar idea, multiple Gaussian
bution and adaptively refining this distribution over time. In
graphical models are jointly learned in [172] by assuming
[183], an online multi-task model is proposed to learn both a
joint sparsity among precision matrices via the ‘1;1 norm
low-rank component and a group sparse component to char-
regularization. In [173], some domain knowledge about task
acterize task relations. In [184], a multi-task passive-aggres-
relations is incorporated into the learning of multiple Bayes-
sive method is proposed to learn multiple relative similarity
ian networks. By viewing the feature interaction matrix as a
learning tasks, each of which is to learn a similarity function
form of graphical models to model pairwise relations
from data with relative constraints. In [185], a Gaussian distri-
between features, two models are proposed in [174] to learn
bution, whose mean or covariance consists of a local compo-
a quadratical function, where the feature interaction matrix
nent for each task and a global component shared by all the
defines the quadratic term, for each task based on the ‘2;1
tasks, over each task is used as a confidence measure to guide
and tensor trace norm regularization, respectively.
the online MTL process.
According to the above discussions, we can see that most
Training data can locate at different devices, making the
research works discussed in this section follow the spirits of
design of distributed MTL models important. In [186], a
MTL approaches introduced in Section 2 and adapt to their
communication-efficient distributed MTL algorithm, where
own settings.
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
5602 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 34, NO. 12, DECEMBER 2022

TABLE 2
The Classification of Works About MTL Applications in Different Areas According to Different MTL Approaches

each machine learns a task, based on the debiased Lasso is vision. In bioinformatics and health informatics, the
proposed to learn jointly sparse features in a high-dimen- interpretability of learning models is more important in
sional space. In [187], the MTRL method (i.e., problem (21)) some sense. Therefore, the feature selection and task rela-
is extended to the distributed setting based on a stochastic tion learning approaches are widely used in this area as the
dual coordinate ascent method. In [188], to protect the pri- former approach can identify useful features and the latter
vacy of data, a privacy-preserving distributed MTL method one can quantitatively show task relations. In speech and
is proposed based on a privacy-preserving proximal gradi- NLP, the data exhibit a sequential structure, which makes
ent algorithm with asynchronous updates. In [189], feder- recurrent-neural-network-based deep MTL models in the
ated multi-task learning is proposed as an extension of feature transformation approach play an important role. As
distributed multi-task learning to consider both stragglers the data in web applications is of a large scale, this area
and fault tolerance. In [190], a distributed multi-task algo- favors simple models such as linear models or their ensem-
rithm is proposed under the online MTL setting. bles based on boosting. Among all the MTL approaches, the
For high-dimensional data in MTL, we can use multi-task feature transformation, feature selection and task relation
feature selection methods to reduce the dimension or extend learning approaches are among the most widely used MTL
single-task dimension reduction techniques to the multi-task approaches in different application areas according to
setting as did in [109]. Another option is to use the feature Table 2.
hashing and in [191], multiple hashing functions are proposed When encountering a new application problem which can
to accelerate the joint learning of multiple tasks. be modeled as a MTL problem, we need to judge whether
tasks in this problem are related in terms of either low-level
features or high-level concepts. If so, by treating Table 2 as a
5 APPLICATIONS look-up table. we can identify a problem in Table 2 similar to
MTL has many applications in various areas including com- the new problem and then adapt the corresponding MTL
puter vision, bioinformatics, health informatics, speech, model to solve the new problem. Otherwise, we can try popu-
NLP, web, and so on. In Table 2, we categorize different lar MTL approaches in the respective area.
MTL problems in each application area according to MTL
approaches they used, where the classification of MTL
approaches has been already introduced in Section 2. In the
6 THEORETICAL ANALYSES
last column of Table 2, we list some problems in various As well as designing MTL models and exploiting MTL
application areas which are different from other columns. applications, there are some works to study theoretical
For application problems listed in Table 2, application- aspects of MTL and here we review them.
dependent MTL models have been proposed to solve The generalization bound, which is to upper-bound the
them.6 Though these models are different from each other, generalization loss in terms of the training loss, model
there are some characteristics in respective areas. For exam- complexity and confidence, is core in learning theory since
ple, in computer vision, deep MTL models, most of which it can identify the learnability and induce the sample com-
belong to the feature transformation approach, exhibit good plexity. There are several works [40], [49], [249], [250],
performance, making this approach popular in computer [251], [252], [253], [254], [255], [256], [257], [258], [259],
[260] to study the generalization bound of different MTL
6. For details of those models, refer to an arXiv version [248] of this models. In Table 3, we compare those works in terms of
paper. the analyzed MTL model, analysis tool and the
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
ZHANG AND YANG: SURVEY ON MULTI-TASK LEARNING 5603

TABLE 3
Comparison of Generalization Bounds Derived in Different Works in Terms of the
Analyzed MTL Model, Analysis Tool, and the Convergence Rate of the Bound

MTL Model Reference Analysis Tool Convergence rate

pffiffiffiffiffiffiffiffiffi
Tasks from an environment [249], [250] VC dimension & Covering number Oð1= mn0 Þ
pffiffiffiffiffiffiffiffiffi
Task distributions can be transformed [254] VC dimension Oð1= mn0 Þ
Task clustering [49] VC dimension Oðm lnðn0 =mÞ=n0 Þ
pffiffiffiffiffiffiffiffiffi
Multi-task kernel classifier [257] Covering number Oð1= mn0 Þ
pffiffiffiffiffi
Problem (26) with Eq. (27) [258] Multi-task stability Oðm m=n0 Þ
pffiffiffiffiffiffiffiffiffi
Multi-task data compression [253] Kolmogorov complexity Oð1= mn0 Þ
pffiffiffiffiffiffiffiffiffi
Problem (10) [40] Covering number Oð1= mn0 Þ
pffiffiffiffiffiffiffiffiffi
Problem (10) without U [251] Rademacher complexity Oð1= mn0 Þ
pffiffiffiffiffi
Problem (3) [12] Rademacher complexity Oð1= n0 Þ
pffiffiffiffiffiffiffiffiffi
Learn a common feature transformation [259] Gaussian average Oð1= mn0 Þ
Problem (13) [258] Multi-task stability Oðm=np 0 Þffiffiffiffiffi
[255] Rademacher complexity OðlnðmÞ= n0 Þ
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi pffiffiffiffiffi
[256] Rademacher complexity Oðmaxð lnðmn0 Þ=mn0 ; 1= n0 ÞÞ
[260] Local Rademacher complexity Oð1=ðmn0 Þa Þ, 0:5 < a < 1
pffiffiffiffiffiffiffiffiffi
Schatten-norm-regularized MTL models [252] Rademacher complexity Oð1= mn0 Þ
pffiffiffiffiffiffiffiffiffi
Graph regularizers [69], [71] [252] Rademacher complexity Oð1= mn0 Þ
[260] Local Rademacher complexity Oð1=ðmn0 Þa Þ, 0:5 < a < 1
pffiffiffiffiffi
Problem (4) [255] Rademacher complexity Oð1= n0 Þ
[260] Local Rademacher complexity Oð1=ðmn0 Þa Þ, 0:5 < a < 1

m denotes the number of tasks and n0 denotes the average number of data points per task.

convergence rate of the corresponding bound which is 7 CONCLUSIONS AND DISCUSSIONS

based on the number of tasks (i.e., m) and the average
In this paper, we survey different aspects of MTL. First,
number of data points per task (i.e., n0 ). According to
after giving the definition of MTL, we give a classification of
Table 3, we can see that some works (i.e., [49], [249], [250],
supervised MTL models into five main approaches and dis-
[253], [254], [257], [258]) analyze different MTL models
cuss their characteristics. Then we review the combinations
based on various analysis tools and the best convergence
of MTL with other learning paradigms. The online, parallel
rate is Oðpffiffiffiffiffiffiffi
mn0 Þ. Though MTL models analyzed in [12], [40],
1
and distributed MTL models as well as dimensionality
[251], [259] are not the same, those MTL models exhibit reduction and feature hashing are discussed to speedup the
similar objectives to learn a linear or nonlinear feature learning process. The applications of MTL in various areas
transformation shared by all the tasks, and the conver- are introduced to show the usefulness of MTL and theoreti-
gence rates are Oðpffiffiffiffiffiffiffi
mn0 Þ except [12]. The trace norm regu-
1
cal aspects of MTL are discussed.
larization (i.e., Problem (13)) is analyzed in [255], [256], In future studies, there are several issues to be addressed.
[258], [260], among which [260] has the best convergence First, outlier tasks, which are unrelated to other tasks, are
rate based on the local Rademacher complexity. A related well known to hamper the performance of all the tasks when
MTL model based on the Schatten norm regularization is learning them jointly. There are some methods to alleviate
analyzed in [252] with an Oðpffiffiffiffiffiffiffi mn0 Þ convergence rate. For
1
negative effects that outlier tasks bring. However, there lacks
the graph regularization [69], [71] analyzed in [252], [260], principled ways and theoretical analyses to study the result-
the local Rademacher complexity leads to a better conver- ing negative effects. In order to make MTL safe to be used by
gence rate (i.e., Oððmn1 Þa Þ for some constant a 2 ð0:5; 1Þ) and human, this is an important issue and needs more studies.
0
a similar observation holds for problem (4). In a word, var- Second, deep learning has become a dominant approach
in many areas and several multi-task deep models belong-
ious analysis tools can be used to analyze MTL models and
ing to the feature transformation, low-rank, task clustering
among them, the local Rademacher complexity can derive
and task relation learning approaches have been proposed
tighter generalization bounds than others.
as reviewed in Sections 2, 3 and 5. As discussed, most of
Besides generalization bounds, there are some works to
them only share hidden layers. This approach is powerful
study other theoretical problems in MTL. For example,
when all the tasks are related, but it is vulnerable to noisy
Argyriou et al. [261], [262] discuss conditions where repre-
and outlier tasks that can deteriorate the performance dra-
senter theorems hold for regularized MTL algorithms. Sev-
matically. We believe that it is desirable to design flexible
eral studies [263], [264], [265] investigate conditions to well
and robust deep multi-task models.
recover true features for multi-task feature selection models.
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
5604 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 34, NO. 12, DECEMBER 2022

Lastly, existing studies mainly focus on supervised learn- [20] Y. Shinohara, “Adversarial multi-task learning of deep neural
networks for robust speech recognition,” in Proc. Interspeech,
ing tasks, and only a few ones are on other tasks such as 2016, pp. 2369–2372.
unsupervised learning, semi-supervised learning, active [21] P. Liu, X. Qiu, and X. Huang, “Adversarial multi-task learning
learning, multi-view learning and reinforcement learning for text classification,” in Proc. 55th Annu. Meeting Assoc. Comput.
tasks. It is natural to adapt or extend the five approaches Linguistics, 2017, pp. 1–10.
[22] I. Misra, A. Shrivastava, A. Gupta, and M. Hebert, “Cross-stitch
introduced in Section 2 to those non-supervised learning networks for multi-task learning,” in Proc. IEEE Conf. Comput.
tasks. We think that such adaptation and extension require Vis. Pattern Recognit., 2016, pp. 3994–4003.
more efforts to design appropriate models. Moreover, it is [23] G. Obozinski, B. Taskar, and M. Jordan, “Multi-task feature
selection,” Statistics Dept., Univ. California, Berkeley, CA, USA,
worth trying to apply MTL to other areas in artificial intelli- Tech. Rep. TR-2.2.2, 2006.
gence such as logic and planning to broaden its application [24] J. Liu, S. Ji, and J. Ye, “Multi-task feature learning via efficient
scopes. l2;1 -norm minimization,” in Proc. 25th Conf. Uncertainty Artif.
Intell., 2009, pp. 339–348.
[25] S. Lee, J. Zhu, and E. P. Xing, “Adaptive multi-task lasso: With
ACKNOWLEDGMENTS application to eQTL detection,” in Proc. Int. Conf. Neural Inf. Pro-
cess. Syst., 2010, pp. 1306–1314.
This work was supported by the NSFC under Grant 62076118. [26] N. S. Rao, C. R. Cox, R. D. Nowak, and T. T. Rogers, “Sparse
overlapping sets lasso for multitask learning and its application
to fMRI analysis,” in Proc. Int. Conf. Neural Inf. Process. Syst.,
REFERENCES 2013, pp. 2202–2210.
[1] R. Caruana, “Multitask learning,” Mach. Learn., vol. 28, pp. 41–75, [27] P. Gong, J. Zhou, W. Fan, and J. Ye, “Efficient multi-task feature
1997. learning with calibration,” in Proc. 20th ACM SIGKDD Int. Conf.
[2] Q. Yang, Y. Zhang, W. Dai, and S. J. Pan, Transfer Learning. New Knowl. Discov. Data Mining, 2014, pp. 761–770.
York, NY, USA: Cambridge Univ. Press, 2020. [28] J. Wang and J. Ye, “Safe screening for multi-task feature learning
[3] M.-L. Zhang and Z.-H. Zhou, “A review on multi-label learning with multiple data matrices,” in Proc. Int. Conf. Mach. Learn.,
algorithms,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 8, 2015, pp. 1747–1756.
pp. 1819–1837, Aug. 2014. [29] H. Liu, M. Palatucci, and J. Zhang, “Blockwise coordinate
[4] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter, descent procedures for the multi-task lasso, with applications to
“Continual lifelong learning with neural networks: A review,” neural semantic basis discovery,” in Proc. 26th Int. Conf. Mach.
Neural Netw., vol. 113, pp. 54–71, 2019. Learn., 2009, pp. 649–656.
[5] Y. Zhang and Q. Yang, “An overview of multi-task learning,” [30] P. Gong, J. Ye, and C. Zhang, “Multi-stage multi-task feature
Nat. Sci. Rev., vol. 5, pp. 30–43, 2018. learning,” J. Mach. Learn. Res., vol. 14, pp. 2979–3010, 2013.
[6] X. Yang, S. Kim, and E. P. Xing, “Heterogeneous multitask learn- [31] A. C. Lozano and G. Swirszcz, “Multi-level lasso for sparse
ing with joint sparsity constraints,” in Proc. Int. Conf. Neural Inf. multi-task regression,” in Proc. 29th Int. Conf. Mach. Learn., 2012,
Process. Syst., 2009, pp. 2151–2159. pp. 595–602.
[7] S. Bickel, J. Bogojeska, T. Lengauer, and T. Scheffer, “Multi-task [32] X. Wang, J. Bi, S. Yu, and J. Sun, “On multiplicative multitask fea-
learning for HIV therapy screening,” in Proc. 25th Int. Conf. Mach. ture learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2014,
Learn., 2008, pp. 56–63. pp. 2411–2429.
[8] X. Liao and L. Carin, “Radial basis function network for multi- [33] L. Han, Y. Zhang, G. Song, and K. Xie, “Encoding tree sparsity in
task learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2005, multi-task learning: A probabilistic framework,” in Proc. 28th
pp. 792–802. AAAI Conf. Artif. Intell., 2014. pp. 1854–1860.
[9] D. L. Silver, R. Poirier, and D. Currie, “Inductive transfer with [34] T. Jebara, “Multi-task feature and kernel selection for SVMs,” in
context-sensitive neural networks,” Mach. Learn., vol. 73, 2008, Proc. 21st Int. Conf. Mach. Learn., 2004, Art. no. 55.
Art. no. 313. [35] S. Kim and E. P. Xing, “Tree-guided group lasso for multi-task
[10] A. Argyriou, T. Evgeniou, and M. Pontil, “Convex multi-task fea- regression with structured sparsity,” in Proc. 27th Int. Conf. Mach.
ture learning,” Mach. Learn, vol. 73, pp. 243–272, 2008. Learn., 2010, pp. 543–550.
[11] A. Argyriou, C. A. Micchelli, M. Pontil, and Y. Ying, “A spectral [36] Y. Zhou, R. Jin, and S. C. H. Hoi, “Exclusive lasso for multi-task
regularization framework for multi-task structure learning,” in feature selection,” in Proc. 13th Int. Conf. Artif. Intell. Statist., 2010,
Proc. Int. Conf. Neural Inf. Process. Syst., 2007, Art. no. 1296. pp. 988–995.
[12] A. Maurer, M. Pontil, and B. Romera-Paredes, “Sparse coding for [37] Y. Zhang, D.-Y. Yeung, and Q. Xu, “Probabilistic multi-task fea-
multitask and transfer learning,” in Proc. 30th Int. Conf. Mach. ture selection,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2010,
Learn., 2013, pp. 343–351. pp. 2559–2567.
[13] J. Zhu, N. Chen, and E. P. Xing, “Infinite latent SVM for classifi- [38] D. Hernandez-Lobato and J. M. Hern andez-Lobato, “Learning
cation and multi-task learning,” in Proc. 24th Int. Conf. Neural Inf. feature selection dependencies in multi-task learning,” in Proc.
Process. Syst., 2011, pp. 1620–1628. Int. Conf. Neural Inf. Process. Syst., 2013, pp. 746–754.
[14] M. K. Titsias and M. L azaro-Gredilla, “Spike and slab variational [39] D. Hernandez-Lobato, J. M. Hernandez-Lobato, and Z. Ghahra-
inference for multi-task and multiple kernel learning,” in Proc. mani, “A probabilistic model for dirty multi-task feature
24th Int. Conf. Neural Inf. Process. Syst., 2011, pp. 2339–2347. selection,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 1073–1082.
[15] Z. Zhang, P. Luo, C. C. Loy, and X. Tang, “Facial landmark detec- [40] R. K. Ando and T. Zhang, “A framework for learning predictive
tion by deep multi-task learning,” in Proc. Eur. Conf. Comput. Vis., structures from multiple tasks and unlabeled data,” J. Mach.
2014, pp. 94–108. Learn. Res., vol. 6, pp. 1817–1853, 2005.
[16] W. Liu, T. Mei, Y. Zhang, C. Che, and J. Luo, “Multi-task deep [41] J. Chen, L. Tang, J. Liu, and J. Ye, “A convex formulation for
visual-semantic embedding for video thumbnail selection,” learning shared structures from multiple tasks,” in Proc. 26th Int.
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, Conf. Mach. Learn., 2009, pp. 137–144.
pp. 3707–3715. [42] A. Agarwal, H. Daume III, and S. Gerber, “Learning multiple
[17] W. Zhang et al., “Deep model based transfer and multi-task tasks using manifold regularization,” in Proc. Adv. Neural Inf.
learning for biological image analysis,” IEEE Trans. Big Data, Process. Syst., 2010, pp. 46–54.
vol. 6, no. 2, pp. 322–333, Jun. 2015. [43] J. Zhang, Z. Ghahramani, and Y. Yang, “Learning multiple
[18] N. Mrksic et al., “Multi-domain dialog state tracking using recur- related tasks using latent independent component analysis,” in
rent neural networks,” in Proc. 53rd Annu. Meeting Assoc. Comput. Proc. Int. Conf. Neural Inf. Process. Syst., 2005, pp. 1585–1592.
Linguistics, 2015, pp. 794–799. [44] T. K. Pong, P. Tseng, S. Ji, and J. Ye, “Trace norm regularization:
[19] S. Li, Z. Liu, and A. B. Chan, “Heterogeneous multi-task learning Reformulations, algorithms, and multi-task learning,” SIAM J.
for human pose estimation with deep convolutional neural Optim., vol. 20, pp. 3465–3489, 2010.
network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Work- [45] L. Han and Y. Zhang, “Multi-stage multi-task learning with
shops, 2015, pp. 488–495. reduced rank,” in Proc. 13th AAAI Conf. Artif. Intell., 2016,
pp. 1638–1644.
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
ZHANG AND YANG: SURVEY ON MULTI-TASK LEARNING 5605

[46] A. M. McDonald, M. Pontil, and D. Stamos, “Spectral k-support [72] T. Kato, H. Kashima, M. Sugiyama, and K. Asai, “Multi-task
norm regularization,” in Proc. 27th Int. Conf. Neural Inf. Process. learning via conic programming,” in Proc. Int. Conf. Neural Inf.
Syst., 2014, 3644–3652. Process. Syst., 2007, pp. 737–744.
[47] Y. Yang and T. M. Hospedales, “Trace norm regularised deep [73] S. Feldman, M. R. Gupta, and B. A. Frigyik, “Revisiting Stein’s
multi-task learning,” in Proc. 5th Int. Conf. Learn. Representation, paradox: Multi-task averaging,” J. Mach. Learn. Res., vol. 15,
Workshop Track, 2017. pp. 3621–366 , 2014.
[48] S. Thrun and J. O’Sullivan, “Discovering structure in multiple [74] I. Yamane, H. Sasaki, and M. Sugiyama, “Regularized multitask
learning tasks: The TC algorithm,” in Proc. 13th Int. Conf. Mach. learning for multidimensional log-density gradient estimation,”
Learn., 1996, pp. 489–497. Neural Comput., vol. 28, no. 7, pp. 1388–1410, Jul. 2016.
[49] K. Crammer and Y. Mansour, “Learning multiple tasks using [75] N. G€ ornitz, C. Widmer, G. Zeller, A. Kahles, S. Sonnenburg, and
shared hypotheses,” in Proc. 25th Int. Conf. Neural Inf. Process. G. R€atsch, “Hierarchical multitask structured output learning for
Syst., 2012, pp. 1475–1483. large-scale sequence segmentation,” in Proc. 24th Int. Conf. Neural
[50] B. Bakker and T. Heskes, “Task clustering and gating for Bayes- Inf. Process. Syst., 2011, pp. 2690–2698.
ian multitask learning,” J. Mach. Learn. Res., vol. 4, pp. 83–99 [76] E. V. Bonilla, K. M. A. Chai, and C. K. I. Williams, “Multi-task
2003. Gaussian process prediction,” in Proc. Int. Neural Inf. Process.
[51] K. Yu, V. Tresp, and A. Schwaighofer, “Learning Gaussian pro- Syst., 2007, 153–160.
cesses from multiple tasks,” in Proc. Int. Conf. Mach. Learn., 2005, [77] K. M. A. Chai, “Generalization errors and learning curves for
pp. 1012–1019 regression with multi-task Gaussian processes,” in Proc. 22nd Int.
[52] S. Yu, V. Tresp, and K. Yu, “Robust multi-task learning with Conf. Neural Inf. Process. Syst., 2009, pp. 279–287.
t-processes,” in Proc. 24th Int. Conf. Mach. Learn., 2007, [78] Y. Zhang and D.-Y. Yeung, “Multi-task learning using general-
pp. 1103–1110. ized t process,” in Proc. 13th Int. Conf. Artif. Intell. Statist., 2010,
[53] W. Lian, R. Henao, V. Rao, J. E. Lucas, and L. Carin, “A multitask pp. 964–971.
point process predictive model,” in Proc. Int. Conf. Mach. Learn., [79] Y. Zhang and D.-Y. Yeung, “A convex formulation for learning
2015, pp. 2030–2038. task relationships in multi-task learning,” in Proc. 26th Conf.
[54] Y. Xue, X. Liao, L. Carin, and B. Krishnapuram, “Multi-task Uncertainty Artif. Intell., 2010, pp. 733–742.
learning for classification with Dirichlet process priors,” J. Mach. [80] Y. Zhang and D.-Y. Yeung, “A regularization approach to learn-
Learn. Res., vol. 8, pp. 35–63, 2007. ing task relationships in multitask learning,” ACM Trans. Knowl.
[55] Y. Xue, D. B. Dunson, and L. Carin, “The matrix stick-breaking Discov. Data, vol. 8, pp. 1–31, 2014.
process for flexible multi-task learning,” in Proc. 24th Int. Conf. [81] Y. Zhang and D.-Y. Yeung, “Multi-task boosting by exploiting
Mach. Learn., 2007, pp. 1063–1070. task relationships,” in Proc. Joint Eur. Conf. Mach. Learn. Knowl.
[56] H. Li, X. Liao, and L. Carin, “Nonparametric Bayesian feature Discov. Databases, 2012, pp. 697–710.
selection for multi-task learning,” in Proc. IEEE Int. Conf. Acoust., [82] Y. Zhang and D.-Y. Yeung, “Multilabel relationship learning,”
Speech Signal Process., 2011, pp. 2236–2239. ACM Trans. Knowl. Discov. Data, vol. 7, pp. 1–30, 2013.
[57] Y. Qi, D. Liu, D. B. Dunson, and L. Carin, “Multi-task compres- [83] F. Dinuzzo, C. S. Ong, P. V. Gehler, and G. Pillonetto, “Learning
sive sensing with Dirichlet process priors,” in Proc. 25th Int. Conf. output kernels with block coordinate descent,” in Proc. 28th Int.
Mach. Learn., 2008, 768–775. Conf. Mach. Learn., 2011, pp. 49–56.
[58] K. Ni, L. Carin, and D. B. Dunson, “Multi-task learning for [84] C. Ciliberto, Y. Mroueh, T. A. Poggio, and L. Rosasco, “Convex
sequential data via iHMMs and the nested Dirichlet process,” in learning of multiple tasks and their structure,” in Proc. 32nd Int.
Proc. 24th Int. Conf. Mach. Learn., 2007, pp. 689–696. Conf. Mach. Learn., 2015, pp. 1548–1557.
[59] K. Ni, J. W. Paisley, L. Carin, and D. B. Dunson, “Multi-task [85] C. Ciliberto, L. Rosasco, and S. Villa, “Learning multiple visual
learning for analyzing and sorting large databases of sequential tasks while discovering their structure,” in Proc. IEEE Conf. Com-
data,” IEEE Trans. Signal Process., vol. 56, no. 8, pp. 3918–3931, put. Vis. Pattern Recognit., 2015, pp. 131–139.
Aug. 2008. [86] P. Jawanpuria, M. Lapin, M. Hein, and B. Schiele, “Efficient out-
[60] A. Passos, P. Rai, J. Wainer, and H. Daum e III, “Flexible model- put kernel learning for multiple tasks,” in Proc. Int. Conf. Neural
ing of latent task structures in multitask learning,” in Proc. 29th Process. Syst., 2015, pp. 1189–1197.
Int. Conf. Mach. Learn., 2012, pp. 1283–1290. [87] Y. Zhang and Q. Yang, “Learning sparse task relations in multi-
[61] L. Jacob, F. R. Bach, and J.-P. Vert, “Clustered multi-task learn- task learning,” in Proc. 31st AAAI Conf. Artif. Intell., 2017,
ing: A convex formulation,” in Proc. 18th Int. Conf. Artif. Intell. pp. 2914–2920.
Statist., 2008, pp. 745–752. [88] Y. Zhang and J. G. Schneider, “Learning multiple tasks with a
[62] Z. Kang, K. Grauman, and F. Sha, “Learning with whom to share sparse matrix-normal penalty,” in Proc. 23rd Int. Conf. Neural Inf.
in multi-task feature learning,” in Proc. 28th Int. Conf. Mach. Process. Syst., 2010, pp. 2550–2558.
Learn., 2011, pp. 521–528. [89] C. Archambeau, S. Guo, and O. Zoeter, “Sparse Bayesian multi-
[63] L. Han and Y. Zhang, “Learning multi-level task groups in multi- task learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2011,
task learning,” in Proc. 29th AAAI Conf. Artif. Intell., 2015, pp. 1755–1763.
pp. 2638–2644. [90] M. Yang, Y. Li, and Z. Zhang, “Multi-task learning with Gauss-
[64] A. Barzilai and K. Crammer, “Convex multi-task learning by ian matrix generalized inverse Gaussian model,” in Proc. Int.
clustering,” in Proc. Artif. Intell. Statist., 2015. Conf. Mach. Learn., 2013, pp. 423–431.
[65] Q. Zhou and Q. Zhao, “Flexible clustered multi-task learning by [91] Y. Zhang and D.-Y. Yeung, “Learning high-order task relation-
learning representative tasks,” IEEE Trans. Pattern Anal. Mach. ships in multi-task learning,” in Proc. 23rd Int. Joint Conf. Artif.
Intell., vol. 38, no. 2, pp. 266–278, Feb. 2016. Intell., 2013, pp. 1917–1923.
[66] A. Kumar and H. Daume III, “Learning task grouping and overlap [92] P. Rai, A. Kumar, and H. Daum e III, “Simultaneously
in multi-task learning,” in Proc. 29th Int. Conf. Mach. Learn., 2012. leveraging output and task structures for multiple-output
[67] Y. Yang and T. M. Hospedales, “Deep multi-task representation regression,” in Proc. 25th Int. Conf. Neural Inf. Process. Syst.,
learning: A tensor factorisation approach,” in Proc. Int. Conf. 2012, pp. 3185–3193.
Mach. Representations, 2017. [93] B. Rakitsch, C. Lippert, K. M. Borgwardt, and O. Stegle, “It is all
[68] J. Zhou, J. Chen, and J. Ye, “Clustered multi-task learning via in the noise: Efficient multi-task Gaussian process inference with
alternating structure optimization,” in Proc. Int. Conf. Neural Inf. structured residuals,” in Proc. 26th Int. Conf. Neural Inf. Process.
Process. Syst., 2011, Art. no. 702. Syst., 2013, pp. 1466–1474.
[69] T. Evgeniou and M. Pontil, “Regularized multi-task learning,” in [94] A. R. Gonçalves, F. J. V. Zuben, and A. Banerjee, “Multi-task
Proc. 10th ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, sparse structure learning with Gaussian copula models,” J. Mach.
2004, pp. 109–117. Learn. Res., vol. 7, pp. 1–30, 2016.
[70] S. Parameswaran and K. Q. Weinberger, “Large margin multi- [95] M. Long, Z. Cao, J. Wang, and P. S. Yu, “Learning multiple tasks
task metric learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., with multilinear relationship networks,” in Proc. 31st Int. Conf.
2010, pp. 1867–1875. Neural Inf. Process. Syst., 2017, pp. 1593–1602.
[71] T. Evgeniou, C. A. Micchelli, and M. Pontil, “Learning multiple [96] Y. Zhang, “Heterogeneous-neighborhood-based multi-task local
tasks with kernel methods,” J. Mach. Learn. Res., vol. 6, pp. 615–637, learning algorithms,” in Proc. Int. Conf. Neural Inf. Process. Syst.,
2005. 2013, pp. 1896–1904.

Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
5606 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 34, NO. 12, DECEMBER 2022

[97] G. Lee, E. Yang, and S. J. Hwang, “Asymmetric multi-task learn- [122] Q. Liu, X. Liao, and L. Carin, “Semi-supervised multitask
ing based on task relatedness and loss,” in Proc. Int. Conf. Mach. learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2007,
Learn., 2016, pp. 230–238. pp. 937–944.
[98] A. Jalali, P. Ravikumar, S. Sanghavi, and C. Ruan, “A dirty model [123] Q. Liu, X. Liao, H. Li, J. R. Stack, and L. Carin, “Semisupervised
for multi-task learning,” in Proc. 23rd Int. Conf. Neural Inf. Process. multitask learning,” IEEE Trans. Pattern Anal. Mach. Intell.,
Syst., 2010, pp. 964–972. vol. 31, no. 6, pp. 1074–1086, Jun. 2009.
[99] J. Chen, J. Liu, and J. Ye, “Learning incoherent sparse and low- [124] Y. Zhang and D. Yeung, “Semi-supervised multi-task
rank patterns from multiple tasks,” ACM Trans. Knowl. Discov. regression,” in Proc. Joint Eur. Conf. Mach. Learn. Knowl. Discov.
Data, vol. 5, 2010, Art. no. 22. Databases, 2009, pp. 617–631.
[100] J. Chen, J. Zhou, and J. Ye, “Integrating low-rank and group- [125] R. Reichart, K. Tomanek, U. Hahn, and A. Rappoport, “Multi-
sparse structures for robust multi-task learning,” in Proc. 17th task active learning for linguistic annotations,” in Proc. 46th
ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2011, Annu. Meeting Assoc. Comput. Linguistics, 2008, pp. 861–869.
pp. 42–50. [126] A. Acharya, R. J. Mooney, and J. Ghosh, “Active multitask learn-
[101] P. Gong, J. Ye, and C. Zhang, “Robust multi-task feature ing using both latent and supervised shared topics,” in Proc.
learning,” in Proc. 18th ACM SIGKDD Int. Conf. Knowl. Discov. SIAM Int. Conf. Data Mining, 2014, pp. 190–198.
Data Mining, 2012, pp. 895–903. [127] M. Fang and D. Tao, “Active multi-task learning via bandits,” in
[102] W. Zhong and J. T. Kwok, “Convex multitask learning with flexi- Proc. SIAM Int. Conf. Data Mining Soc. Ind. Appl. Math., 2015,
ble task clusters,” in Proc. 18th Int. Conf. Artif. Intell. Statist., 2012, pp. 505–513.
pp. 65–73. [128] H. Li, X. Liao, and L. Carin, “Active learning for semi-supervised
[103] A. Zweig and D. Weinshall, “Hierarchical regularization cascade multi-task learning,” in Proc. IEEE Int. Conf. Acoust. Speech Signal
for joint learning,” in Proc. Int. Conf. Mach. Learn., 2013, pp. 37–45. Process., 2009, pp. 1637–1640.
[104] L. Han and Y. Zhang, “Learning tree structure in multi-task [129] K. Lin and J. Zhou, “Interactive multi-task relationship learning,”
learning,” in Proc. 21st ACM SIGKDD Int. Conf. Knowl. Discov. in Proc. EEE 16th Int. Conf. Data Mining, 2016, pp. 241–250.
Data Mining, 2015, pp. 397–406. [130] A. Pentina and C. H. Lampert, “Multi-task learning with labeled
[105] P. Jawanpuria and J. S. Nath, “A convex feature learning formu- and unlabeled tasks,” in Proc. Int. Conf. Mach. Learn., 2017,
lation for latent task structure discovery,” in Proc. 29th Int. Conf. pp. 2807–2816.
Mach. Learn., 2012, pp. 1531–1538. [131] J. Zhang and C. Zhang, “Multitask Bregman clustering,” in Proc.
[106] B. Gong, Y. Shi, F. Sha, and K. Grauman, “Geodesic flow kernel AAAI Conf. Artif. Intell., 2010, pp. 655–660.
for unsupervised domain adaptation,” in Proc. IEEE Conf. Com- [132] X. Zhang and X. Zhang, “Smart multi-task Bregman clustering
put. Vis. Pattern Recognit., 2012, pp. 2066–2073. and multi-task kernel clustering,” in Proc. AAAI Conf. Artif.
[107] M. Solnon, S. Arlot, and F. R. Bach, “Multi-task regression using Intell., 2013, pp. 1034–1040.
minimal penalties,” J. Mach. Learn. Res., vol. 13, pp. 2773–2812, 2012. [133] X. Zhang, X. Zhang, and H. Liu, “Smart multitask Bregman clus-
[108] Y. Zhang, Y. Wei, and Q. Yang, “Learning to multitask,” in Proc. tering and multitask kernel clustering,” ACM Trans. Knowl. Dis-
32nd Int. Conf. Neural Inf. Process. Syst., 2018, pp. 5776–5787. cov. Data, vol. 10, 2015, Art. no. 8.
[109] Y. Zhang and D.-Y. Yeung, “Multi-task learning in heteroge- [134] Q. Gu, Z. Li, and J. Han, “Learning a kernel for multi-task
neous feature spaces,” in Proc. 25th AAAI Conf. Artif. Intell., 2011, clustering,” in Proc. AAAI Conf. Artif. Intell., 2011, pp. 368–373.
pp. 574–579. [135] X. Zhang, “Convex discriminative multitask clustering,” IEEE
[110] S. Han, X. Liao, and L. Carin, “Cross-domain multitask learning Trans. Pattern Anal. Mach. Intell., vol. 37, no. 1, pp. 28–40, Jan.
with latent probit models,” in Proc. 29th Int. Conf. Mach. Learn., 2015.
2012, pp. 363–370. [136] Y. Wang, D. P. Wipf, Q. Ling, W. Chen, and I. J. Wassell, “Multi-
[111] P. Yang, K. Huang, and C. Liu, “Geometry preserving multi-task task learning for subspace segmentation,” in Proc. Int. Conf.
metric learning,” Mach. Learn., vol. 92, pp. 133–175, 2013. Mach. Learn., 2015, pp. 1209–1217.
[112] N. Quadrianto, A. J. Smola, T. S. Caetano, S. V. N. Vishwanathan, [137] X. Zhang, X. Zhang, and H. Liu, “Self-adapted multi-task
and J. Petterson, “Multitask learning without label corre- clustering,” in Proc. Int. Joint Conf. Artif. Intell., 2016,
spondences,” in Proc. 23rd Int. Conf. Neural Inf. Process. Syst., pp. 2357–2363.
2010, pp. 1957–1965. [138] Y. Yang, Z. Ma, Y. Yang, F. Nie, and H. T. Shen, “Multitask spec-
[113] B. Romera-Paredes, H. Aung, N. Bianchi-Berthouze, and M. Pon- tral clustering by exploring intertask correlation,” IEEE Trans.
til, “Multilinear multitask learning,” in Proc. 30th Int. Conf. Mach. Cybern., vol. 45, no. 5, pp. 1083–1094, May 2015.
Learn., 2013, pp. 1444–1452. [139] X. Zhu, X. Li, S. Zhang, C. Ju, and X. Wu, “Robust joint graph
[114] K. Wimalawarne, M. Sugiyama, and R. Tomioka, “Multitask sparse coding for unsupervised spectral feature selection,” IEEE
learning meets tensor factorization: Task imputation via convex Trans. Neural Netw. Learn. Syst., vol. 28, no. 6, pp. 1263–1275,
optimization,” in Proc. 27th Int. Conf. Neural Inf. Process. Syst., Jun. 2017.
2014, pp. 2825–2833. [140] X. Zhang, X. Zhang, H. Liu, and J. Luo, “Multi-task clustering
[115] O. Sener and V. Koltun, “Multi-task learning as multi-objective with model relation learning,” in Proc. Int. Joint Conf. Artif. Intell.,
optimization,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2018, 2018, pp. 3132–3140.
pp. 525–536. [141] A. Wilson, A. Fern, S. Ray, and P. Tadepalli, “Multi-task rein-
[116] T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, forcement learning: A hierarchical Bayesian approach,” in Proc.
“Gradient surgery for multi-task learning,” in Proc. Int. Conf. Int. Conf. Mach. Learn., 2007, pp. 1015–1022.
Neural Inf. Process. Syst., 2020. [142] H. Li, X. Liao, and L. Carin, “Multi-task reinforcement learning
[117] Z. Chen, V. Badrinarayanan, C. Lee, and A. Rabinovich, in partially observable stochastic environments,” J. Mach. Learn.
“Gradnorm: Gradient normalization for adaptive loss balancing Res., vol. 10, pp. 1131–1186, 2009.
in deep multitask networks,” in Proc. Int. Conf. Mach. Learn., [143] A. Lazaric and M. Ghavamzadeh, “Bayesian multi-task rein-
2018, pp. 794–803. forcement learning,” in Proc. Int. Conf. Mach. Learn., 2010,
[118] N. Parikh and S. P. Boyd, “Proximal algorithms,” Founds. Trends pp. 599–606.
Optim., vol. 1, no. 3, pp. 127–239, 2014. [144] D. Calandriello, A. Lazaric, and M. Restelli, “Sparse multi-task
[119] L. Zhao, Q. Sun, J. Ye, F. Chen, C. Lu, and N. Ramakrishnan, reinforcement learning,” in Proc. Int. conf. Neural Inf. Process.
“Multi-task learning for spatio-temporal event forecasting,” in Syst., 2014, pp. 5–20.
Proc. 21st ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, [145] J. Andreas, D. Klein, and S. Levine, “Modular multitask rein-
2015, pp. 1503–1512. forcement learning with policy sketches,” in Proc. Int. Conf.
[120] Y. Li, J. Wang, J. Ye, and C. K. Reddy, “A multi-task learning for- Mach. Learn., 2017, pp. 166–175.
mulation for survival analysis,” in Proc. 22nd ACM SIGKDD Int. [146] A. A. Deshmukh, U. € Dogan, and C. Scott, “Multi-task learning
Conf. Knowl. Discov. Data Mining, 2016, pp. 1715–1724. for contextual bandits,” in Proc. Int. Conf. Neural Inf. Process.
[121] L. Zhao, Q. Sun, J. Ye, F. Chen, C. Lu, and N. Ramakrishnan, Syst., 2017, pp. 4851–4859.
“Feature constrained multi-task learning models for spatiotem- [147] A. M. Saxe, A. C. Earle, and B. Rosman, “Hierarchy through com-
poral event forecasting,” IEEE Trans. Knowl. Data Eng., vol. 29, position with multitask LMDPs,” in Proc. Int. Conf. Mach. Learn.,
no. 5, pp. 1059–1072, May 2017. 2017, pp. 3017–3026.

Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
ZHANG AND YANG: SURVEY ON MULTI-TASK LEARNING 5607

[148] T. Br€am, G. Brunner, O. Richter, and R. Wattenhofer, “Attentive [174] K. Lin, J. Xu, I. M. Baytas, S. Ji, and J. Zhou, “Multi-task feature
multi-task deep reinforcement learning,” in Proc. Joint Eur. Conf. interaction learning,” in Proc. ACM SIGKDD Int. Conf. Knowl.
Mach. Learn. Knowl. Discov. Databases, 2019, 134–149. Discov. Data Mining, 2016, pp. 1735–1744.
[149] T. Vuong et al., “Sharing experience in multitask reinforcement [175] Y. Zhang, “Parallel multi-task learning,” in Proc. IEEE Int. Conf.
learning,” in Proc. Int. Joint Conf. Artif. Intell., 2019, pp. 3642–3648. Data Mining, 2015, pp. 629–638.
[150] M. Igl et al., “Multitask soft option learning,” in Proc. Conf. Uncer- [176] O. Dekel, P. M. Long, and Y. Singer, “Online multitask learning,”
tainty Artif. Intell., 2020, pp. 969–978. in Proc. Int. Conf. Comput. Learn. Theory, 2006, pp. 453–467.
[151] E. Parisotto, J. Ba, and R. Salakhutdinov, “Actor-mimic: Deep [177] O. Dekel, P. M. Long, and Y. Singer, “Online learning of mul-
multitask and transfer reinforcement learning,” in Proc. Int. Conf. tiple tasks with a shared loss,” J. Mach. Learn. Res., vol. 8,
Learn. Representations, 2016. pp. 2233–2264, 2007.
[152] A. A. Rusu et al., “Policy distillation,” in Proc. Int. Conf. Learn. [178] G. Lugosi, O. Papaspiliopoulos, and G. Stoltz, “Online multi-task
Representations, 2016. learning with hard constraints,” in Proc. 22nd Conf. Learn. Theory,
[153] S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian, “Deep 2009.
decentralized multi-task multi-agent reinforcement learning [179] G. Cavallanti, N. Cesa-Bianchi, and C. Gentile, “Linear algo-
under partial observability,” in Proc. Int. Conf. Mach. Learn., 2017, rithms for online multitask classification,” J. Mach. Learn. Res.,
pp. 2681–2690. vol. 11, pp. 2901–2934, 2010.
[154] Y. W. Teh et al., “Distral: Robust multitask reinforcement [180] G. Pillonetto, F. Dinuzzo, and G. D. Nicolao, “Bayesian online
learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2017, multitask learning of Gaussian processes,” IEEE Trans. Pattern
pp. 4499–4509. Anal. Mach. Intell., vol. 32, no. 2, pp. 193–205, Feb. 2010.
[155] S. E. Bsat, H. Bou-Ammar, and M. E. Taylor, “Scalable multitask [181] A. Saha, P. Rai, H. Daume, III, and S. Venkatasubramanian,
policy gradient reinforcement learning,” in Proc. AAAI Conf. “Online learning of multiple tasks and their relationships,” in
Artif. Intell., 2017, pp. 1847–1853. Proc. Int. Conf. Artif. Intell. Statist., 2011, pp. 643–651.
[156] S. Sharma and B. Ravindran, “Online multi-task learning using [182] K. Murugesan, H. Liu, J. G. Carbonell, and Y. Yang, “Adaptive
active sampling,” in Proc. Int. Conf. Learn. Representations Work- smoothed online multi-task learning,” in Proc. Int. Conf. Neural
shop, 2017. Inf. Process. Syst., 2016, pp. 4303–4311.
[157] S. Sharma, A. K. Jha, P. Hegde, and B. Ravindran, “Learning to [183] P. Yang, P. Zhao, and X. Gao, “Robust online multi-task learning
multi-task by active sampling,” in Proc. Int. Conf. Learn. Represen- with correlative and personalized structures,” IEEE Trans. Knowl.
tations, 2018. Data Eng., vol. 29, no. 11, pp. 2510–2521, Nov. 2017.
[158] L. Espeholt et al., “IMPALA: Scalable distributed deep-RL with [184] S. Hao, P. Zhao, Y. Liu, S. C. H. Hoi, and C. Miao, “Online multi-
importance weighted actor-learner architectures,” in Proc. Int. task relative similarity learning,” in Proc. Int. Joint Conf. Artif.
Conf. Mach. Learn., 2018, pp. 1406–1415. Intell., 2017, pp. 1823–1829.
[159] R. Tutunov, D. Kim, and H. Bou-Ammar, “Distributed multitask [185] P. Yang, P. Zhao, J. Zhou, and X. Gao, “Confidence weighted
reinforcement learning with quadratic convergence,” in Proc. Int. multitask learning,” in Proc. AAAI Conf. Artif. Intell., 2019,
Conf. Neural Inf. Process. Syst., 2018, pp. 8921–8930. pp. 5636–5643.
[160] X. Lin, H. S. Baweja, G. Kantor, and D. Held, “Adaptive auxiliary [186] J. Wang, M. Kolar, and N. Srebro, “Distributed multi-task
task weighting for reinforcement learning,” in Proc. Int. Conf. learning,” in Proc. Int. Conf. Artif. Intell. Statist., 2016, pp. 751–760.
Neural Inf. Process. Syst., 2019, pp. 4773–4784. [187] S. Liu, S. J. Pan, and Q. Ho, “Distributed multi-task relationship
[161] C. D’Eramo, D. Tateo, A. Bonarini, M. Restelli, and J. Peters, learning,” in Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data
“Sharing knowledge in multi-task deep reinforcement learning,” Mining, 2017, pp. 937–946.
in Proc. Int. Conf. Learn. Representations, 2020, pp. 1–18. [188] L. Xie, I. M. Baytas, K. Lin, and J. Zhou, “Privacy-preserving dis-
[162] T. Shu, C. Xiong, and R. Socher, “Hierarchical and interpretable tributed multi-task learning with asynchronous updates,” in
skill acquisition in multi-task reinforcement learning,” in Proc. Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2017,
Int. Conf. Learn. Representations, 2018. pp. 1195–1204.
[163] M. Hessel, H. Soyer, L. Espeholt, W. Czarnecki, S. Schmitt, [189] V. Smith, C. Chiang, M. Sanjabi, and A. S. Talwalkar, “Federated
and H. van Hasselt, “Multi-task deep reinforcement learning multi-task learning,” in Proc. Int. Conf. Neural Inf. Process. Syst.,
with PopArt,” in Proc. AAAI Conf. Artif. Intell., 2019, 3796– 2017, pp. 4427–437.
3803. [190] C. Zhang et al., “Distributed multi-task classification: A decen-
[164] Z. Guo et al., “Bootstrap latent-predictive representations for tralized online learning approach,” Mach. Learn., vol. 107,
multitask reinforcement learning,” in Proc. Int. Conf. Mach. Learn., pp. 727–747, 2018.
2020, pp. 3875–3886. [191] K. Q. Weinberger, A. Dasgupta, J. Langford, A. J. Smola, and
[165] J. He and R. Lawrence, “A graph-based framework for multi-task J. Attenberg, “Feature hashing for large scale multitask learning,”
multi-view learning,” in Proc. Int. Conf. Mach. Learn., 2011, in Proc. Annu. Int. Conf. Mach. Learn., 2009, pp. 1113–1120.
pp. 25–32. [192] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja, “Robust visual track-
[166] J. Zhang and J. Huan, “Inductive multi-task learning with multi- ing via multi-task sparse learning,” in Proc. IEEE Comput. Soc.
ple view data,” in Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Conf. Comput. Vis. Pattern Recognit., 2012, pp. 2042–2049.
Data Mining, 2012, 543–551. [193] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja, “Robust visual track-
[167] X. Zhang, X. Zhang, and H. Liu, “Multi-task multi-view clustering via structured multi-task sparse learning,” Int. J. comput. Vis.,
ing for non-negative data,” in Proc. Int. Joint Conf. Artif. Intell., vol. 101, pp. 367–383, 2013.
2015, pp. 4055–4061. [194] Q. Xu, S. J. Pan, H. H. Xue, and Q. Yang, “Multitask learning for
[168] X. Zhang, X. Zhang, H. Liu, and X. Liu, “Multi-task multi-view protein subcellular location prediction,” IEEE/ACM Trans. Com-
clustering,” IEEE Trans. Knowl. Data Eng., vol. 28, no. 12, put. Biol. Bioinf., vol. 8, no. 3, pp. 748–759, May/Jun. 2011.
pp. 3324–3338, Dec. 2016. [195] Z. Wu, C. Valentini-Botinhao, O. Watts, and S. King, “Deep neu-
[169] X. Zhu, X. Li, and S. Zhang, “Block-row sparse multiview multi- ral networks employing multi-task learning and stacked bottle-
label learning for image classification,” IEEE Trans. Cybern., neck features for speech synthesis,” in Proc. IEEE Int. Conf.
vol. 46, no. 2, pp. 450–461, Feb. 2016. Acoust. Speech Signal Process., 2015, pp. 4460–4464.
[170] L. Zheng, Y. Cheng, and J. He, “Deep multimodality model for [196] Q. Hu, Z. Wu, K. Richmond, J. Yamagishi, Y. Stylianou, and
multi-task multi-view learning,” in Proc. SIAM Int. Conf. Data R. Maia, “Fusion of multiple parameterisations for DNN-based
Mining, 2019, pp. 10–16. sinusoidal speech synthesis with multi-task learning,” in Proc.
[171] A. Niculescu-Mizil and R. Caruana, “Inductive transfer for InterSpeech, 2015, pp. 854–858.
Bayesian network structure learning,” in Proc. Mach. Learn., 2012, [197] J. Bai et al., “Multi-task learning for learning to rank in web search,”
167–180. in Proc. Int. ACM Conf. Inf. Knowl. Manage., 2009, pp. 1549–1552.
[172] J. Honorio and D. Samaras, “Multi-task learning of Gaussian [198] J. Ghosn and Y. Bengio, “Multi-task learning for stock selection,”
graphical models,” in Proc. Int. Conf. Mach. Learn., 2010, in Proc. Int. Conf. Neural Inf. Process. Syst., 1996, pp. 946–952.
pp. 447–454. [199] C. Yuan, W. Hu, G. Tian, S. Yang, and H. Wang, “Multi-task
[173] D. Oyen and T. Lane, “Leveraging domain knowledge in multi- sparse learning with Beta process prior for action recognition,”
task Bayesian network structure learning,” in Proc. AAAI Conf. in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.,
Artif. Intell., 2012, pp. 1091–1097. 2013, pp. 423–429.

Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
5608 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 34, NO. 12, DECEMBER 2022

[200] Y. Qi, O. Tastan, J. G. Carbonell, J. Klein-Seetharaman, and [223] H. Wang et al., “High-order multi-task feature learning to iden-
J. Weston, “Semi-supervised multi-task learning for predicting tify longitudinal phenotypic markers for Alzheimer’s disease
interactions between HIV-1 and human proteins,” Bioinformatics, progression prediction,” in Proc. Adv. Neural Inf. Process. Syst.,
vol. 26, pp. i645–i652, 2010. 2012, pp. 1277–1285.
[201] P. Bell and S. Renals, “Regularization of context-dependent deep [224] Q. An, C. Wang, I. Shterev, E. Wang, L. Carin, and D. B. Dunson,
neural networks with context-independent multi-task training,” “Hierarchical kernel stick-breaking process for multi-task image
in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2015, analysis,” in Proc. Int. Conf. Mach. Learn., 2008, pp. 17–24.
pp. 4290–4294. [225] M. Alamgir, M. Grosse-Wentrup, and Y. Altun, “Multitask learn-
[202] Z. Chen, S. Watanabe, H. Erdogan, and J. R. Hershey, “Speech ing for brain-computer interfaces,” in Proc. Mach. Learn. Res.,
enhancement and recognition using multi-task learning of long 2010, pp. 17–24.
short-term memory recurrent neural networks,” in Proc. Inter- [226] Y. Zhang and D.-Y. Yeung, “Multi-task warped Gaussian process
Speech, 2015, pp. 3274–3278. for personalized age estimation,” in Proc. IEEE Comput. Soc. Conf.
[203] V. W. Zheng, S. J. Pan, Q. Yang, and J. J. Pan, “Transferring Comput. Vis. Pattern Recognit., 2010, pp. 2622–2629.
multi-device localization models using latent multi-task [227] J. Xu, J. Zhou, and P. Tan, “FORMULA: FactORized MUlti-task
learning,” in Proc. AAAI Conf. Artif. Intell., 2008, pp. 1427–1432. LeArning for task discovery in personalized medical models,” in
[204] R. Collobert and J. Weston, “A unified architecture for natural Proc. SIAM Int. Conf. Data Mining, 2015, pp. 496–504.
language processing: Deep neural networks with multitask [228] T. R. Almaev, B. Martınez, and M. F. Valstar, “Learning to trans-
learning,” in Proc. Int. Conf. Mach. Learn., 2008, pp. 160–167. fer: Transferring latent task structures and its application to per-
[205] M. Lapin, B. Schiele, and M. Hein, “Scalable multitask represen- son-specific facial action unit detection,” in Proc. IEEE Int. Conf.
tation learning for scene classification,” in Proc. IEEE Comput. Comput. Vis., 2015, pp. 3774–3782.
Soc. Conf. Comput. Vis. Pattern Recognit., 2014, pp. 1343–1441. [229] A. Liu, Y. Su, W. Nie, and M. S. Kankanhalli, “Hierarchical clus-
[206] A. H. Abdulnabi, G. Wang, J. Lu, and K. Jia, “Multi-task CNN tering multi-task learning for joint human action grouping and
model for attribute prediction,” IEEE Trans. Multimedia, vol. 17, recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 1,
no. 11, pp. 1949–1959, Nov. 2015. pp. 102–114, Jan. 2017.
[207] M. Luong, Q. V. Le, I. Sutskever, O. Vinyals, and L. Kaiser, [230] F. Wu and Y. Huang, “Collaborative multi-domain sentiment
“Multi-task sequence to sequence learning,” in Proc. Int. Conf. classification,” in Proc. IEEE Int. Conf. Data mining, 2015,
Learn. Representations, 2016. pp. 459–468.
[208] J. Yim, H. Jung, B. Yoo, C. Choi, D. Park, and J. Kim, “Rotating your [231] O. Chapelle, P. K. Shivaswamy, S. Vadrevu, K. Q. Weinberger,
face using multi-task deep neural network,” in Proc. IEEE Comput. Y. Zhang, and B. L. Tseng, “Multi-task learning for boosting with
Soc. Conf. Comput. Vis. Pattern Recognit., 2015, pp. 676–684. application to web search ranking,” in Proc. ACM SIGKDD Int.
[209] X. Chu, W. Ouyang, W. Yang, and X. Wang, “Multi-task recur- Conf. Knowl. Discov. Data Mining, 2010, pp. 1189–1198.
rent neural network for immediacy prediction,” in Proc. IEEE [232] C. Widmer, J. Leiva, Y. Altun, and G. R€atsch, “Leveraging
Conf. Comput. Vis. Pattern Recognit., 2015, pp. 142–149. sequence classification by taxonomy-based multitask learning,”
[210] X. Wang, C. Zhang, and Z. Zhang, “Boosted multi-task learning in Proc. Annu. Int. Conf. Res. Comput. Mol. Biol., 2010, pp. 522–534.
for face verification with applications to web image and video [233] Y. Zhang, B. Cao, and D.-Y. Yeung, “Multi-domain collabora-
search,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Rec- tive filtering,” in Proc. Conf. Uncertainty Artif. Intell., 2010,
ognit., 2009, pp. 142–149. pp. 725–732.
[211] X. Yuan and S. Yan, “Visual classification with multi-task joint [234] K. M. A. Chai, C. K. I. Williams, S. Klanke, and S. Vijayakumar,
sparse representation,” in Proc. IEEE Comput. Soc. Conf. Comput. “Multi-task Gaussian process learning of robot inverse dynami-
Vis. Pattern Recognit., 2010, pp. 3493–3500. cs,” in Proc. Int. conf. Neural Inf. Process. Syst., 2008, pp. 265–272.
[212] Q. Liu, Q. Xu, V. W. Zheng, H. Xue, Z. Cao, and Q. Yang, “Multi- [235] D.-Y. Yeung and Y. Zhang, “Learning inverse dynamics by
task learning for cross-platform siRNA efficacy prediction: An Gaussian process regression under the multi-task learning
in-silico study,” BMC Bioinf., vol. 10, 2010, Art. no. 181. framework,” in The Path to Autonomous Robots. Boston, MA, USA:
[213] A. Ahmed, M. Aly, A. Das, A. J. Smola, and T. Anastasakos, Springer, 2009, pp. 131–142.
“Web-scale multi-task feature selection for behavioral targeting,” [236] C. Widmer, N. C. Toussaint, Y. Altun, and G. R€atsch, “Inferring
in Proc. ACM Int. Conf. Inf. Knowl. Manage., 2012, pp. 1737–1741. latent task structure for multitask learning by multiple kernel
[214] H. Wang et al., “Sparse multi-task regression and feature selec- learning,” BMC Bioinf., vol. 11, 2010, Art. no. S5.
tion to identify brain imaging predictors for memory perform- [237] A. Ahmed, A. Das, and A. J. Smola, “Scalable hierarchical multi-
ance,” in Proc. IEEE Int. Conf. Comput. Vis., 2011, pp. 557–562. task learning algorithms for conversion optimization in display
[215] K. Puniyani, S. Kim, and E. P. Xing, “Multi-population GWA advertising,” in Proc. ACM Int. Conf. Web Search Data Mining,
mapping via multi-task regularized regression,” Bioinformatics, 2014, pp. 153–162.
vol. 26, pp. i208–i216, 2010. [238] J. Zheng and L. M. Ni, “Time-dependent trajectory regression on
[216] J. Zhou, L. Yuan, J. Liu, and J. Ye, “A multi-task learning formu- road networks via multi-task learning,” in Proc. AAAI Conf. Artif.
lation for predicting disease progression,” in Proc. ACM SIGKDD Intell., 2013, pp. 1048–1055.
Int. Conf. Knowl. Discov. Data Mining, 2011, pp. 814–822. [239] X. Lu, Y. Wang, X. Zhou, Z. Zhang, and Z. Ling, “Traffic sign
[217] J. Wan et al. “Sparse Bayesian multi-task learning for predicting recognition via multi-modal tree-structure embedded multi-
cognitive outcomes from neuroimaging measures in Alzheimer’s task learning,” IEEE Trans. Intell. Transp. Syst., vol. 18, no. 4,
disease,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2012, pp. 960–972, Apr. 2017.
pp. 940–947. [240] F. Mordelet and J. Vert, “ProDiGe: Prioritization of disease genes
[218] D. He, D. Kuhn, and L. Parida, “Novel applications of multitask with multitask machine learning from positive and unlabeled
learning and multiple output regression to multiple genetic trait examples,” BMC Bioinf., vol. 12, 2011, Art. no. 389.
prediction,” Bioinformatics, vol. 32, no. 12, pp. 37–43, 2016. [241] J. Xu, P. Tan, J. Zhou, and L. Luo, “Online multi-task learning
[219] K. Zhang, J. W. Gray, and B. Parvin, “Sparse multitask regression framework for ensemble forecasting,” IEEE Trans. Knowl. Data
for identifying common mechanism of response to therapeutic Eng., vol. 29, no. 6, pp. 1268–1280, Jun. 2017.
targets,” Bioinformatics, vol. 26, pp. 97–105, 2010. [242] M. Kshirsagar, J. G. Carbonell, and J. Klein-Seetharaman,
[220] B. Cheng, G. Liu, J. Wang, Z. Huang, and S. Yan, “Multi-task “Multitask learning for host-pathogen protein interactions,” Bio-
low-rank affinity pursuit for image segmentation,” in Proc. Int. informatics, vol. 29, pp. 217–226, 2013.
Conf. Comput. Vis., 2011, pp. 2439–2446. [243] L. Han, L. Li, F. Wen, L. Zhong, T. Zhang, and X. Wan, “Graph-
[221] J. Xu, P. Tan, L. Luo, and J. Zhou, “GSpartan: A geospatio-tempo- guided multi-task sparse learning model: a method for identify-
ral multi-task learning framework for multi-location prediction,” ing antigenic variants of influenza A(H3N2) virus,” Bioinformat-
in Proc. SIAM Int. Conf. Data Mining, 2016, pp. 657–665. ics, vol. 35, pp. 77–87, 2019.
[222] C. Lang, G. Liu, J. Yu, and S. Yan, “Saliency detection by multi- [244] Z. Hong, X. Mei, D. V. Prokhorov, and D. Tao, “Tracking via
task sparsity pursuit,” IEEE Trans. Image Process., vol. 21, no. 3, robust multi-task multi-view joint sparse representation,” in
pp. 1327–1338, Mar. 2012. Proc. IEEE Int. Conf. Comput. Vis., 2013, pp. 649–656.

[245] Y. Yan, E. Ricci, S. Ramanathan, O. Lanz, and N. Sebe, “No mat- [262] A. Argyriou, C. A. Micchelli, and M. Pontil, “On spectral
ter where you are: Flexible graph-guided multi-task learning for learning,” J. Mach. Learn. Res., vol. 11, pp. 935–953, 2010.
multi-view head pose classification under target motion,” in [263] K. Lounici, M. Pontil, A. B. Tsybakov, and S. A. van de Geer,
Proc. IEEE Int. Conf. Comput. Vis., 2013, pp. 1177–1184. “Taking advantage of sparsity in multi-task learning,” in Proc.
[246] M. Kshirsagar, K. Murugesan, J. G. Carbonell, and J. Klein- Annu. Conf. Learn. Theory, 2009.
Seetharaman, “Multitask matrix completion for learning pro- [264] G. Obozinski, M. J. Wainwright, and M. I. Jordan, “Support
tein interactions across diseases,” J. Comput. Biol., vol. 24, union recovery in high-dimensional multivariate regression,”
pp. 501–514, 2017. Ann. Statist., vol. 39, pp. 1–47, 2011.
[247] C. Su, F. Yang, S. Zhang, Q. Tian, L. S. Davis, and W. Gao, [265] M. Kolar, J. D. Lafferty, and L. A. Wasserman, “Union support
“Multi-task learning with low rank attribute embedding for per- recovery in multi-task learning,” J. Mach. Learn. Res., vol. 12,
son re-identification,” IEEE Trans. Pattern Anal. Mach. Intell., pp. 2415–2435, 2011.
vol. 40, no. 5, pp. 1167–1181, May 2018.
[248] Y. Zhang and Q. Yang, “A survey on multi-task learning,” 2017, Yu Zhang (Member, IEEE) is currently an associ-
arXiv: 1707.08114. ate professor at the Department of Computer Sci-
[249] J. Baxter, “Learning internal representations,” in Proc. Int. Conf. ence and Engineering, Southern University of
Comput. Learn. Theory, 1995, pp. 311–320. Science and Technology. He has authored or
[250] J. Baxter, “A model of inductive bias learning,” J. Artif. Intell. Res., coauthored a book Transfer Learning and about
vol. 12, pp. 149–198, 2000. 70 papers on top-tier conferences and journals.
[251] A. Maurer, “Bounds for linear multi-task learning,” J. Mach. His research interests include artificial intelli-
Learn. Res., vol. 7, pp. 117–139, 2006. gence and machine learning, especially in multi-
[252] A. Maurer, “The Rademacher complexity of linear transforma- task learning, transfer learning, dimensionality
tion classes,” in Proc. Int. Conf. Comput. Learn. Theory, 2006, reduction, metric learning and semi-supervised
pp. 65–78. learning. He is a reviewer for various journals
[253] B. Juba, “Estimating relatedness via data compression,” in Proc. and area chairs or senior program committee members for several top-
Int. Conf. Mach. Learn., 2006, pp. 441–448. tier conferences. He was the recipient of the best paper awards in UAI
[254] S. Ben-David and R. S. Borbely, “A notion of task relatedness 2010 and PAKDD 2019, and the Best Student Paper Award in WI 2013.
yielding provable multiple-task learning guarantees,” Mach.
Learn., vol. 73, pp. 273–287, 2008.
[255] S. M. Kakade, S. Shalev-Shwartz, and A. Tewari, “Regularization Qiang Yang (Fellow, IEEE) is currently the chief
techniques for learning with matrices,” J. Mach. Learn. Res., artificial intelligence officer at WeBank and the
vol. 13, pp. 1865–1890, 2012. chair professor at the CSE Department, Hong
[256] M. Pontil and A. Maurer, “Excess risk bounds for multitask Kong University of Science and Technology. He
learning with trace norm regularization,” in Proc. Annu. Conf. has authored or coauthored five books including
Learn. Theory, 2013, pp. 55–76. Transfer Learning and Federated Learning. His
[257] A. Pentina and S. Ben-David, “Multi-task and lifelong learning of research interests include transfer learning and
kernels,” in Proc. Int. Conf. Algorithmic Learn. Theory, 2015, federated learning. He is the founding EiC of two
pp. 194–208. journals: IEEE Transactions on Big Data and
[258] Y. Zhang, “Multi-task learning and algorithmic stability,” in ACM Transactions on Intelligent Systems and
Proc. AAAI Conf. Artif. Intell., 2015, pp. 3181–3187. Technology. He is the conference chair of AAAI-
[259] A. Maurer, M. Pontil, and B. Romera-Paredes, “The benefit of 21, the president of Hong Kong Society of Artificial Intelligence and
multitask representation learning,” J. Mach. Learn. Res., vol. 17, Robotics and Investment Technology League, and a former president of
pp. 1–32, 2016. IJCAI from 2017 to 2019. He is also a fellow of the AAAI, ACM, and
[260] N. Yousefi, Y. Lei, M. Kloft, M. Mollaghasemi, and G. Anagnas- AAAS.
tapolous, “Local Rademacher complexity-based learning guaran-
tees for multi-task learning,” J. Mach. Learn. Res., vol. 19, pp. 1–47,
2018. " For more information on this or any other computing topic,
[261] A. Argyriou, C. A. Micchelli, and M. Pontil, “When is there a rep- please visit our Digital Library at www.computer.org/csdl.
resenter theorem? Vector versus matrix regularizers,” J. Mach.
Learn. Res., vol. 10, pp. 2507–2529, 2009.

Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.

A Tutorial On Multilabel Learning
No ratings yet
A Tutorial On Multilabel Learning
38 pages
Lightstick Kinetics
No ratings yet
Lightstick Kinetics
8 pages
Reviewer For Assessor
100% (1)
Reviewer For Assessor
25 pages
Survey of Multitask Learning
No ratings yet
Survey of Multitask Learning
20 pages
A Survey On Multi-Task Learning: Yu Zhang and Qiang Yang
No ratings yet
A Survey On Multi-Task Learning: Yu Zhang and Qiang Yang
20 pages
2018_A brief review on multi-task learning_Thung_Wee_Multimedia Tools and Applications
No ratings yet
2018_A brief review on multi-task learning_Thung_Wee_Multimedia Tools and Applications
21 pages
Thijs Van Der Laan s3986721 Bachelors Thesis
No ratings yet
Thijs Van Der Laan s3986721 Bachelors Thesis
42 pages
Multi Task Learning (MTL)
No ratings yet
Multi Task Learning (MTL)
15 pages
Multitask Learning
No ratings yet
Multitask Learning
35 pages
2018_Multi-Task Learning as Multi-Objective Optimization_Sener_Koltun_Advances in Neural Information Processing Systems
No ratings yet
2018_Multi-Task Learning as Multi-Objective Optimization_Sener_Koltun_Advances in Neural Information Processing Systems
12 pages
Multitask Learning: School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213
No ratings yet
Multitask Learning: School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213
35 pages
2024_A survey on kernel-based multi-task learning_Neurocomputing
No ratings yet
2024_A survey on kernel-based multi-task learning_Neurocomputing
12 pages
LibMTL - Pytorch Library For MTL - March 2022
No ratings yet
LibMTL - Pytorch Library For MTL - March 2022
6 pages
2024_Multi-Task Learning in Natural Language Processing - An Overview_Chen et al_ACM Computing Surveys
No ratings yet
2024_Multi-Task Learning in Natural Language Processing - An Overview_Chen et al_ACM Computing Surveys
31 pages
Libmtl: A Python Library For Deep Multi-Task Learning: Baijiong Lin
No ratings yet
Libmtl: A Python Library For Deep Multi-Task Learning: Baijiong Lin
7 pages
A_notion_of_task_relatedness_yielding_provable_mul
No ratings yet
A_notion_of_task_relatedness_yielding_provable_mul
19 pages
2022_Multi-Task Learning for Dense Prediction Tasks - A Survey_Vandenhende et al_IEEE Transactions on Pattern Analysis and Machine Intelligence
No ratings yet
2022_Multi-Task Learning for Dense Prediction Tasks - A Survey_Vandenhende et al_IEEE Transactions on Pattern Analysis and Machine Intelligence
20 pages
Yolor Based Multi Task Learning
No ratings yet
Yolor Based Multi Task Learning
17 pages
Adaptive Weight Assignment Scheme For Multi-Task Learning
No ratings yet
Adaptive Weight Assignment Scheme For Multi-Task Learning
6 pages
2019_Pareto Multi-Task Learning_Lin et al_Curran Associates, Inc.
No ratings yet
2019_Pareto Multi-Task Learning_Lin et al_Curran Associates, Inc.
11 pages
A Review On Multi-Label Learning Algorithms: IEEE Transactions On Knowledge and Data Engineering August 2014
No ratings yet
A Review On Multi-Label Learning Algorithms: IEEE Transactions On Knowledge and Data Engineering August 2014
43 pages
Pattern Recognition: You Ji, Shiliang Sun
No ratings yet
Pattern Recognition: You Ji, Shiliang Sun
11 pages
2018_Learning to Multitask_Zhang Et Al_Curran Associates, Inc.
No ratings yet
2018_Learning to Multitask_Zhang Et Al_Curran Associates, Inc.
12 pages
AReviewonMulti-Label Learning Algorithms
No ratings yet
AReviewonMulti-Label Learning Algorithms
43 pages
sensors-23-00583-v2
No ratings yet
sensors-23-00583-v2
17 pages
Learning Representation for Multitask Learning through Self-Supervised Auxiliary Learning
No ratings yet
Learning Representation for Multitask Learning through Self-Supervised Auxiliary Learning
18 pages
2022_MTFormer - Multi-task Learning via Transformer and Cross-Task Reasoning_Xu et al_Springer Nature Switzerland
No ratings yet
2022_MTFormer - Multi-task Learning via Transformer and Cross-Task Reasoning_Xu et al_Springer Nature Switzerland
18 pages
Multitask Transfer (1)
No ratings yet
Multitask Transfer (1)
36 pages
2021_Task Switching Network for Multi-Task Learning_Sun et al_
No ratings yet
2021_Task Switching Network for Multi-Task Learning_Sun et al_
10 pages
Asymmetric_Multi_Task_Learning_with_Loca
No ratings yet
Asymmetric_Multi_Task_Learning_with_Loca
30 pages
Kurikulum 7
No ratings yet
Kurikulum 7
1 page
3
No ratings yet
3
23 pages
2020_Which Tasks Should Be Learned Together in Multi-task Learning_Standley et al_PMLR
No ratings yet
2020_Which Tasks Should Be Learned Together in Multi-task Learning_Standley et al_PMLR
13 pages
Neural Network Seminar Anirban
No ratings yet
Neural Network Seminar Anirban
13 pages
26275-Article Text-30338-1-2-20230626
No ratings yet
26275-Article Text-30338-1-2-20230626
9 pages
Pentina Curriculum Learning of 2015 CVPR Paper PDF
No ratings yet
Pentina Curriculum Learning of 2015 CVPR Paper PDF
9 pages
Introduction to Machine Learning
No ratings yet
Introduction to Machine Learning
15 pages
Convex MTFL
No ratings yet
Convex MTFL
39 pages
NIPS 1996 Multi Task Learning for Stock Selection Paper
No ratings yet
NIPS 1996 Multi Task Learning for Stock Selection Paper
7 pages
META-LEARNING WITH VERSATILE LOSS GEOMETRIES__FOR FAST ADAPTATION USING MIRROR DESCENT
No ratings yet
META-LEARNING WITH VERSATILE LOSS GEOMETRIES__FOR FAST ADAPTATION USING MIRROR DESCENT
7 pages
Online Meta-Learning: y 0. An Algorithm That Understands The Underlying Struc
No ratings yet
Online Meta-Learning: y 0. An Algorithm That Understands The Underlying Struc
19 pages
Multi-Task Learning On Mnist Image Datasets
No ratings yet
Multi-Task Learning On Mnist Image Datasets
4 pages
2019_End-To-End Multi-Task Learning With Attention_Liu et al_
No ratings yet
2019_End-To-End Multi-Task Learning With Attention_Liu et al_
10 pages
DL Unit-5
No ratings yet
DL Unit-5
7 pages
50 Breakthrough AI Concepts in 500 Words Each: In 500 words, #17
From Everand
50 Breakthrough AI Concepts in 500 Words Each: In 500 words, #17
Nietsnie Trebla
No ratings yet
A Survey and Analysis of Multi-Label Learning Techniques For Data Streams
No ratings yet
A Survey and Analysis of Multi-Label Learning Techniques For Data Streams
5 pages
Supervised Learning
No ratings yet
Supervised Learning
4 pages
EV-PAPER-2023-SDM
No ratings yet
EV-PAPER-2023-SDM
1 page
A Review On Multi-Label Learning Algorithms: Min-Ling Zhang and Zhi-Hua Zhou, Fellow, IEEE
No ratings yet
A Review On Multi-Label Learning Algorithms: Min-Ling Zhang and Zhi-Hua Zhou, Fellow, IEEE
43 pages
Gradnorm: Gradient Normalization For Adaptive Loss Balancing in Deep Multitask Networks
No ratings yet
Gradnorm: Gradient Normalization For Adaptive Loss Balancing in Deep Multitask Networks
12 pages
2021_Efficiently Identifying Task Groupings for Multi-Task Learning_Fifty et al_Curran Associates, Inc.
No ratings yet
2021_Efficiently Identifying Task Groupings for Multi-Task Learning_Fifty et al_Curran Associates, Inc.
14 pages
AdapterFusion: Non-Destructive Task Composition For Transfer Learning
No ratings yet
AdapterFusion: Non-Destructive Task Composition For Transfer Learning
17 pages
Supervised Learning
No ratings yet
Supervised Learning
4 pages
MmAP Multi-Modal Alignment Prompt For Cross-Domain Multi-Task Learning
No ratings yet
MmAP Multi-Modal Alignment Prompt For Cross-Domain Multi-Task Learning
9 pages
Proactive Gradient Conflict Mitigation in Multi-Task Learning: A Sparse Training Perspective
No ratings yet
Proactive Gradient Conflict Mitigation in Multi-Task Learning: A Sparse Training Perspective
23 pages
R Journal
No ratings yet
R Journal
14 pages
Theory and Novel Applications of Machine Learning
No ratings yet
Theory and Novel Applications of Machine Learning
386 pages
6 - Multi - Task - Learning
No ratings yet
6 - Multi - Task - Learning
1 page
ML Suitable Task 1
No ratings yet
ML Suitable Task 1
8 pages
Shen Dissertation 2020
No ratings yet
Shen Dissertation 2020
193 pages
Machine Learning: Fundamentals and Applications
From Everand
Machine Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Constrained Conditional Model: Fundamentals and Applications
From Everand
Constrained Conditional Model: Fundamentals and Applications
Fouad Sabry
No ratings yet
Stata notes by Dr NK Singh
No ratings yet
Stata notes by Dr NK Singh
15 pages
Part2B LP Solvers
No ratings yet
Part2B LP Solvers
60 pages
PDF (Ebook) Flow-induced Vibrations: an Engineering Guide : IAHR Hydraulic Structures Design Manuals 7 by Eduard Naudascher, Donald Rockwell ISBN 9781351447867, 9782692732810, 9789054101314, 1351447866, 2692732812, 9054101318 download
100% (9)
PDF (Ebook) Flow-induced Vibrations: an Engineering Guide : IAHR Hydraulic Structures Design Manuals 7 by Eduard Naudascher, Donald Rockwell ISBN 9781351447867, 9782692732810, 9789054101314, 1351447866, 2692732812, 9054101318 download
55 pages
Đề thi lý thuyết PCS - Aptech 2012
No ratings yet
Đề thi lý thuyết PCS - Aptech 2012
5 pages
pipelining
No ratings yet
pipelining
47 pages
Thesis Summary (FMIPA, Dept. of Mathematics) - I Gusti Agung Kartika Shanti
No ratings yet
Thesis Summary (FMIPA, Dept. of Mathematics) - I Gusti Agung Kartika Shanti
8 pages
Reed-Merrill and Greville Methods: Abridged Life Table
No ratings yet
Reed-Merrill and Greville Methods: Abridged Life Table
1 page
HTET 2023
No ratings yet
HTET 2023
35 pages
CSC 207
No ratings yet
CSC 207
59 pages
BS en 13586-2020
100% (1)
BS en 13586-2020
33 pages
Compressive Sensing Spectroscopy Using A Residual Convolutional Neural Network
No ratings yet
Compressive Sensing Spectroscopy Using A Residual Convolutional Neural Network
16 pages
Machine Tool Spindle Design PDF
100% (1)
Machine Tool Spindle Design PDF
173 pages
SUBJECT-MATHS CLASS-10
No ratings yet
SUBJECT-MATHS CLASS-10
6 pages
Statistics Probability11 q3 Week3 v4
No ratings yet
Statistics Probability11 q3 Week3 v4
10 pages
A Modified Numerical Model For Predicting Carbonation Depth of Concrete With Stress Damage
No ratings yet
A Modified Numerical Model For Predicting Carbonation Depth of Concrete With Stress Damage
12 pages
The Different Between WSD and LRFD
100% (2)
The Different Between WSD and LRFD
50 pages
(AXA Investment) Why The Implied Correlation of Dispersion Has To Be Higher Than The Correlation Swap Strike
No ratings yet
(AXA Investment) Why The Implied Correlation of Dispersion Has To Be Higher Than The Correlation Swap Strike
4 pages
(Asce) Hy 1943-7900 0002028
No ratings yet
(Asce) Hy 1943-7900 0002028
10 pages
Live Coding Practice: C Lick Nilson
No ratings yet
Live Coding Practice: C Lick Nilson
6 pages
Introduction, Advantages of Linear Programming
No ratings yet
Introduction, Advantages of Linear Programming
27 pages
Finite Elements in Analysis and Design: Fernando G. Flores, Liz G. Nallim, Sergio Oller
No ratings yet
Finite Elements in Analysis and Design: Fernando G. Flores, Liz G. Nallim, Sergio Oller
14 pages
Cse4015 Human Computer Interaction: Dr. S M Satapathy
No ratings yet
Cse4015 Human Computer Interaction: Dr. S M Satapathy
50 pages
Day 2 Clang
No ratings yet
Day 2 Clang
9 pages
ME-Sem-I-Civil-Theory-Of-Elasticity-n-Elastic-Stability
No ratings yet
ME-Sem-I-Civil-Theory-Of-Elasticity-n-Elastic-Stability
2 pages
Weighted Mean
No ratings yet
Weighted Mean
3 pages
LectureNotes Dirac PDF
No ratings yet
LectureNotes Dirac PDF
195 pages
Tutorial in Biostatistics Using The General Linear Mixed Model To Analyse Unbalanced Repeated Measures and Longitudinal Data
No ratings yet
Tutorial in Biostatistics Using The General Linear Mixed Model To Analyse Unbalanced Repeated Measures and Longitudinal Data
32 pages
Guile
No ratings yet
Guile
942 pages

A Survey On Multi-Task Learning

Uploaded by

A Survey On Multi-Task Learning

Uploaded by

5586 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 34, NO.

12, DECEMBER 2022

A Survey on Multi-Task Learning

Index Terms—Multi-task learning, machine learning, artificial intelligence

the original feature representations. Similarly, without spe-

in [27]. Moreover, in order to make speedup, a safe screening

min LðW; bÞ þ kWk1;1 : (5)

When setting 2 to be 0, problem (11) reduces to problem

different clusters and a measure of within-cluster variance X

technique, a convex objective function is formulated as s:t: Z 0; ZT 1 ¼ 1:

Compared with the objective function of multi-task sparse

comparable impacts on the learning dynamics. In [164], a 4 HANDLING BIG DATA

MTL Model Reference Analysis Tool Convergence rate

convergence rate of the corresponding bound which is 7 CONCLUSIONS AND DISCUSSIONS

You might also like