A Survey On Multi-Task Learning
A Survey On Multi-Task Learning
Abstract—Multi-Task Learning (MTL) is a learning paradigm in machine learning and its aim is to leverage useful information
contained in multiple related tasks to help improve the generalization performance of all the tasks. In this paper, we give a survey for
MTL from the perspective of algorithmic modeling, applications and theoretical analyses. For algorithmic modeling, we give a definition
of MTL and then classify different MTL algorithms into five categories, including feature learning approach, low-rank approach, task
clustering approach, task relation learning approach and decomposition approach as well as discussing the characteristics of each
approach. In order to improve the performance of learning tasks further, MTL can be combined with other learning paradigms including
semi-supervised learning, active learning, unsupervised learning, reinforcement learning, multi-view learning and graphical models.
When the number of tasks is large or the data dimensionality is high, we review online, parallel and distributed MTL models as well as
dimensionality reduction and feature hashing to reveal their computational and storage advantages. Many real-world applications use
MTL to boost their performance and we review representative works in this paper. Finally, we present theoretical analyses and discuss
several future directions for MTL.
1 INTRODUCTION One reason that MTL is effective is that it utilizes more data
from different learning tasks when compared with single-task
UMAN can learn multiple tasks simultaneously and dur-
H ing this learning process, human can use the knowl-
edge learned in a task to help the learning of another task.
learning. With more data, MTL can learn more robust and uni-
versal representations for multiple tasks and more powerful
models, leading to better knowledge sharing among tasks, bet-
For example, according to our experience in learning to
ter performance of each task and low risk of overfitting in each
play tennis and squash, we find that the skill of playing ten-
task.
nis can help learn to play squash and vice versa. Inspired by
MTL is related to other learning paradigms in machine
such human learning ability, Multi-Task Learning (MTL)
learning, including transfer learning [2], multi-label learn-
[1], a learning paradigm in machine learning, aims to learn
ing [3] and multi-output regression. The setting of MTL is
multiple related tasks jointly so that the knowledge con-
similar to that of transfer learning but with significant dif-
tained in a task can be leveraged by other tasks, with the
ferences. In MTL, there is no distinction among different
hope of improving the generalization performance of all the
tasks and the objective is to improve the performance of all
tasks at hand.
the tasks. However, transfer learning is to improve the per-
At its early stage, an important motivation of MTL is to alle-
formance of a target task with the help of source tasks,
viate the data sparsity problem where each task has a limited
hence the target task plays a more important role than
number of labeled data. In the data sparsity problem, the num-
source tasks. In a word, MTL treats all the tasks equally but
ber of labeled data in each task is insufficient to train an accu-
in transfer learning the target task attracts most attentions.
rate learner, while MTL aggregates the labeled data in all the
From the perspective of the knowledge flow, flows of
tasks in the spirit of data augmentation to obtain a more accu-
knowledge transfer in transfer learning are from source task
rate learner for each task. From this perspective, MTL can help
(s) to the target task, but in multi-task learning, there are
reuse existing knowledge and reduce the cost of manual label-
flows of knowledge sharing between any pair of tasks,
ing for learning tasks. When the era of “big data” comes in
which is illustrated in Fig. 1a. Continual learning [4], in
some areas such as computer vision and Natural Language
which tasks come sequentially, learns tasks one by one,
Processing (NLP), it is found that deep MTL models can
while MTL is to learn multiple tasks together. In multi-label
achieve better performance than their single-task counterparts.
learning and multi-output regression, each data point is
associated with multiple labels which can be categorical or
Yu Zhang is with the Department of Computer Science and Engineering, numeric. If we treat each of all the possible labels as a task,
Southern University of Science and Technology, Shenzhen, Guangdong multi-label learning and multi-output regression can be
518055, China, and also with the Peng Cheng Laboratory, Shenzhen, viewed in some sense as a special case of multi-task learning
Guangdong 518066, China. E-mail: [email protected].
where different tasks always share the same data during
Qiang Yang is with the Department of Computer Science and Engineering,
Hong Kong University of Science and Technology, Hong Kong. both the training and testing phrases. On the one hand,
E-mail: [email protected]. such characteristic in multi-label learning and multi-output
Manuscript received 5 Aug. 2019; revised 28 Dec. 2020; accepted 17 Mar. 2021. regression leads to different research issues from MTL. For
Date of publication 31 Mar. 2021; date of current version 7 Nov. 2022. example, the ranking loss, which enforces the scores (e.g.,
(Corresponding author: Yu Zhang.) the classification probability) of labels associated with a
Recommended for acceptance by L. Chen.
Digital Object Identifier no. 10.1109/TKDE.2021.3070203 data point to be larger than those of absent labels, can be
1041-4347 © 2021 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tps://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
ZHANG AND YANG: SURVEY ON MULTI-TASK LEARNING 5587
Fig. 1. Illustrations for differences between MTL and other learning paradigms.
used for multi-label learning but it does not fit MTL where review relevant works. At last, we discuss several future
different tasks possess different data. On the other hand, directions for MTL.1
this characteristic in multi-label learning and multi-output
regression is invalid in MTL problems. For example, in a
MTL problem discussed in Section 2.7 where each task is to 2 MTL MODELS
predict the disease symptom score of Parkinson for a patient In order to fully characterize MTL, we first give the defini-
based on 19 bio-medical features, different patients/tasks tion of MTL.
should not share the bio-medical data. In a word, multi-
Definition (Multi-Task Learning). Given m learning
label learning and multi-output regression are different
tasks fT i gm
i¼1 where all the tasks or a subset of them are
from multi-task learning as illustrated in Fig. 1b and hence
related, multi-task learning aims to learn the m tasks
we will not survey literature on multi-label learning and
together to improve the learning of a model for each task
multi-output regression. Moreover, multi-view learning is
T i by using the knowledge contained in all or some of
another learning paradigm in machine learning, where each
other tasks. u
t
data point is associated with multiple views, each of which
consists of a set of features. Even though different views Based on the definition of MTL, we focus on supervised
have different sets of features, all the views are used learning tasks in this section since most MTL studies fall in
together to learn for the same task and hence multi-view this setting and for other types of tasks, we review them in
learning belongs to single-task learning with multiple sets the next section. In the setting of supervised learning tasks,
of features, which is different from MTL as shown in Fig. 1c. a task T i is usually accompanied by a training dataset Di
ni
Over past decades, MTL has attracted many attentions in consisting of ni training samples, i.e., Di ¼ fxij ; yij gj¼1 , where
the artificial intelligence and machine learning communities. xj 2 R is the jth training instance in T i and yj is its label.
i di i
Many MTL models have been devised and many MTL appli- We denote by Xi the training data matrix for T i , i.e., Xi ¼
cations in other areas have been exploited. Moreover, many ðxi1 ; . . . ; xini Þ. When different tasks lie in the same feature
analyses have been conducted to study theoretical problems space implying that di equals dj for any i 6¼ j, this setting is
in MTL. This paper serves as a survey on MTL from the per- the homogeneous-feature MTL, and otherwise it corre-
spective of algorithmic modeling, applications and theoretical sponds to heterogeneous-feature MTL. Without special
analyses. For algorithmic modeling, we first give a definition explanation, the default MTL setting is the homogeneous-
for MTL and then classify different MTL algorithms into five feature MTL. Here we need to distinguish the heteroge-
categories: feature learning approach which can be further neous-feature MTL from the heterogeneous MTL. In [6], the
categorized into feature transformation and feature selection heterogeneous MTL is considered to consist of different
approaches, low-rank approach, task clustering approach, types of supervised tasks including classification and
task relation learning approach and decomposition approach. regression problems, and here we generalize it to a more
After that, we discuss the combination of MTL with other general setting that the heterogeneous MTL consists of tasks
learning paradigms, including semi-supervised learning, with different types including supervised learning, unsu-
active learning, unsupervised learning, reinforcement learn- pervised learning, semi-supervised learning, reinforcement
ing, multi-view learning and graphical models. To handle a learning, multi-view learning and graphical models. The
large number of tasks, we review online, parallel and distrib- opposite to the heterogeneous MTL is the homogeneous
uted MTL models. For data in a high-dimensional space, fea- MTL which consists of tasks with only one type. In a word,
ture selection, dimensionality reduction and feature hashing the homogeneous and heterogeneous MTL differ in the type
are introduced as vital tools to process them. As a promising of learning tasks while the homogeneous-feature MTL is
learning paradigm, MTL has many applications in various different from the heterogeneous-feature MTL in terms of
areas and here we briefly review its applications in computer
vision, bioinformatics, health informatics, speech, NLP, web, 1. For an introduction to MTL without technical details, please refer
etc. From the perspective of theoretical analyses on MTL, we to [5].
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
5588 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 34, NO. 12, DECEMBER 2022
layer is linear, then the transformation is a linear function et al. [11] extend problem (2) to a general formulation where
and otherwise it is nonlinear. Compared with multi-layer the second term in the objective function becomes trðWT f
feedforward neural networks used for single-task learning, ðDÞWÞ with fðDÞ operating on the spectrum of D and dis-
the difference in the network architecture lies in the output cuss the condition on fðÞ to make the whole problem
layers where in single-task learning, there is only one out- convex.
put unit while in MTL, there are m ones. In [8], the radial Similar to the MTFL method, the multi-task sparse cod-
basis function network, which has only one hidden layer, is ing method [12] is to learn a linear transformation on fea-
extended to MTL by greedily determining the structure of tures with the objective function formulated as
the hidden layer. Different from these neural network mod-
els, Silver et al. [9] propose a context-sensitive multi-task min LðUA; bÞ s:t: kai k1 8i 2 ½m; kuj k2 1 8j 2 ½D;
neural network which has only one output unit shared by A;U;b
different tasks but has a task-specific context as an addi- (3)
tional input.
Different from multi-layer feedforward neural networks where ai , the ith column of A, contains model parameters of
which are connectionist models, the multi-task feature the ith task, uj is the jth column in U, ½c for an integer c
learning (MTFL) method [10] is formulated under the regu- denotes a set of integers from 1 to c, k k1 denotes the ‘1
larization framework with the objective function as norm of a vector or matrix and equals the sum of the abso-
lute value of its entries, and k k2 denotes the ‘2 norm of a
Xm
1X
ni
vector. Here the transformation U 2 RdD is also called the
min lðyij ; ðai ÞT UT xij þ bi Þ þ kAk22;1
A;U;b n i (1) dictionary in sparse coding and shared by all the tasks.
i¼1 j¼1
Compared with the MTFL method where U in problem (1)
s:t: UU ¼ I;
T
is a d d orthogonal matrix, U in problem (3) is overcom-
plete, which implies that D is larger than d, with each col-
where lð; Þ denotes a loss function such as the hinge loss or
umn having a bounded ‘2 norm. Another difference is that
square loss, b ¼ ðb1 ; . . . ; bm ÞT is a vector of offsets in all the
in problem (1) A is enforced to be row-sparse but in prob-
tasks, U 2 Rdd is a square transformation matrix, A 2 Rdm
lem (3) it is only sparse via the first constraint. With a simi-
contains model parameters of all the tasks with its ith col-
lar idea to the multi-task sparse coding method, Zhu et al.
umn ai as model parameters for the ith task after the trans-
[13] propose a multi-task infinite support vector machine
formation, the ‘2;1 norm of a matrix A denoted by kAk2;1
via the Indian buffet process and the difference is that in
equals the sum of the ‘2 norm of rows in A, I denotes an
[13] the dictionary is sparse and model parameters are non-
identity matrix with an appropriate size, and is a positive
sparse. In [14], the spike and slab prior is used to learn
regularization parameter. The first term in the objective
sparse model parameters for multi-output regression prob-
function of problem (1) measures the empirical loss on
lems where transformed features are induced by Gaussian
the training sets of all the tasks and the second one is to
processes and shared by different outputs.
enforce A to be row-sparse via the ‘2;1 norm which is equiv-
Recently deep learning becomes popular due to its capac-
alent to selecting features after the transformation, while the
ity to learn nonlinear features, which facilitates the learning
constraint enforces U to be orthogonal. Different from the
of invariant features for multiple tasks, and hence many
multi-layer feedforward neural network whose hidden rep-
deep multi-task models belonging to this approach have
resentations may be redundant, the orthogonality of U can
been proposed with each task modeled by a deep neural
prevent the MTFL method from it. As proved in [10], prob-
network. Here we classify deep multi-task models in this
lem (1) is equivalent to
approach into three main categories. The first category [15],
[16], [17], [18], [19] is to learn a common feature representa-
min LðW; bÞ þ trðWT D1 WÞ s:t: D 0; trðDÞ 1;
W;D;b tion for multiple tasks by sharing first several layers in a
(2) similar architecture to Fig. 2. However, different from Fig. 2,
Pm Pni deep MTL models in this category have a large number of
i T i
where LðW; bÞ ¼ i¼1 n1i j¼1 lðyj ; ðw Þ xj
i
þ bi Þ denotes the shared layers, which have general structures such as convo-
total training loss, trðÞ denotes the trace of a square matrix, lutional layers and pooling layers. Building on the first cate-
wi ¼ Uai is the model parameter for T i , W ¼ ðw1 ; . . . ; wm Þ, gory, the second category is to use adversarial learning,
0 denotes a zero vector or matrix with an appropriate which is inspired by generative adversarial networks, to
size, M1 for any square matrix M denotes its inverse when learn a common feature representation for MTL as did in
it is nonsingular or otherwise its pseudo inverse, and B C [20], [21]. Specifically, there are three networks in such
means that B C is positive semidefinite. Based on this for- adversarial multi-task models, including a feature network
mulation, we can see that the MTFL method is to learn a fea- Nf , a classification network Nc and a domain network Nd .
ture covariance D for all the tasks, which will be interpreted Based on Nf , Nc is to minimize the training loss for all the
tasks, while Nd aims to distinguish which task a data
in Section 2.8 from a probabilistic perspective. Given D, the
instance is from. The objective function of such models is
learning of different tasks can be decoupled and this can
usually formulated as
facilitate the parallel computing. When given W,D has an
1 1
analytical solution as D ¼ ðWT WÞ2 =tr ðWT WÞ2 and by X ni
m
1X
plugging this solution into problem (2), we can see that the min max lðyij ; Nc ðNf ðxij ÞÞÞ lce ðdij ; Nd ðNf ðxij ÞÞÞ ;
uf ;uc ud n
i¼1 i j¼1
regularizer on W is the squared trace norm. Then Argyriou
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
5590 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 34, NO. 12, DECEMBER 2022
are among the first to study the multi-task feature selection ative constraint on uj is to keep the model identifiability. It
(MTFS) problem based on the ‘2;1 norm with the objective has beenP proved
pffiffiffiffiffiffiffiffiffiffiffiffiin
ffi [31] that problem (7) leads to a regular-
function formulated as izer dj¼1 kwj k1 , the square root of the ‘1;1 norm regulari-
2
zation. Moreover, Wang et al. [32] extend problem (7) P to a
min LðW; bÞ þ kWk2;1 : (4) general situation where the regularizer becomes 1 m i¼1 k
W;b
w^ i kpp þ 2 kuukqq . By utilizing a priori information describing
The regularizer on W in problem (4) is to enforce W to be the task relations in a hierarchical structure, Han et al. [33]
row-sparse, which in turn helps select important features. In propose a multi-component product based decomposition
[23], a path-following algorithm is proposed to solve prob- for wij where the number of components in the decomposi-
lem (4) and then Liu et al. [24] employ an optimal first-order tion can be arbitrary instead of only 2 in [31], [32]. Similar to
optimization method to solve it. Compared with problem [31], Jebara [34] proposes to learn a binary indicator vector
(1), we can see that problem (4) is similar to the MTFL to do multi-task feature selection based on the maximum
method without learning the transformation U. Lee et al. [25] entropy discrimination formalism.
propose a weighted ‘2;1 norm for multi-task feature selection Similar to [33] where a priori information is given to
where the weights can be learned as well and problem (4) is describe task relations in a hierarchical/tree structure, Kim
extended in [26] to a general case where feature groups can and Xing [35] utilize the given P tree
Pstructure to design a reg-
overlap with each other. In order to make problem (4) more ularizer on W as fðWÞ ¼ di¼1 v2V v kwi;Gv k2 , where V
robust to outliers, a square-root loss function is investigated denotes the set of nodes in the given tree structure, Gv
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
ZHANG AND YANG: SURVEY ON MULTI-TASK LEARNING 5591
denotes the set of leaf nodes (i.e., tasks) in a sub-tree rooted different situations of features and tasks. So this model can
at node v, and wi;Gv denotes a subvector of the ith row of W also handle outlier tasks but in a different way from [37].
indexed by Gv . This regularizer not only enforces each row
of W to be sparse as the ‘2;1 norm did in problem (4), but 2.1.3 Comparison Between Two Sub-Categories
also induces the sparsity in subsets of each row in W based The two sub-categories have different characteristics where
on the tree structure. the feature transformation approach learns a transformation
Different from conventional multi-task feature selection of the original features as the new representation but the fea-
methods which assume that different tasks share a set of ture selection approach selects a subset of the original features
original features, Zhou et al. [36] consider a different sce- as the new representation for all the tasks. Based on the char-
nario where useful features in different tasks have no over- acteristics of those two approaches, the feature selection
lapping. In order to achieve this, an exclusive Lasso model approach can be viewed as a special case of the feature trans-
is proposed with the objective function formulated as formation approach when the transformation matrix is a diag-
onal 0=1 matrix where the diagonal entries with value 1
min LðW; bÞ þ kWk21;2 ; correspond to the selected features. By selecting a subset of
W;b
the original features as the new representation, the feature
where the regularizer is the squared ‘1;2 norm on W. selection approach has a better interpretability.
Another way to select common features for MTL is to use
sparse priors to design probabilistic or Bayesian models. 2.2 Low-Rank Approach
For ‘p;1 -regularized multi-task feature selection, Zhang et al. The relatedness among multiple tasks can imply the low-rank
[37] propose a probabilistic interpretation where the ‘p;1 reg- of W, leading to the low-rank approach. For example, if the
ularizer corresponds to a generalized normal prior: wji ith, jth and kth tasks are related in that the model parameter
GN ðj0; rj ; pÞ, where denotes a (random) variable when we wi of the ith task is a linear combination of those of the other
do not want to introduce it explicitly. Based on this interpre- two tasks, then it is easy to show that the rank of W is at most
tation, Zhang et al. [37] further propose a probabilistic m 1 and hence of low rank. From this perspective, the more
framework for multi-task feature selection, in which task the relatedness is, the lower the rank of W is.
relations and outlier tasks can be identified, based on the Ando and Zhang [40] assume that model parameters of
matrix-variate generalized normal prior. different tasks share a low-rank subspace in part and specif-
In [38], a generalized horseshoe prior is proposed to do ically, wi takes the following form as
feature selection for MTL as:
wi ¼ ui þ Q T vi : (9)
Z Y
d
uji
Pðwi Þ ¼ N ðwji j0; ÞN ðui j0; r2 CÞN ðvi j0; g 2 CÞdui dvi ; Here Q 2 Rhd is the shared low-rank subspace by multiple
vji
j¼1 tasks where h < d. Then we can write in a matrix form as
W ¼ U þ QT V. Based on the form of W, the objective func-
where N ðjm; s Þ denotes a univariate or multivariate nor- tion proposed in [40] is formulated as
mal distribution with m as the mean and s as the variance
or covariance matrix, uji and vji are the jth entries in ui and min LðU þ Q T V; bÞ þ kUk2F s:t: QQ T ¼ I; (10)
vi , respectively, and r; g are hyperparameters. Here C U;V;Q
Q;b
shared by all the tasks denotes the feature correlation matrix
where k kF denotes the Frobenius norm. The orthonormal
to be learned from data and it encodes an assumption that
constraint on Q in problem (10) makes the subspace non-
different tasks share identical feature correlations. When C
redundant. When is large enough, the optimal U can
becomes an identity matrix which means that features are
become a zero matrix and hence problem (10) is very similar
independent, this prior degenerates to the horseshoe prior.
to problem (1) except that there is no regularization on V in
Hernandez-Lobato et al. [39] propose a probabilistic
problem (10) and that Q has a smaller number of rows than
model based on the horseshoe prior as
columns. Chen et al. [41] generalize problem (10) as
h i h
1h zj
i
1t vi ð1zj Þ
Pðwji Þ ¼ pðwji Þhji d0 ji pðwji Þtji d0 ji min LðW; bÞ þ 1 kUk2F þ 2 kWk2F
U;V;Q
Q;b
h i (8) (11)
1g ð1vi Þð1zj Þ
pðwji Þg j d0 j ; s:t: W ¼ U þ Q T V; QQ T ¼ I:
than that of the non-convex problem (11). Compared with matrix A is defined with its ði; jÞth entry aij recording the gen-
the alternative objective function (2) in the MTFL method, eralization accuracy obtained for task T i by using task T j ’s
problem (12) has a similar formulation where M models the distance metric. Based on A, mP tasks can
P be grouped into r
feature covariance for all the tasks. Problem (10) is extended clusters fCi gri¼1 by maximizing rt¼1 jC1t j i;j2Ct aij , where j j
in [42] to a general case where different wi ’s lie in a mani- denotes the cardinality of a set. After obtaining the cluster
fold instead of a subspace. Moreover, in [43], a latent vari- structure among all the tasks, the training data of tasks in a
able model is proposed for W with the same decomposition cluster will be pooled together to learn the final weighted near-
as Eq. (9) and it can provide a framework for MTL by est neighbor classifier. This approach has been extended to an
modeling more cases than problem (10) such as task cluster- iterative learning process [49] in a similar way to k-means
ing, sharing sparse representation, duplicate tasks and clustering.
evolving tasks. Bakker and Heskes [50] propose a multi-task Bayesian
It is well known that using the trace norm as a regular- neural network model with the network structure similar to
izer can make a matrix have low rank and hence this regu- Fig. 2 where input-to-hidden weights are shared by all the
larization is suitable for MTL. Specifically, an objective tasks but hidden-to-output weights are task-specific. By
function with the trace norm regularization is proposed in defining wi as the vector of hidden-to-output weights for
[44] as task T i , the multi-task Bayesian neuralP network assigns a
r
mixture of Gaussian prior to it: wi j¼1 pj N ðjmj ; S j Þ,
min LðW; bÞ þ kWkSð1Þ ; (13) where pj , mj and Sj specify the prior, the mean and the
W;b
covariance in the jth cluster. For tasks in a cluster, they will
where mi ðWÞ denotes the ith smallest singular value of W share a Gaussian distribution. When r equals 1, this model
Pminðm;dÞ degenerates to a case where model parameters of different
and kWkSð1Þ ¼ i¼1 mi ðWÞ denotes the trace norm of
matrix W. Based on the trace norm, Han and Zhang [45] tasks share a prior, which is similar to several Bayesian
propose a capped trace regularizer with the objective func- MTL models such as [51], [52], [53] that are based on Gauss-
tion formulated as ian processes and t processes.
Xue et al. [54] deploy the Dirichlet process to do cluster-
ing on task level. Specifically, it defines the prior on wi as
X
minðm;dÞ
min LðW; bÞ þ minðmi ðWÞ; uÞ: (14)
W;b
i¼1 wi G; G DPða; G0 Þ 8i 2 ½m;
With the use of the threshold u, the capped trace regularizer where DPða; G0 Þ denotes a Dirichlet process with a as a
only penalizes small singular values of W, which is related positive scaling parameter and G0 a base distribution. To
to the determination of the rank of W. When u is large see the clustering effect, by integrating out G, the condi-
enough, the capped trace regularizer will become the trace tional distribution of wi , given model parameters of other
norm and hence problem (14) will reduce to problem (13). tasks Wi ¼ f ; wi1 ; wiþ1 ; g, is
Moreover, a spectral k-support norm is proposed in [46] as
an improvement over the trace norm regularization. a 1 Xm
Pðwi jWi ; a; G0 Þ ¼ G0 þ d j;
The trace norm regularization has been extended to regu- m1þa m 1 þ a j¼1;j6¼i w
larize model parameters in deep multi-task models. Specifi-
cally, the weights in the last several fully connected layers where dwj denotes the distribution concentrated at a single
of deep multi-task neural networks can be viewed as the point wj . So wi can be equal to either wj (j 6¼ i) with proba-
parameters of learners for all the tasks. In this view, the 1
bility m1þa , which corresponds to the case that those two
weights connecting two consecutive layers for one task can tasks lie in the same cluster, or a new sample from G0 with
be organized in a matrix and hence the weights of all the probability m1þaa
, which is the case that task T i forms a
tasks can form a tensor. Based on such tensor representa- new task cluster. When a is large, the chance to form a new
tions, several tensor trace norms, which are based on the task cluster is large and so a affects the number of task clus-
trace norm, are used in [47] as regularizers to identify the ters. This model is extended in [55], [56] to a case where dif-
low-rank structure of the parameter tensor. ferent tasks in a task cluster share useful features via a
matrix stick-breaking process and a beta-Bernoulli hierar-
2.3 Task Clustering Approach chical prior, respectively, and in [57] where each task is a
The task clustering approach assumes that different tasks compressive sensing task. Moreover, a nested Dirichlet pro-
form several clusters, each of which consists of similar tasks. cess is proposed in [58], [59] to use Dirichlet processes to
As indicated by its name, this approach has a close connec- learn both task clusters and the state structure of an infinite
tion with clustering algorithms and it can be viewed as an hidden Markov model, which handles sequential data in
extension of clustering algorithms to the task level while the each task. In [60], wi is decomposed as wi ¼ ui þ QTi vi simi-
conventional clustering algorithms are on the data level. lar to Eq. (9), where ui and Qi are sampled according to a
Thrun and Sullivan [48] propose the first task clustering Dirichlet process.
algorithm by using a weighted nearest neighbor classifier for Different from [50], [54], Jacob et al. [61] aim to learn task
each task, where the initial weights to define the weighted clusters under the regularization framework by considering
euclidean distance are learned by minimizing pairwise three orthogonal aspects, including a global penalty to mea-
within-class distances and maximizing pairwise between-class sure on average how large the parameters, a measure of
distances simultaneously within each task. Then a task transfer between-cluster variance to quantify the distance among
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
ZHANG AND YANG: SURVEY ON MULTI-TASK LEARNING 5593
the rank at most r. From the perspective of modeling, by set- W MN ðj0; I; VÞ; (20)
ting ui to be a zero vector in Eq. (9), we can see that the decom-
where MN ðjM; A; BÞ denotes a matrix-variate normal dis-
position of W in [40] becomes similar to those in [64], [66],
tribution with M as the mean, A the row covariance, and B
which in some sense shows the relation between those two
the column covariance. Based on this prior as well as some
approaches. Moreover, the equivalence between problems
likelihood function, the objective function for a modified
(12) and (15), two typical methods in the low-rank and task
maximum a posterior solution is formulated as
clustering approaches, has been proved in [68]. The task clus-
tering approach can visualize the learned cluster structure, V1 WT Þ
min LðW; bÞ þ 1 kWk2F þ 2 trðWV
which is an advantage over the low-rank approach. W;b;V
V
(21)
s:t: V 0; trðV
VÞ 1;
2.4 Task Relation Learning Approach where the second term in the objective function is to penal-
In MTL, tasks are related and the task relatedness can be ize the complexity of W, the last term is due to the matrix-
quantitated via task similarity, task correlation, task covari- variate normal prior, and the constraints control the com-
ance and so on. Here we use task relations to include all the plexity of the positive definite covariance matrix V . It has
quantitative relatedness. been proved in [79], [80] that problem (21) is jointly convex
In earlier studies on MTL, task relations are assumed to be with respect to W, b and V. Problem (21) has been extended
known as a priori information. In [69], [70], each task is to multi-task boosting [81] and multi-label learning [82] by
assumed to be similar to any other task and so model parame- learning label correlations. Problem (21) can also been inter-
ters of each task will be enforced to approach the average preted from the perspective of reproducing kernel Hilbert
model parameters of all the tasks. In [71], [72], task similarities spaces for vector-valued functions [83], [84], [85], [86].
for each pair of tasks are given and these studies utilize the Moreover, Problem (21) is extended to learn sparse task
task similarities to design regularizers to guide the learning of relations in [87] via the ‘1 regularization on V when the
multiple tasks in a principle that the more similar two tasks number of tasks is large. A model similar to problem (21) is
are, the closer the corresponding model parameters are proposed in [88] via a matrix-variate normal prior on W:
expected to be. A similar formulation to [71] is proposed in W MN ðj0; V1 ; V2 Þ, where V 1 1
1 and V 2 are assumed to
[73] to estimate the mean of multiple distributions by learning be sparse. The MTRL model is extended in [89] to use the
pairwise task relations and another similar formulation is pro- symmetric matrix-variate generalized hyperbolic distribu-
posed in [74] for log-density gradient estimation. Given a tree tion to learn block sparse structure in W and in [90] to use
structure describing relations among tasks in [75], model the matrix generalized inverse Gaussian prior to learn low-
parameters of a task corresponding to a node in the tree are rank V1 and V2 . Moreover, the MTRL model is generalized
enforced to be similar to those of its parent node. to the multi-task feature selection problem [37] by learning
However, in most applications, task relations are not task relations via the matrix-variate generalized normal dis-
available. In this case, learning task relations from data tribution. Since the prior defined in Eq. (20) implies that
automatically is a good option. Bonilla et al. [76] propose a WT W follows a Wishart distribution as Wðj0; V Þ, Zhang
multi-task Gaussian process (MTGP) by defining a prior on and Yeung [91] generalize it as
fji , the functional value for xij , as f N ðj0; S Þ, where f ¼
ðf11 ; . . . ; fnmm ÞT contains the functional values for all the train- ðWT WÞt Wðj0; V Þ; (22)
ing data. S , the covariance matrix, defines the covariance
where t is a positive integer to model high-order task rela-
between fji and fqp as sðfji ; fqp Þ ¼ vip kðxij ; xpq Þ, where kð; Þ
tionships. Eq. (22) can induce a new prior, which is a gener-
denotes a kernel function and vip describes the covariance
alization of the matrix-variate normal distribution, on W
between tasks T i and T p . In order to keep S positive defi-
and based on this new prior, a regularized method is
nite, a matrix V containing vip as its ði; pÞth entry is also
devised to learn high-order task relations in [91]. The MTRL
required to be positive definite, which makes V the task
model has been extended to multi-output regression [90],
covariance to describe the similarities between tasks. Then
[92], [93], [94] by modeling the structure contained in noises
based on the Gaussian likelihood for labels given f, the ana-
via some matrix-variate priors. For deep neural networks,
lytically marginal likelihood by integrating out f can be
the MTRL method has been extended in [95] by placing a
used to learn V from data. In [77], the learning curve and
tensor-variate normal distribution as a prior on the parame-
generalization bound of the MTGP are studied. Since V in
ter tensor in the fully connected layers.
MTGP has a point estimation which may lead to the overfit-
Different from the aforementioned methods which inves-
ting, based on a proposed weight-space view of MTGP,
tigate the use of global learning models in MTL, Zhang [96]
Zhang and Yeung [78] propose a multi-task generalized t
aims to learn the task relations in local learning methods
process by placing an inverse-Wishart prior on V as V
such as the k-nearest-neighbor (kNN) classifier by defining
IWðjn; CÞ, where n denotes the degree of freedom and C is
the learning function as a weighted voting of neighbors:
the base covariance for generating V. Since C models the
covariance between pairs of tasks, it can be determined X
fðxij Þ ¼ s ip sðxij ; xpq Þypq ; (23)
based on the maximum mean discrepancy (MMD). ðp;qÞ2Nk ði;jÞ
Different from [76], [78] which are Bayesian models,
Zhang and Yeung [79], [80] propose a regularized multi- where Nk ði; jÞ denotes the set of task indices and instance
task model called multi-task relationship learning (MTRL) indices for the k nearest neighbors of xij , i.e., ðp; qÞ 2 Nk ði; jÞ
by placing a matrix-variate normal prior on W as meaning that xpq is one of the k nearest neighbors of xij ,
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
ZHANG AND YANG: SURVEY ON MULTI-TASK LEARNING 5595
sðxij ; xpq Þ defines the similarity between xij and xpq , and s ip To help understand problem (26), we introduce several
represents the contribution of task T p to T i when T p has instantiations as follows.
some data points to be neighbors of a data point in T i . s ip In [98] where h equals 2 and CW ¼ ; is an empty set, g1 ðÞ
can be viewed as the similarity from T p to T i . When s ip ¼ 1 and g2 ðÞ are defined as
for all i and p, Eq. (23) reduces to the decision function of
the kNN classifier for all the tasks. Then the objective func- g1 ðW1 Þ ¼ 1 kW1 k1;1 ; g2 ðW2 Þ ¼ 2 kW2 k1 ;
tion to learn S , which is a m m matrix with s ip as its
ði; pÞth entry, can be formulated as where 1 and 2 are positive regularization parameters.
Similar to problem (5), each row of W1 is likely to be a zero
Xm
1X
ni
1 2 row and hence g1 ðW1 Þ can help select important features.
min lðyij ; fðxij ÞÞ þ kSS ST k2F þ kSSk2F Due to the ‘1 norm regularization, g2 ðW2 Þ makes W2 sparse.
S
i¼1
n i j¼1
4 2 (24) Because of the characteristics of two regularizers, the
s:t: s ii 0 8i 2 ½m; s ii s ij s ii 8i 6¼ j: parameter matrix W can eliminate unimportant features for
all the tasks when the corresponding rows in both W1 and
The first regularizer in problem (24) enforces S to be nearly W2 are sparse. Moreover, W2 can identify features for tasks
symmetric and the second one is to penalize the complexity which have their own useful features that may be outliers
of S. The constraints in problem (24) guarantee that the sim- for other tasks. Hence this model can be viewed as a ‘robust’
ilarity from one task to itself is positive and also the largest. version of problem (5).
Similarly, a multi-task kernel regression is proposed in [96] With two component matrices, Chen et al. [99] define
for regression tasks.
While the aforementioned methods whose task relations g2 ðW2 Þ ¼ 2 kW2 k1 ; CW ¼ fW1 jkW1 kSð1Þ 1 g; (27)
are symmetric except [96], Lee et al. [97] focus on learning
asymmetric task relations. Since different tasks are assumed where g1 ðW1 Þ ¼ 0. Similar to problem (13), CW makes W1
to be related, wi can lie in the space spanned by W, i.e., wi low-rank. With a sparse regularizer g2 ðW2 Þ, W2 makes the
Wai , and hence we have W WA. Here matrix A can be entire model matrix W more robust to outlier tasks in a way
viewed as asymmetric task relations between pairs of tasks. similar to the previous model. When 2 is large enough, W2
By assuming that A is sparse, the objective function is for- will become a zero matrix and then problem (27) will act
mulated as similarly to problem (13).
gi ðÞ’s in [100] where CW ¼ ; are defined as
X
m X
ni
min ð1 þ 1 k^ai k1 Þ lðyij ; ðwi ÞT xij þ bi Þ þ 2 kW WAk2F g1 ðW1 Þ ¼ 1 kW1 kSð1Þ ; g2 ðW2 Þ ¼ 2 kWT2 k2;1 : (28)
W;b;A
i¼1 j¼1
s:t: aij 0 8i; j 2 ½m; Different from the above two models which assume that W2
(25) is sparse, here g2 ðW2 Þ enforces W2 to be column-sparse. For
related tasks, their columns in W1 are correlated via the
where a ^i denotes the ith row of A by deleting aii . The term trace norm regularization and the corresponding columns
before the training loss of each task, i.e., 1 þ 1 k^
ai k1 , not only in W2 are zero. For outlier tasks which are unrelated to other
enforces A to be sparse but also allows asymmetric informa- tasks, the corresponding columns in W2 can take arbitrary
tion sharing from easier tasks to difficult ones. The regularizer values and hence model parameters in W for them have no
in problem (25) can make W approach WA with the closeness low-rank structure even though those in W1 may have.
depending on 2 . To see the connection between problems In [101], these functions are defined as
(25) and (21), we rewrite
the regularizer in problem (25)
as kW WAk2F ¼ tr WðI AÞðI AÞT WT . Based on this g1 ðW1 Þ ¼ 1 kW1 k2;1 ; g2 ðW2 Þ ¼ 2 kWT2 k2;1 ; CW ¼ ;:
reformulation, the regularizer in problem (25) is a special case (29)
of that in problem (21) by assuming V 1 ¼ ðI AÞðI AÞT .
Though A is asymmetric, from the perspective of the regular- Similar to problem (4), g1 ðW1 Þ makes W1 row-sparse. Here
g2 ðW2 Þ is identical to that in [100] and it makes W2 column-
izer, the task relations here are symmetric and act as the task
sparse. Hence W1 helps select useful features while non-
precision matrix with a restrictive form.
zero columns in W2 capture outlier tasks.
With h ¼ 2, Zhong and Kwok [102] define
2.5 Decomposition Approach
The decomposition approach assumes that the parameter
g1 ðW1 Þ ¼ 1 cðW1 Þ þ 2 kW1 k2F ; g2 ðW2 Þ ¼ 3 kW2 k2F ; CW ¼ ;;
matrix W can be decomposed into two or more P component P P
matrices fWk ghk¼1 where h 2, i.e., W ¼ hk¼1 Wk . The where cðUÞ ¼ di¼1 k > j juij uik j with uij as the ði; jÞth
objective functions of most methods in this approach can be entry in a matrix U. Due to the sparse nature of the ‘1 norm,
unified as cðW1 Þ enforces corresponding entries in different columns
of W1 to be identical, which is equivalent to clustering tasks
X
h Xh
in terms of individual model parameters. Both the squared
min L Wk ; b þ gk ðWk Þ; (26)
fWi g2CW ;b
k¼1 k¼1
Frobenius norm regularizations in g1 ðW1 Þ and g2 ðW2 Þ
penalize the complexities of W1 and W2 . The use of W2
where the regularizer is decomposable with respect to Wk ’s improves the model flexibility when not all the tasks exhibit
and CW denotes a set of constraints for component matrices. a clear cluster structure.
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
5596 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 34, NO. 12, DECEMBER 2022
Different from the aforementioned methods which have in problem (31) implies that for tasks not contained in a
only two component matrices, an arbitrary number of com- node a, the corresponding columns in Wa are zero. In [105],
ponent matrices are considered in [103] with rðWa Þ adopts the regularizer proposed in [69] which enfor-
ces the parameters of all the tasks to approach their average.
gk ðWk Þ ¼ ðh kÞkWk k2;1 þ ðk 1ÞkWk k1 =ðh 1Þ; Different from deep MTL models which are deep in
(30) terms of layers of feature representations, the decomposi-
tion approach can be viewed as a ‘deep’ approach in terms
where CW ¼ ;. According to Eq. (30), Wk is assumed to be of model parameters while most of previous approaches are
both sparse and row-sparse for all k 2 ½h. Based on different just shallow ones, making this approach have more power-
regularization parameters on the regularizer of Wk , we can ful capacity. Moreover, the decomposition approach can
see that when k increases, Wk is more likely to be sparse reduce to other approaches such as the feature learning,
than to be row-sparse. Even though each Wk is sparse or low-rank and task clustering approaches when there is only
row-sparse, the entire parameter matrix W can be non- one component matrix and hence it can be considered as an
sparse and hence this model can discover the latent sparse improved version of those approaches.
structure among tasks.
In the above methods, different component matrices
have no direct connection. When there is a dependency 2.6 Comparisons Among Different Approaches
among component matrices, problem (26) can model more Based on the above introduction, we can see that different
complex structure among tasks. For example, Han and approaches exhibit their own characteristics. Specifically, the
Zhang [104] define feature learning approach can learn common features, which
X are generic and invariant to all the tasks at hand and even
gk ðWk Þ ¼ kwik wjk k2 =hk1 8k 2 ½h new tasks, for all the tasks. When there exist outlier tasks
i>j which are unrelated to other tasks, the learned features can be
CW ¼ ffWk gj jwik1 wjk1 j jwik wjk j 8k 2; 8i > jg; influenced by outlier tasks significantly and they may cause
the performance deterioration. By assuming that the parame-
where wik denotes the ith column of Wk . Note that the con- ter matrix is low-rank, the low-rank approach can explicitly
straint set CW relates component matrices and the regular- learn the subspace of the parameter matrix or implicitly
izer gk ðWk Þ makes each pair of wik and wjk have a chance to achieve that via some convex or non-convex regularizer. This
become identical. Once this happens for some i, j, k, then approach is powerful but it seems applicable to only linear
based on the constraint set CW , wik0 and wjk0 will always have models, making nonlinear extensions non-trivial to be
the same value for k0 k. This corresponds to sharing all devised. The task clustering approach performs clustering on
the ancestor nodes for two internal nodes in a tree and the task level in terms of model parameters and it can identify
hence this method can learn a hierarchical structure to char- task clusters each of which consists of similar tasks. A major
acterize task relations. When the constraints are removed, limitation of the task clustering approach is that it can capture
this method reduces to the multi-level task clustering positive correlations among tasks in the same cluster but
method [63], which is a generalization of problem (16). ignore negative correlations among tasks in different clusters.
Another way to relate different component matrices is to Moreover, even though some methods in this category can
use a non-decomposable regularizer as [105] did, which is automatically determine the number of clusters, most of them
slightly different from problem (26) in terms of the regular- still need a model selection method such as cross validation to
izer. Specifically, given m tasks, there are 2m 1 possible determine it, which may bring additional computational
and non-empty task clusters. All the task clusters can be costs. The task relation learning approach can learn model
organized in a tree, where the root node represents a parameters and pairwise task relations simultaneously. The
dummy node, nodes in the second level represent groups learned task relations can give us insights about the relations
with a single task, and the parent-child relations are the between tasks and hence they improve the interpretability.
‘subset of’ relation. In total, there are h 2m component The decomposition approach can be viewed as extensions of
matrices each of which corresponds to a node in the tree other parameter-based approaches by equipping multi-level
and hence an index a is used to denote both a level and the parameters and hence they can model more complex task
corresponding node in the tree. The objective function is for- structure, e.g., tree structure. The number of components in
mulated as the decomposition approach is important to the performance
0 0 11p 12
and needs to be carefully determined.
!
X
h
BX X C
min L Wk ; b þ@ v @ rðWa Þp A A
fWi g;b
k¼1 v2V a2DðvÞ
(31) 2.7 Benchmark Datasets and Performance
Comparison
s:t: wia ¼ 0 8i 2
= tðaÞ; In this section, we introduce some benchmark datasets for
MTL and compare the performance of different MTL mod-
where p takes a value between 1 and 2, DðaÞ denotes the set els on them.
of all the descendants of a, tðaÞ denotes the set of tasks con- Some benchmark datasets for MTL are listed as follows.
tained in node a, wia denotes the ith column of Wa , and
rðWa Þ reflects relations among tasks in node a based on Wa . School dataset [50]: This dataset is to estimate exami-
The regularizer in problem (31) is used to prune the subtree nation scores of 15,362 students from 139 secondary
rooted at each node v based on the ‘p norm. The constraint schools in London from 1985 to 1987 where each
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
ZHANG AND YANG: SURVEY ON MULTI-TASK LEARNING 5597
TABLE 1
The Performance Comparison of Representative MTL Models in the Five Approaches
on Benchmark Datasets in Terms of Some Evaluation Metric
nMSE stands for ‘normalized mean squared error’, RMSE is for ‘root mean squared error’, and AUC stands for ‘Area Under Curve’. " after the evaluation metric
implies that the larger value the better performance and # indicates the opposite case.
school is treated as a task. The input consists of four Office-Home dataset:4 This dataset consists of images
school-specific and three student-specific attributes. from 4 different domains/tasks: artistic images, clip
SARCOS dataset:2 This dataset studies a multi-out- art, product images and real-world images. Each task
put problem of learning the inverse dynamics of 7 contains images of 65 object categories collected in the
SARCOS anthropomorphic robot arms, each of office and home settings. In total, there are about
which corresponds to a task, based on 21 features, 15,500 images in all the tasks.
including seven joint positions, seven joint velocities ImageCLEF dataset:5 This dataset contains 12 com-
and seven joint accelerations. This dataset contains mon categories shared by four tasks: Caltech-256,
48,933 data points. ImageNet ILSVRC 2012, Pascal VOC 2012 and Bing.
Computer Survey dataset [11]: This dataset is taken There are about 2,400 images in all the tasks.
from a survey of 180 persons/tasks who rated the In the above benchmark datasets, the first four datasets
likelihood of purchasing one of 20 different personal consist of regression tasks while the other datasets are classifi-
computers, resulting in 36,000 data points in all the cation tasks, where each task in the Sentiment, MHC-I and
tasks. The features contain 13 different computer Landmine datasets is a binary classification problem and that
characteristics (e.g., price, CPU and RAM) while the in the other three image datasets is a multi-class classification
output is an integer rating on the scale 0-10. problem. In order to compare different MTL approaches on
Parkinson dataset [105]: This dataset is to predict the those benchmark datasets, we select some representative
disease symptom score of Parkinson for patients at MTL methods from each of the five approaches introduced in
different times using 19 bio-medical features. This the previous sections and list in Table 1 their performance
dataset has 5,875 data points for 42 patients, each of reported in the MTL literature. We also include the perfor-
whom is treated as a task. mance of Single-Task Learning (STL), which trains a learning
Sentiment dataset:3 This dataset is to classify reviews model for each task separately, for comparison. It is easy to
of four products/tasks, i.e., books, DVDs, electronics see that MTL models perform better than STL counterparts in
and kitchen appliances, from Amazon into two clas- most cases, which verifies the effectiveness of MTL. Usually,
ses: positive and negative reviews. For each task, different datasets have their own characteristics, making
there are 1,000 positive and 1,000 negative reviews, them more suitable for some MTL approach. For example,
respectively. according to the studies in [50], [71], [102], different tasks in
MHC-I dataset [61]: This databset contains binding the School dataset are found to be very similar to each other.
affinities of 15,236 peptides with 35 MHC-I mole- According to [54], the Landmine dataset can have two task
cules. Each MHC-I molecule is considered as a task clusters, where the first cluster consisting of the first 15 tasks
and the goal is to predict whether a peptide binds a corresponds to regions that are relatively highly foliated and
molecule. the rest tasks belong to another cluster with regions that are
Landmine dataset [54]: This dataset consists of 9- bare earth or deserts. According to [61], it is well known in the
dimensional data points, whose features are extracted vaccine design community that some molecules/tasks in the
from radar images, from 29 landmine fields/tasks. MHC-I dataset can be grouped into empirically defined
Each task is to classify a data point into two classes supertypes known to have similar binding behaviors. For
(landmine or clutter). There are 14,820 data points in those three datasets, according to Table 1 we can see that the
total. task clustering, task relation learning and decomposition
Office-Caltech dataset [106]: The dataset contains data approaches have better performance since they can identify
from 10 common categories shared in the Caltech-256 the cluster structure contained in the data in a plain or hierar-
dataset and the Office dataset which consists of images chical way. For other datasets, they do not have so obvious
collected from three distinct domains/tasks: Amazon, structure among tasks but some MTL models can learn task
Webcam and DSLR, making this dataset contain 4 correlations, which can bring more insights for model design
tasks. There are 2,533 images in all the tasks. and the interpretation of experimental results. For example,
2. http://www.gaussianprocess.org/gpml/data/ 4. http://hemanthdv.org/OfficeHome-Dataset
3. http://www.cs.jhu.edu/ mdredze/datasets/sentiment/ 5. http://imageclef.org/2014/adaptation
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
5598 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 34, NO. 12, DECEMBER 2022
the task correlations in the SARCOS and Sentiment datasets models. For example, a learning framework is proposed in
are shown in Tables 2 and 3 of [79], and the task similarities in [108] to learn a suitable multi-task model for a given multi-
the Office-Caltech dataset are shown in Fig. 3b of [95]. More- task problem under problem (33) by utilizing S to represent
over, for image datasets (i.e., Office-Caltech, Office-Home and the corresponding multi-task model.
ImageCLEF), deep MTL models (e.g., [67], [95]) achieve better
performance than shallow models since they can learn power-
ful feature representations, while the rest datasets are from 2.9 Other Settings in MTL
diverse areas, making shallow models perform well on them. Instead of assuming that different tasks share an identical
feature representation, Zhang and Yeung [109] consider a
2.8 Another Taxonomy for Regularized MTL multi-database face recognition problem where face recog-
Methods nition in a database is treated as a task. Since different face
Regularized methods form a main methodology for MTL. databases have different image sizes, here naturally all the
Here we classify many regularized MTL algorithms into tasks do not lie in the same feature space in this application,
two main categories: learning with feature covariance and leading to a heterogeneous-feature MTL problem. To tackle
learning with task relations. The former can be viewed as a this problem, a multi-task discriminant analysis (MTDA) is
representative formulation in feature-based MTL, while the proposed in [109] by first projecting data in different tasks
latter is for parameter-based MTL. into a common subspace and then learning a common pro-
Objective functions in the first category can be unified as jection in this subspace to discriminate different classes in
different tasks. In [110], a latent probit model is proposed to
generate data of different tasks in different feature spaces
min LðW; bÞ þ trðWT Q1 WÞ þ fðQ
QÞ; (32)
W;b;Q
Q 2 via sparse transformations on a shared latent space and
then to generate labels based on this latent space.
where fðÞ denotes a regularizer or constraint on Q. From
In many MTL classification problems, each task is explicitly
the perspective of probabilistic modeling, the regularizer
T 1 or implicitly assumed to be a binary classification problem as
2 trðW Q WÞ corresponds to a matrix-variate normal dis-
each column in the parameter matrix W contains model
tribution on W as W MN ð0; 1 Q IÞ. Based on this proba-
parameters for the corresponding task. It is not difficult to see
bilistic prior, Q models the covariance between the features
that many methods in the feature learning approach, low-rank
since 1 Q is the row covariance matrix with each row in W
approach and decomposition approach can be directly
corresponding to a feature and different tasks share the fea-
extended to a general setting where each classification task can
ture covariance. All the models in this category differ in the
be a multi-class classification problem and correspondingly
choice of the function fðÞ on Q. For example, methods in
multiple columns in W contains model parameters of a multi-
[10], [40], [41] use fðÞ to restrict the trace of Q as shown in
class classification task. Such direct extension is applicable
problems (2) and (12). Moreover, multi-task feature selec-
since those methods only rely on the entire W or its rows but
tion methods based on the ‘2;1 norm such as [23], [24], [25]
not columns as a media to share knowledge among tasks.
can be reformulated as instances of problem (32).
However, to the best of our knowledge, there is no theoretical
Different from the first category, methods in the second
or empirical study to investigate such direct extension. For
category have a unified objective function as
most methods in the task clustering and task relation learning
approaches, such direct extension does not work since for mul-
S1 WT Þ þ gðS
min LðW; bÞ þ trðWS SÞ; (33) tiple columns in W corresponding to one task, we do not know
W;b;S
S 2
which one(s) can be used to represent this task. Therefore, the
where gðÞ denotes a regularizer or constraint on S . The reg- direct extension may not be the best solution to the general set-
S1 WT Þ corresponds to a matrix-variate nor-
ularizer 2 trðWS ting. In the following, we introduce four main approaches
mal prior on W as W MN ð0; I 1 S Þ, where S is to other than the direct extension to tackle the general setting in
model the task relations since 1 S is the column covariance MTL where each classification task can be a multi-class classifi-
with each column in W corresponding to a task. From this cation problem. The first method is to transform the multi-class
perspective, the two regularizers for W in problems (32) classification problem in each task into a binary classification
and (33) have different meanings even though the formula- problem. For example, multi-task metric learning [70], [111]
tions seem a bit similar. All the methods in this category use can do that by treating a pair of data points from the same class
different functions gðÞ to learn S with different functionali- as positive and that from different classes as negative. The sec-
ties. For example, the methods in [69], [70], [71], [72], which ond recipe is to utilize the characteristics of learners. For exam-
utilize a priori information on task relations, directly learn ple, the linear discriminant analysis can handle binary and
W and b by defining gðS SÞ ¼ 0. Some task clustering meth- multi-class classification problems in a unified formulation
ods [61], [65] identify task clusters by assuming that S has a and hence MTDA [109] can naturally handle them without
block structure. Several task relation learning methods changing the formulation. The third approach is to directly
including [79], [80], [87], [97], [107] directly learn S as a learn label correspondence among different tasks. In [112],
covariance matrix by constraining its trace or sparsity in two learning tasks, which share the training data, aim to maxi-
gðSSÞ. The trace norm regularization [44] can be formulated mize the mutual information to identify the correspondence
as an instance of problem (33). between labels in different tasks. By assuming that all the tasks
Even though this taxonomy cannot cover all the regular- share the same label space, the last approach including [47],
ized MTL methods, it can bring insights to understand reg- [67], [95] organizes the model parameters of all the tasks in a
ularized MTL methods better and help devise more MTL tensor where the model parameters of each task form a slice.
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
ZHANG AND YANG: SURVEY ON MULTI-TASK LEARNING 5599
Then the parameter tensor can be regularized by tensor trace method can accelerate the convergence rate of the opti-
norms [47] and a tensor-variate normal prior [95], or factorized mization process or facilitate the design of distributed
as a product of several low-rank matrices or tensors [67]. optimization algorithms.
Most MTL methods assume that the training data in each
task are stored in a data matrix. In some case, the training
data in each task exhibit a multi-modal structure and hence 3 MTL WITH OTHER LEARNING PARADIGMS
they are represented in a tensor instead of a matrix. Multilin- In the previous section, we review different MTL approaches
ear multi-task methods proposed in [113], [114] can handle for supervised learning tasks. In this section, we overview
this situation by employing tensor trace norms as a generali- some works on the combination of MTL with other learning
zation of the trace norm to perform the regularization. paradigms in machine learning, including unsupervised
learning such as clustering, semi-supervised learning, active
2.10 Optimization Techniques in MTL learning, reinforcement learning, multi-view learning and
Optimization techniques used in MTL can be categorized graphical models, to either improve the performance of
into three main classes as follows. supervised MTL further via additional information such as
unlabeled data or use MTL to help improve the performance
Gradient descent method and its variants: The gradi- of other learning paradigms.
ent descent method can be used to optimize smooth In most applications, labeled data are expensive to collect
unconstrained objective functions possessed by many but unlabeled data are abundant. So in some MTL applica-
MTL models. If the unconstrained objective function is tions, the training dataset of each task consists of both labeled
non-smooth, the subgradient can be used instead and and unlabeled data, hence we hope to exploit useful informa-
then the gradient descent method can also be used. tion contained in the unlabeled data to further improve the
When there are some constraints in the objective func- performance of supervised learning tasks. In machine learn-
tion of MTL models [44], [64], the projected gradient ing, semi-supervised learning and active learning are two
descent method can be used to project the updated ways to utilize unlabeled data but in different ways. Semi-
solution in each step to the space defined by con- supervised learning aims to exploit geometrical information
straints. For deep MTL models, stochastic gradient contained in the unlabeled data, while active learning selects
descent methods can be used. Moreover, the Grad- representative unlabeled data to query an oracle with the hope
Norm [115] is devised to normalize gradients to bal- of increasing the labeling cost as little as possible. Hence semi-
ance the learning of multiple tasks and [116] proposes supervised learning and active learning can be combined with
the gradient surgery to avoid the interference between MTL, leading to three new learning paradigms including
task gradients. Differently, [117] studies MTL from the semi-supervised multi-task learning [122], [123], [124], multi-
perspective of multi-objective optimization by learn- task active learning [125], [126], [127] and semi-supervised
ing dynamic loss weights. multi-task active learning [128]. Specifically, a semi-supervised
Block Coordinate Descent (BCD) method: The param- multi-task classification model is proposed in [122], [123] to
eters in many MTL models can be divided into several use random walk to exploit unlabeled data in each task and
blocks. For example, parameters in learning functions then cluster multiple tasks via a relaxed Dirichlet process. In
of all the tasks form a block and parameters to repre- [124], a semi-supervised multi-task Gaussian process for
sent task relations are from another block. Directly regression tasks, where different tasks are related via the
optimizing the objective function of such a MTL model hyperprior on the kernel parameters in Gaussian processes of
with respect to parameters in all blocks together is not all the tasks, is proposed to incorporate unlabeled data into the
easy. The BCD method, which is also known as the design of the kernel function in each task to achieve the
alternating method, is widely used in the MTL litera- smoothness in the corresponding functional spaces. Different
ture, e.g., [10], [12], [31], [40], [41], [61], [62], [65], [66], from these semi-supervised multi-task methods, multi-task
[79], [80], [96], [97], to alternatively optimize each active learning adaptively selects informative unlabeled data
block of parameters while fixing parameters in other for multi-task learners and hence the selection criterion is the
blocks. Hence, each step of the BCD method will solve core research issue. Reichart et al. [125] believe that data instan-
several subproblems, each of which is to optimize ces to be selected should be as informative as possible for a set
with respect to a block of parameters. Compared with of tasks instead of only one task and hence they propose two
the original objective function, each subproblem is eas- protocols for multi-task active learning. In [126], the expected
ier to be solved and so the BCD method can help error reduction is used as a criterion where each task is mod-
reduce the optimization complexity. eled by a supervised latent Dirichlet allocation model. Inspired
Proximal method [118]: For a nonsmooth objective by multi-armed bandits which balance the trade-off between
function, which is the sum of smooth and nonsmooth the exploitation and exploration, a selection strategy is pro-
functions, in an MTL model, the proximal method is posed in [127] to consider both the risk of a multi-task learner
frequently used (e.g., [24], [27], [63], [99], [100], [101], based on the trace norm regularization and the corresponding
[102], [103], [104], [119], [120], [121]) to construct a confidence bound. In [129], the MTRL method (i.e., problem
proximal problem by replacing the smooth function (21)) is extended to the interactive setting where a human
with a quadratic function that may be constructed expert is enquired about partial orderings of pairwise task
based on its Taylor series in various ways and the covariances based an inconsistency criterion. In [130], a pro-
resulting proximal problem is usually easier to be posed generalization bound is used to select a subset from
solved than the original problem. The proximal multiple unlabeled tasks to acquire labels to improve the
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
5600 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 34, NO. 12, DECEMBER 2022
generalization performance of all the tasks. For semi-super- in contexts among arms to improve the prediction of
vised multi-task active learning, Li et al. [128] propose a model rewards from contexts. In [147], a multi-task linearly solv-
to use the Fisher information as a criterion to select unlabeled able MDP, whose task basis matrix contains a library of
data to acquire their labels with the semi-supervised multi- component tasks shared by all the tasks, is proposed to
task classification model [122], [123] as the classifier for each maintain a parallel distributed representation of tasks each
task. of which enables an agent to draw on macro actions simul-
MTL achieves the performance improvement in not only taneously. In [148], a multi-task deep RL model based on
supervised learning tasks but also unsupervised learning the attention can automatically group tasks into sub-net-
tasks such as clustering. In [131], a multi-task Bregman clus- works on a state-level granularity. In [149], a sharing experi-
tering method is proposed based on single-task Bregman clus- ence framework is introduced to use task-specific rewards
tering by using the earth mover distance to minimize to identify similar parts defined as shared-regions which
distances between any pair of tasks in terms of cluster centers can guide the experience sharing of task policies. In [150],
and then in [132], [133], an improved version of [131] and its multi-task soft option learning, a hierarchical framework
kernel extension are proposed to avoid the negative effect based on planning as inference, is regularized by a shared
caused by the regularizer in [131] via choosing the better one prior to avoid training instabilities and allow the fine-tuning
between single-task and multi-task Bregman clustering. In of options for new tasks without forgetting learned policies.
[134], a multi-task kernel k-means method is proposed by The idea of compression and distillation have been incorpo-
learning the kernel matrix via both MMD between any pair of rated into MRL as in [151], [152], [153], [154]. For example,
tasks and the Laplacian regularization that helps identify a in [151], the proposed Actor-Mimic method combines both
smooth kernel space. In [135], two proposed multi-task clus- deep reinforcement learning and model compression tech-
tering methods are extensions of the MTFL and MTRL meth- niques to train a policy network which can learn to act for
ods by treating labels as cluster indicators to be learned. In multiple tasks. In [152], a policy distillation method is pro-
[136], the principle of MTL is incorporated into the subspace posed to not only train an efficient network to learn the pol-
clustering by capturing correlations between data instances. icy of an agent but also consolidate multiple task-specific
In [137], a multi-task clustering method belonging to instance- policies into a single policy. In [153], the problem of multi-
based MTL is proposed to share data instances among differ- task multi-agent reinforcement learning under the partial
ent tasks. In [138], a multi-task spectral clustering algorithm, observability is addressed by distilling decentralized single-
which can handle the out-of-sample issue via a linear function task policies into a unified policy across multiple tasks. In
to learn the cluster assignment, is proposed to achieve the fea- [154], each task has its own policy which is constrained to
ture selection among tasks via the ‘2;1 regularization [139]. be close to a shared policy that is trained by the distillation.
[140] proposes to identify the task cluster structure and learn Some works [155], [156], [157], [158], [159], [160] in MRL
task relations together. focus on online and distributed settings. Specifically, in
Reinforcement Learning (RL) is a promising area in [155], a distributed MRL framework is devised to model it
machine learning and has shown superior performance in as an instance of general consensus and an efficient decen-
many applications such as game playing (e.g., Atari and tralized solver is developed. In [156], [157], multiple goal-
Go) and robotics. MTL can help boost the performance of directed tasks are learned in an online setup without the
reinforcement learning, leading to Multi-task Reinforcement need for expert supervision by actively sampling harder
Learning (MRL). Some works [141], [142], [143], [144], [145], tasks. In [158], a distributed agent is developed to not only
[146], [147], [148], [149], [150] adapt the ideas introduced in use resources more efficiently in single-machine training
Section 2 to MRL. Specifically, in [141] where a task solves a but also scale to thousands of machines without sacrificing
sequence of Markov Decision Processes (MDPs), a hierar- data efficiency or resource utilization. [159] formulates MRL
chical Bayesian infinite mixture model is used to model the from a perspective of variational inference and it proposes a
distribution over MDPs and for each new MDP, previously novel distributed solver with quadratic convergence guar-
learned distributions are used as an informative prior. In antees. In [160], an online learning algorithm is proposed to
[142], a regionalized policy representation is introduced to dynamically combine different auxiliary tasks which pro-
characterize the behavior of an agent in each task and a vide gradient directions to speed up the training of the
Dirichlet process is placed over regionalized policy repre- main reinforcement learning task. Some works study the
sentations across multiple tasks to cluster tasks. In [143], a theoretical foundation of MRL. For example, in [161], shar-
Gaussian process temporal-difference value function model ing representations among tasks is analyzed with theoreti-
is used for each task and a hierarchical Bayesian approach cal guarantees to highlight conditions to share
is to model the distribution over value functions in different representations and finite-time bounds of approximated
tasks. Calandriello et al. [144] assume that parameter vectors value-iteration are extended to the multi-task setting. More-
of value functions in different tasks are jointly sparse and over, there are some works to design novel MRL methods.
then extend the MTFS method with the ‘2;1 regularization as For example, in [162], a MRL framework is proposed to
well as the MTFL method to learn value functions in multi- train agent to employ hierarchical policies that decide when
ple tasks together. In [145], a model associating each subtask to use a previously learned policy and when to learn a new
with a modular subpolicy is proposed to learn from policy skill with a temporal grammar that helps the agent learn
sketches, which annotate tasks with sequences of named complex temporal dependencies. [163] studies the problem
subtasks and provide information about high-level struc- of parallel learning of multiple sequential-decision tasks
tural relationships among tasks. In [146], a multi-task con- and proposes to automatically adapt the contribution of
textual bandit is introduced to leverage or learn similarities each task to the updates of the agent to make all tasks have
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
ZHANG AND YANG: SURVEY ON MULTI-TASK LEARNING 5601
TABLE 2
The Classification of Works About MTL Applications in Different Areas According to Different MTL Approaches
each machine learns a task, based on the debiased Lasso is vision. In bioinformatics and health informatics, the
proposed to learn jointly sparse features in a high-dimen- interpretability of learning models is more important in
sional space. In [187], the MTRL method (i.e., problem (21)) some sense. Therefore, the feature selection and task rela-
is extended to the distributed setting based on a stochastic tion learning approaches are widely used in this area as the
dual coordinate ascent method. In [188], to protect the pri- former approach can identify useful features and the latter
vacy of data, a privacy-preserving distributed MTL method one can quantitatively show task relations. In speech and
is proposed based on a privacy-preserving proximal gradi- NLP, the data exhibit a sequential structure, which makes
ent algorithm with asynchronous updates. In [189], feder- recurrent-neural-network-based deep MTL models in the
ated multi-task learning is proposed as an extension of feature transformation approach play an important role. As
distributed multi-task learning to consider both stragglers the data in web applications is of a large scale, this area
and fault tolerance. In [190], a distributed multi-task algo- favors simple models such as linear models or their ensem-
rithm is proposed under the online MTL setting. bles based on boosting. Among all the MTL approaches, the
For high-dimensional data in MTL, we can use multi-task feature transformation, feature selection and task relation
feature selection methods to reduce the dimension or extend learning approaches are among the most widely used MTL
single-task dimension reduction techniques to the multi-task approaches in different application areas according to
setting as did in [109]. Another option is to use the feature Table 2.
hashing and in [191], multiple hashing functions are proposed When encountering a new application problem which can
to accelerate the joint learning of multiple tasks. be modeled as a MTL problem, we need to judge whether
tasks in this problem are related in terms of either low-level
features or high-level concepts. If so, by treating Table 2 as a
5 APPLICATIONS look-up table. we can identify a problem in Table 2 similar to
MTL has many applications in various areas including com- the new problem and then adapt the corresponding MTL
puter vision, bioinformatics, health informatics, speech, model to solve the new problem. Otherwise, we can try popu-
NLP, web, and so on. In Table 2, we categorize different lar MTL approaches in the respective area.
MTL problems in each application area according to MTL
approaches they used, where the classification of MTL
approaches has been already introduced in Section 2. In the
6 THEORETICAL ANALYSES
last column of Table 2, we list some problems in various As well as designing MTL models and exploiting MTL
application areas which are different from other columns. applications, there are some works to study theoretical
For application problems listed in Table 2, application- aspects of MTL and here we review them.
dependent MTL models have been proposed to solve The generalization bound, which is to upper-bound the
them.6 Though these models are different from each other, generalization loss in terms of the training loss, model
there are some characteristics in respective areas. For exam- complexity and confidence, is core in learning theory since
ple, in computer vision, deep MTL models, most of which it can identify the learnability and induce the sample com-
belong to the feature transformation approach, exhibit good plexity. There are several works [40], [49], [249], [250],
performance, making this approach popular in computer [251], [252], [253], [254], [255], [256], [257], [258], [259],
[260] to study the generalization bound of different MTL
6. For details of those models, refer to an arXiv version [248] of this models. In Table 3, we compare those works in terms of
paper. the analyzed MTL model, analysis tool and the
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
ZHANG AND YANG: SURVEY ON MULTI-TASK LEARNING 5603
TABLE 3
Comparison of Generalization Bounds Derived in Different Works in Terms of the
Analyzed MTL Model, Analysis Tool, and the Convergence Rate of the Bound
m denotes the number of tasks and n0 denotes the average number of data points per task.
Lastly, existing studies mainly focus on supervised learn- [20] Y. Shinohara, “Adversarial multi-task learning of deep neural
networks for robust speech recognition,” in Proc. Interspeech,
ing tasks, and only a few ones are on other tasks such as 2016, pp. 2369–2372.
unsupervised learning, semi-supervised learning, active [21] P. Liu, X. Qiu, and X. Huang, “Adversarial multi-task learning
learning, multi-view learning and reinforcement learning for text classification,” in Proc. 55th Annu. Meeting Assoc. Comput.
tasks. It is natural to adapt or extend the five approaches Linguistics, 2017, pp. 1–10.
[22] I. Misra, A. Shrivastava, A. Gupta, and M. Hebert, “Cross-stitch
introduced in Section 2 to those non-supervised learning networks for multi-task learning,” in Proc. IEEE Conf. Comput.
tasks. We think that such adaptation and extension require Vis. Pattern Recognit., 2016, pp. 3994–4003.
more efforts to design appropriate models. Moreover, it is [23] G. Obozinski, B. Taskar, and M. Jordan, “Multi-task feature
selection,” Statistics Dept., Univ. California, Berkeley, CA, USA,
worth trying to apply MTL to other areas in artificial intelli- Tech. Rep. TR-2.2.2, 2006.
gence such as logic and planning to broaden its application [24] J. Liu, S. Ji, and J. Ye, “Multi-task feature learning via efficient
scopes. l2;1 -norm minimization,” in Proc. 25th Conf. Uncertainty Artif.
Intell., 2009, pp. 339–348.
[25] S. Lee, J. Zhu, and E. P. Xing, “Adaptive multi-task lasso: With
ACKNOWLEDGMENTS application to eQTL detection,” in Proc. Int. Conf. Neural Inf. Pro-
cess. Syst., 2010, pp. 1306–1314.
This work was supported by the NSFC under Grant 62076118. [26] N. S. Rao, C. R. Cox, R. D. Nowak, and T. T. Rogers, “Sparse
overlapping sets lasso for multitask learning and its application
to fMRI analysis,” in Proc. Int. Conf. Neural Inf. Process. Syst.,
REFERENCES 2013, pp. 2202–2210.
[1] R. Caruana, “Multitask learning,” Mach. Learn., vol. 28, pp. 41–75, [27] P. Gong, J. Zhou, W. Fan, and J. Ye, “Efficient multi-task feature
1997. learning with calibration,” in Proc. 20th ACM SIGKDD Int. Conf.
[2] Q. Yang, Y. Zhang, W. Dai, and S. J. Pan, Transfer Learning. New Knowl. Discov. Data Mining, 2014, pp. 761–770.
York, NY, USA: Cambridge Univ. Press, 2020. [28] J. Wang and J. Ye, “Safe screening for multi-task feature learning
[3] M.-L. Zhang and Z.-H. Zhou, “A review on multi-label learning with multiple data matrices,” in Proc. Int. Conf. Mach. Learn.,
algorithms,” IEEE Trans. Knowl. Data Eng., vol. 26, no. 8, 2015, pp. 1747–1756.
pp. 1819–1837, Aug. 2014. [29] H. Liu, M. Palatucci, and J. Zhang, “Blockwise coordinate
[4] G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter, descent procedures for the multi-task lasso, with applications to
“Continual lifelong learning with neural networks: A review,” neural semantic basis discovery,” in Proc. 26th Int. Conf. Mach.
Neural Netw., vol. 113, pp. 54–71, 2019. Learn., 2009, pp. 649–656.
[5] Y. Zhang and Q. Yang, “An overview of multi-task learning,” [30] P. Gong, J. Ye, and C. Zhang, “Multi-stage multi-task feature
Nat. Sci. Rev., vol. 5, pp. 30–43, 2018. learning,” J. Mach. Learn. Res., vol. 14, pp. 2979–3010, 2013.
[6] X. Yang, S. Kim, and E. P. Xing, “Heterogeneous multitask learn- [31] A. C. Lozano and G. Swirszcz, “Multi-level lasso for sparse
ing with joint sparsity constraints,” in Proc. Int. Conf. Neural Inf. multi-task regression,” in Proc. 29th Int. Conf. Mach. Learn., 2012,
Process. Syst., 2009, pp. 2151–2159. pp. 595–602.
[7] S. Bickel, J. Bogojeska, T. Lengauer, and T. Scheffer, “Multi-task [32] X. Wang, J. Bi, S. Yu, and J. Sun, “On multiplicative multitask fea-
learning for HIV therapy screening,” in Proc. 25th Int. Conf. Mach. ture learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2014,
Learn., 2008, pp. 56–63. pp. 2411–2429.
[8] X. Liao and L. Carin, “Radial basis function network for multi- [33] L. Han, Y. Zhang, G. Song, and K. Xie, “Encoding tree sparsity in
task learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2005, multi-task learning: A probabilistic framework,” in Proc. 28th
pp. 792–802. AAAI Conf. Artif. Intell., 2014. pp. 1854–1860.
[9] D. L. Silver, R. Poirier, and D. Currie, “Inductive transfer with [34] T. Jebara, “Multi-task feature and kernel selection for SVMs,” in
context-sensitive neural networks,” Mach. Learn., vol. 73, 2008, Proc. 21st Int. Conf. Mach. Learn., 2004, Art. no. 55.
Art. no. 313. [35] S. Kim and E. P. Xing, “Tree-guided group lasso for multi-task
[10] A. Argyriou, T. Evgeniou, and M. Pontil, “Convex multi-task fea- regression with structured sparsity,” in Proc. 27th Int. Conf. Mach.
ture learning,” Mach. Learn, vol. 73, pp. 243–272, 2008. Learn., 2010, pp. 543–550.
[11] A. Argyriou, C. A. Micchelli, M. Pontil, and Y. Ying, “A spectral [36] Y. Zhou, R. Jin, and S. C. H. Hoi, “Exclusive lasso for multi-task
regularization framework for multi-task structure learning,” in feature selection,” in Proc. 13th Int. Conf. Artif. Intell. Statist., 2010,
Proc. Int. Conf. Neural Inf. Process. Syst., 2007, Art. no. 1296. pp. 988–995.
[12] A. Maurer, M. Pontil, and B. Romera-Paredes, “Sparse coding for [37] Y. Zhang, D.-Y. Yeung, and Q. Xu, “Probabilistic multi-task fea-
multitask and transfer learning,” in Proc. 30th Int. Conf. Mach. ture selection,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2010,
Learn., 2013, pp. 343–351. pp. 2559–2567.
[13] J. Zhu, N. Chen, and E. P. Xing, “Infinite latent SVM for classifi- [38] D. Hernandez-Lobato and J. M. Hern andez-Lobato, “Learning
cation and multi-task learning,” in Proc. 24th Int. Conf. Neural Inf. feature selection dependencies in multi-task learning,” in Proc.
Process. Syst., 2011, pp. 1620–1628. Int. Conf. Neural Inf. Process. Syst., 2013, pp. 746–754.
[14] M. K. Titsias and M. L azaro-Gredilla, “Spike and slab variational [39] D. Hernandez-Lobato, J. M. Hernandez-Lobato, and Z. Ghahra-
inference for multi-task and multiple kernel learning,” in Proc. mani, “A probabilistic model for dirty multi-task feature
24th Int. Conf. Neural Inf. Process. Syst., 2011, pp. 2339–2347. selection,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 1073–1082.
[15] Z. Zhang, P. Luo, C. C. Loy, and X. Tang, “Facial landmark detec- [40] R. K. Ando and T. Zhang, “A framework for learning predictive
tion by deep multi-task learning,” in Proc. Eur. Conf. Comput. Vis., structures from multiple tasks and unlabeled data,” J. Mach.
2014, pp. 94–108. Learn. Res., vol. 6, pp. 1817–1853, 2005.
[16] W. Liu, T. Mei, Y. Zhang, C. Che, and J. Luo, “Multi-task deep [41] J. Chen, L. Tang, J. Liu, and J. Ye, “A convex formulation for
visual-semantic embedding for video thumbnail selection,” learning shared structures from multiple tasks,” in Proc. 26th Int.
in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, Conf. Mach. Learn., 2009, pp. 137–144.
pp. 3707–3715. [42] A. Agarwal, H. Daume III, and S. Gerber, “Learning multiple
[17] W. Zhang et al., “Deep model based transfer and multi-task tasks using manifold regularization,” in Proc. Adv. Neural Inf.
learning for biological image analysis,” IEEE Trans. Big Data, Process. Syst., 2010, pp. 46–54.
vol. 6, no. 2, pp. 322–333, Jun. 2015. [43] J. Zhang, Z. Ghahramani, and Y. Yang, “Learning multiple
[18] N. Mrksic et al., “Multi-domain dialog state tracking using recur- related tasks using latent independent component analysis,” in
rent neural networks,” in Proc. 53rd Annu. Meeting Assoc. Comput. Proc. Int. Conf. Neural Inf. Process. Syst., 2005, pp. 1585–1592.
Linguistics, 2015, pp. 794–799. [44] T. K. Pong, P. Tseng, S. Ji, and J. Ye, “Trace norm regularization:
[19] S. Li, Z. Liu, and A. B. Chan, “Heterogeneous multi-task learning Reformulations, algorithms, and multi-task learning,” SIAM J.
for human pose estimation with deep convolutional neural Optim., vol. 20, pp. 3465–3489, 2010.
network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Work- [45] L. Han and Y. Zhang, “Multi-stage multi-task learning with
shops, 2015, pp. 488–495. reduced rank,” in Proc. 13th AAAI Conf. Artif. Intell., 2016,
pp. 1638–1644.
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
ZHANG AND YANG: SURVEY ON MULTI-TASK LEARNING 5605
[46] A. M. McDonald, M. Pontil, and D. Stamos, “Spectral k-support [72] T. Kato, H. Kashima, M. Sugiyama, and K. Asai, “Multi-task
norm regularization,” in Proc. 27th Int. Conf. Neural Inf. Process. learning via conic programming,” in Proc. Int. Conf. Neural Inf.
Syst., 2014, 3644–3652. Process. Syst., 2007, pp. 737–744.
[47] Y. Yang and T. M. Hospedales, “Trace norm regularised deep [73] S. Feldman, M. R. Gupta, and B. A. Frigyik, “Revisiting Stein’s
multi-task learning,” in Proc. 5th Int. Conf. Learn. Representation, paradox: Multi-task averaging,” J. Mach. Learn. Res., vol. 15,
Workshop Track, 2017. pp. 3621–366 , 2014.
[48] S. Thrun and J. O’Sullivan, “Discovering structure in multiple [74] I. Yamane, H. Sasaki, and M. Sugiyama, “Regularized multitask
learning tasks: The TC algorithm,” in Proc. 13th Int. Conf. Mach. learning for multidimensional log-density gradient estimation,”
Learn., 1996, pp. 489–497. Neural Comput., vol. 28, no. 7, pp. 1388–1410, Jul. 2016.
[49] K. Crammer and Y. Mansour, “Learning multiple tasks using [75] N. G€ ornitz, C. Widmer, G. Zeller, A. Kahles, S. Sonnenburg, and
shared hypotheses,” in Proc. 25th Int. Conf. Neural Inf. Process. G. R€atsch, “Hierarchical multitask structured output learning for
Syst., 2012, pp. 1475–1483. large-scale sequence segmentation,” in Proc. 24th Int. Conf. Neural
[50] B. Bakker and T. Heskes, “Task clustering and gating for Bayes- Inf. Process. Syst., 2011, pp. 2690–2698.
ian multitask learning,” J. Mach. Learn. Res., vol. 4, pp. 83–99 [76] E. V. Bonilla, K. M. A. Chai, and C. K. I. Williams, “Multi-task
2003. Gaussian process prediction,” in Proc. Int. Neural Inf. Process.
[51] K. Yu, V. Tresp, and A. Schwaighofer, “Learning Gaussian pro- Syst., 2007, 153–160.
cesses from multiple tasks,” in Proc. Int. Conf. Mach. Learn., 2005, [77] K. M. A. Chai, “Generalization errors and learning curves for
pp. 1012–1019 regression with multi-task Gaussian processes,” in Proc. 22nd Int.
[52] S. Yu, V. Tresp, and K. Yu, “Robust multi-task learning with Conf. Neural Inf. Process. Syst., 2009, pp. 279–287.
t-processes,” in Proc. 24th Int. Conf. Mach. Learn., 2007, [78] Y. Zhang and D.-Y. Yeung, “Multi-task learning using general-
pp. 1103–1110. ized t process,” in Proc. 13th Int. Conf. Artif. Intell. Statist., 2010,
[53] W. Lian, R. Henao, V. Rao, J. E. Lucas, and L. Carin, “A multitask pp. 964–971.
point process predictive model,” in Proc. Int. Conf. Mach. Learn., [79] Y. Zhang and D.-Y. Yeung, “A convex formulation for learning
2015, pp. 2030–2038. task relationships in multi-task learning,” in Proc. 26th Conf.
[54] Y. Xue, X. Liao, L. Carin, and B. Krishnapuram, “Multi-task Uncertainty Artif. Intell., 2010, pp. 733–742.
learning for classification with Dirichlet process priors,” J. Mach. [80] Y. Zhang and D.-Y. Yeung, “A regularization approach to learn-
Learn. Res., vol. 8, pp. 35–63, 2007. ing task relationships in multitask learning,” ACM Trans. Knowl.
[55] Y. Xue, D. B. Dunson, and L. Carin, “The matrix stick-breaking Discov. Data, vol. 8, pp. 1–31, 2014.
process for flexible multi-task learning,” in Proc. 24th Int. Conf. [81] Y. Zhang and D.-Y. Yeung, “Multi-task boosting by exploiting
Mach. Learn., 2007, pp. 1063–1070. task relationships,” in Proc. Joint Eur. Conf. Mach. Learn. Knowl.
[56] H. Li, X. Liao, and L. Carin, “Nonparametric Bayesian feature Discov. Databases, 2012, pp. 697–710.
selection for multi-task learning,” in Proc. IEEE Int. Conf. Acoust., [82] Y. Zhang and D.-Y. Yeung, “Multilabel relationship learning,”
Speech Signal Process., 2011, pp. 2236–2239. ACM Trans. Knowl. Discov. Data, vol. 7, pp. 1–30, 2013.
[57] Y. Qi, D. Liu, D. B. Dunson, and L. Carin, “Multi-task compres- [83] F. Dinuzzo, C. S. Ong, P. V. Gehler, and G. Pillonetto, “Learning
sive sensing with Dirichlet process priors,” in Proc. 25th Int. Conf. output kernels with block coordinate descent,” in Proc. 28th Int.
Mach. Learn., 2008, 768–775. Conf. Mach. Learn., 2011, pp. 49–56.
[58] K. Ni, L. Carin, and D. B. Dunson, “Multi-task learning for [84] C. Ciliberto, Y. Mroueh, T. A. Poggio, and L. Rosasco, “Convex
sequential data via iHMMs and the nested Dirichlet process,” in learning of multiple tasks and their structure,” in Proc. 32nd Int.
Proc. 24th Int. Conf. Mach. Learn., 2007, pp. 689–696. Conf. Mach. Learn., 2015, pp. 1548–1557.
[59] K. Ni, J. W. Paisley, L. Carin, and D. B. Dunson, “Multi-task [85] C. Ciliberto, L. Rosasco, and S. Villa, “Learning multiple visual
learning for analyzing and sorting large databases of sequential tasks while discovering their structure,” in Proc. IEEE Conf. Com-
data,” IEEE Trans. Signal Process., vol. 56, no. 8, pp. 3918–3931, put. Vis. Pattern Recognit., 2015, pp. 131–139.
Aug. 2008. [86] P. Jawanpuria, M. Lapin, M. Hein, and B. Schiele, “Efficient out-
[60] A. Passos, P. Rai, J. Wainer, and H. Daum e III, “Flexible model- put kernel learning for multiple tasks,” in Proc. Int. Conf. Neural
ing of latent task structures in multitask learning,” in Proc. 29th Process. Syst., 2015, pp. 1189–1197.
Int. Conf. Mach. Learn., 2012, pp. 1283–1290. [87] Y. Zhang and Q. Yang, “Learning sparse task relations in multi-
[61] L. Jacob, F. R. Bach, and J.-P. Vert, “Clustered multi-task learn- task learning,” in Proc. 31st AAAI Conf. Artif. Intell., 2017,
ing: A convex formulation,” in Proc. 18th Int. Conf. Artif. Intell. pp. 2914–2920.
Statist., 2008, pp. 745–752. [88] Y. Zhang and J. G. Schneider, “Learning multiple tasks with a
[62] Z. Kang, K. Grauman, and F. Sha, “Learning with whom to share sparse matrix-normal penalty,” in Proc. 23rd Int. Conf. Neural Inf.
in multi-task feature learning,” in Proc. 28th Int. Conf. Mach. Process. Syst., 2010, pp. 2550–2558.
Learn., 2011, pp. 521–528. [89] C. Archambeau, S. Guo, and O. Zoeter, “Sparse Bayesian multi-
[63] L. Han and Y. Zhang, “Learning multi-level task groups in multi- task learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2011,
task learning,” in Proc. 29th AAAI Conf. Artif. Intell., 2015, pp. 1755–1763.
pp. 2638–2644. [90] M. Yang, Y. Li, and Z. Zhang, “Multi-task learning with Gauss-
[64] A. Barzilai and K. Crammer, “Convex multi-task learning by ian matrix generalized inverse Gaussian model,” in Proc. Int.
clustering,” in Proc. Artif. Intell. Statist., 2015. Conf. Mach. Learn., 2013, pp. 423–431.
[65] Q. Zhou and Q. Zhao, “Flexible clustered multi-task learning by [91] Y. Zhang and D.-Y. Yeung, “Learning high-order task relation-
learning representative tasks,” IEEE Trans. Pattern Anal. Mach. ships in multi-task learning,” in Proc. 23rd Int. Joint Conf. Artif.
Intell., vol. 38, no. 2, pp. 266–278, Feb. 2016. Intell., 2013, pp. 1917–1923.
[66] A. Kumar and H. Daume III, “Learning task grouping and overlap [92] P. Rai, A. Kumar, and H. Daum e III, “Simultaneously
in multi-task learning,” in Proc. 29th Int. Conf. Mach. Learn., 2012. leveraging output and task structures for multiple-output
[67] Y. Yang and T. M. Hospedales, “Deep multi-task representation regression,” in Proc. 25th Int. Conf. Neural Inf. Process. Syst.,
learning: A tensor factorisation approach,” in Proc. Int. Conf. 2012, pp. 3185–3193.
Mach. Representations, 2017. [93] B. Rakitsch, C. Lippert, K. M. Borgwardt, and O. Stegle, “It is all
[68] J. Zhou, J. Chen, and J. Ye, “Clustered multi-task learning via in the noise: Efficient multi-task Gaussian process inference with
alternating structure optimization,” in Proc. Int. Conf. Neural Inf. structured residuals,” in Proc. 26th Int. Conf. Neural Inf. Process.
Process. Syst., 2011, Art. no. 702. Syst., 2013, pp. 1466–1474.
[69] T. Evgeniou and M. Pontil, “Regularized multi-task learning,” in [94] A. R. Gonçalves, F. J. V. Zuben, and A. Banerjee, “Multi-task
Proc. 10th ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, sparse structure learning with Gaussian copula models,” J. Mach.
2004, pp. 109–117. Learn. Res., vol. 7, pp. 1–30, 2016.
[70] S. Parameswaran and K. Q. Weinberger, “Large margin multi- [95] M. Long, Z. Cao, J. Wang, and P. S. Yu, “Learning multiple tasks
task metric learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., with multilinear relationship networks,” in Proc. 31st Int. Conf.
2010, pp. 1867–1875. Neural Inf. Process. Syst., 2017, pp. 1593–1602.
[71] T. Evgeniou, C. A. Micchelli, and M. Pontil, “Learning multiple [96] Y. Zhang, “Heterogeneous-neighborhood-based multi-task local
tasks with kernel methods,” J. Mach. Learn. Res., vol. 6, pp. 615–637, learning algorithms,” in Proc. Int. Conf. Neural Inf. Process. Syst.,
2005. 2013, pp. 1896–1904.
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
5606 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 34, NO. 12, DECEMBER 2022
[97] G. Lee, E. Yang, and S. J. Hwang, “Asymmetric multi-task learn- [122] Q. Liu, X. Liao, and L. Carin, “Semi-supervised multitask
ing based on task relatedness and loss,” in Proc. Int. Conf. Mach. learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2007,
Learn., 2016, pp. 230–238. pp. 937–944.
[98] A. Jalali, P. Ravikumar, S. Sanghavi, and C. Ruan, “A dirty model [123] Q. Liu, X. Liao, H. Li, J. R. Stack, and L. Carin, “Semisupervised
for multi-task learning,” in Proc. 23rd Int. Conf. Neural Inf. Process. multitask learning,” IEEE Trans. Pattern Anal. Mach. Intell.,
Syst., 2010, pp. 964–972. vol. 31, no. 6, pp. 1074–1086, Jun. 2009.
[99] J. Chen, J. Liu, and J. Ye, “Learning incoherent sparse and low- [124] Y. Zhang and D. Yeung, “Semi-supervised multi-task
rank patterns from multiple tasks,” ACM Trans. Knowl. Discov. regression,” in Proc. Joint Eur. Conf. Mach. Learn. Knowl. Discov.
Data, vol. 5, 2010, Art. no. 22. Databases, 2009, pp. 617–631.
[100] J. Chen, J. Zhou, and J. Ye, “Integrating low-rank and group- [125] R. Reichart, K. Tomanek, U. Hahn, and A. Rappoport, “Multi-
sparse structures for robust multi-task learning,” in Proc. 17th task active learning for linguistic annotations,” in Proc. 46th
ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2011, Annu. Meeting Assoc. Comput. Linguistics, 2008, pp. 861–869.
pp. 42–50. [126] A. Acharya, R. J. Mooney, and J. Ghosh, “Active multitask learn-
[101] P. Gong, J. Ye, and C. Zhang, “Robust multi-task feature ing using both latent and supervised shared topics,” in Proc.
learning,” in Proc. 18th ACM SIGKDD Int. Conf. Knowl. Discov. SIAM Int. Conf. Data Mining, 2014, pp. 190–198.
Data Mining, 2012, pp. 895–903. [127] M. Fang and D. Tao, “Active multi-task learning via bandits,” in
[102] W. Zhong and J. T. Kwok, “Convex multitask learning with flexi- Proc. SIAM Int. Conf. Data Mining Soc. Ind. Appl. Math., 2015,
ble task clusters,” in Proc. 18th Int. Conf. Artif. Intell. Statist., 2012, pp. 505–513.
pp. 65–73. [128] H. Li, X. Liao, and L. Carin, “Active learning for semi-supervised
[103] A. Zweig and D. Weinshall, “Hierarchical regularization cascade multi-task learning,” in Proc. IEEE Int. Conf. Acoust. Speech Signal
for joint learning,” in Proc. Int. Conf. Mach. Learn., 2013, pp. 37–45. Process., 2009, pp. 1637–1640.
[104] L. Han and Y. Zhang, “Learning tree structure in multi-task [129] K. Lin and J. Zhou, “Interactive multi-task relationship learning,”
learning,” in Proc. 21st ACM SIGKDD Int. Conf. Knowl. Discov. in Proc. EEE 16th Int. Conf. Data Mining, 2016, pp. 241–250.
Data Mining, 2015, pp. 397–406. [130] A. Pentina and C. H. Lampert, “Multi-task learning with labeled
[105] P. Jawanpuria and J. S. Nath, “A convex feature learning formu- and unlabeled tasks,” in Proc. Int. Conf. Mach. Learn., 2017,
lation for latent task structure discovery,” in Proc. 29th Int. Conf. pp. 2807–2816.
Mach. Learn., 2012, pp. 1531–1538. [131] J. Zhang and C. Zhang, “Multitask Bregman clustering,” in Proc.
[106] B. Gong, Y. Shi, F. Sha, and K. Grauman, “Geodesic flow kernel AAAI Conf. Artif. Intell., 2010, pp. 655–660.
for unsupervised domain adaptation,” in Proc. IEEE Conf. Com- [132] X. Zhang and X. Zhang, “Smart multi-task Bregman clustering
put. Vis. Pattern Recognit., 2012, pp. 2066–2073. and multi-task kernel clustering,” in Proc. AAAI Conf. Artif.
[107] M. Solnon, S. Arlot, and F. R. Bach, “Multi-task regression using Intell., 2013, pp. 1034–1040.
minimal penalties,” J. Mach. Learn. Res., vol. 13, pp. 2773–2812, 2012. [133] X. Zhang, X. Zhang, and H. Liu, “Smart multitask Bregman clus-
[108] Y. Zhang, Y. Wei, and Q. Yang, “Learning to multitask,” in Proc. tering and multitask kernel clustering,” ACM Trans. Knowl. Dis-
32nd Int. Conf. Neural Inf. Process. Syst., 2018, pp. 5776–5787. cov. Data, vol. 10, 2015, Art. no. 8.
[109] Y. Zhang and D.-Y. Yeung, “Multi-task learning in heteroge- [134] Q. Gu, Z. Li, and J. Han, “Learning a kernel for multi-task
neous feature spaces,” in Proc. 25th AAAI Conf. Artif. Intell., 2011, clustering,” in Proc. AAAI Conf. Artif. Intell., 2011, pp. 368–373.
pp. 574–579. [135] X. Zhang, “Convex discriminative multitask clustering,” IEEE
[110] S. Han, X. Liao, and L. Carin, “Cross-domain multitask learning Trans. Pattern Anal. Mach. Intell., vol. 37, no. 1, pp. 28–40, Jan.
with latent probit models,” in Proc. 29th Int. Conf. Mach. Learn., 2015.
2012, pp. 363–370. [136] Y. Wang, D. P. Wipf, Q. Ling, W. Chen, and I. J. Wassell, “Multi-
[111] P. Yang, K. Huang, and C. Liu, “Geometry preserving multi-task task learning for subspace segmentation,” in Proc. Int. Conf.
metric learning,” Mach. Learn., vol. 92, pp. 133–175, 2013. Mach. Learn., 2015, pp. 1209–1217.
[112] N. Quadrianto, A. J. Smola, T. S. Caetano, S. V. N. Vishwanathan, [137] X. Zhang, X. Zhang, and H. Liu, “Self-adapted multi-task
and J. Petterson, “Multitask learning without label corre- clustering,” in Proc. Int. Joint Conf. Artif. Intell., 2016,
spondences,” in Proc. 23rd Int. Conf. Neural Inf. Process. Syst., pp. 2357–2363.
2010, pp. 1957–1965. [138] Y. Yang, Z. Ma, Y. Yang, F. Nie, and H. T. Shen, “Multitask spec-
[113] B. Romera-Paredes, H. Aung, N. Bianchi-Berthouze, and M. Pon- tral clustering by exploring intertask correlation,” IEEE Trans.
til, “Multilinear multitask learning,” in Proc. 30th Int. Conf. Mach. Cybern., vol. 45, no. 5, pp. 1083–1094, May 2015.
Learn., 2013, pp. 1444–1452. [139] X. Zhu, X. Li, S. Zhang, C. Ju, and X. Wu, “Robust joint graph
[114] K. Wimalawarne, M. Sugiyama, and R. Tomioka, “Multitask sparse coding for unsupervised spectral feature selection,” IEEE
learning meets tensor factorization: Task imputation via convex Trans. Neural Netw. Learn. Syst., vol. 28, no. 6, pp. 1263–1275,
optimization,” in Proc. 27th Int. Conf. Neural Inf. Process. Syst., Jun. 2017.
2014, pp. 2825–2833. [140] X. Zhang, X. Zhang, H. Liu, and J. Luo, “Multi-task clustering
[115] O. Sener and V. Koltun, “Multi-task learning as multi-objective with model relation learning,” in Proc. Int. Joint Conf. Artif. Intell.,
optimization,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2018, 2018, pp. 3132–3140.
pp. 525–536. [141] A. Wilson, A. Fern, S. Ray, and P. Tadepalli, “Multi-task rein-
[116] T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn, forcement learning: A hierarchical Bayesian approach,” in Proc.
“Gradient surgery for multi-task learning,” in Proc. Int. Conf. Int. Conf. Mach. Learn., 2007, pp. 1015–1022.
Neural Inf. Process. Syst., 2020. [142] H. Li, X. Liao, and L. Carin, “Multi-task reinforcement learning
[117] Z. Chen, V. Badrinarayanan, C. Lee, and A. Rabinovich, in partially observable stochastic environments,” J. Mach. Learn.
“Gradnorm: Gradient normalization for adaptive loss balancing Res., vol. 10, pp. 1131–1186, 2009.
in deep multitask networks,” in Proc. Int. Conf. Mach. Learn., [143] A. Lazaric and M. Ghavamzadeh, “Bayesian multi-task rein-
2018, pp. 794–803. forcement learning,” in Proc. Int. Conf. Mach. Learn., 2010,
[118] N. Parikh and S. P. Boyd, “Proximal algorithms,” Founds. Trends pp. 599–606.
Optim., vol. 1, no. 3, pp. 127–239, 2014. [144] D. Calandriello, A. Lazaric, and M. Restelli, “Sparse multi-task
[119] L. Zhao, Q. Sun, J. Ye, F. Chen, C. Lu, and N. Ramakrishnan, reinforcement learning,” in Proc. Int. conf. Neural Inf. Process.
“Multi-task learning for spatio-temporal event forecasting,” in Syst., 2014, pp. 5–20.
Proc. 21st ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, [145] J. Andreas, D. Klein, and S. Levine, “Modular multitask rein-
2015, pp. 1503–1512. forcement learning with policy sketches,” in Proc. Int. Conf.
[120] Y. Li, J. Wang, J. Ye, and C. K. Reddy, “A multi-task learning for- Mach. Learn., 2017, pp. 166–175.
mulation for survival analysis,” in Proc. 22nd ACM SIGKDD Int. [146] A. A. Deshmukh, U. € Dogan, and C. Scott, “Multi-task learning
Conf. Knowl. Discov. Data Mining, 2016, pp. 1715–1724. for contextual bandits,” in Proc. Int. Conf. Neural Inf. Process.
[121] L. Zhao, Q. Sun, J. Ye, F. Chen, C. Lu, and N. Ramakrishnan, Syst., 2017, pp. 4851–4859.
“Feature constrained multi-task learning models for spatiotem- [147] A. M. Saxe, A. C. Earle, and B. Rosman, “Hierarchy through com-
poral event forecasting,” IEEE Trans. Knowl. Data Eng., vol. 29, position with multitask LMDPs,” in Proc. Int. Conf. Mach. Learn.,
no. 5, pp. 1059–1072, May 2017. 2017, pp. 3017–3026.
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
ZHANG AND YANG: SURVEY ON MULTI-TASK LEARNING 5607
[148] T. Br€am, G. Brunner, O. Richter, and R. Wattenhofer, “Attentive [174] K. Lin, J. Xu, I. M. Baytas, S. Ji, and J. Zhou, “Multi-task feature
multi-task deep reinforcement learning,” in Proc. Joint Eur. Conf. interaction learning,” in Proc. ACM SIGKDD Int. Conf. Knowl.
Mach. Learn. Knowl. Discov. Databases, 2019, 134–149. Discov. Data Mining, 2016, pp. 1735–1744.
[149] T. Vuong et al., “Sharing experience in multitask reinforcement [175] Y. Zhang, “Parallel multi-task learning,” in Proc. IEEE Int. Conf.
learning,” in Proc. Int. Joint Conf. Artif. Intell., 2019, pp. 3642–3648. Data Mining, 2015, pp. 629–638.
[150] M. Igl et al., “Multitask soft option learning,” in Proc. Conf. Uncer- [176] O. Dekel, P. M. Long, and Y. Singer, “Online multitask learning,”
tainty Artif. Intell., 2020, pp. 969–978. in Proc. Int. Conf. Comput. Learn. Theory, 2006, pp. 453–467.
[151] E. Parisotto, J. Ba, and R. Salakhutdinov, “Actor-mimic: Deep [177] O. Dekel, P. M. Long, and Y. Singer, “Online learning of mul-
multitask and transfer reinforcement learning,” in Proc. Int. Conf. tiple tasks with a shared loss,” J. Mach. Learn. Res., vol. 8,
Learn. Representations, 2016. pp. 2233–2264, 2007.
[152] A. A. Rusu et al., “Policy distillation,” in Proc. Int. Conf. Learn. [178] G. Lugosi, O. Papaspiliopoulos, and G. Stoltz, “Online multi-task
Representations, 2016. learning with hard constraints,” in Proc. 22nd Conf. Learn. Theory,
[153] S. Omidshafiei, J. Pazis, C. Amato, J. P. How, and J. Vian, “Deep 2009.
decentralized multi-task multi-agent reinforcement learning [179] G. Cavallanti, N. Cesa-Bianchi, and C. Gentile, “Linear algo-
under partial observability,” in Proc. Int. Conf. Mach. Learn., 2017, rithms for online multitask classification,” J. Mach. Learn. Res.,
pp. 2681–2690. vol. 11, pp. 2901–2934, 2010.
[154] Y. W. Teh et al., “Distral: Robust multitask reinforcement [180] G. Pillonetto, F. Dinuzzo, and G. D. Nicolao, “Bayesian online
learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2017, multitask learning of Gaussian processes,” IEEE Trans. Pattern
pp. 4499–4509. Anal. Mach. Intell., vol. 32, no. 2, pp. 193–205, Feb. 2010.
[155] S. E. Bsat, H. Bou-Ammar, and M. E. Taylor, “Scalable multitask [181] A. Saha, P. Rai, H. Daume, III, and S. Venkatasubramanian,
policy gradient reinforcement learning,” in Proc. AAAI Conf. “Online learning of multiple tasks and their relationships,” in
Artif. Intell., 2017, pp. 1847–1853. Proc. Int. Conf. Artif. Intell. Statist., 2011, pp. 643–651.
[156] S. Sharma and B. Ravindran, “Online multi-task learning using [182] K. Murugesan, H. Liu, J. G. Carbonell, and Y. Yang, “Adaptive
active sampling,” in Proc. Int. Conf. Learn. Representations Work- smoothed online multi-task learning,” in Proc. Int. Conf. Neural
shop, 2017. Inf. Process. Syst., 2016, pp. 4303–4311.
[157] S. Sharma, A. K. Jha, P. Hegde, and B. Ravindran, “Learning to [183] P. Yang, P. Zhao, and X. Gao, “Robust online multi-task learning
multi-task by active sampling,” in Proc. Int. Conf. Learn. Represen- with correlative and personalized structures,” IEEE Trans. Knowl.
tations, 2018. Data Eng., vol. 29, no. 11, pp. 2510–2521, Nov. 2017.
[158] L. Espeholt et al., “IMPALA: Scalable distributed deep-RL with [184] S. Hao, P. Zhao, Y. Liu, S. C. H. Hoi, and C. Miao, “Online multi-
importance weighted actor-learner architectures,” in Proc. Int. task relative similarity learning,” in Proc. Int. Joint Conf. Artif.
Conf. Mach. Learn., 2018, pp. 1406–1415. Intell., 2017, pp. 1823–1829.
[159] R. Tutunov, D. Kim, and H. Bou-Ammar, “Distributed multitask [185] P. Yang, P. Zhao, J. Zhou, and X. Gao, “Confidence weighted
reinforcement learning with quadratic convergence,” in Proc. Int. multitask learning,” in Proc. AAAI Conf. Artif. Intell., 2019,
Conf. Neural Inf. Process. Syst., 2018, pp. 8921–8930. pp. 5636–5643.
[160] X. Lin, H. S. Baweja, G. Kantor, and D. Held, “Adaptive auxiliary [186] J. Wang, M. Kolar, and N. Srebro, “Distributed multi-task
task weighting for reinforcement learning,” in Proc. Int. Conf. learning,” in Proc. Int. Conf. Artif. Intell. Statist., 2016, pp. 751–760.
Neural Inf. Process. Syst., 2019, pp. 4773–4784. [187] S. Liu, S. J. Pan, and Q. Ho, “Distributed multi-task relationship
[161] C. D’Eramo, D. Tateo, A. Bonarini, M. Restelli, and J. Peters, learning,” in Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data
“Sharing knowledge in multi-task deep reinforcement learning,” Mining, 2017, pp. 937–946.
in Proc. Int. Conf. Learn. Representations, 2020, pp. 1–18. [188] L. Xie, I. M. Baytas, K. Lin, and J. Zhou, “Privacy-preserving dis-
[162] T. Shu, C. Xiong, and R. Socher, “Hierarchical and interpretable tributed multi-task learning with asynchronous updates,” in
skill acquisition in multi-task reinforcement learning,” in Proc. Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Mining, 2017,
Int. Conf. Learn. Representations, 2018. pp. 1195–1204.
[163] M. Hessel, H. Soyer, L. Espeholt, W. Czarnecki, S. Schmitt, [189] V. Smith, C. Chiang, M. Sanjabi, and A. S. Talwalkar, “Federated
and H. van Hasselt, “Multi-task deep reinforcement learning multi-task learning,” in Proc. Int. Conf. Neural Inf. Process. Syst.,
with PopArt,” in Proc. AAAI Conf. Artif. Intell., 2019, 3796– 2017, pp. 4427–437.
3803. [190] C. Zhang et al., “Distributed multi-task classification: A decen-
[164] Z. Guo et al., “Bootstrap latent-predictive representations for tralized online learning approach,” Mach. Learn., vol. 107,
multitask reinforcement learning,” in Proc. Int. Conf. Mach. Learn., pp. 727–747, 2018.
2020, pp. 3875–3886. [191] K. Q. Weinberger, A. Dasgupta, J. Langford, A. J. Smola, and
[165] J. He and R. Lawrence, “A graph-based framework for multi-task J. Attenberg, “Feature hashing for large scale multitask learning,”
multi-view learning,” in Proc. Int. Conf. Mach. Learn., 2011, in Proc. Annu. Int. Conf. Mach. Learn., 2009, pp. 1113–1120.
pp. 25–32. [192] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja, “Robust visual track-
[166] J. Zhang and J. Huan, “Inductive multi-task learning with multi- ing via multi-task sparse learning,” in Proc. IEEE Comput. Soc.
ple view data,” in Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Conf. Comput. Vis. Pattern Recognit., 2012, pp. 2042–2049.
Data Mining, 2012, 543–551. [193] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja, “Robust visual track-
[167] X. Zhang, X. Zhang, and H. Liu, “Multi-task multi-view cluster- ing via structured multi-task sparse learning,” Int. J. comput. Vis.,
ing for non-negative data,” in Proc. Int. Joint Conf. Artif. Intell., vol. 101, pp. 367–383, 2013.
2015, pp. 4055–4061. [194] Q. Xu, S. J. Pan, H. H. Xue, and Q. Yang, “Multitask learning for
[168] X. Zhang, X. Zhang, H. Liu, and X. Liu, “Multi-task multi-view protein subcellular location prediction,” IEEE/ACM Trans. Com-
clustering,” IEEE Trans. Knowl. Data Eng., vol. 28, no. 12, put. Biol. Bioinf., vol. 8, no. 3, pp. 748–759, May/Jun. 2011.
pp. 3324–3338, Dec. 2016. [195] Z. Wu, C. Valentini-Botinhao, O. Watts, and S. King, “Deep neu-
[169] X. Zhu, X. Li, and S. Zhang, “Block-row sparse multiview multi- ral networks employing multi-task learning and stacked bottle-
label learning for image classification,” IEEE Trans. Cybern., neck features for speech synthesis,” in Proc. IEEE Int. Conf.
vol. 46, no. 2, pp. 450–461, Feb. 2016. Acoust. Speech Signal Process., 2015, pp. 4460–4464.
[170] L. Zheng, Y. Cheng, and J. He, “Deep multimodality model for [196] Q. Hu, Z. Wu, K. Richmond, J. Yamagishi, Y. Stylianou, and
multi-task multi-view learning,” in Proc. SIAM Int. Conf. Data R. Maia, “Fusion of multiple parameterisations for DNN-based
Mining, 2019, pp. 10–16. sinusoidal speech synthesis with multi-task learning,” in Proc.
[171] A. Niculescu-Mizil and R. Caruana, “Inductive transfer for InterSpeech, 2015, pp. 854–858.
Bayesian network structure learning,” in Proc. Mach. Learn., 2012, [197] J. Bai et al., “Multi-task learning for learning to rank in web search,”
167–180. in Proc. Int. ACM Conf. Inf. Knowl. Manage., 2009, pp. 1549–1552.
[172] J. Honorio and D. Samaras, “Multi-task learning of Gaussian [198] J. Ghosn and Y. Bengio, “Multi-task learning for stock selection,”
graphical models,” in Proc. Int. Conf. Mach. Learn., 2010, in Proc. Int. Conf. Neural Inf. Process. Syst., 1996, pp. 946–952.
pp. 447–454. [199] C. Yuan, W. Hu, G. Tian, S. Yang, and H. Wang, “Multi-task
[173] D. Oyen and T. Lane, “Leveraging domain knowledge in multi- sparse learning with Beta process prior for action recognition,”
task Bayesian network structure learning,” in Proc. AAAI Conf. in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.,
Artif. Intell., 2012, pp. 1091–1097. 2013, pp. 423–429.
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
5608 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 34, NO. 12, DECEMBER 2022
[200] Y. Qi, O. Tastan, J. G. Carbonell, J. Klein-Seetharaman, and [223] H. Wang et al., “High-order multi-task feature learning to iden-
J. Weston, “Semi-supervised multi-task learning for predicting tify longitudinal phenotypic markers for Alzheimer’s disease
interactions between HIV-1 and human proteins,” Bioinformatics, progression prediction,” in Proc. Adv. Neural Inf. Process. Syst.,
vol. 26, pp. i645–i652, 2010. 2012, pp. 1277–1285.
[201] P. Bell and S. Renals, “Regularization of context-dependent deep [224] Q. An, C. Wang, I. Shterev, E. Wang, L. Carin, and D. B. Dunson,
neural networks with context-independent multi-task training,” “Hierarchical kernel stick-breaking process for multi-task image
in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2015, analysis,” in Proc. Int. Conf. Mach. Learn., 2008, pp. 17–24.
pp. 4290–4294. [225] M. Alamgir, M. Grosse-Wentrup, and Y. Altun, “Multitask learn-
[202] Z. Chen, S. Watanabe, H. Erdogan, and J. R. Hershey, “Speech ing for brain-computer interfaces,” in Proc. Mach. Learn. Res.,
enhancement and recognition using multi-task learning of long 2010, pp. 17–24.
short-term memory recurrent neural networks,” in Proc. Inter- [226] Y. Zhang and D.-Y. Yeung, “Multi-task warped Gaussian process
Speech, 2015, pp. 3274–3278. for personalized age estimation,” in Proc. IEEE Comput. Soc. Conf.
[203] V. W. Zheng, S. J. Pan, Q. Yang, and J. J. Pan, “Transferring Comput. Vis. Pattern Recognit., 2010, pp. 2622–2629.
multi-device localization models using latent multi-task [227] J. Xu, J. Zhou, and P. Tan, “FORMULA: FactORized MUlti-task
learning,” in Proc. AAAI Conf. Artif. Intell., 2008, pp. 1427–1432. LeArning for task discovery in personalized medical models,” in
[204] R. Collobert and J. Weston, “A unified architecture for natural Proc. SIAM Int. Conf. Data Mining, 2015, pp. 496–504.
language processing: Deep neural networks with multitask [228] T. R. Almaev, B. Martınez, and M. F. Valstar, “Learning to trans-
learning,” in Proc. Int. Conf. Mach. Learn., 2008, pp. 160–167. fer: Transferring latent task structures and its application to per-
[205] M. Lapin, B. Schiele, and M. Hein, “Scalable multitask represen- son-specific facial action unit detection,” in Proc. IEEE Int. Conf.
tation learning for scene classification,” in Proc. IEEE Comput. Comput. Vis., 2015, pp. 3774–3782.
Soc. Conf. Comput. Vis. Pattern Recognit., 2014, pp. 1343–1441. [229] A. Liu, Y. Su, W. Nie, and M. S. Kankanhalli, “Hierarchical clus-
[206] A. H. Abdulnabi, G. Wang, J. Lu, and K. Jia, “Multi-task CNN tering multi-task learning for joint human action grouping and
model for attribute prediction,” IEEE Trans. Multimedia, vol. 17, recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 1,
no. 11, pp. 1949–1959, Nov. 2015. pp. 102–114, Jan. 2017.
[207] M. Luong, Q. V. Le, I. Sutskever, O. Vinyals, and L. Kaiser, [230] F. Wu and Y. Huang, “Collaborative multi-domain sentiment
“Multi-task sequence to sequence learning,” in Proc. Int. Conf. classification,” in Proc. IEEE Int. Conf. Data mining, 2015,
Learn. Representations, 2016. pp. 459–468.
[208] J. Yim, H. Jung, B. Yoo, C. Choi, D. Park, and J. Kim, “Rotating your [231] O. Chapelle, P. K. Shivaswamy, S. Vadrevu, K. Q. Weinberger,
face using multi-task deep neural network,” in Proc. IEEE Comput. Y. Zhang, and B. L. Tseng, “Multi-task learning for boosting with
Soc. Conf. Comput. Vis. Pattern Recognit., 2015, pp. 676–684. application to web search ranking,” in Proc. ACM SIGKDD Int.
[209] X. Chu, W. Ouyang, W. Yang, and X. Wang, “Multi-task recur- Conf. Knowl. Discov. Data Mining, 2010, pp. 1189–1198.
rent neural network for immediacy prediction,” in Proc. IEEE [232] C. Widmer, J. Leiva, Y. Altun, and G. R€atsch, “Leveraging
Conf. Comput. Vis. Pattern Recognit., 2015, pp. 142–149. sequence classification by taxonomy-based multitask learning,”
[210] X. Wang, C. Zhang, and Z. Zhang, “Boosted multi-task learning in Proc. Annu. Int. Conf. Res. Comput. Mol. Biol., 2010, pp. 522–534.
for face verification with applications to web image and video [233] Y. Zhang, B. Cao, and D.-Y. Yeung, “Multi-domain collabora-
search,” in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Rec- tive filtering,” in Proc. Conf. Uncertainty Artif. Intell., 2010,
ognit., 2009, pp. 142–149. pp. 725–732.
[211] X. Yuan and S. Yan, “Visual classification with multi-task joint [234] K. M. A. Chai, C. K. I. Williams, S. Klanke, and S. Vijayakumar,
sparse representation,” in Proc. IEEE Comput. Soc. Conf. Comput. “Multi-task Gaussian process learning of robot inverse dynami-
Vis. Pattern Recognit., 2010, pp. 3493–3500. cs,” in Proc. Int. conf. Neural Inf. Process. Syst., 2008, pp. 265–272.
[212] Q. Liu, Q. Xu, V. W. Zheng, H. Xue, Z. Cao, and Q. Yang, “Multi- [235] D.-Y. Yeung and Y. Zhang, “Learning inverse dynamics by
task learning for cross-platform siRNA efficacy prediction: An Gaussian process regression under the multi-task learning
in-silico study,” BMC Bioinf., vol. 10, 2010, Art. no. 181. framework,” in The Path to Autonomous Robots. Boston, MA, USA:
[213] A. Ahmed, M. Aly, A. Das, A. J. Smola, and T. Anastasakos, Springer, 2009, pp. 131–142.
“Web-scale multi-task feature selection for behavioral targeting,” [236] C. Widmer, N. C. Toussaint, Y. Altun, and G. R€atsch, “Inferring
in Proc. ACM Int. Conf. Inf. Knowl. Manage., 2012, pp. 1737–1741. latent task structure for multitask learning by multiple kernel
[214] H. Wang et al., “Sparse multi-task regression and feature selec- learning,” BMC Bioinf., vol. 11, 2010, Art. no. S5.
tion to identify brain imaging predictors for memory perform- [237] A. Ahmed, A. Das, and A. J. Smola, “Scalable hierarchical multi-
ance,” in Proc. IEEE Int. Conf. Comput. Vis., 2011, pp. 557–562. task learning algorithms for conversion optimization in display
[215] K. Puniyani, S. Kim, and E. P. Xing, “Multi-population GWA advertising,” in Proc. ACM Int. Conf. Web Search Data Mining,
mapping via multi-task regularized regression,” Bioinformatics, 2014, pp. 153–162.
vol. 26, pp. i208–i216, 2010. [238] J. Zheng and L. M. Ni, “Time-dependent trajectory regression on
[216] J. Zhou, L. Yuan, J. Liu, and J. Ye, “A multi-task learning formu- road networks via multi-task learning,” in Proc. AAAI Conf. Artif.
lation for predicting disease progression,” in Proc. ACM SIGKDD Intell., 2013, pp. 1048–1055.
Int. Conf. Knowl. Discov. Data Mining, 2011, pp. 814–822. [239] X. Lu, Y. Wang, X. Zhou, Z. Zhang, and Z. Ling, “Traffic sign
[217] J. Wan et al. “Sparse Bayesian multi-task learning for predicting recognition via multi-modal tree-structure embedded multi-
cognitive outcomes from neuroimaging measures in Alzheimer’s task learning,” IEEE Trans. Intell. Transp. Syst., vol. 18, no. 4,
disease,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2012, pp. 960–972, Apr. 2017.
pp. 940–947. [240] F. Mordelet and J. Vert, “ProDiGe: Prioritization of disease genes
[218] D. He, D. Kuhn, and L. Parida, “Novel applications of multitask with multitask machine learning from positive and unlabeled
learning and multiple output regression to multiple genetic trait examples,” BMC Bioinf., vol. 12, 2011, Art. no. 389.
prediction,” Bioinformatics, vol. 32, no. 12, pp. 37–43, 2016. [241] J. Xu, P. Tan, J. Zhou, and L. Luo, “Online multi-task learning
[219] K. Zhang, J. W. Gray, and B. Parvin, “Sparse multitask regression framework for ensemble forecasting,” IEEE Trans. Knowl. Data
for identifying common mechanism of response to therapeutic Eng., vol. 29, no. 6, pp. 1268–1280, Jun. 2017.
targets,” Bioinformatics, vol. 26, pp. 97–105, 2010. [242] M. Kshirsagar, J. G. Carbonell, and J. Klein-Seetharaman,
[220] B. Cheng, G. Liu, J. Wang, Z. Huang, and S. Yan, “Multi-task “Multitask learning for host-pathogen protein interactions,” Bio-
low-rank affinity pursuit for image segmentation,” in Proc. Int. informatics, vol. 29, pp. 217–226, 2013.
Conf. Comput. Vis., 2011, pp. 2439–2446. [243] L. Han, L. Li, F. Wen, L. Zhong, T. Zhang, and X. Wan, “Graph-
[221] J. Xu, P. Tan, L. Luo, and J. Zhou, “GSpartan: A geospatio-tempo- guided multi-task sparse learning model: a method for identify-
ral multi-task learning framework for multi-location prediction,” ing antigenic variants of influenza A(H3N2) virus,” Bioinformat-
in Proc. SIAM Int. Conf. Data Mining, 2016, pp. 657–665. ics, vol. 35, pp. 77–87, 2019.
[222] C. Lang, G. Liu, J. Yu, and S. Yan, “Saliency detection by multi- [244] Z. Hong, X. Mei, D. V. Prokhorov, and D. Tao, “Tracking via
task sparsity pursuit,” IEEE Trans. Image Process., vol. 21, no. 3, robust multi-task multi-view joint sparse representation,” in
pp. 1327–1338, Mar. 2012. Proc. IEEE Int. Conf. Comput. Vis., 2013, pp. 649–656.
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.
ZHANG AND YANG: SURVEY ON MULTI-TASK LEARNING 5609
[245] Y. Yan, E. Ricci, S. Ramanathan, O. Lanz, and N. Sebe, “No mat- [262] A. Argyriou, C. A. Micchelli, and M. Pontil, “On spectral
ter where you are: Flexible graph-guided multi-task learning for learning,” J. Mach. Learn. Res., vol. 11, pp. 935–953, 2010.
multi-view head pose classification under target motion,” in [263] K. Lounici, M. Pontil, A. B. Tsybakov, and S. A. van de Geer,
Proc. IEEE Int. Conf. Comput. Vis., 2013, pp. 1177–1184. “Taking advantage of sparsity in multi-task learning,” in Proc.
[246] M. Kshirsagar, K. Murugesan, J. G. Carbonell, and J. Klein- Annu. Conf. Learn. Theory, 2009.
Seetharaman, “Multitask matrix completion for learning pro- [264] G. Obozinski, M. J. Wainwright, and M. I. Jordan, “Support
tein interactions across diseases,” J. Comput. Biol., vol. 24, union recovery in high-dimensional multivariate regression,”
pp. 501–514, 2017. Ann. Statist., vol. 39, pp. 1–47, 2011.
[247] C. Su, F. Yang, S. Zhang, Q. Tian, L. S. Davis, and W. Gao, [265] M. Kolar, J. D. Lafferty, and L. A. Wasserman, “Union support
“Multi-task learning with low rank attribute embedding for per- recovery in multi-task learning,” J. Mach. Learn. Res., vol. 12,
son re-identification,” IEEE Trans. Pattern Anal. Mach. Intell., pp. 2415–2435, 2011.
vol. 40, no. 5, pp. 1167–1181, May 2018.
[248] Y. Zhang and Q. Yang, “A survey on multi-task learning,” 2017, Yu Zhang (Member, IEEE) is currently an associ-
arXiv: 1707.08114. ate professor at the Department of Computer Sci-
[249] J. Baxter, “Learning internal representations,” in Proc. Int. Conf. ence and Engineering, Southern University of
Comput. Learn. Theory, 1995, pp. 311–320. Science and Technology. He has authored or
[250] J. Baxter, “A model of inductive bias learning,” J. Artif. Intell. Res., coauthored a book Transfer Learning and about
vol. 12, pp. 149–198, 2000. 70 papers on top-tier conferences and journals.
[251] A. Maurer, “Bounds for linear multi-task learning,” J. Mach. His research interests include artificial intelli-
Learn. Res., vol. 7, pp. 117–139, 2006. gence and machine learning, especially in multi-
[252] A. Maurer, “The Rademacher complexity of linear transforma- task learning, transfer learning, dimensionality
tion classes,” in Proc. Int. Conf. Comput. Learn. Theory, 2006, reduction, metric learning and semi-supervised
pp. 65–78. learning. He is a reviewer for various journals
[253] B. Juba, “Estimating relatedness via data compression,” in Proc. and area chairs or senior program committee members for several top-
Int. Conf. Mach. Learn., 2006, pp. 441–448. tier conferences. He was the recipient of the best paper awards in UAI
[254] S. Ben-David and R. S. Borbely, “A notion of task relatedness 2010 and PAKDD 2019, and the Best Student Paper Award in WI 2013.
yielding provable multiple-task learning guarantees,” Mach.
Learn., vol. 73, pp. 273–287, 2008.
[255] S. M. Kakade, S. Shalev-Shwartz, and A. Tewari, “Regularization Qiang Yang (Fellow, IEEE) is currently the chief
techniques for learning with matrices,” J. Mach. Learn. Res., artificial intelligence officer at WeBank and the
vol. 13, pp. 1865–1890, 2012. chair professor at the CSE Department, Hong
[256] M. Pontil and A. Maurer, “Excess risk bounds for multitask Kong University of Science and Technology. He
learning with trace norm regularization,” in Proc. Annu. Conf. has authored or coauthored five books including
Learn. Theory, 2013, pp. 55–76. Transfer Learning and Federated Learning. His
[257] A. Pentina and S. Ben-David, “Multi-task and lifelong learning of research interests include transfer learning and
kernels,” in Proc. Int. Conf. Algorithmic Learn. Theory, 2015, federated learning. He is the founding EiC of two
pp. 194–208. journals: IEEE Transactions on Big Data and
[258] Y. Zhang, “Multi-task learning and algorithmic stability,” in ACM Transactions on Intelligent Systems and
Proc. AAAI Conf. Artif. Intell., 2015, pp. 3181–3187. Technology. He is the conference chair of AAAI-
[259] A. Maurer, M. Pontil, and B. Romera-Paredes, “The benefit of 21, the president of Hong Kong Society of Artificial Intelligence and
multitask representation learning,” J. Mach. Learn. Res., vol. 17, Robotics and Investment Technology League, and a former president of
pp. 1–32, 2016. IJCAI from 2017 to 2019. He is also a fellow of the AAAI, ACM, and
[260] N. Yousefi, Y. Lei, M. Kloft, M. Mollaghasemi, and G. Anagnas- AAAS.
tapolous, “Local Rademacher complexity-based learning guaran-
tees for multi-task learning,” J. Mach. Learn. Res., vol. 19, pp. 1–47,
2018. " For more information on this or any other computing topic,
[261] A. Argyriou, C. A. Micchelli, and M. Pontil, “When is there a rep- please visit our Digital Library at www.computer.org/csdl.
resenter theorem? Vector versus matrix regularizers,” J. Mach.
Learn. Res., vol. 10, pp. 2507–2529, 2009.
Authorized licensed use limited to: Zakir Husain College of Engineering & Technology. Downloaded on April 24,2024 at 03:36:05 UTC from IEEE Xplore. Restrictions apply.