0% found this document useful (0 votes)
10 views12 pages

2024 - A Survey On Kernel-Based Multi-Task Learning - Neurocomputing

This paper surveys kernel-based multi-task learning (MTL) methods, highlighting their advantages over deep learning approaches, particularly for small to medium datasets. It categorizes these methods into three strategies: feature-based, regularization-based, and combination-based, while also linking them to foundational machine learning concepts. The review includes a discussion on commonly used datasets and real-world applications of kernel-based MTL models.

Uploaded by

yangkunkuo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views12 pages

2024 - A Survey On Kernel-Based Multi-Task Learning - Neurocomputing

This paper surveys kernel-based multi-task learning (MTL) methods, highlighting their advantages over deep learning approaches, particularly for small to medium datasets. It categorizes these methods into three strategies: feature-based, regularization-based, and combination-based, while also linking them to foundational machine learning concepts. The review includes a discussion on commonly used datasets and real-world applications of kernel-based MTL models.

Uploaded by

yangkunkuo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Neurocomputing 577 (2024) 127255

Contents lists available at ScienceDirect

Neurocomputing
journal homepage: www.elsevier.com/locate/neucom

A survey on kernel-based multi-task learning


Carlos Ruiz a ,∗, Carlos M. Alaíz a , José R. Dorronsoro a,b
a
Dpto. Ing. Informatica, Universidad Autonoma de Madrid, Calle Francisco Tomas y Valiente, 11, Madrid, 28049, Spain
b
Instituto de Ingenieria del Conocimiento, Calle Francisco Tomas y Valiente, 11, Madrid, 28049, Spain

ARTICLE INFO ABSTRACT

Keywords: Multi-Task Learning (MTL) seeks to leverage the learning process of several tasks by solving them simulta-
Multi-task learning neously to arrive at better models. This advantage is obtained by coupling the tasks together so that paths
Support vector machines to share information among them are created. While Deep learning models have successfully been applied to
Gaussian processes
MTL in different fields, the performance of deep approaches often depends on using large amounts of data
Bias learning
to fit complex models with many parameters, something which may not be always feasible or, simply, they
Learning to learn
may lack some advantages that other approaches have. Kernel methods, such as Support Vector Machines or
Gaussian Processes, offer characteristics such as a better generalization ability or the availability of uncertainty
estimations, that may make them more suitable for small to medium size datasets. As a consequence, kernel-
based MTL methods stand out among these alternative approaches to deep models and there also exists a rich
literature on them. In this paper we review these kernel-based multi-task approaches, group them according
to a taxonomy we propose, link some of them to foundational work on machine learning, and comment on
datasets commonly used in their study and on relevant applications that use them.

1. Introduction such as Support Vector Machines (SVMs) or Gaussian Processes (GPs),


present some characteristics that can be advantageous in learning, such
In Machine Learning (ML), we typically try to minimize some loss as convexity, a better generalization ability or, in the case of GPs,
metric that measures model performance on a single task. In particular, the possibility of giving uncertainty estimates. Moreover, there exists
given a data sample and once we have chosen a concrete model, a substantial recent literature on MTL with kernel methods, which
we use that metric to select its hyperparameters and optimize model explores several approaches to incorporate data from different tasks
coefficients to achieve a minimal sample-dependent error. Multi-Task
to improve the learning process but, to the best of our knowledge, no
Learning (MTL), however, aims at solving simultaneously different but
recent review has considered these advances.
related tasks through some form of task coupling that leverages the
Therefore, the goal of this paper is to present these methods and
overall learning process and results in better final models; we will give
a more precise definition in Section 2. This goal requires selecting techniques in a comprehensive way that also considers their relation
which tasks should be learned together, that is, defining the base with other fundamental issues on learning. To do so, we take as a
MTL problem, and, also, designing algorithms that can benefit from starting point the MTL taxonomy in [6], which may be rather wide
the presence of different tasks. One of the main initial motivations of for our goals, and adapt it to the more specific situation of kernel-
MTL [1,2] was data scarcity, and by combining data from different based MTL, to arrive at a new taxonomy which, we believe, is better
sources this issue could be solved. However, other benefits can be adjusted to our models of interest here and where we will consider
extracted from MTL, such as bias mitigation, domain adaptation or the three basic MTL model approaches: feature-based, regularization-based
avoidance of overfitting. and combination-based strategies.
It can be said that, given the great recent success of Deep Neural The paper is organized as follows. First, in Section 2 we introduce
Networks (DNNs), most recent MTL research uses them; see for exam- MTL from a general point of view, discuss different types of MTL
ple [3,4]. A recent overview of DNN-based MTL, due to S. Ruder, can be
problems and define each one of the three strategies in which our
found in [5]; we also refer the reader to the paper [6] by Y. Zhang and
taxonomy divides kernel based MTL work. Next, in Section 3 we give
Q. Yang for a broader overview of MTL under several ML paradigms
a brief summary of SVMs and GPs, the key kernel approaches to the
which also include online, parallel and distributed approaches for MTL.
MTL models we deal with here. In Section 4, the main one, we give
Even taking this into account, it is also true that kernel methods,

∗ Corresponding author.
E-mail address: [email protected] (C. Ruiz).

https://doi.org/10.1016/j.neucom.2024.127255
Received 28 February 2023; Received in revised form 14 November 2023; Accepted 8 January 2024
Available online 10 January 2024
0925-2312/© 2024 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
C. Ruiz et al. Neurocomputing 577 (2024) 127255

a wide survey of MTL with kernel methods, which we will further where 𝓁(⋅, ⋅) is a loss function. Analogously to the standard, single-task
divide according to our taxonomy into three subsections, one for each case, the goal is to find the 𝑇 hypotheses that minimize this expected
of the strategies above. After this, and not only for completeness but risk, that is,
also for its influence in kernel-based learning, we add in Section 5
𝒉∗𝑷 = arg min 𝑅𝑷 (𝒉).
a summary of important theoretical work on ML that has influenced 𝒉∈ 𝑇
several developments in MTL with kernels. Finally, in Section 6 we
However, the distributions 𝑃1 , … , 𝑃𝑇 are unknown and instead we have
describe some of the most commonly used problems in the research
the following MTL set sampled from  × :
literature on kernel-based MTL, offer pointers to other problems where
kernel-based MTL has been successful and, finally, review some recent ⋃
𝑇 ⋃
𝑇
{ }
𝑟
and interesting examples of real world applications where kernel MTL 𝐷𝑛 = 𝐷𝑚 = (𝑥𝑟𝑖 , 𝑦𝑟𝑖 ) ∼ 𝑃𝑟 (𝑥, 𝑦), 𝑖 = 1, … , 𝑚𝑟 , (2)
𝑟
𝑟=1 𝑟=1
models are used.

where 𝑚𝑟 is the number of samples from task 𝑟 and 𝑛 = 𝑇𝑟=1 𝑚𝑟 is the
2. What is multi-task learning? total number of instances. With these definitions, we can express the
MTL empirical risk as
Different definitions have been given for Multi-Task Learning (MTL) 𝑚𝑟

𝑇 ∑𝑇
1 ∑
but, in broad terms, and according to R. Caruana [1], MTL’s main 𝑅̂ 𝐷𝑛 (𝒉) = 𝑅̂ 𝐷𝑚𝑟 (ℎ𝑟 ) = 𝓁(ℎ𝑟 (𝑥𝑟𝑖 ), 𝑦𝑟𝑖 ). (3)
goal is to improve generalization performance through an inductive 𝑟=1
𝑟 𝑚
𝑟=1 𝑟 𝑖=1
transfer mechanism which leverages the domain-specific information Then, instead of directly minimizing the expected risk (1), we try to
contained in the training data of related problems. This is achieved find the hypotheses that minimize the empirical risk (3), i.e.
learning these tasks in parallel while using some kind of common
representation. In this work we will only consider supervised problems, 𝒉∗𝐷 = arg min 𝑅̂ 𝐷𝑛 (𝒉),
𝑛
𝒉∈ 𝑇
and we will omit the discussion about unsupervised ones. Following
the formulation of [2], in supervised problems we have for each task but sometimes also an unweighted alternative minimization principle
𝑟 = 1, … , 𝑇 , an input space 𝑟 and an output space 𝑟 . First, we can is used, where we select the hypotheses as
classify MTL problems into which we may call either ‘‘homogeneous’’ ∑ 𝑚𝑟
𝑇 ∑
or ‘‘heterogeneous’’ problems. In homogeneous problems all tasks are 𝒉∗𝐷 = arg min 𝓁(ℎ𝑟 (𝑥𝑟𝑖 ), 𝑦𝑟𝑖 ).
𝑛
sampled from the same space  ×, and for each task 𝑟 = 1, … , 𝑇 , there 𝒉∈ 𝑇 𝑟=1 𝑖=1
is a different distribution 𝑃𝑟 (𝑥, 𝑦) over this product space. Therefore, Both minimization principles are valid options, the first avoid the
in these problems all tasks have the same number of features and the risk being dominated by the largest tasks, while with the second
same target space. In heterogeneous problems each task may have a one puts more emphasis in the tasks where we have more train-
different input and output space, 𝑟 and 𝑟 , and a different distribution ing data. In any case, we can observe that the minimization can
𝑃𝑟 (𝑥, 𝑦) for each space 𝑟 × 𝑟 ; therefore, a mix of classification and be done independently in each task selecting the hypotheses ℎ𝑟 =
∑𝑚𝑟
regression problems could be considered. Anyway, in this work we will arg minℎ∈ 𝑖=1 𝓁(ℎ𝑟 (𝑥𝑟𝑖 ), 𝑦𝑟𝑖 ); this is the Independent Task Learning
restrict ourselves to homogeneous problems. In particular, depending (ITL) approach. Another approach is the Common Task Learning (CTL)
on the nature of the tasks, we can consider in turn the following types approximation, which ignores task information completely; in other
of homogeneous problems: words, we consider a single common hypothesis ℎ for all tasks and we
∑ ∑𝑚𝑟
select ℎ∗ = arg minℎ∈ 𝑇𝑟=1 𝑚1 𝑖=1 𝓁(ℎ(𝑥𝑟𝑖 ), 𝑦𝑟𝑖 ).
• Direct MTL problems: each task 𝑟 = 1, … , 𝑇 , has been sampled 𝑟
The CTL and ITL approaches are opposite extremes, but the inter-
with a different distribution 𝑃𝑟 (𝑥, 𝑦) over  × ; thus, we have
esting part lies in between, when we want to consider task-specific
different samples (𝑋𝑟 , 𝒚 𝑟 ), where 𝑋𝑟 is the feature matrix and 𝒚 𝑟
hypotheses, but we also want to couple them so learning the tasks
the target vector. Here we can find multi-task (MT) single target
together yields an advantage. According to [1], MTL improves the
regression, where  = R, or MT binary classification, where
performance by different techniques, such as data amplification, rep-
 = {0, 1}. There are also MT multi-target regression problems
resentation learning or transfer learning. In any case, there are multi-
with 𝑃 > 1 targets, i.e.  = R𝑃 , or MT multi-label classification
ple techniques to achieve this coupling between tasks. In [6], Zhang
ones, but we will focus on MT single-target and binary problems,
and Yang propose to divide MTL techniques in five different groups:
which are the most common ones for kernel methods.
feature-learning, low-rank, task-clustering, task relation learning and
• Derived MTL problems: the targets 𝑦𝑟 might be different for
decomposition approaches Here, starting with the work in [6], and
each task but the feature matrix 𝑋 is shared across all tasks,
since we restrict this review to kernel methods, we propose an updated
i.e. (𝑋𝑟 , 𝑦𝑟 ) = (𝑋, 𝑦𝑟 ). The multi-target regression or multi-label
taxonomy with three groups:
classification problems can be seen as members of this category.
• Feature-based approaches: These are techniques that consider a
Multi-target regression problems can be treated independently for each
shared representation for all tasks, and task-specific functions are
individual target and consider each single-target regression subproblem
built over these shared features. This group roughly corresponds
as a different task. Also, multi-label classification can be tackled with
to the feature-learning group of [6]
a one-vs-all strategy, where each binary classification problem can be
More precisely, we consider hypotheses ℎ𝑟 (⋅) = 𝑔𝑟 (𝑓 (⋅)) where
interpreted as an individual task. We will follow these approaches
𝑓 (⋅) is the feature-building function common to all tasks, and 𝑔𝑟
to derive MTL problems from multi-target regression and multi-label
are the task-specific functions. The risk minimization problem for
classification ones.
these approaches is the following one:
Formally, we can express an MTL problem as finding the minimizer
𝑚𝑟
of some empirical risk. Given 𝑇 distributions in  × , ∑𝑇
1 ∑ ∑𝑇
arg min 𝓁(𝑔𝑟 (𝑓 (𝑥𝑟𝑖 )), 𝑦𝑟𝑖 ) + 𝛺(𝑓 ) + 𝛺𝑟 (𝑔𝑟 ), (4)
𝑷 = (𝑃1 , … , 𝑃𝑇 ), 𝑓 ,𝒈 𝑚
𝑟=1 𝑟 𝑖=1 𝑟=1

we consider 𝑇 hypotheses ℎ1 , … , ℎ𝑇 from some hypothesis space , where 𝒈 = (𝑔1 , … , 𝑔𝑇 ), and 𝛺(𝑓 ), 𝛺𝑟 (𝑔𝑟 ) regularize the feature-
one for each task, such that 𝒉 = (ℎ1 , … , ℎ𝑇 ) ∈  𝑇 , and define the MTL building and decision functions, respectively. This is a natural
expected risk as approach, for instance, for Neural Networks, where weight shar-
ing in the hidden layers determines the feature function, and

𝑇 ∑
𝑇
where the task functions are defined by specific output neu-
𝑅𝑷 (𝒉) = 𝑅𝑃𝑟 (ℎ𝑟 ) = 𝓁(ℎ𝑟 (𝑥), 𝑦)𝑑𝑃𝑟 (𝑥, 𝑦), (1)
∫×𝑌 rons [1], but it is also applicable to other methods, see [7–9] for
𝑟=1 𝑟=1

2
C. Ruiz et al. Neurocomputing 577 (2024) 127255

Fig. 1. A tree scheme of a taxonomy of kernel-based multitask models, with representative formulations of their respective minimization problems and some of the most relevant
works in each category.

example, which consider linear feature extractors, that is, linear 3. Machine learning with kernels
models with a feature-based approach.
• Regularization-based approaches: The coupling between tasks ML kernel methods, as defined in [18], consider models that can
is enforced through a regularization term, which leads to the be expressed as 𝑓 (⋅) = ⟨𝑤, 𝜙(⋅)⟩, where 𝜙(⋅) is a transformation of the
following regularized risk minimization problem: original inputs into a Reproducing Kernel Hilbert Space (RKHS), with a
∑ 𝑚𝑟 possibly infinite dimension and where the minimization problem that
1 ∑
𝑇
arg min 𝓁(ℎ𝑟 (𝑥𝑟𝑖 ), 𝑦𝑟𝑖 ) + 𝛺(𝒉), (5) the learning procedure has to solve might be easier. When the training
𝒉 𝑟=1
𝑚 𝑟 𝑖=1
and prediction algorithms can be expressed in terms of dot products in
where 𝛺(𝒉) is a regularization function that does not decouple the RKHS, we can work implicitly with the transformation 𝜙 by using
∑𝑇
across tasks, i.e. 𝛺(𝒉) ≠ 𝑟=1 𝛺𝑟 (ℎ𝑟 ) Here we include most ̂ = ⟨𝜙(𝑥), 𝜙(𝑥)⟩.
a kernel function such that 𝑘(𝑥, 𝑥) ̂
approaches that are considered in one of the low-rank, task- In this section we will briefly review Support Vector Machines
clustering, task relation learning and decomposition groups of [6] (SVMs) and Gaussian Processes (GPs), two of the most relevant kernel
Some relevant proposals that follow this strategy are [10,11], methods and which have received a greater attention in the MTL
which use linear models, and [12], where kernel methods are literature.
applied.
• Combination-based approaches: Now each task model is the
combination of a component 𝑔 common to all tasks and a task- 3.1. Support vector machines
specific one 𝑔𝑟 . In other words, the hypotheses are now ℎ𝑟 (⋅) =
𝑔(⋅) + 𝑔𝑟 (⋅), and the corresponding risk minimization problem can The theory developed by Vapnik [19] shows that the generalization
then be expressed as ability of a learning model is related to the capacity of the space of
𝑚𝑟 functions where such model is selected and, motivated by this, Vapnik
∑𝑇
1 ∑ ∑𝑇
arg min 𝓁(𝑔(𝑥𝑟𝑖 ) + 𝑔𝑟 (𝑥𝑟𝑖 ), 𝑦𝑟𝑖 ) + 𝛺(𝑔) + 𝛺𝑟 (𝑔𝑟 ). (6) introduced SVMs. In the classification setting, SVMs consider maximum
𝑔,𝒈 𝑚
𝑟=1 𝑟 𝑖=1 𝑟=1 margin hyperplanes instead of general hyperplanes, hence with less
This approach was first proposed in [13] for linear SVMs, and capacity, and the hinge loss is used to obtain such hyperplanes. The
later extended for other SVM variants; see for example [14,15]. 𝜖-insensitive loss is used for SVM regression and only predictions
In [16] a kernelized version was proposed, and in [17] this with an error larger than 𝜖 are penalized. In this work we will use
combination approach was implemented with neural networks. the formulation presented in [20] that unifies the classification and
regression primal problems into the following single primal one:
In Fig. 1 we present a tree scheme with these three categories, their
corresponding minimization problems that we have just described, and ∑
𝑛
1
min 𝐶 𝜉𝑖 + ‖𝑤‖2
some of the most relevant works of Multi-Task Learning with kernel 𝑤,𝑏,𝝃 2 (7)
𝑖=1
methods. These will be further explored in Section 4.
s.t. 𝑦𝑖 (⟨𝑤, 𝜙(𝑥𝑖 )⟩ + 𝑏) ≥ 𝑝𝑖 − 𝜉𝑖 , 𝜉𝑖 ≥ 0;
We will follow this taxonomy when discussing concrete MTL meth-
ods in Section 4 but, before that, we briefly review next support vector here 𝜉𝑖 are slack variables, which allow for misclassified points or
machines and Gaussian processes, the most relevant kernel based ML regression errors, and 𝐶 is a hyperparameter that regulates the trade-
methods. off between the error of the model and its complexity, represented by

3
C. Ruiz et al. Neurocomputing 577 (2024) 127255

the norm ‖𝑤‖. The standard approach to solve (7) is through strong 4.1. Feature-based approaches
duality, solving instead the simpler dual problem:
As explained before, the feature-based approaches rely on finding
1 ⊺
min 𝜶 𝑄𝜶 − 𝜶 ⊺ 𝒑 a shared representation that is useful for all tasks. The corresponding
𝜶 2
∑𝑛 (8) models for each task 𝑟 = 1, … , 𝑇 can be defined as ℎ𝑟 (⋅) = 𝑔𝑟 ◦𝑓 (⋅),
s.t. 𝑦𝑖 𝛼𝑖 = 0, 0 ≤ 𝛼𝑖 ≤ 𝐶, where 𝑓 is the shared feature-building function and 𝑔𝑟 are task-specific
𝑖=1 functions. This is a standard technique for doing MTL with NNs [1];
with 𝒑 = (𝑝1 , … , 𝑝𝑛 )⊺ and 𝜶 = (𝛼1 , … , 𝛼𝑛 )⊺ the vector of dual variables; however, it is more difficult to find examples for kernel methods.
𝑄 is the so-called kernel matrix, defined as 𝑄𝑖𝑗 = 𝑦𝑖 𝑦𝑗 𝑘(𝑥𝑖 , 𝑥𝑗 ). This dual One way to define a feature-based approach with kernels is to
problem can be efficiently solved using the SMO algorithm [21] and the consider the function 𝑓 as linear combinations of fixed features defined
optimal primal 𝑤∗ and 𝑏∗ are derived from the optimal dual solution by the implicit kernel transformation 𝜙, such as for instance,
⟨ ⟩ ⟨ ⟩
𝜶 ∗ through the KKT conditions (see for instance [18] for more details). 𝑓 (𝑥𝑟𝑖 ) = (𝑓1 (𝑥𝑟𝑖 ), … , 𝑓𝐿 (𝑥𝑟𝑖 )) = ( 𝑢1 , 𝜙(𝑥𝑟𝑖 ) , … , 𝑢𝐿 , 𝜙(𝑥𝑟𝑖 ) )

for some 𝐿 ∈ N>0 , and then to define the task models by using linear
3.2. Gaussian processes functions 𝑔𝑟 acting on the features given by 𝑓 ; that is, for 𝑟 = 1, … , 𝑇 ,
we define the models
Gaussian Processes (GPs; see for instance [22]) model the relation- ⟨ ⟩
ℎ𝑟 (𝑥𝑟𝑖 ) = 𝑎𝑟 , 𝑓 (𝑥𝑟𝑖 ) ,
ship between input variables 𝑥𝑖 and outputs 𝑓 (𝑥𝑖 ) of 𝑛 samples by
assuming that the outputs follow a multivariate Gaussian distribution, where 𝑎𝑟 ∈ R𝐿 are the parameters of the linear functions 𝑔𝑟 . In the
that is linear case the models can be expressed as
( ) ⟨ ⟩ ⟨ ⟩ ⟨ ⟩
𝑓 (𝑥1 ), … , 𝑓 (𝑥𝑛 ) = 𝑁 (𝒎, 𝐾) ℎ𝑟 (𝑥𝑟𝑖 ) = 𝑎𝑟 , 𝑈 ⊺ 𝑥𝑟𝑖 = 𝑈 𝑎𝑟 , 𝑥𝑟𝑖 = 𝑤𝑟 , 𝑥𝑟𝑖 ;

hence, the parameter matrix 𝑊 = (𝑤1 , … , 𝑤𝑇 ) is defined as


with 𝒎𝑖 = 𝑚(𝑥𝑖 ) and 𝐾 a positive definite matrix. The features 𝑥 and
( )
targets 𝑦 are related as 𝑦(𝑥) = 𝑓 (𝑥) + 𝜖 with 𝜖 ∼ 𝑁 0, 𝜎 2 ; this leads to 𝑊 = 𝑈 𝐴 ,
𝑑×𝑇 𝑑×𝐿 𝐿×𝑇
a covariance matrix
with 𝑈 = (𝑢1 … 𝑢𝐿 ) and 𝐴 = (𝑎1 … 𝑎𝑇 ) being 𝑑 × 𝐿 and 𝐿 × 𝑇 matrices,
𝑖𝑗 = 𝑘(𝑥𝑖 , 𝑥𝑗 ) + 𝜎 2 𝛿𝑥𝑖 ,𝑥𝑗 and 𝑑 the dimension of the original data.
This idea, which we will call MTL feature learning, is proposed
where the function 𝑘 must ensure that their corresponding matrix 𝐾𝑖𝑗 = by Argyriou et al. in [7], where they consider 𝐿 = 𝑑 features. Some
𝑘(𝑥𝑖 , 𝑥𝑗 ) is positive semidefinite and symmetric or, in other words, 𝑘(⋅, ⋅) restrictions are imposed to enforce that the features represented by
must be a kernel function. For simplicity, the outputs 𝑓 (𝑥𝑖 ) are assumed 𝑢1 , … , 𝑢𝑑 capture different information, and only a subset of them is
to be centered at zero and, hence, the mean function 𝑚(𝑥𝑖 ) is assumed necessary for each task. In more detail, the minimization problem in
to be zero, so the targets 𝑦𝑖 are also centered at zero. Then, given the the linear case is
observations (𝑋, 𝒚) and an unknown feature matrix 𝑋̂ for which we 𝑚𝑟

𝑇 ∑
⟨ ⟩
want to get predictions 𝒇̂ , we have min 𝓁(𝑦𝑟𝑖 , 𝑎𝑟 , 𝑈 ⊺ 𝑥𝑟𝑖 ) + 𝜆 ‖𝐴‖22,1 s.t. 𝑈 ⊺ 𝑈 = 𝐼, (10)
𝑈 ∈R𝑑×𝑑
[ ] ( [ ]) 𝐴∈R𝑑×𝑇
𝑟=1 𝑖=1
𝒚 𝐾(𝑋, 𝑋) + 𝛿 2 𝐼𝑛 𝐾(𝑋, 𝑋) ̂
∼ 𝑁 0, . (9)
𝒇̂ 𝐾(𝑋,̂ 𝑋) ̂ 𝑋)
𝐾(𝑋, ̂ where the 𝐿2,1 regularizer is used to impose row-sparsity across tasks,
i.e. forcing some rows of 𝐴 to be zero, which has the goal of using in all
Since all distributions involved are normal, it follows that the condi-
tasks the same subset of the features represented by the columns of 𝑈 ;
tional distribution 𝒇̂ |𝑋,̂ 𝑋, 𝒚 also has a Gaussian distribution. Analo-
also, the matrix 𝑈 is restricted to be orthonormal, so that these columns
gously, if 𝒚̂ are the predictions to which noise has been added, it can do not contain overlapping information. As shown in [23], (10) is
be shown that the conditional distribution 𝒚| ̂ 𝑋, 𝒚 is also Gaussian.
̂ 𝑋, equivalent to
To get the noiseless predictions 𝒚 ∗̂ , using the distribution of 𝒇̂ , or 𝑚𝑟
𝒇 ∑
𝑇 ∑
⟨ ⟩ ∑
𝑇
⟨ ⟩
the noisy ones 𝒚 ∗𝒚̂ , using the distribution of 𝒚, ̂ for a new, unknown 𝑥, ̂ min 𝓁(𝑦𝑟𝑖 , 𝑤𝑟 , 𝑥𝑟𝑖 ) + 𝜆 𝑤𝑟 , 𝐷−1 𝑤𝑟
we can apply the formulae 𝑊 ∈R𝑑×𝑇 ,𝐷∈R𝑑×𝑑 𝑟=1 𝑖=1 𝑟=1 (11)
[ ( )] [ ] s.t. 𝐷 ⪰ 0, tr (𝐷) ≤ 1.
𝒚 ∗̂ = arg min E𝒇̂ 𝓁 𝒇̂ , 𝒚 𝒇̂ or 𝒚 ∗𝒚̂ = arg min E𝒚̂ 𝓁(𝒚,
̂ 𝒚 𝒚̂ ) ,
𝒇 𝒚 𝒇̂ 𝒚 𝒚̂ To obtain an optimal solution (𝑊 ∗ , 𝐷∗ ), the authors propose an iterated
two-step procedure. In particular, the optimization with respect to 𝑊
respectively, and where 𝓁(⋅, ⋅) is a loss function. If we select 𝓁(𝑧, 𝑧)
̂ =
decouples in each task and the standard Representer Theorem [24] can
̂ 2 , the minimizer is the mean of the distribution. Since both 𝑦̂
‖𝑧 − 𝑧‖
be used to solve it, (while the one with respect to 𝐷 has a closed solution
and 𝑓̂ have Gaussian distributions with the same mean, the result is 1 1)
𝐷∗ = (𝑊 ⊺ 𝑊 ) 2 ∕ tr (𝑊 ⊺ 𝑊 ) 2 .

𝒚 ∗̂ = 𝒚 ∗𝒚̂ = 𝒎(𝑥)
̂ = 𝒌𝑥̂ (𝐾 + 𝜎 2 𝐼𝑛 )−1 𝒚, Moreover, the authors ( of [23]) note that the regularizer in (11)
𝒇
( ) can be expressed as tr 𝑊 ⊺ 𝐷−1 𝑊 and, by plugging 𝐷∗ , we get the

where 𝒌𝑥̂ = 𝑘(𝑥1 , 𝑥), ̂ and 𝐾 is the kernel matrix of the
̂ … , 𝑘(𝑥𝑛 , 𝑥) squared-trace norm regularizer for 𝑊 :
training patterns. Observe that this is the result also obtained by 𝑚𝑟

𝑇 ∑
⟨ ⟩
kernel ridge regression [22]; however, with GPs we not only get point min 𝓁(𝑦𝑟𝑖 , 𝑤𝑟 , 𝑥𝑟𝑖 ) + 𝜆 ‖𝑊 ‖2tr . (12)
predictions, but also their full distribution. 𝑊 ∈R𝑑×𝑇 𝑟=1 𝑖=1
( 1)
Here ‖𝑊 ‖tr = tr (𝑊 ⊺ 𝑊 ) 2 denotes the trace norm which coincides
4. Multi-task learning with kernels ∑min (𝑑,𝑇 )
with ‖𝑊 ‖tr = 𝑖=1
𝜆𝑖 , where 𝜆𝑖 are the singular values of 𝑊 .
Therefore, by penalizing the trace norm, we favor low-rank solutions of
In this section we will review some of the most relevant papers on 𝑊 . That is, the initial problem (10) is equivalent to a problem where
MTL with kernel methods. Moreover, as mentioned in Section 2, we the trace norm regularization of the matrix 𝑊 is used. In [25] this
divide them in three groups: feature-based, regularization-based and regularization is extended to any spectral function, that is, a function 𝑓
∑ (𝑑,𝑇 )
combination-based methods. applied over the eigenvalues of a matrix, i.e., ‖𝑊 ‖𝑓tr = min
𝑖=1
𝑓 (𝜆𝑖 ).

4
C. Ruiz et al. Neurocomputing 577 (2024) 127255

Argyriou and his coworkers have also given the kernel extension of are 𝑇 × 𝑇 binary, diagonal matrices with the 𝑟th element of the diag-
these results. In [23], they propose an MTL feature learning approach onal indicating whether task 𝑟 corresponds to cluster 𝜅. The problem
of (10) with kernels in terms of the problem considered in [28] is
𝑚𝑟
∑ 𝑚𝑟
𝑇 ∑
⟨ ⟩ ∑
𝑇 ∑
⟨ ⟩ ∑
𝐶
𝓁(𝑦𝑟𝑖 , 𝑎𝑟 , 𝑈 ⊺ 𝜙(𝑥𝑟𝑖 ) ) + 𝜆 ‖𝐴‖22,1 s.t. 𝑈 ⊺ 𝑈 = 𝐼𝑑 . min 𝓁(𝑦𝑟𝑖 , 𝑤𝑟 , 𝑥𝑟𝑖 ) + 𝜆 ‖𝑊 𝑄𝜅 ‖2
min (13)
𝑊 ,𝑄1 ,…,𝑄𝐶 ‖ ‖∗
𝑈 ∈ 𝑑 ,𝐴∈R𝑑×𝑇 𝑟=1 𝑖=1 𝑟=1 𝑖=1 𝜅=1
(19)
Here 𝜙 is the implicit transformation into the kernel space , and 𝑈 is a ∑
𝐶
s.t. 𝑄𝜅 = 𝐼 with 𝑄𝜅𝑡 ∈ {0, 1} ,
collection of 𝑑 elements of . For solving this, they use the formulation 𝜅=1
of (12), i.e.
and an iterated two-step optimization separately in 𝑊 and the matrices
𝑚𝑟

𝑇 ∑
⟨ ⟩ 𝑄1 , … , 𝑄𝐶 is proposed to solve it.
min 𝓁(𝑦𝑟𝑖 , 𝑤𝑟 , 𝜙(𝑥𝑟𝑖 )) + 𝜆 ‖𝑊 ‖2tr , (14) Following a different path, a general result for regularization-based
𝑊 ∈ 𝑇 𝑟=1 𝑖=1
MTL with kernels was given by Evgeniou et al. in [12], where a general
and extend the Representer Theorem for any spectral regularization. MTL formulation with kernel methods was presented. Consider 𝒘 =
Another similar approach for feature learning is the one described ⊺ ⊺
vec (𝑊 ) the vectorization of matrix 𝑊 , i.e. the vector (𝑊1 , … , 𝑊𝑇 )⊺
in [26], where a sparse-coding method [27] is applied in MTL, and they where 𝑊1 , … , 𝑊𝑇 are the columns of 𝑊 . Given a positive definite 𝑇 ×𝑇
consider the following optimization problem: matrix 𝐸, in [12] they show that the solutions of
𝑚𝑟

𝑇 ∑
⟨ ⟩ ∑ 𝑚𝑟
𝑇 ∑
⟨ ⟩
min 𝓁(𝑦𝑟𝑖 , 𝑈 𝑎𝑟 , 𝜙(𝑥𝑟𝑖 ) ) + 𝜆 ‖𝑈 ‖2,∞ + 𝜇 ‖𝐴‖1,∞ . (15) min 𝓁(𝑦𝑟𝑖 , 𝑤𝑟 , 𝑥𝑟𝑖 ) + 𝜇𝒘⊺ (𝐸 ⊗ 𝐼)𝒘, (20)
𝑈 ∈(R𝐿 ,),𝐴∈R𝐿×𝑇 𝑟=1 𝑖=1 𝒘
𝑟=1 𝑖=1
Here, 𝑈 is a linear map 𝑈 ∶ R𝐿 → , which is called in [26] an 𝐿- where ⊗ is the Kronecker product between matrices, can be obtained
dimensional dictionary; in the linear case, where  = R𝑑 , the set 𝑈 is by solving
a 𝑑 × 𝐿 matrix. Anyway, it can be seen that an equivalent formulation 𝑚𝑟

𝑇 ∑
⟨ ⟩
is written as min 𝓁(𝑦𝑟𝑖 , 𝑤𝑟 , 𝐵𝑟 𝑥𝑟𝑖 ) + 𝜇𝒘⊺ 𝒘, (21)
𝒘
𝑚𝑟

𝑇 ∑
⟨ ⟩ 𝑟=1 𝑖=1
min 𝓁(𝑦𝑟𝑖 , 𝑎𝑟 , 𝑈 ⊺ 𝜙(𝑥𝑟𝑖 ) ) + 𝜆 ‖𝑈 ‖2,∞ + 𝜇 ‖𝐴‖1,∞ , (16)
𝑈 ∈ 𝐿 ,𝐴∈R𝐿×𝑇 𝑟=1 𝑖=1 with 𝐵𝑟 being the columns of a matrix 𝐵 such that 𝐸 = (𝐵 ⊺ 𝐵)−1 . Then,
Evgeniou et al. also prove that the solutions of problem (20) have the
which has similarities with the MTL feature learning method in prob- form
lem (13). However, some differences must be pointed out: in the linear 𝑚𝑟

𝑇 ∑
case, the matrix 𝑈 in (10) is an orthogonal square matrix, while in (15) 𝒘= 𝛼𝑖𝑟 𝐵𝑟 𝑥𝑟𝑖 .
it is overcomplete, with 𝐿 columns and 𝐿 > 𝑑; also, the regularization 𝑟=1 𝑖=1
used for 𝐴 is the 𝓁2,1 norm in (13), and the 𝓁1,∞ in (16). If we consider the transformations 𝜓(𝑥𝑟𝑖 ) = 𝐵𝑟 𝑥𝑟𝑖 , the models of prob-
A problem very similar to (15) is presented in [9], where the idea is lem (21) are of the form ℎ𝑟 (⋅) = ⟨𝑤𝑟 , 𝜓𝑟 (⋅)⟩, and the reproducing kernel
the same but the regularizers are the 𝐿2,2 (Frobenius) norm for 𝑈 and of such transformations is
the 𝐿1,1 norm for 𝐴, that is, ⟨ ⟩ ⟨ ⟩ ⟨ ⟩
̂ 𝑟 , 𝑥𝑠 ) = 𝜓𝑟 (𝑥𝑟 ), 𝜓𝑠 (𝑥𝑠 ) = 𝐵𝑟 𝑥𝑟 , 𝐵𝑠 𝑥𝑠 = (𝐸 −1 )𝑟𝑠 𝑥𝑟 , 𝑥𝑠 .
𝑘(𝑥 (22)
𝑚𝑟 𝑖 𝑗 𝑖 𝑗 𝑖 𝑖 𝑖 𝑗

𝑇 ∑
⟨ ⟩
min 𝓁(𝑦𝑟𝑖 , 𝑎𝑟 , 𝑈 ⊺ 𝑥𝑟𝑖 ) + 𝜆 ‖𝑈 ‖2,2 + 𝜇 ‖𝐴‖1,1 . (17) Here, in the last equality, observe that the kernel divides task interde-
𝑈 ∈ 𝐿 ,𝐴∈R𝐿×𝑇 𝑟=1 𝑖=1
pendency, encapsulated −1
⟨ in (𝐸 ⟩ )𝑟𝑠 , and feature similarity, as expressed
Here, we can interpret this model as a linear sparse combination, 𝑟 𝑠
in the inner product 𝑥𝑖 , 𝑥𝑗 .
encoded in 𝐴, of some features encoded in 𝑈 . The intuition of splitting the inter-task and feature similarities, as
In general, the problems defined in (16) and (17) are related to done in the kernel (22), can also be introduced using vector-valued
sparse coding because the matrix 𝑈 defines a shared set of features, RKHSs [29,30], whose elements are functions 𝑓 ∶  →  and  is
which are called the codes, that are combined differently for each task, a vector space such as, for example,  = R𝑇 . In this situation, the
and this combination is enforced to be sparse by the 𝓁1,∞ or 𝓁1,1 norm kernels are matrix-valued maps 𝐾 ∶  ×  → R𝑇 ×𝑇 ; this has been later
regularizers of 𝐴. However, they have been fully developed only for extended to kernels in Banach spaces [31]. It is particularly interesting
the linear case, and although the authors point out that they could be the case of separable kernels [30], where these matrix-valued kernels
extended to kernel models, to the best of our knowledge, this is still a 𝐾(𝑥, 𝑥′ ) are defined as the Kronecker product of a kernel 𝑘𝜏 (𝑟, 𝑠) for task
pending task. dependencies, called the output kernel, and a kernel between features
𝑘(𝑥, 𝑥′ ), that is,
4.2. Regularization-based approaches
(𝐾(𝑥, 𝑥′ ))𝑟𝑠 = 𝑘(𝑥, 𝑥′ )𝑘𝜏 (𝑟, 𝑠). (23)
Instead of trying to find a good shared feature space, the The kernel given in (22) can be seen as a particular case of a separable
regularization-based approaches enforce task coupling through regu- kernel, where the inter-task kernel is defined by a matrix, i.e. 𝑘𝜏 (𝑟, 𝑠) =
larization. In the case of kernel methods, we consider models ℎ𝑟 (𝑥𝑟𝑖 ) = (𝐸 −1 )𝑟𝑠 , and the inter-feature kernel is linear, i.e., the dot product. It
⟨ ⟩
𝑤𝑟 , 𝜙(𝑥𝑟𝑖 ) , where, again, 𝜙 is the transformation associated to a kernel is common to consider an inter-task kernel defined by a matrix 𝐸 and
function. Since we have one parameter vector 𝑤𝑟 for each task, we can a non-linear kernel, e.g. a Gaussian kernel. In that case, problem (20)
consider the matrix 𝑊 whose columns are 𝑤𝑟 for 𝑟 = 1, … , 𝑇 ; then, the becomes:
regularized risk to be minimized can be written as 𝑚𝑟

𝑇 ∑
⟨ ⟩
𝑚𝑟 min 𝓁(𝑦𝑟𝑖 , 𝑤𝑟 , 𝜙(𝑥𝑟𝑖 ) ) + 𝜇𝒘⊺ (𝐸 ⊗ 𝐼)𝒘, (24)

𝑇 ∑
⟨ ⟩ 𝒘
𝓁(𝑦𝑟𝑖 , 𝑤𝑟 , 𝑥𝑟𝑖 ) + 𝜇𝛺(𝑊 ). (18) 𝑟=1 𝑖=1
𝑟=1 𝑖=1 and using the definitions from [30], the result of (22) can be extended,
First, observe that the MTL feature learning problem in (14) can be and the kernel for the corresponding dual problem is
also interpreted as a regularization-based problem. Starting with this ̂ 𝑟 , 𝑥𝑠 ) = (𝐸 −1 )𝑟𝑠 𝑘(𝑥𝑟 , 𝑥𝑠 ),
𝑘(𝑥𝑖 𝑗 𝑖 𝑗 (25)
formulation, a clusterized extension is proposed in [28], where a trace
norm regularizer of the matrices 𝑊𝜅 = 𝑊 𝑄𝜅 is proposed. Here 𝑄𝜅 for any kernel function 𝑘 such as the Gaussian kernel.

5
C. Ruiz et al. Neurocomputing 577 (2024) 127255

A particular example for the matrix 𝐸, presented in [12], is the where  (𝑀, 𝐴 ⊗ 𝐵) denotes the matrix-variate normal distribution
Graph Laplacian (GL) regularization 𝐸 = 𝐿, i.e., 𝐸 is taken as the with mean 𝑀, row covariance matrix 𝐴 and column covariance matrix
Laplacian matrix 𝐿 of a graph where the tasks are the nodes and the 𝐵. Thus, in this case the feature covariance matrix is the identity ma-
edge weights measure the degree of relatedness between tasks. In this trix, while the task-covariance matrix is given by the 𝛺 in (27). With the
case, the regularization in (24) can be expressed as distribution (28), Zhang et al. showed that the problem of selecting the
maximum a posteriori estimation of 𝑊 and the maximum likelihood

𝑇 ∑
𝑇 ∑
𝑇 ∑
𝑇
𝒘⊺ (𝐿 ⊗ 𝐼)𝒘 = (𝐿)𝑟𝑠 ⟨𝑤𝑟 , 𝑤𝑠 ⟩ = (𝐴)𝑟𝑠 ‖ ‖
‖ 𝑤𝑟 − 𝑤𝑠 ‖ ,
estimations of both 𝛺 and the biases 𝑏𝑟 , 𝑟 = 1, … , 𝑇 , is a regularized
𝑟=1 𝑠=1 𝑟=1 𝑠=1 minimization problem; however, when relaxing the restrictions on 𝛺,
where 𝐴 is the adjacency matrix of the graph. In [32] a Bayesian in- it can be expressed as a convex problem. Taking a Bayesian approach,
terpretation is given, which leads to a method for solving this problem in [42] a horseshoe prior is used instead to learn feature covariance,
also for kernels that are not associated to a Hilbert space. Moreover, and in [43] this prior is also used to identify outlier tasks.
Finally, another perspective is given by those methods consider-
some proposals also learn the inter-task matrix 𝐸 jointly with the task ∑𝐿
parameters [33–37]. In [35], for instance, Argyriou et al. consider the ing
⟨ task models ⟩ that combine multiple views, i.e., 𝑔𝑟 (𝑥𝑟𝑖 ) = 𝑘=1
GL regularization, assuming that task structure is represented by a 𝑤𝑟𝑘 , 𝜙𝑘 (𝑥𝑟𝑖 ) , with 𝐿 weights 𝑤𝑟𝑘 per task. This is also known as the
graph, and the goal is to jointly learn task parameters and the Laplacian MTL multiple kernel learning approach. For instance, the following
matrix by solving regularized problem is proposed in [44,45]:
𝑚𝑟
( ) (𝑇 )1∕𝑝
arg min ̂ + 𝜈𝜶 ⊺ 𝒚 + tr (𝐿 + 𝜆𝐼)−1
𝜶 ⊺ 𝐾𝜶 ∑ 𝑇 ∑ ∑
𝐿
⟨ ⟩ ∑
𝐿 ∑
min 𝑟
𝓁 𝑦𝑖 , 𝑟
𝑤𝑟𝑘 , 𝜙𝑘 (𝑥𝑖 ) +𝜇 ‖ ‖𝑝
, (29)
𝜶,𝐿
𝒘1 ,…,𝒘𝐿 ‖𝑤𝑟𝑘 ‖2
(26) 𝑟=1 𝑖=1 𝑘=1 𝑘=1 𝑟=1
1
s.t. 0 ⪯ 𝐿, (𝐿 + 𝜆𝐼)off ≤ 0, (𝐿 + 𝜆𝐼)−1 𝟏𝑛 = 𝟏 .
𝜆 𝑛 where for each task there is a combination of 𝐿 different views given by
This is the dual problem of an SVM where 𝐾̂ is the kernel matrix the transformations 𝜙𝑘 . Task coupling is enforced through a regulariza-
defined using the kernel function of (25) and replacing the matrix 𝐸 tion which employs an 𝐿1 norm across kernels and an 𝐿𝑝 norm across
in (24) by 𝐿 + 𝜆𝐼, that is, the Laplacian matrix with additional noise tasks, enforcing thus sparsity across kernels and non-sparse weight
for better stability. The authors follow an iterated two-step optimization matrices across tasks. In [44] the authors prove that problem (29) is
where the first step learns task parameters and the second learns task equivalent to
relations. A regularization-based approach can also be found in the ( ) 𝐿 ∑𝑇 ‖
∑𝑇 ∑ 𝑚𝑟
∑𝐿
⟨ ⟩ ∑ ‖2
𝑟=1 ‖𝑤𝑟𝑘 ‖2
work of [38], where it is applied to a transfer learning scenario, that min min 𝓁 𝑦𝑟𝑖 , 𝑤𝑟𝑘 , 𝜙𝑘 (𝑥𝑟𝑖 ) +𝜇 .
is, where there is only one task of interest, and the rest, source tasks,
𝒘
𝛽1 ,…,𝛽𝐿 1 ,…,𝒘 𝐿
𝑟=1 𝑖=1 𝑘=1 𝑘=1
𝛽𝑘
act as leverage in the training process. (30)
Turning our attention to Gaussian processes (GPs), it could be said
that most MTL methods based on GPs can be classified as This problem is extended in [46] using task-specific parameters 𝛽𝑟𝑘 as
regularization-based strategies. For instance, in [39] a joint distribution 𝑚𝑟
( )
∑𝑇 ∑ ∑
𝐿
⟨ ⟩
for the latent variables of all tasks is defined as 𝒇 = (𝒇 1 , … , 𝒇 𝑇 )⊺ , that min min 𝓁 𝑦𝑟𝑖 , 𝑤𝑟𝑘 , 𝜙𝑘 (𝑥𝑟𝑖 )
𝜷 1 ,…,𝜷 𝑇 𝒘1 ,…,𝒘𝐿
is, 𝑟=1 𝑖=1 𝑘=1
(31)
( ) 𝐿 ‖
𝑇 ∑
∑ ‖2
𝜈 ∑
𝑇
𝑃 (𝒇 |𝑋, 𝜃) = 𝑁 𝟎, 𝐾𝜃 , ‖𝑤𝑟𝑘 ‖2
+𝜇 + {𝛩}𝑟𝑠 ⟨𝜷 𝑟 , 𝜷 𝑠 ⟩ ,
𝑟=1 𝑘=1
𝛽𝑟𝑘 2 𝑟,𝑠=1
where 𝐾𝜃 is the covariance or kernel matrix constructed by evalu-
ating the kernel function, parametrized with 𝜃, at the points 𝑋 = )⊺
where 𝜷 𝑟 = (𝛽𝑟1 , … , 𝛽𝑟𝐿 and 𝛩 is a positive definite matrix that
(𝑋1 , … , 𝑋𝑇 )⊺ : encodes task relationships through the regularization. In particular,
with fixed 𝛩 = 𝐼𝑇 , it results in an independent regularizer for each task
⎡𝐾𝜃 (𝑋1 , 𝑋1 ) … 𝐾𝜃 (𝑋1 , 𝑋𝑇 )⎤ ‖𝜷 𝑟 ‖2 . Moreover, the authors in [44] proposed including the matrix 𝛩
‖ ‖
𝐾𝜃 = ⎢ ⋮ ⋱ ⋮ ⎥.
as a learnable parameter. The corresponding dual problem for the full
⎢ ⎥
⎣𝐾𝜃 (𝑋1 , 𝑋1 ) … 𝐾𝜃 (𝑋1 , 𝑋𝑇 )⎦ optimization problem is
(𝑚 𝑚 )
This model is trained by maximizing the log-likelihood ∑𝑇 ∑ 𝑟 ∑ 𝑠 ∑
𝐿 𝑚𝑟

min min min 𝛼𝑖𝑟 𝛼𝑗𝑠 𝑦𝑟𝑖 𝑦𝑠𝑗 𝛽𝑙𝑟 𝑘𝑙 (𝑥𝑟𝑖 , 𝑥𝑟𝑗 ) − 𝛼𝑖𝑟

𝑛
𝛩 𝜷 1 ,…,𝜷 𝑇 𝜶 1 ,…,𝜶 𝑇
𝑃 (𝒇 , 𝒚|𝑋, 𝜃) = 𝑝(𝒇 |𝑿, 𝜃) 𝑝(𝑦𝑛 |𝑓𝑛 ), 𝑟 𝑖=1 𝑗=1 𝑙=1 𝑖=1
(32)
𝜈 ∑
𝑖=1 𝑇
+ {𝛩}𝑟𝑠 ⟨𝜷 𝑟 , 𝜷 𝑠 ⟩ ,
which leads to a regularized minimization problem. Notice that here 2 𝑟,𝑠=1
task coupling is made by sharing a covariance function in the joint
distribution, but this cannot detect possible tasks relations. To include A similar result is also presented in [47] where they directly ex-
̃
task information, task descriptors 𝑡 ∈ R𝑑 are used in [40] to define press a dual problem considering the linear combination of kernels
∑𝐿 𝑟 𝑟 𝑟
the kernels as a product of inter-feature and inter-task kernels, as 𝑙=1 𝛽𝑙 𝑘𝑙 (𝑥𝑖 , 𝑥𝑗 ), and they reach a dual problem resemblant to (32).
𝑘𝜃 (𝑥, 𝑥′ )𝑘𝜏 (𝑡, 𝑡′ ). However, the task descriptors are not always available, Related to these multi-view works, a cross-view approach is followed
so in [41] a different formulation is proposed that also learns inter-task in [48], where kernels between different transformations are defined
relations. This is done by splitting the kernel matrix between inter-task as
and inter-feature covariance matrices as ̂ = ⟨𝜙𝑙 (𝑥), 𝜙𝑘 (𝑥)⟩
𝑘𝑙𝑘 (𝑥, 𝑥) ̂ ,
( )
𝑃 (𝒇 |𝑋, 𝜃, 𝛺) ∼ 𝑁 𝟎, 𝛺 ⊗ 𝐾𝜃 ; (27) note that 𝜙𝑙 and 𝜙𝑘 are different, so we can see these as cross kernel
here 𝛺 is the 𝑇 × 𝑇 inter-task covariance matrix, 𝐾𝜃 the feature functions; for example, a cross kernel function between a Màtern kernel
covariance matrix and 𝐴 ⊗ 𝐵 denotes the Kronecker product of two and a squared exponential one. With these cross kernels, they define the
following SVM-based multi-view multi-task problem.
matrices. Following this idea, models of the form 𝑓𝑟 (⋅) = ⟨𝑤𝑟 , ⋅⟩ + 𝑏𝑟 are
considered in [33] and the prior on matrix 𝑊 = (𝑤1 … , 𝑤𝑇 ) is defined
4.3. Combination-based approaches
as
( 𝑇 )
∏ ( ) ( ) In this scenario the overall model is a combination of a common part
𝑊 ∼ 𝑁 𝟎𝑑 , 𝜎𝑟2 𝐼𝑑  0𝑑×𝑚 , 𝐼𝑑 ⊗ 𝛺 , (28)
𝑟=1
shared by all task and of task-specific components; both are learned

6
C. Ruiz et al. Neurocomputing 577 (2024) 127255

simultaneously with a goal of leveraging the common and specific where task-specific 𝜆𝑟 hyperparameters are considered instead of the
information to achieve better results. single 𝜆 of (36). Moreover, this was combined in [63] with a GL
The first such proposal is [13] and uses linear SVMs as the base regularization for the task-specific parts.
model. The goal is to find a decision function for each task, defined by Taking a different perspective, it is shown in [13] that the linear
a vector 𝑤𝑟 = 𝑤 + 𝑣𝑟 and a bias 𝑏𝑟 ; here 𝑤 is common to all tasks and problem (33) is equivalent to
𝑣𝑟 is task-specific. In this approach, the multi-task linear SVM primal 𝑚𝑟 ‖∑ ‖2 𝑇 ‖ ‖2

𝑇 ∑
1‖
𝑇
‖ ∑ ‖ ∑𝑇

problem is min 𝐶 𝜉𝑖𝑟 + ‖ 𝑤𝑟 ‖ + 1 ‖𝑤𝑟 − 𝑤 ‖
‖ ‖ ‖ 𝑠‖
𝒘,𝒃,𝜉 2 ‖ ‖ 2 ‖ ‖
∑ 𝑚𝑟
𝑇 ∑
𝜇 1 ∑ ‖ ‖2
𝑇 𝑟=1 𝑖=1
⟨ ⟩ ‖ 𝑟=1 ‖ 𝑟=1 ‖ 𝑠=1 ‖ (37)
min 𝐶 𝜉𝑖𝑟 + ‖𝑤‖2 + 𝑣
2 𝑟=1 ‖ 𝑟 ‖
𝑟 𝑟 𝑟 𝑟
𝑤,𝒗,𝒃,𝜉 2 s.t. 𝑦𝑖 ( 𝑤 + 𝑣𝑟 , 𝑥𝑖 + 𝑏) ≥ 𝑝𝑖 − 𝜉𝑖 ,
𝑟=1 𝑖=1
⟨ ⟩ (33) 𝜉𝑖𝑟 ≥ 0; 𝑖 = 1, … , 𝑚𝑟 , 𝑟 = 1, … , 𝑇 ,
s.t. 𝑦𝑟𝑖 ( 𝑤 + 𝑣𝑟 , 𝑥𝑟𝑖 + 𝑏) ≥ 𝑝𝑟𝑖 − 𝜉𝑖𝑟 ,
𝜉𝑖𝑟 ≥ 0; 𝑖 = 1, … , 𝑚𝑟 , 𝑟 = 1, … , 𝑇 . and this result holds whenever the transformations for the common and
task-specific parts are the same. This means that, in these cases, the
Observe that, instead of imposing some restrictions, such as low-rank or combination approach is equivalent to regularizing the mean model
inter-task regularization, the aim here is to impose a direct coupling by and the task-specific deviations from that mean. An intuition related
using a model 𝑤 that is common to all tasks. In [16,49] this formulation to this result is followed in collaborative online algorithms for MTL.
is extended to the kernel case, using different transformations: 𝜙(⋅) for In particular, an online method is proposed in [64] where the global
the common part and 𝜙𝑟 (⋅) for each of the specific parts. The multi-task model is updated with every instance and the task-specific ones only
kernel SVM problem is now the direct extension of (33): with their own instances, and the distance between the global and
∑ 𝑚𝑟
𝑇 ∑ specific models is penalized. This idea was later extended to kernel
1 ∑ ‖ ‖2
𝑇
𝜇
min 𝐶 𝜉𝑖𝑟 + ‖𝑤‖2 + 𝑣 models in [65], where for each instance 𝑥𝑟𝑖 a new step 𝑡 is considered,
𝑤,𝒗,𝒃,𝜉
𝑟=1 𝑖=1
2 2 𝑟=1 ‖ 𝑟 ‖
⟨ ⟩ ⟨ ⟩ (34) and the following two problems are minimized. First, the global weight
s.t. 𝑦𝑟𝑖 ( 𝑤, 𝜙(𝑥𝑟𝑖 ) + 𝑣𝑟 , 𝜙𝑟 (𝑥𝑟𝑖 ) + 𝑏𝑟 ) ≥ 𝑝𝑟𝑖 − 𝜉𝑖𝑟 , at 𝑤𝑡+1 is selected as
𝜉𝑖𝑟 ≥ 0; 𝑖 = 1, … , 𝑚𝑟 , 𝑟 = 1, … , 𝑇 . ( ⟨ ⟩)
𝑤𝑡+1 = min 𝓁 𝑦𝑟𝑖 , 𝑤, 𝜙(𝑥𝑟𝑖 ) + 𝜇 ‖𝑤‖2 + 𝜂 ‖ 𝑡 ‖2
‖𝑤 − 𝑤 ‖ ,
𝑤
Note that here task-specific biases 𝑏𝑟 have also been added, which leads
to a dual problem with multiple equality constraints, and in [50] a Gen- where 𝜇 and 𝜂 are hyperparameters; then, the task-specific weights 𝑣𝑡+1 𝑟
eralized SMO algorithm is developed to solve it. Furthermore, in [16, are selected as
( ⟨ ⟩) ‖ ‖2
49] a connection is drawn with the SVM+ model, which embodies the 𝑣𝑡+1 = min 𝓁 𝑦𝑟𝑖 , 𝑣𝑟 , 𝜙(𝑥𝑟𝑖 ) + 𝜇𝑟 ‖ ‖2 ‖ 𝑡 ‖2
‖𝑣𝑟 ‖ + 𝜂𝑟 ‖𝑣𝑟 − 𝑣𝑟 ‖ + 𝜈𝑟 ‖𝑣 − 𝑤𝑡+1 ‖ ,
Learning Using Privileged Information (LUPI) paradigm [51,52] which
𝑟 𝑣𝑟 ‖ 𝑟 ‖
we will briefly discuss in Section 5. The kernel space corresponding to where the final term is the one coupling together the task-specific
the common transformation 𝜙(⋅) is compared with the decision space models and the global model. Also, this online approach was extended
and the specific spaces corresponding to the transformations 𝜙𝑟 (⋅) with for multi-task ranking methods in [66]. Finally, the work of [67] also
the correction space; we will explain more about these spaces and adopts a combination-based approach using GPs.
SVM+ in Section 5.2.
In general, this combination-based approach can be expressed as 5. A deeper look into the foundations of MTL
𝑚𝑟

𝑇 ∑
⟨ ⟩ ⟨ ⟩ 𝜇 1 ∑
𝑇
min 𝐶 𝓁(𝑦𝑟𝑖 , 𝑤, 𝜙(𝑥𝑟𝑖 ) + 𝑣𝑟 , 𝜙𝑟 (𝑥𝑟𝑖 ) ) + ‖𝑤‖2 + ‖𝑣𝑟 ‖2 , (35) Up to now, we have assumed that the hypothesis space where we
𝑤,𝒗 2 2 ‖ ‖
𝑟=1 𝑖=1 𝑟=1 choose our learners is fixed, but a more general point of view would
be to learn not only the final models but also the best hypothesis space
and it has been extended to multiple SVM variants. For instance,
for the problem at hand; this is the goal of the Learning to Learn (LTL)
in [14] it was applied for LS-SVMs, in [53] for one-class SVM, in [54]
theory. Following a different point of view, there is a way of improving
for ranking SVM, in [55] for the twin SVM, in [56] for the LS-twin
SVM-based MTL through the paradigm of Learning Using Privileged
SVM, in [57] for the asymmetric LS-SVM, in [58] for SVMs using the
Information of Vapnik. We briefly describe both approaches next.
pinball loss, and in [59] more approaches for combination-based MTL
in twin SVMs are described. Furthermore, an active learning technique
5.1. Learning to learn
is applied in [60] over each task with this combination-based approach.
Also inspired by the work of [13], the authors in [61] present and
As discussed previously, in MTL we have an input space  and
SVM-based approach for fair classification.
an output space , with associated distributions 𝑃1 (𝑥, 𝑦), … , 𝑃𝑇 (𝑥, 𝑦)
In all these variants, the parameter 𝜇 of (35) determines the influ-
defined over  × , one for each task. Given a hypothesis space ,
ence of the common and task-specific parts, and, hence, the degree of
we want to select the hypotheses ℎ1 , … , ℎ𝑇 ∈  that minimize the
relatedness between the models. In particular, large values of 𝜇 in the
expected risk (1); however, since the 𝑃𝑟 (𝑥, 𝑦) distributions are unknown,
regularization force the common parameter 𝑤 to vanish, resulting in
we select instead the hypotheses that minimize the empirical risk (3).
task-independent models. Also, small 𝐶 and 𝜇 may make the specific
In the Learning to Learn (LTL) paradigm, proposed by Baxter in [2],
parts 𝑣𝑟 to disappear, which leads to a common model for all tasks. For
the scope is more general and the goal is to also learn the optimal
a better interpretability, a convex combination-based SVM is proposed
hypothesis space from which we can pick the best hypothesis for each
in [62], in which the primal problem is
task. That is, instead of having a fixed number of tasks 𝑇 and associated
𝑚𝑟

𝑇 ∑
1 ∑ ‖ ‖2 1
𝑇 distributions 𝑃1 (𝑥, 𝑦), … , 𝑃𝑇 (𝑥, 𝑦), Baxter considers an environment of
min 𝐶 𝜉𝑖𝑟 + 𝑣 + ‖𝑤‖2
𝑤,𝒗,𝒃,𝜉
𝑟=1 𝑖=1
2 𝑟=1 ‖ 𝑟 ‖ 2 tasks (, 𝑄) where  is a set of distributions 𝑃 defined over  × , and
⟨ ⟩ ⟨ ⟩ (36) we can sample from  according to a distribution 𝑄. In this scenario
s.t. 𝑦𝑟𝑖 (𝜆 𝑤, 𝜙(𝑥𝑟𝑖 ) + (1 − 𝜆) 𝑣𝑟 , 𝜙𝑟 (𝑥𝑟𝑖 ) + 𝑏𝑟 ) ≥ 𝑝𝑟𝑖 − 𝜉𝑖𝑟 , we do not have a fixed number of tasks; thus, the goal is to learn the
𝜉𝑖𝑟 ≥ 0; 𝑖 = 1, … , 𝑚𝑟 , 𝑟 = 1, … , 𝑇 , optimal hypothesis space  ∗ from a family H of such spaces. In a more
precise formulation, the LTL expected risk is defined as
with 𝜆 ∈ [0, 1]. Here, the hyperparameter 𝜇 is replaced by 𝜆, which
results in an easier interpretation, since when 𝜆 = 1 we have just a [ ]
common model, while a 𝜆 = 0 results in completely independent task 𝑅𝑄 () = inf 𝑅𝑃 (ℎ)𝑑𝑄(𝑃 ) = inf 𝓁(ℎ(𝑥), 𝑦)𝑑𝑃 (𝑥, 𝑦) 𝑑𝑄(𝑃 ),
∫ ℎ∈ ∫ ℎ∈ ∫×𝑌
models; intermediate values correspond to actual multi-task combina-
tions. This approach was later extended in [15] for the L2 and LS-SVMs, (38)

7
C. Ruiz et al. Neurocomputing 577 (2024) 127255

and we want to select  ∗ = min∈H 𝑅𝑄 (). In practice, the empirical generated according to an unknown distribution 𝑃 (𝑥, 𝑥⋄ , 𝑦), the goal is
sample for LTL is essentially the same one that we have described for to find the hypothesis ℎ(𝑥) from a set of hypotheses  that minimizes
MTL, i.e., we sample 𝑇 distributions 𝑃1 , … , 𝑃𝑇 from  and with them some expected risk
we obtain an empirical sample (2). Also, the empirical risk 𝑅̂ 𝑫 ()
coincides with the one given in (3) for MTL. 𝑅𝑃 = 𝓁 (ℎ (𝑥, 𝛼) , 𝑦) 𝑑𝑃 (𝑥, 𝑦).

In this setting, in [2] a bound is given for the LTL expected risk
Notice that the goal is the same one that in the standard Machine Learn-
𝑅𝑄 () in terms of the empirical risk 𝑅̂ 𝑫 () and a notion of capacity of
ing paradigm; however, with the LUPI approach we have additional
the task environment. Also, in the particular case of MTL with a fixed
information, available only during the training stage. This additional
number of tasks, an analogous bound is defined for the MTL expected information is encoded in the elements 𝑥⋄ of a space  ⋄ , which is
risk 𝑅𝑄 (𝒉) for any sequence of hypothesis 𝒉 = (ℎ1 , … , ℎ𝑇 ). To obtain different from . Then, given a pair (𝑥𝑖 , 𝑦𝑖 ), the goal of the Teacher is to
these results, Baxter uses the fact that the hypotheses for all tasks are provide information 𝑥⋄𝑖 ∈  ⋄ according to some probability 𝑃 (𝑥⋄ | 𝑥 ).
selected from the same space , and the bounds depend on definitions That is, the ‘‘intelligence’’ of the Teacher is defined by the choice of the
of the capacity of the family H of hypothesis spaces that is being space  ⋄ and the conditional probability 𝑃 (𝑥⋄ | 𝑥 ). To better under-
considered; these bounds are extensions of the VC-dimension [19]. To stand this paradigm consider the following example given in [52]. The
give a concrete example, we can consider hard sharing MTL neural goal is to find a decision rule that classifies biopsy images into cancer
networks [1], where the hidden layers are shared across all tasks or non-cancer ones. Here,  is the space of images, i.e., matrices of
and task-specific output neurons are defined. In this case, the shared pixels such as, for example, [0, 1]64×64 ; the label space is  = {0, 1}. An
hidden layers perform bias learning, that is, the learning of the optimal Intelligent Teacher might provide a student of medicine with commen-
hypothesis space  from which the output layers will select the final taries about the images such as, for instance, there is an area of unusual
optimal hypotheses. Analogously, when considering kernel methods, concentration of cells of Type A or there is an aggressive proliferation of
the algorithms that implement kernel learning strategies such as those cells of Type B. These commentaries are the elements 𝑥⋄ of a certain
in [44,47,48] also embody an LTL strategy. Following this idea, in [68] space  ⋄ and the Teacher also chooses the probability 𝑃 (𝑥⋄ | 𝑥 ), which
the analysis of Baxter is extended by Ben David and his coworkers to the determines when to express this additional information.
particular case of kernel methods performing kernel learning. Similarly, It is shown in [51] that making a smart use of this additional
in [69,70] multi-task GPs are considered where all tasks share the information can improve the convergence rate to the minimum of the
same kernel width, which is learnable, and they extend to this case expected risk. In particular, this strategy is implemented in the SVM+
the bounds of Baxter for the MTL expected risk 𝑅𝑄 (𝒉). algorithm [51,52] which solves the problem
With a different perspective, in [71] Ben David and his coworkers ∑
𝑛
([ ⟨ ⋄ ⋄ ∗ ⟩ ] ) ∑
𝑛
consider a multi-task scenario where there is one target task and the min
⋄ ⋄
𝐶 𝑦𝑖 ( 𝑤 , 𝜙 (𝑥𝑖 ) + 𝑏⋄ ) + 𝜁𝑖 + 𝐶̂ 𝜁𝑖
𝑤,𝑤 ,𝑏,𝑏 ,𝜁𝑖
rest of tasks are used to leverage the learning of the target one. In 𝑖=1 𝑖=1
1 𝜇
particular, a notion of task relatedness is developed where, essentially, + ⟨𝑤, 𝑤⟩ + ⟨𝑤⋄ , 𝑤⋄ ⟩
two tasks are related if their corresponding distributions are too. More 2 2 [ ⟨ ⟩ ] (39)
formally, the starting point for the theory developed in [72] is to s.t. 𝑦𝑖 (⟨𝑤, 𝜙(𝑥𝑖 )⟩ + 𝑏) ≥ 1 − 𝑦𝑖 ( 𝑤⋄ , 𝜙⋄ (𝑥⋄𝑖 ) + 𝑏⋄ ) + 𝜁𝑖 ,
⟨ ⟩
consider a set  of transformations 𝑓 ∶  →  that somehow 𝑦𝑖 ( 𝑤⋄ , 𝜙⋄ (𝑥⋄𝑖 ) + 𝑏⋄ ) + 𝜁𝑖 ≥ 0,
connect the distributions of different tasks. We say that a set of tasks 𝜁𝑖 ≥ 0.
with distributions 𝑃1 , … , 𝑃𝑇 are  -related if there exists a probability
As pointed out in [16], some similarities can be observed between (39)
distribution 𝑃 over  × {0, 1} such that for each task there exists some
and the MTL problem (34), since the final model that we obtain is also
𝑓𝑖 ∈  that verifies 𝑃𝑖 = 𝑓𝑖 [𝑃 ]. Using this idea of related tasks, the
a combination of two parts, each built on a possibly different space.
bounds given by Baxter can be tightened.
Observe that the model of interest for prediction here is the one defined
The works discussed up to this point use the VC-dimension, and
by the parameters 𝑤, 𝑏, while the parameters 𝑤⋄ , 𝑏⋄ are used to better
their corresponding extensions to the MTL framework, to bound the
model the slack values. With this interpretation, the spaces defined by
differences between empirical and expected risks. However, in [10]
the transformations 𝜙 and 𝜙⋄ are the decision and the correction spaces,
the authors rely on another notion of complexity, the Rademacher respectively.
Complexity [73], to establish the Multi-Task bounds. Other theoretical
works, such as [74–76], give bounds for the linear feature extractor 6. Problems and applications
methods for MTL that we considered in Section 2. In the more general
case of LTL, some improved bounds are found for specific cases such Although finding research MTL problems is not a trivial task, there
as MTL models which use the trace norm regularization [26]. Also, exist some datasets that are often considered for supervised MTL. These
in [77] bounds are given for a wide class of MTL models based on problems can be classified among several categories and some are
feature learning in both MTL and LTL settings, where these bounds are direct MTL problems, where there are different feature matrices 𝑋𝑟
not dependent on the data dimensions, as it is the case of other bounds and target values 𝑦𝑟 for each task, sampled from possibly different
for linear models, and they are derived using an approach based on distributions 𝑃𝑟 (𝑥, 𝑦). Others, however, are derived from either multi-
empirical process theory [78], instead of the generalized VC-dimension. target regression problems, where the patterns 𝑋𝑟 are shared across
tasks, or multi-class classification problems that can be solved with a
5.2. Learning using privileged information one-vs-all strategy for each class, and where each binary decision is
considered a distinct task.
Focusing now in SVM-based MTL, an approach to interpret its In Table 1 we present the characteristics of some of the datasets
advantages is the one derived by Vapnik and his coworkers in [52,79] that can be found more often in the research literature of kernel based
from the Learning Using Privileged Information (LUPI) paradigm, based MTL, such as number of samples 𝒏, dimension 𝒅 and number of tasks
on the observation that humans typically learn under the supervision 𝑻 . We also add a column with their category according to our previous
of an Intelligent Teacher; the additional knowledge provided by the discussion. In particular, we use ‘‘MT reg’’, and ‘‘MT clas’’ for direct MT
Teacher is the privileged information. Formally, given a set of i.i.d. regression and classification problems, and for the derived MT datasets
triplets we write ‘‘multi-reg’’ for multiple target regression and ‘‘multi-clas’’ for
multi-class classification, when they are tackled as MT problems. It is
{ }
𝐷 = (𝑥1 , 𝑥⋄1 , 𝑦1 ), … , (𝑥𝑛 , 𝑥⋄𝑛 , 𝑦𝑛 ) , 𝑥 ∈ , 𝑥⋄ ∈  ⋄ , 𝑦 ∈ , worth noting that not all MTL strategies are suitable for every problem.

8
C. Ruiz et al. Neurocomputing 577 (2024) 127255

Table 1 Table 2
Characteristics of MT datasets. The columns correspond to the number of instances 𝑛, Results obtained over three of the most used MTL problems by different previous works.
the dimension 𝑑, the number of tasks 𝑇 , the origin, and the references where they The baseline is a single SVM for all tasks.
have been used. Name School Computer Sarcos
Name 𝑛 𝑑 𝑇 Origin References (Exp. Variance) (RMSE) (RMSE)
School 15 362 27 139 MT reg [12,13,23,25] Baseline 23.50 - -
[8,9,33,40,41] [13] 34.37 - -
Computer 3600 13 180 MT clas [8,9,23,25] [12] 34.37 - -
Sarcos 44 484 21 7 multi-reg [33,34,44,45] [25] 26.70 1.93 -
Landmine 14 820 10 29 MT clas [45,80] [23] 26.70 1.90 0.36a
Dermatology 366 33 6 multi-clas [23,81] [40] 31.57 - -
Sentiment 2000 473 856 4 MT clas [33,34] [41] 29.20 - 0.41b
MNIST 70 000 400 4 multi-clas [9,28] [33] 29.90 - 0.14
USPS 9298 256 10 multi-clas [9,28] [8] - 1.10 -
Parkinson 5875 19 42 MT reg [45] [9] - 1.76 -
MHC-I 32 302 184 47 MT clas [45]
a
According to the results of [33].
b
According to the results of [33].

For example, combination-based methods cannot be applied in the


multi-class classification case; just consider for instance the MNIST
dataset. Here the common part would try to learn information shared the combination-model directly uses all the data, so it may give better
across all tasks, but images of any given digit would be the positive results. Also, other methods try to learn the task relations, which leads
class in one task but part of the negative one in the others; thus, the to more complex models with more learnable parameters, that might
common part would get contradictory information. On the other hand, not be too successful with small tasks.
feature-based methods are well suited for this kind of problems. We finish this section by considering some real-life applications. We
As representative examples of the problems reviewed, we will con- observe that the use of Neural Network-based MTL techniques has been
centrate here in describing three MTL problems, school, computer and widespread, and we can find examples in the biomedical field [82],
sarcos, which we have found to have been considered in at least four natural language understanding [83], image detection and segmenta-
different papers. In more detail: tion [84] or genetical analysis [85], among many others. However,
the application of multi-task kernel methods in these problems is more
• The school problem considers the prediction of the scores of stu-
complex, first because of the computational cost that they entail, but
dents from different schools. There are 15 362 students from 139
also because it is more difficult for them to exploit specific structures
different schools and each student is described by 27 attributes.
in data such as images or time-ordered values.
The predictions in each school define a task and, thus, there are
139 tasks. Anyway, we can find some real application examples of interest in
• The computer dataset has as target the likelihood of purchasing other fields. For instance, Multi-Task SVMs were applied for COVID
20 different computer models described by 13 binary attributes, as diagnosis [86]. Also, a Multi-task learning based on LS-SVMs is ap-
gathered from a survey of 180 people. Thus, there are 180 tasks plied for finance in [87] Other example is the output kernel learning
with 20 examples each. approach from [36] is successfully used in [88] for electricity demand
• The sarcos is a multi-target regression problem, where the goal forecasting by grouping energy loads from different meters. Similarly,
is to predict the torque forces corresponding to the seven joints in [89,90] load energy is predicted from different meters, using here
of a robot arm based on 21 input features: the position, velocity an approach based on MTL multiple kernel learning. Also, in [91],
and acceleration in each joint. The dataset contains 44 484 data combination-based MTL SVMs are applied for the forecasting of pollu-
points and the prediction of the inverse dynamics for each joint tion, in particular, of PM2.5 particulate pollution levels, and the tasks
is considered as a task. are the predictions at different stations. In [92] combination-based MTL
SVMs are applied for predicting solar and wind energy production. In
In Table 2 we present the scores obtained by different approaches
the solar case, defining month-based and time-based tasks yields good
in these problems, taken from the results reported in the corresponding
results, beating either a single control model or task specific ones; on
papers. For illustrative purposes, we give now a short discussion of
the other hand, models for wind energy forecasting based on tasks
this comparison, pointing to some caveats to be taken into account.
defined by wind angles or velocities are competitive, but control or
First, the results of [44,45] are not comparable, since they provide
individual task models still have an edge.
score values very different from those reported in the other papers, so
An interesting example in other application area is [93], where
we omit them. Also, the scores shown for sarcos corresponding to the
models presented in [23,41] have been taken from [33], which achieves a survival learning method is developed considering a multi-task ap-
the best results in this dataset, so we refer to the discussion in [33] for proach, namely, the binary classification in each interval is interpreted
a further analysis. In the computer dataset, the best score is reported as a different task, and the parameters are coupled through regular-
by [8]. One possible explanation, as described by the authors, is that ization. Furthermore, a multiple kernel based MTL approach is used
the relation among tasks in this problem is not linear, which is only in [94] for predicting mood or stress levels, in [95] for personalized
well captured by their proposal. Observe that in [25] or [23] the trace pain recognition from functional near-infrared spectroscopy brain sig-
norm regularization of problem (12) forces the task parameters to share nals, and in [96] for discrimination of early and late-stage tumors. In
the same linear subspace. any case, and to the best of our knowledge, the number of applications
Finally, in the school dataset, possibly the one most often used in of MTL kernel methods in real world problems seems to be smaller
the literature, we find that the explained variance scores range from than is the case with NN-based methods, particularly given the current
23.5 for the CTL SVM, to the best score, which is 34.3 reported for the pre-eminence of big data problems. However, it is our belief that, for
combination-based method of [13]. We remark that this is a problem small to medium size datasets, kernel based MTL methods offer a rich
where some tasks have very few data; in fact, the average task size alternative to obtain very competitive models, just as it is the case, for
is ∼ 100, so learning individual task models using essentially only instance, with standard SVMs, which often are hard to beat on such
their corresponding data is difficult. However, the common part of datasets.

9
C. Ruiz et al. Neurocomputing 577 (2024) 127255

7. Conclusions References

It is clear that Deep Neural Networks are currently the research [1] R. Caruana, Multitask learning, Mach. Learn. 28 (1) (1997) 41–75.
and application area of Machine Learning receiving a widest attention; [2] J. Baxter, A model of inductive bias learning, J. Artificial Intelligence Res. 12
(2000) 149–198.
as such, it is no surprise that DNNs also possibly dominate current
[3] I. Misra, A. Shrivastava, A. Gupta, M. Hebert, Cross-stitch networks for multi-task
research and applications in Multi-Task Learning. However, other ML learning, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition,
models can be competitive and sometimes even better in particular CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, IEEE Computer Society,
problems, as it may be the case, for instance, of kernel methods on small 2016, pp. 3994–4003.
to medium size problems and, accordingly, there is also a substantial [4] S. Ruder, J. Bingel, I. Augenstein, A. Søgaard, Sluice networks: Learning what
to share between loosely related tasks, 2017, CoRR abs/1705.08142.
literature on kernel-based MTL methods.
[5] S. Ruder, An overview of multi-task learning in deep neural networks, 2017,
Starting with that observation, in this paper we have given a general CoRR abs/1706.05098.
overview of MTL with kernel methods. First, we have adopted in [6] Y. Zhang, Q. Yang, A survey on multi-task learning, IEEE Trans. Knowl. Data
Section 2 a fairly general point of view to introduce MTL and discuss Eng. 34 (12) (2022) 5586–5609.
different MTL problem types, which we have followed by a proposal of [7] A. Argyriou, T. Evgeniou, M. Pontil, Multi-task feature learning, in: B. Schölkopf,
J.C. Platt, T. Hofmann (Eds.), Advances in Neural Information Processing Systems
a three category taxonomy to organize our review of kernel MTL. This
19, Proceedings of the Twentieth Annual Conference on Neural Information
review has been done in Section 4, where we have considered a large Processing Systems, Vancouver, British Columbia, Canada, December 4-7, 2006,
selection of MTL papers having SVMs or GPs as their underlying ML MIT Press, 2006, pp. 41–48.
tool; while we have divided this review into three different subsections [8] A. Agarwal, H. Daumé, S. Gerber, Learning multiple tasks using manifold
according to our taxonomy, we have also discussed, when possible, regularization, in: J.D. Lafferty, C.K.I. Williams, J. Shawe-Taylor, R.S. Zemel,
connections between them. A. Culotta (Eds.), Advances in Neural Information Processing Systems 23: 24th
Annual Conference on Neural Information Processing Systems 2010. Proceedings
It is well known that contributions to a general theory of learning of a Meeting Held 6-9 December 2010, Vancouver, British Columbia, Canada,
by V. Vapnik and others are the foundations where methods such Curran Associates, Inc., 2010, pp. 46–54.
as SVMs are based and, to some extent, the same could be said of [9] A. Kumar, H. Daumé, Learning task grouping and overlap in multi-task learning,
kernel-based MTL. Because of this, and for a deeper understanding in: Proceedings of the 29th International Conference on Machine Learning, ICML
of similar mechanisms that underlay MTL kernel methods, we also 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012, Omni Press, 2012.
[10] R.K. Ando, T. Zhang, A framework for learning predictive structures from
present in Section 5.1 a descriptive summary of the Learning to Learn
multiple tasks and unlabeled data, J. Mach. Learn. Res. 6 (2005) 1817–1853.
theory of Baxter, Ben David and others, without delving too deeply [11] J. Chen, L. Tang, J. Liu, J. Ye, A convex formulation for learning shared
on its mathematical framework, which we follow by a description in structures from multiple tasks, in: Proceedings of the 26th Annual International
Section 5.2 of the Learning Under Privileged Information, or LUPI, Conference on Machine Learning, ICML 2009, Montreal, Quebec, Canada, June
paradigm of Vapnik. 14-18, 2009, in: ACM International Conference Proceeding Series, vol. 382, ACM,
2009, pp. 137–144.
One consequence of the wide application of MTL methods to many
[12] T. Evgeniou, C.A. Micchelli, M. Pontil, Learning multiple tasks with kernel
different areas is the relative lack of a collection of standard MTL methods, J. Mach. Learn. Res. 6 (2005) 615–637.
problems that are shared by the research community, and this is also [13] T. Evgeniou, M. Pontil, Regularized multi–task learning, in: W. Kim, R. Kohavi,
the case in kernel MTL. Even with this caveat in mind, we have selected J. Gehrke, W. DuMouchel (Eds.), Proceedings of the Tenth ACM SIGKDD
a few MTL problems that have received a more sustained attention International Conference on Knowledge Discovery and Data Mining, Seattle,
Washington, USA, August 22-25, 2004, ACM, 2004, pp. 109–117.
from the kernel MTL community, and report some scores for them
[14] S. Xu, X. An, X. Qiao, L. Zhu, Multi-task least-squares support vector machines,
that appear in the literature. This is complemented in Section 6 with a Multimedia Tools Appl. 71 (2) (2014) 699–715.
discussion of several examples of the application of these methods to [15] C. Ruiz, C.M. Alaíz, J.R. Dorronsoro, Convex formulation for multi-task L1-, L2-,
real world problems. and LS-SVMs, Neurocomputing 456 (2021) 599–608.
As a final reflection, in this review we have striven to collect and [16] F. Cai, V. Cherkassky, SVM+ regression and multi-task learning, in: International
present a substantial and representative part of the rich literature on Joint Conference on Neural Networks, IJCNN 2009, Atlanta, Georgia, USA, 14-19
June 2009, IEEE Computer Society, 2009, pp. 418–424.
kernel-based MTL methods, where we have tried to put an emphasis on [17] C. Ruiz, C.M. Alaíz, J.R. Dorronsoro, Convex multi-task learning with neural net-
recent papers. As said above, it is possibly true that DNN-based methods works, in: Hybrid Artificial Intelligent Systems - 17th International Conference,
lead the current research on MTL. However, it is also true that, at least HAIS 2022, Salamanca, Spain, September 5-7, 2022, Proceedings, in: Lecture
in our experience, kernel methods such as SVMs or GPs often have an Notes in Computer Science, vol. 13469, Springer, 2022, pp. 223–235.
advantage on supervised ML problems of small to medium size, and we [18] B. Schölkopf, A.J. Smola, Learning with Kernels: Support Vector Machines, Reg-
ularization, Optimization, and beyond, in: Adaptive computation and machine
believe that the same can be true for kernel MTL methods. We hope that
learning series, MIT Press, 2002.
this review helps such methods to gain visibility, and retain and even [19] V. Vapnik, The Nature of Statistical Learning Theory, in: Statistics for Engineering
increase the attention they deserve from both theoretical and practical and Information Science, Springer, 2000.
points of view. [20] C. Lin, On the convergence of the decomposition method for support vector
machines, IEEE Trans. Neural Netw. 12 (6) (2001) 1288–1298.
Declaration of competing interest [21] S.S. Keerthi, S.K. Shevade, C. Bhattacharyya, K.R.K. Murthy, Improvements to
Platt’s SMO algorithm for SVM classifier design, Neural Comput. 13 (3) (2001)
637–649.
The authors declare that they have no known competing finan- [22] M. Kanagawa, P. Hennig, D. Sejdinovic, B.K. Sriperumbudur, Gaussian processes
cial interests or personal relationships that could have appeared to and kernel methods: A review on connections and equivalences, 2018, CoRR
influence the work reported in this paper. abs/1807.02582.
[23] A. Argyriou, T. Evgeniou, M. Pontil, Convex multi-task feature learning, Mach.
Data availability Learn. 73 (3) (2008) 243–272.
[24] B. Schölkopf, R. Herbrich, A.J. Smola, A generalized representer theorem, in:
Computational Learning Theory, 14th Annual Conference on Computational
No data was used for the research described in the article. Learning Theory, COLT 2001 and 5th European Conference on Computational
Learning Theory, EuroCOLT 2001, Amsterdam, the Netherlands, July 16-19,
Acknowledgments 2001, Proceedings, in: Lecture Notes in Computer Science, vol. 2111, Springer,
2001, pp. 416–426.
The authors gratefully acknowledge financial support from the Eu- [25] A. Argyriou, C.A. Micchelli, M. Pontil, Y. Ying, A spectral regularization
framework for multi-task structure learning, in: J.C. Platt, D. Koller, Y. Singer,
ropean Regional Development Fund and the Spanish State Research
S.T. Roweis (Eds.), Advances in Neural Information Processing Systems 20,
Agency of the Ministry of Economy, Industry, and Competitiveness Proceedings of the Twenty-First Annual Conference on Neural Information
under the project PID2019-106827GB-I00. They also thank the support Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007,
of the UAM–ADIC Chair for Data Science and Machine Learning. Curran Associates, Inc., 2007, pp. 25–32.

10
C. Ruiz et al. Neurocomputing 577 (2024) 127255

[26] A. Maurer, M. Pontil, B. Romera-Paredes, Sparse coding for multitask and transfer [49] L. Liang, V. Cherkassky, Connection between SVM+ and multi-task learning, in:
learning, in: Proceedings of the 30th International Conference on Machine Proceedings of the International Joint Conference on Neural Networks, IJCNN
Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, in: JMLR Workshop 2008, Part of the IEEE World Congress on Computational Intelligence, WCCI
and Conference Proceedings, vol. 28, JMLR.org, 2013, pp. 343–351. 2008, Hong Kong, China, June 1-6, 2008, IEEE, 2008, pp. 2048–2054.
[27] A. Maurer, M. Pontil, K-dimensional coding schemes in Hilbert spaces, IEEE [50] F. Cai, V. Cherkassky, Generalized SMO algorithm for SVM-based multitask
Trans. Inf. Theory 56 (11) (2010) 5839–5846. learning, IEEE Trans. Neural Netw. Learn. Syst. 23 (6) (2012) 997–1003.
[28] Z. Kang, K. Grauman, F. Sha, Learning with whom to share in multi-task feature [51] V. Vapnik, A. Vashist, A new learning paradigm: Learning using privileged
learning, in: L. Getoor, T. Scheffer (Eds.), Proceedings of the 28th International information, Neural Netw. 22 (5–6) (2009) 544–557.
Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June [52] V. Vapnik, R. Izmailov, Learning using privileged information: Similarity control
28 - July 2, 2011, Omni Press, 2011, pp. 521–528. and knowledge transfer, J. Mach. Learn. Res. 16 (2015) 2023–2049.
[29] A. Caponnetto, C.A. Micchelli, M. Pontil, Y. Ying, Universal multi-task kernels, [53] X. He, G. Mourot, D. Maquin, J. Ragot, P. Beauseroy, A. Smolarz, E. Grall-Maës,
J. Mach. Learn. Res. 9 (2008) 1615–1646, http://dx.doi.org/10.5555/1390681. Multi-task learning with one-class SVM, Neurocomputing 133 (2014) 416–426.
1442785, URL https://dl.acm.org/doi/10.5555/1390681.1442785. [54] X. Liang, L. Zhu, D. Huang, Multi-task ranking SVM for image cosegmentation,
[30] M.A. Álvarez, L. Rosasco, N.D. Lawrence, Kernels for vector-valued functions: A Neurocomputing 247 (2017) 126–136.
[55] B. Mei, Y. Xu, Multi-task 𝜈-twin support vector machines, Neural Comput. Appl.
review, Found. Trends Mach. Learn. 4 (3) (2012) 195–266.
32 (15) (2020) 11329–11342.
[31] R. Lin, G. Song, H. Zhang, Multi-task learning in vector-valued reproducing
[56] B. Mei, Y. Xu, Multi-task least squares twin support vector machine for
kernel Banach spaces with the l1 norm, J. Complexity 63 (2021) 101514.
classification, Neurocomputing 338 (2019) 26–33.
[32] A. Scampicchio, M. Bisiacco, G. Pillonetto, Kernel-based learning of orthogonal
[57] L. Lu, Q. Lin, H. Pei, P. Zhong, The aLS-SVM based multi-task learning classifiers,
functions, Neurocomputing 545 (2023) 126237.
Appl. Intell. 48 (8) (2018) 2393–2407.
[33] Y. Zhang, D. Yeung, A convex formulation for learning task relationships in
[58] Y. Zhang, J. Yu, X. Dong, P. Zhong, Multi-task support vector machine with
multi-task learning, in: P. Grünwald, P. Spirtes (Eds.), UAI 2010, Proceedings of
pinball loss, Eng. Appl. Artif. Intell. 106 (2021) 104458.
the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence, Catalina [59] Z. Liu, Y. Xu, Multi-task nonparallel support vector machine for classification,
Island, CA, USA, July 8-11, 2010, AUAI Press, 2010, 733–742. Appl. Soft Comput. 124 (2022) 109051.
[34] Y. Zhang, D. Yeung, A regularization approach to learning task relationships in [60] Y. Xiao, Z. Chang, B. Liu, An efficient active learning method for multi-task
multitask learning, ACM Trans. Knowl. Discov. Data 8 (3) (2013) 12:1–12:31. learning, Knowl.-Based Syst. 190 (2020) 105137.
[35] A. Argyriou, S. Clémençon, R. Zhang, Learning the graph of relations among [61] L. Oneto, M. Donini, A. Elders, M. Pontil, Taking advantage of multitask learning
multiple tasks, 2013, Le Centre pour la Communication Scientifique Directe - for fair classification, in: Proceedings of the 2019 AAAI/ACM Conference on AI,
HAL - Diderot. Ethics, and Society, AIES 2019, Honolulu, HI, USA, January 27-28, 2019, ACM,
[36] F. Dinuzzo, Learning output kernels for multi-task problems, Neurocomputing 2019, pp. 227–237.
118 (2013) 119–126. [62] C. Ruiz, C.M. Alaíz, J.R. Dorronsoro, A convex formulation of SVM-based multi-
[37] P. Jawanpuria, M. Lapin, M. Hein, B. Schiele, Efficient output kernel learning task learning, in: HAIS 2019, in: Lecture Notes in Computer Science, vol. 11734,
for multiple tasks, in: C. Cortes, N.D. Lawrence, D.D. Lee, M. Sugiyama, R. Springer, 2019, pp. 404–415.
Garnett (Eds.), Advances in Neural Information Processing Systems 28: Annual [63] C. Ruiz, C.M. Alaíz, J.R. Dorronsoro, Convex graph Laplacian multi-task learning
Conference on Neural Information Processing Systems 2015, December 7-12, SVM, in: Artificial Neural Networks and Machine Learning - ICANN 2020 -
2015, Montreal, Quebec, Canada, 2015, pp. 1189–1197. 29th International Conference on Artificial Neural Networks, Bratislava, Slovakia,
[38] Y. Zhang, S. Ying, Z. Wen, Multitask transfer learning with kernel representation, September 15-18, 2020, Proceedings, Part II, in: Lecture Notes in Computer
Neural Comput. Appl. 34 (15) (2022) 12709–12721. Science, vol. 12397, Springer, 2020, pp. 142–154.
[39] N.D. Lawrence, J.C. Platt, Learning to learn with the informative vector machine, [64] G. Li, S.C.H. Hoi, K. Chang, W. Liu, R.C. Jain, Collaborative online multitask
in: C.E. Brodley (Ed.), Machine Learning, Proceedings of the Twenty-First learning, IEEE Trans. Knowl. Data Eng. 26 (8) (2014) 1866–1876.
International Conference (ICML 2004), Banff, Alberta, Canada, July 4-8, 2004, [65] A. Aravindh, S.S. Shiju, S. Sumitra, Kernel collaborative online algorithms for
in: ACM International Conference Proceeding Series, vol. 69, ACM, 2004. multi-task learning, Ann. Math. Artif. Intell. 86 (4) (2019) 269–286.
[40] E.V. Bonilla, F.V. Agakov, C.K.I. Williams, Kernel multi-task learning using task- [66] G. Li, P. Zhao, T. Mei, P. Yang, Y. Shen, K. Chang, S.C.H. Hoi, Collaborative
specific features, in: M. Meila, X. Shen (Eds.), Proceedings of the Eleventh online ranking algorithms for multitask learning, Knowl. Inf. Syst. 62 (6) (2020)
International Conference on Artificial Intelligence and Statistics, AISTATS 2007, 2327–2348.
San Juan, Puerto Rico, March 21-24, 2007, in: JMLR Proceedings, vol. 2, [67] A. Leroy, P. Latouche, B. Guedj, S. Gey, MAGMA: inference and prediction using
JMLR.org, 2007, pp. 43–50. multi-task Gaussian processes with common mean, Mach. Learn. 111 (5) (2022)
[41] E.V. Bonilla, K.M.A. Chai, C.K.I. Williams, Multi-task Gaussian process prediction, 1821–1849.
in: J.C. Platt, D. Koller, Y. Singer, S.T. Roweis (Eds.), Advances in Neural [68] A. Pentina, S. Ben-David, Multi-task and lifelong learning of kernels, in: K.
Information Processing Systems 20, Proceedings of the Twenty-First Annual Con- Chaudhuri, C. Gentile, S. Zilles (Eds.), Algorithmic Learning Theory - 26th
ference on Neural Information Processing Systems, Vancouver, British Columbia, International Conference, ALT 2015, Banff, AB, Canada, October 4-6, 2015,
Canada, December 3-6, 2007, Curran Associates, Inc., 2007, pp. 153–160. Proceedings, in: Lecture Notes in Computer Science, vol. 9355, Springer, 2015,
pp. 194–208.
[42] D. Hernández-Lobato, J.M. Hernández-Lobato, Learning feature selection depen-
[69] Y. Xu, X. Li, D. Chen, H. Li, Learning rates of regularized regression with multiple
dencies in multi-task learning, in: Advances in Neural Information Processing
Gaussian kernels for multi-task learning, IEEE Trans. Neural Netw. Learn. Syst.
Systems 26: 27th Annual Conference on Neural Information Processing Systems
29 (11) (2018) 5408–5418.
2013. Proceedings of a Meeting Held December 5-8, 2013, Lake Tahoe, Nevada,
[70] J. Gui, H. Zhang, Learning rates for multi-task regularization networks,
United States, 2013, pp. 746–754.
Neurocomputing 466 (2021) 243–251.
[43] D. Hernández-Lobato, J.M. Hernández-Lobato, Z. Ghahramani, A probabilistic
[71] S. Ben-David, R. Schuller, Exploiting task relatedness for mulitple task learning,
model for dirty multi-task feature selection, in: Proceedings of the 32nd Inter-
in: Computational Learning Theory and Kernel Machines, 16th Annual Confer-
national Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July
ence on Computational Learning Theory and 7th Kernel Workshop, COLT/Kernel
2015, in: JMLR Workshop and Conference Proceedings, 37, JMLR.org, 2015, pp.
2003, Washington, DC, USA, August 24-27, 2003, Proceedings, in: Lecture Notes
1073–1082.
in Computer Science, vol. 2777, Springer, 2003, pp. 567–580.
[44] P. Jawanpuria, J.S. Nath, Multi-task multiple kernel learning, in: Proceedings of
[72] S. Ben-David, R.S. Borbely, A notion of task relatedness yielding provable
the Eleventh SIAM International Conference on Data Mining, SDM 2011, April
multiple-task learning guarantees, Mach. Learn. 73 (3) (2008) 273–287.
28-30, 2011, Mesa, Arizona, USA, SIAM/Omni Press, 2011, pp. 828–838.
[73] P.L. Bartlett, S. Mendelson, Rademacher and Gaussian complexities: Risk bounds
[45] P. Jawanpuria, J.S. Nath, A convex feature learning formulation for latent task and structural results, J. Mach. Learn. Res. 3 (2002) 463–482.
structure discovery, in: Proceedings of the 29th International Conference on [74] G. Cavallanti, N. Cesa-Bianchi, C. Gentile, Linear algorithms for online multitask
Machine Learning, ICML 2012, Edinburgh, Scotland, UK, June 26 - July 1, 2012, classification, J. Mach. Learn. Res. 11 (2010) 2901–2934.
Omni Press, 2012. [75] A. Maurer, Bounds for linear multi-task learning, J. Mach. Learn. Res. 7 (2006)
[46] K. Murugesan, J.G. Carbonell, Multi-task multiple kernel relationship learning, 117–139.
in: N.V. Chawla, W. Wang (Eds.), Proceedings of the 2017 SIAM International [76] A. Maurer, The rademacher complexity of linear transformation classes, in:
Conference on Data Mining, Houston, Texas, USA, April 27-29, 2017, SIAM, Learning Theory, 19th Annual Conference on Learning Theory, COLT 2006,
2017, pp. 687–695. Pittsburgh, PA, USA, June 22-25, 2006, Proceedings, in: Lecture Notes in
[47] M. Kandemir, A. Vetek, M. Gönen, A. Klami, S. Kaski, Multi-task and multi-view Computer Science, vol. 4005, Springer, 2006, pp. 65–78.
learning of user state, Neurocomputing 139 (2014) 97–106. [77] A. Maurer, M. Pontil, B. Romera-Paredes, The benefit of multitask representation
[48] E. Marcelli, R.D. Leone, Multi-kernel covariance terms in multi-output support learning, J. Mach. Learn. Res. 17 (2016) 81:1–81:32.
vector machines, in: Machine Learning, Optimization, and Data Science - 6th [78] A.W. van der Vaart, J.A. Wellner, Weak Convergence, Springer New York, New
International Conference, LOD 2020, Siena, Italy, July 19-23, 2020, Revised York, NY, 1996.
Selected Papers, Part II, in: Lecture Notes in Computer Science, vol. 12566, [79] V. Vapnik, The Nature of Statistical Learning Theory, Springer science & business
Springer, 2020, pp. 1–11. media, 2013.

11
C. Ruiz et al. Neurocomputing 577 (2024) 127255

[80] T. Jebara, Multitask sparsity via maximum entropy discrimination, J. Mach. [88] J. Fiot, F. Dinuzzo, Electricity demand forecasting by multi-task learning, IEEE
Learn. Res. 12 (2011) 75–110. Trans. Smart Grid 9 (2) (2018) 544–551.
[81] T. Jebara, Multi-task feature and kernel selection for SVMs, in: Machine Learning, [89] D. Wu, B. Wang, D. Precup, B. Boulet, Boosting based multiple kernel learning
Proceedings of the Twenty-First International Conference (ICML 2004), Banff, and transfer regression for electricity load forecasting, in: Machine Learning and
Alberta, Canada, July 4-8, 2004, in: ACM International Conference Proceeding Knowledge Discovery in Databases - European Conference, ECML PKDD 2017,
Series, 69, ACM, 2004. Skopje, Macedonia, September 18-22, 2017, Proceedings, Part III, in: Lecture
[82] X. Wang, Y. Zhang, X. Ren, Y. Zhang, M. Zitnik, J. Shang, C.P. Langlotz, J. Han, Notes in Computer Science, vol. 10536, Springer, 2017, pp. 39–51.
Cross-type biomedical named entity recognition with deep multi-task learning, [90] D. Wu, B. Wang, D. Precup, B. Boulet, Multiple kernel learning-based transfer
Bioinformatics 35 (10) (2019) 1745–1752. regression for electric load forecasting, IEEE Trans. Smart Grid 11 (2) (2020)
[83] K. Clark, M. Luong, U. Khandelwal, C.D. Manning, Q.V. Le, Bam! Born-again 1183–1192.
multi-task networks for natural language understanding, in: Proceedings of the [91] Y. Zhou, F.-J. Chang, L.-C. Chang, I.-F. Kao, Y.-S. Wang, C.-C. Kang, Multi-output
57th Conference of the Association for Computational Linguistics, ACL 2019, support vector machine for regional multi-step-ahead PM2.5 forecasting, Sci.
Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, Association for Total Environ. 651 (2019) 230–240.
Computational Linguistics, 2019, pp. 5931–5937. [92] C. Ruiz, C.M. Alaíz, J.R. Dorronsoro, Multitask support vector regression for solar
[84] H.H. Nguyen, F. Fang, J. Yamagishi, I. Echizen, Multi-task learning for detecting and wind energy prediction, Energies 13 (23) (2020).
and segmenting manipulated facial images and videos, in: 10th IEEE Interna- [93] Z. Meng, J. Xu, Z. Li, Y. Wang, F. Chen, Z. Wang, A multi-task kernel learning
tional Conference on Biometrics Theory, Applications and Systems, BTAS 2019, algorithm for survival analysis, in: Advances in Knowledge Discovery and Data
Tampa, FL, USA, September 23-26, 2019, IEEE, 2019, pp. 1–8. Mining - 25th Pacific-Asia Conference, PAKDD 2021, Virtual Event, May 11-14,
[85] Y. Hu, M. Li, Q. Lu, H. Weng, J. Wang, S.M. Zekavat, Z. Yu, B. Li, J. Gu, S. 2021, Proceedings, Part III, in: Lecture Notes in Computer Science, vol. 12714,
Muchnik, Y. Shi, B.W. Kunkle, S. Mukherjee, P. Natarajan, A. Naj, A. Kuzma, Springer, 2021, pp. 298–311.
Y. Zhao, P.K. Crane, H. Lu, H. Zhao, A statistical framework for cross-tissue [94] S. Taylor, N. Jaques, E. Nosakhare, A. Sano, R.W. Picard, Personalized multitask
transcriptome-wide association analysis, Nature Genet. 51 (3) (2019) 568–576. learning for predicting tomorrow’s mood, stress, and health, IEEE Trans. Affect.
[86] R. Hu, J. Gan, X. Zhu, T. Liu, X. Shi, Multi-task multi-modality SVM for early Comput. 11 (2) (2020) 200–213.
COVID-19 diagnosis using chest CT data, Inf. Process. Manag. 59 (1) (2022) [95] D.L. Martinez, K. Peng, S.C. Steele, A.J. Lee, D. Borsook, R.W. Picard, Multi-task
102782. multiple kernel machines for personalized pain recognition from functional near-
[87] H. Zhang, Q. Wu, F. Li, Application of online multitask learning based on least infrared spectroscopy brain signals, in: 24th International Conference on Pattern
squares support vector regression in the financial market, Appl. Soft Comput. Recognition, ICPR 2018, Beijing, China, August 20-24, 2018, IEEE Computer
121 (2022) 108754. Society, 2018, pp. 2320–2325.
[96] A. Rahimi, M. Gönen, Efficient multitask multiple kernel learning with
application to cancer research, IEEE Trans. Cybern. 52 (9) (2022) 8716–8728.

12

You might also like