0% found this document useful (0 votes)
22 views20 pages

2022 - Multi-Task Learning For Dense Prediction Tasks - A Survey - Vandenhende Et Al - IEEE Transactions On Pattern Analysis and Machine Intelligence

This survey explores multi-task learning (MTL) techniques for dense prediction tasks in computer vision, highlighting the advantages of joint learning through shared representations. It categorizes deep MTL architectures into encoder- and decoder-focused models, discusses various optimization methods, and provides an extensive experimental evaluation of different approaches. The paper aims to unify fragmented literature on MTL and offers insights into the performance of various methods across dense prediction benchmarks.

Uploaded by

yangkunkuo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views20 pages

2022 - Multi-Task Learning For Dense Prediction Tasks - A Survey - Vandenhende Et Al - IEEE Transactions On Pattern Analysis and Machine Intelligence

This survey explores multi-task learning (MTL) techniques for dense prediction tasks in computer vision, highlighting the advantages of joint learning through shared representations. It categorizes deep MTL architectures into encoder- and decoder-focused models, discusses various optimization methods, and provides an extensive experimental evaluation of different approaches. The paper aims to unify fragmented literature on MTL and offers insights into the performance of various methods across dense prediction benchmarks.

Uploaded by

yangkunkuo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

1

Multi-Task Learning for Dense Prediction Tasks:


A Survey
Simon Vandenhende, Stamatios Georgoulis, Wouter Van Gansbeke,
Marc Proesmans, Dengxin Dai and Luc Van Gool

Abstract—With the advent of deep learning, many dense prediction tasks, i.e. tasks that produce pixel-level predictions, have seen
significant performance improvements. The typical approach is to learn these tasks in isolation, that is, a separate neural network is
trained for each individual task. Yet, recent multi-task learning (MTL) techniques have shown promising results w.r.t. performance,
computations and/or memory footprint, by jointly tackling multiple tasks through a learned shared representation. In this survey, we
provide a well-rounded view on state-of-the-art deep learning approaches for MTL in computer vision, explicitly emphasizing on dense
prediction tasks. Our contributions concern the following. First, we consider MTL from a network architecture point-of-view. We include
an extensive overview and discuss the advantages/disadvantages of recent popular MTL models. Second, we examine various
optimization methods to tackle the joint learning of multiple tasks. We summarize the qualitative elements of these works and explore
their commonalities and differences. Finally, we provide an extensive experimental evaluation across a variety of dense prediction
benchmarks to examine the pros and cons of the different methods, including both architectural and optimization based strategies.

Index Terms—Multi-Task Learning, Dense Prediction Tasks, Pixel-Level Tasks, Optimization, Convolutional Neural Networks.

1 I NTRODUCTION

O VER the last decade, neural networks have shown


impressive results for a multitude of tasks, such as
semantic segmentation [1], instance segmentation [2] and
input can infer all desired task outputs.
Multi-Task Learning (MTL) [29] aims to improve such
generalization by leveraging domain-specific information
monocular depth estimation [3]. Traditionally, these tasks contained in the training signals of related tasks. In the deep
are tackled in isolation, i.e. a separate neural network is learning era, MTL translates to designing networks capable
trained for each task. Yet, many real-world problems are of learning shared representations from multi-task supervi-
inherently multi-modal. For example, an autonomous car sory signals. Compared to the single-task case, where each
should be able to segment the lane markings, detect all individual task is solved separately by its own network,
instances in the scene, estimate their distance and trajectory, such multi-task networks bring several advantages to the
etc., in order to safely navigate itself in its surroundings. table. First, due to their inherent layer sharing, the resulting
Similarly, an intelligent advertisement system should be memory footprint is substantially reduced. Second, as they
able to detect the presence of people in its viewpoint, explicitly avoid to repeatedly calculate the features in the
understand their gender and age group, analyze their ap- shared layers, once for every task, they show increased
pearance, track where they are looking at, etc., in order to inference speeds. Most importantly, they have the potential
provide personalized content. At the same time, humans for improved performance if the associated tasks share
are remarkably good at solving many tasks concurrently. complementary information, or act as a regularizer for one
Biological data processing appears to follow a multi-tasking another.
strategy too: instead of separating tasks and tackling them Scope. In this survey, we study deep learning approaches
in isolation, different processes seem to share the same early for MTL in computer vision. We refer the interested reader
processing layers in the brain (see V1 in macaques [4]). The to [30] for an overview of MTL in other application domains,
aforementioned observations have motivated researchers to such as natural language processing [31], speech recog-
develop generalized deep learning models that given an nition [32], bioinformatics [33], etc. Most importantly, we
emphasize on solving multiple pixel-level or dense prediction
tasks, rather than multiple image-level classification tasks, a
• Simon Vandenhende, Wouter Van Gansbeke and Marc
Proesmans are with the Center for Processing Speech and case that has been mostly under-explored in MTL. Tackling
Images, Department Electrical Engineering, KU Leuven. E-mail: multiple dense prediction tasks differs in several aspects
{simon.vandenhende,wouter.vangansbeke,marc.proesmans}@kuleuven.be from solving multiple classification tasks. First, as jointly
• Stamatios Georgoulis and Dengxin Dai are with the Computer
learning multiple dense prediction tasks is governed by the
Vision Lab, Department Electrical Engineering, ETH Zurich. E-mail: use of different loss functions, unlike classification tasks
{georgous,daid}@ee.ethz.ch that mostly use cross-entropy losses, additional consider-
ation is required to avoid a scenario where some tasks
• Luc Van Gool is with both the Center for Processing Speech and
Images, KU Leuven and the Computer Vision Lab, ETH Zurich. E-mail: overwhelm the others during training. Second, opposed
[email protected] to image-level classification tasks, dense prediction tasks
can not be directly predicted from a shared global image
Manuscript received September X, 2020 representation [34], which renders the network design more
2

Multi-Task Learning Methods

Deep Multi-Task Optimization Strategy


Architectures (Sec. 2) Methods (Sec. 3)

Encoder-Focused Decoder-Focused Other Task Balancing Other


(Sec. 2.2) (Sec. 2.3) (Sec. 2.4) (Sec. 3.1) (Sec. 3.2)

MTL Baseline PAD-Net [13] ASTMT [18] Fixed Adversarial [18], [24], [25]
Cross-Stitch Networks [5] PAP-Net [14] Uncertainty [19] Modulation [26]
Sluice Networks [6] JTRL [15] GradNorm [20] Heuristics [27], [28]
NDDR-CNN [7] MTI-Net [16] DWA [8]
MTAN [8] PSD [17] DTP [21]
Branched MTL [9], [10], [11], [12] Multi-Objective Optim. [22], [23]

Fig. 1: A taxonomy of deep learning approaches for jointly solving multiple dense prediction tasks.

difficult. Third, pixel-level tasks in scene understanding ancing the influence of the tasks when updating the net-
often have similar characteristics [14], and these similarities work’s weights. We consider the majority of task balancing,
can potentially be used to boost the performance under a adversarial and modulation techniques. In Section 4, we
MTL setup. A popular example is semantic segmentation provide an extensive experimental evaluation across differ-
and depth estimation [13]. ent datasets both within the scope of each group of methods
Motivation. The abundant literature on MTL is rather frag- (e.g. encoder-focused approaches) as well as across groups
mented. For example, we identify two main groups of works of methods (e.g. encoder- vs decoder-focused approaches).
on deep multi-task architectures in Section 2 that have been Section 5 discusses the relations of MTL with other fields.
considered largely independent from each other. Moreover, Section 6 concludes the paper.
there is limited agreement on the used evaluation metrics Figure 1 shows a structured overview of the paper. Upon
and benchmarks. This paper aims to provide a more unified publication, our code will be made publicly available to ease
view on the topic. Additionally, we provide a comprehen- the adoption of the reviewed MTL techniques.
sive experimental study where different groups of works are
evaluated in an apples-to-apples comparison. 2 D EEP M ULTI -TASK A RCHITECTURES
Related work. MTL has been the subject of several sur- In this section, we review deep multi-task architectures used
veys [29], [30], [35], [36]. In [29], Caruana showed that in computer vision. First, we give a brief historical overview
MTL can be beneficial as it allows for the acquisition of of MTL approaches, before introducing a novel taxonomy to
inductive bias through the inclusion of related additional categorize different methods. Second, we discuss network
tasks into the training pipeline. The author showcased the designs from different groups of works, and analyze their
use of MTL in artificial neural networks, decision trees and advantages and disadvantages. An experimental compar-
k-nearest neighbors methods, but this study is placed in the ison is also provided later in Section 4. Note that, as a
very early days of neural networks, rendering it outdated detailed presentation of each architecture is beyond the
in the deep learning era. Ruder [35] gave an overview scope of this survey, in each case we refer the reader to the
of recent MTL techniques (e.g. [5], [6], [9], [19]) applied corresponding paper for further details that complement the
in deep neural networks. In the same vein, Zhang and following descriptions.
Yang [30] provided a survey that includes feature learning,
low-rank, task clustering, task relation learning, and decom- 2.1 Historical Overview and Taxonomy
position approaches for MTL. Yet, both works are literature 2.1.1 Non-Deep Learning Methods
review studies without an empirical evaluation or compar- Before the deep learning era, MTL works tried to model
ison of the presented techniques. Finally, Gong et al. [36] the common information among tasks in the hope that
benchmarked several optimization techniques (e.g. [8], [19]) a joint task learning could result in better generalization
across three MTL datasets. Still, the scope of this study is performance. To achieve this, they placed assumptions on
rather limited, and explicitly focuses on the optimization the task parameter space, such as: task parameters should
aspect. Most importantly, all prior studies provide a general lie close to each other w.r.t. some distance metric [37], [38],
overview on MTL without giving specific attention to dense [39], [40], share a common probabilistic prior [41], [42], [43],
prediction tasks that are of utmost importance in computer [44], [45], or reside in a low dimensional subspace [46],
vision. [47], [48] or manifold [49]. These assumptions work well
Paper overview. In the following sections, we provide a when all tasks are related [37], [46], [50], [51], but can lead
well-rounded view on state-of-the-art MTL techniques that to performance degradation if information sharing happens
fall within the defined scope. Section 2 considers different between unrelated tasks. The latter is a known problem in
deep multi-task architectures, categorizing them into two MTL, referred to as negative transfer. To mitigate this prob-
main groups: encoder- and decoder-focused approaches. lem, some of these works opted to cluster tasks into groups
Section 3 surveys various optimization techniques for bal- based on prior beliefs about their similarity or relatedness.
3

Task A Task B Task C Task A Task B Task C


Task A Task B Task C Task A Task B Task C

Task
specific

Task A Task B Task C

Shared

Cross-talk
Shared Encoder Shared Encoder
(Soft/Hard) (Soft/Hard)

(a) Hard parameter sharing (b) Soft parameter sharing (a) Encoder-focused model (b) Decoder-focused model

Fig. 2: Historically multi-task learning using deep neu- Fig. 3: In this work we discriminate between encoder-
ral networks has been subdivided into soft- and hard- and decoder-focused models depending on where the task
parameter sharing schemes. interactions take place.

2.1.2 Soft and Hard Parameter Sharing in Deep Learning 2.1.3 Distilling Task Predictions in Deep Learning
In the context of deep learning, MTL is performed by All works presented in Section 2.1.2 follow a common
learning shared representations from multi-task supervisory pattern: they directly predict all task outputs from the same
signals. Historically, deep multi-task architectures were clas- input in one processing cycle. In contrast, a few recent
sified into hard or soft parameter sharing techniques. In hard works first employed a multi-task network to make initial
parameter sharing, the parameter set is divided into shared task predictions, and then leveraged features from these
and task-specific parameters (see Figure 2a). MTL models initial predictions to further improve each task output – in
using hard parameter sharing typically consist of a shared a one-off or recursive manner. PAD-Net [13] proposed to
encoder that branches out into task-specific heads [19], [20], distill information from the initial task predictions of other
[22], [52], [53]. In soft parameter sharing, each task is assigned tasks, by means of spatial attention, before adding it as
its own set of parameters and a feature sharing mechanism a residual to the task of interest. JTRL [15] opted for se-
handles the cross-task talk (see Figure 2b). We summarize quentially predicting each task, with the intention to utilize
representative works for both groups of works below. information from the past predictions of one task to refine
Hard Parameter Sharing. UberNet [54] was the first hard- the features of another task at each iteration. PAP-Net [14]
parameter sharing model to jointly tackle a large number of extended upon this idea, and used a recursive procedure
low-, mid-, and high-level vision tasks. The model featured to propagate similar cross-task and task-specific patterns
a multi-head design across different network layers and found in the initial task predictions. To do so, they operated
scales. Still, the most characteristic hard parameter shar- on the affinity matrices of the initial predictions, and not
ing design consists of a shared encoder that branches out on the features themselves, as was the case before [13], [15].
into task-specific decoding heads [19], [20], [22], [52], [53]. Zhou et al. [17] refined the use of pixel affinities to distill
Multilinear relationship networks [55] extended this design the information by separating inter- and intra-task patterns
by placing tensor normal priors on the parameter set of from each other. MTI-Net [16] adopted a multi-scale multi-
the fully connected layers. In these works the branching modal distillation procedure to explicitly model the unique
points in the network are determined ad hoc, which can task interactions that happen at each individual scale.
lead to suboptimal task groupings. To alleviate this issue,
several recent works [9], [10], [11], [12] proposed efficient 2.1.4 A New Taxonomy of MTL Approaches
design procedures that automatically decide where to share As explained in Section 2.1.2, multi-task networks have his-
or branch within the network. Similarly, stochastic filter torically been classified into soft or hard parameter sharing
groups [56] re-purposed the convolution kernels in each techniques. However, several recent works took inspiration
layer to support shared or task-specific behaviour. from both groups of works to jointly solve multiple pixel-
Soft Parameter Sharing. Cross-stitch networks [5] intro- level tasks. As a consequence, it is debatable whether the
duced soft-parameter sharing in deep MTL architectures. soft vs hard parameter sharing paradigm should still be
The model uses a linear combination of the activations in used as the main framework for classifying MTL architec-
every layer of the task-specific networks as a means for soft tures. In this survey, we propose an alternative taxonomy
feature fusion. Sluice networks [6] extended this idea by that discriminates between different architectures on the
allowing to learn the selective sharing of layers, subspaces basis of where the task interactions take place, i.e. locations
and skip connections. NDDR-CNN [7] also incorporated in the network where information or features are exchanged
dimensionality reduction techniques into the feature fusion or shared between tasks. The impetus for this framework
layers. Differently, MTAN [8] used an attention mechanism was given in Section 2.1.3. Based on the proposed criterion,
to share a general feature pool amongst the task-specific net- we distinguish between two types of models: encoder-focused
works. A concern with soft parameter sharing approaches and decoder-focused architectures. The encoder-focused ar-
is scalability, as the size of the multi-task network tends to chitectures (see Figure 3a) only share information in the
grow linearly with the number of tasks. encoder, using either hard- or soft-parameter sharing, before
4

is not clear where the cross-stitch units should be inserted


Shared Encoder
Task A
in order to maximize their effectiveness. Sluice networks [6]
extended this work by also supporting the selective sharing

Attention Module

Attention Module

Attention Module
Task A of subspaces and skip connections.
Task B

2.2.2 Neural Discriminative Dimensionality Reduction

Attention Module

Attention Module

Attention Module
Task B

Cross-Stitch Unit / NDDR-Layer


Neural Discriminative Dimensionality Reduction CNNs
(NDDR-CNNs) [7] used a similar architecture with cross-
Fig. 4: The architecture of Fig. 5: The architecture of stitch networks (see Figure 4). However, instead of utilizing
cross-stitch networks [5] and MTAN [8]. Task-specific at- a linear combination to fuse the activations from all single-
NDDR-CNNs [7]. The activa- tention modules select and task networks, a dimensionality reduction mechanism is
tions from all single-task net- refine features from several employed. First, features with the same spatial resolution
works are fused across several layers of a shared encoder. in the single-task networks are concatenated channel-wise.
encoding layers. Different fea- Second, the number of channels is reduced by processing
ture fusion mechanisms were the features with a 1 by 1 convolutional layer, before feeding
used in each case. the result to the next layer. The convolutional layer allows
to fuse activations across all channels. Differently, cross-
stitch networks only allow to fuse activations from channels
decoding each task with an independent task-specific head. that share the same index. The NDDR-CNN behaves as a
Differently, the decoder-focused architectures (see Figure 3b) cross-stitch network when the non-diagonal elements in the
also exchange information during the decoding stage. Fig- weight matrix of the convolutional layer are zero.
ure 1 gives an overview of the proposed taxonomy, listing Due to their similarity with cross-stitch networks,
representative works in each case. NDDR-CNNs are prone to the same problems. First, there
is a scalability concern when dealing with a large number
2.2 Encoder-focused Architectures of tasks. Second, NDDR-CNNs involve additional design
choices, since we need to decide where to include the NDDR
Encoder-focused architectures (see Figure 3a) share the task layers. Finally, both cross-stitch networks and NDDR-CNNs
features in the encoding stage, before they process them only allow to use limited local information (i.e. small re-
with a set of independent task-specific heads. A number ceptive field) when fusing the activations from the different
of works [19], [20], [22], [52], [53] followed an ad hoc single-task networks. We hypothesize that this is subopti-
strategy by sharing an off-the-shelf backbone network in mal because the use of sufficient context is very important
combination with small task-specific heads (see Figure 2a). during encoding – as already shown for the tasks of image
This model relies on the encoder (i.e. backbone network) classification [57] and semantic segmentation [58], [59], [60].
to learn a generic representation of the scene. The features This is backed up by certain decoder-focused architectures
from the encoder are then used by the task-specific heads to in Section 2.3 that overcome the limited receptive field by
get the predictions for every task. While this simple model predicting the tasks at multiple scales and by sharing the
shares the full encoder amongst all tasks, recent works have features repeatedly at every scale.
considered where and how the feature sharing should happen
in the encoder. We discuss such sharing strategies in the 2.2.3 Multi-Task Attention Networks
following sections. Multi-Task Attention Networks (MTAN) [8] used a shared
backbone network in conjunction with task-specific atten-
2.2.1 Cross-Stitch Networks tion modules in the encoder (see Figure 5). The shared back-
Cross-stitch networks [5] shared the activations amongst bone extracts a general pool of features. Then, each task-
all single-task networks in the encoder. Assume we are specific attention module selects features from the general
given two activation maps xA , xB at a particular layer, that pool by applying a soft attention mask. The attention mech-
belong to tasks A and B respectively. A learnable linear anism is implemented using regular convolutional layers
combination of these activation maps is applied, before and a sigmoid non-linearity. Since the attention modules
feeding the transformed result x̃A , x̃B to the next layer in the are small compared to the backbone network, the MTAN
single-task networks. The transformation is parameterized model does not suffer as severely from the scalability issues
by learnable weights α, and can be expressed as that are typically associated with cross-stitch networks and
     NDDR-CNNs. However, similar to the fusion mechanism in
x̃A αAA αAB xA
= . (1) the latter works, the MTAN model can only use limited local
x̃B αBA αBB xB
information to produce the attention mask.
As illustrated in Figure 4, this procedure is repeated at
multiple locations in the encoder. By learning the weights 2.2.4 Branched Multi-Task Learning Networks
α, the network can decide the degree to which the features The models presented in Sections 2.2.1-2.2.3 softly shared
are shared between tasks. In practice, we are required to the features amongst tasks during the encoding stage. Dif-
pre-train the single-task networks, before stitching them ferently, branched multi-task networks followed a hard-
together, in order to maximize the performance. A disadvan- parameter sharing scheme. Before presenting these meth-
tage of cross-stitch networks is that the size of the network ods, consider the following observation: deep neural net-
increases linearly with the number of tasks. Furthermore, it works tend to learn hierarchical image representations [61].
5

The early layers tend to focus on more general low-level


image features, such as edges, corners, etc., while the deeper
layers tend to extract high-level information that is more
task-specific. Motivated by this observation, branched MTL
networks opted to learn similar hierarchical encoding struc-
tures [9], [10], [11], [12]. These ramified networks typically
start with a number of shared layers, after which different Multi-Modal
Backbone Distillation
(groups of) tasks branch out into their own sequence of
layers. In doing so, the different branches gradually become
more task-specific as we move to the deeper layers. This
behaviour aligns well with the hierarchical representations
learned by deep neural nets. However, as the number of
possible network configurations is combinatorially large,
deciding what layers to share and where to branch out Initial Task
becomes cumbersome. Several works have tried to automate Predictions
the procedure of hierarchically clustering the tasks to form
Fig. 6: The architecture in PAD-Net [13]. Features extracted
branched MTL networks given a specific computational
by a backbone network are passed to task-specific heads
budget (e.g. number of parameters, FLOPS). We provide a
to make initial task predictions. The task features from the
summary of existing works below.
different heads are then combined through a distillation unit
Fully-Adaptive Feature Sharing (FAFS) [9] starts from to make the final predictions. Note that, auxiliary tasks can
a network where tasks initially share all layers, and dynam- be used in this framework, i.e. tasks for which only the
ically grows the model in a greedy layer-by-layer fashion initial predictions are generated, but not the final ones.
during training. The task groupings are optimized to sep-
arate dissimilar tasks from each other, while minimizing
network complexity. The task relatedness is based on the 2.3 Decoder-Focused Architectures
probability of concurrently ’simple’ or ’difficult’ examples The encoder-focused architectures in Section 2.2 follow a
across tasks. This strategy assumes that it is preferable common pattern: they directly predict all task outputs from
to solve two tasks in an isolated manner (i.e. different the same input in one processing cycle (i.e. all predictions
branches) when the majority of examples are ’simple’ for are generated once, in parallel or sequentially, and are not
one task, but ’difficult’ for the other. refined afterwards). By doing so, they fail to capture com-
Similar to FAFS, Vandenhende et al. [10] rely on pre- monalities and differences among tasks, that are likely fruit-
computed task relatedness scores to decide the grouping of ful for one another (e.g. depth discontinuities are usually
tasks. In contrast to FAFS, they measure the task relatedness aligned with semantic edges). Arguably, this might be the
based on feature affinity scores, rather than sample diffi- reason for the moderate only performance improvements
culty. The main assumption is that two tasks are strongly achieved by the encoder-focused approaches to MTL (see
related, if their single-task models rely on a similar set of Section 4.3.1). To alleviate this issue, a few recent works
features. An efficient method [62] is used to quantify this first employed a multi-task network to make initial task
property. An advantage over FAFS is that the task groupings predictions, and then leveraged features from these initial
can be determined offline for the whole network, and not predictions in order to further improve each task output –
online in a greedy layer-by-layer fashion [10]. This strategy in an one-off or recursive manner. As these MTL approaches
promotes task groupings that are optimal in a global, rather also share or exchange information during the decoding
than local, sense. Yet, a disadvantage is that the calculation stage, we refer to them as decoder-focused architectures (see
of the task affinity scores requires a set of single-task net- Figure 3b).
works to be pretrained first.
Different from the previous works, Branched Multi-Task 2.3.1 PAD-Net
Architecture Search (BMTAS) [11] and Learning To Branch PAD-Net [13] was one of the first decoder-focused architec-
(LTB) [12] have directly optimized the network topology tures. The model itself is visualized in Figure 6. As can be
without relying on pre-computed task relatedness scores. seen, the input image is first processed by an off-the-shelf
More specifically, they rely on a tree-structured network backbone network. The backbone features are further pro-
design space where the branching points are casted as a cessed by a set of task-specific heads that produce an initial
Gumbel softmax operation. This strategy has the advan- prediction for every task. These initial task predictions add
tage over [9], [10] that the task groupings can be directly deep supervision to the network, but they can also be used
optimized end-to-end for the tasks under consideration. to exchange information between tasks, as will be explained
Moreover, both methods can easily be applied to any set next. The task features in the last layer of the task-specific
of tasks, including both image classification and per-pixel heads contain a per-task feature representation of the scene.
prediction tasks. Similar to [9], [10], a compact network PAD-Net proposed to re-combine them via a multi-modal
topology can be obtained by including a resource-aware distillation unit, whose role is to extract cross-task informa-
loss term. In this case, the computational budget is jointly tion, before producing the final task predictions.
optimized with the multi-task learning objective in an end- PAD-Net performs the multi-modal distillation by
to-end fashion. means of a spatial attention mechanism. Particularly, the
6

Per-Task Pixel
Affinty
CxHxW
HW x HW
Backbone

Adaptive Combination
Diffusion

Per-Task Pixel
Affinity
Backbone p

Diffusion

Step t - 1 Step t Step t + 1


Initial Task
Multi-Modal Distillation
Predictions Fig. 8: The architecture in Joint Task-Recursive Learn-
ing [15]. The features of two tasks are progressively refined
Fig. 7: The architecture in PAP-Net [14]. Features extracted in an intertwined manner based on past states.
by a backbone network are passed to task-specific heads
to make initial task predictions. The task features from the
different heads are used to calculate a per-task pixel affinity mechanism, as done in PAD-Net, might be a suboptimal
matrix. The affinity matrices are adaptively combined and choice. As the optimization still happens at a different
diffused back into the task features space to spread the cross- space, i.e. the task label space, there is no guarantee that
task correlation information across the image. The refined the model will learn the desired task relationships. Instead,
features are used to make the final task predictions. they statistically observed that pixel affinities tend to align
well with common local structures on the task label space.
Motivated by this observation, they proposed to leverage
output features Fko for task k are calculated as pixel affinities in order to perform multi-modal distillation.
X To achieve this, the backbone features are first processed
Fko = Fki + σ Wk,l Fli Fli ,

(2) by a set of task-specific heads to get an initial prediction
l6=k for every task. Second, a per-task pixel affinity matrix MTj
where σ Wk,l Fli returns a spatial attention mask that is ap- is calculated by estimating pixel-wise correlations upon the


plied to the initial task features Fli from task l. The attention task features coming from each head. Third, a cross-task
mask itself is found by applying a convolutional layer Wk,l information matrix M̂Tj for every task Tj is learned by
to extract local information from the initial task features. adaptively combining the affinity matrices MTi for tasks Ti
T
Equation 2 assumes that the task interactions are location with learnable weights αi j
dependent, i.e. tasks are not in a constant relationship across X Tj
the entire image. This can be understood from a simple ex- M̂Tj = αi · MTi . (3)
Ti
ample. Consider two dense prediction tasks, e.g. monocular
depth prediction and semantic segmentation. Depth discon- Finally, the task features coming from each head j are
tinuities and semantic boundaries often coincide. However, refined using the cross-task information matrix M̂Tj . In
when we segment a flat object, e.g. a magazine, from a flat particular, the cross-task information matrix is diffused into
surface, e.g. a table, we will still find a semantic boundary the task features space to spread the correlation information
where the depth map is rather continuous. In this particular across the image. This effectively weakens or strengthens
case, the depth features provide no additional information the pixel correlations for task Tj , based on the pixel affinities
for the localization of the semantic boundaries. The use from other tasks Ti . The refined features are used to make
of spatial attention explicitly allows the network to select the final predictions for every task.
information from other tasks at locations where its useful. All previously discussed methods only use limited local
The encoder-focused approaches in Section 2.2 shared information when fusing features from different tasks. For
features amongst tasks using the intermediate representa- example, cross-stitch networks and NDDR-CNNs combine
tions in the encoder. Differently, PAD-Net models the task the features in a channel-wise fashion, while PAD-Net only
interactions by applying a spatial attention layer to the uses the information from within a 3 by 3 pixels window
features in the task-specific heads. In contrast to the interme- to construct the spatial attention mask. Differently, PAP-
diate feature representations in the encoder, the task features Net also models the non-local relationships through pixel
used by PAD-Net are already disentangled according to the affinities measured across the entire image. Zhou et al. [17]
output task. We hypothesize that this makes it easier for extended this idea to specifically mine and propagate both
other tasks to distill the relevant information. This multi- inter- and intra- task patterns.
step decoding strategy from PAD-Net is applied and refined
in other decoder-focused approaches. 2.3.3 Joint Task-Recursive Learning
Joint Task-Recursive Learning (JTRL) [15] recursively pre-
2.3.2 Pattern-Affinitive Propagation Networks dicts two tasks at increasingly higher scales in order to grad-
Pattern-Affinitive Propagation Networks (PAP-Net) [14] ually refine the results based on past states. The architecture
used an architecture similar to PAD-Net (see Figure 7), but is illustrated in Figure 8. Similarly to PAD-Net and PAP-Net,
the multi-modal distillation in this work is performed in a multi-modal distillation mechanism is used to combine
a different manner. The authors argue that directly work- information from earlier task predictions, through which
ing on the task features space via the spatial attention later predictions are refined. Differently, the JTRL model
7

predicts two tasks sequentially, rather than in parallel, and

Scale
1/32
Distillation
in an intertwined manner. The main disadvantage of this Scale 1/32

approach is that it is not straightforward, or even possible, Feature


Propagation Module
to extent this model to more than two tasks given the

Scale
1/16
Distillation
intertwined manner at which task predictions are refined. Initial Task Predictions
Scale 1/16 Feature
Aggregation
Backbone Feature
2.3.4 Multi-Scale Task Interaction Networks Propagation Module
Feature

Scale
1/8
Distillation
In the decoder-focused architectures presented so far, the Initial Task Predictions
Scale 1/8
Aggregation

multi-modal distillation was performed at a fixed scale, i.e. Feature


the features of the backbone’s last layer. This rests on the Propagation Module

Scale
assumption that all relevant task interactions can solely be

1/4
Distillation
Initial Task Predictions Scale 1/4
modeled through a single filter operation with a specific
receptive field. However, Multi-Scale Task Interaction Net- Multi-Scale
Multi-Modal
works (MTI-Net) [16] showed that this is a rather strict as- Distillation

sumption. In fact, tasks can influence each other differently


at different receptive fields. Fig. 9: The architecture in Multi-Scale Task Interaction Net-
To account for this restriction, MTI-Net explicitly took works [16]. Starting from a backbone that extracts multi-
into account task interactions at multiple scales. Its architec- scale features, initial task predictions are made at each scale.
ture is illustrated in Figure 9. First, an off-the-shelf backbone These task features are then distilled separately at every
network extracts a multi-scale feature representation from scale, allowing the model to capture unique task interactions
the input image. From the multi-scale feature representation at multiple scales, i.e. receptive fields. After distillation, the
an initial prediction for every task is made at each scale. The distilled task features from all scales are aggregated to make
task predictions at a particular scale are found by applying the final task predictions. To boost performance, a feature
a task-specific head to the backbone features extracted at propagation module is included to pass information from
that scale. Similarly to PAD-Net, the features in the last lower resolution task features to higher ones.
layer of the task-specific heads are combined and refined to
make the final predictions. Differently, in MTI-Net the per-
task feature representations can be distilled at each scale
separately. This allows to have multiple task interactions,
each modeled within a specific receptive field. The distilled connectivity of a network’s function blocks through routing.
multi-scale features are upsampled to the highest scale and Piggyback [66] showed how to adapt a single, fixed neural
concatenated, resulting in a final feature representation for network to a multi-task network by learning binary masks.
every task. The final task predictions are found by decoding Huang et al. [67] introduced a method rooted in Neural
these final feature representations in a task-specific manner Architecture Search (NAS) for the automated construction
again. The performance was further improved by also prop- of a tree-based multi-attribute learning network. Stochas-
agating information from the lower-resolution task features tic filter groups [56] re-purposed the convolution kernels
to the higher-resolution ones using a Feature Propagation in each layer of the network to support shared or task-
Module. specific behaviour. In a similar vein, feature partitioning [68]
The experimental evaluation in [16] shows that distilling presented partitioning strategies to assign the convolution
task information at multiple scales increases the multi- kernels in each layer of the network into different tasks. In
tasking performance compared to PAD-Net where such general, these works have a different scope within MTL,
information is only distilled at a single scale. Furthermore, e.g. automate the network architecture design. Moreover,
since MTI-Net distills the features at multiple scales, i.e. they mostly focus on solving multiple (binary) classification
using different pixel dilations, it overcomes the issue of tasks, rather than multiple dense predictions tasks. As a
using only limited local information to fuse the features, result, they fall outside the scope of this survey, with one
which was already shown to be beneficial in PAP-Net. notable exception that is discussed next.

Attentive Single-Tasking of Multiple Tasks (ASTMT) [18]


2.4 Other Approaches proposed to take a ’single-tasking’ route for the MTL prob-
A number of approaches that fall outside the aforemen- lem. That is, within a multi-tasking framework they per-
tioned categories have been proposed in the literature. For formed separate forward passes, one for each task, that ac-
example, multilinear relationship networks [55] used tensor tivate shared responses among all tasks, plus some residual
normal priors to the parameter set of the task-specific heads responses that are task-specific. Furthermore, to suppress
to allow interactions in the decoding stage. Different from the negative transfer issue they applied adversarial training
the standard parallel ordering scheme, where layers are on the gradients level that enforces them to be statistically
aligned and shared (e.g. [5], [7]), soft layer ordering [63] indistinguishable across tasks. An advantage of this ap-
proposed a flexible sharing scheme across tasks and net- proach is that shared and task-specific information within
work depths. Yang et al. [64] generalized matrix factorisa- the network can be naturally disentangled. On the negative
tion approaches to MTL in order to learn cross-task shar- side, however, the tasks can not be predicted altogether, but
ing structures in every layer of the network. Routing net- only one after the other, which significantly increases the
works [65] proposed a principled approach to determine the inference speed and somehow defies the purpose of MTL.
8

3 O PTIMIZATION IN MTL on the network weight update is smaller when the task’s
In the previous section, we discussed about the construction homoscedastic uncertainty is high. This is advantageous
of network architectures that are able to learn multiple tasks when dealing with noisy annotations since the task-specific
concurrently. Still, a significant challenge in MTL stems from weights will be lowered automatically for such tasks.
the optimization procedure itself. In particular, we need to
carefully balance the joint learning of all tasks to avoid a 3.1.2 Gradient Normalization
scenario where one or more tasks have a dominant influence Gradient normalization (GradNorm) [20] proposed to con-
in the network weights. In this section, we discuss several trol the training of multi-task networks by stimulating the
methods that have considered this task balancing problem. task-specific gradients to be of similar magnitude. By doing
so, the network is encouraged to learn all tasks at an equal
pace. Before presenting this approach, we introduce the
3.1 Task Balancing Approaches
necessary notations in the following paragraph.
Without loss of generality, the optimization objective in a We define the L2 norm of the gradient for the weighted
MTL problem, assuming task-specific weights wi and task- single-task loss wi (t) · Li (t) at step t w.r.t. the weights W ,
specific loss functions Li , can be formulated as as GWi (t). We additionally define the following quantities,
X
LM T L = wi · L i . (4) • the mean task gradient ḠW averaged across all task
W W
i
 WGi w.r.t the weights W at step t: Ḡ (t) =
gradients
When using stochastic gradient descent to minimize the Etask Gi (t) ;
objective from Equation 4, which is the standard approach • the inverse training rate L̃i of task i at step t: L̃i (t) =
in the deep learning era, the network weights in the shared Li (t) /Li (0);
layers Wsh are updated by the following rule • the relative inverse training
h i rate of task i at step t:
X ∂Li ri (t) = L̃i (t) /Etask L̃i (t) .
Wsh = Wsh − γ wi . (5)
i
∂Wsh GradNorm aims to balance two properties during the
training of a multi-task network. First, balancing the gra-
From Equation 5 we can draw the following conclusions.
dient magnitudes GW i . To achieve this, the mean gradient
First, the network weight update can be suboptimal when
ḠW is considered as a common basis from which the rel-
the task gradients conflict, or dominated by one task when
ative gradient sizes across tasks can be measured. Second,
its gradient magnitude is much higher w.r.t. the other tasks.
balancing the pace at which different tasks are learned. The
This motivated researchers [8], [19], [20], [21] to balance the
relative inverse training rate ri (t) is used to this end. When
gradient magnitudes by setting the task-specific weights wi
the relative inverse training rate ri (t) increases, the gradient
in the loss. To this end, other works [22], [26], [69] have also
magnitude GW i (t) for task i should increase as well to
considered the influence of the direction of the task gradi-
stimulate the task to train more quickly. GradNorm tackles
ents. Second, each task’s influence on the network weight
both objectives by minimizing the following loss
update can be controlled, either indirectly by adapting the
task-specific weights wi in the loss, or directly by operating GW
i (t) − Ḡ
W
(t) · ri (t) . (7)
∂Li
on the task-specific gradients ∂W sh
. A number of methods
that tried to address these problems are discussed next. Remember that, the gradient magnitude GW i (t) for task i
depends on the weighted single-task loss wi (t) · Li (t). As
3.1.1 Uncertainty Weighting a result, the objective in Equation 7 can be minimized by
Kendall et al. [19] used the homoscedastic uncertainty to adjusting the task-specific weights wi . In practice, during
balance the single-task losses. The homoscedastic uncer- training these task-specific weights are updated in every
tainty or task-dependent uncertainty is not an output of the iteration using backpropagation. After every update, the
model, but a quantity that remains constant for different task-specific weights wi (t) are re-normalized in order to
input examples of the same task. The optimization pro- decouple the learning rate from the task-specific weights.
cedure is carried out to maximise a Gaussian likelihood Note that, calculating the gradient magnitude GW i (t)
objective that accounts for the homoscedastic uncertainty. requires a backward pass through the task-specific layers
In particular, they optimize the model weights W and the of every task i. However, savings on computation time can
noise parameters σ1 , σ2 to minimize the following objective be achieved by considering the task gradient magnitudes
only w.r.t. the weights in the last shared layer.
1 1
L (W, σ1 , σ2 ) = L1 (W ) + 2 L2 (W ) + log σ1 σ2 . (6) Different from uncertainty weighting, GradNorm does
2σ12 2σ2 not take into account the task-dependent uncertainty to re-
The loss functions L1 , L2 belong to the first and second weight the task-specific losses. Rather, GradNorm tries to
task respectively. By minimizing the loss L w.r.t. the noise balance the pace at which tasks are learned, while avoiding
parameters σ1 , σ2 , one can essentially balance the task- gradients of different magnitude.
specific losses during training. The optimization objective in
Equation 6 can easily be extended to account for more than 3.1.3 Dynamic Weight Averaging
two tasks too. The noise parameters are updated through Similarly to GradNorm, Liu et al. [8] proposed a technique,
standard backpropagation during training. termed Dynamic Weight Averaging (DWA), to balance the
Note that, increasing the noise parameter σi reduces pace at which tasks are learned. Differently, DWA only
the weight for task i. Consequently, the effect of task i requires access to the task-specific loss values. This avoids
9

having to perform separate backward passes during train- on a regression task. Depending on the threshold’s value,
ing in order to obtain the task-specific gradients. In DWA, the task-specific weight will be higher or lower during
the task-specific weight wi for task i at step t is set as training. We conclude that the choice of the KPIs in DTP is
not determined in a straightforward manner. Furthermore,
N exp (ri (t − 1) /T ) Ln (t − 1)
wi (t) = P , rn (t − 1) = , similar to DWA, DTP requires to balance the overall magni-
n exp (rn (t − 1) /T ) Ln (t − 2) tude of the loss values beforehand. After all, Equation 9 does
(8)
not take into account the loss magnitudes to calculate the
with N being the number of tasks. The scalars rn (·) estimate
task-specific weights. As a result, DTP still involves manual
the relative descending rate of the task-specific loss values
tuning to set the task-specific weights.
Ln . The temperature T controls the softness of the task
weighting in the softmax operator. When the loss of a task 3.1.5 MTL as Multi-Objective Optimization
decreases at a slower rate compared to other tasks, the task-
A global optimum for the multi-task optimization objective
specific weight in the loss is increased.
in Equation 4 is hard to find. Due to the complex nature
Note that, the task-specific weights wi are solely based
of this problem, a certain choice that improves the perfor-
on the rate at which the task-specific losses change. Such
mance for one task could lead to performance degradation
a strategy requires to balance the overall loss magnitudes
for another task. The task balancing methods discussed
beforehand, else some tasks could still overwhelm the others
beforehand try to tackle this problem by setting the task-
during training. GradNorm avoids this problem by balanc-
specific weights in the loss according to some heuristic.
ing both the training rates and the gradient magnitudes
Differently, Sener and Koltun [22] view MTL as a multi-
through a single objective (see Equation 7).
objective optimization problem, with the overall goal of
finding a Pareto optimal solution among all tasks.
3.1.4 Dynamic Task Prioritization
In MTL, a Pareto optimal solution is found when the
The task balancing techniques in Sections 3.1.1-3.1.3 opted to following condition is satisfied: the loss for any task can be
optimize the task-specific weights wi as part of a Gaussian decreased without increasing the loss on any of the other
likelihood objective [19], or in order to balance the pace at tasks. A multiple gradient descent algorithm (MGDA) [71]
which the different tasks are learned [8], [20]. In contrast, was proposed in [22] to find a Pareto stationary point. In
Dynamic Task Prioritization (DTP) [21] opted to prioritize particular, the shared network weights are updated by find-
the learning of ’difficult’ tasks by assigning them a higher ing a common direction among the task-specific gradients.
task-specific weight. The motivation is that the network As long as there is a common direction along which the
should spend more effort to learn the ’difficult’ tasks. Note task-specific losses can be decreased, we have not reached
that, this is opposed to uncertainty weighting, where a a Pareto optimal point yet. An advantage of this approach
higher weight is assigned to the ’easy’ tasks. We hypothesize is that since the shared network weights are only updated
that the two techniques do not necessarily conflict, but along common directions of the task-specific gradients, con-
uncertainty weighting seems better suited when tasks have flicting gradients are avoided in the weight update step.
noisy labeled data, while DTP makes more sense when we Lin et al. [23] observed that MGDA only finds one
have access to clean ground-truth annotations. out of many Pareto optimal solutions. Moreover, it is not
To measure the task difficulty, one could consider the guaranteed that the obtained solution will satisfy the users’
progress on every task using the loss ratio L̃i (t) defined needs. To address this problem, they generalized MGDA to
by GradNorm. However, since the loss ratio depends on generate a set of well-representative Pareto solutions from
the initial loss Li (0), its value can be rather noisy and which a preferred solution can be selected. So far, however,
initialization dependent. Furthermore, measuring the task the method was only applied to small-scale datasets (e.g.
progress using the loss ratio might not accurately reflect the Multi-MNIST).
progress on a task in terms of qualitative results. Therefore,
DTP proposes the use of key performance indicators (KPIs) 3.1.6 Discussion
to quantify the difficulty of every task. In particular, a KPI In Section 3.1, we described several methods for balanc-
κi is selected for every task i, with 0 < κi < 1. The KPIs ing the influence of each task when training a multi-task
are picked to have an intuitive meaning, e.g. accuracy for network. Table 1 provides a qualitative comparison of the
classification tasks. For regression tasks, the prediction error described methods. We summarize some conclusions below.
can be thresholded to obtain a KPI that lies between 0 and (1) We find discrepancies between these methods, e.g. un-
1. Further, we define a task-level focusing parameter γi ≥ 0 certainty weighting assigns a higher weight to the ’easy’
that allows to adjust the weight at which easy or hard tasks tasks, while DTP advocates the opposite. The latter can
are down-weighted. DTP sets the task-specific weight wi for be attributed to the experimental evaluation of the differ-
task i at step t as ent task balancing strategies that was often done in the
γ
wi (t) = − (1 − κi (t)) i log κi (t) . (9) literature using different datasets or task dictionaries. We
suspect that an appropriate task balancing strategy should
Note that, Equation 9 employs a focal loss expression [70] to be decided for each case individually. (2) We also find
down-weight the task-specific weights for the ’easy’ tasks. commonalities between the described methods, e.g. uncer-
In particular, as the value for the KPI κi increases, the weight tainty weighting, GradNorm and MGDA opted to balance
wi for task i is being reduced. the loss magnitudes as part of their learning strategy. In
DTP requires to carefully select the KPIs. For example, Section 4.4, we provide extensive ablation studies under
consider choosing a threshold to measure the performance more common datasets or task dictionaries to verify what
10

TABLE 1: A qualitative comparison between task balancing techniques. First, we consider whether a method balances the
loss magnitudes (Balance Magnitudes) and/or the pace at which tasks are learned (Balance Learning). Second, we show
what tasks are prioritized during the training stage (Prioritize). Third, we show whether the method requires access to
the task-specific gradients (Grads Required). Fourth, we consider whether the tasks gradients are enforced to be non-
conflicting (Non-Competing Grads). Finally, we consider whether the proposed method requires additional tuning, e.g.
manually choosing the weights, KPIs, etc. (No Extra Tuning). For a detailed description of each technique see Section 3.

Method Balance Balance Prioritize Grads Non-Competing No Extra Motivation


Magnitudes Learning Required Grads Tuning
Uncertainty [19] X Low Noise X Homoscedastic uncertainty
GradNorm [20] X X X X Balance learning and magnitudes
DWA [8] X Balance learning
DTP [21] Difficult Prioritize difficult tasks
MGDA [22] X X X X X Pareto Optimum

task balancing strategies are most useful to improve the evaluation criteria and training setup, so the reader can
multi-tasking performance, and under which circumstances. easily give interpretation to the obtained results. Section 4.2
(3) A number of works (e.g. DWA, DTP) still require careful presents a general overview of the results, allowing us to
manual tuning of the initial hyperparameters, which can identify several overall trends. Section 4.3 compares the
limit their applicability when dealing with a larger number MTL architectures in more detail, while the task balancing
of tasks. strategies are considered in Section 4.4. We refer the reader
to the supplementary materials for qualitative results.
3.2 Other Approaches
The task balancing works in Section 3.1 can be plugged
into most existing multi-task architectures to regulate the 4.1 Experimental Setup
task learning. Another group of works also tried to regulate
4.1.1 Datasets
the training of multi-task networks, albeit for more specific
setups. We touch upon several of these approaches here. Our experiments were conducted on two popular dense-
Note that, some of these concepts can be combined with labeling benchmarks, i.e. NYUD-v2 [72] and PASCAL [73].
task balancing strategies too. We selected the datasets to provide us with a diverse pair
Zhao et al. [26] empirically found that tasks with gra- of settings, allowing us to scrutinize the advantages and
dients pointing in the opposite direction can cause the disadvantages of the methods under consideration. We ad-
destructive interference of the gradient. This observation is ditionally took into account what datasets were used in the
related to the update rule in Equation 5. They proposed to original works. Both datasets are described in more detail
add a modulation unit to the network in order to alleviate below.
the competing gradients issue during training. The PASCAL dataset [73] is a popular benchmark for
Liu et al. [24] considered a specific multi-task architec- dense prediction tasks. We use the split from PASCAL-
ture where the feature space is split into a shared and a task- Context [74] which has annotations for semantic segmen-
specific part. They argue that the shared features should tation, human part segmentation and semantic edge detec-
contain more common information, and no information that tion. Additionally, we consider the tasks of surface normals
is specific to a particular task only. The network weights prediction and saliency detection. The annotations were dis-
are regularized by enforcing this prior. More specifically, an tilled by [18] using pre-trained state-of-the-art models [75],
adversarial approach is used to avoid task-specific features [76]. The optimal dataset F-measure (odsF) [77] is used to
from creeping into the shared representation. Similarly, [18], evaluate the edge detection task. The semantic segmen-
[25] added an adversarial loss to the single-task gradients in tation, saliency estimation and human part segmentation
order to make them statistically indistinguishable from each tasks are evaluated using mean intersection over union
other in the shared parts of the network. (mIoU). We use the mean error (mErr) in the predicted angles
Finally, some works relied on heuristics to balance the to evaluate the surface normals.
tasks. Sanh et al. [27] trained the network by randomly The NYUD-v2 dataset [72] considers indoor scene un-
sampling a single task for the weight update during every derstanding. The dataset contains 795 train and 654 test
iteration. The sampling probabilities were set proportionally images annotated for semantic segmentation and monocular
to the available amount of training data for every task. Raffel depth estimation. Other works have also considered surface
et al. [28] used temperature scaling to balance the tasks. So normal prediction [13], [14], [18] and semantic edge detec-
far, however, both procedures were used in the context of tion [13], [18] on the NYUD-v2 dataset. The annotations for
natural language processing. these tasks can be directly derived from the semantic and
depth ground truth. In this work we focus on the semantic
4 E XPERIMENTS segmentation and depth estimation tasks. We use the mean
This section provides an extensive comparison of the previ- intersection over union (mIoU) and root mean square error
ously discussed methods. First, we describe the experimen- (rmse) to evaluate the semantic segmentation and depth
tal setup in Section 4.1. We cover the used datasets, methods, estimation task respectively.
11

Table 2 gives an overview of the used datasets. We PAD-Net, but operates on the pixel affinities to perform
marked tasks for which the annotations were obtained the multi-modal distillation through a recursive diffusion
through distillation with an asterik. process.
Based on these observations, we organize the experi-
4.1.2 Evaluation Criterion ments as follows. On NYUD-v2, we consider PAD-Net, PAP-
In addition to reporting the performance on every individ- Net and JTRL in combination with a ResNet-50 backbone.
ual task, we include a single-number performance metric for This facilitates an apples-to-apples comparison between the
the multi-task models. Following prior work [10], [11], [16], encoder- and decoder-focused approaches that operate on
[18], we define the multi-task performance of model m as top of a single-scale feature extractor. We draw a separate
the average per-task drop in performance w.r.t. the single- comparison between MTI-Net and PAD-Net on NYUD-v2
task baseline b: using a multi-scale HRNet-18 backbone [79]. Finally, we
repeat the comparison between PAD-Net and MTI-Net on
T
1X l PASCAL using the multi-scale HRNet-18 backbone to verify
∆m = (−1) i (Mm,i − Mb,i ) /Mb,i , (10)
T i=1 how well the decoder-focused approaches handle the larger
and more diverse task dictionary. In addition to the MTL ar-
where li = 1 if a lower value means better performance chitectures, we also compare the task balancing techniques
for metric Mi of task i, and 0 otherwise. The single-task from Section 3.1. We analyze the use of fixed weights,
performance is measured for a fully-converged model that uncertainty weighting [19], GradNorm [20], DWA [8] and
uses the same backbone network only to perform that task. MGDA [22]. We did not include DTP [21], as this technique
To achieve a fair comparison, all results were obtained after requires to define additional key performance indicators
performing a grid search on the hyperparameters. This en- (see Section 3.1.4). The task balancing methods are evaluated
sures that every model is trained with comparable amounts in combination with the MTL baseline models that use a
of finetuning. We refer to Section 4.1.4 for more details on single-scale ResNet backbone. We do not consider how the
our training setup. task balancing techniques interact with other MTL architec-
The MTL performance metric does not account for the tures. This choice stems from the following observations.
variance when different hyperparameters are used. To ad- First, GradNorm and MGDA were specifically designed
dress this, we analyze the influence of the used hyperpa- with the vanilla hard parameter sharing model in mind (i.e.
rameters on NYUD-v2 with performance profiles. Finally, MTL baseline). Second, uncertainty weighting and DWA re-
in addition to a performance evaluation, we also include weigh the tasks based on the task-specific losses. Since the
the model resources, i.e. number of parameters and FLOPS, loss values depend more on the used loss functions, and less
when comparing the multi-task architectures. on the used architectures, we expect these methods to result
in similar task weights when plugging in a different model.
4.1.3 Compared Methods
Table 3 summarizes the models and task balancing strate- 4.1.4 Training Setup
gies used in our experiments. We consider the following We reuse the loss functions and augmentation strategy
encoder-focused architectures from Section 2.2 on NYUD- from [16]. The MTL models are trained with fixed loss
v2 and PASCAL: MTL baseline with shared encoder and weights obtained from [18], that optimized them through
task-specific decoders, cross-stitch networks [5], NDDR- a grid search. All experiments are performed using pre-
CNN [7] and MTAN [8]. We do not include branched MTL trained ImageNet weights. The optimizer, learning rate and
networks, since this collection of works is mainly situated batch size are optimized with a grid search procedure to
in the domain of Neural Architecture Search, and focuses ensure a fair comparison across all compared approaches.
on finding a MTL solution that fits a specific computational More specifically, we test batches of size 6 and 12, and
budget constraint. We refer to the corresponding papers [9], Adam (LR={1e-4, 5e-4}) vs stochastic gradient descent with
[10], [11], [12] for a concrete experimental analysis on this momentum 0.9 (LR={1e-3,5e-3,1e-2,5e-2}). This accounts for
subject. All compared models use a ResNet [78] encoder 12 hyperparameter settings in total (see overview in Table 4).
with dilated convolutions [76]. We use the ResNet-50 vari- A poly learning rate scheduler is used. The number of total
ant for our experiments on NYUD-v2. On PASCAL, we epochs is set to 60 for PASCAL and 100 for NYUD-v2. We
use the shallower ResNet-18 model due to GPU memory include weight decay regularization of 1e-4. Any remaining
constraints. The task-specific heads use an Atrous Spatial hyperparameters are set in accordance with the original
Pyramid Pooling (ASPP) [76] module. works.
Furthermore, we cover the following decoder-focused
approaches from Section 2.3: JTRL [15], PAP-Net [14], PAD- 4.2 Overview
Net [13] and MTI-Net [16]. Note that a direct comparison Table 5 provides an overview of the results on NYUD-v2
between all models is not straightforward. There are sev- and PASCAL. The MTL architectures are shown in Tables 5a
eral reasons behind this. First, MTI-Net operates on top and 5b. A direct comparison can be made between architec-
of a multi-scale feature representation of the input image, tures that rely on the same backbone. The task balancing
which assumes a multi-scale backbone, unlike the other strategies are analyzed in Tables 5c and 5d. We identify
approaches that were originally designed with a single- several trends from the results.
scale backbone network in mind. Second, JTRL was strictly Single-Task vs Multi-Task. We compare the encoder- and
designed for a pair of tasks, without any obvious extension decoder-focused MTL models against their single-task coun-
to the MTL setting. Finally, PAP-Net behaves similarly to terparts on NYUD-v2 and PASCAL in Tables 5a- 5b. MTL
12

TABLE 2: MTL benchmarks used in the experiments section. Distilled task labels are marked with *.

Dataset # Train # Test Segmentation Depth Human Parts Normals Saliency Edges
PASCAL [73] 10,581 1,449 X X X* X* X
NYUD-v2 [72] 795 654 X X

TABLE 3: Overview of our experiments on NYUD-v2 and PASCAL. We indicate the used backbones and optimization
strategies for every model.

Model Datasets Backbone Optimization


NYUD-v2 PASCAL NYUD-v2 PASCAL
MTL Baseline X X ResNet-50/HRNet-18 ResNet-18/HRNet-18 Uniform, Fixed (Grid Search), Uncertainty,
GradNorm, DWA, MGDA
Cross-Stitch X X ResNet-50 ResNet-18 Fixed (Grid Search)
NDDR-CNN X X ResNet-50 ResNet-18 Fixed (Grid Search)
MTAN X X ResNet-50 ResNet-18 Fixed (Grid Search)
JTRL X ResNet-50 - Fixed (Grid Search)
PAP-Net X ResNet-50 - Fixed (Grid Search)
PAD-Net X X ResNet-50/HRNet-18 HRNet-18 Fixed (Grid Search)
MTI-Net X X HRNet-18 HRNet-18 Fixed (Grid Search)

TABLE 4: Overview of the used hyperparameter settings in single-task equivalents on PASCAL (see Table 5b). For ex-
our grid search procedure. ample, the improvements reported by the encoder-focused
approaches are usually limited to a few isolated tasks, while
Optimizer Batch size Learning rate
a decline in performance is observed for the other tasks.
SGD (momentum = 0.9) {6, 12} {1e-3, 5e-3, 1e-2, 5e-2} Jointly tackling a large and diverse task dictionary proves
Adam {6, 12} {1e-4, 5e-4}
challenging, as also noted by [10], [18].
Architecture vs Optimization. The effect of designing a
better MTL architecture is compared against the use of a
can offer several advantages relative to single-task learning,
better task balancing strategy (see Tables 5a-5b vs 5c-5d).
that is, smaller memory footprint, reduced number of calcu-
We find that the use of a better MTL architecture is usually
lations, and improved performance. However, few models
more helpful to improve the performance in MTL. Similar
are able to deliver on this potential to its full extent. For
observations were made by prior works [8], [18].
example, JTRL improves the performance on the segmenta-
tion and depth estimation tasks on NYUD-v2, but requires Encoder- vs Decoder-Focused Models. We compare the
more resources. Differently, the processing happens more encoder-focused models against the decoder-focused mod-
efficiently when using the MTL baseline on PASCAL, but the els on NYUD-v2 and PASCAL in Tables 5a- 5b. First, we find
performance declines too. MTI-Net constitutes an exemp- that the decoder-focused architectures generally outperform
tion from this rule. In particular, the performance increases the encoder-focused ones in terms of multi-task perfor-
on all tasks, except for normals, while the computational mance. We argue that each architecture paradigm serves dif-
overhead is limited. Note, in this specific case, the relative ferent purposes. The encoder-focused architectures aim to
increase in parameters and FLOPS can be attributed to the learn richer feature representations of the image by sharing
use of a shallow backbone network. information during the encoding process. Differently, the
Influence of Task Dictionary. We study the influence of decoder-focused ones focus on improving dense prediction
the task dictionary (i.e. size and diversity) by comparing tasks by repeatedly refining the predictions through cross-
the results on NYUD-v2 against PASCAL (see Table 5a vs task interactions. Since the interactions take place near the
Table 5b). On NYUD-v2, we consider the tasks of semantic output of the network, they allow for a better alignment
segmentation and depth estimation. This pair of tasks is of common cross-task patterns, which in turn, greatly boost
strongly related [13], [14], [17], since both semantic segmen- the performance. Based on their complementary behavior,
tation and depth estimation reveal similar characteristics we hope to see both paradigms integrated in future works.
about a scene, such as the layout and object shapes or Second, we focus on the encoder- and decoder-focused
boundaries. Differently, PASCAL includes a larger and more models that use an identical ResNet-50 backbone on NYUD-
diverse task dictionary. v2 in Table 5a. The decoder-focused models report higher
On NYUD-v2, MTL proves an effective strategy for performance, but consume a large number of FLOPS. The
jointly tackling segmentation and depth estimation. In par- latter is due to repeatedly predicting the task outputs at high
ticular, most MTL models outperform the set of single-task resolution scales. On the other hand, except for JTRL, the
networks (see Table 5a). Similar results have been reported decoder-focused models have a smaller memory footprint
for other pairs of well-correlated tasks, e.g. depth and flow compared to the encoder-focused ones. We argue that the
estimation [80], detection and classification [81], [82], detec- decoder-focused approaches parameterize the task interac-
tion and segmentation [2], [83]. tions more efficiently. This can be understood as follows.
Differently, most existing models fail to outperform their The task features before the final layer are disentangled
13

TABLE 5: Comparison of deep architectures and optimization strategies for MTL on NYUD-v2 and PASCAL. Models that
use the same backbone can be put in direct comparison with each other. For more details about the experimental setup
visit Section 4.1.

(a) Comparison of deep MTL architectures for MTL on the NYUD-v2 validation set.

Backbone Model FLOPS (G) Params (M) Seg. (IoU) ↑ Depth (rmse) ↓ ∆MTL (%) ↑
Single ResNet-50 192 80 43.9 0.585 + 0.00
Task HRNet-18 11 8 35.3 0.648 + 0.00
MTL Baseline 133 56 44.4 0.587 + 0.41
Encoder MTAN 197 72 45.0 0.584 + 1.32
ResNet-50
Focused Cross-Stitch 192 80 44.2 0.570 + 1.61
MTL NDDR-CNN 207 102 44.2 0.573 + 1.38
HRNet-18 MTL Baseline 6 4 33.9 0.636 - 1.09
JTRL 660 295 46.4 0.501 + 10.02
Decoder ResNet-50 PAP-Net 4800 52 50.4 0.530 + 12.10
Focused PAD-Net (Single-Scale) 256 52 50.2 0.582 + 7.43
MTL PAD-Net (Multi-Scale) 82 12 36.0 0.630 + 2.38
HRNet-18
MTI-Net 16 27 38.6 0.593 + 8.95

(b) Comparison of deep MTL architectures for MTL on the PASCAL validation set.

Backbone Model FLOPS Params Seg. H. Parts Norm. Sal. Edge ∆MTL
(G) (M) (IoU) ↑ (IoU) ↑ (mErr) ↓ (IoU) ↑ (odsF) ↑ (%) ↑
Single ResNet-18 167 80 66.2 59.9 13.9 66.3 68.8 + 0.00
Task HRNet-18 24 20 59.5 61.4 14.0 67.3 72.6 + 0.00
MTL Baseline 71 35 63.8 58.6 14.9 65.1 69.2 - 2.86
Encoder MTAN 80 37 63.7 58.9 14.8 65.4 69.6 - 2.39
ResNet-18
Focused Cross-Stitch 167 80 66.1 60.6 13.9 66.8 69.9 + 0.60
MTL NDDR-CNN 187 88 65.4 60.5 13.9 66.8 69.8 + 0.39
HRNet-18 MTL Baseline 7 4 54.3 59.3 14.8 65.5 71.7 - 4.38
Decoder ResNet-18 PAD-Net 36 32 63.2 59.3 15.2 64.3 60.2 - 5.62
Focused PAD-Net 212 29 53.6 59.6 15.3 65.8 72.5 - 4.41
MTL HRNet-18
MTI-Net 15 24 64.3 62.1 14.8 68.0 73.4 + 1.13

(c) Comparison of optimization strategies for (d) Comparison of optimization strategies for MTL on the PASCAL validation
MTL on the NYUD-v2 validation set. The MTL set. The MTL baseline model with ResNet-18 backbone is used.
baseline model with ResNet-50 backbone is used.
Method Seg. H. Parts Norm. Sal. Edge ∆MTL
Method Seg. Depth ∆MTL (IoU) ↑ (IoU) ↑ (mErr) ↓ (IoU) ↑ (odsF) ↑ (%) ↑
(IoU) ↑ (rmse) ↓ (%)↑
Single-Task 66.2 59.9 13.9 66.3 68.8 + 0.00
Single-Task 43.9 0.585 + 0.00
Uniform 65.5 59.5 15.8 64.1 67.9 - 3.99
Fixed (Grid S.) 44.4 0.587 + 0.41 Fixed (Grid S.) 63.8 58.6 14.9 65.1 69.2 - 2.86
Uncertainty 44.0 0.590 - 0.23 Uncertainty 65.4 59.2 16.5 65.6 68.6 - 4.60
DWA 44.1 0.591 - 0.28 DWA 63.4 58.9 14.9 65.1 69.1 - 2.94
GradNorm 44.2 0.581 + 1.45 GradNorm 64.7 59.0 15.4 64.5 67.0 - 3.97
MGDA 43.2 0.576 + 0.02 MGDA 64.9 57.9 15.6 62.5 61.4 -6.81

according to the structure of the output task. This allows IoU), and a small decline in performance on the depth
to distill the relevant cross-task patterns with a small num- estimation task (+0.002 rmse). Other encoder-focused archi-
ber of filter operations. This situation is different for the tectures further improve over the MTL baseline in terms of
encoder-focused approaches, where tasks share information multi-task performance, but require more parameters and
in the intermediate layers of the encoder. FLOPS. Note that the observed performance gains are of
rather small magnitude. We conclude that sharing a ResNet
4.3 Architectures backbone in combination with strong task-specific head
units proves a strong baseline for solving a pair of well-
We study the MTL architectures in more detail. Sec-
correlated dense prediction tasks, like semantic segmenta-
tion 4.3.1 compares the encoder-focused architectures, while
tion and depth estimation.
the decoder-focused ones are discussed in Section 4.3.2.
Furthermore, the cross-stitch network outperforms the
4.3.1 Encoder-Focused Architectures NDDR-CNN model both in terms of performance and com-
NYUD-v2. We analyze the encoder-focused approaches on putational efficiency. Both models follow a similar design,
NYUD-v2 in Table 5a. The MTL baseline performs on par where features from the single-task networks are fused
with the set of single-task networks, while it reduces the across several encoding layers (see Section 2.2). The dif-
number of parameters and FLOPS. In particular, we observe ference lies in the employed feature fusion mechanism:
an increase in performance on the segmentation task (+0.5 NDDR-CNNs employ non-linear dimensionality reduction
14

Fig. 10: Performance profile of MTL methods for the semantic segmentation and depth estimation tasks on NYUD-v2. We
show the results obtained with different hyperparameter settings in the same color. Bottom-right is better.

(a) Performance profile of the encoder- (b) Performance profile of the decoder- (c) Performance profile of the task bal-
focused models. focused models. ancing techniques.

to fuse the features, while cross-stitch networks opt for a First, the performance gains reported by the encoder-
simple linear combination of the feature channels. Given focused models are strongly dependent on how well the
the more sophisticated feature fusion scheme employed by single-task models were optimized. In particular, the MTL
NDDR-CNNs, one would expect the NDDR-CNN to obtain performance gains are significantly enlarged when using a
higher performance. Yet, the opposite is observed in this suboptimal set of hyperparameters vs the optimal ones to
experiment. We conclude that the feature fusion scheme train the single-task models. We emphasize the importance
employed by cross-stitch networks and NDDR-CNNs could of carefully training the single-task baselines in MTL.
benefit from further investigation. Second, the encoder-focused models seem more robust
Finally, no encoder-focused model is significantly out- to the used hyperparameters compared to the single-task
performing its competitors on the NYUD-v2 benchmark. models. More specifically, when using a less carefully tuned
Therefore, in this case, Multi-Task Attention Networks seem set of hyperparameters, the performance of the single-task
the most favorable choice given their efficient design and models drops faster compared to the encoder-focused MTL
good performance. models. We will come to a similar conclusion when re-
PASCAL. We study again the encoder-focused approaches peating this experiment for the optimization strategies. We
on the PASCAL dataset in Table 5b. In contrast to the results conclude that studying hyperparameter robustness under
on NYUD-v2, both the MTL baseline and MTAN model a single- vs multi-task scenario could be an interesting
report lower performance when compared against the set of direction for future work.
single-task models (−2.86% and −2.39%). There are several Finally, the performance of cross-stitch networks and
possible explanations for this. First, we consider a larger task NDDR-CNNs is less hyperparameter dependent compared
dictionary. It has been shown [18], [54] that a diverse task to the MTL baseline and MTAN. The latter pair of models
dictionary is more prone to task interference. Second, the shows much larger performance differences across different
PASCAL dataset contains more labeled examples compared hyperparameter settings.
to NYUD-v2. Prior work [36] observed that the multi-task Discussion. We compared several encoder-focused archi-
performance gains can be lower when more annotations tectures on NYUD-v2 and PASCAL. For specific pairs of
are available. Still both models are a useful strategy to tasks, e.g. depth estimation and semantic segmentation, we
reduce the required amount of resources. Depending on can boost the overall performance through encoder-focused
the performance requirements, the MTAN model is favored MTL models. However, when considering a large or diverse
over the MTL baseline or vice versa. task dictionary, the performance improvements are limited
Furthermore, we observe that cross-stitch networks and to a few isolated tasks. In the latter case, MTL still provides
NDDR-CNNs can not handle the larger, more diverse task a useful strategy to reduce the required amount of compu-
dictionary: the performance improvements are negligible tational resources. Notably, there was no encoder-focused
when using these models, while the number of parame- model that consistently outperformed the other architec-
ters and FLOPS increases. Differently, the MTL baseline tures. Instead, an appropriate MTL model should rather be
and MTAN model are able to strike a better compromise decided on a per case basis, also taking into account the
between the multi-task performance and the computation amount of available computational resources. For example,
resources that are required. when performance is crucial, it is advised to use a cross-
Hyperparameters. We evaluate the performance of the stitch network, while if the available resources are limited,
encoder-focused models when trained with different sets the MTAN model provides a more viable alternative.
of hyperparameters (see Table 4). Figure 10a shows a per-
formance profile for the semantic segmentation and depth 4.3.2 Decoder-Focused Architectures
estimation tasks on NYUD-v2. Experiments that use the NYUD-v2. The results for the decoder-focused models
same model, but a different set of hyperparameters, are on NYUD-v2 can be found in Table 5a. All single-scale
displayed in identical color. We make several observations decoder-focused architectures report significant gains over
regarding the influence of the employed hyperparameters. the single-task networks on NYUD-v2. PAP-Net achieves
15

the highest multi-task performance (+12.10%), but con- Conclusion. We compared several decoder-focused archi-
sumes a large number of FLOPS. This is due to the use of tectures across two dense labeling datasets. On NYUD-v2,
task-affinity matrices, which require to calculate the feature the performance on both semantic segmentation and depth
correlation between every pair of pixels in the image. A estimation can be significantly improved by employing one
similar improvement is seen for JTRL (+10.02%). JTRL re- of the decoder-focused models. However, both PAP-Net and
cursively predicts the two tasks at increasingly higher scales. JTRL incur a large increase in the number of FLOPS. MTI-
Therefore, a large number of filter operations is performed Net and PAD-Net provide more viable alternatives when
on high resolution feature maps, leading to an increase in we wish to limit the number of FLOPS. On PASCAL, a
the computational resources. Opposed to JTRL and PAP- multi-scale approach, like MTI-Net, seems better fitted for
Net, PAD-Net does not come with a significant increase increasing the multi-scale performance while retaining the
in the number of computations. Yet, we still observe a computational resources low. We conclude that the decoder-
relatively large improvement over the single-task networks focused architectures obtain promising results on the MTL
(+7.43%). problem.
Next, we consider the decoder-focused architectures that
used a multi-scale backbone, i.e. HRNet-18. Again, PAD-
4.4 Task Balancing
Net outperforms the single-task networks (+2.38%), but
MTI-Net further improves the performance (+8.95%). We We revisit the task balancing strategies from Section 3.1.
conclude that it is beneficial to distill task information at NYUD-v2. Table 5c shows the results when training the
a multitude of scales, instead of a single scale. MTI-Net MTL baseline with a ResNet-50 backbone using different
consumes slightly more parameters and FLOPS compared task balancing strategies. On NYUD-v2, optimizing the loss
to the single-task networks. This is due to the use of a rather weights with a grid search procedure resulted in a uniform
shallow backbone, i.e. HRNet-18, and a small number of loss weighting scheme. Therefore, in this case, the use of
tasks. As a consequence, the overhead introduced by adding fixed uniform weights and the use of fixed weights from a
the additional layers in MTI-Net is relatively large compared grid search overlap.
to the resources used by the backbone. PAD-Net sees a large The MTL baseline with fixed weights improves over the
increase in the number of FLOPS compared to MTI-Net. This single-task networks (+0.41% for ∆M T L ). GradNorm can
is due to the fact that PAD-Net performs the multi-modal further boost the performance by adjusting the task-specific
distillation at a single higher scale (1/4) with 4 · C channels, weights in the loss during training (+1.45%). We conclude
C being the number of backbone channels at a single scale. that uniform weights are suboptimal for training a multi-
Instead, MTI-Net performs most of the computations at task model when the tasks employ different loss functions.
smaller scales (1/32, 1/16, 1/8), while operating on only Similar to GradNorm, DWA tries to balance the pace
C channels at the higher scale (1/4). at which tasks are learned, but does not equalize the gra-
PASCAL. We analyze the decoder-focused models on PAS- dient magnitudes. From Table 5c, we conclude that the
CAL in Table 5b. We see that PAD-Net can not handle the latter is important as well (−0.28% with DWA vs +1.45%
larger task dictionary. The low performance on the semantic with GradNorm). Uncertainty weighting results in reduced
edge prediction task can be attributed to not using skip performance compared to GradNorm (−0.23% vs 1.45%).
connections in the implemented model. Differently, MTI- Uncertainty weighting assigns a smaller weight to noisy
Net improves the performance on all tasks, except normals, or difficult tasks. Since the annotations on NYUD-v2 were
while requiring fewer flops when compared against the not distilled, the noise levels are rather small. When we
single-task networks. The consistent findings in the larger have access to clean ground-truth annotations, it seems
task dictionary backup our hypothesis about the importance better to balance the learning of tasks rather than lower
of performing the multi-modal distillation at a multitude the weights for the difficult tasks. Further, MGDA reports
of scales (see multi-scale decoder-focused approaches on lower performance compared to GradNorm (+0.02% vs
NYUD-v2). +1.45%). MGDA only updates the weights along directions
Hyperparameters. Figure 10b shows the performance pro- that are common among all task-specific gradients. The
file for PAD-Net and MTI-Net on NYUD-v2. We use the results suggest that it is better to allow some competition
same hyperparameters as before (see Table 4). The obtained between the task-specific gradients in the shared layers, as
MTL solutions outperform training dedicated single-task this could help to avoid local minima.
models. Furthermore, both PAD-Net and MTI-Net are ro- Finally, we conclude that the use of fixed loss weights op-
bust against hyperparameter changes. Even when trained timized with a grid search still outperforms several existing
with suboptimal hyperparameters, both models still outper- task balancing methods. In particular, the solution obtained
form their single-task counterparts. Finally, we do not ob- with fixed uniform weights outperforms the models trained
serve a significant difference in hyperparameter robustness with uncertainty weighting, MGDA and DWA on both the
between the single-task and multi-task models. When com- semantic segmentation and depth estimation tasks.
pared against the ResNet-50 single-task models, the HRNet- PASCAL. The task balancing techniques are compared on
18 model seems to require less hyperparameter tuning to PASCAL in Table 5d. We find that a grid-search on the
tackle the semantic segmentation and depth estimation tasks weight space works better than the use of automated task
on NYUD-v2. We observe that the HRNet-18 model uses balancing procedures from Section 3. A similar observation
fewer parameters compared to the ResNet-50 model. This was made by [18]. We hypothesize that this is due to the
could explain why it is easier to train the single-task HRNet- imbalance between the optimal parameters, e.g. the weight
18 models. for the edge detection loss is 100 times higher than for the
16

semantic segmentation loss. Uncertainty weighting reports becomes intractable when tackling large number of tasks.
the highest performance losses on the distilled tasks, i.e. In this case, we can fallback to existing task balancing tech-
normals and saliency. This is because uncertainty weighting niques to set the task weights. Furthermore, when dealing
assigns a smaller weight to tasks with higher homoscedastic with noisy annotations, uncertainty weighting can help to
uncertainty. Differently, MGDA fails to properly learn the automatically readjust the weights on the noisy tasks.
edge detection task. This is another indication (cf. NYUD-
v2 results) that avoiding competing gradients by only back
propagating along a common direction in the shared layers 5 R ELATED D OMAINS
does not necessarily improve performance. This is partic- So far, we focused on the application of MTL to jointly solve
ularly the case for the edge detection task on PASCAL multiple vision tasks under a fully-supervised setting. In
because of two reasons. First, the edge detection loss has this section, we consider the MTL setup from a more gen-
a much smaller magnitude compared to the other tasks. eral point-of-view, and analyze its connection with several
Second, when the edge annotations are converted into seg- related domains. The latter could potentially be combined
ments they provide an over-segmentation of the image. As with the MTL setup and improve it, or vice versa.
a consequence, the loss gradient of the edge detection task
often conflicts with other tasks since they have a smoother 5.1 Multi-Domain Learning
gradient. Because of this, MGDA prefers to mask out the
gradient from the edge detection task by assigning it a The methods considered so far were applied to solve mul-
smaller weight. Finally, GradNorm does not report higher tiple tasks on the same visual domain. However, there is
performance when compared against uniform weighting. a growing interest in learning representations that perform
We hypothesize that this is due to the re-normalization of well for many visual domains simultaneously. For example,
the loss weights during the optimization. The latter does Bilen and Vedaldi [84] learned a compact multi-domain rep-
not work well when the optimal loss magnitudes are very resentation using domain-specific scaling parameters. This
imbalanced. idea was later extended to the use of residual adapters [85],
Hyperparameters. We use the performance profile from Fig- [86]. These works only explored multi-domain learning for
ure 10c to analyze the hyperparameter sensitivity of the task various classification tasks. Future research should address
balancing methods on NYUD-v2. The used hyperparame- the problem when considering multiple dense prediction
ters are the same as defined before (see Table 4). GradNorm tasks too.
is the only technique that outperforms training a separate
model for each task. However, when less effort is spent to 5.2 Transfer Learning
tune the hyperparameters of the single-task models, all task Transfer learning [87] makes use of the knowledge obtained
balancing techniques result in improved performance. This when solving one task, and applies it to tackle another task.
observation explains the increased performance relative to Different from MTL, transfer learning does not consider
the single-task case reported by prior works [19], [20], [22]. solving all tasks concurrently. An important question in
Again, we stress the importance of carefully training the both transfer learning and MTL is whether visual tasks
single-task baselines. have relationships, i.e. they are correlated. Ma et al. [88]
Second, as before, the MTL models seem more robust modeled the task relationships in a MTL network through
to hyperparameter changes compared to the single-task a Multi-gate Mixture-of-Experts model. Zamir et al. [89]
networks. In particular, the performance drops slower when provided a taxonomy for task transfer learning to quantify
using less optimal hyperparameter settings in the MTL case. the task relationships. Similarly, Dwivedi and Roig [62]
Finally, fixed loss weighting and uncertainty weighting used Representation Similarity Analysis to obtain a measure
are most robust against hyperparameter changes. These of task affinity, by computing correlations between mod-
techniques report high performance for a large number of els pretrained on different tasks. Vandenhende et al. [10]
hyperparameter settings. For example, out of the top-20% then used these task relationships for the construction of a
best models, 40% of those were trained with uncertainty branched MTL network. Standley et al. [90] also relied on
weighting. task relationships to define what tasks should be learned
Discussion. We evaluated the task balancing strategies from together in a MTL setup.
Section 3.1 under different settings. The methods were
compared to selecting the loss weights by a grid-search
procedure. Surprisingly, in our case, we found that grid- 5.3 Neural Architecture Search
search is competitive or better compared to existing task The experimental results in Section 4 showed that the
balancing techniques. Also, a number of techniques per- success of MTL strongly depends on the use of a proper
formed worse than anticipated. Gong et al. [36] obtained network architecture. Typically, such architectures are hand-
similar results with ours, albeit only for a few loss balancing crafted by human experts. However, given the size and com-
strategies. Also, Maninis et al. [18] found that performing a plexity of the problem, this manual architecture exploration
grid search for the weights could outperform a state-of-the- likely exceeds human design abilities. To automate the con-
art loss balancing scheme. Based on these works and our struction of the network architecture, Neural Architecture
own findings, we argue that the optimization in MTL could Search (NAS) [91] has been proposed in the literature.
benefit from further investigation (see also Section 3.1.6). Yet, most existing NAS works are limited to task-specific
Still existing task balancing techniques can be useful to models [92], [93], [94], [95], [96]. This is to be expected as
train MTL networks. A grid search on the weight space using NAS for MTL assumes that layer sharing has to be
17

jointly optimized with the layers types, their connectivity, etc., all affect the final outcome. As a result, an appropri-
etc., rendering the problem considerably expensive. ate architecture and optimization strategy should better be
To alleviate the heavy computational burden associated selected on a per case basis. Although we provided concrete
with NAS, several works have proposed to start from a observations as to why some methods work better for
predefined backbone network for which a cross-task layer specific setups, MTL could generally benefit from a deeper
sharing scheme is automatically determined. For example, theoretical understanding to maximize the expected gains
Liang et al. [97] implemented an evolutionary architecture in every case. For example, these gains seem dependent on
search for MTL, while others explored alternatives like a plurality of factors, e.g. amount of data, task relationships,
branched MTL networks [9], [10], [11], [12], routing [65], noise, etc. Future work should try to isolate and analyze the
stochastic filter grouping [56], and feature partitioning [68]. influence of these different elements.
So far, NAS for MTL focused on how to share features across Second, when it comes to tackling multiple dense pre-
tasks in the encoder. We hypothesize that NAS could also be diction tasks using a single MTL model, decoder-focused
applied for the discovery of decoder-focused MTL models. architectures currently offer more advantages in terms of
multi-task performance, and with limited computational
5.4 Other overhead compared to the encoder-focused ones. As ex-
plained, this is due to an alignment of common cross-task
MTL has been applied to other problems too. This includes
patterns that decoder-focused architectures promote, which
various domains, such as language [24], [98], [99], au-
naturally fits well with dense prediction tasks. Encoder-
dio [100], video [101], [102], and robotics [103], [104], as well
focused architectures still offer certain advantages within
as with different learning paradigms, such as reinforcement
the dense prediction task setting, but their inherent layer
learning [105], [106], self-supervised learning [107], semi-
sharing seems better fitted to tackle multiple classification
supervised learning [108], [109] and active learning [110],
tasks.
[111]. Surprisingly, in the deep learning era, very few works
Finally, we analyzed multiple task balancing strategies,
have considered MTL under the semi-supervised or active
and isolated the elements that are most effective for balanc-
learning setting. Nonetheless, we believe that these are
ing task learning, e.g. down-weighing noisy tasks, balancing
interesting directions for future research.
task gradients, etc. Yet, many optimization aspects still
For example, a major limitation of the fully-supervised
remain poorly understood. For example, opposed to recent
MTL setting that we consider here is the requirement for all
works, our analysis indicates that avoiding gradient com-
samples to be annotated for every task. Prior work [112]
petition between tasks can hurt performance. Furthermore,
showed that the standard update rule from Equation 5
our study revealed that some task balancing strategies still
gives suboptimal results if we do not take precautions when
suffer from shortcomings, and highlighted several discrep-
annotations are missing. To alleviate this problem, Kim
ancies between existing methods. We hope that this work
et al. [113] proposed to use an alternate learning scheme
stimulates further research efforts into this problem.
by updating the network weights for a single task at a
Acknowledgment. The authors would like to acknowledge
time. A knowledge distillation term [114] is included, in
support by Toyota via the TRACE project and MACCHINA
order to avoid loosing relevant information about the other
(KU Leuven, C14/18/065). This work is also sponsored by
tasks. Differently, Nekrasov et al. [112] proposed to use
the Flemish Government under the Flemish AI programme.
the predictions from an expert model as synthetic ground-
Finally, the authors would like to thank Shikun Liu, Wanli
truth when annotations are missing. Although, these early
Ouyang and the anonymous reviewers for useful feedback.
attempts have shown encouraging results, we believe that
this problem could benefit from further investigation.
Finally, multi-task learning was recently shown to im-
prove robustness. For example, in [115] a multi-task learn-
ing strategy showed robustness against adversarial attacks,
while [116] found that applying cross-task consistency in
MTL improves generalization, and allows for domain shift
detection.
Simon Vandenhende received his MSE degree
6 C ONCLUSION in Electrical Engineering from the KU Leuven,
Belgium, in 2018. He is currently studying to-
In this paper, we reviewed recent methods for MTL within wards a PhD degree at the Center for Process-
the scope of deep neural networks. First, we presented an ing Speech and Images at KU Leuven. His re-
extensive overview of both architectural and optimization search focuses on multi-task learning and self-
based strategies for MTL. For each method, we described supervised learning. He won a best paper award
at MVA’19. He is co-organizing the first Com-
its key aspects, discussed the commonalities and differences mands For Autonomous Vehicles workshop at
with related works, and presented the possible advantages ECCV’20.
or disadvantages. Finally, we conducted an extensive ex-
perimental analysis of the described methods that led us to
several key findings. We summarize some of our conclu-
sions below, and present some possibilities for future work.
First, the performance of MTL strongly varies depending
on the task dictionary. Its size, task types, label sources,
18

Stamatios Georgoulis is a post-doctoral re- Luc Van Gool is a full professor for Computer
searcher at the Computer Vision Lab of ETH Vision at ETH Zurich and the KU Leuven. He
Zurich. His current research interests include leads research and teaches at both places. He
multi-task learning, unsupervised learning, and has authored over 300 papers. Luc Van Gool
image generation. Before coming to Zurich, he has been a program committee member of sev-
was a doctoral student at the PSI group of KU eral, major computer vision conferences (e.g.
Leuven, where he received his PhD under the Program Chair ICCV’05, Beijing, General Chair
supervision of Prof. Van Gool and Prof. Tuyte- of ICCV’11, Barcelona, and of ECCV’14, Zurich).
laars, focusing on extracting surface characteris- His main interests include 3D reconstruction and
tics and lighting from images. Further back, he modeling, object recognition, and autonomous
received his diploma in Electrical and Computer driving. He received several Best Paper awards
Engineering from the Aristotle University of Thessaloniki and partici- (eg. David Marr Prize ’98, Best Paper CVPR’07). He received the
pated in Microsoft’s Imagine Cup Software Design Competition. He reg- Koenderink Award in 2016 and ‘Distinguished Researcher’ nomination
ularly serves as a reviewer in Computer Vision and Machine Learning by the IEEE Computer Society in 2017. In 2015 he also received they
conferences (CVPR, NeurIPS, ICCV, ECCV, etc.) with distinctions. 5-yearly Excellence Prize by the Flemish Fund for Scientific Research.
He was the holder of an ERC Advanced Grant (VarCity). Currently, he
leads computer vision research for autonomous driving in the context of
the Toyota TRACE labs in Leuven and at ETH, and has an extensive col-
laboration with Huawei on the issue of image and video enhancement.

Wouter Van Gansbeke obtained his MSE de-


gree in Electrical Engineering, with a focus on R EFERENCES
Information Technology, from the KU Leuven,
Belgium, in 2018. He is currently pursuing a [1] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional net-
PhD degree at the Center for Processing Speech works for semantic segmentation,” in CVPR, 2015.
and Images at KU Leuven. His main research [2] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in
interests are self-supervised learning and multi- ICCV, 2017.
task learning. His focus is leveraging visual sim- [3] D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from
ilarities, temporal information and multiple tasks a single image using a multi-scale deep network,” in NIPS, 2014.
to learn rich representations with limited annota- [4] M. Gur and D. M. Snodderly, “Direction selectivity in v1 of alert
tions. monkeys: evidence for parallel pathways for motion processing,”
The Journal of physiology, 2007.
[5] I. Misra, A. Shrivastava, A. Gupta, and M. Hebert, “Cross-stitch
networks for multi-task learning,” in CVPR, 2016.
[6] S. Ruder, J. Bingel, I. Augenstein, and A. Søgaard, “Latent multi-
task architecture learning,” in AAAI, 2019.
[7] Y. Gao, J. Ma, M. Zhao, W. Liu, and A. L. Yuille, “Nddr-cnn: Lay-
Marc Proesmans is a research associate at
erwise feature fusing in multi-task cnns by neural discriminative
the Center for Processing Speech and Images
dimensionality reduction,” in CVPR, 2019.
at KU Leuven, where he leads the TRACE
[8] S. Liu, E. Johns, and A. J. Davison, “End-to-end multi-task
team. He has an extensive research experience
learning with attention,” in CVPR, 2019.
on various aspects in image processing e.g.
[9] Y. Lu, A. Kumar, S. Zhai, Y. Cheng, T. Javidi, and R. Feris, “Fully-
early vision processes, 3D reconstruction, op-
adaptive feature sharing in multi-task networks with applica-
tical flow, stereo, structure from motion, recog-
tions in person attribute classification,” in CVPR, 2017.
nition. He has been involved in several Euro-
[10] S. Vandenhende, S. Georgoulis, B. De Brabandere, and
pean research projects, such as VANGUARD,
L. Van Gool, “Branched multi-task networks: Deciding what
IMPROOFS, MESH, MURALE, EUROPA, ROV-
layers to share,” in BMVC, 2020.
INA. From 2000-2011, he has been CTO for a
[11] D. Bruggemann, M. Kanakis, S. Georgoulis, and L. V. Gool,
KU Leuven spin-off company specialized in special effects for the movie
“Automated search for resource-efficient branched multi-task
and game industry. He won several prizes, Golden Egg, Tech-Art, a
networks,” 2020.
European IST, tech. Oscar nomination, European seal of Excellence.
[12] P. Guo, C.-Y. Lee, and D. Ulbricht, “Learning to branch for multi-
From 2015 he is leading the research on automotive driving as Research
task learning,” in ICML, 2020.
expert, and co-founded TRACE vzw to drive the technology transfer
[13] D. Xu, W. Ouyang, X. Wang, and N. Sebe, “Pad-net: Multi-
from academia to the real R&D environment for autonomous driving in
tasks guided prediction-and-distillation network for simultane-
cooperation with Toyota.
ous depth estimation and scene parsing,” in CVPR, 2018.
[14] Z. Zhang, Z. Cui, C. Xu, Y. Yan, N. Sebe, and J. Yang, “Pattern-
affinitive propagation across depth, surface normal and semantic
segmentation,” in CVPR, 2019.
[15] Z. Zhang, Z. Cui, C. Xu, Z. Jie, X. Li, and J. Yang, “Joint task-
recursive learning for semantic segmentation and depth estima-
Dengxin Dai is a Senior Scientist working with tion,” in ECCV, 2018.
the Computer Vision Lab at ETH Zurich. In 2016, [16] S. Vandenhende, S. Georgoulis, and L. Van Gool, “Mti-net: Multi-
he obtained his PhD in Computer Vision at ETH scale task interaction networks for multi-task learning,” in ECCV,
Zurich. Since then he is the Team Leader of 2020.
TRACE-Zurich, working on Autonomous Driv- [17] L. Zhou, Z. Cui, C. Xu, Z. Zhang, C. Wang, T. Zhang, and J. Yang,
ing within the R&D project ”TRACE: Toyota Re- “Pattern-structure diffusion for multi-task learning,” in CVPR,
search on Automated Cars in Europe”. His re- 2020.
search interests lie in autonomous driving, ro- [18] K.-K. Maninis, I. Radosavovic, and I. Kokkinos, “Attentive single-
bust perception in adverse weather and illumi- tasking of multiple tasks,” in CVPR, 2019.
nation conditions, automotive sensors and com- [19] A. Kendall, Y. Gal, and R. Cipolla, “Multi-task learning using
puter vision under limited supervision. He has uncertainty to weigh losses for scene geometry and semantics,”
organized a CVPR Workshop series (’19, ’20) on Vision for All Seasons: in CVPR, 2018.
Bad Weather and Nighttime, and has organized an ICCV’19 workshop [20] Z. Chen, V. Badrinarayanan, C.-Y. Lee, and A. Rabinovich, “Grad-
on Autonomous Driving. He has been a program committee member norm: Gradient normalization for adaptive loss balancing in deep
of several major computer vision conferences and received multiple multitask networks,” in ICML, 2018.
outstanding reviewer awards. He is guest editor for the IJCV special [21] M. Guo, A. Haque, D.-A. Huang, S. Yeung, and L. Fei-Fei,
issue Vision for All Seasons, area chair for WACV’20 and CVPR’21. “Dynamic task prioritization for multitask learning,” in ECCV,
2018.
19

[22] O. Sener and V. Koltun, “Multi-task learning as multi-objective [52] D. Neven, B. De Brabandere, S. Georgoulis, M. Proesmans, and
optimization,” in NIPS, 2018. L. Van Gool, “Fast scene understanding for autonomous driv-
[23] X. Lin, H.-L. Zhen, Z. Li, Q.-F. Zhang, and S. Kwong, “Pareto ing,” in IV Workshops, 2017.
multi-task learning,” in NIPS, 2019. [53] M. Teichmann, M. Weber, M. Zoellner, R. Cipolla, and R. Urtasun,
[24] P. Liu, X. Qiu, and X. Huang, “Adversarial multi-task learning “Multinet: Real-time joint semantic reasoning for autonomous
for text classification,” in ACL, 2017. driving,” in IV, 2018.
[25] A. Sinha, Z. Chen, V. Badrinarayanan, and A. Rabinovich, “Gra- [54] I. Kokkinos, “Ubernet: Training a universal convolutional neural
dient adversarial training of neural networks,” arXiv preprint network for low-, mid-, and high-level vision using diverse
arXiv:1806.08028, 2018. datasets and limited memory,” in CVPR, 2017.
[26] X. Zhao, H. Li, X. Shen, X. Liang, and Y. Wu, “A modulation mod- [55] M. Long, Z. Cao, J. Wang, and S. Y. Philip, “Learning multiple
ule for multi-task learning with applications in image retrieval,” tasks with multilinear relationship networks,” in NIPS, 2017.
in ECCV, 2018. [56] F. J. Bragman, R. Tanno, S. Ourselin, D. C. Alexander, and M. J.
[27] V. Sanh, T. Wolf, and S. Ruder, “A hierarchical multi-task ap- Cardoso, “Stochastic filter groups for multi-task cnns: Learning
proach for learning embeddings from semantic tasks,” in AAAI, specialist and generalist convolution kernels,” in ICCV, 2019.
2019. [57] K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling
[28] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, in deep convolutional networks for visual recognition,” TPAMI,
Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer 2015.
learning with a unified text-to-text transformer,” arXiv preprint [58] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L.
arXiv:1910.10683, 2019. Yuille, “Deeplab: Semantic image segmentation with deep con-
[29] R. Caruana, “Multitask learning,” Machine learning, 1997. volutional nets, atrous convolution, and fully connected crfs,”
[30] Y. Zhang and Q. Yang, “A survey on multi-task learning,” arXiv TPAMI, 2017.
preprint arXiv:1707.08114, 2017. [59] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
[31] A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, network,” in CVPR, 2017.
“Glue: A multi-task benchmark and analysis platform for natural [60] Y. Yuan and J. Wang, “Ocnet: Object context network for scene
language understanding,” in Workshop EMNLP, 2018. parsing,” arXiv preprint arXiv:1809.00916, 2018.
[32] L. Deng, G. Hinton, and B. Kingsbury, “New types of deep neural [61] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable
network learning for speech recognition and related applications: are features in deep neural networks?” in NIPS, 2014.
An overview,” in 2013 IEEE international conference on acoustics, [62] K. Dwivedi and G. Roig, “Representation similarity analysis for
speech and signal processing, 2013. efficient task taxonomy & transfer learning,” in CVPR, 2019.
[33] C. Widmer, J. Leiva, Y. Altun, and G. Rätsch, “Leveraging se- [63] E. Meyerson and R. Miikkulainen, “Beyond shared hierarchies:
quence classification by taxonomy-based multitask learning,” Deep multitask learning through soft layer ordering,” in ICLR,
in Annual International Conference on Research in Computational 2018.
Molecular Biology, 2010. [64] Y. Yang and T. Hospedales, “Deep multi-task representa-
[34] S. Bell, Y. Liu, S. Alsheikh, Y. Tang, E. Pizzi, M. Henning, K. Singh, tion learning: A tensor factorisation approach,” arXiv preprint
O. Parkhi, and F. Borisyuk, “Groknet: Unified computer vision arXiv:1605.06391, 2016.
model trunk and embeddings for commerce,” in SIGKDD, 2020. [65] C. Rosenbaum, T. Klinger, and M. Riemer, “Routing networks:
[35] S. Ruder, “An overview of multi-task learning in deep neural Adaptive selection of non-linear functions for multi-task learn-
networks,” arXiv preprint arXiv:1706.05098, 2017. ing,” in ICLR, 2018.
[36] T. Gong, T. Lee, C. Stephenson, V. Renduchintala, S. Padhy, [66] A. Mallya, D. Davis, and S. Lazebnik, “Piggyback: Adapting a
A. Ndirango, G. Keskin, and O. H. Elibol, “A comparison of single network to multiple tasks by learning to mask weights,”
loss weighting strategies for multi task learning in deep neural in ECCV, 2018.
networks,” IEEE Access, 2019. [67] S. Huang, X. Li, Z. Cheng, A. Hauptmann et al., “Gnas: A greedy
[37] T. Evgeniou and M. Pontil, “Regularized multi–task learning,” in neural architecture search method for multi-attribute learning,”
KDD, 2004. in ACMMM, 2018.
[68] A. Newell, L. Jiang, C. Wang, L.-J. Li, and J. Deng, “Feature
[38] Y. Xue, X. Liao, L. Carin, and B. Krishnapuram, “Multi-task
partitioning for efficient multi-task architectures,” arXiv preprint
learning for classification with dirichlet process priors,” JMLR,
arXiv:1908.04339, 2019.
2007.
[69] M. Suteu and Y. Guo, “Regularizing deep multi-task networks us-
[39] L. Jacob, J.-p. Vert, and F. R. Bach, “Clustered multi-task learning:
ing orthogonal gradients,” arXiv preprint arXiv:1912.06844, 2019.
A convex formulation,” in NIPS, 2009.
[70] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss
[40] J. Zhou, J. Chen, and J. Ye, “Clustered multi-task learning via
for dense object detection,” in ICCV, 2017.
alternating structure optimization,” in NIPS, 2011.
[71] J.-A. Désidéri, “Multiple-gradient descent algorithm (mgda) for
[41] B. Bakker and T. Heskes, “Task clustering and gating for bayesian multiobjective optimization,” Comptes Rendus Mathematique, 2012.
multitask learning,” JMLR, 2003.
[72] N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor seg-
[42] K. Yu, V. Tresp, and A. Schwaighofer, “Learning gaussian pro- mentation and support inference from rgbd images,” in ECCV,
cesses from multiple tasks,” in ICML, 2005. 2012.
[43] S.-I. Lee, V. Chatalbashev, D. Vickrey, and D. Koller, “Learning [73] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and
a meta-level prior for feature relevance from multiple related A. Zisserman, “The pascal visual object classes (voc) challenge,”
tasks,” in ICML, 2007. IJCV, 2010.
[44] H. Daumé III, “Bayesian multitask learning with latent hierar- [74] X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, and A. Yuille,
chies,” arXiv preprint arXiv:0907.0783, 2009. “Detect what you can: Detecting and representing objects using
[45] A. Kumar and H. Daume III, “Learning task grouping and holistic models and body parts,” in CVPR, 2014, pp. 1971–1978.
overlap in multi-task learning,” in ICML, 2012. [75] A. Bansal, X. Chen, B. Russell, A. Gupta, and D. Ramanan,
[46] A. Argyriou, T. Evgeniou, and M. Pontil, “Convex multi-task “Pixelnet: Representation of the pixels, by the pixels, and for the
feature learning,” Machine learning, 2008. pixels,” arXiv preprint arXiv:1702.06506, 2017.
[47] J. Liu, S. Ji, and J. Ye, “Multi-task feature learning via efficient [76] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam,
l2, 1-norm minimization,” in Uncertainty in Artificial Intelligence, “Encoder-decoder with atrous separable convolution for seman-
2009. tic image segmentation,” in ECCV, 2018.
[48] A. Jalali, S. Sanghavi, C. Ruan, and P. K. Ravikumar, “A dirty [77] D. R. Martin, C. C. Fowlkes, and J. Malik, “Learning to detect nat-
model for multi-task learning,” in NIPS, 2010. ural image boundaries using local brightness, color, and texture
[49] A. Agarwal, S. Gerber, and H. Daume, “Learning multiple tasks cues,” TPAMI, 2004.
using manifold regularization,” in NIPS, 2010. [78] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
[50] R. K. Ando and T. Zhang, “A framework for learning predictive image recognition,” in CVPR, 2016.
structures from multiple tasks and unlabeled data,” JMLR, 2005. [79] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution
[51] P. Rai and H. Daumé III, “Infinite predictor subspace models for representation learning for human pose estimation,” in CVPR,
multitask learning,” in AISTATS, 2010. 2019.
20

[80] Y. Zou, Z. Luo, and J.-B. Huang, “Df-net: Unsupervised joint [111] R. Reichart, K. Tomanek, U. Hahn, and A. Rappoport, “Multi-task
learning of depth and flow using cross-task consistency,” in active learning for linguistic annotations,” in ACL, 2008.
ECCV, 2018. [112] V. Nekrasov, T. Dharmasiri, A. Spek, T. Drummond, C. Shen,
[81] R. Girshick, “Fast r-cnn,” in ICCV, 2015. and I. Reid, “Real-time joint semantic segmentation and depth
[82] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real- estimation using asymmetric annotations,” in ICRA, 2019.
time object detection with region proposal networks,” in NIPS, [113] D.-J. Kim, J. Choi, T.-H. Oh, Y. Yoon, and I. S. Kweon, “Dis-
2015. joint multi-task learning between heterogeneous human-centric
[83] N. Dvornik, K. Shmelkov, J. Mairal, and C. Schmid, “Blitznet: A tasks,” in WACV. IEEE, 2018.
real-time deep network for scene understanding,” in ICCV, 2017. [114] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in
[84] H. Bilen and A. Vedaldi, “Universal representations: The missing a neural network,” arXiv preprint arXiv:1503.02531, 2015.
link between faces, text, planktons, and cat breeds,” arXiv preprint [115] C. Mao, A. Gupta, V. Nitin, B. Ray, S. Song, J. Yang, and C. Von-
arXiv:1701.07275, 2017. drick, “Multitask learning strengthens adversarial robustness,”
[85] S.-A. Rebuffi, H. Bilen, and A. Vedaldi, “Learning multiple visual 2020.
domains with residual adapters,” in NIPS, 2017. [116] A. R. Zamir, A. Sax, N. Cheerla, R. Suri, Z. Cao, J. Malik, and
[86] ——, “Efficient parametrization of multi-domain deep neural L. J. Guibas, “Robust learning through cross-task consistency,” in
networks,” in CVPR, 2018. CVPR, 2020.
[87] S. J. Pan, Q. Yang et al., “A survey on transfer learning,” TKDE,
2010.
[88] J. Ma, Z. Zhao, X. Yi, J. Chen, L. Hong, and E. H. Chi, “Modeling
task relationships in multi-task learning with multi-gate mixture-
of-experts,” in KDD, 2018.
[89] A. R. Zamir, A. Sax, W. Shen, L. J. Guibas, J. Malik, and
S. Savarese, “Taskonomy: Disentangling task transfer learning,”
in CVPR, 2018.
[90] T. Standley, A. R. Zamir, D. Chen, L. Guibas, J. Malik, and
S. Savarese, “Which tasks should be learned together in multi-
task learning?” ICML, 2020.
[91] T. Elsken, J. H. Metzen, and F. Hutter, “Neural architecture search:
A survey.” JMLR, 2019.
[92] B. Zoph and Q. V. Le, “Neural architecture search with reinforce-
ment learning,” in ICLR, 2017.
[93] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-
Fei, A. Yuille, J. Huang, and K. Murphy, “Progressive neural
architecture search,” in ECCV, 2018.
[94] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean, “Efficient
neural architecture search via parameter sharing,” in ICML, 2018.
[95] H. Liu, K. Simonyan, and Y. Yang, “Darts: Differentiable architec-
ture search,” ICLR, 2018.
[96] E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, “Regularized
evolution for image classifier architecture search,” in AAAI, 2019.
[97] J. Liang, E. Meyerson, and R. Miikkulainen, “Evolutionary archi-
tecture search for deep multitask networks,” in GECCO, 2018.
[98] D. Dong, H. Wu, W. He, D. Yu, and H. Wang, “Multi-task learning
for multiple language translation,” in ACL, 2015.
[99] B. McCann, N. S. Keskar, C. Xiong, and R. Socher, “The natural
language decathlon: Multitask learning as question answering,”
arXiv preprint arXiv:1806.08730, 2018.
[100] A. Balajee Vasudevan, D. Dai, and L. Van Gool, “Semantic object
prediction and spatial sound prediction with binaural sounds,”
in ECCV, 2020.
[101] A. Diba, M. Fayyaz, V. Sharma, M. Paluri, J. Gall, R. Stiefelhagen,
and L. Van Gool, “Holistic large scale video understanding,”
arXiv preprint arXiv:1904.11451, 2019.
[102] R. Pasunuru and M. Bansal, “Multi-task video captioning with
video and entailment generation,” in ACL, 2017.
[103] M. Wulfmeier, A. Abdolmaleki, R. Hafner, J. T. Springenberg,
M. Neunert, T. Hertweck, T. Lampe, N. Siegel, N. Heess, and
M. Riedmiller, “Regularized hierarchical policies for composi-
tional transfer in robotics,” arXiv preprint arXiv:1906.11228, 2019.
[104] K. Hausman, J. T. Springenberg, Z. Wang, N. Heess, and M. Ried-
miller, “Learning an embedding space for transferable robot
skills,” in ICLR, 2018.
[105] L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V. Mnih, T. Ward,
Y. Doron, V. Firoiu, T. Harley, I. Dunning et al., “Impala: Scal-
able distributed deep-rl with importance weighted actor-learner
architectures,” in ICML, 2018.
[106] A. Wilson, A. Fern, S. Ray, and P. Tadepalli, “Multi-task reinforce-
ment learning: a hierarchical bayesian approach,” in ICML, 2007.
[107] C. Doersch and A. Zisserman, “Multi-task self-supervised visual
learning,” in ICCV, 2017.
[108] Q. Liu, X. Liao, and L. Carin, “Semi-supervised multitask learn-
ing,” in NIPS, 2008.
[109] Y. Zhang and D.-Y. Yeung, “Semi-supervised multi-task regres-
sion,” in KDD, 2009.
[110] A. Acharya, R. J. Mooney, and J. Ghosh, “Active multitask learn-
ing using both latent and supervised shared topics,” in ICDM,
2014.

You might also like