0% found this document useful (0 votes)
13 views10 pages

2021 - Task Switching Network For Multi-Task Learning - Sun Et Al

The document presents Task Switching Networks (TSNs), a novel architecture for multi-task learning that utilizes a single encoder-decoder structure to perform multiple tasks efficiently by switching between them. TSNs maintain a constant number of parameters regardless of the number of tasks, enhancing scalability and reducing complexity compared to existing methods. Experimental results demonstrate that TSNs achieve state-of-the-art performance on challenging benchmarks while facilitating constructive task interactions through learned task embeddings.

Uploaded by

yangkunkuo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views10 pages

2021 - Task Switching Network For Multi-Task Learning - Sun Et Al

The document presents Task Switching Networks (TSNs), a novel architecture for multi-task learning that utilizes a single encoder-decoder structure to perform multiple tasks efficiently by switching between them. TSNs maintain a constant number of parameters regardless of the number of tasks, enhancing scalability and reducing complexity compared to existing methods. Experimental results demonstrate that TSNs achieve state-of-the-art performance on challenging benchmarks while facilitating constructive task interactions through learned task embeddings.

Uploaded by

yangkunkuo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Task Switching Network for Multi-task Learning

Guolei Sun1 , Thomas Probst1 , Danda Pani Paudel1 , Nikola Popovic1 , Menelaos Kanakis1 ,
Jagruti Patel1 , Dengxin Dai1,2 , Luc Van Gool1
1
Computer Vision Laboratory, ETH Zurich, Switzerland
2
MPI for Informatics, Germany
[email protected]

Abstract
We introduce Task Switching Networks (TSNs), a
task-conditioned architecture with a single unified en-
coder/decoder for efficient multi-task learning. Multiple
tasks are performed by switching between them, perform-
ing one task at a time. TSNs have a constant number of pa-
(a) Single-task (b) Multi-task (c) TC Multi-task (d) Our TSNs
rameters irrespective of the number of tasks. This scalable
yet conceptually simple approach circumvents the overhead
Figure 1: Solutions for multi-task learning. (a) Every
and intricacy of task-specific network components in ex-
task is solved by training an individual network, i.e., us-
isting works. In fact, we demonstrate for the first time
ing an independent encoder-decoder pair for each task. (b)
that multi-tasking can be performed with a single task-
General multi-task solutions are built on sharing the en-
conditioned decoder. We achieve this by learning task-
coder and maintaining separate decoders for each task. (c)
specific conditioning parameters through a jointly trained
Task-conditional (TC) multi-task solutions [15, 24] are built
task embedding network, encouraging constructive interac-
on sharing partial parameters of the encoder (task-specific
tion between tasks. Experiments validate the effectiveness
modules also exist), and using separate decoders for each
of our approach, achieving state-of-the-art results on two
task. (d) In the proposed Task Switching Networks (TSNs),
challenging multi-task benchmarks, PASCAL-Context and
all parameters of a single encoder-decoder pair are shared,
NYUD. Our analysis of the learned task embeddings fur-
and a small task embedding network C facilitates switching
ther indicates a connection to task relationships studied in
between different tasks. Best viewed in color.
the recent literature.
ficient in memory usage, computation, and learning speed,
but it may also benefit from complementary tasks.
1. Introduction
To tackle multi-task learning (MTL), different solutions
The very concept of computer vision is to automatically have been proposed. Encoder-based methods [18, 27, 23]
perform the tasks that a human visual system can do. Even focus on the encoder, by enhancing the representation
artificial neural networks (ANNs) were also designed as an capability of architectures so that both shared and task-
inspiration from the biological nervous system, such as the specific information can be encoded, while decoder-based
human brain. As opposed to the most successful ANNs, approaches [47, 44] explore techniques mainly on the de-
the brain and its visual cortex can perform multiple tasks coder part, to better refine the encoder features for specific
– such as object, parts, and boundary detection or depth tasks. Optimization-based methods [6, 17, 37] explicitly
and orientation prediction – without any difficulty. Be- target on task interference or negative transfer issue from
ing able to perform multitude of such tasks has allowed optimization perspective, by re-weighting the loss or re-
humans to efficiently conduct complex activities. In the ordering the task learning. In general, these methods fol-
very spirit, real-world applications like autonomous driving, low the structure as Fig. 1 (b). Recently, another direc-
healthcare, agriculture, manufacturing, cannot be addressed tion for MTL emerged, termed task-conditional (TC) multi-
by merely seeking for the perfection on solving individual tasking [24, 15], shown in Fig. 1 (c). They perform sepa-
tasks. It goes without saying that a system capable of per- rate pass within the MTL model and activate a set of task-
forming multiple tasks not only has potential of being ef- specific modules for each task. The task-specific modules

8291
are used to adapt the network for corresponding tasks. As conditioning, facilitating optimization and offering in-
this setting has many practical use cases [15, 24], our pro- sights into relationships between tasks.
posed TSNs also follow it and execute one task each time.
Despite that promising results are achieved, the existing 2. Related Works
methods [44, 24, 15] do not scale well with the number
of tasks since they require a large number of task-specific Multi-task learning (MTL). MTL is concerned with learn-
parameters (modules). This can be seen in Fig. 1 (b) and ing multiple tasks simultaneously, while exerting shared
(c), where task-specific decoders or modules (in the en- influence on model parameters. The potential benefits
coder) scale with the number of tasks. Additionally, even are manifold, and include speed-up of training or infer-
though task-specific modules minimize adverse interactions ence, higher accuracy, better representations, as well as
amongst tasks, they also minimize positive interactions, i.e., lower parameters or higher efficiency. A comprehensive
inductive bias [5]. Motivated by these, we propose Task survey on architectures, optimization and other aspects of
Switching Networks (TSNs). TSNs share all parameters MTL can be found in [7]. Many MTL methods per-
among all tasks and do not require any task-specific mod- form multiple tasks by a single forward pass, using shared
ules (parameters). Hence, our network is simple and has trunk [18, 3, 23, 43, 22, 8], cross talk [27], or prediction
constant size independent of the number of tasks, while still distillation [47, 50, 51, 44] architectures. A recent work of
enabling task interactions. We argue that our motivation is MTI-Net [44] follows this direction and proposes to utilize
also consistent with the widely accepted view in neurobiol- the task interactions between multi-scale features. Another
ogy, that the visual cortex does not have separate modules stream of MTL methods are based on task-conditional net-
for different tasks [20, 26]. works [15, 24], which perform a separate forward pass and
More specifically, our task-switching networks solve activate some task-specific modules as well as shared mod-
multiple tasks by switching between them – performing ules, for each task. As mentioned in [15], this setting is
one task at a time, and follows a task-conditional single- useful for many real-world setups. Hence, we follow this
encoder-single-decoder architecture. As shown in Fig. 1 direction and propose TSNs. In stark contrast to condition-
(d), the task switching is accomplished by employing a ing the shared encoder, as done in [24, 15, 31, 2, 52, 41],
small network to learn task-specific embeddings from task we instead learn an unconditioned encoder, and condition
encodings, and the behaviour of the decoder is adapted by only one single unified decoder for all tasks. To the best
conditioning it on those embeddings. In practice, we con- of our knowledge, our network is the first MTL method
dition only the decoder (U-Net), in a hope that the encoder that does not require task-specific branching to multiple de-
learns the concept of ‘thought space’ [36] as it is forced to coders, but rather shares all network parameters and a single
be task-agnostic. This way, the encoder features can also be conditional unified decoder for all tasks.
reused to efficiently perform multiple tasks in series, which Conditioning strategy. In the context of MTL, [41] pro-
is not possible in encoder-based conditioning [24, 15]. pose a heuristic masking of features to induce partially
Interestingly, the task embedding network offers some shared subnetworks for each task. On the other hand, fea-
insight into the relationships between tasks. During train- tures can be modulated by introducing task-specific pro-
ing, a latent embedding for each task is learned together jections [52], residual adapters [31, 32], attention mech-
with its corresponding mapping to the conditioning parame- anisms [24] or parametrized convolutions [15], while the
ters in each decoder layer. Though there are still many open original backbone network is shared among all tasks. Moti-
questions in the study of task relationships and multi-task vated by the successful application of adaptive normaliza-
learning [40], we observe that the structure of our task em- tion strategies in the context of domain adaptation [21],
beddings resemble the task relationships reported in [48]. image generation [4, 16], style transfer [13], and super-
To summarize, in this work we study multi-task net- resolution [46], we explore task-conditioned affine projec-
works without task-specific parameters, and investigate the tions after instance normalization (IN) [42] for MTL. In-
behaviour with regards to efficiency, optimization, and ac- spired by the style-based generator of [16], the affine pa-
curacy. We state our key contributions as follows. rameters are generated from a latent vector representing
a desired task. Note that a concurrent work of [30] pro-
• We introduce Task-Switching Networks, an efficient poses a new task of CompositeTasking by fusing tasks spa-
yet simple architecture for multi-task learning. tially (pixel-wise), based on task-conditioned BatchNorm
(BN) [14].
• We demonstrate that conditioning a single shared de-
Task relationships and embedding. Knowledge on the
coder can outperform multi-decoder methods even on
relationships between tasks is crucial for many aspects of
heterogeneous tasks such as segmentation and regres-
machine learning, including multi-task, transfer, or meta-
sion.
learning. Including the seminal study by Zamir et al. [48],
• We adopt a small embedding network to learn task many recent works shed light on such task relationships for

8292
transfer learning and multi-tasking [10, 38, 39, 9, 49] by network with the least number of parameters, while bound-
means of computation. Based on the assumption that tasks ing the relative performance drop by ∆τ , as follows.
can be meaningfully taxonomized, this structure can be
min |θ|,
represented using task embeddings in a high-dimensional f ∈Fθ ,θ
space. Nevertheless, such embeddings are mostly explored s.t. E [`τ (f (I, τ ), yτ )] ≤ (1 + ∆τ )`¯S
τ, (1)
in meta-learning literature [1, 35, 19], and very little from (I,yτ )∼Dτ

the multi and transfer learning perspective [49]. Note that ∀τ ∈ T .


the task embeddings [1, 35, 19] of meta-learning are indeed
While this is fundamentally a problem of architecture
inspirational, but using the mechanisms of meta-learning
search that could be solved using neural architecture
in the context of MTL is however not straightforward, be-
search [11] or combinatorial optimization [40], we instead
cause they are based on fundamentally different assump-
develop our solution from an analysis of shared parameters.
tions. We believe it is worth studying the connection of task
With slight abuse of notation, we aim to learn a function
relationships, embeddings, and multi-task learning in more
f (In , θs , θτ , τ ) = ynτ , for all n ∈ [1, N ] and τ ∈ T . Here,
detail [7], and we see our work as a step in this direction.
θs denotes the shared parameters across all tasks, while θτ
represents the task-specific parameters. Let θ denote the
3. Problem Formulation total parameters for T (|T |) tasks, given by
We begin by formally introducing the multi-task learning [
θ = θs ∪ θτ . (2)
problem from the perspective of network architecture. In τ ∈T
our formulation we consider sequential multi-tasking, i.e.
solving one task per forward pass, as done in recent MTL In addition to learning task-specific conditioning param-
techniques [24, 15]. First, we are going to give a formal eters, existing MTL methods [31, 2, 15, 24] also learn one
definition of multi-task networks and data sets. or more task-specific convolution layer(s) after branching
out to their respective output head. This is also necessary to
Definition 3.1 (Multi-task network) Given a set of tasks cater for the output types of different tasks. For the simplic-
h×w×3
T to be performed on images X = [0, 1] . For the ity of analysis, we assume that the number of parameters for
h×w×c all tasks is a constant, and we have the condition,
sake of simplicity, let the output type Y = [0, 1] be
equal for all tasks. We define fθ : X ×T → Y to be a multi- |θs ∪ θτ | ≈ c, ∀ τ ∈ T . (3)
tasking network with parameters θ, that performs one given
task τ at a time on a given image. Further, let Fθ be the set Combining Eq. 2 and Eq. 3, we get
of all multi-tasking networks with parameter set θ.
|θ| = T c − (T − 1)|θs |. (4)
Definition 3.2 (Multi-task data set) Let us denote a multi- From Eq. 4, it is obvious that the total number of parame-
task data set as Dτ = {(In , ynτ )}N , with tasks τ ∈ T , and ters for a MTL approach with T tasks is inversely correlated
ynτ as the ground-truth associated to task τ for image In . with the number of shared parameters |θs |. In the extreme
case of single-task setting, where each task is solved by an
Now we can define our goal for MTL as finding a multi- individual network, without sharing any parameters across
task network with a small set of parameters, that is able to tasks, the total parameters are given by |θ| = T c. Hence,
solve all tasks with an accuracy that is close to or better than existing approaches seek to increase the number of shared
a single-task baseline. We measure the achievement of this parameters θs , while reducing task interference [18, 52, 24]
goal by means of the relative performance drop. and maintaining the performance for all tasks. In this paper,
we pursue a challenging goal: sharing as many parts of the
Definition 3.3 (Relative performance drop) Given a met-
network as possible across all tasks, up to the point that all
ric mτ to evaluate the performance for task τ , we de-
parameters are shared, and the network becomes indepen-
fine the relative performance drop from a baseline mbτ as
m −mb
dent of the number of tasks T ,
∆τ (mτ , mbτ ) = sτ τmb τ , where sτ ∈ {−1, 1} indicates
τ
if higher values are better, or vice versa. θτ = ∅, ∀ τ ∈ T ⇒ c = |θs | = |θ|. (5)
Next, we explain our solution to achieve this goal.
We can now formalize our problem statement as follows.
4. Method
Problem 3.4 (Parameter-efficient MTL) Given a valida-
tion set of labelled images DT for tasks set T , with corre- In this section, we first present the proposed task switch-
sponding loss functions `τ , and expected validation losses ing networks in detail. We further explain task embedding
`¯Sτ of single-task baselines, we wish to find a multi-tasking learning using our network.

8293
Figure 2: Task Switching Network overview. Our network performs multi-tasking by switching between tasks using a
conditional decoder. Following U-Net [33], our encoder takes the image In and extracts features Fi at different layers. As
second input, our network takes a task encoding vector vτ , selecting task τ to be performed. A small task embedding network
C maps each task to a latent embedding lτ , that conditions the decoder layers along the blue paths, using module A [16]. The
output is computed by conditioning and assembling encoder features Fi and decoder features Oi in a bottom up fashion.

4.1. Task Switching Network volution layer with kernel size of (7 × 7), and layer 2 to
layer 5 represent conv2 x to conv5 x (notation in [12]) of
As laid out in the introduction, we design our Task
the backbone, respectively. The outputs of layer 1 to layer
Switching Networks on the premise that all network pa-
5 are F1 to F5 . From layer j to layer j + 1, the spatial
rameters should be shared, in order to provide an efficient
resolution of the encoder feature maps is reduced by half.
solution to Problem 3.4. However, as illustrated in Fig. 1,
For the decoder, we follow a similar structure as U-
parameter sharing in MTL techniques in literature are lim-
Net [33], collecting features from layer 5 to layer 1. Specif-
ited to the encoder and parts of the decoder, and omit the
ically, at layer j (j ≤ 4), the corresponding feature maps
potential of sharing the complete decoder. Moreover, state-
Fj from the encoder first pass through a conditional con-
of-the-art methods [24, 15] switch tasks by activating task-
volution module A, then concatenate the features (after up-
specific modules. To avoid such additional parameters for
sampling) from layer j + 1, and finally pass through an-
each task, we introduce task switching, by taking a task con-
other instance of A. For the highest layer (j = 5), the fea-
dition as an additional input to the network. To this end, we
ture maps go through module A only once since there is no
associate each task τ with a task-condition vτ ∈ Rd .
higher layer. As shown in Fig. 2, module A transforms in-
The input to our model therefore is a pair of image and put features to new features based on the embedding vector
task-encoding vector, i.e., (In , vτ ), which represents con- lτ , representing the specific task. Let Oj be the output of
ducting task τ on image In . Our backbone encoder takes the decoder from layer j, which is given by
the image In and extracts features Fi at different layers. A
small task embedding network C with a few fully connected
(
A([U(Oj+1 ), A(Fj , lτ )] , lτ ), for j ≤ 4,
layers maps each task to a latent embedding lτ , that is used Oj = , (6)
A(Fj , lτ ), for j = 5
to condition the decoder layers using a module similar to
StyleGAN [16]. The output is then computed by condition- where [·, ·] denotes the concatenation of two features ten-
ing and assembling encoder features Fi and decoder fea- sors along the channel dimension and U(·) is a upsampling
tures Oi bottom-up along the feature pyramid. operation, which is omitted in Fig. 2 for simplicity.
In our discussion, the dense prediction tasks (i.e., edge The output feature O1 from decoder layer 1 has the
detection, and semantic segmentation) are considered if not same resolution as the original image and is passed to a
specifically stated, following [24, 15]. In the following, we convolution layer to make predictions for different tasks.
describe the architectural details of our network. As discussed previously, different tasks can either share a
common convolution layer (i.e. a single head shared by all
4.1.1 Network Architecture tasks), or have separate convolution layers (different heads
for different tasks). In the interest of avoiding task-specific
As shown in Fig. 2, our network is based on a simple U- parameters, we opt for the choice that a single head is used
Net architecture [33]. The encoder is a ResNet-based [12] by all tasks. To this end, we simply choose the number
backbone pre-trained on ImageNet [34], following existing of output channels as the largest number channels needed
MTL approaches [24, 15]. Let layer 1 denote the first con- for different tasks. Take PASCAL-Context [28] (the most

8294
popular benchmark in MTL) as an example, the number of Table 1: Task switching performance. Our TSNs perform
output channels needed among edge detection, parts seg- competitively with single tasking and multi-tasking base-
mentation, semantic segmentation, normals, and saliency lines, with substantially smaller model sizes. Optimal per-
detection are 1, 7, 21, 3, and 1, respectively. So we choose formance is observed when all parameters are shared via
21 output channels, which fits for semantic segmentation. our task embedding module (INs+TE).
For other tasks, we simply conduct adaptive average pool- Method Edge↑ SemSeg↑ Parts↑ Normals↓ Sal↑ ∆m %↓ # params
Single-task 71.3 64.3 55.3 16.3 62.9 - 88.7M
ing along the channels to obtain the predictions matching Multi-decoder 72.2 55.4 55.5 16.8 59.1 4.32 43.9M
the corresponding tasks. The elegance of sharing a head Ours (BNs) 71.6 55.9 54.1 16.7 60.0 4.38 17.7M
Ours (INs) 70.7 62.8 54.6 16.8 63.1 1.43 17.7M
across tasks is that exactly a single and neat network is used Ours (INs+TE) 70.6 64.2 55.0 16.3 63.3 0.30 18.3M
to solve all tasks. Our experiments in fact support this ap-
(
proach, since we found that sharing one head performs com- d
, if τ1 = τ2
petitively to using separate head for different tasks. vτ|1 vτ2 = T
, vτ1 ,τ2 ∈ Rd , (9)
0, otherwise
In the following, we describe the two key components of
task switching networks, that facilitate the conditioning. and Gaussian random vectors vτ ∼ N (0d , diag(1d )) [16].
Conditional Convolution Module. The goal of this mod- The results are reported in §5.1.
ule (block A in Fig. 2) is to adjust feature representations
from the encoder – that are shared by all tasks – to new 5. Experiments
features that serve the desired task. As mentioned above, Overview . Following existing works [24, 15], we focus our
to conduct task τ , the corresponding task-condition vector MTL experiments on dense prediction tasks. In particular,
vτ is transformed by the embedding network C to obtain we use the PASCAL-Context [28] dataset, which contains a
the task-specific latent vector lτ , which is then passed to total of 10,103 images, for the five tasks of edge detection
module A, inspired by [16]. Let x ∈ R1×c1 ×h×w denote (Edge), semantic segmentation (SemSeg), human parts seg-
the input feature to module A, where c1 , h and w repre- mentation (Parts), surface normals (Normals), and saliency
sent number of channels, height and width of the feature detection (Sal). We further evaluate and compare our ap-
map, respectively. Module A then works as follows. First, proach on the NYUD dataset [37], which is comprised of
x is processed by a convolution layer x̂ = x ∗ W with fil- 1,449 images of indoor scenes and comes with annotations
ter weights W , generating x̂ ∈ R1×c2 ×h×w . At the same for the four tasks of edge detection, semantic segmentation,
time, lτ is transformed by two fully connected layers with surface normals, and depth estimation (Depth).
weight matrices Wγ ∈ Rd×c2 and Wβ ∈ Rd×c2 , to form Evaluation metric. We use standard evaluation metrics,
the normalization coefficients γ ∈ R1×c2 and β ∈ R1×c2 , following [24, 15, 45]. Specifically, to evaluate the predic-
for the subsequent AdaIN. For feature x̂, AdaIN performs tive performance for each task, we use the optimal dataset
the normalization following, F-measure (odsF) [25] for edge detection, mean intersec-
tion over union (mIoU) for semantic segmentation, human
(x̂ − µ)
AdaIN(x̂, β, γ) = γ √ + β, (7) parts segmentation, and saliency, mean error (Error) for
σ2
surface normals, and root mean square error (RMSE) for
depth. In order to compare to a multi-task approach m, we
where β and σ 2 are the mean and variance of x̂, which
average the relative performance drop (see Definition 3.3),
are statistics computed according to instance normaliza-
tion [42]. In summary, module A performs the operation P to the single-task baseline b over all tasks:
with respect
∆m = T1 τ ∈T ∆τ (pm,τ , pb,τ ), where pm,τ and pb,τ are
the metrics for task τ for the multi-task method m and for
(x ∗ W − µ) single-task baseline b, respectively.
A(x, lτ ) = lτ Wγ √ + lτ Wβ . (8)
σ2 Network configuration. We employ the ResNet-18 back-
bone with the architecture introduced in §4.1 for all of our
Task embedding network. Recall that each task is associ- experiments, unless stated otherwise. The task embedding
ated with a unique task-condition vector vτ ∈ Rd , and the network C contains 8 fully connected layers of width d. Our
TSNs switch between tasks by feeding different vτ to the method is implemented in PyTorch [29] and experiments
task embedding network C, shown in the left of Fig. 2. The were conducted on NVIDIA GPUs.
embedding network C : Rd → Rd learns to embed the task
5.1. Ablation study
τ in a latent space lτ = C(vτ ), from which the AdaIN coef-
ficients of Eq. 7 are generated for each module A. In prin- Study on module sharing. We compare our method with
ciple, there are many choices for the initialization of these various baselines in Table 1. All approaches use the same
vectors. Specifically, we investigate embedding dimensions network architecture and are trained with the same hyper-
d, with orthogonal vτ (binary vector) given by parameters, to ensure fair comparisons. The details of all

8295
(a) Zamir et al. [48] (b) Ours (c) Dwivedi et al. [10] (d) Song et al. [39]
Figure 3: Task embedding relationships. We analyse the similarity of our task embeddings after training our network with
20 tasks on a small subset of the Taskonomy dataset [48]. The hierarchical clustering of task affinities from our learned
embeddings (b), reveals an interesting similarity to the relationships found by the compared methods (a,c,d).
Table 2: Impact of task embedding strategy. The de- ∆m . It demonstrates that learning the normalization coef-
signed task embedding is robust against different choices ficients jointly through the task embedding is better than
for the task encoding vτ , as well as for the dimensionality learning them separately for each task. We also observed
d of the embedding network C. that during training, our method converges much faster than
Type d Edge↑ SemSeg↑ Parts↑ Normals↓ Sal↑ ∆m %↓ Task-specific INs and BNs. Moreover, our method only in-
50 70.8 63.6 55.2 16.3 63.4 0.32
100 70.6 64.2 55.0 16.3 63.3 0.30 creases the size of the model by a small margin, because the
Orthogonal
150 70.5 64.3 54.9 16.3 63.2 0.38 task embedding network C in our model is very small.
250 70.5 63.8 54.8 16.4 63.1 0.75
50 70.8 64.1 55.1 16.3 63.0 0.30
Task embedding network. We study the impact of two
Gaussian 100 70.3 63.2 54.4 16.5 63.1 1.22 different choices for the task-condition vector vτ , as de-
150 70.7 63.6 54.8 16.3 63.4 0.44
scribed in §4. The results are shown in Table 2. For or-
thogonal encodings, we observe that the performance of
Table 3: Impact of network architecture. The designed
our method is robust towards the embedding dimensional-
task embedding is robust against various backbones.
ity d, while perfoming best at d = 100. Gaussian encod-
Backbone Method Edge↑ SemSeg↑ Parts↑ Normals↓ Sal↑ ∆m %↓
ResNet-18
Single-task 71.3 64.3 55.5 16.3 62.9 - ings perform equally well as the orthogonal counterpart for
Ours 70.6 64.2 55.0 16.3 63.3 0.30
dimensionality below 100, and tends to be slightly worse
Single-task 72.7 68.6 58.7 16.0 64.4 -
ResNet-34 above. We conjecture that under Gaussian encoding, the
Ours 71.8 67.6 58.0 16.1 64.3 0.99

ResNet-101
Single-task 74.2 70.7 62.1 15.8 65.0 - distance between task-condition vectors for two tasks is ran-
Ours 73.3 70.9 61.0 15.9 64.5 0.93
dom (close or far), which is not desirable. However, this
study demonstrates that our conditioning is robust towards
considered baselines are as follows. Single-task means that these hyper-parameters. In our experiments, we choose or-
each task is trained with an individual network, shown in thogonal encoding with a dimension of 100 for PASCAL-
Fig. 1 (a). Multi-decoder represents a simple multi-task Context dataset (5 tasks). For NYUD dataset (4 tasks), we
solution where encoder is shared but decoders are task- use dimension of 120 (divisible by 4).
specific. We further compare to our architecture without More network architecture. Following [24, 15], we study
the task embedding (TE) network, by using task-specific the robustness of our method against more network ar-
batch-(BNs), and instance (INs) normalizations. We see chitectures (ResNet-34 and ResNet-101). The results are
that the Multi-decoder model, sharing a common encoder shown in Table 3. As expected, the absolute performance
but using different decoders does not perform well, which on all tasks improves when using larger networks. Fur-
is consistent with MTL literature [15]. Moreover, it has a thermore, our method performs closely to the correspond-
large number of parameters (43.9 millions). Task-specific ing single-task baselines for different backbones. Note that
BNs on the other hand performs only slightly worse than single-task baselines have 5 times more parameters than
Multi-decoder, with a substantially smaller model size. In- ours. The fact that our approach achieves similarly low av-
terestingly, Task-specific INs performs much better than erage performance drop (∆m %) across various networks,
Task-specific BNs. The results for Task-specific BNs and demonstrates its robustness and effectiveness in reducing
INs show that simply adapting features to different tasks negative interference between tasks.
by affine transformation in the decoder is able to give rea-
sonable performance for multi-task learning. Our method, 5.2. Comparison to state-of-the-arts
with the task embedding network to jointly learn the (affine
transformation) coefficients for AdaIN, outperforms task- The state-of-the-art comparisons for PASCAL-Context
specific INs by 1.13% in terms of average performance drop are shown in Table 4. We compare our method to

8296
Table 4: Comparison with state-of-the-art. Our TSNs
outperform different multi-decoder methods on PASCAL-
Context, with only a single decoder and substantially fewer
parameters.
Method Edge↑ SemSeg↑ Parts↑ Normals↓ Sal↑ ∆m %↓ # params
Single-task 71.3 64.3 55.5 16.3 62.9 - 88.7M
Series RA [31] 72.0 55.1 54.6 17.0 58.7 5.21 51.7M
Parallel RA [32] 72.1 55.9 55.0 17.0 58.6 4.81 50.8M
RCM [15] 72.3 56.6 55.8 16.7 59.3 3.62 51.7M
Ours 70.6 64.2 55.0 16.3 63.3 0.30 18.3M

Task-conditional (TC) multi-task methods: Series Residual


Adapter (Series RA) [31], Parallel Residual Adapter (Paral- Figure 4: Model parameters scaling. While the number of
lel RA) [32], and RCM [15], since these approaches follow parameters in TSNs is independent of T , it scales in a linear
the same direction of MTL as mentioned previously. For fashion for the compared MTL methods.
a fair comparison, both Series RA and Parallel RA are im-
plemented in our setting. Performance RCM is obtained SemSeg
by adapting the official implementation to our setting (U- Parts
Net architecture). We observe that our method achieves the Normals
best performance among those existing methods in terms Edge

of average performance drop, with respect to our single- Sal

Edge
SemSeg

Sal
Parts

Normals
task baseline. We report the number of parameters for each
method, and show that our method uses the least parame-
(a) PASCAL-Context
ters among the compared methods. Specifically, our method
outperforms RCM by 3.32% and only uses 18.3M parame- SemSeg
ters, compared to 51.7M of RCM. Edge

In fact, our main motivation in §3 is driven by efficient Normals


parameter utilization. In Fig. 4, we can see how the number Depth
of parameters |θm | of each method scales with the number
Edge
SemSeg

Depth
Normals

of tasks T . By design, our TSNs have constant parame- (b) NYUD


ters irrespective of T . On the other hand, other methods
Figure 5: Task embedding similarity. We observe that
(RCM, Multi-decoder, etc.) scale linearly with T . For in-
similar tasks – such as body parts and semantic segmen-
stance, when T = 9, our method still has 18.3M parame-
tation in PASCAL-Context (a), or depth and normals in
ters, whereas the 84.0M parameters for RCM and 159.6M
NYUD (b) – cluster together in the learned embedding
for Single-task make it obvious that these methods are not
space, while 3D tasks are separated from 2D tasks.
applicable in practical cases where many tasks are required
and resources are limited.
5.3. Task relationships
We further validate our method in NYUD dataset. The
results are shown in Table 5. Similarly, we observe that our Learning a task embedding network jointly with the
method outperforms existing approaches by clear margins, MTL objective naturally raises the question, if the learned
which further demonstrates the general effectiveness of the task embeddings carry some meaningful information for
proposed task-switching network. The qualitative compar- task relationships. To analyse this question empirically, we
isons are shown in Fig. 6. compute task affinity between two tasks as follows,
lτ|i lτj
A(τi , τj ) = 1 − . (10)
Table 5: Comparison with state-of-the-art. On the four klτi kklτj k
tasks of NYUD dataset, our TSNs outperform the compared
methods, with superior parameter efficiency. We visualize our found affinities for PASCAL-Context
Method Edge↑ SemSeg↑ Normals↓ Depth↓ ∆m %↓
and NYUD datasets in Fig. 5, together with an illustration
Single-task 67.7 26.6 26.2 74.0 - of the hierarchical clustering. We make two interesting ob-
Series RA [31] 68.5 18.9 29.0 84.1 10.39 servations. First, there seems to be a clear distinction be-
Parallel RA [32] 68.5 23.1 29.0 84.2 7.25 tween 2D and 3D tasks in the embedding, with depth and
RCM [15] 68.7 23.2 28.4 82.1 6.14
normals being close in NYUD, as well as normals getting
Ours 67.9 25.9 26.1 72.7 0.03
separated from segmentation tasks in both datasets. Second,
in PASCAL, the cluster hierarchy appears to be correlated

8297
Ours
baseline
Ours
baseline

Figure 6: Qualitative results. We compare our model with baseline (Task-specific INs) visually. Task interference is
observed in baseline where detected edges could exist in saliency predictions. Our method resolves this and outperforms the
baseline in high-level tasks such as semantic segmentation, parts, and saliency detection. Best viewed with zooming.
with “semanticness” – connecting first parts and semantic once and storing them with the model. In that case, the
segmentation, then saliency, edges and finally normals. number of parameters drops from 18.3M to 17.7M, corre-
We further investigate the task embeddings on the 20 sponding to the size of our IN baseline in Table 1. From
tasks of the Taskonomy [48] dataset, which is intended for this perspective, our task embedding can interpreted as a
finding task relationships. In Fig. 3, we compare our found additional inductive bias for the MTL.
task relationships with the ones established by Zamir et al., Architecture. We chose the U-Net architecture for simplic-
as well as with two recent methods [39, 10]. Interestingly, ity, together with the ResNet-18 backbone to demonstrate
there appears to be a striking similarity between the found the idea of task switching networks, and its behaviour and
“taskonomies”. Although not perfect, we roughly observe performance with regards to recent MTL methods. The ap-
the trend of 2D and 3D tasks clustering together, as well plication of TSNs principle using other more powerful or
as a separation of low level (e.g. denoising, inpainting) more efficent architectures, backbones, decoders, or condi-
from high level (semantic segmentation, scene classifica- tioning strategies leaves for future work.
tion) tasks. Note that our method establishes task rela-
tionships much more efficiently than the compared meth- 6. Conclusion
ods [48, 39, 10]. Specifically, these approaches need to
In this paper, we introduce the first approach for multi-
have separate models trained for individual tasks (i.e., 20
task learning that uses only a single encoder and decoder
separate models for 20 tasks). Then the method proposed
architecture. By design, our Task Switching Networks of-
in Taskonomy [48] does transfer learning between different
fer a substantial advantage in terms of simplicity and pa-
tasks to find the task similarities, while both RSA [10] and
rameter efficiency. This is achieved by sharing the com-
DEPARA [39] conduct pairwise comparisons among deep
plete set of network parameters among all tasks and using
features extracted from a certain number of images. How-
a conditioning network to learn task-specific latent vectors
ever, our approach only uses a single unified model and ob-
(embeddings) which then adapt the decoder for correspond-
tains the task similarities by simply computing the affinities
ing tasks. As demonstrated in our experiments, our pro-
between task embeddings.
posed task switching strategy improves MTL performance
We hypothesize that our embeddings implicitly transfer by learning the task embeddings jointly with all tasks, and
knowledge between tasks in the embedding space, in order offers a new perspective on multi-task learning through the
to provide the impressive results of Table 1 and 4. If two lens of task embeddings. Our experiments further vali-
tasks require similar features, it is favorable to share cer- date the utility and efficiency of the proposed framework,
tain patterns in the conditioning, and therefore be localized which outperforms state-of-the-art multi-decoder methods
closer together in the embedding space. From experimental on standard benchmark datasets with much less parameters,
results, we can see that this behaviour is further encouraged under fair comparisons. We also show interesting findings
by the limited capacity of the embedding network. on task relationships using the learnt task embeddings. To
conclude, we believe that further investigation into the con-
5.4. Discussion cept of task embeddings for multi-task learning will be an
Test-time parameters. In Table 4 and Table 5, we report interesting topic for future work.
the number of parameters of our TSNs. When it comes to
maximizing memory and computational efficency, we can Acknowledgements
however convert our task-switching network into a task- This work was partly supported by Specta AI.
conditioned network, by computing the AdaIN parameters

8298
References [18] Iasonas Kokkinos. Ubernet: Training a universal convolu-
tional neural network for low-, mid-, and high-level vision
[1] A. Achille, Michael Lam, Rahul Tewari, A. Ravichandran, using diverse datasets and limited memory. In CVPR, 2017.
Subhransu Maji, Charless C. Fowlkes, Stefano Soatto, and 1, 2, 3
P. Perona. Task2vec: Task embedding for meta-learning.
[19] L. Lan, Zhenguo Li, X. Guan, and P. Wang. Meta reinforce-
ICCV, 2019. 3
ment learning with task embedding and shared policy. arXiv,
[2] Hakan Bilen and A. Vedaldi. Universal representations: The 2019. 3
missing link between faces, text, planktons, and cat breeds.
[20] Wu Li, V. Piëch, and C. Gilbert. Perceptual learning and
arXiv, 2017. 2, 3
top-down influences in primary visual cortex. Nature Neuro-
[3] Felix J. S. Bragman, Ryutaro Tanno, Sébastien Ourselin, D. science, 7:651–657, 2004. 2
Alexander, and M. Cardoso. Stochastic filter groups for [21] Yanghao Li, Naiyan Wang, J. Shi, Xiaodi Hou, and Jiay-
multi-task cnns: Learning specialist and generalist convolu- ing Liu. Adaptive batch normalization for practical domain
tion kernels. ICCV, 2019. 2 adaptation. Pattern Recognit., 80:109–117, 2018. 2
[4] A. Brock, J. Donahue, and K. Simonyan. Large scale gan [22] Shikun Liu, Edward Johns, and A. Davison. End-to-end
training for high fidelity natural image synthesis. arXiv, multi-task learning with attention. CVPR, 2019. 2
2019. 2
[23] Y. Lu, Abhishek Kumar, Shuangfei Zhai, Yu Cheng, T. Ja-
[5] Rich Caruana. Multitask learning. Machine learning, vidi, and R. Feris. Fully-adaptive feature sharing in multi-
28(1):41–75, 1997. 2 task networks with applications in person attribute classifi-
[6] Z. Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew cation. CVPR, 2017. 1, 2
Rabinovich. Gradnorm: Gradient normalization for adaptive [24] Kevis-Kokitsi Maninis, Ilija Radosavovic, and Iasonas
loss balancing in deep multitask networks. arXiv, 2018. 1 Kokkinos. Attentive single-tasking of multiple tasks. In
[7] M. Crawshaw. Multi-task learning with deep neural net- CVPR, 2019. 1, 2, 3, 4, 5, 6
works: A survey. arXiv, 2020. 2, 3 [25] David R Martin, Charless C Fowlkes, and Jitendra Ma-
[8] C. Doersch and Andrew Zisserman. Multi-task self- lik. Learning to detect natural image boundaries using local
supervised visual learning. ICCV, 2017. 2 brightness, color, and texture cues. TPAMI, 26(5):530–549,
[9] Kshitij Dwivedi, Jiahui Huang, Radoslaw Martin Cichy, and 2004. 5
Gemma Roig. Duality diagram similarity: a generic frame- [26] Justin N. J. McManus, W. Li, and C. Gilbert. Adaptive shape
work for initialization selection in task transfer learning. processing in primary visual cortex. Proceedings of the Na-
arXiv, 2020. 3 tional Academy of Sciences, 108:9739 – 9746, 2011. 2
[10] Kshitij Dwivedi and Gemma Roig. Representation similar- [27] I. Misra, Abhinav Shrivastava, A. Gupta, and M. Hebert.
ity analysis for efficient task taxonomy & transfer learning. Cross-stitch networks for multi-task learning. CVPR, 2016.
CVPR, 2019. 3, 6, 8 1, 2
[11] Yuan Gao, Haoping Bai, Zequn Jie, Jiayi Ma, Kui Jia, [28] Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu
and Wei Liu. Mtl-nas: Task-agnostic neural architecture Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and
search towards general-purpose multi-task learning. 2020 Alan Yuille. The role of context for object detection and se-
IEEE/CVF Conference on Computer Vision and Pattern mantic segmentation in the wild. In CVPR, 2014. 4, 5
Recognition (CVPR), pages 11540–11549, 2020. 3 [29] Adam Paszke, S. Gross, Francisco Massa, A. Lerer, J. Brad-
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. bury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L.
Deep residual learning for image recognition. In CVPR, Antiga, Alban Desmaison, Andreas Köpf, E. Yang, Zach De-
2016. 4 Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy,
[13] X. Huang and Serge J. Belongie. Arbitrary style transfer in B. Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Py-
real-time with adaptive instance normalization. ICCV, 2017. torch: An imperative style, high-performance deep learning
2 library. ArXiv, abs/1912.01703, 2019. 5
[14] S. Ioffe and Christian Szegedy. Batch normalization: Accel- [30] Nikola Popovic, Danda Pani Paudel, Thomas Probst, Guolei
erating deep network training by reducing internal covariate Sun, and Luc Van Gool. Compositetasking: Understanding
shift. arXiv, 2015. 2 images by spatial composition of tasks. In Proceedings of
[15] Menelaos Kanakis, David Bruggemann, Suman Saha, Sta- the IEEE/CVF Conference on Computer Vision and Pattern
matios Georgoulis, Anton Obukhov, and Luc Van Gool. Recognition, pages 6870–6880, 2021. 2
Reparameterizing convolutions for incremental multi-task [31] Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi.
learning without task interference. ECCV, 2020. 1, 2, 3, Learning multiple visual domains with residual adapters. In
4, 5, 6, 7 NeurIPS, 2017. 2, 3, 7
[16] Tero Karras, Samuli Laine, and Timo Aila. A style-based [32] Sylvestre-Alvise Rebuffi, Hakan Bilen, and A. Vedaldi. Effi-
generator architecture for generative adversarial networks. In cient parametrization of multi-domain deep neural networks.
CVPR, 2019. 2, 4, 5 CVPR, 2018. 2, 7
[17] Alex Kendall, Yarin Gal, and R. Cipolla. Multi-task learn- [33] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net:
ing using uncertainty to weigh losses for scene geometry and Convolutional networks for biomedical image segmentation.
semantics. CVPR, 2018. 1 In MICAI, 2015. 4

8299
[34] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- [52] Xiangyun Zhao, Haoxiang Li, Xiaohui Shen, Xiaodan Liang,
jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, and Ying Wu. A modulation module for multi-task learning
Aditya Khosla, Michael Bernstein, et al. Imagenet large with applications in image retrieval. In ECCV, 2018. 2, 3
scale visual recognition challenge. IJCV, 115(3):211–252,
2015. 4
[35] Andrei A. Rusu, D. Rao, Jakub Sygnowski, Oriol Vinyals,
Razvan Pascanu, Simon Osindero, and Raia Hadsell. Meta-
learning with latent embedding optimization. arXiv, 2019.
3
[36] Holger Schwenk and M. Douze. Learning joint multilingual
sentence representations with neural machine translation. In
Rep4NLP@ACL, 2017. 2
[37] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob
Fergus. Indoor segmentation and support inference from
rgbd images. In ECCV, 2012. 1, 5
[38] J. Song, Yixin Chen, X. Wang, Chengchao Shen, and Min-
gli Song. Deep model transferability from attribution maps.
NeurIPS, 2019. 3
[39] J. Song, Yixin Chen, Jingwen Ye, X. Wang, Chengchao
Shen, Feng Mao, and Mingli Song. Depara: Deep attribu-
tion graph for deep knowledge transferability. CVPR, 2020.
3, 6, 8
[40] Trevor Scott Standley, A. Zamir, Dawn Chen, L. Guibas,
Jitendra Malik, and S. Savarese. Which tasks should be
learned together in multi-task learning? ICML, 2020. 2,
3
[41] Gjorgji Strezoski, Nanne van Noord, and M. Worring. Many
task learning with task routing. ICCV, 2019. 2
[42] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. In-
stance normalization: The missing ingredient for fast styliza-
tion. arXiv, 2016. 2, 5
[43] Simon Vandenhende, Bert De Brabandere, and L. Gool.
Branched multi-task networks: Deciding what layers to
share. BMVC, 2019. 2
[44] Simon Vandenhende, S. Georgoulis, and L. Gool. Mti-net:
Multi-scale task interaction networks for multi-task learning.
In ECCV, 2020. 1, 2
[45] Simon Vandenhende, Stamatios Georgoulis, and Luc
Van Gool. Mti-net: Multi-scale task interaction networks
for multi-task learning. ECCV, 2020. 5
[46] Xintao Wang, K. Yu, C. Dong, and Chen Change Loy. Re-
covering realistic texture in image super-resolution by deep
spatial feature transform. CVPR, 2018. 2
[47] D. Xu, Wanli Ouyang, X. Wang, and N. Sebe. Pad-net:
Multi-tasks guided prediction-and-distillation network for si-
multaneous depth estimation and scene parsing. CVPR,
2018. 1, 2
[48] A. Zamir, Alexander Sax, William Bokui Shen, L. Guibas,
Jitendra Malik, and S. Savarese. Taskonomy: Disentangling
task transfer learning. CVPR, 2018. 2, 6, 8
[49] Y. Zhang, Y. Wei, and Qiang Yang. Learning to multitask.
arXiv, 2018. 3
[50] Z. Zhang, Zhen Cui, Chunyan Xu, Zequn Jie, Xiang Li, and
Jian Yang. Joint task-recursive learning for semantic seg-
mentation and depth estimation. In ECCV, 2018. 2
[51] Z. Zhang, Zhen Cui, Chunyan Xu, Yan Yan, N. Sebe, and
J. Yang. Pattern-affinitive propagation across depth, surface
normal and semantic segmentation. CVPR, 2019. 2

8300

You might also like