2021 - Task Switching Network For Multi-Task Learning - Sun Et Al
2021 - Task Switching Network For Multi-Task Learning - Sun Et Al
Guolei Sun1 , Thomas Probst1 , Danda Pani Paudel1 , Nikola Popovic1 , Menelaos Kanakis1 ,
Jagruti Patel1 , Dengxin Dai1,2 , Luc Van Gool1
1
Computer Vision Laboratory, ETH Zurich, Switzerland
2
MPI for Informatics, Germany
[email protected]
Abstract
We introduce Task Switching Networks (TSNs), a
task-conditioned architecture with a single unified en-
coder/decoder for efficient multi-task learning. Multiple
tasks are performed by switching between them, perform-
ing one task at a time. TSNs have a constant number of pa-
(a) Single-task (b) Multi-task (c) TC Multi-task (d) Our TSNs
rameters irrespective of the number of tasks. This scalable
yet conceptually simple approach circumvents the overhead
Figure 1: Solutions for multi-task learning. (a) Every
and intricacy of task-specific network components in ex-
task is solved by training an individual network, i.e., us-
isting works. In fact, we demonstrate for the first time
ing an independent encoder-decoder pair for each task. (b)
that multi-tasking can be performed with a single task-
General multi-task solutions are built on sharing the en-
conditioned decoder. We achieve this by learning task-
coder and maintaining separate decoders for each task. (c)
specific conditioning parameters through a jointly trained
Task-conditional (TC) multi-task solutions [15, 24] are built
task embedding network, encouraging constructive interac-
on sharing partial parameters of the encoder (task-specific
tion between tasks. Experiments validate the effectiveness
modules also exist), and using separate decoders for each
of our approach, achieving state-of-the-art results on two
task. (d) In the proposed Task Switching Networks (TSNs),
challenging multi-task benchmarks, PASCAL-Context and
all parameters of a single encoder-decoder pair are shared,
NYUD. Our analysis of the learned task embeddings fur-
and a small task embedding network C facilitates switching
ther indicates a connection to task relationships studied in
between different tasks. Best viewed in color.
the recent literature.
ficient in memory usage, computation, and learning speed,
but it may also benefit from complementary tasks.
1. Introduction
To tackle multi-task learning (MTL), different solutions
The very concept of computer vision is to automatically have been proposed. Encoder-based methods [18, 27, 23]
perform the tasks that a human visual system can do. Even focus on the encoder, by enhancing the representation
artificial neural networks (ANNs) were also designed as an capability of architectures so that both shared and task-
inspiration from the biological nervous system, such as the specific information can be encoded, while decoder-based
human brain. As opposed to the most successful ANNs, approaches [47, 44] explore techniques mainly on the de-
the brain and its visual cortex can perform multiple tasks coder part, to better refine the encoder features for specific
– such as object, parts, and boundary detection or depth tasks. Optimization-based methods [6, 17, 37] explicitly
and orientation prediction – without any difficulty. Be- target on task interference or negative transfer issue from
ing able to perform multitude of such tasks has allowed optimization perspective, by re-weighting the loss or re-
humans to efficiently conduct complex activities. In the ordering the task learning. In general, these methods fol-
very spirit, real-world applications like autonomous driving, low the structure as Fig. 1 (b). Recently, another direc-
healthcare, agriculture, manufacturing, cannot be addressed tion for MTL emerged, termed task-conditional (TC) multi-
by merely seeking for the perfection on solving individual tasking [24, 15], shown in Fig. 1 (c). They perform sepa-
tasks. It goes without saying that a system capable of per- rate pass within the MTL model and activate a set of task-
forming multiple tasks not only has potential of being ef- specific modules for each task. The task-specific modules
8291
are used to adapt the network for corresponding tasks. As conditioning, facilitating optimization and offering in-
this setting has many practical use cases [15, 24], our pro- sights into relationships between tasks.
posed TSNs also follow it and execute one task each time.
Despite that promising results are achieved, the existing 2. Related Works
methods [44, 24, 15] do not scale well with the number
of tasks since they require a large number of task-specific Multi-task learning (MTL). MTL is concerned with learn-
parameters (modules). This can be seen in Fig. 1 (b) and ing multiple tasks simultaneously, while exerting shared
(c), where task-specific decoders or modules (in the en- influence on model parameters. The potential benefits
coder) scale with the number of tasks. Additionally, even are manifold, and include speed-up of training or infer-
though task-specific modules minimize adverse interactions ence, higher accuracy, better representations, as well as
amongst tasks, they also minimize positive interactions, i.e., lower parameters or higher efficiency. A comprehensive
inductive bias [5]. Motivated by these, we propose Task survey on architectures, optimization and other aspects of
Switching Networks (TSNs). TSNs share all parameters MTL can be found in [7]. Many MTL methods per-
among all tasks and do not require any task-specific mod- form multiple tasks by a single forward pass, using shared
ules (parameters). Hence, our network is simple and has trunk [18, 3, 23, 43, 22, 8], cross talk [27], or prediction
constant size independent of the number of tasks, while still distillation [47, 50, 51, 44] architectures. A recent work of
enabling task interactions. We argue that our motivation is MTI-Net [44] follows this direction and proposes to utilize
also consistent with the widely accepted view in neurobiol- the task interactions between multi-scale features. Another
ogy, that the visual cortex does not have separate modules stream of MTL methods are based on task-conditional net-
for different tasks [20, 26]. works [15, 24], which perform a separate forward pass and
More specifically, our task-switching networks solve activate some task-specific modules as well as shared mod-
multiple tasks by switching between them – performing ules, for each task. As mentioned in [15], this setting is
one task at a time, and follows a task-conditional single- useful for many real-world setups. Hence, we follow this
encoder-single-decoder architecture. As shown in Fig. 1 direction and propose TSNs. In stark contrast to condition-
(d), the task switching is accomplished by employing a ing the shared encoder, as done in [24, 15, 31, 2, 52, 41],
small network to learn task-specific embeddings from task we instead learn an unconditioned encoder, and condition
encodings, and the behaviour of the decoder is adapted by only one single unified decoder for all tasks. To the best
conditioning it on those embeddings. In practice, we con- of our knowledge, our network is the first MTL method
dition only the decoder (U-Net), in a hope that the encoder that does not require task-specific branching to multiple de-
learns the concept of ‘thought space’ [36] as it is forced to coders, but rather shares all network parameters and a single
be task-agnostic. This way, the encoder features can also be conditional unified decoder for all tasks.
reused to efficiently perform multiple tasks in series, which Conditioning strategy. In the context of MTL, [41] pro-
is not possible in encoder-based conditioning [24, 15]. pose a heuristic masking of features to induce partially
Interestingly, the task embedding network offers some shared subnetworks for each task. On the other hand, fea-
insight into the relationships between tasks. During train- tures can be modulated by introducing task-specific pro-
ing, a latent embedding for each task is learned together jections [52], residual adapters [31, 32], attention mech-
with its corresponding mapping to the conditioning parame- anisms [24] or parametrized convolutions [15], while the
ters in each decoder layer. Though there are still many open original backbone network is shared among all tasks. Moti-
questions in the study of task relationships and multi-task vated by the successful application of adaptive normaliza-
learning [40], we observe that the structure of our task em- tion strategies in the context of domain adaptation [21],
beddings resemble the task relationships reported in [48]. image generation [4, 16], style transfer [13], and super-
To summarize, in this work we study multi-task net- resolution [46], we explore task-conditioned affine projec-
works without task-specific parameters, and investigate the tions after instance normalization (IN) [42] for MTL. In-
behaviour with regards to efficiency, optimization, and ac- spired by the style-based generator of [16], the affine pa-
curacy. We state our key contributions as follows. rameters are generated from a latent vector representing
a desired task. Note that a concurrent work of [30] pro-
• We introduce Task-Switching Networks, an efficient poses a new task of CompositeTasking by fusing tasks spa-
yet simple architecture for multi-task learning. tially (pixel-wise), based on task-conditioned BatchNorm
(BN) [14].
• We demonstrate that conditioning a single shared de-
Task relationships and embedding. Knowledge on the
coder can outperform multi-decoder methods even on
relationships between tasks is crucial for many aspects of
heterogeneous tasks such as segmentation and regres-
machine learning, including multi-task, transfer, or meta-
sion.
learning. Including the seminal study by Zamir et al. [48],
• We adopt a small embedding network to learn task many recent works shed light on such task relationships for
8292
transfer learning and multi-tasking [10, 38, 39, 9, 49] by network with the least number of parameters, while bound-
means of computation. Based on the assumption that tasks ing the relative performance drop by ∆τ , as follows.
can be meaningfully taxonomized, this structure can be
min |θ|,
represented using task embeddings in a high-dimensional f ∈Fθ ,θ
space. Nevertheless, such embeddings are mostly explored s.t. E [`τ (f (I, τ ), yτ )] ≤ (1 + ∆τ )`¯S
τ, (1)
in meta-learning literature [1, 35, 19], and very little from (I,yτ )∼Dτ
8293
Figure 2: Task Switching Network overview. Our network performs multi-tasking by switching between tasks using a
conditional decoder. Following U-Net [33], our encoder takes the image In and extracts features Fi at different layers. As
second input, our network takes a task encoding vector vτ , selecting task τ to be performed. A small task embedding network
C maps each task to a latent embedding lτ , that conditions the decoder layers along the blue paths, using module A [16]. The
output is computed by conditioning and assembling encoder features Fi and decoder features Oi in a bottom up fashion.
4.1. Task Switching Network volution layer with kernel size of (7 × 7), and layer 2 to
layer 5 represent conv2 x to conv5 x (notation in [12]) of
As laid out in the introduction, we design our Task
the backbone, respectively. The outputs of layer 1 to layer
Switching Networks on the premise that all network pa-
5 are F1 to F5 . From layer j to layer j + 1, the spatial
rameters should be shared, in order to provide an efficient
resolution of the encoder feature maps is reduced by half.
solution to Problem 3.4. However, as illustrated in Fig. 1,
For the decoder, we follow a similar structure as U-
parameter sharing in MTL techniques in literature are lim-
Net [33], collecting features from layer 5 to layer 1. Specif-
ited to the encoder and parts of the decoder, and omit the
ically, at layer j (j ≤ 4), the corresponding feature maps
potential of sharing the complete decoder. Moreover, state-
Fj from the encoder first pass through a conditional con-
of-the-art methods [24, 15] switch tasks by activating task-
volution module A, then concatenate the features (after up-
specific modules. To avoid such additional parameters for
sampling) from layer j + 1, and finally pass through an-
each task, we introduce task switching, by taking a task con-
other instance of A. For the highest layer (j = 5), the fea-
dition as an additional input to the network. To this end, we
ture maps go through module A only once since there is no
associate each task τ with a task-condition vτ ∈ Rd .
higher layer. As shown in Fig. 2, module A transforms in-
The input to our model therefore is a pair of image and put features to new features based on the embedding vector
task-encoding vector, i.e., (In , vτ ), which represents con- lτ , representing the specific task. Let Oj be the output of
ducting task τ on image In . Our backbone encoder takes the decoder from layer j, which is given by
the image In and extracts features Fi at different layers. A
small task embedding network C with a few fully connected
(
A([U(Oj+1 ), A(Fj , lτ )] , lτ ), for j ≤ 4,
layers maps each task to a latent embedding lτ , that is used Oj = , (6)
A(Fj , lτ ), for j = 5
to condition the decoder layers using a module similar to
StyleGAN [16]. The output is then computed by condition- where [·, ·] denotes the concatenation of two features ten-
ing and assembling encoder features Fi and decoder fea- sors along the channel dimension and U(·) is a upsampling
tures Oi bottom-up along the feature pyramid. operation, which is omitted in Fig. 2 for simplicity.
In our discussion, the dense prediction tasks (i.e., edge The output feature O1 from decoder layer 1 has the
detection, and semantic segmentation) are considered if not same resolution as the original image and is passed to a
specifically stated, following [24, 15]. In the following, we convolution layer to make predictions for different tasks.
describe the architectural details of our network. As discussed previously, different tasks can either share a
common convolution layer (i.e. a single head shared by all
4.1.1 Network Architecture tasks), or have separate convolution layers (different heads
for different tasks). In the interest of avoiding task-specific
As shown in Fig. 2, our network is based on a simple U- parameters, we opt for the choice that a single head is used
Net architecture [33]. The encoder is a ResNet-based [12] by all tasks. To this end, we simply choose the number
backbone pre-trained on ImageNet [34], following existing of output channels as the largest number channels needed
MTL approaches [24, 15]. Let layer 1 denote the first con- for different tasks. Take PASCAL-Context [28] (the most
8294
popular benchmark in MTL) as an example, the number of Table 1: Task switching performance. Our TSNs perform
output channels needed among edge detection, parts seg- competitively with single tasking and multi-tasking base-
mentation, semantic segmentation, normals, and saliency lines, with substantially smaller model sizes. Optimal per-
detection are 1, 7, 21, 3, and 1, respectively. So we choose formance is observed when all parameters are shared via
21 output channels, which fits for semantic segmentation. our task embedding module (INs+TE).
For other tasks, we simply conduct adaptive average pool- Method Edge↑ SemSeg↑ Parts↑ Normals↓ Sal↑ ∆m %↓ # params
Single-task 71.3 64.3 55.3 16.3 62.9 - 88.7M
ing along the channels to obtain the predictions matching Multi-decoder 72.2 55.4 55.5 16.8 59.1 4.32 43.9M
the corresponding tasks. The elegance of sharing a head Ours (BNs) 71.6 55.9 54.1 16.7 60.0 4.38 17.7M
Ours (INs) 70.7 62.8 54.6 16.8 63.1 1.43 17.7M
across tasks is that exactly a single and neat network is used Ours (INs+TE) 70.6 64.2 55.0 16.3 63.3 0.30 18.3M
to solve all tasks. Our experiments in fact support this ap-
(
proach, since we found that sharing one head performs com- d
, if τ1 = τ2
petitively to using separate head for different tasks. vτ|1 vτ2 = T
, vτ1 ,τ2 ∈ Rd , (9)
0, otherwise
In the following, we describe the two key components of
task switching networks, that facilitate the conditioning. and Gaussian random vectors vτ ∼ N (0d , diag(1d )) [16].
Conditional Convolution Module. The goal of this mod- The results are reported in §5.1.
ule (block A in Fig. 2) is to adjust feature representations
from the encoder – that are shared by all tasks – to new 5. Experiments
features that serve the desired task. As mentioned above, Overview . Following existing works [24, 15], we focus our
to conduct task τ , the corresponding task-condition vector MTL experiments on dense prediction tasks. In particular,
vτ is transformed by the embedding network C to obtain we use the PASCAL-Context [28] dataset, which contains a
the task-specific latent vector lτ , which is then passed to total of 10,103 images, for the five tasks of edge detection
module A, inspired by [16]. Let x ∈ R1×c1 ×h×w denote (Edge), semantic segmentation (SemSeg), human parts seg-
the input feature to module A, where c1 , h and w repre- mentation (Parts), surface normals (Normals), and saliency
sent number of channels, height and width of the feature detection (Sal). We further evaluate and compare our ap-
map, respectively. Module A then works as follows. First, proach on the NYUD dataset [37], which is comprised of
x is processed by a convolution layer x̂ = x ∗ W with fil- 1,449 images of indoor scenes and comes with annotations
ter weights W , generating x̂ ∈ R1×c2 ×h×w . At the same for the four tasks of edge detection, semantic segmentation,
time, lτ is transformed by two fully connected layers with surface normals, and depth estimation (Depth).
weight matrices Wγ ∈ Rd×c2 and Wβ ∈ Rd×c2 , to form Evaluation metric. We use standard evaluation metrics,
the normalization coefficients γ ∈ R1×c2 and β ∈ R1×c2 , following [24, 15, 45]. Specifically, to evaluate the predic-
for the subsequent AdaIN. For feature x̂, AdaIN performs tive performance for each task, we use the optimal dataset
the normalization following, F-measure (odsF) [25] for edge detection, mean intersec-
tion over union (mIoU) for semantic segmentation, human
(x̂ − µ)
AdaIN(x̂, β, γ) = γ √ + β, (7) parts segmentation, and saliency, mean error (Error) for
σ2
surface normals, and root mean square error (RMSE) for
depth. In order to compare to a multi-task approach m, we
where β and σ 2 are the mean and variance of x̂, which
average the relative performance drop (see Definition 3.3),
are statistics computed according to instance normaliza-
tion [42]. In summary, module A performs the operation P to the single-task baseline b over all tasks:
with respect
∆m = T1 τ ∈T ∆τ (pm,τ , pb,τ ), where pm,τ and pb,τ are
the metrics for task τ for the multi-task method m and for
(x ∗ W − µ) single-task baseline b, respectively.
A(x, lτ ) = lτ Wγ √ + lτ Wβ . (8)
σ2 Network configuration. We employ the ResNet-18 back-
bone with the architecture introduced in §4.1 for all of our
Task embedding network. Recall that each task is associ- experiments, unless stated otherwise. The task embedding
ated with a unique task-condition vector vτ ∈ Rd , and the network C contains 8 fully connected layers of width d. Our
TSNs switch between tasks by feeding different vτ to the method is implemented in PyTorch [29] and experiments
task embedding network C, shown in the left of Fig. 2. The were conducted on NVIDIA GPUs.
embedding network C : Rd → Rd learns to embed the task
5.1. Ablation study
τ in a latent space lτ = C(vτ ), from which the AdaIN coef-
ficients of Eq. 7 are generated for each module A. In prin- Study on module sharing. We compare our method with
ciple, there are many choices for the initialization of these various baselines in Table 1. All approaches use the same
vectors. Specifically, we investigate embedding dimensions network architecture and are trained with the same hyper-
d, with orthogonal vτ (binary vector) given by parameters, to ensure fair comparisons. The details of all
8295
(a) Zamir et al. [48] (b) Ours (c) Dwivedi et al. [10] (d) Song et al. [39]
Figure 3: Task embedding relationships. We analyse the similarity of our task embeddings after training our network with
20 tasks on a small subset of the Taskonomy dataset [48]. The hierarchical clustering of task affinities from our learned
embeddings (b), reveals an interesting similarity to the relationships found by the compared methods (a,c,d).
Table 2: Impact of task embedding strategy. The de- ∆m . It demonstrates that learning the normalization coef-
signed task embedding is robust against different choices ficients jointly through the task embedding is better than
for the task encoding vτ , as well as for the dimensionality learning them separately for each task. We also observed
d of the embedding network C. that during training, our method converges much faster than
Type d Edge↑ SemSeg↑ Parts↑ Normals↓ Sal↑ ∆m %↓ Task-specific INs and BNs. Moreover, our method only in-
50 70.8 63.6 55.2 16.3 63.4 0.32
100 70.6 64.2 55.0 16.3 63.3 0.30 creases the size of the model by a small margin, because the
Orthogonal
150 70.5 64.3 54.9 16.3 63.2 0.38 task embedding network C in our model is very small.
250 70.5 63.8 54.8 16.4 63.1 0.75
50 70.8 64.1 55.1 16.3 63.0 0.30
Task embedding network. We study the impact of two
Gaussian 100 70.3 63.2 54.4 16.5 63.1 1.22 different choices for the task-condition vector vτ , as de-
150 70.7 63.6 54.8 16.3 63.4 0.44
scribed in §4. The results are shown in Table 2. For or-
thogonal encodings, we observe that the performance of
Table 3: Impact of network architecture. The designed
our method is robust towards the embedding dimensional-
task embedding is robust against various backbones.
ity d, while perfoming best at d = 100. Gaussian encod-
Backbone Method Edge↑ SemSeg↑ Parts↑ Normals↓ Sal↑ ∆m %↓
ResNet-18
Single-task 71.3 64.3 55.5 16.3 62.9 - ings perform equally well as the orthogonal counterpart for
Ours 70.6 64.2 55.0 16.3 63.3 0.30
dimensionality below 100, and tends to be slightly worse
Single-task 72.7 68.6 58.7 16.0 64.4 -
ResNet-34 above. We conjecture that under Gaussian encoding, the
Ours 71.8 67.6 58.0 16.1 64.3 0.99
ResNet-101
Single-task 74.2 70.7 62.1 15.8 65.0 - distance between task-condition vectors for two tasks is ran-
Ours 73.3 70.9 61.0 15.9 64.5 0.93
dom (close or far), which is not desirable. However, this
study demonstrates that our conditioning is robust towards
considered baselines are as follows. Single-task means that these hyper-parameters. In our experiments, we choose or-
each task is trained with an individual network, shown in thogonal encoding with a dimension of 100 for PASCAL-
Fig. 1 (a). Multi-decoder represents a simple multi-task Context dataset (5 tasks). For NYUD dataset (4 tasks), we
solution where encoder is shared but decoders are task- use dimension of 120 (divisible by 4).
specific. We further compare to our architecture without More network architecture. Following [24, 15], we study
the task embedding (TE) network, by using task-specific the robustness of our method against more network ar-
batch-(BNs), and instance (INs) normalizations. We see chitectures (ResNet-34 and ResNet-101). The results are
that the Multi-decoder model, sharing a common encoder shown in Table 3. As expected, the absolute performance
but using different decoders does not perform well, which on all tasks improves when using larger networks. Fur-
is consistent with MTL literature [15]. Moreover, it has a thermore, our method performs closely to the correspond-
large number of parameters (43.9 millions). Task-specific ing single-task baselines for different backbones. Note that
BNs on the other hand performs only slightly worse than single-task baselines have 5 times more parameters than
Multi-decoder, with a substantially smaller model size. In- ours. The fact that our approach achieves similarly low av-
terestingly, Task-specific INs performs much better than erage performance drop (∆m %) across various networks,
Task-specific BNs. The results for Task-specific BNs and demonstrates its robustness and effectiveness in reducing
INs show that simply adapting features to different tasks negative interference between tasks.
by affine transformation in the decoder is able to give rea-
sonable performance for multi-task learning. Our method, 5.2. Comparison to state-of-the-arts
with the task embedding network to jointly learn the (affine
transformation) coefficients for AdaIN, outperforms task- The state-of-the-art comparisons for PASCAL-Context
specific INs by 1.13% in terms of average performance drop are shown in Table 4. We compare our method to
8296
Table 4: Comparison with state-of-the-art. Our TSNs
outperform different multi-decoder methods on PASCAL-
Context, with only a single decoder and substantially fewer
parameters.
Method Edge↑ SemSeg↑ Parts↑ Normals↓ Sal↑ ∆m %↓ # params
Single-task 71.3 64.3 55.5 16.3 62.9 - 88.7M
Series RA [31] 72.0 55.1 54.6 17.0 58.7 5.21 51.7M
Parallel RA [32] 72.1 55.9 55.0 17.0 58.6 4.81 50.8M
RCM [15] 72.3 56.6 55.8 16.7 59.3 3.62 51.7M
Ours 70.6 64.2 55.0 16.3 63.3 0.30 18.3M
Edge
SemSeg
Sal
Parts
Normals
task baseline. We report the number of parameters for each
method, and show that our method uses the least parame-
(a) PASCAL-Context
ters among the compared methods. Specifically, our method
outperforms RCM by 3.32% and only uses 18.3M parame- SemSeg
ters, compared to 51.7M of RCM. Edge
Depth
Normals
8297
Ours
baseline
Ours
baseline
Figure 6: Qualitative results. We compare our model with baseline (Task-specific INs) visually. Task interference is
observed in baseline where detected edges could exist in saliency predictions. Our method resolves this and outperforms the
baseline in high-level tasks such as semantic segmentation, parts, and saliency detection. Best viewed with zooming.
with “semanticness” – connecting first parts and semantic once and storing them with the model. In that case, the
segmentation, then saliency, edges and finally normals. number of parameters drops from 18.3M to 17.7M, corre-
We further investigate the task embeddings on the 20 sponding to the size of our IN baseline in Table 1. From
tasks of the Taskonomy [48] dataset, which is intended for this perspective, our task embedding can interpreted as a
finding task relationships. In Fig. 3, we compare our found additional inductive bias for the MTL.
task relationships with the ones established by Zamir et al., Architecture. We chose the U-Net architecture for simplic-
as well as with two recent methods [39, 10]. Interestingly, ity, together with the ResNet-18 backbone to demonstrate
there appears to be a striking similarity between the found the idea of task switching networks, and its behaviour and
“taskonomies”. Although not perfect, we roughly observe performance with regards to recent MTL methods. The ap-
the trend of 2D and 3D tasks clustering together, as well plication of TSNs principle using other more powerful or
as a separation of low level (e.g. denoising, inpainting) more efficent architectures, backbones, decoders, or condi-
from high level (semantic segmentation, scene classifica- tioning strategies leaves for future work.
tion) tasks. Note that our method establishes task rela-
tionships much more efficiently than the compared meth- 6. Conclusion
ods [48, 39, 10]. Specifically, these approaches need to
In this paper, we introduce the first approach for multi-
have separate models trained for individual tasks (i.e., 20
task learning that uses only a single encoder and decoder
separate models for 20 tasks). Then the method proposed
architecture. By design, our Task Switching Networks of-
in Taskonomy [48] does transfer learning between different
fer a substantial advantage in terms of simplicity and pa-
tasks to find the task similarities, while both RSA [10] and
rameter efficiency. This is achieved by sharing the com-
DEPARA [39] conduct pairwise comparisons among deep
plete set of network parameters among all tasks and using
features extracted from a certain number of images. How-
a conditioning network to learn task-specific latent vectors
ever, our approach only uses a single unified model and ob-
(embeddings) which then adapt the decoder for correspond-
tains the task similarities by simply computing the affinities
ing tasks. As demonstrated in our experiments, our pro-
between task embeddings.
posed task switching strategy improves MTL performance
We hypothesize that our embeddings implicitly transfer by learning the task embeddings jointly with all tasks, and
knowledge between tasks in the embedding space, in order offers a new perspective on multi-task learning through the
to provide the impressive results of Table 1 and 4. If two lens of task embeddings. Our experiments further vali-
tasks require similar features, it is favorable to share cer- date the utility and efficiency of the proposed framework,
tain patterns in the conditioning, and therefore be localized which outperforms state-of-the-art multi-decoder methods
closer together in the embedding space. From experimental on standard benchmark datasets with much less parameters,
results, we can see that this behaviour is further encouraged under fair comparisons. We also show interesting findings
by the limited capacity of the embedding network. on task relationships using the learnt task embeddings. To
conclude, we believe that further investigation into the con-
5.4. Discussion cept of task embeddings for multi-task learning will be an
Test-time parameters. In Table 4 and Table 5, we report interesting topic for future work.
the number of parameters of our TSNs. When it comes to
maximizing memory and computational efficency, we can Acknowledgements
however convert our task-switching network into a task- This work was partly supported by Specta AI.
conditioned network, by computing the AdaIN parameters
8298
References [18] Iasonas Kokkinos. Ubernet: Training a universal convolu-
tional neural network for low-, mid-, and high-level vision
[1] A. Achille, Michael Lam, Rahul Tewari, A. Ravichandran, using diverse datasets and limited memory. In CVPR, 2017.
Subhransu Maji, Charless C. Fowlkes, Stefano Soatto, and 1, 2, 3
P. Perona. Task2vec: Task embedding for meta-learning.
[19] L. Lan, Zhenguo Li, X. Guan, and P. Wang. Meta reinforce-
ICCV, 2019. 3
ment learning with task embedding and shared policy. arXiv,
[2] Hakan Bilen and A. Vedaldi. Universal representations: The 2019. 3
missing link between faces, text, planktons, and cat breeds.
[20] Wu Li, V. Piëch, and C. Gilbert. Perceptual learning and
arXiv, 2017. 2, 3
top-down influences in primary visual cortex. Nature Neuro-
[3] Felix J. S. Bragman, Ryutaro Tanno, Sébastien Ourselin, D. science, 7:651–657, 2004. 2
Alexander, and M. Cardoso. Stochastic filter groups for [21] Yanghao Li, Naiyan Wang, J. Shi, Xiaodi Hou, and Jiay-
multi-task cnns: Learning specialist and generalist convolu- ing Liu. Adaptive batch normalization for practical domain
tion kernels. ICCV, 2019. 2 adaptation. Pattern Recognit., 80:109–117, 2018. 2
[4] A. Brock, J. Donahue, and K. Simonyan. Large scale gan [22] Shikun Liu, Edward Johns, and A. Davison. End-to-end
training for high fidelity natural image synthesis. arXiv, multi-task learning with attention. CVPR, 2019. 2
2019. 2
[23] Y. Lu, Abhishek Kumar, Shuangfei Zhai, Yu Cheng, T. Ja-
[5] Rich Caruana. Multitask learning. Machine learning, vidi, and R. Feris. Fully-adaptive feature sharing in multi-
28(1):41–75, 1997. 2 task networks with applications in person attribute classifi-
[6] Z. Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew cation. CVPR, 2017. 1, 2
Rabinovich. Gradnorm: Gradient normalization for adaptive [24] Kevis-Kokitsi Maninis, Ilija Radosavovic, and Iasonas
loss balancing in deep multitask networks. arXiv, 2018. 1 Kokkinos. Attentive single-tasking of multiple tasks. In
[7] M. Crawshaw. Multi-task learning with deep neural net- CVPR, 2019. 1, 2, 3, 4, 5, 6
works: A survey. arXiv, 2020. 2, 3 [25] David R Martin, Charless C Fowlkes, and Jitendra Ma-
[8] C. Doersch and Andrew Zisserman. Multi-task self- lik. Learning to detect natural image boundaries using local
supervised visual learning. ICCV, 2017. 2 brightness, color, and texture cues. TPAMI, 26(5):530–549,
[9] Kshitij Dwivedi, Jiahui Huang, Radoslaw Martin Cichy, and 2004. 5
Gemma Roig. Duality diagram similarity: a generic frame- [26] Justin N. J. McManus, W. Li, and C. Gilbert. Adaptive shape
work for initialization selection in task transfer learning. processing in primary visual cortex. Proceedings of the Na-
arXiv, 2020. 3 tional Academy of Sciences, 108:9739 – 9746, 2011. 2
[10] Kshitij Dwivedi and Gemma Roig. Representation similar- [27] I. Misra, Abhinav Shrivastava, A. Gupta, and M. Hebert.
ity analysis for efficient task taxonomy & transfer learning. Cross-stitch networks for multi-task learning. CVPR, 2016.
CVPR, 2019. 3, 6, 8 1, 2
[11] Yuan Gao, Haoping Bai, Zequn Jie, Jiayi Ma, Kui Jia, [28] Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu
and Wei Liu. Mtl-nas: Task-agnostic neural architecture Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and
search towards general-purpose multi-task learning. 2020 Alan Yuille. The role of context for object detection and se-
IEEE/CVF Conference on Computer Vision and Pattern mantic segmentation in the wild. In CVPR, 2014. 4, 5
Recognition (CVPR), pages 11540–11549, 2020. 3 [29] Adam Paszke, S. Gross, Francisco Massa, A. Lerer, J. Brad-
[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. bury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L.
Deep residual learning for image recognition. In CVPR, Antiga, Alban Desmaison, Andreas Köpf, E. Yang, Zach De-
2016. 4 Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy,
[13] X. Huang and Serge J. Belongie. Arbitrary style transfer in B. Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Py-
real-time with adaptive instance normalization. ICCV, 2017. torch: An imperative style, high-performance deep learning
2 library. ArXiv, abs/1912.01703, 2019. 5
[14] S. Ioffe and Christian Szegedy. Batch normalization: Accel- [30] Nikola Popovic, Danda Pani Paudel, Thomas Probst, Guolei
erating deep network training by reducing internal covariate Sun, and Luc Van Gool. Compositetasking: Understanding
shift. arXiv, 2015. 2 images by spatial composition of tasks. In Proceedings of
[15] Menelaos Kanakis, David Bruggemann, Suman Saha, Sta- the IEEE/CVF Conference on Computer Vision and Pattern
matios Georgoulis, Anton Obukhov, and Luc Van Gool. Recognition, pages 6870–6880, 2021. 2
Reparameterizing convolutions for incremental multi-task [31] Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea Vedaldi.
learning without task interference. ECCV, 2020. 1, 2, 3, Learning multiple visual domains with residual adapters. In
4, 5, 6, 7 NeurIPS, 2017. 2, 3, 7
[16] Tero Karras, Samuli Laine, and Timo Aila. A style-based [32] Sylvestre-Alvise Rebuffi, Hakan Bilen, and A. Vedaldi. Effi-
generator architecture for generative adversarial networks. In cient parametrization of multi-domain deep neural networks.
CVPR, 2019. 2, 4, 5 CVPR, 2018. 2, 7
[17] Alex Kendall, Yarin Gal, and R. Cipolla. Multi-task learn- [33] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net:
ing using uncertainty to weigh losses for scene geometry and Convolutional networks for biomedical image segmentation.
semantics. CVPR, 2018. 1 In MICAI, 2015. 4
8299
[34] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- [52] Xiangyun Zhao, Haoxiang Li, Xiaohui Shen, Xiaodan Liang,
jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, and Ying Wu. A modulation module for multi-task learning
Aditya Khosla, Michael Bernstein, et al. Imagenet large with applications in image retrieval. In ECCV, 2018. 2, 3
scale visual recognition challenge. IJCV, 115(3):211–252,
2015. 4
[35] Andrei A. Rusu, D. Rao, Jakub Sygnowski, Oriol Vinyals,
Razvan Pascanu, Simon Osindero, and Raia Hadsell. Meta-
learning with latent embedding optimization. arXiv, 2019.
3
[36] Holger Schwenk and M. Douze. Learning joint multilingual
sentence representations with neural machine translation. In
Rep4NLP@ACL, 2017. 2
[37] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob
Fergus. Indoor segmentation and support inference from
rgbd images. In ECCV, 2012. 1, 5
[38] J. Song, Yixin Chen, X. Wang, Chengchao Shen, and Min-
gli Song. Deep model transferability from attribution maps.
NeurIPS, 2019. 3
[39] J. Song, Yixin Chen, Jingwen Ye, X. Wang, Chengchao
Shen, Feng Mao, and Mingli Song. Depara: Deep attribu-
tion graph for deep knowledge transferability. CVPR, 2020.
3, 6, 8
[40] Trevor Scott Standley, A. Zamir, Dawn Chen, L. Guibas,
Jitendra Malik, and S. Savarese. Which tasks should be
learned together in multi-task learning? ICML, 2020. 2,
3
[41] Gjorgji Strezoski, Nanne van Noord, and M. Worring. Many
task learning with task routing. ICCV, 2019. 2
[42] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. In-
stance normalization: The missing ingredient for fast styliza-
tion. arXiv, 2016. 2, 5
[43] Simon Vandenhende, Bert De Brabandere, and L. Gool.
Branched multi-task networks: Deciding what layers to
share. BMVC, 2019. 2
[44] Simon Vandenhende, S. Georgoulis, and L. Gool. Mti-net:
Multi-scale task interaction networks for multi-task learning.
In ECCV, 2020. 1, 2
[45] Simon Vandenhende, Stamatios Georgoulis, and Luc
Van Gool. Mti-net: Multi-scale task interaction networks
for multi-task learning. ECCV, 2020. 5
[46] Xintao Wang, K. Yu, C. Dong, and Chen Change Loy. Re-
covering realistic texture in image super-resolution by deep
spatial feature transform. CVPR, 2018. 2
[47] D. Xu, Wanli Ouyang, X. Wang, and N. Sebe. Pad-net:
Multi-tasks guided prediction-and-distillation network for si-
multaneous depth estimation and scene parsing. CVPR,
2018. 1, 2
[48] A. Zamir, Alexander Sax, William Bokui Shen, L. Guibas,
Jitendra Malik, and S. Savarese. Taskonomy: Disentangling
task transfer learning. CVPR, 2018. 2, 6, 8
[49] Y. Zhang, Y. Wei, and Qiang Yang. Learning to multitask.
arXiv, 2018. 3
[50] Z. Zhang, Zhen Cui, Chunyan Xu, Zequn Jie, Xiang Li, and
Jian Yang. Joint task-recursive learning for semantic seg-
mentation and depth estimation. In ECCV, 2018. 2
[51] Z. Zhang, Zhen Cui, Chunyan Xu, Yan Yan, N. Sebe, and
J. Yang. Pattern-affinitive propagation across depth, surface
normal and semantic segmentation. CVPR, 2019. 2
8300