0% found this document useful (0 votes)

19 views18 pages

Learning Representation For Multitask Learning Through Self-Supervised Auxiliary Learning

Uploaded by

chenruishi414

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views18 pages

Learning Representation For Multitask Learning Through Self-Supervised Auxiliary Learning

Uploaded by

chenruishi414

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Learning Representation for Multitask Learning

through Self-Supervised Auxiliary Learning

Seokwon Shin1,2 , Hyungrok Do3 , and Youngdoo Son1,2,∗

1
Department of Industrial and Systems Engineering, Dongguk University-Seoul, 30,
Pildong-ro 1-gil, Jung-gu, Seoul, 04620, Republic of Korea
[email protected], [email protected]
2
Data Science Laboratory (DSLAB), Dongguk University-Seoul, 30, Pildong-ro 1-gil,
Jung-gu, Seoul, 04620, Republic of Korea
3
Department of Population Health, NYU School of Medicine, New York, NY, 10016,
United States
[email protected]
* corresponding author

Abstract. Multi-task learning is a popular machine learning approach

that enables simultaneous learning of multiple related tasks, improving
algorithmic efficiency and effectiveness. In the hard parameter sharing
approach, an encoder shared through multiple tasks generates data rep-
resentations passed to task-specific predictors. Therefore, it is crucial
to have a shared encoder that provides decent representations for every
and each task. However, despite recent advances in multi-task learning,
the question of how to improve the quality of representations generated
by the shared encoder remains open. To address this gap, we propose
a novel approach called Dummy Gradient norm Regularization (DGR)
that aims to improve the universality of the representations generated
by the shared encoder. Specifically, the method decreases the norm of
the gradient of the loss function with respect to dummy task-specific
predictors to improve the universality of the shared encoder’s represen-
tations. Through experiments on multiple multi-task learning benchmark
datasets, we demonstrate that DGR effectively improves the quality of
the shared representations, leading to better multi-task prediction per-
formances. Applied to various classifiers, the shared representations gen-
erated by DGR also show superior performance compared to existing
multi-task learning methods. Moreover, our approach takes advantage of
computational efficiency due to its simplicity. The simplicity also allows
us to seamlessly integrate DGR with the existing multi-task learning
algorithms.

Keywords: Multi-task learning · Universality · Regularization

1 Introduction
Multi-task learning (MTL) is a machine learning approach that involves training
a single model to handle multiple tasks simultaneously by sharing model param-
eters across the tasks, leading to better efficiency and a less complex model
2 S. Shin et al.

compared to having completely separate models for each task. MTL can poten-
tially improve the quality of learned representations, which can benefit individ-
ual tasks. MTL is particularly useful in real-world applications where multiple
tasks need to be performed simultaneously with limited resources. Represen-
tative examples include autonomous driving perception [6, 9, 22], defect detec-
tion [13, 28, 44, 47], and pre-training techniques [1, 2, 5, 56].
However, learning multiple tasks simultaneously can be a challenging prob-
lem, and optimizing the average loss over all tasks in MTL may not always result
in satisfactory generalization performance [4, 54]. Moreover, sharing the repre-
sentation in MTL can lead to a challenge where some tasks are learned well,
while others are overlooked due to differences in the loss scales and gradient
magnitudes of various tasks, as well as the interference among them. As a result,
some tasks may dominate the training process, leading to poorer performance
on other tasks [15, 43].
Various methods have been proposed to address these challenges and im-
prove the generalization performance of MTL models. These methods include
manipulating gradients to prevent their conflicts [7, 8, 12, 27, 30, 32, 52], properly
weighting loss functions [19, 23, 32, 33, 35, 36], and finding Pareto optimal solu-
tions for MTL as a multi-objective optimization problem [12, 29, 41, 45]. These
approaches have contributed to our understanding of the challenges in MTL and
provided valuable insight to improve its performance.
Despite all these recent advances in MTL, the question of how to improve the
universality of the representations generated by the shared encoder has not been
explored. In this paper, we present a novel method for improving the universality
of the shared encoder as a form of regularization. To begin, we define universal-
ity as the inverse of the difference between the loss of an optimal task-specific
predictor and that of an untrained task-specific predictor, which is arbitrarily
chosen. The underlying idea behind this definition is that the universal repre-
sentations generated by the encoder would allow any task predictor, whether
optimal or arbitrary, to perform with equal efficacy. Moreover, we show that the
universality is inversely proportional to the Frobenius norm of the gradient of
the loss function with respect to the arbitrarily chosen predictors. This allows us
to increase the universality just by adding a simple dummy gradient norm to the
learning objective function. Owing to its simplicity, we integrate our approach
with the existing MTL approaches and demonstrate that our approach boosts
the baseline performances in most combinations.
In summary, the contributions of our work are:

• defining the universality of the shared encoder for hard parameter sharing
MTL framework as the inverse of the difference between the loss value of an
optimal task-specific predictor and that of an arbitrary predictor;
• showing that the universality is inversely proportional to the Frobenius norm
of the gradient of the loss function with respect to an arbitrary predictor;
• proposing Dummy Gradient norm Regularization (DGR), a novel regular-
ization method to improve the universality of the shared encoder;
Multitask Learning through Self-Supervised Auxiliary Learning 3

• demonstrating that the DGR improves the performances of the existing MTL
methods, as easily being integrated with them, through a series of experi-
ments on various MTL benchmark datasets;
• demonstrating that the DGR results in decent-quality representations of
input data.

2 Related Work

The main challenge in hard parameter sharing MTL is negative transfer, which
refers to a situation in which the performance of a model on one task is adversely
affected by another task. This can happen when the knowledge or representations
learned on one task are not compatible with that of another task, leading to a
conflict or mismatch. The previous studies tackled the negative transfer problem
in several different ways: (1) loss weighting, (2) gradient manipulation, and (3)
multi-objective optimization.
Loss weighting assigns different weights on loss functions for each task to
minimize the negative transfer caused by the large variance in loss or gradient
magnitudes per task [19, 23, 32, 33, 35,36]. Methods in this category differ in how
to determine the weight for each task. For example, [23] determined the weights
based on the uncertainty for each task, and [36] used previous loss improvements
to adjust the weights during training. Recently, [35] determined weights using a
meta-learning scheme similar to MAML [16] and showed excellent performance
improvement.
Moreover, there exists a line of work that formulates MTL as a multi-objective
optimization problem [12, 29, 41, 45]. It is also known that the multi-objective
optimization-based approaches mitigate the negative transfer problem. In addi-
tion, there exists an integrated approach that involves both loss weighting and
gradient manipulation [32].
Meanwhile, previous studies [3, 25, 26, 42] argued that the success of deep
neural networks for computer vision is related to the quality of representation.
In the case of MTL, how the shared representations are suitable for all tasks is one
of the key components for achieving good performance across all tasks. However,
to the best of our knowledge, the universality of representations generated by a
shared encoder in MTL has yet to be explored; even a concept of universality
for MTL problem has never been considered.

3 Proposed Method

This section introduces the dummy gradient norm regularization to improve

the universality of shared representation in hard parameter sharing MTL set-
ting. First, we introduce general problem setting for MTL. Then, we describe
how DGR can improve the universality of the shared encoder. Figure 1 gives a
schematic overview of the proposed method.
4 S. Shin et al.

Fig. 1: A schematic overview of the proposed DGR, which consists of a shared encoder,
task-specific predictors, and task-specific dummy predictors. During the forward pass,
task-specific predictors produce the actual prediction for each task, while the backward
pass minimizes the sum of task-specific losses and encourages the universality of the
shared encoder using dummy predictors. The black and red solid lines represent the
forward pass during the training and inference phases, respectively, while the blue
dashed line represents the direction of backpropagation for training.

3.1 MTL Problem Definition

Suppose we are given K different prediction tasks T = {T1 , · · · , TK } with train-
ing dataset D = {(xi , y1i , · · · , yKi ) ∈ X × Y1 × · · · × YK : i = 1, · · · , n},
where xi is the input vector and yki is the true label for prediction task Tk , for
k = 1, · · · , K. The goal of MTL is to learn a model that achieves high average
performance on all K prediction tasks [52]. We consider the hard parameter
sharing MTL setting [30, 32, 35, 41, 52] where the prediction model f consists of
a shared encoder φ and K task-specific predictors Ψ = {ψk : k = 1, · · · , K}
so that ψk ◦ φ : X → Yk makes the predictions for task Tk . We consider φ
and ψk to be parameterized by θE and θTk , respectively, for each k, and let
Θ = {θE , θT1 , · · · , θTK } represent the set of all model parameters. Therefore, the
MTL problem we consider is formulated as follows:

\label {eq:1} \min _{\Theta } \sum _{i=1}^{n} \sum _{k=1}^{K}\mathcal {L}_{k}\big ( \mathbf {y}_{ki}, ( \psi _{k} \circ \varphi )(\mathbf {x}_{i}) \big ), (1)

where Lk : Yk × Yk → R+ is a loss function for Tk .

3.2 Universality of Shared Encoder

There have been some attempts to measure the quality of representation in var-
ious fields [3, 21, 49–51, 53]. They have investigated reconstruction errors, evalu-
ated performance in downstream tasks, or evaluated transferability to new tasks.
However, these approaches require the intervention of downstream task predic-
tors, which can potentially be computationally demanding and time consuming.
Multitask Learning through Self-Supervised Auxiliary Learning 5

Furthermore, such methods do not provide quality scores that can be directly
incorporated into the learning objective function. Therefore, we propose a way
to quantify the universality of the shared encoder. We define universality as
the inverse of the difference between the loss value of an optimal task-specific
predictor and that of an arbitrarily task-specific predictor.
We first define optimal task-specific predictors.
Definition 1 (Optimal Task-Specific Predictors). Given a shared encoder
parameterized by θE , we denote the optimal task-specific predictor ψθT∗ for task
k
Tk where
\label {eqn:optimal-predictor} \theta _{\mathcal {T}_{k}}^{*}|\theta _{\mathcal {E}} = \underset {\theta _{\mathcal {T}_{k}}}{\argmin } \sum _{i=1}^{n} \mathcal {L}_{k}(\mathbf {y}_{ki}, \big ( \psi _{\theta _{\mathcal {T}_{k}}} \circ \varphi _{{\theta }_{\mathcal {E}}})(\mathbf {x}_{i}) \big ). (2)

Note that θT∗k depends on θE , which implies that the optimal task-specific predictor
varies as θE perturbs.
Apparently, the optimal task-specific predictors depend on the shared en-
coder, meaning that the difficulty of Problem (2) varies as θE changes. That is,
if the shared encoder generates poor representation, it might be hard to learn a
good task-specific predictor. Moreover, if the shared encoder is biased towards
certain (often simpler or easier) tasks, it may be difficult to learn the other tasks
effectively. Thus, the representations generated by the shared encoder need to
be sufficiently informative to help each predictor learn the task-specific features
necessary for good performance.
Therefore, we want the shared encoder to provide sufficiently good repre-
sentations for all prediction tasks. We quantify the universality of the shared
encoder based on an arbitrarily chosen task-specific predictor, ψθT∆ .
k

Definition 2 (Universality of Shared Encoder). Given an arbitrary task-

specific predictor ψθT∆ , we define the quality of representations provided by the
k
shared encoder as follows:

\label {eqn:QoE} \footnotesize \mathcal {U}(\theta _{\mathcal {E}} | \theta _{\mathcal {T}_{k}}^{\Delta } ) = \Big [\sum _{i=1}^{n}\min _{\boldsymbol {\sigma }_k}\mathcal {L}_{k}\big (\sigma _{ki}y_{ki}, (\psi _{\theta _{\mathcal {T}_{k}}^{\Delta }} \circ \varphi _{\theta _{\mathcal {E}}})(\mathbf {x}_{i}) \big ) \sum _{i=1}^{n}\mathcal {L}_{k}\big (y_{ki}, (\psi _{\theta _{\mathcal {T}_{k}}^{*} \circ }\varphi _{\theta _{\mathcal {E}}} )(\mathbf {x}_{i}) \big ) \Big ]^{-1}

(3)
where σ k is a class permutation for the task k.
The intuition behind this definition is that if the shared encoder produces uni-
versal representations of the input data, then an arbitrarily chosen non-trained
predictor can achieve comparable performance to the optimal predictor. In other
words, the universal representations should be remarkably informative and ex-
pressive, so that they can compensate for the lack of task-specific tuning of
an arbitrarily initialized non-trained predictor. If the learned representation is
universal for all tasks, it could greatly help each task-specific predictor achieve
decent performance for its prediction task.
Our definition for universality shares a similarity with Sharpness-Aware Min-
imization (SAM) [17] in that they both minimize the discrepancy between the
6 S. Shin et al.

losses incurred by the parameters and alternative ones. However, SAM aims to
explore flat minima by minimizing the difference between the losses with the
given parameters and their vicinity, while our work finds universal representa-
tions by minimizing the difference between the losses with the optimal parame-
ters and arbitrary ones.
Although (3) is a reasonable quantification of the universality of the shared
encoder, including U directly into the learning objective is computationally in-
feasible as it requires knowing the optimal task-specific predictors ψθT∗ . The
k
following theorem, which assumes the convexity of the loss function, allows us
to have an indirect but feasible way of improving the universality of the shared
encoder. The assumption of convexity has also been utilized in other multitask
learning studies [30, 52] and has led to successful performance improvements.

Theorem 1 (Inverse Proportionality). Given a task-specific predictor ψθT∆ ,

k
if Lk is convex with respect to θT∆k , the universality of the shared encoder is
inversely proportional to the Frobenius norm of the gradient of the loss function
with respect to θT∆k , that is,

\label {eqn:InvProp} \mathcal {U}(\theta _{\mathcal {E}} | \theta _{\mathcal {T}_{k}}^{\Delta } ) \propto \left \| \nabla _{\theta _{\mathcal {T}_{k}}^{\Delta }} \sum _{i=1}^{n}\mathcal {L}_{k}\big (\mathbf {y}_{ki}, ( \psi _{\theta _{\mathcal {T}_{k}}^{\Delta }} \circ \varphi _{\theta _{\mathcal {E}}})(\mathbf {x}_{i}) \big ) \right \|_{F}^{-1}. (4)

Theorem 1 allows us to improve the universality of the shared encoder by

minimizing the Frobenius norm of the gradient of an arbitrary task-specific pre-
dictor. Therefore, we formulate our MTL problem as follows:

\label {eqn:learning_objective} \footnotesize \min _{\Theta } \sum _{k=1}^{K} \left [ \sum _{i=1}^{n}\mathcal {L}_{k}\big (\mathbf {y}_{ki}, (\psi _{\theta _{\mathcal {T}_{k}}}\circ \varphi _{\theta _{\mathcal {E}}} )(\mathbf {x}_{i}) \big ) + \lambda \left \| \nabla _{\theta _{\mathcal {T}_{k}}^{\Delta }} \sum _{i=1}^{n}\mathcal {L}_{k}\big (\mathbf {y}_{ki}, (\psi _{\theta _{\mathcal {T}_{k}}^{\Delta }} \circ \varphi _{\theta _{\mathcal {E}}})(\mathbf {x}_{i}) \big )\right \|_{F} \right ],

(5)

where λ > 0 is the hyperparameter that controls the tradeoff between the
loss and the penalty term. Problem (5) can be optimized with typical minibatch
stochastic gradient method provided that Lk is smoothly differentiable. The
optimization steps to solve Problem (5) is summarized in Algorithm 1.
Algorithm 1 involves the computation of a Hessian matrix of the loss func-
tion with respect to the parameters of the dummy predictor, which potentially
requires substantial time and space complexity, depending on the number of pa-
rameters of the task-specific predictor. Therefore, for computational and memory
efficiency, we adopt the finite difference approximation, which is widely used to
approximate Hessian matrices [16, 31]. The details of the finite difference are
given in Appendix.

3.3 Choosing Task-Specific Dummy Predictor

As given in (4), U can be improved by decreasing the norm of the gradient of
the loss function Lk given an arbitrary predictor ψθT∆ . In our implementation,
k
Multitask Learning through Self-Supervised Auxiliary Learning 7

Algorithm 1 DGR for MTL

Input: shared encoder φθE and task-specific predictors {ψθTk }, task-specific dummy
predictors {ψθ∆ }, dataset D, and hyperparamter λ
Tk
Output: Multitask prediction model (φθE , ψθTk ).
(0) (0)
1: t ← 0, (θE , θT ) ← Initialize(θE , θT ).
2: while Not Converged do
3: B = {(xi , yki )} ← MiniBatchSampler(D, b).
4: Feedforward encoder: zi = φθ(t) (xi ).
E
5: for k = 1 to K do
6: Feedforward predictor: ŷki = ψθ(t) (zi ).
Tk

7: Feedforward dummy predictor: ŷki

∆
= ψθ∆ (zi ).
Tk
8: Compute objective value for the minibatch B:
b h
X i
∆
ℓk (B) = Lk (yki , ŷki ) + λ∥∇θ∆ Lk (yki , ŷki )∥F .
Tk
i=1

(t+1) (t)
9: Update θTk by stochastic gradient descent: θTk ← θTk − η∇θTk ℓk (B).
10: end for
(t+1) (t)
11: Update θE by stochastic gradient descent: θE ← θE − η∇θE K
P
k=1 ℓk (B).
12: t ← t + 1.
13: end while
(t) (t)
14: (θE , θT ) ← (θE , θT ).

we added a dummy predictor ψθT∆ that has exactly the same architecture as
k
the task-specific predictor ψθTk that is indeed used to make predictions. Then
we initialize the dummy parameters θT∆k randomly. The choice is arbitrary, but
empirically we have confirmed that it helps improve performance.
We fixed randomly initialized parameters for the dummy predictor during
training for learning stability. However, there is a risk that the encoder may
learn to adapt specifically to that fixed predictor. To prevent such adaptation,
we utilized multiple dummy predictors instead of a single one. In our implemen-
tation, we employed three distinct, fixed dummy predictors. In terms of employ-
ing additional decoders, our implementation draws parallels with Pseudo-Task
Augmentation (PTA) [39], which also employed multiple decoders across var-
ious tasks. However, while PTA aimed to train a shared encoder to solve the
same tasks in multiple ways, our focus lies in enhancing the shared encoder’s
universality through gradient regularization with multiple fixed decoders.
Additionally, in terms of utilizing auxiliary tasks, our implementation is sim-
ilar to previous studies on auxiliary learning [14,34,40,46], which aimed to learn
useful representations for specific primary tasks. However, while these stud-
ies have centered on determining or creating beneficial auxiliary tasks, to be
learned alongside primary tasks, DGR focuses on regularizing the gradient norm
obtained by dummy encoders with arbitrarily chosen and fixed parameters.
8 S. Shin et al.

4 Experiments

We performed a series of experiments to demonstrate the utility and efficiency

of the proposed DGR method on several multi-task benchmark datasets. We
first apply our method to the image classification task of UTKFace [55] com-
piled as a multi-task learning problem. Then, we further investigate the quality of
representations produced by the shared encoder learned with our approach. Sub-
sequently, we tackle multi-task scene understanding problems on NYUv2 [48]
and CityScapes [10] datasets.

4.1 Competitive Methods

In our experiments, we compared our method with two baseline methods, single-
task models (Single) and a vanilla multi-task model (Vanilla), and existing
MTL methods. We considered loss weighting approaches that have demonstrated
decent performance in previous studies, such as Uncertainty weighting [23],
Dynamic Weight Averaging (DWA) [36], and Auto-λ [35]. Furthermore, we
compared the proposed method with four other methods, including gradient con-
flict and multi-objective optimization: PCGrad [52], CAGrad [30], IMTL [32],
and Nash-MTL [41]. We further integrated our method with each competitive
MTL method (presented as +Ours) to investigate how our DGR can improve
performance from the existing approaches.

4.2 Multi-task Image Classification on UTKFace

We evaluated DGR on the UTKFace dataset with three image classification

tasks: age prediction, gender classification, and ethnicity classification. To facil-
itate a more straightforward interpretation of the experiment results, we trans-
formed it into a classification task by discretizing ages into seven classes: 0-9,
10-19, 20-29, 30-39, 40-49, 50-59, and 60+.
Experiment Settings. In our experiments on the UTKFace dataset, we used
the standard hard parameter sharing structure known as Split, the same as
in [35]. The Split structure is a general structure for hard parameter sharing
settings, where each task-specific predictor makes predictions using only repre-
sentations obtained through the shared encoder’s parameters.
For all methods, we used ResNet50 [20] as the backbone architecture for the
shared encoder and two fully connected layers as each task-specific predictor.
For optimization, we used Adam [24] optimizer for all methods and performed
a grid search on λ ∈ {10−5 , 10−6 , 10−7 } to determine the optimal values of the
hyperparameter for the proposed method.
Evaluation Metrics. We reported the test accuracy on every classification task
for the UTKFace dataset. All performances were averaged over three indepen-
dent trials. Furthermore, we reported the relative improvement in performance
∆MTL compared to the single task model in the following experiments [38]. ∆MTL
Multitask Learning through Self-Supervised Auxiliary Learning 9

Table 1: Test accuracies averaged over three random trials on the UTKFace dataset.
Boldface and underline denote the best and second-best, respectively.

Method Age Gender Ethnicity ∆MTL ↑

Single 0.5233 0.8964 0.7728 -
Vanilla 0.5391 0.9153 0.7948 +2.66%
+ Ours 0.5605 0.9019 0.7942 +3.50%
DWA 0.5370 0.9148 0.7989 +2.68%
+ Ours 0.5336 0.9135 0.8056 +2.71%
Uncertainty 0.5385 0.9137 0.7961 +2.62%
+ Ours 0.5455 0.9118 0.7988 +3.11%
Auto-λ 0.5405 0.9171 0.7965 +2.89%
+ Ours 0.5503 0.9163 0.8014 +3.69%
PCGrad 0.5417 0.9134 0.7960 +2.80%
+ Ours 0.5431 0.8999 0.7986 +2.50%
CAGrad 0.5390 0.9239 0.7994 +3.17%
+ Ours 0.5510 0.9169 0.8094 +4.11%
IMTL 0.5250 0.9173 0.8039 +2.23%
+ Ours 0.5430 0.9020 0.8051 +2.86%
Nash-MTL 0.5485 0.9172 0.8021 +3.64%
+ Ours 0.5545 0.9063 0.8055 +3.77%

is defined as follow:

\label {eq:12} \Delta _{\text {MTL}} = \frac {1}{K}\sum ^{K}_{k=1}(-1)^{\varsigma _{k}}(M_{m,k}-M_{b,k})/M_{b,k} (6)

where ςk is the indicator that takes 1 if the performance metric Mk for Tk is the
lower the better 0 otherwise. m and b represent multi-task model and single-task
baseline model, respectively.
Results. Table 1 presents the image classification performance of the proposed
method, baselines, and other MTL methods on UTKFace dataset. In addition,
we also reported the performance of baselines and other MTL methods combined
with our method. Across all tasks, the MTL methods, including DGR, outper-
formed the single-task model. Notably, both when independently employed and
integrated with the other MTL methods, DGR improved performance in two of
the three tasks—specifically, in the challenging age prediction and ethnicity clas-
sification tasks, which involve a larger number of classes than the gender classifi-
cation task. Moreover, combined with DGR, ∆M T L of almost all MTL methods,
except PCGrad, increased. This result emphasizes the efficacy of DGR in en-
hancing the universality of representations, thereby facilitating the obtaining of
suitable representations for diverse and intricate tasks.
We also conducted comparative experiments with DGR against SAM [17]
and PTA [39], which share some similarities with our work. We compared aver-
age ∆MTL of each method when they are used both independently and combined
with the three MTL methods that demonstrated the highest performances in
Table 1: Auto-λ, CAGrad and Nash-MTL. As shown in Figure 2, although
DGR, SAM, and PTA showed consistent performance improvement when inte-
10 S. Shin et al.

Fig. 2: Comparison result of average ∆MTL over three trials on the UTKFace dataset
upon integrating each method into SAM, PTA, and Ours. The bars colored in gray
indicate the results obtained using each method independently.

grated into the MTL methods, the proposed DGR showed the best performance
improvement. The analysis on computational cost is also reported in Appendix.

4.3 Quality of Shared Representation

According to our theoretical finding, a universal representation should enable

any classifier to perform well. This includes relatively simpler classifiers built
with the representations. We conducted an experiment to determine whether
the shared representation learned through DGR, which was aimed at achieving
a universal shared representation, exhibited greater universality compared to the
shared representations learned through other methods.
In particular, we conducted a three-step experiment to assess the quality of
shared representations on the dataset UTKFace. In the first step, we trained a
model using each MTL method. Then, we froze the trained shared encoder and
substituted the original classifier with a simpler alternative, such as decision tree
(DT), support vector machine (SVM), and k-nearest neighbor (kNN). Finally,
we evaluated the performance of these classifiers. This experiment is similar to
previous evaluations of representation learning [3, 21], with the only variation
being the algorithm used for the task-specific predictor.
As shown in Figure 3a, the shared representations learned through DGR
consistently demonstrated superior performance with the simple task-specific
classifier across all tasks. In particular, the performance advantage was more
pronounced in the age prediction task, which has the largest number of classes.
The proposed method was also less sensitive to the choice of classifiers.
Furthermore, we visualized the shared representations with t-SNE to ver-
ify the effect of DGR on their universality. Figure 3b shows the visualization
results of the shared representation by the Vanilla and DGR. Similar to the
experiment with simple classifiers, notable improvements were observed in both
ethnicity and age prediction tasks. Specifically, in the ethnicity classification
task, the representations generated by DGR were more clustered by their classes
compared to those of Vanilla. Notably, for the age classification, the represen-
Multitask Learning through Self-Supervised Auxiliary Learning 11

(a) (b)

Fig. 3: (a) Average test performance over 3 trials on the UTKFace dataset, where the
learned representation using MTL was fixed, and relatively simple classifiers were used
to replace the original classifier. (b) The t-SNE visualization results of shared repre-
sentations generated by the Vanilla (left) and DGR (right). Each row corresponds to
a specific task.

tations generated by DGR were not only clustered by classes but also aligned
by age order, whereas those generated by the Vanilla were rather dispersed.

4.4 Visual Scene Understanding on NYUv2 and CityScapes

We evaluated DGR on scene understanding tasks with the two standard datasets,
NYUv2 [48] and CityScapes [10]. In the experiments on the NYUv2 dataset,
we trained on three tasks: surface normal prediction, 13-class semantic segmen-
tation, and depth prediction. In the experiments on the CityScapes dataset,
we also trained on three tasks: 19-class semantic segmentation, 10-class part seg-
mentation [18], and disparity estimation, with the same setting as in [23]. The
experiments on these datasets are common for verification in MTL studies.
Experiment Settings. For the experiments on scene understanding tasks, we
used the state-of-the-art multi-task architecture MTAN [36] that has been pri-
marily applied to the same experiments in previous studies, along with the
standard structure Split used in image classification experiments. The MTAN
structure differs from the Split by incorporating task-specific attention to ob-
tain shared representations. We used ResNet-50 as the backbone architecture
for all methods. Furthermore, we used the SGD optimizer for all methods ex-
cept Nash-MTL, where we used Adam [24] to achieve stable convergence of the
12 S. Shin et al.

Table 2: Average test performance over three trials on the NYUv2 dataset with
MTL methods in Split (left) and MTAN (right) multi-task architectures. The bold and
underline denote the best and second best, respectively. The arrow next to metrics
indicates whether the metric is higher or lower the better.

Split MTAN
Method
Normal Semantic Seg. Depth Normal Semantic Seg. Depth
∆MTL ↑ ∆MTL ↑
(mDist.) ↓ (mIoU) ↑ (aErr.) ↓ (mDist.) ↓ (mIoU) ↑ (aErr.) ↓
Single 22.40 43.37 52.24 - 22.40 43.37 52.24 -
Vanilla 24.19 46.02 41.36 +6.32% 24.05 46.05 41.03 +6.55%
+ Ours 23.98 47.11 40.11 +8.26% 24.04 46.24 39.98 +7.51%
DWA 24.20 45.84 41.58 +6.02% 24.06 46.47 41.48 +6.71%
+ Ours 24.18 46.43 41.08 +6.82% 24.10 46.81 41.37 +7.24%
Uncertainty 24.24 46.46 41.04 +6.78% 23.97 46.96 41.02 +7.57%
+ Ours 24.34 46.52 40.80 +6.83% 23.89 46.23 40.00 +7.28%
Auto-λ 23.60 48.20 39.83 +9.85% 23.43 48.10 39.21 +10.02%
+ Ours 23.60 48.04 39.45 +9.96% 23.32 48.19 39.03 +10.50%
PCGrad 24.55 45.41 42.32 +4.70% 24.07 46.71 39.79 +5.41%
+ Ours 24.03 46.71 41.20 +7.19% 23.95 46.86 39.73 +7.42%
CAGrad 23.39 46.79 40.55 +8.61% 23.23 47.48 39.92 +9.38%
+ Ours 23.28 47.48 39.81 +9.78% 22.98 48.49 38.94 +11.56%
IMTL 22.45 50.11 38.69 +13.75% 22.53 49.84 38.43 +13.43%
+ Ours 22.37 49.62 37.70 +14.13% 22.39 50.04 38.18 +14.42%
Nash-MTL 23.60 48.20 39.83 +9.85% 23.43 48.10 39.84 +10.02%
+ Ours 22.54 48.25 39.53 +11.65% 22.66 48.38 39.59 +11.75%

algorithm [41]. We performed a grid search to determine the optimal values of λ,

the hyperparameter of the proposed method, as the same as image classification
experiments.
Evaluation Metrics. We followed the standard evaluation protocol that has
been used in previous studies [30,32,35,41,52]. We evaluated normal, depth, and
semantic segmentation tasks via absolute error (aErr.), mean angle distances
(mDist.), and mean intersection over union (mIoU), respectively, for the exper-
iments on the NYUv2 dataset. Similar to this way, we evaluated two segmen-
tation tasks, semantic and part, through mean intersection over union (mIoU)
and the disparity estimation task via absolute error (aErr.) for the experiments
on the CityScapes dataset.
Results. Tables 2 and 3 show the performance in the scene understanding ex-
periments on the NYUv2 and the CityScapes datasets, respectively.
On the NYUv2 dataset, DGR showed similar or better performance than
common benchmarks in recent MTL studies such as DWA, Uncertainty, and
PCGrad. Notably, these results were achieved without loss balancing or gra-
dient manipulation methods. Moreover, combining DGR with other methods
resulted in performance improvements, and the best performance was achieved
when IMTL was combined with the proposed method.
On the CityScapes dataset, DGR improved the performance of almost all
MTL methods, except for DWA. Combining CAGrad with DGR led to the
best performance.
These results confirmed that the DGR improves performance in the scene
understanding tasks across most cases, even though the degrees of improvement
Multitask Learning through Self-Supervised Auxiliary Learning 13

Table 3: Average test performance over three trials on the CityScapes dataset with
MTL methods in Split (left) and MTAN (right) multi-task architectures. Bold and
underline denote the best and second-best, respectively. The arrow next to metrics
indicates whether the metric is higher or lower the better.

Split MTAN
Method
Semantic Seg. Part Seg. Disparity Semantic Seg. Part Seg. Disparity
∆ ↑ ∆ ↑
(mIoU) ↑ (mIoU) ↑ (aErr.) ↓ MTL (mIoU) ↑ (mIoU) ↑ (aErr.) ↓ MTL
Single 56.20 52.74 0.84 - 56.20 52.74 0.84 -
Vanilla 56.29 52.31 0.75 +3.35% 56.64 52.52 0.74 +4.09%
+ Ours 56.27 52.45 0.74 +3.83% 57.13 51.88 0.73 +4.37%
DWA 55.63 52.22 0.72 +4.10% 55.88 52.51 0.72 +4.43%
+ Ours 55.84 52.39 0.72 +4.33% 55.65 52.54 0.72 +4.31%
Uncertainty 56.77 54.59 0.77 +4.29% 56.66 55.03 0.76 +4.89%
+ Ours 56.76 55.46 0.78 +4.43% 56.59 55.10 0.75 +5.29%
Auto-λ 55.95 52.94 0.70 +5.53% 55.97 52.67 0.69 +5.77%
+ Ours 57.22 53.55 0.71 +6.28% 56.76 53.29 0.69 +6.63%
PCGrad 55.40 52.53 0.72 +4.15% 55.53 52.83 0.72 +4.42%
+ Ours 56.04 52.53 0.72 +4.53% 56.61 52.75 0.72 +5.01%
CAGrad 56.55 55.05 0.71 +6.83% 56.36 55.32 0.72 +6.49%
+ Ours 57.05 55.16 0.71 +7.19% 56.66 56.07 0.72 +7.14%
IMTL 58.33 55.43 0.76 +6.14% 58.67 55.77 0.76 +6.55%
+ Ours 58.46 55.33 0.75 +6.55% 58.12 55.73 0.75 +6.60%
Nash-MTL 57.65 55.59 0.75 +6.23% 57.64 55.09 0.74 +6.31%
+ Ours 57.78 55.04 0.74 +6.36% 57.77 55.60 0.74 +6.71%

vary by several factors, including datasets, model structures, tasks, and base
methods.

4.5 Sensitivity Analysis

We performed additional sensitivity analyses of the proposed method under

various conditions. First, we verified the performance of DGR by varying the
numbers of dummy decoders and primary tasks on visual scene understanding
(NYUv2 and Cityscapes) as well as single-task image classification ( IM-
AGENET [11]). Second, we evaluated DGR with the SWIN transformer [37]
backbone architecture, which is more capable than the backbone we previously
used. Finally, DGR was applied to the Pascal dataset to verify its robustness
on larger number of tasks.
Results. As illustrated in Table (a), the performance improvement of DGR com-
pared to VANILLA becomes more prominent as the number of tasks increases.
Moreover, we observed a slight improvement even when DGR was employed for
single-task image classification (IMAGENET). Furthermore, Table (b) demon-
strates that the SWIN backbone consistently improved performance compared to
ResNet50 across all tasks in both single- and multi-task scenarios. DGR further
enhanced performance, indicating its effectiveness with high capable backbone
architectures. In the case of the Pascal dataset, shown in Table (c), although
MTL had lower performance compared to single-task learning, DGR achieved
performance improvement in MTL scenarios.
14 S. Shin et al.

Table 4: (a) Experimental results with varying numbers of dummy decoders (d) and
primary tasks (numbers in parentheses indicate the number of primary tasks). The
final row shows the experimental result when DGR is utilized for single-task learning.
For MTL, ∆MTL is reported, while in single-task learning, top-1 accuracy is reported.
(b) Experimental results on the NYUv2 and the CityScapes datasets when DGR
utilized with the SWIN backbone. (c) Experimental results on on the Pascal dataset.

Normal Sem Seg. Depth

Dataset Vanilla d = 1 d=3 d=5 NYUv2 ∆ ↑
(mDist.) ↓ (mIoU) ↑ (aErr.) ↓ MTL
NYUv2 (1) +0.00% +0.05% +0.07% -0.25%
Single 22.03 53.81 38.87 -
NYUv2 (2) +6.76% +6.84% +6.93% +6.73%
Vanilla 23.12 54.87 35.09 +6.74%
NYUv2 (3) +6.32% +7.13% +8.05% +7.96% + Ours 22.78 54.98 34.98 +8.77%
Cityscapes (1) +0.00% -0.04% +0.02% +0.03% Sem Seg. Part Seg. Disparity
Cityscapes ∆ ↑
(mIoU) ↑ (mIoU) ↑ (aErr.) ↓ MTL
Cityscapes (2) +3.24% +3.30% +3.33% +3.28%
Cityscapes (3) +3.37% 3.72% +3.83% +3.66% Single 56.08 53.33 0.74 -
Vanilla 56.02 54.18 0.69 +2.79%
ImageNet 76.13% 76.30% 76.50% 75.98%
+ Ours 57.13 54.37 0.68 +3.98%
(a)
(b)

Seg. H. Parts Norm. Sal. Edge

Method ∆ ↑
(IoU) ↑ (IoU) ↑ (mErr) ↓ (IoU) ↑ (odsF)↑ MTL
Vanilla 63.8 58.6 14.9 65.1 69.2 -2.86%
+DGR 64.2 58.7 14.8 65.2 69.5 -2.42%
MTAN 63.7 58.9 14.8 65.4 69.6 -2.39%
+DGR 63.8 59.0 14.7 65.5 69.6 -2.19%
(c)

These findings demonstrate that DGR performs robustly under various con-
ditions and can achieve performance improvements when combined with more
advanced backbone architectures, such as the Swin transformer.

5 Conclusion

We present a novel approach, Dummy Gradient norm Regularization (DGR),

to improve the universality of a shared encoder in MTL. Through experiments
on multiple benchmark datasets, we have demonstrated that DGR effectively
improves the universality of a shared encoder, resulting in better multi-task pre-
diction performances. Our approach takes advantage of its inherent simplicity,
leading to relatively less computation time and allowing seamless integration
with existing MTL algorithms. Overall, our study contributes to the advance-
ment of MTL by addressing an important question of improving the universality
of a shared encoder and providing a practical and efficient method to achieve
this goal.
Multitask Learning through Self-Supervised Auxiliary Learning 15

Acknowledgements

This research was supported by the National Research Foundationof Korea

(NRF) grant funded by the Ministry of Science and ICT (MSIT) of Korea (No.
RS-2023-00208412).

References

1. Anderson, C., Farrell, R.: Improving fractal pre-training. In: Proceedings of the
IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1300–
1309 (2022)
2. Bachmann, R., Mizrahi, D., Atanov, A., Zamir, A.: Multimae: Multi-modal multi-
task masked autoencoders. In: European Conference on Computer Vision. pp. 348–
367. Springer (2022)
3. Bengio, Y., Courville, A., Vincent, P.: Representation learning: A review and
new perspectives. IEEE transactions on pattern analysis and machine intelligence
35(8), 1798–1828 (2013)
4. Caruana, R.: Multitask learning. Springer (1998)
5. Chen, T., Frankle, J., Chang, S., Liu, S., Zhang, Y., Carbin, M., Wang, Z.: The
lottery tickets hypothesis for supervised and self-supervised pre-training in com-
puter vision models. In: Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition. pp. 16306–16316 (2021)
6. Chen, Y., Zhao, D., Lv, L., Zhang, Q.: Multi-task learning for dangerous object
detection in autonomous driving. Information Sciences 432, 559–571 (2018)
7. Chen, Z., Badrinarayanan, V., Lee, C.Y., Rabinovich, A.: Gradnorm: Gradient nor-
malization for adaptive loss balancing in deep multitask networks. In: International
conference on machine learning. pp. 794–803. PMLR (2018)
8. Chen, Z., Ngiam, J., Huang, Y., Luong, T., Kretzschmar, H., Chai, Y., Anguelov,
D.: Just pick a sign: Optimizing deep multitask models with gradient sign dropout.
Advances in Neural Information Processing Systems 33, 2039–2050 (2020)
9. Chowdhuri, S., Pankaj, T., Zipser, K.: Multinet: Multi-modal multi-task learn-
ing for autonomous driving. In: 2019 IEEE Winter Conference on Applications of
Computer Vision (WACV). pp. 1496–1504. IEEE (2019)
10. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R.,
Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene
understanding. In: Proceedings of the IEEE conference on computer vision and
pattern recognition. pp. 3213–3223 (2016)
11. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-
scale hierarchical image database. In: 2009 IEEE conference on computer vision
and pattern recognition. pp. 248–255. Ieee (2009)
12. Désidéri, J.A.: Multiple-gradient descent algorithm (mgda) for multiobjective op-
timization. Comptes Rendus Mathematique 350(5-6), 313–318 (2012)
13. Dong, X., Taylor, C.J., Cootes, T.F.: Defect classification and detection using
a multitask deep one-class cnn. IEEE Transactions on Automation Science and
Engineering 19(3), 1719–1730 (2021)
14. Du, Y., Czarnecki, W.M., Jayakumar, S.M., Farajtabar, M., Pascanu, R., Lakshmi-
narayanan, B.: Adapting auxiliary losses using gradient similarity. arXiv preprint
arXiv:1812.02224 (2018)
16 S. Shin et al.

15. Evgeniou, T., Pontil, M.: Regularized multi–task learning. In: Proceedings of the
tenth ACM SIGKDD international conference on Knowledge discovery and data
mining. pp. 109–117 (2004)
16. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation
of deep networks. In: International conference on machine learning. pp. 1126–1135.
PMLR (2017)
17. Foret, P., Kleiner, A., Mobahi, H., Neyshabur, B.: Sharpness-aware minimization
for efficiently improving generalization. arXiv preprint arXiv:2010.01412 (2020)
18. de Geus, D., Meletis, P., Lu, C., Wen, X., Dubbelman, G.: Part-aware panoptic
segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition. pp. 5485–5494 (2021)
19. Guo, M., Haque, A., Huang, D.A., Yeung, S., Fei-Fei, L.: Dynamic task prior-
itization for multitask learning. In: Proceedings of the European conference on
computer vision (ECCV). pp. 270–287 (2018)
20. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE conference on computer vision and pattern recognition.
pp. 770–778 (2016)
21. Hjelm, R.D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P.,
Trischler, A., Bengio, Y.: Learning deep representations by mutual information
estimation and maximization. arXiv preprint arXiv:1808.06670 (2018)
22. Ishihara, K., Kanervisto, A., Miura, J., Hautamaki, V.: Multi-task learning with
attention for end-to-end autonomous driving. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. pp. 2902–2911 (2021)
23. Kendall, A., Gal, Y., Cipolla, R.: Multi-task learning using uncertainty to weigh
losses for scene geometry and semantics. In: Proceedings of the IEEE conference
on computer vision and pattern recognition. pp. 7482–7491 (2018)
24. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980 (2014)
25. Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, J., Gelly, S., Houlsby, N.:
Big transfer (bit): General visual representation learning. In: Computer Vision–
ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Pro-
ceedings, Part V 16. pp. 491–507. Springer (2020)
26. Kolesnikov, A., Zhai, X., Beyer, L.: Revisiting self-supervised visual representation
learning. In: Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition. pp. 1920–1929 (2019)
27. Lee, S., Son, Y.: Multitask learning with single gradient step update for task bal-
ancing. Neurocomputing 467, 442–453 (2022)
28. Li, Y., Li, J.: An end-to-end defect detection method for mobile phone light guide
plate via multitask learning. IEEE Transactions on Instrumentation and Measure-
ment 70, 1–13 (2021)
29. Lin, X., Zhen, H.L., Li, Z., Zhang, Q.F., Kwong, S.: Pareto multi-task learning.
Advances in neural information processing systems 32 (2019)
30. Liu, B., Liu, X., Jin, X., Stone, P., Liu, Q.: Conflict-averse gradient descent
for multi-task learning. Advances in Neural Information Processing Systems 34,
18878–18890 (2021)
31. Liu, H., Simonyan, K., Yang, Y.: Darts: Differentiable architecture search. arXiv
preprint arXiv:1806.09055 (2018)
32. Liu, L., Li, Y., Kuang, Z., Xue, J., Chen, Y., Yang, W., Liao, Q., Zhang, W.:
Towards impartial multi-task learning. iclr (2021)
Multitask Learning through Self-Supervised Auxiliary Learning 17

33. Liu, S., Liang, Y., Gitter, A.: Loss-balanced task weighting to reduce negative
transfer in multi-task learning. In: Proceedings of the AAAI conference on artificial
intelligence. vol. 33, pp. 9977–9978 (2019)
34. Liu, S., Davison, A., Johns, E.: Self-supervised generalisation with meta auxiliary
learning. Advances in Neural Information Processing Systems 32 (2019)
35. Liu, S., James, S., Davison, A.J., Johns, E.: Auto-lambda: Disentangling dynamic
task relationships. arXiv preprint arXiv:2202.03091 (2022)
36. Liu, S., Johns, E., Davison, A.J.: End-to-end multi-task learning with attention. In:
Proceedings of the IEEE/CVF conference on computer vision and pattern recog-
nition. pp. 1871–1880 (2019)
37. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin
transformer: Hierarchical vision transformer using shifted windows. In: Proceedings
of the IEEE/CVF international conference on computer vision. pp. 10012–10022
(2021)
38. Maninis, K.K., Radosavovic, I., Kokkinos, I.: Attentive single-tasking of multiple
tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. pp. 1851–1860 (2019)
39. Meyerson, E., Miikkulainen, R.: Pseudo-task augmentation: From deep multitask
learning to intratask sharing—and back. In: International Conference on Machine
Learning. pp. 3511–3520. PMLR (2018)
40. Navon, A., Achituve, I., Maron, H., Chechik, G., Fetaya, E.: Auxiliary learning by
implicit differentiation. arXiv preprint arXiv:2007.02693 (2020)
41. Navon, A., Shamsian, A., Achituve, I., Maron, H., Kawaguchi, K., Chechik, G., Fe-
taya, E.: Multi-task learning as a bargaining game. arXiv preprint arXiv:2202.01017
(2022)
42. Rebuffi, S.A., Kolesnikov, A., Sperl, G., Lampert, C.H.: icarl: Incremental classifier
and representation learning. In: Proceedings of the IEEE conference on Computer
Vision and Pattern Recognition. pp. 2001–2010 (2017)
43. Ruder, S.: An overview of multi-task learning in deep neural networks. arXiv
preprint arXiv:1706.05098 (2017)
44. Sampath, V., Maurtua, I., Martín, J.J.A., Rivera, A., Molina, J., Gutierrez, A.:
Attention guided multi-task learning for surface defect identification. IEEE Trans-
actions on Industrial Informatics (2023)
45. Sener, O., Koltun, V.: Multi-task learning as multi-objective optimization. Ad-
vances in neural information processing systems 31 (2018)
46. Shamsian, A., Navon, A., Glazer, N., Kawaguchi, K., Chechik, G., Fetaya, E.:
Auxiliary learning as an asymmetric bargaining game. In: International Conference
on Machine Learning. pp. 30689–30705. PMLR (2023)
47. Shao, L., Zhang, E., Ma, Q., Li, M.: Pixel-wise semisupervised fabric defect de-
tection method combined with multitask mean teacher. IEEE Transactions on
Instrumentation and Measurement 71, 1–11 (2022)
48. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support
inference from rgbd images. ECCV (5) 7576, 746–760 (2012)
49. Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain
adaptation. In: Proceedings of the IEEE conference on computer vision and pattern
recognition. pp. 7167–7176 (2017)
50. Wang, X., Li, L., Ye, W., Long, M., Wang, J.: Transferable attention for domain
adaptation. In: Proceedings of the AAAI Conference on Artificial Intelligence.
vol. 33, pp. 5345–5352 (2019)
18 S. Shin et al.

51. Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-
parametric instance discrimination. In: Proceedings of the IEEE conference on
computer vision and pattern recognition. pp. 3733–3742 (2018)
52. Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K., Finn, C.: Gradient surgery
for multi-task learning. Advances in Neural Information Processing Systems 33,
5824–5836 (2020)
53. Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep
learning (still) requires rethinking generalization. Communications of the ACM
64(3), 107–115 (2021)
54. Zhang, Y., Yang, Q.: An overview of multi-task learning. National Science Review
5(1), 30–43 (2018)
55. Zhang, Z., Song, Y., Qi, H.: Age progression/regression by conditional adversar-
ial autoencoder. In: Proceedings of the IEEE conference on computer vision and
pattern recognition. pp. 5810–5818 (2017)
56. Zhang, Z., Wang, S., Xu, Y., Fang, Y., Yu, W., Liu, Y., Zhao, H., Zhu, C., Zeng,
M.: Task compass: Scaling multi-task pre-training with task prefix. arXiv preprint
arXiv:2210.06277 (2022)