Learning Representation For Multitask Learning Through Self-Supervised Auxiliary Learning
Learning Representation For Multitask Learning Through Self-Supervised Auxiliary Learning
1 Introduction
Multi-task learning (MTL) is a machine learning approach that involves training
a single model to handle multiple tasks simultaneously by sharing model param-
eters across the tasks, leading to better efficiency and a less complex model
2 S. Shin et al.
compared to having completely separate models for each task. MTL can poten-
tially improve the quality of learned representations, which can benefit individ-
ual tasks. MTL is particularly useful in real-world applications where multiple
tasks need to be performed simultaneously with limited resources. Represen-
tative examples include autonomous driving perception [6, 9, 22], defect detec-
tion [13, 28, 44, 47], and pre-training techniques [1, 2, 5, 56].
However, learning multiple tasks simultaneously can be a challenging prob-
lem, and optimizing the average loss over all tasks in MTL may not always result
in satisfactory generalization performance [4, 54]. Moreover, sharing the repre-
sentation in MTL can lead to a challenge where some tasks are learned well,
while others are overlooked due to differences in the loss scales and gradient
magnitudes of various tasks, as well as the interference among them. As a result,
some tasks may dominate the training process, leading to poorer performance
on other tasks [15, 43].
Various methods have been proposed to address these challenges and im-
prove the generalization performance of MTL models. These methods include
manipulating gradients to prevent their conflicts [7, 8, 12, 27, 30, 32, 52], properly
weighting loss functions [19, 23, 32, 33, 35, 36], and finding Pareto optimal solu-
tions for MTL as a multi-objective optimization problem [12, 29, 41, 45]. These
approaches have contributed to our understanding of the challenges in MTL and
provided valuable insight to improve its performance.
Despite all these recent advances in MTL, the question of how to improve the
universality of the representations generated by the shared encoder has not been
explored. In this paper, we present a novel method for improving the universality
of the shared encoder as a form of regularization. To begin, we define universal-
ity as the inverse of the difference between the loss of an optimal task-specific
predictor and that of an untrained task-specific predictor, which is arbitrarily
chosen. The underlying idea behind this definition is that the universal repre-
sentations generated by the encoder would allow any task predictor, whether
optimal or arbitrary, to perform with equal efficacy. Moreover, we show that the
universality is inversely proportional to the Frobenius norm of the gradient of
the loss function with respect to the arbitrarily chosen predictors. This allows us
to increase the universality just by adding a simple dummy gradient norm to the
learning objective function. Owing to its simplicity, we integrate our approach
with the existing MTL approaches and demonstrate that our approach boosts
the baseline performances in most combinations.
In summary, the contributions of our work are:
• defining the universality of the shared encoder for hard parameter sharing
MTL framework as the inverse of the difference between the loss value of an
optimal task-specific predictor and that of an arbitrary predictor;
• showing that the universality is inversely proportional to the Frobenius norm
of the gradient of the loss function with respect to an arbitrary predictor;
• proposing Dummy Gradient norm Regularization (DGR), a novel regular-
ization method to improve the universality of the shared encoder;
Multitask Learning through Self-Supervised Auxiliary Learning 3
• demonstrating that the DGR improves the performances of the existing MTL
methods, as easily being integrated with them, through a series of experi-
ments on various MTL benchmark datasets;
• demonstrating that the DGR results in decent-quality representations of
input data.
2 Related Work
The main challenge in hard parameter sharing MTL is negative transfer, which
refers to a situation in which the performance of a model on one task is adversely
affected by another task. This can happen when the knowledge or representations
learned on one task are not compatible with that of another task, leading to a
conflict or mismatch. The previous studies tackled the negative transfer problem
in several different ways: (1) loss weighting, (2) gradient manipulation, and (3)
multi-objective optimization.
Loss weighting assigns different weights on loss functions for each task to
minimize the negative transfer caused by the large variance in loss or gradient
magnitudes per task [19, 23, 32, 33, 35,36]. Methods in this category differ in how
to determine the weight for each task. For example, [23] determined the weights
based on the uncertainty for each task, and [36] used previous loss improvements
to adjust the weights during training. Recently, [35] determined weights using a
meta-learning scheme similar to MAML [16] and showed excellent performance
improvement.
Moreover, there exists a line of work that formulates MTL as a multi-objective
optimization problem [12, 29, 41, 45]. It is also known that the multi-objective
optimization-based approaches mitigate the negative transfer problem. In addi-
tion, there exists an integrated approach that involves both loss weighting and
gradient manipulation [32].
Meanwhile, previous studies [3, 25, 26, 42] argued that the success of deep
neural networks for computer vision is related to the quality of representation.
In the case of MTL, how the shared representations are suitable for all tasks is one
of the key components for achieving good performance across all tasks. However,
to the best of our knowledge, the universality of representations generated by a
shared encoder in MTL has yet to be explored; even a concept of universality
for MTL problem has never been considered.
3 Proposed Method
Fig. 1: A schematic overview of the proposed DGR, which consists of a shared encoder,
task-specific predictors, and task-specific dummy predictors. During the forward pass,
task-specific predictors produce the actual prediction for each task, while the backward
pass minimizes the sum of task-specific losses and encourages the universality of the
shared encoder using dummy predictors. The black and red solid lines represent the
forward pass during the training and inference phases, respectively, while the blue
dashed line represents the direction of backpropagation for training.
\label {eq:1} \min _{\Theta } \sum _{i=1}^{n} \sum _{k=1}^{K}\mathcal {L}_{k}\big ( \mathbf {y}_{ki}, ( \psi _{k} \circ \varphi )(\mathbf {x}_{i}) \big ), (1)
Furthermore, such methods do not provide quality scores that can be directly
incorporated into the learning objective function. Therefore, we propose a way
to quantify the universality of the shared encoder. We define universality as
the inverse of the difference between the loss value of an optimal task-specific
predictor and that of an arbitrarily task-specific predictor.
We first define optimal task-specific predictors.
Definition 1 (Optimal Task-Specific Predictors). Given a shared encoder
parameterized by θE , we denote the optimal task-specific predictor ψθT∗ for task
k
Tk where
\label {eqn:optimal-predictor} \theta _{\mathcal {T}_{k}}^{*}|\theta _{\mathcal {E}} = \underset {\theta _{\mathcal {T}_{k}}}{\argmin } \sum _{i=1}^{n} \mathcal {L}_{k}(\mathbf {y}_{ki}, \big ( \psi _{\theta _{\mathcal {T}_{k}}} \circ \varphi _{{\theta }_{\mathcal {E}}})(\mathbf {x}_{i}) \big ). (2)
Note that θT∗k depends on θE , which implies that the optimal task-specific predictor
varies as θE perturbs.
Apparently, the optimal task-specific predictors depend on the shared en-
coder, meaning that the difficulty of Problem (2) varies as θE changes. That is,
if the shared encoder generates poor representation, it might be hard to learn a
good task-specific predictor. Moreover, if the shared encoder is biased towards
certain (often simpler or easier) tasks, it may be difficult to learn the other tasks
effectively. Thus, the representations generated by the shared encoder need to
be sufficiently informative to help each predictor learn the task-specific features
necessary for good performance.
Therefore, we want the shared encoder to provide sufficiently good repre-
sentations for all prediction tasks. We quantify the universality of the shared
encoder based on an arbitrarily chosen task-specific predictor, ψθT∆ .
k
\label {eqn:QoE} \footnotesize \mathcal {U}(\theta _{\mathcal {E}} | \theta _{\mathcal {T}_{k}}^{\Delta } ) = \Big [\sum _{i=1}^{n}\min _{\boldsymbol {\sigma }_k}\mathcal {L}_{k}\big (\sigma _{ki}y_{ki}, (\psi _{\theta _{\mathcal {T}_{k}}^{\Delta }} \circ \varphi _{\theta _{\mathcal {E}}})(\mathbf {x}_{i}) \big ) \sum _{i=1}^{n}\mathcal {L}_{k}\big (y_{ki}, (\psi _{\theta _{\mathcal {T}_{k}}^{*} \circ }\varphi _{\theta _{\mathcal {E}}} )(\mathbf {x}_{i}) \big ) \Big ]^{-1}
(3)
where σ k is a class permutation for the task k.
The intuition behind this definition is that if the shared encoder produces uni-
versal representations of the input data, then an arbitrarily chosen non-trained
predictor can achieve comparable performance to the optimal predictor. In other
words, the universal representations should be remarkably informative and ex-
pressive, so that they can compensate for the lack of task-specific tuning of
an arbitrarily initialized non-trained predictor. If the learned representation is
universal for all tasks, it could greatly help each task-specific predictor achieve
decent performance for its prediction task.
Our definition for universality shares a similarity with Sharpness-Aware Min-
imization (SAM) [17] in that they both minimize the discrepancy between the
6 S. Shin et al.
losses incurred by the parameters and alternative ones. However, SAM aims to
explore flat minima by minimizing the difference between the losses with the
given parameters and their vicinity, while our work finds universal representa-
tions by minimizing the difference between the losses with the optimal parame-
ters and arbitrary ones.
Although (3) is a reasonable quantification of the universality of the shared
encoder, including U directly into the learning objective is computationally in-
feasible as it requires knowing the optimal task-specific predictors ψθT∗ . The
k
following theorem, which assumes the convexity of the loss function, allows us
to have an indirect but feasible way of improving the universality of the shared
encoder. The assumption of convexity has also been utilized in other multitask
learning studies [30, 52] and has led to successful performance improvements.
\label {eqn:InvProp} \mathcal {U}(\theta _{\mathcal {E}} | \theta _{\mathcal {T}_{k}}^{\Delta } ) \propto \left \| \nabla _{\theta _{\mathcal {T}_{k}}^{\Delta }} \sum _{i=1}^{n}\mathcal {L}_{k}\big (\mathbf {y}_{ki}, ( \psi _{\theta _{\mathcal {T}_{k}}^{\Delta }} \circ \varphi _{\theta _{\mathcal {E}}})(\mathbf {x}_{i}) \big ) \right \|_{F}^{-1}. (4)
\label {eqn:learning_objective} \footnotesize \min _{\Theta } \sum _{k=1}^{K} \left [ \sum _{i=1}^{n}\mathcal {L}_{k}\big (\mathbf {y}_{ki}, (\psi _{\theta _{\mathcal {T}_{k}}}\circ \varphi _{\theta _{\mathcal {E}}} )(\mathbf {x}_{i}) \big ) + \lambda \left \| \nabla _{\theta _{\mathcal {T}_{k}}^{\Delta }} \sum _{i=1}^{n}\mathcal {L}_{k}\big (\mathbf {y}_{ki}, (\psi _{\theta _{\mathcal {T}_{k}}^{\Delta }} \circ \varphi _{\theta _{\mathcal {E}}})(\mathbf {x}_{i}) \big )\right \|_{F} \right ],
(5)
where λ > 0 is the hyperparameter that controls the tradeoff between the
loss and the penalty term. Problem (5) can be optimized with typical minibatch
stochastic gradient method provided that Lk is smoothly differentiable. The
optimization steps to solve Problem (5) is summarized in Algorithm 1.
Algorithm 1 involves the computation of a Hessian matrix of the loss func-
tion with respect to the parameters of the dummy predictor, which potentially
requires substantial time and space complexity, depending on the number of pa-
rameters of the task-specific predictor. Therefore, for computational and memory
efficiency, we adopt the finite difference approximation, which is widely used to
approximate Hessian matrices [16, 31]. The details of the finite difference are
given in Appendix.
(t+1) (t)
9: Update θTk by stochastic gradient descent: θTk ← θTk − η∇θTk ℓk (B).
10: end for
(t+1) (t)
11: Update θE by stochastic gradient descent: θE ← θE − η∇θE K
P
k=1 ℓk (B).
12: t ← t + 1.
13: end while
(t) (t)
14: (θE , θT ) ← (θE , θT ).
we added a dummy predictor ψθT∆ that has exactly the same architecture as
k
the task-specific predictor ψθTk that is indeed used to make predictions. Then
we initialize the dummy parameters θT∆k randomly. The choice is arbitrary, but
empirically we have confirmed that it helps improve performance.
We fixed randomly initialized parameters for the dummy predictor during
training for learning stability. However, there is a risk that the encoder may
learn to adapt specifically to that fixed predictor. To prevent such adaptation,
we utilized multiple dummy predictors instead of a single one. In our implemen-
tation, we employed three distinct, fixed dummy predictors. In terms of employ-
ing additional decoders, our implementation draws parallels with Pseudo-Task
Augmentation (PTA) [39], which also employed multiple decoders across var-
ious tasks. However, while PTA aimed to train a shared encoder to solve the
same tasks in multiple ways, our focus lies in enhancing the shared encoder’s
universality through gradient regularization with multiple fixed decoders.
Additionally, in terms of utilizing auxiliary tasks, our implementation is sim-
ilar to previous studies on auxiliary learning [14,34,40,46], which aimed to learn
useful representations for specific primary tasks. However, while these stud-
ies have centered on determining or creating beneficial auxiliary tasks, to be
learned alongside primary tasks, DGR focuses on regularizing the gradient norm
obtained by dummy encoders with arbitrarily chosen and fixed parameters.
8 S. Shin et al.
4 Experiments
In our experiments, we compared our method with two baseline methods, single-
task models (Single) and a vanilla multi-task model (Vanilla), and existing
MTL methods. We considered loss weighting approaches that have demonstrated
decent performance in previous studies, such as Uncertainty weighting [23],
Dynamic Weight Averaging (DWA) [36], and Auto-λ [35]. Furthermore, we
compared the proposed method with four other methods, including gradient con-
flict and multi-objective optimization: PCGrad [52], CAGrad [30], IMTL [32],
and Nash-MTL [41]. We further integrated our method with each competitive
MTL method (presented as +Ours) to investigate how our DGR can improve
performance from the existing approaches.
Table 1: Test accuracies averaged over three random trials on the UTKFace dataset.
Boldface and underline denote the best and second-best, respectively.
is defined as follow:
\label {eq:12} \Delta _{\text {MTL}} = \frac {1}{K}\sum ^{K}_{k=1}(-1)^{\varsigma _{k}}(M_{m,k}-M_{b,k})/M_{b,k} (6)
where ςk is the indicator that takes 1 if the performance metric Mk for Tk is the
lower the better 0 otherwise. m and b represent multi-task model and single-task
baseline model, respectively.
Results. Table 1 presents the image classification performance of the proposed
method, baselines, and other MTL methods on UTKFace dataset. In addition,
we also reported the performance of baselines and other MTL methods combined
with our method. Across all tasks, the MTL methods, including DGR, outper-
formed the single-task model. Notably, both when independently employed and
integrated with the other MTL methods, DGR improved performance in two of
the three tasks—specifically, in the challenging age prediction and ethnicity clas-
sification tasks, which involve a larger number of classes than the gender classifi-
cation task. Moreover, combined with DGR, ∆M T L of almost all MTL methods,
except PCGrad, increased. This result emphasizes the efficacy of DGR in en-
hancing the universality of representations, thereby facilitating the obtaining of
suitable representations for diverse and intricate tasks.
We also conducted comparative experiments with DGR against SAM [17]
and PTA [39], which share some similarities with our work. We compared aver-
age ∆MTL of each method when they are used both independently and combined
with the three MTL methods that demonstrated the highest performances in
Table 1: Auto-λ, CAGrad and Nash-MTL. As shown in Figure 2, although
DGR, SAM, and PTA showed consistent performance improvement when inte-
10 S. Shin et al.
Fig. 2: Comparison result of average ∆MTL over three trials on the UTKFace dataset
upon integrating each method into SAM, PTA, and Ours. The bars colored in gray
indicate the results obtained using each method independently.
grated into the MTL methods, the proposed DGR showed the best performance
improvement. The analysis on computational cost is also reported in Appendix.
(a) (b)
Fig. 3: (a) Average test performance over 3 trials on the UTKFace dataset, where the
learned representation using MTL was fixed, and relatively simple classifiers were used
to replace the original classifier. (b) The t-SNE visualization results of shared repre-
sentations generated by the Vanilla (left) and DGR (right). Each row corresponds to
a specific task.
tations generated by DGR were not only clustered by classes but also aligned
by age order, whereas those generated by the Vanilla were rather dispersed.
We evaluated DGR on scene understanding tasks with the two standard datasets,
NYUv2 [48] and CityScapes [10]. In the experiments on the NYUv2 dataset,
we trained on three tasks: surface normal prediction, 13-class semantic segmen-
tation, and depth prediction. In the experiments on the CityScapes dataset,
we also trained on three tasks: 19-class semantic segmentation, 10-class part seg-
mentation [18], and disparity estimation, with the same setting as in [23]. The
experiments on these datasets are common for verification in MTL studies.
Experiment Settings. For the experiments on scene understanding tasks, we
used the state-of-the-art multi-task architecture MTAN [36] that has been pri-
marily applied to the same experiments in previous studies, along with the
standard structure Split used in image classification experiments. The MTAN
structure differs from the Split by incorporating task-specific attention to ob-
tain shared representations. We used ResNet-50 as the backbone architecture
for all methods. Furthermore, we used the SGD optimizer for all methods ex-
cept Nash-MTL, where we used Adam [24] to achieve stable convergence of the
12 S. Shin et al.
Table 2: Average test performance over three trials on the NYUv2 dataset with
MTL methods in Split (left) and MTAN (right) multi-task architectures. The bold and
underline denote the best and second best, respectively. The arrow next to metrics
indicates whether the metric is higher or lower the better.
Split MTAN
Method
Normal Semantic Seg. Depth Normal Semantic Seg. Depth
∆MTL ↑ ∆MTL ↑
(mDist.) ↓ (mIoU) ↑ (aErr.) ↓ (mDist.) ↓ (mIoU) ↑ (aErr.) ↓
Single 22.40 43.37 52.24 - 22.40 43.37 52.24 -
Vanilla 24.19 46.02 41.36 +6.32% 24.05 46.05 41.03 +6.55%
+ Ours 23.98 47.11 40.11 +8.26% 24.04 46.24 39.98 +7.51%
DWA 24.20 45.84 41.58 +6.02% 24.06 46.47 41.48 +6.71%
+ Ours 24.18 46.43 41.08 +6.82% 24.10 46.81 41.37 +7.24%
Uncertainty 24.24 46.46 41.04 +6.78% 23.97 46.96 41.02 +7.57%
+ Ours 24.34 46.52 40.80 +6.83% 23.89 46.23 40.00 +7.28%
Auto-λ 23.60 48.20 39.83 +9.85% 23.43 48.10 39.21 +10.02%
+ Ours 23.60 48.04 39.45 +9.96% 23.32 48.19 39.03 +10.50%
PCGrad 24.55 45.41 42.32 +4.70% 24.07 46.71 39.79 +5.41%
+ Ours 24.03 46.71 41.20 +7.19% 23.95 46.86 39.73 +7.42%
CAGrad 23.39 46.79 40.55 +8.61% 23.23 47.48 39.92 +9.38%
+ Ours 23.28 47.48 39.81 +9.78% 22.98 48.49 38.94 +11.56%
IMTL 22.45 50.11 38.69 +13.75% 22.53 49.84 38.43 +13.43%
+ Ours 22.37 49.62 37.70 +14.13% 22.39 50.04 38.18 +14.42%
Nash-MTL 23.60 48.20 39.83 +9.85% 23.43 48.10 39.84 +10.02%
+ Ours 22.54 48.25 39.53 +11.65% 22.66 48.38 39.59 +11.75%
Table 3: Average test performance over three trials on the CityScapes dataset with
MTL methods in Split (left) and MTAN (right) multi-task architectures. Bold and
underline denote the best and second-best, respectively. The arrow next to metrics
indicates whether the metric is higher or lower the better.
Split MTAN
Method
Semantic Seg. Part Seg. Disparity Semantic Seg. Part Seg. Disparity
∆ ↑ ∆ ↑
(mIoU) ↑ (mIoU) ↑ (aErr.) ↓ MTL (mIoU) ↑ (mIoU) ↑ (aErr.) ↓ MTL
Single 56.20 52.74 0.84 - 56.20 52.74 0.84 -
Vanilla 56.29 52.31 0.75 +3.35% 56.64 52.52 0.74 +4.09%
+ Ours 56.27 52.45 0.74 +3.83% 57.13 51.88 0.73 +4.37%
DWA 55.63 52.22 0.72 +4.10% 55.88 52.51 0.72 +4.43%
+ Ours 55.84 52.39 0.72 +4.33% 55.65 52.54 0.72 +4.31%
Uncertainty 56.77 54.59 0.77 +4.29% 56.66 55.03 0.76 +4.89%
+ Ours 56.76 55.46 0.78 +4.43% 56.59 55.10 0.75 +5.29%
Auto-λ 55.95 52.94 0.70 +5.53% 55.97 52.67 0.69 +5.77%
+ Ours 57.22 53.55 0.71 +6.28% 56.76 53.29 0.69 +6.63%
PCGrad 55.40 52.53 0.72 +4.15% 55.53 52.83 0.72 +4.42%
+ Ours 56.04 52.53 0.72 +4.53% 56.61 52.75 0.72 +5.01%
CAGrad 56.55 55.05 0.71 +6.83% 56.36 55.32 0.72 +6.49%
+ Ours 57.05 55.16 0.71 +7.19% 56.66 56.07 0.72 +7.14%
IMTL 58.33 55.43 0.76 +6.14% 58.67 55.77 0.76 +6.55%
+ Ours 58.46 55.33 0.75 +6.55% 58.12 55.73 0.75 +6.60%
Nash-MTL 57.65 55.59 0.75 +6.23% 57.64 55.09 0.74 +6.31%
+ Ours 57.78 55.04 0.74 +6.36% 57.77 55.60 0.74 +6.71%
vary by several factors, including datasets, model structures, tasks, and base
methods.
Table 4: (a) Experimental results with varying numbers of dummy decoders (d) and
primary tasks (numbers in parentheses indicate the number of primary tasks). The
final row shows the experimental result when DGR is utilized for single-task learning.
For MTL, ∆MTL is reported, while in single-task learning, top-1 accuracy is reported.
(b) Experimental results on the NYUv2 and the CityScapes datasets when DGR
utilized with the SWIN backbone. (c) Experimental results on on the Pascal dataset.
These findings demonstrate that DGR performs robustly under various con-
ditions and can achieve performance improvements when combined with more
advanced backbone architectures, such as the Swin transformer.
5 Conclusion
Acknowledgements
References
1. Anderson, C., Farrell, R.: Improving fractal pre-training. In: Proceedings of the
IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1300–
1309 (2022)
2. Bachmann, R., Mizrahi, D., Atanov, A., Zamir, A.: Multimae: Multi-modal multi-
task masked autoencoders. In: European Conference on Computer Vision. pp. 348–
367. Springer (2022)
3. Bengio, Y., Courville, A., Vincent, P.: Representation learning: A review and
new perspectives. IEEE transactions on pattern analysis and machine intelligence
35(8), 1798–1828 (2013)
4. Caruana, R.: Multitask learning. Springer (1998)
5. Chen, T., Frankle, J., Chang, S., Liu, S., Zhang, Y., Carbin, M., Wang, Z.: The
lottery tickets hypothesis for supervised and self-supervised pre-training in com-
puter vision models. In: Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition. pp. 16306–16316 (2021)
6. Chen, Y., Zhao, D., Lv, L., Zhang, Q.: Multi-task learning for dangerous object
detection in autonomous driving. Information Sciences 432, 559–571 (2018)
7. Chen, Z., Badrinarayanan, V., Lee, C.Y., Rabinovich, A.: Gradnorm: Gradient nor-
malization for adaptive loss balancing in deep multitask networks. In: International
conference on machine learning. pp. 794–803. PMLR (2018)
8. Chen, Z., Ngiam, J., Huang, Y., Luong, T., Kretzschmar, H., Chai, Y., Anguelov,
D.: Just pick a sign: Optimizing deep multitask models with gradient sign dropout.
Advances in Neural Information Processing Systems 33, 2039–2050 (2020)
9. Chowdhuri, S., Pankaj, T., Zipser, K.: Multinet: Multi-modal multi-task learn-
ing for autonomous driving. In: 2019 IEEE Winter Conference on Applications of
Computer Vision (WACV). pp. 1496–1504. IEEE (2019)
10. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R.,
Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene
understanding. In: Proceedings of the IEEE conference on computer vision and
pattern recognition. pp. 3213–3223 (2016)
11. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-
scale hierarchical image database. In: 2009 IEEE conference on computer vision
and pattern recognition. pp. 248–255. Ieee (2009)
12. Désidéri, J.A.: Multiple-gradient descent algorithm (mgda) for multiobjective op-
timization. Comptes Rendus Mathematique 350(5-6), 313–318 (2012)
13. Dong, X., Taylor, C.J., Cootes, T.F.: Defect classification and detection using
a multitask deep one-class cnn. IEEE Transactions on Automation Science and
Engineering 19(3), 1719–1730 (2021)
14. Du, Y., Czarnecki, W.M., Jayakumar, S.M., Farajtabar, M., Pascanu, R., Lakshmi-
narayanan, B.: Adapting auxiliary losses using gradient similarity. arXiv preprint
arXiv:1812.02224 (2018)
16 S. Shin et al.
15. Evgeniou, T., Pontil, M.: Regularized multi–task learning. In: Proceedings of the
tenth ACM SIGKDD international conference on Knowledge discovery and data
mining. pp. 109–117 (2004)
16. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation
of deep networks. In: International conference on machine learning. pp. 1126–1135.
PMLR (2017)
17. Foret, P., Kleiner, A., Mobahi, H., Neyshabur, B.: Sharpness-aware minimization
for efficiently improving generalization. arXiv preprint arXiv:2010.01412 (2020)
18. de Geus, D., Meletis, P., Lu, C., Wen, X., Dubbelman, G.: Part-aware panoptic
segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition. pp. 5485–5494 (2021)
19. Guo, M., Haque, A., Huang, D.A., Yeung, S., Fei-Fei, L.: Dynamic task prior-
itization for multitask learning. In: Proceedings of the European conference on
computer vision (ECCV). pp. 270–287 (2018)
20. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE conference on computer vision and pattern recognition.
pp. 770–778 (2016)
21. Hjelm, R.D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P.,
Trischler, A., Bengio, Y.: Learning deep representations by mutual information
estimation and maximization. arXiv preprint arXiv:1808.06670 (2018)
22. Ishihara, K., Kanervisto, A., Miura, J., Hautamaki, V.: Multi-task learning with
attention for end-to-end autonomous driving. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. pp. 2902–2911 (2021)
23. Kendall, A., Gal, Y., Cipolla, R.: Multi-task learning using uncertainty to weigh
losses for scene geometry and semantics. In: Proceedings of the IEEE conference
on computer vision and pattern recognition. pp. 7482–7491 (2018)
24. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980 (2014)
25. Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, J., Gelly, S., Houlsby, N.:
Big transfer (bit): General visual representation learning. In: Computer Vision–
ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Pro-
ceedings, Part V 16. pp. 491–507. Springer (2020)
26. Kolesnikov, A., Zhai, X., Beyer, L.: Revisiting self-supervised visual representation
learning. In: Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition. pp. 1920–1929 (2019)
27. Lee, S., Son, Y.: Multitask learning with single gradient step update for task bal-
ancing. Neurocomputing 467, 442–453 (2022)
28. Li, Y., Li, J.: An end-to-end defect detection method for mobile phone light guide
plate via multitask learning. IEEE Transactions on Instrumentation and Measure-
ment 70, 1–13 (2021)
29. Lin, X., Zhen, H.L., Li, Z., Zhang, Q.F., Kwong, S.: Pareto multi-task learning.
Advances in neural information processing systems 32 (2019)
30. Liu, B., Liu, X., Jin, X., Stone, P., Liu, Q.: Conflict-averse gradient descent
for multi-task learning. Advances in Neural Information Processing Systems 34,
18878–18890 (2021)
31. Liu, H., Simonyan, K., Yang, Y.: Darts: Differentiable architecture search. arXiv
preprint arXiv:1806.09055 (2018)
32. Liu, L., Li, Y., Kuang, Z., Xue, J., Chen, Y., Yang, W., Liao, Q., Zhang, W.:
Towards impartial multi-task learning. iclr (2021)
Multitask Learning through Self-Supervised Auxiliary Learning 17
33. Liu, S., Liang, Y., Gitter, A.: Loss-balanced task weighting to reduce negative
transfer in multi-task learning. In: Proceedings of the AAAI conference on artificial
intelligence. vol. 33, pp. 9977–9978 (2019)
34. Liu, S., Davison, A., Johns, E.: Self-supervised generalisation with meta auxiliary
learning. Advances in Neural Information Processing Systems 32 (2019)
35. Liu, S., James, S., Davison, A.J., Johns, E.: Auto-lambda: Disentangling dynamic
task relationships. arXiv preprint arXiv:2202.03091 (2022)
36. Liu, S., Johns, E., Davison, A.J.: End-to-end multi-task learning with attention. In:
Proceedings of the IEEE/CVF conference on computer vision and pattern recog-
nition. pp. 1871–1880 (2019)
37. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin
transformer: Hierarchical vision transformer using shifted windows. In: Proceedings
of the IEEE/CVF international conference on computer vision. pp. 10012–10022
(2021)
38. Maninis, K.K., Radosavovic, I., Kokkinos, I.: Attentive single-tasking of multiple
tasks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition. pp. 1851–1860 (2019)
39. Meyerson, E., Miikkulainen, R.: Pseudo-task augmentation: From deep multitask
learning to intratask sharing—and back. In: International Conference on Machine
Learning. pp. 3511–3520. PMLR (2018)
40. Navon, A., Achituve, I., Maron, H., Chechik, G., Fetaya, E.: Auxiliary learning by
implicit differentiation. arXiv preprint arXiv:2007.02693 (2020)
41. Navon, A., Shamsian, A., Achituve, I., Maron, H., Kawaguchi, K., Chechik, G., Fe-
taya, E.: Multi-task learning as a bargaining game. arXiv preprint arXiv:2202.01017
(2022)
42. Rebuffi, S.A., Kolesnikov, A., Sperl, G., Lampert, C.H.: icarl: Incremental classifier
and representation learning. In: Proceedings of the IEEE conference on Computer
Vision and Pattern Recognition. pp. 2001–2010 (2017)
43. Ruder, S.: An overview of multi-task learning in deep neural networks. arXiv
preprint arXiv:1706.05098 (2017)
44. Sampath, V., Maurtua, I., Martín, J.J.A., Rivera, A., Molina, J., Gutierrez, A.:
Attention guided multi-task learning for surface defect identification. IEEE Trans-
actions on Industrial Informatics (2023)
45. Sener, O., Koltun, V.: Multi-task learning as multi-objective optimization. Ad-
vances in neural information processing systems 31 (2018)
46. Shamsian, A., Navon, A., Glazer, N., Kawaguchi, K., Chechik, G., Fetaya, E.:
Auxiliary learning as an asymmetric bargaining game. In: International Conference
on Machine Learning. pp. 30689–30705. PMLR (2023)
47. Shao, L., Zhang, E., Ma, Q., Li, M.: Pixel-wise semisupervised fabric defect de-
tection method combined with multitask mean teacher. IEEE Transactions on
Instrumentation and Measurement 71, 1–11 (2022)
48. Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support
inference from rgbd images. ECCV (5) 7576, 746–760 (2012)
49. Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain
adaptation. In: Proceedings of the IEEE conference on computer vision and pattern
recognition. pp. 7167–7176 (2017)
50. Wang, X., Li, L., Ye, W., Long, M., Wang, J.: Transferable attention for domain
adaptation. In: Proceedings of the AAAI Conference on Artificial Intelligence.
vol. 33, pp. 5345–5352 (2019)
18 S. Shin et al.
51. Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-
parametric instance discrimination. In: Proceedings of the IEEE conference on
computer vision and pattern recognition. pp. 3733–3742 (2018)
52. Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K., Finn, C.: Gradient surgery
for multi-task learning. Advances in Neural Information Processing Systems 33,
5824–5836 (2020)
53. Zhang, C., Bengio, S., Hardt, M., Recht, B., Vinyals, O.: Understanding deep
learning (still) requires rethinking generalization. Communications of the ACM
64(3), 107–115 (2021)
54. Zhang, Y., Yang, Q.: An overview of multi-task learning. National Science Review
5(1), 30–43 (2018)
55. Zhang, Z., Song, Y., Qi, H.: Age progression/regression by conditional adversar-
ial autoencoder. In: Proceedings of the IEEE conference on computer vision and
pattern recognition. pp. 5810–5818 (2017)
56. Zhang, Z., Wang, S., Xu, Y., Fang, Y., Yu, W., Liu, Y., Zhao, H., Zhu, C., Zeng,
M.: Task compass: Scaling multi-task pre-training with task prefix. arXiv preprint
arXiv:2210.06277 (2022)