Align and Distill: Unifying and Improving Domain Adaptive Object Detection
Align and Distill: Unifying and Improving Domain Adaptive Object Detection
Justin Kay∗1 , Timm Haucke1 , Suzanne Stathatos2 , Siqi Deng†3 , Erik Young4 ,
Pietro Perona2,3 , Sara Beery‡1 , and Grant Van Horn‡5
1
MIT 2 Caltech 3 AWS
Skagit Fisheries Enhancement Group 5 UMass Amherst
arXiv:2403.12029v1 [cs.CV] 18 Mar 2024
1 Introduction
The challenge of DAOD. Modern object detector performance, though excel-
lent across many benchmarks [3, 36, 46, 49, 52, 53], often severely degrades when
test data exhibits a distribution shift with respect to training data [41]. For
instance, detectors do not generalize well when deployed in new environments
in environmental monitoring applications [30, 53]. Similarly, models in medical
applications perform poorly when deployed in different hospitals or on different
hardware than they were trained [19, 55]. Unfortunately, in real-world applica-
tions it is often difficult, expensive, or time-consuming to collect the additional
annotations needed to address such distribution shifts in a supervised manner.
∗
Correspondence to: [email protected] † Work done outside AWS ‡ Equal contribution
∗∗
github.com/justinkay/aldi †† github.com/visipedia/caltech-fish-counting
2 J. Kay et al.
+ViT
70 Oracle UMT
SADA
Target AP50
60 Source-only PT
Cityscapes Foggy Cityscapes MIC
AT
50
ALDI++
40
Prior art Prior art w/ ALDI Ours
Fig. 1: ALDI achieves state-of-the-art performance in domain adaptive ob-
ject detection (DAOD) and provides a unified framework for fair compari-
son. We show: (1) Inconsistent implementation practices give the appearance of steady
progress in DAOD (left bars [8, 10, 12, 26, 34]); reimplementation and fair comparison
with ALDI shows less difference between methods than previously reported (middle
bars); (2) A fairly constructed source-only model (blue line) outperforms many existing
DAOD methods, indicating less progress has been made than previously reported; and a
proper oracle (orange line) outperforms all existing methods, in contrast to previously-
published results; and (3) Our proposed method ALDI++ (green bars) achieves state-
of-the-art performance on DAOD benchmarks such as Cityscapes → Foggy Cityscapes
and is complementary to ongoing advances in object detection like VitDet [33].
2 Related Work
Two methodological themes have dominated recent DAOD research: feature
alignment and self-training/self-distillation. We first give an overview of these
themes and previous efforts to combine them, and then use commonalities to
motivate our unified framework, Align and Distill, in Sec. 3.
Feature alignment in DAOD. Feature alignment methods aim to make target-
domain data “look like” source-domain data, reducing the magnitude of the dis-
tribution shift. The most common approach utilizes an adversarial learning ob-
jective to align the feature spaces of source and target data [9, 10, 16, 57]. Faster
R-CNN in the Wild [9] utilizes adversarial networks at the image and instance
level. SADA [10] extends this to multiple adversarial networks at different feature
levels. Other approaches propose mining for discriminative regions [57], weight-
ing local and global features differently [47], incorporating uncertainty [40], and
using attention networks [51]. Alignment at the pixel level has also been proposed
using image-to-image translation techniques to modify input images directly [12].
Self-training/self-distillation in DAOD. Self-training methods use a “teacher”
model to predict pseudo-labels on target-domain data that are then used as
training targets for a “student” model. Self-training can be seen as a type of self-
distillation [6, 43], which is a special case of knowledge distillation [7, 25] where
the teacher and student models share the same architecture. Most recent self-
training approaches in DAOD are based on the Mean Teacher [50] framework,
in which the teacher model is updated as an exponential moving average (EMA)
of the student model’s parameters. Extensions to Mean Teacher for DAOD in-
clude: MTOR, which utilizes graph structure to enforce student-teacher feature
consistency [4], Probabilistic Teacher (PT), which uses probabilistic localization
prediction and soft distillation losses [8], and Contrastive Mean Teacher (CMT),
which uses MoCo [21] to enforce student-teacher feature consistency [5].
Combining feature alignment and self-training. Several approaches utilize
both feature alignment and self-training/self-distillation, motivating our unified
framework. Unbiased Mean Teacher (UMT) [12] uses mean teacher in combina-
tion with image-to-image translation to align source and target data at the pixel
level. Adaptive Teacher (AT) [55] uses mean teacher with an image-level dis-
criminator network. Masked Image Consistency (MIC) [26] uses mean teacher,
SADA, and a masking augmentation to enforce teacher-student consistency. Be-
cause these methods were implemented in different codebases using different
training recipes and hyparameter settings, it is unclear which contributions are
most effective and to what extent feature alignment and self-training are com-
plementary. We address these issues by reimplementing these approaches in the
ALDI framework and perform fair comparisons and ablation studies in Sec. 6.
Align and Distill 5
Table 1: ALDI unifies and extends existing work. We show settings to reproduce
five prior works and our method ALDI++. See Fig. 2 and Sec. 3 for more details. Burn-
in: fixed duration (Fixed), our approach (Ours, Sec. 4). Augs. Tsrc , Ttgt : Random flip
(F), multi-scale (M), crop & pad (CP), color jitter (J), cutout [13] (C), MIC [26]. 12 :
Btgt
augs used on half the images in the batch. B : Target-domain portion of minibatch of
size B. Postprocess: Processing of teacher preds before distillation: sigmoid/softmax
(Sharpen), sum class preds for pseudo-objectness (Sum), conf. thresholding (Thresh),
NMS. Ldistill : Distillation losses: hard pseudo-labels (Hard), continuous targets (Soft).
Lalign : Feature alignment losses: image-level adversarial (Img), instance-level adver-
sarial (Inst), image-to-image translation (Img2Img). † : settings used in ALDI imple-
mentation (last column) but not in the original implementation (second-to-last col-
umn). at : source-only and oracle results sourced from [34].
Fig. 2: (Left) The ALDI student-teacher framework for DAOD. Each training
step (moving left to right and bottom to top): (1) Sample Bsrc labeled source images
xsrc ; transform by t ∼ Tsrc ; pass to student; compute supervised loss Lsup using
ground-truth labels ysrc . (2) Sample Btgt unlabeled target images xtgt ; transform by
t ∼ Ttgt ; pass to student to get preds ptgt . Compute alignment objectives Lalign using
xsrc and xtgt . (3) Pass same unlabeled target data xtgt , weakly transformed, to teacher;
postprocess to obtain teacher preds p̂tgt . Compute distillation loss Ldistill between
teacher and student preds. Use stop gradient (SG) on teacher model; update teacher
to the EMA of student’s weights. (Middle, Right) ALDI++ (Sec. 4) introduces
two new methods to achieve state-of-the-art performance: (Middle) A robust
burn-in strategy utilizing strong augmentations and EMA, and (Right) Multi-task soft
distillation losses to train the student using teacher outputs at all detector stages. σ:
sigmoid or softmax for binary cross-entropy and cross entropy losses, respectively.
Align and Distill 7
3. Feature alignment. xsrc,: and xtgt,: are “aligned” via an alignment objective
Lalign that enforces invariance across domains either at the image or feature level.
Unification of prior work. We demonstrate the generality of our framework
by reimplementing five recently-proposed methods on top of ALDI for fair com-
parison: UMT [12], SADA [10], PT [8], MIC [26], and AT [34]. In Tab. 1 we
enumerate the settings required to reproduce each method.
\hspace {-0.32in} p^{rpn,obj}_{tgt} = \theta ^{rpn,obj}_{stu}(A, x^{t}_{tgt}) (1) \hspace {-0.32in} \hat {p}^{rpn,obj}_{tgt} = \theta ^{rpn,obj}_{tch}(A, x^{\hat {t}}_{tgt}) (2)
p^{roih,cls}_{tgt} = \theta ^{roih,cls}_{stu}(p^{rpn}_{tgt}, x^{t}_{tgt}) (3) \hat {p}^{roih,cls}_{tgt} = \theta ^{roih,cls}_{tch}(p^{rpn}_{tgt}, x^{\hat {t}}_{tgt}) (4)
At each iteration, student distillation losses Ldistill are computed as:
L^{rpn}_{distill} = \lambda _{0}L_{rpn}(p^{rpn}_{tgt}, \hat {p}^{rpn}_{tgt}) + \lambda _{1}L_{obj}(p^{obj}_{tgt}, \hat {p}^{obj}_{tgt}) (5)
L^{roih}_{distill} = \lambda _{2}L_{roih}(p^{roih}_{tgt}, \hat {p}^{roih}_{tgt}) + \lambda _{3}L_{cls}(p^{cls}_{tgt}, \hat {p}^{cls}_{tgt}) (6)
Where Lrpn and Lroih are the smooth L1 loss and Lobj and Lcls are the cross-
entropy loss, and λ0...3 = 1 by default. See Fig. 2 for a visual depiction. We
include more implementation details in the supplemental material.
One prior DAOD work, PT [8], has also used soft distillation losses, how-
ever we note two shortcomings our method addresses: (1) PT requires a custom
“Probabilistic R-CNN” architecture for distillation, while our approach is general
and can work with any two-stage detector, and (2) PT uses p̂cls as an indirect
proxy for distilling pobj , while our approach is able to distill each task directly.
supervised training set to train oracle methods. We keep the original supervised
Kenai training set from CFC (132k annotations in 70k images) and the original
Channel test set (42k annotations in 13k images). We note this is substantially
larger than existing DAOD benchmarks (CS contains 32k instances in 3.5k im-
ages, and Sim10k contains 58k instances in 10k images). See the supplemental
material for more dataset statistics. We make the dataset public.
6 Experiments
In this section we propose an updated benchmarking protocol for DAOD (Sec. 6.1)
that allows us to fairly analyze the performance of ALDI++ compared to prior
work (Sec. 6.2) and conduct extensive ablation studies (Sec. 6.3).
Datasets. We perform experiments on Cityscapes → Foggy Cityscapes, Sim10k
→ Cityscapes, and CFC Kenai → Channel. In addition to being consistent with
prior work, these datasets represent three common adaptation scenarios captur-
ing a range of real-world challenges: weather adaptation, Sim2Real, and environ-
mental adaptation, respectively. We note that there have been inconsistencies in
prior work in terms of which ground truth labels for Cityscapes are used. We
use the Detectron2 version.
Metrics. For all experiments we report the PascalVOC metric of mean Average
Precision with IoU ≥ 0.5 (“AP50”) [15]. This is consistent with prior work on
Cityscapes, Foggy Cityscapes, Sim10k, and CFC.
63.5
EMA EMA
+ EMA
(final) (final)
58.5
~ ~ ~ 53.5
Source Target
(optional)
= = = 48.5
65 70 75 80
Source-only Oracle Source AP50 (CS)
Fig. 4: Revisiting source-only and oracle models in DAOD. We argue that in
order to provide a fair measure of domain adaptation performance in DAOD, source-
only and oracle models must utilize the same non-adaptive architectural and train-
ing components as methods being studied. In the case of Align and Distill -based ap-
proaches, this means source-only and oracle models must have access to the same set of
source augmentations and EMA as DAOD methods. We see that these upgrades signif-
icantly improve source-only performance on target-domain data (+7.2 AP50 on Foggy
Cityscapes), even though the source-only model has never seen any target-domain data,
and these upgrades also improve oracle performance. Overall, these results set more
challenging and realistic performance targets for DAOD methods.
10 J. Kay et al.
model selection, and all self-training runs are initialized with the same burned-
in checkpoint for fair comparison. See Tab. 1 for additional definitions.
θstu , θtch Network initialization (burn-in). In Fig. 6a we analyze the effects
of our proposed burn-in strategy (see Sec. 4). We measure performance in terms
of target-domain AP50 as well as convergence time, defined as the training time
at which the model first exceeds 95% of its final target-domain performance. We
compare our approach with: (1) No dataset-specific burn-in, i.e. starting with
COCO weights, and (2) The approach used by past work—using a fixed burn-
in duration, e.g. 10k iterations. We find that our method results in significant
improvements in both training speed and accuracy, leading to upwards of 10%
improvements in AP50 and reducing training time by a factor of 10 compared
to training without burn-in.
Tsrc Source augmentations. In Fig. 6b we ablate the set of source-domain
data augmentations. We compare using weak augmentations (random flipping
and multi-scale training), strong augmentations (color jitter and random eras-
ing), and a combination of weak and strong, noting that prior works differ in this
regard but do not typically report the settings used (see Tab. 1). We find that
using strong source augmentations on the entire source-domain training batch
outperforms weak augmentations and a combination of both.
Ttgt Target augmentations. In Tab. 3a we investigate the use of different
augmentations for target-domain inputs to the student model. (We note that
weak augmentations are always used for target-domain inputs to the teacher in
accordance with prior work). We see that stronger augmentations consistently
improve performance, with best performance coming from the recently-proposed
MIC augmentation [26].
stu, tch Burn-in: Performance v. train time Tsrc Source augmentations Btgt / B Batch composition
65 65.0 65
64
Target AP50 (FCS)
60 64.5
63
55 Burn-in 62
None 64.0
50 Fixed duration 61
Our approach 63.5 60
45
After burn-in After self-training 59
63.0
0 20 40 60 80 100 Weak Strong Both 0.00 0.25 0.50 0.75 1.00
Training time (V100-hours) Target ratio Btgt / B
(a) (b) (c)
Fig. 6: Ablating the components of ALDI++. (a) Our proposed burn-in strategy
(Sec. 4) improves AP50F CS by +4.7 and reduces training time by 10x compared to
no burn-in. (b) Strong source-data augmentations during self-training lead to better
performance. (c) An equal ratio of source and target data during self-training leads to
best performance.
Align and Distill 13
References
1. Anonymous: Exponential moving average of weights in deep learning: Dynamics
and benefits. Submitted to Transactions on Machine Learning Research (2023),
https://openreview.net/forum?id=2M9CUnYnBA, under review
2. Arpit, D., Wang, H., Zhou, Y., Xiong, C.: Ensemble of averages: Improving model
selection and boosting performance in domain generalization. Advances in Neural
Information Processing Systems 35, 8265–8277 (2022)
3. Bondi, E., Fang, F., Hamilton, M., Kar, D., Dmello, D., Choi, J., Hannaford,
R., Iyer, A., Joppa, L., Tambe, M., et al.: Spot poachers in action: Augmenting
conservation drones with automatic detection in near real time. In: Proceedings of
the AAAI Conference on Artificial Intelligence. vol. 32 (2018)
4. Cai, Q., Pan, Y., Ngo, C.W., Tian, X., Duan, L., Yao, T.: Exploring object relation
in mean teacher for cross-domain detection. In: Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. pp. 11457–11466 (2019)
5. Cao, S., Joshi, D., Gui, L.Y., Wang, Y.X.: Contrastive mean teacher for domain
adaptive object detectors. In: Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition. pp. 23839–23848 (2023)
6. Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin,
A.: Emerging properties in self-supervised vision transformers. In: Proceedings of
the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)
7. Chen, G., Choi, W., Yu, X., Han, T., Chandraker, M.: Learning efficient object
detection models with knowledge distillation. Advances in neural information pro-
cessing systems 30 (2017)
8. Chen, M., Chen, W., Yang, S., Song, J., Wang, X., Zhang, L., Yan, Y., Qi, D.,
Zhuang, Y., Xie, D., et al.: Learning domain adaptive object detection with prob-
abilistic teacher. arXiv preprint arXiv:2206.06293 (2022)
9. Chen, Y., Li, W., Sakaridis, C., Dai, D., Van Gool, L.: Domain adaptive faster
r-cnn for object detection in the wild. In: Proceedings of the IEEE conference on
computer vision and pattern recognition. pp. 3339–3348 (2018)
10. Chen, Y., Wang, H., Li, W., Sakaridis, C., Dai, D., Van Gool, L.: Scale-aware
domain adaptive faster r-cnn. International Journal of Computer Vision 129(7),
2223–2243 (2021)
11. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R.,
Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene
16 J. Kay et al.
30. Kay, J., Kulits, P., Stathatos, S., Deng, S., Young, E., Beery, S., Van Horn, G.,
Perona, P.: The caltech fish counting dataset: A benchmark for multiple-object
tracking and counting. In: European Conference on Computer Vision. pp. 290–
311. Springer (2022)
31. Kay, J., Stathatos, S., Deng, S., Young, E., Perona, P., Beery, S., Van Horn, G.:
Unsupervised domain adaptation in the real world: A case study in sonar video. In:
NeurIPS 2023 Computational Sustainability: Promises and Pitfalls from Theory to
Deployment (2023)
32. Koh, P.W., Sagawa, S., Marklund, H., Xie, S.M., Zhang, M., Balsubramani, A.,
Hu, W., Yasunaga, M., Phillips, R.L., Gao, I., et al.: Wilds: A benchmark of in-
the-wild distribution shifts. In: International Conference on Machine Learning. pp.
5637–5664. PMLR (2021)
33. Li, Y., Mao, H., Girshick, R., He, K.: Exploring plain vision transformer backbones
for object detection. In: European Conference on Computer Vision. pp. 280–296.
Springer (2022)
34. Li, Y.J., Dai, X., Ma, C.Y., Liu, Y.C., Chen, K., Wu, B., He, Z., Kitani, K.,
Vajda, P.: Cross-domain adaptive teacher for object detection. In: Proceedings
of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp.
7581–7590 (2022)
35. Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature
pyramid networks for object detection. In: Proceedings of the IEEE conference on
computer vision and pattern recognition. pp. 2117–2125 (2017)
36. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P.,
Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference
on computer vision. pp. 740–755. Springer (2014)
37. Liu, Y.C., Ma, C.Y., He, Z., Kuo, C.W., Chen, K., Zhang, P., Wu, B., Kira, Z.,
Vajda, P.: Unbiased teacher for semi-supervised object detection. arXiv preprint
arXiv:2102.09480 (2021)
38. Musgrave, K., Belongie, S., Lim, S.N.: Unsupervised domain adaptation: A reality
check. arXiv preprint arXiv:2111.15672 (2021)
39. Musgrave, K., Belongie, S., Lim, S.N.: Benchmarking validation methods for un-
supervised domain adaptation. arXiv preprint arXiv:2208.07360 (2022)
40. Nguyen, D.K., Tseng, W.L., Shuai, H.H.: Domain-adaptive object detection via
uncertainty-aware distribution alignment. In: Proceedings of the 28th ACM inter-
national conference on multimedia. pp. 2499–2507 (2020)
41. Oza, P., Sindagi, V.A., Sharmini, V.V., Patel, V.M.: Unsupervised domain adap-
tation of object detectors: A survey. IEEE Transactions on Pattern Analysis and
Machine Intelligence (2023)
42. Parmar, G., Zhang, R., Zhu, J.Y.: On aliased resizing and surprising subtleties in
gan evaluation. In: Proceedings of the IEEE/CVF Conference on Computer Vision
and Pattern Recognition. pp. 11410–11420 (2022)
43. Pham, M., Cho, M., Joshi, A., Hegde, C.: Revisiting self-distillation. arXiv preprint
arXiv:2206.08491 (2022)
44. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object de-
tection with region proposal networks. Advances in neural information processing
systems 28 (2015)
45. Reuther, A., Kepner, J., Byun, C., Samsi, S., Arcand, W., Bestor, D., Bergeron,
B., Gadepally, V., Houle, M., Hubbell, M., Jones, M., Klein, A., Milechin, L.,
Mullen, J., Prout, A., Rosa, A., Yee, C., Michaleas, P.: Interactive supercomput-
ing on 40,000 cores for machine learning and data analysis. In: 2018 IEEE High
Performance extreme Computing Conference (HPEC). pp. 1–6. IEEE (2018)
18 J. Kay et al.
46. Rodriguez, M., Laptev, I., Sivic, J., Audibert, J.Y.: Density-aware person detection
and tracking in crowds. In: 2011 International Conference on Computer Vision. pp.
2423–2430. IEEE (2011)
47. Saito, K., Ushiku, Y., Harada, T., Saenko, K.: Strong-weak distribution alignment
for adaptive object detection. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. pp. 6956–6965 (2019)
48. Sakaridis, C., Dai, D., Van Gool, L.: Semantic foggy scene understanding with
synthetic data. International Journal of Computer Vision 126, 973–992 (2018)
49. Schneider, S., Taylor, G.W., Kremer, S.: Deep learning object detection methods
for ecological camera trap data. In: 2018 15th Conference on computer and robot
vision (CRV). pp. 321–328. IEEE (2018)
50. Tarvainen, A., Valpola, H.: Mean teachers are better role models: Weight-averaged
consistency targets improve semi-supervised deep learning results. Advances in
neural information processing systems 30 (2017)
51. Vs, V., Gupta, V., Oza, P., Sindagi, V.A., Patel, V.M.: Mega-cda: Memory guided
attention for category-aware unsupervised domain adaptive object detection. In:
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog-
nition. pp. 4516–4526 (2021)
52. Weinstein, B.G., Gardner, L., Saccomanno, V., Steinkraus, A., Ortega, A., Brush,
K., Yenni, G., McKellar, A.E., Converse, R., Lippitt, C., et al.: A general deep
learning model for bird detection in high resolution airborne imagery. bioRxiv
(2021)
53. Weinstein, B.G., Graves, S.J., Marconi, S., Singh, A., Zare, A., Stewart, D.,
Bohlman, S.A., White, E.P.: A benchmark dataset for canopy crown detection and
delineation in co-registered airborne rgb, lidar and hyperspectral imagery from
the national ecological observation network. PLoS computational biology 17(7),
e1009180 (2021)
54. Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2. https://
github.com/facebookresearch/detectron2 (2019)
55. Xue, Z., Yang, F., Rajaraman, S., Zamzmi, G., Antani, S.: Cross dataset analysis
of domain shift in cxr lung region detection. Diagnostics 13(6), 1068 (2023)
56. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation
using cycle-consistent adversarial networks. In: Proceedings of the IEEE interna-
tional conference on computer vision. pp. 2223–2232 (2017)
57. Zhu, X., Pang, J., Yang, C., Shi, J., Lin, D.: Adapting object detectors via se-
lective cross-domain alignment. In: Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition. pp. 687–696 (2019)
Align and Distill: Supplemental Material
Justin Kay∗1 , Timm Haucke1 , Suzanne Stathatos2 , Siqi Deng†3 , Erik Young4 ,
Pietro Perona2,3 , Sara Beery‡1 , and Grant Van Horn‡5
MIT 2 Caltech 3 AWS
1
4
Skagit Fisheries Enhancement Group 5 UMass Amherst
1 Additional Experiments
Method AP50F CS
Method AP50CS Method AP50Channel
Baseline 51.9
SADA 54.2 Baseline 70.8 Baseline 65.8
Image-level (ours) 55.8 Image-level (ours) 71.8 Image-level (ours) 65.2
Instance-level (ours) 54.3 Instance-level (ours) 73.3 Instance-level (ours) 66.0
Image + Instance (ours) 54.9 Image + Instance (ours) 71.5 Image + Instance (ours) 66.9
(a) (b) (c)
We investigate the overlap of source and target data in the feature space of
different methods. For each method, we pool the highest-level feature maps of
the backbone, either globally (“image-level”) or per instance (“instance-level”).
We then embed the pooled feature vectors in 2D space using PCA for visual
∗
Correspondence to: [email protected] †
Work done outside AWS ‡
Equal contribution
20 J. Kay et al.
Cityscapes
Foggy CS
Sim10k
Cityscapes
Kenai S1
Kenai S2
Channel S1
Channel S2
Channel S3
v = 0.91, dF = 6.87 v = 0.87, dF = 0.46 v = 0.89, dF = 0.13 v = 0.77, dF = 4.08
Fig. 1: Embedding of pooled features from the final backbone layer in 2D space using
PCA for four different methods. The ratio of variance explained by the first two PCA
components is given by v and a dissimilarity score between source and target features is
given by dF . dF is lower than the baseline for all alignment methods and does roughly
match the overall visual trend in feature overlap. In all cases, the simple MeanTeacher
model significantly reduces the distance between source and target data even though
there is no explicit alignment criterion, even resulting in a smaller dF than adversarial
alignment methods for CS → FCS & CFC Kenai → Channel.
inspection (see Fig 1). We also compute a dissimilarity score based on FID [24],
by fitting Gaussians to the source and target features and then computing the
Fréchet distance between them.
Table 2: ViT backbone results for: (a) Sim10k → Cityscape and (b) CFC Kenai →
Channel.
Method AP50F CS
Baseline 51.9
No update (vanilla self-training) 52.9
Student is teacher 53.8
EMA (mean teacher) 63.5
2 Implementation Details
2.1 Re-implementations of Other Methods
Here we include additional details regarding our re-implementations of prior work
on top of the ALDI framework. We visualize our implementations in Fig. 3.
Adaptive Teacher [34] Adaptive Teacher (AT) uses the default settings from
the base configuration in Table 2 of the main paper, plus an image-level align-
ment network. For fair reproduction, we used the authors’ alignment network
implementation instead of our own for all AT experiments.
Fig. 2: Effects of fair and modernized comparison between MIC and AT. Here
we show an example of why fair and modern comparisons are necessary for making
principled progress in DAOD. Moving left to right: (1) Published results report a
difference of 3.3 AP50 on Cityscapes → Foggy Cityscapes between the two methods;
(2) However the authors used different truth test labels, and when this is corrected
we see that the originally-published MIC model actually outperforms the originally-
published AT model; (3) The authors also used different object detection libraries
(Detectron2 for AT and maskrcnn-benchmark for MIC); when we re-implement them
on top of ALDI (still using the VGG-16 backbones proposed in the original papers),
we see that AT significantly outperforms MIC, but (4) These performance diferences
are less pronounced when using a modern backbone, indicating that for practical use
there is less difference between these two methods than previously reported.
SADA PT
Soft
SG
Softmax, Sum
SADA Student
Student EMA Teacher
Flip,
Jitter,
Scale
Source Target Source Target
= =24 = =24 = =16 = =32
UMT MIC
Hard Hard
SG SG
Thresh, NMS Thresh, NMS
Img2Img
Image SADA
Student EMA Teacher Student EMA Teacher
Flip,
CropPad,
Jitter,
Jitter Flip,
Flip MIC Flip,
Scale Scale
Source Target Source Target
= =24 = =24 = =24 = =24
AT ALDI++
Hard Soft
SG SG
Thresh, NMS Softmax
Image Student
Student EMA Teacher EMA Teacher
Flip,
Flip,
Jitter,
Scale,
Cutout Flip,
Jitter,
Flip,
Flip, Jitter (1/2), Cutout (1/2) Scale Flip, Scale, Jitter, Cutout MIC Scale
Source Target Source Target
= =16 = =32 = =24 = =24
2. A set of anchor boxes that represent the initial candidates for detection.
1. Proposals from the RPN. In training, these are sampled at a desired fore-
ground/background ratio, similar to the procedure used for computing the
loss in the RPN. Note, however, that these will be different proposals than
those used to compute RPN loss. In the Detectron2 defaults, 512 RPN pro-
posals are sampled as inputs to the ROI heads at a foreground ratio of 0.25.
2. Cropped backbone features, extracted using a procedure such as ROIAlign [22].
These are the features in the backbone feature map that are “inside” each
proposal.
1. A multi-class classification.
2. Regression targets for the final bounding box, representing adjustments to
the box to more closely enclose any foreground objects.
Computing the loss. Predicted boxes are matched with ground truth boxes
again based on intersection-over-union in order to compute the loss. By default
we compute a cross-entropy loss for (1) and a smooth L1 loss for (2). (2) is again
only computed for foreground predictions.
Align and Distill: Supplemental Material 25
(1): ReLU()
(2): AdaptiveAvgPool2d(output_size=1)
(3): Flatten(start_dim=1, end_dim=-1)
(4): Linear(in_features=256,
out_features=1,
bias=True)
)
)
FCDiscriminator(
(model): Sequential(
(0): Flatten(start_dim=1, end_dim=-1)
(1): Linear(in_features=1024,
out_features=1024,
bias=True)
(2): ReLU()
(3): Linear(in_features=1024,
out_features=1,
bias=True)
)
)
3 Experiment Details
Source
Target-like
Source Target-like
Target
Source-like
Target Source-like
Fig. 4: Exemplary results of our CycleGAN models. Source and target are the
original images. Target-like and source-like are images translated by CycleGAN. Since
FCS is derived from CS, CS → FCS is the only case in which we have paired images
and can therefore show the translation from source into target-like and from target
into source-like for the same example.
Fréchet inception distance (FID) [24] between the source & source-like and the
target & target-like images in the training dataset. For FID computation, we use
the clean-fid implementation proposed in [42]. We compute FID on the training
datasets as UMT only uses translated images thereof, which is why we are only
interested in the best fit on the training data. We follow [12] by then generating
source-like and target-like dataset using the selected model ahead of time, before
the training of the main domain adaptation method. We note that tuning Cy-
cleGAN’s hyperparameters or using other image-to-image translation methods
could possibly improve UMT’s performance however for the fair reproduction
we use the defaults. We show some exemplary results of our CycleGAN models
that are used to train UMT [12] in Fig 4.
Like other DAOD benchmarks, CFC-DAOD consists of data from two domains,
source and target.
Faster R-CNN
Codebase LOC
Implementation
UMT [12] faster-rcnn.pytorch 19k
SADA [10] maskrcnn-benchmark 7k
PT [8] Detectron2 v0.5 3.4k
MIC [26] maskrcnn-benchmark 20k
AT [34] Detectron2 v0.3 4k
ALDI (Ours) Detectron2 ∼v0.7 1.5k
We designed the ALDI codebase to be lightweight and extensible. For this reason,
we build on top of a recent version of Detectron2 [54]. The last tagged release
of Detectron2 was v0.6 in November 2021, however there have been a number
upgrades since then leading to state-of-the-art performance. Thus, we use a fixed
version that we call v0.7ish based off of an unofficial pull request for v0.7,
commit 7755101 dated August 30 2023. We include this version of Detectron2
as a pip-installable submodule in the ALDI codebase for now, noting that once
the official version is released it will no longer need to be a submodule (i.e. it
will be able to be directly installed through pip without cloning any code).
Our codebase makes no modifications to the underlying Detectron2 code,
making it a lightweight standalone framework. This is in contrast to existing
DAOD codebases (see Tab. 4) that often duplicate and modify the underlying
framework as part of their implementation. By building on top of Detectron2
rather than within it, our codebase is up to 13x smaller than other DAOD
codebases while providing more functionality. We note that in Tab. 4, other
codebases implement a single method while ours supports all methods studied.
5.2 Speedups
– Converting tensors back and forth between torch, numpy, and PIL during
augmentation. We addressed this, reimplementing transforms as needed so
that everything stays in torch.
30 J. Kay et al.
– Using the random hue transform from torchvision. We found minimal changes
in performance from disabling this component of the ColorJitter transform.
– Using separate dataloaders for weakly and strongly augmented imagery. We
instead use a single dataloader per domain, with a hook to retrieve weakly
augmented imagery before strong augmentations are performed.
We reimplemented the dataloaders and augmentation strategies used by AT,
MIC, and others to be more efficient, leading to a 5x speedup in training time
per image compared to AT.