0% found this document useful (0 votes)
47 views11 pages

Cross Match

This paper proposes a novel two-stream segmentation network for source-free domain adaptation semantic segmentation. It introduces a multimodal auxiliary network that takes depth as additional input to enhance pseudo labels and encourage consistency between predictions. Experiments show the method achieves state-of-the-art mIoU of 57.7% and 57.5% on adapting from GTA5 and SYNTHIA to Cityscapes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views11 pages

Cross Match

This paper proposes a novel two-stream segmentation network for source-free domain adaptation semantic segmentation. It introduces a multimodal auxiliary network that takes depth as additional input to enhance pseudo labels and encourage consistency between predictions. Experiments show the method achieves state-of-the-art mIoU of 57.7% and 57.5% on adapting from GTA5 and SYNTHIA to Cityscapes.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

This ICCV paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;


the final published version of the proceedings is available on IEEE Xplore.

CrossMatch: Source-Free Domain Adaptive Semantic Segmentation via


Cross-Modal Consistency Training

Yifang Yin1 , Wenmiao Hu2,4 , Zhenguang Liu3 *, Guanfeng Wang4 *, Shili Xiang1 , Roger Zimmermann2
1
Institute for Infocomm Research, A*STAR 2 National University of Singapore
3
Zhejiang Gongshang University 4 Grabtaxi Holdings Pte. Ltd.
{yin yifang, sxiang}@i2r.a-star.edu.sg, [email protected]
[email protected], [email protected], [email protected]

Abstract
Source Source
Auxiliary task
Source-free domain adaptive semantic segmentation has model model
Pseudo labels
gained increasing attention recently. It eases the require- Pseudo
labels
Depth
Semantic loss
Semantic loss
ment of full access to the source domain by transferring Semantic loss Depth loss
Consistency
knowledge only from a well-trained source model. How- loss
Semantic Depth Semantic Semantic
ever, reducing the uncertainty of the target pseudo labels Decoder Decoder Decoder Decoder

becomes inevitably more challenging without the supervi- Image Image RGB-D
Encoder
sion of the labeled source data. In this work, we propose a Encoder Encoder

novel asymmetric two-stream architecture that learns more


robustly from noisy pseudo labels. Our approach simul-
Auxiliary network
taneously conducts dual-head pseudo label denoising and (a) (b)
cross-modal consistency regularization. Towards the for-
Figure 1. Comparison of our proposed framework with existing
mer, we introduce a multimodal auxiliary network during
depth-aware semantic segmentation models. (a) Prior art mostly
training (and discard it during inference), which effectively
adopts a multitask learning framework by adding depth estimation
enhances the pseudo labels’ correctness by leveraging the as an auxiliary task. (b) We introduce a multimodal auxiliary net-
guidance from the depth information. Towards the latter, work that takes depth modality as an additional input for effective
we enforce a new cross-modal pixel-wise consistency be- pseudo label denoising and consistency regularization.
tween the predictions of the two streams, encouraging our
model to behave smoothly for both modality variance and
image perturbations. It serves as an effective regularization distribution shift, e.g., street scenes collected under a cross-
to further reduce the impact of the inaccurate pseudo la- city [11] or cross-weather [44] environment. Unsupervised
bels in source-free unsupervised domain adaptation. Exper- domain adaptation (UDA) techniques have been proposed
iments on GTA5 → Cityscapes and SYNTHIA → Cityscapes to address the domain shift problem, which aim at transfer-
benchmarks demonstrate the superiority of our proposed ring the knowledge learned from a labeled source domain
method, obtaining the new state-of-the-art mIoU of 57.7% to an unlabeled target domain [48, 50, 69, 67]. However,
and 57.5%, respectively. one major limitation of such UDA approaches lies in the re-
quirement for full access to the source dataset. In practice,
the source data may be restricted from being shared due to
1. Introduction proprietary, privacy, or profit related concerns [26].
To cope with data sharing restrictions, recent efforts have
Semantic segmentation predicts pixel-level category la- investigated source-free domain adaptation, which trans-
bels to given scenes. Although deep neural networks have fers knowledge from a well-trained source model (rather
been widely adopted, attaining state-of-the-art performance than from the source data itself) to an unlabeled target do-
relies mainly on the assumption that the training and test- main [39, 31]. Early solutions introduce a generator to es-
ing data follow the same distribution [62, 32, 33]. This as- timate the source domain based on the pre-trained source
sumption is impractical as target scenarios often exhibit a model [31], which can be used to generate fake source sam-
* The corresponding authors. ples for supervision as in typical UDA. However, due to the

21786
lack of supervision from the real source domain, advanced 57.5% on the Cityscapes dataset when adapting from
techniques designed for typical UDA, such as depth-aware the GTA5 and SYNTHIA benchmarks, respectively.
semantic segmentation and pseudo label denoising meth-
ods, may work less satisfactorily in a source-free setting. 2. Related Work
With the above insights, we propose a novel two-stream
Unsupervised domain adaptation Unsupervised domain
segmentation network for source-free UDA. As shown in
adaptation (UDA) aims to improve a model’s performance
Figure 1 (a), existing depth-aware semantic segmentation
on an unlabeled target domain by leveraging the features
for typical UDA mainly adopts a multitask learning frame-
extracted from a labeled source domain [62]. Early works
work where depth estimation is modeled as an auxiliary
adopted adversarial training [18] to reduce the distribu-
task [51, 53]. However, we observe through experiments
tion mismatch between different domains [36, 15, 48, 50].
that the regularization induced by the auxiliary task is quite
Efforts have been made on aligning the distributions at
limited for source-free UDA due to the lack of ground-truth
either the image level [21, 57], the intermediate feature
semantic labels. It cannot effectively prevent the main seg-
level [11, 10] or the output level [48, 50]. Some recent
mentation network from overfitting to the incorrect over-
attempts align the distributions in a class-wise manner in
confident pseudo labels of the target images. To solve this
order to obtain a fine-grained feature alignment [36, 15].
problem, we alternatively propose a multimodal auxiliary
However, these methods rely on cumbersome adversarial
network, as shown in Figure 1 (b), which takes the depth
training that requires access to the source data.
information and the intermediate representations generated
UDA via self-training Pseudo label refinement under a
by the main stream image encoder as the input. We train
self-training framework has achieved competitive results
both the main and the auxiliary streams on the segmentation
in the field of UDA for semantic segmentation [30, 68,
task via self-training, and formulate an explicit cross-modal
70, 23]. Early methods selected highly confident pre-
consistency loss between the output of the two streams for
dictions as pseudo labels based on a confidence thresh-
effective regularization. The benefits of our proposed seg-
old [73, 72]. To improve the robustness of the pseudo labels,
mentation network are threefold:
efforts have been made on prediction ensembling [6, 63],
First, our inference-stage model consists of the main
pseudo label denoising [37, 28, 45, 67], training sample re-
stream only, which is a unimodal model that infers from
weighting [69], augmentation consistency [1, 38], leverag-
RGB images the same way as existing models. Second,
ing high-resolution images [24], and pixel-level contrastive
the asymmetric design of our neural network introduces
learning [58]. However, these approaches also rely on
modality variance in addition to the typical input pertur-
the source-target co-existence to retain task-specific source
bations produced by data augmentation, dropouts, etc. On
knowledge with self-training.
one hand, the auxiliary network better rectifies the pseudo
Source-free UDA Kundu et al. [26] focused on source
labels with multimodal knowledge expansion [61]. On the
model generalization and developed a multi-head frame-
other hand, the cross-modal consistency effectively trans-
work trained by extending the source data with diverse data
fers the knowledge learned from the multimodal auxiliary
augmentations. Teja and Fleuret [39] focused on target do-
network to the unimodal main network. Third, our pro-
main adaptation and proposed to reduce the prediction un-
posed framework has better feasibility compared to exist-
certainty by feature corruption with entropy regularization.
ing depth-aware UDA as ours only requires the depth in-
Liu et al. [31] leveraged a generator to estimate the source
formation in the target domain. Without annotation cost,
data distribution, based on which fake samples were syn-
the depth information can be easily learned from video se-
thesized for training. Qiu et al. [40] proposed to generate
quences or stereo images based on self-supervised depth es-
per-class prototypes based on a source prototype genera-
timation models [17, 71, 53]. Here we summarize our con-
tor, which is used to align the pseudo-labeled target data
tributions as follows:
based on contrastive learning. To the best of our knowl-
• We propose a novel source-free UDA framework by edge, the prior approaches [64, 66] all focused on unimodal
introducing a multimodal auxiliary network. It models models. Inspired by existing work on cross-modal model-
the correlations between depth and semantics, and can ing between image features and acoustic clues [65], edge
be discarded completely at inference time. maps [34], or LiDAR points [25] in different applications,
we develop a new cross-modal pseudo label denoising net-
• We enforce a cross-modal consistency between the work for depth-aware source-free UDA.
predictions of the main and auxiliary streams with Depth-aware UDA Motivated by multitask learning, depth
dual-head pseudo label denoising, to reduce the impact estimation has been adopted as an auxiliary task to im-
of inaccurate pseudo labels in source-free UDA. prove UDA for semantic segmentation [49, 9, 43, 3, 22].
• Our proposed method outperforms the prior art by a The labels for depth estimation are mostly derived by self-
significant margin, obtaining an mIoU of 57.7% and supervised models using stereo pairs [16, 17] or video se-

21787
Rectified
-!"# pseudo labels
3!"# 4,5 Offline
0/!"# abels
pseudo labels
Image ,
Classification

denoising

Source
loss

model
!$ -6 ,
!"#
Classification 0/%&'
loss Class prototypes
SSA-Gate SSA-Gate 1!"# 2 1%&'
-.!"#
"(#

teacher model
!$
,
Cross-modal

Mean
%&'

)&* consistency loss


%&'
+
3%&' 4,2 +5 -%&' -.%&'
Depth +
Multimodal auxiliary stream
Online pseudo labels

Figure 2. Illustration of our proposed two-stream segmentation network for source-free UDA.

quences [71]. The correlations between depth and se- 4. Approach


mantics are next modeled by attention-based feature fu-
sion [51, 53]. The depth distribution in different categories We follow the pseudo-label based self-training strategies
can be utilized to further reduce the domain gap [56]. How- to train our source-free UDA model [26]. Target samples
ever, these methods rely on the access to the source domain are passed through the source model to generate a set of
and assume the source and target images are available in pseudo labels that are used to supervise the network. One
stereo pairs or video sequences. main challenge in a self-training framework is reducing the
uncertainty of the pseudo labels for the target images. To
tackle this challenge, we propose to denoise the offline tar-
3. Problem Formulation get pseudo labels with online cross-modal consistency train-
Efforts on source-free domain adaptive semantic seg- ing. Next, we introduce the technical details of our pro-
mentation can be divided into 1) vendor-side domain gen- posed framework.
eralization, and 2) client-side domain adaptation [26]. The 4.1. Two-stream Segmentation Network
vendor and the client have access to the labeled source and
the unlabeled target datasets, respectively. The goal of the The overall architecture of our proposed asymmetric
vendor is to train a source model with good generalization two-stream segmentation network is shown in Figure 2. The
ability to unseen domains [27]. This trained source model main stream is unimodal, which takes RGB images as the
is next passed to the client to be adapted to the unlabeled only input, and can be implemented by any of the exist-
target domain via self-training [39, 31]. ing segmentation models such as DeepLabv2. The auxiliary
In this work, we propose to improve client-side domain stream is multimodal, which ingests depth and the interme-
adaptation by leveraging depth information as the auxiliary diate features generated by the main stream image encoder
modality. Let X = {(xi , di )}ni=1 denote the target dataset to exploit the correlations between the depth and semantic
where (xi , di ) represent the RGB and the depth modality information. To achieve this, we build upon the Separation-
of the i-th sample, respectively. Our goal is to adapt a and-Aggregation Gate (SA-Gate) [8] and present a single-
unimodal source model hs (x) to a unimodal target model sided SA-Gate, termed SSA-Gate, which is placed after
in in
ht (x) more robustly via a multimodal auxiliary network. each of the encoder blocks. Formally, let Fimg and Faux de-
To achieve this goal, we present a novel two-stream neu- note the input features of the SSA-Gate from the main and
ral network with a main stream and an auxiliary stream that auxiliary streams, respectively. SSA-Gate first recalibrates
perform semantic segmentation based on RGB and RGB-D the input features with the help from the other modality by
rec in
modalities, respectively. Facilitated by the depth modality, Fimg = Fimg + Attna (Fimg
in in
||Faux in
) ⊛ Faux
pseudo labels obtained from the source model can be bet- rec in
(1)
Faux = Faux + Attni (Fimg
in in
||Faux in
) ⊛ Fimg
ter rectified, leading to improved source-free UDA perfor-
in in
mance. Moreover, the auxiliary stream is only required dur- where Fimg ||Faux is the concatenation of the input features
ing training, and will be discarded at inference time. Thus, along the channel dimension. Attna and Attni compute
in in
our inference-stage model shares the same network archi- the channel-wise attention for Faux and Fimg , respectively,
tecture (e.g., DeepLabv2 [4]) but obtains improved segmen- and ⊛ denotes the channel-wise multiplication. Next, SSA-
tation results compared to the prior art. Gate merges the features from the two streams based on

21788
(i,k)
the spatial-wise gates proposed in [8]. Let Fmrg denote the where ps represents the softmax probability of pixel x(i)
merged feature, SSA-Gate updates the feature of the auxil- belonging to the k-th class. Thereafter, the classification
out in
iary stream as Faux = 0.5 · (Faux + Fmrg ) and keeps the loss can be computed based on ŷ (i,k) as
feature in the main stream unchanged. With known cam-
era parameters, we follow prior work [5, 8] and extract the ℓcla = ℓce (ŷ, pimg ) + ℓce (ŷ, paux ) (3)
HHA representation, which encodes the depth image with PH×W PK (i,k)
three channels of horizontal disparity, height above ground, where ℓce (ŷ, p) = − i=1 k=1 ŷ log p(i,k) is the
and the angle of the pixel’s local surface normal, as the in- cross-entropy loss. pimg and paux are the predicted outputs
put of our target network [19]. According to previous stud- of the main and auxiliary streams, respectively. In addition
ies [5, 8], the HHA representation is more effective for se- to the pseudo labeling, we introduce a cross-modal consis-
mantic segmentation tasks. Alternatively, the 1-channel dis- tency loss to regularize the output between the two streams.
parity maps can be directly used as the input to our frame- The goal is to reduce the impact of inaccurate pseudo labels,
work if the camera parameters are not available. and this consistency loss is formulated as

4.2. Dual-head Pseudo Label Denoising with Cross- ℓreg = Dkl (p̃aux ||pimg ) + Dkl (p̃img ||paux ) (4)
modal Consistency Regularization
where p̃img and p̃aux are the predicted outputs of
Given a target sample (x, d), we use fimg (x) and the mean-teacher model, and Dkl (p̃aux ||pimg ) =
faux (x, d) to denote the features extracted by the main and PH×W (i) (i) (i)
− i=1 p̃aux log(pimg /p̃aux ) is the Kullback Leibler
auxiliary streams as shown in Figure 2. The extracted fea-
(KL) divergence. We perturb the input based on strong and
tures are next passed to the respective classifiers gimg and
weak augmentations, and feed them to the target network
gaux to obtain the predictions pimg and paux . A mean-
and its mean-teacher model, respectively. Since p̃img and
teacher model [47] is maintained whose parameters are up-
p̃aux are generated based on weak augmented views, they
dated as the exponential moving average of the parameters
are more reliable. They thus can be used as online soft
of the target network. This is used to generate more reliable
pseudo labels to regularize the predictions pimg and paux
online pseudo labels, denoted as p̃img and p̃aux . Offline
inferred over the strong augmented views.
pseudo labels are generated using the source model based
In addition to data augmentations, recall that pimg =
on RGB images only, i.e., ps = hs (x). Next, we will intro-
gimg (fimg (x)) and paux = gaux (faux (x, d)) also predict
duce how to formulate the objectives to optimize our pro-
based on different input modalities. Therefore, our pro-
posed framework.
posed regularization loss enforces that the target network
gives consistent predictions not only for small perturbations
4.2.1 Cross-modal Consistency Training but also over cross-modal views.

Consistency regularization is a popular and essential tech-


4.2.2 Dual-head Pseudo Label Denoising
nique in semi-supervised learning [60, 46]. Based on the
model smoothness assumption, model predictions should Though the pseudo labels ps generated by the source model
be constrained to be invariant to small perturbations of ei- can be directly used to train the target network, rectifying
ther inputs or model hidden states [38], which can be in- ps from a parallel aspect to consistency training will gain
troduced by data augmentation, dropouts, etc. To prevent additional benefits. To this end, we adapt a recent state-
the target model from overfitting to the noisy pseudo labels, of-the-art prototypical pseudo label denoising method [67]
we present a new cross-modal consistency regularization to our framework. This approach fixes ps and rectifies ps
loss that works effectively with pseudo labeling in source- based on class-wise dynamic weights ω as
free UDA. The predictions for pixels with low-confidence
(i,k)
pseudo labels tend to be more sensitive to input perturba- exp (ω (i,k) · ps )
tions [69]. Thus, the impact of the noise in pseudo labels p̂(i,k)
s = PK (i,k′ )
(5)
k′ =1 exp (ω (i,k′ ) · ps )
can be significantly reduced by enforcing a consistency reg-
ularization between the predictions of the two streams. (i,k) (i,k)
where ps and p̂s represent the softmax probability of
Given an unlabeled target image x, we pass it through
(i,k) pixel x(i) belonging to the k-th class before and after de-
the source model to generate the soft pseudo labels ps . noising. We perform prototypical pseudo label denoising
(i,k)
The hard pseudo labels ŷ are computed as for the main and the auxiliary streams separately. Take the
(
(i,k′ )
main stream as an example, let fimg (x)(i) represent the fea-
(i,k) 1 if k = arg maxk′ ps ture at pixel i. The weights ωimg are updated in each train-
ŷ = (2)
0 otherwise ing epoch based on the feature distance to the class proto-

21789
types by already relatively high. We compute the classification loss
using Eq. 3 as ℓ2cla = ℓce (ŷ, pimg ) + ℓce (ŷ, paux ).
(k)
(i,k) exp (−||f˜img (x)(i) − ηimg ||/τ ) This stage is usually referred to as self-distillation, which
ωimg = PK (k′ )
(6)
′ exp (−||f˜img (x)(i) − η ||/τ ) has been successfully applied to typical UDA to boost a
k =1 img
model’s performance [67, 26]. Here we show that with
(k)
where ηimg is the prototype (i.e., the feature centroid) of our proposed cross-modal consistency training, one or more
rounds of self-distillation can also bring substantial perfor-
class k in the main stream. We use f˜img (i.e., the image en-
mance gain to source-free UDA.
coder in the mean-teacher model) instead of fimg , as we de-
sire a more reliable feature estimation for the input sample. 4.3. Test-time Inference
τ is the softmax temperature empirically set to 1. Similarly,
(k)
we maintain class prototypes ηaux for the auxiliary stream, Considering that the depth information may not always
(k) be available during test-time inference, we discard the mul-
compute ωaux based on f˜aux (x, d) and ηaux , and correct ps
timodal auxiliary network and keep only the main stream
based on ωaux using Eq. 5. The classification loss can then
as our inference-stage model. The reasons behind this are
be computed based on the rectified pseudo labels ŷimg and
twofold. First, it improves the feasibility of our model as the
ŷaux , which are more accurate than ŷ.
main stream takes the RGB image as the only input. Sec-
ond, we observe that the multimodal auxiliary stream only
4.2.3 Optimization marginally outperforms the main stream after the model
We perform two rounds of self-training to optimize our pro- converges. Therefore, the accuracy loss as a trade-off for
posed two-stream segmentation network. In both stages, model feasibility is relatively slim. Formally, given a test
we formulate the overall loss as a linear combination of the image x, we compute its pixel-level semantic labels as
classification loss and the regularization loss pimg = gimg (fimg (x)).

ℓstg = ℓstg
cla + γℓreg (7) 5. Experiments
where the superscript stg ∈ {1, 2} distinguishes the loss 5.1. Experimental Settings
computed in stage 1 or stage 2. γ is a balancing coefficient
Dataset We evaluate our proposed method by adapting
that controls the weight of the regularization loss. We em-
from the game scenes GTA [41] and SYNTHIA [42] to the
pirically set γ = 1 in our experiments. We train the same
real scenes Cityscapes [12]. The Cityscapes dataset con-
two-stream segmentation model with the same cross-modal
tains 2,975 training and 500 validation images with a res-
consistency loss as the regularization for self-training. The
olution of 2048 × 1024. For depth, we use the disparity
only difference between the two stages is how we compute
maps provided by the official Cityscapes dataset by default.
the hard pseudo labels and the classification loss.
In the ablation study, we also evaluate our method with
Stage one The source model extracts the pseudo labels for
self-supervised stereoscopic depth [44, 53] and monocular
the target images in the first stage. As the source model
depth [55], which were trained on the stereo images and
was trained on the labeled source data, the uncertainty in
video sequences in the Cityscapes training set, respectively.
the pseudo labels for target images is high. Thus, applying
Evaluation metric We report the Intersection over Union
pseudo label denoising techniques is beneficial, based on
(IoU) on the 19 common categories shared by GTA5 and
which a more robust classification loss can be computed.
Cityscapes and the 16 common categories shared by SYN-
In our implementation, we compute the symmetric cross-
THIA and Cityscapes. Following previous studies, we also
entropy (SCE) [54] based on ŷimg and ŷaux as
report the results on 13 of the 16 common categories shared
ℓ1cla = ℓsce (ŷimg , pimg ) + ℓsce (ŷaux , paux ) (8) by the SYNTHIA and Cityscapes datasets.
Implementation details For the source-only model, we
where pimg and paux are the predicted outputs of the adopt the pre-trained models on GTA5 and SYNTHIA pro-
main and auxiliary streams, ŷimg and ŷaux are the hard vided by Kundu et al. [26]. Both the source model and
pseudo labels denoised by ωimg and ωaux , and ℓsce (ŷ, p) = our target model use DeepLabv2 [4] for segmentation with
αℓce (p, ŷ) + βℓce (ŷ, p). Following previous work [67], we ResNet-101 [20] as the backbone. We insert four SSA-
set the balancing coefficients α and β to 0.1 and 1. Gates, one after each of the four encoder blocks in ResNet-
Stage two The pseudo labels for the target images are 101. We train our model using the SGD solver with a mo-
extracted by our learned target model in the first stage, mentum of 0.9 and weight decay of 2 × 10−4 . We use a
which are derived from the fusion of the two streams: mini-batch size of 4 and an initial learning rate of 6 × 10−4 .
ŷ = 12 (pimg + paux ). No advanced denoising methods are Following [67], we set the parameters for the prototypical
required in this stage as the quality of the pseudo labels is pseudo label denoising α, β, and τ to 0.1, 1, and 1, re-

21790
Table 1. Per-class IoU (%) and mIoU (%) comparison of GTA5 → Cityscapes adaptation. The best score for each column is highlighted.

sidewalk

building

person
terrain

motor
fence

vege.

truck
rider
light

train
road

pole
wall

bike
sign

sky

bus
car
Method SF mIoU
FADA [52] ✗ 91.0 50.6 86.0 43.4 29.8 36.8 43.4 25.0 86.8 38.3 87.4 64.0 38.0 85.2 31.6 46.1 6.5 25.4 37.1 50.1
CAG-UDA [68] ✗ 90.4 51.6 83.8 34.2 27.8 38.4 25.3 48.4 85.4 38.2 78.1 58.6 34.6 84.7 21.9 42.7 41.1 29.3 37.2 50.2
Seg-Uncertainty [69] ✗ 90.4 31.2 85.1 36.9 25.6 37.5 48.8 48.5 85.3 34.8 81.1 64.4 36.8 86.3 34.9 52.2 1.7 29.0 44.6 50.3
IAST [37] ✗ 94.1 58.8 85.4 39.7 29.2 25.1 43.1 34.2 84.8 34.6 88.7 62.7 30.3 87.6 42.3 50.3 24.7 35.2 40.2 52.2
CorDA [53] ✗ 94.7 63.1 87.6 30.7 40.6 40.2 47.8 51.6 87.6 47.0 89.7 66.7 35.9 90.2 48.9 57.5 0.0 39.8 56.0 56.6
ProDA [67] ✗ 87.8 56.0 79.7 46.3 44.8 45.6 53.5 53.5 88.6 45.2 82.1 70.7 39.2 88.8 45.5 59.4 1.0 48.9 56.4 57.5
EHTDI [29] ✗ 95.4 68.8 88.1 37.1 41.4 42.5 45.7 60.4 87.3 42.6 86.8 67.4 38.6 90.5 66.7 61.4 0.3 39.4 56.1 58.8
BiSMAP [35] ✗ 89.2 54.9 84.4 44.1 39.3 41.6 53.9 53.5 88.4 45.1 82.3 69.4 41.8 90.4 56.4 68.8 51.2 47.8 60.4 61.2
SFDA [31] ✓ 84.2 39.2 82.7 27.5 22.1 25.9 31.1 21.9 82.4 30.5 85.3 58.7 22.1 80.0 33.1 31.5 3.6 27.8 30.6 43.2
URMA [39] ✓ 92.3 55.2 81.6 30.8 18.8 37.1 17.7 12.1 84.2 35.9 83.8 57.7 24.1 81.7 27.5 44.3 6.9 24.1 40.4 45.1
LD [66] ✓ 91.6 53.2 80.6 36.6 14.2 26.4 31.6 22.7 83.1 42.1 79.3 57.3 26.6 82.1 41.0 50.1 0.3 25.9 19.5 45.5
SRDA [2] ✓ 90.5 47.1 82.8 32.8 28.0 29.9 35.9 34.8 83.3 39.7 76.1 57.3 23.6 79.5 30.7 40.2 0.0 26.6 30.9 45.8
SFUDA [64] ✓ 95.2 40.6 85.2 30.6 26.1 35.8 34.7 32.8 85.3 41.7 79.5 61.0 28.2 86.5 41.2 45.3 15.6 33.1 40.0 49.4
GtA w/o cPAE [26] ✓ 90.9 48.6 85.5 35.3 31.7 36.9 34.7 34.8 86.2 47.8 88.5 61.7 32.6 85.9 46.9 50.4 0.0 38.9 52.4 51.6
GtA w/ cPAE [26] ✓ 91.7 53.4 86.1 37.6 32.1 37.4 38.2 35.6 86.7 48.5 89.9 62.6 34.3 87.2 51.0 50.8 4.2 42.7 53.9 53.4
Ours ✓ 93.0 60.4 87.2 46.4 41.4 38.0 45.1 51.5 87.5 48.6 83.7 63.2 31.8 88.6 49.5 60.3 0.0 47.1 47.8 56.4
Ours w/ distillation ✓ 94.5 65.5 87.4 45.7 42.6 42.3 46.7 54.5 88.3 48.0 84.7 66.0 33.4 89.9 53.5 56.8 0.0 46.9 49.4 57.7
Ours (mono) ✓ 95.0 67.0 87.4 44.0 42.2 40.7 47.5 50.8 87.1 51.0 77.5 67.7 29.9 88.5 42.0 57.4 0.0 45.3 42.5 56.0
Ours (stereo) ✓ 95.1 67.8 87.7 51.3 41.5 36.3 47.4 51.3 87.8 47.8 87.3 67.0 34.2 87.5 41.0 51.8 0.0 42.6 46.4 56.4

spectively. We conduct an ablation study on the balancing or 41.0% mIoU on GTA5 or SYNTHIA → Cityscapes), our
coefficient γ in Eq. 7 and set γ = 1 in the rest of the ex- method obtains competitive or even better results compared
periments. For consistency regularization, we employ ran- to most of the existing non-source-free UDA methods. It is
dom crop as the weak augmentation and apply RandAug- worth noting that our method can be easily integrated with
ment [13] and Cutout [14] in addition to random crop as the non-source-free UDA methods. A naive implementation is
strong augmentation. As the class prototypes are required to start with an adapted model instead of the source model
for pseudo label denoising, we first train our target model to generate pseudo labels for target images in stage one self-
on the pseudo labels generated by the source model before training.
denoising as a warm-up. Next, we initialize the class pro-
totypes with the learned warm-up model and continue op- 5.3. Ablation Study and Discussion
timizing it based on Eq. 7 for 60 epochs. In the warm-up
Impact of the source for depth information Our pro-
stage, we choose the top 33% of the most confident predic-
posed method is agnostic to the acquisition of the depth
tions per class over the entire training set to select balanced
information. To evaluate, we replace the depth informa-
and reliable hard pseudo labels [30, 26].
tion provided by the official Cityscapes dataset1 by 1) the
5.2. Comparisons with State-of-the-Art Methods self-supervised stereoscopic depth [44] used in CorDA [53],
and 2) the self-supervised monocular depth learned by the
We compare our proposed method with the prior art in ManyDepth model [55], denoted as Ours (stereo) and Ours
Tables 1 and 2. The column SF indicates if the compari- (mono), respectively. For the monocular depth, we directly
son method is source-free or not. As shown, our method use the 1-channel disparity map as the input; while for the
outperforms the existing source-free methods by a large stereo depth, we use the 3-channel HHA representation de-
margin, achieving a state-of-the-art mIoU of 57.7% and rived from the depth information with camera parameters as
56.4% (57.5% and 55.6%) with or without self-distillation the input (see Figure 3 for the visualized examples). Gen-
on GTA5 → Cityscapes (SYNTHIA → Cityscapes). We erally speaking, stereo depth is more accurate but its acqui-
achieve the best score on 15 out of 19 common categories sition requires more expensive stereo cameras. Monocular
shared by GTA5 and Cityscapes, and on 12 out of 16 com- depth can be estimated based on video sequences recorded
mon categories shared by SYNTHIA and Cityscapes. The by regular cameras. However, it is less accurate and it re-
experimental results indicate the effectiveness of our pro- quires significantly more storage to manage the video se-
posed pseudo label denoising with cross-modal consistency quences. We show that our proposed method is effective
training. As we are exploring a new direction that has with different sources of depth information. In real-world
not been studied in previous source-free methods, our so- scenarios, users should choose based on their own require-
lution is orthogonal to existing techniques such as source ments and available devices.
domain estimation [31] and conditional Prior-enforcing Au-
Utilization strategies on the depth information Existing
toEncoder (cPAE) [26]. Such techniques can be combined
depth-aware domain adaptive semantic segmentation meth-
with our proposed method for further performance gains.
Next, we compare our method to the non-source-free 1 The depth provided in the official Cityscapes dataset is not the ground

prior art. Starting with a well-trained source model (44.0% truth but also estimated based on stereo images.

21791
Table 2. Per-class IoU (%) and mIoU (%) comparison of SYNTHIA → Cityscapes adaptation. The best score for each column is high-
lighted. mIoU and mIoU* denote the averaged scores across 16 and 13 categories, respectively.

sidewalk

building

person
fence*

motor
vege.
pole*
wall*

rider
light
road

bike
sign

sky

bus
car
Method SF mIoU mIoU*
CAG-UDA [68] ✗ 84.7 40.8 81.7 7.8 0.0 35.1 13.3 22.7 84.5 77.6 64.2 27.8 80.9 19.7 22.7 48.3 44.5 51.5
FADA [52] ✗ 84.5 40.1 83.1 4.8 0.0 34.3 20.1 27.2 84.8 84.0 53.5 22.6 85.4 43.7 26.8 27.8 45.2 52.5
Seg-Uncertainty [69] ✗ 87.6 41.9 83.1 14.7 1.7 36.2 31.3 19.9 81.6 80.6 63.0 21.8 86.2 40.7 23.6 53.1 47.9 54.9
IAST [37] ✗ 81.9 41.5 83.3 17.7 4.6 32.3 30.9 28.8 83.4 85.0 65.5 30.8 86.5 38.2 33.1 52.7 49.8 57.0
CorDA [53] ✗ 93.3 61.6 85.3 19.6 5.1 37.8 36.6 42.8 84.9 90.4 69.7 41.8 85.6 38.4 32.6 53.9 55.0 62.8
ProDA [67] ✗ 87.8 45.7 84.6 37.1 0.6 44.0 54.6 37.0 88.1 84.4 74.2 24.3 88.2 51.1 40.5 45.6 55.5 62.0
EHTDI [29] ✗ 93.0 69.8 84.0 36.6 9.1 39.7 42.2 43.8 88.2 88.1 68.3 29.0 85.5 54.1 37.1 56.3 57.8 64.6
BiSMAP [35] ✗ 81.9 39.8 84.2 - - - 41.7 46.1 83.4 88.7 69.2 39.3 80.7 51.0 51.2 58.8 - 62.8
SFDA [31] ✓ 81.9 44.9 81.7 4.0 0.5 26.2 3.3 10.7 86.3 89.4 37.9 13.4 80.6 25.6 9.6 31.3 39.2 45.9
URMA [39] ✓ 59.3 24.6 77.0 14.0 1.8 31.5 18.3 32.0 83.1 80.4 46.3 17.8 76.7 17.0 18.5 34.6 39.6 45.0
LD [66] ✓ 77.1 33.4 79.4 5.8 0.5 23.7 5.2 13.0 81.8 78.3 56.1 21.6 80.3 49.6 28.0 48.1 42.6 50.1
SFUDA [64] ✓ 90.9 45.5 80.8 3.6 0.5 28.6 8.5 26.1 83.4 83.6 55.2 25.0 79.5 32.8 20.2 43.9 44.2 51.9
GtA w/o cPAE [26] ✓ 89.0 44.6 80.1 7.8 0.7 34.4 22.0 22.9 82.0 86.5 65.4 33.2 84.8 45.8 38.4 31.7 48.1 55.5
GtA w/ cPAE [26] ✓ 90.5 50.0 81.6 13.3 2.8 34.7 25.7 33.1 83.8 89.2 66.0 34.9 85.3 53.4 46.1 46.6 52.0 60.1
Ours ✓ 91.5 55.5 85.4 34.4 8.3 40.8 40.0 44.4 86.6 84.3 62.4 22.0 88.3 60.0 40.6 45.6 55.6 62.1
Ours w/ distillation ✓ 91.5 56.3 85.9 37.9 9.2 42.1 42.6 47.6 87.2 86.1 64.5 23.3 89.3 64.5 45.0 47.7 57.5 64.0
Ours (mono) ✓ 91.2 56.6 85.0 36.5 6.8 41.6 45.5 18.8 86.5 86.2 66.4 26.7 88.7 58.2 44.3 48.0 55.4 61.7
Ours (stereo) ✓ 91.6 56.4 85.7 29.3 7.8 41.2 42.0 37.6 86.8 85.9 65.2 27.3 88.4 59.5 44.4 47.8 56.0 63.0

Mono depth Stereo depth Official depth Table 4. Model justification of our proposed framework on GTA5
→ Cityscapes. The auxiliary modality column indicates if depth
modality is used during training or not.
components mIoU gain
source model 44.0 -
auxiliary self consistency pseudo label
mIoU gain
modality training regularization denoising
✓ 50.5 +6.5
✓ ✓ 51.2 +7.2
Original image Stereo HHA Official HHA ✓ ✓ 52.7 +8.7
stage 1
Figure 3. Visualization of the depth and the HHA representation ✓ ✓ ✓ 55.1 +11.1
✓ ✓ 50.9 +6.9
obtained by different methods. ✓ ✓ ✓ 51.6 +7.6
✓ ✓ ✓ 54.2 +10.2
Table 3. Comparison of different utilization strategies of the depth ✓ ✓ ✓ ✓ 56.4 +12.4
auxiliary self stage 1 self-supervised
information for source-free UDA on GTA5 → Cityscapes. * indi- modality distillation initialization initialization
mIoU gain
stage 2
cates we made minimum modifications to make the method com- ✓ ✓ ✓ 57.6 +13.6
✓ ✓ ✓ 57.7 +13.7
patible with source-free settings.
Method BG MC RIV RIG DS mIoU gain
Source only [26] 55.3 19.4 28.7 62.9 53.7 44.0 - However, as this method did not address the domain shift
DADA* [51] 61.5 26.9 36.1 72.1 55.8 50.1 +6.1
CorDA* [53] 60.5 27.3 39.0 73.8 55.6 50.5 +6.5
issue between the source model and the target images, it
MKE* [61] 62.2 27.8 40.4 70.9 57.8 51.5 +7.5 performs less effectively than our proposed approach. Fur-
Ours 65.8 31.7 44.9 76.7 65.4 56.4 +12.4 thermore, the inference-stage model in MKE is multimodal,
while ours is unimodal with better feasibility.
ods mostly follow a multitask learning framework where Effectiveness of cross-modal pseudo label denoising Our
depth estimation is modeled as the auxiliary task [51, 53]. proposed framework consists of two major components,
We modified two depth-aware UDA methods to make them namely the multimodal auxiliary network and cross-modal
applicable in a source-free setting by calculating the classi- consistency training. As shown in Table 4, we start with a
fication loss based on the pseudo labeled target images only. source model that obtains an mIoU of 44.0% on the GTA5
The results are reported in Table 32 . As shown, without the → Cityscapes. By training the network without our pro-
supervision of the labeled source data, the regularization in- posed consistency regularization, it achieves an mIoU of
duced by the auxiliary task is quite limited. Moreover, we 50.9% and 54.2%, respectively, based on the supervision of
compare our approach to a Multimodal Knowledge Expan- the classification loss only before and after the pseudo label
sion (MKE) method [61] that transfers knowledge from a denoising. By combining our proposed consistency regular-
unimodal teacher network to a multimodal student network. ization with pseudo label denoising, we obtain a new state-
2 Background
of-the-art mIoU of 56.4%, outperforming the source model
(BG) - building, wall, fence, vegetation, terrain, sky; Mi-
nority Class (MC) - rider, train, motorcycle, bicycle; Road Infrastructure
significantly by 12.4%. To evaluate the benefits introduced
Vertical (RIV) - pole, traffic light, traffic sign; Road Infrastructure Ground by the depth modality, we replaced our multimodal auxil-
(RIG) - road, sidewalk; and Dynamic Stuff (DS) - person, car, truck, bus. iary network with a unimodal network with the same ar-

21792
Figure 4. Qualitative results of source-free semantic segmentation on the Cityscapes dataset. From left to right: input, output of the source
model, output of the GtA model with cPAE [26], output of our proposed model without self-distillation, ground-truth segmentation mask.

Table 5. Impact of the source model on GTA5 → Cityscapes. Table 6. The effect of the balancing coefficient γ.
source model source training target model target adaptation mIoU γ 0.5 1 2 5 10
DeepLabv2 data aug. - - 38.6 mIoU 56.0 56.4 56.2 56.7 55.7
DeepLabv2 [26] multi-head - - 44.0
DeepLabv2 multi-head SegFormer self-training 51.3 Table 7. The mIoU obtained by the multimodal auxiliary network
DeepLabv2 multi-head DeepLabv2 self-training 50.5 with varying number of SSA-Gate.
DeepLabv2 multi-head DeepLabv2* our proposed 56.4 SSA-Gate no. 1 2 3 4
SegFormer [59] data aug. - - 43.2
mIoU 43.4 49.2 53.1 56.6
SegFormer data aug. SegFormer self-training 50.5
SegFormer data aug. DeepLabv2 self-training 49.4
SegFormer data aug. DeepLabv2* our proposed 55.5
Parameter sensitivity analysis Finally, we study the im-
GtA w/ cPAE SF adapted - - 53.4
GtA w/ cPAE SF adapted DeepLabv2* our proposed 57.3 pact of the balancing coefficient γ in Eq. 7 on the self-
ProDA [67] non-SF adapted - - 57.5 training in stage one. We set γ to different values, conduct
ProDA non-SF adapted DeepLabv2* our proposed 59.5
experiments on GTA5 → Cityscapes, and report the results
in Table 6. The experimental results show that our proposed
chitecture as the main stream. The mIoU decreases in all method is not sensitive to the balancing factor γ. In our pre-
cases by using RGB as the only input. Next, we evaluate vious experiments, we empirically set γ = 1. It shows that
our cross-modal consistency training in self-distillation. We the mIoU can be slightly improved by setting γ = 5. We
initialize our model either with the weights of the learned obtain the state-of-the-art mIoU of 55.7% ∼ 56.7% when
model in stage one (i.e., stage 1 initialization) or with Sim- γ ∈ [0.5, 10], which verifies and underscores the robust-
CLRv2 [7] pretrained weights (i.e., self-supervised initial- ness of our proposed cross-modal consistency training tech-
ization). In both cases, we observe a performance gain of nique. Table 7 shows the mIoU obtained by the multimodal
around 1.3% over the stage one model. The qualitative eval- auxiliary network with varying number of SSA-Gate. The
uation of our method is illustrated in Figure 4. mIoU decreases significantly to 43.4% with only one SSA-
Impact of the source model The majority of the source- Gate, which indicates that predicting the semantic labels
free UDA methods are built upon DeepLab models. Here from depth alone is challenging without sufficient informa-
we evaluate a Transformer-based model, namely Seg- tion exchange with RGB images.
former [59], as the source and target models in a source-
free UDA setting. As Table 5 shows, Segformer has better 6. Conclusions
generalization ability than DeepLabv2. With data augmen-
tation only, a source Segformer model obtains an mIoU of We propose to enhance source-free domain adaptive se-
43.2, outperforming a source DeepLabv2 model by 4.6%. mantic segmentation via cross-modal consistency training.
Moreover, when being adopted as the target model, Seg- To achieve this goal, we introduce a multimodal auxiliary
former achieves an mIoU of 51.3% and 50.5%, respectively. network to leverage the guidance from the depth modal-
It outperforms the corresponding DeepLabv2 by 0.8% and ity during training. A cross-modal consistency loss is for-
1.1%, when being adapted from the same source model. To mulated between the output of the main and the auxiliary
verify that our method is orthogonal to previous work, we networks, which serves as an effective regularization for
also start with a source-free model (i.e., GtA w/ cPAE [26]) source-free UDA. Our proposed approach not only outper-
and a non-source-free model (i.e., ProDA [67]), and apply forms the source-free prior art by a large margin, but also
our method on top of it. As can be seen, the mIoU has been reduces the gap between source-free and non-source-free
further improved by 3.9% and 2%, respectively. UDA methods in semantic segmentation.

21793
References [14] Terrance DeVries and Graham W Taylor. Improved regular-
ization of convolutional neural networks with cutout. arXiv
[1] Nikita Araslanov and Stefan Roth. Self-supervised augmen- preprint arXiv:1708.04552, 2017. 6
tation consistency for adapting semantic segmentation. In [15] Liang Du, Jingang Tan, Hongye Yang, Jianfeng Feng, Xi-
CVPR, pages 15384–15394, 2021. 2 angyang Xue, Qibao Zheng, Xiaoqing Ye, and Xiaolin
[2] Mathilde Bateson, Hoel Kervadec, Jose Dolz, Hervé Lom- Zhang. SSF-DAN: Separated semantic feature based domain
baert, and Ismail Ben Ayed. Source-relaxed domain adap- adaptation network for semantic segmentation. In ICCV,
tation for image segmentation. In MICCAI, pages 490–499, pages 982–991, 2019. 2
2020. 6 [16] Ravi Garg, Vijay Kumar Bg, Gustavo Carneiro, and Ian Reid.
[3] Adriano Cardace, Luca De Luigi, Pierluigi Zama Ramirez, Unsupervised cnn for single view depth estimation: Geome-
Samuele Salti, and Luigi Di Stefano. Plugging self- try to the rescue. In ECCV, pages 740–756. Springer, 2016.
supervised monocular depth into unsupervised domain adap- 2
tation for semantic segmentation. In WACV, pages 1129– [17] Clément Godard, Oisin Mac Aodha, and Gabriel J Bros-
1139, 2022. 2 tow. Unsupervised monocular depth estimation with left-
[4] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, right consistency. In CVPR, pages 270–279, 2017. 2
Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image [18] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
segmentation with deep convolutional nets, atrous convolu- Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
tion, and fully connected crfs. IEEE Transactions on Pattern Yoshua Bengio. Generative adversarial networks. Commu-
Analysis and Machine Intelligence, 40(4):834–848, 2017. 3, nications of the ACM, 63(11):139–144, 2020. 2
5 [19] Saurabh Gupta, Ross Girshick, Pablo Arbeláez, and Jitendra
[5] Lin-Zhuo Chen, Zheng Lin, Ziqin Wang, Yong-Liang Yang, Malik. Learning rich features from RGB-D images for object
and Ming-Ming Cheng. Spatial information guided convo- detection and segmentation. In ECCV, pages 345–360, 2014.
lution for real-time RGBD semantic segmentation. IEEE 4
Transactions on Image Processing, 30:2313–2324, 2021. 4 [20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
[6] Minghao Chen, Hongyang Xue, and Deng Cai. Do- Deep residual learning for image recognition. In CVPR,
main adaptation for semantic segmentation with maximum pages 770–778, 2016. 5
squares loss. In ICCV, pages 2090–2099, 2019. 2 [21] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu,
Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell.
[7] Ting Chen, Simon Kornblith, Kevin Swersky, Mohammad
Cycada: Cycle-consistent adversarial domain adaptation. In
Norouzi, and Geoffrey E Hinton. Big self-supervised mod-
ICML, pages 1989–1998, 2018. 2
els are strong semi-supervised learners. NeurIPS, 33:22243–
22255, 2020. 8 [22] Lukas Hoyer, Dengxin Dai, Yuhua Chen, Adrian Koring,
Suman Saha, and Luc Van Gool. Three ways to improve se-
[8] Xiaokang Chen, Kwan-Yee Lin, Jingbo Wang, Wayne Wu,
mantic segmentation with self-supervised depth estimation.
Chen Qian, Hongsheng Li, and Gang Zeng. Bi-directional
In CVPR, pages 11130–11140, 2021. 2
cross-modality feature propagation with separation-and-
[23] Lukas Hoyer, Dengxin Dai, and Luc Van Gool. Daformer:
aggregation gate for RGB-D semantic segmentation. In
Improving network architectures and training strategies for
ECCV, pages 561–577, 2020. 3, 4
domain-adaptive semantic segmentation. In CVPR, pages
[9] Yuhua Chen, Wen Li, Xiaoran Chen, and Luc Van Gool. 9924–9935, 2022. 2
Learning semantic segmentation from synthetic data: A ge-
[24] Lukas Hoyer, Dengxin Dai, and Luc Van Gool. HRDA:
ometrically guided input-output adaptation approach. In
Context-aware high-resolution domain-adaptive semantic
CVPR, pages 1841–1850, 2019. 2
segmentation. In ECCV, 2022. 2
[10] Yuhua Chen, Wen Li, and Luc Van Gool. Road: Reality ori- [25] Maximilian Jaritz, Tuan-Hung Vu, Raoul de Charette, Emilie
ented adaptation for semantic segmentation of urban scenes. Wirbel, and Patrick Pérez. xMUDA: Cross-modal unsuper-
In CVPR, pages 7892–7901, 2018. 2 vised domain adaptation for 3d semantic segmentation. In
[11] Yi-Hsin Chen, Wei-Yu Chen, Yu-Ting Chen, Bo-Cheng Tsai, CVPR, pages 12605–12614, 2020. 2
Yu-Chiang Frank Wang, and Min Sun. No more discrimi- [26] Jogendra Nath Kundu, Akshay Kulkarni, Amit Singh, Varun
nation: Cross city adaptation of road scene segmenters. In Jampani, and R Venkatesh Babu. Generalize then adapt:
ICCV, pages 1992–2001, 2017. 1, 2 Source-free domain adaptive semantic segmentation. In
[12] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo ICCV, pages 7046–7056, 2021. 1, 2, 3, 5, 6, 7, 8
Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe [27] Da Li, Yongxin Yang, Yi-Zhe Song, and Timothy M
Franke, Stefan Roth, and Bernt Schiele. The cityscapes Hospedales. Deeper, broader and artier domain generaliza-
dataset for semantic urban scene understanding. In CVPR, tion. In ICCV, pages 5542–5550, 2017. 3
pages 3213–3223, 2016. 5 [28] Guangrui Li, Guoliang Kang, Wu Liu, Yunchao Wei, and
[13] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Yi Yang. Content-consistent matching for domain adaptive
Le. Randaugment: Practical automated data augmentation semantic segmentation. In ECCV, pages 440–456, 2020. 2
with a reduced search space. In CVPR Workshops, pages [29] Junjie Li, Zilei Wang, Yuan Gao, and Xiaoming Hu. Explor-
702–703, 2020. 6 ing high-quality target domain information for unsupervised

21794
domain adaptive semantic segmentation. In ACM Multime- [45] Inkyu Shin, Sanghyun Woo, Fei Pan, and In So Kweon. Two-
dia, pages 5237––5245, 2022. 6, 7 phase pseudo label densification for self-training based do-
[30] Yunsheng Li, Lu Yuan, and Nuno Vasconcelos. Bidirectional main adaptation. In ECCV, pages 532–548, 2020. 2
learning for domain adaptation of semantic segmentation. In [46] Kihyuk Sohn, David Berthelot, Nicholas Carlini, Zizhao
CVPR, pages 6936–6945, 2019. 2, 6 Zhang, Han Zhang, Colin A Raffel, Ekin Dogus Cubuk,
[31] Yuang Liu, Wei Zhang, and Jun Wang. Source-free do- Alexey Kurakin, and Chun-Liang Li. Fixmatch: Simplifying
main adaptation for semantic segmentation. In CVPR, pages semi-supervised learning with consistency and confidence.
1215–1224, 2021. 1, 2, 3, 6, 7 NeurIPS, 33:596–608, 2020. 4
[32] Zhenguang Liu, Haoming Chen, Runyang Feng, Shuang Wu, [47] Antti Tarvainen and Harri Valpola. Mean teachers are better
Shouling Ji, Bailin Yang, and Xun Wang. Deep dual consec- role models: Weight-averaged consistency targets improve
utive network for human pose estimation. In CVPR, pages semi-supervised deep learning results. NeurIPS, 30, 2017. 4
525–534, 2021. 1 [48] Yi-Hsuan Tsai, Wei-Chih Hung, Samuel Schulter, Ki-
[33] Zhenguang Liu, Shuang Wu, Shuyuan Jin, Qi Liu, Shijian hyuk Sohn, Ming-Hsuan Yang, and Manmohan Chandraker.
Lu, Roger Zimmermann, and Li Cheng. Towards natural and Learning to adapt structured output space for semantic seg-
accurate future motion prediction of humans and animals. In mentation. In CVPR, pages 7472–7481, 2018. 1, 2
CVPR, pages 10004–10012, 2019. 1 [49] Simon Vandenhende, Stamatios Georgoulis, Wouter
[34] Adrian Lopez-Rodriguez and Krystian Mikolajczyk. Desc: Van Gansbeke, Marc Proesmans, Dengxin Dai, and Luc
Domain adaptation for depth estimation via semantic con- Van Gool. Multi-task learning for dense prediction tasks: A
sistency. International Journal of Computer Vision, survey. PAMI, 2021. 2
131(3):752–771, 2023. 2 [50] Tuan-Hung Vu, Himalaya Jain, Maxime Bucher, Matthieu
[35] Yulei Lu, Yawei Luo, Li Zhang, Zheyang Li, Yi Yang, Cord, and Patrick Pérez. Advent: Adversarial entropy mini-
and Jun Xiao. Bidirectional self-training with multiple mization for domain adaptation in semantic segmentation. In
anisotropic prototypes for domain adaptive semantic seg- CVPR, pages 2517–2526, 2019. 1, 2
mentation. In ACM Multimedia, pages 1405—-1415, 2022.
[51] Tuan-Hung Vu, Himalaya Jain, Maxime Bucher, Matthieu
6, 7
Cord, and Patrick Pérez. Dada: Depth-aware domain adap-
[36] Yawei Luo, Liang Zheng, Tao Guan, Junqing Yu, and Yi
tation in semantic segmentation. In ICCV, pages 7364–7373,
Yang. Taking a closer look at domain shift: Category-level
2019. 2, 3, 7
adversaries for semantics consistent domain adaptation. In
[52] Haoran Wang, Tong Shen, Wei Zhang, Ling-Yu Duan, and
ICCV, pages 2507–2516, 2019. 2
Tao Mei. Classes matter: A fine-grained adversarial ap-
[37] Ke Mei, Chuang Zhu, Jiaqi Zou, and Shanghang Zhang. In-
proach to cross-domain semantic segmentation. In ECCV,
stance adaptive self-training for unsupervised domain adap-
pages 642–659, 2020. 6, 7
tation. In ECCV, 2020. 2, 6, 7
[53] Qin Wang, Dengxin Dai, Lukas Hoyer, Luc Van Gool, and
[38] Luke Melas-Kyriazi and Arjun K Manrai. Pixmatch: Unsu-
Olga Fink. Domain adaptive semantic segmentation with
pervised domain adaptation via pixelwise consistency train-
self-supervised depth estimation. In ICCV, pages 8515–
ing. In CVPR, pages 12435–12445, 2021. 2, 4
8525, 2021. 2, 3, 5, 6, 7
[39] S Prabhu Teja and François Fleuret. Uncertainty reduction
for model adaptation in semantic segmentation. In CVPR, [54] Yisen Wang, Xingjun Ma, Zaiyi Chen, Yuan Luo, Jinfeng
pages 9613–9623, 2021. 1, 2, 3, 6, 7 Yi, and James Bailey. Symmetric cross entropy for robust
[40] Zhen Qiu, Yifan Zhang, Hongbin Lin, Shuaicheng Niu, learning with noisy labels. In ICCV, pages 322–330, 2019. 5
Yanxia Liu, Qing Du, and Mingkui Tan. Source-free domain [55] Jamie Watson, Oisin Mac Aodha, Victor Prisacariu, Gabriel
adaptation via avatar prototype generation and adaptation. In Brostow, and Michael Firman. The temporal opportunist:
IJCAI, 2021. 2 Self-supervised multi-frame monocular depth. In CVPR,
[41] Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen 2021. 5, 6
Koltun. Playing for data: Ground truth from computer [56] Quanliang Wu and Huajun Liu. Unsupervised domain adap-
games. In ECCV, pages 102–118, 2016. 5 tation for semantic segmentation using depth distribution. In
[42] German Ros, Laura Sellart, Joanna Materzynska, David Advances in Neural Information Processing Systems. 3
Vazquez, and Antonio M Lopez. The synthia dataset: A large [57] Zuxuan Wu, Xin Wang, Joseph E Gonzalez, Tom Goldstein,
collection of synthetic images for semantic segmentation of and Larry S Davis. ACE: Adapting to changing environ-
urban scenes. In CVPR, pages 3234–3243, 2016. 5 ments for semantic segmentation. In ICCV, pages 2121–
[43] Suman Saha, Anton Obukhov, Danda Pani Paudel, Menelaos 2130, 2019. 2
Kanakis, Yuhua Chen, Stamatios Georgoulis, and Luc [58] Binhui Xie, Shuang Li, Mingjia Li, Chi Harold Liu, Gao
Van Gool. Learning to relate depth and semantics for un- Huang, and Guoren Wang. SePiCo: Semantic-guided pixel
supervised domain adaptation. In CVPR, pages 8197–8207, contrast for domain adaptive semantic segmentation. arXiv
2021. 2 preprint arXiv:2204.08808, 2022. 2
[44] Christos Sakaridis, Dengxin Dai, Simon Hecker, and Luc [59] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar,
Van Gool. Model adaptation with synthetic and real data for Jose M Alvarez, and Ping Luo. SegFormer: Simple and ef-
semantic dense foggy scene understanding. In ECCV, pages ficient design for semantic segmentation with transformers.
687–704, 2018. 1, 5, 6 NeurIPS, 34:12077–12090, 2021. 8

21795
[60] Qizhe Xie, Zihang Dai, Eduard Hovy, Thang Luong, and
Quoc Le. Unsupervised data augmentation for consistency
training. NeurIPS, 33:6256–6268, 2020. 4
[61] Zihui Xue, Sucheng Ren, Zhengqi Gao, and Hang Zhao.
Multimodal knowledge expansion. In ICCV, pages 854–863,
2021. 2, 7
[62] Jihan Yang, Ruijia Xu, Ruiyu Li, Xiaojuan Qi, Xiaoyong
Shen, Guanbin Li, and Liang Lin. An adversarial perturba-
tion oriented domain adaptation approach for semantic seg-
mentation. In AAAI, pages 12613–12620, 2020. 1, 2
[63] Yanchao Yang and Stefano Soatto. FDA: Fourier domain
adaptation for semantic segmentation. In CVPR, pages
4085–4095, 2020. 2
[64] Mucong Ye, Jing Zhang, Jinpeng Ouyang, and Ding Yuan.
Source data-free unsupervised domain adaptation for seman-
tic segmentation. In ACM Multimedia, pages 2233—-2242,
2021. 2, 6, 7
[65] Yifang Yin, Harsh Shrivastava, Ying Zhang, Zhenguang Liu,
Rajiv Ratn Shah, and Roger Zimmermann. Enhanced au-
dio tagging via multi-to single-modal teacher-student mutual
learning. In AAAI, volume 35, pages 10709–10717, 2021. 2
[66] Fuming You, Jingjing Li, Lei Zhu, Zhi Chen, and Zi
Huang. Domain adaptive semantic segmentation without
source data. In ACM Multimedia, pages 3293—-3302, 2021.
2, 6, 7
[67] Pan Zhang, Bo Zhang, Ting Zhang, Dong Chen, Yong Wang,
and Fang Wen. Prototypical pseudo label denoising and tar-
get structure learning for domain adaptive semantic segmen-
tation. In CVPR, pages 12414–12424, 2021. 1, 2, 4, 5, 6, 7,
8
[68] Qiming Zhang, Jing Zhang, Wei Liu, and Dacheng Tao. Cat-
egory anchor-guided unsupervised domain adaptation for se-
mantic segmentation. NeurIPS, 32, 2019. 2, 6, 7
[69] Zhedong Zheng and Yi Yang. Rectifying pseudo label learn-
ing via uncertainty estimation for domain adaptive seman-
tic segmentation. International Journal of Computer Vision,
pages 1106–1120, 2021. 1, 2, 4, 6, 7
[70] Qianyu Zhou, Zhengyang Feng, Qiqi Gu, Jiangmiao Pang,
Guangliang Cheng, Xuequan Lu, Jianping Shi, and Lizhuang
Ma. Context-aware mixup for domain adaptive semantic seg-
mentation. IEEE Transactions on Circuits and Systems for
Video Technology, 2022. 2
[71] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G
Lowe. Unsupervised learning of depth and ego-motion from
video. In CVPR, pages 1851–1858, 2017. 2, 3
[72] Yang Zou, Zhiding Yu, BVK Kumar, and Jinsong Wang. Un-
supervised domain adaptation for semantic segmentation via
class-balanced self-training. In ECCV, pages 289–305, 2018.
2
[73] Yang Zou, Zhiding Yu, Xiaofeng Liu, BVK Kumar, and Jin-
song Wang. Confidence regularized self-training. In ICCV,
pages 5982–5991, 2019. 2

21796

You might also like