0% found this document useful (0 votes)

52 views

Texture Learning Domain Randomization For Domain Generalized Segmentation

This paper proposes a new framework called Texture Learning Domain Randomization (TLDR) for domain generalized semantic segmentation. TLDR includes novel losses to effectively enhance texture learning, which is important but often omitted in existing methods. A texture regularization loss prevents overfitting to source textures using pretrained texture features. A texture generalization loss learns diverse textures using random style images. Experiments show TLDR improves segmentation performance on unseen target domains.

Uploaded by

Kiên Dương Ngô

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views

Texture Learning Domain Randomization For Domain Generalized Segmentation

Uploaded by

Kiên Dương Ngô

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Texture Learning Domain Randomization for Domain Generalized Segmentation

Sunghwan Kim Dae-hwan Kim Hoseong Kim*

Agency for Defense Development (ADD)
{ssshwan, dhkim7, hoseongkim}@add.re.kr
arXiv:2303.11546v2 [cs.CV] 17 Aug 2023

Abstract

Deep Neural Networks (DNNs)-based semantic segmen-

tation models trained on a source domain often struggle
to generalize to unseen target domains, i.e., a domain gap
problem. Texture often contributes to the domain gap, mak-
Unseen Target Image (a) Normalization & Whitening
ing DNNs vulnerable to domain shift because they are prone
to be texture-biased. Existing Domain Generalized Seman-
tic Segmentation (DGSS) methods have alleviated the do-
main gap problem by guiding models to prioritize shape
over texture. On the other hand, shape and texture are two
prominent and complementary cues in semantic segmenta-
tion. This paper argues that leveraging texture is crucial (b) Domain Randomization (c) TLDR (ours)
for improving performance in DGSS. Specifically, we pro-
pose a novel framework, coined Texture Learning Domain Figure 1. Segmentation results for an image from an unseen do-
Randomization (TLDR). TLDR includes two novel losses to main (i.e., Cityscapes [12]), using models trained on GTA [51]
effectively enhance texture learning in DGSS: (1) a texture with DGSS methods. (a-b) Existing methods [10, 64] have diffi-
regularization loss to prevent overfitting to source domain culty in distinguishing between road, sidewalk, and terrain which
textures by using texture features from an ImageNet pre- have similar shapes and contexts, as the texture is not sufficiently
considered during the training process. (c) Our Texture Learning
trained model and (2) a texture generalization loss that uti-
Domain Randomization (TLDR) can distinguish the classes effec-
lizes random style images to learn diverse texture represen-
tively as we utilize the texture as prediction cues.
tations in a self-supervised manner. Extensive experimental
results demonstrate the superiority of the proposed TLDR;
e.g., TLDR achieves 46.5 mIoU on GTA→Cityscapes us-
volve unseen out-of-distribution scenarios, also known as a
ing ResNet-50, which improves the prior state-of-the-art
domain gap problem. Many domain adaptation and gener-
method by 1.9 mIoU. The source code is available at
alization methods have been proposed to solve the domain
https://github.com/ssssshwan/TLDR.
gap problem [10, 27, 31, 46–48, 59, 64, 65]. Domain adapta-
tion assumes accessibility of target domain images, differ-
ing from domain generalization. This paper addresses Do-
1. Introduction main Generalized Semantic Segmentation (DGSS), which
Semantic segmentation is an essential task in com- aims to train models that can generalize to diverse unseen
puter vision with many real-world applications, such as au- domains by training on a single source domain.
tonomous vehicles, augmented reality, and medical imag- Existing DGSS methods have attempted to address
ing. Deep Neural Networks (DNNs)-based semantic seg- the domain gap problem by guiding models to focus
mentation models work well when the data distributions on shape rather than texture. Given that texture often
are consistent between a source domain and target domains varies across different domains (e.g., synthetic/real and
[1,5–7,24,37,60]. However, the performance of the models sunny/rainy/foggy), DNNs are susceptible to domain shift
tends to degrade significantly in practical settings that in- because they tend to be texture-biased [21, 43]. Accord-
ingly, there are two main approaches for the DGSS meth-
* Corresponding author. ods. The first approach is Normalization and Whitening
Original Source (a) Norm. & Whiten. (b) Domain Random. GTA Cityscapes BDD Mapillary

Road
Figure 2. Reconstructed source images from the feature maps of
(a) normalization and whitening and (b) domain randomization.

Sidewalk
Texture features are often omitted in existing DGSS methods.

(a) Shape Features (b) Texture Features

Terrain
Figure 3. The t-SNE [57] plots for the road, sidewalk, and terrain Figure 4. Visualization of texture for the road, sidewalk, and ter-
classes from Cityscapes [12] that have similar shapes. While the rain classes from GTA [51], Cityscapes [12], BDD [63], and Map-
shape features (Canny edge [3]) are entangled in (a), the texture illary [44] datasets. For each class, there are commonalities in tex-
features (Gram-matrix) of these classes are clearly separated in ture across the datasets.
(b). The plots are based on an ImageNet pre-trained model.

(NW), which involves normalizing and whitening the fea- in DGSS, texture must be learned from them. To accurately
tures [10, 46–48]. It is possible to remove domain-specific capture the texture, we leverage the source domain images
texture features and learn domain-invariant shape features without any modification, which we refer to as the original
with NW (see Figure 2a). The second approach is Domain source images. The stylized source images from DR are
Randomization (DR), which trains by transforming source more focused on learning shape features, and the original
images into randomly stylized images [27,31,49,59,64–66]. source images are more focused on learning texture features
The model learns domain-invariant shape features because (Section 4.1). To further improve texture learning in DGSS,
texture cues are mostly replaced by random styles (see Fig- we propose a texture regularization loss and a texture gen-
ure 2b) [42, 56]. eralization loss. While there are commonalities in texture
While the existing methods are effective at making the across different domains, there are also clear texture differ-
models focus on shape features, they need to give more con- ences. Thus, if the model overfits source domain textures, it
sideration to texture features. In addition to utilizing shape will result in a performance drop. To mitigate this problem,
features like edges and structures, DNNs also use texture we propose the texture regularization loss to prevent over-
features such as patterns, color variations, and histograms fitting to source domain textures by using texture features
as important cues for prediction [20]. Particularly in seman- from an ImageNet pre-trained model (Section 4.2). Since
tic segmentation, texture plays a crucial role in accurately only source domain textures alone may not be sufficient
maintaining the boundaries between objects [29, 68]. for learning general texture representations, we propose the
Figure 1 demonstrates the results of predicting an unseen texture generalization loss that utilizes random style images
image from Cityscapes [12] using DGSS methods trained to learn diverse texture representations in a self-supervised
on GTA [51]. One can see that the models trained with NW manner (Section 4.3).
and DR have difficulty distinguishing between the road, Our contribution can be summarized into three aspects.
sidewalk, and terrain classes, which have similar shapes and First, to the best of our knowledge, we are approaching
contexts. In order to determine these classes accurately, it DGSS for the first time from both the shape and texture
is necessary to use texture cues. This assertion is further perspectives. We argue that leveraging texture is essential
emphasized through t-SNE [57] plots of the classes: the in distinguishing between classes with similar shapes de-
shape features are entangled in Figure 3a, whereas the tex- spite the domain gap. Second, to enhance texture learning
ture features are clearly separated in Figure 3b. Meanwhile, in DGSS, we introduce two novel losses: the texture reg-
some textures remain relatively unchanged across domains ularization loss and the texture generalization loss. Third,
in DGSS as shown in Figure 4. Based on these observations, extensive experiments over multiple DGSS tasks show that
we suggest utilizing texture as valuable cues for prediction. our proposed TLDR achieves state-of-the-art performance.
We propose Texture Learning Domain Randomization Our method attains 46.5 mIoU on GTA→Cityscapes using
(TLDR), which enables DGSS models to learn both shape ResNet-50, surpassing the prior state-of-the-art method by
and texture. Since only source domain images are available 1.9 mIoU.
TEO
Original Source Image Stylized Source Images

STM ×

Random Style Images Feature Map Gram-matrix

(a) Style Transfer Module (STM) (b) Texture Extraction Operator (TEO)

Figure 5. (a) Visualization of using the Style Transfer Module (STM) to transform an original source image into stylized source images
with random style images. (b) Illustration of the process of using Texture Extraction Operator (TEO) to extract only texture features, i.e.,
a Gram-matrix, from a feature map.

2. Related Work with random styles. Yue et al. presented a method for con-
sidering consistency among multiple stylized images [64].
Domain Generalization (DG). DG aims to learn DNNs Peng et al. distinguished between global and local texture
that perform well on multiple unseen domains [40]. DG in the randomization process [49]. Huang et al. proposed
has primarily been studied in image classification. A num- DR in frequency space [27]. Adversarial learning [66] and
ber of studies have been proposed to address DG, includ- self-supervised learning [59] have been used as attempts to
ing adversarial learning [30, 34, 55], data augmentation make style learnable rather than using random style images.
[50, 58, 61, 67], meta-learning [2, 14, 36], ensemble learn- It has been shown that utilizing content and style of Ima-
ing [62, 67], and self-supervised learning [4, 32]. geNet [31] and ImageNet pre-trained features [65] can aid
Recent studies have attempted to train domain gener- in learning generalized features in DR.
alized models by preserving ImageNet pre-trained feature
representations as much as possible. Chen et al. defined Existing DR methods have yet to comprehensively ad-
DG as a life-long learning problem [36] and tried to utilize dress DGSS from both shape and texture perspectives. Sev-
ImageNet pre-trained weights to prevent catastrophic for- eral studies [31, 64] have stated that the success of DR is
getting [9]. Contrastive learning [8] and attentional pool- attributed to its ability to learn various styles, resulting in
ing [8, 41] were introduced to enhance the capturing of se- improved performance in generalized domains. However,
mantic knowledge in ImageNet features. In this paper, we from a different perspective, we consider that the effective-
regularize texture representations of DNNs with the Ima- ness of DR is due to the model becoming more focused on
geNet features. To the best of our knowledge, this is the shape, as discussed in recent DG methods [42, 56]. There-
first attempt to extract a specific semantic concept (i.e. tex- fore, we assume that the model needs to learn texture for
ture) from the ImageNet features for regularization in DG. further performance improvement in DR. We propose a
Domain Generalized Semantic Segmentation (DGSS). novel approach for learning texture in DGSS without over-
DGSS is in its early stages and has yet to receive much fitting to source domain textures.
research attention. Existing DGSS methods have tried
to alleviate a domain gap problem through two main ap-
proaches: Normalization and Whitening (NW) and Do- 3. Preliminaries
main Randomization (DR). NW trains by normalizing the
mean and standard deviation of source features and whiten- 3.1. Domain Randomization with Style Transfer
ing the covariance of source features. This process elimi-
nates domain-specific features, allowing the model to learn We adopt DR as a baseline DGSS method. We use a neu-
domain-invariant features. Pan et al. introduced instance ral style transfer method at each epoch to transform each
normalization to remove domain-specific features [46]. Pan original source image into a different random style. If edge
et al. proposed switchable whitening to decorrelate features information is lost during the style transfer process, it may
[47], and Choi et al. proposed Instance Selective Whitening cause a mismatch with the semantic label, leading to a de-
(ISW) to enhance the ability to whiten domain-specific fea- crease in performance. We utilize photoWCT [33] known
tures [10]. Peng et al. tried to normalize and whiten features as an edge-preserving style transfer method. Random style
in a category-wise manner [48]. images are sampled from ImageNet [13] validation set. Fig-
DR trains by transforming source images into randomly ure 5a is a visualization of using the Style Transfer Module
stylized images. The model is guided to capture domain- (STM) to transform an original source image into stylized
invariant shape features since texture cues are substituted source images with random style images.
Stop Gradient
TEO
Original Image

TEO Original Pred.

STM
G. Truth Label
Stylized Image Weight Shared Weight Shared
Stylized Pred.

TEO

Random Style
RSM

TEO

Figure 6. An overview of our proposed Texture Learning Domain Randomization (TLDR) framework. The stylized source image xsr
(green) is obtained by stylizing the original source image xs (blue) with the random style image xr (purple). The stylized task loss Lstyl
focuses on learning shape from xsr , and the original task loss Lorig focuses on learning texture from xs . The texture regularization loss LTR
enforces the consistency between the Gram-matrices of the ImageNet model fI and the task model fT for xs . The texture generalization
loss LTG enforces the consistency between the Gram-matrices of the task model fT for xsr and xr . Random Style Masking (RSM) selects
only the random style features when applying the texture generalization loss.

3.2. Texture Extraction with Gram-matrix ture regularization loss, and a texture generalization loss.
The stylized task loss focuses on learning shape, and the
Texture is a regional descriptor that can offer measure-
original task loss focuses on learning texture. The texture
ments for both local structural (e.g., pattern) and global sta-
regularization loss and the texture generalization loss pre-
tistical (e.g., overall distribution of colors) properties of an
vent overfitting to source domain textures. Figure 6 is an
image [23]. It has been shown that texture can be repre-
overview of TLDR. The frozen ImageNet pre-trained en-
sented by pairwise correlations between features extracted
coder is denoted as fI , i.e., the ImageNet model. The train-
by DNNs, also known as a Gram-matrix [18, 19, 35].
ing encoder is denoted as fT , i.e., the task model. The train-
We use the Gram-matrix to extract only texture features
ing semantic segmentation decoder is denoted as hT .
from a feature map. The Gram-matrix G ∈ RC×C for
the vectorized matrix F ∈ RC×HW of the feature map is 4.1. Task Losses
defined as Equation 1. C, H, and W denote the channel,
height, and width of the feature map, respectively. Stylized task loss. We denote an original source image
xs and its semantic label y. xsr is a stylized source im-
Gi,j = Fi · Fj , (1) age obtained by stylizing an original source image xs with
where · represents a dot product, Fi and Fj are the ith and a random style image xr . The prediction result for xsr of
j th row vectors of F , respectively. Gi,j is the entry at the the model is psr =hT (fT (xsr )). Then the stylized task loss
ith row and j th column of G. Each entry of the Gram- Lstyl is given by Equation 2.
matrix indicates a pairwise correlation between the features
Lstyl = CE(psr , y), (2)
corresponding to a texture feature. In this paper, the oper-
ator extracting texture features is called Texture Extraction where CE(·) represents the categorical cross-entropy loss.
Operator (TEO). Figure 5b illustrates the process of TEO. The stylized task loss encourages the model to focus on
shape features during training since the texture cues are
4. Approach mostly replaced by random styles [42, 56].
This section describes our proposed Texture Learn- Original task loss. The model struggles to learn texture
ing Domain Randomization (TLDR) framework. TLDR from the stylized source images as the texture cues are
learns texture features in addition to learning shape fea- mostly substituted with random styles. To accurately cap-
tures through domain randomization. TLDR consists of ture the source domain textures, the model is trained on the
four losses: a stylized task loss, an original task loss, a tex- original source images. The prediction result for xs of the
model is ps =hT (fT (xs )). The original task loss Lorig is Stylized Image

given by Equation 3.
+ 0 0 0 1 0

Lorig = CE(ps , y). (3) +

+ -
-
0
0
0
1
1
0
0
0
0
0
+ + 1 0 0 0 1
Original Image - - + 0 0 0 1 0
DNNs tend to prioritize texture cues without restrictions
[21, 43]. The original task loss guides the model to concen-
trate on texture features during training.
4.2. Texture Regularization Loss Figure 7. Random Style Masking (RSM) masks only the entries
If the model is trained on the original source images corresponding to the random style features (purple) in the Gram-
without regularization, it will overfit the source domain tex- matrices. The difference between Gl,sr
T and Gl,s
T is calculated as
l
tures [21, 43]. We regularize texture representations of the D . The entries with values greater than a certain threshold τ in
Dl are masked as M l .
task model using the ImageNet model, which encodes di-
verse feature representations [8]. However, it is important
to note that ImageNet features include not only texture fea- corresponding Gram-matrices. Our goal is basically to set
tures, but also other semantic features such as shape fea- an objective that makes Gl,r T and GT
l,sr
consistent. How-
l,sr
tures. We therefore assume that regularizing the entire fea- ever, GT includes both random style features and remain-
ture may interfere with the task model learning texture. To ing source texture features. Applying the constraints to the
address this issue, we propose to apply TEO to extract only entire Gl,sr
T also imposes the objective on the source tex-
texture features from ImageNet features and regularize the ture features in the stylized source image. To select only
task model with the extracted texture features (see Table 5 the random style features when enforcing the consistency
for ablation). between Gl,r l,sr
T and GT , we propose Random Style Mask-
Let FIl,s and FTl,s denote the vectorized feature maps ing (RSM), inspired by ISW [10]. Figure 7 is an illustra-
in layer l of the ImageNet model fI and the task model tion of RSM. We assume that the entries corresponding to
fT for xs , respectively. The Gram-matrices Gl,s l,s the random style features are activated in Gl,sr but deac-
I and GT l,s l,sr
T
l,s
l
l,s l,s
from FI and FT are the texture features of the original tivated in GT . Considering D =GT −GT , the entries
source image as seen by fI and fT , respectively. The con- corresponding to the random style features are expected to
tribution of the lth layer to the texture regularization loss be larger than a certain threshold τ . We denote the mask
is ∥Gl,s l,s for the entries representing the random style features as M l
I − GT ∥2 . The total texture regularization loss is
given by Equation 4. (see Equation 5).
L
(
l
X ul 1, if Di,j >τ
LTR = ∥Gl,s − Gl,s
T ∥2 , (4) l
Mi,j = (5)
Cl Hl Wl I 0, otherwise
l=1
where L is the number of feature map layers, and ul is a where i and j are the row and column indices in each ma-
weighting factor for the contribution of the lth layer to LTR . trix, respectively. The threshold τ is determined empiri-
Cl , Hl , and Wl denote the channel, height and width of the cally (see Table 6 for ablation). We only apply the objec-
lth layer feature map, respectively. We reduce the value of tive to selected random style features by RSM. The contri-
ul as l increases, considering that fewer texture features are bution of the lth layer to the texture generalization loss is
encoded in feature maps as layers become deeper [28]. ∥(Gl,r l,sr l
T − GT ) ⊙ M ∥2 , where ⊙ represents an element-
4.3. Texture Generalization Loss wise product. The total texture generalization loss is given
by Equation 6.
We supplement texture learning from the random style
L
images for more diverse texture representations. Since the X vl
LTG = ∥(Gl,r l,sr l
T − GT ) ⊙ M ∥2 , (6)
random style images are unlabeled, the texture representa- Cl Hl Wl
l=1
tions should be learned self-supervised. Note that the ran-
dom style image xr and the stylized source image xsr share where vl is a weighting factor for the contribution of the lth
some texture features. To encourage learning of diverse tex- layer to LTG . We also reduce the value of vl as l increases
ture representations, we induce the texture features of xsr for the same reason as ul in Equation 4.
to become more similar to those of xr as much as possible
4.4. Full Objective
while preserving source texture features.
Let FTl,r and FTl,sr denote the vectorized feature maps in As the training progresses, the task losses decrease while
layer l of fT for the random style image xr and the styl- the texture regularization loss and the texture generaliza-
ized source image xsr , respectively. Gl,r
T and GT
l,sr
are the tion loss remain relatively constant. We add a Linear De-
Method Encoder C B M S Method Encoder C B M G
DRPC [64] 37.42 32.14 34.12 - DRPC [64] 35.65 31.53 32.74 28.75
RobustNet [10] 36.58 35.20 40.33 28.30 SAN-SAW [48] ResNet-50 38.92 35.24 34.52 29.16
SAN-SAW [48] 39.75 37.34 41.86 30.79 TLDR (ours) 41.88 34.35 36.79 35.90
SiamDoGe [59] ResNet-50 42.96 37.54 40.64 28.34 DRPC [64] 37.58 34.34 34.12 29.24
WildNet [31] 44.62 38.42 46.09 31.34 GTR [49] 39.70 35.30 36.40 28.71
SHADE [65] 44.65 39.28 43.34 - FSDR [27] ResNet-101 40.80 39.60 37.40 -
TLDR (ours) 46.51 42.58 46.18 36.30 SAN-SAW [48] 40.87 35.98 37.26 30.79
DRPC [64] 42.53 38.72 38.05 29.67 TLDR (ours) 42.60 35.46 37.46 37.77
GTR [49] 43.70 39.60 39.10 29.32
FSDR [27] 44.80 41.20 43.40 - Table 2. Comparison of mIoU (%; higher is better) between DGSS
SAN-SAW [48] ResNet-101 45.33 41.18 40.77 31.84 methods trained on SYNTHIA and evaluated on C, B, M, and G.
WildNet [31] 45.79 41.73 47.08 32.51
The best and second best results are highlighted and underlined,
respectively. Our method is marked in gray .
SHADE [65] 46.66 43.66 45.50 -
TLDR (ours) 47.58 44.88 48.80 39.35
Training. We adopt an AdamW [39] optimizer. An initial
Table 1. Comparison of mIoU (%; higher is better) between DGSS learning rate is set to 3×10−5 for the encoder and 3 × 10−4
methods trained on GTA and evaluated on C, B, M, and S. The for the decoder, 40k training iterations, a batch size of 4.
best and second best results are highlighted and underlined, re- A weight decay is set to 0.01, with a linear warmup [22]
spectively. Our method is marked in gray . over twarm =1k iterations, followed by a linear decay. We
use random scaling in the range [0.5, 2.0] and random crop-
cay Factor (LDF) to the texture regularization and gener- ping with a size of 768×768. We apply additional data aug-
alization losses to balance their scale with the task losses mentation techniques, including random flipping and color
(see Table 6 for ablation). The LDF at iteration t is set jittering. We set the texture regularization parameters as
w(t) = (1 − t/ttotal ), where ttotal denotes the total num- ul =5 × 10−l−2 , and the texture generalization parameters
ber of iterations. Our full objective is given by Equation 7. as vl =5 × 10−l−2 . The original task loss and the stylized
task loss weights are set to αorig =0.5 and αstyl =0.5, respec-
Ltotal = αorig Lorig + αstyl Lstyl + w(t)LTR + w(t)LTG ,
tively. We set the RSM threshold τ =0.1.
(7)
where αorig and αstyl are the weights for the original task 5.2. Comparison with DGSS methods
loss and the stylized task loss, respectively.
To measure generalization capacity in unseen domains,
5. Experiments we train on a single source domain and evaluate multiple
unseen domains. We conduct experiments on two settings,
5.1. Implementation Details (1) G→{C, B, M, S} and (2) S→{C, B, M, G}. We repeat
Datasets. As synthetic datasets, GTA [51] consists of each benchmark three times, each time with a different ran-
24,966 images with a resolution of 1914×1052. It has dom seed, and report the average results. We evaluate our
12,403, 6,382, and 6,181 images for training, validation, method using ResNet-50 and ResNet-101 encoders. We use
and test sets. SYNTHIA [52] contains 9,400 images with mean Intersection over Union (mIoU) [16] as the evaluation
a resolution of 1280×760. It has 6,580 and 2,820 images metric. The best and second best results are highlighted
for training and validation sets, respectively. and underlined in tables, respectively.
As real-world datasets, Cityscapes [12] consists of 2,975 Tables 1 and 2 show the generalization performance of
training images and 500 validation images with a resolution models trained on GTA [51] and SYNTHIA [52], respec-
of 2048×1024. BDD [63] contains 7,000 training images tively. Our TLDR generally outperforms other DGSS meth-
and 1,000 validation images with a resolution of 1280×720. ods in most benchmarks. In particular, we improve the
Mapillary [44] involves 18,000 training images and 2,000 G→C benchmark by +1.9 mIoU and +0.9 mIoU on ResNet-
validation images with diverse resolutions. For brevity, we 50 and ResNet-101 encoders, respectively.
denote GTA, SYNTHIA, Cityscapes, BDD, and Mapillary
as G, S, C, B, and M, respectively.
5.3. Texture Awareness of Model
Network architecture. We conduct experiments using We design experiments to verify whether the perfor-
ResNet [25] as an encoder architecture and DeepLabV3+ mance improvement of TLDR is actually due to being aware
[7] as a semantic segmentation decoder architecture. In all of texture. In the experiments, we compare TLDR to plain
experiments, encoders are initialized with an ImageNet [13] Domain Randomization (DR), which is only trained with
pre-trained model. the stylized task loss. We train on GTA using a ResNet-101
Unseen Image DR TLDR Unseen Image DR TLDR Ground Truth
Road
Sidewalk
Terrain

road sidew. build. wall fence pole tr.light sign veget. n/a.
Figure 8. Visualization of class activation maps for different terrain sky person rider car truck bus train m.bike bike

classes using DR and our TLDR. The classes are road, sidewalk,
and terrain. Only the classes are displayed as opaque for better vi- Figure 10. Qualitative results of DR and our TLDR on unseen
sualization in the unseen images. The TLDR model tends to have images. The unseen images are from Cityscapes, BDD, Mapillary,
activation throughout broader areas than the DR model, suggest- and SYNTHIA in row order. The TLDR model provides better
ing it can rely on more texture cues when making predictions. prediction results for shape-similar classes than the DR model, as
seen in the white boxes.
(a) Road, Sidew., Terrain (b) Build., Wall, Fence (c) Car, Truck, Bus
Road Sidew. Terrain Build. Wall Fence Car Truck Bus DR TLDR
Layer
Road 83.2 4.2 0.2 Build. 94.5 0.1 0.0 Car 91.5 2.5 1.2 Texture Shape Texture Shape
Ground Truth

Ground Truth
Ground Truth
DR

Sidew. 27.6 54.1 4.1 Wall 50.4 28.7 1.0 Truck 11.7 55.2 0.8 l=1 66.5% 29.4% 66.7% 29.2%
Terrain 11.6 10.0 42.7 Fence 65.0 5.3 16.0 Bus 8.4 29.6 40.3
l=2 56.6% 39.8% 57.3% 39.1%
Prediction Prediction Prediction l=3 32.0% 63.5% 33.8% 61.9%
Road Sidew. Terrain Build. Wall Fence Car Truck Bus l=4 19.1% 54.3% 20.1% 52.5%
Road 85.3 10.1 0.4 Build. 94.6 0.3 0.1 Car 91.2 1.8 2.7
Ground Truth
Ground Truth

Ground Truth
TLDR

Sidew. 19.6 71.1 3.2 Wall 34.9 41.6 1.7 Truck 4.2 64.9 6.0 Table 3. Layer-wise dimensionality of texture and shape in the
latent representations of DR and our TLDR, evaluated in percent-
Terrain 6.9 13.8 48.5 Fence 44.3 6.4 30.0 Bus 2.7 8.4 69.7
ages (%). The TLDR model stores more texture information than
Prediction Prediction Prediction
the DR model across all the layers.
Figure 9. Comparison of the confusion matrices between DR and
our TLDR for classes with similar shapes. The TLDR model tends in classifying sidewalk and terrain as road, with rates of
to have less confusion and higher accuracy than the DR model. 19.6% and 6.9%, respectively, compared to the DR model,
which has rates of 27.6% and 11.6% (see Figure 9a). Also,
encoder. For convenience, we refer to the models trained there is an increase in the accuracy of each class. There are
with DR and TLDR as the DR model and the TLDR model, also clear reductions in confusion between building, wall &
respectively. The models are evaluated on Cityscapes in the fence and car, truck & bus (see Figures 9b and 9c).
experiments if no specific dataset is mentioned. Qualitative results. Figure 10 shows the qualitative re-
Class activation map. To validate whether the models uti- sults of the DR and TLDR models in various domains. The
lize texture cues for prediction, we generate class activation TLDR model provides better prediction results for shape-
maps and analyze the contribution of texture features to the similar classes than the DR model (see white boxes).
predictions. We apply Grad-CAM [54] to road, sidewalk, Dimensionality analysis. We investigate whether the
and terrain on the DR and TLDR models. Figure 8 displays TLDR model encodes additional texture information com-
the attention heat maps generated for each class, highlight- pared to the DR model. We use the method proposed by Is-
ing the regions the models focus on during prediction. Our lam et al. to quantify the dimensions corresponding to tex-
analysis reveals that the DR model tends to have high ac- ture and shape in latent representations [28]. The method
tivation in edge regions, while the TLDR model tends to estimates the dimensionality of a semantic concept by com-
have activation throughout broader areas. One can infer that puting mutual information between feature representations
these broader areas include texture cues, which the TLDR of input pairs that exhibit the same semantic concept. A
model uses as valuable prediction cues. detailed explanation of the experiment is in Appendix A.
Confusion matrix. We compare confusion matrices be- Table 3 shows the estimated dimensions (%) for texture and
tween the two models. Figure 9 shows the confusion matri- shape in each layer of the DR and TLDR models. At ev-
ces of the DR model and the TLDR model for shape-similar ery layer, it is apparent that the TLDR model encodes more
classes. The TLDR model has lower false positive rates texture information than the DR model.
Lorig Lstyl LTR LTG C B M S Case C B M S
1 ✓ - - - 36.12 31.38 30.89 37.43 1 w/o RSM 46.57 44.89 47.17 38.87
2 - ✓ - - 39.48 39.21 43.45 31.96 2 w/ RSM (τ =0.01) 46.92 44.01 47.75 38.61
3 ✓ ✓ - - 43.62 43.94 44.06 38.75 3 w/ RSM (τ =0.1) 47.58 44.88 48.80 39.35
4 ✓ ✓ ✓ - 46.85 46.07 46.52 38.64 4 w/o LDF 46.77 44.84 46.87 39.72
5 ✓ ✓ - ✓ 46.36 44.01 47.49 38.10 5 w/ LDF 47.58 44.88 48.80 39.35
6 ✓ ✓ ✓ ✓ 47.58 44.88 48.80 39.35
Table 6. Ablation experiments on the design choices of TLDR.
Table 4. Ablation experiment on each loss in TLDR. The model The model is trained on GTA using ResNet-101. The best results
is trained on GTA using ResNet-101. The best results are high- are highlighted. The default setting is marked in gray .
lighted. The default setting is marked in gray .
Random Style C B M S
LTR LTG TEO C B M S 1 WikiArt [45] 45.09 44.58 45.16 39.52
1 ✓ - - 44.95 44.09 44.24 39.07 2 DTD [11] 45.65 45.10 45.29 39.78
2 ✓ - ✓ 46.85 46.07 46.52 38.64 3 ImageNet val [13] 47.58 44.88 48.80 39.35
3 - ✓ - 44.25 43.53 46.90 38.96
Table 7. Ablation experiment on different random style datasets.
4 - ✓ ✓ 46.36 44.01 47.49 38.10
The model is trained on GTA using ResNet-101. The best results
5 ✓ ✓ - 45.70 44.13 46.20 38.57 are highlighted. The default setting is marked in gray .
6 ✓ ✓ ✓ 47.58 44.88 48.80 39.35

Table 5. Ablation experiments on TEO. The model is trained on 4, and 6) in both LTR and LTG compared to the absence of
GTA using ResNet-101. The best results are highlighted. The TEO (cf. rows 1, 3, and 5).
default setting is marked in gray . Random Style Masking (RSM). We conduct an experi-
ment to validate the effectiveness of RSM. As shown in
The experimental results demonstrate that the TLDR Table 6, the model with RSM (cf. row 3) performs bet-
model can effectively distinguish between classes with sim- ter than the model without RSM (cf. row 1). Also, when
ilar shapes by encoding and using texture as additional dis- the threshold of RSM is decreased to τ =0.01 (cf. row 2),
criminative cues. the performance is higher than without RSM (cf. row 1)
but lower than the default setting τ =0.1 (cf. row 3). The
5.4. Ablation Study experiment suggests that maintaining an appropriate RSM
threshold can aid in learning random styles.
In ablation experiments, we use ResNet-101 as the en- Linear Decay Factor (LDF). When analyzing the texture
coder, train on GTA, and evaluate on C, B, M, and S. The regularization and generalization losses, we observe that the
best results are highlighted in tables. losses do not naturally decrease as much as the task losses
Loss components. We investigate how each loss compo- (see Appendix B). The texture regularization loss remains
nent contributes to overall performance. Table 4 shows the relatively constant likely due to the frozen ImageNet model.
mIoU performance change with respect to the ablation of The texture generalization loss is also relatively stable likely
loss components. There is a large performance improve- due to the different combinations of source images and ran-
ment when the stylized task loss and the original task loss dom style images used in each epoch. We design LDF to
are used simultaneously (cf. row 3). It seems to be the effect match the scale of the losses. As shown in Table 6, the
of using shape and texture cues as complementary. There is model with LDF (cf. row 5) performs better than the model
an additional performance improvement when the texture without LDF (cf. row 4).
regularization loss and the texture generalization loss are Random style dataset. We analyze how performance
added separately (cf. rows 4 and 5). The overall perfor- changes when training on various random style datasets.
mance is the best when both losses are used (cf. row 6). Table 7 shows the performance when using WikiArt [45]
Texture Extraction Operator (TEO). We use TEO to ex- dataset, Describable Texture Dataset (DTD) [11] and Ima-
tract only texture features from feature maps. We conduct geNet [13] validation set as the random style dataset. The
experiments to verify whether TEO leads to performance overall performance is the best when using the ImageNet
improvement in the texture regularization loss and the tex- validation set as the random style dataset (cf. row 3). The
ture generalization loss. In the absence of TEO, LTR and performances remain consistently demonstrated even using
LTG refer to the calculation of direct consistency in the fea- the other two datasets (cf. rows 1 and 2). The performance
ture maps without any alterations. Table 5 shows the results improvement when using ImageNet is probably due to the
of TEO ablation experiments. The results show that the per- superiority of the ImageNet pre-trained model to extract the
formance is improved in the presence of TEO (cf. rows 2, texture present in ImageNet.
6. Conclusion segmentation with deep convolutional nets, atrous convolu-
tion, and fully connected crfs. IEEE Transactions on Pattern
Texture often contributes to the domain gap in DGSS. Analysis and Machine Intelligence (TPAMI), 40(4):834–848,
Existing DGSS methods have attempted to eliminate or ran- 2017. 1
domize texture features. However, this paper argues that [6] Liang-Chieh Chen, George Papandreou, Florian Schroff, and
texture remains supplementary prediction cues for shape de- Hartwig Adam. Rethinking atrous convolution for seman-
spite the domain gap. Accordingly, we proposed TLDR to tic image segmentation. arXiv preprint arXiv:1706.05587,
learn texture features without overfitting to source domain 2017. 1
textures in DGSS. TLDR includes novel texture regular- [7] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian
ization and generalization losses, using a Gram-matrix as Schroff, and Hartwig Adam. Encoder-decoder with atrous
a key component. We conducted a diverse set of experi- separable convolution for semantic image segmentation. In
ments to demonstrate that TLDR effectively differentiates Proceedings of the European Conference on Computer Vi-
sion (ECCV), pages 801–818, 2018. 1, 6
shape-similar classes by leveraging texture as a prediction
[8] Wuyang Chen, Zhiding Yu, Shalini De Mello, Sifei Liu,
cue. The experiments on multiple DGSS tasks show that
Jose M Alvarez, Zhangyang Wang, and Anima Anandku-
our TLDR achieves state-of-the-art performance.
mar. Contrastive syn-to-real generalization. arXiv preprint
Limitation. During the experiments, we found that tex- arXiv:2104.02290, 2021. 3, 5
ture differences between domains vary class-wise. Learning [9] Wuyang Chen, Zhiding Yu, Zhangyang Wang, and Ani-
more source domain textures for classes with small texture mashree Anandkumar. Automated synthetic-to-real gener-
differences may help with generalization, but it may be ad- alization. In Proceedings of the International Conference on
vantageous for classes with high texture differences to learn Machine Learning (ICML), pages 1746–1756, 2020. 3
less of them. In our method, there is no class-wise pre- [10] Sungha Choi, Sanghun Jung, Huiwon Yun, Joanne T Kim,
scription for different texture differences. We leave it as an Seungryong Kim, and Jaegul Choo. Robustnet: Improving
interesting future work. domain generalization in urban-scene segmentation via in-
stance selective whitening. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition
Acknowledgements (CVPR), pages 11580–11590, 2021. 1, 2, 3, 5, 6, 14, 15,
17, 18
We sincerely thank Chanyong Lee and Eunjin Koh for
their constructive discussions and support. We also ap- [11] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy
Mohamed, and Andrea Vedaldi. Describing textures in the
preciate Chaehyeon Lim, Sihyun Yu, Seokhyun Moon,
wild. In Proceedings of the IEEE/CVF Conference on Com-
and Youngju Yoo for providing insightful feedback. This puter Vision and Pattern Recognition (CVPR), pages 3606–
work was supported by the Agency for Defense Devel- 3613, 2014. 8
opment (ADD) grant funded by the Korea government [12] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo
(912855101). Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe
Franke, Stefan Roth, and Bernt Schiele. The cityscapes
References dataset for semantic urban scene understanding. In Proceed-
ings of the IEEE/CVF Conference on Computer Vision and
[1] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla. Pattern Recognition (CVPR), pages 3213–3223, 2016. 1, 2,
Segnet: A deep convolutional encoder-decoder architecture 6, 14, 15, 17
for image segmentation. IEEE Transactions on Pattern Anal- [13] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li,
ysis and Machine Intelligence (TPAMI), 39(12):2481–2495, and Li Fei-Fei. Imagenet: A large-scale hierarchical im-
2017. 1 age database. In Proceedings of the IEEE/CVF Conference
[2] Yogesh Balaji, Swami Sankaranarayanan, and Rama Chel- on Computer Vision and Pattern Recognition (CVPR), pages
lappa. Metareg: Towards domain generalization using meta- 248–255, 2009. 3, 6, 8
regularization. Advances in Neural Information Processing [14] Qi Dou, Daniel Coelho de Castro, Konstantinos Kamnitsas,
Systems (NeurIPS), 31, 2018. 3 and Ben Glocker. Domain generalization via model-agnostic
[3] John Canny. A computational approach to edge detection. learning of semantic features. Advances in Neural Informa-
IEEE Transactions on Pattern Analysis and Machine Intelli- tion Processing Systems (NeurIPS), 32, 2019. 3
gence (TPAMI), (6):679–698, 1986. 2, 14 [15] Patrick Esser, Robin Rombach, and Bjorn Ommer. A dis-
[4] Fabio M Carlucci, Antonio D’Innocente, Silvia Bucci, Bar- entangling invertible interpretation network for explaining
bara Caputo, and Tatiana Tommasi. Domain generalization latent representations. In Proceedings of the IEEE/CVF
by solving jigsaw puzzles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Conference on Computer Vision and Pattern Recognition (CVPR), pages 9223–9232, 2020. 13
(CVPR), pages 2229–2238, 2019. 3 [16] Mark Everingham, SM Eslami, Luc Van Gool, Christo-
[5] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, pher KI Williams, John Winn, and Andrew Zisserman. The
Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image pascal visual object classes challenge: A retrospective. Inter-
national Journal of Computer Vision (IJCV), 111(1):98–136, [30] Yunpei Jia, Jie Zhang, Shiguang Shan, and Xilin Chen.
2015. 6 Single-side domain generalization for face anti-spoofing. In
[17] David V Foster and Peter Grassberger. Lower bounds on Proceedings of the IEEE/CVF Conference on Computer Vi-
mutual information. Physical Review E, 83(1):010101, 2011. sion and Pattern Recognition (CVPR), pages 8484–8493,
13 2020. 3
[18] Leon Gatys, Alexander S Ecker, and Matthias Bethge. Tex- [31] Suhyeon Lee, Hongje Seong, Seongwon Lee, and Eun-
ture synthesis using convolutional neural networks. Ad- tai Kim. WildNet: Learning Domain Generalized Se-
vances in Neural Information Processing Systems (NeurIPS), mantic Segmentation from the Wild. In Proceedings of
28, 2015. 4 the IEEE/CVF Conference on Computer Vision and Pattern
[19] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. Recognition (CVPR), pages 9936–9946, 2022. 1, 2, 3, 6, 14
Image style transfer using convolutional neural networks. [32] Lei Li, Ke Gao, Juan Cao, Ziyao Huang, Yepeng Weng, Xi-
In Proceedings of the IEEE/CVF Conference on Computer aoyue Mi, Zhengze Yu, Xiaoya Li, and Boyang Xia. Pro-
Vision and Pattern Recognition (CVPR), pages 2414–2423, gressive domain expansion network for single domain gen-
2016. 4 eralization. In Proceedings of the IEEE/CVF Conference
[20] Yunhao Ge, Yao Xiao, Zhi Xu, Xingrui Wang, and Laurent on Computer Vision and Pattern Recognition (CVPR), pages
Itti. Contributions of Shape, Texture, and Color in Visual 224–233, 2021. 3
Recognition. In Proceedings of the European Conference on [33] Yijun Li, Ming-Yu Liu, Xueting Li, Ming-Hsuan Yang, and
Computer Vision (ECCV), pages 369–386, 2022. 2 Jan Kautz. A closed-form solution to photorealistic image
[21] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, stylization. In Proceedings of the European Conference on
Matthias Bethge, Felix A Wichmann, and Wieland Bren- Computer Vision (ECCV), pages 453–468, 2018. 3
del. ImageNet-trained CNNs are biased towards texture; in- [34] Ya Li, Xinmei Tian, Mingming Gong, Yajing Liu, Tongliang
creasing shape bias improves accuracy and robustness. arXiv Liu, Kun Zhang, and Dacheng Tao. Deep domain gener-
preprint arXiv:1811.12231, 2018. 1, 5 alization via conditional invariant adversarial networks. In
[22] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noord- Proceedings of the European Conference on Computer Vi-
huis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, sion (ECCV), pages 624–639, 2018. 3
Yangqing Jia, and Kaiming He. Accurate, large mini- [35] Yanghao Li, Naiyan Wang, Jiaying Liu, and Xiaodi
batch sgd: Training imagenet in 1 hour. arXiv preprint Hou. Demystifying neural style transfer. arXiv preprint
arXiv:1706.02677, 2017. 6 arXiv:1701.01036, 2017. 4, 14
[23] Robert M Haralick. Statistical and structural approaches to
[36] Zhizhong Li and Derek Hoiem. Learning without forgetting.
texture. Proceedings of the IEEE, 67(5):786–804, 1979. 4
IEEE Transactions on Pattern Analysis and Machine Intelli-
[24] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Gir- gence (TPAMI), 40(12):2935–2947, 2017. 3
shick. Mask r-cnn. In Proceedings of the IEEE/CVF In-
[37] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully
ternational Conference on Computer Vision (ICCV), pages
convolutional networks for semantic segmentation. In Pro-
2961–2969, 2017. 1
ceedings of the IEEE/CVF Conference on Computer Vision
[25] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
and Pattern Recognition (CVPR), pages 3431–3440, 2015. 1
Deep residual learning for image recognition. In Proceed-
[38] Ilya Loshchilov and Frank Hutter. Sgdr: Stochas-
ings of the IEEE/CVF Conference on Computer Vision and
tic gradient descent with warm restarts. arXiv preprint
Pattern Recognition (CVPR), pages 770–778, 2016. 6, 17
arXiv:1608.03983, 2016. 14
[26] Xiaowei Hu, Chi-Wing Fu, Lei Zhu, and Pheng-Ann Heng.
Depth-attentional features for single-image rain removal. In [39] Ilya Loshchilov and Frank Hutter. Decoupled weight decay
Proceedings of the IEEE/CVF Conference on Computer Vi- regularization. arXiv preprint arXiv:1711.05101, 2017. 6
sion and Pattern Recognition (CVPR), pages 8022–8031, [40] Krikamol Muandet, David Balduzzi, and Bernhard
2019. 14, 17 Schölkopf. Domain generalization via invariant fea-
[27] Jiaxing Huang, Dayan Guan, Aoran Xiao, and Shijian Lu. ture representation. In Proceedings of the International
Fsdr: Frequency space domain randomization for domain Conference on Machine Learning (ICML), pages 10–18,
generalization. In Proceedings of the IEEE/CVF Conference 2013. 3
on Computer Vision and Pattern Recognition (CVPR), pages [41] Gilhyun Nam, Gyeongjae Choi, and Kyungmin Lee. GCISG:
6891–6902, 2021. 1, 2, 3, 6 Guided Causal Invariant Learning for Improved Syn-to-Real
[28] Md Amirul Islam, Matthew Kowal, Patrick Esser, Sen Jia, Generalization. In Proceedings of the European Conference
Bjorn Ommer, Konstantinos G Derpanis, and Neil Bruce. on Computer Vision (ECCV), pages 656–672, 2022. 3
Shape or texture: Understanding discriminative features in [42] Hyeonseob Nam, HyunJae Lee, Jongchan Park, Wonjun
cnns. arXiv preprint arXiv:2101.11604, 2021. 5, 7, 13 Yoon, and Donggeun Yoo. Reducing domain gap by reduc-
[29] Deyi Ji, Haoran Wang, Mingyuan Tao, Jianqiang Huang, ing style bias. In Proceedings of the IEEE/CVF Conference
Xian-Sheng Hua, and Hongtao Lu. Structural and Statisti- on Computer Vision and Pattern Recognition (CVPR), pages
cal Texture Knowledge Distillation for Semantic Segmenta- 8690–8699, 2021. 2, 3, 4
tion. In Proceedings of the IEEE/CVF Conference on Com- [43] Muhammad Muzammal Naseer, Kanchana Ranasinghe,
puter Vision and Pattern Recognition (CVPR), pages 16876– Salman H Khan, Munawar Hayat, Fahad Shahbaz Khan, and
16885, 2022. 2 Ming-Hsuan Yang. Intriguing properties of vision transform-
ers. Advances in Neural Information Processing Systems [56] Nathan Somavarapu, Chih-Yao Ma, and Zsolt Kira. Frus-
(NeurIPS), 34:23296–23308, 2021. 1, 5 tratingly simple domain generalization via image stylization.
[44] Gerhard Neuhold, Tobias Ollmann, Samuel Rota Bulo, and arXiv preprint arXiv:2006.11207, 2020. 2, 3, 4
Peter Kontschieder. The mapillary vistas dataset for se- [57] Laurens Van der Maaten and Geoffrey Hinton. Visualizing
mantic understanding of street scenes. In Proceedings of data using t-SNE. Journal of Machine Learning Research
the IEEE/CVF International Conference on Computer Vision (JMLR), 9(11), 2008. 2, 14
(ICCV), pages 4990–4999, 2017. 2, 6, 14, 15, 17 [58] Riccardo Volpi, Hongseok Namkoong, Ozan Sener, John C
[45] Kiri Nichol. Painter by numbers, wikiart. Kiri Nichol, 2016. Duchi, Vittorio Murino, and Silvio Savarese. Generalizing
8 to unseen domains via adversarial data augmentation. Ad-
[46] Xingang Pan, Ping Luo, Jianping Shi, and Xiaoou Tang. Two vances in Neural Information Processing Systems (NeurIPS),
at once: Enhancing learning and generalization capacities 31, 2018. 3
via ibn-net. In Proceedings of the European Conference on [59] Zhenyao Wu, Xinyi Wu, Xiaoping Zhang, Lili Ju, and Song
Computer Vision (ECCV), pages 464–479, 2018. 1, 2, 3 Wang. SiamDoGe: Domain Generalizable Semantic Seg-
[47] Xingang Pan, Xiaohang Zhan, Jianping Shi, Xiaoou Tang, mentation Using Siamese Network. In Proceedings of the
and Ping Luo. Switchable whitening for deep representa- European Conference on Computer Vision (ECCV), pages
tion learning. In Proceedings of the IEEE/CVF International 603–620, 2022. 1, 2, 3, 6
Conference on Computer Vision (ICCV), pages 1863–1871, [60] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar,
2019. 1, 2, 3 Jose M Alvarez, and Ping Luo. SegFormer: Simple and
[48] Duo Peng, Yinjie Lei, Munawar Hayat, Yulan Guo, and Wen efficient design for semantic segmentation with transform-
Li. Semantic-aware domain generalized segmentation. In ers. Advances in Neural Information Processing Systems
Proceedings of the IEEE/CVF Conference on Computer Vi- (NeurIPS), 34:12077–12090, 2021. 1, 14
sion and Pattern Recognition (CVPR), pages 2594–2605, [61] Qinwei Xu, Ruipeng Zhang, Ya Zhang, Yanfeng Wang, and
2022. 1, 2, 3, 6 Qi Tian. A fourier-based framework for domain generaliza-
[49] Duo Peng, Yinjie Lei, Lingqiao Liu, Pingping Zhang, tion. In Proceedings of the IEEE/CVF Conference on Com-
and Jun Liu. Global and local texture randomization for puter Vision and Pattern Recognition (CVPR), pages 14383–
synthetic-to-real semantic segmentation. IEEE Transactions 14392, 2021. 3
on Image Processing, 30:6594–6608, 2021. 2, 3, 6 [62] Zheng Xu, Wen Li, Li Niu, and Dong Xu. Exploiting low-
[50] Fengchun Qiao, Long Zhao, and Xi Peng. Learning to rank structure from latent domains for domain generaliza-
learn single domain generalization. In Proceedings of the tion. In Proceedings of the European Conference on Com-
IEEE/CVF Conference on Computer Vision and Pattern puter Vision (ECCV), pages 628–643, 2014. 3
Recognition (CVPR), pages 12556–12565, 2020. 3 [63] Fisher Yu, Haofeng Chen, Xin Wang, Wenqi Xian, Yingying
[51] Stephan R Richter, Vibhav Vineet, Stefan Roth, and Vladlen Chen, Fangchen Liu, Vashisht Madhavan, and Trevor Dar-
Koltun. Playing for data: Ground truth from computer rell. Bdd100k: A diverse driving dataset for heterogeneous
games. In Proceedings of the European conference on com- multitask learning. In Proceedings of the IEEE/CVF Confer-
puter vision (ECCV), pages 102–118, 2016. 1, 2, 6, 14, 15, ence on Computer Vision and Pattern Recognition (CVPR),
17 pages 2636–2645, 2020. 2, 6, 14, 15, 17
[52] German Ros, Laura Sellart, Joanna Materzynska, David [64] Xiangyu Yue, Yang Zhang, Sicheng Zhao, Alberto
Vazquez, and Antonio M Lopez. The synthia dataset: A large Sangiovanni-Vincentelli, Kurt Keutzer, and Boqing
collection of synthetic images for semantic segmentation of Gong. Domain randomization and pyramid consistency:
urban scenes. In Proceedings of the IEEE/CVF Conference Simulation-to-real generalization without accessing target
on Computer Vision and Pattern Recognition (CVPR), pages domain data. In Proceedings of the IEEE/CVF International
3234–3243, 2016. 6, 14, 15, 17 Conference on Computer Vision (ICCV), pages 2100–2110,
[53] Christos Sakaridis, Dengxin Dai, and Luc Van Gool. Se- 2019. 1, 2, 3, 6, 17, 18
mantic foggy scene understanding with synthetic data. Inter- [65] Yuyang Zhao, Zhun Zhong, Na Zhao, Nicu Sebe, and
national Journal of Computer Vision (IJCV), 126:973–992, Gim Hee Lee. Style-hallucinated dual consistency learn-
2018. 17 ing for domain generalized semantic segmentation. In Pro-
[54] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, ceedings of the European Conference on Computer Vision
Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. (ECCV), pages 535–552, 2022. 1, 2, 3, 6, 14, 15
Grad-cam: Visual explanations from deep networks via [66] Zhun Zhong, Yuyang Zhao, Gim Hee Lee, and Nicu
gradient-based localization. In Proceedings of the IEEE/CVF Sebe. Adversarial style augmentation for domain
International Conference on Computer Vision (ICCV), pages generalized urban-scene segmentation. arXiv preprint
618–626, 2017. 7 arXiv:2207.04892, 2022. 2, 3
[55] Rui Shao, Xiangyuan Lan, Jiawei Li, and Pong C Yuen. [67] Kaiyang Zhou, Yongxin Yang, Yu Qiao, and Tao Xi-
Multi-adversarial discriminative deep domain generalization ang. Domain generalization with mixstyle. arXiv preprint
for face presentation attack detection. In Proceedings of arXiv:2104.02008, 2021. 3
the IEEE/CVF Conference on Computer Vision and Pattern [68] Lanyun Zhu, Deyi Ji, Shiping Zhu, Weihao Gan, Wei Wu,
Recognition (CVPR), pages 10023–10031, 2019. 3 and Junjie Yan. Learning statistical texture for semantic
segmentation. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition (CVPR), pages
12537–12546, 2021. 2
Appendix
In Appendix, we provide more details and additional experimental results of our proposed Texture Learning Domain
Randomization (TLDR). The sections are organized as follows:
• A: Details of Dimension Estimation
• B: Loss Graph
• C: Theoretical analysis on LTR and LTG
• D: Details of t-SNE Visualizations
• E: Experiment on Class Uniform Sampling
• F: Hyperparameter Analysis
• G: Experiment on Multi-source Setting
• H: Pseudocode and Source Code Implementation
• I: More Qualitative Results

A. Details of Dimension Estimation

(a) Texture Pair (b) Shape Pair

Figure S1. Visualization of image pairs with a certain semantic concept: (a) texture pair and (b) shape pair. The image pairs are generated
by using the Style Transfer Module (STM).

In Section 5.3, we conduct an experiment to verify whether our TLDR enhances the ability to encode texture informa-
tion in latent representations. Esser et al. proposed a method for estimating the dimensions of semantic concepts in latent
representations [15], and Islam et al. utilized the method on texture and shape [28].
Let Ia and Ib be a pair of images that are similar in terms of a certain semantic concept, as shown in Figure S1. The latent
representations zia and zib correspond to the images Ia and Ib , respectively, and are estimated by the task model at the ith layer.
The method hypothesizes that high mutual information between two latent representations indicates that the model effectively
encodes the corresponding semantic concept. It is known that the mutual information between latent representations obeys
the lower bound of Equation S1 [17].
1
MI(zia , zib ) ≥ − log(1 − corr(zia , zib )), (S1)
2
where corr(·) represents correlation, and MI(·) represents mutual information. We calculate the mutual information by
assuming it satisfies the inequality with a tight condition. We use mutual information to measure the scores of encoding
texture and shape. Lastly, the percentage of semantic concepts in latent representations is calculated by taking a softmax
function over each of the three scores, the texture score, the shape score, and a fixed baseline score.
In the experiment, we create image pairs using the Style Transfer Module (STM). A texture pair consists of two stylized
images that share the same style but generated from two different content images using the STM. Conversely, a shape pair
comprises two stylized images derived from a single content image but with distinct styles, again using the STM. The
dimensionality is then calculated for the texture and shape pairs that share one common image (e.g., Ia in Figure S1). We
conduct the experiment on 200 sets of shape-texture image pairs and estimate the dimensions by taking the average.
B. Loss Graph

(a) w/o LDF (b) w/ LDF

Figure S2. Graphs of the original task loss, the stylized task loss, the texture regularization loss, and the texture generalization loss (a)
without LDF and (b) with LDF.

In Section 4.4, we introduce the Linear Decay Factor (LDF), which is based on the observation that the texture regu-
larization and generalization losses exhibit relatively constant behavior compared to the task losses. In Figure S2a, we can
observe that the original task loss and the stylized task loss continue to decrease with each iteration, whereas the texture
regularization loss and the texture generalization loss remain relatively constant. Figure S2b shows that the application of
LDF results in aligned scales for the losses. Higher-order functions such as cosine annealing [38] can also be applied to
improve performance.

C. Theoretical analysis on LTR and LTG

It is known that texture can be represented as low-level statistics of image features. Meanwhile, [35] theoretically showed
that matching the Gram-matrices in the l2 norm is equivalent to minimizing the Maximum Mean Discrepancy (MMD) with a
second-order polynomial kernel. This minimization implies aligning the low-level statistics between features. Thus, LTR and
LTG are designed to compare only the low-level statistics (i.e., texture) using Gram-matrices, while excluding the high-level
statistics present in entire features.

D. Details of t-SNE Visualizations

In Figure 3, we demonstrate the significance of utilizing texture by presenting t-SNE [57] plots of the shape and texture
features for the road, sidewalk, and terrain classes. In the Cityscapes [12] dataset, we select 500 random instances from each
class that contained more than 5k pixels. For feature extraction, we utilize feature maps from the final layer of Segformer-
B5 [60] model pre-trained on ImageNet. Subsequently, the shape features were derived using Canny edge [3], while texture
features were extracted via Gram-matrix.

E. Experiment on Class Uniform Sampling

Case C B M S
1 w/ CUS 48.63 45.49 50.06 38.45
2 w/o CUS 47.58 44.88 48.80 39.35

Table S1. Experiment on Class Uniform Sampling (CUS) on TLDR. The model is trained on GTA [51] using ResNet-101 as the encoder
and evaluated on Cityscapes [12], BDD [63], Mapillary [44], and SYNTHIA [52]. The default setting is marked in gray .

Some existing methods [10, 31, 65] used Class Uniform Sampling (CUS) technique [26] to alleviate class imbalance
problem in DGSS. We conduct experiments without CUS in the default setting for a fair comparison with DGSS methods
without CUS. Table S1 shows the performance ablation results of TLDR when trained using CUS. The model is trained
on GTA [51] using ResNet-101 as the encoder. One can see that our TLDR achieves better results with CUS (cf. row 1)
compared to without CUS (cf. row 2).
F. Hyperparameter Analysis
In hyperparameter analysis, we use ResNet-101 as the encoder, train on GTA [51], and evaluate on Cityscapes [12],
BDD [63], Mapillary [44], and SYNTHIA [52]. The best results are highlighted.

αorig αstyl C B M S
1 0.1 0.9 46.58 43.98 49.09 38.04
2 0.3 0.7 47.41 43.71 47.09 38.57
3 0.5 0.5 47.58 44.88 48.80 39.35
4 0.7 0.3 47.05 44.11 47.01 38.57
5 0.9 0.1 44.25 43.11 40.33 39.97

Table S2. Hyperparameter analysis on the original and stylized task loss weights. The model is trained on GTA using ResNet-101 as the
encoder and evaluated on Cityscapes, BDD, Mapillary, and SYNTHIA. The default setting is marked in gray .

Task loss weights. We analyze the performance changes resulting from variations in the task loss weights αorig and αstyl .
Table S2 shows the results for the analysis. The best performance is achieved when the values of αorig and αstyl are both 0.5
(cf. row 3). We assume that the best performance is achieved in this case because it balances the objectives for texture and
shape. Additionally, we observe that setting αorig to 0.1 leads to relatively good performance (cf. row 1), while setting αstyl to
0.1 significantly decreases performance (cf. row 5). We assume that the reason is that shape provides the primary prediction
cues, while texture serves as a complementary cue to shape.

Case C B M S
1 ul =5 × 10−l−3 47.32 44.60 44.99 39.52
2 ul =5 × 10−l−2 47.58 44.88 48.80 39.35
3 ul =5 × 10−l−1 46.41 44.49 44.60 39.33
4 vl =5 × 10−l−3 46.56 44.66 45.92 40.15
5 vl =5 × 10−l−2 47.58 44.88 48.80 39.35
6 vl =5 × 10−l−1 45.35 43.17 45.33 40.01

Table S3. Hyperparameter analysis on the texture regularization and generalization parameters. The model is trained on GTA using
ResNet-101 as the encoder and evaluated on Cityscapes, BDD, Mapillary, and SYNTHIA. The default setting is marked in gray .

Weighting factors. To examine the effects of the weighting factors ul and vl on the texture regularization and generalization
losses, we conduct ablation experiments by manipulating the scale of the weighting factors. As shown in Table S3, we vary
the weight factors by decreasing them by 0.1 times (cf. rows 1 and 4) and increasing them by 10 times (cf. rows 3 and 6)
relative to the default setting (cf. rows 2 and 5). The experimental results indicate that the default setting achieves the best
performance.

G. Experiment on Multi-source Setting

Train on G+S
Methods C B M
RobustNet [10] 37.69 34.09 38.49
SHADE [65] 47.43 40.30 47.60
TLDR (ours) 48.83 42.58 47.80

Table S4. Experimental results on a multi-source setting. The model is trained on GTA and SYNTHIA using ResNet-50 as the encoder
and evaluated on Cityscapes, BDD, and Mapillary. Our method is marked in gray .

We compare our method against other DGSS methods [10, 65] in a multi-source setting. In the experiment, we train on
GTA [51] and SYNTHIA [52], and evaluate on Cityscapes [12], BDD [63], and Mapillary [44]. As shown in Table S4, our
method consistently showed superior performance across all benchmarks.
H. Pseudocode and Source Code Implementation
Algorithm 1 is the PyTorch-style pseudocode for the proposed TLDR. The pseudocode contains the process of computing
the original task loss, the stylized task loss, the texture regularization loss, and the texture generalization loss for one training
iteration. For more details on the implementation of TLDR, please refer to the source code. The source code to reproduce
TLDR is provided at https://github.com/ssssshwan/TLDR. A detailed description of the source code is explained
in the contained README.md file.
Algorithm 1: PyTorch-style pseudocode for one training iteration of TLDR.
# STM : Style Transfer Module
# n, CE : L2 matrix norm, Cross Entropy Loss
# f T, f I : get task, ImageNet L (total number of layers) feature maps list
# h T : Semantic segmentation decoder
# g : Gram-matrix from a feature map
def g(F):
B, C, H, W = F.size()
F = F.view(B, C, H * W)
G = torch.bmm(F, F.transpose(1,2))
return G.div(C * H * W)

x s, y = source loader()
x r = style loader()
x sr = STM(x s, x r)
p s, p sr = h T(f T(x s)), h T(f T(x sr)) # Inference x s and x sr
L orig, L styl = CE(p s, y), CE(p sr, y) # Calculate task losses

for l in range (L): # Iterate L layers (L tr = 0, L tg = 0)

G I s, G T s = g(f I(x s)[l]).detach(), g(f T(x s)[l])
L tr += n(G I s - G T s) # Texture reg loss in layer l

G T r, G T sr = g(f T(x r)[l]), g(f T(x sr)[l])

diff = G T sr - G T s
mask = diff > threshold # Random Style Masking
L tg += n((G T r - G T sr) * mask)# Texture gen loss in layer l
I. More Qualitative Results
This section compares the qualitative results between DGSS methods [10, 64] and our proposed TLDR using ResNet-
50 [25] encoder. We obtain qualitative results in two settings. First, Cityscapes [12] to RainCityscapes [26] (Figure S3) and
Foggy Cityscapes [53] (Figure S4). Second, GTA [51] to Cityscapes [12] (Figure S5), BDD [63] (Figure S6), Mapillary [44]
(Figure S7), and SYNTHIA [52] (Figure S8). Our TLDR demonstrates superior results compared to the existing DGSS
methods across the various domains.