SDPT Semantic-Aware Dimension-Pooling Transformer For Image Segmentation
SDPT Semantic-Aware Dimension-Pooling Transformer For Image Segmentation
Abstract— Image segmentation plays a critical role in a meaningful part of the image, such as objects, bound-
autonomous driving by providing vehicles with a detailed and aries, or regions with similar properties. Early approaches
accurate understanding of their surroundings. Transformers have often relied on simple methods such as thresholding, edge
recently shown encouraging results in image segmentation. How-
ever, transformer-based models are challenging to strike a better detection, and region growing [2]. These techniques were
balance between performance and efficiency. The computational limited in their ability to handle complex images with vary-
complexity of the transformer-based models is quadratic with ing lighting conditions, textures, and object orientations.
the number of inputs, which severely hinders their application Subsequently, region-based segmentation algorithms gained
in dense prediction tasks. In this paper, we present the semantic- popularity. These methods divide an image into regions based
aware dimension-pooling transformer (SDPT) to mitigate the
conflict between accuracy and efficiency. The proposed model on similarities in color, intensity, texture, or other low-level
comprises an efficient transformer encoder for generating hier- features. Region growing [3], watershed transformation [4],
archical features and a semantic-balanced decoder for predicting and mean-shift clustering [5] are examples of region-based
semantic masks. In the encoder, a dimension-pooling mechanism segmentation techniques. While effective for certain types of
is used in the multi-head self-attention (MHSA) to reduce the images, region-based methods often struggle with handling
computational cost, and a parallel depth-wise convolution is used
to capture local semantics. Simultaneously, we further apply noise, occlusions, and overlapping objects. Edge-based seg-
this dimension-pooling attention (DPA) to the decoder as a mentation techniques focus on detecting boundaries or edges
refinement module to integrate multi-level features. With such a between different image regions. Edge detection algorithms,
simple yet powerful encoder-decoder framework, we empirically such as the Canny edge detector [6], Sobel operator [7],
demonstrate that the proposed SDPT achieves excellent perfor-
and Prewitt operator [8], identify abrupt changes in pixel
mance and efficiency on various popular benchmarks, including
ADE20K, Cityscapes, and COCO-Stuff. For example, our SDPT intensity, which often correspond to object boundaries. While
achieves 48.6% mIOU on the ADE20K dataset, which outper- edge-based methods are sensitive to noise and may pro-
forms the current methods with fewer computational costs. The duce fragmented segmentations, they are useful for tasks like
codes can be found at https://github.com/HuCaoFighting/SDPT. object detection and contour extraction. In recent years, the
Index Terms— Image segmentation, vision transformer, field of image segmentation has been revolutionized by the
dimension-pooling attention, semantic-balanced decoder, scene widespread adoption of machine learning techniques, particu-
understanding. larly deep learning. Convolutional neural networks (CNNs)
I. I NTRODUCTION have emerged as powerful tools for learning feature rep-
resentations directly from image data, enabling end-to-end
I MAGE segmentation is a fundamental task in computer
vision that involves partitioning an image into multi-
ple regions or segments. Each segment typically represents
segmentation pipelines [9]. Architectures like U-Net [10], FCN
(Fully Convolutional Network) [11], and Mask R-CNN [12]
have demonstrated state-of-the-art (SOTA) performance in
Manuscript received 27 May 2023; revised 6 May 2024; accepted 28 May tasks such as semantic segmentation and instance segmenta-
2024. Date of publication 3 July 2024; date of current version 1 November
2024. This work was supported in part by the National Natural Science tion. As shown in Fig. 1, semantic segmentation assigns a
Foundation of China under Grant 62372329, in part by Shanghai Scientific class label to each pixel in an image, effectively partitioning
Innovation Foundation under Grant 23DZ1203400, in part by Shanghai the image into semantically meaningful regions. Instance
Rising Star Program under Grant 21QC1400900, in part by Tongji–Qomolo
Autonomous Driving Commercial Vehicle Joint Laboratory Project, and in segmentation, on the other hand, goes a step further by
part by Xiaomi Young Talents Program. The Associate Editor for this article not only identifying object categories but also distinguishing
was V. Chamola. (Corresponding author: Guang Chen.) individual object instances within the same category. These
Hu Cao and Alois Knoll are with the Chair of Robotics, Artificial
Intelligence and Real-Time Systems, Technical University of Munich, advanced segmentation tasks have numerous applications in
80333 Munich, Germany. autonomous driving [13], [14], medical imaging [15], video
Guang Chen is with the Department of Computer Science and Tech- surveillance [16], remote sensing [17], augmented reality [18],
nology, Tongji University, Shanghai 200070, China (e-mail: guangchen@
tongji.edu.cn). robotic perception [19], aerial semantic segmentation [20],
Hengshuang Zhao is with the Department of Computer Science, The [21], and more.
University of Hong Kong, Hong Kong. Image segmentation is an important technology in the field
Dongsheng Jiang, Xiaopeng Zhang, and Qi Tian are with Huawei Tech-
nologies, Shanghai 200122, China. of autonomous driving, enabling vehicles to perceive and
Digital Object Identifier 10.1109/TITS.2024.3417813 understand their surroundings with great precision [22], [23].
© 2024 The Authors. This work is licensed under a Creative Commons Attribution 4.0 License.
For more information, see https://creativecommons.org/licenses/by/4.0/
CAO et al.: SEMANTIC-AWARE DIMENSION-POOLING TRANSFORMER FOR IMAGE SEGMENTATION 15935
Fig. 1. An overview of image segmentation reveals its diverse types, such as semantic segmentation and instance segmentation. These techniques are integral
to a wide range of applications, including autonomous driving, medical imaging, video surveillance, remote sensing, augmented reality, and beyond. Images
are selected from the cityscapes [1] dataset.
This fine-grained segmentation allows self-driving cars to cost. Restricting attention computation to local windows is
gain a detailed understanding of their environment, effectively efficient, but it compromises the transformer’s ability to model
distinguishing key elements such as roads, pedestrians, traffic long-range dependencies. PVT [35], SegFormer [37], and
signs, and vehicles [24]. The accurate identification and differ- PVTv2 [36] down-sample the spatial structure of key and
entiation provided by image segmentation enable self-driving value to reduce the computational complexity. Moreover,
cars to make informed decisions, navigate through complex P2T [40] uses pyramid pooling to reduce the computational
situations, and ultimately improve road safety. With its abil- cost while capturing powerful contextual features. On the other
ity to provide high-resolution environmental understanding, hand, the spatial reduction strategy uses down-sampling to
image segmentation plays a key role in the development and improve the computational efficiency but sacrifices the spatial
implementation of self-driving cars, promising a future where structure of the feature map. In this work, we introduce
transportation is more efficient, reliable, and safe [25]. a dimension-pooling mechanism to improve the efficiency
CNNs with intrinsic inductive bias serve as the preva- of self-attention while preserving the feature map’s spatial
lent backbone in image segmentation [11], [26], [27], [28]. structure. Based on the dimension-pooling attention (DPA),
However, CNNs are good at modeling local visual features we design a novel hierarchical transformer encoder to generate
(e.g., edges and corners) but are not suitable for modeling multi-scale features.
long-range information dependencies. The transformers, the For segmentation tasks, an effective decoder is important to
de-facto dominant model in natural language processing (NLP) capture the high-level semantics. Four representative designs
studies, have been introduced to vision tasks [29], [30], of decoder structures are: (i) the output of the encoder
[31]. The key idea behind transformer is its strong ability is directly fed into a heavy decoder, such as ASPP [27],
to model long-range dependencies through the self-attention PSP [41], and DANet [42]; (ii) a symmetric decoder is used to
mechanism [32]. Benefiting from transformer’s powerful rep- up-sample the features from the encoder, such as U-Net [10]
resentational learning capabilities, researchers have achieved and V-Net [43]; (iii) a simple MLP-based decoder, such as
excellent results in image classification [29], [30], [33], object SegFormer [37]; and (iv) a transformer-based decoder is used
detection [34], [35], [36], and image segmentation [15], [37], to model global context, such as Segmenter [44] and Mask-
[38], [39], etc. Former [45]. Despite the excellent performance achieved by
In the field of segmentation task, the performance these methods, two key challenges remain when aggregating
gains achieved by transformer-based models compared to multi-level features of the encoder; namely, how to maintain
CNN-based approaches are mainly attributed to their power- semantic consistency within the same level of features and how
ful backbone networks as encoders. The main weakness of to bridge the context across different levels of features. Deep
transformer-based encoders is that their self-attention mecha- high-level features contain more abstract semantic information,
nism has higher time and memory costs compared to convo- while shallow low-level features provide more content descrip-
lutional operations. The complexity of the transformer-based tions [46]. Previous work, such as FPN [47], PANet [48], and
model is O(w2 h 2 ) for w × h inputs, which severely limits SETR [38], utilized lateral connections for feature interaction.
its application in dense prediction tasks (e.g., semantic seg- These methods demonstrate that high-level features and low-
mentation). To alleviate this limitation, Swin Transformer [31] level features are complementary, but they focus more on
adopts window-based self-attention to reduce the computation adjacent level features and less on other level features. Inspired
15936 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 25, NO. 11, NOVEMBER 2024
by [49], we use the balanced semantic features to strengthen such as the integration of skip connections [56], atrous con-
the multi-level features. The DPA is deployed to refine the volutions [27], and attention mechanisms [20], [54], [55],
balanced semantic features to be more discriminative. Each [57], have further improved segmentation performance and
level of features can then obtain equal information from robustness. In [57], the authors introduce a criss-cross atten-
the other levels of features, balancing the information flow. tion module aimed at gathering contextual information from
Combined with our hierarchical transformer encoder, a simple full-image dependencies in a more efficient and effective
yet effective encoder-decoder framework is established. manner. The concept of strip pooling, proposed in [54],
Our segmentation model consists of a novel hierarchical focuses on capturing long-range dependencies while retaining
transformer encoder and a semantic-balanced decoder, named attention to local details. Additionally, the Horizontal Seg-
SDPT. The proposed SDPT achieves the best trade-off between ment Attention (HSA) module, developed in ECANet [55],
segmentation performance and efficiency compared to the is designed to facilitate omnidirectional semantic segmenta-
previous transformer-based methods. tion. Recently, SegNeXt [58] is propsed based on multi-scale
Our main contributions can be summarized as follows: convolutional attention (MSCA) module. Despite their suc-
• In order to generate multi-scale features and improve cess, CNN-based segmentation methods still face challenges
efficiency, a novel hierarchical transformer encoder with related to data scarcity, overfitting, and generalization to
dimension-pooling attention (DPA) is implemented. diverse image domains.
• A semantic-balanced decoder is introduced to strengthen
the multi-level features. In the decoder, DPA is further C. Transformer-Based Segmentation Methods
used as a refinement module to make the balanced ViT [29] is the first work that demonstrates transformer-
features more discriminative. based methods can achieve comparable performance in the
• Our method outperforms current segmentation models on vision task. DeiT [30] is proposed to facilitate training by
three publicly available semantic segmentation datasets bringing the idea of distilling knowledge from CNNs to
(including ADE20K [50], Cityscapes [1], and COCO- Transformers. However, both ViT and DeiT are columnar
Stuff [51]) with lower computational complexity. architectures that maintain the same spatial scale across
layers. Referring to the hierarchical structure of CNNs, more
II. R ELATED W ORK multi-scale transformer models such as PVT [35], Swin [31],
A. Traditional Segmentation Methods PVTv2 [36], Shunted Transformer [59] and P2T [40]
are proposed to perform dense prediction tasks. Various
Classic approaches such as thresholding, edge detection,
CNN-based approaches [60], [61], [62] have been explored
region-based segmentation, and clustering have laid the
to address the real-time segmentation problem. Previously,
groundwork for subsequent research in computer vision. Early
researchers have used pre-trained ViTs as encoders to improve
methods like global thresholding [52] and the Sobel oper-
segmentation performance [38], [44], [63]. Compared to
ator [7] paved the way for more sophisticated techniques,
CNN-based methods, the computational complexity of
including adaptive thresholding [53], Canny edge detec-
transformer-based approaches grows quadratically with
tion [6], region-growing [3], and clustering [14] algorithms.
the number of input tokens. Swin Transformer [31]
These methods, though simple in concept, have demonstrated
addresses this challenge by introducing window-based
effectiveness in segmenting images with well-defined fea-
self-attention, reducing computational costs. However,
tures and distinct boundaries. Moreover, research efforts have
confining attention computation to local windows, although
extended to hybrid approaches that combine multiple seg-
efficient, compromises the model’s ability to capture long-
mentation techniques to achieve enhanced performance and
range dependencies. Techniques employed by PVT [35],
robustness. While traditional segmentation methods excel in
PVTv2 [36], and SegFormer [37] involve downsampling the
certain scenarios, their efficacy is often challenged by complex
spatial structure of key and value to mitigate computational
image structures, noise, and variability in lighting conditions.
complexity. Additionally, P2T [40] employs pyramid pooling
to lower computational costs while retaining powerful
B. CNN-Based Segmentation Methods contextual features. Conversely, spatial reduction strategies
CNN-based segmentation methods have emerged as SOTA using downsampling enhance computational efficiency, but
techniques for image segmentation tasks. Leveraging the at the expense of spatial structure in the feature map.
power of deep learning, CNNs have revolutionized the field Recently, TopFormer [64], PIDNet [65], SeaFormer [66],
by learning hierarchical representations directly from raw AFFormer [67], and SCTNet [68] were developed for real-time
image data [26]. Seminal works such as the FCN [11], semantic segmentation. SegViT [39] and VWFormer [69]
U-Net [10], DeepLab series [27], [28], Strip pooling [54], are proposed to achieve excellent segmentation performance.
ECANet [55], and Mask RCNN [12] have demonstrated Moreover, the authors of CMX [70] presented a
remarkable performance in semantic segmentation, omni- multimodal fusion segmentation method based on the
directional segmentation, and instance segmentation. These backbone of SegFormer [37]. In this work, we introduce a
architectures leverage convolutional layers to capture spatial dimension-pooling attention (DPA) to improve the efficiency
dependencies and learn feature representations at multiple of the transformer-based encoder. We further demonstrate that
scales, enabling accurate and efficient segmentation of com- applying our DPA to the encoder and decoder can achieve
plex scenes. Moreover, advancements in CNN architectures, outstanding performance-efficiency trade-offs.
CAO et al.: SEMANTIC-AWARE DIMENSION-POOLING TRANSFORMER FOR IMAGE SEGMENTATION 15937
Fig. 2. Left: the overall architecture of our hierarchical transformer encoder. The input is passed through a hierarchical transformer encoder to generate
multi-scale features. This encoder is structured into four stages. Within each stage, a convolutional stem and overlapped patch merging layer down-sample
features, while a transformer block conducts representation learning. Right: details of the proposed transformer block. It consists of three main components:
layer normalization, dimension-pooling attention (DPA), and a feed-forward network (FFN). The DPA efficiently captures long-range relationships between
input tokens, while the expanded FFN is employed to learn wider representations.
III. M ETHOD
In this section, we introduce the framework of the proposed
SDPT. Following the popular encoder-decoder architecture,
the SDPT consists of an efficient transformer encoder and a
semantic-balanced decoder. The encoder aims to extract multi-
scale features, and the decoder aggregates these multi-level
features to perform semantic mask prediction.
A. Encoder
The encoder comprises four stages for generating multi-
scale features, as shown in Fig. 2. We use a convolutional stem
and overlapped patch merging layers to down-sample features
and a transformer block to perform representation learning.
In the following, we elaborate on each module in detail.
1) Transformer Block: As shown in Fig. 2 right, the pro-
posed transformer block contains a layer norm, a dimension- Fig. 3. The structure of dimension-pooling attention (DPA). To improve the
pooling attention (DPA), and a feed-forward network (FFN). efficiency of the attention operation, the input tokens of K and V are pooled
from the size of H W to the size of H + W . A parallel depth-wise convolution
The DPA is used to efficiently model long-range relationships is further deployed on the query tensors Q to provide high-frequency local
between the input tokens and the expanded FFN is utilized to information.
learn wider representations.
In a vanilla transformer [29], it builds long-range depen-
dence through multi-head self-attention (MHSA), which can a dimension-pooling mechanism to reduce the computational
be formulated as follows: cost of MHSA. To shorten the input sequence while preserving
QK T spatial structure, we employ global average pooling to shrink
Attention = So f t Max( √ )V (1) the input tokens of K and V from the size of H W to the size
Dh of H + W , as shown in Fig. 3.
where Q, K , and V are query, key, and value tensors, respec- 2) Dimension-Pooling Attention (DPA): The inputs are
tively. Dh denotes the head dimension. The computational encoded along with the horizontal and lateral directions by
complexity of the original MHSA module on an image of two spatial extents with pooling kernels (H, 1) and (1, W ).
h × w patch tokens is: The encoded horizontal average tensor Hn and lateral aver-
age tensor Wn are then concatenated together for attention
(M H S A) = 4hwC 2 + 2(hw)2 C (2)
operation. The whole calculation process can be expressed as
Low computational complexity is crucial for the semantic follows:
segmentation task. However, the cost of MHSA is quadratic 1 X
with the number of patch tokens (O(h 2 w 2 )), which severely Hn = xn (h, i), n ∈ (K , V )
W
limits its application in semantic segmentation. The Swin 0≤i<W
Transformer [31] restricts self-attention within the local 1 X
Wn = xn ( j, w), n ∈ (K , V )
window and PVT [35], PVTv2 [36], and SegFormer [37] H
0≤ j<H
down-sample the spatial structure of K and V to improve
the efficiency of MHSA. Unlike the previous works, we use Pn = Concat (Hn , Wn ), n ∈ (K , V ) (3)
15938 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 25, NO. 11, NOVEMBER 2024
Fig. 4. Left: the overall architecture of our SDPT. The features extracted from stages 2, 3, and 4 are fed into the decoder head for semantic mask prediction.
Right: details of our decoder. The proposed DPA is further utilized in the decoder to refine the features.
where xn and Pn are input features and concatenated pooling is that an early convolutional stem helps transformers be
tensors, respectively. Through this transformation, the global more robust and better, which has been demonstrated in [71].
context can be captured in spatial direction while improving Moreover, compared to the convolutional layer with a large
computational efficiency. Finally, the DPA is computed as kernel size used in the SegFormer [37], sequential convolution
follows: with a small kernel size can reduce the parameters without
Q PT compromising the receptive field. Each convolutional block
Attention(Q, PK , PV ) = So f t Max( √ K )PV (4) is made up of a convolution with a kernel size of 3 × 3,
Dh a BatchNorm (BN) layer, and the GELU non-linearity in
Furthermore, we use a parallel depth-wise convolution on between each convolutional layer.
the query tensors Q to provide high-frequency local informa- 5) Overlapped Patch Merging: In the next three stages,
tion to compensate for the information loss caused by pooling we use one-layer 3 × 3 convolution with a stride of 2 and
operations. The final output of the attention module, xatt , 1 padding to perform overlapped patch merging to produce
is as follows: CNN-like multi-scale feature maps. Take the second stage as
an example. The feature maps are shrunk from F1 ( W4 × H4 ×C1 )
xatt = M L P(Mul(Attention(Q, PK , PV ), DW Conv(Q)))
to F2 ( W8 × H8 × C2 ) while reserving the local continuity, and
(5) the remaining multi-scale feature maps are generated in the
same manner.
The computational complexity of our DPA is expressed as
6) Model Details: Following previous backbone
follows (only for attention operation):
designs [31], [35], [36], [37], [56], we built our encoder in
(D P A) = 2(h + w + hw)C 2 + 2(hw)(h + w)C (6) four stages. As the network depth deepens, the resolution of
feature maps decreases, whereas the channel dimension
3) FFN: Similar to [37] and [40], we add a depth-wise of feature maps increases. Hence, the encoder generates four
convolution with a kernel size of 3 × 3 and a padding size of feature maps: F1 , F2 , F3 , and F4 . Fi has a dimension of
1 between the first MLP layer and the GELU non-linear acti- W H
× 2i+1 × Ci , where i ∈ {1, 2, 3, 4}. Totally, we devised
2i+1
vation in the feed-forward network (FFN). The computation three encoder models with different sizes, named SDPT-Tiny,
can be defined using the following formula: SDPT-Small, and SDPT-Base. In Table I, the detailed network
x1 = M L P1 (xatt ) settings are listed.
x2 = DW Conv(x1 ) + x1
B. Decoder
xout = M L P2 (G E LU (x2 )) + xatt (7)
In segmentation models, a decoder is deployed on the
where xatt is the feature from the DPA module, x1 is the output encoder to capture high-level semantics. Current methods usu-
of the first MLP layer, x2 is the value of x1 after the ally utilize convolution, MLP, or attention-based modules as
convolution operation and residual connection, and xout is the decoders to integrate multi-level features, such as SETR [38],
feature of the second MLP layer and residual connection. SegFormer [37], MaskFormer [45], and SegNeXt [58]. Differ-
4) Convolutional Stem: In the first stage, the convolutional ent from these methods, we introduce a decoder to strengthen
stem is used to transform an input of size H × W × 3 into multi-level features using the same balanced semantic features.
patch tokens of size 4 × 4. Unlike [29], [31], and [35], which The detailed structure is depicted in Fig. 4. The features from
directly use a convolutional layer with a large kernel size to Stage 1 consume more computational resources but bring
split the input into non-overlapped patch tokens, we employ little performance improvement due to too much low-level
four consecutive convolutional blocks with a kernel size of information and higher resolution. Therefore, we aggregate
3 × 3 to form a convolutional stem. The reason for this design the features from the last three stages of the encoder. First,
CAO et al.: SEMANTIC-AWARE DIMENSION-POOLING TRANSFORMER FOR IMAGE SEGMENTATION 15939
TABLE I
D ETAILED S ETTINGS OF D IFFERENT S IZES OF THE P ROPOSED SDPT. C D ENOTES THE O UTPUT C HANNEL N UMBER , E R EPRESENTS THE E XPANSION
R ATIO IN FFN, AND L I NDICATES THE N UMBER OF T RANSFORMER B LOCKS . ‘PARAMETERS ’ A RE C ALCULATED ON THE ADE20K DATASET [50]
multi-level features Fi from the encoder are fed into an MLP TABLE II
layer to unify the channel dimension. Then, these unified T RAINING D ETAILS FOR THE T HREE DATASETS
features are up-sampled to H8 × W8 and integrated together. The
balanced semantic features F are obtained by the averaging
operation. We further utilize our dimension-pooling attention
(DPA) to refine the balanced semantic features to be more
discriminative. The refined features are then used to strengthen
the original features through residual connections. In this
manner, each level feature gets equal information from the publicly available datasets to evaluate our SDPT, includ-
others. The computation can be defined as follows: ing ADE20K [50], Cityscapes [1], and COCO-Stuff [51].
ADE20K is a challenging scene-parsing dataset covering
Fi′ = M L P(Ci , C)(Fi ), i ∈ (2, 3, 4) 150 semantic classes. In this dataset, there are 20210 images
W H for training, 2000 images for validation, and 3352 images for
Fi′ = U psample( × )(Fi′ ), i ∈ (2, 3, 4)
8 8 the test. Cityscapes is a driving dataset that contains 5000 high-
4 resolution images in 19 categories. It consists of 2975 images
1 X ′
F= Fi , i ∈ (2, 3, 4) in the training set, 500 images in the validation set, and
N
i=2 1525 images in the test set. The COCO-Stuff dataset includes
F = D P A(F) 164k images with 172 semantic categories; 118k of these
Yi = Fi′ + F, i ∈ (2, 3, 4) (8) images are utilized for training, 5k for validation, 20k for test
development, and 20k for test challenge.
where M L P(Cin , Cout ) denotes a MLP layer with Cin and
Cout as input and output vector dimension, respectively. N rep-
resents the number of multi-level features. By fusing these B. Implementation Details
balanced semantic features, the final output of the decoder is We implement our models based on the Pytorch [73],
expressed as follows: Timm [74], and Mmsegmentation [75] libraries. All encoder
variants are pre-trained on the ImageNet-1K dataset [72], and
Y = M L P(3C, C)(Concat (Yi )), i ∈ (2, 3, 4) the decoder is randomly initialized. We train our models on
M = M L P(C, Ncls )(Y ) (9) a node with 8 Tesla V100 GPUs. For pre-training, we adopt
the same training hyperparameters (e.g., data augmentation,
where Ncls and M denote the number of categories and the learning rate, and regularization) used in DeiT [30]. For
predicted semantic mask, respectively. semantic segmentation experiments, we use random scaling
(0.5–2.0), random horizontal flipping, and random cropping as
IV. E XPERIMENTAL S ETTINGS data augmentation methods. AdamW [76] is used as the default
optimizer. The batch size is set to 16 for ADE20K and COCO-
A. Datasets Stuff, and 8 for Cityscapes. The learning rate is initialized as
Following previous methods, we pre-train each encoder 0.00006 and the poly-learning rate decay policy is applied
variant using the ImageNet-1K dataset [72]. ImageNet- during the training. We train the models with 160K iterations
1K is a popular dataset with 1000 categories for image for ADE20K and Cityscapes and 80K iterations for COCO-
classification. For semantic segmentation, we use three Stuff. We report the mean Intersection over Union (mIOU) to
15940 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 25, NO. 11, NOVEMBER 2024
Fig. 5. The visualization showcases prediction segmentation results using the Cityscapes dataset [1]. The original images are displayed in the first row, while
ground truths (GT) are depicted in the second row. The predicted results are presented in the last row.
TABLE VI results by using our trained model. The results show that our
C OMPARISON W ITH SOTA M ETHODS ON THE C ITYSCAPES DATASET [1]. method can generate excellent instance-predictions.
‘FLOP S (G)’ I S T ESTED U NDER THE I NPUT S IZE OF 2048 × 1024.
† D ENOTES M ODELS P RE -T RAINED ON I MAGE N ET-22K
6) Results on Mask Predictions: In Fig. 5, we visualize the
competitive segmentation results selected from the Cityscapes
dataset [1]. The original images are showcased in the first row,
followed by the ground truths (GT) depicted in the second row.
The predicted results using the proposed model are presented
in the last row. It is clear that the proposed SDPT can produce
satisfactory segmentation results by single-scale inference.
C. Ablation Study
We conduct ablation experiments on the ADE20K
dataset [50]. SDPT-Tiny is used as the baseline model, and
‘FLOPs (G)’ is tested under the input size of 512 × 512.
1) Effect of Our DPA: We investigate the effect of introduc-
ing dimension-pooling attention (DPA) by replacing attention
module with different attention mechanisms in the encoder.
The results are summarized in Table IX. Our DPA outperforms
the linear spatial reduction attention (linear SRA) used in
PVTv2 [36] and the convolutional spatial reduction attention
SDPT-Tiny, SDPT-Small, and SDPT-Base achieve real-time (SRA) used in SegFormer [37]. Compared with the original
performance with 63.8 FPS, 46.8 FPS, and 32.2 FPS on the self-attention [32], the proposed DPA achieves comparable
ADE20K dataset, respectively. The results demonstrate that the performance with significantly lower computational complex-
proposed method can strike a balance between segmentation ity.
performance and efficiency. 2) Importance of Convolutional Stem in Early Stage: To
4) Comparison With CNN-Based Methods: We compare our explore the influence of convolutions in the early stages,
SDPT with SOTA CNN-based methods such as FCN [11], we evaluate our convolutional stem with non-overlap patch
EncNet [83], PSPNet [41], CCNet [57], DeepLabV3+ [28], embedding in ViT [29] and overlap patch embedding in
OCRNet [84], and SegNeXt [58] on the ADE20K dataset [50]. SegFormer [37]. As shown in Table X, the results show that
The results are summarized in Table VII. Our SDPT surpasses early convolutional stem helps transformers learn better, which
the popular OCRNet (HRNet) [84] and SegNeXt [58]. In addi- is consistent with [71]. With this design, our method can bring
tion, our method is significantly faster (FPS) than the majority significant performance improvement.
of CNN-based competitors. 3) Effect of Each Component in DPA: We investigate
5) Instance Segmentation: To further demonstrate the effec- the effect of each component in dimension-pooling atten-
tiveness of our SDPT backbone. We conducted experiments tion (DPA) by replacing the attention module with different
on the MS COCO dataset [85] for the instance segmentation configurations in the encoder. The results are summarized in
task. We integrated the proposed SDPT backbone into the Table XI. Only the convolution or attention branches achieve
Mask RCNN [12] framework to conduct the experiments. relatively poor performance, while combining attention with
The corresponding results are summarized in Table VIII. lightweight depth-wise convolution (DWConv) can lead to
Our SDPT outperforms other methods including ViL [86], performance gains. This is due to the fact that DWConv
PVT [35], PVTv2 [36], Swin [31], Twins [87] and P2T [40]. can provide local detail information, which facilitates image
As presented in Fig. 6, we visualize the instance segmentation classification and downstream tasks (semantic segmentation).
15942 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 25, NO. 11, NOVEMBER 2024
TABLE VII
C OMPARISON W ITH SOTA R EAL -T IME M ETHODS AND CNN-BASED M ETHODS ON THE ADE20K DATASET [50].
‘FLOP S (G)’ I S T ESTED U NDER THE I NPUT S IZE OF 512 × 512
TABLE VIII
C OMPARISON W ITH I NSTANCE S EGMENTATION R ESULTS OF OTHER SOTA M ETHODS ON THE MS COCO DATASET [85].
‘FLOP S (G)’ I S T ESTED U NDER THE I NPUT S IZE OF 800 × 1280
TABLE IX
T HE P ERFORMANCE OF D IFFERENT ATTENTION M ECHANISMS IN THE E NCODER . W E R EPORT T OP -1 ACCURACY AND M I O U ON THE I MAGE N ET-1K [72]
AND ADE20K DATASETS [50], R ESPECTIVELY. ‘FLOP S (G)’ I S T ESTED U NDER THE I NPUT S IZE OF 224 × 224 AND 512 × 512, R ESPECTIVELY
Compared to fusing with the add operation, multiplying MLP-based decoder [37], ASPP [27], and a lightweight Ham-
features from convolution and attention branches performs burger decoder [58], and our decoder variants ((a) and (b),
better. see Fig. 7a and Fig. 7b) for segmentation experiments on the
4) Influence of Decoder Structure: We compare differ- ADE20K dataset [50]. The results are listed in Table XII. It can
ent decoder structures for semantic segmentation. Specifi- be seen that SDPT (b) achieves the best performance compared
cally, we configured our SDPT-Tiny backbone with a pure to other decoder structures.
CAO et al.: SEMANTIC-AWARE DIMENSION-POOLING TRANSFORMER FOR IMAGE SEGMENTATION 15943
TABLE X
T HE P ERFORMANCE OF D IFFERENT PATCH E MBEDDING M ETHODS
ON THE ADE20K DATASET [50]
TABLE XI
A BLATION S TUDIES ON E ACH C OMPONENT OF DPA ON THE ADE20K
DATASET [50]. ( A ) AND (B) D ENOTE ATTN AND C ONV A RE F USED
BY U SING THE A DD AND M ULTIPLY O PERATIONS , R ESPECTIVELY
VI. C ONCLUSION
In this paper, we present a simple yet effective encoder-
decoder architecture based on dimension-pooling transform-
ers, named SDPT. An efficient transformer encoder is
elaborately designed to generate multi-scale features, and
TABLE XIII a semantic-balanced decoder is introduced to integrate
T HE P ERFORMANCE OF D IFFERENT R EFINEMENT M ODULES multi-level features for predicting semantic masks. The high-
ON THE ADE20K DATASET [50] light is that our SDPT performs better than current methods
with less computational complexity, leading to a trade-off
between accuracy and speed. The experimental results on the
ImageNet-1k and MS COCO datasets show that using the
proposed SDPT as a backbone can achieve excellent perfor-
mance. Extensive experiments on the ADE20K, Cityscapes,
and COCO-Stuff datasets have demonstrated the effectiveness
of our SDPT. The limitation is that, despite having only
3.6 million parameters, it is unclear whether our SDPT-Tiny
will work well in chip-based edge devices. Furthermore,
it is also interesting to study how our approach extends to
large-scale models and other vision tasks.
R EFERENCES
[1] M. Cordts et al., “The cityscapes dataset for semantic urban scene
understanding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2016, pp. 3213–3223.
[2] G. Phonsa and K. Manu, “A survey: Image segmentation techniques,” in
Fig. 6. The visualization of instance segmentation results. The results are Harmony Search and Nature Inspired Optimization Algorithms: Theory
generated by using the proposed model on the MS COCO dataset [85]. and Applications. Singapore: Springer, 2019, pp. 1123–1140.
[3] S. A. Hojjatoleslami and J. Kittler, “Region growing: A new approach,”
IEEE Trans. Image Process., vol. 7, no. 7, pp. 1079–1084, Jul. 1998.
[4] S. Beucher, “The watershed transformation applied to image segmenta-
5) DPA for Decoder: We evaluate the impact of the differ- tion,” Scanning Microsc., vol. 1992, no. 6, p. 28, 1992.
[5] D. Comaniciu and P. Meer, “Mean shift: A robust approach toward
ent refinement modules in the decoder. As shown in Table XIII, feature space analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24,
our DPA can significantly reduce the computational cost no. 5, pp. 603–619, May 2002.
15944 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 25, NO. 11, NOVEMBER 2024
[6] J. Canny, “A computational approach to edge detection,” IEEE Trans. [28] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam,
Pattern Anal. Mach. Intell., vol. PAMI-8, no. 6, pp. 679–698, Nov. 1986. “Encoder–decoder with atrous separable convolution for semantic image
[7] N. Kanopoulos, N. Vasanthavada, and R. L. Baker, “Design of an image segmentation,” in Proc. ECCV, 2018, pp. 801–818.
edge detection filter using the Sobel operator,” IEEE J. Solid-State [29] A. Dosovitskiy et al., “An image is worth 16×16 words: Transformers
Circuits, vol. 23, no. 2, pp. 358–367, Apr. 1988. for image recognition at scale,” in Proc. ICLR, 2021, pp. 1–22.
[8] J. M. Prewitt, “Object enhancement and extraction,” Picture Process. [30] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and
Psychopictorics, vol. 10, no. 1, pp. 15–19, 1970. H. Jegou, “Training data-efficient image transformers distillation through
[9] S. Minaee, Y. Boykov, F. Porikli, A. Plaza, N. Kehtarnavaz, and attention,” in Proc. ICML, 2021, pp. 10347–10357.
D. Terzopoulos, “Image segmentation using deep learning: A survey,” [31] Z. Liu et al., “Swin transformer: Hierarchical vision transformer using
IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 7, pp. 3523–3542, shifted windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV),
Jul. 2022. Oct. 2021, pp. 9992–10002.
[10] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional net- [32] A. Vaswani et al., “Attention is all you need,” in Proc. NIPS, 2017,
works for biomedical image segmentation,” in Proc. MICCAI, 2015, pp. 1–11.
pp. 234–241. [33] L. Yuan et al., “Tokens-to-token ViT: Training vision transformers from
scratch on ImageNet,” in Proc. IEEE/CVF Int. Conf. Comput. Vis.
[11] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
(ICCV), Oct. 2021, pp. 538–547.
for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern
Recognit. (CVPR), Jun. 2015, pp. 3431–3440. [34] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and
S. Zagoruyko, “End-to-end object detection with transformers,” in Proc.
[12] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc. ECCV, 2020, pp. 213–229.
IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2980–2988.
[35] W. Wang et al., “Pyramid vision transformer: A versatile backbone for
[13] P. Wang et al., “Understanding convolution for semantic segmentation,” dense prediction without convolutions,” in Proc. IEEE/CVF Int. Conf.
in Proc. IEEE Winter Conf. Appl. Comput. Vis. (WACV), Mar. 2018, Comput. Vis. (ICCV), Oct. 2021, pp. 548–558.
pp. 1451–1460. [36] W. Wang et al., “PVTv2: Improved baselines with pyramid vision
[14] N. Sahu, V. Chamola, and R. R. Rajkumar, “A clustering and image transformer,” Comput. Vis. Media, vol. 8, no. 3, pp. 1–10, 2022.
processing approach to unsupervised real-time road segmentation for [37] E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo,
autonomous vehicles,” in Proc. IEEE Globecom Workshops, Dec. 2022, “Segformer: Simple and efficient design for semantic segmentation with
pp. 160–165. transformers,” in Proc. NIPS, 2021, pp. 12077–12090.
[15] H. Cao et al., “Swin-UNet: UNet-like pure transformer for medical [38] S. Zheng et al., “Rethinking semantic segmentation from a sequence-
image segmentation,” in Proc. ECCVW, 2022, pp. 205–218. to-sequence perspective with transformers,” in Proc. IEEE/CVF Conf.
[16] M. Gruosso, N. Capece, and U. Erra, “Human segmentation in surveil- Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, pp. 6877–6886.
lance video with deep learning,” Multimedia Tools Appl., vol. 80, no. 1, [39] B. Zhang et al., “SegViT: Semantic segmentation with plain vision
pp. 1175–1199, Jan. 2021. transformers,” in Proc. NIPS, 2022, pp. 1–12.
[17] X. Yuan, J. Shi, and L. Gu, “A review of deep learning methods for [40] Y.-H. Wu, Y. Liu, X. Zhan, and M.-M. Cheng, “P2T: Pyramid pool-
semantic segmentation of remote sensing imagery,” Exp. Syst. Appl., ing transformer for scene understanding,” IEEE Trans. Pattern Anal.
vol. 169, May 2021, Art. no. 114417. Mach. Intell., vol. 99, pp. 1–12, 2022.
[18] L. Tanzi, P. Piazzolla, F. Porpiglia, and E. Vezzetti, “Real-time deep [41] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
learning semantic segmentation during intra-operative surgery for 3D network,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
augmented reality assistance,” Int. J. Comput. Assist. Radiol. Surg., Jul. 2017, pp. 6230–6239.
vol. 16, no. 9, pp. 1435–1445, Sep. 2021. [42] J. Fu et al., “Dual attention network for scene segmentation,” in Proc.
[19] S. Ainetter and F. Fraundorfer, “End-to-end trainable deep neural IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019,
network for robotic grasp detection and semantic segmentation from pp. 3141–3149.
RGB,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2021, [43] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-Net: Fully convolutional
pp. 13452–13458. neural networks for volumetric medical image segmentation,” in Proc.
4th Int. Conf. 3D Vis. (3DV), Oct. 2016, pp. 565–571.
[20] T. Anand, S. Sinha, M. Mandal, V. Chamola, and F. R. Yu, “AgriSegNet:
Deep aerial semantic segmentation framework for IoT-assisted preci- [44] R. Strudel, R. Garcia, I. Laptev, and C. Schmid, “Segmenter: Trans-
sion agriculture,” IEEE Sensors J., vol. 21, no. 16, pp. 17581–17590, former for semantic segmentation,” in Proc. IEEE/CVF Int. Conf.
Aug. 2021. Comput. Vis. (ICCV), Oct. 2021, pp. 7242–7252.
[45] B. Cheng, A. G. Schwing, and A. Kirillov, “Per-pixel classification
[21] A. S. Chakravarthy, S. Sinha, P. Narang, M. Mandal, V. Chamola, and
is not all you need for semantic segmentation,” in Proc. NIPS, 2021,
F. R. Yu, “DroneSegNet: Robust aerial semantic segmentation for UAV-
pp. 17864–17875.
based IoT applications,” IEEE Trans. Veh. Technol., vol. 71, no. 4,
[46] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu-
pp. 4277–4286, Apr. 2022.
tional networks,” in Proc. ECCV, 2014, pp. 818–833.
[22] W. Zhou, J. S. Berrio, S. Worrall, and E. Nebot, “Automated evaluation [47] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie,
of semantic segmentation robustness for autonomous driving,” IEEE “Feature pyramid networks for object detection,” in Proc. IEEE Conf.
Trans. Intell. Transp. Syst., vol. 21, no. 5, pp. 1951–1963, May 2020. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 936–944.
[23] G. Chen, H. Cao, J. Conradt, H. Tang, F. Rohrbein, and A. Knoll, “Event- [48] S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path aggregation network for
based neuromorphic vision for autonomous driving: A paradigm shift for instance segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
bio-inspired visual sensing and perception,” IEEE Signal Process. Mag., Recognit., Jun. 2018, pp. 8759–8768.
vol. 37, no. 4, pp. 34–49, Jul. 2020. [49] J. Pang, K. Chen, J. Shi, H. Feng, W. Ouyang, and D. Lin, “Libra
[24] D. Feng et al., “Deep multi-modal object detection and semantic segmen- R-CNN: Towards balanced learning for object detection,” in Proc.
tation for autonomous driving: Datasets, methods, and challenges,” IEEE IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2019,
Trans. Intell. Transp. Syst., vol. 22, no. 3, pp. 1341–1360, Mar. 2021. pp. 821–830.
[25] K. Muhammad et al., “Vision-based semantic segmentation in scene [50] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba,
understanding for autonomous driving: Recent achievements, challenges, “Scene parsing through ADE20K dataset,” in Proc. IEEE Conf. Comput.
and outlooks,” IEEE Trans. Intell. Transp. Syst., vol. 23, no. 12, Vis. Pattern Recognit. (CVPR), Jul. 2017, pp. 5122–5130.
pp. 22694–22715, Dec. 2022. [51] H. Caesar, J. Uijlings, and V. Ferrari, “COCO-stuff: Thing and stuff
[26] J. Liao, L. Cao, W. Li, Y. Ou, C. Duan, and H. Cao, “Fully-supervised classes in context,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern
semantic segmentation networks: Exploring the relationship between the Recognit., Jun. 2018, pp. 1209–1218.
segmentation networks learning ability and the number of convolutional [52] S. U. Lee, S. Y. Chung, and R. H. Park, “A comparative performance
layers,” in Proc. 7th Int. Conf. Comput. Commun. (ICCC), Dec. 2021, study of several global thresholding techniques for segmentation,” Com-
pp. 1685–1692. put. Vis., Graph., Image Process., vol. 52, no. 2, pp. 171–190, 1990.
[27] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, [53] D. Bradley and G. Roth, “Adaptive thresholding using the integral
“DeepLab: Semantic image segmentation with deep convolutional nets, image,” J. Graph. Tools, vol. 12, no. 2, pp. 13–21, 2007.
atrous convolution, and fully connected CRFs,” IEEE Trans. Pattern [54] Q. Hou, L. Zhang, M.-M. Cheng, and J. Feng, “Strip pooling: Rethinking
Anal. Mach. Intell., vol. 40, no. 4, pp. 834–848, Apr. 2018. spatial pooling for scene parsing,” in Proc. CVPR, 2020, pp. 4003–4012.
CAO et al.: SEMANTIC-AWARE DIMENSION-POOLING TRANSFORMER FOR IMAGE SEGMENTATION 15945
[55] K. Yang, J. Zhang, S. Reiß, X. Hu, and R. Stiefelhagen, “Captur- [80] B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar,
ing omni-range context for omnidirectional segmentation,” in Proc. “Masked-attention mask transformer for universal image segmentation,”
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2021, in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR),
pp. 1376–1386. Jun. 2022, pp. 1280–1289.
[56] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for [81] J. Gu et al., “Multi-scale high-resolution vision transformer for semantic
image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.
(CVPR), Jun. 2016, pp. 770–778. (CVPR), Jun. 2022, pp. 12084–12093.
[57] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu, “CCNet: [82] C. Yu, C. Gao, J. Wang, G. Yu, C. Shen, and N. Sang, “BiSeNet
Criss-cross attention for semantic segmentation,” in Proc. IEEE/CVF v2: Bilateral network with guided aggregation for real-time semantic
Int. Conf. Comput. Vis. (ICCV), Oct. 2019, pp. 603–612. segmentation,” Int. J. Comput. Vis., vol. 129, no. 11, pp. 3051–3068,
[58] M.-H. Guo, C.-Z. Lu, Q. Hou, Z. Liu, M.-M. Cheng, and S.-M. Hu, Nov. 2021.
“SegNeXt: Rethinking convolutional attention design for semantic seg- [83] H. Zhang et al., “Context encoding for semantic segmentation,” in
mentation,” in Proc. NIPS, 2022, pp. 1140–1156. Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., Jun. 2018,
[59] S. Ren, D. Zhou, S. He, J. Feng, and X. Wang, “Shunted self-attention pp. 7151–7160.
via multi-scale token aggregation,” in Proc. IEEE/CVF Conf. Comput. [84] Y. Yuan, X. Chen, and J. Wang, “Object-contextual representations for
Vis. Pattern Recognit. (CVPR), Jun. 2022, pp. 10843–10852. semantic segmentation,” in Proc. ECCV, 2020, pp. 173–190.
[60] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “ICNet for real-time [85] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in
semantic segmentation on high-resolution images,” in Proc. ECCV, Proc. ECCV, 2014, pp. 740–755.
2018, pp. 405–420. [86] P. Zhang et al., “Multi-scale vision longformer: A new vision transformer
[61] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “BiSeNet: for high-resolution image encoding,” in Proc. IEEE/CVF Int. Conf.
Bilateral segmentation network for real-time semantic segmentation,” Comput. Vis. (ICCV), Oct. 2021, pp. 2978–2988.
in Proc. ECCV, 2018, pp. 325–341. [87] X. Chu et al., “Twins: Revisiting the design of spatial attention in vision
transformers,” in Proc. NIPS, 2021, pp. 9355–9366.
[62] M. Fan et al., “Rethinking BiSeNet for real-time semantic segmentation,”
in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), [88] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural
Jun. 2021, pp. 9711–9720. networks,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit.,
Jun. 2018, pp. 7794–7803.
[63] R. Ranftl, A. Bochkovskiy, and V. Koltun, “Vision transformers for
dense prediction,” in Proc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV),
Oct. 2021, pp. 12159–12168.
Hu Cao received the Ph.D. degree in computer engi-
[64] W. Zhang et al., “TopFormer: Token pyramid transformer for mobile neering from the Technical University of Munich
semantic segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern (TUM) in 2023. He is currently a Research Associate
Recognit. (CVPR), Jun. 2022, pp. 12073–12083. with the Chair of Robotics, Artificial Intelligence,
[65] J. Xu, Z. Xiong, and S. P. Bhattacharyya, “PIDNet: A real-time and Real-Time Systems, TUM, where he has also
semantic segmentation network inspired by PID controllers,” in Proc. been a Research Assistant, since October 2019. Dur-
IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2023, ing his studies, he stayed abroad with ETH Zürich
pp. 19529–19539. and The University of Hong Kong (HKU), where
[66] Q. Wan, Z. Huang, J. Lu, G. Yu, and L. Zhang, “Seaformer: Squeeze- he was involved in developing algorithms for dense
enhanced axial transformer for mobile semantic segmentation,” in Proc. prediction (classification, detection, and segmenta-
ICLR, 2023, pp. 1–34. tion), autonomous driving, and robotic grasping. His
[67] D. Bo, W. Pichao, and F. Wang, “Head-free lightweight semantic seg- current research interests include robotics, machine learning, computer vision,
mentation with linear transformer,” in Proc. AAAI, 2023, pp. 516–524. event-based vision, and embodied AI.
[68] Z. Xu, D. Wu, C. Yu, X. Chu, N. Sang, and C. Gao, “SCTNet:
Single-branch CNN with transformer semantic information for real-time
segmentation,” in Proc. AAAI, 2024, pp. 6378–6386. Guang Chen received the B.S. and M.Eng. degrees
in mechanical engineering from Hunan University,
[69] H. Yan, M. Wu, and C. Zhang, “Multi-scale representations by varing
China, and the Ph.D. degree from the Faculty
window attention for semantic segmentation,” in Proc. ICLR, 2024,
of Informatics, Technical University of Munich,
pp. 1–17.
Germany, in 2016. He is a Professor with Tongji
[70] J. Zhang, H. Liu, K. Yang, X. Hu, R. Liu, and R. Stiefelhagen, “CMX: University and a Senior Research Associate (guest)
Cross-modal fusion for RGB-X semantic segmentation with transform- with the Technical University of Munich. He is
ers,” IEEE Trans. Intell. Transp. Syst., vol. 24, no. 12, pp. 14679–14694, leading the Robotics and Embodied Artificial Intel-
Dec. 2023. ligence Laboratory, Tongji University. He was a
[71] T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Dollàr, and R. B. Girshick, Research Scientist with Fortiss GmbH, a research
“Early convolutions help transformers see better,” in Proc. NIPS, 2021, institute of the Technical University of Munich,
pp. 30392–30400. from 2012 to 2016; and a Senior Researcher with the Chair of Robotics, Arti-
[72] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: ficial Intelligence, and Real-Time Systems, Technical University of Munich,
A large-scale hierarchical image database,” in Proc. IEEE Conf. Comput. from 2016 to 2018. His research interests include 3-D vision, embodied artifi-
Vis. Pattern Recognit., Jun. 2009, pp. 248–255. cial intelligence, intelligent robotics, and autonomous driving. He was awarded
[73] A. Paszke et al., “PyTorch: An imperative style, high-performance deep as the Tongji Hundred Talent Research Professor 2018, the Shanghai Rising
learning library,” in Proc. NIPS, 2019, pp. 1–13. Star 2021, the Shanghai S&T 35U35 2021, and the National Distinguished
[74] R. Wightman, “PyTorch image models,” GitHub, GitHub Repository, Young Talents 2023. He is the Program Chair of IEEE MFI 2022. He serves
2019, doi: 10.5281/zenodo.4414861. [Online]. Available: https://github. as an associate editor for several international journals.
com/rwightman/pytorch-image-models
[75] MMSegmentation Contributors. (2020). MMSegmentation: OpenMMLab
Hengshuang Zhao (Member, IEEE) received the
Semantic Segmentation Toolbox and Benchmark. [Online]. Available:
Ph.D. degree from the Department of Computer
https://github.com/open-mmlab/mmsegmentation
Science and Engineering, The Chinese University
[76] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” of Hong Kong, supervised by Prof. Jiaya Jia. He is
in Proc. ICLR, 2019. currently an Assistant Professor with the Department
[77] Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, of Computer Science, The University of Hong Kong.
“A ConvNet for the 2020s,” in Proc. IEEE/CVF Conf. Comput. Vis. Before that, he was a Post-Doctoral Researcher with
Pattern Recognit. (CVPR), Jun. 2022, pp. 11966–11976. the Computer Science and Artificial Intelligence
[78] Y. Yuan et al., “HRFormer: High-resolution vision transformer for dense Laboratory (CSAIL), MIT, working with Prof. Anto-
predict,” in Proc. NIPS, 2021, pp. 7281–7293. nio Torralba; and the Torr Vision Group, Department
[79] B. Shi et al., “A transformer-based decoder for semantic seg- of Engineering Science, University of Oxford, work-
mentation with multi-level context mining,” in Proc. ECCV, 2022, ing with Prof. Philip Torr. His general research interests cover the broad areas
pp. 624–639. of computer vision and machine learning.
15946 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 25, NO. 11, NOVEMBER 2024
Dongsheng Jiang received the Ph.D. degree in for Research Achievement, the 2016 UTSA Innovation Award in the first
biomedical engineering from Fudan University. He is category, the 2014 Research Achievement Awards from the College of Science
currently a Senior Researcher with Huawei Inc. at UTSA, and the 2010 Google Faculty Research Award. He has served as a
His research interests include computer vision and Founding Member for ICMR (2009–2014) and ACM MM (2009–2012); an
medical image analysis. International Steering Committee Member for ACM MIR (2006–2010), ACM
ICIMCS (2013), ICME (2006 and 2009), PCM (2012), and IEEE International
Symposium on Multimedia (2011); and the Chair for ACM Multimedia 2015.
He is an Associate Editor of IEEE T RANSACTIONS ON M ULTIMEDIA, IEEE
T RANSACTIONS ON C IRCUITS AND S YSTEMS FOR V IDEO T ECHNOLOGY,
ACM Transactions on Multimedia Computing, Communications, and Appli-
cations, Multimedia Systems, and Journal of Machine Vision and Applications.
Xiaopeng Zhang received the Ph.D. degree from
Shanghai Jiao Tong University, Shanghai, China,
in 2017. He is currently a Senior Researcher
with CLOUD&AI, Huawei Technologies. He was
a Research Fellow with the Department of ECE,
National University of Singapore, from 2017
to 2019; and a Visiting Researcher with the Depart-
ment of Computer Science, The University of Texas
at San Antonio, San Antonio, TX, USA, from 2015 Alois Knoll (Fellow, IEEE) received the diploma
to 2016. His current research interests include (M.Sc.) degree in electrical/communications engi-
object recognition, weakly supervised detection, and neering from the University of Stuttgart, Germany,
self-supervised learning. in 1985, and the Ph.D. degree (summa cum laude) in
computer science from the Technical University of
Qi Tian (Fellow, IEEE) received the Ph.D. degree Berlin (TU Berlin), Germany, in 1988. He was on
in electronics and communication engineering from the Faculty of the Computer Science Department,
the University of Illinois at Urbana–Champaign TU Berlin, until 1993. He joined the University
(UIUC) in 2002. He is currently the Chief Sci- of Bielefeld as a Full Professor and the Director
entist in artificial intelligence with Huawei Cloud of the Research Group of Technical Informatics in
& AI. He was the Chief Scientist in computer 2001. Since 2001, he has been a Professor with the
vision with the Huawei Noah’s Ark Laboratory, Department of Informatics, TU München (TUM). He was also on the board
from 2018 to 2020. Before joined Huawei, he was of directors of the Central Institute of Medical Technology, TUM (IMETUM).
a Full Professor with the Department of Computer From 2004 to 2006, he was the Executive Director of the Institute of Computer
Science, The University of Texas at San Antonio Science, TUM. His research interests include cognitive, medical, and sensor-
(UTSA), from 2002 to 2019. He was listed in the based robotics; multi-agent systems; data fusion; adaptive systems; multimedia
Top 10 of the 2016 Most Influential Scholars in Multimedia by Aminer.org. information retrieval; model-driven development of embedded systems, with
He is an academician of the International Eurasian Academy of Sciences applications to automotive software and electric transportation; and simulation
(IEAS) in 2021. He received the 2017 UTSA President Distinguished Award systems for robotics and traffic.