0% found this document useful (0 votes)
36 views9 pages

Paper of Rolling Net

The document presents Rolling-Unet, a novel medical image segmentation network that combines CNN and MLP to effectively capture long-distance dependencies and local features. The core of this model is the R-MLP module, which allows for efficient learning of dependencies in multiple directions, and the Lo2 block, which balances local context and long-range dependencies without increasing computational burden. Experimental results demonstrate that Rolling-Unet outperforms existing state-of-the-art methods across several datasets.

Uploaded by

vvbvansh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views9 pages

Paper of Rolling Net

The document presents Rolling-Unet, a novel medical image segmentation network that combines CNN and MLP to effectively capture long-distance dependencies and local features. The core of this model is the R-MLP module, which allows for efficient learning of dependencies in multiple directions, and the Lo2 block, which balances local context and long-range dependencies without increasing computational burden. Experimental results demonstrate that Rolling-Unet outperforms existing state-of-the-art methods across several datasets.

Uploaded by

vvbvansh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

Rolling-Unet: Revitalizing MLP’s Ability to Efficiently Extract Long-Distance


Dependencies for Medical Image Segmentation
Yutong Liu, Haijiang Zhu* , Mengting Liu, Huaiyuan Yu, Zihan Chen, Jie Gao
Beijing University of Chemical Technology, China
[email protected], [email protected], [email protected], [email protected],
[email protected], [email protected]

Abstract well, due to the inherent locality of convolution operations,


pure CNN architectures are difficult to learn clear global and
Medical image segmentation methods based on deep learning remote semantic information (Chen et al. 2021).
network are mainly divided into CNN and Transformer. How-
ever, CNN struggles to capture long-distance dependencies, To overcome the limitations of CNN, inspired by the great
while Transformer suffers from high computational complex- success of Transformer in the natural language processing
ity and poor local feature learning. To efficiently extract and (NLP) domain, researchers have tried to introduce Trans-
fuse local features and long-range dependencies, this paper former into the vision domain (Carion et al. 2020). Vision
proposes Rolling-Unet, which is a CNN model combined Transformer (ViT) (Dosovitskiy et al. 2020) is completely
with MLP. Specifically, we propose the core R-MLP mod- based on multi-head self-attention mechanism, which en-
ule, which is responsible for learning the long-distance de- ables the network to capture remote dependencies and en-
pendency in a single direction of the whole image. By con- code shape representations. However, it requires a large
trolling and combining R-MLP modules in different direc-
tions, OR-MLP and DOR-MLP modules are formed to cap-
amount of training data to achieve good performance. More-
ture long-distance dependencies in multiple directions. Fur- over, it has high computational complexity, which prevents
ther, Lo2 block is proposed to encode both local context in- the network from supporting high-resolution input (Azad
formation and long-distance dependencies without excessive et al. 2022). Swin Transformer (Liu et al. 2021) reduces
computational burden. Lo2 block has the same parameter size the computation, but at the cost of no information interac-
and computational complexity as a 3×3 convolution. The ex- tion between its windows, resulting in a smaller receptive
perimental results on four public datasets show that Rolling- field. Compared with CNN models, pure Transformer mod-
Unet achieves superior performance compared to the state-of- els also perform poorly in capturing local representations
the-art methods. (Chen et al. 2021). In view of the characteristics of CNN
and Transformer, some methods attempt to combine CNN
Introduction and Transformer (Chen et al. 2021; Valanarasu et al. 2021;
Wang et al. 2022) to further enhance the network’s ability.
With the rapid development of computer technology and ar- But these methods still cannot balance the performance and
tificial intelligence, the powerful modeling ability of Con- computational cost well.
volutional Neural Network (CNN) has been widely stud-
Multilayer perceptron (MLP) or fully connected (FC) is
ied. Deep learning-based segmentation algorithms have also
the earliest type of neural network, which consists of multi-
been introduced into medical image. U-Net (Ronneberger,
ple linear layers and nonlinear activations stacked together
Fischer, and Brox 2015) is one of the most famous network
(Rosenblatt 1957). Theoretically, MLP is a universal ap-
architectures in the field of medical image segmentation, and
proximator (Pinkus 1999). However, MLP has large com-
it is a fully convolutional segmentation network. U-Net’s en-
putation and is prone to overfitting when data is insufficient.
coder and decoder are symmetrical, forming a U-shaped seg-
Moreover, input flattening limits the input resolution. Due
ment, and fusing feature maps from different stages through
to the limitations of hardware and available datasets at that
skip connections. U-Net can adapt to small training sets and
time, the development of MLP was not smooth. In 2021,
output more accurate segmentation results. This advantage
MLP-Mixer (Tolstikhin et al. 2021) revived the vitality of
makes U-Net a huge success and widely used. Following
MLP. It mainly consists of two modules: Token-Mixing
this technical route, such as UNet++ (Zhou et al. 2018), Att-
MLP and Channel-Mixing MLP, which achieve competitive
UNet (Oktay et al. 2018), 3D U-Net (Çiçek et al. 2016) and
performance without convolution and attention. MLP has a
V-Net (Milletari, Navab, and Ahmadi 2016) have been de-
small inductive bias, and on large datasets, pure MLP ar-
veloped for image and volume segmentation of various med-
chitectures can better extract global semantic information.
ical imaging modalities. Although these methods perform
But this also makes it perform poorly on small datasets. To
* Corresponding author. achieve better performance, local bias was introduced (Hou
Copyright © 2024, Association for the Advancement of Artificial et al. 2022; Tang et al. 2022; Yu et al. 2022; Lian et al. 2021).
Intelligence (www.aaai.org). All rights reserved. But they lost sight of the global aspect.

3819
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

Figure 1: The overview of the proposed Rolling-Unet.

How to capture and fuse local features and long-distance performs the existing best methods. The main contributions
dependencies more effectively is the key to achieve accurate of this work are:
medical image segmentation. In this paper, we rethink this
• 1) We proposed a new approach to capture long-distance
topic: besides combining CNN and Transformer, are there
dependency, and constructed the R-MLP module.
any other methods that can have both local information and
long-distance dependencies? The answer is yes. By com- • 2) Based on 1, we constructed the OR-MLP and DOR-
bining CNN and MLP, this paper proposes a medical im- MLP modules, which can obtain remote dependencies in
age segmentation network named Rolling-Unet. Its core is more directions.
the flexible Rolling-MLP (R-MLP) module, which can cap- • 3) Based on 2, we proposed the Lo2 block. It simulta-
ture linear long-distance dependency in a single direction neously extracts the local context information and long-
of the whole image. By concatenating two vertical R-MLP distance dependencies, without increasing the computa-
modules, we form the Orthogonal Rolling-MLP (OR-MLP) tional burden. The Lo2 block has the same level of pa-
module, which can capture remote dependencies in multiple rameters and computation as a 3×3 convolution.
directions. We adopt the U-shaped framework of U-Net, in- • 4) Based on 3, we constructed Rolling-Unet networks
cluding the encoder-decoder structure, bottleneck layer and with different parameter scales. On four datasets, all
skip connections, to preserve the fine spatial details. In the scales of Rolling-Unet surpassed the existing methods,
4th layer of the encoder-decoder and the bottleneck layer, fully verifying the efficiency of our method.
we replace the original convolution block with Feature In-
centive block and Long-Local (Lo2) block. The Feature In- Related Work
centive block encodes features and controls the dimension
and shape of feature output. Lo2 block consists of Double CNN and Transformer for Medical Image
Orthogonal Rolling-MLP (DOR-MLP contains two comple- Segmentation
mentary OR-MLP) module and Depthwise Separable Con- Inspired by U-Net, UNet++ (Zhou et al. 2018) incorporated
volution (DSC) module, which capture both local context a set of dense skip connections in the model to alleviate the
information and long-distance dependencies relationship of semantic gap of feature fusion. Several subsequent works
the image. Extensive experiments show that our method out- leveraged techniques such as attention mechanism, image

3820
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

Figure 2: Illustration of the Rolling operation in width direc-


tion.

pyramid, and residual architecture (Oktay et al. 2018; Jha


et al. 2020)(Jha et al. 2019) to further enhance the perfor-
mance of CNN-based models. DconnNet (Yang and Farsiu
2023) is a state-of-the-art CNN-based model that exploits di-
rectional features extracted from a shared latent space to en- Figure 3: Controlling and combining different R-MLP to ob-
rich the overall data representation. In the medical image do- tain long-distance dependencies in multiple directions.
main, pure Transformer-based segmentation paradigms have
also emerged: such as MISSFormer (Huang et al. 2021),
DAE-Former (Azad et al. 2022), Swin–Unet (Cao et al. UNext (Valanarasu and Patel 2022) introduced a lightweight
2023). Swin–Unet is the first pure Transformer-based U- model, which adopts an axial shift module, but still can only
shaped architecture that adopts Swin Transformer to boost capture short-range linear receptive fields. PHNet (Lin et al.
feature representation. Given the respective drawbacks of 2023) is a 3D segmentation network that proposes a multi-
CNN and Transformer, various works that integrate both layer permutation perceptron module, which augments the
paradigms have been proposed. MedT (Valanarasu et al. primal MLP by preserving positional information.
2021) devised a gated axial attention model that tackles the
issue of limited data samples in medical image. UCTransNet Method
(Wang et al. 2022) introduced a Transformer-based module
Architecture Overview
to substitute the skip connections in U-Net. Despite these
works all embrace the strategy of blending global and lo- Figure 1 illustrates the overall architecture of the proposed
cal features to augment the model capability, they still fall Rolling-Unet, which follows the U-Net design. It consists
short of satisfying the demand of accurate segmentation of of an encoder-decoder, a bottleneck layer, and skip con-
medical images. nections. The encoder-decoder has four stages of downsam-
pling and upsampling, which are performed by max pooling
MLP Paradigm for Image Tasks and bilinear interpolation, respectively. The first three layers
MLP-Mixer (Tolstikhin et al. 2021) is the pioneer of a deep of the encoder-decoder contain two standard 3×3 convolu-
MLP network for vision. Owing to its inferior performance tion blocks each. The fourth layer and the bottleneck layer
on small datasets, later works endeavored to incorporate lo- employ Feature Incentive blocks to handle feature channel
cal priors in MLP . Vision Permutator (ViP) (Hou et al. compression and expansion, and Lo2 blocks to capture both
2022) encodes the feature representation with linear projec- local context and long-range dependencies of the image. The
tions along both height and width dimensions. Sparse MLP skip connections fuse the features of the same scale by addi-
(Tang et al. 2022) follows a similar strategy, except that it tion. Each module is described in detail below.
directly maps along the image height and width. However,
this design lacks flexibility, as its parameter and computa- R-MLP Module
tion overheads are tied to the image size, which limits the Given a feature matrix X ∈ H×W ×C with spatial resolu-
size of the input image. S2MLP (Yu et al. 2022) devised a tion H × W and channel number C, where hi (i ∈ [1, H])
spatial shift module, which aligns disparate token features to denotes the height index, wj (j ∈ [1, W ]) denotes the width
the same channel. AS-MLP (Lian et al. 2021) employs two index, and ck (k ∈ [1, C]) denotes the channel index, we per-
parallel branches for horizontal and vertical shifts. Neverthe- form a Rolling operation on the feature maps of each chan-
less, these works merely possess local receptive fields, for- nel layer in the feature matrix along the same direction, as
saking the original motivation of pure MLP models to cap- shown in Figure 2 (taking the width direction as an exam-
ture global features. In the medical image domain, as far as ple). The Rolling operation consists of two steps: shifting
we know, there are few segmentation models based on MLP. and cropping. First, the feature map with channel index ck

3821
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

BUSI GlaS
Method Parmas(M)
IoU ↑ F1 ↑ HD95 ↓ IoU ↑ F1 ↑ HD95 ↓
U-Net(2015) 31.04 64.25±1.63 77.55±1.23 7.57±2.44 87.62±0.29 93.35±0.16 0.83±0.18
UNet++(2018) 36.63 65.68±1.66 78.56±1.26 7.72±2.16 87.99±0.52 93.58±0.29 0.81±0.16
Att-UNet(2018) 34.88 65.97±1.91 78.79±1.29 8.36±2.11 87.90±0.47 93.40±0.26 0.82±0.29
MedT(2021) 1.37 52.15±3.47 67.68±3.18 10.23±1.17 ———— ———— ————
UCTransNet(2022) 66.24 67.27±1.04 79.62±0.74 6.19±0.45 87.80±0.16 93.46±0.12 0.78±0.26
UNeXt(2022) 1.47 61.78±1.46 75.52±0.91 8.33±0.42 83.95±1.09 91.22±0.67 1.04±0.10
DconnNet(2023) 25.49 67.16±0.61 79.63±0.61 6.97±2.81 87.22±0.59 93.12±0.36 0.93±0.15
Rollling-Unet(S) 1.78 65.52±2.82 78.43±2.10 6.19±0.62 86.19±0.35 92.51±0.27 1.00±0.08
Rollling-Unet(M) 7.10 66.99±0.61 79.50±0.35 5.76±0.95 86.60±0.82 92.75±0.53 0.90±0.15
Rollling-Unet(L) 28.32 67.81±1.80 80.17±1.19 7.29±2.50 88.02±0.28 93.59±0.17 0.64±0.27

Table 1: Results on the BUSI and GlaS dataset. The IoU, F1 and HD95 are in ‘mean±std’ format. The best results are bold.

has a shifting step of k. Then, taking the feature map with affect the linear receptive field extraction. However, when
channel index c0 as the reference, we crop the excess parts using OR-MLP, the sign of k is crucial. For the width direc-
of the other feature maps to the missing parts. Finally, we tion, given a positive k value, it represents moving from left
perform a channel projection with weight sharing at each to right (LR), and a negative k value represents moving from
spatial location index (hi , wj ) to encode long-distance de- right to left (RL). For the height direction, given a positive k
pendency. In Figure 2, the original feature matrix has only value, it represents moving from top to bottom (TB), and a
one width wj feature at a fixed spatial index (hi , wj ) for negative k value represents moving from bottom to top (BT).
all channels. After applying the Rolling operation in width As shown in Figure 3, we consider two complementary OR-
direction, different channels have different width features. MLP modules. The first one applies R-MLP along the LR
When C ≥ W , we can encode the width features of the direction first and then sequentially along the TB direction.
entire image, which can be understood as global, unidirec- The second one applies R-MLP along the BT direction first
tional, linear receptive fields. When C < W , this linear re- and then sequentially along the LR direction. By paralleliz-
ceptive field is non-global. Similarly, R-MLP can also cap- ing these two OR-MLPs, we capture the long-range depen-
ture long-distance dependency in height direction. dencies along four directions: width, height, positive diag-
It is well known that MLP is sensitive to the positional onal, and negative diagonal! As shown in equation (2), for
1
information of the input. R-MLP performs cyclic operations an input X, we first apply an OR-MLP M LPOR , and then
2
of shifting and cropping the feature maps, making the posi- parallelize another OR-MLP M LPOR . We concatenate their
tional index order on each channel non-fixed. This prelim- outputs along the channel dimension and apply LayerNorm.
inarily reduces the sensitivity of R-MLP to position. Sec- Then we use Channel-mixing (CM) (Tolstikhin et al. 2021)
ondly, by using weight sharing, all channel projections share to fuse the features and reduce the channels back to C. Fi-
a set of parameters, which further reduces the sensitivity. nally, we add a residual connection with the input X. This
forms the Double Orthogonal Rolling-MLP (DOR-MLP)
OR-MLP and DOR-MLP module, as depicted in Figure 1.
R-MLP can encode the long-range dependency along either
the width or height direction. How can we capture the long- 1
distance dependency along other direction? By applying R- M LPDOR (X) =CM (LN (Concat[M LPOR (X),
2
(2)
MLP first along the width direction and then along the height M LPOR (X)])) + X
direction, it is equivalent to the synchronous shifting oper-
ation of the feature map in two orthogonal directions, re- Lo2 Block and Feature Incentive Block
sulting in a diagonal receptive field. As shown in equation
(1), for an input X, we first apply R-MLP along one direc- The DOR-MLP module captures the global, linear
tion M LPR1 , and then concatenate another R-MLP along the long-range dependencies along four directions in two-
perpendicular direction M LPR2 . We use the GELU activa- dimensional space, but it lacks the local context informa-
tion function in between, and then add a residual connection tion. We argue that better integrating local information and
with the input X. This forms the Orthogonal Rolling-MLP global dependencies is crucial for performance improve-
(OR-MLP) module, as illustrated in Figure 1. ment. Depthwise Separable Convolution (DSC) is a natural
choice (Chollet 2017). Because it has very few parameters
and computational costs, which is compatible with DOR-
M LPOR (X) = (M LPR2 (GELU (M LPR1 (X))))+ X (1) MLP. It is a well-established fact that the Channel-mixing in
R-MLP is a highly flexible module with great potential. MLP-Mixer, the MLP in ViT, and the R-MLP in this paper
The sign of the shifting step k determines the encoding or- are all equivalent to the standard 1×1 convolution in CNN,
der. When using R-MLP alone, reversing the order does not which allows feature interaction between different channels.

3822
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

Method IoU ↑ F1 ↑ HD95 ↓ Method IoU ↑ F1 ↑ HD95 ↓


U-Net(2015) 82.97 90.39 1.79 U-Net(15) 69.81±0.34 82.22±0.24 1.86±0.15
UNet++(2018) 83.34 90.66 1.56 Att-UNet(18) 69.90±0.41 82.28±0.29 1.80±0.17
Att-UNet(2018) 83.31 90.61 1.69 UCTransNet(22) 69.13±0.31 81.74±0.22 2.00±0.00
MedT(2021) 81.48 89.49 1.89 UNeXt(22) 66.81±0.04 80.10±0.03 2.04±0.07
UCTransNet(2022) 83.96 91.02 1.66 DconnNet(23) 69.90±0.44 82.28±0.31 2.00±0.00
UNeXt(2022) 82.90 90.38 2.04 Rollling-Unet(S) 69.40±0.28 81.94±0.19 1.90±0.17
DconnNet(2023) 83.86 90.93 2.04 Rollling-Unet(M) 69.55±0.38 82.03±0.27 1.96±0.08
Rollling-Unet(S) 84.15 91.13 1.51 Rollling-Unet(L) 70.40±0.43 82.63±0.30 1.71±0.00
Rollling-Unet(M) 84.16 91.09 1.69
Rollling-Unet(L) 83.74 90.90 1.99 Table 4: Results on the CHASEDB1 dataset. The metrics are
in ‘mean±std’ format.
Table 2: Results on the ISIC 2018 dataset (Image size = 256).

Method IoU ↑ F1 ↑ HD95 ↓ Experiments


U-Net(2015) 80.98 89.14 3.37
Datasets
UNet++(2018) 81.40 89.44 3.59 We evaluated our method on four datasets with different
Att-UNet(2018) 81.45 89.44 2.78 characteristics, data sizes and image resolutions: the Inter-
MedT(2021) —— —— —— national Skin Imaging Collaboration (ISIC 2018), the Breast
UCTransNet(2022) 83.14 90.47 2.89 UltraSound Images (BUSI), the Gland Segmentation dataset
UNeXt(2022) 83.12 90.51 2.32 (GlaS) and the CHASEDB1. The ISIC 2018 dataset contains
DconnNet(2023) 83.60 90.78 2.78 skin images acquired by cameras and the corresponding skin
lesion segmentation maps. We only used the training set of
Rollling-Unet(S) 84.14 91.11 2.17 the ISIC 2018 dataset, which contains 2594 images. The dif-
Rollling-Unet(M) 83.96 90.94 2.42 ficulty of this dataset lies in the fact that the segmentation
Rollling-Unet(L) 83.94 90.99 1.90 targets often have blurry boundaries, which is exacerbated
by the increasing of image size. Therefore, we resized the
Table 3: Results on the ISIC 2018 dataset (Image size = 512). images to two resolutions of 256×256 and 512×512 and con-
ducted experiments separately. The BUSI dataset consists of
ultrasound images of normal, benign, and malignant breast
The Rolling operation in R-MLP does not involve any pa- cancer and the corresponding segmentation maps. It has sim-
rameters or FLOPs, so the parameters of R-MLP is O(C 2 ), ilar problems with the ISIC 2018 dataset, but they have dif-
and the FLOPs is O(HW C 2 ). It can be further derived that ferent lesion types and imaging methods. We used 647 ul-
the parameters and FLOPs of OR-MLP are O(2C 2 ) and trasound images of benign and malignant breast tumors, re-
O(2HW C 2 ) respectively, and the parameters and FLOPs of sized to 256×256. The GlaS dataset contains 165 images,
DOR-MLP are O(6C 2 ) and O(6HW C 2 ) respectively. As which we resized to 512×512. The CHASEDB1 dataset is a
depicted in Figure 1, we parallelize DOR-MLP with DSC, vessel segmentation dataset with 28 images of 999×960 res-
and then concatenate their outputs along the channel dimen- olution. To preserve the details of the thin vessels, we resized
sion, and finally use Channel-mixing to fuse the features the images to 960×960.
and restore the channels to C. This forms the Long-Local
(Lo2) block, see equation (3). In DSC, we use a 3×3 con- Implementation Details
volution kernel. Hence, we can derive that the parameters of
We implemented Rolling-Unet using Pytorch on a NVIDIA
Lo2 block is O(9C 2 ), and the FLOPs is O(9HW C 2 ). This
A6000 GPU. For the ISIC 2018, BUSI and GlaS datasets,
is of the same level as a standard 3×3 convolution.
the batch size was set to 8 and the learning rate was 0.0001
(Valanarasu and Patel 2022). For the CHASEDB1 dataset,
Lo2(X) = CM (Concat[M LPDOR (X), DSC(X)]) (3) the batch size was set to 4 and the learning rate was 0.001
(Tomar et al. 2022). We used the Adam optimizer to train the
We employ the Feature Incentive block in the 4th layer model, and used a cosine annealing learning rate scheduler
of the encoder and the bottleneck layer. It is essentially a with a minimum learning rate of 0.00001. The loss function
convolution block that mainly used to encode the feature was a combination of binary cross entropy (BCE) and dice
and channel number changes. Since subsequent Lo2 block loss. We randomly split each dataset into 80% training and
mainly conducts MLP, we adopt GELU activation function 20% validation subsets. To account for the limited data size
and LayerNorm, following a series of prior MLP works. In of the BUSI, GlaS and CHASEDB1 datasets, we repeated
the 4th layer of the decoder, the Feature Incentive block is this process three times and reported the average and stan-
composed of a convolution block, RELU activation function dard deviation of the results. To evaluate the network’s abil-
and BatchNorm, as subsequent networks conduct convolu- ity fairly, all experiments did not use any pre-trained weights
tion operations, following a series of CNN habits. and post-processing methods, and only applied two simple

3823
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

(a) Image+GT (b) Ours (c) DconnNet (d) UNeXt (e) U-Net (f) Att-UNet (g) UCTransNet (h) UNet++

Figure 4: Qualitative comparison of Rolling-Unet with other state-of-the-art methods. From top to bottom are the BUSI, GlaS
and ISIC2018 datasets. The first column is the original image, with the green contour indicating the Ground Truth. In the
visualized segmentation results, purple indicates over-segmentation, and yellow indicates under-segmentation

(a) Image (b) GT (c) Ours (d) DconnNet (e) UNeXt (f) U-Net (g) Att-UNet (h) UCTransNet

Figure 5: Qualitative comparison on the CHASEDB1 dataset.

online data augmentations: random rotation and flipping. We method outperformed all the other methods on all datasets.
trained for 400 epochs in total. Especially on BUSI and ISIC 2018, Rolling-Unet obtained
a significant advantage. In these two datasets, many targets
Comparison with State-of-the-Art Method have blurry boundaries, which make them difficult to distin-
guish from the background. Rolling-Unet more effectively
We evaluated Rolling-Unet against other state-of-the-art extracted remote dependencies to enhance the segmentation
methods, including CNN-based methods: U-Net (Ron- performance. The experiment of changing the image size
neberger, Fischer, and Brox 2015), UNet++ (Zhou et al. on ISIC 2018 further verified this conclusion. Only Rolling-
2018), Att-Unet (Oktay et al. 2018), DconnNet (Yang Unet and UNeXt maintained similar performance when the
and Farsiu 2023); Transformer-based methods: UCTransNet image size increased, while other methods showed different
(Wang et al. 2022), MedT (Valanarasu et al. 2021); and degrees of decline. For the phenomenon that the metrics of
MLP-based method: UNeXt (Valanarasu and Patel 2022). Rolling-Unet (X) are lower than those of Rolling-Unet (S) in
MedT failed to produce results on the GlaS, ISIC 2018 (Im- ISIC 2018, we have two hypotheses. One is the fluctuation
age size = 512) and CHASEDB1 datasets due to memory of training, which requires taking the average of multiple re-
constraints. Similarly, UNet++ did not yield results on the sults to reduce the impact. Another is that the semantic infor-
CHASEDB1 dataset. To fully demonstrate the efficiency of mation of this dataset is relatively simple, and more network
Rolling-Unet, we trained different sizes of Rolling-Unet. parameters are prone to overfitting, thereby reducing per-
when the channel number C = 16 / 32 / 64 in Figure 1, formance. Recent lightweight models (Valanarasu and Patel
they are named as Rolling-Unet (S) / Rolling-Unet (M) / 2022; Ruan et al. 2023; Cheng et al. 2023) also reflect this
Rolling-Unet (L) respectively. We adopted Intersection over point from the side. In the follow-up work, we will explain
Union (IoU), F1 score and 95% Hausdorff Distance (HD95) this phenomenon through more experiments.
as evaluation metrics.
The evaluation results on BUSI and GlaS are presented in On GlaS and CHASEDB1, no method achieved a signif-
Table 1. The results on ISIC 2018 are shown in Table 2 and icant advantage, but Rolling-Unet was still the best with a
Table 3. The result on CHASEDB1 is shown in Table 4. Our small standard deviation. The images in GlaS have dense,

3824
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

DSC R- OR- DOR- IoU ↑ F1 ↑ HD95 ↓ Method IoU ↑ F1 ↑ HD95 ↓


MLP MLP MLP
MLP 81.10 89.22 2.70
79.48 88.19 4.05 R-MLP 84.14 91.11 2.17
* 80.62 88.94 2.87
* 81.62 89.50 3.73 Table 6: Ablation experiments on the ISIC 2018 dataset.
* * 82.11 89.85 3.16
* 83.39 90.67 2.43
* * 83.84 90.92 2.16 Method IoU ↑ F1 ↑ HD95 ↓
* 83.46 90.63 2.27 Series 1 83.65 90.82 2.72
* * 84.14 91.11 2.17 Series 2 83.62 90.84 2.05
Parallel 84.14 91.11 2.17
Table 5: Ablation experiments on the ISIC 2018 dataset.
Table 7: Ablation experiments on the ISIC 2018 dataset.
tiny cells and tissues; the segmentation targets and the back-
ground often have similar textures, colors, as well as shapes.
In the CHASEDB1 dataset, the thicker vessels are not diffi- the DSC module, the performance of R-MLP, OR-MLP, and
cult for all methods, and the difficulty of segmentation lies DOR-MLP progressively increases. This demonstrates the
in those thin vessels, as shown in Figure 5. These require effectiveness of the proposed module for capturing long-
more powerful methods to solve. distance dependency, and approves the idea of extracting
The parameter amounts of the models are provided in Ta- long-distance dependencies from multiple directions. When
ble 1. We define models with parameter amount less than combined with the DSC module, the performance can be
2M as primary models, and models with parameter amount further enhanced. Therefore, it is essential to fuse the remote
greater than 20M as secondary models (only Rolling-Unet dependencies and local context information.
(L) is between 2-20M). On the four datasets, our method is To rule out the performance improvement caused by the
the best in both primary and secondary models, proving the increase of parameters and FLOPs, we replaced the R-MLP
efficiency of the method. in Rolling-Unet with a regular MLP. This makes the model
In Figure 4, we visualized the difference map between the lose the ability to capture long-distance dependencies while
segmentation results and the Ground Truth to highlight the keeping the parameters and FLOPs consistent. As shown in
differences. Purple indicates over-segmentation, and yellow Table 6, the performance dropped significantly. This result is
indicates under-segmentation. Due to space limitations, we expected, as the Rolling-Unet without the ability to capture
omitted the results of MedT. In the images of BUSI and ISIC long-range dependencies has a similar network structure to
2018, we can see that the segmentation target lacks a clear the original U-Net.
boundary. In the segmentation results, other methods than Further, we explored the combination of DOR-MLP and
Rolling-Unet have generated a lot of under-segmentation or DSC. Series 1 means executing DOR-MLP first and then
over-segmentation regions. This demonstrates that Rolling- DSC. Series 2 means executing DSC first and then DOR-
Unet is good at extraction of the target contours. The targets MLP. Parallel means connecting DSC and DOR-MLP in
in GlaS have complex boundaries, and only Rolling-Unet parallel, the two branches are executed concurrently, and the
achieved segmentation results close to Ground Truth. The features are integrated by Channel-mixing in the end. The
visualization results of the CHASEDB1 dataset are shown results are shown in Table 7. There is little difference be-
in Figure 5. Almost all methods can correctly segment the tween Series 1 and Series 2, and the best is Parallel. This
thick vessels, and the subtle difference lies in the thin ves- proves that: the order of extracting local features and remote
sels inside the blue box. Rolling-Unet considered the long- dependencies is not important, and it is best to fuse them
distance dependencies features of the image, so it improved after extracting them simultaneously.
the segmentation effect of the thin vessels.

Ablation Studies Conclusion


To investigate the impact of various factors on the model per- In this paper, we propose Rolling-Unet model that can cap-
formance, we performed ablation experiments on the ISIC ture long-range dependencies without increasing the com-
2018 dataset (Image size = 512). The details are described putational cost, and outperform the existing methods. It is
as follows. worth noting that the remote dependencies from multiple di-
Lo2 Block consists of DOR-MLP and DSC modules rections are not global receptive fields, they are still a com-
in parallel. The former is responsible for capturing long- promise of MLP in a strict sense. However, R-MLP is a very
distance dependencies, and the latter is responsible for ex- flexible module. By combining it, it can also capture large-
tracting local context information. To ensure that the com- scale regional features and even global features. In future
bination of DOR-MLP and DSC is optimal, and to explore work, we will explore this aspect. We will also investigate
their respective contributions, the experimental results are its potential in three-dimensional medical image segmenta-
shown in Table 5. Regardless of the presence or absence of tion, as well as other image tasks.

3825
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

Acknowledgements Lian, D.; Yu, Z.; Sun, X.; and Gao, S. 2021. As-mlp: An
This research was supported by the National Key R&D Pro- axial shifted mlp architecture for vision. arXiv preprint
gram of China (2022YFF0607503). arXiv:2107.08391.
Lin, Y.; Fang, X.; Zhang, D.; Cheng, K.-T.; and Chen, H.
References 2023. A Permutable Hybrid Network for Volumetric Medi-
Azad, R.; Arimond, R.; Aghdam, E. K.; Kazerouni, A.; and cal Image Segmentation. arXiv:2303.13111.
Merhof, D. 2022. Dae-former: Dual attention-guided effi- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin,
cient transformer for medical image segmentation. arXiv S.; and Guo, B. 2021. Swin transformer: Hierarchical vi-
preprint arXiv:2212.13504. sion transformer using shifted windows. In Proceedings of
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; the IEEE/CVF international conference on computer vision,
and Wang, M. 2023. Swin-Unet: Unet-Like Pure Trans- 10012–10022.
former for Medical Image Segmentation. In Karlinsky, Milletari, F.; Navab, N.; and Ahmadi, S.-A. 2016. V-net:
L.; Michaeli, T.; and Nishino, K., eds., Computer Vision – Fully convolutional neural networks for volumetric medical
ECCV 2022 Workshops, 205–218. Cham: Springer Nature image segmentation. In 2016 fourth international confer-
Switzerland. ISBN 978-3-031-25066-8. ence on 3D vision (3DV), 565–571. Ieee.
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, Oktay, O.; Schlemper, J.; Folgoc, L. L.; Lee, M.; Heinrich,
A.; and Zagoruyko, S. 2020. End-to-end object detection M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N. Y.;
with transformers. In European conference on computer vi- Kainz, B.; et al. 2018. Attention u-net: Learning where to
sion, 213–229. Springer. look for the pancreas. arXiv preprint arXiv:1804.03999.
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, Pinkus, A. 1999. Approximation theory of the MLP model
L.; Yuille, A. L.; and Zhou, Y. 2021. Transunet: Transform- in neural networks. Acta numerica, 8: 143–195.
ers make strong encoders for medical image segmentation.
arXiv preprint arXiv:2102.04306. Ronneberger, O.; Fischer, P.; and Brox, T. 2015. U-net:
Convolutional networks for biomedical image segmenta-
Cheng, J.; Gao, C.; Wang, F.; and Zhu, M. 2023. SegNetr: tion. In Medical Image Computing and Computer-Assisted
Rethinking the local-global interactions and skip connec- Intervention–MICCAI 2015: 18th International Conference,
tions in U-shaped networks. arXiv:2307.02953. Munich, Germany, October 5-9, 2015, Proceedings, Part III
Chollet, F. 2017. Xception: Deep Learning With Depthwise 18, 234–241. Springer.
Separable Convolutions. In Proceedings of the IEEE Confer-
Rosenblatt, F. 1957. The perceptron, a perceiving and recog-
ence on Computer Vision and Pattern Recognition (CVPR).
nizing automaton Project Para. Cornell Aeronautical Labo-
Çiçek, Ö.; Abdulkadir, A.; Lienkamp, S. S.; Brox, T.; and ratory.
Ronneberger, O. 2016. 3D U-Net: learning dense volumet-
ric segmentation from sparse annotation. In Medical Image Ruan, J.; Xie, M.; Gao, J.; Liu, T.; and Fu, Y. 2023. EGE-
Computing and Computer-Assisted Intervention–MICCAI UNet: an Efficient Group Enhanced UNet for skin lesion
2016: 19th International Conference, Athens, Greece, Octo- segmentation. arXiv:2307.08473.
ber 17-21, 2016, Proceedings, Part II 19, 424–432. Springer. Tang, C.; Zhao, Y.; Wang, G.; Luo, C.; Xie, W.; and Zeng, W.
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, 2022. Sparse MLP for image recognition: Is self-attention
D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; really necessary? In Proceedings of the AAAI Conference on
Heigold, G.; Gelly, S.; et al. 2020. An image is worth 16x16 Artificial Intelligence, volume 36, 2344–2351.
words: Transformers for image recognition at scale. arXiv Tolstikhin, I. O.; Houlsby, N.; Kolesnikov, A.; Beyer, L.;
preprint arXiv:2010.11929. Zhai, X.; Unterthiner, T.; Yung, J.; Steiner, A.; Keysers, D.;
Hou, Q.; Jiang, Z.; Yuan, L.; Cheng, M.-M.; Yan, S.; and Uszkoreit, J.; et al. 2021. Mlp-mixer: An all-mlp architec-
Feng, J. 2022. Vision permutator: A permutable mlp-like ar- ture for vision. Advances in neural information processing
chitecture for visual recognition. IEEE Transactions on Pat- systems, 34: 24261–24272.
tern Analysis and Machine Intelligence, 45(1): 1328–1334. Tomar, N. K.; Jha, D.; Riegler, M. A.; Johansen, H. D.;
Huang, X.; Deng, Z.; Li, D.; and Yuan, X. 2021. MISS- Johansen, D.; Rittscher, J.; Halvorsen, P.; and Ali, S.
Former: An Effective Medical Image Segmentation Trans- 2022. FANet: A Feedback Attention Network for Improved
former. CoRR, abs/2109.07162. Biomedical Image Segmentation. IEEE Transactions on
Jha, D.; Riegler, M. A.; Johansen, D.; Halvorsen, P.; and Jo- Neural Networks and Learning Systems, 1–14.
hansen, H. D. 2020. Doubleu-net: A deep convolutional neu- Valanarasu, J. M. J.; Oza, P.; Hacihaliloglu, I.; and Patel,
ral network for medical image segmentation. In 2020 IEEE V. M. 2021. Medical transformer: Gated axial-attention
33rd International symposium on computer-based medical for medical image segmentation. In Medical Image Com-
systems (CBMS), 558–564. IEEE. puting and Computer Assisted Intervention–MICCAI 2021:
Jha, D.; Smedsrud, P. H.; Riegler, M. A.; Johansen, D.; 24th International Conference, Strasbourg, France, Septem-
De Lange, T.; Halvorsen, P.; and Johansen, H. D. 2019. Re- ber 27–October 1, 2021, Proceedings, Part I 24, 36–46.
sunet++: An advanced architecture for medical image seg- Springer.
mentation. In 2019 IEEE international symposium on mul- Valanarasu, J. M. J.; and Patel, V. M. 2022. UNeXt: MLP-
timedia (ISM), 225–2255. IEEE. Based Rapid Medical Image Segmentation Network. In

3826
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)

Wang, L.; Dou, Q.; Fletcher, P. T.; Speidel, S.; and Li, S.,
eds., Medical Image Computing and Computer Assisted In-
tervention – MICCAI 2022, 23–33. Cham: Springer Nature
Switzerland. ISBN 978-3-031-16443-9.
Wang, H.; Cao, P.; Wang, J.; and Zaiane, O. R. 2022. Uc-
transnet: rethinking the skip connections in u-net from a
channel-wise perspective with transformer. In Proceedings
of the AAAI conference on artificial intelligence, volume 36,
2441–2449.
Yang, Z.; and Farsiu, S. 2023. Directional Connectivity-
Based Segmentation of Medical Images. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 11525–11535.
Yu, T.; Li, X.; Cai, Y.; Sun, M.; and Li, P. 2022. S2-mlp:
Spatial-shift mlp architecture for vision. In Proceedings of
the IEEE/CVF winter conference on applications of com-
puter vision, 297–306.
Zhou, Z.; Rahman Siddiquee, M. M.; Tajbakhsh, N.; and
Liang, J. 2018. Unet++: A nested u-net architecture for
medical image segmentation. In Deep Learning in Medical
Image Analysis and Multimodal Learning for Clinical De-
cision Support: 4th International Workshop, DLMIA 2018,
and 8th International Workshop, ML-CDS 2018, Held in
Conjunction with MICCAI 2018, Granada, Spain, Septem-
ber 20, 2018, Proceedings 4, 3–11. Springer.

3827

You might also like