Paper of Rolling Net
Paper of Rolling Net
3819
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
How to capture and fuse local features and long-distance performs the existing best methods. The main contributions
dependencies more effectively is the key to achieve accurate of this work are:
medical image segmentation. In this paper, we rethink this
• 1) We proposed a new approach to capture long-distance
topic: besides combining CNN and Transformer, are there
dependency, and constructed the R-MLP module.
any other methods that can have both local information and
long-distance dependencies? The answer is yes. By com- • 2) Based on 1, we constructed the OR-MLP and DOR-
bining CNN and MLP, this paper proposes a medical im- MLP modules, which can obtain remote dependencies in
age segmentation network named Rolling-Unet. Its core is more directions.
the flexible Rolling-MLP (R-MLP) module, which can cap- • 3) Based on 2, we proposed the Lo2 block. It simulta-
ture linear long-distance dependency in a single direction neously extracts the local context information and long-
of the whole image. By concatenating two vertical R-MLP distance dependencies, without increasing the computa-
modules, we form the Orthogonal Rolling-MLP (OR-MLP) tional burden. The Lo2 block has the same level of pa-
module, which can capture remote dependencies in multiple rameters and computation as a 3×3 convolution.
directions. We adopt the U-shaped framework of U-Net, in- • 4) Based on 3, we constructed Rolling-Unet networks
cluding the encoder-decoder structure, bottleneck layer and with different parameter scales. On four datasets, all
skip connections, to preserve the fine spatial details. In the scales of Rolling-Unet surpassed the existing methods,
4th layer of the encoder-decoder and the bottleneck layer, fully verifying the efficiency of our method.
we replace the original convolution block with Feature In-
centive block and Long-Local (Lo2) block. The Feature In- Related Work
centive block encodes features and controls the dimension
and shape of feature output. Lo2 block consists of Double CNN and Transformer for Medical Image
Orthogonal Rolling-MLP (DOR-MLP contains two comple- Segmentation
mentary OR-MLP) module and Depthwise Separable Con- Inspired by U-Net, UNet++ (Zhou et al. 2018) incorporated
volution (DSC) module, which capture both local context a set of dense skip connections in the model to alleviate the
information and long-distance dependencies relationship of semantic gap of feature fusion. Several subsequent works
the image. Extensive experiments show that our method out- leveraged techniques such as attention mechanism, image
3820
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
3821
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
BUSI GlaS
Method Parmas(M)
IoU ↑ F1 ↑ HD95 ↓ IoU ↑ F1 ↑ HD95 ↓
U-Net(2015) 31.04 64.25±1.63 77.55±1.23 7.57±2.44 87.62±0.29 93.35±0.16 0.83±0.18
UNet++(2018) 36.63 65.68±1.66 78.56±1.26 7.72±2.16 87.99±0.52 93.58±0.29 0.81±0.16
Att-UNet(2018) 34.88 65.97±1.91 78.79±1.29 8.36±2.11 87.90±0.47 93.40±0.26 0.82±0.29
MedT(2021) 1.37 52.15±3.47 67.68±3.18 10.23±1.17 ———— ———— ————
UCTransNet(2022) 66.24 67.27±1.04 79.62±0.74 6.19±0.45 87.80±0.16 93.46±0.12 0.78±0.26
UNeXt(2022) 1.47 61.78±1.46 75.52±0.91 8.33±0.42 83.95±1.09 91.22±0.67 1.04±0.10
DconnNet(2023) 25.49 67.16±0.61 79.63±0.61 6.97±2.81 87.22±0.59 93.12±0.36 0.93±0.15
Rollling-Unet(S) 1.78 65.52±2.82 78.43±2.10 6.19±0.62 86.19±0.35 92.51±0.27 1.00±0.08
Rollling-Unet(M) 7.10 66.99±0.61 79.50±0.35 5.76±0.95 86.60±0.82 92.75±0.53 0.90±0.15
Rollling-Unet(L) 28.32 67.81±1.80 80.17±1.19 7.29±2.50 88.02±0.28 93.59±0.17 0.64±0.27
Table 1: Results on the BUSI and GlaS dataset. The IoU, F1 and HD95 are in ‘mean±std’ format. The best results are bold.
has a shifting step of k. Then, taking the feature map with affect the linear receptive field extraction. However, when
channel index c0 as the reference, we crop the excess parts using OR-MLP, the sign of k is crucial. For the width direc-
of the other feature maps to the missing parts. Finally, we tion, given a positive k value, it represents moving from left
perform a channel projection with weight sharing at each to right (LR), and a negative k value represents moving from
spatial location index (hi , wj ) to encode long-distance de- right to left (RL). For the height direction, given a positive k
pendency. In Figure 2, the original feature matrix has only value, it represents moving from top to bottom (TB), and a
one width wj feature at a fixed spatial index (hi , wj ) for negative k value represents moving from bottom to top (BT).
all channels. After applying the Rolling operation in width As shown in Figure 3, we consider two complementary OR-
direction, different channels have different width features. MLP modules. The first one applies R-MLP along the LR
When C ≥ W , we can encode the width features of the direction first and then sequentially along the TB direction.
entire image, which can be understood as global, unidirec- The second one applies R-MLP along the BT direction first
tional, linear receptive fields. When C < W , this linear re- and then sequentially along the LR direction. By paralleliz-
ceptive field is non-global. Similarly, R-MLP can also cap- ing these two OR-MLPs, we capture the long-range depen-
ture long-distance dependency in height direction. dencies along four directions: width, height, positive diag-
It is well known that MLP is sensitive to the positional onal, and negative diagonal! As shown in equation (2), for
1
information of the input. R-MLP performs cyclic operations an input X, we first apply an OR-MLP M LPOR , and then
2
of shifting and cropping the feature maps, making the posi- parallelize another OR-MLP M LPOR . We concatenate their
tional index order on each channel non-fixed. This prelim- outputs along the channel dimension and apply LayerNorm.
inarily reduces the sensitivity of R-MLP to position. Sec- Then we use Channel-mixing (CM) (Tolstikhin et al. 2021)
ondly, by using weight sharing, all channel projections share to fuse the features and reduce the channels back to C. Fi-
a set of parameters, which further reduces the sensitivity. nally, we add a residual connection with the input X. This
forms the Double Orthogonal Rolling-MLP (DOR-MLP)
OR-MLP and DOR-MLP module, as depicted in Figure 1.
R-MLP can encode the long-range dependency along either
the width or height direction. How can we capture the long- 1
distance dependency along other direction? By applying R- M LPDOR (X) =CM (LN (Concat[M LPOR (X),
2
(2)
MLP first along the width direction and then along the height M LPOR (X)])) + X
direction, it is equivalent to the synchronous shifting oper-
ation of the feature map in two orthogonal directions, re- Lo2 Block and Feature Incentive Block
sulting in a diagonal receptive field. As shown in equation
(1), for an input X, we first apply R-MLP along one direc- The DOR-MLP module captures the global, linear
tion M LPR1 , and then concatenate another R-MLP along the long-range dependencies along four directions in two-
perpendicular direction M LPR2 . We use the GELU activa- dimensional space, but it lacks the local context informa-
tion function in between, and then add a residual connection tion. We argue that better integrating local information and
with the input X. This forms the Orthogonal Rolling-MLP global dependencies is crucial for performance improve-
(OR-MLP) module, as illustrated in Figure 1. ment. Depthwise Separable Convolution (DSC) is a natural
choice (Chollet 2017). Because it has very few parameters
and computational costs, which is compatible with DOR-
M LPOR (X) = (M LPR2 (GELU (M LPR1 (X))))+ X (1) MLP. It is a well-established fact that the Channel-mixing in
R-MLP is a highly flexible module with great potential. MLP-Mixer, the MLP in ViT, and the R-MLP in this paper
The sign of the shifting step k determines the encoding or- are all equivalent to the standard 1×1 convolution in CNN,
der. When using R-MLP alone, reversing the order does not which allows feature interaction between different channels.
3822
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
3823
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
(a) Image+GT (b) Ours (c) DconnNet (d) UNeXt (e) U-Net (f) Att-UNet (g) UCTransNet (h) UNet++
Figure 4: Qualitative comparison of Rolling-Unet with other state-of-the-art methods. From top to bottom are the BUSI, GlaS
and ISIC2018 datasets. The first column is the original image, with the green contour indicating the Ground Truth. In the
visualized segmentation results, purple indicates over-segmentation, and yellow indicates under-segmentation
(a) Image (b) GT (c) Ours (d) DconnNet (e) UNeXt (f) U-Net (g) Att-UNet (h) UCTransNet
online data augmentations: random rotation and flipping. We method outperformed all the other methods on all datasets.
trained for 400 epochs in total. Especially on BUSI and ISIC 2018, Rolling-Unet obtained
a significant advantage. In these two datasets, many targets
Comparison with State-of-the-Art Method have blurry boundaries, which make them difficult to distin-
guish from the background. Rolling-Unet more effectively
We evaluated Rolling-Unet against other state-of-the-art extracted remote dependencies to enhance the segmentation
methods, including CNN-based methods: U-Net (Ron- performance. The experiment of changing the image size
neberger, Fischer, and Brox 2015), UNet++ (Zhou et al. on ISIC 2018 further verified this conclusion. Only Rolling-
2018), Att-Unet (Oktay et al. 2018), DconnNet (Yang Unet and UNeXt maintained similar performance when the
and Farsiu 2023); Transformer-based methods: UCTransNet image size increased, while other methods showed different
(Wang et al. 2022), MedT (Valanarasu et al. 2021); and degrees of decline. For the phenomenon that the metrics of
MLP-based method: UNeXt (Valanarasu and Patel 2022). Rolling-Unet (X) are lower than those of Rolling-Unet (S) in
MedT failed to produce results on the GlaS, ISIC 2018 (Im- ISIC 2018, we have two hypotheses. One is the fluctuation
age size = 512) and CHASEDB1 datasets due to memory of training, which requires taking the average of multiple re-
constraints. Similarly, UNet++ did not yield results on the sults to reduce the impact. Another is that the semantic infor-
CHASEDB1 dataset. To fully demonstrate the efficiency of mation of this dataset is relatively simple, and more network
Rolling-Unet, we trained different sizes of Rolling-Unet. parameters are prone to overfitting, thereby reducing per-
when the channel number C = 16 / 32 / 64 in Figure 1, formance. Recent lightweight models (Valanarasu and Patel
they are named as Rolling-Unet (S) / Rolling-Unet (M) / 2022; Ruan et al. 2023; Cheng et al. 2023) also reflect this
Rolling-Unet (L) respectively. We adopted Intersection over point from the side. In the follow-up work, we will explain
Union (IoU), F1 score and 95% Hausdorff Distance (HD95) this phenomenon through more experiments.
as evaluation metrics.
The evaluation results on BUSI and GlaS are presented in On GlaS and CHASEDB1, no method achieved a signif-
Table 1. The results on ISIC 2018 are shown in Table 2 and icant advantage, but Rolling-Unet was still the best with a
Table 3. The result on CHASEDB1 is shown in Table 4. Our small standard deviation. The images in GlaS have dense,
3824
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
3825
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
Acknowledgements Lian, D.; Yu, Z.; Sun, X.; and Gao, S. 2021. As-mlp: An
This research was supported by the National Key R&D Pro- axial shifted mlp architecture for vision. arXiv preprint
gram of China (2022YFF0607503). arXiv:2107.08391.
Lin, Y.; Fang, X.; Zhang, D.; Cheng, K.-T.; and Chen, H.
References 2023. A Permutable Hybrid Network for Volumetric Medi-
Azad, R.; Arimond, R.; Aghdam, E. K.; Kazerouni, A.; and cal Image Segmentation. arXiv:2303.13111.
Merhof, D. 2022. Dae-former: Dual attention-guided effi- Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin,
cient transformer for medical image segmentation. arXiv S.; and Guo, B. 2021. Swin transformer: Hierarchical vi-
preprint arXiv:2212.13504. sion transformer using shifted windows. In Proceedings of
Cao, H.; Wang, Y.; Chen, J.; Jiang, D.; Zhang, X.; Tian, Q.; the IEEE/CVF international conference on computer vision,
and Wang, M. 2023. Swin-Unet: Unet-Like Pure Trans- 10012–10022.
former for Medical Image Segmentation. In Karlinsky, Milletari, F.; Navab, N.; and Ahmadi, S.-A. 2016. V-net:
L.; Michaeli, T.; and Nishino, K., eds., Computer Vision – Fully convolutional neural networks for volumetric medical
ECCV 2022 Workshops, 205–218. Cham: Springer Nature image segmentation. In 2016 fourth international confer-
Switzerland. ISBN 978-3-031-25066-8. ence on 3D vision (3DV), 565–571. Ieee.
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, Oktay, O.; Schlemper, J.; Folgoc, L. L.; Lee, M.; Heinrich,
A.; and Zagoruyko, S. 2020. End-to-end object detection M.; Misawa, K.; Mori, K.; McDonagh, S.; Hammerla, N. Y.;
with transformers. In European conference on computer vi- Kainz, B.; et al. 2018. Attention u-net: Learning where to
sion, 213–229. Springer. look for the pancreas. arXiv preprint arXiv:1804.03999.
Chen, J.; Lu, Y.; Yu, Q.; Luo, X.; Adeli, E.; Wang, Y.; Lu, Pinkus, A. 1999. Approximation theory of the MLP model
L.; Yuille, A. L.; and Zhou, Y. 2021. Transunet: Transform- in neural networks. Acta numerica, 8: 143–195.
ers make strong encoders for medical image segmentation.
arXiv preprint arXiv:2102.04306. Ronneberger, O.; Fischer, P.; and Brox, T. 2015. U-net:
Convolutional networks for biomedical image segmenta-
Cheng, J.; Gao, C.; Wang, F.; and Zhu, M. 2023. SegNetr: tion. In Medical Image Computing and Computer-Assisted
Rethinking the local-global interactions and skip connec- Intervention–MICCAI 2015: 18th International Conference,
tions in U-shaped networks. arXiv:2307.02953. Munich, Germany, October 5-9, 2015, Proceedings, Part III
Chollet, F. 2017. Xception: Deep Learning With Depthwise 18, 234–241. Springer.
Separable Convolutions. In Proceedings of the IEEE Confer-
Rosenblatt, F. 1957. The perceptron, a perceiving and recog-
ence on Computer Vision and Pattern Recognition (CVPR).
nizing automaton Project Para. Cornell Aeronautical Labo-
Çiçek, Ö.; Abdulkadir, A.; Lienkamp, S. S.; Brox, T.; and ratory.
Ronneberger, O. 2016. 3D U-Net: learning dense volumet-
ric segmentation from sparse annotation. In Medical Image Ruan, J.; Xie, M.; Gao, J.; Liu, T.; and Fu, Y. 2023. EGE-
Computing and Computer-Assisted Intervention–MICCAI UNet: an Efficient Group Enhanced UNet for skin lesion
2016: 19th International Conference, Athens, Greece, Octo- segmentation. arXiv:2307.08473.
ber 17-21, 2016, Proceedings, Part II 19, 424–432. Springer. Tang, C.; Zhao, Y.; Wang, G.; Luo, C.; Xie, W.; and Zeng, W.
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, 2022. Sparse MLP for image recognition: Is self-attention
D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; really necessary? In Proceedings of the AAAI Conference on
Heigold, G.; Gelly, S.; et al. 2020. An image is worth 16x16 Artificial Intelligence, volume 36, 2344–2351.
words: Transformers for image recognition at scale. arXiv Tolstikhin, I. O.; Houlsby, N.; Kolesnikov, A.; Beyer, L.;
preprint arXiv:2010.11929. Zhai, X.; Unterthiner, T.; Yung, J.; Steiner, A.; Keysers, D.;
Hou, Q.; Jiang, Z.; Yuan, L.; Cheng, M.-M.; Yan, S.; and Uszkoreit, J.; et al. 2021. Mlp-mixer: An all-mlp architec-
Feng, J. 2022. Vision permutator: A permutable mlp-like ar- ture for vision. Advances in neural information processing
chitecture for visual recognition. IEEE Transactions on Pat- systems, 34: 24261–24272.
tern Analysis and Machine Intelligence, 45(1): 1328–1334. Tomar, N. K.; Jha, D.; Riegler, M. A.; Johansen, H. D.;
Huang, X.; Deng, Z.; Li, D.; and Yuan, X. 2021. MISS- Johansen, D.; Rittscher, J.; Halvorsen, P.; and Ali, S.
Former: An Effective Medical Image Segmentation Trans- 2022. FANet: A Feedback Attention Network for Improved
former. CoRR, abs/2109.07162. Biomedical Image Segmentation. IEEE Transactions on
Jha, D.; Riegler, M. A.; Johansen, D.; Halvorsen, P.; and Jo- Neural Networks and Learning Systems, 1–14.
hansen, H. D. 2020. Doubleu-net: A deep convolutional neu- Valanarasu, J. M. J.; Oza, P.; Hacihaliloglu, I.; and Patel,
ral network for medical image segmentation. In 2020 IEEE V. M. 2021. Medical transformer: Gated axial-attention
33rd International symposium on computer-based medical for medical image segmentation. In Medical Image Com-
systems (CBMS), 558–564. IEEE. puting and Computer Assisted Intervention–MICCAI 2021:
Jha, D.; Smedsrud, P. H.; Riegler, M. A.; Johansen, D.; 24th International Conference, Strasbourg, France, Septem-
De Lange, T.; Halvorsen, P.; and Johansen, H. D. 2019. Re- ber 27–October 1, 2021, Proceedings, Part I 24, 36–46.
sunet++: An advanced architecture for medical image seg- Springer.
mentation. In 2019 IEEE international symposium on mul- Valanarasu, J. M. J.; and Patel, V. M. 2022. UNeXt: MLP-
timedia (ISM), 225–2255. IEEE. Based Rapid Medical Image Segmentation Network. In
3826
The Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24)
Wang, L.; Dou, Q.; Fletcher, P. T.; Speidel, S.; and Li, S.,
eds., Medical Image Computing and Computer Assisted In-
tervention – MICCAI 2022, 23–33. Cham: Springer Nature
Switzerland. ISBN 978-3-031-16443-9.
Wang, H.; Cao, P.; Wang, J.; and Zaiane, O. R. 2022. Uc-
transnet: rethinking the skip connections in u-net from a
channel-wise perspective with transformer. In Proceedings
of the AAAI conference on artificial intelligence, volume 36,
2441–2449.
Yang, Z.; and Farsiu, S. 2023. Directional Connectivity-
Based Segmentation of Medical Images. In Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 11525–11535.
Yu, T.; Li, X.; Cai, Y.; Sun, M.; and Li, P. 2022. S2-mlp:
Spatial-shift mlp architecture for vision. In Proceedings of
the IEEE/CVF winter conference on applications of com-
puter vision, 297–306.
Zhou, Z.; Rahman Siddiquee, M. M.; Tajbakhsh, N.; and
Liang, J. 2018. Unet++: A nested u-net architecture for
medical image segmentation. In Deep Learning in Medical
Image Analysis and Multimodal Learning for Clinical De-
cision Support: 4th International Workshop, DLMIA 2018,
and 8th International Workshop, ML-CDS 2018, Held in
Conjunction with MICCAI 2018, Granada, Spain, Septem-
ber 20, 2018, Proceedings 4, 3–11. Springer.
3827