0% found this document useful (0 votes)
37 views23 pages

Transunext: Towards A More Advanced U-Shaped Framework For Automatic Vessel Segmentation in The Fundus Image

Uploaded by

geek.bill.0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views23 pages

Transunext: Towards A More Advanced U-Shaped Framework For Automatic Vessel Segmentation in The Fundus Image

Uploaded by

geek.bill.0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

TransUNext: towards a more advanced U-shaped

framework for automatic vessel segmentation in


the fundus image

Xiang Li1 , Mingsi Liu1 , and Lixin Duan1


arXiv:2411.02724v1 [eess.IV] 5 Nov 2024

Shenzhen Institute for Advanced Study, University of Electronic Science and


Technology of China, Shenzhen, 518110, China

Abstract. Purpose: Automatic and accurate segmentation of fundus


vessel images has become an essential prerequisite for computer-aided
diagnosis of ophthalmic diseases such as diabetes mellitus. The task of
high-precision retinal vessel segmentation still faces difficulties due to the
low contrast between the branch ends of retinal vessels and the back-
ground, the long and thin vessel span, and the variable morphology of
the optic disc and optic cup in fundus vessel images.
Methods: We propose a more advanced U-shaped architecture for a hy-
brid Transformer and CNN: TransUNext, which integrates an Efficient
Self-attention Mechanism into the encoder and decoder of U-Net to cap-
ture both local features and global dependencies with minimal compu-
tational overhead. Meanwhile, the Global Multi-Scale Fusion (GMSF)
module is further introduced to upgrade skip-connections, fuse high-
level semantic and low-level detailed information, and eliminate high-
and low-level semantic differences. Inspired by ConvNeXt, TransNeXt
Block is designed to optimize the computational complexity of each base
block in U-Net and avoid the information loss caused by the compressed
dimension when the information is converted between the feature spaces
of different dimensions.
Results: We evaluated the proposed method on four public datasets
DRIVE, STARE, CHASE-DB1, and HRF. In the experimental results,
the AUC (area under the ROC curve) values were 0.9867, 0.9869, 0.9910,
and 0.9887, which exceeded the other state-of-the-art (SOTA) methods,
respectively. In addition, evaluation metrics such as CAL (connectivity-
area-length) were used in the ablation study to quantify the segmentation
of coarse and fine vessels.
Conclusion: The quantitative and qualitative results validate the supe-
rior performance of the proposed method for retinal vessel segmentation,
and the robustness in challenging situations, such as lesions or fine ves-
sels in the optic disc region, is well demonstrated. The Source code is
available at https://github.com/lixiang007666/TransUNext.

Keywords: Fundus image, vessel segmentation, TransNeXt Block,


Global Multi-Scale Fusion
2 X. Li et al.

1 Introduction
The essence of automatic segmentation of fundus vessel images is to dichotomize
the vascular pixels in the image with the surrounding pixels. In clinical appli-
cations, fundus vessel images are more complex. They require experienced pro-
fessionals to complete segmentation manually. Not only are they subjective and
inefficient, but with the explosion of fundus image data, implementing computer-
aided automatic segmentation of vessel networks in fundus images has crucial
clinical value [17].
Currently, automatic segmentation methods for fundus vessel images are divided
into two main categories: one is based on unsupervised methods, and the other
is based on supervised methods, including machine learning and deep learning
strategies.
Based on unsupervised machine learning methods, Chaudhuri et al. [2] success-
fully implemented blood vessel segmentation of fundus images using a two-
dimensional Gaussian matched filter. Afterward, some segmentation methods
based on vessel morphology and particular pixels appeared. For example, Yang et
al. [43] proposed a morphological processing method, which first enhances vessel
features, suppresses background information, and then applies fuzzy clustering
to achieve vessel segmentation; Zhao et al. [44] also proposed a segmentation
method based on a deformable model, which uses regional information of differ-
ent vessel types to achieve segmentation; Li et al. [13] optimized the matched
filtering method and applied it to the vessel segmentation task. The segmenta-
tion method based on unsupervised learning is fast, but the segmentation result
is rough and has low accuracy.
The supervised machine learning segmentation method better extracts vessel
feature information by strengthening the training model through manually la-
beled images. Staal et al. [23] and Soares et al. [22] used a two-dimensional filter
to extract the overall features of the retinal image and then used naive Bayes to
classify the retinal background and blood vessels. Ricci et al. [20] first extracted
the green channel of the fundus image during image preprocessing and then used
SVM to segment according to the difference in vessel width. Fraz et al. [6] pro-
posed combining AdaBoost and Bagging models, integrating the results of com-
plex feature extraction and the results of binary classification models, and using
the supervised method to segment retinal vessel images automatically. Although
the accuracy of the supervised machine learning method has been improved,
because the algorithm itself cannot adapt to the shape, scale, and geometric
transformation of blood vessels, there are still problems, such as low accuracy
and low robustness when segmenting small vessels and vessel intersections, and
it is difficult to provide an objective basis for clinical diagnosis.
With the advent of CNN, the semantic segmentation method based on deep
learning can accurately predict vessel and non-vessel pixels and provide descrip-
tions of vessel scale, shape, multiple curvature, and other information. In medical
image semantic segmentation methods, U-Net [21] is considered a very success-
ful network, consisting of convolutional encoding and decoding units. It can use
a few samples to complete training to perform segmentation tasks better. Its
Title Suppressed Due to Excessive Length 3

derivative works [18,19,37,41] have also achieved advanced retinal blood vessel
segmentation results. In order to further improve the accuracy of retinal blood
vessel segmentation, Wang B et al. [26] proposed a variant of U-Net with dual
encoders to capture richer context features. Li et al. [30] also proposed an im-
proved end-to-end network of U-Net. This framework uses technologies including
compression and excitation (SE) module, residual module, and circular structure
and introduces enhanced super-resolution generative adaptive networks (ESR-
GANS) [34] and improved data enhancement methods to achieve retinal blood
vessel segmentation. Wang B et al. [31] also provided a supervision framework
for retinal vessel segmentation from thick to thin.
Nevertheless, CNN-based approaches cannot model long-range dependencies due
to inherent inductive biases such as locality and translational equivalence. Thus
Transformer [25], which relies purely on attention mechanisms to build global
dependencies without any convolutional operations, has emerged as an alter-
native architecture that provides better performance than CNNs in computer
vision (CV) under pre-training conditions on large-scale datasets. Vision Trans-
former (Vit) [5] revolutionized the CV field by segmenting images into a sequence
of tokens and modeling their global relationships with stacked Transformer
blocks. Swin Transformer [15] can produce hierarchical features in a movable
window with low computational complexity representation, achieving state-of-
the-art performance in various CV tasks. However, the size of medical image
datasets is much smaller than the pre-trained datasets in the above work (e.g.,
ImageNet-21k and JFT-300M). As a result, the Transformer produces unsat-
isfactory performance in medical image segmentation. Therefore, many hybrid
structures combining CNN and Transformer have emerged, which have both ad-
vantages and are gradually becoming a compromise solution for medical image
segmentation without needing pre-training on large datasets.
We summarize several popular hybrid architectures based on Transformer and
CNN in medical image segmentation. These hybrid architectures add the Trans-
former to a CNN-based backbone model or replace some architecture compo-
nents. For example, UNETR [9] uses an encoder and decoder architecture where
the encoder is composed of a cascaded block built with a pure Transformer,
and the decoder is a stacked convolutional layer, see Fig. 1(a).TransBTS [33]
and TransUNet [3] introduce a relationship between the encoder and decoder
composed of a CNN a Transformer, see Fig. 1(b). coTr [39] bridges all stages
from the encoder to the decoder through the Transformer, not only the adjacent
stages, allowing to exploit the global dependencies at multiple scales, see Fig.
1(c). Furthermore, nn-Former [45,1] interweaves Transformer and convolutional
blocks into a hybrid model where convolution encodes precise spatial information
while Transformer captures global context information, see Fig. 1(d). As seen
from Fig. 1, these architectures implement a serial combination of Transformer
and CNN from a macroscopic perspective. However, in these combinations, the
convolutional and self-attention mechanisms cannot be applied throughout the
network structure, making it challenging to model local and global features effi-
ciently.
4 X. Li et al.

CNN Transformer

Encoder

Decoder

(a) (b) (c) (d) (e)

Fig. 1. Comparison of several hybrid architectures of Transformer and CNN. (Color


figure online)

In this paper, we propose a more advanced U-shaped architecture of a hybrid


Transformer for fundus vessel image segmentation (TransUNext), the pattern of
which is shown in Fig. 1(e), and TransUNext combines the advantages of both
Fig. 1(c) and Fig. 1(d). It can take full advantage of the global dependencies at
multi-scale, fuse high-level semantic and low-level detail information, and effi-
ciently aggregate local and global features. In addition, inspired by ConvNeXt
[16], we also optimize each base convolutional block in the U-shaped architec-
ture to maintain excellent segmentation capability even at low computational
complexity.
We found that TransUNext can further improve the accuracy of vessel segmen-
tation with the following main contributions:

– TransNeXt Block is proposed as the base block in the U-shaped architec-


ture. This base block avoids the information loss caused by the compressed
dimension when the information is converted between the feature spaces of
different dimensions and uses self-attention to capture global dependencies.
At the same time, it avoids large-scale pre-training of the Transformer and
has a more negligible computational overhead.
– Upgrading the skip-connection of U-Net using Global Multi-Scale Fusion
(GMSF) structure and propagating information through all-to-all attention
in all tokens at each scale achieves semantic and spatial global multi-scale
fusion, which can make full use of high-level semantic and low-level detail
information.
– The proposed method was evaluated on four public datasets, DRIVE, STARE,
CHASE-DB1, and HRF, where the images of HRF are of high resolution. In
addition, in addition to some general evaluation metrics, such as AUC, we
used CAL for the specificity of the retinal vessel segmentation task [7].
Title Suppressed Due to Excessive Length 5

The rest of this paper is organized as follows: Section II introduces the proposed
method in detail. Section III describes the experimental implementation and
illustrates the experimental results. Section IV gives the conclusions of this paper

2 Proposed method

2.1 Dataset Pre-processing

Many samples have poor contrast and high noise in the fundus vessel image
dataset. Therefore, proper preprocessing is crucial for subsequent training. This
paper uses four preprocessing methods, including Grayscale Transformation,
Data Normalization, Contrast Limited Adaptive Histogram Equalization (CLAHE),
and Gamma Correction [7,8,35], to process each original fundus vessel image.
Fig. 2 provides a schematic diagram of the phased processing results of the orig-
inal color retinal images after Grayscale Transformation, CLAHE, and Gamma
Correction. As can be seen from the figure, the preprocessed images have clear
texture, prominent edges, and enhanced detail information.

Fig. 2. The flowchart of data preprocessing.

Because of the relatively small amount of data in the fundus vessel image dataset,
a new augmentation was performed to reduce the effects of overfitting. The
retinal vessel image in the used dataset is a circular region, so rotating the
image randomly by a fixed angle can simulate different acquisition environments
without changing the structure of the image itself. In addition, 15,000 randomly
extracted patches of size 128 in each training image of the DRIVE, STARE,
CHASE-DB1, and HRF datasets were extracted by random cropping, and the
corresponding ground truth was processed identically. The amplified image data
are shown in Fig. 3.
Unlike the cropping method in the training phase, in the testing phase, each
image block needs to be re-stitched into a complete image and then binarized
to obtain the segmentation result. All patches must be stitched to restore their
resolution to the level of the original fundus image. However, the time and space
complexity of stitching based on the index is extremely high if random cropping
is used. To avoid this problem, we use overlapping cropping in the testing phase.
The step size was set to 12 based on workstation performance trade-offs.
6 X. Li et al.

Fig. 3. Illustration of random cropping. (a) patches from the original image (b) patches
of ground truth.

2.2 Overall Architecture

An overview of the TransUNext architecture is illustrated in Fig. 4(a). Tran-


sUNext follows the design of a U-shaped network encoder and decoder, mainly
consisting of pure convolutional modules and TransNeXt Blocks. While our orig-
inal intention was to build a fully hybrid architecture consisting of Transformer
and ConvNeXt, in our implementation, cascaded pure convolutional modules and
downsampling operations are introduced to gradually extract low-level features
with high resolution and obtain delicate spatial information. Similarly, these
pure convolutional modules are deployed in the same stage of the decoder to
recover the original image dimension by upsampling. Note that we do not apply
the Transformer on the original resolution, as adding the Transformer module
in the very shallow layers of the network does not help in experiments but in-
troduces additional computation. A possible reason is that the shallow layers
of the network focus more on detailed textures, where gathering global context
may not be informative [14].
It is worth noting that TransUNext does not use a simple skip connection be-
tween the encoder and decoder. Instead, Global Multi-Scale Fusion (GMSF),
which propagates information through all-to-all attention in all tokens at each
scale, enables semantic and spatial global multi-scale fusion, which can fully use
high-level semantic and low-level detail information.
Given an input volume x ∈ RH×W ×D , where H, W and D denote the height,
width, and depth, respectively, we first utilize several pure convolution modules
to obtain feature maps f ∈ R 2H W D
N1 × 2N1 × 2N1 ×2
N1
C, where N1 and C denote the
number of modules and base channels, respectively. Afterwards, hybrid modules
consisting of Transformer and ConvNeXt were applied to model the hierarchical
representation from local and global feature. The procedure is repeated N2 times
with 2N1H+N2 × 2N1W+N2 × 2N1D+N2 as the output resolutions and 2N1 +N2 C as the
channel number. The architecture of TransUNext is straightforward and change-
able, where the number of each module can be adjusted according to medical
image segmentation tasks, i.e., N1 and N2 .
Title Suppressed Due to Excessive Length 7

Output

Pure Conv Module Hybrid Module


1x1 Conv

GELU

TransNeXt
Block
Block
Conv
↓ ↓ LayerNorm

Input
Multi-Head

Multi-Scale Fusion
Self-Attention

TransNeXt
Block
LayerNorm
C

HxWxC
1x1 Conv
TransNeXt
GELU
Block

Block
Conv

↓ ↓ H x W x 4C
1x1 Conv

Output LayerNorm
HxWxC
DepthWiseConv


↓ Down Sampling Up Sampling C Skip Connection Input

(a) Architecture (b) TransNeXt Block

Fig. 4. (a) The architecture of TransUNext; (b) Hybird block consisting of Transformer
and ConvNeXt (TransNeXt Block).

2.3 TransNeXt Block


The hybrid modules are deployed in the deep stages of TransUNext, where the
TransNeXt Block, as its heart, achieves aggregation of local and global rep-
resentations by ConvNeXt and Transformer. Fig. 4(b) shows the structure of
the TransNeXt Block from bottom to top, and the essential components include
DepthWise (DW) Conv, LayerNorm, 1×1 Conv, GELU, multi-head self-attention
(MHSA), and three residual connections. We use this structure as the base block
of the U-shaped architecture, which captures local features and avoids the in-
formation loss caused by compressing dimensions when the input is transformed
between feature spaces of different dimensions. In addition, Transformer’s self-
attention mechanism can efficiently capture global dependencies.
Revisiting ConvNeXt. Liu et al. have proposed a convolutional neural net-
work model for the 2020s ConvNeXt network. The author uses the ideas of
Swin-transformer, ResNeXt [38], and MobileNet [10] to change the calculation
amount in different stages at the macro level, convolution with large-scale convo-
lution kernel, group convolution, etc. In the micro design, the RELU activation
function is replaced by the GELU activation function, fewer activation functions
and normalization layers are used, and LN replaces BN. The network outper-
forms the Swin-t model in many classification and recognition tasks to achieve
the best performance.
Specifically, as shown in the header of Fig. 4(b), the input first passes through
the DepthWise Conv with kernel size=7, stride=1 and padding=3, aiming to get
8 X. Li et al.

better results with less number of parameters compared to the standard base
convolution block. At the channel level, its size changes from C to 4C and then
decreases to C. Both the ConvNeXt block adopts the form of small, large, and
small dimensions to avoid the information loss caused by the compressed dimen-
sion when the information is converted between the feature spaces of different
dimensions.
Efficient Self-attention Mechanism. As can be found in Fig. 4(b), the TransNeXt
Block is built based on the core MHSA module. This module allows the model to
infer attention from different representation subspaces jointly. The results from
multiple heads are concatenated and then transformed with a feed-forward net-
work. In this paper, we used four heads; you can see more detailed multi-head
dimensions in the code.
As shown in Fig. 5. Three 1×1 convolutions are used to project X to query, key,
value embeddings: Q, K, V ∈ Rd×H×W , where d is the dimension of embedding
in each head. The Q, K, V is then flatten and transposed into sequences with
size n × d, where n = HW . Self-attention mechanism is implemented by the dot
product of vectors and softmax. However, images are highly structured data,
and most pixels in a high-resolution feature map have similar features in the
neighborhood except for the boundary regions. Therefore, pairwise attention
computation in the whole image is unnecessary [32].
So we use an Efficient Self-attention Mechanism corresponding to the Sub-
Sample component in Fig. 5. The main idea is to use two projections for keys
and values, and the processed self-attention is computed as follows:

K′ Q⊤
 
Attention (Q, K , V ) = softmax
′ ′
√ V′ (1)
d
where K, V ∈ Rn×d into low-dimensional embedding: K′ , V′ ∈ Rk×d , and k =
hw ≪ n, h and w are the reduced
 ′ ⊤size
 of feature map after sub-sampling. Among
them, the shape of softmax K√Q
d
is n × k, and the shape of V′ is k × d.
By doing so, the computational complexity is reduced to O(nkd), instead of the
O n2 d complexity of a simple dot product. Notably, the projection to low-
dimensional embedding can be any down-sampling operations, such as strided
convolution, or average pooling/max pooling. The advantage of this is that, for
example, with a high-resolution fundus image dataset like HRF, n is much larger
than d because of the high-resolution of its feature maps. Thus the sequence
length dominates the self-attention computation and makes it infeasible to apply
self-attention in high-resolution feature maps. In our implementation, we still use
1×1 convolution followed by a simple interpolation to down-sample the feature
map.

2.4 Global Multi-Scale Fusion

Fig. 6 shows a spatially and semantically Global Multi-Scale Fusion (GMSF)


module that integrates information from multi-scale semantic graphs with a
Title Suppressed Due to Excessive Length 9

HxWxd HHR x WHR x d

H*W x h*w HHR*WHR x h*w

Softmax Softmax

h*w x H*W h*w x H*W h*w x HHR*WHR h*w x HHR*WHR

Content-Position Content-Content Content-Position Content-Content

h*w x d H*W x d h*w x d h*w x d h*w x d HHR*WHR x d h*w x d h*w x d

R Q R Q

Sub-Sample Sub-Sample Sub-Sample Sub-Sample

Hx1xd 1xWxd HxWxd HxWxd HxWxd Hx1xd 1xWxd HHR x WHR x d HLR x WLR x d HLR x WLR x d
1x1 1x1 1x1 1x1 1x1 1x1
Conv Conv Conv Conv Conv Conv

Q K V Q K V
X
HxWxd High-Res Encoder Feat Low-Res Decoder Feat

(a) Multi-Head Self-Attention (Encoder) (b) Multi-Head Self-Attention (Decoder)

Fig. 5. Efficient multi-head self-attention (MHSA). (a) The MHSA used in the Trans-
former encoder. (b) The MHSA used in the Transformer decoder. They share similar
concepts, but (b) takes two inputs, including the high-resolution features from GMSF
of the encoder, and the low-resolution features from the decoder.

small computational overhead. In the decoder, TransUNext progressively recov-


ers the resolution through TransNeXt Block and upsampling layers and combines
the high-resolution feature maps from the encoder through GMSF in place of
the traditional U-network skip connections to finally output the segmentation
maps.
Global Multi-Scale Fusion plays a crucial role in dense prediction tasks, combin-
ing high-level semantic and low-level detail information, while the feature maps
output in TransNeXt Block are naturally used in the Global Multi-Scale Fusion
module. First, accepting 2D feature maps from multiple scales:
M ap1 , M ap2 , . . . , M apn , Map Fusion contains feature maps from all scales, which
are obtained by concatting and reshaping the feature maps from multiple inputs.
Then it is sent to the Transformer block for multi-scale semantic fusion. Finally,
the fused Multi-Scale Map is chunked and reshaped into a 2D feature map. In
contrast to the skip connection, the proposed approach propagates information
to all tokens at each scale by all-to-all attention to form a semantic and spatial
Global Multi-Scale Fusion. i.e., the "skip connection" now has an attention mech-
anism that can eliminate high and low-level semantic differences and improve
feature reuse.

2.5 Evaluation metrics

To evaluate the performance of the proposed method, several metrics are used in
this paper, including Accuracy (Acc), Specificity (SP ), Sensitivity (SE ), Preci-
sion, F1-Score, AUC (area under the ROC curve) and CAL (connectivity-area-
length) [7,27].
10 X. Li et al.

Self-Attention
Multi-Head

LayerNorm
LayerNorm

MLP
+ +

Map Fusion Multi-Scale Map

Global Multi-Scale Fusion

Fig. 6. The proposed Global Multi-Scale Fusion(GMSF) module.

Acc: It measures the proportion of correctly classified pixels, a representative


evaluation metric for vessel segmentation tasks.

TP + TN
Acc = (2)
TP + FP + FN + TN
SP : It refers to the proportion of correctly classified non-vessel pixels to actual
non-vessel pixels.

TN
SP = (3)
TN + FP
SE : It is also known as recall rate (Recall ), which refers to the proportion of
correctly classified blood vessel pixels to actual blood vessel pixels.

TP
SE = (4)
TP + FN
Precision: It refers to the proportion of correctly classified vessel pixels to all
vessel pixels.

TP
Precision = (5)
TP + FP
F1-Score: It measures the binary classification model, which considers the model’s
precision and recall rate. The F1-score can be seen as the harmonic mean of the
precision and recall, showing satisfactory results when SE and Precision values
are high.

Precision ∗ SE TP
F1 − Score = 2 ∗ = (6)
Precision + SE T P + F P +F
2
N

In addition, we depicted the receiver operating characteristic curve (ROC ),


which was generated with TP as the ordinate and FP as the abscissa. We also
provided the Area under the ROC curve (AUC ), which considers the SE and
Title Suppressed Due to Excessive Length 11

SP under different thresholds, and is suitable for measuring retinal vessel seg-
mentation.
Also, some metrics have been specially designed for vessel segmentation and
widely used in previous works. For example, a set of metrics was proposed by
Gegundez et al. [36] to evaluate the connectivity (C ), overlapping area (A), and
consistency of vessel length (L) of predicted vessels. The overall metric (F ) was
defined as

F (C, A, L) = C × A × L. (7)
In doing so, the segmentation of coarse and fine vessels can be quantified more
equally.

3 Experiments
3.1 Datasets
We evaluated our proposed method on four public datasets of fundus images,
including DRIVE, STARE, CHASE-DB1, and HRF. Fig. 7 shows some typical
cases from the four datasets.
1. DRIVE Dataset: The DRIVE dataset
(https://http://drive.grand-challenge.org/DRIVE) contains 40 fundus
retinal color images, seven of which are from patients with early diabetic
retinopathy, with a resolution of 565 × 584 and stored in JPEG format. The
original dataset uses 20 images for training and 20 for testing with masks,
and two experts manually annotated the dataset. In this paper, we divide
the dataset into a training set, a validation set, and a test set according to
the ratio of 18:2:20.
2. STARE Dataset: The STARE dataset
(http://cecas.clemson.edu/~ahoover/stare/probing) provides 20 fun-
dus color images with a resolution of 605 × 700. We use 15 of these images
for training and five for testing. The original dataset is not divided into a
validation set like the DRIVE dataset. Thus, we choose 10% of the training
data for validation. The STARE dataset also provides annotated images of
two experts.
3. CHASE-DB1 Dataset: The CHASE-DB1 dataset
(https://blogs.kingston.ac.uk/retinal/chasedb1) contains 28 color reti-
nal images with a resolution of 996 × 960. It was taken from the left and
right eyes of 14 children. We used 21 of these images for training and seven
for testing.
4. HRF Dataset: The are 45 fundus images in the HRF dataset
(https://www5.cs.fau.de/fileadmin/research/datasets/fundus-images)
with a resolution of 3504 × 2336. Thirty-eight images from each group of
healthy children, diabetic retinopathy, and glaucoma patients are taken as
the training set, and the other seven images are taken as the test set.
The detailed division of the four datasets is shown in Table 1.
12 X. Li et al.

CHASE
DRIVE
-DB1

(a) (b) (c) (a) (b) (c)

STARE HRF

(a) (b) (c) (a) (b) (c)

Fig. 7. Fundus images from four different datasets. (a) Original fundus image (b)
ground truth (c) mask.

Table 1. The division method of datasets for experiment.

Database Training Validation Test


DRIVE 18 images 2 images 20 images
STARE 13 images 2 images 5 images
CHASE-DB1 19 images 2 images 7 images
HRF 34 images 4 images 7 images
Note: Set up division based on experience and datasets requirements.

3.2 Implementation details


We implemented TransUNext under torch-1.9.0-cu11.1-cudnn8 and conducted
experiments with a single GeForce RTX 2080Ti.
In the training process, we set the maximum number of training epochs to 25,
batch size to 8, and initial learning rate to 0.0005. A binary cross-entropy loss
(BCE) is used as the objective function to supervise the model’s training process.
The optimizer that performs gradient updates uses Adam. The DRIVE, STARE,
CHASE-DB1, and HRF datasets follow the same data augmentation strategy:
randomly extract image patches with a resolution of 128 × 128 from the prepro-
cessed images. In TransUNext, we empirically set the hyper-parameters [N1 ,N2 ]
to [1,3].
Unlike the training phase, we perform overlap cropping in the original image
with a fixed step size. Since these patches have overlapping areas (I. e., each pixel
appears multiple times in different patches), we averaged the probability value
of each pixel belonging to retinal blood vessels and set the threshold to obtain
a binarized prediction map. In addition, we use an early stopping mechanism to
prevent the occurrence of overfitting.

3.3 Results
Ablation study We conducted an ablation study on the DRIVE dataset to
verify each module’s contribution to the entire model’s performance. As can
Title Suppressed Due to Excessive Length 13

be seen from Table 2, the AUC, C, A, and L of TransUNext reached 0.9867,


99.85%, 93.58%, and 91.28%, respectively. Compared with the baseline model
in the DRIVE dataset, the performance of the final model is improved by 0.8%,
1.05%, 1.11%, and 1.09%, respectively.
Although the FLOPs and Params of TransUNext are higher than those of Base-
line, the increase in memory consumption and storage space is acceptable, con-
sidering the improvement in its segmentation accuracy. Comparing the data
of the two rows of Baseline w/CNN and Trans block and Baseline + all (our
TransUNext), the structure of the former follows the design of Fig. 1(e); it is
found that our proposed TransNeXt Block, which can optimize the computa-
tional complexity of each base block in U-Net, has a higher performance with a
lower number of parameters.

Table 2. Performance of proposed method tested on four datasets.

Methods AUC C(%) A(%) L(%) FLOPs/G Params/M


Baseline w/pure CNN block 0.9787 98.8 92.47 90.19 12.97 7.07
Baseline w/CNN and Trans block 0.9818 98.89 93.51 90.89 21.51 9.53
Baseline + GMSF 0.983 99.78 93.57 91.08 21.78 9.74
Baseline + TransNeXt 0.986 99.84 93.48 91.15 18.32 8.34
Baseline + all (our TransUNext) 0.9867 99.85 93.58 91.28 18.51 8.55
Note: For each metric, the bold value indicates that column’s best result. Baseline
is a standard U-Net.

To better evaluate the whole ablation study, Fig. 8 shows the ROC curves of
TransUNext on the test dataset of DRIVE. After visualization, it can also be
found that both GMSF and TransNeXt Block are designed to enhance the seg-
mentation performance with AUC values of 0.9830 and 0.9860, respectively.

Results on four datasets Table 3 and Fig. 9 present our proposed method’s
partial segmentation results on four datasets. Fig. 9(d) and Fig. 9(c) illustrate
typical grayscale segmentation and binarization results on the four datasets. As
seen from the table and figure, the segmentation result of our proposed Tran-
sUNext is very close to the ground truth, which can extract the main vessel from
the background and correctly segment the vessel edge. TransUNext was shown
to cope with the complexities of long and thin vessel spans and the variable
morphology of the optic disc and optic cup in fundus images.

Comparisons with SOTA methods Comparison with SOTA methods


on the DRIVE test dataset. We further compare the proposed method with
existing state-of-the-art (SOTA) methods to verify the effectiveness of the pro-
posed method. The comparison in Table 4 is based on the DRIVE dataset, and
TransUNext is in the leading position in terms of performance for AUC, ACC,
14 X. Li et al.

Fig. 8. ROC curves of ablation study on the DRIVE dataset.

Table 3. Ablation study on the DRIVE dataset.

Dataset AUC SP SE Precision F1-score Acc


DRIVE 0.9867 0.9937 0.8643 0.9465 0.9035 0.9697
STARE 0.9869 0.9907 0.8401 0.8959 0.8671 0.9737
CHASE-DB1 0.991 0.9966 0.8817 0.9566 0.9176 0.9741
HRF 0.9887 0.991 0.8857 0.9098 0.8976 0.9643
Baseline + all (our TransUNext) 0.9867 99.85 93.58 91.28 18.51 8.55
Title Suppressed Due to Excessive Length 15

DRIVE

STARE

CHASE-DB1

HRF

(a) (b) (c) (d) (e)

Fig. 9. Segmentation results of the proposed method on four datasets. (a) original
image, (b) pre-processed, (c) binarization, (d) proposed TransUNext and (e) ground
truth.
16 X. Li et al.

SP, and SE. While AUC and ACC are slightly lower than [27] and [40], SP and
SE reach the maximum.

Table 4. Comparison with SOTA methods on the DRIVE test dataset.

Methods Pub. DRIVE


AUC(%) ACC(%) SP(%) SE(%)
Maninis et al. [18] MICCAI 2016 98.01 95.41 - 82.8
JL-UNet [41] T-BME 2018 95.56 97.84 98.18 77.92
MS-NFN [37] MICCAI 2018 98.07 95.67 98.19 78.44
Yan et al. [42] J-BHI 2018 97.5 95.38 98.2 76.31
CE-Net [8] TMI 2019 97.79 95.45 - 83.09
Wang et al. [26] MICCAI 2019 97.72 95.67 - 79.4
DUNet [11] KBS 2019 98.02 95.66 98 79.63
CTF-Net [31] ISBI 2020 97.88 95.67 - 78.49
Wang et al. [29] J-BHI 2020 98.23 95.81 98.13 79.91
Li et al. [12] WACV 2020 98.16 95.73 98.38 77.35
CGA-Net [28] ISBI 2021 98.65 96.47 - 83.05
SCS-Net [36] MIA 2021 98.37 96.97 - 82.89
DA-Net [27] MICCAI 2022 99.03 97.07 - 85.57
Xu et al. [40] MICCAI 2022 98.9 97.05 98.28 84.41
TransUNet (Impl) [3] - 97.86 95.32 98.96 78.39
Swin-UNet (Impl) [1] - 97.78 95.4 98.81 80.05
TransUNext (Impl) - 98.67 96.97 99.37 86.43
Note: The best results are marked in bold and second best are marked in underline.
The last three lines of data are calculated after strictly following the code imple-
mentation of the network structure in the papers.

Comparison with SOTA methods on the STARE test dataset. To val-


idate the generalization of TransUNext, Table 5 compares TransUNext to the
STARE dataset, which is also at a higher level. Although the mean of our test re-
sults on the STARE dataset is weaker than [31], the performance on the DRIVE
and CHASE-DB1 datasets is more leading.
Comparison with SOTA methods on the CHASE-DB1 test dataset.
The comparison in Table 6 is based on the CHASE-DB1 dataset, where the SP
and SE of TransUNext reach a maximum and the AUC value is slightly lower
than [40].
Comparison with SOTA methods on the HRF test dataset. Unlike the
above three datasets, the images of HRF have higher resolution, and many SOTA
methods have not been validated on this dataset. Table 7 shows the results of
comparing TransUNext with other methods on the HRF dataset, showing that
TransUNext can accurately segment the high-resolution fundus vascular images,
further demonstrating its generalization capability.
Qualitative visualizations of the proposed TransUNext and other SOTA
methods. Fig. 10 compares the visualization results of several SOTA methods
Title Suppressed Due to Excessive Length 17

Table 5. Comparison with SOTA methods on the STARE test dataset.

Methods Pub. STARE


AUC(%) ACC(%) SP(%) SE(%)
JL-UNet [41] T-BME 2018 98.01 96.12 98.46 75.81
Yan et al. [42] J-BHI 2018 98.33 96.38 98.57 77.35
DUNet [11] KBS 2019 98.32 96.41 98.78 75.95
CTF-Net [31] ISBI 2020 99.6 98.49 99.34 90.24
Wang et al. [29] J-BHI 2020 98.81 96.73 98.44 81.86
Li et al. [12] WACV 2020 98.81 97.01 98.86 77.15
SCS-Net [36] MIA 2021 98.77 97.36 98.39 82.07
TransUNet (Impl) [3] - 97.74 95.37 98.67 73.98
Swin-UNet (Impl) [1] - 98.32 96.43 98.64 79.57
TransUNext (Impl) - 98.69 97.37 99.07 84.01

Table 6. Comparison with SOTA methods on the CHASE-DB1 test dataset.

Methods Pub. CHASE-DB1


AUC(%) ACC(%) SP(%) SE(%)
Maninis et al. [18] MICCAI 2016 97.46 96.57 - 76.51
JL-UNet [41] T-BME 2018 97.81 96.1 98.09 76.33
MS-NFN [37] MICCAI 2018 98.25 96.37 98.47 75.38
Yan et al. [42] J-BHI 2018 97.76 96.07 98.06 76.41
CE-Net [8] TMI 2019 98.3 96.89 - 81.52
Wang et al. [26] MICCAI 2019 98.12 96.61 - 80.74
DUNet [11] KBS 2019 98.04 96.1 97.52 81.55
CTF-Net [31] ISBI 2020 98.47 96.48 - 79.48
Li et al. [12] WACV 2020 98.51 96.55 98.23 79.7
CGA-Net [28] ISBI 2021 98.12 97.06 - 86.78
SCS-Net [36] MIA 2021 98.67 97.44 - 83.65
DA-Net [27] MICCAI 2022 99.08 97.66 - 87.04
Xu et al. [40] MICCAI 2022 99.19 97.71 98.55 85.43
TransUNet (Impl) [3] - 97.78 95.4 98.8 72.04
Swin-UNet (Impl) [1] - 98.73 97.38 98.44 83.61
TransUNext (Impl) - 99.1 97.41 99.66 88.17
18 X. Li et al.

Table 7. Comparison with SOTA methods on the HRF test dataset.

Methods Pub. HRF


AUC(%) ACC(%) SP(%) SE(%)
FullyCRF [19] T-BME 2016 - 91.64 97.09 59.65
DRIS-GP [4] TIP 2019 - 94.25 98.98 66.11
SCS-Net [36] MIA 2021 98.42 96.87 98.23 81.14
Tan et al. [24] TMI 2022 - 95.9 98.86 78.53
TransUNet (Impl) [3] - 97.78 95.77 99.26 65.58
Swin-UNet (Impl) [1] - 97.81 95.83 98.85 80.99
TransUNext (Impl) - 98.87 96.43 99.1 88.57

on four datasets. Our method achieves satisfactory segmentation results from


the locally zoom-in images. Compared with other improved approaches, such as
TransUNet [3] and Swin-UNet [1], our proposed TransUNext can detect retinal
vessels more correctly and reduce misclassified retinal vessel pixels. In addition,
better recognition is achieved for tiny vessels and edge regions.

Discussion In TransUNext, the TransNeXt Block is used to improve the base


block of traditional U-Net, while combining the advantages of Transformer and
ConvNeXt, avoiding the massive pre-training of Transformer and with less com-
putational overhead. In addition, the GMSF mechanism replaces the skip con-
nection of U-Net for semantic and spatial global multi-scale fusion, which allows
the overall architecture of TransUNext to follow the design of Fig. 1(e). Extensive
experiments have demonstrated that TransUNext brings significant performance
gains on retinal vessel segmentation tasks. In the future, we will further improve
the hybrid approach of Transformer and ConvNeXt modules, such as the parallel
hybrid architecture design of both, and explore how the Transformer can be pre-
trained end-to-end in medical images to improve the segmentation performance
further.

3.4 Conclusion

We propose a retinal vessel segmentation method, TransUNext, that integrates


an Efficient Self-attention Mechanism into the encoder and decoder of U-Net to
capture both local features and global dependencies with minimal computational
overhead. Unlike other hybrid architectures that embed modular Transformers
into CNNs, TransUNext constructs hybrid modules consisting of Transformer
and ConvNeXt throughout the model, which continuously aggregates represen-
tations from global and local features to give full play to the superiority of both.
Among them, the GMSF module upgrades the skip connection, fuses high-level
semantic and low-level detail information and eliminates the high and low-level
semantic differences. In addition, we design the TransNeXt Block as the base
block of the U-shaped architecture, which not only captures local features but
Title Suppressed Due to Excessive Length 19

Original U-Net TransUNet Swin-UNet TransUNext Ground Truth


DRIVE
STARE
CHASE-DB1
HRF

Fig. 10. Comparative results of the SOTA methods and our proposed TransUNext on
four datasets, where the second, fourth, sixth and eighth rows give the local zoomed-in
results of the vessel ends and at the optic disc.
20 X. Li et al.

also avoids the information loss caused by the compressed dimension when the
information is converted between the feature spaces of different dimensions.
Our TransUNext significantly outperforms other SOTA methods for retinal ves-
sel segmentation on DRIVE, SATRE, CHASE-DB1, and HRF datasets and can
be applied to other high-resolution or strip objects and lesions segmentation
tasks potentially.
Ethical approval
This article does not contain any studies with human participants or animals
performed by any of the authors.
Competing Interests
The authors declare that they have no conflict of interest.

References

1. Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Tian, Q., Wang, M.: Swin-unet:
Unet-like pure transformer for medical image segmentation (2021)
2. Chaudhuri, S., Chatterjee, S., Katz, N., Nelson, M., Goldbaum, M.: Detection of
blood vessels in retinal images using two-dimensional matched filters. IEEE Trans-
actions on Medical Imaging 8(3), 263–269 (1989). https://doi.org/10.1109/42.
34715
3. Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., Zhou,
Y.: Transunet: Transformers make strong encoders for medical image segmentation
(2021)
4. Cherukuri, V., Kumar B.G., V., Bala, R., Monga, V.: Deep retinal image segmen-
tation with regularization under geometric priors. IEEE Transactions on Image
Processing 29, 2552–2567 (2020). https://doi.org/10.1109/TIP.2019.2946078
5. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner,
T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.:
An image is worth 16x16 words: Transformers for image recognition at scale (2020)
6. Fraz, M.M., Remagnino, P., Hoppe, A., Uyyanonvara, B., Rudnicka, A.R., Owen,
C.G., Barman, S.A.: An ensemble classification-based approach applied to retinal
blood vessel segmentation. IEEE Transactions on Biomedical Engineering 59(9),
2538–2548 (2012). https://doi.org/10.1109/TBME.2012.2205687
7. Gegundez-Arias, M.E., Aquino, A., Bravo, J.M., Marin, D.: A function for quality
evaluation of retinal vessel segmentations. IEEE Transactions on Medical Imaging
31(2), 231–239 (2012). https://doi.org/10.1109/TMI.2011.2167982
8. Gu, Z., Cheng, J., Fu, H., Zhou, K., Hao, H., Zhao, Y., Zhang, T., Gao, S., Liu,
J.: Ce-net: Context encoder network for 2d medical image segmentation. IEEE
Transactions on Medical Imaging 38(10), 2281–2292 (2019). https://doi.org/
10.1109/TMI.2019.2903562
9. Hatamizadeh, A., Tang, Y., Nath, V., Yang, D., Myronenko, A., Landman, B.,
Roth, H.R., Xu, D.: Unetr: Transformers for 3d medical image segmentation. In:
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer
Vision (WACV). pp. 574–584 (January 2022)
10. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., An-
dreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for
mobile vision applications (2017)
Title Suppressed Due to Excessive Length 21

11. Jin, Q., Meng, Z., Pham, T.D., Chen, Q., Wei, L., Su, R.: Dunet: A
deformable network for retinal vessel segmentation. Knowledge-Based Sys-
tems 178, 149–162 (2019). https://doi.org/https://doi.org/10.1016/j.
knosys.2019.04.025, https://www.sciencedirect.com/science/article/pii/
S0950705119301984
12. Li, L., Verma, M., Nakashima, Y., Nagahara, H., Kawasaki, R.: Iternet: Retinal
image segmentation utilizing structural redundancy in vessel networks. In: Pro-
ceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision
(WACV) (March 2020)
13. Li, Q., You, J., Zhang, D.: Vessel segmentation and width estimation in reti-
nal images using multiscale production of matched filter responses. Expert Sys-
tems with Applications 39(9), 7600–7610 (2012). https://doi.org/https://doi.
org/10.1016/j.eswa.2011.12.046, https://www.sciencedirect.com/science/
article/pii/S0957417411017179
14. Liu, W., Tian, T., Xu, W., Yang, H., Pan, X., Yan, S., Wang, L.: Phtrans: Paral-
lelly aggregating global and local representations for medical image segmentation.
In: Wang, L., Dou, Q., Fletcher, P.T., Speidel, S., Li, S. (eds.) Medical Image Com-
puting and Computer Assisted Intervention – MICCAI 2022. pp. 235–244. Springer
Nature Switzerland, Cham (2022)
15. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin trans-
former: Hierarchical vision transformer using shifted windows. In: Proceedings of
the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 10012–
10022 (October 2021)
16. Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for
the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR). pp. 11976–11986 (June 2022)
17. Luo, Z., Jia, Y.: The comparison of retinal vessel segmentation methods in fundus
images. Journal of Physics: Conference Series 1574(1), 012160 (jun 2020). https:
//doi.org/10.1088/1742-6596/1574/1/012160, https://dx.doi.org/10.1088/
1742-6596/1574/1/012160
18. Maninis, K.K., Pont-Tuset, J., Arbeláez, P., Van Gool, L.: Deep retinal image
understanding. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W.
(eds.) Medical Image Computing and Computer-Assisted Intervention – MICCAI
2016. pp. 140–148. Springer International Publishing, Cham (2016)
19. Orlando, J.I., Prokofyeva, E., Blaschko, M.B.: A discriminatively trained fully
connected conditional random field model for blood vessel segmentation in fun-
dus images. IEEE Transactions on Biomedical Engineering 64(1), 16–27 (2017).
https://doi.org/10.1109/TBME.2016.2535311
20. Ricci, E., Perfetti, R.: Retinal blood vessel segmentation using line operators and
support vector classification. IEEE Transactions on Medical Imaging 26(10), 1357–
1365 (2007). https://doi.org/10.1109/TMI.2007.898551
21. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed-
ical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F.
(eds.) Medical Image Computing and Computer-Assisted Intervention – MICCAI
2015. pp. 234–241. Springer International Publishing, Cham (2015)
22. Soares, J., Leandro, J., Cesar, R., Jelinek, H., Cree, M.: Retinal vessel segmentation
using the 2-d gabor wavelet and supervised classification. IEEE Transactions on
Medical Imaging 25(9), 1214–1222 (2006). https://doi.org/10.1109/TMI.2006.
879967
22 X. Li et al.

23. Staal, J., Abramoff, M., Niemeijer, M., Viergever, M., van Ginneken, B.: Ridge-
based vessel segmentation in color images of the retina. IEEE Transactions on Med-
ical Imaging 23(4), 501–509 (2004). https://doi.org/10.1109/TMI.2004.825627
24. Tan, Y., Yang, K.F., Zhao, S.X., Li, Y.J.: Retinal vessel segmentation with skeletal
prior and contrastive loss. IEEE Transactions on Medical Imaging 41(9), 2238–2251
(2022). https://doi.org/10.1109/TMI.2022.3161681
25. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.,
Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: Guyon, I., Luxburg,
U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R.
(eds.) Advances in Neural Information Processing Systems. vol. 30. Curran
Associates, Inc. (2017), https://proceedings.neurips.cc/paper/2017/file/
3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
26. Wang, B., Qiu, S., He, H.: Dual encoding u-net for retinal vessel segmentation.
In: Shen, D., Liu, T., Peters, T.M., Staib, L.H., Essert, C., Zhou, S., Yap, P.T.,
Khan, A. (eds.) Medical Image Computing and Computer Assisted Intervention –
MICCAI 2019. pp. 84–92. Springer International Publishing, Cham (2019)
27. Wang, C., Xu, R., Xu, S., Meng, W., Zhang, X.: Da-net: Dual branch transformer
and adaptive strip upsampling for retinal vessels segmentation. In: Wang, L., Dou,
Q., Fletcher, P.T., Speidel, S., Li, S. (eds.) Medical Image Computing and Com-
puter Assisted Intervention – MICCAI 2022. pp. 528–538. Springer Nature Switzer-
land, Cham (2022)
28. Wang, C., Xu, R., Zhang, Y., Xu, S., Zhang, X.: Retinal vessel segmentation
via context guide attention net with joint hard sample mining strategy. In: 2021
IEEE 18th International Symposium on Biomedical Imaging (ISBI). pp. 1319–1323
(2021). https://doi.org/10.1109/ISBI48211.2021.9433813
29. Wang, D., Haytham, A., Pottenburgh, J., Saeedi, O., Tao, Y.: Hard attention
net for automatic retinal vessel segmentation. IEEE Journal of Biomedical and
Health Informatics 24(12), 3384–3396 (2020). https://doi.org/10.1109/JBHI.
2020.3002985
30. Wang, J., Li, X., Lv, P., Shi, C.: Serr-u-net: Squeeze-and-excitation resid-
ual and recurrent block-based u-net for automatic vessel segmentation in
retinal image. Computational and Mathematical Methods in Medicine 2021,
5976097 (Aug 2021). https://doi.org/10.1155/2021/5976097, https://doi.
org/10.1155/2021/5976097
31. Wang, K., Zhang, X., Huang, S., Wang, Q., Chen, F.: Ctf-net: Retinal vessel
segmentation via deep coarse-to-fine supervision network. In: 2020 IEEE 17th
International Symposium on Biomedical Imaging (ISBI). pp. 1237–1241 (2020).
https://doi.org/10.1109/ISBI45749.2020.9098742
32. Wang, S., Li, B.Z., Khabsa, M., Fang, H., Ma, H.: Linformer: Self-attention with
linear complexity (2020)
33. Wang, W., Chen, C., Ding, M., Yu, H., Zha, S., Li, J.: Transbts: Multimodal brain
tumor segmentation using transformer. In: de Bruijne, M., Cattin, P.C., Cotin,
S., Padoy, N., Speidel, S., Zheng, Y., Essert, C. (eds.) Medical Image Comput-
ing and Computer Assisted Intervention – MICCAI 2021. pp. 109–119. Springer
International Publishing, Cham (2021)
34. Wang, X., Yu, K., Wu, S., Gu, J., Liu, Y., Dong, C., Qiao, Y., Change Loy, C.: Es-
rgan: Enhanced super-resolution generative adversarial networks. In: Proceedings
of the European Conference on Computer Vision (ECCV) Workshops (September
2018)
Title Suppressed Due to Excessive Length 23

35. Wang, Z., Lin, J., Wang, R., Zheng, W.: Data augmentation is more important than
model architectures for retinal vessel segmentation. In: Proceedings of the 2019 In-
ternational Conference on Intelligent Medicine and Health. p. 48–52. ICIMH 2019,
Association for Computing Machinery, New York, NY, USA (2019). https://doi.
org/10.1145/3348416.3348425, https://doi.org/10.1145/3348416.3348425
36. Wu, H., Wang, W., Zhong, J., Lei, B., Wen, Z., Qin, J.: Scs-net:
A scale and context sensitive network for retinal vessel segmentation.
Medical Image Analysis 70, 102025 (2021). https://doi.org/https:
//doi.org/10.1016/j.media.2021.102025, https://www.sciencedirect.
com/science/article/pii/S1361841521000712
37. Wu, Y., Xia, Y., Song, Y., Zhang, Y., Cai, W.: Multiscale network followed network
model for retinal vessel segmentation. In: Frangi, A.F., Schnabel, J.A., Davatzikos,
C., Alberola-López, C., Fichtinger, G. (eds.) Medical Image Computing and Com-
puter Assisted Intervention – MICCAI 2018. pp. 119–126. Springer International
Publishing, Cham (2018)
38. Xie, S., Girshick, R., Dollar, P., Tu, Z., He, K.: Aggregated residual transformations
for deep neural networks. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR) (July 2017)
39. Xie, Y., Zhang, J., Shen, C., Xia, Y.: Cotr: Efficiently bridging cnn and trans-
former for 3d medical image segmentation. In: de Bruijne, M., Cattin, P.C., Cotin,
S., Padoy, N., Speidel, S., Zheng, Y., Essert, C. (eds.) Medical Image Comput-
ing and Computer Assisted Intervention – MICCAI 2021. pp. 171–180. Springer
International Publishing, Cham (2021)
40. Xu, R., Zhao, J., Ye, X., Wu, P., Wang, Z., Li, H., Chen, Y.W.: Local-region
and cross-dataset contrastive learning for retinal vessel segmentation. In: Wang,
L., Dou, Q., Fletcher, P.T., Speidel, S., Li, S. (eds.) Medical Image Computing and
Computer Assisted Intervention – MICCAI 2022. pp. 571–581. Springer Nature
Switzerland, Cham (2022)
41. Yan, Z., Yang, X., Cheng, K.T.: Joint segment-level and pixel-wise losses for deep
learning based retinal vessel segmentation. IEEE Transactions on Biomedical Engi-
neering 65(9), 1912–1923 (2018). https://doi.org/10.1109/TBME.2018.2828137
42. Yan, Z., Yang, X., Cheng, K.T.: A three-stage deep learning model for accurate
retinal vessel segmentation. IEEE Journal of Biomedical and Health Informatics
23(4), 1427–1436 (2019). https://doi.org/10.1109/JBHI.2018.2872813
43. Yang, Y., Huang, S., Rao, N.: An automatic hybrid method for retinal blood vessel
extraction. Int. J. Appl. Math. Comput. Sci. 18(3), 399–407 (sep 2008)
44. Zhao, Y., Rada, L., Chen, K., Harding, S.P., Zheng, Y.: Automated vessel segmen-
tation using infinite perimeter active contour model with hybrid region information
with application to retinal images. IEEE Transactions on Medical Imaging 34(9),
1797–1807 (2015). https://doi.org/10.1109/TMI.2015.2409024
45. Zhou, H.Y., Guo, J., Zhang, Y., Yu, L., Wang, L., Yu, Y.: nnformer: Interleaved
transformer for volumetric segmentation (2021)

You might also like