Transunext: Towards A More Advanced U-Shaped Framework For Automatic Vessel Segmentation in The Fundus Image
Transunext: Towards A More Advanced U-Shaped Framework For Automatic Vessel Segmentation in The Fundus Image
1 Introduction
The essence of automatic segmentation of fundus vessel images is to dichotomize
the vascular pixels in the image with the surrounding pixels. In clinical appli-
cations, fundus vessel images are more complex. They require experienced pro-
fessionals to complete segmentation manually. Not only are they subjective and
inefficient, but with the explosion of fundus image data, implementing computer-
aided automatic segmentation of vessel networks in fundus images has crucial
clinical value [17].
Currently, automatic segmentation methods for fundus vessel images are divided
into two main categories: one is based on unsupervised methods, and the other
is based on supervised methods, including machine learning and deep learning
strategies.
Based on unsupervised machine learning methods, Chaudhuri et al. [2] success-
fully implemented blood vessel segmentation of fundus images using a two-
dimensional Gaussian matched filter. Afterward, some segmentation methods
based on vessel morphology and particular pixels appeared. For example, Yang et
al. [43] proposed a morphological processing method, which first enhances vessel
features, suppresses background information, and then applies fuzzy clustering
to achieve vessel segmentation; Zhao et al. [44] also proposed a segmentation
method based on a deformable model, which uses regional information of differ-
ent vessel types to achieve segmentation; Li et al. [13] optimized the matched
filtering method and applied it to the vessel segmentation task. The segmenta-
tion method based on unsupervised learning is fast, but the segmentation result
is rough and has low accuracy.
The supervised machine learning segmentation method better extracts vessel
feature information by strengthening the training model through manually la-
beled images. Staal et al. [23] and Soares et al. [22] used a two-dimensional filter
to extract the overall features of the retinal image and then used naive Bayes to
classify the retinal background and blood vessels. Ricci et al. [20] first extracted
the green channel of the fundus image during image preprocessing and then used
SVM to segment according to the difference in vessel width. Fraz et al. [6] pro-
posed combining AdaBoost and Bagging models, integrating the results of com-
plex feature extraction and the results of binary classification models, and using
the supervised method to segment retinal vessel images automatically. Although
the accuracy of the supervised machine learning method has been improved,
because the algorithm itself cannot adapt to the shape, scale, and geometric
transformation of blood vessels, there are still problems, such as low accuracy
and low robustness when segmenting small vessels and vessel intersections, and
it is difficult to provide an objective basis for clinical diagnosis.
With the advent of CNN, the semantic segmentation method based on deep
learning can accurately predict vessel and non-vessel pixels and provide descrip-
tions of vessel scale, shape, multiple curvature, and other information. In medical
image semantic segmentation methods, U-Net [21] is considered a very success-
ful network, consisting of convolutional encoding and decoding units. It can use
a few samples to complete training to perform segmentation tasks better. Its
Title Suppressed Due to Excessive Length 3
derivative works [18,19,37,41] have also achieved advanced retinal blood vessel
segmentation results. In order to further improve the accuracy of retinal blood
vessel segmentation, Wang B et al. [26] proposed a variant of U-Net with dual
encoders to capture richer context features. Li et al. [30] also proposed an im-
proved end-to-end network of U-Net. This framework uses technologies including
compression and excitation (SE) module, residual module, and circular structure
and introduces enhanced super-resolution generative adaptive networks (ESR-
GANS) [34] and improved data enhancement methods to achieve retinal blood
vessel segmentation. Wang B et al. [31] also provided a supervision framework
for retinal vessel segmentation from thick to thin.
Nevertheless, CNN-based approaches cannot model long-range dependencies due
to inherent inductive biases such as locality and translational equivalence. Thus
Transformer [25], which relies purely on attention mechanisms to build global
dependencies without any convolutional operations, has emerged as an alter-
native architecture that provides better performance than CNNs in computer
vision (CV) under pre-training conditions on large-scale datasets. Vision Trans-
former (Vit) [5] revolutionized the CV field by segmenting images into a sequence
of tokens and modeling their global relationships with stacked Transformer
blocks. Swin Transformer [15] can produce hierarchical features in a movable
window with low computational complexity representation, achieving state-of-
the-art performance in various CV tasks. However, the size of medical image
datasets is much smaller than the pre-trained datasets in the above work (e.g.,
ImageNet-21k and JFT-300M). As a result, the Transformer produces unsat-
isfactory performance in medical image segmentation. Therefore, many hybrid
structures combining CNN and Transformer have emerged, which have both ad-
vantages and are gradually becoming a compromise solution for medical image
segmentation without needing pre-training on large datasets.
We summarize several popular hybrid architectures based on Transformer and
CNN in medical image segmentation. These hybrid architectures add the Trans-
former to a CNN-based backbone model or replace some architecture compo-
nents. For example, UNETR [9] uses an encoder and decoder architecture where
the encoder is composed of a cascaded block built with a pure Transformer,
and the decoder is a stacked convolutional layer, see Fig. 1(a).TransBTS [33]
and TransUNet [3] introduce a relationship between the encoder and decoder
composed of a CNN a Transformer, see Fig. 1(b). coTr [39] bridges all stages
from the encoder to the decoder through the Transformer, not only the adjacent
stages, allowing to exploit the global dependencies at multiple scales, see Fig.
1(c). Furthermore, nn-Former [45,1] interweaves Transformer and convolutional
blocks into a hybrid model where convolution encodes precise spatial information
while Transformer captures global context information, see Fig. 1(d). As seen
from Fig. 1, these architectures implement a serial combination of Transformer
and CNN from a macroscopic perspective. However, in these combinations, the
convolutional and self-attention mechanisms cannot be applied throughout the
network structure, making it challenging to model local and global features effi-
ciently.
4 X. Li et al.
CNN Transformer
Encoder
Decoder
The rest of this paper is organized as follows: Section II introduces the proposed
method in detail. Section III describes the experimental implementation and
illustrates the experimental results. Section IV gives the conclusions of this paper
2 Proposed method
Many samples have poor contrast and high noise in the fundus vessel image
dataset. Therefore, proper preprocessing is crucial for subsequent training. This
paper uses four preprocessing methods, including Grayscale Transformation,
Data Normalization, Contrast Limited Adaptive Histogram Equalization (CLAHE),
and Gamma Correction [7,8,35], to process each original fundus vessel image.
Fig. 2 provides a schematic diagram of the phased processing results of the orig-
inal color retinal images after Grayscale Transformation, CLAHE, and Gamma
Correction. As can be seen from the figure, the preprocessed images have clear
texture, prominent edges, and enhanced detail information.
Because of the relatively small amount of data in the fundus vessel image dataset,
a new augmentation was performed to reduce the effects of overfitting. The
retinal vessel image in the used dataset is a circular region, so rotating the
image randomly by a fixed angle can simulate different acquisition environments
without changing the structure of the image itself. In addition, 15,000 randomly
extracted patches of size 128 in each training image of the DRIVE, STARE,
CHASE-DB1, and HRF datasets were extracted by random cropping, and the
corresponding ground truth was processed identically. The amplified image data
are shown in Fig. 3.
Unlike the cropping method in the training phase, in the testing phase, each
image block needs to be re-stitched into a complete image and then binarized
to obtain the segmentation result. All patches must be stitched to restore their
resolution to the level of the original fundus image. However, the time and space
complexity of stitching based on the index is extremely high if random cropping
is used. To avoid this problem, we use overlapping cropping in the testing phase.
The step size was set to 12 based on workstation performance trade-offs.
6 X. Li et al.
Fig. 3. Illustration of random cropping. (a) patches from the original image (b) patches
of ground truth.
Output
GELU
TransNeXt
Block
Block
Conv
↓ ↓ LayerNorm
Input
Multi-Head
Multi-Scale Fusion
Self-Attention
TransNeXt
Block
LayerNorm
C
HxWxC
1x1 Conv
TransNeXt
GELU
Block
Block
Conv
↓ ↓ H x W x 4C
1x1 Conv
Output LayerNorm
HxWxC
DepthWiseConv
↓
↓ Down Sampling Up Sampling C Skip Connection Input
Fig. 4. (a) The architecture of TransUNext; (b) Hybird block consisting of Transformer
and ConvNeXt (TransNeXt Block).
better results with less number of parameters compared to the standard base
convolution block. At the channel level, its size changes from C to 4C and then
decreases to C. Both the ConvNeXt block adopts the form of small, large, and
small dimensions to avoid the information loss caused by the compressed dimen-
sion when the information is converted between the feature spaces of different
dimensions.
Efficient Self-attention Mechanism. As can be found in Fig. 4(b), the TransNeXt
Block is built based on the core MHSA module. This module allows the model to
infer attention from different representation subspaces jointly. The results from
multiple heads are concatenated and then transformed with a feed-forward net-
work. In this paper, we used four heads; you can see more detailed multi-head
dimensions in the code.
As shown in Fig. 5. Three 1×1 convolutions are used to project X to query, key,
value embeddings: Q, K, V ∈ Rd×H×W , where d is the dimension of embedding
in each head. The Q, K, V is then flatten and transposed into sequences with
size n × d, where n = HW . Self-attention mechanism is implemented by the dot
product of vectors and softmax. However, images are highly structured data,
and most pixels in a high-resolution feature map have similar features in the
neighborhood except for the boundary regions. Therefore, pairwise attention
computation in the whole image is unnecessary [32].
So we use an Efficient Self-attention Mechanism corresponding to the Sub-
Sample component in Fig. 5. The main idea is to use two projections for keys
and values, and the processed self-attention is computed as follows:
K′ Q⊤
Attention (Q, K , V ) = softmax
′ ′
√ V′ (1)
d
where K, V ∈ Rn×d into low-dimensional embedding: K′ , V′ ∈ Rk×d , and k =
hw ≪ n, h and w are the reduced
′ ⊤size
of feature map after sub-sampling. Among
them, the shape of softmax K√Q
d
is n × k, and the shape of V′ is k × d.
By doing so, the computational complexity is reduced to O(nkd), instead of the
O n2 d complexity of a simple dot product. Notably, the projection to low-
dimensional embedding can be any down-sampling operations, such as strided
convolution, or average pooling/max pooling. The advantage of this is that, for
example, with a high-resolution fundus image dataset like HRF, n is much larger
than d because of the high-resolution of its feature maps. Thus the sequence
length dominates the self-attention computation and makes it infeasible to apply
self-attention in high-resolution feature maps. In our implementation, we still use
1×1 convolution followed by a simple interpolation to down-sample the feature
map.
Softmax Softmax
R Q R Q
Hx1xd 1xWxd HxWxd HxWxd HxWxd Hx1xd 1xWxd HHR x WHR x d HLR x WLR x d HLR x WLR x d
1x1 1x1 1x1 1x1 1x1 1x1
Conv Conv Conv Conv Conv Conv
Q K V Q K V
X
HxWxd High-Res Encoder Feat Low-Res Decoder Feat
Fig. 5. Efficient multi-head self-attention (MHSA). (a) The MHSA used in the Trans-
former encoder. (b) The MHSA used in the Transformer decoder. They share similar
concepts, but (b) takes two inputs, including the high-resolution features from GMSF
of the encoder, and the low-resolution features from the decoder.
To evaluate the performance of the proposed method, several metrics are used in
this paper, including Accuracy (Acc), Specificity (SP ), Sensitivity (SE ), Preci-
sion, F1-Score, AUC (area under the ROC curve) and CAL (connectivity-area-
length) [7,27].
10 X. Li et al.
Self-Attention
Multi-Head
LayerNorm
LayerNorm
MLP
+ +
TP + TN
Acc = (2)
TP + FP + FN + TN
SP : It refers to the proportion of correctly classified non-vessel pixels to actual
non-vessel pixels.
TN
SP = (3)
TN + FP
SE : It is also known as recall rate (Recall ), which refers to the proportion of
correctly classified blood vessel pixels to actual blood vessel pixels.
TP
SE = (4)
TP + FN
Precision: It refers to the proportion of correctly classified vessel pixels to all
vessel pixels.
TP
Precision = (5)
TP + FP
F1-Score: It measures the binary classification model, which considers the model’s
precision and recall rate. The F1-score can be seen as the harmonic mean of the
precision and recall, showing satisfactory results when SE and Precision values
are high.
Precision ∗ SE TP
F1 − Score = 2 ∗ = (6)
Precision + SE T P + F P +F
2
N
SP under different thresholds, and is suitable for measuring retinal vessel seg-
mentation.
Also, some metrics have been specially designed for vessel segmentation and
widely used in previous works. For example, a set of metrics was proposed by
Gegundez et al. [36] to evaluate the connectivity (C ), overlapping area (A), and
consistency of vessel length (L) of predicted vessels. The overall metric (F ) was
defined as
F (C, A, L) = C × A × L. (7)
In doing so, the segmentation of coarse and fine vessels can be quantified more
equally.
3 Experiments
3.1 Datasets
We evaluated our proposed method on four public datasets of fundus images,
including DRIVE, STARE, CHASE-DB1, and HRF. Fig. 7 shows some typical
cases from the four datasets.
1. DRIVE Dataset: The DRIVE dataset
(https://http://drive.grand-challenge.org/DRIVE) contains 40 fundus
retinal color images, seven of which are from patients with early diabetic
retinopathy, with a resolution of 565 × 584 and stored in JPEG format. The
original dataset uses 20 images for training and 20 for testing with masks,
and two experts manually annotated the dataset. In this paper, we divide
the dataset into a training set, a validation set, and a test set according to
the ratio of 18:2:20.
2. STARE Dataset: The STARE dataset
(http://cecas.clemson.edu/~ahoover/stare/probing) provides 20 fun-
dus color images with a resolution of 605 × 700. We use 15 of these images
for training and five for testing. The original dataset is not divided into a
validation set like the DRIVE dataset. Thus, we choose 10% of the training
data for validation. The STARE dataset also provides annotated images of
two experts.
3. CHASE-DB1 Dataset: The CHASE-DB1 dataset
(https://blogs.kingston.ac.uk/retinal/chasedb1) contains 28 color reti-
nal images with a resolution of 996 × 960. It was taken from the left and
right eyes of 14 children. We used 21 of these images for training and seven
for testing.
4. HRF Dataset: The are 45 fundus images in the HRF dataset
(https://www5.cs.fau.de/fileadmin/research/datasets/fundus-images)
with a resolution of 3504 × 2336. Thirty-eight images from each group of
healthy children, diabetic retinopathy, and glaucoma patients are taken as
the training set, and the other seven images are taken as the test set.
The detailed division of the four datasets is shown in Table 1.
12 X. Li et al.
CHASE
DRIVE
-DB1
STARE HRF
Fig. 7. Fundus images from four different datasets. (a) Original fundus image (b)
ground truth (c) mask.
3.3 Results
Ablation study We conducted an ablation study on the DRIVE dataset to
verify each module’s contribution to the entire model’s performance. As can
Title Suppressed Due to Excessive Length 13
To better evaluate the whole ablation study, Fig. 8 shows the ROC curves of
TransUNext on the test dataset of DRIVE. After visualization, it can also be
found that both GMSF and TransNeXt Block are designed to enhance the seg-
mentation performance with AUC values of 0.9830 and 0.9860, respectively.
Results on four datasets Table 3 and Fig. 9 present our proposed method’s
partial segmentation results on four datasets. Fig. 9(d) and Fig. 9(c) illustrate
typical grayscale segmentation and binarization results on the four datasets. As
seen from the table and figure, the segmentation result of our proposed Tran-
sUNext is very close to the ground truth, which can extract the main vessel from
the background and correctly segment the vessel edge. TransUNext was shown
to cope with the complexities of long and thin vessel spans and the variable
morphology of the optic disc and optic cup in fundus images.
DRIVE
STARE
CHASE-DB1
HRF
Fig. 9. Segmentation results of the proposed method on four datasets. (a) original
image, (b) pre-processed, (c) binarization, (d) proposed TransUNext and (e) ground
truth.
16 X. Li et al.
SP, and SE. While AUC and ACC are slightly lower than [27] and [40], SP and
SE reach the maximum.
3.4 Conclusion
Fig. 10. Comparative results of the SOTA methods and our proposed TransUNext on
four datasets, where the second, fourth, sixth and eighth rows give the local zoomed-in
results of the vessel ends and at the optic disc.
20 X. Li et al.
also avoids the information loss caused by the compressed dimension when the
information is converted between the feature spaces of different dimensions.
Our TransUNext significantly outperforms other SOTA methods for retinal ves-
sel segmentation on DRIVE, SATRE, CHASE-DB1, and HRF datasets and can
be applied to other high-resolution or strip objects and lesions segmentation
tasks potentially.
Ethical approval
This article does not contain any studies with human participants or animals
performed by any of the authors.
Competing Interests
The authors declare that they have no conflict of interest.
References
1. Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Tian, Q., Wang, M.: Swin-unet:
Unet-like pure transformer for medical image segmentation (2021)
2. Chaudhuri, S., Chatterjee, S., Katz, N., Nelson, M., Goldbaum, M.: Detection of
blood vessels in retinal images using two-dimensional matched filters. IEEE Trans-
actions on Medical Imaging 8(3), 263–269 (1989). https://doi.org/10.1109/42.
34715
3. Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., Zhou,
Y.: Transunet: Transformers make strong encoders for medical image segmentation
(2021)
4. Cherukuri, V., Kumar B.G., V., Bala, R., Monga, V.: Deep retinal image segmen-
tation with regularization under geometric priors. IEEE Transactions on Image
Processing 29, 2552–2567 (2020). https://doi.org/10.1109/TIP.2019.2946078
5. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner,
T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.:
An image is worth 16x16 words: Transformers for image recognition at scale (2020)
6. Fraz, M.M., Remagnino, P., Hoppe, A., Uyyanonvara, B., Rudnicka, A.R., Owen,
C.G., Barman, S.A.: An ensemble classification-based approach applied to retinal
blood vessel segmentation. IEEE Transactions on Biomedical Engineering 59(9),
2538–2548 (2012). https://doi.org/10.1109/TBME.2012.2205687
7. Gegundez-Arias, M.E., Aquino, A., Bravo, J.M., Marin, D.: A function for quality
evaluation of retinal vessel segmentations. IEEE Transactions on Medical Imaging
31(2), 231–239 (2012). https://doi.org/10.1109/TMI.2011.2167982
8. Gu, Z., Cheng, J., Fu, H., Zhou, K., Hao, H., Zhao, Y., Zhang, T., Gao, S., Liu,
J.: Ce-net: Context encoder network for 2d medical image segmentation. IEEE
Transactions on Medical Imaging 38(10), 2281–2292 (2019). https://doi.org/
10.1109/TMI.2019.2903562
9. Hatamizadeh, A., Tang, Y., Nath, V., Yang, D., Myronenko, A., Landman, B.,
Roth, H.R., Xu, D.: Unetr: Transformers for 3d medical image segmentation. In:
Proceedings of the IEEE/CVF Winter Conference on Applications of Computer
Vision (WACV). pp. 574–584 (January 2022)
10. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., An-
dreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for
mobile vision applications (2017)
Title Suppressed Due to Excessive Length 21
11. Jin, Q., Meng, Z., Pham, T.D., Chen, Q., Wei, L., Su, R.: Dunet: A
deformable network for retinal vessel segmentation. Knowledge-Based Sys-
tems 178, 149–162 (2019). https://doi.org/https://doi.org/10.1016/j.
knosys.2019.04.025, https://www.sciencedirect.com/science/article/pii/
S0950705119301984
12. Li, L., Verma, M., Nakashima, Y., Nagahara, H., Kawasaki, R.: Iternet: Retinal
image segmentation utilizing structural redundancy in vessel networks. In: Pro-
ceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision
(WACV) (March 2020)
13. Li, Q., You, J., Zhang, D.: Vessel segmentation and width estimation in reti-
nal images using multiscale production of matched filter responses. Expert Sys-
tems with Applications 39(9), 7600–7610 (2012). https://doi.org/https://doi.
org/10.1016/j.eswa.2011.12.046, https://www.sciencedirect.com/science/
article/pii/S0957417411017179
14. Liu, W., Tian, T., Xu, W., Yang, H., Pan, X., Yan, S., Wang, L.: Phtrans: Paral-
lelly aggregating global and local representations for medical image segmentation.
In: Wang, L., Dou, Q., Fletcher, P.T., Speidel, S., Li, S. (eds.) Medical Image Com-
puting and Computer Assisted Intervention – MICCAI 2022. pp. 235–244. Springer
Nature Switzerland, Cham (2022)
15. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin trans-
former: Hierarchical vision transformer using shifted windows. In: Proceedings of
the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 10012–
10022 (October 2021)
16. Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for
the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR). pp. 11976–11986 (June 2022)
17. Luo, Z., Jia, Y.: The comparison of retinal vessel segmentation methods in fundus
images. Journal of Physics: Conference Series 1574(1), 012160 (jun 2020). https:
//doi.org/10.1088/1742-6596/1574/1/012160, https://dx.doi.org/10.1088/
1742-6596/1574/1/012160
18. Maninis, K.K., Pont-Tuset, J., Arbeláez, P., Van Gool, L.: Deep retinal image
understanding. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W.
(eds.) Medical Image Computing and Computer-Assisted Intervention – MICCAI
2016. pp. 140–148. Springer International Publishing, Cham (2016)
19. Orlando, J.I., Prokofyeva, E., Blaschko, M.B.: A discriminatively trained fully
connected conditional random field model for blood vessel segmentation in fun-
dus images. IEEE Transactions on Biomedical Engineering 64(1), 16–27 (2017).
https://doi.org/10.1109/TBME.2016.2535311
20. Ricci, E., Perfetti, R.: Retinal blood vessel segmentation using line operators and
support vector classification. IEEE Transactions on Medical Imaging 26(10), 1357–
1365 (2007). https://doi.org/10.1109/TMI.2007.898551
21. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed-
ical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F.
(eds.) Medical Image Computing and Computer-Assisted Intervention – MICCAI
2015. pp. 234–241. Springer International Publishing, Cham (2015)
22. Soares, J., Leandro, J., Cesar, R., Jelinek, H., Cree, M.: Retinal vessel segmentation
using the 2-d gabor wavelet and supervised classification. IEEE Transactions on
Medical Imaging 25(9), 1214–1222 (2006). https://doi.org/10.1109/TMI.2006.
879967
22 X. Li et al.
23. Staal, J., Abramoff, M., Niemeijer, M., Viergever, M., van Ginneken, B.: Ridge-
based vessel segmentation in color images of the retina. IEEE Transactions on Med-
ical Imaging 23(4), 501–509 (2004). https://doi.org/10.1109/TMI.2004.825627
24. Tan, Y., Yang, K.F., Zhao, S.X., Li, Y.J.: Retinal vessel segmentation with skeletal
prior and contrastive loss. IEEE Transactions on Medical Imaging 41(9), 2238–2251
(2022). https://doi.org/10.1109/TMI.2022.3161681
25. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.,
Kaiser, L.u., Polosukhin, I.: Attention is all you need. In: Guyon, I., Luxburg,
U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R.
(eds.) Advances in Neural Information Processing Systems. vol. 30. Curran
Associates, Inc. (2017), https://proceedings.neurips.cc/paper/2017/file/
3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
26. Wang, B., Qiu, S., He, H.: Dual encoding u-net for retinal vessel segmentation.
In: Shen, D., Liu, T., Peters, T.M., Staib, L.H., Essert, C., Zhou, S., Yap, P.T.,
Khan, A. (eds.) Medical Image Computing and Computer Assisted Intervention –
MICCAI 2019. pp. 84–92. Springer International Publishing, Cham (2019)
27. Wang, C., Xu, R., Xu, S., Meng, W., Zhang, X.: Da-net: Dual branch transformer
and adaptive strip upsampling for retinal vessels segmentation. In: Wang, L., Dou,
Q., Fletcher, P.T., Speidel, S., Li, S. (eds.) Medical Image Computing and Com-
puter Assisted Intervention – MICCAI 2022. pp. 528–538. Springer Nature Switzer-
land, Cham (2022)
28. Wang, C., Xu, R., Zhang, Y., Xu, S., Zhang, X.: Retinal vessel segmentation
via context guide attention net with joint hard sample mining strategy. In: 2021
IEEE 18th International Symposium on Biomedical Imaging (ISBI). pp. 1319–1323
(2021). https://doi.org/10.1109/ISBI48211.2021.9433813
29. Wang, D., Haytham, A., Pottenburgh, J., Saeedi, O., Tao, Y.: Hard attention
net for automatic retinal vessel segmentation. IEEE Journal of Biomedical and
Health Informatics 24(12), 3384–3396 (2020). https://doi.org/10.1109/JBHI.
2020.3002985
30. Wang, J., Li, X., Lv, P., Shi, C.: Serr-u-net: Squeeze-and-excitation resid-
ual and recurrent block-based u-net for automatic vessel segmentation in
retinal image. Computational and Mathematical Methods in Medicine 2021,
5976097 (Aug 2021). https://doi.org/10.1155/2021/5976097, https://doi.
org/10.1155/2021/5976097
31. Wang, K., Zhang, X., Huang, S., Wang, Q., Chen, F.: Ctf-net: Retinal vessel
segmentation via deep coarse-to-fine supervision network. In: 2020 IEEE 17th
International Symposium on Biomedical Imaging (ISBI). pp. 1237–1241 (2020).
https://doi.org/10.1109/ISBI45749.2020.9098742
32. Wang, S., Li, B.Z., Khabsa, M., Fang, H., Ma, H.: Linformer: Self-attention with
linear complexity (2020)
33. Wang, W., Chen, C., Ding, M., Yu, H., Zha, S., Li, J.: Transbts: Multimodal brain
tumor segmentation using transformer. In: de Bruijne, M., Cattin, P.C., Cotin,
S., Padoy, N., Speidel, S., Zheng, Y., Essert, C. (eds.) Medical Image Comput-
ing and Computer Assisted Intervention – MICCAI 2021. pp. 109–119. Springer
International Publishing, Cham (2021)
34. Wang, X., Yu, K., Wu, S., Gu, J., Liu, Y., Dong, C., Qiao, Y., Change Loy, C.: Es-
rgan: Enhanced super-resolution generative adversarial networks. In: Proceedings
of the European Conference on Computer Vision (ECCV) Workshops (September
2018)
Title Suppressed Due to Excessive Length 23
35. Wang, Z., Lin, J., Wang, R., Zheng, W.: Data augmentation is more important than
model architectures for retinal vessel segmentation. In: Proceedings of the 2019 In-
ternational Conference on Intelligent Medicine and Health. p. 48–52. ICIMH 2019,
Association for Computing Machinery, New York, NY, USA (2019). https://doi.
org/10.1145/3348416.3348425, https://doi.org/10.1145/3348416.3348425
36. Wu, H., Wang, W., Zhong, J., Lei, B., Wen, Z., Qin, J.: Scs-net:
A scale and context sensitive network for retinal vessel segmentation.
Medical Image Analysis 70, 102025 (2021). https://doi.org/https:
//doi.org/10.1016/j.media.2021.102025, https://www.sciencedirect.
com/science/article/pii/S1361841521000712
37. Wu, Y., Xia, Y., Song, Y., Zhang, Y., Cai, W.: Multiscale network followed network
model for retinal vessel segmentation. In: Frangi, A.F., Schnabel, J.A., Davatzikos,
C., Alberola-López, C., Fichtinger, G. (eds.) Medical Image Computing and Com-
puter Assisted Intervention – MICCAI 2018. pp. 119–126. Springer International
Publishing, Cham (2018)
38. Xie, S., Girshick, R., Dollar, P., Tu, Z., He, K.: Aggregated residual transformations
for deep neural networks. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR) (July 2017)
39. Xie, Y., Zhang, J., Shen, C., Xia, Y.: Cotr: Efficiently bridging cnn and trans-
former for 3d medical image segmentation. In: de Bruijne, M., Cattin, P.C., Cotin,
S., Padoy, N., Speidel, S., Zheng, Y., Essert, C. (eds.) Medical Image Comput-
ing and Computer Assisted Intervention – MICCAI 2021. pp. 171–180. Springer
International Publishing, Cham (2021)
40. Xu, R., Zhao, J., Ye, X., Wu, P., Wang, Z., Li, H., Chen, Y.W.: Local-region
and cross-dataset contrastive learning for retinal vessel segmentation. In: Wang,
L., Dou, Q., Fletcher, P.T., Speidel, S., Li, S. (eds.) Medical Image Computing and
Computer Assisted Intervention – MICCAI 2022. pp. 571–581. Springer Nature
Switzerland, Cham (2022)
41. Yan, Z., Yang, X., Cheng, K.T.: Joint segment-level and pixel-wise losses for deep
learning based retinal vessel segmentation. IEEE Transactions on Biomedical Engi-
neering 65(9), 1912–1923 (2018). https://doi.org/10.1109/TBME.2018.2828137
42. Yan, Z., Yang, X., Cheng, K.T.: A three-stage deep learning model for accurate
retinal vessel segmentation. IEEE Journal of Biomedical and Health Informatics
23(4), 1427–1436 (2019). https://doi.org/10.1109/JBHI.2018.2872813
43. Yang, Y., Huang, S., Rao, N.: An automatic hybrid method for retinal blood vessel
extraction. Int. J. Appl. Math. Comput. Sci. 18(3), 399–407 (sep 2008)
44. Zhao, Y., Rada, L., Chen, K., Harding, S.P., Zheng, Y.: Automated vessel segmen-
tation using infinite perimeter active contour model with hybrid region information
with application to retinal images. IEEE Transactions on Medical Imaging 34(9),
1797–1807 (2015). https://doi.org/10.1109/TMI.2015.2409024
45. Zhou, H.Y., Guo, J., Zhang, Y., Yu, L., Wang, L., Yu, Y.: nnformer: Interleaved
transformer for volumetric segmentation (2021)