Unet 2
Unet 2
Abstract—Retinal vessel segmentation is the foundation of deeper convolutional layer. Many important detail features will
fundus image research and an important and challenging task be lost, which cannot meet the high precision requirements
in medical analysis and diagnosis. Due to the special nature of medical image segmentation. U-Net network based on
of many small capillaries and intricate vascular distribution in
the retinal vessels, traditional segmentation methods are time- FCN was introduced with good segmentation performance. U-
consuming, error-prone, and rely on the subjective experience of Net composed of encoder, decoder and skip-connection. The
the ophthalmologists. This is not feasible for large-scale studies encoder is the subsampling part, which is composed of the
and clinical applications. In order to achieve better retinal vessel convolution layer and the maximum pooling layer.
segmentation and improve segmentation accuracy, we propose a The deeper feature information of the image is extracted by
vessel segmentation method based on an improved U-Net archi-
tecture. First, we introduce an Atrous Spatial Pyramid Pooling different convolution kernels, and the redundant information
(ASPP) null-space convolutional pooling pyramid and propose is removed by pooling dimension reduction. skip-connection
tow attention modules called convolution depthwise squeeze transfers the semantic information obtained from each layer
convolution (CDSC) and double convolution squeeze convolution of feature extraction to the corresponding decoding end in
residual (DCSCR) blocks in the encoding and decoding structures time, and completely preserves the image features obtained
to improve the vessel features, respectively. Second, to ensure that
more original information is maintained and the perceptual field from the first three layers of encoder. The convolution layer
is extended in the downsampling stage, we employ the pooling in U-Net uses two 3×3 convolution layers. Each convolution
method of SoftPool. Next, we add the Squeeze-and-Excitation layer is followed by a modified linear element (ReLU) to
(SE) attention mechanism to the hopping structure to effectively overcome the phenomenon of gradient disappearance and a
enhance the features with discriminative properties and suppress 2×2 maximum pooling operation. The step size is basically set
irrelevant or noisy features, thus improving the representational
power of the model. Finally, we reduce the initial five-layer to 2 due to shrinkage in the image size. In order to compensate
encoding and decoding structure to four layers, which reduces for the features lost by pooling, the number of convolution
the computational effort of the model and yields more accurate channels is doubled after the pooling operation. The decoder
segmentation results than the original U-Net and other improved is the upsampled part, and the shallow position features of the
methods. image are obtained. The size expansion is achieved by 2×2
The experimental results on the DRIVE and STARE public
datasets show that the improved model has better segmentation deconvolution, while the number of channels is halved. Each
results. The propose model obtained a MIoU of 82.5% for the upsampling result is stacked with the corresponding tailored
DRIVE dataset, and a MIoU of 77.1% for the STARE dataset. downsampling feature map, and the multiplied channels allow
The proposed model improves the segmentation accuracy of the the network to propagate the context information to a higher
retinal vessel images, and the improvement is effective from the resolution layer. The last layer is a 1×1 convolution layer,
experimental results. The proposed model will help clinicians to
diagnose different retinal diseases. which outputs a segmentation result[2]. The two biggest
Keywords: Retinal Blood Vessels; Color Fundus Images; U- features of the network are U-shaped structure and skip-
Net Architecture; CDSC Block; ASPP; DCSCR Block; SE connection. Due to the full consideration of the relationship
Attention;SoftPool. between pixels, the U-Net model has higher segmentation
accuracy and strong generalization ability[3].
I. INTRODUCTION U-Net network has shown superior performance in med-
The distribution of blood vessels in the retina can be seen ical image segmentation. In recent years, many improved
clearly in fundus images, and a correct segmentation image segmentation methods based on U-Net network have been
of the retinal blood vessels can be utilized to aid in the di- proposed. In[4] proposed MSU-Net which uses atrous spatial
agnosis and early detection of retinal disorders. Consequently, pyramid pooling (ASPP) to extract multi-scale features of
it is crucial to understand how to accurately segment retinal blood vessels and improves the segmentation performance of
vessels in fundus images[1]. Medical image segmentation is the network. In [5], the proposed network model is based
a dichotomy problem of segmentation target and background. on fusion of residual blocks, attention mechanism and U-Net,
The neural network used for image segmentation is the fully which solved the problem of tiny blood vessels segmentation,
connected neural network (FCN), which changes the full lesions and optic discs misclassified as blood vessels. In [6],
connection layer in the convolutional neural network (CNN) the authors proposed DAS-U-Net; a parallel atrous convo-
into the convolutional layer to achieve end-to-end training. lution, which encourages responsive feature reuse through
However, the deconvolution reduction is carried out on the salient computing and helps to learn the characteristics of thin
tecture, by proposing the convolution depthwise squeeze con-
volution (CDSC) and double convolution squeeze convolution
residual (DCSCR) blocks to enhance the feature extraction of
vessels and improve the accuracy of the model segmentation.
(3) We embedded the SoftPool pooling in the proposed
model to reduce information loss and increase perceptual
field during the pooling process. In addition, Squeeze-and-
Excitation (SE) attention mechanism is utilized in the hopping
link structure to provide additional feature information by
learning the weight of each channel.
II. E XPERIMENTAL DATASET
In this paper, we used the publicly available DRIVE[20],
STARE[21] and CHASE DB1 [22] datasets. The DRIVE
dataset consists of 40 color fundus images, which are divided
Fig. 1. U-Net structure into 20 training and 20 test images, each with a pixel size of
565×584, and each with the results of manual segmentation
by the corresponding ophthalmologist. The STARE dataset
and thick vessels simultaneously. In [7], a 3AU-Net based includes 20 fundus images. We used the first 10 images for
on the triple attention mechanism is proposed. The network training and the last 10 images for testing process, each
can suppress noise information and express richer features. with a size of 700x605 and corresponding expert manual
In [8],a technique is proposed to process the background segmentation result. CHASE DB1 dataset Includes 28 retinal
vascular texture in a complex way without damaging the images, with each image of a size of 999×960. The first 14
blood vessel pixels. The technique achieved high segmentation images were used for the training and the last 14 images for
accuracy of blood vessels from the low-contrast area. In [9], testing process. In order to prevent overfitting, we performed
another proposed SERR-U-Net, a U-Net vascular automatic random horizontal flip and random vertical flip operations
segmentation method for retinal images based on squeeze ex- on the original images in all the three datasets. Moreso,
citation residual and circulation segmentation, which achieved we cropped the images to 480x480 size using random crop
good performance in the segmentation of small blood vessels operation, and normalized the mean and standard deviation of
and blood vessel branches. The authors in [10] proposed a the images.
new type of residuals called BSE residuals and introduced The experiment is performed on a 64-bit Windows 10 OS,
a joint loss function to achieve excellent performance on Pytorch version 1.11.0 framework and Python 3.8, with a
both low and high resolution fundus images. In [11], the NVIDIA RTX 3090 GPU. We used the stochastic gradient
authors analyzed the limitations of patch-based retinal blood descent (SGD) as the optimizer during the training process.
vessel deep learning segmentation and proposed an effective
automatic segmentation method, which improved the accuracy III. P ROPOSE M ODEL
of blood vessel segmentation and had good stability. Another To solve the problems of detail loss and poor segmentation
authors in [12] proposed a Res-HSPP U-Net, which extends of fine vessels in retinal vessel segmentation by U-Net, we
the depth and width of the network, which improves the ability improved a modified U-Net, as shown in Figure 2. Our model
of the network to extract small features, and segment some is divided into three parts, the left part is the encoding part,
small blood vessels at the end of the retina and the blood the middle part is the skip-connection structure, and the right
vessels in the fine cross section. All these techniques [13, part is the decoding part,. In the encoding part, the input
14, 15, 16, 17, 18, 19], used the U-Net as the backbone image first passes through an ASPP structure at the first layer
network to improve the blood vessel segmentation accuracy. of the model, and then passes through the CDSC layer to
However, these techniques suffers from accurate segmentation obtain a tensor with the size of [2,64,480,480] and then enters
of thick and thin blood vessels. For this reason, we propose the second layer of the model. At second layer, the softpool
an improved U-Net network that accurately segment both the pooling operation is performed first, and then the CDSC Layer
thick and thin retinal blood vessels. The main contribution of transformed the input data. The third layer structure is the
the propose model include: same as that of the second layer. After the third layer, the
(1) We integrated the Atrous Spatial Pyramid Pooling output of the model is a [2,256,120,120], and then it enters
(ASPP) structure to the proposed model to fused the feature the fourth layer for softpool pooling. The tensor enters the
information of different perceptual fields and extracts more decoding part, by passing through the CDSCR module in the
features by using the convolution of voids with different void fourth layer of the decoder, and then an upsampling operation
rates. is performed, while the data progresses to the third layer. The
(2) We modified the convolutional structure in the encoding upsampling result is spliced with the result obtained in the
part and the decoding part of the conventional U-Net archi- third layer of the encoding module after passing CDSC layer
and SE attention module, while the CDSCR layer operation is
carried out. The first, second and third layers of the decoding
module performed the same operation with varying transfor-
mation. The upsampling and the skip-connection results are
spliced at every layers. Finally, in the last layer of the decoding
part, the model uses a 1x 1 convolution after passing through
the CDSCR layer to obtain the final segmented result with an
output of [2,2,480,480].
TABLE II
T HE MODEL METRIC COMPARISON IN STARE DATASET
TABLE III
T HE M ODEL METRIC COMPARISON IN CHASE DB1
Huang et al., and MC-UNet on all evaluation metrics. Based rest of the improved models, and the segmentation effect can
on Table 1, the conventional U-Net, SA-UNet, Huang et al., be closer to the expert manual segmentation map, and our
MC-UNet, and the proposed model obtain a dice coefficient of improvement is effective. As for the CHASE DB1 dataset,
0.816, 0.818, 0.819, 0.819 and 0.822, respectively. In Table 2, we can see from Table 3 that the proposed model does not get
the dice coefficient of the proposed model is 0.758 and 0.752 better results in all the indexes, and we think that the reason
for the U-Net, which is also higher than the SA-UNet, Huang for this result is that the generalization performance of the
et al., and MC-UNet. This shows the efficiency of the proposed model is not enough, so it makes the improved model not get
model in the segmentation of the blood vessel. In Table 1, the positive improvement on this dataset.
proposed model obtained a correlation coefficient of 0.982,
showing a significant improvement from the conventional U- In the DRIVE dataset, the original image has moderate
Net with a correlation coefficient of 0.971. In addition, the brightness, darker blood vessels, and more pronounced edges,
IoU of the original model is improved from [’94.7’, ’69.0’] to so all models achieved the best results on this dataset. The
[’95.0’, ’69.9’]; In Table 2, the correlation coefficients of U- original U-Net performs poorly in terms of capillaries and
Net and the proposed model are 0.960 and 0.969, and the continuity of segmentation results, but the segmentation of
IoU metric also improves from [’92.2’, ’61.9’] to [’92.5’, trunk vessels show good performance. SA-UNet loses the
’62.1’], in which our proposed model also obtains better results middle part of the capillaries in Fig. 9 and also has multiple
in both metrics of correlation coefficient and IoU comparing truncations, which is weaker than U-Net. However, it performs
with the other improved models, and it can be concluded that better in segmentation of medium width vessels, and its overall
the proposed model can obtain more effective segmentation result is better than that of U-Net. Huang et al. and MC-UNet
results. Finally, MIoU is improved from 81.9% to 82.5% are closer, but they optimize the performance of segmentation
in the traditional U-Net model, which is 0.6% better than of the capillaries, which makes the overall score improved.
the traditional U-Net, 0.4% better than SA-UNet and Huang Compared with the other models, the segmentation results of
et al, and 0.3% better than MC-UNet, which indicates that the proposed model are obviously finer. For the finer blood
the overall segmentation accuracy of the proposed model is vessels, the segmentation accuracy of the remaining models
improved. In summary, it can be seen that our proposed model remains low, and the blood vessel segmentation results are not
has better results compared with the original U-Net and the fine enough, which may lead to misjudgment in the subsequent
blood vessel analysis process. The proposed model obtained
Original Ground truth U-Net SA-UNet[33] Huang et al.[34] MC-UNet[35] Proposed model
Fig. 9. The first to sixth columns are the original image, the expert manual segmentation map, the U-Net segmentation map, the SA-UNet [33]segmentation
map, the Huang et al.[34] segmentation map, the MC-UNet[35] segmentation map and the proposed model segmentation map,in the DRIVE dataset
Original Ground truth U-Net SA-UNet[33] Huang et al[34] MC-UNet[35] Proposed model
Fig. 10. The first to sixth columns are the original image, the expert manual segmentation map, the U-Net segmentation map, the SA-UNet [33]segmentation
map, the Huang et al.[34] segmentation map, the MC-UNet[35] segmentation map and the proposed model segmentation map,in the STARE dataset
high segmentation accuracy, and finer blood vessels can be vessels in the lower right corner as shown in Fig. 10. Both
segmented out accurately, and the results are more similar SA-UNet and Huang et al’s scores are improved compared to
to the expert manual segmentation results. In addition, we the U-Net, in which Huang et al’s segmentation result in Fig.
optimize the effect of vessel continuity segmentation, and the 9 is better than that of the U-Net in terms of the continuity
continuity of vessel segmentation results is also better than of the segmentation result but truncated in the same place in
other models. In the trunk vessels, our segmentation results Fig. 10. MC-Net’s results are lower mainly because it suffers
are wider than those of other models, and the results are closer from truncation in the segmentation of capillaries and has the
to the ground truth. same problem as the previous two models in Fig. 10. Our
model improves compared to these models by capturing more
In the STARE dataset, the background and blood vessel detail in the segmentation of the capillaries and improving
colors of the original image are close to each other, which in continuity. For the problem stated in Fig. 10, our model
makes the segmentation task more difficult, but the image is segments the finer connections. On the right side of Fig. 10,
overall bright. This helps us to separate the background and the our model also shows improvement in the continuity of the
edges of the blood vessels. The conventional U-Net lost part of main blood vessels. Furthermore, our model outperforms other
the capillaries and the segmentation result of the main blood
Original Ground truth U-Net SA-UNet[33] Huang et al[34] MC-UNet[35] Proposed model
Fig. 11. The first to sixth columns are the original image, the expert manual segmentation map, the U-Net segmentation map, the SA-UNet [33]segmentation
map, the Huang et al[34] segmentation map, the MC-UNet[35] segmentation map and the improved model segmentation map,in the CHASE DB1 dataset