0% found this document useful (0 votes)

65 views16 pages

2.5D Solutions

This document summarizes recent developments in 2.5D methods for efficient volumetric medical image segmentation. It reviews three categories of 2.5D methods: multi-view fusion, incorporating inter-slice information, and fusing 2D/3D features. The document presents an empirical study comparing these 2.5D methods to 2D and 3D CNNs on three medical image segmentation tasks to evaluate their performance on different tasks. The study aims to provide guidance on which methods work best in practical applications.

Uploaded by

Rmz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views16 pages

2.5D Solutions

Uploaded by

Rmz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Bridging 2D and 3D Segmentation Networks for Computation-Efficient

Volumetric Medical Image Segmentation: An Empirical Study of 2.5D

Solutions

Yichi Zhanga , Qingcheng Liaoa , Le Dinga , Jicong Zhanga,b,c,d,∗

a School of Biological Science and Medical Engineering, Beihang University, Beijing, China
b Hefei Innovation Research Institute, Beihang University, Hefei, China
c Beijing Advanced Innovation Centre for Biomedical Engineering, Beijing, China
d Beijing Advanced Innovation Centre for Big Data-Based Precision Medicine, Beijing, China
arXiv:2010.06163v2 [eess.IV] 7 Feb 2022

Abstract

Recently, deep convolutional neural networks have achieved great success for medical image segmentation.
However, unlike segmentation of natural images, most medical images such as MRI and CT are volumetric
data. In order to make full use of volumetric information, 3D CNNs are widely used. However, 3D CNNs
suffer from higher inference time and computation cost, which hinders their further clinical applications.
Additionally, with the increased number of parameters, the risk of overfitting is higher, especially for medical
images where data and annotations are expensive to acquire. To issue this problem, many 2.5D segmentation
methods have been proposed to make use of volumetric spatial information with less computation cost.
Despite these works lead to improvements on a variety of segmentation tasks, to the best of our knowledge,
there has not previously been a large-scale empirical comparison of these methods. In this paper, we aim
to present a review of the latest developments of 2.5D methods for volumetric medical image segmentation.
Additionally, to compare the performance and effectiveness of these methods, we provide an empirical study
of these methods on three representative segmentation tasks involving different modalities and targets. Our
experimental results highlight that 3D CNNs may not always be the best choice. Despite all these 2.5D
methods can bring performance gains to 2D baseline, not all the methods hold the benefits on different
datasets. We hope the results and conclusions of our study will prove useful for the community on exploring
and developing efficient volumetric medical image segmentation methods.
Keywords: Medical Image Segmentation, Convolutional Neural Network, 2.5D Segmentation Methods,
Computation-Efficient Learning

∗ Corresponding author at: School of Biological Science and Medical Engineering, Beihang University, Beijing, China.

E-mail addresses: [email protected] (Yichi Zhang), [email protected] (Jicong Zhang)

Preprint submitted to Elsevier February 8, 2022

1. Introduction

Medical image segmentation aims to extract meaningful objects or regions from the image, which is a
fundamental and important step for many clinical applications like computer-assisted diagnosis, treatment
planning and surgery navigation. For example, accurate location and segmentation of abdominal anatomy
from CT images is helpful in cancer diagnosis and treatment [1, 2]. Atrial segmentation plays an important
role in the medical management of patients [3, 4]. However, manual segmentation of medical images is a
labor-intensive and tedious task, and subjects to variations of observers. This has motivated many researches
on automatic segmentation methods. In the past several years, convolutional neural networks (CNNs)1
have been proposed and achieved state-of-the-art performance in many medical image segmentation tasks
[3, 4, 5, 6]. Most of these segmentation methods are based on U-Net [7] or its variants, using encoder-
decoder architecture and incorporating multi-scale features by skip connections, which is proven to be a
strong baseline network for medical image segmentation [8].
With the development of medical imaging technology, 3D medical images such as Computed Tomogra-
phy (CT) and Magnetic Resonance Imaging (MRI) have been widely used in clinical approaches. Figure 1
presents two different examples of volumetric medical image segmentation with corresponding 3D reconstruc-
tion of segmentation results. Automatic segmentation of volumetric medical images has become increasingly
important for biomedical applications [9, 10]. To deal with volumetric input, two main strategies are ap-
plied. The first method is cutting the 3D volume into 2D slices and training 2D CNNs for segmentation
based on intra-slice information. Another strategy is to extend the network structure to volumetric data
by using 3D convolutions and train 3D CNNs for segmentation based on volumetric images [11, 12]. Both
of these two methods have their own advantages and disadvantages. 2D CNNs have lighter computation
and higher inference speed. However, the information between adjacent slices is neglected, which hinders
the improvement of segmentation accuracy. Besides, 2D segmentation results are prone to discontinuity in
3D space [13]. Although 3D CNNs have the perception of volumetric spacial information, there are still
some shortcomings. Due to the increased dimension, 3D CNNs require higher computation cost, resulting
in higher inference latency [14, 15]. Besides, the large number of parameters may result in higher risk of
overfitting, especially when encountering small datasets. Moreover, the GPU requirement of 3D CNNs is
impractically expensive, which hinders their further clinical application.
To bridge the gap between 2D and 3D CNNs, many 2.5D segmentation methods (also known as pseudo
3D methods) are proposed for efficient volumetric medical image segmentation by designing new architecture
or using strategies to fuse volumetric information into 2D CNNs. In this way, the models could enjoy both

1 In order to unify the statement, in this paper, CNNs refer specifically to convolution neural networks for medical image
segmentation.

2
Figure 1: Examples for volumetric medical image segmentation. The image with 2D label denoted by red and
corresponding 3D reconstruction of the label are shown in the figure.

light computation cost and time consumption of 2D CNNs and the ability to capture spacial contextual
information of 3D CNNs. However, these methods are tested on different datasets for different segmentation
tasks. Their performance in other medical image segmentation tasks is unknown. Besides, the comparison
experiments of most 2.5D methods are only compared with 2D and 3D CNNs, the comparison between these
methods have not been verified. Therefore, in practical applications, we do not know which method should
be selected to obtain better performance gains. In addition, although 2.5D CNNs obviously suffer from less
time consumption compared with 3D CNNs, the improvement has not been quantitatively studied. This
has hindered the further research and application of these methods.
In this paper, our contributions are summarized as follows:

• We summarize the latest developments about 2.5D methods for volumetric medical image segmentation
and divide these methods into three categories: multi-view fusion, incorporating inter-slice information
and fusing 2D/3D features.

• We make a large-scale empirical comparison of these 2.5D methods with 2D and 3D CNNs on 3 repre-
sentative public datasets involving different modalities (CT and MRI) and targets (cardiac, prostate
and abdomen) to systematically evaluate the performance of these methods on different medical image
segmentation tasks.

2. Related Work

For segmentation of volumetric medical images, 3D CNNs could capture richer contextual information
and improve the performance over the 2D CNNs, while higher memory requirements and time consumption
are required [16]. To issue this problem, many studies have suggested the usage of 2.5D segmentation
methods that fuse volumetric spatial information into 2D CNNs to improve the accuracy while reducing the

3
Figure 2: The overall workflow of multi-view fusion. The 3D input volume is divided into 2D slices from axial,
sagittal and coronal planes to train the corresponding 2D CNNs. Then the segmentation results from different views
are ensembled to get the final 2.5D segmentation result.

computational cost. In this section, we present an overview of 2.5D segmentation methods based on the
criteria that the method should be general and can be applied to different 3D segmentation tasks. Several
related methods also use 2.5D methods for specific task like blood vessel segmentation [17]. The evaluation
of these tailored methods is beyond the scope of this paper. In the following subsections, we give a review of
2.5D segmentation methods that will be evaluated in the following experiments. We divided these methods
into three categories: multi-view fusion, incorporating inter-slice information and fusing 2D/3D features.

2.1. Multi-view fusion

To incorporate volumetric spatial information into 2D CNNs, a simple and intuitive solution is multi-
view fusion (MVF). By generating slice-by-slice predictions based on 2D CNNs from different image views,
and then fusing the segmentation results by means of majority voting, the final 2.5D segmentation result is
obtained. Generally, the strategy is performed on three orthogonal planes, including the sagittal (S), coronal
(C) and axial (A) planes. For the segmentation of each plane, the information of two axes could be used.
Therefore, by combining the results of sagittal slices, coronal slices and axial slices together, the spatial
information of 3D volumes can be fully utilized, so as to obtain better segmentation results compared with
2D CNNs. The overall workflow of multi-view fusion is shown in Figure 2.

4
Figure 3: Examples of short-axis slices and long axis slices for anisotropic volumetric medical images. From the
figure we can see that long-axis slices are with poor contextual information due to the low resolution on z axis.

Multi-view fusion of three orthogonal planes has been applied in many 3D medical image segmentation
tasks [16, 18, 19, 20, 21, 22]. In these methods, three 2D CNNs are trained to segment from sagittal,
coronal, and axial planes separately. After that, the segmentation results from each plane are fused to get
the final segmentation results. Instead of the most commonly used majority voting for ensemble learning,
[16] proposed Volumetric Fusion Net (VFN), a shallow 3D CNN with much fewer parameters for fusion
of 2D results from different views. Their experimental results showed using VFN could obtain better 3D
segmentation results than majority voting.

2.2. Incorporating inter-slice information

Due to the imaging settings, some modalities of medical images consist of anisotropic voxels, which
means that the spatial resolution of three dimensions are not always the same. As a consequence, for these
anisotropic volumetric images, the image resolution in x and y axis (length and width) is more than ten
times higher than that of the z axis (depth), and x and y axis preserve much higher resolution and richer
information than the z axis [23]. Therefore, neighboring regions along z axis may have abrupt changes around
the same area [24]. Figure 3 presents an example of short-axis and long-axis slices of prostate MR images
with anisotropic voxels. From the figure, we can observe that long-axis slices suffer from poor contextual
information due to the low resolution on z axis. As a result, for anisotropic volumes, training 2D CNNs
from long-axis slices (xz or yz planes) may be inadvisable.
Instead of fusing results from different segmentation views, another strategy is incorporating inter-slice
information into 2D CNNs to explore the spatial correlation. The workflow of this strategy is shown in

5
Figure 4: The overall workflow of incorporating inter-slice information. With the additional input of neighboring
slices, the inter-slice correlation could be leveraged so as to get better performance compared with 2D CNNs.

Figure 4. In this way, the network could utilize not only the information of the slice at hand, but also the
information of its neighboring slices. The inputs of the network are consecutive slices while the output is the
corresponding segmentation results of the middle slice. By incorporating inter-slice information, the spatial
correlation of the volume can be leveraged while avoiding the heavy burden of 3D computing.
The most widely used method draws on the idea of natural image segmentation and introduces sequential
slices as multi-channel input for the segmentation of the middle slice. Using multi-slice input for utilization
of spatial information has been applied in many medical segmentation tasks [25, 26, 27, 28, 29]. Most of these
methods feed 3 slices into segmentation networks: the slice at hand and its adjacent slices [30, 31]. Instead
of inputting continuous slices, Zhao et al. [32] proposed to utilize both dense sampling with adjacent slices
and sparse sampling with dilated slices for 2.5D segmentation. However, directly adding adjacent slices as
multi-channel input may be inefficient. When neighboring slices are mixed together into channel dimension,
the information of input slices are fused in the first convolution layer. This process makes the network
harder to pick up useful information for distinguishing each slice, which may lead to worse performance.
Therefore, some work has focus on designing new architectures for inter-slice information extraction.
The successful application of Recurrent Neural Networks (RNNs) in segmentation tasks [27, 33] provides
potential solutions for refinement of 2D segmentation results. In this way, 2D slices of the 3D volume are
viewed as a time series sequence to distill the inter-slice contexts and features. Among the variants of RNNs,
bi-directional convolutional long short-term memory (BC-LSTM) [34] exhibits better improvement as the
network can integrate information flow from two directions to improve the spatial continuity of the overall
segmentation results, which has been applied in many medical image segmentation tasks [35, 21].
Other than using RNNs, another line of researches focuses on using attention mechanism to utilize inter-

6
2D CNN module
2D feature

Feature Fusion for Final

Segmentation Results
3D Input Volume 3D feature

3D CNN module

Figure 5: The overall workflow of fusing 2D/3D features. Features extracted from 2D and 3D CNNs/modules are
fused to utilize volumetric spatial information.

slice information. Since the information of each slice is spatially correlated with its upper and lower slices
due to the spatial continuity, the information of adjacent slices could be used to guide the segmentation
procedure by highlighting the most salient areas using attention mechanism. Zhang et al. [15] proposed an
inter-slice attention module that uses the information of adjacent slices to generate the attention masks so
as to provide a priori shape regulation for the segmentation. Kuang et al. [36] applied contextual-attention
block to force the model to focus on the border areas using element-wise subtraction between slices. Using
attention mechanism has been an emerging trend for extraction of inter-slice information to obtain further
improvements.

2.3. Fusing 2D/3D features

For computation-efficient volumetric medical image segmentation, some works also focus on the fusing
features extracted from 2D and 3D CNNs to obtain higher efficiency [37, 38, 39, 40]. Figure 5 presents the
workflow of this strategy. Although these methods still use 3D convolutions to extract spatial information,
however, the overall computation cost is reduced compared with training pure 3D CNNs. In this way, this
strategy can be regarded as 2.5D methods to explore the segmentation efficiency. The first method uses
2D results to provide priori shape information for 3D CNNs, then 3D CNNs are used to utilize the spatial
information based on 2D outputs and the original volume. In the end, 2D and 3D results are fused to get
the final segmentation results [37]. Another way uses use both 2D and 3D encoder for the feature extraction.
Then the 2D and 3D features at different scales are combined respectively in the encoding stage [38]. Besides,
instead of directly fusing the features, squeeze-and-excitation (SE) block [41] is applied to achieve better

7
fusion by weighting the output of the two dimensions. In summary, both methods aim at combining 2D and
3D features for efficient extraction of volumetric information. The main difference is the stage and method
to fuse the 2D and 3D features.

3. Experiments

Despite many 2.5D methods are proposed for efficient volumetric medical image segmentation, these
methods are evaluated on different datasets with different settings and backbone structures. The compar-
ison between these methods has not been verified. Therefore, in the following experiments, we intend to
systemically evaluate the performance of these 2.5D methods with the same backbone structure and settings
to compare their performance on three representative segmentation datasets.

3.1. Datasets

In the following experiments, we use 3 representative datasets of Medical Segmentation Decathlon Chal-
lenge [42] for the evaluation of these methods. The first dataset is the Cardiac which includes 20 mono-modal
MR volumes provided by King’s College London, for the segmentation of left atrium. The left atrium ap-
pendage, mitral plane, and portal vein end points were segmented by an expert using an automated tool [43]
followed by manual correction. The second dataset is Spleen which includes 41 CT volumes from Memorial
Sloan Kettering Cancer Center for spleen segmentation. The spleen was semi-automatically segmented using
the Scout application and the contours were manually adjusted by an expert abdominal radiologist. The
third dataset is Prostate which includes 32 Multimodal MR (T2, ADC) volumes provided by Radboud Uni-
versity for segmentation of prostate. Specifically, the Prostate dataset has highly anisotropic voxel spacings
with a typical image shape of 16x320x320. All the datasets are randomly splitted at a rate of 80% for
training and 20% for testing. We chose these three datasets because we want to involve typical modalities
(CT and MRI) and targets (cardiac, prostate and abdomen) in 3D medical image segmentation tasks so as
to obtain more general conclusions.

3.2. Backbone structure and evaluation metrics

We employ U-Net [7] as the network backbone that has four stage convolutional blocks in different
resolutions and the base convolution block has 8 feature maps. For comparison of 3D CNNs, we employ 3D
U-Net [11] on the same experimental environment and settings. All the experiments are implemented using
Pytorch and run in Linux on NVIDIA Tesla V100 GPUs. During training, we use the Adam optimizer for
all experiments with initial learning rate 0.001 and would be dropped by the factor of 0.5 if the training loss
was no more improved in 20 epochs. We use the combination of cross-entropy loss and dice loss as the loss
function for training. For quantitative evaluation of segmentation results, two complementary metrics are
introduced. Dice coefficient (DSC) are used to measure the region mismatch and 95% Hausdorff Distance
8
(95HD) are used to evaluate the boundary errors between the segmentation results and the ground truth.
Higher DSC (closer to 1) indicating more complete overlap and lower 95HD (closer to 0) indicating closer
boundaries represent better segmentation performance. The evaluation using these two metrics can achieve
a more comprehensive comparison of segmentation results.

3.3. Experimental design

For the first experiment, we aim to compare the performance of 2D segmentation results from different
segmentation planes including sagittal (S), coronal (C) and axial (A) planes, and the 2.5D multi-view fusion
results with different fusion methods including majority voting (MV), weighted voting (WV) and Volumetric
Fusion Net (VFN) [16]. For the second experiment, we first compare the performance of different number
of input slices for multi-slice input. Then we compare the performance of different ways to incorporate
inter-slice information including multi-slice input, applying RNNs and using attention mechanism as in [15]
and [36]. Specifically, we select the segmentation plane with best performance according to the former
experimental results to generate 2D slices. After that, for the third experiment, we make comparisons of
different stages including feature fusion at encoder stage and output stage, and different methods including
Add and SE [41] to fuse 2D and 3D features.

4. Results

In this section, we systematically evaluate the performance of these 2.5D methods on three representative
public datasets with the above implementation details.

4.1. Experimental results of multi-view fusion

Table 1 presents the quantitative results of multi-view fusion. The experiments are designed to evaluate
the performance of multi-view fusion on different segmentation tasks. Firstly, we train 2D U-Net from
sagittal, coronal and axial planes, named U-Net(S), U-Net(C) and U-Net(A) respectively. Then 2.5D multi-
view fusion results are obtained by ensembling results from three segmentation views using different fusion
methods. Compared with the best 2D results of different views, multi-view fusion with majority voting can
obtain performance improvement by 3.62% and 0.90% in terms of Dice on Cardiac MRI and Spleen CT
datasets, respectively. Compared with majority voting, using VFN could achieve additional performance
gains by 1.48% and 2.06% on Dice, although we need to train an extra network for fusion of different
outputs, resulting in additional computation cost. However, for anisotropic volumes like Prostate MRI, we
observe that the performance of 2D CNNs for sagittal and coronal slices is poor due to limited intra-slice
information. When taking the unsatisfactory results into account for multi-view fusion, the performance
may be even worse compared with the best 2D results on Dice metric. To issue the problem, we use another
fusion method by giving the short-axis slices (generally with better 2D results) more weight, which we
9
Table 1: Quantitative segmentation results of 2D U-Net from different planes, 2.5D methods with different
fusion strategies and 3D U-Net on the evaluation datasets. The arrows indicate which direction is better.

2D CNNs 2.5D Multi-view Fusion 3D CNNs

Dataset Metrics
U-Net (A) U-Net (S) U-Net (C) MV WV VFN 3D U-Net

Cardiac MRI Dice ↑ 0.8362 0.8630 0.8225 0.8992 0.8996 0.9140 0.9203

95HD ↓ 7.120 3.274 5.268 1.287 1.104 1.021 1.014

Spleen CT Dice ↑ 0.9114 0.8027 0.8858 0.9204 0.9208 0.9410 0.9452

95HD ↓ 1.521 8.710 2.165 1.511 1.508 1.119 1.222

Prostate MRI Dice ↑ 0.7222 0.4322 0.4545 0.6448 0.7490 0.7731 0.7677

95HD ↓ 8.933 23.72 15.31 4.956 4.402 4.460 5.935

named weighted voting (WV). From the results, we can observe that weighted voting outperforms majority
voting, since the additional weight can mitigate the impact of bad results while fusing useful multi-view
information. However, for Cardiac MRI and Spleen CT datasets, the improvement of WV compared with
MV is not obvious.

4.2. Experimental results of incorporating inter-slice information

To evaluate the influence of different number of input slices on incorporating contextual information, we
make comparison experiments of different number of input slices for multi-slice input. We select short-axis
slices with the best 2D performance as the segmentation plane for the following experiments. As shown
in Table 2, generally, using with neighboring slices for multi-slice input can obtain better results compared
with 2D single-slice input. These findings prove that the usage of neighboring slices could improves the
segmentation accuracy. Interestingly, we observe that additional slices did not improve the performance
when the number of input slices reaches three. One possible explanation is that non-adjacent neighboring
slices with less useful information may sometimes mislead the segmentation of middle slice. Besides, more
input slices make the computational cost of the network higher. As a result, using three as the number of
input slices may be the best practice.
Table 3 shows the quantitative results of different methods to incorporate inter-slice information and their
comparison with 2D U-Net and 3D U-Net. In addition to naive multi-slice inputs, we conduct experiments on
three different methods of incorporating inter-slice information: applying RNNs (Bi-CLSTM) for information
extraction, using inter-slice contextual attention for priori shape regulation [15] and border areas [36] .
From the results, we could see that compared with naive multi-slice input, all three methods can obtain

10
Table 2: Quantitative segmentation results of 2D and 2.5D methods with different number of input channels
on the evaluation datasets. The arrows indicate which direction is better.

2D CNNs 2.5D Multi-slice Input 3D CNNs

Dataset Metrics
U-Net (1slice) 3 slices 5 slices 7 slices 3D U-Net

Cardiac MRI Dice ↑ 0.8630 0.8781 0.8642 0.867 0.9203

95HD ↓ 3.274 1.559 2.525 3.837 1.014

Spleen CT Dice ↑ 0.9114 0.9121 0.9082 0.9089 0.9452

95HD ↓ 1.521 1.345 1.581 1.527 1.222

Prostate MRI Dice ↑ 0.7222 0.7460 0.7448 0.7257 0.7677

95HD ↓ 8.933 5.565 4.759 7.225 5.935

Table 3: Quantitative segmentation results of 2.5D methods to incorporate inter-slice information, 2D U-Net
and 3D U-Net. The arrows indicate which direction is better.

2D CNNs 2.5D Incorporate Inter-slice Information 3D CNNs

Dataset Metrics
U-Net Multi-slice RNN Att Shape Att Border 3D U-Net

Cardiac MRI Dice ↑ 0.8630 0.8781 0.8858 0.8821 0.8747 0.9203

95HD ↓ 3.274 1.559 2.155 1.494 1.750 1.014

Spleen CT Dice ↑ 0.9114 0.9121 0.9167 0.9233 0.9155 0.9452

95HD ↓ 1.521 1.345 1.573 1.390 1.794 1.222

Prostate MRI Dice ↑ 0.7222 0.7460 0.7801 0.7804 0.7885 0.7677

95HD ↓ 8.933 5.565 4.914 3.448 3.907 5.935

11
Table 4: Quantitative segmentation results of different stages and methods to fuse 2D and 3D features. The
arrows indicate which direction is better.

Fusion of 2D and 3D features

Dataset Metrics 2D CNNs Encoder Encoder Output Output 3D CNNs

Add SE Add SE

Cardiac MRI Dice ↑ 0.8630 0.8743 0.8732 0.8876 0.8824 0.9203

95HD ↓ 3.274 1.759 1.913 1.604 1.707 1.014

Spleen CT Dice ↑ 0.9114 0.9221 0.9193 0.9197 0.9122 0.9452

95HD ↓ 1.521 1.445 1.463 1.425 1.487 1.222

Prostate MRI Dice ↑ 0.7222 0.7562 0.7693 0.7625 0.7597 0.7677

95HD ↓ 8.933 7.141 6.259 5.567 8.780 5.935

extra performance gains due to more efficient extraction of inter-slice information. However, these methods
have variable performance on different datasets and none of them outperforms the other two on all tasks.
Specifically, we observe that for segmentation of anisotropic volumes like Prostate MRI, 3D CNNs may not
be the best choice since the scarce inter-slice information will be further lost when downsampling on z axis.
As a result, all three methods outperform 3D U-Net on all metrics.

4.3. Experimental results of fusing 2D/3D features

In this subsection, we aim at comparing the performances of different stages and methods to fuse 2D
and 3D features. For feature fusion stages, we select fusing features extracted from 2D and 3D encoders
at encoding stage as in [38] and fusing the output feature maps of 2D and 3D networks as in [37]. For
feature fusion methods, two methods of directly adding corresponding feature maps and using squeeze-
and-excitation (SE) block [41] are compared. Table 4 shows the evaluation results of fusing 2D and 3D
features. From the table we could see that compared with 2D CNNs, all fusion methods can achieve better
performance. Specifically, for Cardiac MRI segmentation, fusing features at output stage could get better
results than that at encoding stage. However, for Spleen CT and Prostate MRI, these methods have variable
performance and no one is superior to the others on all tasks. As a conclusion, fusing 2D and 3D features
could efficiently utilize volumetric spatial information and improve the performance compared with 2D
CNNs. For different fusion methods, fusing at encoding stage could urge the network to extract useful
information with affordable computation cost. For fusing 2D and 3D outputs, the addition of 2D results
could provide useful intra-slice information and accelerate the training procedure of 3D network. Although

12
Table 5: Comparison of model complexity of different segmentation methods.

Methods CNNs Parameters (M) Training Time

2D U-Net 2D 0.487 1.00x

3D U-Net 3D 1.403 4.14x

VFN (shallow 3D U-Net) 3D 0.088 ——

2D U-Net + Att Shape 2.5D 0.487 1.11x

2D U-Net + Att Border 2.5D 0.487 1.08x

2D U-Net + RNN 2.5D 0.535 1.31x

2D/3D Fusion (Encoder) 2.5D 0.543 1.54x

2D/3D Fusion (Output) 2.5D 0.841 2.32x

all these methods can improve the segmentation efficiency, however, the comparison of their performance is
still affected by the characteristics of segmentation tasks.

5. Conclusion and Discussion

In this paper, we aim at exploring 2.5D method for efficient volumetric medical image segmentation
to overcome the lack of volumetric information of 2D CNNs and heavy computation cost of 3D CNNs,
and make a large-scale empirical comparison on three representative medical image segmentation tasks
involving different modalities and targets. Generally, 3D CNNs seem to be the best choice for segmentation
accuracy, however, the unaffordable computational cost remains a hindrance for their further application.
Besides, for anisotropic volumes that x and y axis preserve much higher resolution than the z axis, the
performance advantages of 3D CNNs are not obvious, sometimes even worse than some 2.5D methods. For
these anisotropic images with high discontinuity between slices, directly applying 3D CNNs may involve
irrelevant features and hinder the learning process, which is in line with the conclusions in [15, 24].
For 2.5D segmentation methods, we observe that all these 2.5D methods have the potential to improve
the performance of 2D baseline. Multi-view fusion of 2D results from different views can get segmentation
results comparable to 3D CNNs. However, we still need to train 3 (or 4 using VFN) networks to get
the final segmentation results. In addition, the improvement of this strategy for anisotropic volumes is
not significant. For incorporating inter-slice information, we could obtain performance improvement while
training only one network with slight computational increasement. Compared with naive multi-slice input to
incorporate volumetric information, all RNN-based and attention-based 2.5D methods can further improve
the segmentation performance. However, the performance gains of these methods are not consistent in
13
different datasets. For fusing 2D/3D features, all these fusion methods can improve the segmentation
efficiency compared with using only 2D CNNs, demonstrating the effectiveness of fusing 3D features to
utilize volumetric information. However, the performance of different fusion methods is affected by the
characteristics of different segmentation tasks. Table 5 presents the comparison of model parameters and
average training time on different datasets (ratio to 2D baseline) of different segmentation methods. It can
be observed that all 2.5D methods have fewer parameters and less computation cost compared with 3D
CNNs. We cannot claim that we have completely reproduced all these methods, however, we tried our best
to tune each method so as to ensure a fair comparison. Another limitation is that we only use 2D/3D U-Net
as the backbone structure. However, they may not always be the best choice for all segmentation tasks.
Still, we hope the results and conclusions of our study will prove useful for the community on exploring and
developing efficient volumetric medical image segmentation methods.

Acknowledgement

This work is supported by the National Key Research and Development Program of China under Grant
2016YFF0201002, the University Synergy Innovation Program of Anhui Province under Grant GXXT-2019-
044, and the National Natural Science Foundation of China under Grant 61301005.

References

[1] R. Wolz, C. Chu, K. Misawa, M. Fujiwara, K. Mori, D. Rueckert, Automated abdominal multi-organ segmentation with
subject-specific atlas generation, IEEE transactions on medical imaging 32 (9) (2013) 1723–1730.
[2] J. Ma, Y. Zhang, S. Gu, C. Zhu, C. Ge, Y. Zhang, X. An, C. Wang, Q. Wang, X. Liu, et al., Abdomenct-1k: Is abdominal
organ segmentation a solved problem, IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
[3] O. Bernard, A. Lalande, C. Zotti, F. Cervenansky, X. Yang, P.-A. Heng, I. Cetin, K. Lekadir, O. Camara, M. A. G.
Ballester, et al., Deep learning techniques for automatic mri cardiac multi-structures segmentation and diagnosis: is the
problem solved?, IEEE Transactions on Medical Imaging 37 (11) (2018) 2514–2525.
[4] A. Lalande, Z. Chen, T. Pommier, T. Decourselle, A. Qayyum, M. Salomon, D. Ginhac, Y. Skandarani, A. Boucher,
K. Brahim, et al., Deep learning methods for automatic evaluation of delayed enhancement-mri. the results of the emidec
challenge, arXiv preprint arXiv:2108.04016 (2021).
[5] L. Wang, D. Nie, G. Li, É. Puybareau, J. Dolz, Q. Zhang, F. Wang, J. Xia, Z. Wu, J.-W. Chen, et al., Benchmark on
automatic six-month-old infant brain segmentation algorithms: the iseg-2017 challenge, IEEE transactions on medical
imaging 38 (9) (2019) 2219–2230.
[6] N. Heller, F. Isensee, K. H. Maier-Hein, X. Hou, C. Xie, F. Li, Y. Nan, G. Mu, Z. Lin, M. Han, et al., The state of the
art in kidney and kidney tumor segmentation in contrast-enhanced ct imaging: Results of the kits19 challenge, Medical
Image Analysis 67 (2021) 101821.
[7] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks for biomedical image segmentation, in: International
Conference on Medical image computing and computer-assisted intervention, Springer, 2015, pp. 234–241.
[8] F. Isensee, P. F. Jaeger, S. A. Kohl, J. Petersen, K. H. Maier-Hein, nnu-net: a self-configuring method for deep learning-
based biomedical image segmentation, Nature Methods (2020) 1–9.

14
[9] M. H. Hesamian, W. Jia, X. He, P. Kennedy, Deep learning techniques for medical image segmentation: achievements and
challenges, Journal of digital imaging 32 (4) (2019) 582–596.
[10] N. Tajbakhsh, L. Jeyaseelan, Q. Li, J. N. Chiang, Z. Wu, X. Ding, Embracing imperfect datasets: A review of deep
learning solutions for medical image segmentation, Medical Image Analysis 63 (2020) 101693.
[11] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, O. Ronneberger, 3d u-net: learning dense volumetric segmentation
from sparse annotation, in: International conference on medical image computing and computer-assisted intervention,
Springer, 2016, pp. 424–432.
[12] F. Milletari, N. Navab, S.-A. Ahmadi, V-net: Fully convolutional neural networks for volumetric medical image segmen-
tation, in: 2016 fourth international conference on 3D vision (3DV), IEEE, 2016, pp. 565–571.
[13] Y. Qu, X. Li, Z. Yan, L. Zhao, L. Zhang, C. Liu, S. Xie, K. Li, D. Metaxas, W. Wu, et al., Surgical planning of pelvic
tumor using multi-view cnn with relation-context representation learning, Medical Image Analysis 69 (2021) 101954.
[14] Q. Yu, Y. Xia, L. Xie, E. K. Fishman, A. L. Yuille, Thickened 2d networks for efficient 3d medical image segmentation,
arXiv preprint arXiv:1904.01150 (2019).
[15] Y. Zhang, L. Yuan, Y. Wang, J. Zhang, Sau-net: Efficient 3d spine mri segmentation using inter-slice attention, Proceedings
of Machine Learning Research 121 (2020) 903–913.
[16] Y. Xia, L. Xie, F. Liu, Z. Zhu, E. K. Fishman, A. L. Yuille, Bridging the gap between 2d and 3d organ segmentation with
volumetric fusion net, in: International Conference on Medical Image Computing and Computer-Assisted Intervention,
Springer, 2018, pp. 445–453.
[17] C. Angermann, M. Haltmeier, R. Steiger, S. Pereverzyev, E. Gizewski, Projection-based 2.5 d u-net architecture for fast
volumetric segmentation, in: 2019 13th International conference on Sampling Theory and Applications (SampTA), IEEE,
2019, pp. 1–5.
[18] J. Yun, J. Park, D. Yu, J. Yi, M. Lee, H. J. Park, J.-G. Lee, J. B. Seo, N. Kim, Improvement of fully automated airway
segmentation on volumetric computed tomographic images using a 2.5 dimensional convolutional neural net, Medical
image analysis 51 (2019) 13–20.
[19] G. Wang, W. Li, S. Ourselin, T. Vercauteren, Automatic brain tumor segmentation using cascaded anisotropic convolu-
tional neural networks, in: International MICCAI brainlesion workshop, Springer, 2017, pp. 178–190.
[20] H. Cui, X. Liu, N. Huang, Pulmonary vessel segmentation based on orthogonal fused u-net++ of chest ct images, in:
International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2019, pp. 293–300.
[21] H. Li, J. Li, X. Lin, X. Qian, A model-driven stack-based fully convolutional network for pancreas segmentation, in: 2020
5th International Conference on Communication, Image and Signal Processing (CCISP), IEEE, 2020, pp. 288–293.
[22] Y. Zhang, J. Wu, Y. Liu, Y. Chen, W. Chen, E. X. Wu, C. Li, X. Tang, A deep learning framework for pancreas
segmentation with multi-atlas registration and 3d level-set, Medical Image Analysis 68 (2021) 101884.
[23] Y. Zhang, Cascaded convolutional neural network for automatic myocardial infarction segmentation from delayed-
enhancement cardiac mri, in: International Workshop on Statistical Atlases and Computational Models of the Heart,
Springer, 2020, pp. 328–333.
[24] Y. Ou, Y. Yuan, X. Huang, K. Wong, J. Volpi, J. Z. Wang, S. T. Wong, Lambdaunet: 2.5 d stroke lesion segmentation
of diffusion-weighted mr images, in: International Conference on Medical Image Computing and Computer-Assisted
Intervention, Springer, 2021, pp. 731–741.
[25] J. Duan, G. Bello, J. Schlemper, W. Bai, T. J. W. Dawes, C. A. Biffi, A. De Marvao, G. Doumoud, D. P. Oregan,
D. Rueckert, Automatic 3d bi-ventricular segmentation of cardiac images by a shape-refined multi- task deep learning
approach, IEEE Transactions on Medical Imaging 38 (9) (2019) 2151–2164.
[26] X. Wang, S. Han, Y. Chen, D. Gao, N. Vasconcelos, Volumetric attention for 3d medical image segmentation and detection,
in: International conference on medical image computing and computer-assisted intervention, Springer, 2019, pp. 175–184.

15
[27] Q. Yu, L. Xie, Y. Wang, Y. Zhou, E. K. Fishman, A. L. Yuille, Recurrent saliency transformation network: Incorporating
multi-stage visual cues for small organ segmentation, in: Proceedings of the IEEE conference on computer vision and
pattern recognition, 2018, pp. 8280–8289.
[28] Y. Li, H. Li, Y. Fan, Acenet: Anatomical context-encoding network for neuroanatomy segmentation, Medical image
analysis 70 (2021) 101991.
[29] H. Zhou, J. Xiao, Z. Fan, D. Ruan, Intracranial vessel wall segmentation for atherosclerotic plaque quantification, in: 2021
IEEE 18th International Symposium on Biomedical Imaging (ISBI), IEEE, 2021, pp. 1416–1419.
[30] S. Liu, X. Yuan, R. Hu, S. Liang, S. Feng, Y. Ai, Y. Zhang, Automatic pancreas segmentation via coarse location and
ensemble learning, IEEE Access 8 (2019) 2906–2914.
[31] L. Li, S. Lian, Z. Luo, S. Li, B. Wang, S. Li, Learning consistency-and discrepancy-context for 2d organ segmentation, in:
International Conference on Medical Image Computing and Computer-Assisted Intervention, Springer, 2021, pp. 261–270.
[32] Z. Zhao, Z. Ma, Y. Liu, Z. Zeng, P. K. Chow, Multi-slice dense-sparse learning for efficient liver and tumor segmentation,
in: 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), IEEE,
2021, pp. 3582–3585.
[33] X. Yang, L. Yu, S. Li, H. Wen, D. Luo, C. Bian, J. Qin, D. Ni, P.-A. Heng, Towards automated semantic segmentation in
prenatal volumetric ultrasound, IEEE transactions on medical imaging 38 (1) (2018) 180–193.
[34] J. Chen, L. Yang, Y. Zhang, M. Alber, D. Z. Chen, Combining fully convolutional and recurrent neural networks for 3d
biomedical image segmentation, Advances in neural information processing systems 29 (2016).
[35] Q. Zhu, B. Du, B. Turkbey, P. Choyke, P. Yan, Exploiting interslice correlation for mri prostate image segmentation, from
recursive neural networks aspect, Complexity 2018 (2018).
[36] Z. Kuang, X. Deng, L. Yu, H. Wang, T. Li, S. Wang, ψ-net: Focusing on the border areas of intracerebral hemorrhage on
ct images, Computer Methods and Programs in Biomedicine 194 (2020) 105546.
[37] X. Li, H. Chen, X. Qi, Q. Dou, C.-W. Fu, P.-A. Heng, H-denseunet: hybrid densely connected unet for liver and tumor
segmentation from ct volumes, IEEE transactions on medical imaging 37 (12) (2018) 2663–2674.
[38] Y. Zhou, W. Huang, P. Dong, Y. Xia, S. Wang, D-unet: a dimension-fusion u shape network for chronic stroke lesion
segmentation, IEEE/ACM Transactions on Computational Biology and Bioinformatics (2019) 1–1.
[39] Z. Chen, C. Li, J. He, J. Ye, D. Song, S. Wang, L. Gu, Y. Qiao, A novel hybrid convolutional neural network for accurate
organ segmentation in 3d head and neck ct images, in: International Conference on Medical Image Computing and
Computer-Assisted Intervention, Springer, 2021, pp. 569–578.
[40] H. Mei, W. Lei, R. Gu, S. Ye, Z. Sun, S. Zhang, G. Wang, Automatic segmentation of gross target volume of nasopharynx
cancer using ensemble of multiscale deep neural networks with spatial attention, Neurocomputing 438 (2021) 211–222.
[41] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: Proceedings of the IEEE conference on computer vision and
pattern recognition, 2018, pp. 7132–7141.
[42] A. L. Simpson, M. Antonelli, S. Bakas, M. Bilello, K. Farahani, B. Van Ginneken, A. Kopp-Schneider, B. A. Landman,
G. Litjens, B. Menze, et al., A large annotated medical image dataset for the development and evaluation of segmentation
algorithms, arXiv preprint arXiv:1902.09063 (2019).
[43] O. Ecabert, J. Peters, M. J. Walker, T. Ivanc, C. Lorenz, J. von Berg, J. Lessick, M. Vembar, J. Weese, Segmentation of
the heart and great vessels in ct images using a model-based adaptation framework, Medical image analysis 15 (6) (2011)
863–876.