Huimin Huang, Lanfen Lin, Ruofeng Tong, Hongjie Hu, Qiaowei Zhang, Yutaro Iwamoto, Xianhua Han, Yen-Wei Chen, Jian Wu
Huimin Huang, Lanfen Lin, Ruofeng Tong, Hongjie Hu, Qiaowei Zhang, Yutaro Iwamoto, Xianhua Han, Yen-Wei Chen, Jian Wu
Huimin Huang, Lanfen Lin, Ruofeng Tong, Hongjie Hu, Qiaowei Zhang, Yutaro Iwamoto, Xianhua Han, Yen-Wei Chen, Jian Wu
2.1. Full-scale Skip Connections where function ∁(∙) denotes a convolution operation, ℋ(∙)
realizes the feature aggregation mechanism with a convolu-
The proposed full-scale skip connections convert the inter- tion followed by a batch normalization and a ReLU activation
connection between the encoder and decoder as well as intra- function. 𝒟(∙) and 𝒰(∙) indicate up- and down-sampling op-
connection between the decoder sub-networks. Both UNet eration respectively, and [∙] represents the concatenation.
with plain connections and UNet++ with nested and dense It is worth mentioning that our proposed UNet 3+ is more
connections are short of exploring sufficient information efficient with fewer parameters. In the encoder sub-network,
from full scales, failing to explicitly learn position and bound- UNet, UNet++, and UNet 3+ share the same structure, where
ary of an organ. To remedy the defect in UNet and UNet++, ,
𝑋%& has 32× 2, channels. As for the decoder, the depth of
each decoder layer in UNet 3+ incorporates both smaller- and feature map in UNet is symmetric to the encoder, and thus
same-scale feature maps from encoder and larger-scale fea- ,
𝑋"# also has 32 × 2, channels. The number of parameters in
ture maps from decoder, which capturing fine-grained details LM ,
𝑖 decoder stage of UNet (𝑃O6"# ) can be computed as:
and coarse-grained semantics in full scales.
(
As an example, Fig. 2 illustrates how to construct the ,
𝑃O6"# ,A'
= 𝐷Q ×𝐷Q × 𝑑 𝑋"# ,
×𝑑 𝑋"# ,
+ 𝑑 𝑋"#
$
feature map of 𝑋"# . Similar to the UNet, the feature map from ,
+ 𝑑 𝑋%& ,
+ 𝑋"# ,
×𝑑 𝑋"# (2)
$
the same-scale encoder layer 𝑋%& are directly received in the
where 𝐷Q is the convolution kernel size, 𝑑 ∙ denotes the
decoder. In contrast to the UNet, a set of inter encoder-decode
depth of the nodes. When it comes to UNet++, it makes use
skip connections delivers the low-level detailed information
' ( of a dense convolution block along each skip pathway, where
from the smaller-scale encoder layer 𝑋%& and 𝑋%& , by apply-
ing non-overlapping max pooling operation; while a chain of 𝑃O, TT6"# can be computed as:
(
intra decoder skip connections transmits the high-level se- 𝑃O, TT 6"# = 𝐷Q ×𝐷Q × 𝑑 𝑋"#
,A' ,
×𝑑 𝑋"# ,
+ 𝑑 𝑋"# +
)
mantic information from larger-scale decoder layer 𝑋"# and 𝑑 ,
𝑋%& + B6'6,
45'
,,4
𝑋U# ,
+ 𝑋"# ,
×𝑑 𝑋"# (3)
*
𝑋"# , by utilizing bilinear interpolation. With the five same- As can be seen, 𝑃O, TT6"# is larger than 𝑃O6"#
,
. While in
resolution feature maps in hand, we need to further unify the UNet 3+, each decoder feature map is derived from 𝑁 scales,
number of channels, as well as reduce the superfluous infor-
yielding 64 × 𝑁 channels. 𝑃O, XT6"# can be computed as:
mation. It occurred to us that the convolution with 64 filters (
𝑃O, XT 6"# = 𝐷Q ×𝐷Q × ,
45' 𝑑
4
𝑋%& + B
45,A' 𝑑
4
𝑋"# ,
×64 + 𝑑 𝑋"# (4)
of size 3 × 3 can be a satisfying choice. To seamlessly merge
the shallow exquisite information with deep semantic infor- For the sake of the channel reduction, the parameters in
mation, we further perform a feature aggregation mechanism UNet 3+ is fewer than those in UNet and UNet++.
Fig. 2: Illustration of how to construct the full-scale aggre-
$
gated feature map of third decoder layer 𝑋"# . Fig. 3: Illustration of classification-guided module (CGM).
tance. It is, in all probability, caused by noisy information
2.2. Full-scale Deep Supervision from background remaining in shallower layer, leading to the
In order to learn hierarchical representations from the full- phenomenon of over-segmentation. To achieve more accu-
scale aggregated feature maps, the full-scale deep supervision rate segmentation, we attempt to solve this problem by adding
is further adopted in the UNet 3+. Compared with the deep an extra classification task, which is designed for predicting
supervision performed on generated full-resolution feature the input image whether has organ or not.
map in UNet++, the proposed UNet 3+ yields a side output As depicted in Fig. 3, after passing a series of operations
from each decoder stage, which is supervised by the ground including dropout, convolution, maxpooling and sigmoid, a
*
truth. To realize deep supervision, the last layer of each de- 2-dimensional tensor is produced from the deepest-level 𝑋%& ,
coder stage is fed into a plain 3 × 3 convolution layer fol- each of which represents the probability of with/without or-
lowed by a bilinear up-sampling and a sigmoid function. gans. Benefiting from the richest semantic information, the
To further enhance the boundary of organs, we propose a classification result can further guide each segmentation side-
multi-scale structural similarity index (MS-SSIM) [9] loss output in two steps. First, with the help of the argmax func-
function to assign higher weights to the fuzzy boundary. Ben- tion, 2-dimensional tensor is transferred into a single output
efiting from it, the UNet 3+ will keep eye on fuzzy boundary of {0,1}, which denotes with/without organs. Subsequently,
as the greater the regional distribution difference, the higher we multiply the single classification output with the side seg-
the MS-SSIM value. Two corresponding N×N sized patches mentation output. Due to the simpleness of binary classifica-
are cropped from the segmentation result P and the ground tion task, the module effortlessly achieves accurate classifi-
truth mask G, which can be denoted as 𝑝 = {𝑝[ : 𝑗 = cation results under the optimization of Binary Cross Entropy
1, … , 𝑁 ( } and 𝑔 = 𝑔[ : 𝑗 = 1, … , 𝑁 ( , respectively. The MS- loss function [12], which realizes the guidance for remedying
SSIM loss function of p and g is defined as: the drawback of over-segmentation on none-organ image.
U fg ig
2𝜇c 𝜇d + 𝐶' 2𝜎cd + 𝐶( 3. EXPERIMENTS AND RESULTS
ℓa;6;;,a = 1 − (5)
𝜇c( + 𝜇d( + 𝐶' 𝜎c( + 𝜎d( + 𝐶(
a5' 3.1. Datasets and Implementation
where 𝑀 is the total number of the scales, 𝜇c , 𝜇d and 𝜎c , 𝜎c
The method was validated on two organs: the liver and spleen.
are the mean and standard deviations of p and g, 𝜎cd denotes
The dataset for liver segmentation is obtained from the ISBI
their covariance. 𝛽a and 𝛾a define the relative importance of LiTS 2017 Challenge. It contains 131 contrast-enhanced 3D
the two components in each scale, which are set according to abdominal CT scans, of which 103 and 28 volumes are used
[9]. Two small constants 𝐶' = 0.01( and 𝐶( = 0.03( are for training and testing, respectively. The spleen dataset from
added to avoid the unstable circumstance of dividing by zero. the hospital passed the ethic approvals, containing 40 and 9
In our experiment, we set the scale to 5 based on [9]. CT volumes for training and testing. In order to speed up
By combining focal loss ( ℓp: ) [10], MS-SSIM loss training, the input image had three channels, including the
(ℓa;6;;,a ) and IoU loss (ℓ,qr ) [11], we develop a hybrid loss slice to be segmented and the upper and lower slices, which
for segmentation in three-level hierarchy – pixel-, patch- and was cropped to 320×320. We utilized the stochastic gradient
map-level, which is able to capture both large-scale and fine descent to optimize our network and its hyper parameters
structures with clear boundaries. The hybrid segmentation were set to the default values. Dice coefficient was used as
loss (ℓ;#d )is defined as: the evaluation metric for each case.
ℓ;#d = ℓp: + ℓa;6;;,a + ℓ,qr (6) 3.2. Comparison with UNet and UNet++
2.3. Classification-guided Module (CGM) In this section, we first compared the proposed UNet 3+ with
In the most medical image segmentations, the appearance of UNet and UNet++. The loss function used in each method is
false-positives in a non-organ image is an inevitable circums- the focal loss.
Table 1: Comparison of UNet, UNet++, the proposed UNet 3+ without deep supervision (DS) and UNet 3+ on liver and spleen
datasets in terms of Dice metrics. The best results are highlighted in bold. The loss function used in each method is focal loss.
Vgg-16 ResNet-101
Architecture 𝐷𝑖𝑐𝑒9u#v9d#
Params 𝐷𝑖𝑐𝑒:,u#v 𝐷𝑖𝑐𝑒;c:##& Params 𝐷𝑖𝑐𝑒:,u#v 𝐷𝑖𝑐𝑒;c:##&
UNet 39.39M 0.9206 0.9023 55.90M 0.9387 0.9332 0.9237
UNet++ 47.18M 0.9278 0.9230 63.76M 0.9475 0.9423 0.9352
UNet 3+ w/o DS 26.97M 0.9489 0.9437 43.55M 0.9580 0.9539 0.9511
UNet 3+ 26.97M 0.9550 0.9496 43.55M 0.9601 0.9560 0.9552