Research Article: Dual-Path Attention Compensation U-Net For Stroke Lesion Segmentation

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Hindawi

Computational Intelligence and Neuroscience


Volume 2021, Article ID 7552185, 16 pages
https://doi.org/10.1155/2021/7552185

Research Article
Dual-Path Attention Compensation U-Net for Stroke
Lesion Segmentation

Haisheng Hui , Xueying Zhang , Zelin Wu , and Fenlian Li


College of Information and Computer, Taiyuan University of Technology, Taiyuan 030024, China

Correspondence should be addressed to Xueying Zhang; [email protected]

Received 9 July 2021; Accepted 19 August 2021; Published 31 August 2021

Academic Editor: Suresh Manic

Copyright © 2021 Haisheng Hui et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

For the segmentation task of stroke lesions, using the attention U-Net model based on the self-attention mechanism can suppress
irrelevant regions in an input image while highlighting salient features useful for specific tasks. However, when the lesion is small
and the lesion contour is blurred, attention U-Net may generate wrong attention coefficient maps, leading to incorrect seg-
mentation results. To cope with this issue, we propose a dual-path attention compensation U-Net (DPAC-UNet) network, which
consists of a primary network and auxiliary path network. Both networks are attention U-Net models and identical in structure.
The primary path network is the core network that performs accurate lesion segmentation and outputting of the final seg-
mentation result. The auxiliary path network generates auxiliary attention compensation coefficients and sends them to the
primary path network to compensate for and correct possible attention coefficient errors. To realize the compensation mechanism
of DPAC-UNet, we propose a weighted binary cross-entropy Tversky (WBCE-Tversky) loss to train the primary path network to
achieve accurate segmentation and propose another compound loss function called tolerance loss to train the auxiliary path
network to generate auxiliary compensation attention coefficient maps with expanded coverage area to perform compensate
operations. We conducted segmentation experiments using the 239 MRI scans of the anatomical tracings of lesions after stroke
(ATLAS) dataset to evaluate the performance and effectiveness of our method. The experimental results show that the DSC score
of the proposed DPAC-UNet network is 6% higher than the single-path attention U-Net. It is also higher than the existing
segmentation methods of the related literature. Therefore, our method demonstrates powerful abilities in the application of stroke
lesion segmentation.

1. Introduction tracer several hours to complete accurate labeling and


rechecking of a single large complex lesion on magnetic
Recent global statistics on the incidence of stroke cases resonance imaging (MRI) [3].
demonstrate that there are up to 10.3 million new cases This situation has changed after the advent of con-
annually [1]. Stroke has become one of the top three lethal volutional neural network (CNN) [4] and its continuously
diseases, besides chronic diseases. When a stroke occurs, evolving network structures, such as fully convolutional
accurate diagnosis of the severity of the stroke and timely network (FCN) [5] and SegNet [6], which have achieved
thrombolytic therapy can effectively improve blood success in the field of image segmentation, especially medical
supply in the ischemic area and significantly reduce the image segmentation [7]. However, CNN-based segmenta-
risk of disability or even death. Therefore, it is clinically tion networks require a large amount of labeled medical data
significant to quickly and accurately locate and segment for training, which is limited by the high cost of acquiring
the stroke lesions [2]. Since manual segmentation relies and accurate labeling [8]. The multilevel U-shaped network
on the doctor’s professional experience and medical (U-Net) [9] based on CNN, consisting of the contraction and
skills, individual subjectivity can reduce segmentation expansion paths, mitigates the problem of requiring huge
accuracy. Furthermore, manual segmentation of the amounts of labeled data. The U-Net network structure and
stroke lesion is time-consuming. It may take a skilled its improved network structure, such as the attention U-Net
2 Computational Intelligence and Neuroscience

[10], U-Net++ [11], and R2U-Net [12], have been applied performs lesion segmentation and outputs the final seg-
successfully in medical segmentation tasks, such as skin mentation result. The auxiliary network is used to generate
cancer [13], brain tumor [14], colorectal tumor [15], liver an auxiliary attention compensation coefficient map sent to
[16], colon histology [17], kidney [18], and vascular borders the primary network to compensate for possible attention
[19]. The U-Net network has thousands of feature channels, coefficient learning errors. The auxiliary network realizes its
especially the standard U-Net model with a five-level compensation ability by focusing on a larger area than the
structure with enormous parameters to be trained. During actual lesion area, which increases the coverage of the at-
the training process, the contraction path (encoder) and tention coefficient map generated by the auxiliary network.
expansion path (decoder) need to repeatedly extract deep- The attention coefficient map with a larger attention area is
scale features. The deep-scale features of standard U-Net are defined as a tolerant attention coefficient map, which is used
considered abstract and low-resolution features, which in- as an auxiliary compensation attention coefficient to com-
crease the training difficulty and make the training unstable pensate for possible errors in the primary network attention
and inadequate. coefficient map. To study our lesion segmentation network,
To reduce the training difficulty caused by repeated we use the ATLAS dataset [3], consisting of 239 T1-weighted
extraction of deep-scale features and improve segmentation subacute and chromic stroke MRI scans released in 2018.
accuracy, many researchers employed a two-step method to The main contributions of this article are summarized as
locate the lesion and segment the target area [20, 21]. follows:
However, these methods introduce additional positioning
(1) We proposed a DPAC-UNet that uses the auxiliary
operations and cannot achieve end-to-end training.
network to generate an attention coefficient map
Schlemper et al. introduced a self-attention mechanism and
with a larger area to compensate for the possible
proposed an attention U-Net with an attention gate (AG)
defect of the primary network’s attention coefficient
[10] to avoid additional operations. The self-attention
map.
mechanism reduces the dependence on external information
obtained from additional steps by utilizing the correlation (2) We proposed the WBCE-Tversky loss and tolerance
coefficient of feature signals from different scales. This loss to train the primary and auxiliary networks of
mechanism captures the internal correlation of features and the DPAC-UNet to realize their effects on the entire
focuses attention on the target area. The attention U-Net network, respectively, and explore the optimal
uses AG to generate a 2D attention coefficient map to hyperparameter configurations of the two proposed
suppress irrelevant regions in an input image while high- loss functions.
lighting salient features useful for specific tasks. The AG The remainder of this work is organized as follows: In
module can be integrated into the standard U-Net model for Section 2.1, we describe the network structure of the DPAC-
end-to-end learning without additional pretraining steps. UNet and how to use the auxiliary network to compensate
Compared with the standard U-Net training parameters, the for attention in the primary network. Section 2.2 proposes
number of training parameters slightly increased with ad- two compound loss functions, the WBCE-Tversky loss and
ditional computation of AG operations. The use of the built- the tolerance loss. In this section, we also conducted ex-
in self-attention module eliminates the use of additional periments to discuss the effect of different hyperparameter
target location operations. It achieves the goal of reducing values of the loss functions on the performance of the
training difficulty, improving training efficiency, and im- segmentation task. Finally, the steps to select the optimal
proving model segmentation performance. hyperparameter configuration of the two proposed loss
However, the self-attention mechanism based on cor- functions are listed. In Section 3, we train the DPAC-UNet
relation operation has some deficiencies. The attention by the WBCE-Tversky and the tolerance loss functions with
coefficient α for constraining the area of interest is generated the optimal hyperparameter configurations. In this section, a
by the current-scale feature signal x and the rougher-scale visualization example is also presented to demonstrate the
feature signal g derived from x, leading to a potential risk of effectiveness of the DPAC-UNet network further. We also
the segmentation network using the self-attention mecha- discussed the time consumption of the primary and auxiliary
nism. It implies that a small lesion with a nondistinct lesion networks of the DPAC-UNet, and we also tried to execute
feature may cause the current level feature signal x to learn the auxiliary network’s compensation mechanism for other
the lesion feature inadequately. Consequently, the deviation segmentation models with self-attention mechanisms.
of the attention area from the lesion area due to the wrong or
insufficient attention coefficient learning leads to incorrect 2. Materials and Methods
segmentation results.
To solve the problem of the attention area deviating from 2.1. DPAC-UNet. The attention U-Net introduces several
the lesion area, we proposed a dual-path attention com- attention gates (AG) to generate attention coefficient maps
pensation U-Net (DPAC-UNet) network, which is com- that suppress irrelevant regions in an input image while
posed of the primary path network (primary network) and highlighting salient features useful to improve segmentation
auxiliary path network (auxiliary network). Both networks performance without introducing additional positioning
are all attention U-Net segmentation models based on the operations. However, it sometimes makes mistakes. A small
self-attention mechanism with an identical structure. The lesion with indistinct lesion features is difficult to distinguish
primary network is the core part of DPAC-UNet, which from the surrounding healthy tissues, leading to the current
Computational Intelligence and Neuroscience 3

scale feature signal x of a certain layer not learning the lesion attention U-Net, and its detailed structure is shown in Figure
feature well. As a result, the attention coefficient generated 2(a). In Figure 2(a), ① and ② are the input of the auxiliary
using x and its derived rougher feature g will deviate from network AG, ④ is the output of the current level for skip
the lesion area. Therefore, the wrong attention coefficient connection (SC), where l is the level number of current AG
results in the AG outputting the wrong feature signal, which (in this case l � 2), and feature signals xli and gli correspond
affects the segmentation results. Thus, if the attention U-Net to the inputs labeled ① and ②. The feature signals gli ∈ RFg
finds the correct lesion in the AG module, it will emphasize and xli ∈ RFx are sent to the AG block to generate the at-
the relevant area and suppress the unrelated area to improve tention coefficient αl using the additive attention generation
the segmentation performance. Conversely, if the lesion operation in order to determine the area to focus, where i is
location is not found in the AG or is wrong, it will result in the pixel number,Fx is the number of feature channels of
diametrically opposite effects and degrade the segmentation input feature signal xl at the current level, andFg is the
performance. To cope with the previously mentioned issues, number of feature channels of input feature signal gl at the
using the attention U-Net as the basic segmentation model, rougher level. When the additive attention coefficient map αl
we propose the DPAC-UNet network. is generated using xli and gli , the feature signal xli is mul-
tiplied by αl and used as the output of the AG gate and sent
to the decoding path through the SC at the current level. The
2.1.1. Overview of the Structure. The schematic of DPAC- additive attention coefficient αl marked as ③ is the auxiliary
UNet is presented in Figure 1. We used two identical at- compensation attention coefficient map and sent to the AG
tention U-Net models as the primary and auxiliary network marked as (I) at the same level and in the same position of
segmentation models, which correspond to the upper and the primary network in the upper half of Figure 1. The
lower half of Figure 1, respectively. The WBCE-Tversky loss equations for generating the attention coefficient of the
function trains the primary network for accurate segmen- auxiliary network are as follows:
tation. The auxiliary network is trained by the tolerance loss
to generate a tolerant auxiliary compensation attention qlatt � WTψ 􏼐σ 1 􏼐WTx xli + WTg gli + bg 􏼑􏼑 + bψ , (1)
coefficient that compensates for the defect of the attention
coefficient map of the primary network. The details of the αli � σ 2 􏼐qlatt 􏼐xli , gli ; Θatt 􏼑􏼑, (2)
two loss functions are described in Section 2.2. As presented
in Figure 1, the auxiliary network compensates for the l l
􏼐αi 􏼑rs � resample􏼐αi 􏼑, (3)
auxiliary compensation attention coefficient to the primary
network through the vertical dark red arrow line from the
􏽢
AG marked (II) to the AG marked (I), in order to perform xli � xli · 􏼐αli 􏼑rs . (4)
the compensation operation. We just selected the second-
level AG of the primary and auxiliary networks for additive As presented in Figure 2(a), considering the inconsistent
compensation operation. This is because the resolution of spatial resolution and feature channel dimensions of feature
the attention coefficient maps generated by the two bottom gli and xli , we also need to use the upsampling operation to
AGs (13 × 11 and 26 × 22) is too low. The difference between change the spatial resolution of the signal gli to make it
the attention maps of the two networks on this resolution consistent with xli . Moreover, we need to use the linear
scale is larger due to the difference of one pixel. When the transformation Wg ∈ RFg ×Fint and Wx ∈ RFx ×Fint to make the
level is deeper, the receptive field affected by a single pixel is number of feature channels of these two signals the same,
very large. Consequently, the compensation operation at this where bg ∈ RFint and bψ ∈ R denote the biases of the two
scale by the auxiliary network has a significant impact on the linear transformations. In (1), σ 1 is the ReLU activation
primary network, and the compensating operation generates function, and the output of this activation function is lin-
a significant attention fluctuation. Furthermore, the first- early transformed by WTψ ∈ R1×Fint that forms an attention
level AG, which is close to the uppermost layer’s output, does coefficient matrix with only one feature channel. In (2), the
not perform auxiliary attention compensation operation sigmoid activation function σ 2 converts the attention co-
because the feature map here is too close to the output and efficient matrix into a gridded attention coefficient map αli to
affects the segmentation result. In summary, we only se- act on xli . Resample αli , and then, multiply the resampled
lected the second-level AG to implement the compensation result by xli to generate the AG output feature signal x 􏽢 li .
operation in order to effectively compensate for the defective Figure 2(b) presents the block diagram of the AG marked as
attention coefficient map of the primary network and ensure (I) in the upper half of Figure 1, where the auxiliary com-
that it does not directly affect the accuracy of lesion seg- pensation attention coefficient map compensates for the
mentation of the primary network. primary network. The structure and equations of the signal
Figure 2 presents the AG schematic of the primary and operation process are almost consistent with the auxiliary
auxiliary networks at the second level. The AG of the first, network, as presented in Figure 2(a). The difference is that
third, and fourth levels are shown in Figure 1, which are not when generating the final additive fused attention coefficient
involved in auxiliary attention coefficient compensation map, the auxiliary compensation attention coefficient map
operation and are identical in structure to the AG in the generated by the auxiliary network AG is marked as ③, and
literature [10]. The AG marked as (II) in the lower half of perform additive fusion together with the original attention
Figure 1 is the second-level AG in the auxiliary network’s coefficient map generated by the primary network AG
4 Computational Intelligence and Neuroscience

Primary Path Network

C Output seg-map
(I)

Input image
C
208×176

1 64 64 64+64=128 64 64 1 C C: Channel

W×H
H: Hight
W: Width
C Input
Output
1×1 Conv+ Sigmoid
64 128 128 128+128=256 128 128
(II)
104×88

3×3 Conv+ BN+ReLU


C
3×3 Deconv
Max pooling
128 256 256 256+256=512 256 256
52×44

C C Concatenate

256 512 512 512+512=1024 512 512 Primary Path AG


26×22

C
Auxiliary Path AG
13×11

512 1024 1024


Normal AG
Auxiliary Path Network

Figure 1: Schematic of DPAC-UNet.


l
② gi
① xi
l Output
Wx :1×1 l
xi
(σ1)
l
(αi)rs ④
current feature (σ1) (σ2)
current feature Wy :1×1 (σ2)
Output
l ② gi
l Wy :1×1
Wx :1×1 xi
l
① xi
④ coarser feature

Auxiliary Path AG
Auxiliary Path AG

(a) (b)
Figure Description Input Output

Conv 1×1g (W :1×1)


(26, 22, 512) (52, 44, 256)
and up-sampling

Auxiliary path AG signal (52, 44, 1) (52, 44, 1)

Current feature Conv 1×1 (52, 44, 256) (52, 44, 256)
Attn ceofficient Conv 1×1 (52, 44, 128) (52, 44, 1)
(52, 44, 256)
Element-wise addition (52, 44, 256)
(52, 44, 256)

(52, 44, 256)


Element-wise production (52, 44, 256)
(52, 44, 256)

ReLU activation (52, 44, 256) (52, 44, 256)

Sigmoid activation (52, 44, 1) (52, 44, 1)

Element-wise repeat to
get multi-channel (52, 44, 1) (52, 44, 256)

(c)

Figure 2: (a) Schematic of the AG structure of the auxiliary network, (b) schematic of the AG structure of the primary network, and (c) the
definition of various operation symbols and dimensional changes of input and output feature signals.
Computational Intelligence and Neuroscience 5

generated by inputs ① and ②. According to (3) and (4), the Combined with an auxiliary network: the larger aux-
output feature signal ④ of the primary network AG is iliary attention coefficient compensation map gener-
generated. Figure 2(c) presents the definition of various ated by the auxiliary network covers a larger area, and
operation symbols and dimension changes of input and the compensated attention coefficient map may be still
output feature signals in Figures 2(a) and 2(b). wrong (Figure 3(c), ②), or correct partially
(Figure 3(c), ③), or correct completely (Figure 3(c),
④). At this time, correspondingly, the segmentation
2.1.2. Compensation Mechanism of the Auxiliary Network. performance will remain unchanged, or improve to
The traditional single-path self-attention model generates a some extent, or improve significantly.
spatial attention coefficient map by the AG to cover the
Therefore, by combining the previously mentioned three
lesion area of features to pay more attention to the lesion
situations, the overall average segmentation performance of
area to improve the segmentation performance. Our pro-
the whole dataset will be improved. It can also be seen from
posed method builds an auxiliary network to generate an
Figure 3 that the attention coefficient map generated by the
auxiliary attention coefficient map with a larger coverage
auxiliary network does not deviate from the attention co-
area to compensate the segmentation network (primary
efficient map area generated by the primary network.
network) to improve its hit rate of complete coverage of the
lesion by spatial attention coefficient map. It should be noted
that the attention compensation map will not deviate from 2.2. Loss Functions of DPAC-UNet. We proposed two dif-
the original attention area of the primary network but will be ferent compound loss functions to train the primary and
constrained to increase the attention area around it. This auxiliary networks. First, we proposed the WBCE-Tversky
compensation mechanism is especially effective when the loss for the primary network to generate an attention co-
lesion feature is indistinct, the lesion’s outline is unclear, or efficient map focused on the target area and an accurate
the segmentation model cannot generate the correct region segmentation result. Second, we proposed the tolerance loss
of interest. for the auxiliary network to generate an auxiliary com-
The qualitative analysis and comparison of using the pensation attention coefficient map with a larger coverage
primary network individually or combined with an auxiliary area to compensate for the primary network. It is called a
network are stated as follows. When DPAC-UNet uses the tolerance loss because it can generate an attention coefficient
auxiliary network to compensate for the primary network, map that covers a larger area and does not deviate from the
there are three situations: lesion area, which means a higher fault tolerance for at-
tention errors.
Situation 1. (1) Use the primary network individually:
when the focus area of the attention coefficient map of
the single-path network is partially correct (Figure 3(a), 2.2.1. WBCE-Tversky Loss. The Tversky loss [22], which was
①), which will lead to reduced segmentation perfor- proposed to address data imbalance in medical image
mance. (2) Combined with an auxiliary network: after segmentation, is introduced as a component of our WBCE-
the auxiliary network compensates the primary net- Tversky. The Tversky loss is as follows:
work’s attention coefficient map with a larger focus area
􏽐N
i�1 p1 i · g1 i
through additive compensation, the compensated at- Tloss (α, β) � 1 − ,
tention coefficient map may be correct (Figure 3(a), ②) 􏽐N N N
i�1 p1 i · g1 i + α 􏽐i�1 p1 i · g0 i + β 􏽐i�1 p0 i · g1 i

or remain unchanged (Figure 3(a), ③), which will (5)


eventually improve the segmentation performance or
where p1,i denotes the probability that a voxel is a lesion and
maintain the segmentation performance.
p0,i denotes the opposite, and g1,i denotes the probability of
Situation 2. (1) Use the primary network individually: whether a voxel is a lesion and g0,i denotes the opposite. The
when the focus area of the attention coefficient of the Tversky loss achieves a trade-off between false positives (FP)
primary network is completely correct, which will and false negatives (FN) by configuring the value of its
generate correct segmentation results (Figure 3(b), ①). hyperparameter β and α, where α + β � 1. A higher β value
(2) Combined with an auxiliary network: although the implies that the trained model’s recall is given greater weight
auxiliary network compensates it for a larger attention than the precision, and the network pays more attention to
coefficient map, after the addition compensation op- FN. Often, the volume of the lesion is significantly smaller
eration, the value of the original correct focus area than that of healthy tissue. For example, in the 239 MRI
becomes larger, and the values of other areas are still scans of the ATLAS dataset, the voxel number ratio of the
smaller than the value of the correct area (Figure 3(b), lesion to the background is about 3 : 1000. The high ratio of
②). Therefore, the primary network of DPAC-UNet the nonlesion to lesion makes the segmentation network
can still pay higher attention value in the correct area prone to focusing on the nonlesion area, therefore, pre-
and keep the segmentation performance unchanged. dicting the lesions as nonlesions and increasing FN in the
Situation 3. (1) Use the primary network individually: predicted results. To solve this problem, we increased the
when the focus area of the primary network attention value of the hyperparameter β of Tversky loss. Larger β gives
coefficient is completely wrong (Figure 3(c), ①), which greater weight to recall than precision by placing more
will lead to reduced segmentation performance. (2) emphasis on FN. We assume that using higher β in our
6 Computational Intelligence and Neuroscience

Primary
Network
Individually
① ① ①

Combined
with
Auxiliary
Network
② ③ ② ② ③ ④

(a) (b) (c)


Real lesion
Attention coefficient map of primary network
Attention coefficient compensation map of auxiliary network
Figure 3: Qualitative analysis of compensation mechanism of the auxiliary network.

generalized loss function in training will lead to higher models for imbalanced data; PRE quantifies the number of
generalization and improved performance for the imbal- positive class predictions that belong to the positive class;
anced dataset. So, we use the Tversky loss with higher β as a RE quantifies the number of positive class predictions made
part of the WBCE-Tversky loss for training the primary out of all positive examples in the dataset. The experimental
network of DPAC-UNet. Meanwhile, in the tolerance loss, results of training the attention U-Net with different
we also need to use a Tversky loss function to constrain the hyperparameter β values for the Tversky loss are presented
growth of the attention coefficient map to ensure that the in Table 1.
larger and more tolerant focus area will not deviate from the As presented in Table 1, the maximum RE value is
lesion area. To compare the segmentation performance of obtained when β takes a large value of 0.95, and the max-
the Tversky loss with the different hyperparameter values of imum PRE value is obtained when the minimum value of
β and select the appropriate hyperparameter β for the 0.05 is taken. DSC and F2 scores reached the maximum
WBCE-Tversky loss and tolerance loss, we used the Tversky when β � 0.80. Simultaneously, a trade-off between PRE and
loss for the training the basic segmentation model, attention RE has been made, indicating that, for the imbalanced
U-Net. The hyperparameter β of the Tversky loss ranges ATLAS dataset, training a model using the Tversky loss with
from 0.5 to 0.95, using 0.5 as the value interval. We con- hyperparameter β � 0.80 improves the segmentation accu-
ducted an experiment using the sixfold cross-validation, racy. We need a loss function that can train the primary
which is often used to train a model in which hyper- network of the DPAC-UNet to achieve an accurate seg-
parameters need to be optimized. We split the 239 stroke mentation. To improve the segmentation performance, we
MRI scans into training, validation, and test sets by sixfold can handle the imbalanced dataset by selecting the hyper-
cross-validation according to Figure 4. parameter β value of the Tversky loss to train the basic
First, in each fold, we divided the data into training and segmentation model in order to reduce the tendency of the
test sets using a ratio of about 5 : 1 (199 : 40), and we ensured lesion to be classified as nonlesion. As presented in Table 1,
that all MRI scans of all test sets are not repeated. Second, we the use of the Tversky loss with hyperparameter β � 0.80 to
further split the training set in each fold into the inner train the attention U-Net on the ATLAS dataset achieves the
training and validation sets using a ratio of about 4 : 1 (160 : highest segmentation performance. However, as presented
39). The validation set is used to select the best-performing in (5), if the denominator of the Tversky loss is a small value,
model trained by the training set. Moreover, we also ensured it causes instability in backpropagation and derivation. To
that the training, validation, and test sets of each fold have solve this problem, we introduced the WBCE loss [23] as
the same lesion volume distribution for the accuracy of the another part of the WBCE-Tversky loss. On the one hand, it
experiment results. The lesion size distribution of fold 1 is avoids the problems of backpropagation and gradient cal-
presented in Figure 5. culation instability caused by the Tversky loss for small
The experimental configuration and results of training denominators. On the other hand, using the WBCE loss and
the attention U-Net using Tversky are presented in Table 1. giving greater weight to the minority class in the equation
We used 10 different β values to perform sixfold cross- adapts to the imbalance of dataset and further improves the
validation and computed the average metric scores of all overall segmentation performance. The WBCE loss function
test sets’ results. We used the dice similarity coefficient has differentiable properties, which simplifies the optimi-
(DSC), F2 score (F2), precision (PRE), and recall (RE) as zation process. The equation of the proposed WBCE-
the metrics for the model evaluation. DSC is a widely used Tversky loss is presented in (8). The compound loss function
metric for evaluating the performance of the models; F2 is composed of the Tversky loss (β � 0.80) and WBCE loss,
score is often used to evaluate the performance of the and their respective equations are presented as
Computational Intelligence and Neuroscience 7

Fold 1 Training fold Test fold Table 1: Experimental results when using the Tversky loss with
different β values to train the attention U-Net.
Fold 2 Metrics (%)
Weights
DSC F2 PRE RE
Fold 3 α � 0.50, β � 0.50 49.9 46.4 64.3 45.0
α � 0.45, β � 0.55 50.8 48.6 62.8 47.5
Fold 4 α � 0.40, β � 0.60 51.1 52.5 58.0 51.1
α � 0.35, β � 0.65 50.9 52.1 57.8 53.7
Fold 5 α � 0.30, β � 0.70 51.5 52.6 59.5 54.8
α � 0.25, β � 0.75 52.0 51.5 61.3 52.5
α � 0.20, β � 0.80 52.7 55.4 56.7 58.3
Fold 6 α � 0.15, β � 0.85 50.5 52.5 53.4 55.5
α � 0.10, β � 0.90 50.2 52.7 53.2 56.5
α � 0.05, β � 0.95 51.6 55.0 53.5 59.4
Training Validation

Inner split
To test and verify the proposed WBCE-Tversky loss, we
Figure 4: Schematic of sixfold cross-validation. conducted a series of comparative experiments using the
WBCE loss, Tversky loss with different hyperparameter β,
and WBCE-Tversky loss with different β. The model used in
the experiment, the experiment datasets, and the experiment
conditions are the same as the experiments corresponding to
150
Table 1. The experiment parameter configuration and results
Volume (10^3 voxels)

are presented in Table 2. As can be seen from Table 2, for the


100
same hyperparameter β, the DSC and F2 scores of the
WBCE-Tversky loss are better than that of the Tversky loss.
The WBCE-Tversky loss also performs best at β � 0.80.
50 Compared with the WBCE loss, the segmentation accuracy
improved significantly, the DSC score improved by 6.5%,
and the F2 score increased by 12.5%. In summary, on the
0 imbalanced ATLAS dataset, using the WBCE-Tversky loss
Training fold Validation fold Test fold with β � 0.80 to train the attention U-Net model achieves
the best segmentation performance. Therefore, we used
Figure 5: Distribution of lesion volume in the training, test, and
validation sets.
WBCE-Tversky loss with β � 0.80 as the loss function of the
DPAC-UNet’s primary network for accurate lesion
segmentation.
1
WBCEloss � − log pn 􏼁
N 􏽐N
i�1 wgn 2.2.2. Tolerance Loss. When the focus area is larger than
(6)
the actual lesion area, the FP of the model segmentation
+ 1 − gn 􏼁log 1 − pn 􏼁, result increases. The FP and FPR are proportional, im-
plying that we can indirectly measure the tolerant degree
N of the lesion area using FPR. To indirectly measure the
w� , (7)
smooth. + 􏽐n gn tolerant degree of the auxiliary compensation attention
coefficient map, we used the FPR value as an indicator to
WBCE − Tversky � WBCEloss + Tloss (β � 0.8). (8) determine the tolerant degree of attention coefficient
generated by the auxiliary network. To provide the pri-
The WBCE loss adds weight w to the standard BCE loss mary network with a more tolerant auxiliary compen-
to give the pixels more importance, and a higher training sation attention coefficient map and a much larger
weight when the area of the lesion is small, thereby im- coverage area, we proposed the tolerance loss by intro-
proving the segmentation performance for unbalanced ducing a specificity reducing item combined with the
datasets. As presented in (6), the main part of the WBCE loss Tversky loss. It is called tolerance loss because the
is the same as the BCE loss [23]. The only difference is that compound loss function’s training goal is to obtain an
we modified the calculation method of the weight w as attention coefficient map with high tolerance. The pro-
presented in (7) and took the reciprocal of the proportion of posed tolerance loss is presented in (11), where Sloss (λ, δ)
lesion pixels to all pixels as the weight w, where N denotes denotes the specificity reducing item presented in (10). The
the number of pixels in the entire image to be segmented and concept of specificity reducing item is based on the ad-
􏽐n gn is the number of lesion pixels to be segmented, justment of specificity, which measures the proportion of
andsmooth � 1 is used to prevent division by zero error. negatives that are correctly identified, and s is presented in
8 Computational Intelligence and Neuroscience

Table 2: Comparing the segmentation performance of the WBCE- Table 3: FPR values of the tolerance loss using different hyper-
Tversky loss under different hyperparameter configurations. parameter configurations.
Metrics (%) Metrics (%)
Loss functions Weights Loss functions Weights
DSC F2 PRE RE FPR DSC F2 PRE RE FPR
WBCE only None 46.7 43.1 62.3 41.6 0.08 δ � 0.9 45.9 55.2 38.5 67.7 0.44
Tversky only 49.9 46.4 64.3 45.0 0.06 δ � 0.8 44.6 55.7 36.7 71.6 0.57
α � 0.50, β � 0.50 λ�1
WBCE-Tversky 51.5 49.5 63.2 49.5 0.10 δ � 0.7 40.7 51.2 33.1 67.2 0.61
Tversky only 51.1 52.5 58.0 51.1 0.14 δ � 0.6 30.2 44.1 21.0 77.1 1.27
α � 0.40, β � 0.60
WBCE-Tversky 52.1 51.5 59.6 52.0 0.10 δ � 0.9 45.2 55.4 36.3 70.6 0.51
Tversky only 51.5 52.6 59.5 54.8 0.14 δ � 0.8 32.2 45.3 23.0 72.6 1.09
α � 0.30, β � 0.70 λ�2
WBCE-Tversky 51.9 50.4 62.2 50.3 0.10 δ � 0.7 30.4 44.6 20.6 70.9 1.34
Tversky only 52.7 55.4 56.7 58.3 0.16 δ � 0.6 14.8 26.0 8.9 83.5 4.44
α � 0.20, β � 0.80
WBCE-Tversky 53.2 55.6 62.6 56.2 0.12 δ � 0.9 36.1 48.0 27.4 70.2 0.74
Tversky only 50.2 52.7 53.2 56.5 0.20 Tolerance loss δ � 0.8 22.1 35.1 14.1 74.1 2014
α � 0.10, β � 0.90 λ�3
WBCE-Tversky 51.5 51.6 57.7 53.1 0.14 β � 0.8 δ � 0.7 22.8 36.1 14.8 79.4 2.01
δ � 0.6 11.8 18.8 7.6 83.5 4.57
δ � 0.9 39.4 50.9 31.4 68.9 0.69
TN δ � 0.8 23.9 37.6 15.6 74.2 1.89
specificity � , λ�4
TN + FP δ � 0.7 14.9 25.4 9.2 80.6 4.09
(9) δ � 0.6 7.9 11.2 5.7 82.8 4.99
δ � 0.9 34.4 47.5 24.9 72.4 0.90
2 δ � 0.8 20.7 33.5 13.3 82.3 2.74
􏽐N i�1 p0 i · g0 i
λ�5
δ � 0.7 13.2 24.1 7.8 84.1 5.63
Sloss (λ, δ) � λ􏼠 N − δ􏼡 , (10)
􏽐i�1 p0 i · g0 i + 􏽐N i�1 p1 i · g0 i
δ � 0.6 5.7 11.8 3.1 92.8 18.97

Tloss � Sloss (λ, δ) + T2loss (β � 0.8). (11)


As presented in Table 3, the different FPR values gen-
Generally, the nonlesions in the imbalanced dataset erated by the tolerance loss with different hyperparameters λ
occupy a large part of the total area. Using the ATLAS and δare compared. Based on (10), when λ � 5, the tolerance
dataset as an example, the specificity of the segmentation loss gives the most significant weight to the specificity re-
results is reached as high as 95%. Since FPR � 1 −Specificity, ducing item. Increasing λ and keeping δ constant produce
it implies that the larger the proportion of nonlesions higher FPR. Furthermore, the smaller the value of δ, the
identified as nonlesions, the smaller the FPR, and the less smaller the value of specificity, and the higher the FPR. In
tolerant the auxiliary compensation attention coefficient Table 3, the largest FPR value was obtained when
map. Therefore, we introduce a specificity reducing item to λ � 5, δ � 0.6, and the FPR reaches as high as 18.97%. We
reduce the specificity of segmentation results, increase the also introduce a Tversky loss part to constrain the spatial
FPR of the auxiliary network’s training results, and increase position and contour shape of the lesion and restrict the
the size of the coverage area of the attention coefficient map. growth of the attention coverage area with a high FPR value,
As presented in (10) and (11), we used the hyperparameters λ rather than randomly increasing the FPR of the results.
and δ to control the weight of the specificity reducing item in As visual examples, we export the attention coefficient
the tolerance loss. We squared the specificity reducing item heatmaps of four MRI slices of different lesion sizes, which
and the Tversky loss to balance the equation to make the were segmented by the attention U-Net trained by tolerance
backward derivation and backpropagation easier. loss with 10 varying configurations of hyperparameter. The
In (10), the specificity reducing item is the square of the attention coefficient heatmaps are generated by the AG
difference between the specificity equation and δ. Since the (marked as II) in the auxiliary network in Figure 1. Note
training goal of any loss function is to make the value as small as that, in the tolerance loss, the hyperparameter β � 0.8 is
possible, the training goal of (10) is to make value 0, which fixed, because we used the other two parameters to adjust the
means that the value of specificity is close to the value of FPR value. Considering the FPR of some values may be
hyperparameter δ. Therefore, setting a reasonable δ can control caused by a smaller λ and a larger δ or by a larger λ and a
the specificity value to the desired degree. The smaller the δ, the smaller δ, to draw the heatmaps, we sorted the FPR values in
smaller the specificity obtained by the network training. As Table 3 and evenly selected 10 hyperparameter configura-
mentioned earlier, since FPR � 1 − Specificity, the smaller the tions of the tolerance loss according to the different FPR
specificity, the larger the obtained FPR value, and the resulting values. The attention coefficient heatmaps from the selected
attention coefficient map is more tolerant with a larger coverage 10 hyperparameter configurations from Table 3 are also
area. We set the hyperparameter δ value of our tolerance loss to presented in Figure 6. It can be seen that as the FPR value
0.6, 0.7, 0.8, or 0.9. The other hyperparameter λ is set to 1, 2, 3, 4, increases, the coverage area of the attention coefficient map
or 5 to adjust the contribution of the specificity reducing item of gradually increases. Due to the restriction of the Tversky loss
the tolerance loss. The value of the hyperparameter β is set to 0.8 part in the tolerance loss, although the focus area increased
according to the conclusion discussed in Section 2.2.1. The gradually, it did not deviate from the lesion area. Therefore,
experiment results are presented in Table 3. when tolerance loss is used in the auxiliary network of the
Computational Intelligence and Neuroscience 9

MRI slice Tversky loss Tolerance loss Tolerance loss Tolerance loss Tolerance loss Tolerance loss Tolerance loss Tolerance loss Tolerance loss Tolerance loss
with lesion contour β=0.8 β=0.8, λ=1, δ=0.9 β=0.8, λ=1, δ=0.7 β=0.8, λ=2, δ=0.7 β=0.8, λ=3, δ=0.8 β=0.8, λ=4, δ=0.7 β=0.8, λ=3, δ=0.6 β=0.8,λ=4,δ=0.6 β=0.8, λ=5, δ=0.7 β=0.8, λ=5, δ=0.6
FPR=0.222 FPR=0.438 FPR=0.607 FPR=1.335 FPR=2.143 FPR=4.093 FPR=4.57 FPR=4.99 FPR=5.634 FPR=18.97
C0009S0009t01A100
C0009S0008t01A68
C0011S0009t01A82
C0004S0010t01A132

Figure 6: Attention coefficient heatmaps generated by the attention U-Net with different hyperparameters of the tolerance loss.

DPAC-UNet, the primary network gets a compensation different parameter pairs of tolerance loss to train the
attention coefficient with the correct region irrespective of auxiliary network, and take the δ and λ pair with the
the increase of the FPR value and the coverage area. best segmentation performance as the selected values of
However, for the coverage area of the auxiliary compen- proposed tolerance loss.
sation attention coefficient map, the case is not the larger the
When our method is applied to other different types of
better, indicating that FPR is not as high as possible. We
datasets of medical segmentation tasks or different seg-
need to set a moderate value of hyperparameters λ and δ to
mentation models, the hyperparameter configurations of
provide the best segmentation performance for DPAC-
loss functions are different, and the hyperparameter values
UNet. Therefore, in Session 3, the optimal λ and δ hyper-
need to be redetermined. This is because the hyperparameter
parameters will be selected based on the DPAC-UNet model
selection of the loss function needs to consider the imbalance
depending on the experiment performance.
of different datasets and the individual differences of at-
tention maps generated by different models.
2.2.3. Hyperparameter Selection. In order for the auxiliary
network to generate a larger proper attention coefficient 3. Experimental Results and Analysis
map, it needs to be trained by the tolerance loss proposed.
Only when the hyperparameter configuration of the toler- 3.1. Dataset and Training. The ATLAS dataset has a high 3D
ance loss function is selected appropriately, the auxiliary resolution that can meet the requirements of rotation slicing
network can provide moderate compensation to the at- operations, which contains 239 MRI data and focuses on the
tention module of the primary network to improve the subacute and chronic stages of stroke disease. The operations
segmentation performance. The selection process of loss of MNI-152 [24] image registration, intensity normalization
function hyperparameter configuration of the primary and [25], bias field correction [26], and changing the resolution
auxiliary network follows the following two steps: of MRI scans to 176 × 208 × 176 through cropping and
interpolation operation to fit our method have been per-
Step 1. With 0.05 as the interval, from 0.5 to 0.95, using formed. We use the sixfold cross-validation to ensure that
10 different β values of Tversky loss to train the single- the test sets can cover the entire dataset. We also divide the
path Attention U-Net model, take the β value with the training set of each fold into the inner loop training set and
best segmentation performance as the selected β value the inner loop validation set for best model selection. It
of the proposed WBCE-Tversky loss and Tolerance loss. should be noted that since the distribution of the number of
Step 2. To select appropriate δ and λ values for the MRIs of different sizes is extremely imbalanced in the
tolerance loss, so that the auxiliary network can provide dataset, it is necessary to ensure that the training, validation,
appropriate attention coefficient map compensation and test sets have similar lesions sizes’ distribution.
and achieve the best segmentation performance of the We use the deep learning framework PyTorch to conduct
entire DPAC-UNet, we use the WBCE-Tversky loss our experiments on three NVIDIA Tesla T4 GPUs. We train
function (fix the β value that has been selected in the the models 100 epochs at most and save the best model when
first step) to train the primary network. We set the the validation set loss is the smallest. We used the lookahead
tolerance loss δ value to 0.6, 0.7, 0.8, or 0.9, and set λ optimizer [27] for model training. The optimizer improves
value to 1, 2, 3, 4, or 5; that is, we use a total of 20 the stability of the optimization process while considering
10 Computational Intelligence and Neuroscience

the dynamic adjustment of the learning rate and the ac- F2 scores of the DPAC-UNet gradually increase. When the
celeration of the gradient descent. We set the initial learning values of the hyperparameters are λ � 4 and δ � 0.7, the DSC
rate to 1 × 10− 4 . The same experiment conditions and en- and F2 scores get the maximum value. As the FPR∗ further
vironment, used in the previous experiments in Section 2, increases, the segmentation performance gradually declines.
are used for reproducing the single-path segmentation When the coverage area significantly increases with the FPR∗
models, such as U-Net and attention U-Net. We applied the value, it negatively affects the primary network. As presented
WBCE-Tversky loss for accurate segmentation to train these in Figure 6, when λ � 5 and δ � 0.6, the FPR∗ value reaches
single-path models and use their results to compare the the maximum, as well as the coverage area of the auxiliary
results of our DPAC-UNet method. compensation attention, which occupies a quarter of the
brain slice. At this time, the coverage area is too large to
constrain the primary network to focus on the correct lesion
3.2. Experiment and Results. In Section 2.1, we elaborated on area effectively. Its attention coefficient map generated by this
the principle of the proposed DPAC network structure. hyperparameter configuration even interferes with the pri-
Using the attention U-Net as the basic segmentation model mary network, so its DSC and F2 scores are negatively af-
of the primary and auxiliary networks of the DPAC method, fected as presented in Table 4. The change of FPR∗ is
we proposed a specific segmentation model, DPAC-UNet. In determined by the hyperparameters λ and δ together. FPR∗ is
Section 2.2, we also proposed the WBCE-Tversky loss and proportional to λ and inversely proportional to δ. Therefore,
tolerance loss to train the primary and auxiliary networks, the smallest λ and the largest δ will generate the smallest
respectively. Moreover, we explored and verified the value of FPR∗ , and the largest λ and smallest δwill lead to the largest
hyperparameter β of the WBCE-Tversky loss through the FPR∗ . Figure 7 presents a line chart of the segmentation
experiments presented in Tables 1 and 2 and found that accuracy changing with FPR∗ . The line chart indicates that
when β � 0.8, the primary network based on the attention the DPAC-UNet segmentation accuracy changes as the FPR∗
U-Net achieves the best segmentation performance trained increases. As the FPR∗ increases, the DSC and F2 scores
by the WBCE-Tversky loss. increase and then decrease. It shows that when the FPR∗ is
We also explained the relationship between the values of small, the coverage area of the corresponding auxiliary at-
different hyperparameters δ and λ and the coverage area of tention compensation coefficient map is also small. It cannot
the auxiliary compensation attention coefficient map in compensate for the primary network adequately and effec-
Section 2.2. The coverage area of the auxiliary attention tively. When the FPR∗ value is too large, it tends to over-
coefficient map is proportional to the FPR value, and the compensate. Only when the hyperparameter values are
FPR value is proportional to λ and inversely proportional to moderate and its corresponding FPR∗ value is moderate can
δ. We need to select a suitable set of λ and δ values to obtain the DPAC-UNet achieve the best segmentation performance.
an auxiliary attention coefficient map with a suitable cov- Simultaneously, it can be seen from Table 4 that the FPR
erage area in order to enable the DPAC-UNet to achieve the values generated by the DPAC-UNet’s primary network are
best segmentation performance. Therefore, based on the all small, irrespective of the loss function of the auxiliary
experiment results, as presented in Table 3, we explored the network used and the corresponding FPR∗ value. This is
optimal hyperparameter configuration of λ and δ to train the because the compensation operation of the auxiliary com-
best DPAC-UNet model. We used the tolerance loss pensation attention coefficient map generated by the aux-
(β � 0.8) configured with different hyperparameters λ and δ iliary network does not directly affect the segmentation
to train the auxiliary network of DPAC-UNet and the result of the primary network. It is an additive compensation
WBCE-Tversky loss (β � 0.8) to train the primary network operation from the auxiliary network to the primary net-
of the DPAC-UNet. work during the training process; therefore, it does not
Table 4 presents the experiment results corresponding to participate in the gradient operation and backpropagation of
the experiment of DPAC-UNet trained by the tolerance loss the primary network. However, it partially modified the size
function with different hyperparameters. In Table 4, the of the coverage area of the primary network’s attention
FPR∗ represents the FPR results of single-path attention coefficient map. The primary network still considers accu-
U-Net trained by tolerance loss functions with different rate segmentation as its training purpose. It does not gen-
hyperparameter configurations from Table 3. We sort FPR∗ erate FP as high as the auxiliary network due to the increased
in ascending order and identified the corresponding toler- attention area after compensation.
ance loss functions and hyperparameter configurations. We In summary, when the primary network uses the WCBE-
use tolerance loss functions with these sorted configurations Tversky loss function with hyperparameter configuration of
to train the auxiliary network of the DPAC-UNet and the β � 0.8, and the auxiliary network uses tolerance loss
WBCE-Tversky loss (β � 0.8) to train the primary network. function with hyperparameter configuration of β � 0.8,
Then, we got the experiment results of the different con- λ � 4, and δ � 0.7, our DPAC-UNet can achieve the highest
figurations of DPAC-UNet to select the best hyperparameter segmentation accuracy.
configuration.
By observing the relationship between FPR∗ and seg-
mentation metrics, as presented in Table 4, it is evident that as 3.3. Visualization Examples. To show the principle of the
the coverage area of the attention coefficient generated by the DPAC-UNet, we give the attention coefficient heatmaps and
auxiliary network increases (indicated by FPR∗ ), the DSC and segmentation results of using attention U-Net (primary
Computational Intelligence and Neuroscience 11

Table 4: The segmentation performance of the DPAC-UNet using different hyperparameter configurations.
Metrics (%)
Loss functions Weights
FPR∗ DSC F2 PRE RE FPR
1. λ � 1, δ � 0.9 0.438 54.8 54.1 63.6 55.1 0.111
2. λ � 2, δ � 0.9 0.508 53.0 52.4 61.5 53.3 0.101
3. λ � 1, δ � 0.8 0.573 55.2 54.7 64.4 55.7 0.120
4. λ � 1, δ � 0.7 0.607 54.1 54 62.2 55.1 0.117
5. λ � 4, δ � 0.9 0.689 55.9 56.6 63 57.4 0.124
6. λ � 3, δ � 0.9 0.743 55.3 56 61.1 58.1 0.173
7. λ � 5, δ � 0.9 0.898 53.8 54.1 62.8 55.7 0.142
8. λ � 2, δ � 0.8 1.091 54.9 55.4 61.2 56.9 0.140
9. λ � 1, δ � 0.6 1.27 55.5 55.6 63.7 57 0.126
10. λ � 2, δ � 0.7 1.335 55.8 55.8 64.8 57.2 0.133
Tolerance loss, β � 0.8
11. λ � 4, δ � 0.8 1.888 53.6 53 64.4 53.9 0.111
12. λ � 3, δ � 0.7 2.006 56.9 57.7 61.6 59.6 0.149
13. λ � 3, δ � 0.8 2.143 56.7 57.3 61.9 59.1 0.157
14. λ � 5, δ � 0.8 2.744 56.7 56 65.8 56.8 0.103
15. λ � 4, δ � 0.7 4.093 59.3 59.8 65.6 59.9 0.106
16. λ � 2, δ � 0.6 4.44 58.2 58.6 62.6 60.3 0.151
17. λ � 3, δ � 0.6 4.57 57.5 57.5 64 58.8 0.137
18. λ � 4, δ � 0.6 4.99 56.5 56.9 62.5 61.6 0.153
19. λ � 5, δ � 0.7 5.634 56.2 57.5 63 59.3 0.132
20. λ � 5, δ � 0.6 18.97 52.8 51.5 65.9 52.1 0.196

obvious defects. Although the lesion’s location is correct,


0.59
the coverage area of the lesion is too small to perform
0.58 accurate segmentation. ③ is the segmentation result;
comparing ③ with the truth label of ①, it can be seen that
0.57 there is a big difference between the segmentation result
and the ground truth. When using the DPAC-UNet to
0.56
segment the slice, as presented in Figure 8(b), ② is the
0.55 attention coefficient heatmap generated by the primary
network at the location marked as (I) in Figure 1. It is
0.54 evident from the figure that the attention coefficient
heatmap has obvious defects that are consistent with ②, as
0.53
presented in Figure 8(a), which is also a defective attention
0.52 heatmap with a smaller coverage area than the actual lesion.
Notably, the attention coefficient heatmap ②, as presented
in Figure 8(b), introduces a certain amount of noise. As
Tolerance (λ=1, δ=0.9)
Tolerance (λ=2, δ=0.9)
Tolerance (λ=1, δ=0.8)
Tolerance (λ=1, δ=0.7)
Tolerance (λ=4, δ=0.9)
Tolerance (λ=3, δ=0.9)
Tolerance (λ=5, δ=0.9)
Tolerance (λ=2, δ=0.8)
Tolerance (λ=1, δ=0.6)
Tolerance (λ=2, δ=0.7)
Tolerance (λ=4, δ=0.8)
Tolerance (λ=3, δ=0.7)
Tolerance (λ=3, δ=0.8)
Tolerance (λ=5, δ=0.8)
Tolerance (λ=4, δ=0.6)
Tolerance (λ=4, δ=0.7)
Tolerance (λ=2, δ=0.6)
Tolerance (λ=3, δ=0.6)
Tolerance (λ=5, δ=0.7)
Tolerance (λ=5, δ=0.6)
Tversky (β=0.95)

presented in Figure 8(b), ③ is the auxiliary compensation


attention coefficient generated by the DPAC-UNet’s aux-
iliary network at the location marked as (II) in Figure 1. It is
evident that the coverage area is moderately larger than the
actual lesion, and covering the correct lesion region. After
compensating the auxiliary compensation attention coef-
ficient map of ③ to the primary network’s attention co-
DSC
efficient map of ② through an additive compensation
F2
operation, a new attention coefficient map after compen-
Figure 7: Segmentation performance of DPAC-UNet with the sation is obtained, as shown in ④. Comparing ④ and ②, as
change in FPR∗ . presented in Figure 8(b), the insufficient coverage area of
attention coefficient in ② has been compensated, and the
noise has also been significantly reduced. ⑤ is the final
network) individually and using DPAC-UNet with the segmentation result of the DPAC-UNet. After using the
auxiliary network when segmenting an MRI slice, as pre- DPAC-UNet, the segmentation result has been signifi-
sented in Figure 8. cantly improved in terms of both lesion contour and area.
Using the primary network individually as presented in One thing to note here is when we compare the heatmap ②
Figure 8(a), ② is the attention coefficient heatmap gen- of Figure 8(a) generated by single-path attention U-Net and
erated by the second-level AG of classic Attention U-Net; it the heatmap ② of Figure 8(b) generated by DPAC-UNet’s
can be observed that its attention coefficient map has primary network, the attention heatmaps of Figures 8(a)
12 Computational Intelligence and Neuroscience

① ② ③

(a)

① ④ ⑤

(b)

Figure 8: Visualization examples of the attention coefficient maps of different methods: (a) single-path primary network individually;
\(b) DPAC-UNet.

and 8(b) are slightly different in noise level because they are (3) the DPAC-UNet model proposed in this paper,
two independent trained models, but the respective heat- trained by the WBCE-Tversky loss and tolerance loss
map ② has the defects of the same pattern. (β � 0.8, δ � 0.7, λ � 4)
Cases (2) and (3) are, respectively, using the primary
3.4. Comparison of Different Methods. Many lesion seg- network individually and combined with the auxiliary
mentation methods have been studied recently using the network.
ATLAS dataset. Zhou et al. proposed a new architecture The final experiment comparison results are presented in
called dimension-fusion-UNet (D-UNet) [28], which Table 5 that the DPAC-UNet achieved the highest DSC and
combines 2D and 3D convolution in the encoding stage. F2 scores. Comparing the single-path model attention U-Net
Yang et al. proposed a CLCI-Net using cross-level fusion and with our DPAC-UNet, from using primary network indi-
a context inference network [29]. The previously mentioned vidually to the introduction of the auxiliary attention
existing segmentation results serve as a comparison for our compensation mechanism, the DSC score improved by 6%.
experiments. Comparing the classic U-Net with attention U-Net, from no
Using the same conditions as the previous experiments, attention to the introduction of self-attention mechanism,
we conducted a comparison experiment of the following the DSC score only improved by 2.1%. The previously
models and loss functions: mentioned comparison shows that our DPAC-UNet has a
very significant performance improvement compared to the
(1) the U-Net [9] model trained by the WBCE-Tversky single-path self-attention segmentation model. Compared
loss (β � 0.8) with the methods in the existing literature, it is 5.7% higher
(2) the attention U-Net [10] trained by the WBCE- than the D-UNet and 1.1% higher than the CLCI-NET. This
Tversky loss (β � 0.8) suggests that our DPAC-UNet achieved improved
Computational Intelligence and Neuroscience 13

Table 5: Comparison of segmentation performance of different methods.


Metrics (%)
Models Loss functions
DSC F2 PRE RE
D-UNet Enhance mixing loss 53.5 — 63.3 52.4
CLCI-NET Dice loss 58.1 — 64.9 58.1
U-Net WBCE-Tversky(β � 0.8) 51.1 49.2 59.3 48.7
Attention U-Net WBCE-Tversky(β � 0.8) 53.2 55.6 62.6 56.2
DPAC-UNet WBCE-Tversky(β � 0.8), tolerance(δ � 0.7, λ � 4) 59.2 59.0 65.6 59.9

segmentation performance than the existing methods. As model parameters and training time consumption after the
shown in Figure 9, we present a group of boxplots of the introduction of the auxiliary network compensation
segmentation performance distribution of all 239 MRI scans mechanism, the significant improvement in segmentation
to evaluate the performance of the different models. The 239 performance makes up for the shortcoming of model
segmentation results are generated from the six nonrepeated complexity.
test sets split by sixfold nested cross-validation. From the
boxplots, we can state the following: first, comparing our
DPAC-UNet model with the other two models, the overall 3.6. DPAC Structure of Other Models. The DPAC structure
segmentation accuracy increases significantly, and also, the proposed in this paper that uses the auxiliary network to
minimum value of the boxplot of DSC and F2 scores and its compensate the primary network can be applied to most
lower quartile value increase significantly. This proves that segmentation models with spatial self-attention. We
our method significantly improves the data with poor implemented our method on two other segmentation
performance using the other two methods. Second, when models with self-attention mechanism, RA-UNet [30] and
comparing the middle value and upper quartile of boxplots, AGResU-Net [31], and compared the experimental results of
we can see that, for the data with better segmentation single-path with dual-path networks with auxiliary net-
performance segmented by the other two models, the works. The experimental results are shown in Table 7. The
DPAC-UNet has a slight improvement. For data with dis- previously mentioned two single-path segmentation models
tinct lesion characteristics that are easy to segment, the can effectively improve the segmentation performance after
primary network can generate a correct attention coefficient using the auxiliary network for attention compensation. It
map with a high probability. At this time, using the auxiliary shows that our method can be applied to other segmentation
network to compensate the primary network will not reduce networks with the self-attention mechanism. It should be
the segmentation accuracy or even slightly improve it. By noted that, in accordance with the hyperparameter selection
observing the boxplots of the FPR results, it is evident that steps in Section 2.2.3, when the dataset and segmentation
the FPR values of the three models are consistently small. model change, the hyperparameters of the tolerance loss
This proves that although the auxiliary compensation at- function need to be redetermined. As shown in Table 7,
tention coefficient map generated by the DPAC-UNet’s when the δ value of AGResU-Net is 0.6, the DPAC structure
auxiliary network has a high FPR, after compensating it to can achieve the best segmentation performance.
the primary network, the segmentation result of the primary
network maintains a small FPR. 4. Discussion and Conclusions
In this paper, we proposed the DPAC-UNet using the classic
3.5. Time Consumption. The parameter amount, training, self-attention model, attention U-Net, as the basic seg-
and testing computation time for each part of DPAC-UNet mentation model. To realize the functions of the DPAC-
are listed in Table 6 to understand which part of the network UNet’s primary and secondary networks, we proposed the
needs more time for executing. Since the primary and WBCE-Tversky and tolerance losses as the training loss
auxiliary networks are trained in parallel as a whole, the functions, respectively. We explored the hyperparameter
computation time of each part cannot be measured sepa- configuration of the loss functions by applying sixfold cross-
rately at the same time. Therefore, we compared the com- validation on the 239 MRI data of the ATLAS stroke seg-
putation complexity and time consumption of the primary mentation dataset. We discovered that the WBCE-Tversky
and auxiliary networks of DPAC-UNet by training them loss achieves the most accurate segmentation performance
independently. for the primary network when β � 0.8. The tolerance loss
The amount of our DPAC-UNet’s training parameters is generates a tolerant auxiliary compensation attention co-
double compared with the single-path attention U-Net efficient map with a moderate coverage area to compensate
(primary network or auxiliary network). The training time of for the primary network’s defective attention coefficient
the DPAC-UNet (5.11 hours on average) is about 1.7 times map. It achieves the best segmentation performance when
that of each subnetwork (3.06 hours on average). The testing β � 0.8, λ � 4, and δ � 0.7. The experiment results indicate
time of the DPAC-UNet (17 secs on average) is about 1.7 that the DSC score of the proposed DPAC-UNet with the
times that of each subnetwork (10 secs on average). Although auxiliary network is 6% higher than that without the aux-
DPAC-UNet has significantly increased the total number of iliary network. Compared with the methods in the existing
14 Computational Intelligence and Neuroscience

1.0
0.014

0.8 0.012

0.010
0.6
Scores

Scores
0.008

0.4 0.006

0.004
0.2
0.002

0.0 0.000

DSC F2 PRE RE FPR

Models
U-Net
Attention U-Net
DPAC-UNet
Figure 9: Boxplots of metric results for different models.

Table 6: Time consumption of DPAC-UNet.


Networks Parameters (M) Training (hours) Testing (seconds)
Primary network 40.4 3.07 10
Auxiliary network 40.4 3.05 10
DPAC-UNet 80.8 5.11 17

Table 7: Experimental results of DPAC structure based on other models.


Metrics (%)
No. Networks Auxiliary Loss functions
DSC F2 PRE RE
Without WBCE-Tversky (β � 0.8) 54.1 56.5 63.8 58.1
1 RA-UNet
With WBCE-Tversky (β � 0.8), tolerance (δ � 0.7, λ � 4) 60.3 59.9 67.1 60
Without WBCE-Tversky (β � 0.8) 55.2 59.7 61.4 57.5
2 AGResU-Net
With WBCE-Tversky (β � 0.8), tolerance (δ � 0.6, λ � 4) 60.5 62.2 66.6 61.1

literature, the DSC score of the proposed DPAC-UNet is Table 6, although our method obviously requires more
5.7% higher than the D-UNet and 1.1% higher than the computing resources and takes more training time, the im-
CLCI-NET. The results indicate that the proposed method provement in the segmentation performance of our method
achieved an improved segmentation performance and balances out the shortcomings in increased model complexity.
verified the effectiveness of the proposed method. The five-hour training time is currently at a lower or average
It should be noted that although we used the same dataset level in some of the latest existing network models, which are
in the proposed method as D-UNet and CLCI-NET, the currently used for stroke lesion segmentation. Moreover, we
version varied. We used the version without defacing that will implement our DPAC network structure on other basic
contains 239 MR images, and D-UNet and CLCI-NETused the segmentation models with a self-attention mechanism to
version with defacing that contains 229 MR images. Fur- verify our method’s versatility. We also proved that if our
thermore, considering that the cross-validation dataset split- DPAC structure is applied to other models based on the self-
ting methods do not generate the same training, validation, attention mechanism, it can also effectively improve the
and testing sets, and also considering that the loss functions segmentation performance. In future work, we plan to use
used are also different, achieving the best segmentation per- other stroke segmentation datasets to compare the effec-
formance does not directly prove that the proposed method is tiveness of our method across various datasets.
the best. It proves that we have reached a higher level of
segmentation performance in the current methods. Data Availability
The purpose and focus of our work are to improve the
performance of the single-path attention mechanism seg- The ATLAS dataset is publicly available at http://fcon_1000.
mentation model by using our DPAC method. As shown in projects.nitrc.org/indi/retro/atlas_download.html.
Computational Intelligence and Neuroscience 15

Conflicts of Interest networks based U-Net (R2U-Net),” in Proceedings of the


NAECON 2018-IEEE National Aerospace and Electronics
The authors declare no conflicts of interest. Conference, pp. 228–233, Dayton, OH, USA, July 2018.
[13] B. S. Lin, K. Michael, S. Kalra, and H. R. Tizhoosh, “Skin
Lesion Segmentation: U-Nets versus Clustering,” in Pro-
Acknowledgments ceedings of the 2017 IEEE Symposium Series on Computational
Intelligence (SSCI), pp. 1–7, Honolulu, HI, USA, November
This work was supported in part by the National Natural
2017.
Science Foundation of China (NSFC) under grant no.
[14] M. Noori, A. Bahri, and K. Mohammadi, “Attention-guided
62171307, Key Research and Development Project of Shanxi version of 2D UNet for automatic brain tumor segmentation,”
Province under grant no. 201803D31045 (China), Natural in Proceedings of the 2019 9th International Conference on
Science Foundation of Shanxi Province under grant no. Computer and Knowledge Engineering (ICCKE), Mashhad,
201801D121138 (China), research project supported by Iran, October 2019.
Shanxi Scholarship Council under grant no. 201925 (China), [15] Y. J. Huang, Q. Dou, Z. X. Wang et al., “3-D RoI-aware U-net
and Graduate Education Innovation Project of Shanxi for accurate and efficient colorectal tumor segmentation,”
Province under grant no. 2018BY051 (China). IEEE Transactions on Cybernetics, 2020, Early Access.
[16] P. F. Christ, M. E. A. Elshaer, F. Ettlinger et al., “Automatic
liver and lesion segmentation in CT using cascaded fully
References convolutional neural networks and 3D conditional random
[1] A. G. Thrift, D. A. Cadilhac, T. Thayabaranathan et al., “Global elds,” in Proceedings of the 19th International Conference on
stroke statistics,” International Journal of Stroke, vol. 9, no. 1, Medical Image Computing and Computer-Assisted Interven-
pp. 6–18, 2014. tion, pp. 415–423, Athens, Greece, October 2016.
[2] R. Zhang, L. Zhao, W. Lou et al., “Automatic segmentation of [17] K. Sirinukunwattana, J. P. W. Pluim, H. Chen et al., “Gland
acute ischemic stroke from DWI using 3-D fully convolu- segmentation in colon histology images: the glas challenge
tional DenseNets,” IEEE Transactions on Medical Imaging, contest,” Medical Image Analysis, vol. 35, pp. 489–502, 2017.
vol. 37, no. 9, pp. 2149–2160, 2018. [18] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and
[3] S. L. Liew, J. M. Anglin, N. W. Banks et al., “A large, open O. Ronneberger, “3D U-net: learning dense volumetric seg-
source dataset of stroke anatomical brain images and manual mentation from sparse annotation,” in Proceedings of the 19th
lesion segmentations,” Scientific data, vol. 5, no. 1, Article ID International Conference on Medical Image Computing and
180011, 2018. Computer-Assisted Intervention, pp. 424–432, Athens, Greece,
[4] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient- October 2016.
based learning applied to document recognition,” Proceedings [19] J. Merkow, A. Marsden, D. Kriegman, and Z. Tu, “Dense
of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. volume-to-volume vascular boundary detection,” in Pro-
[5] E. Shelhamer, J. Long, and T. Darrell, “Fully convolutional ceedings of the 19th International Conference on Medical
networks for semantic segmentation,” IEEE Transactions on Image Computing and Computer-Assisted Intervention,
Pattern Analysis and Machine Intelligence, vol. 39, no. 4, pp. 371–379, Athens, Greece, October 2016.
pp. 640–651, 2017. [20] M. Khened, V. A. Kollerathu, and G. Krishnamurthi, “Fully
[6] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: a convolutional multi-scale residual DenseNets for cardiac
deep convolutional encoder-decoder architecture for image segmentation and automated cardiac diagnosis using en-
segmentation,” IEEE Transactions on Pattern Analysis and semble of classifiers,” Medical Image Analysis, vol. 51,
Machine Intelligence, vol. 39, no. 12, pp. 2481–2495, 2017. pp. 21–45, 2019.
[7] K. Suzuki, “Overview of deep learning in medical imaging,” [21] Y. Li and L. Shen, “Deep learning based multimodal brain
Radiological Physics and Technology, vol. 10, no. 3, pp. 257–273, tumor diagnosis,” in Proceedings of the 3rd International
2017. MICCAI Brainlesion Workshop, pp. 149–158, Quebec City,
[8] G. Litjens, T. Kooi, B. E. Bejnordi et al., “A survey on deep Canada, September 2017.
learning in medical image analysis,” Medical Image Analysis, [22] S. S. M. Salehi, D. Erdogmus, and A. Gholipour, “Tversky loss
vol. 42, pp. 60–88, 2017. function for image segmentation using 3D fully convolutional
[9] O. Ronneberger, P. Fischer, and T. Brox, “U-net: convolu- deep networks,” in Proceedings of the International Workshop
tional networks for biomedical image segmentation,” in on Machine Learning in Medical Imaging, pp. 379–387,
Proceedings of the 18th International Conference on Medical Quebec City, Canada, September 2017.
Image Computing and Computer-Assisted Intervention, [23] C. H. Sudre, W. Li, T. Vercauteren, S. Ourselin, and M. Jorge
pp. 234–241, Munich, Germany, October 2015. Cardoso, “Generalised dice overlap as a deep learning loss
[10] J. Schlemper, O. Oktay, M. Schaap et al., “Attention gated function for highly unbalanced segmentations,” in Deep
networks: learning to leverage salient regions in medical Learning in Medical Image Analysis and Multimodal Learning
images,” Medical Image Analysis, vol. 53, no. 12, pp. 197–207, for Clinical Decision Support, pp. 240–248, Springer, Berlin,
2019. Germany, 2017.
[11] Z. Zhou, M. M. R. Siddiquee, N. Tajbakhsh, and J. Liang, [24] D. L. Collins, P. Neelin, T. M. Peters, and A. C. Evans,
“UNet++: a nested U-net architecture for medical image “Automatic 3D intersubject registration of MR volumetric
segmentation,” in Proceedings of the 4th International Deep data in standardized Talairach space,” Journal of Computer
Learning in Medical Image Analysis and Multimodal Learning Assisted Tomography, vol. 18, no. 2, pp. 192–205, 1994.
for Clinical Decision Support, pp. 3–11, Granada, Spain, [25] J. G. Sled, A. P. Zijdenbos, and A. C. Evans, “A nonparametric
September 2018. method for automatic correction of intensity nonuniformity
[12] M. Z. Alom, C. Yakopcic, T. M. Taha, and V. K. Asari, “Nuclei in MRI data,” IEEE Transactions on Medical Imaging, vol. 17,
segmentation with recurrent residual convolutional neural no. 1, pp. 87–97, 1998.
16 Computational Intelligence and Neuroscience

[26] N. J. Tustison, B. B. Avants, P. A. Cook et al., “N4ITK: im-


proved N3 bias correction,” IEEE Transactions on Medical
Imaging, vol. 29, no. 6, pp. 1310–1320, 2010.
[27] M. Zhang, J. Lucas, J. Ba, and G. E. Hinton, “Lookahead
Optimizer: k steps forward, 1 step back,” in Proceedings of the
33rd Conference on Neural Information Processing Systems,
pp. 9593–9604, Vancouver, Canada, December 2019.
[28] Y. Zhou, W. Huang, P. Dong, Y. Xia, and S. Wang, “D-UNet: a
dimension-fusion U shape network for chronic stroke lesion
segmentation,” IEEE/ACM Transactions on Computational
Biology and Bioinformatics, vol. 18, no. 3, pp. 940–950, 2021.
[29] H. Yang, W. Huang, K. Qi et al., “CLCI-net: cross-level fusion
and context inference networks for lesion segmentation of
chronic stroke,” in Proceedings of the International Conference
on Medical Image Computing and Computer-Assisted Inter-
vention, pp. 266–274, Cham, Germany, October 2019.
[30] Q. Jin, Z. Meng, C. Sun, H. Cui, and R. Su, “RA-UNet: A
hybrid deep attention-aware network to extract liver and
tumor in CT scans,” Frontiers in Bioengineering and Bio-
technology, vol. 8, p. 1471, 2020.
[31] J. Zhang, Z. Jiang, J. Dong, Y. Hou, and B. Liu, “Attention gate
resU-Net for automatic MRI brain tumor segmentation,”
IEEE Access, vol. 8, pp. 58533–58545, 2020.

You might also like