Research Article: Dual-Path Attention Compensation U-Net For Stroke Lesion Segmentation
Research Article: Dual-Path Attention Compensation U-Net For Stroke Lesion Segmentation
Research Article: Dual-Path Attention Compensation U-Net For Stroke Lesion Segmentation
Research Article
Dual-Path Attention Compensation U-Net for Stroke
Lesion Segmentation
Copyright © 2021 Haisheng Hui et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
For the segmentation task of stroke lesions, using the attention U-Net model based on the self-attention mechanism can suppress
irrelevant regions in an input image while highlighting salient features useful for specific tasks. However, when the lesion is small
and the lesion contour is blurred, attention U-Net may generate wrong attention coefficient maps, leading to incorrect seg-
mentation results. To cope with this issue, we propose a dual-path attention compensation U-Net (DPAC-UNet) network, which
consists of a primary network and auxiliary path network. Both networks are attention U-Net models and identical in structure.
The primary path network is the core network that performs accurate lesion segmentation and outputting of the final seg-
mentation result. The auxiliary path network generates auxiliary attention compensation coefficients and sends them to the
primary path network to compensate for and correct possible attention coefficient errors. To realize the compensation mechanism
of DPAC-UNet, we propose a weighted binary cross-entropy Tversky (WBCE-Tversky) loss to train the primary path network to
achieve accurate segmentation and propose another compound loss function called tolerance loss to train the auxiliary path
network to generate auxiliary compensation attention coefficient maps with expanded coverage area to perform compensate
operations. We conducted segmentation experiments using the 239 MRI scans of the anatomical tracings of lesions after stroke
(ATLAS) dataset to evaluate the performance and effectiveness of our method. The experimental results show that the DSC score
of the proposed DPAC-UNet network is 6% higher than the single-path attention U-Net. It is also higher than the existing
segmentation methods of the related literature. Therefore, our method demonstrates powerful abilities in the application of stroke
lesion segmentation.
[10], U-Net++ [11], and R2U-Net [12], have been applied performs lesion segmentation and outputs the final seg-
successfully in medical segmentation tasks, such as skin mentation result. The auxiliary network is used to generate
cancer [13], brain tumor [14], colorectal tumor [15], liver an auxiliary attention compensation coefficient map sent to
[16], colon histology [17], kidney [18], and vascular borders the primary network to compensate for possible attention
[19]. The U-Net network has thousands of feature channels, coefficient learning errors. The auxiliary network realizes its
especially the standard U-Net model with a five-level compensation ability by focusing on a larger area than the
structure with enormous parameters to be trained. During actual lesion area, which increases the coverage of the at-
the training process, the contraction path (encoder) and tention coefficient map generated by the auxiliary network.
expansion path (decoder) need to repeatedly extract deep- The attention coefficient map with a larger attention area is
scale features. The deep-scale features of standard U-Net are defined as a tolerant attention coefficient map, which is used
considered abstract and low-resolution features, which in- as an auxiliary compensation attention coefficient to com-
crease the training difficulty and make the training unstable pensate for possible errors in the primary network attention
and inadequate. coefficient map. To study our lesion segmentation network,
To reduce the training difficulty caused by repeated we use the ATLAS dataset [3], consisting of 239 T1-weighted
extraction of deep-scale features and improve segmentation subacute and chromic stroke MRI scans released in 2018.
accuracy, many researchers employed a two-step method to The main contributions of this article are summarized as
locate the lesion and segment the target area [20, 21]. follows:
However, these methods introduce additional positioning
(1) We proposed a DPAC-UNet that uses the auxiliary
operations and cannot achieve end-to-end training.
network to generate an attention coefficient map
Schlemper et al. introduced a self-attention mechanism and
with a larger area to compensate for the possible
proposed an attention U-Net with an attention gate (AG)
defect of the primary network’s attention coefficient
[10] to avoid additional operations. The self-attention
map.
mechanism reduces the dependence on external information
obtained from additional steps by utilizing the correlation (2) We proposed the WBCE-Tversky loss and tolerance
coefficient of feature signals from different scales. This loss to train the primary and auxiliary networks of
mechanism captures the internal correlation of features and the DPAC-UNet to realize their effects on the entire
focuses attention on the target area. The attention U-Net network, respectively, and explore the optimal
uses AG to generate a 2D attention coefficient map to hyperparameter configurations of the two proposed
suppress irrelevant regions in an input image while high- loss functions.
lighting salient features useful for specific tasks. The AG The remainder of this work is organized as follows: In
module can be integrated into the standard U-Net model for Section 2.1, we describe the network structure of the DPAC-
end-to-end learning without additional pretraining steps. UNet and how to use the auxiliary network to compensate
Compared with the standard U-Net training parameters, the for attention in the primary network. Section 2.2 proposes
number of training parameters slightly increased with ad- two compound loss functions, the WBCE-Tversky loss and
ditional computation of AG operations. The use of the built- the tolerance loss. In this section, we also conducted ex-
in self-attention module eliminates the use of additional periments to discuss the effect of different hyperparameter
target location operations. It achieves the goal of reducing values of the loss functions on the performance of the
training difficulty, improving training efficiency, and im- segmentation task. Finally, the steps to select the optimal
proving model segmentation performance. hyperparameter configuration of the two proposed loss
However, the self-attention mechanism based on cor- functions are listed. In Section 3, we train the DPAC-UNet
relation operation has some deficiencies. The attention by the WBCE-Tversky and the tolerance loss functions with
coefficient α for constraining the area of interest is generated the optimal hyperparameter configurations. In this section, a
by the current-scale feature signal x and the rougher-scale visualization example is also presented to demonstrate the
feature signal g derived from x, leading to a potential risk of effectiveness of the DPAC-UNet network further. We also
the segmentation network using the self-attention mecha- discussed the time consumption of the primary and auxiliary
nism. It implies that a small lesion with a nondistinct lesion networks of the DPAC-UNet, and we also tried to execute
feature may cause the current level feature signal x to learn the auxiliary network’s compensation mechanism for other
the lesion feature inadequately. Consequently, the deviation segmentation models with self-attention mechanisms.
of the attention area from the lesion area due to the wrong or
insufficient attention coefficient learning leads to incorrect 2. Materials and Methods
segmentation results.
To solve the problem of the attention area deviating from 2.1. DPAC-UNet. The attention U-Net introduces several
the lesion area, we proposed a dual-path attention com- attention gates (AG) to generate attention coefficient maps
pensation U-Net (DPAC-UNet) network, which is com- that suppress irrelevant regions in an input image while
posed of the primary path network (primary network) and highlighting salient features useful to improve segmentation
auxiliary path network (auxiliary network). Both networks performance without introducing additional positioning
are all attention U-Net segmentation models based on the operations. However, it sometimes makes mistakes. A small
self-attention mechanism with an identical structure. The lesion with indistinct lesion features is difficult to distinguish
primary network is the core part of DPAC-UNet, which from the surrounding healthy tissues, leading to the current
Computational Intelligence and Neuroscience 3
scale feature signal x of a certain layer not learning the lesion attention U-Net, and its detailed structure is shown in Figure
feature well. As a result, the attention coefficient generated 2(a). In Figure 2(a), ① and ② are the input of the auxiliary
using x and its derived rougher feature g will deviate from network AG, ④ is the output of the current level for skip
the lesion area. Therefore, the wrong attention coefficient connection (SC), where l is the level number of current AG
results in the AG outputting the wrong feature signal, which (in this case l � 2), and feature signals xli and gli correspond
affects the segmentation results. Thus, if the attention U-Net to the inputs labeled ① and ②. The feature signals gli ∈ RFg
finds the correct lesion in the AG module, it will emphasize and xli ∈ RFx are sent to the AG block to generate the at-
the relevant area and suppress the unrelated area to improve tention coefficient αl using the additive attention generation
the segmentation performance. Conversely, if the lesion operation in order to determine the area to focus, where i is
location is not found in the AG or is wrong, it will result in the pixel number,Fx is the number of feature channels of
diametrically opposite effects and degrade the segmentation input feature signal xl at the current level, andFg is the
performance. To cope with the previously mentioned issues, number of feature channels of input feature signal gl at the
using the attention U-Net as the basic segmentation model, rougher level. When the additive attention coefficient map αl
we propose the DPAC-UNet network. is generated using xli and gli , the feature signal xli is mul-
tiplied by αl and used as the output of the AG gate and sent
to the decoding path through the SC at the current level. The
2.1.1. Overview of the Structure. The schematic of DPAC- additive attention coefficient αl marked as ③ is the auxiliary
UNet is presented in Figure 1. We used two identical at- compensation attention coefficient map and sent to the AG
tention U-Net models as the primary and auxiliary network marked as (I) at the same level and in the same position of
segmentation models, which correspond to the upper and the primary network in the upper half of Figure 1. The
lower half of Figure 1, respectively. The WBCE-Tversky loss equations for generating the attention coefficient of the
function trains the primary network for accurate segmen- auxiliary network are as follows:
tation. The auxiliary network is trained by the tolerance loss
to generate a tolerant auxiliary compensation attention qlatt � WTψ σ 1 WTx xli + WTg gli + bg + bψ , (1)
coefficient that compensates for the defect of the attention
coefficient map of the primary network. The details of the αli � σ 2 qlatt xli , gli ; Θatt , (2)
two loss functions are described in Section 2.2. As presented
in Figure 1, the auxiliary network compensates for the l l
αi rs � resampleαi , (3)
auxiliary compensation attention coefficient to the primary
network through the vertical dark red arrow line from the
AG marked (II) to the AG marked (I), in order to perform xli � xli · αli rs . (4)
the compensation operation. We just selected the second-
level AG of the primary and auxiliary networks for additive As presented in Figure 2(a), considering the inconsistent
compensation operation. This is because the resolution of spatial resolution and feature channel dimensions of feature
the attention coefficient maps generated by the two bottom gli and xli , we also need to use the upsampling operation to
AGs (13 × 11 and 26 × 22) is too low. The difference between change the spatial resolution of the signal gli to make it
the attention maps of the two networks on this resolution consistent with xli . Moreover, we need to use the linear
scale is larger due to the difference of one pixel. When the transformation Wg ∈ RFg ×Fint and Wx ∈ RFx ×Fint to make the
level is deeper, the receptive field affected by a single pixel is number of feature channels of these two signals the same,
very large. Consequently, the compensation operation at this where bg ∈ RFint and bψ ∈ R denote the biases of the two
scale by the auxiliary network has a significant impact on the linear transformations. In (1), σ 1 is the ReLU activation
primary network, and the compensating operation generates function, and the output of this activation function is lin-
a significant attention fluctuation. Furthermore, the first- early transformed by WTψ ∈ R1×Fint that forms an attention
level AG, which is close to the uppermost layer’s output, does coefficient matrix with only one feature channel. In (2), the
not perform auxiliary attention compensation operation sigmoid activation function σ 2 converts the attention co-
because the feature map here is too close to the output and efficient matrix into a gridded attention coefficient map αli to
affects the segmentation result. In summary, we only se- act on xli . Resample αli , and then, multiply the resampled
lected the second-level AG to implement the compensation result by xli to generate the AG output feature signal x li .
operation in order to effectively compensate for the defective Figure 2(b) presents the block diagram of the AG marked as
attention coefficient map of the primary network and ensure (I) in the upper half of Figure 1, where the auxiliary com-
that it does not directly affect the accuracy of lesion seg- pensation attention coefficient map compensates for the
mentation of the primary network. primary network. The structure and equations of the signal
Figure 2 presents the AG schematic of the primary and operation process are almost consistent with the auxiliary
auxiliary networks at the second level. The AG of the first, network, as presented in Figure 2(a). The difference is that
third, and fourth levels are shown in Figure 1, which are not when generating the final additive fused attention coefficient
involved in auxiliary attention coefficient compensation map, the auxiliary compensation attention coefficient map
operation and are identical in structure to the AG in the generated by the auxiliary network AG is marked as ③, and
literature [10]. The AG marked as (II) in the lower half of perform additive fusion together with the original attention
Figure 1 is the second-level AG in the auxiliary network’s coefficient map generated by the primary network AG
4 Computational Intelligence and Neuroscience
C Output seg-map
(I)
Input image
C
208×176
1 64 64 64+64=128 64 64 1 C C: Channel
W×H
H: Hight
W: Width
C Input
Output
1×1 Conv+ Sigmoid
64 128 128 128+128=256 128 128
(II)
104×88
C C Concatenate
C
Auxiliary Path AG
13×11
③
l
② gi
① xi
l Output
Wx :1×1 l
xi
(σ1)
l
(αi)rs ④
current feature (σ1) (σ2)
current feature Wy :1×1 (σ2)
Output
l ② gi
l Wy :1×1
Wx :1×1 xi
l
① xi
④ coarser feature
Auxiliary Path AG
Auxiliary Path AG
(a) (b)
Figure Description Input Output
Current feature Conv 1×1 (52, 44, 256) (52, 44, 256)
Attn ceofficient Conv 1×1 (52, 44, 128) (52, 44, 1)
(52, 44, 256)
Element-wise addition (52, 44, 256)
(52, 44, 256)
Element-wise repeat to
get multi-channel (52, 44, 1) (52, 44, 256)
(c)
Figure 2: (a) Schematic of the AG structure of the auxiliary network, (b) schematic of the AG structure of the primary network, and (c) the
definition of various operation symbols and dimensional changes of input and output feature signals.
Computational Intelligence and Neuroscience 5
generated by inputs ① and ②. According to (3) and (4), the Combined with an auxiliary network: the larger aux-
output feature signal ④ of the primary network AG is iliary attention coefficient compensation map gener-
generated. Figure 2(c) presents the definition of various ated by the auxiliary network covers a larger area, and
operation symbols and dimension changes of input and the compensated attention coefficient map may be still
output feature signals in Figures 2(a) and 2(b). wrong (Figure 3(c), ②), or correct partially
(Figure 3(c), ③), or correct completely (Figure 3(c),
④). At this time, correspondingly, the segmentation
2.1.2. Compensation Mechanism of the Auxiliary Network. performance will remain unchanged, or improve to
The traditional single-path self-attention model generates a some extent, or improve significantly.
spatial attention coefficient map by the AG to cover the
Therefore, by combining the previously mentioned three
lesion area of features to pay more attention to the lesion
situations, the overall average segmentation performance of
area to improve the segmentation performance. Our pro-
the whole dataset will be improved. It can also be seen from
posed method builds an auxiliary network to generate an
Figure 3 that the attention coefficient map generated by the
auxiliary attention coefficient map with a larger coverage
auxiliary network does not deviate from the attention co-
area to compensate the segmentation network (primary
efficient map area generated by the primary network.
network) to improve its hit rate of complete coverage of the
lesion by spatial attention coefficient map. It should be noted
that the attention compensation map will not deviate from 2.2. Loss Functions of DPAC-UNet. We proposed two dif-
the original attention area of the primary network but will be ferent compound loss functions to train the primary and
constrained to increase the attention area around it. This auxiliary networks. First, we proposed the WBCE-Tversky
compensation mechanism is especially effective when the loss for the primary network to generate an attention co-
lesion feature is indistinct, the lesion’s outline is unclear, or efficient map focused on the target area and an accurate
the segmentation model cannot generate the correct region segmentation result. Second, we proposed the tolerance loss
of interest. for the auxiliary network to generate an auxiliary com-
The qualitative analysis and comparison of using the pensation attention coefficient map with a larger coverage
primary network individually or combined with an auxiliary area to compensate for the primary network. It is called a
network are stated as follows. When DPAC-UNet uses the tolerance loss because it can generate an attention coefficient
auxiliary network to compensate for the primary network, map that covers a larger area and does not deviate from the
there are three situations: lesion area, which means a higher fault tolerance for at-
tention errors.
Situation 1. (1) Use the primary network individually:
when the focus area of the attention coefficient map of
the single-path network is partially correct (Figure 3(a), 2.2.1. WBCE-Tversky Loss. The Tversky loss [22], which was
①), which will lead to reduced segmentation perfor- proposed to address data imbalance in medical image
mance. (2) Combined with an auxiliary network: after segmentation, is introduced as a component of our WBCE-
the auxiliary network compensates the primary net- Tversky. The Tversky loss is as follows:
work’s attention coefficient map with a larger focus area
N
i�1 p1 i · g1 i
through additive compensation, the compensated at- Tloss (α, β) � 1 − ,
tention coefficient map may be correct (Figure 3(a), ②) N N N
i�1 p1 i · g1 i + α i�1 p1 i · g0 i + β i�1 p0 i · g1 i
Primary
Network
Individually
① ① ①
Combined
with
Auxiliary
Network
② ③ ② ② ③ ④
generalized loss function in training will lead to higher models for imbalanced data; PRE quantifies the number of
generalization and improved performance for the imbal- positive class predictions that belong to the positive class;
anced dataset. So, we use the Tversky loss with higher β as a RE quantifies the number of positive class predictions made
part of the WBCE-Tversky loss for training the primary out of all positive examples in the dataset. The experimental
network of DPAC-UNet. Meanwhile, in the tolerance loss, results of training the attention U-Net with different
we also need to use a Tversky loss function to constrain the hyperparameter β values for the Tversky loss are presented
growth of the attention coefficient map to ensure that the in Table 1.
larger and more tolerant focus area will not deviate from the As presented in Table 1, the maximum RE value is
lesion area. To compare the segmentation performance of obtained when β takes a large value of 0.95, and the max-
the Tversky loss with the different hyperparameter values of imum PRE value is obtained when the minimum value of
β and select the appropriate hyperparameter β for the 0.05 is taken. DSC and F2 scores reached the maximum
WBCE-Tversky loss and tolerance loss, we used the Tversky when β � 0.80. Simultaneously, a trade-off between PRE and
loss for the training the basic segmentation model, attention RE has been made, indicating that, for the imbalanced
U-Net. The hyperparameter β of the Tversky loss ranges ATLAS dataset, training a model using the Tversky loss with
from 0.5 to 0.95, using 0.5 as the value interval. We con- hyperparameter β � 0.80 improves the segmentation accu-
ducted an experiment using the sixfold cross-validation, racy. We need a loss function that can train the primary
which is often used to train a model in which hyper- network of the DPAC-UNet to achieve an accurate seg-
parameters need to be optimized. We split the 239 stroke mentation. To improve the segmentation performance, we
MRI scans into training, validation, and test sets by sixfold can handle the imbalanced dataset by selecting the hyper-
cross-validation according to Figure 4. parameter β value of the Tversky loss to train the basic
First, in each fold, we divided the data into training and segmentation model in order to reduce the tendency of the
test sets using a ratio of about 5 : 1 (199 : 40), and we ensured lesion to be classified as nonlesion. As presented in Table 1,
that all MRI scans of all test sets are not repeated. Second, we the use of the Tversky loss with hyperparameter β � 0.80 to
further split the training set in each fold into the inner train the attention U-Net on the ATLAS dataset achieves the
training and validation sets using a ratio of about 4 : 1 (160 : highest segmentation performance. However, as presented
39). The validation set is used to select the best-performing in (5), if the denominator of the Tversky loss is a small value,
model trained by the training set. Moreover, we also ensured it causes instability in backpropagation and derivation. To
that the training, validation, and test sets of each fold have solve this problem, we introduced the WBCE loss [23] as
the same lesion volume distribution for the accuracy of the another part of the WBCE-Tversky loss. On the one hand, it
experiment results. The lesion size distribution of fold 1 is avoids the problems of backpropagation and gradient cal-
presented in Figure 5. culation instability caused by the Tversky loss for small
The experimental configuration and results of training denominators. On the other hand, using the WBCE loss and
the attention U-Net using Tversky are presented in Table 1. giving greater weight to the minority class in the equation
We used 10 different β values to perform sixfold cross- adapts to the imbalance of dataset and further improves the
validation and computed the average metric scores of all overall segmentation performance. The WBCE loss function
test sets’ results. We used the dice similarity coefficient has differentiable properties, which simplifies the optimi-
(DSC), F2 score (F2), precision (PRE), and recall (RE) as zation process. The equation of the proposed WBCE-
the metrics for the model evaluation. DSC is a widely used Tversky loss is presented in (8). The compound loss function
metric for evaluating the performance of the models; F2 is composed of the Tversky loss (β � 0.80) and WBCE loss,
score is often used to evaluate the performance of the and their respective equations are presented as
Computational Intelligence and Neuroscience 7
Fold 1 Training fold Test fold Table 1: Experimental results when using the Tversky loss with
different β values to train the attention U-Net.
Fold 2 Metrics (%)
Weights
DSC F2 PRE RE
Fold 3 α � 0.50, β � 0.50 49.9 46.4 64.3 45.0
α � 0.45, β � 0.55 50.8 48.6 62.8 47.5
Fold 4 α � 0.40, β � 0.60 51.1 52.5 58.0 51.1
α � 0.35, β � 0.65 50.9 52.1 57.8 53.7
Fold 5 α � 0.30, β � 0.70 51.5 52.6 59.5 54.8
α � 0.25, β � 0.75 52.0 51.5 61.3 52.5
α � 0.20, β � 0.80 52.7 55.4 56.7 58.3
Fold 6 α � 0.15, β � 0.85 50.5 52.5 53.4 55.5
α � 0.10, β � 0.90 50.2 52.7 53.2 56.5
α � 0.05, β � 0.95 51.6 55.0 53.5 59.4
Training Validation
Inner split
To test and verify the proposed WBCE-Tversky loss, we
Figure 4: Schematic of sixfold cross-validation. conducted a series of comparative experiments using the
WBCE loss, Tversky loss with different hyperparameter β,
and WBCE-Tversky loss with different β. The model used in
the experiment, the experiment datasets, and the experiment
conditions are the same as the experiments corresponding to
150
Table 1. The experiment parameter configuration and results
Volume (10^3 voxels)
Table 2: Comparing the segmentation performance of the WBCE- Table 3: FPR values of the tolerance loss using different hyper-
Tversky loss under different hyperparameter configurations. parameter configurations.
Metrics (%) Metrics (%)
Loss functions Weights Loss functions Weights
DSC F2 PRE RE FPR DSC F2 PRE RE FPR
WBCE only None 46.7 43.1 62.3 41.6 0.08 δ � 0.9 45.9 55.2 38.5 67.7 0.44
Tversky only 49.9 46.4 64.3 45.0 0.06 δ � 0.8 44.6 55.7 36.7 71.6 0.57
α � 0.50, β � 0.50 λ�1
WBCE-Tversky 51.5 49.5 63.2 49.5 0.10 δ � 0.7 40.7 51.2 33.1 67.2 0.61
Tversky only 51.1 52.5 58.0 51.1 0.14 δ � 0.6 30.2 44.1 21.0 77.1 1.27
α � 0.40, β � 0.60
WBCE-Tversky 52.1 51.5 59.6 52.0 0.10 δ � 0.9 45.2 55.4 36.3 70.6 0.51
Tversky only 51.5 52.6 59.5 54.8 0.14 δ � 0.8 32.2 45.3 23.0 72.6 1.09
α � 0.30, β � 0.70 λ�2
WBCE-Tversky 51.9 50.4 62.2 50.3 0.10 δ � 0.7 30.4 44.6 20.6 70.9 1.34
Tversky only 52.7 55.4 56.7 58.3 0.16 δ � 0.6 14.8 26.0 8.9 83.5 4.44
α � 0.20, β � 0.80
WBCE-Tversky 53.2 55.6 62.6 56.2 0.12 δ � 0.9 36.1 48.0 27.4 70.2 0.74
Tversky only 50.2 52.7 53.2 56.5 0.20 Tolerance loss δ � 0.8 22.1 35.1 14.1 74.1 2014
α � 0.10, β � 0.90 λ�3
WBCE-Tversky 51.5 51.6 57.7 53.1 0.14 β � 0.8 δ � 0.7 22.8 36.1 14.8 79.4 2.01
δ � 0.6 11.8 18.8 7.6 83.5 4.57
δ � 0.9 39.4 50.9 31.4 68.9 0.69
TN δ � 0.8 23.9 37.6 15.6 74.2 1.89
specificity � , λ�4
TN + FP δ � 0.7 14.9 25.4 9.2 80.6 4.09
(9) δ � 0.6 7.9 11.2 5.7 82.8 4.99
δ � 0.9 34.4 47.5 24.9 72.4 0.90
2 δ � 0.8 20.7 33.5 13.3 82.3 2.74
N i�1 p0 i · g0 i
λ�5
δ � 0.7 13.2 24.1 7.8 84.1 5.63
Sloss (λ, δ) � λ N − δ , (10)
i�1 p0 i · g0 i + N i�1 p1 i · g0 i
δ � 0.6 5.7 11.8 3.1 92.8 18.97
MRI slice Tversky loss Tolerance loss Tolerance loss Tolerance loss Tolerance loss Tolerance loss Tolerance loss Tolerance loss Tolerance loss Tolerance loss
with lesion contour β=0.8 β=0.8, λ=1, δ=0.9 β=0.8, λ=1, δ=0.7 β=0.8, λ=2, δ=0.7 β=0.8, λ=3, δ=0.8 β=0.8, λ=4, δ=0.7 β=0.8, λ=3, δ=0.6 β=0.8,λ=4,δ=0.6 β=0.8, λ=5, δ=0.7 β=0.8, λ=5, δ=0.6
FPR=0.222 FPR=0.438 FPR=0.607 FPR=1.335 FPR=2.143 FPR=4.093 FPR=4.57 FPR=4.99 FPR=5.634 FPR=18.97
C0009S0009t01A100
C0009S0008t01A68
C0011S0009t01A82
C0004S0010t01A132
Figure 6: Attention coefficient heatmaps generated by the attention U-Net with different hyperparameters of the tolerance loss.
DPAC-UNet, the primary network gets a compensation different parameter pairs of tolerance loss to train the
attention coefficient with the correct region irrespective of auxiliary network, and take the δ and λ pair with the
the increase of the FPR value and the coverage area. best segmentation performance as the selected values of
However, for the coverage area of the auxiliary compen- proposed tolerance loss.
sation attention coefficient map, the case is not the larger the
When our method is applied to other different types of
better, indicating that FPR is not as high as possible. We
datasets of medical segmentation tasks or different seg-
need to set a moderate value of hyperparameters λ and δ to
mentation models, the hyperparameter configurations of
provide the best segmentation performance for DPAC-
loss functions are different, and the hyperparameter values
UNet. Therefore, in Session 3, the optimal λ and δ hyper-
need to be redetermined. This is because the hyperparameter
parameters will be selected based on the DPAC-UNet model
selection of the loss function needs to consider the imbalance
depending on the experiment performance.
of different datasets and the individual differences of at-
tention maps generated by different models.
2.2.3. Hyperparameter Selection. In order for the auxiliary
network to generate a larger proper attention coefficient 3. Experimental Results and Analysis
map, it needs to be trained by the tolerance loss proposed.
Only when the hyperparameter configuration of the toler- 3.1. Dataset and Training. The ATLAS dataset has a high 3D
ance loss function is selected appropriately, the auxiliary resolution that can meet the requirements of rotation slicing
network can provide moderate compensation to the at- operations, which contains 239 MRI data and focuses on the
tention module of the primary network to improve the subacute and chronic stages of stroke disease. The operations
segmentation performance. The selection process of loss of MNI-152 [24] image registration, intensity normalization
function hyperparameter configuration of the primary and [25], bias field correction [26], and changing the resolution
auxiliary network follows the following two steps: of MRI scans to 176 × 208 × 176 through cropping and
interpolation operation to fit our method have been per-
Step 1. With 0.05 as the interval, from 0.5 to 0.95, using formed. We use the sixfold cross-validation to ensure that
10 different β values of Tversky loss to train the single- the test sets can cover the entire dataset. We also divide the
path Attention U-Net model, take the β value with the training set of each fold into the inner loop training set and
best segmentation performance as the selected β value the inner loop validation set for best model selection. It
of the proposed WBCE-Tversky loss and Tolerance loss. should be noted that since the distribution of the number of
Step 2. To select appropriate δ and λ values for the MRIs of different sizes is extremely imbalanced in the
tolerance loss, so that the auxiliary network can provide dataset, it is necessary to ensure that the training, validation,
appropriate attention coefficient map compensation and test sets have similar lesions sizes’ distribution.
and achieve the best segmentation performance of the We use the deep learning framework PyTorch to conduct
entire DPAC-UNet, we use the WBCE-Tversky loss our experiments on three NVIDIA Tesla T4 GPUs. We train
function (fix the β value that has been selected in the the models 100 epochs at most and save the best model when
first step) to train the primary network. We set the the validation set loss is the smallest. We used the lookahead
tolerance loss δ value to 0.6, 0.7, 0.8, or 0.9, and set λ optimizer [27] for model training. The optimizer improves
value to 1, 2, 3, 4, or 5; that is, we use a total of 20 the stability of the optimization process while considering
10 Computational Intelligence and Neuroscience
the dynamic adjustment of the learning rate and the ac- F2 scores of the DPAC-UNet gradually increase. When the
celeration of the gradient descent. We set the initial learning values of the hyperparameters are λ � 4 and δ � 0.7, the DSC
rate to 1 × 10− 4 . The same experiment conditions and en- and F2 scores get the maximum value. As the FPR∗ further
vironment, used in the previous experiments in Section 2, increases, the segmentation performance gradually declines.
are used for reproducing the single-path segmentation When the coverage area significantly increases with the FPR∗
models, such as U-Net and attention U-Net. We applied the value, it negatively affects the primary network. As presented
WBCE-Tversky loss for accurate segmentation to train these in Figure 6, when λ � 5 and δ � 0.6, the FPR∗ value reaches
single-path models and use their results to compare the the maximum, as well as the coverage area of the auxiliary
results of our DPAC-UNet method. compensation attention, which occupies a quarter of the
brain slice. At this time, the coverage area is too large to
constrain the primary network to focus on the correct lesion
3.2. Experiment and Results. In Section 2.1, we elaborated on area effectively. Its attention coefficient map generated by this
the principle of the proposed DPAC network structure. hyperparameter configuration even interferes with the pri-
Using the attention U-Net as the basic segmentation model mary network, so its DSC and F2 scores are negatively af-
of the primary and auxiliary networks of the DPAC method, fected as presented in Table 4. The change of FPR∗ is
we proposed a specific segmentation model, DPAC-UNet. In determined by the hyperparameters λ and δ together. FPR∗ is
Section 2.2, we also proposed the WBCE-Tversky loss and proportional to λ and inversely proportional to δ. Therefore,
tolerance loss to train the primary and auxiliary networks, the smallest λ and the largest δ will generate the smallest
respectively. Moreover, we explored and verified the value of FPR∗ , and the largest λ and smallest δwill lead to the largest
hyperparameter β of the WBCE-Tversky loss through the FPR∗ . Figure 7 presents a line chart of the segmentation
experiments presented in Tables 1 and 2 and found that accuracy changing with FPR∗ . The line chart indicates that
when β � 0.8, the primary network based on the attention the DPAC-UNet segmentation accuracy changes as the FPR∗
U-Net achieves the best segmentation performance trained increases. As the FPR∗ increases, the DSC and F2 scores
by the WBCE-Tversky loss. increase and then decrease. It shows that when the FPR∗ is
We also explained the relationship between the values of small, the coverage area of the corresponding auxiliary at-
different hyperparameters δ and λ and the coverage area of tention compensation coefficient map is also small. It cannot
the auxiliary compensation attention coefficient map in compensate for the primary network adequately and effec-
Section 2.2. The coverage area of the auxiliary attention tively. When the FPR∗ value is too large, it tends to over-
coefficient map is proportional to the FPR value, and the compensate. Only when the hyperparameter values are
FPR value is proportional to λ and inversely proportional to moderate and its corresponding FPR∗ value is moderate can
δ. We need to select a suitable set of λ and δ values to obtain the DPAC-UNet achieve the best segmentation performance.
an auxiliary attention coefficient map with a suitable cov- Simultaneously, it can be seen from Table 4 that the FPR
erage area in order to enable the DPAC-UNet to achieve the values generated by the DPAC-UNet’s primary network are
best segmentation performance. Therefore, based on the all small, irrespective of the loss function of the auxiliary
experiment results, as presented in Table 3, we explored the network used and the corresponding FPR∗ value. This is
optimal hyperparameter configuration of λ and δ to train the because the compensation operation of the auxiliary com-
best DPAC-UNet model. We used the tolerance loss pensation attention coefficient map generated by the aux-
(β � 0.8) configured with different hyperparameters λ and δ iliary network does not directly affect the segmentation
to train the auxiliary network of DPAC-UNet and the result of the primary network. It is an additive compensation
WBCE-Tversky loss (β � 0.8) to train the primary network operation from the auxiliary network to the primary net-
of the DPAC-UNet. work during the training process; therefore, it does not
Table 4 presents the experiment results corresponding to participate in the gradient operation and backpropagation of
the experiment of DPAC-UNet trained by the tolerance loss the primary network. However, it partially modified the size
function with different hyperparameters. In Table 4, the of the coverage area of the primary network’s attention
FPR∗ represents the FPR results of single-path attention coefficient map. The primary network still considers accu-
U-Net trained by tolerance loss functions with different rate segmentation as its training purpose. It does not gen-
hyperparameter configurations from Table 3. We sort FPR∗ erate FP as high as the auxiliary network due to the increased
in ascending order and identified the corresponding toler- attention area after compensation.
ance loss functions and hyperparameter configurations. We In summary, when the primary network uses the WCBE-
use tolerance loss functions with these sorted configurations Tversky loss function with hyperparameter configuration of
to train the auxiliary network of the DPAC-UNet and the β � 0.8, and the auxiliary network uses tolerance loss
WBCE-Tversky loss (β � 0.8) to train the primary network. function with hyperparameter configuration of β � 0.8,
Then, we got the experiment results of the different con- λ � 4, and δ � 0.7, our DPAC-UNet can achieve the highest
figurations of DPAC-UNet to select the best hyperparameter segmentation accuracy.
configuration.
By observing the relationship between FPR∗ and seg-
mentation metrics, as presented in Table 4, it is evident that as 3.3. Visualization Examples. To show the principle of the
the coverage area of the attention coefficient generated by the DPAC-UNet, we give the attention coefficient heatmaps and
auxiliary network increases (indicated by FPR∗ ), the DSC and segmentation results of using attention U-Net (primary
Computational Intelligence and Neuroscience 11
Table 4: The segmentation performance of the DPAC-UNet using different hyperparameter configurations.
Metrics (%)
Loss functions Weights
FPR∗ DSC F2 PRE RE FPR
1. λ � 1, δ � 0.9 0.438 54.8 54.1 63.6 55.1 0.111
2. λ � 2, δ � 0.9 0.508 53.0 52.4 61.5 53.3 0.101
3. λ � 1, δ � 0.8 0.573 55.2 54.7 64.4 55.7 0.120
4. λ � 1, δ � 0.7 0.607 54.1 54 62.2 55.1 0.117
5. λ � 4, δ � 0.9 0.689 55.9 56.6 63 57.4 0.124
6. λ � 3, δ � 0.9 0.743 55.3 56 61.1 58.1 0.173
7. λ � 5, δ � 0.9 0.898 53.8 54.1 62.8 55.7 0.142
8. λ � 2, δ � 0.8 1.091 54.9 55.4 61.2 56.9 0.140
9. λ � 1, δ � 0.6 1.27 55.5 55.6 63.7 57 0.126
10. λ � 2, δ � 0.7 1.335 55.8 55.8 64.8 57.2 0.133
Tolerance loss, β � 0.8
11. λ � 4, δ � 0.8 1.888 53.6 53 64.4 53.9 0.111
12. λ � 3, δ � 0.7 2.006 56.9 57.7 61.6 59.6 0.149
13. λ � 3, δ � 0.8 2.143 56.7 57.3 61.9 59.1 0.157
14. λ � 5, δ � 0.8 2.744 56.7 56 65.8 56.8 0.103
15. λ � 4, δ � 0.7 4.093 59.3 59.8 65.6 59.9 0.106
16. λ � 2, δ � 0.6 4.44 58.2 58.6 62.6 60.3 0.151
17. λ � 3, δ � 0.6 4.57 57.5 57.5 64 58.8 0.137
18. λ � 4, δ � 0.6 4.99 56.5 56.9 62.5 61.6 0.153
19. λ � 5, δ � 0.7 5.634 56.2 57.5 63 59.3 0.132
20. λ � 5, δ � 0.6 18.97 52.8 51.5 65.9 52.1 0.196
① ② ③
(a)
① ④ ⑤
(b)
Figure 8: Visualization examples of the attention coefficient maps of different methods: (a) single-path primary network individually;
\(b) DPAC-UNet.
and 8(b) are slightly different in noise level because they are (3) the DPAC-UNet model proposed in this paper,
two independent trained models, but the respective heat- trained by the WBCE-Tversky loss and tolerance loss
map ② has the defects of the same pattern. (β � 0.8, δ � 0.7, λ � 4)
Cases (2) and (3) are, respectively, using the primary
3.4. Comparison of Different Methods. Many lesion seg- network individually and combined with the auxiliary
mentation methods have been studied recently using the network.
ATLAS dataset. Zhou et al. proposed a new architecture The final experiment comparison results are presented in
called dimension-fusion-UNet (D-UNet) [28], which Table 5 that the DPAC-UNet achieved the highest DSC and
combines 2D and 3D convolution in the encoding stage. F2 scores. Comparing the single-path model attention U-Net
Yang et al. proposed a CLCI-Net using cross-level fusion and with our DPAC-UNet, from using primary network indi-
a context inference network [29]. The previously mentioned vidually to the introduction of the auxiliary attention
existing segmentation results serve as a comparison for our compensation mechanism, the DSC score improved by 6%.
experiments. Comparing the classic U-Net with attention U-Net, from no
Using the same conditions as the previous experiments, attention to the introduction of self-attention mechanism,
we conducted a comparison experiment of the following the DSC score only improved by 2.1%. The previously
models and loss functions: mentioned comparison shows that our DPAC-UNet has a
very significant performance improvement compared to the
(1) the U-Net [9] model trained by the WBCE-Tversky single-path self-attention segmentation model. Compared
loss (β � 0.8) with the methods in the existing literature, it is 5.7% higher
(2) the attention U-Net [10] trained by the WBCE- than the D-UNet and 1.1% higher than the CLCI-NET. This
Tversky loss (β � 0.8) suggests that our DPAC-UNet achieved improved
Computational Intelligence and Neuroscience 13
segmentation performance than the existing methods. As model parameters and training time consumption after the
shown in Figure 9, we present a group of boxplots of the introduction of the auxiliary network compensation
segmentation performance distribution of all 239 MRI scans mechanism, the significant improvement in segmentation
to evaluate the performance of the different models. The 239 performance makes up for the shortcoming of model
segmentation results are generated from the six nonrepeated complexity.
test sets split by sixfold nested cross-validation. From the
boxplots, we can state the following: first, comparing our
DPAC-UNet model with the other two models, the overall 3.6. DPAC Structure of Other Models. The DPAC structure
segmentation accuracy increases significantly, and also, the proposed in this paper that uses the auxiliary network to
minimum value of the boxplot of DSC and F2 scores and its compensate the primary network can be applied to most
lower quartile value increase significantly. This proves that segmentation models with spatial self-attention. We
our method significantly improves the data with poor implemented our method on two other segmentation
performance using the other two methods. Second, when models with self-attention mechanism, RA-UNet [30] and
comparing the middle value and upper quartile of boxplots, AGResU-Net [31], and compared the experimental results of
we can see that, for the data with better segmentation single-path with dual-path networks with auxiliary net-
performance segmented by the other two models, the works. The experimental results are shown in Table 7. The
DPAC-UNet has a slight improvement. For data with dis- previously mentioned two single-path segmentation models
tinct lesion characteristics that are easy to segment, the can effectively improve the segmentation performance after
primary network can generate a correct attention coefficient using the auxiliary network for attention compensation. It
map with a high probability. At this time, using the auxiliary shows that our method can be applied to other segmentation
network to compensate the primary network will not reduce networks with the self-attention mechanism. It should be
the segmentation accuracy or even slightly improve it. By noted that, in accordance with the hyperparameter selection
observing the boxplots of the FPR results, it is evident that steps in Section 2.2.3, when the dataset and segmentation
the FPR values of the three models are consistently small. model change, the hyperparameters of the tolerance loss
This proves that although the auxiliary compensation at- function need to be redetermined. As shown in Table 7,
tention coefficient map generated by the DPAC-UNet’s when the δ value of AGResU-Net is 0.6, the DPAC structure
auxiliary network has a high FPR, after compensating it to can achieve the best segmentation performance.
the primary network, the segmentation result of the primary
network maintains a small FPR. 4. Discussion and Conclusions
In this paper, we proposed the DPAC-UNet using the classic
3.5. Time Consumption. The parameter amount, training, self-attention model, attention U-Net, as the basic seg-
and testing computation time for each part of DPAC-UNet mentation model. To realize the functions of the DPAC-
are listed in Table 6 to understand which part of the network UNet’s primary and secondary networks, we proposed the
needs more time for executing. Since the primary and WBCE-Tversky and tolerance losses as the training loss
auxiliary networks are trained in parallel as a whole, the functions, respectively. We explored the hyperparameter
computation time of each part cannot be measured sepa- configuration of the loss functions by applying sixfold cross-
rately at the same time. Therefore, we compared the com- validation on the 239 MRI data of the ATLAS stroke seg-
putation complexity and time consumption of the primary mentation dataset. We discovered that the WBCE-Tversky
and auxiliary networks of DPAC-UNet by training them loss achieves the most accurate segmentation performance
independently. for the primary network when β � 0.8. The tolerance loss
The amount of our DPAC-UNet’s training parameters is generates a tolerant auxiliary compensation attention co-
double compared with the single-path attention U-Net efficient map with a moderate coverage area to compensate
(primary network or auxiliary network). The training time of for the primary network’s defective attention coefficient
the DPAC-UNet (5.11 hours on average) is about 1.7 times map. It achieves the best segmentation performance when
that of each subnetwork (3.06 hours on average). The testing β � 0.8, λ � 4, and δ � 0.7. The experiment results indicate
time of the DPAC-UNet (17 secs on average) is about 1.7 that the DSC score of the proposed DPAC-UNet with the
times that of each subnetwork (10 secs on average). Although auxiliary network is 6% higher than that without the aux-
DPAC-UNet has significantly increased the total number of iliary network. Compared with the methods in the existing
14 Computational Intelligence and Neuroscience
1.0
0.014
0.8 0.012
0.010
0.6
Scores
Scores
0.008
0.4 0.006
0.004
0.2
0.002
0.0 0.000
Models
U-Net
Attention U-Net
DPAC-UNet
Figure 9: Boxplots of metric results for different models.
literature, the DSC score of the proposed DPAC-UNet is Table 6, although our method obviously requires more
5.7% higher than the D-UNet and 1.1% higher than the computing resources and takes more training time, the im-
CLCI-NET. The results indicate that the proposed method provement in the segmentation performance of our method
achieved an improved segmentation performance and balances out the shortcomings in increased model complexity.
verified the effectiveness of the proposed method. The five-hour training time is currently at a lower or average
It should be noted that although we used the same dataset level in some of the latest existing network models, which are
in the proposed method as D-UNet and CLCI-NET, the currently used for stroke lesion segmentation. Moreover, we
version varied. We used the version without defacing that will implement our DPAC network structure on other basic
contains 239 MR images, and D-UNet and CLCI-NETused the segmentation models with a self-attention mechanism to
version with defacing that contains 229 MR images. Fur- verify our method’s versatility. We also proved that if our
thermore, considering that the cross-validation dataset split- DPAC structure is applied to other models based on the self-
ting methods do not generate the same training, validation, attention mechanism, it can also effectively improve the
and testing sets, and also considering that the loss functions segmentation performance. In future work, we plan to use
used are also different, achieving the best segmentation per- other stroke segmentation datasets to compare the effec-
formance does not directly prove that the proposed method is tiveness of our method across various datasets.
the best. It proves that we have reached a higher level of
segmentation performance in the current methods. Data Availability
The purpose and focus of our work are to improve the
performance of the single-path attention mechanism seg- The ATLAS dataset is publicly available at http://fcon_1000.
mentation model by using our DPAC method. As shown in projects.nitrc.org/indi/retro/atlas_download.html.
Computational Intelligence and Neuroscience 15