Mri Brain Tumor Segmentation and Uncertainty Estimation Using 3D-Unet Architectures
Mri Brain Tumor Segmentation and Uncertainty Estimation Using 3D-Unet Architectures
?
Laura Mora Ballestar and Veronica Vilaplana
arXiv:2012.15294v1 [eess.IV] 30 Dec 2020
1 Introduction
Brain tumors are categorized into primary, brain originated; and secondary, tu-
mors that have spread from elsewhere and are known as brain metastasis tumors.
Among malignant primary tumors, gliomas are the most common in adults, rep-
resenting 81% of brain tumors [7]. The World Health Organization (WHO) cat-
egorizes gliomas into grades I-IV which can be simplified into two types (1) “low
grade gliomas” (LGG), grades I-II, which are less common and are character-
ized by low blood concentration and slow growth and (2) “high grade gliomas”
(HGG), grades III-IV, which have a faster growth rate and aggressiveness.
?
This work has been partially supported by the project MALEGRA TEC2016-75976-
R financed by the Spanish Ministerio de Economı́a y Competitividad.
2 L. Mora et al.
2 Related Work
2.1 Semantic Segmentation
Brain tumor segmentation methods include generative and discriminative ap-
proaches. Generative methods try to incorporate prior knowledge and model
MRI Brain Tumor Segmentation and Uncertainty Estimation using 3D-UNet 3
2.2 Uncertainty
3 Method
The biggest complexity for brain tumor segmentation is derived from the class
imbalance. The tumor regions account for a 5-15% of the brain tissue and each
tumor region is an even smaller portion. Fig. 1 provides a graphical represen-
tation of the distribution per each tumor class: ET, NCR, ED; without healthy
4 L. Mora et al.
tissue. It can be seen, that ED is more probable than ET and NCR and that
there is high variability between subjects in the NCR label. Another complexity
is the difference between glioma grades as LGG patients are characterized by
low blood concentration which is translated to low appearance of ET voxels and
higher number of voxels for NCR and NET regions.
Fig. 1: Distribution of each class ED, ET, NCR. From left to right, (1) number
of voxels in all cases, (2) number of voxels for the HGG and (3) number of voxels
for the LGG
In this work, we have used two approaches depending on the patch size.
– Binary Distribution: Small patches, equal or lower than 643 are randomly
selected with a 50% probability of being centred on healthy tissue and 50%
probability on tumor [10].
– Random Tumor Distribution: Bigger sizes, 1123 or 1283 , are selected ran-
domly but always centred in tumor region, as the patches will contain more
healthy tissue and background information.
3.4 Loss
The Dice score coefficient (DSC) is a measure of overlap widely used to assess
segmentation performance when ground truth is available. Proposed in Milletari
et al. [6] as a loss function for binary classification, it can be written as:
PN
2∗ i=1 pi gi
Ldice = 1 − PN PN 2 (1)
2
i pi + =1i gi +
PL PN
l=1wl i=1 pli gli
LdiceGDL = 1 − 2 PL PN (2)
l=1 wl i=1 pli + gli +
where L represents the number of classes and w l the weight given to each
class. We use the GDL variant as it is more suited for unbalanced segmentation
problems.
This work proposes three networks, variations of V-Net [6] and 3U-Net [27]
architectures, for brain tumor segmentation and creates an ensemble to mitigate
the bias in each independent model.
The different models are trained using the ADAM optimizer, with start learn-
ing rate of 1e − 4, decreased by a factor of 5 whenever the validation loss has
not improved in the past 30 epochs and regularized with a l2 weight decay of
1e − 5. They all use the GDL loss.
6 L. Mora et al.
V-Net The V-Net implementation has been adapted to use four output chan-
nels (Non-Tumor, ED, NCR/NET, ET) and uses Instance Normalization [21] in
contrast to Batch Normalization, which normalizes across each channel for each
training example instead of the whole batch. Also, as proposed in [15], we have
increased the number of feature maps to 32 at the highest resolution, instead
of 16 as proposed by the original implementation. Figure 2 shows the network
architecture with an input patch size of 64x64x64.
The network has been trained using a patch size of 963 and the random
tumor distribution strategy (see 3.3). The maximum batch size due to memory
constraints is 2.
Fig. 4: 3D-Unet [27] architecture with RestNet blocks at each level, MaxPooling,
TransposedConvolutions and ReLU non-linearity
This network is trained following two different strategies. The first one, 3D-
UNet-residual uses a patch size of 1123 and a batch size of 2 for the whole
training, whereas 3D-UNet-residual-multiscale varies the sampling strategy so
8 L. Mora et al.
the network sees local and global information. For that, the first half of the
training uses a patch size of 1283 with a batch size of 1. Then, the patch size is
reduced to 1123 and the batch increased to 2.
3.6 Post-Processing
In order to correct the appearance of false positives in the form of small and
separated connected components, this work uses a post-processing step that
keeps the two biggest connected components if their proportion is bigger than
some threshold -obtained by analysing the training set. With this process, small
connected components that may be false positives are removed but big enough
components are kept as some of the subjects may have several tumors.
Moreover, one of the biggest difficulties of this challenge is to provide an
accurate segmentation of the smallest sub-region, ET, which is particularly dif-
ficult to segment in LGG patients, as almost 40% have no enhancing tumor in
the training set. In the evaluation step, BraTS awards a Dice score of 1 if a
label is absent in both the ground truth and the prediction. Conversely, only
a single false positive voxel in a patient where no enhancing tumor is present
in the ground truth will result in a Dice score of 0. Therefore, some previous
works [15, 16] propose to replace enhancing tumor voxels for necrosis if the total
number of enhancing voxels is smaller than some threshold, which is found for
each experiment independently. However, we were not able to find a threshold
that improved the performance as it helped for some subjects but made some
other results worse.
3.7 Uncertainty
This year’s BraTS includes a third task to evaluate the model uncertainty and
reward methods with predictions that are: (a) confident when correct and (b)
uncertain when incorrect. In this work, we model the voxel-wise uncertainty of
our method at test time, using test time dropout (TTD) and test-time data
augmentation (TTA) for epistemic and aleatoric uncertainty respectively.
We compute epistemic uncertainty as proposed in Gal et.al [23], who uses
dropout as a Bayesian Approximation in order to simplify the task. Therefore,
the idea is to use dropout both at training and testing time. The paper suggests
to repeat the prediction a few hundred times with random dropout. Then, the
final prediction is the average of all estimations and the uncertainty is modelled
by computing the variance of the predictions. In this work, we perform B = 20
iterations and use dropout with a 50% probability to zero out a channel. The un-
certainty map
is estimated with the variance for each sub-region independently.
Let Y i = y1i , y2i ...yB
i
be the vector that represents the i-th voxel’s predicted
labels, the voxel-wise uncertainty map, for each evaluation region, is obtained as
the variance:
B
1 X i i
var = (yb − ymean )2 (3)
B
b=1
MRI Brain Tumor Segmentation and Uncertainty Estimation using 3D-UNet 9
Uncertainty can also be estimated with the entropy, as [19] showed. However,
the entropy will provide a global measure instead of map for each sub-region. In
this case, the voxel-wise uncertainty is calculated as:
M
X
H(Y i |X) ≈ − p̂im ln(p̂im ) (4)
m=1
where p̂im is the frequency of the m-th unique value in Y i and X represents
the input image.
To model aleatoric uncertainty we apply the same augmentation techniques
from the training step plus random Gaussian noise, in order to add modifications
not previously seen by the network. The final prediction and uncertainty maps
are computed following the same strategies as in the epistemic uncertainty.
All that begin said, we hope to evaluate the model’s behaviour w.r.t to input
and model variability by defining the several experiments:
– Aleatoric Uncertainty: model aleatoric uncertainty with (1) TTA-variance,
providing three uncertainty maps (ET, TC, WT) and (2) TTA-entropy, with
one global map.
– Epistemic Uncertainty: model epistemic uncertainty with (1) TTD-variance,
providing three uncertainty maps (ET, TC, WT) and (2) TTD-entropy, with
one global map.
– Hybrid (Aleatoric + Epistemic) Uncertainty: model both aleatoric and epis-
temic uncertainty together with (1) TTD+TTA-variance, providing three
uncertainty maps (ET, TC, WT) and (2) TTD+TTA-entropy, with one
global map.
4 Results
The code1 has been implemented in Pytorch [24] and trained on the GPI2 servers,
based on 2 Intel(R) Xeon(R) @ 2.40GHz CPUs using 16GB RAM and a 12GB
NVIDIA GPU, using BraTS 2020 training dataset. We report results on training,
validation and test datasets. All results, prediction and uncertainty maps, are
uploaded to the CBICA’s Image Processing Portal (IPP) for evaluation of Dice
score, Hausdorff distance (95th percentile), sensitivity and specificity per each
class. Specific uncertainty evaluation metrics are the ratio of filtered TN (FTN)
and the ratio of filtered TP (FTP).
4.1 Segmentation
The principal metrics to evaluate the segmentation performance are the Dice
Score, which is an overlap measure for pairwise comparison of segmentation
mask X and ground truth Y:
1
Github repository: https://github.com/imatge-upc/mri-braintumor-segmentation
2
The Image and Video Processing Group (GPI) is a research group of the Signal
Theory and Communications Department, Universitat Politècnica de Catalunya.
10 L. Mora et al.
|X ∩ Y |
DSC = 2 ∗ (5)
|X| + |Y |
and the Hausdorff distance, which is the maximum distance of a set to the
nearest point in the other set, defined as:
DH (X, Y ) = max supxX inf d(x, y)), supyY inf d(x, y)) (6)
yY xX
where sup represents the supremum and inf the infimum. In order to have
more robust results and to avoid issues with noisy segmentation, the evaluation
scheme uses the 95th percentile.
Tables 1 and 2 show Dice and Hausdorff Distance (95th percentil) scores for
training and validation sets respectively.
The model used with the test set is the Residual 3D-UNet-multiscale with
post-processing. Table 3 shows the results in the training, validation and test
sets for comparison.
MRI Brain Tumor Segmentation and Uncertainty Estimation using 3D-UNet 11
Fig. 5: Training results on patients: 280, 010, 331 and 178 (top-bottom). Image
order: (1) Flair (2) GT (3) Residual 3D-UNet-multiscale (4) Residual 3D-UNet
(5) Basic 3D-UNet (6) V-Net (7) Ensemble mean
12 L. Mora et al.
4.2 Uncertainty
BraTS requires to upload three uncertainty maps, one for each subregion (WT,
TC, ET) together with the prediction map. Values must be normalized between
0-100 such that ”0” represents the most certain prediction and ”100” represents
the most uncertain. The metrics used are the FTP ratio defined as F T P =
(T P100 − T PT )/T P100 , where T represents the threshold used to filter the more
uncertain values. The ratio of filtered true negatives (FTN) is calculated in a
similar manner. The integrated score will be calculated as follows:
score = AU C1 + (1 − AU C2 ) + (1 − AU C3 ). (7)
From this point forward all experiments are performed on the model Residual
3D-UNet-multiscale, as it is the one with more balanced results across the dif-
ferent regions. Table 4 shows the results for the epistemic, aleatoric and hybrid
uncertainties when computed with entropy or variance. As a general overview,
we can see that the AUC-Dice, which is computed by averaging the segmenta-
tion results for several thresholds that filter uncertain predictions, improves 2
to 3 points w.r.t the results obtained in the segmentation task (W T : 0.8172,
T C : 0.7664, ET : 0.7071). Although the metrics are not the same, it indicates
that the model is more certain on the TP and less certain on FP and FN. More-
over, the AUC-Dice is higher when using entropy as the uncertainty measure.
Our results show that the model is more uncertain in LGG patients, par-
ticularly on epistemic uncertainty; meaning the model requires more data to
achieve a more confident prediction. If we compare the behaviour between the
uncertainty types, we see that (1) aleatoric focuses on the region boundaries,
with small variations (2) epistemic improves results on the ET region but fil-
ters more TP and TN and(2) the hybrid approach achieves the best Dice-AUC
results when using entropy as the uncertainty measurement.
TTA 0,83 0,78 0,71 0,06 0,05 0,04 1.1e-3 4.7e-3 6.3e-3
Entropy TTD 0,82 0,78 0,74 0,15 0,13 0,07 2.1e-3 8.2e-3 1.22e-2
Hybrid 0,83 0,79 0,77 0,15 0,12 0,07 3.0e-3 1.01e-3 1.39e-2
MRI Brain Tumor Segmentation and Uncertainty Estimation using 3D-UNet 13
This work proposes a set of models based on two 3D-CNNs specialized in medical
imaging, V-Net and 3D-UNet. As each of the trained models performs better in
a particular tumor region, we define an ensemble of those models in order to
increase the performance. Moreover, we analyze the implication of uncertainty
estimation on the predicted segmentation in order to understand the reliability
of the provided segmentation and identify challenging cases, but also as a means
of improving the model accuracy by filtering uncertain voxels that should refer
to wrong predictions. We use the Residual 3D-UNet-multiscale as our model to
participate at the BraTS’20 challenge.
The best results in the validation set are obtained when creating an ensemble
of the proposed models, as we can leverage the biases of each model, but are
still far from the current state the art. These results may be caused by a bad
training strategy where the sampling technique does not reflect the correct label
distribution, thus providing more false detections. This is reflected more in the
ET region as all models predict more tumor voxels of this label, which is greatly
penalized when the ground truth does not contain it. In order to improve results,
future work should try to provide a better representation of the labels, not just
increase the patch size, but maybe let the network see both local and more global
information.
Another potential problem is the model’s simplicity. Although previous works
achieve good results using a 3D-UNet, i.e. [15], adding more complexity to the
network may help boost the performance. Therefore a possible line of work would
be to extend the proposed models into a cascaded network, where each nested
14 L. Mora et al.
evaluation region –WT, TC and ET– is learnt as a binary problem. Also, LGG
subjects usually achieve lower accuracy on the prediction. In order to improve
the results, we could research other post processing techniques and design them
specifically to target each one of the glioma grades, as they may be differentiated
by the sub-region distribution.
For uncertainty estimation, the work evaluates the usage of aleatoric, epis-
temic and a hybrid approach using the entropy as a global measure and variance
to evaluate uncertainty on each evaluation region. In the provided results, it
has been seen that using uncertainty information actually helps improve the
accuracy of the network, achieving the best Dice Score (AUC, estimated from
filtering uncertain voxels) when using the hybrid approach and entropy as the
uncertainty measure. Our method achieves a score of 0.93, 0.93, 0.91 for WT,
TC and ET respectively on the test set.
References
1. B. H. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, J. Kirby, et
al.: ”The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS)”,
IEEE Transactions on Medical Imaging 34(10), 1993-2024 (2015) https://doi.org/
10.1109/TMI.2014.2377694
2. S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J.S. Kirby, et al.: ”Ad-
vancing The Cancer Genome Atlas glioma MRI collections with expert segmen-
tation labels and radiomic features”, Nature Scientific Data, 4:170117 (2017)
https://doi.org/10.1038/sdata.2017.117
3. S. Bakas, M. Reyes, A. Jakab, S. Bauer, M. Rempfler, A. Crimi, et al.: ”Identifying
the Best Machine Learning Algorithms for Brain Tumor Segmentation, Progres-
sion Assessment, and Overall Survival Prediction in the BRATS Challenge”, arXiv
preprint arXiv:1811.02629 (2018)
4. S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. Kirby, et al.,
”Segmentation Labels and Radiomic Features for the Pre-operative Scans
of the TCGA-GBM collection”, The Cancer Imaging Archive, 2017. DOI:
10.7937/K9/TCIA.2017.KLXWJJ1Q
5. S. Bakas, H. Akbari, A. Sotiras, M. Bilello, M. Rozycki, J. Kirby, et al.,
”Segmentation Labels and Radiomic Features for the Pre-operative Scans
of the TCGA-LGG collection”, The Cancer Imaging Archive, 2017. DOI:
10.7937/K9/TCIA.2017.GJQ7R0EF
6. Milletari, Fausto, Nassir Navab, and Seyed-Ahmad Ahmadi. ”V-net: Fully convo-
lutional neural networks for volumetric medical image segmentation.” 3D Vision
(3DV), 2016 Fourth International Conference on. IEEE, 2016.
7. Morgan, L Lloyd: The epidemiology of glioma in adults: A ”state of the science”
review. Neuro-oncology vol.17 01-2015 https://doi.org/10.1093/neuonc/nou358
8. Armen Der Kiureghian and Ove Ditlevsen: Aleatory or epistemic? does it matter?
Structural safety, 31(2):105–112, 2009.
9. Yarin Gal and Zoubin Ghahramani: Dropout as a bayesian approximation: Repre-
senting model uncertainty in deep learning. arXiv preprint arXiv:1506.02142, 2015
10. Konstantinos Kamnitsas, Christian Ledig, Virginia F.J. Newcombe, Joanna P.
Simpson, Andrew D. Kane, David K. Menon, Daniel Rueckert, Ben Glocker:
Efficient multi-scale 3D CNN with fully connected CRF for accurate brain
MRI Brain Tumor Segmentation and Uncertainty Estimation using 3D-UNet 15
lesion segmentation, Medical Image Analysis, Volume 36, 2017, pages 61-78,
https://doi.org/10.1016/j.media.2016.10.004
11. Casamitjana, A., Puch, S., Aduriz, A., Vilaplana, V., ”3D Convolutional Neural
Networks for Brain Tumor Segmentation: a comparison of multi-resolution archi-
tectures”. In: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain
Injuries. BrainLes 2016. Lecture Notes in Computer Science, vol 10154. Springer,
2017.
12. Kamnitsas, K., Bai, W., Ferrante, E., McDonagh, S., Sinclair, M., Pawlowski, N:
Ensembles of Multiple Models and Architectures for Robust Brain Tumour Seg-
mentation in International MICCAI Brainlesion Workshop (Quebec, QC), 450–462
arXiv preprint arXiv:1711.01468, 2017
13. Casamitjana, A., Catà, M., Sánchez, I., Combalia, M., Vilaplana, V., ”Cascaded
V-Net Using ROI Masks for Brain Tumor Segmentation”. In: Brainlesion: Glioma,
Multiple Sclerosis, Stroke and Traumatic Brain Injuries. BrainLes 2017. Lecture
Notes in Computer Science, vol 10670. Springer, 2018.
14. Andriy Myronenko: 3D MRI brain tumor segmentation using autoencoder regu-
larization. arXiv preprint arXiv:1810.11654, 2016
15. Isensee, F., et al.: No new-net. International MICCAI Brainlesion Workshop, pp.
234–244. Springer (2018)
16. Jiang Z., Ding C., Liu M., Tao D. (2020) Two-Stage Cascaded U-Net: 1st Place
Solution to BraTS Challenge 2019 Segmentation Task. In: Crimi A., Bakas S. (eds)
Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries. Brain-
Les 2019. Lecture Notes in Computer Science, vol 11992. Springer, Cham
17. Zhao YX., Zhang YM., Liu CL. (2020) Bag of Tricks for 3D MRI Brain Tumor
Segmentation. In: Crimi A., Bakas S. (eds) Brainlesion: Glioma, Multiple Sclerosis,
Stroke and Traumatic Brain Injuries. BrainLes 2019. Lecture Notes in Computer
Science, vol 11992. Springer, Cham
18. Natekar Parth, Kori Avinash, Krishnamurthi Ganapathy AUTHOR=Natekar
Parth, Kori Avinash, Krishnamurthi Ganapathy: Demystifying Brain Tumor Seg-
mentation Networks: Interpretability and Uncertainty Analysis. Frontiers in Com-
putational Neuroscience vol.14 page 6 https://doi.org/10.3389/fncom.2020.00006,
2020
19. Wang G., Li W., Ourselin S. and Vercauteren T. Automatic Brain Tumor
Segmentation Based on Cascaded Convolutional Neural Networks With Un-
certainty Estimation. Frontiers in Computational Neuroscience vol.13 pages 56
https://doi.org/10.3389/fncom.2019.00056, 2019
20. McKinley R., Meier R., Wiest R. (2019) Ensembles of Densely-Connected CNNs
with Label-Uncertainty for Brain Tumor Segmentation. In: Crimi A., Bakas S.,
Kuijf H., Keyvan F., Reyes M., van Walsum T. (eds) Brainlesion: Glioma, Multiple
Sclerosis, Stroke and Traumatic Brain Injuries. BrainLes 2018. Lecture Notes in
Computer Science, vol 11384. Springer
21. Dmitry Ulyanov and Andrea Vedaldi and Victor Lempitsky. Instance Normaliza-
tion: The Missing Ingredient for Fast Stylization. arXiv preprint arXiv:1607.08022,
2016
22. Kamnitsas, K., Ledig, C., Newcombe, V.F., Simpson, J.P., Kane, A.D., Menon,
D.K., Rueckert, D., Glocker, B.: Efficient multi-scale 3D CNN with fully connected
CRF for accurate brain lesion segmentation. Med. Image Anal. 36 (2017) 61–78
23. Yarin Gal and Zoubin Ghahraman.Dropout as a Bayesian Approximation: Rep-
resenting Model Uncertainty in Deep Learning. arXiv preprint arXiv:1506.02142,
2015
16 L. Mora et al.
24. Paszke, Adam and Gross, Sam and Chintala, Soumith and Chanan, Gregory and
Yang, Edward and DeVito, Zachary and Lin, Zeming and Desmaison, Alban and
Antiga, Luca and Lerer, Adam. Automatic differentiation in PyTorch, NIPS-W 2017
25. Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. ”U-net: Convolu-
tional net- works for biomedical image segmentation.” MICCAI. Springer, 2015.
https://doi.org/10.1007/978-3-319-24574-4 28
26. Sudre, C.H., Li, W., Vercauteren, T., Ourselin, S., Jorge Cardoso, M.Generalised
Dice Overlap as a Deep Learning Loss Function for Highly Unbalanced Segmenta-
tions. Lecture Notes in Computer Science 240-248, Springer International Publishing
2017. https://doi.org/10.1007/978-3-319-67558-9 28
27. Özgün Çiçek and Ahmed Abdulkadir and Soeren S. Lienkamp and Thomas Brox
and Olaf Ronneberger, ”3D U-Net: Learning Dense Volumetric Segmentation from
Sparse Annotation”, arXiv preprint arXiv:1606.06650, 2016.
28. W. R. Crum and O. Camara and D. L. G. Hill, ”Generalized Overlap Measures
for Evaluation and Validation in Medical Image Analysis”, IEEE Transactions on
Medical Imaging vol. 25, no. 11, pp. 1451–1461, 2006