0% found this document useful (0 votes)
7 views15 pages

Brain Tumor Segmentation

Uploaded by

Maude Zed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views15 pages

Brain Tumor Segmentation

Uploaded by

Maude Zed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

A novel tridimensionnal multimodal brain segmentation approach

GBM6700E - 3D Reconstruction from Medical Images

Zerhouni, Maude
x

2409344
Sunday, December 15 2024

1
Contents

Contents 2

List of Figures 2

1 Introduction 3
1.1 Medical context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Research context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Objectives and methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Basics of deep learning principles and implementation 5


2.1 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.1 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 U-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 V-Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.4 Transformers and Attention mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Tutorial Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 The 3DUV-NetR+ architecture 9


3.1 Implementation specifics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.2 Loss function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Obtained results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.1 Ablation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.2 Comparaison with other SOTA methods . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Limitations of the article . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Re-implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Literature review 14

5 Conclusion 14

References 15

List of Figures
1 Deep learning architectures for medical image segmentation. . . . . . . . . . . . . . . . . . . . 4
2 Convolutional Neural Network Schematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3 U-NET architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4 V-Net architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
5 Illustration of transformers and attention mechanisms’ effect . . . . . . . . . . . . . . . . . . 8
6 Proposed architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
7 Ablation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
8 Comparison with SOTA methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2
1 Introduction
1.1 Medical context
Brain tumors are a complex and diverse group of highly lethal diseases. In 2024, more than 3,500 patients
in Canada were diagnosed with brain cancer, and alarmingly, 79% of them passed away within the same
year. This high mortality rate is often attributed to late diagnoses and improper treatment plans. Indeed,
their large variety, regrouping more than 120 types, makes distinguishing between tumor types particularly
challenging, complicating diagnosis and treatment decisions.

The diagnostic process typically begins with a preliminary clinical examination. In some cases, this is
followed by a biopsy, a procedure where a small sample of brain tissue is extracted and analyzed to confirm
the presence and type of a tumor. However, imaging techniques play a crucial role in brain tumor diagnosis.
These techniques include several modalities such as Magnetic Resonance Imaging (MRI), Magnetic Resonance
Angiography (MRA), Magnetic Resonance Spectroscopy (MRS), and Computed Tomography (CT). Among
these, MRI is particularly versatile, offering dozens of specialized modalities tailored to highlight different
features of brain anatomy and pathology. For example, diffusion MRI focuses on water molecule movement,
while T1- and T2-weighted MRI provide structural details.

Despite their potential, these imaging techniques require manual interpretation by experts, which is
time-consuming and demands high levels of expertise, particularly in handling and interpreting the various
modalities. This reliance on manual analysis not only delays diagnosis but also risks missing crucial details.
Automating this process could significantly improve diagnostic efficiency and accuracy. Additionally, relying
on a single imaging modality limits the understanding of the tumor, as each modality offers unique insights
into different aspects of brain structure and function. For instance, MRA emphasizes vascularization, while
MRI provides superior spatial resolution. Combining these two modalities has the potential to offer a more
comprehensive and functional understanding of the tumor, enabling more precise and personalized treatment
hence improving the patient’s outcome. For this exact reason, multimodality has gained interest in the field
of medical imaging, particularly for complex conditions such as brain tumors.

1.2 Research context


Medical Image Segmentation (MIS) is a computer vision task that involves dividing an medical image into
multiple segments, where each segment represents a different object or structure of interest in the image. The
goal is to provide a precise and accurate representation of the objects of interest within the image, typically
for the purpose of diagnosis, treatment planning, and quantitative analysis. Machine learning, particularly
deep learning, excels in recognizing patterns in large datasets and can adapt to the intricate and diverse
nature of medical images. Convolutional Neural Networks (CNNs) have emerged as the cornerstone for such
tasks, enabling robust and automated segmentation with high accuracy.

CNNs are specialized machine learning models designed for visual data processing, primarily used for
classification tasks, where the model assigns a single label to an entire image. Although CNNs were known
for decades, their potential for MIS was demonstrated in 2012 when Krizhevsky et al. trained one of the
largest CNNs of the time on ImageNet, a dataset with over a million labeled images. This network achieved
groundbreaking results in image classification, ushering in a new era for deep learning. Their work leveraged
improved GPU capabilities and techniques to address challenges like overfitting, making CNNs practical
for large-scale tasks. However, in medical imaging, localization is critical—each pixel in an image must
be assigned a label to identify structures like tumors, vessels, or lesions. This requirement, referred to as
semantic segmentation, presented unique challenges, such as the computation cost or even of the smaller
volume of available biomedical datasets compared to ImageNet.

3
To address this, the U-Net architecture, introduced in 2015 by Ronneberger et al. became a milestone in
medical image segmentation. U-Net builds upon the concept of fully convolutional networks by incorporating
a symmetric structure: a contracting path (to capture context) and an expanding path (to achieve precise
localization). High-resolution features from the contracting path are combined with upsampled outputs
to produce detailed segmentation maps, even with limited training data. In 2016, the V-Net architecture
introduced by Milletari et al. extended this idea to 3D medical imaging, enabling volumetric segmentation.
This was particularly important for tasks such as analyzing CT or MRI scans in three dimensions, where
spatial relationships across slices are crucial. U-Net and V-Net set the foundation for modern segmentation
tasks, inspiring countless adaptations and improvements. Today, these architectures remain at the core of
biomedical image processing, enabling efficient and accurate segmentation with limited data. By automating
segmentation, they reduce the time and expertise required for manual analysis, paving the way for more
personalized and precise medical treatments. A non-exhaustive list of architectures designed for biomedical
applications is presented in Figure 1.

Figure 1: Deep learning architectures for medical image segmentation.

Deep learning architectures in segmentation incorporate multimodality by combining data from various
sources to enhance analysis. Separate encoders extract specific features from each modality, which are then
merged at different levels within the architecture, either early in the process (early fusion) or after individual
extraction (late fusion). This approach captures complementary details, improves precision, and enhances
the robustness of segmentation by leveraging the strengths of each modality. In the medical field, this method
allows for more comprehensive and reliable analyses.

1.3 Objectives and methodology


Overall, brain tumors are a leading cause of cancer-related deaths highlighting the need for accurate seg-
mentation methods to aid diagnosis and treatment planning. In this context, combining imaging techniques
such as T1-weighted MRI, T2-weighted MRI, and FLAIR MRI holds great promise for delivering rich and
integrated information. As a result, 3D multi-modal image reconstruction is emerging as a technique that
offers a 3D visualization of the tumor, allowing us to enhance the diagnostic quality and facilitate more
precise assessments of tumor type and progression. However, automatic segmentation of brain tumors from
volumetric medical images poses significant challenges due to variations in tumor shapes, sizes, and intensi-
ties, as well as the non-standard characteristics of multimodal MRI images. Although convolutional neural
networks (CNNs) have shown promise in addressing these challenges, there remains a critical need for effective
integration of multimodal data to enhance segmentation accuracy.

4
The goal of this project is to evaluate 3DUV-NetR, a deep learning framework for multimodal brain
image analysis proposed by Aboussaleh et al. (February 2024) combining on U-Net, V-Net and transformers
architectures. To achieve this, this project was structured around three secondary objectives:

1. Developing a solid understanding of deep learning concepts.


2. Analyzing the methodology and results of the article.
3. Contextualizing the article within the broader landscape of 3D multimodal brain segmentation archi-
tectures.

To build foundational knowledge in deep learning, open-access courses on Coursera were utilized, notably
IBM’s Introduction to Deep Learning and Neural Networks with Keras and DeepLearning.AI’s AI for Med-
ical Diagnosis. These courses facilitated both the understanding of underlying concepts and the high level
implementation of code. The latter was more relevantly achieved using MONAI tutorials.

With these concepts mastered, the article was meticulously analyzed to assess the promise of its approach,
independently of its results. The implementation was then compared to MONAI tutorials, allowing for a
preliminary evaluation of the method’s relevance and depth.

Finally, a literature review was conducted to position the article within the context of recent advancements
in multimodal brain segmentation. This included comparing methods, results, and complexity across studies
from the past five years.

2 Basics of deep learning principles and implementation


2.1 Concepts
2.1.1 Convolutional Neural Networks
Convolutional Neural Networks (CNNs) refer to a long-time known deep learning architecture that works
by transforming an input image into a feature map through a series of convolutional, pooling, and activation
layers. These layers extract and refine features progressively to produce a predicted output.

On a high level, a typical CNN architecture comprises three components as described in Figure 2 : the
input layer, the hidden layers, and the output layer. The input layer receives the input image and passes it
to the hidden layers that extract relevant features or patterns. The output layer provides the predicted class
label probability scores for each potential class.

The hidden layers are critical for the CNN’s performance, and the number of hidden layers and the
number of filters in each layer can be adjusted to optimize it. In Figure 2, we can see four convolutional
layers. At each layer, one or several filters are applied, each one leading to a feature map. These filters
aim at recognizing a certain pattern such as an edge or a corner. Mathematically, this pattern corresponds
to a convolution, hence the name. Then, a pooling is applied in order to reduce the spatial dimensions of
the feature maps produced by the convolutional layer enabling to reduce computationnal complexity. An
exemple is provided in Figure 2. Finally, a non-linear activation function is applied in order to introduce
non-linearity into the model, allowing the network to model complex patterns. The subsequent feature maps
are passed through successive layers for more refined feature extraction.

5
Figure 2: Convolutional Neural Network Schematics

Finally, the output from the hidden layers is flattened and passed through traditional fully connected
layers. These layers combine extracted features to make predictions. For instance, in Figure 2, the fully
connected layer outputs probabilities, such as a 0.7 probability that the image represents a zebra. This layer
connects every neuron in one layer to every neuron in the next, synthesizing the learned features into the
final classification result.

Each filter is typically a 3x3 or 5x5 matrix with initially randomly distributed coefficients called weights.
In order to adjust these weights, the model needs to be trained. During training, these weights are adjusted
to minimize a loss function through its gradient. The loss function quantifies the difference between the
predicted and actual outputs, guiding the optimization process. The gradient of the loss function with
respect to each weight is computed, which indicates the direction and magnitude of the change needed for
each weight. If the gradient is negative, the weight is decreased, and conversely, if the gradient is positive,
the weight is increased. This adjustment is done iteratively to reduce the loss and improve the model’s
performance. To achieve optimization, the algorithm uses techniques such as gradient descent, where the
aim is to find the minimum of the loss function. The weights are updated in the direction that reduces the
loss, and this process is repeated until the model converges to an optimal set of weights.

2.1.2 U-Net
U-Net is a convolutional neural network (CNN) architecture developed specifically for biomedical 2D
image segmentation as introduced in 1.2. Since its introduction, U-Net has become one of the most popular
and widely used architectures in the field due to its ability to learn from relatively few training samples and
produce precise segmentations, addressing the issue of limited medical data and the unique label, most of
the time difficult to understand why attributed. The U-Net architecture is characterized by its distinctive
U-shaped structure as shown in Figure 3, composed of a contracting path known as the encoder and an
expansive path called the decoder, connected by skip connections.

The encoder follows the typical architecture of a convolutional network. It consists of the repeated
application of two 3x3 convolutions followed by a rectified linear unit (ReLU) and a 2x2 max pooling operation
for downsampling, as described in ??. The decoder aims at recreating an image with labelled pixels. Every
step of it consists of an upsampling of the feature map followed by a 2x2 convolution (“up-convolution”) that
halves the number of feature channels, a concatenation with the correspondingly cropped feature map from
the contracting path, and two 3x3 convolutions, each followed by a ReLU. The decoder aims to reconstruct the
segmented image from the encoded features. It performs upsampling, which is achieved through transposed

6
Figure 3: U-NET architecture

convolutions, or deconvolutions. Finally, skip connections bridge the corresponding layers of the encoder
and decoder. They help to combine the low-level spatial features with the high-level abstract features, which
enhances the network’s ability to produce accurate segmentation masks. By skipping some layers and directly
connecting the output of one layer to a layer in the decoder, the network can leverage both the contextual
and detailed information for better segmentation.

Same as for CNNs, the model is trained in order to optimize the weigths and minimize the loss function.
By replacing the 2D operations by 3D operations in the U-Net architecture, the 3DU-Net architecture is
obtained and allows for a high resolution 3D MIS. However, doing so demultiplies the computation cost.
Also, the method is highly dependant on the loss function used, which on the one hand gives flexibility to
the model but on the other hand makes it less reliable.

2.1.3 V-Net
V-Net is an extension of the U-Net architecture, specifically designed for volumetric MIS. It adapts the
U-shaped structure to handle 3D data more effectively by incorporating 3D convolutional layers, 3D pooling
operations, and convolutional residual units to capture features across multiple spatial scales. Overall, it
is composed of a compression path and a decompression path connected by residual functions as shown in
Figure 4.

The compression path, or encoder, follows the architecture of a typical convolutional network. It consists
of multiple stages where convolutions are performed with volumetric kernels (5×5×5 voxels) and stride 2
for downsampling. This path progressively reduces the spatial resolution of the input while increasing the
depth of the feature maps, which allows the network to capture high-level features. Each stage uses residual
learning, where the input is added back to the output of the convolutional layers within the same stage.

The decompression path, or decoder, aims to recreate the original resolution of the input image. This is
achieved through upsampling using transposed convolutions. Each stage of the decompression path reduces
the number of feature channels while increasing the spatial resolution. The use of residual functions and skip

7
Figure 4: V-Net architecture

connections, which transfer detailed features from the encoder stages to the corresponding decoder stages,
helps in preserving spatial information and improving segmentation accuracy.

Like in other CNN architectures, the model is trained to optimize its weights and minimize the loss
function, ensuring that the segmentation masks produced are as accurate as possible. Unlike its predecessors,
V-Net does not concatenate the entire encoder counterpart with skip connection but uses residual connections
instead, enhancing its efficiency and gradient management. Despite its smaller resolution and rigidity, V-Net
optimizes the segmentation process, making it more suitable for 3D data and operations compared to the 3D
U-Net.

2.1.4 Transformers and Attention mechanisms


Initially used for language tasks like translation and ChatGPT, transformers are deep learning architec-
tures that leverage the attention mechanism, allowing the model to understand the entire context before
generating the next part, providing a form of ”long-term memory.” This idea is represented in the Figure ??
transfomers.

Figure 5: Illustration of transformers and attention mechanisms’ effect

8
The transformer architecture is an attention-based encoder-decoder model that maps input sequences
into continuous representations with the encoder, and the decoder generates outputs step-by-step, guided
by previously generated outputs. Their utilization in MIS is recent and seems promising in improving it
by capturing contextual information over long distances, enhancing global understanding, and optimizing
gradient flow for precise reconstruction. This is especially useful for complex structures like tumors.

The encoder’s role is to meticulously extract features from the input sequence. This is achieved through a
series of layers, each comprising a multi-head attention mechanism followed by a feed-forward neural network.
These layers are further enhanced with normalization and residual connections to ensure stability during
training. The decoder, on the other hand, is tasked with generating the output sequence. It mirrors the
encoder’s structure but includes an additional layer of cross-attention that allows it to focus on relevant parts
of the input sequence as it produces the output. Important to note here that the Transformer’s architecture
is not set in stone; it can manifest as Encoder-only, Decoder-only, or the classic Encoder-Decoder model.
Each architectural variation is tailored to specific learning objectives and tasks.

2.2 Tutorial Implementation


The second part of understanding the basics of deep learning involved coding. The open-access courses
provided by IBM and DeepLearning.Ai included practical exercises that helped in grasping the coding of
basic functions, which significantly contributed to the overall comprehension of the concepts. However, the
MONAI tutorials proved more relevant for understanding the output of a training algorithm, as they are
specifically tailored to medical imaging tasks. This project focused on implementing these tutorials, which
provided a hands-on approach to learning how deep learning can be applied to segmentation tasks.

Three tutorial models were followed, with the initial plan being to apply them using the BraTS2020 dataset
in order to compare the results. Unfortunately, this part had to be aborted due to the lack of resources at
the author’s disposal, including the absence of a graphical interface, which hindered both the training and
its validation. Despite this setback, this phase still offered great insights into the various packages that can
be used to process and read medical image files, such as nibabel, and deep learning libraries like MONAI.
MONAI, a specialized library for medical imaging, provided a comprehensive set of tools, including pre-built
networks like U-Net and V-Net, which are essential for segmenting 3D medical images.

3 The 3DUV-NetR+ architecture


To gain a more concrete understanding of how a deep learning architecture for 3D modality brain tumor
segmentation operates, a detailed review of a research article was undertaken. This article evaluates the
performance of a model derived from the classical architectures briefly mentioned in 1.2. The selection criteria
for the article included its recent publication date, its relevance, and the model’s depth and applicability in
light of the concepts introduced in 2.1. The chosen article, published by Aboussaleh et al. in 2024, is titled
”3DUV-NetR+: A 3D Hybrid Semantic Architecture Using Transformers for Brain Tumor Segmentation
with Multi-Modal MR Images.”

3.1 Implementation specifics


3.1.1 Architecture
As its title suggests, the study explores the combination of 3D-Unet, V-Net and Transformers architectures
hypothesizing that it would add up their their benefits and suppress their limitations. The architecture is
provided in Figure 6 By integrating three 3D MRI modalities , this study aims to explore the potential for

9
(a) 3DUVNetR+ Architecture (b) Transformer block

Figure 6: Proposed architecture

improving brain tumor diagnosis and paving the way for more accurate and targeted treatments. Indeed,
each modality provides specific information. Here, T1-weighted MRI, T2-weighted MRI, and FLAIR (Fluid-
Attenuated Inversion Recovery) MRI. T1-weighted MRI highlights anatomical structures and is particularly
effective at delineating the boundaries between gray and white matter, T2-weighted MRI is highly sensitive
to fluid content, making it useful for detecting edema and other fluid accumulations. Finally, FLAIR MRI
suppresses signals from free fluids, such as cerebrospinal fluid, to better highlight lesions adjacent to fluid
spaces.

One can recognize the (3D) U-Net and V-Net architectures in the top left and top right of the Figure,
respectively, where the outputs of the encoders are utilized and combined. The images are initially input into
the encoders of both architectures in parallel, and the results of each output are preserved, indicated by the
pink vertical lines. These features are then upsampled through convolution operations, dropout (dropout is
a regularization technique that prevents overfitting by randomly setting a fraction of input units to zero at
each update during training), and transformers. It is worth noted noted that in this image, the decoders of
the U-Net and V-Net architectures are only used for comparison with the classical architectures presented in
2.1.2 and 2.1.3.

The transformers consist only of a decoder, as shown in Figure 6b, where LN, MLP, and MSA stand
for Layer Normalization, Multi-Head Self Attention, and Multi-Perceptron Layer, respectively, introduced in
2.1.4.

This architecture is implemented using TensorFlow, which, like MONAI, is a comprehensive library with
numerous packages for deep learning including the operations utilized here.

3.1.2 Loss function


The model uses a combined loss function to achieve faster convergence and improved performance, lever-
aging the Adam optimizer with a weight decay of 1 × 104 . The combination involves two primary loss

10
functions: categorical focal loss and dice loss. The function is decrite dans l’équation suivante
N PN
X
γ 2 i=1 pi gi
L=− (gi (1 − pi ) log(pi )) + 1 − PN PN 2 (1)
2
i=1 i=1 pi + i=1 gi

where pi is the predicted probability for pixel i, gi is the ground truth label for pixel i and γ is the focusing
parameter..

Categorical focal loss addresses class imbalances in segmentation tasks where certain classes, such as
tumors, are much less prevalent than others. As we go deeper in the network, these classes tend to disappear,
which result in segmentation that don’t make sense anymore. By increasing the weight on these harder-
to-predict classes, the model focuses more on accurately identifying these critical areas, thus balancing the
learning process. Dice loss is particularly suited for both binary and multiclass image segmentation. It aims
to maximize the similarity between the model’s predictions and the ground truth by emphasizing the quality
of the predictions. This loss function is essential for ensuring precise segmentation.

3.1.3 Metrics
The quality of the segmentation has been evaluated in terms of four different metrics which are the accu-
racy, the Hausdorff95, the intersection over union and the dice similarity coefficient, respectively expressed
in equations (2) below.

|G ∩ S| + |Gc ∩ S c |
Acc = (2.1)
|G ∪ S|
 
dH (G, S) = max sup inf d(g, s), sup inf d(g, s) (2.2)
g∈G s∈S s∈S g∈G
|G ∩ S|
IoU = (2.3)
|G ∪ S|
2 × |G ∩ S|
DSC = (2.4)
|G| + |S|

Accuracy measures the proportion of correctly classified pixels by comparing each predicted pixel to its
ground truth counterpart. Hausdorff95 evaluates the boundary alignment between the predicted and ground
truth regions, representing the maximum distance between them. The Dice Similarity Coefficient (DSC) and
Intersection over Union (IoU) quantify the overlap between the predicted and expected regions. The main
difference is that the DSC is less sensitive to small errors compared to IoU.

3.2 Obtained results


The architecture was trained using the BraTS2020 dataset, which includes 497 MRI scans—369 for
training and 128 for validation. These scans were collected from numerous institutions across the United
States and underwent manual segmentation by expert radiologists. Four MRI modalities were available, but
the focus was placed on Post-Contrast T1, T2, and T2-FLAIR, as these are the most relevant for brain
tumors. The images were preprocessed and resized from 240 × 240 × 155 to 128 × 128 × 128, concentrating
on the critical voxels around the tumor in each image.

The authors compared their results with other architectures in two distinct ways. First, they conducted
an ablation study, re-running segmentations using only parts of their code—specifically, using just 3DU-Net,
just V-Net and UV-Net without transormers, as previously mentioned. Second, they compared their findings
with results from other state-of-the-art methods.

11
3.2.1 Ablation study

(a) DSC and HD95 comparison (b) Visualization of the segmentation

Figure 7: Ablation study

Figure 7a below transcribes the comparison of the 3DUVNetR+ architecture with their standalone U-Net
and V-Net in terms of DSC and HD95 for the whole tumor (WT), the enhanced tumor (ET) and the tumor
core (TC). WT refers to the entire tumor, including both the central core and surrounding areas affected
by the tumor. ET represents the tumor’s enhanced regions, highlighted after contrast agent administration,
which typically includes the most aggressive and vascular areas. TC, on the other hand, refers to the
central part of the tumor, often the densest and most homogeneous, excluding the peripheral regions. These
categories are critical for assessing the accuracy of tumor segmentation in medical imaging. Figure 7b gives
an overview of the code’s output.

The DSC for the 3DUV-NetR+ architecture is higher than that of all semi-complete architectures across
all tumor regions: whole tumor, tumor core, and enhanced tumor. The Hausdorff distance is also smaller for
the 3DUV-NetR+, indicating improved performance. The segmentation visualizations further support these
findings. The two leftmost images in the first row represent one of the ground truth multimodal scans and
its associated labels from manual segmentation. The remaining images show the labels obtained through
segmentation using only U-Net, only V-Net, both without transformers, and finally, the 3DUV-NetR+. The
3DUV-NetR+ results are the closest to the expected output, especially for the green class, which overextends
in the results from the incomplete architectures.

3.2.2 Comparaison with other SOTA methods

Figure 8: Comparison with SOTA methods

12
The results were quantitavely compared with other proposed methods referenced in Figure ?? below.

Once again, the architecture proposed by Aboussaleh et al. shows promising results. However, their
DSC is not always the highest, and their Hausdorff distance is not consistently the smallest. The nnFormer
architecture outperforms in all categories except for the WT DSC and the ET HD95. Wang et al.’s study also
presents competitive results, with better performance in TC DSC, WT HD95, and ET HD95. Additionally,
CU-Net, Lachinov et al., and Pereira et al. achieved better DSC values for both the WT and ET.

3.3 Limitations of the article


The authors claim to combine the favorable features of three deep learning architectures—U-Net, V-Net,
and Transformers—but it seems this goal has not been fully achieved. Specifically, the method of feature
extraction from the encoders, represented by the pink vertical lines in Figure 6, is unclear, making it difficult
for the reader to determine whether skip connections or residual connections are employed. Notably, one
of V-Net’s strengths lies in its use of residual connections instead of concatenations (skip connections) for
upsampling and regenerating each pixel of the input image, resulting in reduced redundancy and improved
efficiency at the expense of resolution. In contrast, U-Net’s simpler skip connections provide higher accuracy
but come at a significant computational cost. Another advantage of U-Net is its straightforward implementa-
tion, but this is not leveraged here, as the authors also incorporate the more complex V-Net architecture and
Transformers. Overall, the article does not clearly demonstrate how the positive features of each architecture
are integrated. One might infer that the higher resolution of U-Net, the robostness of the V-Net with respect
to the loss function used, contextual information provided by Transformers, and the optimized operations of
V-Net are combined to achieve superior results compared to using each structure independently.

Furthermore, the article lacks justification for its approach to multimodality. At each training or validation
step, the 3D image modalities appear to be concatenated, but the article would benefit from providing
additional explanations for this choice. One might infer that the way that the multimodality is managed
could also improve the efficiency of the algorithm.

The article does not explicitly explain why four metrics were selected for evaluation. However, it can be
inferred that these metrics were chosen to facilitate comparison of segmentation results with other architec-
tures that may have employed one or more of them. Nevertheless, only the results for HD95 and DSC were
reported, suggesting that the segmentation performance based on the other two metrics may not have met
the authors’ expectations.

Finally, the computation time remains excessively high. This is unsurprising, given that the architecture
requires twice as many operations in the encoder stage, as the U-Net and V-Net architectures function in
parallel without any simplification compared to their standalone implementations.

3.4 Re-implementation
The author of this project attempted to implement the described architecture using TensorFlow in order
to stay as faithful as possible to the original design. At that time, the author believed that access to the
necessary resources would be granted, but due to the lack of a suitable environment, this could not be tested.
Nonetheless, this experience allowed for a deeper understanding of the deep learning packages offered by
TensorFlow, such as tf.keras for building and training U-Net and V-Net architectures, which are widely
used for medical image segmentation tasks. Despite not being able to run the code, this phase provided
valuable insights into the design and deployment of deep learning models.

13
4 Literature review
To contextualize Aboussaleh et al.’s study within current research, it is useful to explore other brain
segmentation methods from the literature that have been implemented using the same dataset, BraTS2020.
As previously noted, an initial comparison has already been made, and the comparative list appears to be
relatively comprehensive. No other relevant studies have been identified. The goal of this literature review is
hence to compare architectures that have outperformed some of the results from Aboussaleh et al.’s study,
in order to understand why they performed better and to identify potential research directions in this field.

The architecture proposed by Pereira et al. leverages 3 × 3 kernels to design deeper networks with fewer
weights, enhancing resistance to overfitting. Intensity normalization is applied to standardize data from multi-
site acquisitions, and data augmentation via rotations improves segmentation of gliomas in MRI images.
The 3D transformer model, nnFormer, combines convolutional and self-attention operations to effectively
capture local and global dependencies. It employs Local and Global Volume-based Multi-head Self-attention
mechanisms to model representations robustly and uses skip attention in place of traditional skip connections
for precise segmentation. The cascaded CNN framework with uncertainty estimation, proposed by Wang
et al., uses three 2.5D CNNs to hierarchically segment brain tumor regions. This architecture balances
memory consumption, model complexity, and receptive field through anisotropic convolutions. Test-time
augmentation is employed for uncertainty estimation, aiding in identifying potentially mis-segmented areas.
The CU-Net framework features a successive segmentation approach for brain tumor structures. Between-
network connections are designed to transmit high-resolution features from shallow layers to deeper layers,
enhancing segmentation precision. A loss-weighted sampling scheme addresses class imbalance, improving
model performance in challenging datasets.

The reviewed studies highlight overlapping features such as hierarchical approaches, data augmentation,
and methods to balance complexity and precision, including the use of transformers. These commonali-
ties underscore their effectiveness and point to promising directions for further exploration in brain tumor
segmentation methodologies. Additionally, they emphasize the significance of Aboussaleh et al.’s approach,
which integrates all these common features.

5 Conclusion
The majority of the objectives for this project were achieved, except for the implementation, which was
hindered by limited resources at the time of completion. The understanding of deep learning concepts was
successfully attained, allowing for a comprehensive grasp of the implementation details of a promising method
for 3D multimodal brain segmentation. This also provided the tools to critically evaluate the results.

A brief literature review of studies with better results in the metrics used in the project’s focal article
further justified the relevance of Aboussaleh et al.’s approach. These reviews underscored the benefits of
using transformers and a hierarchical structure with encoder-decoder architectures for brain segmentation.
However, each method remains computationally expensive and would benefit from optimization at this scale.

14
References
[1] Ilyasse Aboussaleh, Jamal Riffi, Khalid el Fazazy, Adnane Mohamed Mahraz, and Hamid Tairi. 3duv-
netr+: A 3d hybrid semantic architecture using transformers for brain tumor segmentation with multi-
modal mr images. Results in Engineering, 21:101892, 2024.

[2] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional
neural networks. Advances in neural information processing systems, 25, 2012.
[3] Dmitry Lachinov, Evgeny Vasiliev, and Vadim Turlapov. Glioma segmentation with cascaded unet. In
International MICCAI Brainlesion Workshop, pages 189–198. Springer, 2018.
[4] Hongying Liu, Xiongjie Shen, Fanhua Shang, Feihang Ge, and Fei Wang. Cu-net: Cascaded u-net
with loss weighted sampling for brain tumor segmentation. In Multimodal Brain Image Analysis and
Mathematical Foundations of Computational Anatomy: 4th International Workshop, MBIA 2019, and
7th International Workshop, MFCA 2019, Held in Conjunction with MICCAI 2019, Shenzhen, China,
October 17, 2019, Proceedings 4, pages 102–111. Springer, 2019.
[5] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks
for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV),
pages 565–571. Ieee, 2016.
[6] Project MONAI. 3d segmentation - brats tutorial, 2023. Accessed: 2024-12-12.
[7] Project MONAI. 3d segmentation - swin unetr brats21, 2023. Accessed: 2024-12-12.

[8] Project MONAI. 3d segmentation - unet with ignite, 2023. Accessed: 2024-12-12.
[9] Sérgio Pereira, Adriano Pinto, Victor Alves, and Carlos A Silva. Brain tumor segmentation using
convolutional neural networks in mri images. IEEE transactions on medical imaging, 35(5):1240–1251,
2016.

[10] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical
image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015:
18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages
234–241. Springer, 2015.
[11] Guotai Wang, Wenqi Li, Sébastien Ourselin, and Tom Vercauteren. Automatic brain tumor segmentation
based on cascaded convolutional neural networks with uncertainty estimation. Frontiers in computational
neuroscience, 13:56, 2019.
[12] Hong-Yu Zhou, Jiansen Guo, Yinghao Zhang, Xiaoguang Han, Lequan Yu, Liansheng Wang, and Yizhou
Yu. nnformer: Volumetric medical image segmentation via a 3d transformer. IEEE Transactions on
Image Processing, 2023.

15

You might also like