0% found this document useful (0 votes)
10 views78 pages

Stroke Detection With Deep Learning: SRH Hochschule Heidelberg

This master's thesis by Jaime Paolo Vega Arellano focuses on stroke detection using deep learning techniques, specifically through the implementation of a two-stage architecture for segmenting brain lesions and classifying results. The research highlights the challenges faced in medical image analysis and aims to enhance efficiency in stroke detection using convolutional neural networks. The work is submitted to fulfill the requirements for a degree in Big Data and Business Analytics at SRH Hochschule Heidelberg.

Uploaded by

Susi04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views78 pages

Stroke Detection With Deep Learning: SRH Hochschule Heidelberg

This master's thesis by Jaime Paolo Vega Arellano focuses on stroke detection using deep learning techniques, specifically through the implementation of a two-stage architecture for segmenting brain lesions and classifying results. The research highlights the challenges faced in medical image analysis and aims to enhance efficiency in stroke detection using convolutional neural networks. The work is submitted to fulfill the requirements for a degree in Big Data and Business Analytics at SRH Hochschule Heidelberg.

Uploaded by

Susi04
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

SRH Hochschule Heidelberg

Masters Thesis

Stroke Detection with Deep Learning

Author:
Vega Arellano, Jaime Paolo

Supervisors:
Prof. Dr. Ing. Chandna, Swati
Prof. Dr. Ing. Osman, Ahmad

Advisors:
PhD student Albert-Weiß, Dominique
Dr. med. Ohlmann-Knafo, Susanne
Prof. Dr. med. Pickuth, Dirk
PhD student Wei, Ziang

A thesis submitted in fulfilment of the requirements for the degree of Big Data and
Business Analytics in the

School of Information, Media and Design

September 2022
Declaration of Authorship

I, Jaime Paolo Vega Arellano, declare that this thesis titled, ’Stroke Detection with Deep Learn-
ing’ and the work presented in it is my own. I confirm that this work submitted for assessment
is expressed in my own words and is my own. Any uses made within it of the works of other
authors in any form (e.g., ideas, equations, figures, text, tables, programs) are appropriately
acknowledged at any point of their use. A list of the references employed is included.

■ This work was done wholly for a research degree at this University.
■ It has been clearly stated in this work for any part of this thesis has previously been
submitted for a degree or any other qualification at this University or any other institution.
■ Where I have consulted the published work of others, credit is always given to the con-
cerned work.
■ Where I have quoted from the work provided by others, the source is always given unless
it is entirely my work.
■ I have acknowledged all main sources that supported me and helped during this work.

Signed: Paolo Vega

Date: 21.09.2022

i
Acknowledgements

First, I would like to thank my whole family: mom, dad, and you, sister. Words cannot describe
how thankful I am for having you all in my life, for receiving your invaluable support, love, and
inspiration. You’re my best supporters even from the distance.

Many thanks to all my close friends in Mexico. Everyone of you has contributed uniquely in my
life, and I am glad for having such inspiring and amazing people with me, despite the distance
and the time difference.

I want to thank my Prof. Dr. Swati Chandna and first supervisor; her support, advice, leading
example, and valuable tips in lectures at SRH have contributed to find the complement I was
looking for in my professional profile, and put me in the right track to start my journey in
computer vision.

I would like to express my sincere gratitude to my second supervisor, Prof. Dr. Ing Osman
Ahmad, for giving me a real opportunity to work in computer vision along with his team,
support me in my road to achieve mastery and excellence in this field. My respect to you, both
professors, because I firmly believe that teaching is the most rewarding aspect that brings the
best in people.

Special thanks to PhD students Dominique Albert-Weiss, and Ziang Wei for their contribution,
guidance, and support during my whole project; but above all, for the opportunity to learn from
you in many several ways.

I must thank to the whole AutomatIQ team at Fraunhofer, which has helped me to discover
this spectacular side of Artificial Intelligence that I did not know, and made my life easier in
Saarbrücken with a great sense of humor.

Special thanks to Prof. Dr. Dirk Pickuth, and Dr. Susanne Ohlmann-Knafo from Cari-
tasKlinikum Saarbrücken, for their invitation to experience closely how specialist work with
MRI, and the crucial labor you perform everyday preserving health in patients.

Finally, thank you all the people who helped me met my personal and professional goal of living
abroad. You have no idea about your influence on my personality to keep thinking bigger.

ii
Abstract

Stroke is a severe condition that causes a high mortality rate worldwide. Medical image analysis
is the main method that specialist employ in hospitals to localize and detect strokes. Despite
the on-site software at hospitals and the expertise of neurologists and radiologists, it is still a
challenging, time-consuming and labor-intensive task for them. Convolutional Neural Networks
is the preferable strategy to analyze images and contribute to the medical diagnoses with a
considerable reliability in their results. In fact, Deep Learning techniques have evolved at a
rapid peace with new architectures that are currently assisting the medical sector to make such
tasks more efficient when it comes to preserve patients’ lives. Despite such enhancements, the
complexity of the diseases by nature makes it necessary to continue excelling at this task in
an automatic approach. Therefore, the ultimate goal of this thesis is to implement a two-stage
architecture to segment brain lesions, and the classify the results to distinguish correct and
incorrect cases automatically, with a particular focus on maximizing performance on the seg-
mentation part by training state-of-the-art neural network architectures to detect brain lesions
in patients.

Keywords: MRI, Stroke detection, Deep Learning, Image Segmentation, ATLAS, U-Net
Contents
List of Figures vi

List of Tables viii

Abbreviations ix

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Theoretical Background 6
2.1 Stroke in the Medical Field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Medical Images for stroke analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Computer Vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.3 Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.4 Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5.1 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.2 Feed Forward Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.6 Training Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6.1 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.6.2 Optimizers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.6.3 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6.4 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.7 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.8 Model Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3 Related Work in Medical Semantic Segmentation 27


3.1 Filter Size and Depth Based Models . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2 Residual Based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3 Encoder-Decoder Based Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.4 Attention Based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.5 Transformer Based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.6 Object Detection Based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.7 Region Based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.8 Other Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

iv
Contents v

4 Methodology and Implementation 40


4.1 Machine Learning Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Machine Learning Data Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.1 Data Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.2 Data Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.3 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.4 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5 Results 49

6 Evaluation 55

7 Discussion 58

8 Conclusion 59

9 Future Work 60

Bibliography 61
List of Figures
1.1 Types of strokes [91] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2.1 Axial, sagittal and coronal planes of an MRI [92] . . . . . . . . . . . . . . . . . . 7


2.2 Sequences of MRI [92] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Representation of the relation among AI, ML and DL [41]. . . . . . . . . . . . . 11
2.4 Representation of a biological (a) and an artificial (b) neural network [52] . . . . 12
2.5 Sigmoid function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 Tanh function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7 ReLu function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.8 Softmax function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.9 Leaky ReLu function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.10 Single layer perceptron Model [55] . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.11 Multi layer perceptron model [55] . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.12 Underfitting and overfitting [93] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.13 Dropout effect [61] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.14 Example of a CNN diagram [41] . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.15 Confusion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.1 ResU-net [45] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29


3.2 ResU-net++ [46] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3 2D Unet [6] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 3D Unet [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5 SegNet [9] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.6 X-Net [7] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.7 V-Net [9] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.8 D-UNet [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.9 CaraNet [8] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.10 Attention U-Net [46] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.11 SwinUNETR [56] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.12 YOLO [8] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.13 SDD [8] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.14 Retina Net [8] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.15 Mask R-CNN [8] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.1 DevOps [68] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40


4.2 ML Pipeline Level 2 with fully automated processes for continuous training . . . 42
4.3 Histogram that shows the distribution of the patient’s MRI sequences in the
ATLAS dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.4 Stroke centroids in training (left), validation (middle), and testing (right). . . . 44
4.5 Distribution of stroke areas in the training (left) and validation (right) dataset. 44
4.6 Sample of individual 21 MRI images along with their analogous lesion segmenta-
tion masks below each. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.7 Data folder structure for training, validation and testing datasets . . . . . . . . . 45

vi
List of Figures vii

4.8 Data folder structure for training, validation and testing datasets . . . . . . . . . 45
4.9 Contrast adjustment, rotation and resizing applied to both images on the left
(input and mask). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.10 Proposed Attention CNN for experiments. . . . . . . . . . . . . . . . . . . . . . . 47

5.1 Attention U-Net prediction results . . . . . . . . . . . . . . . . . . . . . . . . . . 50


5.2 ResU-Net++ prediction results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.3 U-Net prediction results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.4 ResU-Net prediction results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.5 Tailored Attention U-Net prediction results . . . . . . . . . . . . . . . . . . . . . 54

6.1 Loss for all models in the validation dataset . . . . . . . . . . . . . . . . . . . . . 56


6.2 Dice score for all models in the validation dataset . . . . . . . . . . . . . . . . . 57
6.3 IoU score for all models in the validation dataset . . . . . . . . . . . . . . . . . . 57
List of Tables
4.1 Hyperparameters considered in random search . . . . . . . . . . . . . . . . . . . 48

5.1 Experimental results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.1 Hyperparameters in models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

viii
Abbreviations

Adam Adaptive Moment Estimation


AI Artificial Intelligence
ATLAS Anatomical Tracings of Lesions After Stroke
BN Batch Normalization
BCE Binary Cross Entropy
BGD Batch Gradient Descent
CT Computed Tomography
CV Computer Vision
CNN Convolutional Neural Networks
DevOps Development and operations
DLC Dice Loss Coefficient
DSC Dice Score
DICOM Digital Imaging and Communications in Medicine
DL Deep Learning
DNL Deep Neural Networks
IoU Intersection over Union
ML Machine Learning
MRI Magnetic Resonance Imaging
NIfTI Neuroimaging Informatics Technology Initiative
NN Neural Network
ReLU Rectified Linear Unit
RoI Region of Interest
RGB Red Green Blue
SGD Stochastic Gradient Descent
TE Time Echo
TR Time Repetition
VCS Version Control System

ix
Chapter 1. Introduction
With the accelerating diversity of Computer Vision (CV) applications in various fields, especially
in the medical setting, CV has enabled the possibility to assist specialists in the diagnosis of
certain diseases that would otherwise require them a big amount of effort. One of the main ad-
vantages of CV is the use of visual recognition tasks, like image segmentation, for analyzing and
understanding medical images. Magnetic Resonance Imaging (MRI) is the preferred approach
for detecting strokes, a worldwide disease considered among the leading medical condition with
long-lasting negative effects on patients’ life style [1].

Generally, strokes can be classified into two main groups: (i) hemorrhagic, which happens when
a blood vessel in the brain is ruptured and blood flows around the damaged area, affecting the
vicinity region in the brain, and (ii) ischemic, which occurs when a blood vessel is blocked,
resulting in a lack of oxygen of the surrounded brain area, causing brain damage and loss of
movement in certain parts of the body [1, 12]. While both types of strokes are severe health
problems, ischemic stroke ranks as the second deadliest disease in the world and inflicts crucial
damage in human life [1, 43]. Figure 1.1 illustrates the difference between these two types of
stroke.

Figure 1.1: Types of strokes [91]

1
Chapter 1. Introduction 2

Diagnosing a disease is not an easy task for specialists, especially if it occurs internally, as
there is no clear vision to determine its status. The use of technology is necessary to examine
and understand the situation of the internal organs of the patient’s body. An example of this
technology is MRI, which is used to generate a visual representation of any part of the body
[48, 49]. Such image is obtained with the aid of scanners, devices that pass harmless high-
frequency radio waves within a powerful magnetic field, where patients lie down. The goal is
to detect the response in their bodies’ tissues, and produce a series of gray scale images of the
internal organs at different points in time [48, 49].

During the MRI analysis process, neurologists and radiologists identify the locations of blocked
blood vessels in the brain that indicate the presence of ischemic strokes in their patients [12].
This task is highly relevant to determine the proportion of the damaged area to determine the
appropriate treatment required to assist patients as soon as possible [1, 12, 16, 24]. Nevertheless,
factors such as poor image resolution due to the variety of scanners in hospitals, and the need
for human-intervention by specialists, affect the timely, accurate interpretation, and validation
of results, spending valuable time that affects patients’ lives.

On the other hand, technology has rapidly evolved to become one of the main tools to assist
people in several time-consuming manual tasks. Deep Learning (DL) has proven to be even
more efficient than humans in performing critical object-detection efforts, as well as classifica-
tion tasks in the medical sector [3, 37]. For instance, its contribution in this field has lead to
incredible results in identifying not only strokes, but also brain tumors, Alzheimer’s and Parkin-
son’s diseases, which assists doctors to make more decisive diagnoses [44]. The main reason of
its success is mainly because DL follows an approach similar to the human learning process,
achieving incredible accuracy in CV and signal processing tasks.

Among these tasks, MRI analysis using Deep Neural Networks (DNN) for brain tumor and stroke
detection has provided interesting results during the recent years and new Neural Network (NN)
architectures have been proposed to keep improving the outcome. Although the contribution of
many researches has led to promising results, the natural elements of strokes such as their size,
location and area, makes it hard for the existing models to generate accurate lesion areas for
patients worldwide.

1.1 Motivation
Combining knowledge from several fields with technology is key to bring innovation in solving
common problems that have predominated for years. In the medical setting, the result of such
combination have produced medical image analysis to accelerate specialists’ responses to act
more efficiently towards clinical diagnoses, prevention, and treatment of diseases to preserve
patients’ lives, as it is among the ultimate goals of medicine. However, since the last decade,
the growth in DL to perform image analysis has allowed to automate manual tasks in the
Chapter 1. Introduction 3

medical field, and researchers are constantly contributing with enhanced models, new available
data, and suggesting key strategies in the training process of NNs to systematically enhance
disease detection.

Nevertheless, medical image analysis is complex by nature. Despite the massive publications
implemented with CNN architectures, most of the solutions encompass new architectures with
additional blocks that perform more sophisticated feature extraction to increase accuracy in
pixel-level classification. Then, model optimization during implementation is a necessary step,
as it represents an important characteristic for effective and favorable NN training.

Since the release of the original version of the ATLAS dataset [15] in 2016, state-of-the-art image
segmentation models have reported a maximum of 53% in accuracy (D-Unet [2]). Among the
primarily factors that contribute to such limited accuracy are low data-quality, not sufficient
data and not an appropriate model to solve the problem. In particular, not only the architecture
design plays a crucial role in the final performance, but also hyperparameters including the
proper loss functions that successfully helps the model to converge. Indeed, several loss functions
have been also utilized to deal with the class-imbalanced common problem in medical images [2],
and only a few of recent publications incorporate weighted loss formulas in their experiments.
Weights in mathematical formulas are of a matter of importance, given the effect of different
contribution of individual elements over the final calculated value. Therefore, exploring the
effect of a weighted well-known loss function to evaluate final performance of the results is the
main reason of this thesis, as this could lead to obtain an adequate performance for the first
stage in the general proposed architecture.

1.2 Problem Statement


Strokes occur due to variety of conditions, but when they do occur rapid medical attention is key
for the injured person to reduce major lifestyle consequences [14]. When a person experiences
atypical symptoms such as sudden speech, visual, and motor issues, dizziness, numbness or
tingling in some extremities of their body [42], a visit to the hospital is extremely urgent for
diagnosis of the medical condition, as they are frequently related to stroke symptoms.

In such cases, time is of vital importance for specialists to determine the patient’s final diagnosis
and start taking action. The rapid and accurate localization of strokes in patients is crucial for
specialists to determine the proper diagnosis, define the strategy to rescue most of the affected
brain tissue and establish an appropriate therapy for the patient to recover motor functions
[14, 16, 42]. This analysis involves challenging situations as this process depends entirely on a
group of experts to discuss what the main issue is by visualizing the MRI of each patient and
localizing the region(s) of the brain where the lesion appeared.

The entire analysis procedure presents the following issues:


Chapter 1. Introduction 4

1. Subjective analysis. Although a group of specialists usually gathers to observe the in-
formation and analyse the patient’s images, they come to an agreement at the end and decide
on the possible diagnosis. However, this process is highly influenced by the opinions of such
group, and for that it requires approval of more people before stipulating a final decision. Fur-
thermore, this situation is completely supervised, and the review of dozens of patients becomes
an exhausting and time-consuming task, especially considering that the variety of size, location
and shape of stroke could be completely different in every patient, as each and every person
requires individual analysis to evaluate their own situation.

2. Difference in MRI Resolutions. It is clear that hospitals have their own medical equip-
ment and that radiologists and neurologists use on-site software to visualize anomalies in pa-
tients. In fact, hospitals posses medical image visualization tools to have a better judgment
of such problems, as it is necessary for an effective diagnosis [12]. Nevertheless, poor quality,
dimension and resolution in mostly 2D gray images complicates the distinction between patients
with small lesions and those without [1]; and certainly, the diversity of people in regards to their
uniquely distinct organs, contributes with another variable to the complexity of the problem at
hand [1].

3. Lack of reliable data for leveraging automatic computer-aided diagnosis. Data


protection, with the relevance it has acquired lately, is perhaps one of the most serious topics to
address in releasing datasets that include sensitive and personal information. Ensuring that the
data is not traceable to reveal an individual’s identity is a challenge, principally in the medical
domain, where most of the data is stored as images in either DICOM [19] or NIfTI [19] formats,
which are the medical standard for patient’s information. For this reason,an additional process
to remove any sensitive information that leads to the identification of individuals is required,
adding more elaboration in the data generation process. Despite having well-known data-related
competitions that include publicly available datasets, like lesion segmentation on the ATLAS
[15] dataset used in this project, the quality and amount of data has not been sufficient to
produce reliable results [13]; and it takes a huge effort for specialists to produce big volumes of
data for researching and enhancing DL methods.

Therefore, looking at hundred of sequences per patient in order to effectively indicate the size,
position and surrounding area of the affected brain tissue, represents a fundamental task in
which DL could participate, by reducing the manual effort it takes at hospitals to mitigate,
reduce work load and even improve results of experts [13].

1.3 Research Questions


The previous reasons illustrate why MRI analysis represents a current challenge in medicine
and technology. The relevance of localizing strokes in patients by looking at their MRI image
sequence is to assist radiologists and neurologists at hospitals to make more efficient assessments
Chapter 1. Introduction 5

in patients’ conditions. Thus, research and recent state-of-the-art image segmentation methods
are developed to improve on the result of this task. Specifically, the main research question
of this thesis related to the segmentation models seeks to understand the most efficient NN
architecture that provides high accuracy on this learning task, as well as exploring an additional
strategy to evaluate performance on current results. The questions are as follows:

1. How well is the performance of existing encoder-decoder-based DL architectures that localize


brain lesions in stroke patients without the intervention of experts?

2. How accurate is the performance of an encoder-decoder-based DL tailored architecture in


stroke detection considering the ATLAS [15] dataset?

3. What impact does a weighted BCEDiceLoss have in accuracy for stroke localization with the
current encoder-decoder architectures (i.e. different contribution on each individual function
represented by α BCE + β DiceLoss)?

1.4 Contributions
The list below shows the contributions of this thesis:

• Initial development of a two-stage system for image segmentation that consists of (i) an
image segmentation model whose aim is to detect cerebral lesions in patients, and (ii) a
classification model to evaluate the result of the previous stage to increase reliability in
the results.
• The use of the weighted BCEDiceLoss loss function to train the CNN in the segmentation
task. While some publications related to stroke image segmentation utilized BCEDiceLoss
in their experiments [7, 59], none of them instilled the use of a weighted formula for
performance evaluation. In fact, Caranet [8] considered a similar metric called weighted
IOU + BCE, but IoU and DSC differ in their respective formulas. Therefore, the idea
is to add these two hyperparameters as part of the loss function, to explore the effect of
calculating a final loss value with different contributions in its base functions, and assess
the impact on the final accuracy, without augmenting trainable model parameters.
Chapter 2. Theoretical Background

2.1 Stroke in the Medical Field


Stroke is the second deadliest disease in the world and it has always been one of the major
causes that damage the life and health of human beings [1]. Generally, there are two main types
of stroke: (i) hemorrhagic, which occurs when a cerebral blood vessel suffers a rupture and the
blood affects near the affected area, and (ii) ischemic, which happens when a cerebral blood
vessel is blocked and prevents blood from flowing and reaching other parts of the brain [11].
The most prevalent cases of stroke are ischemic and the main concern is that they can cause
the local death of the tissue where they are located if they prevail for a longer period of time, as
they can cause irreversible harm to the patients’ motor functions. There are two main regions
that characterize a stroke: the infarct core and the penumbra. The former refers to tissue that
is irreversible affected and no longer alive, while the latter refers to tissue that is likely to be
recovered only if blood is rapidly supplied to it [13]. The effectiveness of a quick reaction of
neurologists and radiologists is a matter of importance in identifying the penumbra, assessing
the damage and determining the most adequate therapy required by patients to salvage as much
of the compromised brain tissue as possible [16].

2.2 Medical Images for stroke analysis


To begin with the scrutiny and understanding of the penumbra, specialists rely on biomedical
images: X-Ray, MRI and CT. These are examples of popular types of medical images that
provide a better internal tissue visualization. Generally, X-Ray imaging is used for fissures and
bone displacement, CT for tumors, cancer and stroke detection, and MRI for stroke analysis
[16, 50]. Although the type of image completely depends on the medical equipment at hospitals,
CT has remained to be one favorite choice for analyzing cerebral flow, as it generates images of
the bones, soft tissues and blood vessels in the body [50]. However, it still presents limitations
that MRI leverages in regards to ischemic stroke detection, as it generates a image with sufficient
resolution to have a better visualization of the brain.

In principle, the functionality of MRI is based on radio frequency and strong magnetic fields.
An MRI scanner is responsible for generating a strong magnetic field, and then it passes a radio
signal with different intensity to produce gray scale images of any part of the body. One of the
benefits of using MRI is that its wave frequency is harmless to humans, in contrast to how x-ray
and CT work. Another advantage of MRI is that it is susceptible to water, which means that
it can detect diseases, as in most cases they are detectable due to their varied levels of water
in patients [48, 49] and the effect in the brain is that it makes visible the difference between
6
Chapter 2. Theoretical Background 7

gray and white matter. MRI images can be divided into three main planes: axial, sagittal, and
coronal. Figure 2.1 shows an example of each plane of an MRI.

Figure 2.1: Axial, sagittal and coronal planes of an MRI [92]

Different types of MRI sequences can be obtained depending on the time period in which the
radio frequency pulses (Time Repetition) are applied to the tissue and the time span in which
the signal is received (Time Echo). The most common ones are T1-weighted and T2-weighted
images. The former is obtained when the time for both TR and TE is short (500 msec and 14
msec), whereas the latter shows a result when there is a longer time in both indicators (4,000
msec and 90 msec). An extra sequence called Flair is generated when both TR and TE are
much longer (9,000 msec and 114 msec) [62]. Figure 2.2 shows an example of such sequences.

Figure 2.2: Sequences of MRI [92]

Although these types of neuroimages (X-Ray, CT and MRI) are used to facilitate the detection of
ischemic stroke, they cannot provide an identifiable penumbra region in cerebral tissue [32]. The
final interpretation of results, the determination of the diagnosis, and the required treatment
for the patients remains an exhaustive human labour. According to a testimony of a neurologist
at the hospital in Saarbrücken, a typical analysis of results is as follows: first, radiologists look
at medical images by using on-site software and then share their observations with supervisors,
who then pass it onto neurologists responsible for a final explanation of results. In spite of
the specialized software available for image analysis, there are still some challenges that need
Chapter 2. Theoretical Background 8

to be considered, such as (i) poor image resolution due to the variety of scanners at hospitals,
(ii) misleading highlighted brain zones caused by medical equipment, (iii) demanding labour of
specialists for manual stroke localization in patients. Thus, the reliability of the results report is
considered a subjective perception, despite the experienced specialists in charge of performing
this job.

In contrast, technology has rapidly evolved to become one of the main tools to assist people
in several time-consuming tasks. In the medical setting, for example, its contribution has lead
to incredible results in identifying not only strokes, but also brain and lung tumors, helping
doctors to make more accurate diagnosis. The reason behind this benefit relies on CV and
signal processing, as they are widely used to perform a variety of human-level tasks, and during
the recent years DL has proven to be even more efficient with several models by detecting and
classifying clinical conditions in the medical sector [3].

2.3 Computer Vision


Computer Vision (CV) is a subfield of AI which main goal is to enable computers to acquire
essential information from unstructured data, like images and videos, to provide a better un-
derstanding of the visual element, and assist humans in performing specific tasks, from object
classification in images to object tracking in videos [78, 80]. CV relies on DL models to achieve
its goal and it has several applications in many industries due to the amount of available data
in this formats. The contributions can still be perceived in modern technology; for instance,
self-driving cars [79]. The following are the four main areas of application for CV [80]:

2.3.1 Classification

Classification is one of the areas in which the performance of CNN has had prominent results.
In simple terms, this task consists of detecting objects in images and provide a label (class)
they belong to, like ”car”, ”dog” or ”cat” [80]. In fact, multiple competitions since 2012 have
enabled researches to use public available datasets, such as ImageNet, MNIST, CIFAR, COCO,
to develop more complex architectures with the aim to improve such effort [37]. This application
is widely used in medical disease diagnosis, especially in cancer and stroke [44].

2.3.2 Tagging

Tagging tries to add a set of human-level concepts to describe a particular image based on the
objects it contains [86]. An example of this application is implemented on search engines, as
users normally write what they want to search [86].
Chapter 2. Theoretical Background 9

2.3.3 Object Detection

Detection is related to the object location within an image, enclosing a bounding box around
the target, and it is the cornerstone of vary computer vision tasks, like instance segmentation,
image captioning and object tracking, to mention a few. [85]

2.3.4 Segmentation

Segmentation solves the problem of classifying images at pixel-level to form groups or regions
in the image (semantic segmentation), segregated individual objects (instance segmentation),
or combining both strategies (panoptic segmentation) to highlight the shape of the objects
contained in the image [11]. The result of this method is an image that contains the pixels
where the object of interest is located.

Semantic segmentation classifies every single pixel of the image with a corresponding label that
belongs to a group of object categories [11, 80]. The output of this result is an image that
indicates the location of the object by highlighting the corresponding pixels belonging to that
particular label, and in the medical domain, knowing the localization of different classes in the
image is critical, as it provides a difference between tissues that needs to be analysed for further
diagnoses [81]. In pathology, semantic segmentation is used to analyse patient’s images and
determine localization of tumors [63].

Similarly, instance segmentation also classifies individual pixels into groups within the image;
however, this method can additionally identify unique instances of pixels belonging to one single
class [11, 82]. While semantic segmentation can detect the pixels of all, for instance, five people
in an image, instance segmentation can identify the pixels of each individual.

Panoptic segmentation merges semantic segmentation and instance segmentation into one uni-
fied process for visual recognition tasks [83, 84]. This method differentiates two wide classes:
stuff classes and thing classes. While the thing class encloses countable objects such as people,
cars, animals, etc), the stuff class refers to uncountable objects such sky, vegetation, road and
in some cases buildings, that are still countable, but not relevant to distinguish in certain tasks
[84].

2.4 Artificial Intelligence


Artificial Intelligence (AI) is defined as the science and engineering of building machines that
implement a similar process as the rational human one in their problem-solution approach
[37, 70]. The idea of this field is to find automate daily complex activities such as analyzing
images, videos, speech, text and contribute to other fields [70]. AI is a wider spectrum that
covers machine learning (ML) and deep learning (DL) [37, 41].
Chapter 2. Theoretical Background 10

2.4.1 Machine Learning

ML is a subset of AI in which the machines derive hidden patterns on the data to generate
their own knowledge that allows them to perform and improve a particular task without human
intervention [37, 70]. ML has its foundations on statistical methods and algorithms that are
applied to the available data for learning. The machine extracts the main features that represent
most of data to solve a particular task; however, what makes ML appealing and interesting is
its capacity to adjust by itself to acquire more meaningful data characteristics by minimizing
the errors on their predictions to improve its learning task [70]. Among the applications of ML
are spam detection, video recommendation, image classification, text mining [41].

2.4.2 Deep Learning

Likewise, DL is a subset of ML that includes computational models called artificial neural


networks (ANN) with several layers in their design, as part of their strategy to solve complex
problems [37, 70, 88]. The idea behind DL is that the first layers of the architecture extract low-
level features, whereas the last layers extract the high-level features [41, 70]. The medical field
has been benefited by DL implementations, especially in the medical imaging [63, 87], radiation
oncology [87], pathology, cardiology and dermatology [63]. Another factor that has contributed
to the accelerated development of deep learning are the appearance of public annotated datasets,
as well as GPU’s parallel computing for a quicker training process of DL models [33].

The categorization of DL approaches is commonly as follows:

1) Supervised Learning. This technique is the most frequently utilized, as it requires labeled
data to indicate which legitimate output corresponds to the input data [41, 70, 88]. The idea
is that this ground-truth is the baseline for the model to iteratively find an optimal solution
that reduces the error (loss function) between its prediction and the correct value. While the
simplicity of this approach allows to find an appropriate model when there is sufficient data, a
lack of it could negatively impact the result of the model, leading to undesirable results [41].
Examples of this implementation are recurrent neural networks (RNNs), convolutional neural
networks (CNNs), deep neural networks (DNNs), long short-term memory (LSTMs) and gated
recurrent units (GRUs). Nevertheless, the focus on this thesis is on CNNs, as this is the DL
method used for the stroke detection [41].

2) Unsupervised learning. In this approach, there is no labeled data and the model deter-
mines the output by itself according to the relationships from the relevant features determined
internally by the model [41, 70, 88]. Clustering and dimensionality reduction are examples of
this approach. Auto-encoders is another example of NN architectures for this category [41].

3) Reinforcement Learning. This approach differs from both supervised and unsupervised
learning because here the model interacts with the environment in order to receive rewards
Chapter 2. Theoretical Background 11

[41, 70, 88]. The goal is to solve problems based on feedback that maximizes the rewards and
improve the approach during the process [70, 88]. Reinforcement learning is useful when some
behaviour produces a long-term reward or even when trying to figure out the best approach, for
instance, in video games. Figure 2.3 depicts the relationship among AL, ML and DL.

Figure 2.3: Representation of the relation among AI, ML and DL [41].

2.5 Artificial Neural Networks


It was previously stated that the approach of DL is based on the similarity with the human
learning process. In order to understand how it truly works, it is necessary to explain the
internal process of the brain and how it learns.

The brain is the most sensitive and complex organ in the human body and it is where a person’s
learning process takes place. In it, a huge number of tiny cells called neurons receive external
information through some ramifications called dendrites, which is processed in their nucleus to
generate a result through the cell’s axon via electrical impulses that allows the signal to travel
and reach other neurons [52]. This huge extension of interconnected neurons represents the
extensive network of cells that live in the human brain.

The main function of one single neuron is simply to process the external stimulus to produce a
response and then transfer it to other neurons for further processing [33]. However, when the
neurons are able to modify their responses depending on the input they receive, we can say that
it is when true learning happens. Although the contribution of one neuron may seem minimal,
the impact of this process is outstanding, considering the multiple neurons that are stimulated
in parallel during the exchange of information at an incredible speed.
Chapter 2. Theoretical Background 12

Similarly, Artificial Neural Networks (ANN) try to mimic the marvelous human-level learning
process that occurs in a biological neural network and their design is based on this composition.
First, there is a single unit process, the equivalent of a neuron, receiving some inputs, then these
inputs are processed by an operation occurring inside this unit to generate a response and pass
it on to neighboring units, and finally all the units are arranged into groups to create a network.
The similitude between the biological network and the ANN can be seen in Figure 2.4.

Figure 2.4: Representation of a biological (a) and an artificial (b) neural network [52]

Consequently, ANNs provide a different approach to solving a wide range of complex problems.
They have a solid theoretical and statistical foundation where the method consists of finding
a solution based on several examples given to the network. This simple, yet extremely power-
ful procedure allows the application of solutions in a variety of fields, including data process,
computer vision, and even text analysis. Although there have been significant improvements to
understand more about how the human brain works, the mathematical representation utilized
in ANN provides sufficient knowledge to rely on such methodology [33].

In order to determine if a solution to a problem can be addressed with ANN, a series of factors
must be evaluated, such as: (i) the availability and sufficiency of data for training the network;
(ii) whether the solution to the problem is not straightforward and requires complex modelling;
(iii) whether a general solution that can be used to further process new and unseen incoming
data at a reasonable speed is needed, either as part of the time response requirements of the
solution or because the huge volume of data needs an optimal processing response; and finally,
(iv) whether the robustness in the model is enough to accept anomalies in the data (error-
tolerant design) [33].

Artificial neurons are the cornerstone of ANNs. These neurons are responsible for individual
computation and expand the results through the interconnection with other group of neurons.
This single unit includes two types of parameters: inputs, external values directly associated
with data or connections to other layers, and weights, internal ones to the output computation.
Chapter 2. Theoretical Background 13

The number of inputs depends entirely on the model design, but essentially a neuron has n+1
number of inputs, where n represents the number of explicit inputs and an additional element
called bias is added. Following the principles of ANN design, each of the inputs of the neuron
has a corresponding weight value that is calculated during the training process. As the response
of the neuron could be positive or negative, the weights also can take any positive or negative
value. To obtain the final input value, the calculation consists of adding the product between
each input and their corresponding weight, including the bias term. The final value is passed to
the activation function to produce the output signal of the neuron and continue its trajectory
towards the output layer. For instance, if a neuron has 5 inputs, the final result will be the sum
of
Output = input1 ∗ weight1 + input2 ∗ weight2 + ... + input5 ∗ weight5 + 1(bias)

The general formula for calculating this input is shown in the formula:

d
X
inputsignal = wi xi + bias
i=1

2.5.1 Activation Functions

The operation at the core of the neuron is of vital importance for modelling the solution of a
complex problem. This function is called the activation function and is responsible for trans-
forming the input values into an output response that is transferred to sequential layers of the
architecture [34, 41].

In ANNs, the input of an activation function in one particular neuron is the weighted sum of
each input with its corresponding weight, as explained before. The function takes this value and
produces a final result. Nowadays, the existence of a wide variety of data format is enormous and
ANNs are expected to analyze not only numbers, but also text, images, video, audio, etc. For this
reason, choosing a non-linear function helps the model to establish more complex relationships
that better describe the problem at hand. Despite that different activation functions can be
used in several layers through the entire network design, a common practice is to set one type
for the hidden layers and another one for the output layer.

Examples of activation functions are the following:

1- Sigmoid: It is usually used for classification in the output layer [34]. The result of this
function ranges between 0 to 1 to indicate probabilities, where 0 means 0% of probability of
not being class A and 1 represents 100% of probability being class A. The formula and its
representation in figure 2.5 are shown below:

1
f (x) =
1 + e− x
Chapter 2. Theoretical Background 14

Figure 2.5: Sigmoid function

2- Tanh: It is hyperbolic tangent function and it is normally used in hidden layers of the
network. Tanh has a similar behaviour to the Sigmoid function with the only difference of
being symmetric around the origin, which indicates that the output can take either positive or
negative values that can be transported to subsequent layers [34]. In other words, its values
range from -1 to 1. The formula is defined as:

ex − e− x
f (x) =
ex + e − x

Tanh has gradients which are not restricted to vary in a certain direction and also, it is zero
centered. Tanh is shown in figure 2.6.

Figure 2.6: Tanh function

3- ReLu: ReLU stands for Rectified Linear Unit and it is widely used in hidden layers in
ANNs, as its speed for training is faster than other activation functions [37, 80]. Its values
range from [0, ∞) and for this reason it is really useful to indicate whether a neuron is activated
or not, depending on the value obtained [70]. For instance, for any negative values neurons will
be deactivated, whereas for positive numbers neurons may have several corresponding values,
depending on their result. The formula is defined below and the plot is shown in figure 2.7.

f (x) = max(0, x)
Chapter 2. Theoretical Background 15

Figure 2.7: ReLu function

4- Softmax: Softmax function is widely used in the output layer for classification, as well as
Sigmoid function. The only difference is that Softmax can be used for multi class classification,
and the total value ranges between 0 and 1, indicating the probability for every individual class
specified [70]. That is, if the model tries to predict c classes, the output layer of the network will
have c neurons containing the probability for each class and the total sum of the probabilities
is equals to 1. Softmax is shown in figure 2.8

Figure 2.8: Softmax function

5- Leaky ReLu: This function is similar to ReLu for positive numbers, but there is a slightly
difference for negative values, in which the function returns small values near zero by multiplying
them for a small constant (i.e. 0.01) [70]. Although leaky ReLu has been used in successful
scenarios, there is no guarantee of consistent results [33]. The formula and the plot in figure 2.9
are shown as follows:


αx, if x < 0, α = 0.01
LeakyReLU =
x, if x = 2
Chapter 2. Theoretical Background 16

Figure 2.9: Leaky ReLu function

Some of the following considerations are useful when choosing the activation function in an
ANN design:

• Sigmoid function is commonly utilized for binary classification problems [33, 70].
• Softmax function is commonly used for multi class classification problems [33, 70].
• ReLu function is mostly used in hidden layers to avoid vanishing gradients [33, 37, 70, 87].

2.5.2 Feed Forward Networks

The basic layout of an ANN can be established as a nonlinear mathematical function which
transforms a set of input variables into a set of output variables [33]. A feed forward network
is one in which the direction of the inputs at each preceding layer is routed directly to the
following layer until it reaches the output layer without any intermediate loop or cycle [52, 70].

The single layer perceptron model and the multi layer perceptron model are two types
of architecture for feed forward networks.

The first and simplest neural model proposed by Rosenblatt (1957) was the single layer percep-
tron model (SLP), also known as single layer network [55, 89]. It is based on the McCulloch and
Pitts’ [53, 54] neuron representation and it is a collection of these neurons that are arranged
together to form a structure of ANN [55]. This model is shown in figure 2.11. SLP only contains
an input and an output layer [89]. The network receives m inputs, whereas the output layer
contains n neurons and every single input is connected to the neurons in the output layer. This
mapping is known as a fully connected layer design. The final weights representation is a m × n
matrix, where m represents the number of input variables, and n the number of connections to
the output layer each input has. The matrix is utilized to apply a multiplication to calculate
the final value that is passed through an activation function to generate the output result. It
is worth noting that the the bias, that is a special case of an additional weight with a constant
value, is part of the inputs of the network [89].

Although the single layer network provides only a solution to linear separable problems, it is not
a suitable design for non linear ones, limitation that is addressed by the multilayer perceptron
Chapter 2. Theoretical Background 17

Figure 2.10: Single layer perceptron Model [55]

model (MLP), which was proposed to enable non-linear solutions for more complex problems
[33, 52, 55]. These networks are better known as deep neural networks (DNN), and their design
is based on having one input layer, several layers (hidden layers) stacked to each other that
define the networks’ depth, and a final output layer [55, 70].

Figure 2.11: Multi layer perceptron model [55]

2.6 Training Neural Networks


The most relevant aspect of DL is data. Ideally, DL models require good quality and enough
data to find hidden patterns that allows them to correctly predict information from unseen data
[70]. Therefore, when working with DL models, it is recommended to split the collected data
into three main subsets: (i) training, (ii) validation, and (iii) testing [89]. Training dataset
is used during the NN training procedure; the only condition is that this dataset must be
sufficiently large for the model to generalize the solution. Testing dataset is used to measure
the performance of the model. Finally, validation dataset is used for tuning model performance,
when the testing data is enough for such split [89]. Some assumptions have to be considered,
such as data from the three datasets is independent from each other, and comes from the same
probability distribution. Such conditions allow to have a mathematical relationship among the
datasets to determine how well the model is performing in general to extract the key hidden
Chapter 2. Theoretical Background 18

patterns in the data [70]. Lastly, it is relevant that data from training is not present in testing;
otherwise the final performance of the model would not be reliable [89].

2.6.1 Backpropagation

Once the network architecture is defined, the training of the network is an execution of the
backpropagation algorithm. Backpropagation consists of two main stages that occur after the
network weights initialization: (i) forward pass (from input layer to output layer) and (ii)
backward pass (from output layer to input layer) [72]. Both steps form a single iteration, and
backpropagation is an iteratively process which aim is to discover which set of weights minimize
the error (or loss) between the real value and the network prediction [72]. Each complete
execution of the backpropagation algorithm, that includes these two stages, is referred as an
epoch and typically, the training consists of certain number of epochs [41].

Initializing the network weights is an important step before backpropagation. This step is crucial
to converge the network, because a poor initialization strategy, such as assigning arbitrary values
is not an optimal configuration, as it only slows performance during training [25]. Therefore,
different techniques are utilized in this regard, like Xavier [73] or He Hamming [74] initialization.

Xavier initialization is a popular method for specifying the values of the network weights in
√ √
−6 √ 6
the form of weight = [ √i+o , i+o ], where i is the number of nodes feeding into that layer, and
o is the number of output nodes of the layer assuming it follows a uniform distribution [35].
Despite this method works well in the majority of the cases, He et. al [74] claimed that this
method does not work well with ReLu, and to cope with this problem a new method called He
initialization was proposed. This method
q assumes to follow a Gaussian distribution with the
2
initial calculation as weight = G(0, n ), when n is the number of inputs in the current node
[35, 75].

Forward pass. After weight initialization has been established in the network, the first step
in the forward pass consists of processing the input data from the initial layer of the network
to generate an output. In each layer, each neuron computes the weighted sum with their
corresponding weights, as previously stated. Then, the neurons apply the specified activation
function to generate an output value that is carried out to the following layer, where the same
process is repeated until the final output value is obtained [72].

Backward pass. As it is very likely that the first network configuration of weights and bias
will not produce the best result in the network prediction, it is necessary to calculate the error
(loss value) to have an idea on how much the weights and bias need to be adjusted to reduce
it, as the main goal of the training is minimizing it. Gradient descent is the algorithm that
helps approach the minimum of the loss function by considering its gradient. Gradient descent
calculates the gradient and then uses it to determine how much the current weights in a neuron
have to change to recalculate the gradient, and move it downhill towards the minimum of the
Chapter 2. Theoretical Background 19

loss function. Nevertheless, the process of updating the network weights is complex due to
the multiple dependency of neurons from previous layers. The chain rule from calculus is the
approach for this calculation, as a change in the weights of one single neuron in the last layer
implies an adjustment on the activation output from all the neurons in the previous layers until
the input layer. Once this propagated adjustment is completed, another forward pass can be
executed to reevaluate the loss function with the new set of weights and bias [72]. For this
reason, backpropagation only works with activation and loss functions that have a derivative
[72].

2.6.2 Optimizers

Given that the gradient descent algorithm is computational expensive, there are three modified
versions called optimizers, which main difference resides on the amount of data used to calculate
the gradient; nevertheless, depending on the amount of data utilized during training, either
accuracy of the model or training time could be affected. These three versions are: batch
gradient descent (BGD), stochastic gradient descent (SGD), and mini-batch gradient descent
[76]

BGD updates the network parameters based on the entire training dataset; that is, one single
parameters update per epoch, considering that the training process includes more than one
epoch. While this approach works efficiently for small datasets because the model converges
faster, large datasets require more time to converge, as more computational resources are needed
[41, 76].

SGD is a method to estimate the real gradient on each epoch to accelerate training [31]. SGD
is more effective than BGD due to its memory usage optimization, as it updates network pa-
rameters with samples on the training dataset in every epoch. However, since it updates values
more frequently, research has shown that it presents more instability when converging, as the
steps towards the optimal solution might be noisy [31, 41, 76].

To address this issue, the mini-batch gradient descent method is used to sample the training
dataset into smaller groups called mini-batches to update the network parameters based on the
gradient value in every mini-batch. The benefit of this calculation is a more stable convergence
of the network, and more efficient use of memory and computational resources [41, 70, 76].

In a NN configuration, there are two relevant concepts to consider: parameters and hyperpa-
rameters. The former refers to the set of weights and biases that are discovered based on the
data during training, while the latter refers to the values estimated manually before training,
typically as a result of experience, trial and error, or heuristic rules [70, 72].

An essential hyperparameter called the learning rate, plays a crucial role in calculating the
magnitude in the adjustment of the gradient [72]. The learning rate has an important trade-off
Chapter 2. Theoretical Background 20

in its values, and special attention is needed when setting up this hyperparameter, because a
large value indicates a larger adjustment size that causes the gradient to oscillate around the
optimal solution without converging, and producing fluctuating loss values, whereas a small
value indicates a smaller adjustment size in the gradient, slowing down the training process,
and requiring more epochs to converge [70, 72].

Along with the learning rate, momentum is another hyperparameter that helps a model converge
faster. Here, the idea is to add a factor to the calculated gradient that represents the direction
and speed at which it should move during training by calculating an exponentially decaying
moving average of previous gradients to maintain gradient direction. The effect of momentum
is that it avoids getting stuck in local minimum points for problems without a solution space,
which is a desire behaviour in NN models [41, 70].

Given the relevance of the learning rate in the training process of a NN, researchers have
developed optimization algorithms that modify this hyperparameters during training for optimal
performance [70].

AdaGrad, forqexample, considers the history of the gradients to adjust the learning rate with a
PN 2
proportion of i=1 lri to accelerate convergence. While this approach works well for certain
DL models, the algorithm presents some drawbacks in more complex ones [70].

RMSProp is an enhancement of AdaGrad, as this algorithm calculates the exponentially


weighted moving average of the gradient, to improve performance without affecting the last
calculation by having a long history gradient values, maintaining a fast learning rate value to
converge; problem occurred in AdaGrad that makes the learning rate so small, that the network
does not converge fast [70].

Adaptive moment estimation (Adam) is an example that has proven to be widely used in
DL, as it was originally developed for training NNs [41]. The key concept of this strategy is
that the learning rate is able to adjust itself to improve the convergence speed and accuracy
of the model towards the optimal solution. Such behavior enhances the memory consumption
and offers better management of computational resources. Adam combines the benefits of both
momentum and RMSprop [41, 70].

2.6.3 Loss Functions

NNs use the gradient descent algorithm to learn a particular objective, as explained before [29].
As the loss function is utilized in the gradient descent algorithm, selecting an appropriate loss
function for the learning objective of the network is key. For instance, in the medical domain,
problems are related to classification, semantic segmentation or instance segmentation. There
are several well-known and well-researched cost functions that work well in various cases:
Chapter 2. Theoretical Background 21

1. Binary Cross Entropy (BCE): Cross-entropy is defined as a measure of the difference


between two probability distributions for a given random variable or set of events. It is widely
used for classification, and in image segmentation problems, the classification is performed
at pixel-level to determine whether a pixel is active or not [29]. A use case of this function
mentioned in [29] is when the data has equally distributed classes.

2. Focal Loss: It is a variation of BCE. The main difference relies on having a multiplicative
factor that reduces the contribution of easy examples and enables the model to focus more on
learning difficult examples. It works typically well in highly imbalanced class scenarios, like
medical images [29].

3. Dice Loss Coefficient (DLC): The Dice coefficient calculates the similarity between two
images and it can also be used as an error function for training NNs. This function is widely
used in image segmentation problems [29].

In particular, medical images present a critical problem related to the comparison between the
background and the foreground. On the one hand, a small portion of the image generally corre-
sponds to the patients’ cerebral lesion during the whole scanning interval. On the other hand,
the remaining area is occupied by the background, which is a non-interest area for prediction
[2]. Thus, the ratio of the difference between these areas has a significant effect during training
for two main reasons: (i) it is considered a class-imbalance problem since a few portion of the
image corresponds to the lesions, and (ii) it could decrease the accuracy of the model when the
accuracy metric is not selected wisely. In order to deal with such problems in this thesis, a loss
function that is a combination of the Binary Cross Entropy (BCE) and a Dice Loss Coefficient
(DLC) is selected. BCE is used to calculate the probability that the predicted output is similar
to the ground truth, with a value ranging between 0 and 1, with 1 being the highest probability,
whereas DLC is a coefficient that measures the similarity between the two images, regardless
of the class-imbalance pixel problem [2]. Hence, the final loss function is calculated as the sum
of BCE + DLC.

2.6.4 Regularization

Although the training of the model could provide accurate results, the main goal of a NN model
is to make accurate predictions based on unknown data, especially, on the testing dataset. The
goal is to analyze whether the model generalizes the solution to the problem or not. One way
to prove a right model generalization is to run predictions on the testing dataset and compare
the testing error (generalization error) against the training error. Having a training error low
and a small difference between the training and the testing loss, are two main aspects to ensure
a good model generalization [70].

Such aspects could help identify two main issues that occur during training NNs: underfitting
and overfitting [60, 70]. The former occurs when the model is not able to acquire key aspects
Chapter 2. Theoretical Background 22

of data, either due to lack of complexity in its design or failure to finding the optimal solution,
as it could get stuck in a local minima solution. It also could happen that due to a limited
number of epochs, the model does not converge. To put it simply, there is no low error on the
training dataset [70]. On the contrary, the latter occurs when there is a lack of generalization,
and the training error is small in comparison to the testing error. In other words, the training
error and test error have a huge gap between each other [70]. Such scenario occurs when there
is high complexity in the model and high number of parameters. Also, a lack of sufficient data
may lead to this issue as well, since the model can provide apparently good solutions with data
that do not contain relevant features, typically called noisy data. For this reason, the model
explains such noise instead of good data quality patterns, and the gap between the training and
the testing data tends to increase. Figure 2.12 shows an example of these two cases.

The capacity of a model contributes to overfitting or underfitting. The capacity of a model con-
stitutes three characteristics: (i) the architecture of the model (how deep it is), (ii) appropriate
loss function selection, and (iii) regularization degree in the model, according to the training
and the loss function [70]. Defining the capacity of a model also depends on the complexity of
the learning task. On the one hand, models with higher capacity on non-complex tasks tend to
overfit more, as they extract an memorize more complex features than required in the testing
dataset. On the other hand, models lacking capacity on complex task are likely to underfit, as
they are not able to acquire relevant features in training dataset to solve that problem [70].

Figure 2.12: Underfitting and overfitting [93]

Regularization techniques also equip the model to avoid such problems during training. Dropout
[25, 61], for example, is a type of regularization method implemented at the model definition
level by randomly cutting the connection of some neurons with certain level of probability
to subsequent layers. As a result, the matrix parameters will have several values set to 0
that will disable neurons in hidden layers, avoiding memorizing features [29, 61, 71]. Another
regularization method is weight decay [71], which consists of penalizing the loss function with an
additional term to reduce the parameter space. Its implementation helps the model to accept
more noisy data [70, 71]. Its formula is loss = Lossf unction + λw, where λ represents the
Chapter 2. Theoretical Background 23

proportion of the penalization to have smaller weights, ranging from 0 to 1, and w represents
the weight network value.

Figure 2.13: Dropout effect [61]

Finally, an optimizer is responsible for implementing the backpropagation algorithm in a NN


to update the network parameters. There are plenty of optimizers available and they require
the learning rate to update networks’ parameters [70].

The training procedure for NNs, regardless of their architecture is the same. In the following
section, Convolutional Neural Networks are described, as they are the first preferred approach
to deal with images.

2.7 Convolutional Neural Networks


A convolutional neural network (CNN) is a type of neural network architecture with the aim
to classify and identify image data [77, 80]. A classification occurs when a mapping between an
input and a label exists; that is, if a dog image and a cat image are the input, the corresponding
label is ”dog” and ”cat”, respectively. This is what differentiates CNN from other NNs, as
CNNs outperform in extracting patterns on unstructured data, such as images, becoming the
preferable method to solve visual recognition tasks, like image segmentation [11].

In particular, CNNs perform an operation called convolution, which is essentially the product
between two functions to generate a third one that contains more information than the previous
two. CNNs are responsible for defining several kernels that are convoluted with the image
to generate outputs that extract the relevant features from the image, based on the desired
learning task. These kernels, present in all convolutional layers, represent the trainable weights
that minimize the loss function during network training, providing the optimal solution for the
problem [43, 77].

A typical CNN architecture is composed by the following three types of layers:

• Input layers: This layer represents the input image of the network, that may vary its
number of channels according to the type of image fed into the model [77]. For example,
Chapter 2. Theoretical Background 24

a gray scale image has only one channel, whereas a colored image has three channels that
correspond to the RGB.
• Hidden layers: These layers are he inner part of the network and the arrangement is
commonly formed by three components: (i) convolutional layers, (ii) pooling layers, and
(iii) fully connected layers [33].
(i) Convolutional layers. As stated before, in this type of layers, a kernel, which in practical
terms is a matrix containing the weights, slides over the image horizontally and vertically
(convolution) to generate various feature maps. This feature generation is the result of
a matrix multiplication between the kernel and portion of the image enclosing the kernel
size. The importance of these kernels is that they will be adjusted during backpropagation.
Perhaps the key fundamental characteristic of these layers is that all neurons share the
same weight values. The concept of shared weights refers to having identical values of
weights and biases for a group of neurons in a convolutional layer [37]. Such groups are
called planes and each plane produces a feature map resulting in several feature maps
per layer. As a result, the same filter pattern can be detected in any part of the image,
especially used in object detection, and the network has a significant low number of final
parameters [43, 77].
The hyperparameters in this layer are related to the kernel size, padding and stride. kernel
size defines how big the kernel is for the convolution. This parameter is critical to find
the optimal solution for the learning task. Larger kernels cover more image space, but
extract less information, leading to poor performance, despite the quick convergence of the
network. Smaller kernels extract relevant features as they are more region-specific, and
they allow a deeper layers arrangement. Padding maintains the input image size by adding
extra information around the border of the image to cover the areas where the kernel goes
beyond the original size of the image during the convolution. Adding zeros (zero-padding)
is widely used practice for padding in CNNs, as there is no significant computational load
when zeros are added to the image. Stride controls the number of pixels in which the
kernel has to shift during the convolution with the image. A larger stride reduces the
number of features extracted as well as the output image dimensions, while a small stride
increases both the numbers of features extracted and the output image dimensions [77].
(ii) Pooling layers. These layers perform a step called downsampling which consists of
loosing some information to reduce the size of the input [43, 77, 80]. This data reduction
is key to avoid overfitting and to reduce the computational load for subsequent layers of
the model as there are less parameters to train. Average pooling and max pooling are
the standard strategies in these layers. The difference is basically on the mathematical
function that each type applies to the extracted features (average or max, as their name
suggests) to achieve this data reduction: The hyperparameters in this layer are kernel size
and stride [43, 77].
Chapter 2. Theoretical Background 25

(iii) Fully connected layers. This type of layer is utilized to reduce dimensionality in either
hidden layers or the output layer by having connections from all activated neurons in the
previous layer to all neurons in the current one [80]. Essentially, it reduces all dimensions
to one, to either allow further processing or to determine the final number of classes in a
classification problem [43, 77].
• Output layers: This layer takes the form of a fully connected layer, as it is here where
the classification occurs. The number of features represent the number of classes in our
learning task, and an activation function like sigmoid or softmax is implemented [77].

Due to the previous stated concepts, CNNs reduce the computational processing needed to
analyze images, compared to other DL architectures [80].

Figure 2.14: Example of a CNN diagram [41]

2.8 Model Evaluation Methods


In image segmentation problems, metrics such as Dice Score (DSC) [11] and Intersection Over
Union (IoU) [11] are two commonly used indicators to report network performance and measure
the final results. DSC is widely used in image segmentation tasks, as it calculates the overlap
area of the prediction and the ground-truth, divided by the total number of pixels in both
images. IoU, also known as the Jaccard index, calculates the area of intersection between the
prediction and the ground truth, divided by the total number of pixels. The value ranges from
0 to 1 in both metrics, where 1 indicates complete similarity between the images.

|A ∩ B|
IoU = J(A, B) =
|A ∪ B|

2|A ∩ B|
Dice =
|A| + |B|
Chapter 2. Theoretical Background 26

In addition, typical metrics for measuring performance are Accuracy, Precision, Recall and F1
Score [11]. Such metrics are easier to understand by looking at the following confusion matrix:

Figure 2.15: Confusion Matrix

The matrix is useful to compare the ground-truth and the prediction of the model. To simplify
the explanation, consider a binary classification problem where there are only positive or neg-
ative classes as result. Each quadrant of the matrix represents a specific region: true positives
(TP), false positives (FP), false negatives (FN), true negatives (TN). TP means the number of
positive cases predicted by the model that are indeed labeled as correct in the legitimate dataset.
FP means the number of positive cases predicted by the model that are actually negative in
the dataset. FN means those number of cases in which the model predicted as negative, but
they were actually correct in the original dataset and finally, TN are the correct negative cases
predicted by the model that are negative in the dataset.

In addition, the formulas for accuracy, precision and recall can be explained with these terms:

TN + TP
Accuracy =
TN + FP + TP + FN
TP
P recision =
TP + FP
TP
Recall =
TP + FN
2 × precision × recall TP
F 1score = =
precision + recall T P + F P +F
2
N

Accuracy indicates the proportion of the correct predicted cases divided by the total number
of samples. Nevertheless, Accuracy is not a reliable metric, especially in datasets that present
the problem of class imbalance like ATLAS [15]. An alternative is Precision, which represents
the ratio of correct predicted samples of the model with respect to the total number of positive
predictive cases including incorrect results. Recall measures the proportion of correct predicted
samples in regards to the total number of positive cases in the dataset (including positive and
negative predicted results). F1 Score is the harmonic mean of Precision and Recall.
Chapter 3. Related Work in Medical Seman-
tic Segmentation
Since the last decade, CNN layers have been the primary approach to extract features from
images, especially when they are arranged together and combined with other type of layers
to increase the efficiency of the feature extraction procedure. This layout receives the name
of architecture, and CNN architectures have performed amazingly in classification and image
segmentation tasks [7, 12]. A review on different architectures based on their design patter is
provided below.

3.1 Filter Size and Depth Based Models


During the first decade of the 2000s, researches conducted multiple experiments to improve
CNNs’ performance. One of the most prominent findings at that time was related to the filter size
and the depth of the network. First, they realized that a bigger filter extracts high-level features,
while small filters obtain more refined low-level features [70, 80]. Second, the number of hidden
layers also approximate more complex functions that leads to a better network performance
[80]. Thus, filter size and depth in CNN architectures are correlated with network’s accuracy
[80].

Krizhevsky et al. (2012) proposed AlexNet, a CNN architecture formed by five convolutional
layers using ReLu as their activation function, and three fully connected layers in which the
last one applies a softmax activation function for object classification [25, 38]. The relevance of
this model is that it represents a prominent finding in machine learning history, allowing more
researches to further develop DL CNNs architectures for visual recognition tasks. [25].

In 2015, Simonyan and Zisserman proposed the Visual Geometry Group Network, better known
as VGGNet, an architecture that utilizes two convolutional layers followed by a max pooling
layer and three fully connected layers at the end of the network, where the last one implements
a softmax activation function for classification. The filters size that it uses are set to 3 × 3 .
This model proved that the more layers added to the network, the better the performance is in
recognition or classification. Nevertheless, the depth of a network also increases its complexity,
leading to a longer training time, as more parameters need to be updated [25, 38].

Increasing the number of parameters also leads to occupy more computational resources for
training the network, issue that was tackled by Szegedy et al. (2015) when he proposed the
GoogLeNet v1, model that increased the number of receptive fields by using several convolutional
layers at the same level with different filter sizes (1 × 1 , 3 × 3 , 5 × 5 ) and a 3 × 3 max
27
Chapter 3. Related Work in Medical Semantic Segmentation 28

pooling layer, where the final result is concatenated to prevent the network from augmenting
the computational load. The final version is a 22-layer design with fewer parameters (7M) than
the previous models, such as AlexNet (60M) and VGGNet (138M) [25, 38].

3.2 Residual Based Models


The goal of these architectures was to avoid the vanishing gradient problem in deep CNN
architectures, by adding residuals, which are shortcut connections between layers to expand the
information shared across layers, and avoid layers to receive non-valid information to process,
as it goes deeper into the design [41, 80].

In 2015, He et al. proposed an architecture consisted of several number of convolutional layers


(18, 34, 50, 101,110, 152, 164 and 1202) [38]. ResNet50 is one of the most popular models and
its design is based on feeding the input of one convolution with the output of the previous two
convolutions. The benefits of this model is that it strengthens and reuses feature propagation
along the network, achieving a better performance with less parameters [38].

Zhang et al. proposed ResU-Net [45], U-Net architecture enhanced by adding residuals in both
encoder and decoder layers of the model to avoid degradation of performance due to the depth
of the network. In addition, it uses skip connections to retain more contextual information and
transfer it from the encoder to the decoder block for better results [41]. Despite using this
architecture originally for road extraction tasks, it has been used within the medical domain as
well given its U-net architecture pattern that serves well for image segmentation tasks. Figure
3.1 shows the ResU-net architecture.

Jha et al. proposed an enhanced version of the ResU-net called ResU-net++ [46] in which it
uses three main additional components in their new architecture design: (i) a residual block that
propagates information to the decoder block and combine it with an attention module in every
layer without increasing the network parameters; (ii) squeeze and excitation block, which serves
to keep the most relevant features in the model right after the residuals in the encoding part;
(iii) atrous spatial pyramidal pooling block that applies atrous convolutions to enlarge the view
field of the kernel without adding more parameters and extract multi-scale information more
precisely for better performance. This module represents the connection between the encoder
and the decoder of the network, and it is used at the very last layer of the network before the
final classification part [46]. Figure 3.2 shows the ResU-net++ architecture.

3.3 Encoder-Decoder Based Methods


Whereas the previous architectures focus on solving classification problems due to their output
layer kind (i.e. fully connected layer), there are other fields, especially in medicine, in which
the output of the network should be also an image (semantic segmentation). For this reason,
Chapter 3. Related Work in Medical Semantic Segmentation 29

Figure 3.1: ResU-net [45]

different CNN architectures have been proposed and one of them is known as encoder-decoder
architecture.

The encoder-decoder architecture is a proportional two-stage structure and its advantage relies
on the individual task that each part of the symmetrical network executes [11, 39, 80]. First, the
encoder is composed by a set of convolutions on each layer to generate a multi-channel feature
map to extract and learn the complex features from the image. It then applies pooling layers to
reduce the spatial dimension of the image as it goes deeper into the network. Once the network
reaches its bottom, the opposite operation takes place. That is, upsampling convolutions on
Chapter 3. Related Work in Medical Semantic Segmentation 30

Figure 3.2: ResU-net++ [46]

each layer to recover the original size of the image at the end. Finally, most of these methods
transfer information from the encoder to the decoder by using skip connections on all layers.

Ronneberger et al. proposed the U-Net [6], a 2D architecture that follows this proportional
pattern to perform pixel-level classification; this network was originally designed for biomedical
images, and its outstanding results made it to be a primary option for other computer vision
areas. U-Net is formed by four layers conforming the down sampling part, each of which
includes 2 convolutional layers followed by a 2 × 2 max pooling layer. The result is used as
both skip connections for the corresponding level in the decoder and input for the subsequent
level of the network. The upsampling part, the decoder, is also formed by four layers, each
performing two modules: (i) the concatenation between the skip connection and the result after
performing the transpose convolution, (ii) a 2-convolution layer 3 × 3 for feature extraction
after the concatenation of the step (i). The original network receives an input of 572 × 572
pixels and generates an output of 388 × 388 that labels all pixels of the image. It has obtained
prominent segmentation results in medical images, winning a challenge in 2015 in cell tracking
with microscopy images [39]. Figure 3.3 shows the U-net architecture.
Chapter 3. Related Work in Medical Semantic Segmentation 31

Figure 3.3: 2D Unet [6]

One of the main advantages of the U-Net is the skip connections between the encoder and the
decoder to retain contextual information. Although U-Net has been the foundation of several
2D improvements, Çiçek et al [3] proposed a 3D U-Net to leverage the spatial information of
3D images such as MRI or CT to segment diseased tissues. The decoder design contains three
layers; each layer has a 2-convolutional pass where the last convolution doubles the number of
features from the previous convolution. It applies batch normalization, ReLU as the activation
function, and then a 2 × 2 × 2 max pooling layer reduces the spatial dimension of the image [39].
The decoder has three layers, each performing a 2-level convolution, one to the combination
between the skip connection of the same level in the encoder, and another one to the result of
predecessor layer with the same dimension as the skip connection of the next level. Finally, in
the last layer it implements a softmax function to classify pixels. It is worth noting that the
loss function they used was the weighted cross entropy to tackle the issue of the majority of
background pixels against the area of interest [39]. Figure 3.4 shows the 3D U-net architecture.

Figure 3.4: 3D Unet [3]

Similarly, SegNet [9] is another proportional design, whose encoder is a VGG16 network used in
object analysis. The idea of using an embedded network as part of a new architecture is called
Chapter 3. Related Work in Medical Semantic Segmentation 32

transfer learning and the benefit is that it contributes to the solution of the problem in a more
efficient way [28]. It uses the first 13 layers of the VGG16 to generate the feature activation
maps. After this stage, the decoder is responsible for generating the final result by implementing
upsampling in a pair of convolutions followed by BN and ReLU as its main activation function.
The last layer implements the softmax function to classify pixels according to their label of the
object they represent, to produce the final segmentation mask [39]. Figure 3.5 shows the SegNet
architecture.

Figure 3.5: SegNet [9]

One of the main drawbacks of previous models is the huge number of trainable parameters they
need to calculate, aspect that considerably reduces the performance during inference, which is a
crucial requirement to deploy models in real-case scenarios. For this reason, K. Qi et al. proposed
X-Net[7], a model that implements enhancements to the previous models by considering (i) the
use of the depthwise separable convolution instead of the common convolution, and (ii) by
implementing a new module at the bottom of the network called feature similarity module to
retain long-term contextual information given the differences in size and location of strokes [12].

The difference with the depthwise convolution is that it implements a normal convolution over
every channel of the input feature map in an independent way. For example, in a RGB image,
three independent convolutions are performed on every channel of the input. Furthermore, this
model implements a block called ”X” conformed by three convolutional layers, each of which
starts with a depthwise convolution with a kernel of 3 × 3, batch normalization, and ReLU,
followed by a convolution with a 1 × 1 kernel, batch normalization and ReLU. At the end of
the block, it utilizes residuals to concatenate the result of the previous step with the result of
a single convolution with a 1 × 1 kernel, batch normalization, and ReLU to generate the input
feature map for the following layer. In essence, there are 4 layers formed by an X block that
doubles the number of features from the previous layer, and a max pooling layer of 2 × 2 to
reduce image resolution. An additional X block is implemented before reaching the bottleneck
of the network; the bottleneck is where the feature similarity module resides and it performs a
3 × 3 convolution to the input feature and then it splits three different convolutions. The first
two results are concatenated and applied softmax function, and then combined with the third
result to performed a convolution 3 × 3. The final results is concatenated with the original input
of the module to leverage residuals, and transfer more information [39]. X-Net achieved better
accuracy results in their segmentation tasks in comparison to previous models, as it noticeably
Chapter 3. Related Work in Medical Semantic Segmentation 33

reduced the trainable parameters of the model to accelerate the training and inference process
with new data [7]. X-Net ranks second in image segmentation with ATLAS [15] dataset, with
a reported DSC of 48.7% Figure 3.6 shows the X-Net architecture.

Figure 3.6: X-Net [7]

Milletari et al. proposed the 3D V-Net[9], similar architecture to 3D U-Net, but with the main
difference that it implements residuals in every layer of both encoder and decoder, augments the
convolutions as the network goes deeper, that is, starting with one, two and three convolutions
stacked to each other, and that it doubles the number of features from the previous layer. At
the bottleneck of the network, it applies the grouping upsampling convolutions by reducing the
number from three to one; it also leverages the skip connections to retain contextual information
as in U-Net, and the last layer uses a softmax activation function to classify pixels in the
image. Despite its good results in automatic segmentation of prostate images [10], this model
has been extended to other type of images within the medical sector. Both the input and
output dimensions of the original network are 128 × 128 × 64 [39]. Figure 3.7 shows the V-Net
architecture.

Figure 3.7: V-Net [9]


Chapter 3. Related Work in Medical Semantic Segmentation 34

Zohu et al. proposed D-UNet [2], model that harnesses the power of both 2D and 3D convo-
lutions to transfer information between layers in the early encoding stage. D-Net also uses an
enhancing mixing loss function to deal with the class imbalance problem, while ensuring the
similarity between the predicted image and the ground truth. The network has two branches:
one receives 192 × 192 × 4 2D images and includes three layers. The first layer performs 2-group
of 2D convolutions followed by batch normalization, then it applied a 2 × 2 max pooling layer
followed by another 2D convolutions followed by batch normalization. The second branch re-
ceives 192 × 192 × 4 × 1 input where it performs 2-group of 3D convolution followed by batch
normalization. Then, it applied a 3D max pooling layer and the result is also passed through
the group of 3D convolutions and batch normalization. Here, the branches are combined by
implementing a dimension transform block, responsible for reducing the spatial dimension of the
3D image by utilizing a 1×1×1 convolution, passing it through 3×3 2D convolutions, and com-
bining the final result with the 2D image [39]. D-UNet ranks first place in image segmentation
with ATLAS [15] dataset, with a DSC of 53.5%. Figure 3.8 shows the D-Unet architecture.

Figure 3.8: D-UNet [2]

Finally, Lou et al. proposed Context Axial Reverse Attention Network CaraNet [8] that imple-
ments several elements in its design: (i) a backbone, (ii) a partial decoder, (iii) a channel-wise
feature pyramid, and an (iv) axial-reverse attention. The backbone is the pre trained Res2Net
model, and it allows to leverage the residual network design with the feature extractions ob-
tained from the visual representation learned by the ImageNet dataset. The partial decoder
module generates a global map of features that approximately locate the position of medical
objects; the channel-wise feature pyramid module extracts only multi-scale features from the
pre-trained model and the axial-reverse attention module implements attention more efficiently
by reducing two to one dimension, saving more computational resources in comparison to the
normal self-attention block. The contribution of this model is to improve prediction of small
lesions in the brain, as this is a task where the majority of existing methods fail [8], [18]. Figure
3.9 shows the CaraNet architecture.
Chapter 3. Related Work in Medical Semantic Segmentation 35

Figure 3.9: CaraNet [8]

3.4 Attention Based Models


The idea of these architectures is to mimic the effect of attention in human sight, that consist of
focusing on specific details on an image, ignore non-relevant elements, and use context-related
information to gain more understanding. Similarly, CNNs focus on specific regions on the image
to extract more relevant features that contribute to the final learning task [80].

Oktay et al. proposed a model called Attention U-Net [46] that incorporates a soft-attention
gate as part of the U-Net network design to improve focus on particular features applied to
pancreas image segmentation. The attention gates are added as input to each layer of the
decoder part to refine the information coming from the skip connections, while considering the
activation features of the same layer. The idea of attention is that certain regions in images
are more responsible for generating more specific coefficients without excessively increasing the
number of trainable parameters. Figure 3.10 shows the Attention U-net architecture.

Figure 3.10: Attention U-Net [46]

3.5 Transformer Based Models


In 2022, Hatamizadeh et al [56], proposed a network called SwinUNETR, which is a recent ar-
chitecture with hierarchical encoders based on transformers (a successful architecture consisting
of self-attention modules to learn long-term relationships originally used in Natural Language
Chapter 3. Related Work in Medical Semantic Segmentation 36

Processing (NLP), but that it was extended to the medical setting with image segmentation
applications [58, 59]). SwinUNETR was also developed to assist the medical sector in brain
tumor segmentation and it is based on the encoder-decoder architecture with skip connections
between each layer from both elements. First, the network generates a patch partition to to
have linear elements, similar to the input of NLP problems, and then each partition implements
a self-attention module with certain number of features that are doubled in subsequent lay-
ers. The decoder then process the linear results to transform them with convolutional layers
and combine information from the encoder’s layers with skip connections. Each layer of the
decoder uses residuals to maintain more contextual information. According to ??, this model
was tested on brain tumor segmentation, specifically with the BraTS 2021 dataset, where it
achieved remarkable performance on the testing dataset.

Figure 3.11: SwinUNETR [56]

3.6 Object Detection Based Models


Another type of models that identify objects are called the one-stage networks, because they
achieve this goal in one single step process. Examples of this type are YOLO and SDD.

YOLO (v1) was proposed by Redmon et al. in 2016 by utilizing a feature pyramid network
to make predictions in subsequent levels at once. YOLO first splits the image into a grid cell,
and it calculates the probability of having any object for each cell, generating a bounding-box
around the object based on this [25]. Its architecture uses 24 convolutional layers with 2 fully
connected layers at the end. Despite its remarkable results in object detection speed, YOLOv1
is not suitable for critical detection systems in real-time applications [38, 39].

Figure 3.12: YOLO [8]


Chapter 3. Related Work in Medical Semantic Segmentation 37

In 2017, Redmon and Farhadi proposed YOLOv2, enhancements over the previous version by
expanding the number of categories to 9,000, utilizing batch normalization, using high resolu-
tion images during the training process, and predicting offsets rather than coordinates. At this
point, YOLOv2 was able to outperform models such as R-CNN, ResNet and SDD for critical
detection systems. One year later, again Redmon and Farhadi proposed YOLOv3, as they se-
lected Darknet-53 as part of its backbone and implemented residual layers like ResNet, achieving
higher speed at detection. The following versions of YOLOv3 were simply improvements to this
version [39].

In 2020, Bochkovskiy et al. proposed YOLOv4, that includes a combination of features to


achieve higher accuracy, such as weighted-residual-connection (WRC), cross-stage-partial-connections
(SCP), cross mini-batch Normalization (CmBN), self-adversarial- training (SAT), and mish-
activation. They also apply tricks including Mosaic data augmentation, DropBlock regulariza-
tion, and CIoU (Complete-IOU) loss [38].

On the same year, Long et al. (2020) proposed PP-YOLO, an enhanced version of YOLO by
replacing the Darknet53 backbone with a Resnet50-V, increasing the batch size to 196 and
adding IOU Loss, Grid Sensitive, and IOU Aware to increase mAP by 1.7% (45.2%) [39].

Similarly, Liu et al. (2016) proposed an architecture called SSD (Single-Shot Detector). SSD is
formed by two elements: (i) a backbone that is a pre trained CNN like ResNet used for feature
extraction, and (ii) an SSD head which is a set of convolutional layers for detecting object class
and object location. SDD splits the image into a grid cell where each cell is responsible for
detecting the object they contain. Each grid can belong to one or more anchors, whose main
task is to be flexible enough for the object’s properties such as shape and location. In this way, it
can detect multiple objects with different shapes regardless of their position in the image. Thus,
SSD has been used in breast cancer detection obtaining a good performance than other similar
algorithms. Fu et al. (2017) and Cai et al., (2016) tried to improve accuracy on SSD without
prominent results. However, the Deconvolutional Single Shot Detector (DSSD) performs better
at small object detection [25], [39].

Figure 3.13: SDD [8]

Lin, Goyal et al. (2017) proposed the RetinaNet, given that some of the previous architectures
did not tackle the problem caused by class imbalanced datasets. For this reason, RetinaNet uses
Chapter 3. Related Work in Medical Semantic Segmentation 38

focal loss as an alternative to deal with this issue during training. It uses ResNet and Feature
Pyramid Network (FPN) and according to their authors, outperformed Faster R-CNN on FPN
when tested on the COCO dataset [39].

Figure 3.14: Retina Net [8]

3.7 Region Based Models


Another approach with outstanding results in visual recognition tasks is R-CNN (Region-based
Convolutional Neural Network) proposed by Girshick et al. (2014). First, R-CNN determines
a series of potential object regions in the image, extracts the features from these regions with
convolution layers, and finally it encloses the objects with bounding-boxes [25, 39].

In 2015, Girshick proposed another method called Fast R-CNN to address the poor performance
of his previous design. In this new architecture, the network both classifies objects and predicts
their location with bounding boxes simultaneously. The approach to determine the regions of
interest (RoI) is by using a selective search algorithm and then it applies RoI pooling layers to
extract the features from them that are fed into a classifier layer to get the class labels of the
objects and a bounding box regressor that determines the coordinates of the objects, meaning
that these two modules are merged to reduce training and testing time [25, 39].

Ren et al. (2017) proposed a faster R-CNN in which a NN called Region Proposal Network
(RPN) could include all previous steps in one, with the aim of using it in real-time object
detection applications [39]. Nevertheless, the performance was not optimal, and for this reason
Dai et al. (2016) proposed a Fully Convolutional Region-based detector (R-FCN) with a score
map sensitive to the object position. This design produced an accuracy similar to Faster R-CNN,
but with less computational time [39].

He et al. (2017) proposed the Mask R-CNN model, which, along with classification and bounding
box detection, adds a mask detection module for semantic segmentation. It replaces RoI pooling
with RoI align, improving accuracy in 40% over previous models [25], [39].

Law and Deng (2018) proposed a method called CornerNet, which basically implements a new
anchor free model that helps (i) to address the class imbalance issues by using focal loss and
(ii) to add more hyperparametes such as size, aspect, ratio and number of anchors to achieve
better accuracy. CenterNet (Zhou et al., 2019), FCOS (Tian et al., 2019), and RepPoints (Yang
et al., 2019) are methods inspired by CornetNet [39].
Chapter 3. Related Work in Medical Semantic Segmentation 39

Figure 3.15: Mask R-CNN [8]

3.8 Other Models


In 2017, Huang et al. proposed the DenseNet architecture that interconnects every single layer
with all its successor layers in the model. This interconnection, as well as Residual Networks
(ResNet), strengthens and reuses feature propagation that diminishes the number of parameters,
enabling a more efficient model for feature extraction [38]. Its design consists of a series of
3 × 3 convolutional layers where each performs batch normalization, and 2 × 2 average pooling.
The last layer in the network receives the name of transaction layer, and it performs a 1 × 1
convolution with batch normalization, and a 2 × 2 average pooling layer. The key concept of
this model is to reduce number of parameters by enabling feature reuse [25, 38].
Chapter 4. Methodology and Implementation
In this section, the following aspects are considered: First, a general overview of the methodology
in ML projects is explained; then the complete data pipeline of the project is described to
ensure future reproducible results. In addition, the annotated dataset is covered for a deeper
explanation on its relevance in this project, as well as the medical image exploration process,
including all preprocessing steps for MRI. Finally, the design of the proposed CNN architecture
is shown.

4.1 Machine Learning Operations


Machine Learning operations (MLOps) is an approach widely adopted in ML projects that
incorporates principles of both development operations (DevOps) and ML pipelines.

Figure 4.1: DevOps [68]

DevOps refers to an agile software development practice, whose purpose is to combine the
effort from the software development team with the business operations team to provide rapid
changes on the software product [64, 65, 66, 67]. While the software development team generally
follows a practice that consist of planning, coding, testing, building and releasing software, the
operations team focuses more on configuring, monitoring and operating the system to request
customer’s feedback and improve the final product. The idea is to perform this jointly effort
in short periods of time called sprints with a fully automated process, which is based on two
main concepts: continuous integration and continuous delivery. Continuous integration is an
approach to integrate the independent contribution of the developers team during each sprint
into the main component, that is the place that contains the code final version. The idea is to
automate this part to increase teams’ productivity, efficiency, and deliver high-quality results
[64, 65]. Continuous delivery is an approach to ensure automated software releases that respond
quickly to user’s feedback with incremental updates; the goal is to minimize cost, time, and
deployment risk without impacting the final application [64, 65]. The benefit of both concepts
is that software systems can easily scale as a result of a collaboration between software developers
40
Chapter 4. Methodology and Implementation 41

and cross-functional teams. In addition, the use of a version control system (VCS) is key to
facilitate the incorporation of both methods in the DevOps strategy, as it allows to keep track
of all the code changes in one single place, maintaining a history of previous version releases
[64, 67]. Figure shows a typical DevOps diagram.

In contrast, ML pipelines describe the process that ML engineers follow to develop and deploy
ML models. In general, this procedure includes the following stages [64, 68]:

• Data extraction: data collection process that commonly includes gathering relevant data
from various sources that could help solving a particular problem.
• Data exploration: understand main features from data and consider what it is needed to
increase data quality.
• Data preparation: data cleaning process and additional transformations to the data to
split it into training, validation and testing datasets that feed the model.
• Model training: different techniques to choose an appropriate model and tune all hyper-
parameters required for the best model selection.
• Model evaluation: metric assessment of the prediction model results in the testing dataset
to evaluate performance.
• Model deployment: make the model available for individual consumption or integration
with other systems.
• Model monitoring: keep track of model performance in production environment for re-
training with new data and maintaining updated and high model performance.

ML engineers need to iteratively execute experiments to find the best possible configuration
that provides a desirable result with a particular ML model. Therefore, it is necessary to
keep track of all configurations even when they worked or not, to identify other possibilities
that might improve results. As software components, ML pipelines should also follow the best
practices in software development, to eliminate code redundancy by designing reusable code.
Reproducibility of results in a ML pipeline is also crucial to minimize errors when deploying
models to different environments, such as development, preproduction and production. Thus,
reusing pipelines that reproduce results in multiple environments is key for testing, adjusting and
easily deploying ML models. Finally, monitoring models in production for continuous training
with new data is vital, as the performance of any model tends to decrease over time, impacting
negatively the functionality of the system that uses such models. Having automated processes
following DevOps principles is beneficial and recommended in ML pipelines [64, 68, 69].

A ML process can be classified into three different groups, depending on the automation level
of the previous steps [68, 69]:

• MLOps level 0. There is a complete manual execution of each stage in the ML pipeline,
and a lack of continuous integration practices, as the models do not change regularly,
Chapter 4. Methodology and Implementation 42

causing a complete lack of performance monitoring. This mainly occurs during the initial
steps of building ML models.
• MLOps level 1. Automation is implemented to achieve continuous integration, contin-
uous delivery, and continuous testing. The main advantage of this type of automated
pipelines is the facility in which multiple experiments can be executed with the mini-
mum changes in the pipeline; including additional models to retrain them with new data
and compare their performance results is a relevant task to promote enhanced models to
production.
• MLOps level 2: There is a fully automated orchestration in every step of the pipeline
that enables easy experimentation in robust and scalable systems in production, to ac-
tively monitor model performance and automatically adjust and deploy newer versions, if
necessary. Figure 4.2 shows the level 2 ML Pipeline:

Figure 4.2: ML Pipeline Level 2 with fully automated processes for continuous training

4.2 Machine Learning Data Pipeline


In this section, the main components of the ML pipeline are described, as a part of the MLOps
methodology.

4.2.1 Data Extraction

An annotated open-source dataset, known as the Anatomical Tracings of Lesions After Stroke
(ATLAS) Release 2.0, 2021 [15], is used to train the proposed model, as it contains 655 anony-
mous patients from 33 different cohorts across 20 institutions worldwide with their corresponding
lesion segmented masks. Two datasets are provided: training and testing. While the training
cases include both the T1-weighted MRI and a manually segmented mask of the lesion for each
patient, the testing cases only consider 300 subjects with the T1-weighted images for evaluation
Chapter 4. Methodology and Implementation 43

purposes. For model training, testing dataset was splitted into validation and testing. There-
fore, for this thesis there are three main datasets to consider, as indicated previously in section
2.6: training, validation and testing. Further details are included in section 4.2.3.

It is worth noting that the segmentation mask were manually labeled by specialists and the
images have already gone through a preprocessing procedure that includes: intensity normal-
ization, registration to a standardize template, and defacing (removing facial structures), as
indicated in the documentation of such dataset [15].

4.2.2 Data Exploration

After the data collection process is done, it is necessary to build up understanding on the
available medical data. First, a data exploration process is needed, given the heterogeneity of
the scanners utilized during the data collection process, as each cohort stores patient’s data
with a different number of MRI slices; in this case, MRI slices range from 72 to 512, as shown
in Figure 4.3.

Figure 4.3: Histogram that shows the distribution of the patient’s MRI sequences in the
ATLAS dataset.

Second, an approximate stroke location is plotted, based on the segmented regions. To ac-
complish this task, several steps were executed, such as (i) resizing each image to 224 × 224
for comparison; (ii) pixel normalization ranging between 0 and 1; (iii) centroid calculation of
strokes, given the diversity of shape and location in the annotated dataset. This calculation
considered the marching squares algorithm for all contours with pixel values higher than 0.6.
Figure 4.4 depicts the final comparison in both training and validation datasets.

Third, stroke area-based analysis to understand the distribution of strokes regarding their size
in the dataset. The area of each contour is calculated, as it represents the cerebral lesion with
respect to the whole image. The same threshold with value of 0.6 is considered as well, similarly
to the previous step to keep relevant contours in the image. Figure 4.5 illustrates the resulting
histogram.
Chapter 4. Methodology and Implementation 44

Figure 4.4: Stroke centroids in training (left), validation (middle), and testing (right).

Figure 4.5: Distribution of stroke areas in the training (left) and validation (right) dataset.

Finally, analyze random samples from the dataset. An arbitrary patient was chosen from the
training data to illustrate the sequence of MRI images with their corresponding segmentation
masks. One important point is that the entire sequence of images of each patient is stored as
NIfTI format [19]. Figure 4.6 shows an example of 21 slices with their respective annotation.

Figure 4.6: Sample of individual 21 MRI images along with their analogous lesion segmenta-
tion masks below each.

4.2.3 Data Preparation

The ATLAS dataset is splitted by patients to ensure that all images of a sequence are assigned to
a particular dataset. in this case, images from the original training dataset belong to the training
dataset, whereas the original testing dataset provides samples for validation and testing. The
final proportion of each dataset is 60%, 30% and 10% , respectively. In other words, training
consisted on 308 NIfTI files, validation on 163 NIfTI, and testing on 184 NIfTI. In terms of
png images, the experiments utilized 30K images for training, 15K for validation, and 5K for
testing. Individual folders for each datasets were created, containing two additional folders to
store both patient images and masks. Moreover, image and masks were renamed to facilitate
the process of binding them together and ensure consistency during NN training. The entire
root file directory is shown in Figure 4.7.
Chapter 4. Methodology and Implementation 45

Figure 4.7: Data folder structure for training, validation and testing datasets

Each NIfTI file contains three plane MRI types (axial, sagittal and coronal planes). However,
the segmentation masks provided in the dataset correspond to the axial view. Therefore, the
axial plane was utilized for storing individual patient images in png format, as it also provides
a clearer visual representation of the brain for feature extraction in cerebral lesions. Every
patient has two folders that contain the input images and the segmentation masks, respectively.
The naming convention for all images within a folder consists of a XXXX.png format number,
where X represents a number in the sequence. For example, in a patient of 130 slices, the image
names would be from 0000.png to 0129.png. Napari [20], was the main tool to generate all png
images, as it enables to establish values for important image properties, for instance, (1) set
brightness equals to 1 and (2) gamma equals to 1 to increase the visibility of medical images.
Normalization was the only transformation applied to lesion masks to ensure values as 0 or 1.
An example of the directories containing the images is shown in figure 4.8.

Figure 4.8: Data folder structure for training, validation and testing datasets

As ML models need huge amount of data for better generalization, data augmentation steps are
required to contribute to this ultimate goal. The following transformations are considered in this
project: (1) ensure only gray scale images to fit model input of one channel per image, (2) adjust
gamma and contrast values, to highlight darker regions on the image, (3) random rotation, with
an angle between 0° and 180° to vary the position of the stroke, and help the model to learn
Chapter 4. Methodology and Implementation 46

different regions where they occur, and (5) image resizing on both images to 224 × 224 pixels for
fitting the model input dimensions, size decided after several experiments, and following best
practices to improve speed and reduce memory consumption without affecting performance
during training and inference [27]. An example of the final conversion of some images can be
seen in figure 4.9:

Figure 4.9: Contrast adjustment, rotation and resizing applied to both images on the left
(input and mask).

4.2.4 Model Training

The tailored NN design is based on the symmetrical encoder-decoder architecture with attention
blocks, that is frequently used in medical segmentation problems. Here, each part of the encoder
applies a group of two 2D-Convolutions with a 3 × 3 kernel, padding 1, and stride 1 to generate
the feature maps. It also normalizes the output with batch normalization, and then applies
ReLU as the main activation function. A max pooling layer with a 2 × 2 kernel is responsible
for reducing image size by half, and similar to the residual network, the features per layer
are 64 with 112 × 112 pixels, 128 with 56 × 56, 256 with 28 × 28, 512 with 14 × 14. An
additional convolution group is added to produce 1024 with 14 × 14 pixels, that is the input of
the bottle neck of the network, which is the base line of the upsampling operations used in the
attention block. The attention block consists of two inputs: the first one corresponds to the
skip connection from the encoding layer, and the second is the result of the upsampling process
from the lower layer. Each input is passed through a single 2D-convolution, then the results are
added together, activated via ReLu function, and then another convolution is included here to
reduce inputs dimension in order to use them as residuals with the input coming from the skip
connection. After that, another upsampling process takes place and then result is fed into the
following attention block for the subsequent decoding layer.

Similarly, the last convolution reduces the output filters from 64 to 1, to activate the pixels via
sigmoid function and determine the possible stroke location. Figure 4.10 shows the complete
architecture of this CNN.

In the medical sector, the class-imbalance is a frequent problem that can be addressed from
different perspectives: using a sensitive loss function, where the goal is to designate a different
cost when the classification is not correct in examples of distinct classes; oversampling, which
Chapter 4. Methodology and Implementation 47

Figure 4.10: Proposed Attention CNN for experiments.

uses patches from the original images, where the intention is to augment and balance the images
with the minority class, or multi-phase training, where a distinct class distribution is used to
retrain the network [14].

In this case, choosing a sensitive loss function like BCE + Dice Loss helps overcoming this prob-
lem, since it is widely used in similar publications related to image segmentation [7, 59], achieving
significant results. However, to be more specific in this thesis, the weighted BCEDiceLoss func-
tion is the one utilized in the experiments, as different contributions of each individual function
could produce better results. The formula of the weighted BCEDiceLoss is as follows:

W eighted BCEDiceLoss = αBCE + βDiceLoss

Four different optimizers were considered to compare performance between training and valida-
tion as well as training time execution: SGD, Adam, AdamW and AdaGrad. As the learning
rate represents the most relevant hyperparameter of the network, different strategies were uti-
lized such as learning rate on plateau, multi step learning rate and cyclic learning rate, to test
empirically a combination of values that could lead to the best possible performance. Dropout
and weight decay are the main strategies used for regularization, as well as data augmentation
to increase data samples [30]. Finally, random search was the method implemented to explore
different combination of hyperparemeters (epochs, batch size, α, β, learning rate, weight decay,
Chapter 4. Methodology and Implementation 48

momentum) to train the model, as it represents the easiest way to explore several combinations
of configurations for the best model.

Hyperparameter Value
Epochs 30, 50, 80, 100
Batch size 1, 2, 4, 8, 16, 32
Learning rate 0.01 - 0.00001
Weight Decay 0.01 - 0.00001
Momentum 0.1 - 0.9
α 0.1 - 1
β 0.1 - 1
Table 4.1: Hyperparameters considered in random search
Chapter 5. Results
The following table summarizes relevant information to evaluate the performance of each model
on the testing dataset.

Model Parameters Avg Dice Score Avg IoU


Attention U-Net (custom) 57,355,373 11% ±0.04 8% ±0.02
ResU-Net++ 14,481,412 9% ±0.24 6% ±0.18
U-Net 1,511,124 9% ±0.22 6% ±0.16
ResU-Net 13,040,705 9% ±0.18 6% ±0.12
Attention U-net 1,905,321 7% ±0.25 5% ±0.19
Table 5.1: Experimental results

In order to show individual performance, the following images illustrate results for five random
cases, where the model predicted the output given the segmentation mask. Each image is
divided in five columns, each of which depicts the input image, the ground truth, the model
prediction, and the dice score, respectively. The dice score was added on the right-most column
to compare results quantitatively between the segmentation mask and the model output.

49
Chapter 5. Results 50

Figure 5.1: Attention U-Net prediction results


Chapter 5. Results 51

Figure 5.2: ResU-Net++ prediction results


Chapter 5. Results 52

Figure 5.3: U-Net prediction results


Chapter 5. Results 53

Figure 5.4: ResU-Net prediction results


Chapter 5. Results 54

Figure 5.5: Tailored Attention U-Net prediction results


Chapter 6. Evaluation
The experiments considered 50K images: 30K for training (60%), 15K for validation (30%) and
5K for testing (10%). U-Net, ResU-Net, ResU-Net++, Attention U-Net, and the custom-fit
version models were considered as part of the experiments to evaluate performance, analyse,
and benchmark final results. All models were trained on the Fraunhofer High-Performance
Cluster with a node including 4 NVIDIA TITAN X (Pascal) GPUs, each with 12GB. MONAI
[22] was the python open-source framework selected for the implementation of U-Net, Attention
U-Net models, as MONAI provides these architectures ready to use. ResU-Net, ResU-Net++
were implemented from the pytorch versions available on their original publication that is also
available on Github [90] and lastly, the tailored model was implemented with Pytorch [21].
The hyperparemeters tunned during training were: learning rate, number of epochs, optimizer,
weight decay, momentum, batch size and number of data samples. The test cases final results
were recorded and compared in order to decide the configuration with the highest DSC and IoU
in the validation dataset.

Although MRI is a 3D data type that could be used in a 3D model, the results obtained by
using 3D models were not outperforming the ones obtained with 2D models due to a hardware
limitation issue. Therefore, the current approach only consisted of 2D-based models to measure
performance.

In spite of using different optimizers in our experiments (SGD, Adam, AdamW and AdaGrad),
the final choice was Adam, as it showed a better response and stability in training and validation
performance in comparison with the other ones. All models used momentum set to 0.9 to achieve
higher performance. The following tables show the final list of hyperparameters for every and
each model that correspond to the best performance during the experiment:

Model Learning Rate Epochs Batch Size Weight Decay


Attention Unet 1e-3 50 16 1e-4
ResU-net 1e-3 50 8 1e-5
ResU-net++ 1e-3 50 8 1e-4
U-net 1e-3 50 8 1e-5
Tailored Attention Model 1e-4 30 8 1e-5
Table 6.1: Hyperparameters in models

After several experiments with different proportions in both and β in the weighted BCEDiceLoss,
the best performance was obtained by using Weighted BCDEDiceLoss, with an equal contribu-
tion on each individual function (α = β = 1). The loss function curve for the models with this
loss function are shown in Figure 6.1.

55
Chapter 6. Evaluation 56

Figure 6.1: Loss for all models in the validation dataset

Figure 6.1 shows the different loss curves of all models on the validation dataset. The horizontal
axis represents the epochs during the network training, and the vertical axis the loss value. The
dark blue line highlights the loss curve of the Attention Unet (0.51 lowest value reached on the
41st epoch), as this model got the highest accuracy in DSC during the experimental results.
The orange line highlights the loss behaviour of the custom model to compare performance with
the Attention U-net (0.57 lowest value reached on the 27th epoch). ResU-net ended with 0.61
by the 24th epoch, followed by U-net with 0.62, and ResU-net++ with 0.70 by the 28th epoch.
The lines do not reach 50 epochs, as the current lines indicate where the best performance was
achieved.

Figure 6.2 illustrates the DSC curves for all models. This is the metric in which the performance
of the models is evaluated. The dark blue line shows the accuracy of the Attention U-Net model
with 52.2%, The tailored model, indicated in orange, is ranked as the 2nd model with a score of
45.50%. ResU-net reached a score of 42.6%, followed by U-Net with 40.8%, and ResU-net++
with a value of 35.5%.

Similarly, figure 6.3 shows the IoU result for all the models. Similarly, the highlighted blue line
corresponds to the Attention U-Net with a score of 37.8%, followed by the customed model At-
tention U-net with 33.9% shown in orange. ResU-Net got 31.3% , U-net 29.5%, and ResUnet++
ended with 24.4%.
Chapter 6. Evaluation 57

Figure 6.2: Dice score for all models in the validation dataset

Figure 6.3: IoU score for all models in the validation dataset
Chapter 7. Discussion
Identifying strokes with encoder-decoder-based DL architectures provides approximate results,
despite the complexity of the task. In fact, research shows that the encoder-decoder pattern is
widely used due to their effective performance in medical segmentation problems [46, 50, 59, 63,
81]. The main reason for that is that these architectures obtain the benefit of skip connections to
preserve relevant features in the decoder part, as they are used to add stability for the network
to converge. Moreover, attention blocks equip models to improve their performance as such
components filter non-relevant features. Nonetheless, the more elements added to the model,
the more trainable parameters it contains. This is the main characteristic that distinguishes
U-net from other complex models. Whereas other CNN architectures can improve the accuracy
of their prediction by adding several components in their design, U-net remains one of the
simplest models with the lowest number of trainable parameters among the chosen group, which
enables accelerated training by using less computational resources, and a considerable accurate
performance in its segmentation task.

Certainly, implementing an encoder-decoder based tailored architecture with attention resulted


in DL networks with huge number of parameters, despite their high accuracy, with respect
to models with less parameters. Having an optimal model that speeds up the training and
prediction process in new unseen data is the second high-priority requirement of any field,
particularly in the medical one.

In addition, although some hyperparameters could be good indicators for achieving accurate
performance in rapid test environments, continuous experimentation and monitoring are essen-
tial to find the most suitable model that solves the problem in the best possible way. In the
particular case of weighted BCEDiceLoss with the ATLAS [15] dataset, the contribution of each
individual functions does not affect the final loss during training to achieve a higher perfor-
mance, specially when more data is part of the training process. Nevertheless, it represents an
alternative in the hyperparameter space when working with relative smaller datasets.

Finally, a model could show high accuracy results considering the training and validation
datasets. However, it is still possible, that the model fails in generalizing the learning tasks
when new unseen data is presented to the model. Anomalies in data, sample data outside of
the main sample distribution, or not sufficient data for real cases are risk factors to consider
that may cause a model to not perform as expected.

58
Chapter 8. Conclusion
In this thesis, 2D image segmentation models for stroke detection were implemented to compare
performance in order to complete the first step of the two-stage architecture for automated
stroke detection. Firstly, basic data transformation techniques were used to medical images and
lesions in the ATLAS [15] dataset, such as random rotation, increasing gamma, and contrast
for a better visualization of the image, and image resizing to 224 × 224 pixels. Secondly,
both models are based on the encoder-decoder architecture that helps extracting the relevant
feature activation maps, and then produces a final result of the same size as the input that
includes contextual information coming from early layers in the model. Thirdly, a wide range
of hyper parameters were tested to achieve a similar DSC and IoU accuracy, similar to what
other publications reported, where authors used similar symmetrical architectures, including
models like U-Net, V-Net, and enhanced versions of U-net that include attention and residual
modules. During the refinement of the proposed models, a weighted BCEDiceLoss function was
tested to identify possible accuracy improvements by establishing different contributions of each
individual function, as high accuracy in rapid model tests was observable. The DSC accuracy
achieved by both models completely depends on a complex design that requires huge number
of trainable parameters.

59
Chapter 9. Future Work
The implemented CNNs model produces a considerable accuracy in segmenting strokes in pa-
tients; however, such model has over 57.3 million trainable parameters, leading to a slow training
process that is difficult to evaluate and perform model tuning. Therefore, model optimization to
reduce number of parameters without negatively impacting accuracy is necessary to accelerate
such procedure, and continue exploring better alternatives. Furthermore, due to hardware limi-
tation, 3D models were out of the scope of this project due to the low performance produced by
the limited data samples feeding the models. Thus, 3D model exploration could provide robust
alternatives that show an increase on the performance, as MRI are 3D images by nature.

Nevertheless, special focus is necessary on the first component of the architecture, as the results
provides in the image segmentation are not satisfactory due to the data-related issues mentioned
earlier in this project. Seeking more high-quality of data is desirable for more experimental
results. Indeed, further analysis is needed in image segmentation to improve accuracy, as there
were cases where the stroke can be barely seen, as it only covers a narrow portion of the image
and no prediction was produced. Such cases were not consistently identified by any of the models
in this thesis. This scenario reinforces the concerned addressed in [8]. The early diagnosis of
hemorrhagic strokes could potentially benefit specialist to act more rapidly to unfortunate events
in patients’ brains.

Finally, ATLAS dataset [15] includes a metadata with tabular information regarding the number
of strokes, as well as a brief textual description of the specific hemisphere of the brain where
the stroke occurs. Such information could be used as additional variables to increase accuracy
in the learning task on the second stage of the proposed architecture.

60
Bibliography
[1] S. Zhang, S. Xu, L. Tan, H. Wang, and J. Meng, “Stroke Lesion Detection and Analysis
in MRI Images Based on Deep Learning,” Journal of Healthcare Engineering, vol. 2021, p.
e5524769, Apr. 2021, doi: 10.1155/2021/5524769.

[2] Y. Zhou, W. Huang, P. Dong, Y. Xia, and S. Wang, “D-UNet: A Dimension-Fusion U


Shape Network for Chronic Stroke Lesion Segmentation,” IEEE/ACM Transactions on
Computational Biology and Bioinformatics, vol. 18, no. 3, pp. 940–950, May 2021, doi:
10.1109/TCBB.2019.2939522.

[3] Ö. Çiçek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3D U-Net:
Learning Dense Volumetric Segmentation from Sparse Annotation,” arXiv:1606.06650 [cs],
Jun. 2016, Accessed: Apr. 24, 2022. [Online]. Available: http://arxiv.org/abs/1606.06650

[4] A. N. Christensen, ”Data Analysis of Medical Images: CT, MRI, Phase Contrast X-ray
and PET”. Kgs. Lyngby: Technical University of Denmark, 2016.

[5] T. E. Nichols et al., “Best practices in data analysis and sharing in neuroimaging using
MRI,” Nat Neurosci, vol. 20, no. 3, Art. no. 3, Mar. 2017, doi: 10.1038/nn.4500.

[6] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks for Biomedical
Image Segmentation.” arXiv, May 18, 2015. Accessed: Jun. 01, 2022. [Online]. Available:
http://arxiv.org/abs/1505.04597

[7] K. Qi et al., “X-Net: Brain Stroke Lesion Segmentation Based on Depthwise Separable
Convolution and Long-range Dependencies,” arXiv:1907.07000 [cs, eess], Dec. 2019, doi:
10.1007/978-3-030-32248-9 28.

[8] A. Lou, S. Guan, H. Ko, and M. H. Loew, “CaraNet: context axial reverse attention network
for segmentation of small medical objects,” in Medical Imaging 2022: Image Processing,
San Diego, United States, Apr. 2022, p. 11. doi: 10.1117/12.2611802.

[9] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-Net: Fully Convolutional Neural Networks
for Volumetric Medical Image Segmentation,” arXiv:1606.04797 [cs], Jun. 2016, Accessed:
Apr. 25, 2022. [Online]. Available: http://arxiv.org/abs/1606.04797

[10] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A Deep Convolutional Encoder-


Decoder Architecture for Image Segmentation,” arXiv:1511.00561 [cs], Oct. 2016, Accessed:
Apr. 29, 2022. [Online]. Available: http://arxiv.org/abs/1511.00561

61
Chapter 9. Future Work 62

[11] S. Minaee, Y. Boykov, F. Porikli, A. Plaza, N. Kehtarnavaz, and D. Terzopoulos, “Image


Segmentation Using Deep Learning: A Survey,” IEEE Transactions on Pattern Analy-
sis and Machine Intelligence, vol. 44, no. 7, pp. 3523–3542, Jul. 2022, doi: 10.1109/T-
PAMI.2021.3059968.

[12] M. Soltanpour, R. Greiner, P. Boulanger, and B. Buck, “Ischemic Stroke Lesion Prediction
in CT Perfusion Scans Using Multiple Parallel U-Nets Following by a Pixel-Level Classifier,”
in 2019 IEEE 19th International Conference on Bioinformatics and Bioengineering (BIBE),
Oct. 2019, pp. 957–963. doi: 10.1109/BIBE.2019.00179.

[13] W. Wu, Y. Lu, R. Mane, and C. Guan, “Deep Learning for Neuroimaging Segmentation
with a Novel Data Augmentation Strategy,” in 2020 42nd Annual International Conference
of the IEEE Engineering in Medicine Biology Society (EMBC), Jul. 2020, pp. 1516–1519.
doi: 10.1109/EMBC44109.2020.9176537.

[14] A. Clèrigues, S. Valverde, J. Bernal, J. Freixenet, A. Oliver, and X. Lladó, “Acute is-
chemic stroke lesion core segmentation in CT perfusion images using fully convolutional
neural networks,” Computers in Biology and Medicine, vol. 115, p. 103487, Dec. 2019, doi:
10.1016/j.compbiomed.2019.103487.

[15] S.-L. Liew et al., “A large, open source dataset of stroke anatomical brain images and
manual lesion segmentations,” Sci Data, vol. 5, no. 1, Art. no. 1, Feb. 2018, doi: 10.1038/s-
data.2018.11.

[16] W.-D. Heiss, “Ischemic Penumbra: Evidence From Functional Imaging in Man,” J Cereb
Blood Flow Metab, vol. 20, no. 9, pp. 1276–1293, Sep. 2000, doi: 10.1097/00004647-
200009000-00002.

[17] H. J. Audebert and J. B. Fiebach, “Brain Imaging in Acute Ischemic Stroke—MRI or CT?,”
Curr Neurol Neurosci Rep, vol. 15, no. 3, p. 6, Mar. 2015, doi: 10.1007/s11910-015-0526-4.

[18] A. M. Hafiz and G. M. Bhat, “A survey on instance segmentation: state of the art,” Int J
Multimed Info Retr, vol. 9, no. 3, pp. 171–189, Sep. 2020, doi: 10.1007/s13735-020-00195-x.

[19] X. Li, P. S. Morgan, J. Ashburner, J. Smith, and C. Rorden, “The first step for neuroimag-
ing data analysis: DICOM to NIfTI conversion,” Journal of Neuroscience Methods, vol.
264, pp. 47–56, May 2016, doi: 10.1016/j.jneumeth.2016.03.001.

[20] Napari contributors (2019). napari: a multi-dimensional image viewer for


python. doi:10.5281/zenodo.3555620. Retrieved September 13, 2022, from
https://napari.org/stable/

[21] The Linux Foundation. (n.d.). Pytorch. PyTorch. Retrieved September 13, 2022, from
https://pytorch.org/
Chapter 9. Future Work 63

[22] Project MONAI. (n.d.). Monai - Medical Open Network for Artificial Intelligence. Retrieved
September 13, 2022, from https://monai.io/

[23] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Im-
age Recognition.” arXiv, Apr. 10, 2015. Accessed: Jul. 02, 2022. [Online]. Available:
http://arxiv.org/abs/1409.1556

[24] K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image
Recognition.” arXiv, Dec. 10, 2015. Accessed: Jul. 02, 2022. [Online]. Available:
http://arxiv.org/abs/1512.03385

[25] Z. Alom, T. M. Taha, C. Yakopcic, S. Westberg, P. Sidike, and M. S. Nasrin, “The History
Began from AlexNet: A Comprehensive Survey on Deep Learning Approaches,” p. 39.

[26] Y. Zhang, J. Chu, L. Leng, and J. Miao, “Mask-Refined R-CNN: A Network for Refining
Object Details in Instance Segmentation,” Sensors, vol. 20, no. 4, p. 1010, Feb. 2020, doi:
10.3390/s20041010.

[27] H. Talebi and P. Milanfar, “Learning to Resize Images for Computer Vision Tasks,” arXiv,
arXiv:2103.09950, Mar. 2021. doi: 10.48550/arXiv.2103.09950.

[28] F. Zhuang et al., “A Comprehensive Survey on Transfer Learning,” Proceedings of the


IEEE, vol. 109, no. 1, pp. 43–76, Jan. 2021, doi: 10.1109/JPROC.2020.3004555.

[29] S. Jadon, “A survey of loss functions for semantic segmentation,” in 2020 IEEE Conference
on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB),
Oct. 2020, pp. 1–7. doi: 10.1109/CIBCB48159.2020.9277638.

[30] Z. Li, K. Kamnitsas, and B. Glocker, “Analyzing Overfitting under Class Imbalance in
Neural Networks for Image Segmentation,” IEEE Trans. Med. Imaging, vol. 40, no. 3, pp.
1065–1077, Mar. 2021, doi: 10.1109/TMI.2020.3046692.

[31] S. Sun, Z. Cao, H. Zhu, and J. Zhao, “A Survey of Optimization Methods From a Machine
Learning Perspective,” IEEE Trans. Cybern., vol. 50, no. 8, pp. 3668–3681, Aug. 2020, doi:
10.1109/TCYB.2019.2950779.

[32] W.-D. Heiss, “Ischemic Penumbra: Evidence From Functional Imaging in Man,” J Cereb
Blood Flow Metab, vol. 20, no. 9, pp. 1276–1293, Sep. 2000, doi: 10.1097/00004647-
200009000-00002.

[33] A. Voulodimos, N. Doulamis, A. Doulamis, and E. Protopapadakis, “Deep Learning for


Computer Vision: A Brief Review,” Computational Intelligence and Neuroscience, vol.
2018, pp. 1–13, 2018, doi: 10.1155/2018/7068349.
Chapter 9. Future Work 64

[34] S. Sharma, S. Sharma, and A. Athaiya, “Activation Functions in Neural Networks,”


IJEAST, vol. 04, no. 12, pp. 310–316, May 2020, doi: 10.33564/IJEAST.2020.v04i12.054.

[35] S. K. Kumar, “On weight initialization in deep neural networks.” arXiv, May 02, 2017.
Accessed: Jul. 13, 2022. [Online]. Available: http://arxiv.org/abs/1704.08863

[36] Q. Li, M. Yan, and J. Xu, “Optimizing Convolutional Neural Network Performance by Mit-
igating Underfitting and Overfitting,” in 2021 IEEE/ACIS 19th International Conference
on Computer and Information Science (ICIS), Shanghai, China, Jun. 2021, pp. 126–131.
doi: 10.1109/ICIS51600.2021.9516868.

[37] Q. Wu, Y. Liu, Q. Li, S. Jin, and F. Li, “The Application of Deep Learning in Computer
Vision,” p. 6.

[38] J. Chai, H. Zeng, A. Li, and E. W. T. Ngai, “Deep learning in computer vision: A critical
review of emerging techniques and application scenarios,” Machine Learning with Applica-
tions, vol. 6, p. 100134, Dec. 2021, doi: 10.1016/j.mlwa.2021.100134.

[39] X. Liu, L. Song, S. Liu, and Y. Zhang, “A Review of Deep-Learning-Based Medical Im-
age Segmentation Methods,” Sustainability, vol. 13, no. 3, Art. no. 3, Jan. 2021, doi:
10.3390/su13031224.

[40] S. Studer et al., “Towards CRISP-ML(Q): A Machine Learning Process Model with
Quality Assurance Methodology,” MAKE, vol. 3, no. 2, pp. 392–413, Apr. 2021, doi:
10.3390/make3020020.

[41] L. Alzubaidi et al., “Review of deep learning: concepts, CNN architectures, challenges,
applications, future directions,” Journal of Big Data, vol. 8, no. 1, p. 53, Mar. 2021, doi:
10.1186/s40537-021-00444-8.

[42] R. Soto-Cámara, J. J. González-Bernal, J. González-Santos, J. M. Aguilar-Parra, R.


Trigueros, and R. López-Liria, “Knowledge on Signs and Risk Factors in Stroke Patients,”
JCM, vol. 9, no. 8, p. 2557, Aug. 2020, doi: 10.3390/jcm9082557.

[43] A. T. Tursynova, B. S. Omarov, O. A. Postolache, and M. Z. Sakypbekova, “Convolutional


Deep Learning Neural Networks for Stroke Image Recognition: Review,” . , , , vol. 112,
no. 4, Art. no. 4, Dec. 2021, doi: 10.26577/JMMCS.2021.v112.i4.09.

[44] N. Burgos, S. Bottani, J. Faouzi, E. Thibeau-Sutre, and O. Colliot, “Deep learning for
brain disorders: from data processing to disease treatment,” Briefings in Bioinformatics,
vol. 22, no. 2, pp. 1560–1576, Mar. 2021, doi: 10.1093/bib/bbaa310.

[45] Z. Zhang, Q. Liu, and Y. Wang, “Road Extraction by Deep Residual U-Net,” IEEE
Geosci. Remote Sensing Lett., vol. 15, no. 5, pp. 749–753, May 2018, doi: 10.1109/L-
GRS.2018.2802944.
Chapter 9. Future Work 65

[46] D. Jha et al., “ResUNet++: An Advanced Architecture for Medical Image Seg-
mentation.” arXiv, Nov. 16, 2019. Accessed: Aug. 02, 2022. [Online]. Available:
http://arxiv.org/abs/1911.07067

[47] O. Oktay et al., “Attention U-Net: Learning Where to Look for the Pancreas.” arXiv, May
20, 2018. Accessed: Aug. 02, 2022. [Online]. Available: http://arxiv.org/abs/1804.03999

[48] D. Weishaupt, V. D. Köchli, and B. Marincek, ”How does MRI work? an introduction
to the physics and function of magnetic resonance imaging”, 2nd ed. Berlin; New York:
Springer, 2006.

[49] A. Berger, “How does it work?: Magnetic resonance imaging,” BMJ, vol. 324, no. 7328,
pp. 35–35, Jan. 2002, doi: 10.1136/bmj.324.7328.35.

[50] I. Rizwan I Haque and J. Neubert, “Deep learning approaches to biomedical im-
age segmentation,” Informatics in Medicine Unlocked, vol. 18, p. 100297, 2020, doi:
10.1016/j.imu.2020.100297.

[51] B. Liu, “‘Weak AI’ is Likely to Never Become ‘Strong AI’, So What is its Great-
est Value for us?” arXiv, Mar. 28, 2021. Accessed: Aug. 07, 2022. [Online]. Available:
http://arxiv.org/abs/2103.15294

[52] C. M. Bishop, “Neural networks and their applications,” Neural networks, p. 31.

[53] W.W. McCulloch and W. Pitts, A Logical Calculus of the Ideas Inminent in Nervous
Activity, Bulletin of Mathematical Biophysics, 5:115–133, 1943.

[54] W. Pitts and W.W. McCulloch, How We Know Universals, Bulletin of Mathematical Bio-
physics, 9:127147, 1947.

[55] D. Andina, A. Vega-Corona, J. I. Seijas, and J. Torres-Garcı̀a, “Neural Networks Historical


Review,” in Computational Intelligence, D. Andina and D. T. Pham, Eds. Boston, MA:
Springer US, 2007, pp. 39–65. doi: 10.1007/0-387-37452-3 2.

[56] A. Hatamizadeh, V. Nath, Y. Tang, D. Yang, H. Roth, and D. Xu, “Swin UNETR: Swin
Transformers for Semantic Segmentation of Brain Tumors in MRI Images.” arXiv, Jan. 04,
2022. Accessed: Aug. 10, 2022. [Online]. Available: http://arxiv.org/abs/2201.01266

[57] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, “Transformers in


Vision: A Survey,” ACM Comput. Surv., p. 3505244, Jan. 2022, doi: 10.1145/3505244.

[58] A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image Recog-
nition at Scale.” arXiv, Jun. 03, 2021. Accessed: Aug. 11, 2022. [Online]. Available:
http://arxiv.org/abs/2010.11929
Chapter 9. Future Work 66

[59] A. Hatamizadeh et al., “UNETR: Transformers for 3D Medical Image Segmen-


tation.” arXiv, Oct. 09, 2021. Accessed: Aug. 11, 2022. [Online]. Available:
http://arxiv.org/abs/2103.10504

[60] H. Zhang, L. Zhang, and Y. Jiang, “Overfitting and Underfitting Analysis for Deep Learning
Based End-to-end Communication Systems,” in 2019 11th International Conference on
Wireless Communications and Signal Processing (WCSP), Xi’an, China, Oct. 2019, pp.
1–6. doi: 10.1109/WCSP.2019.8927876.

[61] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A


Simple Way to Prevent Neural Networks from Overfitting,” p. 30.

[62] Preston, D. C. (Reviewed 2016, April 7). Magnetic Resonance Imaging (MRI) of
the Brain and Spine: Basics. MRInbsp; BASICS. Retrieved August 12, 2022, from
https://case.edu/med/neurology/NR/MRI%20Basics.htm

[63] Esteva, A., Chou, K., Yeung, S., Naik, N., Madani, A., Mottaghi, A., Liu, Y., Topol, E.,
Dean, J., Socher, R. (2021). Deep learning-enabled medical computer vision. Npj Digital
Medicine, 4(1), 5. https://doi.org/10.1038/s41746-020-00376-2

[64] Karamitsos, I., Albarhami, S., Apostolopoulos, C. (2020). Applying DevOps Prac-
tices of Continuous Automation for Machine Learning. Information, 11(7), 363.
https://doi.org/10.3390/info11070363

[65] Ebert, C., Gallardo, G., Hernantes, J., Serrano, N. (2016). DevOps. IEEE Software, 33(3),
94–100. https://doi.org/10.1109/MS.2016.68

[66] Makinen, S., Skogstrom, H., Laaksonen, E., Mikkonen, T. (2021). Who Needs MLOps:
What Data Scientists Seek to Accomplish and How Can MLOps Help? 2021 IEEE/ACM
1st Workshop on AI Engineering - Software Engineering for AI (WAIN), 109–112.
https://doi.org/10.1109/WAIN52551.2021.00024

[67] Lwakatare, L. E., Crnkovic, I., Bosch, J. (2020). DevOps for AI – Chal-
lenges in Development of AI-enabled Applications. 2020 International Confer-
ence on Software, Telecommunications and Computer Networks (SoftCOM), 1–6.
https://doi.org/10.23919/SoftCOM50211.2020.9238323

[68] Google Cloud. 2020. Last updated 2020, January 7, MLOps: Continuous delivery and au-
tomation pipelines in machine learning — Cloud Architecture Center — Google Cloud. [on-
line] Available at: https://cloud.google.com/architecture/mlops-continuous-delivery-and-
automation-pipelines-in-machine-learning, Accessed 25 August 2022.
Chapter 9. Future Work 67

[69] Garg, S., Pundir, P., Rathee, G., Gupta, P. K., Garg, S., Ahlawat, S.
(2021). On Continuous Integration / Continuous Delivery for Automated Deploy-
ment of Machine Learning Models using MLOps. 2021 IEEE Fourth Interna-
tional Conference on Artificial Intelligence and Knowledge Engineering (AIKE),
25–28.https://doi.org/10.1109/AIKE52691.2021.00010

[70] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, 2016.
http://www.deeplearningbook.org.

[71] Kukačka, J., Golkov, V., Cremers, D. (2017). Regularization for Deep Learning: A Tax-
onomy (arXiv:1710.10686). arXiv. http://arxiv.org/abs/1710.10686

[72] Kelleher, J. D. (2019). Chapter 3, Neural Networks: The Building Blocks of Deep Learning.
In Deep learning (pp. 65–100). essay, The Mit Press.

[73] Glorot, X., Bengio, Y. (n.d.). Understanding the difficulty of training deep feedforward
neural networks. 8.

[74] He, K., Zhang, X., Ren, S., Sun, J. (2015). Delving Deep into Rectifiers: Surpass-
ing Human-Level Performance on ImageNet Classification (arXiv:1502.01852). arXiv.
http://arxiv.org/abs/1502.01852

[75] Katanforoosh Kunin, ”Initializing neural networks”, deeplearning.ai, 2019.


https://www.deeplearning.ai/ai-notes/initialization/

[76] Ruder, S. (2017). An overview of gradient descent optimization algorithms


(arXiv:1609.04747). arXiv. http://arxiv.org/abs/1609.04747

[77] Wang, Z. J., Turko, R., Shaikh, O., Park, H., Das, N., Hohman, F., Kahng, M., Polo
Chau, D. H. (2021). CNN Explainer: Learning Convolutional Neural Networks with Inter-
active Visualization. IEEE Transactions on Visualization and Computer Graphics, 27(2),
1396–1406. https://doi.org/10.1109/TVCG.2020.3030418

[78] Ibm.com. 2022. What is Computer Vision? — IBM. [online] Available at:
¡https://www.ibm.com/topics/computer-vision¿ [Accessed 29 August 2022].

[79] Huang, T. S. (n.d.). Computer Vision: Evolution and Promise. 5.

[80] Bhatt, D., Patel, C., Talsania, H., Patel, J., Vaghela, R., Pandya, S., Modi, K., Ghayvat, H.
(2021). CNN Variants for Computer Vision: History, Architecture, Application, Challenges
and Future Scope. Electronics, 10(20), 2470. https://doi.org/10.3390/electronics10202470

[81] Pham, D. L., Xu, C., Prince, J. L. (n.d.). A survey of current methods in medical image
segmentation. Image Segmentation, 27.
Chapter 9. Future Work 68

[82] Hafiz, A. M., Bhat, G. M. (2020). A survey on instance segmentation: State of


the art. International Journal of Multimedia Information Retrieval, 9(3), 171–189.
https://doi.org/10.1007/s13735-020-00195-x

[83] Kim, D., Woo, S., Lee, J.-Y., Kweon, I. S. (n.d.). Video Panoptic Segmentation. 10.

[84] Milioto, A., Behley, J., McCool, C., Stachniss, C. (2020). LiDAR Panoptic Segmentation
for Autonomous Driving. 2020 IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS), 8505–8512. https://doi.org/10.1109/IROS45743.2020.9340837

[85] Zou, Z., Shi, Z., Guo, Y., Ye, J. (2019). Object Detection in 20 Years: A Survey
(arXiv:1905.05055). arXiv. http://arxiv.org/abs/1905.05055

[86] Fu, J., Rui, Y. (2017). Advances in deep learning approaches for im-
age tagging. APSIPA Transactions on Signal and Information Processing, 6(1).
https://doi.org/10.1017/ATSIP.2017.12

[87] Castiglioni, I., Rundo, L., Codari, M., Di Leo, G., Salvatore, C., Interlenghi, M., Gal-
livanone, F., Cozzi, A., D’Amico, N. C., Sardanelli, F. (2021). AI applications to
medical images: From machine learning to deep learning. Physica Medica, 83, 9–24.
https://doi.org/10.1016/j.ejmp.2021.02.006

[88] Chinnamgari, S. K. (2019). Chapter 1. In R machine learning projects: Implement super-


vised, unsupervised, and reinforcement learning techniques using r 3.5 (pp. 8–20). essay,
Packt Publishing.

[89] Singh, J., Banerjee, R. (2019). A Study on Single and Multi-layer Perceptron Neural Net-
work. 2019 3rd International Conference on Computing Methodologies and Communication
(ICCMC), 35–40. https://doi.org/10.1109/ICCMC.2019.8819775

[90] Rishikesh. (n.d.). Rishikksh20/ResUnet: Pytorch implementation of ResUnet and resunet


++. GitHub. Retrieved August 11, 2022, from https://github.com/rishikksh20/ResUnet

[91] Alison Scarpa, O. T. R. L. (2020, June 22). What are the different types of stroke? NEO-
FECT Blog. Retrieved August 7, 2022, from https://www.neofect.com/us/blog/what-are-
the-different-types-of-stroke

[92] Preston, D. C. (Reviewed 2016, April 7). Magnetic Resonance Imaging (MRI) of
the Brain and Spine: Basics. MRInbsp; BASICS. Retrieved August 12, 2022, from
https://case.edu/med/neurology/NR/MRI%20Basics.htm

[93] Zhou, I. (2021, July 7). Model training with Machine Learning. Landing AI. Retrieved
August 11, 2022, from https://landing.ai/model-training-with-machine-learning/

You might also like