0% found this document useful (0 votes)
39 views13 pages

CNN Architecture Optimization Using Bio Inspired Algor - 2022 - Computers in Bio

This document summarizes a research paper that used genetic algorithms and particle swarm optimization to optimize the hyperparameters and architecture of convolutional neural networks (CNNs) for detecting breast cancer in infrared images. The researchers optimized three state-of-the-art CNNs: VGG-16, ResNet-50, and DenseNet-201. Through optimization, they improved the F1-score for VGG-16 from 0.66 to 0.92 and for ResNet-50 from 0.83 to 0.90 when classifying test data. Bio-inspired optimization techniques helped find CNN architectures and hyperparameters that improved breast cancer detection performance compared to previous studies.

Uploaded by

Farhan Maulana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views13 pages

CNN Architecture Optimization Using Bio Inspired Algor - 2022 - Computers in Bio

This document summarizes a research paper that used genetic algorithms and particle swarm optimization to optimize the hyperparameters and architecture of convolutional neural networks (CNNs) for detecting breast cancer in infrared images. The researchers optimized three state-of-the-art CNNs: VGG-16, ResNet-50, and DenseNet-201. Through optimization, they improved the F1-score for VGG-16 from 0.66 to 0.92 and for ResNet-50 from 0.83 to 0.90 when classifying test data. Bio-inspired optimization techniques helped find CNN architectures and hyperparameters that improved breast cancer detection performance compared to previous studies.

Uploaded by

Farhan Maulana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Computers in Biology and Medicine 142 (2022) 105205

Contents lists available at ScienceDirect

Computers in Biology and Medicine


journal homepage: www.elsevier.com/locate/compbiomed

CNN architecture optimization using bio-inspired algorithms for breast


cancer detection in infrared images
Caroline Barcelos Gonçalves *, Jefferson R. Souza , Henrique Fernandes
Faculty of Computing, Federal University of Uberlandia, 2121, Joao Naves de Avila Avenue, Uberlandia, 38408-100, MG, Brazil

A R T I C L E I N F O A B S T R A C T

Keywords: The early detection of breast cancer is a vital factor when it comes to improving cure and recovery rates in
Breast cancer patients. Among such early detection factors, one finds thermography, an imaging technique that demonstrates
Infrared images good potential as an early detection method. Convolutional neural networks (CNNs) are widely used in image
CNN
classification tasks, but finding good hyperparameters and architectures for these is not a simple task. In this
GA
PSO
study, we use two bio-inspired optimization techniques, genetic algorithm and particle swarm optimization to
Bio-inspired optimization find good hyperparameters and architectures for the fully connected layers of three state of the art CNNs: VGG-
16, ResNet-50 and DenseNet-201. Through use of optimization techniques, we obtained F1-score results above
0.90 for all three networks, an improvement from 0.66 of the F1-score to 0.92 of the F1-score for the VGG-16.
Moreover, we were also able to improve the ResNet-50 from 0.83 of the F1-score to 0.90 of the F1-score for the
test data, when compared to previously published studies.

1. Introduction technique that can be used in the support of early detection of breast
cancer, as it has recently shown potential in providing support to the
According to Ref. [1], cancer is an abnormal cell growth that be­ diagnosis of this disease [7–9].
comes out of control. Breast cancer is one of the cancer types that starts Thermography is an imaging technique that measures the tempera­
in the breast and mainly affects women, but it can also appear in men ture emitted by the human body through infrared radiation [7,8]. This
[1]. The American Cancer Society in Ref. [2] estimates that in 2021 imaging technique is useful for the early detection of breast cancer, as
more than 280,000 new cases of breast cancer will be diagnosed in the tumor areas are hotter than healthy areas due to higher metabolic ac­
USA. They also reported that the incidence of breast cancer had tivity of the cancerous cells [8,10,11]. Thermography is also a cheap,
increased 0.5% over recent years [2]. In Brazil, the number of estimated simple and noninvasive procedure [8].
new cases of breast cancer for 2021 is over 66,000 [3]. After skin cancer, Convolutional Neural Networks (CNNs) are one of the deep neural
breast cancer is the most common both in Brazil and the USA [2,4]. Due networks that have shown great potential in solving computer vision
to its prevalence, early diagnosis of breast cancer is essential, especially problems (recognition and classification tasks for example) performing,
when it comes to improving the chances of patient survival and recovery in some cases, even better than human experts [12–14]. This technique
[5]. is also used in many medical imaging application, including breast
There are many imaging techniques used to support the early cancer detection, and more specifically when using thermography im­
detection of breast cancer. Among such, mammography is considered ages [7,8,11,15].
the standard exam. One of its disadvantages is that mammography Developing a CNN involves optimizing many parameters, as well as
comes with low doses of X-ray, which are harmful to patients. Although choosing its architecture. Choosing the suitable parameters is essential
the mammography is the standard imaging technique, it does not reaching fitting results with CNNs. As such, it is not a simple task, as it
demonstrate good sensitivity in breasts with denser tissue, which is demands a high level of expertise. In addition, it is common to experi­
common in younger woman [3,6]. Therefore, in such cases, other im­ ment with multiple parameters in order to find the most appropriate,
aging technique might also be used to support the diagnosis, for example which also makes it time consuming [12–14].
MRI or ultrasound [6]. Infrared thermography is another imaging Population based algorithms, such as genetic algorithms (GA) and

* Corresponding author.
E-mail address: [email protected] (C.B. Gonçalves).

https://doi.org/10.1016/j.compbiomed.2021.105205
Received 1 November 2021; Received in revised form 28 December 2021; Accepted 29 December 2021
Available online 5 January 2022
0010-4825/© 2022 Elsevier Ltd. All rights reserved.
C.B. Gonçalves et al. Computers in Biology and Medicine 142 (2022) 105205

particle swarm optimization (PSO), present good results especially in batch normalization, number of dense units and top layer type. The
problems that have a broad search space [16]. GA and PSO are two optimized model is chosen based on the F1-score results. The best result
bio-inspired meta-heuristic algorithms that can be used in optimization obtained by the authors using the state-of-the-art CNNs is 0.91 of the
problems [12,13]. In our study, we use algorithms of this type in the F1-score using the SeResNet-18. The CNN obtained from the surrogate
context of optimizing some of the parameters of our CNNs. In our study, model was the best result reported by the authors with 0.92 of the
we are using transfer learning with the following state of the art CNNs: F1-score.
VGG-16, ResNet-50 and DenseNet-201. Each CNN has its entire fully The study of Farooq and Corcoran [18] elucidates many different
connected layers replaced and its architecture chosen by the PSO or GA. healthcare applications of thermography, which include cancer detec­
In addition, some optimization algorithms are also used to choose some tion (breast cancer, skin cancer and brain cancer) and diabetes detec­
of the network hyperparameters. tion, for example. These authors also propose a computer aided
diagnoses (CAD) system using the dynamic images of the database
2. Related work DMR-IR and the CNN Inception-V3. The authors used images from 40
patient, 18 sick patients and 22 healthy patients, these were then
In this section, we are going to review some literature on related divided into 70% for training, 20% for validation and 10% for test. Their
studies that use machine learning techniques for breast cancer detection, image pre-processing approaches, used by the authors, included a
through the application of thermography images and some bio-inspired sharpening filter and a histogram equalization applied to training and
algorithms, specifically those of GA and PSO, which are applied when validation data. The CNN hyperparameters involve 5000 epochs,
searching for adequate CNNs architectures and hyperparameters. learning rate of 0.001 and an SGD optimizer. The proposed CAD ob­
The study developed by Torres-Galván et al. in Ref. [15] uses images tained an accuracy of 80% and an F1-score of 76.89%.
from three different infrared databases: DMR-IR, Instituto Jaliciense de The study by Baffa et at. in Ref. [8] proposed a new CNN for the
Cancerología (IJC) and Instituto de Seguridad y Servicios Sociales de los classification of thermography images from the DMR-IR database. The
Trabajadores del Estado (ISSSTE) with a total of 311 images, in which 267 authors used both static and dynamic images. For the static images, they
was healthy and 44 sick. Transfer learning using ResNet-101 was the used 300 images, 174 images from healthy patients and 126 images from
chosen technique for the classification. The authors in Ref. [15] applied sick ones, whereas for the dynamic images, they used 137 patients, 95
random data augmentation techniques to the training data, for instance: healthy and 42 sick. For both setups, data augmentation was applied to
data rotation on angles between 0◦ and 359◦ , reflection on each axis and the sick class as a way of balancing the database. These authors also
translation on angles between 0◦ and 50◦ . In addition, they augmented tested four different approaches for dealing with dynamic images, which
the data by a factor of 67. The authors report on two different setups for involves, for example, creating a new image with the mean of all the
the experiments: an unbalanced dataset keeping the original data pro­ available images and creating a new image with the difference found
portion and a balanced dataset each with different hyper-parameters, between the first and the last image. The CNN proposed by the authors
with both using data augmentation. The authors find a performance of has two convolutional layers (size of 5x5 and 32 outputs); followed by
84.6% sensitivity using the unbalanced dataset. Moreover, with the two max pooling layers also 5x5, a stride of 3 and a fully connected
balanced dataset, they obtained even higher results with 92.3% of layer. They also tested color and gray scale images, and find out that the
sensitivity. color images performed better. Their best result for static images was
The DMR-IR database proposed by Silva et al. is found in Ref. [17]. 98% of accuracy with color images and for the dynamic images 95% of
These authors created a database of infrared breast images collected at accuracy with the color images.
the Hospital Universitário Antônio Pedro (HUAP) of the Fluminense Fed­ The study by Cabioglu et at. in Ref. [11] used the pre-trained AlexNet
eral University, which has healthy patients, along with those patients to classify 181 static images from the DMR-IR database. The authors
with breast cancer. The authors collected thermography images from used 147 images from healthy individuals and 34 images from in­
two different protocols static and dynamic. For each static patient, dividuals which have cancer. Their experiments were performed using
approximately 5 poses (images) were made available: frontal, lateral left both an unbalanced and a balanced dataset (using data augmentation
and right on two angles: 45◦ and 90◦ . For the dynamic images, 20 images techniques for the sick class in order to obtain a more balanced data­
were collected over a period of 5 min from the same position, frontal base). The authors also tested two different types of data preprocessing.
with their own proposed protocol. Moreover, two extra images were The first considers the thermography images as gray scale images, which
collected from lateral positions at 90◦ . With both protocols, the authors makes a channel replication to obtain a 3-channel image. The second
performed a patient acclimation process before collecting the data. In involves converting the temperature matrix into RGB images based on
the paper, the authors state that the database possessed images from 141 the Matlab® Jet Colormap. The authors also tested the overwriting of
patients with 3534 images. However, the database has increased since different parts of the CNN, for example: all the fully connected layers or
the publication of the paper. The database also contains mammography some of the last convolutional layers in order to verify how the CNN
images. would perform. The best result, obtained by these authors, is 94.3% of
Research conducted by Zuluaga et al. in Ref. [7] presents a study accuracy, which was obtained with Jet Colormap preprocessing,
with many different state-of-the-art CNNs, using a fine-tunning balanced dataset and the changing of the entire fully connected layers.
approach (ResNet-50, SeResNet-50, SeResNet-18, SeResNet-34, The study developed by Mambou et al. in Ref. [19] uses a CNN
VGG-16, Inception-V3, Inception-ResNet-V2 and Xception). Moreover, (Inception-V3) associated with an support vector machine (SVM) to
the authors propose a surrogate model based on a tree parser estimator classify dynamic thermography images from the DMR-IR database. The
optimization in order to find better CNNs architecture. Finally, they data used is composed of 64 patients, 32 from each class, each patient
investigate the impact of data augmentation in the classification. In their with 20 images. The SVM is only used when the CNN output probability
study, 57 patients from the dynamic protocol of DMR-IR database were is between 0.5 and 0.6. The authors preprocessed the images to extract
used, in which 19 were healthy and 38 sick. Next, the pre-processing of ROI and used a grayscale for conversion to RGB. Their CNN was trained
the images was performed in order to extract the region of interest with a learning rate of 0.000 1, 15 epochs, and 4000 steps. The result
(ROI). In addition, cropping, resizing and normalization was performed reported for classification is receiver operating characteristic (ROC) of 1
across all images. For the data augmentation, the authors used vertical and precision recall of 1.
and horizontal flips, rotation of angles between 0 and 45◦ , 20% zoom In Rosalidar et al. [20], the authors compare different fine-tuned
and normalized noise. In the optimization flow, the authors defined the CNNs (ResNet-101, DenseNet-201, MobileNet-V2, and ShuffleNet-V2)
following hyperparameters: number of blocks, convolutional layers and for the classification of thermography images from the database
filter, type of optimizer, kernel and pooling layer size, dropout rate, DMR-IR in cancer or healthy. Here, both static and dynamic images were

2
C.B. Gonçalves et al. Computers in Biology and Medicine 142 (2022) 105205

used. The static images were used in the training; however, the testing of the art models.
was done twice, once with other static images and another with the Research performed in Sun et al. in Ref. [13] also proposes a
dynamic images. Moreover, for the training images, data augmentation bio-inspired technique in order to find good CNN architectures auto­
was applied. The authors reported that DenseNet-201 was the best matically, by using a genetic algorithm called EvoCNN. The paper in
network, classifying both test sets with 100% accuracy. Furthermore, Ref. [13] describes many of the genetic algorithm definition as the gene
MobileNet-V2 also obtained very good results in a less demanding encoding, crossover and mutation operators, selection, fitness evalua­
training time, with 100% accuracy for the static testing set and 99.6% tion. Each gene is composed of a multitude of unit types. There are three
accuracy over the dynamic tests. unit types available: the convolutional layers unit (with the convolu­
Another study that also uses the DMR-IR and pre-trained CNNs, tional layers and hyperparameters associated to it), the pooling unit
developed by Kiymet et al. can be found in Ref. [21]. The authors here (which also encapsulates the polling layers parameters), and the fully
used 144 patients in which 88 are healthy and 56 sick. Moreover, they connected unit. The definition of the gene delivered in Ref. [13] pro­
used four different pre-trained CNNs, those being: VGG-16, VGG-19, vides a flexible CNN structure. However, with the flexible GA individual,
ResNet-50 and Inception-V3. The images were scaled and converted to comes the complexity of the crossover, for which the authors also pro­
RGB as their preprocessing step. The best result in their work was ob­ pose an algorithm. In order to evaluate the EvoCNN proposed, the au­
tained using the ResNet-50 with 88.89% of accuracy. thors tested it on nine different databases: Fashion, Rectangle, Rectangle
In the study by Chaves et al. [9], 88 DMR-IR static images were used Images (RI), Convex Sets (CS), MNIST Basic (MB), MNIST with Back­
(44 for each class), using five pre-trained CNNs: AlexNet, GoogleNet, ground Images (MBI), MNIST with Random Background (MRB), MNIST
ResNet-18, VGG-16, and VGG-19. The authors used all 5 static images with Rotated Digits (MRD) MNIST with RD plus Background Images
from each patient, and applied the following data augmentation tech­ (MRDBI) compared to many different classifiers used in these datasets.
nique to the training images: horizontal and vertical flips, moving the The CNNs obtained by the EvoCNN outperform almost all of the clas­
images 30 pixels and resizing with a value between 0.9 and 1.1. The sifiers compared in the stated datasets with the advantage of much
authors also tested a different number of epochs. The best result, from smaller CNNs.
these authors, was obtained with VGG-16 and VGG-19, both attaining The study in Fidelis et al. in Ref. [22] uses a genetic algorithm to find
77.5% accuracy, the first with 85% sensitivity and the second with 90% comprehensive rules in two databases, a dermatology database and a
sensitivity. breast cancer database both from the University of California Irvine
Table 1 presents a summary of the related studies described herein. (UCI) machine learning repository. In their study, the authors propose
Next, we will review studies that proposed the use of bio-inspired al­ an individual encoding which provides a very flexible approach in
gorithms, specifically GA and PSO, used to find CNN architectures. turning on and off one rule, while also maintaining a simple approach
The research conducted by Junior et al. in Ref. [12] proposes a for the crossover and other GA operators. Each individual is composed of
particle swarm optimization (PSO) algorithm, denominated psoCNN as a many genes and each gene has three parts: weight, operator and value.
tool that quickly defines good CNN architectures. This study describes in The authors consider the weight above 0.3 as being present in the rule
detail all the PSO operators, i.e. the particle definition, velocity calcu­ and below 0.3 as not present. Therefore, only changing the weight might
lation, position update and fitness evaluation. Their algorithm is used to mean turning on or off the rule. The dermatology database has five rules
find the entire CNN architecture, with all the convolutional layers, and the authors here generated rules for each of the rules with test
pooling layers and fully connected layers. The particle definition that the fitness (specificity*sensitivity) between 1 and 0.78. For the breast cancer
authors use allows for flexibility in the number of layers, as well as dataset, they generated two rules but with a low-test fitness of 0.36 and
proposing how to compare different particle lengths. They use three 0.39. Although their study does not use the GA in the context of CNN
different units to compose the particle: convolutional, pooling and fully optimization, we were inspired by their individual representation and
connected layers, each one with its own parameters and adding the units adapted it for our problem, as we will show in the next session.
together generates the particle, which represents a CNN architecture
(with all the CNN restrictions). In order to validate their psoCNN, these 3. Methodology
authors use nine public databases: MNIST, MNIST - RD, MNIST - RB,
MNIST - BI, MNIST - RD + BI, Rectangles, Rectangles-I, Convex, and 3.1. Database
MNIST - Fashion. The CNN model established by the psoCNN performs
better than the state of the art models for six of the nine databases tested. The database used in this study is the DMR-IR, a public database
Although these did not outperform all the state of the art model results, proposed by Silva et al. [17] that possesses infrared images from two
they were able to find very good results with smaller and simpler net­ classes of patients healthy and with breast cancer, which we will call
works, which leads to faster conversion and are comparable to the state ”sick”. The dataset is composed of infrared images from two different

Table 1
Summary of related work.
Reference Database Protocol ML Technique Best Results

[15] Combined three databases Static ResNet-101 Sensitivity: 84.6% for the unbalanced set and for the
balanced set 92.6%
[7] DMR-IR Dynamic ResNet-50, SeResNet, VGG-16 Inception-V3, Inception- 0.92 of the F1-score with the optimized model
ResNet-V2 Xception and CNN obtained from the
optimization method
[18] DMR-IR Dynamic Inception-V3 F1-score of 76.89%
[8] DMR-IR Static and Dynamic Proposed CNN 98% accuracy with the static 95% accuracy with the
dynamic.
[11] DMR-IR Static AlexNet 94.3% accuracy
[19] DMR-IR Dynamic Inception-V3 and SVM ROC of 1
[21] DMR-IR Not mentioned VGG-16, VGG-19, ResNet-50 and Inception-V3 88.89% accuracy with ResNet-50
[20] DMR-IR Static and Dynamic ResNet-101, DenseNet-201 MobileNet-V2 and 100% accuracy with DenseNet-201
ShuffleNet-V2
[9] DMR-IR Static AlexNet, GoogleNet, ResNet-18, VGG-16, and VGG-19 77.5% accuracy with both VGG-16, and VGG-19

3
C.B. Gonçalves et al. Computers in Biology and Medicine 142 (2022) 105205

thermography protocols: static and dynamic. In the static protocol, the process. The processing procedure for this approach involves converting
patient rests some minutes to complete the acclimatization process. the original temperature image into a Matlab® index image and then
Following this, an image is taken from 5 different positions (frontal, converting this image to RGB. By use of this approach, we obtain an image
lateral left and right with 45◦ rotation and lateral left and right with 90◦ with 3-channels, along with the pixel values in a range between 0 and 1.
rotation), which is a total of 5 images per patient. For the dynamic
protocol, 20 images are taken over a period of 5 min from only the 3.3. CNNs
frontal position. Additionally, two extra images were taken from the 90◦
lateral rotation. All the images are of size 640 x 480 px, and on to each CNN is one of the main techniques used today for pattern recognition
pixel, a temperature value is associated. in images [7]. They also have been used in breast cancer detection,
Initially, we decided to focus on the static images. The subset of the where it produced good results, as shown in the previous section. A
static images is composed of 864 healthy images (176 different patients) typical CNN is represented in Fig. 1, which is an adaptation of the figure
and 193 images from sick patients (39 patients). There are still some shown in Ref. [24]. Noteworthy here is that the CNNs are mainly divided
patients with unknown classes, and as such were not used in the present in two parts, the convolutional part (which is composed of convolutional
study. and pooling layers). The convolutional part, made up of convolutional
Although the DMR-IR is widely used in the literature, different layers, which are responsible for the feature extraction and the dropout
studies use a different number of images, and do not specify the exact layer, responsible for reducing the dimensions of the image [24]. The
patient numbers used. As we also reported in our previous study [23], second part is the fully connected layer, which is the part that performs
we decided to use the same number of patients reported by Ref. [11] the classification through the data extracted by the preliminary part of
(without the data augmentation they reported), even though we do not the network [24].
have the specifics for those patients used in the study. Therefore, we Transfer learning is a technique that consists of taking a pre-trained
selected 34 sick patients and 147 healthy patients. Similar to that per­ model from a database and transferring that knowledge to a new data­
formed in Ref. [23]; in the present study, we are using only the frontal base or domain [21]. Hence, transferring the learned features extracted
images. Hence, we also have 34 sick images and 147 healthy images. beforehand. Many pre-trained CNNs are available in the literature so we
Due to the fact that we have an unbalanced dataset, we used data decided to choose three that have shown promising results for breast
augmentation to deal with this unbalance. The data augmentation is thermography: VGG-16, ResNet-50 and DenseNet-201. We are main­
performed using rotation at a random angle between − 30◦ and 30◦ , taining the convolutional part of the CNNs we use as these are already
translation in X-axis and translation in the Y-axis. All the three tech­ provided, while only changing the fully connected part, which is the part
niques are randomly applied, thus a single input image is likely to that performs the classification.
produce different output images. We applied data augmentation to both
classes, but with different proportions. For the healthy classes each 3.3.1. Weight initialization
original image generates one other image with data augmentation In [25], Glorot et al. propose a new method for initializing the
applied, whereas for the sick class, each original image generates 7 new weights of a deep neural network known as ‘Glorot initialization’ or
images with data augmentation. In total, we have 294 healthy images ‘Xavier initialization’, which is reported to provide faster convergence.
and 272 sick images. The proposed method uses a uniform distribution for the weight
As also described in Ref. [23], we are using another subset from the initialization and can be defined as:
database with a balanced number of original images, in which we use 38 [ √̅̅̅ √̅̅̅ ]
patients for each class. In this subset, we are also only considering the 6 6
U − √̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ , √̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅ (2)
frontal images. Therefore, we have 38 images from each class as well. nj + nj+1 nj + nj+1

where n is the layer size.


3.2. Image pre processing In this particular study, this method is used to initialize the weights
of the fully connected layers of the CNNs.
The image preprocessing used in this study involves three steps:
converting the 2D images of the dataset into an image with 3-channels, 3.4. Optimization
parse the temperature values to values between 0 and 1 and resize the
image. All the preprocessing was necessary, due to the CNNs that we used. Due to the manual nature of the trial for finding a good fully con­
To use the CNNs, we must adapt our images to the input images the CNNs nected layer architecture with a good learning rate, along with the time
expect. For image resizing, we resized the images to obtain images with demanded to perform these experiments, we propose the use of opti­
224 x 224 px. For the other two steps, we tested two different approaches. mization techniques for this process. The optimization techniques cho­
The first combines a min-max conversion with channel replication. The sen are two bio-inspired algorithms: GA and PSO. For both algorithms,
min-max was performed in order to obtain an image with values between we considered the same parameters for optimization, which are:
0 and 1. Therefore, for each image we applied the min-max with the min
and max of the image itself, which is shown in Equation (1). ● learning rate;
X − Xmin ● number of fully connected layers;
X= (1)
Xmax − Xmin ● number of neurons of each fully connected layer;
● after which layers there is dropout;
where X is the obtained value, X is the pixel temperature value, Xmax and ● dropout rate of each dropout present in the layer;
Xmin are respectively the maximum and minimum values of the image.
After the min-max conversion, the channel replication is performed Moreover, for both bio-inspired algorithms the range of the opti­
by taking the single channel and replicating it to obtain a 3-channel mized parameters are the same. The value used to compose the learning
image. Therefore, all the 3-channels of the obtained image are equal. rate (LR) is defined as an integer value (lr) along the range [1,6], which
The other approach is what we called the Jet Colormap approach. This in the conversion to the CNN is used as 10lr. The number of neurons in a
approach is inspired on the study in Ref. [11]. Opening the original image specific layer is defined as a value in a range of [3,10], called fc, which is
in Matlab® and choosing the Jet Colomap generates a 3-channels image converted as 2fc for use in the definition of the fully connected CNN
with the values in the desired range. Therefore, we processed the images layer. Finally, we have a float value in the range of [0, 0.6] for the
to obtain a similar image to that generated thorough the colormap dropout rate of a layer. Furthermore, the number of fully connected

4
C.B. Gonçalves et al. Computers in Biology and Medicine 142 (2022) 105205

layers are implicit arguments, as well as, after which fully connected The first important definition of the GA is how the individual is going
layer there is a dropout layer, which means they are not explicit as al­ to be defined. Inspired by Ref. [22], a model which reached flexibility
gorithm arguments. Instead, they are consequences of the algorithm for the GA individual with standard crossover operators, we decided to
individual or particle definition. also use a fixed length individual with a flag indicating the turning on or
These two bio-inspired optimization techniques are used in this study off of that specific chromosome. Fig. 2 shows how we define our indi­
for finding a good architecture for the fully connected layers with a good vidual. Each individual is constituted of 12 chromosomes, and each
value for the learning rate. Next, we describe both the GA and PSO chromosome is composed of two parts. A flag is incorporated to indicate
approaches. if that chromosome is enabled or not (the ‘Is present?’ in this figure), as
well as a value that indicates the value of that specific layer or param­
3.4.1. Genetic algorithm eter. As we use a fixed individual length, the fist chromosome defines the
In Algorithm 1, the GA algorithm we use is presented and in the learning rate. From the second chromosome until the 11th, the fully
following, its underlying approach is described in detail. connected layers as well as the dropouts are defined, where the even
Algorithm 1. chromosomes in this range define the fully connected (FC) layers and the
GA algorithm. odd chromosomes define the dropout layers. Furthermore, in order for

5
C.B. Gonçalves et al. Computers in Biology and Medicine 142 (2022) 105205

Fig. 1. CNN example - image adapted from Baldominos [24].

Fig. 2. GA individual definition.

the individual to correspond to a valid CNN, in our particular problem, After choosing te individual representation and the fitness calcula­
the last chromosome is always an FC layer with a value of 1, as 21 is the tion, it is also necessary to define the selection, crossover, mutation and
CNN output. reinsertion operators. Due to the fact that we have a fixed individual
length, we decided to use a single crossover point, where two parents
Moreover, a consequence of our representation is that if a chromo­
generate two children. A position Cp is randomly selected and the first
some from the FC layer has the ‘Is present?’ equals to 0, the following
child inherits the chromosomes before Cp from the first parent and after
dropout chromosome is just ignored, as we use the Dropout chromosome
Cp from the second parent. The second child is the opposite (inheriting
only for the immediate predecessor in the FC layer. It is also worthy of
the first part from the second parent and the second part from the first
note that in the GA, the number of fully connected layers is implicit to
parent). Fig. 3 shows an example of a crossover.
the number of chromosomes in the FC layer that have 1 present for the
Another important GA operator that we defined was how the in­
‘Is present?’ flag.
dividuals will be selected for the crossover. We elect the tour as the
In order to verify how well the CNN with the GA definition for the FC
selection method, as it allows for the changing of the selective pressure
layers architecture behaves, it is necessary to train the CNN. Therefore,
of the GA by changing the tour size. Therefore, with this approach we
in order to calculate the fitness of each individual, a CNN is trained. The
can test different tour sizes to validate which one is best suited to our
fitness of the individual is actually the F1-score of the validation set. We
problem. The tour selection algorithm depends on a value, which we call
noted that some individuals had a non-decreasing number of neurons
‘tourValue’. We randomly select ‘tourValue’ individuals from the entire
(where the number of neurons in layer i was smaller than the number of
population and the individual with the best fitness is the one selected for
neurons i+1). As the procedure of increasing the number of neurons in
the crossover. Hence, for each pair of parents for the crossover, we use
the FC part of the CNN from one layer to the next is not common in the
the tour selection twice, one for each parent.
literature, we decided to penalize architectures with this behaviour by a
The mutation is performed by changing a value from the individual.
factor of 0.7 (empirically chosen). For instance: an individual which
This value can be from the ‘Is present?’ part or the actual value part. We
maps to an FC layer of 256 -> 256 -> 1024 will be penalized, whereas an
randomly select a chromosome to be mutated, and then the part that is
FC layer that maps to 1024 -> 1024 -> 256 will not be penalized. Hence,
going to be mutated. The new value is generated from the specific range
the fitness is defined as:
of the part of the chromosome. Therefore, if the ‘Is present?’ is selected


⎪ non − decreasing to be mutated, it is only changed from 0 to 1 or vice versa. In terms of the



⎪ 0.7 ∗ F1 − score of validation, number ​ of ​ neurons values, a new value is randomly generated from the range for that po­

fitness =
in ​ each ​ layer
(3) sition. Fig. 4 shows an example of a mutation of an individual. In the



non − increasing image, the red values are the values that were changed.




F1 − score of validation, number ​ of ​ neurons Finally, for reinsertion, we use order reinsertion, keeping the best
in ​ each ​ layer

Fig. 3. GA crossover.

6
C.B. Gonçalves et al. Computers in Biology and Medicine 142 (2022) 105205

In order to use the PSO, many definitions are needed, which include
the particle definition, the particle evaluation, how the difference of two
particles should be calculated, how to update the particle velocity and
finally, the particle position. Although the PSO optimizes the same pa­
rameters as the GA, we use a different representation for the particle
than that used for the GA individual. Most of the PSO definitions were
inspired on the study in Ref. [12].
Our particle is composed of different pieces that we call fragments.
Each fragment has two parts: type and value. The type indicates the
Fig. 4. GA individual mutation. category of the fragment, which can be LR, FC or dropout. The value
stores the actual value of that fragment type, this can be an integer in a
range of [1,6] for the learning rate, an integer value in a range of [3,10]
individual from one generation to another.
for the number of neurons in a specific layer and a float value in a range
of [0, 0.6] for the dropout rate of a layer. The value of fragments that has
3.4.2. PSO algorithm
type LR or FC stores the exponent that is used to find the actual numbers,
The algorithm for the PSO is described in Algorithm 2, with details of
for the LR the base is 10, and for the FC the base is 2. Fig. 5 shows an
its approach being given in the following.
image with the particle definition. In order to obtain a valid CNN from a
Algorithm 2. particle, the particle has some restrictions. These restrictions are the first
PSO. fragment of the particle is always an LR fragment and the last fragment is
always an FC layer with value 1 (so the output of the CNN is always

7
C.B. Gonçalves et al. Computers in Biology and Medicine 142 (2022) 105205

Fig. 5. PSO particle definition.

binary, as we have two classes), and the particle cannot have two or We only use the parallelism for the fitness calculation, as each fitness is
more successive dropout fragments. For the sake of simplicity, the last completely independent from the other individuals or particle.
fragment is only added to the fitness evaluation in the conversion to The two bio-inspired optimization techniques are used to optimize
CNN and to return the best particle, as it is not desirable that this the architecture of the CNN based on the validation subset (which is
fragment be removed from update flow, which the particle undergoes used for the fitness calculation). When the PSO or the GA finish evolving,
within the algorithm. we then take the architecture represented in the best individual or the
Similar to the GA approach, the evaluation of a single particle in­ best particle and run four new CNNs with that architecture. The differ­
volves training a CNN in which the architecture of the fully connected ence between these four new runs is the number of epochs: we run 30
layer and the learning rate comes from the particle. Therefore, for each and 50 epochs both with and without early stopping enabled. The early
particle, there is the training of the corresponding CNN and the fitness of stopping is used to avoid over fitting. Therefore, after 10 epochs without
the particle is the same as described for the GA, i.e., the result of the F1- reduction in the loss of the validation set, we stop the training. Finally,
score of the validation set, along with the penalty, as shown in Equation all four CNNs are tested with the test data and we then use the F1-score
(3). of the test data to decide which are our best networks. Furthermore, all
Another definition of the PSO is how it calculates the difference the experiments used the Adam optimizer.
between two particles. The particle difference is calculated fragment by As previously mentioned, we use two subsets of the original static
fragment. The fragment difference is defined as: 0 when the two frag­ images database, both only with frontal images: one with 294 healthy
ments have the same type; the fragment of the first particle itself if the images and 272 sick images (which we will call group-1) and the other
two types are different (or if the second particle does not have a corre­ with 38 images from each class (called group-2). Both subsets were
sponding fragment for that position, for the cases where the second divided into 70% of the data for training, 15% for validation and 15% for
particle is smaller than the first); − 1 when the second particle has a test.
fragment, for which the first does not have a corresponding partner (the Both groups (group-1 and group-2) were tested in the optimization
case where the number of fragments of the second particle is greater using GA and PSO. Therefore, we have four ample setups for the ex­
than the first). Fig. 6 shows two different examples of how we calculate periments: group-1 with GA, group-1 with PSO, group-2 with GA and
particle difference. group-2 with PSO. Furthermore, each of these setups is executed for all
It is also necessary to define how to calculate the particle velocity. the three CNNs used: ResNet-50, VGG-16 and DenseNet-201. Moreover,
The velocity of each fragment is defined by either the difference between as both GA and PSO are meta-heuristic algorithms, we run each exper­
pBest − P or gBest − P. The difference is selected based on a random iment 5 times and the best run is presented in this section.
value r and Cg (one of the PSO arguments). When r > Cg, the fragment is The baseline results that we use to compare with the optimization
selected from pBest − P whereas, when r < = Cg, the selected difference results is what we called manual experiments. In these experiments, six
is from gBest − P. Fig. 7 shows an example of this calculated difference. pre-defined architectures for the fully connected layer were tested with
As noted in Ref. [12], for their PSO, there is one special case for the two different learning rates, which were described in our previous study
velocity, which also occurs in our approach. This case happens when [23]. In summary, the six setups are:
both differences (pBest − P and gBest − P) have all the fragments equal to
0. We are also using the same procedure as that reported by Ref. [12], in ● (1) 1 FC;
which instead of using the difference, we use gBest and pBest itself, ● (2) 2 FC with 256 neurons as the output of the first layer and input of
keeping the random selection for each fragment based on Cg. the second (256 output neurons of the first layer, 2 output neurons of
Furthermore, the particle position update should also be defined. We the second layer);
update the particle position in a way that for a fragment that has the ● (3) 2 FC with 512 neurons (512, 2);
velocity equal to 0, the particle fragment is maintained; in the case of a ● (4) 2 FC with 1024 neurons (1024, 2);
fragment in which the velocity is either the FC layer or the Dropout, the ● (5) 3 FC with 4096 and 1024 neurons (4096, 1024, 2);
velocity fragment is kept; finally for a fragment with a − 1, the particle ● (6) 3 FC with 4096 neurons (4096, 4096, 2).
fragment is removed. An example of how the position is updated is
shown in Fig. 8, in which the P1t is the particle for the moment t and the The best results for each of the CNNs, that use manual setups, are
P1t+1 is the particle after the update for the moment t + 1, updated with described in Ref. [23], but for the sake of simplicity and comparison, this
the velocity shown in Fig. 8. is also specified in Table 2 and Table 3. We only added the best result for
the VGG-16 in the manual setup for group-2, which was not present
4. Experiments and results before.

In this section, we are going to provide details on the experiments 4.1. GA results
performed and present the obtained results. For the experiments, we
used Python, PyTorch and Matlab® for some of the preprocessing. In GA experiments used an 80% crossover rate. This value is used to
order to take advantage of graphics processing units (GPUs), we opt for verify whether the pair of individuals selected will go through the
using the Google Colab® tool to run our experiments. Furthermore, as crossover process. Therefore, a random number between 0 and 1 is
the fitness calculation of both the GA and PSO are computationally generated. If it is greater than 0.8 the selected pair of individuals are
expensive and time demanding, parallelism is used to reduce run time. maintained as is, otherwise, a single point crossover is applied and two

8
C.B. Gonçalves et al. Computers in Biology and Medicine 142 (2022) 105205

Fig. 6. Calculating the difference between two particles.

Fig. 7. PSO velocity calculation.

Fig. 8. PSO update position.

children are generated. Other parameters that were not changed during population size (PS), tested values were 10 and 20; for the mutation rate
the experiments reported in this section, were the tour size, all the ex­ (MR), we tested 20%, 30%, 40% and 60%. Therefore, we have 8 possible
periments used a tour of 2, and the number of iterations, where we used scenarios here to be tested (each PS for each MR).
10 for all experiments. Other parameters such as population size and Table 4 and Table 5 show the best results obtained with our GA for
mutation rate were varied empirically, testing the best ones. For the both groups of data previously described. In the tables, ‘Best GA Fitness’

9
C.B. Gonçalves et al. Computers in Biology and Medicine 142 (2022) 105205

Table 2
Baseline results manual experiments - group-1.
CNN LR Architecture Acc(%) SE(%) SP(%) F1-score

DenseNet-201 0.000 1 2 FC layers with 1024 neurons (1024, 2) 88.23 93.75 80.43 0.8824
ResNet-50 0.000 1 2 FC layers with 1024 neurons (1024, 2) 81.91 91.67 71.74 0.8381
VGG-16 0.001 2 FC layers with 512 neurons (512, 2) 77.66 72.92 82.61 0.7692

Table 3
Baseline results manual experiments - group-2.
CNN LR Architecture Acc(%) SE(%) SP(%) F1-score

DenseNet-201 0.001 3 FC layers with 4096 neurons (4096, 4096, 2) 91.67 100 83.33 0.9230
ResNet-50 0.001 2 FC layers with 512 neurons (512, 2) 83.33 83.33 83.33 0.8333
VGG-16 0.001 2 FC layers with 1024 neurons (1024, 2) 75.0 50.0 1 0.6666

refers to the F1-score of the validation subset of best individual of the presented slightly better results with the GA obtaining 0.846 1 of the
final population, whereas the ACC, SE, SP and F1-score refers to accu­ F1-score compared to the manual setup 0.838 1 of the F1-score. The GA
racy, sensitivity, specificity and F1-score, respectively, of the test subset, optimization for the VGG-16 improved the result from 0.769 2 of the
which were obtained after the entire GA flow had finished. The NE is the F1-score (with the manual setup) to 0.857 1 of the F1-score. We note
number of epochs from the four possibilities tested that gave that result. also that all the results were obtained with 40% of mutation rate.
The obtained learning rate and fully connected layer architecture ob­ Moreover, both DenseNet-201 and VGG-16 required more individuals in
tained for group-1 are: order to obtain their best results.
In Table 5, which refers to data group-2, we note that the DenseNet-
● DenseNet-201: LR 10− 4; FC layer (4096, 256, dropout 0.340 1, 2); 201 showed the same value for the F1-score using the GA as the manual
● ResNet-50: LR 10− 5; FC layer (1024, 2); experiments. Nevertheless, for VGG-16 and ResNet-50, results improved
● VGG-16: LR 10− 3; FC layer (1024, 8, 2); significantly using the GA optimization. The VGG-16 achieved without
the GA optimization 0.66 of the F1-score and with the optimization, it
Moreover, the obtained learning rate and fully connected layer ar­ obtained 0.92 of the F1-score, while, ResNet-50 improved from 0.83 of
chitecture obtained for group-2 are: the F1-score to 0.90.
Moreover, in group-2, the best result obtained with the DenseNet-
● DenseNet-201: LR 10− 4; FC layer (2); 201 with only 10 individuals was with 60% of mutation rate.
● ResNet-50: LR 10− 5; FC layer (128, dropout 0.426 7, 32, 2); Whereas, the best result over all setups was with 20 individuals and 30%
● VGG-16: LR 10− 2; FC layer (32, dropout 0.306 5, 8, dropout 0.315 3, of MR. We note that the smaller population required more mutation to
2); generate greater diversity, which lead to better results. Nevertheless,
with more individuals, greater MR did not perform as well. A reason for
GA evolution was evaluated with 30 epochs without early stopping that could be the need of more iterations for convergence. For ResNet-
enabled and 30 epochs with early stopping. For group-1, the GA per­ 50, the best result was obtained with 20 individuals and 60% of MR.
formed better when the evolution occurred without the early stopping, Therefore, the GA with more variability for the ResNet-50 provided
which are the results shown in Table 4. On the other hand, for group-2 better results. Different from that which happened with DenseNet-201
when the GA evolution process had the early stopping enabled, the and ResNet-50, VGG-16, with only 10 individuals, performed better
CNNs performed better in the test data as shown in Table 5. Another with a smaller MR. The VGG-16 might need more iterations in situations
change is the fitness penalization factor in which for group-1 we use 0.6 where there is more diversity to converge and provide good results. Even
and for group-2 we use 0.7 (Equation (3)), these were empirically though, the result obtained with only 10 individuals and an MR of 30%
chosen. is still significant.
By analyzing the results in Table 4, we note that the VGG-16 is the As each evaluation of an individual involves training a CNN, in order
CNN that benefited the most from the GA optimization for group-1. to obtain the fitness, it is computational expensive and time demanding
DenseNet-201 obtained equivalent results with the manual setup [23] to do so, which made it impracticable to run these evaluations with a
and the setup with the GA, both with 0.88 of the F1-score. ResNet-50 larger population size and more than 20 iterations.

Table 4
Best results GA - group-1.
CNN PS MR Best GA Fitness NE ACC(%) SE(%) SP(%) F1-score

DenseNet-201 20 40 0.7407 30 with ES 87.23 93.75 80.43 0.8823


ResNet-50 10 40 0.6341 30 82.97 91.66 73.91 0.8461
VGG-16 20 40 0.7407 30 84.04 93.75 73.91 0.8571

Table 5
Best results GA - group-2.
CNN PS MR Best GA Fitness NE ACC(%) SE(%) SP(%) F1-score

DenseNet-201 20 30 0.8571 30 91.66 100 83.33 0.9230


ResNet-50 20 60 0.9230 50 91.66 83.33 100 0.9090
VGG-16 10 30 0.9090 30 91.66 100 83.33 0.9230

10
C.B. Gonçalves et al. Computers in Biology and Medicine 142 (2022) 105205

4.2. PSO results better, with 20 particles, and DenseNet-201 performed better with only
10 particles, but with more iterations. The best result for the DenseNet-
The PSO arguments analyzed during the experiments were the 201 in group-2 performed equivalent to both the manual and GA results
number of particles, for which we used 10 and 20, the number of iter­ with 0.92 of the F1-score. The experiments with the DenseNet-201 and
ations, where we also used 10 and 20; and the Cg, in which we tested 0.5, Cg 0.6 presented results below the other two Cg tested for both of the
0.6 and 0.4. Therefore, we tested giving the best particle more or less swarm size (SS). The best ResNet-50 result using the PSO was superior to
relevance. As mentioned in the GA section, the particle evaluation also the manual results. Using the optimization, the network achieved
involves training a CNN. Hence, it was impracticable running the PSO 0.857 1 of the F1-score in the test set against 0.833 3 of the F1-score
with larger swarms and with more iterations. Therefore, the experiment using the manual setup. This PSO result is below that of the GA,
with 20 iterations was only tested with 10 particles and Cg of 0.5, but all which achieved 0.90 of the F1-score in the test data. Moreover, in op­
possible combinations of number of particle and Cg were tested for 10 position to the DenseNet-201 behaviour with the Cg of 0.6, for the
iterations. ResNet-50, using the same value, it achieved the best results with both of
The best PSO results obtained for each group are shown in Table 6 the SS tested. Similar to the results of the ResNet-50, the PSO optimi­
and Table 7. The generated networks for group-1 are: zation with the VGG-16 outperformed the manual setup (0.833 3 against
0.666 6 of the F1-score), but, still below the GA result, which obtained
● DenseNet-201: LR 10− 4; FC layer (64, dropout 0.295 0, 32, 2); 0.92.
● ResNet-50: LR 10− 4; FC layer (32, dropout 0.337 8, 16, dropout
0.255 0, 2);
● VGG-16: LR 10− 4; FC layer (128, 128, 16, dropout 0.552 1, 2); 4.3. Discussion

For group-2 we obtained: The results with GA and PSO optimization improved the majority of
the results obtained previously with the manual setup [23]. For the CNN
● DenseNet-201: LR 10− 4; FC layer (8, dropout 0.444 3, 2); that presented the worst result in the manual setup, i.e., VGG-16, both of
● ResNet-50: LR 10− 5; FC layer (1024, dropout 0.133 8, 32, 2); the optimization techniques were able to find architectures and learning
● VGG-16: LR 10− 3; FC layer (16, dropout 0.207 5, 8, dropout 0.149 9, rates that significantly improved the results. DenseNet-201 is the only
8, 2); CNN for which the optimization did not find better results. Here, the
results obtained were equivalent to the manual, except for the PSO in
Similar to the GA, we tested the PSO flow with 30 epochs with and group-1, which is slightly inferior. This occurred mainly due to there
without early stopping. Analogous to the GA results presented for the being more room for improvement in the CNNs from which we obtained
two tested groups, group-1 demonstrates the results without early poor results in the manual setup, contrary to the DenseNet-201, which
stopping, whereas group-2 demonstrates the results from the PSO flow already had good results. Furthermore, the results obtained with data
with 30 epochs and the early stopping enabled. Furthermore, these also from group-2 were improved more significantly when using the opti­
have the fitness penalization as described for the GA, with the 0.6 for mization techniques over the results that used data from group-1. One
group-1 and 0.7 for group-2. notes that, using the optimization techniques, specifically the GA, all the
By analyzing the results in Table 6 for group-1, we note that the networks showed results above 0.90 of the F1-score using group-2,
DenseNet-201 result is inferior to the manual results although slightly, whereas for group-1 the two optimization techniques only produced a
obtaining 0.87 of the F1-score, whereas the manual setup obtained 0.88 slight improvement to the results.
of the F1-score. ResNet-50 obtained a result a little bit better with the As previously mentioned, the study developed by Junior et al. in
PSO obtaining 0.84 of the F1-score compared to the manual results, Ref. [12] showed that the use of their proposed PSO was successful in
which obtained 0.83. The results with the VGG-16 were improved using finding relevant CNN architecture, which had less parameters and
the PSO flow, from 0.666 6 to 0.837 9 of the F1-score for the test data. consequently faster conversion in many of the tested databases. These
The behavior of the results is similar to that obtained with the GA, in authors used databases that are commonly used to evaluate CNNs
which the VGG-16 is the CNN that benefited the most from the opti­ (MNIST, MNIST-RD, MNIST-RB, MNIST-BI, MNIST-RD + BI, Rectangles,
mization and the DenseNet-201 in which the optimization was not Rectangles-I, Convex, and MNIST-Fashion). In these databases, even
significant. with smaller CNNs, their algorithm found CNNs which provided better
In Table 7, with the results from group-2, we note that the PSO results than the state of the art models tested in those databases, in six
performed better, either with greater number of iterations or the number out of the nine databases tested. Even though Junior et al. used the PSO
of particles, for two of the CNNs (VGG-16 and ResNet-50) performing to find good CNN architecture over the entire network (convolutional
layers, dropout layers and fully connected layers), and we only used PSO

Table 6
Best results PSO - group-1.
CNN Iterations SS Cg Best PSO Fitness NE ACC(%) SE(%) SP(%) F1-score

DenseNet-201 10 10 0.4 0.7209 30 86.17 93.75 78.26 0.8737


ResNet-50 10 10 0.5 0.7021 50 81.91 95.83 67.39 0.8440
VGG-16 10 10 0.5 0.7323 30 with ES 81.91 89.58 73.91 0.8349

Table 7
Best results PSO - group-2.
CNN Iterations SS Cg Best PSO Fitness NE ACC(%) SE(%) SP(%) F1-score

DenseNet-201 20 10 0.5 0.7272 30 with ES 91.66 100 83.33 0.9230


ResNet-50 10 20 0.6 0.7058 30 with ES 83.33 100 66.66 0.8571
VGG-16 10 20 0.5 0.8571 50 83.33 83.33 83.33 0.8333

11
C.B. Gonçalves et al. Computers in Biology and Medicine 142 (2022) 105205

for the fully connected layer part of the network, our results also support our results are going to improve, as we expect, using the same approach
their results in a completely different dataset, one with infrared images. as [11] (although, not using the exact same patients). Another difference
Using the PSO, we were able to improve the majority of the results, of note is the CNN used, the authors used AlexNet, and we chose
which were originally obtained by a manual approach. VGG-16, ResNet-50 and DenseNet-201.
A GA algorithm used to find good CNN architectures was presented Due to the high time demand and expensive computational cost for
in Ref. [13]. In similar fashion to the PSO proposed in Ref. [12], the performing of our experiments, a future could be based on the use of a
proposed GA algorithm is used to find the entire CNN architecture, surrogate model in the fitness evaluation of the GA and PSO. This would
whereas we are using it to optimize the fully connected layers. allow us to evaluate the individuals/particle without the need of
Furthermore, different to our study, the authors proposed a variable training CNNs for each one of them in all the iterations. By using a
length GA individual, whereas we opt for using a fixed length individual, surrogate model, we might be able to test the optimization techniques
which allows for flexibility, as suggested by Ref. [22]. Nevertheless [13], with more individuals or particle and with a longer number of iterations,
and our study show that the GA is a suitable technique to help in the hence, the possibility of discovering even better results.
specific problem of finding good CNN architectures and hyper­ Even though the database has many other static images available, as
parameters associated with it. The study in Ref. [13] showed that the each patient has roughly five images taken from different angles, as
proposed GA outperformed all the 22 compared algorithms from the discussed in Ref. [23], for the majority of the images we do not have the
state of the art in nine different tested databases. In this study, using the information of which breast the cancer is located. This becomes a
proposed GA, we were able to find results that outperformed the ma­ problem in the lateral images, as the network might see an image from a
jority of the results, when using only the manual tests. healthy breast from a patient which belongs to the sick class or even a
An interesting observation drawn from our results is that the GA breast that does in fact have cancer, but in the lateral image, the cancer
performed better for the majority of the experiments compared to the is not visible. In order to use the lateral images without negatively
PSO. Although our particle definition is flexible, as it allows for particle affecting the classification, we are planning on an alternative that uses
from different sizes, the particle operations do not change the values of all the images without losing the information concerning to which pa­
the fragment of the particle (either a learning rate fragment, or a tient they belong. An example is an image preprocessing approach that
dropout fragment, or the fully connected layer). It only changes the produces a 5D image, where each dimension is an image from one angle,
number of layers and the order of those layers. The crossover of the GA therefore, we will be able to tie all the images together and show all of
also does not change the values, but the mutation does. For the PSO them to the network at once. Furthermore, having more images the
optimization, our problem might require a more smoother velocity up­ computational complexity becomes even more important as time will
date. Therefore, a future study would be to test either a different velocity increase, so this might be viable after implementing the surrogate and
calculation or another particle definition. reducing the fitness calculation time. In this approach, we will also need
In [12,13], the finding of smaller CNNs is a very important goal due to make slight adjustments the CNNs, in order to accommodate an input
to the fact that smaller networks converge faster, and consequently image not only with 3 but with 5 channels.
demand less time. As we are using the convolutional layers of We are also planning to test the proposed methods on two other
pre-trained CNNs from the state of the art, the depth of the network is subsets of the database DMR-IR: a subset with the dynamic images in
not our main focus in the optimization. Although, as we also understand order to compare the two types of thermography protocols and a subset
that smaller fully connected layers are more suitable, also because it of an ROI extraction in order to verify whether using only a ROI con­
demands less time, we use this depth as the tiebreaker in the fitness taining only the area of interest, i.e., the breast of the patient, out­
evaluation. Therefore, when two particles or individuals have the same performs the use of the entire images.
fitness value, we opt for the one with the smaller number of layers.
Although Zuluaga et al. in Ref. [7] used dynamic infrared images and 5. Conclusions
we are using the static infrared ones (from the same database), both
studies showed that, using optimization techniques to find CNN archi­ Thermography, or just infrared imaging, is a technique that is
tectures for the specific field of breast cancer detection using infrared showing good potential in supporting the early detection of breast
images, showed improvements when compared to finding architectures cancer. In addition, a CAD system based on CNNs is also promising due
from scratch or even with only transfer learning approaches. In their to the networks ability to perform image classification. In order to find
work, an algorithm based on a tree parzen estimator (Bayesian optimi­ good results using CNNs, it is essential to find suitable hyperparameters
zation) is proposed to help find a suitable CNN architecture and the and architecture, which is not an easy task, as it is expensive, demands
entire CNN is chosen with this technique, and as presented before, we time and may require a human expert.
use GA and PSO to find the architecture and hyperparameters of the fully In this study, we present two bio-inspired algorithms, GA and PSO
connected layers. Even with all the differences, i.e., the protocol of that were used to optimize some hyperparameters and the architecture
infrared images used, the number of images and the optimization of fully connected layers from three state of the art CNNs (VGG-16,
technique, both studies present an F1-score of 0.92 in their best results. ResNet-50 and DenseNet-201), in the context of breast cancer detection
As shown in Table 1, the work of Cabiglu et al. [11] presented an with static infrared images. We described the individual representation,
accuracy of 94.3%, while even with improvements to optimization, the fitness calculation, selection, crossover and mutation operator for the
accuracy of our models is around 91.6%. Noteworthy here is that both proposed GA. Moreover, we also detailed the particle definition, velocity
our study and [11] used the static images. Although in group-1 there are calculation, particle update and fitness of the proposed PSO.
the same number of patients as in Ref. [11], after the data augmentation We found that the GA outperformed the PSO in our experiments for
we ended up with a different number of patients (we obtained 294 the database used, but that both proposed optimizations outperformed
healthy images and 272 sick ones whereas in Ref. [11] the authors used a the majority of the manual experiments [23]. Through the GA optimi­
total 147 healthy patients and 135 sick ones). This is due to our opting zation, we were able to improve the VGG-16 result from 0.66 to 0.92 of
for the use of data augmentation in both classes, healthy and sick, and the F1-score in the test subset with 38 static images from each class
Cabiglu et al. augmented only the sick class. We believe that to augment (called group-2). We were also able to improve the ResNet-50 results
only one class makes the classification easier, as all the rotated images, from 0.83 to 0.90 of the F1-score, also for the test set in group-2.
for example, will be the sick ones, without the need for any other in­ Furthermore, the proposed optimization method can also be applied to
formation concerning the cancer in the breast (hotter areas for instance). other networks and in other problem domains.
Even though, we are working on more experiments with the exact same Results presented in this study are encouraging and confirm the
number of patients for the sake of comparison, in order to understand if potential of thermography to assist in the task of detecting breast cancer

12
C.B. Gonçalves et al. Computers in Biology and Medicine 142 (2022) 105205

in its early stages. The next steps of our study include the use of different on/mammograms/breast-density-and-your-mammogram-report.html. (Accessed
12 September 2021).
images from DMR-IR, i.e., consider only the ROI that includes only the
[7] J. Zuluaga-Gomez, Z. A. Masry, K. Benaggoune, S. Meraghni, N. Zerhouni, A CNN-
affected regions (breasts) and to use images from the dynamic protocol Based Methodology for Breast Cancer Diagnosis Using Thermal Images, arXiv
to also consider the transient aspect of the physical phenomenon preprint arXiv:1910.13757 .
involved. Finally, we will test a surrogate model for our fitness calcu­ [8] M.d.F.O. Baffa, L.G. Lattari, Convolutional neural networks for static and dynamic
breast infrared imaging classification, in: 2018 31st SIBGRAPI Conference on
lation, for both GA and PSO, to allow the evaluation of individuals/ Graphics, Patterns and Images (SIBGRAPI), IEEE, 2018, pp. 174–181.
particle without the need of training CNNs for each of the iterations and [9] E. Chaves, C.B. Gonçalves, M.K. Albertini, S. Lee, G. Jeon, H.C. Fernandes,
eventually finding better networks. Evaluation of transfer learning of pre-trained CNNs applied to breast cancer
detection on infrared images, Appl. Opt. 59 (17) (2020) E23–E28.
[10] A.A.A. Figueiredo, J.G. do Nascimento, F.C. Malheiros, L.H. da Silva Ignacio, H.
CRediT authorship contribution statement C. Fernandes, G. Guimaraes, Breast tumor localization using skin surface
temperatures from a 2D anatomic model without knowledge of the thermophysical
properties, URL, Comput. Methods Progr. Biomed. 172 (2019) 65–77, https://doi.
Caroline Barcelos Gonçalves: Conceptualization, Methodology, org/10.1016/j.cmpb.2019.02.004. ISSN 0169-2607, https://www.sciencedirect.co
Software, Write of first draft. Jefferson R. Souza: Revision, Supervision. m/science/article/pii/S0169260719300094.
Henrique Fernandes: Revision, Supervision. [11] Ç. Cabıoğlu, H. Oğul, Computer-aided breast cancer diagnosis from thermal images
using transfer learning, in: International Work-Conference on Bioinformatics and
Biomedical Engineering, Springer, 2020, pp. 716–726.
Declaration of competing interest [12] F.E.F. Junior, G.G. Yen, Particle swarm optimization of deep neural networks
architectures for image classification, Swarm. Evol. Comput. 49 (2019) 62–74.
[13] Y. Sun, B. Xue, M. Zhang, G. G. Yen, Evolving Deep Convolutional Neural Networks
The authors declare that they have no known competing financial
for Image Classification, IEEE Transactions on Evolutionary Computation .
interests or personal relationships that could have appeared to influence [14] A. Reiling, W. Mitchell, S. Westberg, E. Balster, T. Taha, CNN optimization with a
the work reported in this paper. genetic algorithm, in: 2019 IEEE National Aerospace and Electronics Conference
(NAECON), IEEE, 2019, pp. 340–344.
[15] J.C. Torres-Galván, E. Guevara, E.S. Kolosovas-Machuca, A. Oceguera-Villanueva,
Acknowledgments J.L. Flores, F.J. González, Deep convolutional neural networks for classifying
breast cancer using infrared thermography, Quant. InfraRed.Thermogr.J (2021)
This study was financed in part by the Coordenação de Aperfeiçoa­ 1–12.
[16] F. Johnson, A. Valderrama, C. Valle, B. Crawford, R. Soto, R. Ñanculef, Automating
mento de Pessoal de Nível Superior - Brazil (CAPES) - Finance Code 001, by configuration of convolutional neural network hyperparameters using genetic
the Conselho Nacional de Desenvolvimento Científico e Tecnológico - Brazil algorithm, IEEE Access 8 (2020) 156139–156152.
(CNPq) - Finance Code 407 140/2 021-2 and by the Fundação de Amparo [17] L. Silva, D. Saade, G. Sequeiros, A. Silva, A. Paiva, R. Bravo, A. Conci, A new
database for breast research with infrared image, J. Med. Imaging.Health Inf. 4 (1)
à Pesquisa de Minas Gerais - Brazil (FAPEMIG). (2014) 92–100.
[18] M.A. Farooq, P. Corcoran, Infrared imaging for human thermography and breast
Appendix A. Supplementary data tumor classification using thermal images, in: 2020 31st Irish Signals and Systems
Conference (ISSC), IEEE, 2020, pp. 1–6.
[19] S.J. Mambou, P. Maresova, O. Krejcar, A. Selamat, K. Kuca, Breast cancer detection
Supplementary data to this article can be found online at https://doi. using infrared thermal imaging and a deep learning model, Sensors 18 (9) (2018)
org/10.1016/j.compbiomed.2021.105205. 2799.
[20] R. Roslidar, K. Saddami, F. Arnia, M. Syukri, K. Munadi, A study of fine-tuning CNN
models based on thermal imaging for breast cancer classification, in: 2019 IEEE
References International Conference on Cybernetics and Computational Intelligence
(CyberneticsCom), IEEE, 2019, pp. 77–81.
[1] I. American Cancer Society, What is breast cancer?. https://www.cancer.org/cance [21] S. Kiymet, M.Y. Aslankaya, M. Taskiran, B. Bolat, Breast cancer detection from
r/breast-cancer/about/what-is-breast-cancer.html, 2021. (Accessed 12 September thermography based on deep neural networks, in: 2019 Innovations in Intelligent
2021). Systems and Applications Conference (ASYU), IEEE, 2019, pp. 1–5.
[2] I. American Cancer Society, How common is breast cancer?. https://www.cancer. [22] M.V. Fidelis, H.S. Lopes, A.A. Freitas, Discovering comprehensible classification
org/cancer/breast-cancer/about/how-common-is-breast-cancer.html, 2021. rules with a genetic algorithm, in: Proceedings of the 2000 Congress on
(Accessed 12 September 2021). Evolutionary Computation. Cec00 (Cat. No. 00th8512), vol. 1, IEEE, 2000,
[3] I. Instituto Nacional de Câncer, Câncer de mama, 2021. https://www.inca.gov.br/t pp. 805–810.
ipos-de-cancer/cancer-de-mama. (Accessed 12 September 2021). [23] C.B. Gonçalves, J.R. Souza, H. Fernandes, Classification of static infrared images
[4] I. Instituto Nacional de Câncer, Controle do Câncer de Mama, 2021. https://www. using pre-trained CNN for breast cancer detection, in: 2021 IEEE 34th International
inca.gov.br/mama. (Accessed 12 September 2021). Symposium on Computer-Based Medical Systems (CBMS), IEEE, 2021,
[5] I. American Cancer Society, American Cancer Society Recommendations for the pp. 101–106.
Early Detection of Breast Cancer, 2021. https://www.cancer.org/cancer/brea [24] A. Baldominos, Y. Saez, P. Isasi, Evolutionary convolutional neural networks: an
st-cancer/screening-tests-and-early-detection/american-cancer-society-recommen application to handwriting recognition, Neurocomputing 283 (2018) 38–52.
dations-for-the-early-detection-of-breast-cancer.html. (Accessed 12 September [25] X. Glorot, Y. Bengio, Understanding the difficulty of training deep feedforward
2021). neural networks, in: Proceedings of the Thirteenth International Conference on
[6] I. American Cancer Society, Breast Density and Your Mammogram Report, 2021. Artificial Intelligence and Statistics, 2010, pp. 249–256.
https://www.cancer.org/cancer/breast-cancer/screening-tests-and-early-detecti

13

You might also like