0% found this document useful (0 votes)
34 views24 pages

BreastCancer Classification - 2025

Uploaded by

sayma.rma.2023
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views24 pages

BreastCancer Classification - 2025

Uploaded by

sayma.rma.2023
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Breast Cancer Classification Using an Ensemble Deep Learning Decision

Fusion Technique on Preprocessed Mammograms


Sayma Islam1, Sami Azam2*, Sidratul Montaha3, A.K.M. Rakibul Haque Rafid4, Kayes Uddin Fahim5, Mirjam
Jonkman6
1,3,4,5
Health Informatics Research Laboratory (HIRL), Department of Computer Science and Engineering, Daffodil
International University, Dhaka-1341, Bangladesh
2,6
Faculty of Science and Technology, Charles Darwin University, Casuarina, 0909, NT, Australia

* Corresponding author: [email protected]

Abstract

Breast cancer is one of the leading causes of death worldwide. The mortality rate can be reduced by timely diagnosis
and treatment. An automated diagnostic system might help to provide a fast and accurate diagnosis. This study aims
to classify breast cancer into four classes employing a decision fusion deep learning mechanism as combining the
decision of multiple models tends to result in a better performance than a single model. First the images are
preprocessed and augmented employing several image preprocessing and data augmentation methods. To develop
the decision fusion architecture, a number of transfer learning models are trained with the images. Five models are
then selected based on the highest test accuracy. Using these five models, four decision fusion architectures are
developed and to each of the model a long short term memory (LSTM) layer is added. Results show that the highest
test accuracy, of 98.31%, is achieved while combining the models VGG16, MobileNet and Xception with LSTM.
To assess the performance stability and potential overfitting issues, k-fold cross validation and single class
performance of the model are presented. To get more insight in the differences between the classes, sixteen
geometric features extracted from the dataset and a clustering approach is presented. The findings and approaches of
this study might provide some useful insights in automatic mammogram classification.

Keywords:

1 Introduction

Breast cancer is one of the major causes of mortality around the globe especially for women aged 20 to 59. Between
2012 to 2020 the number of breast cancer patients has risen from 1.6 million to 2.2 million [1] [2]. In 2020, 684,996
deaths from breast cancer were reported [2]. Based on a report of World Health Organization (WHO), the number of
estimated breast cancer cases will rise to 19.3 million by 2025 [3]. With early diagnosis, five year survival rates
might improve to 91% [4]. Early identification and treatment can consequently decrease the rate of breast cancer-
associated mortality and increase the chances of survival. Mammography is widely used in breast cancer screening
as breast cancer has an asymptomatic stage that can be identified using only mammography images [5]. The correct
analysis and interpretation of mammograms is a challenging task, even for the expert radiologists [6] as they need to
observe the dimension, shape, solidity and distortions of masses and calcifications very carefully [7]. As there has
been a rapid growth in the number of cancer cases, this has created a time pressure for the radiologists as they need
to finish the diagnostic procedure in a relatively short period. Analyzing a mammogram necessitates careful
concentration of a radiologist to diagnosis. Lack of time might lead to a false interpretation, preventing early
diagnosis. An automated system using mammography could therefore play an important role in classifying breast
cancer at an early stage [8]. Deep learning including convolutional neural network (CNN) has achieved great
successes in image detection in recent years and may revolutionize medical image interpretation by outperforming
traditional methods [9].

1
In this research, mammograms are classified and several image preprocessing and deep learning approaches are
explored to achieve a promising outcome. The dataset, with a total of 1459 mammograms, comprises of four classes:
benign calcification, benign mass, malignant calcification and malignant mass. Mammogram interpretation is often
very challenging, owing to the structures and features of the abnormalities and the structure of mammary cells, the
lack of contrast and poor resolution of some images, poor focus, and an inadequate number of images for training,
along with several technical issues such as the spatial resolution or noise [10]. To address these challenges, several
image preprocessing and data augmentation techniques are employed to enhance the image quality and increase the
number of images before proceeding to the classification. We introduce a deep learning decision level feature fusion
technique for classification by combining the decisions of multiple classifiers, as this often outperforms a single
classifier. To develop the decision level feature fusion deep learning architectures, seven transfer learning models
are first trained with the preprocessed and augmented dataset. Five of these models are selected, based on their
accuracy. These five selected models are combined following different criteria to create four decision level feature
fusion deep learning architectures and trained with the dataset. Finally, an LSTM layer is added to these four
architectures and the LSTM based models are again trained with the dataset. To evaluate the performance of all the
models, a number of performance metrics including training accuracy, training loss, validation accuracy, validation
loss, test accuracy, test loss, recall, specificity, precision and F1 Scores are calculated. The VGG16 + MobileNet +
Xception with LSTM decision fusion based model outperform the others models based on these evaluation metrics.
For this best model, the test accuracy of individual classes does not indicate any bias in the performance. Over-
fitting issues and the robustness of the proposed model are assessed with the accuracy and loss curves and the k-fold
cross validation technique [11] with a total of 12 k values. Finally, to illustrate the statistical differences of the four
classes of our dataset, sixteen features are derived from the original dataset and a clustering approach is carried ou t.
The features can be divided into three clusters based on the degree of diversity of the feature values.

2 Literature review

A number of studies have been conducted in the field of automated detection of breast cancer, using mammograms.
Kaur et al [12] proposed a deep neural network and a Multiclass Support Vector Machine (MSVM) to classify
mammograms into three classes: normal, benign and malignant. Preprocessing and feature extraction are carried out
by K-mean clustering for Speed-Up Robust Features (SURF) selection on a dataset of 322 images. The overall
accuracy for the normal, benign and malignant classes was 95%, 94% and 98%, respectively. Lotter et al [13]
introduced a multi-scale CNN trained with a curriculum learning tactic in categorizing breast cancer using
segmented masks of lesions in mammograms. The features learned with the curriculum tactic are then used to
generate a scanning-based network. Using the DDSM dataset, the network achieved the best classification
performance, yielding an AUROC of 0.92. Khamparia et el. [14] introduced a transfer learning approach to classify
breast cancer, also using the DDSM mammography dataset. Combining Modified VGG (MVGG), residual model
and MobileNet, using a network fusion technique network, they achieved an accuracy of 88.3%. In another study
[15], breast cancer was classified using mammograms of two datasets: CBIS-DDSM and mammographic image
analysis society digital mammogram database (MIAS). Features were extracted with multiple DCNNs and
classification was done, using a support vector machine (SVM) with different kernels. Ting et al [16] improved the
performance of CNN using feature wise pre-processing technique. Mammograms were classified into malignant,
benign or normal with an accuracy of 90.50% and aspecificity of 90.71%. A Multiscale All Convolutional Neural
Network (MA-CNN) was proposed by Agnes et al [17] to classify mammograms into normal, malignant and benign
using the mini-MIAS dataset. The performance of the network in terms of accuracy and computation time is
improved by combining information using multiscale filters. The networks achieved an overall sensitivity of 96%
and an AUC of 99%. Khan et al., [18] proposed a Multi-View Feature Fusion (MVFF)-based CADx system for the
classification of mammograms using a CNN-based feature fusion technique. They evaluated the system on the
CBIS-DDSM and the mini-MIAS datasets. They obtained an area under the ROC curve (AUC) of 93% for the
classification of masses and calcifications and 84% for the classification of malignant and benign lesions. The AUC
for the classification between normal and abnormal was 93%. In another study [19], the authors suggest to used the

2
region of interest (ROI) of a mammogram's detailed coefficients to generate the features matrix, using a gray level
co-occurrence matrix (GLCM). The suggested approach was validated using the two main databases, MIAS and
DDSM. The accuracy measures were calculated based on normal vs abnormal and benign vs malignant. They
obtained test accuracies of 98.0% and 94.2% for the MIAS database and 98.8% and 97.4% for the DDSM database
while classifying normal vs abnormal and benign vs malignant, respectively. Six deep learning models were
developed by Wang et al. [20] to classify mammograms, resulting in AUCs between 71% to 79%. However, the
results of thesesame methods decreased significantly, to between 44% and 65%, for all six models when applied to
three external test data sets. Using deep learning, Jadoon et al. [21] proposed a classification methodology
employing a CNN - discrete wavelet and a CNN - curvelet transform for mammograms. The models were trained
utilizing Softmax and SVM classifiers. The accuracies of CNN - discrete wavelet and CNN - curvelet transform
were 81.83% and 83.74 %, respectively. Limitations of the papers mentioned in this section are summarized in
Table 1.

Table 1 Limitations of various papers mentioned in literature review.

No. Paper Objective Approach/model Limitations


i. Lack of image preprocessing
1 Khan et al. [18] Classification CNN, VGGNet, ResNet ii. Lack of evaluation of model
performance consistency
i. No image preprocessing
ii. Lack of evaluation of model
2 Beura et al. [19] classification BPNN performance consistency
iii. Experiment with a single model
iv. No data augmentation
i. No image preprocessing
CNN, AlexNet, VGG16, ii. No data augmentation
3 Wang et al. [20] Classification
ResNet50 iii. Lack of evaluation of model
performance consistency
i. No image preprocessing
ii. Experiment with just one model
4 Ting et al. [16] Classification CNN
iii. Lack of evaluation of model
performance consistency
i. No ablation study
ii. Experiment with just one model
5 Jadoon et al. [21] Classification CNN
iii. Lack of evaluation of model
performance consistency
i. Experiment with just one model
6 Kaur et al. [12] Classification KNN, CNN ii. Lack of evaluation of model
performance consistency
i. No image preprocessing
7 Lotter et al. [13] Classification CNN, InceptionV3 ii. Lack of evaluation of model
performance consistency
i. Lack of image preprocessing
Khamparia et al. ii. No data augmentation
8 Classification VGG16, MobileNet
[14] ii. Lack of evaluation of model
performance consistency
DCNN, AlexNet,
i. Lack of evaluation of model
9 Ragab et al. [15] Classification GoogleNet, and residual
performance consistency
networks (ResNet)

3
i. Experiment with just one model
10 Agnes et al. [17] Classification CNN ii. Lack of evaluation of model
performance consistency

As can be observed from Table 1, the papers contain several limitations such as lack of image preprocessing and
data augmentation, experimentation with just one model and lack of evaluation of performance consistency of the
models. We attempt to address these limitations in our study by employing image preprocessing, data augmentation,
experimenting with several models, using an ensemble deep learning technique, and assessing the different aspects
of the performance of the proposed architecture.

3 Dataset description

In this study, a total of 1459 mammograms from CBIS-DDSM [22], comprising of four classes, are used to
conductfor all the experiments. The four classes are: benign calcification (BC) with 398 images, benign mass (BM)
with 417 images, malignant calcification (MC) with 300 images and malignant mass (MM) with 344 images. All the
mammograms have a size of 224 × 224 pixels and are in RGB format. An overview of the dataset is given in Table
2.

Table 2 Number of images per class in CBIS-DDSM dataset

No. Classes Image Number


1. Benign Mass 417
2. Benign Calcification 398
3. Malignant Calcification 300
4. Malignant Mass 344

Inspecting the dataset, we found 17 images in the malignant mass class with no breast tissue. As these images may
degrade the classification performance, they are removed from this class. As a consequence, 327 mammograms
remain in the malignant mass class and the total number of images is 1442. Fig. 1 illustrates example images of the
four classes.

Fig. 1 Images of the four classes of the dataset.

4
4 Methodology

The dataset used in this research contains several artifacts and noises, including while lines surrounding the images,
labels and other artifacts. Moreover, the dense tissue of the breast and the tumorous region appear to have similar
intensity levels in the mammograms. It is therefore crucial to preprocess the dataset before training any deep
learning model for classification. This research is conducted following seven steps: (i) image preprocessing, (ii) data
augmentation, (iii) transfer learning model selection, (iv) developing four decision level fusion models, (v) adding
an LSTM layer to the four decision level fusion models (vi) performance analysis and (vii) inter-class variance. Fig.
2 illustrates the methodology of this study.

Fig. 2 Overview of the entire methodology utilized in this work

As shown in Fig 2, the mammograms are first preprocessed by eliminating noise, artifacts and adjusting the
brightness-contrast level. For artifact removal and image enhancement, Non-local Means algorithm (NLM), line
removal, and largest contour detection, Balance Contrast Enhancement Technique (BCET) and Contrast Limited
Adaptive Histogram Equalization (CLAHE) are applied consecutively. From the output, a heatmap is generated to
highlight the affected regions. As the dataset contains an inadequate number of images to train a deep learning
model effectively, the preprocessed dataset is augmented by employing seven data augmentation techniques.
However, before augmentation, the dataset is first split into training, validation and test sets, with a ratio of
70:20:10. The test and validation datasets are set apart and data augmentation is applied only to the training dataset.
Seven transfer learning models, VGG16, Xception, ResNet50, ResNet50V2, MobileNet, MobileNetV2 and
InceptionV3, are trained with the augmented dataset. Comparing the test accuracy of these seven models, five
networks are selected based on their test accuracy. Applying decision level fusion techniques, these five models are
used to generate four decision level deep learning feature fusion models: (a) MobileNet + VGG16, (b) MobileNet +
VGG16 + Xception, (c) MobileNet + VGG16+ Xception + RestNet50V2 and (d) MobileNet + VGG16 + Xception +
RestNet50V2 + MobileNetV2. These four decision level models are again trained with the augmented dataset and
the results are recorded. An LSTM layer is then added to these individual four decision level models. The resulting

5
models are trained with the augmented dataset. Several performance metrics are evaluated. Over-fitting issues are
assessed, k-fold cross validation, single class prediction is presented. Finally, to investigate the inter-class
differences of the four classes, sixteen features are derived from the original dataset and a clustering approach is
conducted.

4.1 Image preprocessing

In the study of medical imaging, datasets are often found to contain artifacts, noises and low intensity images, which
may degrade the classification performance of a deep learning model. Therefore, to boost the model performance
and prevent over-fitting issues, the images need to be preprocessed. In this study, several image preprocessing
algorithms are used. NLM is applied to remove noise from the original mammograms. The raw images had a white
thin border. This is removed using a rectangular shaped binary mask. To remove the artifacts, the largest contour
detection method is applied. BCET and CLAHE are employed to the artifact removed images to enhance the
important properties of the images. Finally, a heatmap, which highlights the suspicious region in a different color
channel, is generated.

4.1.1 NLM

Non-local means algorithm eliminates noise by computing the mean of all pixels of a picture. The mean values are
weighted and the algorithm evaluated whether these weighted pixels are similar to the target. The operation of
equivalence of the weighted and neighboring pixels is done using the Gaussian weighted Euclidean distance method.
The formula is as follows [23]:

1
N Lu(x) =
C( X) ∫f ¿¿ (1)

Where d (B ( x ) , B ( y )) is the Euclidean distance between image patches centered at x and y respectively, f is the
decreasing function and C (x) is a normalizing factor.

The output of NLM is visualized in Fig. 3.

Fig. 3 Output of Non-Local Means Algorithm

6
4.1.2 Line removal

Most mammograms of this dataset are surrounded by a thin white border on the top, bottom, left or right side. This
border is removed employing binary masking with a rectangular shaped mask. To create the mask, four parameters
need to be defined, the start point, the end point, the border color and the border width, see Algorithm 1:

Algorithm 1. Creating rectangular shaped binary mask


1. START
2. Read input image  p (x, y)
3. Convert to grayscale  g (x, y)
4. Define a null array having input image shape  a (0, 0)
5. Define parameter start point, s
6. Define parameter end point, e
7. Define parameter border color, c
8. Define parameter border width, w
9. Create a rectangular mask using s, e, c and w  m (x, y)
10. Merge m (x, y) with g (x, y)
11. Write output image  o (x, y)
12. END

For the mammograms of this dataset, the border to be removed is not very wide and the width of the binary mask is
set to 5 pixels. The border color is set to 0 (black). The resultant rectangular binary mask is merged with the input
image using the bitwise_AND function. The resulting image is shown in Fig. 4.

Fig. 4 Line removed image

4.1.3 Largest contour detection

Contours can be described as a set of pixels containing similar colors or intensity levels [24]. In mammograms, the
breast is the largest object which can be detected and extracted using the largest contour detection technique.
Locating contours works by detecting white or bright objects in a black background. The best outcomes from
contour detection are obtained from binary images which contain only two intensities, black and white. The largest
contour detection method operates by following Algorithm 2:

Algorithm 2: Largest contour detection


1. Read input image  p (x, y)
2. Convert to grayscale  g (x, y)
3. Convert to binary  b (x, y)
4. Define parameter contour retrieval method, m
5. Define parameter contour approximation technique, t

7
6. Apply findcontours() method
7. Get a list of all contours, l
8. Retrieve largest contour from l using the max function, l
9. Draw the outline of l in b (x, y)
10. Merge (x, y) with g (x, y)
11. Write output image  o (x, y)
12. END

The image is binarized to obtain a binary mask. The algorithm employs two functions findContours () and max (),
where findContours () detects each contour present in the binary image and stores the coordinates in a list. We have
set the parameter values of the contour retrieval method and the contour approximation technique as ‘RETR_LIST’
and ‘CHAIN_APPROX_SIMPLE’, respectively. The max function then identifies the largest object from the list.
This largest object is marked by an outline created by the drawContours() method. A binary mask comprising of the
shape of the largest object is then obtained. This mask is merged with the original image, again with the help of the
bitwise_AND function. The output of this approach is shown in Fig. 5.

Fig. 5 Largest contour detection output.

4.1.4 BCET

After removing artifacts by extracting only the breast region, the intensity level is enhanced by applying BCET. The
contrast of the image can be effectively adjusted without altering the histogram information of the input image. The
method works by using the parabolic function obtained from the input image. The contrast is improved efficiently
without the loss of essential information. The maximum value, the minimum value, and the mean value of the output
picture are set to 255, 0, and 200, respectively. Fig. 6 presents the output picture.

Fig. 6 Resulted image after BCET

8
4.1.5 CLAHE

CLAHE is an advanced variation of Adaptive Histogram Equalization (AHE) which was developed to increase the
quality of medical images having complex structures. This algorithm improves local contrast and adjusts brightness
and contrast level which improves the interpretability of medical images such as mammograms [25]. To apply
CLAHE, two parameters are required: clipLimit and tileGridSize. After testing several values, we set clipLimit to
2.0 and tileGridSize to 8 x 8. Applying CLAHE with these parameter values, the ROI becomes more visible (Fig. 7).

Fig. 7 Image after applying CLAHE

4.1.6 Heatmap

Heatmaps were introduced to detect associated features by picturing features and structures using a color palette.
The heatmap of an image shows which sections of a picture require the greatest attention. In this study, a heatmap is
generated based on the output image from CLAHE. The ROI depicted in the output heatmap is highlighted and
brighter than other regions (Fig. 8), which can help to classify the images.

Fig. 8 Applying heatmap enhancement method on CLAHE image

The output of each preprocessing step is shown in Fig. 9.

9
Fig. 9 Image enhancement methods utilized in this study

It can be observed from Fig. 9 that after applying the algorithms consecutively, the artifacts of the images
are removed, the intensity is enhanced and the important areas of the image are highlighted in a different
color in the heatmap.

5 Dataset split and augmentation

Before proceeding to data augmentation, the original dataset is split first into a training, a validation and a test set
using a ratio of 70:20:10. After splitting, we get 1010 images in the training set, 287 images in the validation set and
145 images in the test set. The number of images in the training, validation and test sets are shown in Fig. 9 along
with the number of images in each class.

1100

900

700

500
Train
300 Validation
Test
100

Benign calc Benign mass Malignant Malignant Total


calc mass
Train 292 279 210 229 1010
Validation 83 79 60 65 287
Test 42 40 30 33 145

Fig. 9 Distribution of images in each class across train validation and test set.

10
Only the training set is augmented. The original dataset had an inadequate number of images for all classes which
might cause overfitting issue and a reduced classification accuracy. Data augmentation is a strategy where the
number of images is increased by several techniques. Due to the increase of the training dataset, the performance of
deep learning models can be enhanced. In this study, seven geometrical augmentation techniques are applied to the
preprocessed dataset: vertical flip, horizontal flip, horizontal-vertical flip, rotating 30°, rotating 30° with horizontal
flip, rotating -30° and rotating -30° with horizontal flip. This increases the number of training images by a factor
eight. As a result, 8080 training images are obtained after augmentation.

6 Proposed model

As mammograms have complex and hidden characteristics in terms of lesion and global information and achieving
accurate classification is quite challenging. Three experiments model are conducted to find the best performing deep
learning model; (i) transfer learning model selection, (ii) developing four decision level fusion models and (iii)
adding AN LSTM layer TO the four decision level fusion the highest performance. All the models are trained using
the same hypermeters and training strategy.

6.1 Training strategy

All models are trained for 100 epochs with optimizer Adam, a learning rate 0.001, and a batch size of 16.
‘Categorical cross-entropy’ is employed as the loss function.

6.2 Transfer learning model selection

Seven transfer learning models, VGG16, Xception, ResNet50, ResNet50V2, MobileNet, MobileNetV2 and
InceptionV3, are trained with the preprocessed dataset and their performance in terms of test accuracy is compared.
Transfer models are widely used in image classification tasks and several breast cancer classification studies have
been conducted using these models. As the highest accuracies for our dataset, are achieved with VGG16, Xception,
MobileNet, ResNet50V2 and the MobileNet50V2 networks, these models are selected for the experimentation of
our further approaches. The result of all models can be found in Section 7.

6.3 Developing four decision level fusion models

The deep learning decision fusion method aims to combine the decisions achieved by multiple classifiers to achieve
a better performance than any individual classifier. The performance of single classifier can be different while
employing different models for the same classification task. To address this, the concept of decision level fusion was
introduced for image classification problems. In the second approach, four architectures models are developed using
the five selected transfer learning networks attained from the first approach by (a) combining the two best models,
(b) combining the three best models, (c) combining the four best models, and (d) combining the five best models.
The four architectures comprise of: (a) MobileNet + VGG16, (b) MobileNet + VGG16 + Xception, (c) MobileNet +
VGG16+ Xception + RestNet50V2 and (d) MobileNet + VGG16 + Xception + RestNet50V2 + MobileNetV2. The
results of these four architectures can be found in Section 7.

6.4 Adding LSTM to the four decision level fusion models

In our last experiments, an LSTM layer is added to each of the four decision level fusion architectures. These four
LSTM decision level fusion models are trained with our dataset and the results are compared to identify the best
model. The result of these four architectures can be found in Section 7.

6.5 Proposed model architecture

11
After analyzing the performance of all the models, it is found that the deep learning decision level fusion model
MobileNet + VGG16 + Xception equipped with an LSTM layer results in the highest accuracy. MobileNet has
previously been used to classify images in breast cancer classification using mammograms [26] [27] [28] due to its
reasonably small computational complexity and compact size. The architecture contains a total of 28 layers,
combining separable depth-wise and point-wise convolutions. The VGG16 model consists of 16 weighted layers
containing 13 convolutional layers, 5 Max pooling layers, and 3 dense layers. The network has been used in
mammography classification because of its shallow architecture and high detection rate. Xception has 71 layers.
Like VGG16 and MobileNet, Xception offers computational efficiency and high accuracy [29]. The decision fusion
architecture combining these three models is depicted in Fig. 10.

Fig. 10 Training process of the proposed model.

Input images of 224×224×3 size is fed to the ensemble model and shared with all three transfer learning models
MobileNet, VGG16 and Xception. The output dense layers of these models are modified by adding an LSTM layer
and an additional dropout of 0.5. The flatten layer of these models is wrapped with a time distributed function to
help the LSTM layer to analyze the feature maps produced by the three models. Decision level fusion of the
classification results of the three models is done using the average layer which provides the ensemble classification
output.

7 Results

To investigate the performance of all the methods, a number of performance metrics such as training accuracy,
training loss, validation accuracy, validation loss, test accuracy, test loss, recall, specificity, precision and F1 scores
are explored. The results are calculated based on the true positive (TP), true negative (TN), false positive (FP) and
false negative (FN) values of the confusion matrix obtained with the model. The equations for the performance
metrics are as follows [30] [31]:

TP+ TN
ACC= (2)
TP+TN + FP+ FN

12
TP
RecaIl= ( 3)
TP+ FN

TN
Specificity= ( 4)
TN + FP

TP
Pr e cision= (5 )
TP+ FP

precision∗recall
F 1=2 (6)
precision+recall

7.1 Results of transfer learning models

As mentioned before, seven transfer learning models are trained with the processed dataset. The training accuracy,
training loss, validation accuracy, validation loss, test accuracy, and test loss are derived for all these models and the
performance is compared. Table 3 shows the outcomes obtained whit these networks. Here, training accuracy is
denoted as Train_Acc, training loss is denoted as Train_Loss, validation accuracy is denoted as Val_Acc, validation
loss is denoted as Val_Loss, test accuracy is denoted as Test_Acc, and test loss is denoted as Test_Loss.

Table 3 Performance of seven transfer learning models.

Model name Train_Acc Train_Loss Val_Acc Val_Loss Test_Acc Test_Loss

ResNet50 73.63% 0.77 71.97% 3.04 71.74% 2.86

ResNet50V2 98.78% 0.16 93.85% 2.23 93.62% 1.12

InceptionV3 78.99% 0.55 78.11% 0.56 77.70% 0.57

MobileNet 97.66% 0.62 91. 98% 0.99 93.98% 1.12

VGG16 99.51% 0.07 95.31% 2.47 93.77% 0.07

Xception 97.84% 0.05 94.44% 0.19 92.63% 0.24

MobileNetV2 98.40% 0.11 91.98% 1.58 90.09% 1.77

It can be seen thatVGG16, MobileNet, MobileNetV2, Xception and RestNet50V2 yield the highest test accuracies,
of 93.77%, 93.98%, 90.09%, 92.68% and 92.62% respectively. The training and validation accuracy of these
models indicate that no overfitting occurs during the training phase as there is a minimal difference between the
training and validation accuracy. Therefore, these five models are chosen for the subsequent experiments.

7.2 Results of deep learning decision level fusion approach

The five transfer learning models VGG16, MobileNet, MobileNetV2, Xception and RestNet50V2 which are
selected based on theirperformance are used to develop four decision level feature fusion architectures. The
performance of these four models is evaluated based on the performance metrics described above. Table 4 shows the
results obtained with these four fusion models.

13
Table 4 Performance of various decision level fusion models based on combinations of the five best performing
transfer learning models.

configuration Model name Train_Acc Train_Loss Val_Acc Val_Loss Test_Acc Test_Loss


1 MobileNet + 97.66% 0.62 91. 98% 0.99 93.98% 1.12
VGG 16
2 MobileNet + 99.56% 0.07 96.54% 1.52 95.82% 0.57
VGG 16 +
Xception
3 MobileNet + 97.96% 0.05 94.61% 0.18 94.50% 0.19
VGG 16 +
Xception +
ResNet50V2
4 MobileNet + 97.84% 0.05 94.44% 0.19 92.63% 0.24
VGG 16 +
Xception +
ResNet50V2 +
MobileNetV2

After combining the models using decision level feature fusion technique, test accuracies above 92% are achieved
for all four configurations, see Table 4. The highest test accuracy, of 95.82%, and the lowest test loss, of 0.57, are
achieved with the combination of three models MobileNet, VGG16 and Xception. The other three architectures,
MobileNet + VGG16, MobileNet + VGG16 + Xception + ResNet50V2 and MobileNet + VGG16 + Xception +
ResNet50V2 + MobileNetV2 perform moderately well, with the test accuracies of 93.98%, 94.50% and 92.63%,
respectively. The accuracy of the best two fusion models, MobileNet + VGG16 + Xception, and MobileNet +
VGG16 + Xception + ResNet50V2 is better than the accuracy of the best transfer learning model, MobileNet, while
the accuracy of MobileNet + VGG16 is identical to the accuracy of MobileNet alone. The training accuracy and
validation accuracy results do not indicate overfitting issues for any of the models.

7.3 Results of deep learning feature fusion with LSTM models

In our last approach an LSTM layer is added to every fusion model to boost the performance. Table 5 lists the
results of the four deep learning feature fusion with LSTM models in terms of training accuracy, training loss,
validation accuracy, validation loss, test accuracy, test loss, precision, recall, specificity, F1 Score and AUC value.
Here ‘Conf1+LSTM’ denotes that an LSTM layer has been added to configuration 1 of Table 4 and so on. Training
accuracy is denoted as T_Acc, training loss as T_Loss, validation accuracy as V_Acc, validation loss as V_Loss, test
accuracy as Te_Acc, and test loss as Te_Loss.

Table 5 Performance of deep learning feature fusion with LSTM models

Model T_Acc T_Los V_Ac V_Loss Te_Ac Te_Los Precision Recal Specificit F1_score AUC
name s c c s l y
Conf1+LST 96.05% 0.06 96.23 0.15 96.86 0.34 95.00 95.28 97.26 94.44 95.51
M % %

Conf2+LST 97.65% 0.23 97.88 0.20 98.31 0.14 96.85 96.77 98.36 96.82 97.43
M % %

Conf3+LST 95.46% 0.57 95.86 0.78 96.08 0.38 95.35 95.89 96.88 96.31 96.47
M % %
Conf4+LST 95.16% 1.24 95. 0.84 95.84 0.41 94.29 95.56 97.04 94.79 94.66

14
M 48% %

Conf5+LST 94.61% 0.67 94.83 1.34 95.16 0.86 89.47 85.33 91.61 89.47 92.73
M % %

It can be seen from Table 5 that adding LSTM to the deep learning feature fusion models, improves the performance
for all networks. The test accuracy increased by 1-3% after adding LSTM for all models. Conf2
(MobileNet+VGG16+ Xception) with LSTM model outperforms all the other models, achieving a test accuracy of
98.31% and test loss of 0.14, an F1 score of 96.82%, a precision of 96.85%, a recall of 96.77%, a specificity of
98.36% and an AUC value 97.43%. The other models had values in the range of 89-95% for precision, 85-95% for
recall, 91-97% for specificity, 92-95% for F1 scores and 92-95% for AUC values. Validation and training accuracies
were also improved with a small gap between validation and training loss. This indicates that no overfitting occurs
for any of the architecture.

The confusion matrix for the MobileNet+VGG16+ Xception with LSTM model, the models which had the best
performance, is generated to show the true and false classification cases for the distinct classes. Fig. 11 illustrates the
confusion matrix for proposed model.

Fig. 11 Confusion matrix for the best performing model

The row values of a confusion matrix indicate the true labels of the test images whereas the column values denote
the predicted labels. Correct predictions are on the diagonal of the confusion matrix. Based on the confusion matrix
values, it can be seen that the model has the best prediction accuracy for classes BM and MC with no cases of
misclassification. For each of the remaining classes (BC and MM), only one image was misclassified. It can be
concluded that the model offers good classification accuracy across all classes. The test accuracy for the individual
classes is shown in Table 6.

Table 6 Class based accuracy for optimal model.

15
Class Test accuracy
BC 97.61%
BM 100%
MC 100%
MM 96.96%

For the individual classes, BC, BM, MC and MM, test accuracies of 97.61%, 100%, 100% and 96.96% are obtained,
respectively (see Table 6). There is no indication of bias of the network as it performs well for all classes.

The accuracy and loss curves of the best model are derived to assess overfitting issues. The accuracy and loss curves
are displayed in Fig. 12.

Fig. 12 Accuracy and loss curves for the best model

Both of the accuracy and the loss curves show only a minimal gap between training and validation. This indicates
that no overfitting occurred while training the model.

Additionally, K-fold cross validation with k values of ranging from 3 to 30 is done for a further assessment of the
robustness of the performance and of overfitting issue. The results of these different folds are shown in Fig. 13.

16
Fig. 13 K folds cross validation with various k values

Accuracies across all folds are close to the highest accuracy, ranging from 98.11%-98.18%, see Fig. 13. As there is
no significant reduction in performance for any of the folds, it can be concluded that the model is robust in
classifying mammograms.

8 Analysis of inter-class differences

Our proposed model is robust and can classify the mammograms well, as assessed by different performance metrics.
However, to get better insight in the differences between the classes, a number of textural features such as Perimeter
area ratio (PA ratio), Solidity, Circularity, Equivalent diameter (Equiv Diameter), Extent, Major axis length, Minor
axis length, Mean, Standard Deviation, Shannon Entropy, Gray level co-occurrence matrix (GlCM) entropy,
Skewness and Kurtosis [32] are derived for all the classes. For these features, the mean, median, maximum and
minimum values are calculated. Table 7 displays the mean, median, maximum and minimum values of the features
for the four classes.

Table 7 Mean, median, max and min value for 13 features of the four classes

Features BC BM MC MM

Mean Medi Max Min Mean Median Max Min Mean Media Max Min Mean Median Max Min
an n

PA_ratio 0.03 0.03 0.07 0.01 0.03 0.03 0.08 0.019 0.042 0.04 0.09 0.02 0.031 0.02 0.08 0.01

Solidity 0.41 0.30 11.43 0.21 0.44 0.30 6.71 0.23 0.347 0.28 3.94 0.22 0.43 0.40 1.34 0.20

Circulari
ty 0.69 0.73 1.04 0.13 0.73 0.76 1.01 0.15 0.73 0.78 1.04 0.11 0.70 0.74 1.35 0.17

EquivDi 149.6 230.9


ameter 1 149.6 230.06 65.32 153.00 153.05 3 50.45 119.58 117.15 202.97 53.53 170.78 177.54 246.03 54.47

Extent 0.02 0.02 0.95 0.009 0.02 0.019 0.56 0.008 0.02 0.02 0.62 0.009 0.02 0.02 0.07 0.008

Major 251.8 251.2 427.47 108.5 250.75 244.12 438.6 38.87 216.70 212.96 404.40 109.8 250.57 244.47 437.75 83.25
axis 4 8 7 6

17
length

Minor
axis 146.7 298.7
length 5 142.5 273.61 4.61 151.069 141.0 9 11.06 134.32 131.67 233.50 8.21 141.27 132.77 295.83 21.36

123.6
Mean 37.74 34.73 104.27 5.01 47.25 44.71 3 6.20 24.96 21.56 78.14 4.67 53.82 50.45 178.78 2.93

Standard
Deviatio 104.0
n 53.86 53.35 95.72 21.9 63.16 63.9 2 25.57 47.80 46.85 76.68 20.16 53.87 54.95 91.46 15.70

Shanno_
Entropy 3.41 3.42 5.82 0.83 3.62 3.64 6.80 0.53 2.41 2.24 5.48 0.56 4.16 4.50 6.67 0.60

Glcm_en 131.6
tropy 61.99 61.80 110.31 13.82 66.26 65.77 2 8.49 43.20 39.98 102.28 9.61 76.29 82.06 124.79 9.71

Skewnes
s 1.18 1.14 4.73 1.19 1.14 1.04 5.67 0.90 2.05 1.98 5.43 -0.01 0.75 0.58 6.04 -2.69

Kurtosis 0.48 -0.32 22.38 1.89 0.69 -0.63 31.24 -1.87 3.85 2.35 28.41 -1.88 0.91 -0.82 37.71 -1.84

From Table 7 it can be observed there are differences in the features of the classes, although some are larger than
others. Hence, the features are clustered according to the degree of dissimilarity between the classes. The features
are divided into three clusters: high dissimilarity (c-1), moderate dissimilarity (c-2) and low dissimilarity (c-3). The
clustering is based on the mean values of the features for each class. The step by step process is given in Algorithm
3.

Algorithm 3 : step by step process of feature clustering


1. START
2. Calculate four mean values of each feature for four classes m1, m2, m3 and m4
3. FOR feature-1:
4. Find the lowest from m1, m2, m3 and m4 min1
5. Find the highest from m1, m2, m3 and m4 max1
6. Calculate the difference between the max1 and min1  md1 = max1 - min1
7. IF md1 > 10:
8. Assign feature-1 to c-1
9. ELSE IF md1 >1 :
10. Assign feature-1 to c-2
11. ELSE:
12. Assign feature-1 to c-3
13. Repeat the process for rest of the features
14. END

In cluster 1, features have a difference between the highest and the lowest mean larger than 10, in cluster 2, features
have a difference between the highest and the lowest mean larger than 1 the remaining features are in cluster 3.
Table 8 shows the results of feature clustering with respect to difference between the highest and the lowest mean
value.

Table 8 Clustering approach based on mean values.

No. Features Difference Cluster High difference (c-1) Moderate difference Low difference (c-3)
(highest – (c-2)

18
lowest mean)
0.01 3 -- PA ratio
1. PA ratio --

0.1 3 -- -- Solidity
2. Solidity

0.1 3 -- -- Circularity
3. Circularity

Equivalent 51 1 Equiv Diameter --


4. --
Diameter
0.006 3 -- Extent
5. Extent --

Major axis 33 1 Major axis length -- --


6.
length
Minor axis 17 1 Minor axis length --
7. --
length
29 1 Mean --
8. Mean --

Standard 46 1 Standard Deviation --


9. --
Deviation
Shannon 3 2 -- --
10. Shannon Entropy
Entropy
34 1 Glcm entropy --
11. Glcm entropy --

1.25 2 -- --
12. Skewness Skewness

0-4 2 -- --
13. Kurtosis Kurtosis

For a better understanding, three bar charts are presented in Fig. 13 based on the outcomes of Table 8.

19
Fig. 13 Bar charts of feature clustering

9 Conclusion
This study presents a multi-class classification approach for breast cancer, using preprocessed and augmented
mammography images. A deep learning decision level fusion technique is introduced in this paper as an improved
performance may be achieved by combining the decisions of multiple classifiers. In Several transfer learning models
are evaluated with the dataset, and the five models with the highest test accuracy are selected and combined in four
fusion models. To each of these models, an LSTM layer is added the models are again trained with the dataset. It is
found that the MobileNet + VGG16 + Xception with LSTM decision fusion deep learning architecture outperforms
the other models, with a test accuracy of 98.31%. To assess the robustness and performance consistency of the
model, single class accuracy, k-fold cross validation and loss and accuracy curves are explored. To get more insight
in the differences between the classes, sixteen geometric features extracted from the dataset and a clustering
approach is presented.

References

1. Torre LA, Bray F, Siegel RL, et al (2015) Global cancer statistics, 2012. CA Cancer J Clin 65:87–108.
https://doi.org/10.3322/caac.21262

20
2. Barbieri RL (2019) Breast. Yen & Jaffe’s Reproductive Endocrinology: Physiology, Pathophysiology, and
Clinical Management: Eighth Edition 419:248-255.e3. https://doi.org/10.1016/B978-0-323-47912-
7.00010-X

3. Ragab DA, Sharkas M, Marshall S, Ren J (2019) Breast cancer detection using deep convolutional neural
networks and support vector machines. PeerJ 2019:. https://doi.org/10.7717/peerj.6201

4. Labrèche F, Goldberg MS, Hashim D, Weiderpass E (2020) Breast cancer. Occupational Cancers 417–438.
https://doi.org/10.1007/978-3-030-30766-0_24

5. Arevalo J, González FA, Ramos-Pollán R, et al (2016) Representation learning for mammography mass
lesion classification with convolutional neural networks. Comput Methods Programs Biomed 127:248–
257. https://doi.org/10.1016/j.cmpb.2015.12.014

6. Kaur P, Singh G, Kaur P (2019) Intellectual detection and validation of automated mammogram breast
cancer images by multi-class SVM using deep learning classification. Inform Med Unlocked 16:100151.
https://doi.org/10.1016/j.imu.2019.01.001

7. Rangayyan RM, Banik S, Desautels JEL (2010) Computer-aided detection of architectural distortion in
prior mammograms of interval cancer. J Digit Imaging 23:611–631. https://doi.org/10.1007/s10278-009-
9257-x

8. Suh YJ, Jung J, Cho BJ (2020) Automated breast cancer detection in digital mammograms of various
densities via deep learning. J Pers Med 10:1–11. https://doi.org/10.3390/jpm10040211

9. Ribli D, Horváth A, Unger Z, et al (2018) Detecting and classifying lesions in mammograms with Deep
Learning. Sci Rep 8:. https://doi.org/10.1038/s41598-018-22437-z

10. Daniel López-Cabrera J, Alberto López Rodriguez L, Pérez-Díaz M (2020) Classification of breast cancer
from digital mammography using deep learning. Inteligencia Artificial 23:56–66.
https://doi.org/10.4114/intartif.vol23iss65pp56-66

11. Ozsahin I, Sekeroglu B, Musa MS, et al (2020) Review on Diagnosis of COVID-19 from Chest CT Images
Using Artificial Intelligence. Comput Math Methods Med 2020:. https://doi.org/10.1155/2020/9756518

12. Kaur P, Singh G, Kaur P (2019) Intellectual detection and validation of automated mammogram breast
cancer images by multi-class SVM using deep learning classification. Inform Med Unlocked 16:100151.
https://doi.org/10.1016/j.imu.2019.01.001

13. Lotter W, Sorensen G, Cox D (2017) A multi-scale CNN and curriculum learning strategy for mammogram
classification. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial
Intelligence and Lecture Notes in Bioinformatics) 10553 LNCS:169–177. https://doi.org/10.1007/978-3-
319-67558-9_20

21
14. Khamparia A, Bharati S, Podder P, et al (2021) Diagnosis of breast cancer based on modern
mammography using hybrid transfer learning. Multidimens Syst Signal Process 32:747–765.
https://doi.org/10.1007/s11045-020-00756-7

15. Ragab DA, Attallah O, Sharkas M, et al (2021) A framework for breast cancer classification using Multi-
DCNNs. Comput Biol Med 131:104245. https://doi.org/10.1016/j.compbiomed.2021.104245

16. Ting FF, Tan YJ, Sim KS (2019) Convolutional neural network improvement for breast cancer
classification. Expert Syst Appl 120:103–115. https://doi.org/10.1016/j.eswa.2018.11.008

17. Agnes SA, Anitha J, Pandian SIA, Peter JD (2020) Classification of Mammogram Images Using Multiscale
all Convolutional Neural Network (MA-CNN). J Med Syst 44:. https://doi.org/10.1007/s10916-019-1494-z

18. Nasir Khan H, Shahid AR, Raza B, et al (2019) Multi-View Feature Fusion Based Four Views Model for
Mammogram Classification Using Convolutional Neural Network. IEEE Access 7:165724–165733.
https://doi.org/10.1109/ACCESS.2019.2953318

19. Beura S, Majhi B, Dash R (2015) Mammogram classification using two dimensional discrete wavelet
transform and gray-level co-occurrence matrix for detection of breast cancer. Neurocomputing 154:1–
14. https://doi.org/10.1016/j.neucom.2014.12.032

20. Wang X, Liang G, Zhang Y, et al (2020) Inconsistent Performance of Deep Learning Models on
Mammogram Classification. Journal of the American College of Radiology 17:796–803.
https://doi.org/10.1016/j.jacr.2020.01.006

21. Jadoon MM, Zhang Q, Haq IU, et al (2017) Three-Class Mammogram Classification Based on Descriptive
CNN Features. Biomed Res Int 2017:. https://doi.org/10.1155/2017/3640901

22. Irmak E (2021) Multi-Classification of Brain Tumor MRI Images Using Deep Convolutional Neural
Network with Fully Optimized Framework. Iranian Journal of Science and Technology - Transactions of
Electrical Engineering 45:1015–1036. https://doi.org/10.1007/s40998-021-00426-9

23. Buades A, Coll B, Morel J-M (2011) Non-Local Means Denoising. Image Processing On Line 1:208–212.
https://doi.org/10.5201/ipol.2011.bcm_nlm

24. Montaha S, Azam S, Rakibul Haque Rafid AKM, et al (2022) A shallow deep learning approach to classify
skin cancer using down-scaling method to minimize time and space complexity

25. Montaha S, Azam S, Kalam A, et al (2021) BreastNet18 : A High Accuracy Fine-Tuned VGG16 Model
Evaluated Using Ablation Study for Diagnosing Breast Cancer from Enhanced Mammography Images

26. Falconi LG, Perez M, Aguilar WG (2019) Transfer Learning in Breast Mammogram Abnormalities
Classification with Mobilenet and Nasnet. International Conference on Systems, Signals, and Image
Processing 2019-June:109–114. https://doi.org/10.1109/IWSSIP.2019.8787295

22
27. Xie L, Zhang L, Hu T, et al (2020) Neural networks model based on an automated multi-scale method for
mammogram classification. Knowl Based Syst 208:106465.
https://doi.org/10.1016/j.knosys.2020.106465

28. Jahangeer GSB, Rajkumar TD (2021) Early detection of breast cancer using hybrid of series network and
VGG-16. Multimed Tools Appl 80:7853–7886. https://doi.org/10.1007/s11042-020-09914-2

29. Yadavendra, Chand S (2020) A comparative study of breast cancer tumor classification by classical
machine learning methods and deep learning method. Mach Vis Appl 31:1–10.
https://doi.org/10.1007/s00138-020-01094-1

30. Montaha S, Azam S, Rafid AKMRH, et al (2022) TimeDistributed-CNN-LSTM: A Hybrid Approach


Combining CNN and LSTM to Classify Brain Tumor on 3D MRI Scans Performing Ablation Study. IEEE
Access 10:60039–60059. https://doi.org/10.1109/ACCESS.2022.3179577

31. Montaha S, Azam S, Rafid AKMRH, et al (2022) MNet-10: A robust shallow convolutional neural network
model performing ablation study on medical images assessing the effectiveness of applying optimal data
augmentation technique. Front Med (Lausanne) 9:2346.
https://doi.org/10.3389/FMED.2022.924979/BIBTEX

32. Rafid AKMRH, Azam S, Montaha S, et al (2022) An Effective Ensemble Machine Learning Approach to
Classify Breast Cancer Based on Feature Selection and Lesion Segmentation Using Preprocessed
Mammograms. Biology (Basel) 11:1654. https://doi.org/10.3390/biology11111654

23
24

You might also like