0% found this document useful (0 votes)

38 views37 pages

Paper v20

This document summarizes a research paper that proposes a new method called BDR-CNN-GCN for improved breast cancer classification through combining a graph convolutional network and convolutional neural network. The method utilizes a convolutional neural network integrated with batch normalization and dropout, and substitutes max pooling with rank-based stochastic pooling, resulting in BDR-CNN. This is then hybridized with a two-layer graph convolutional network to form the BDR-CNN-GCN model. When tested on 322 mammographic images, the model achieved a sensitivity of 96.20±2.90%, specificity of 96.00±2.31% and accuracy of 96.10±1.60%, showing improved performance over other neural network models and

Uploaded by

youssef amr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views37 pages

Paper v20

Uploaded by

youssef amr

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/345998410

Improved Breast Cancer Classiﬁcation Through Combining Graph Convolutional

Network and Convolutional Neural Network

Article in Information Processing and Management · January 2021

DOI: 10.1016/j.ipm.2020.102439

CITATIONS READS

101 859

5 authors, including:

Yu-Dong Zhang Suresh Satapathy

University of Leicester Institute of Electrical and Electronics Engineers
647 PUBLICATIONS 19,835 CITATIONS 313 PUBLICATIONS 4,233 CITATIONS

SEE PROFILE SEE PROFILE

David S Guttery Juan M Gorriz

University of Leicester University of Granada
140 PUBLICATIONS 2,877 CITATIONS 502 PUBLICATIONS 8,045 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

International Conference on Frontiers of Intelligent Computing: Theory and Applications ( FICTA 2018) View project

DGA-based varrescheduling for transmission loss reduction View project

All content following this page was uploaded by Yu-Dong Zhang on 18 November 2020.

The user has requested enhancement of the downloaded file.

Improved Breast Cancer Classification Through Combining Graph
Convolutional Network and Convolutional Neural Network

Yu-Dong Zhang1,2,#, Suresh Chandra Satapathy3,#, David S Guttery4,*, Juan Manuel Górriz5,*, Shui-Hua Wang2,6,*,

1. School of Informatics, University of Leicester, Leicester, LE1 7RH, UK

2. Department of Information Systems, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi

Arabia

3. School of Computer Engg, KIIT Deemed to University, Bhubaneswar, India

4. Leicester Cancer Research Center, University of Leicester, Leicester, LE2 7LX, UK

5. Department of Signal Theory, Networking and Communications, University of Granada, Granada, Spain

6. School of Architecture Building and Civil engineering, Loughborough University, Loughborough, LE11 3TU, UK

# Yu-Dong Zhang & Suresh Chandra Sataphaty contributed equally to this paper, and should be regarded as co-first
authors.
* Correspondence should be addressed to Shui-Hua Wang, Juan Manuel Gorriz, and David S Guttery

Abstract: (Aim) In a pilot study to improve detection of malignant lesions in breast mammograms, we aimed to
develop a new method called BDR-CNN-GCN, combining two advanced neural networks: (i) graph convolutional
network (GCN); and (ii) convolutional neural network (CNN). (Method) We utilised a standard 8-layer CNN, then
integrated two improvement techniques: (i) batch normalization (BN) and (ii) dropout (DO). Finally, we utilized rank-
based stochastic pooling (RSP) to substitute the traditional max pooling. This resulted in BDR-CNN, which is a
combination of CNN, BN, DO, and RSP. This BDR-CNN was hybridized with a two-layer GCN, and yielded our
BDR-CNN-GCN model which was then utilized for analysis of breast mammograms as a 14-way data augmentation
method. (Results) As proof of concept, we ran our BDR-CNN-GCN algorithm 10 times on the breast mini-MIAS
dataset (containing 322 mammographic images), achieving a sensitivity of 96.20±2.90%, a specificity of 96.00±2.31%
and an accuracy of 96.10±1.60%. (Conclusion) Our BDR-CNN-GCN showed improved performance compared to
five proposed neural network models and 15 state-of-the-art breast cancer detection approaches, proving to be an
effective method for data augmentation and improved detection of malignant breast masses.
Keywords: convolutional neural network; graph convolutional network; breast cancer classification; mammogram;
artificial intelligence; deep learning; rank-based stochastic pooling; data augmentation;

Email: YDZ ([email protected]), SCS ([email protected]), DSG ([email protected]), JMG

([email protected]), SHW ([email protected])

1
1 Introduction

Breast cancer commonly presents as a solid mass that could be considered (by inspection, palpation or
radiologically) as being different from the surrounding tissue [1], with subsequent confirmation using tissue biopsy.
There are several classifications used in breast cancer grading systems, the most common of which is the TNM system.
TNM (T-tumor, N-lymph node, and M-metastasis) [2] stratifies breast cancers into five stages: Stage 0 is a pre-
cancerous condition, such as lobular carcinoma in situ (LCIS) and ductal carcinoma in situ (DCIS), stages 1 and 2 are
invasive tumors that are still confined within the breast or have only extended to nearby sentinel lymph nodes, stage
3 represents breast cancers that have extended beyond the immediate tumor region and invaded nearby lymph nodes
and muscles; whereas stage 4 is metastatic cancer which has multiplied beyond the breast and neighboring lymph
nodes to distant regions of the body.
Digital mammography is a non-invasive method to detect the earliest stages of breast cancer development [3].
On mammograms, dense breast tissues (DBTs) and breast masses/tumors appear as increased radiological densities
(seen as white), potentially complicating detection of malignant breast masses due to overlay with DBTs. Hence,
mammography is beset by problems with false positives, false negatives, and overdiagnosis.
Recently, artificial intelligence (AI) approaches have been utilized to aid radiologists through quicker and
improved detection of breast cancer using mammography, since AI approaches interrogate mammograms on a pixel
level and have spatially long-range memory [4] (i.e., radiologists may focus on local or isolated regions, whereas AI
analyzes the mammogram globally).
Numerous studies have investigated AI methods for improving mammographic detection of breast cancer. Most
techniques used in previous studies include biogeography-based optimization (BBO), wavelet energy entropy (WEE),
cross validation (CV), k-nearest neighbor (kNN) algorithm, particle swarm optimization (PSO), fractional Fourier
transform (FrFT), support vector machine (SVM), decision tree (DT), and particle swarm optimization (PSO). For
example, Milosevic, et al. [5] tested different classifiers, including SVM, Naive Bayes and kNN classifiers. To select
the best classifier, five-fold CV and receiver operating characteristic analysis were carried out. The authors found the
best result was achieved by support vector machine (SVM). Nakamura [6] presented a hybridization of BBO and PSO
(termed HBP). The simulation achieved the sensitivity, specificity, and accuracy which were all more than 85%.
Gorgel, et al. [7] mixed spherical wavelet and SVM (termed SWSVM), showing improved accuracy compared to
ordinary discrete wavelet transform, which had an accuracy of 83.3%. Liu [8] introduced a weighted FrFT (WFrFT),
with subsequent kNN used as the classifier. The advantage of using WFrFT was to attain the analysis on unified time–
frequency spectrum. Yang, et al. [9] combined thin-plate spline (TS) with maximum intensity projection (MIP)
approaches. Using MRI assessment with information of mammography for diagnosis, their method achieved a
sensitivity of 91.9±2.3%, a specificity of 70.0±4.7%, and an accuracy of 84.8±3.1%. Rao [10] proposed using a Jaya
approach for abnormal breast cancer detection, showing Jaya was more effective than other training algorithms, such
as back propagation, momentum back propagation, genetic algorithm, simulated annealing, and PSO. Wu [11] offered
a novel improved BBO (IBBO) approach by implementing chaotic adaptive real-coded mechanisms to improve

2
traditional biogeography-based optimization. Chen [12] proposed to use a new feature – WEE where linear regression
was used for classification and achieved an accuracy of 91.85 ± 2.21%. Liu and Brown [13] combined SVM with
Daubechies (db) wavelet (SVM-db), reporting their average sensitivity, average specificity, and average accuracy were
all above 82%. Guo [14] used wavelet energy (WE) to detect abnormal breast. Their method arrived at that both
sensitivity and specificity and accuracy were all above 81%. Here, the authors mentioned that their method could
improve the accuracy of intelligent breast cancer diagnosis. [15] proposed a hybrid model of Radial basis function
network and DT, termed HRD. Their method was compared with three common algorithms, viz., kNN, Naive Bayes
algorithm and SVM. The authors found their proposed method yielded a high accuracy. Yu [16] presented a semi-
automatic system to classify mammograms into abnormality and normality. This study employed DenseNet201 (DN-
201) as a transfer learning technique, achieving a highest accuracy of 92.73%. Samala, et al. [17] simulated a dataset
set including noisy labels or corrupted data, and compared AlexNet (AN) with GoogleNet (GN) for breast cancer
diagnosis. The balance between memorization and learning of their networks was controlled by varying the ratio of
noisy training samples. Pan [18] used a rank-based stochastic pooling neural network (RSPNN) to detect diseased
breasts. In their paper, the authors compared three activation functions and six pooling techniques. Their best model
achieved an accuracy of 94.0%. In addition to these studies, there are also numerous studies on information embedding
[19-24].
Recently, researchers [17, 18] have focused their attention on mammogram-based breast cancer detection
methods using deep learning (DL) approaches to extract features of individual images automatically, but they do not
learn image-level relationships. Based on this, we propose that DL methods currently available could be improved
using additional feature engineering. There are other publications [25-27] using DL to analyze biopsy images; however,
this was not the focus of our study, which only focuses on mammogram images.
The inspiration is to not only learn the image-level representation automatically, but also the relation-aware
representation (RAR) [28] to more accurately detect abnormal masses using mammography. RAR is a method which
utilizes the relationships between data points obtained from a cohort as a whole to unbiasedly improve decisions when
analyzing each data point individually. For example, RAR will determine the relationships among breast
mammograms from an entire cohort of images to unbiasedly determine abnormal masses in each individual image.
Therefore, the AI system will generate more accurate results if considering the relationships among the input
mammogram images.
Graph is a tool to describe the above “relationship” if we treat each image as a “node”. Graph convolutional
network (GCN) [29] is a newly proposed AI framework, and GCN is capable of learning over graph structure and
node features, thus, it can learn RARs of nodes.
The idea of this work is to use traditional convolutional neural network (CNN) to learn image-level features, and
use GCN to learn relation-aware representation features. The hypothesis is this combination is expected to provide
better performance than any network working alone. The aim of this paper is to present a novel CNN and GCN
combination network to make more accurate diagnoses using breast mammograms.
The contributions of this research entail following points: (i) First, we designed a base network (Net-0), and then

3
added two improvement techniques to obtain an improved AI model, Net-1. (ii) Next, we developed Net-2 by utilizing
rank-based stochastic pooling (RSP) to substituting conventional pooling in Net-1. (iii) Net-0, Net-1 and Net-2 were
combined with our proposed 2-layer GCN, and we obtained Net-3, Net-4, and Net-5, respectively. (iv) Further
experiments showed that Net-5 gives the best results amongst all proposed six networks. In addition, Net-5 was
superior to state-of-the-art approaches.

2 Breast Dataset

2.1 Aim & Dataset

The mini-MIAS [30] dataset was chosen that comprises 322 single-breast mammogram slices all of which are
1,024 × 1,024 in size. Among the 322 images, 113 are abnormal and 209 are normal. This dataset was chosen due to
numerous previous studies utilising this dataset, ensuring robust comparison between algorithms.
In this dataset abnormal breasts are stratified into six categories, which are presented in Figure 1. The task was
not to predict the 6 abnormal types, but instead combine all 6 abnormal categories as one category termed “abnormal”;
thus, the aim was to detect abnormality in each mammogram image.

(a) (b) (c)

(d) (e) (f)

Figure 1: Examples of the six mammographic abnormal breast types (a) Circumscribed Mass, a mass where the
contour is clearly defined along at least 75% of its surface; (b) Asymmetry, a spectrum of morphological descriptors
for a unilateral fibroglandular-density finding; (c) Architectural distortion, a region where the breasts normal

4
appearance, looks like an abnormal arrangement of tissue strands; (d) Spiculated masses, representing sharp-pointed
barbed tissues; (e) Ill-defined masses, i.e., indistinct; (f) Calcification, small deposits of calcium as bright white specks
or dots on the soft tissue background of the breasts.

Note that circumscribed mass is where the contour is clearly defined at least 75% of its surface. The other 25%
may be masked by the adjacent gland. Asymmetry denotes a spectrum of morphological descriptors for a unilateral
fibroglandular-density finding seen on one or more mammographic projections which do not meet the standard for a
being a mass. Architecture distortion shows a region where the breast appears normal, but shows abnormal
arrangements of tissue strands. Spiculated masses (SMs – see Table 10 for a list of abbreviations) represent sharp-
pointed barbed tissues. These spiky tumors have elongated/spicules pieces of tissue extruding from the perimeter. SMs
occur on the border of the breast, not the middle areas. Ill-defined means indistinct masses. Calcification (i.e. small
deposits of calcium in the breast) are evident in mammograms as bright white specks or dots on the soft tissue
background of the breasts.

2.2 Cost-sensitive Learning

The whole image set D contains 113 abnormal (A) breast images and 209 normal (N) mammogram images (MIs).
Hence, 𝐷 = 𝐷(𝐴) + 𝐷(𝑁), where |𝐷(𝐴)| = 113, |𝐷(𝑁)| = 209. Dataset D is divided into training set B and test set
C randomly. Due to the imbalanced properties of the mini-MIAS dataset, the test set 𝐶 is composed of fifty abnormal
and fifty normal MIs, i.e., |𝐶(𝐴)| = |𝐶(𝑁)| = 50. The training set 𝐵 is composed of 63 abnormal and 159 normal
Mis.
|𝐵(𝐴)| = 63, |𝐵(𝑁)| = 159 (1)
where 𝐴 and 𝑁 means abnormal and normal, respectively.
The imbalanced dataset (|𝐵(𝑁)| ≫ |𝐵(𝐴)| ) during training was analysed using numerous techniques. One
effective strategy is cost-sensitive learning (CSL). CSL biases the model by giving extra cost 𝑒 on the minority class
A. the cost matrix 𝐸𝑐𝑜𝑠𝑡 is expanded to:

0 𝑒12
𝐸𝑐𝑜𝑠𝑡 = [ ] (2)
𝑒21 0

where 𝑒21 represents the cost of misclassification of 𝑁 to 𝐴, and its cost is set to 1. Assume 𝐴 is positive and 𝑁
negative,
𝑒21 = cost(𝑁 ↦ 𝐴) ≝ 1 (3)
𝑒12 represents the cost of misclassification of 𝐴 to 𝑁 [31], and its value equals the ratio of the number of normal
training MIs divided by that of abnormal training MIs, viz.,

|𝐵(𝑁)|
𝑒12 = cost(𝐴 ↦ 𝑁) = |𝐵(𝐴)|
(4)

5
2.3 Preprocessing

The raw images contain noise and other unwanted contents; thus, preprocessing plans to generate the region of
interest (ROI) of the breast itself. Figure 2 shows the seven-step process of preprocessing. For clear description, let
𝐷0 ← 𝐷, suppose a given original image 𝑑0 (𝑘) ∈ 𝐷0 , 𝑘 = 1,2, ⋯ ,322.
In step 1, a median filter was utilized to get rid of additive noise (AN). The filter window is set to with 3 × 3.
𝑑1 (𝑘) = MF{𝑑0 (𝑘), [3 × 3]} (5)

Original Mammogram
Image d0(k)
d1(k)
Step 1 AN Reduction ln

Step 2 MN Reduction a
F
Step 3 CLAHE A

P
Step 4 BG Removal
B

Step 5 PEM Removal F -1

b
Step 6 SSBC
exp

d2(k)
Step 7 Downsampling

Preprocessed Image d7(k)

Figure 2: Preprocessing workflow of the mini-MIAS dataset: The original mammogram image 𝒅𝟎 (𝒌) passes
through seven steps: (1) AN reduction; (2) MN reduction; (3) CLAHE; (4) BG removal; (5) PEM removal; (6) SSBC;
and (7) Downsampling, to obtain the preprocessed image 𝒅𝟕 (𝒌) . MN: multiplicative noise; AN: additive noise;
CLAHE: contrast-limited adaptive histogram equalization; BG: background; PEM: pectoral muscle; SSBC: squared-
sized and breast-centered; a, b, A, and B serve as transient variables

In step 2, to take away multiplicative noise (MN), homomorphic filtering (HF) was utilized. The right side of
Figure 2 presents the workflow of HF. The denoised image 𝑑2 (𝑘) is
𝑑2 (𝑘) = exp{𝐹 −1 [𝑃(𝐹{ln[𝑑1 (𝑘)]})]} (6)
where 𝑑1 (𝑘) and 𝑑2 (𝑘) represents the input and output image of Step 2, respectively. Operations exp() and ln()
stand for the exponential and logarithmic function, respectively. F is the discrete Fourier transform (DFT) operation;
F-1 the inverse DFT (IDFT) operation. P is the filter function:

𝐷(𝑢) 2
𝑃(𝑢) = (𝜂𝐻 − 𝜂𝐿 ) {1 − exp [−𝑠 (
𝐷0
) ]} + 𝜂𝐿 (7)

where s is the slope parameter, 𝑢 is the frequency domain (FD), 𝐷(𝑢) the distance from point (u) to the origin in

6
the FD, D0 a predefined specified distance, 𝜂𝐿 = 0.5, 𝜂𝐻 = 2. To expand Eq. (2), we have
𝑎 = ln[𝑑1 (𝑘)]
𝐴 = 𝐹(𝑎)
𝐵 = 𝑃(𝐴) (8)
𝑏 = 𝐹 −1 (𝐵)
{𝑑2 (𝑘) = exp(𝑏)
In step 3, to equalize the images’ histogram, contrast-limited adaptive histogram equalization (CLAHE) was
carried out on 𝑑2 (𝑘).
𝑑3 (𝑘) = 𝐸𝐶𝐿 [𝑑2 (𝑘)] (9)
where 𝐸𝐶𝐿 represents CLAHE method.
In step 4, the background (BG) was taken away using region-growing (RG) approach.
𝑑4 (𝑘) = RG[𝑑3 (𝑘)] (10)
In step 5, pectoral muscle (PEM) areas were removed using the thresholding 𝒯 technique.
𝑑5 (𝑘) = 𝒯[𝑑4 (𝑘)] (11)
In step 6, we cropped the longer edge to make each image squared-sized and breast-centered (SSBC).
𝑑6 (𝑘) = SSBC[𝑑5 (𝑘)] (12)
Finally, in step 7, down-sampling (DS) was utizlied to resize the image in previous step 𝑑6 (𝑘) to a smaller one
𝑑7 (𝑘). Its size is set to 𝑂7 × 𝑂7 , with DS reducing redundant information and easing subsequent classification tasks.
𝑑7 (𝑘) = DS{𝑑6 (𝑘), [𝑂7 × 𝑂7 ]} (13)
where 𝑂7 is the output size at Step 7. We tested different sizes in the grid of 𝑂7 ∈ [64,128,256,512], and found the
optimal value of 𝑂7 is 𝑂7∗ = 256. The reason being images that are too small such as 𝑂7 = 64 ∨ 128 may not
contain sufficient information, and images that are too large such as 𝑂7 = 512 will result in overfitting by our
classifier.
After completing all seven steps, the ROI 𝑑7 (𝑘) was segmented from the raw mammogram image 𝑑0 (𝑘). All
the images were collated and formed a new dataset 𝐷7 = {𝑑7 (𝑘)}, 𝑘 = 1,2, ⋯ ,322, and 𝐷7 was assigned to replace
the original dataset 𝐷 ← 𝐷7 .
Quality control during preprocessing was mostly performed by algorithm result comparison. Subsequently, trial
and error was used for determining whether preprocessing steps are added or removed, and deciding the optimal
parameters. To do this, we analyzed a small subset from the whole dataset, and checked which combinations of
preprocessing steps can help improve performance.

3 Methodology

In total, we defined six networks in this study:

1 A base network (Net-0, termed the base CNN), which is an 8-layer convolutional neural network (CNN)
consisting of 6 conv layers and 2 fully-connected layers, was first designed.
2 We added two improvement technique: batch normalization and dropout, and yielded the improved AI

7
model, Net-1 termed “BD-CNN”.
3 Third, Net-2 was proposed and termed BDR-CNN, by utilizing rank-based stochastic pooling (RSP) to
substitute conventional max pooling in Net-1.
4 Net-0, Net-1 and Net-2 were combined with our proposed 2-layer graph convolutional network (GCN) to
obtain Net-3, Net-4, and Net-5, respectively. The names of Net-3, Net-4, and Net-5 are CNN-GCN, BD-
CNN-GCN, and BDR-CNN-GCN, respectively.
Table 1 details the six proposed networks and Figure 3 shows their relationships.

Table 1: The six proposed networks used in our study

Index Inheritance Short Name Description
8-layer CNN (6 conv layers and 2 fully-connected
Net-0 Base Network CNN
layers)
Add BN and dropout to Net-0 (Add BN to each conv
Net-1 Net-0+BD+DO BD-CNN
blocks, and add DO to each fully-connected block)
Net-2 Net-1+RSP BDR-CNN Use RSP to replace MP in Net-1
Net-3 Net-0+GCN CNN-GCN Add GCN to Net-0
Net-4 Net-1+GCN BD-CNN-GCN Add GCN to Net-1
Net-5 Net-2+GCN BDR-CNN-GCN Add GCN to Net-2
(BN: Batch normalization; DO: Dropout; RSP: rank-based stochastic pooling; GCN: graph convolutional network;
CNN: convolutional neural network; MP: max pooling)

Base Network
Add GCN
Net-3 Net-0

Net-4 Net-1

Net-5 Net-2

Figure 3: Relationship between the six proposed networks. Net-0 is the base network; Adding BN & DO to Net-0
generates Net-1; Replacing MP with RSP generates Net-2; Adding GCN to Net-(0-2) generate Net-(3-5); BN: batch
normalization; DO: dropout; MP: max pooling; RSP: rank-based stochastic pooling; GCN: graph convolutional
network.

3.1 Basics of CNN

In recent neural network (NN) and deep learning (DL) techniques, convolutional neural network (CNN) [32] is

8
particularly suitable to handle two-dimensional images. CNN comprises conv layers (CLs), pooling layers (PLs), and
fully connected layers (FCLs). CNNs show improved performance against traditional AI methods (e.g., SVM, DT,
naive Bayesian classifier, etc.), because CNNs learn features from the data during training and therefore significantly
reduces the time needed towards feature engineering design, i.e., to select the most distinguishing features/biomarkers.
The most important procedure in CNN is convolution, and thus the most important layer in CNN is CL, which
carries out the 2D convolution operation of the input and the kernels during forward pass. The weights of kernels in
each CL are initialized randomly, and are updated at each iteration from the loss function by network training. As a
result, the final learnt kernels may detect some types of patterns within the input images.

Figure 4: Illustration of conv layer. Conv-in-Run means the convolution is running, i.e., the kernels are moving
across the input, and therefore several steps will be carried out before a complete convolution is performed.

Figure 4 display the three steps within a CL: (i) Convolution; (ii) Stack; (iii) nonlinear activation function
(NLAF). Mathematically, suppose an input matrix 𝑋 and an output 𝑂 of the CL, and suppose there exist a set of
kernels 𝐹𝑗 , ∀𝑗 ∈ [1, ⋯ , 𝐽], then the convolution output 𝐶(𝑗) after step 1 is defined as
𝐶(𝑗) = 𝑋 ⊗ 𝐹𝑗 , ∀𝑗 ∈ [1, ⋯ , 𝐽] (14)
where ⊗ denotes the convolution operation, which is dot product of filter and inputs.
Second, all 𝐶(𝑗) activation maps are piled to form a new 3D activation map
𝐷 = 𝒮(𝐶(1), ⋯ , 𝐶(𝐽)) (15)
where 𝒮 denotes the pile operation along the channel direction, and J the total number of filters
Third, the 3D activation map D is fed into the NLAF and outputs the final activation map
𝑂 = NLAF(𝐷) (16)
The sizes 𝑆 of three important matrixes (input, filters, and output) are assumed as

9
𝑉𝐼 × 𝑄𝐼 × 𝐻𝐼 𝑥=𝑋
𝑆(𝑥) = {𝑉𝐾 × 𝑄𝐾 × 𝐻𝐾 𝑥 = 𝐹𝑗 , ∀𝑗 ∈ [1, ⋯ , 𝐽] (17)
𝑉𝑂 × 𝑄𝑂 × 𝐻𝑂 𝑥=𝑂

where the three variables (𝑉, 𝑄, 𝐻) denote the size of height, width, and channels of the activation map, respectively.
The subscripts I, K, and O denote input, filter, and output, respectively. There are two equalities. First, 𝐻𝐼 = 𝐻𝐾 ,
indicating the channel of input 𝐻𝐼 equals the channel of filter 𝐻𝐾 . Second, 𝐻𝑂 = 𝐽, indicating the channel of output
𝐻𝑂 equals the number of filters 𝐽.
Let 𝐵 means the padding, 𝐴 the stride, the values of (𝑉𝑂 , 𝑄𝑂 , 𝐻𝑂 ) can be deduced as
𝑉𝑂 = 1 + 𝑓𝑓𝑙 [(2 × 𝐵 + 𝑉𝐼 − 𝑉𝐾 ) ÷ 𝐴] (18.a)
𝑄𝑂 = 1 + 𝑓𝑓𝑙 [(2 × 𝐵 + 𝑄𝐼 − 𝑄𝐾 ) ÷ 𝐴] (18.b)
where 𝑓𝑓𝑙 means the floor function.
|The NLAF 𝜎, commonly choses the rectified linear unit (ReLU) function [33].
𝜎ReLU (𝑑𝑖𝑗 ) = ReLU(𝑑𝑖𝑗 )
(19)
= max (0, 𝑑𝑖𝑗 )
where 𝑑𝑖𝑗 ∈ 𝐷 means the element of the activation map 𝐷. ReLU is at present the most popular NLAF compared to
traditional hyperbolic tangent (HT) and sigmoid (SM) function, which are defined as
𝜎HT (𝑑𝑖𝑗 ) = tanh (𝑑𝑖𝑗 )
(20.a)
= (𝑒 𝑑𝑖𝑗 − 𝑒 −𝑑𝑖𝑗 ) ÷ (𝑒 𝑑𝑖𝑗 + 𝑒 −𝑑𝑖𝑗 )
−1
𝜎SM (𝑑𝑖𝑗 ) = (1 + 𝑒 −𝑑𝑖𝑗 ) (20.b)
The main advantage of ReLU is its improved gradient propagation, i.e., compared to 𝜎SM , ReLU generates fewer
vanishing gradient problems. Compared to σHT , ReLU is one-sided, so it is more biologically plausible.

3.2 Improvement 1: Dropout and Batch Normalization

To improve fully connected layer (FCL)’s performance, a dropout layer composed of dropout neurons (DONs)
is inserted before each FCL. In Ref. [34], the authors introduced DON by dropping neurons and setting their associated
neurons’ weights during training to zero. The choosing of DON is random via a retention probability variable (𝛽𝑟𝑝 ).
Mathematically, a neuron 𝑁(𝑖, 𝑗), then
𝑠(𝑖, 𝑗) 𝑁(𝑖, 𝑗) ∈ DON
𝑠̃ (𝑖, 𝑗) = { (21)
training 0 𝑁(𝑖, 𝑗) ∉ DON

Where 𝑠(𝑖, 𝑗) is the corresponding weights of 𝑁(𝑖, 𝑗). 𝑠̃ (𝑖, 𝑗) means the weights of neuron 𝑁(𝑖, 𝑗) after association
with dropout layers. 𝛽𝑟𝑝 has a default value of 0.5, viz., 𝛽𝑟𝑝 = 0.5. In inference phase, the whole network run
without DONs, and the weights of FCLs associated with DONs are reduced by 𝛽𝑟𝑝 .
𝑠̃ (𝑖, 𝑗) = 𝛽𝑟𝑝 × 𝑠(𝑖, 𝑗) (22)
inference

10
(a) Before DO

(b) After DO
Figure 5: A toy example of a 4-layer FCL with and without DON. (a) a toy example with four FCL layers before
DO; (b) neurons frozen after DO; each number in a circle denotes a neuron, and each line linking two neurons denotes
the corresponding weights will be trained. (FCL: fully connected layer; DO: dropout; DON: dropout neuron)

Figure 5(a) displays a toy CNN instance (a four-FCL model) before DO. At k-th layer, there are 𝐶(𝑘), 𝑘 = 1, … ,4
neurons, and let 𝐶(1) = 10, 𝐶(2) = 8, 𝐶(3) = 6, 𝐶(4) = 8. Therefore, this model has altogether ∑4𝑘=1 𝐶(𝑘) =
32 neurons.
The size of learnable weights (SLWs) between layer 𝑖 and 𝑗 are defined as 𝐶(𝑖, 𝑗), (𝑖, 𝑗) = {(1,2) ∨ (2,3) ∨
(3,4)}. Thus, before dropout, the SLWs are 𝐶(1,2) = 10 × 8 = 80, 𝐶(2,3) = 8 × 6 = 48, 𝐶(3,4) = 6 × 8 = 48.
Hence, the total number of SLWs before dropout is 𝐶 = ∑𝑖,𝑗 𝐶(𝑖, 𝑗) = 176. Note neither incoming and outgoing
weights, nor the number of biases is taken into account in previous calculation.
Then if DO is chosen with 𝛽𝑟𝑝 = 0.5, Figure 5(b) shows the neurons frozen after DO. The SLW connecting layer
𝑖 and 𝑗 is defined as 𝐶′(𝑖, 𝑗), and the whole SLW is 𝐶 ′ = ∑𝑖,𝑗 𝐶′(𝑖, 𝑗) = 20 + 12 + 12 = 44. The compression
ratio of size of learnable weights (CRSLW), symbolized as 𝑓𝐶𝑅 , is deduced as 𝐶 ′ /𝐶 = 44/176 = 0.25, which equals

11
the squared result of 𝛽𝑟𝑝 :

𝐶′ 2
𝑓𝐶𝑅 = = 𝛽𝑟𝑝 (23)
𝐶

where 𝐶′ and 𝐶 are the SLW values after and before dropout, respectively.
On the other side, batch normalization (BN) is to work out the so-called “internal covariant shift (ICS)” problem
that reduces the performances of deep neural networks. Four abbreviations are defined for understanding BN:
empirical mean (EM), empirical variance (EV), population mean (PM), and population variance (PV). BN is utilized
to normalize the input of internal layer 𝑋 = {𝑥𝑖 } over each mini-batch to ensure the normalized output 𝑉 = {𝑣𝑖 } has
a uniform distribution. BN [35] is to learn a function as below

{𝑥𝑖 , 𝑖 = 1,2. ⋯ , 𝐼} ↦ ⏟
⏟ {𝑣𝑖 , 𝑖 = 1,2, ⋯ , 𝐼} (24)
𝑋 𝑉

where 𝐼 is the size of mini-batch. In the training phase, the EM 𝜇 and EV 𝜂 are deduced by
1
𝜇 = (∑𝐼𝑖=1 𝑥𝑖 ) (25)
𝐼
1
𝜂= ∑𝐼𝑖=1(𝑥𝑖 − 𝜇)2
(26)
𝐼
First, the input 𝑥𝑖 ∈ 𝑋 was normalized to 𝑥𝑖̀
(𝑥𝑖 −𝜇)
𝑥𝑖̀ = (27)
√(𝜂+∆)

where ∆ the denominator in Eq. (27) is to improve the numerical stability. ∆= 10−5 in this study, since this value
is commonly used in many similar publications [36, 37]. The ∆’s value is a trivial constant. After this step, the 𝑥𝑖̀ has
unit-variance and zero-mean properties. Further, to attain a more expressive AI model [37], a transformation was
carried out as
𝑣𝑖 = 𝐶𝐴 × 𝑥𝑖̀ + 𝐶𝐵 , 𝑖 = 1, ⋯ , 𝐼 (28)
in which the parameter vectors 𝐶𝐴 and 𝐶𝐵 are two trainable parameter vectors throughout training. Afterwards, the
transformed output 𝑣𝑖 ∈ 𝑉 is fed to the subsequent layer. The temporary variable 𝑥𝑖̀ remained internal to the present
layer.
There is no minibatch at test phase; hence, instead of computing the EM 𝜇 and EV 𝜂, we utilized the PM 𝜇̲
and PV 𝜂̲. The difference between empirical mean and population mean is 𝜇̲ calculates the average of the whole
population, while 𝜇 calculates the average from a collection (i.e., samples) from that population. This applies to the
difference between EV and PV. Thus, the output ℎ𝑖 at the test phase is:

𝑥𝑖 −𝜇
̲
ℎ𝑖 = 𝐶𝐴 × ( ) + 𝐶𝐵 (29)
𝑠𝑞𝑟𝑡(𝜂
̲ +∆)

3.3 Improvement 2: Rank-based Stochastic Pooling

The activation maps (AMs) after conv layer are frequently overly sizable, viz., the size of their width, length,
and channels are too big to cope with, which will lead to (1) overfitting during training and (2) huge computational

12
burdens.
Pooling layer (PL) is a procedure of nonlinear downsampling (NLDS) to solve the above problem. Additionally,
PL could provide invariance-to-translation characteristics to those AMs. Given a region Φ with size of 2 × 2, let the
pixels of Φ = {𝜑𝑚,𝑛 }, (𝑚 = 1,2, 𝑛 = 1,2) are
𝜑1,1 𝜑1,2
Φ = [𝜑 𝜑2,2 ] (30)
2,1

L2P calculates the 𝑙2 norm pooling [38] of a given region Φ. Let the output value after NLDS is 𝜆, L2P output
2 2
𝜆𝐿2𝑃
Φ is defined as 𝜆𝐿2𝑃
Φ = 𝑠𝑞𝑟𝑡(∑𝑖,𝑗=1 𝜙𝑖𝑗 ). We added a constant 1/|Φ|, here |Φ| denotes the size of region Φ.

|Φ| = 4 for a 2 × 2 NLDS pooling.

∑2 2
𝑚,𝑛=1 𝜑𝑚,𝑛
𝜆𝐿2𝑃
Φ =√ (31)
|Φ|

Next, we introduce the average pooling (AP) and max pooling (MP). AP computes the mean value in Φ as
𝐴𝑃
𝜆Φ = average(Φ)
∑2
𝑚,𝑛=1 𝜑𝑚,𝑛 (32)
=
|Φ|

MP works on Φ and chooses its maximum value:

𝜆𝑀𝑃
Φ = max(Φ)
̅
2 (33)
= max𝑚,𝑛=1 𝜑𝑚,𝑛

Rank-based pooling (RP) [39] is another type of pooling method. Three typical algorithms are rank-based average
pooling (RAP), rank-based weighted pooling (RWP), and rank-based stochastic pooling (RSP). All pooling operations
in RP are calculated based on the ranks other than the realistic values. First, the 2 × 2 region is vectorized, and the
rank matrix (RM) is calculated via the values of every entry 𝜑𝑘 ∈ Φ, 𝑘 ∈ {(1,1), (1,2), (2,1), (2,2)}, usually lower
ranks 𝑟𝑘 ∈ 𝑅 are assigned to higher values (𝜑𝑘 ) as
𝜑𝑘1 < 𝜑𝑘2 ⇒ 𝑟𝑘1 > 𝑟𝑘2 (34)
Providing tied values (𝜑𝑘1 = 𝜑𝑘2 ), a constraint is added to Eq. (34).
(𝜑𝑘1 = 𝜑𝑘2 ) ∧ (𝑘1 > 𝑘2) ⇒ 𝑟𝑘1 > 𝑟𝑘2 (35)
RAP 𝜆𝑅𝐴𝑃
𝛷 used the v greatest activations
1
𝜆𝑅𝐴𝑃
Φ = ∑𝑘(𝜑𝑘 |𝑟𝑘 ≤ 𝑣)
ℎ
(36)
𝑣 = 2 is defined in this work. RWP and RSP are calculated on the exponential rank (ER) vector 𝐸 = {𝑒𝑘 }, which is
defined as
𝑒𝑘 = 𝛼 × (1 − 𝛼)𝑟𝑘−1 (37)
where 𝛼 is a hyper-parameter, here 𝛼 = 0.5.
At this setting, equation (37) can be updated as 𝑒𝑘 = 0.5 × 0.5𝑟𝑘−1 = 0.5𝑟𝑘 . RWP is defined as the summation
of 𝜑𝑖𝑗 and 𝑒𝑖𝑗 as below
|Φ|
𝜆𝑅𝑊𝑃
Φ = ∑𝑘=1 𝜑𝑘 × 𝑒𝑘 (38)
Suppose 𝑘 ⋆ is an outcome from a binary discrete random variable ℰ~𝐸 = {𝑒1 , … , 𝑒|Φ| } , then RSP [18] is
defined as

13
𝜆𝑅𝑆𝑃
Φ = 𝜑𝑘 ⋆ (39)
Be mindful that all pooling methods (L2P, AP, MP, and RSP) run on each channel of the activation map separately.

Figure 6: A simplistic example of six pooling technologies. L2P: l2-norm pooling; AP: average pooling; MP: max
pooling; RAP: rank-based average pooling; RWP: rank-based weighted pooling; RSP: rank-based stochastic pooling;
RM: rank matrix; ER: exponential rank

Using Figure 6 as an instance, and let the region Φ means the patch at top left region of the input AM 𝐼. To
easy understanding, the row-vector format (RVF) is employed to represent the matrix by Φ ← ⃗Φ
⃗⃗ , so Φ = 𝐼(𝑟 =
1, 𝑐 = 1) = (5,6.9,1.1,4.9).
The L2P is 𝜆𝐿2𝑃 2 2 2 2
Φ(1,1) = sqrt((5 + 6.9 + 1.1 + 4.9 )/4) = √(25 + 47.61 + 1.21 + 24.01)/4 = 4.95 . The
𝐴𝑃
pooling result of AP is 𝜆Φ(1,1) = average(Φ) = (5 + 6.9 + 1.1 + 4.9) ÷ 4 = 4.47 . The MP result is 𝜆𝑀𝑃
Φ(1,1) =

max(Φ) = max(5,6.9,1.1,4.9) = 6.9.

Similarly, the RM of Φ is stated in RVF 𝑅Φ ← 𝑣𝑒𝑐(𝑅Φ ) , thus 𝑅Φ = (2,1,4,3) . The RAP is 𝜆𝑅𝐴𝑃
Φ =
(5 + 6.9) ÷ 2 = 5.95 . The row vector ER can be calculated as 𝐸Φ = (0.52 , 0.51 , 0.54 , 0.53 ) . The RWP result is
𝜆𝑅𝑊𝑃
Φ = 5 × 0.52 + 6.9 × 0.51 + 1.1 × 0.54 + 4.9 × 0.53 = 5.38 . RSP chooses 𝑘 ⋆ = 2 randomly, and hence
λRSP
Φ = 𝜑2 = 6.9.

3.4 Proposed Net-0, Net-1 and Net-2

The setting of the proposed Net-0 was crafted by trial-and-error. The number of CLs is set to 𝑁𝐶𝐿 , and the number
of FCLs is set to 𝑁𝐹𝐶𝐿 . Finally, we find 𝑁𝐶𝐿 = 6, 𝑁𝐹𝐶𝐿 = 2 gives the best performance.

14
(a) Net-0

(b) Net-1

Figure 7: Block chart of Net-0, Net-1, and Net-2. (a, b, and c) block chart of Net-0, Net-1, and Net-2, respectively;
𝑺𝒋 (𝒋 = 𝟏, … , 𝟗) means the activation map of internal layers; The digits in the format of 𝒂 × 𝒃 × 𝒄 under each block
means its corresponding size. C: Conv layer; MP: max pooling; FCL: fully-connected layer; BN: batch normalization;
DO: dropout; RSP: rank-based stochastic pooling.

The topmost row of Figure 7 displays the AMs of the proposed CNN Net-0, in which the size of input is 𝑆0 =
256 × 256 × 1, the AMs of all the following six conv blocks (each block contains one conv layer, one ReLU layer,
and one pooling layer) is 𝑆1 = 128 × 128 × 16, 𝑆2 = 64 × 64 × 32, 𝑆3 = 32 × 32 × 64 , 𝑆4 = 16 × 16 × 28 ,
𝑆5 = 8 × 8 × 256, and 𝑆6 = 4 × 4 × 512. Then 𝑆6 was squashed to one column vector 𝑆7 = 1 × 1 × 8192, and
fed into two fully-connected blocks (FCBs), where the first FCB is composed of FCL and ReLU, the second FCB is
composed of FCL and softmax. The AM of 2nd FCB is 𝑆8 = 1 × 1 × 100. Finally, the output of this CNN model is
𝑆9 = 1 × 1 × 2.
Totally the nine AMs 𝑆𝑘 , 𝑘 ∈ [1,9] are associated to the cuboids in the topmost row in Figure 7. The hyper-
parameters are presented in Table 2. Here (𝑎 𝑏 × 𝑏 /𝑐) represents a filters with size of 𝑏 × 𝑏 followed by pooling
with stride of c. W and B represent weights and bias, respectively.

Table 2: Hyperparameters of Net-0

15
Layer Parameters
Conv-1 16 3x3 /2
Conv-2 32 3x3 /2
Conv-3 64 3x3 /2
Conv-4 128 3x3 /2
Conv-5 256 3x3 /2
Conv-6 512 3x3 /2
FCL-1 W: 100x8192, B: 100x1
FCL-2 W: 2x100, B: 2x1

Based on Net-0, we embed the BN and DN, and then obtain the proposed Net-1. Furthermore, we can replace
traditional max pooling (MP) with rank-based stochastic pooling (RSP), and we obtain the proposed Net-2. The block
chart of Net-1 and Net-2 are shown in Figure 7(b-c). The activation maps of each block in Net-1 and Net-2 are the
same as those in Net-0.

3.5 Improvement 3: Graph Convolutional Network

To make more precise and accurate decisions on breast cancer, we brought in graph convolutional network (GCN)
to help determine the relation-aware representation (RAR). GCN is different from standard CNN, which works well
on Euclidean structure. For non-Euclidean data (such as graph), GCN can help generalize the standard convolution
operation to graph convolution [40].
For a graph 𝐺 = (𝑉, 𝐸) , where there are N nodes 𝑣𝑖 ∈ 𝑉, 𝑖 = 1, ⋯ , 𝑁 and cognate links (𝑣𝑖 , 𝑣𝑗 ) ∈ 𝐸 . An
adjacency matrix (ADM) 𝐴 ∈ ℝ𝑁×𝑁 could be defined, which includes the relationship of {𝑣𝑖 }. The point of GCN is
to express 𝐺 through a NN model 𝑓(𝑋, 𝐴) in which 𝑋 ∈ ℝ𝑁×𝐷 , where D represents the feature dimension of every
node. Here 𝐴𝑋 represents the sum of features of all neighboring nodes. Thus, GCN is able to apprehend the RAR
feature [41].
Mathematically, a multi-layer GCN updates all nodes’ feature representation via the layer-wise rule:
𝐻 𝑙+1 = 𝜎(𝐴̂𝐻 𝑙 𝑊 𝑙 ) (40)
Where 𝐴̂ ∈ ℝ𝑁×𝑁 denotes the normalized form of ADM 𝐴, and 𝜎 ReLU function. 𝐻 (𝑙) ∈ ℝ𝑁×𝑑𝑙 stands for the
feature representation of l-th layer.
To run the normalization 𝐴 ↦ 𝐴̂, the degree matrix 𝑑 𝑚 ∈ ℝ𝑁×𝑁 that is a diagonal matrix is firstly computed as:

𝑚 deg(𝑣𝑖 ) if 𝑖 = 𝑗
𝑑𝑖𝑗 ≝{ (41)
0 otherwise

Afterwards, 𝐴̂ is deduced via degree matrix 𝑑 𝑚 and ADM 𝐴 [42]. Here 𝑋 = 𝐻 (0) , so for a two-layer GCN (2L-
GCN), we have

16
𝐻 (1) = 𝜎(𝐴̂𝑋𝑊 (0) ) (42.a)
𝐻 (2)
= 𝜎(𝐴̂𝐻 𝑊 1 (1)
) (42.b)
where 𝑊 (0) ∈ ℝ𝑑0×𝑑𝐶 , and 𝑊 (1) ∈ ℝ𝑑𝐶×𝑑2 are two trainable weight matrixes.

3.6 Proposed Net-3, Net-4, and Net-5

In this breast cancer classification task, the previous Net-0 or Net-1 or Net-2 was first used to give image-level
presentation of breast mammogram images. Meanwhile, CNN does not take into account the inter-image dependencies,
so their RARs were learnt by GCN.
The GCN was combined with CNN models (Net-0 or Net-1 or Net-2). The 𝑆8 in Figure 7 is utilized as the
individual image-level representation 𝐼 ∈ ℝ𝐷 where 𝐷 = 100 in this work. Then, k-means clustering (KMC) is
implemented on the individual image-level representations, and N cluster centroids (CCs) 𝑋 ∈ ℝ𝑁×𝐷 are obtained.
The CC correlation displays the relationships of the images. The ADM 𝐴 ∈ ℝ𝑁×𝑁 is deduced as

1 if 𝑋𝑚 ∈ 𝐾𝑁𝑁(𝑋𝑛 ) ∨ 𝑋𝑛 ∈ 𝐾𝑁𝑁(𝑋𝑚 )
𝐴𝑚𝑛 = { (43)
0 otherwise

where KNN denotes the cosine similarity (CS)-based kNN. Its neighbor number is set to 𝑘𝐾𝑁𝑁 . Figure 8 delivers an
instance, where the three CS-based nearest neighbors of nodes 𝑖&𝑗 are 𝐾𝑁𝑁(𝑋𝑖 ) = (2,1, 𝑗), 𝐾𝑁𝑁(𝑋𝑗 ) = (4,5,3).
So, we have 𝑋𝑗 ∈ 𝐾𝑁𝑁(𝑋𝑖 ), 𝑋𝑖 ∉ 𝐾𝑁𝑁(𝑋𝑗 ). Using the ‘or’ operation we can conclude 𝐴𝑖𝑗 = 1.

4 5
1 ... j ... N
j 3 1
2 ... ...
...
i i 1
...
1 ... ...
N
(a) (b)
Figure 8: Illustration of a KNN-based ADM. a and b denote the graph and the corresponding ADM generated by
KNN. The ellipses in the top row and leftmost column mean omission, and the ellipses in the matrix mean uncertain
values. ADM: Adjacency matrix; KNN: cosine similarity (CS)-based k-nearest neighbors.

The node features 𝑋 and ADM 𝐴 were passed to a 2L-GCN, and we obtained 𝐻 (2) ∈ ℝ𝑁×𝐷 if we set 𝑑2 =
𝐷 = 100. The 𝐻 (2) was then combined with 𝐼 via dot product
𝑦 = 𝐻 (2) 𝐼 (44)
Through a linear projection (LP) with trainable weights 𝑊 (2) ∈ ℝ𝑁×𝑁𝐶 , in which 𝑁𝐶 denotes the number of
classes,

17
𝑧 = 𝑦𝑊 (2) + 𝑊 𝑏 (45)
𝑁𝐶 𝑏
where 𝑧 ∈ ℝ , and 𝑊 represents the bias. 𝑁𝐶 = 2 due to this binary classification problem, i.e., abnormal or
normal breast. Hence, we only need to train (𝑊 (0) , 𝑊 (1) , 𝑊 (2) ) and cognate biases for this 2L-GCN. Figure 9
illustrates the flowchart of Net-3, Net-4, and Net-5, which are a combination of GCN of Net-0, Net-1, and Net-2,
respectively.

Figure 9: Flowchart of Net-3, Net-4, and Net-5. LP = Linear Projection; CE = Cross Entropy; When the internal
box chooses Net-0 (Net-1 or Net-2), the whole picture denotes the flowchart of Net-3 (Net-4 or Net-5); Bottom row
shows the CNN pipeline while the top row shows the GCN pipeline, and CNN features and GCN features are finally
combined.

During the inference stage, CNN representations are attained and its corresponding GCN representations are
attained by trained 2L-GCN and pre-built graph. Combining CNN and 2L-GCN, every image is described by both its
image-level features and its neighbor features [43]. In this work, three hyperparameters were set by trial-and-error.
𝑑𝐶 = 50, 𝑁 = 128, 𝑘𝐾𝑁𝑁 = 7.

3.7 Improvement 4: Multiple-way Data Augmentation

Medical data is not uncommon small-size due to the expense of collecting data and patients’ privacies. To solve
the small-size dataset problem (SSDP) [44] and lack of generation (LoG), four types of solutions, namely data
generation (DG), data augmentation (DA), regularization, and ensemble approaches (EA) exist. In this study, we
utilized a 14-way DA technology due to its ease of implementation [45].
From the whole training image set 𝐵, (See Equation (1)), |𝐵(𝐴)| = 63, |𝐵(𝑁)| = 159, we know that |𝐵| =
222 . For each training image 𝑏(𝑡) ∈ 𝐵, 𝑡 = 1, ⋯ , 222 , we carried out the subsequent seven DA methods with
different DA factors 𝛽. Every DA method generates V simulated images. In this study 𝑉 = 30.

18
(i) Noise injection. The 0-mean 0.01-variance Gaussian noises were added to all training images to produce V
simulated noised images.
⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝑏1 (𝑡) = NO[𝑏(𝑡)]
(46)
= [𝑏1𝑁𝑂 (𝑡), … 𝑏𝑉𝑁𝑂 (𝑡)]
(ii) Gamma correction (GC). The factor of GC 𝛽𝐺𝐶 varied from 0.4 to 1.6 with an increase of 0.04, skipping the
value of 1.
⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝑏 2 (𝑡) = GC[𝑏(𝑡)]
(47)
= [𝑏1𝐺𝐶 (𝑡, 𝛽1𝐺𝐶 ), … 𝑏𝑉𝐺𝐶 (𝑡, 𝛽𝑉𝐺𝐶 )]
𝐺𝐶
Where 𝛽1𝐺𝐶 = 0.4, , 𝛽2𝐺𝐶 = 0.44, …, 𝛽15 𝐺𝐶
= 0.96, 𝛽16 𝐺𝐶
= 1.04, 𝛽17 = 1.08, ⋯ 𝛽𝑉𝐺𝐶 = 1.6.
(iii) Rotation. Rotation angle vector 𝛽𝑅𝑁 was in the range of -30° to 30° with an increase of 2°, omitting 𝛽𝑅𝑁 =
0.
⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝑏 3 (𝑡) = RN[𝑏(𝑡)]
(48)
= [𝑏1𝑅𝑁 (𝑡, 𝛽1𝑅𝑁 ), … 𝑏𝑉𝑅𝑁 (𝑡, 𝛽𝑉𝑅𝑂 )]
𝑅𝑁
Where 𝛽1𝑅𝑁 = −30°, 𝛽2𝑅𝑁 = −28°, … , 𝛽15 𝑅𝑁
= −2°, 𝛽16 𝑅𝑁
= +2°, 𝛽17 = +4°, … , 𝛽𝑉𝑅𝑁 = +30°.
(iv) Horizontal Shear (HS) transform. Simulated V images were procuded by HT transformation
⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝑏 4 (𝑡) = HS[𝑏(𝑡)]
(49)
= [𝑏1𝐻𝑆 (𝑡, 𝛽1𝐻𝑆 ), … 𝑏𝑉𝐻𝑆 (𝑡, 𝛽𝑉𝐻𝑆 )]
where HS factors 𝛽𝐻𝑆 vary from -0.15 to 0.15 with an increase of 0.01, skipping the value of 𝛽𝐻𝑆 = 0. 𝛽1𝐻𝑆 =
𝐻𝑆
−0.15, 𝛽2𝐻𝑆 = −0.14, … , 𝛽15 𝐻𝑆
= −0.01, 𝛽16 𝐻𝑆
= +0.01, 𝛽17 = +0.02, … , 𝛽𝑉𝐻𝑆 = +0.15.
(v) Vertical Shear (VS) transform. VS transforms were generated similarly to ST transform. Especially, the VS
𝑉𝑆 𝐻𝑆
factors are identical to HS factors 𝛽𝑚 = 𝛽𝑚 , ∀𝑚 ∈ 1,2, ⋯ , 𝑉.
⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝑏 5 (𝑡) = VS[𝑏(𝑡)]
(50)
= [𝑏1𝑉𝑆 (𝑡, 𝛽1𝑉𝑆 ), … 𝑏𝑉𝑉𝑆 (𝑡, 𝛽𝑉𝑉𝑆 )]
(vi) Scaling. All training images 𝑏(𝑘) were shrunk or stretched with scaling factor 𝛽 𝑆𝐶 from 0.7 to 1.3 with an
increase of 0.02, skipping 𝛽 𝑆𝐶 = 1.
⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝑏 6 (𝑡) = SC[𝑏(𝑡)]
(51)
= [𝑏1𝑆𝐶 (𝑡, 𝛽1𝑆𝐶 ), … 𝑏𝑉𝑆𝐶 (𝑡, 𝛽𝑉𝑆𝐶 )]
𝑆𝐶
where the 𝑉 values of 𝛽 𝑆𝐶 are given as: 𝛽1𝑆𝐶 = 0.7, 𝛽2𝑆𝐶 = 0.72, … , 𝛽15 𝑆𝐶
= 0.98, 𝛽16 𝑆𝐶
= 1.02, 𝛽17 = 1.04 ,…,
𝛽𝑉𝑆𝐶 = 1.3
(vii) Random translation (RT). Altogether, the training image 𝑏(𝑡) was translated V times with random
horizontal shift 𝜀 𝑥 and random vertical shift 𝜀 𝑦 . Values of 𝜀 𝑥 and 𝜀 𝑦 are from −𝑌 to +𝑌, and they obey uniform
distribution 𝒰.
⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝑏 7 (𝑡) = RT[𝑏(𝑡)]
𝑦 𝑦 (52)
= [𝑏1𝑅𝑇 (𝑡, 𝜀1𝑥 , 𝜀1 ), … 𝑏𝑉𝑅𝑇 (𝑡, 𝜀𝑉𝑥 , 𝜀𝑉 )]
𝜃
where 𝜀𝑚 ~𝒰[−𝑌, 𝑌], ∀𝑚 ∈ [1, 𝑉] ∧ ∀𝜃 ∈ {𝑥, 𝑦}. We set 𝑌 = 25 in this study.
(ix) Mirror. The entirely previous DA results were mirrored, subsequently we get

19
⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝑏 𝑛+7 (𝑡) = 𝒩(𝑏 ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝑛 (𝑡)), ∀𝑛 ∈ {1,2, ⋯ ,7} (53)
where 𝒩 represents the mirror function. The raw image 𝑏(𝑡) was also mirrored as 𝒩[𝑏(𝑡)].
(x) Concatenation All the results were concatenated as

⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝑏⏟𝐷𝐴 (𝑡) = concat {𝑏(𝑡)
⏟ , 𝒩[𝑏(𝑡)]
⏟ ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
,⏟
𝑏 1 (𝑡) , ⋯ , ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝑏⏟14 (𝑡)} (54)
422 1 1 𝑉 𝑉

The size of ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗

𝑏 𝐷𝐴 (𝑡) is |𝑏⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝐷𝐴 (𝑡)| = 𝑉 × 14 + 1 + 1 = 422 images. Thus, the DA can be regarded as a function

𝑏(𝑡) ↦ ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝑏 𝐷𝐴 (𝑡). The enhanced training set is symbolized as
⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝐵′ = {𝑏 𝐷𝐴 (1), ⋯ ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝑏 𝐷𝐴 (|𝐵|)} (55)

3.8 Measures

Table 3: Training and test sets

Set Symbol Abnormal A Normal N Total
Training B 63 159 |𝐵| = 222
DA Training 𝐵′ 26,586 67,098 |𝐵′| = 93,684
Test C 50 50 |𝐶| = 100
Total 𝐷 =𝐵+𝐶 113 209 |𝐷| = |𝐵| + |𝐶| = 322

Table 3 shows the size of our training, DA training, and test sets, with their cognate sizes. The algorithm
performed 𝑁𝑊 runs, where 𝑁𝑊 = 10 in this study. Suppose run index is 𝑤 = 1, ⋯ , 𝑁𝑊 the ideal 𝐸 𝑖 and realistic
𝐸 𝑟 confusion matrix over the test set are

50 0
𝐸 𝑖 (𝑤) = [ ] , ∀𝑤 ∈ 1, ⋯ , 𝑁𝑊 (56)
0 50
𝑎(𝑤) 𝑏(𝑤)
𝐸 𝑟 (𝑤) = [ ]
𝑐(𝑤) 𝑑(𝑤) , ∀𝑤 ∈ 1, ⋯ , 𝑁𝑊 (57)
0 ≤ 𝑎(𝑤), 𝑏(𝑤), 𝑐(𝑤), 𝑑(𝑤) ≤ 50
where the four variables {𝑎(𝑤), 𝑏(𝑤), 𝑐(𝑤), 𝑑(𝑤)} stand for TP, FN, FP, and TN at w-th run, respectively, in which
P stands for positive class (abnormal breast) and N means negative class (normal breast).
TP means the true positive, i.e., the abnormal breast image is correctly classified as abnormal. TN means true
negative, i.e., the normal breast is correctly classified as normal. FN means the abnormal image is wrongly classified
as normal, and FP means the normal image is wrongly classified as abnormal.
Four simple measures {𝜈1 (𝑤), 𝜈 2 (𝑤), 𝜈 3 (𝑤), 𝜈 4 (𝑤)} are defined below, here 𝜈1 means sensitivity, 𝜈 2 means
specificity, 𝜈 3 precision, and 𝜈 4 accuracy.

𝑎(𝑤)
𝜈1 (𝑤) = (58.a)
𝑎(𝑤)+𝑏(𝑤)

𝑑(𝑤)
𝜈 2 (𝑤) = (58.b)
𝑐(𝑤)+𝑑(𝑤)

20
𝑎(𝑤)
𝜈 3 (𝑤) = (58.c)
𝑎(𝑤)+𝑐(𝑤)

𝑎(𝑤)+𝑑(𝑤)
𝜈 4 (𝑤) = (58.d)
𝑎(𝑤)+𝑏(𝑤)+𝑐(𝑤)+𝑑(𝑤)

where accuracy is an overall indicator, which uses all the four variables: TP, TN, FP, and FN. The range of accuracy
is in [0, 1].
F1 score is 𝜈 5 (𝑤)

2×𝑎(𝑤)
𝜈 5 (𝑤) = (59)
2×𝑎(𝑤)+𝑏(𝑤)+𝑐(𝑤)

Moreover, F1 score could be formulated by precision and sensitivity, as 𝜈 5 (𝑤) = 2 × [𝜈 3 (𝑤) ×

𝜈1 (𝑤)]÷ [𝜈 3 (𝑤) + 𝜈1 (𝑤)].
Matthews correlation coefficient (MCC) 𝜈 6 (𝑤) can be written as:
𝑑(𝑤)×𝑎(𝑤)−𝑐(𝑤)×𝑏(𝑤)
𝜈 6 (𝑤) = (60)
√𝜂(𝑤)

where 𝜂(𝑤) is a transient variable 𝜂(𝑤) = [𝑐(𝑤) + 𝑎(𝑤)] × [𝑎(𝑤) + 𝑏(𝑤)] × [𝑑(𝑤) + 𝑐(𝑤)] × [𝑑(𝑤) + 𝑏(𝑤)].
MCC can be regarded as a correlation coefficient, and the values of +1, 0, and -1 stand for perfect prediction, random
guess, and totally wrong prediction.
+1 perfect prediction
𝜈 6 (𝑤) = { 0 random guess (61)
−1 totally wrong prediction
Fowlkes–Mallows index (FMI) 𝜈 7 (𝑤) is:

𝑎(𝑤) 𝑎(𝑤)
𝜈 7 (𝑤) = √ × (62)
𝑎(𝑤)+𝑐(𝑤) 𝑎(𝑤)+𝑏(𝑤)

Similar to F1 score, FMI could be formulated as 𝜈 7 (𝑤) = 𝑠𝑞𝑟𝑡[𝜈 3 (𝑤) × 𝜈1 (𝑤)].

After the all previous indicators of all 𝑁𝑊 runs are computed, the mean and standard deviation (SD) of all m-th
(∀𝑚 ∈ [1,7]) measures could be computed as:

1 𝑁
Mean(𝜈 𝑚 ) = 𝑊
× ∑𝑤=1 𝜈 𝑚 (𝑤) (63)
𝑁𝑊

1 𝑁
SD(𝜈 𝑚 ) = √ 𝑊
× ∑𝑤=1 [𝜈 𝑚 (𝑤) − Mean(𝜈 𝑚 )]2 (64)
𝑁𝑊 −1

The last performances over 𝑁𝑊 runs are computed in the format of Mean ± SD format, abbreviated as MSD. In this
study, we proposed seven indicators in total. Some indicators may be more important than others in isolated conditions.
In general, we believe 𝜈 4 and 𝜈 6 are the most important indicators, since they consider all four variables
{𝑎(𝑤), 𝑏(𝑤), 𝑐(𝑤), 𝑑(𝑤)} in the confusion matrix. Considering their ranges, we found 𝜈 6 (MCC) has a larger range
than 𝜈 4 (See Figure 10), so we utilized 𝜈 6 as the main distinguishing factor for selecting the best model.

21
3.9 Proposed Algorithm

Table 4 itemizes the pseudocode of proposed algorithm, entailing one input, five phases, and one output. Phase I
introduces the procedures of preprocessing. Phase II explains the detailed steps of constructing the six proposed
models. Phase III presents the complete procedures of 𝑁𝑊 runs (𝑁𝑊 = 10 in this study) over the test set. Phase IV
shows the criterion to choose the best model, and Phase V validates the effectiveness of DA.

Table 4 Pseudocode of our algorithm

Input: Original Image Set 𝐷0 and its ground truth label 𝐺.
Phase I: Preprocessing
AN Reduction: 𝐷0 → 𝐷1 . See Eq. (5).
MN Reduction by HF: 𝐷1 → 𝐷2 , See Eq. (6).
CLAHE: 𝐷2 → 𝐷3 , See Eq. (9).
Background Removal: 𝐷3 → 𝐷4 . See Eq. (10).
PEM Removal: 𝐷4 → 𝐷5 . See Eq. (11).
SSBC: 𝐷5 → 𝐷6 . See Eq. (12).
Downsampling 𝐷6 → 𝐷7 . See Eq. (13). Parameter 𝑂7 is searched in the grid of [64,128,256,512].
𝐷 ← 𝐷7
Phase II: Model Construction
Net-0: Create an 8-layer CNN (six conv layers and two fully connected layers). See Figure 7(a).
Net-1: See Figure 7(b)
Add BN to conv layers of Net-0. See Eqs. (28)(29).
Add DO to fully connected layers of Net-0. See Eqs. (21)(22)
Net-2: Replace MP with RSP. See Eq. (39) and Figure 7(c).
Net-3, Net-4, and Net-5: Combine 2L-GCN with Net-0, Net-1, and Net-2, respectively. See Figure 9.
Phase III: 𝑵𝑾 runs of test
for 𝑤 = 1: 𝑁𝑊 % w is run index
Step III.A Random Split
Randomly split preprocessed set 𝐷 into training set 𝐵(𝑤) and test set 𝐶(𝑤),
|𝐵(𝑤)| = 222, |𝐶(𝑤)| = 100.
Step III.B DA on training set
for 𝑡 = 1: 222
Training set: 𝐵(𝑤) and its ground truth labels 𝐺[𝐵(𝑤)].
𝑏(𝑡, 𝑤) ∈ 𝐵(𝑤): Each training image in w-th run
𝑏(𝑡, 𝑤) → ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝑏 𝐷𝐴 (𝑡, 𝑤). See Eq. (54).

22
end
⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
Enhanced training set: 𝐵′ (𝑤) = {𝑏(𝑡, 𝑤)|𝑡 = 1, ⋯ ,222}. See Eq. (55).
Enhanced training set labels: 𝐺[𝐵′(𝑤)].
Step III.C Model Training
for 𝑛 = 0: 5 % n is the model index
Create an initial model 𝑁𝑒𝑡-(𝑛);
Train 𝑁𝑒𝑡-(𝑛) with enhanced training set 𝐵′(𝑤);
Use cost matrix 𝐸𝑐𝑜𝑠𝑡 . See Eq. (2).
𝑒12 = |𝐵(𝑁)|/|𝐵(𝐴)| = 159/63 = 2.52.
Record the trained model 𝑁(𝑛, 𝑤) as
𝑁(𝑛, 𝑤) = trainnetwork{𝑁𝑒𝑡-(𝑛), 𝐵′ (𝑤), 𝐺[𝐵′ (𝑤)]}.
end
Step III.D Test confusion matrix
Test Set: 𝐶(𝑤) and its labels 𝐺[𝐶(𝑤)].
Test prediction 𝑃𝑟𝑒𝑑(𝑛, 𝑤)
𝑃𝑟𝑒𝑑(𝑛, 𝑤) = predict[𝑁(𝑛, 𝑤), 𝐶(𝑤)];
Test performance on model Net-(n) at w-th run 𝐸 𝑟 (𝑛, 𝑤)
𝐸 𝑟 (𝑛, 𝑤) = compare[𝑝𝑟𝑒𝑑(𝑛, 𝑤), 𝐺[𝐶(𝑤)]]. See Eq. (56).
Step III.E Indicator Evaluation
for 𝑚 = 1: 7 % m is the indicator index
for 𝑛 = 0: 5 % n is model index
Extract {𝑎(𝑛, 𝑤), 𝑏(𝑛, 𝑤), 𝑐(𝑛, 𝑤), 𝑑(𝑛, 𝑤)} from confusion matrix 𝐸 𝑟 (𝑛, 𝑤). See Eq. (57)
Calculate indicator 𝜈 𝑚 (𝑛, 𝑤). See Eq. (58.a)-(62).
end
end
end
Phase IV: Compare and select the best model
Calculate MSD of each network model.
𝑁
𝜈 𝑚 (𝑛) = ∑𝑤=1
𝑊
𝜈 𝑚 (𝑛, 𝑤).
Then calculate mean and SD of performances of Net-(n). See Eqs. (63)(64).
Select the best model 𝑛∗ in terms of MCC 𝜈 6 , i.e., set 𝑚 = 6.
𝑛∗ = argmax[𝜈 6 (𝑛)].
Phase V: Validate the effectiveness of DA
Create optimal model without DA 𝑁𝑒𝑡-(𝑛∗ )-𝑁𝐷𝐴
by repeating a modified Step III, where the modification is at Step III.C “Train 𝑁𝑒𝑡-(𝑛) with raw training set

23
𝐵(𝑤)”
Compare 𝑁𝑒𝑡-(𝑛∗ ) against 𝑁𝑒𝑡-(𝑛∗ )-𝑁𝐷𝐴
Output: The best model
𝑁𝑒𝑡-(𝑛∗ ).

4 Experiments and Results

4.1 Comparison among Net-0, Net-1, and Net-2

Table 5 gives the results of 10 runs using Net-0, Net-1, and Net-2. Net-0 is base CNN, Net-1 is BD-CNN, and
Net-2 is BDR-CNN. Net-0 (the basic 8-layer CNN model consisting of 6 conv layers and 2 fully connected layers)
1 2 3
yielded the following seven performance: 𝜈𝑁𝑒𝑡0 = 90.20 ± 2.90 , 𝜈𝑁𝑒𝑡0 = 91.00 ± 1.41 , 𝜈𝑁𝑒𝑡0 = 90.95 ± 1.23 ,
4 5 6 7
𝜈𝑁𝑒𝑡0 = 90.60 ± 1.26 , 𝜈𝑁𝑒𝑡0 = 90.54 ± 1.41 , 𝜈𝑁𝑒𝑡0 = 81.25 ± 2.48 , 𝜈𝑁𝑒𝑡0 = 90.56 ± 1.39 . Note that the
definition of 𝜈 can be found in Section 3.8.

Table 5: Comparison among Net-0, Net-1, and Net-2 (all values are percentages)
Net-0 𝜈1 𝜈2 𝜈3 𝜈4 𝜈5 𝜈6 𝜈7
1 84.00 94.00 93.33 89.00 88.42 78.39 88.54
2 88.00 92.00 91.67 90.00 89.80 80.06 89.81
3 92.00 90.00 90.20 91.00 91.09 82.02 91.09
4 92.00 92.00 92.00 92.00 92.00 84.00 92.00
5 90.00 90.00 90.00 90.00 90.00 80.00 90.00
6 94.00 92.00 92.16 93.00 93.07 86.02 93.07
7 92.00 90.00 90.20 91.00 91.09 82.02 91.09
8 88.00 90.00 89.80 89.00 88.89 78.02 88.89
9 92.00 90.00 90.20 91.00 91.09 82.02 91.09
10 90.00 90.00 90.00 90.00 90.00 80.00 90.00
MSD 90.20± 2.90 91.00± 1.41 90.95± 1.23 90.60± 1.26 90.54± 1.41 81.25± 2.48 90.56± 1.39
Net-1 𝜈1 𝜈2 𝜈3 𝜈4 𝜈5 𝜈6 𝜈7
1 94.00 94.00 94.00 94.00 94.00 88.00 94.00
2 90.00 90.00 90.00 90.00 90.00 80.00 90.00
3 92.00 94.00 93.88 93.00 92.93 86.02 92.93
4 92.00 92.00 92.00 92.00 92.00 84.00 92.00
5 94.00 94.00 94.00 94.00 94.00 88.00 94.00
6 92.00 92.00 92.00 92.00 92.00 84.00 92.00

24
7 94.00 90.00 90.38 92.00 92.16 84.07 92.17
8 90.00 92.00 91.84 91.00 90.91 82.02 90.91
9 96.00 90.00 90.57 93.00 93.20 86.16 93.24
10 92.00 94.00 93.88 93.00 92.93 86.02 92.93
MSD 92.60± 1.90 92.20± 1.75 92.25± 1.60 92.40± 1.26 92.41± 1.28 84.83± 2.54 92.42± 1.28
Net-2 𝜈1 𝜈2 𝜈3 𝜈4 𝜈5 𝜈6 𝜈7
1 94.00 94.00 94.00 94.00 94.00 88.00 94.00
2 92.00 90.00 90.20 91.00 91.09 82.02 91.09
3 96.00 96.00 96.00 96.00 96.00 92.00 96.00
4 92.00 96.00 95.83 94.00 93.88 88.07 93.90
5 94.00 94.00 94.00 94.00 94.00 88.00 94.00
6 96.00 92.00 92.31 94.00 94.12 88.07 94.14
7 92.00 94.00 93.88 93.00 92.93 86.02 92.93
8 96.00 94.00 94.12 95.00 95.05 90.02 95.05
9 92.00 92.00 92.00 92.00 92.00 84.00 92.00
10 96.00 96.00 96.00 96.00 96.00 92.00 96.00
MSD 94.00± 1.89 93.80± 1.99 93.83± 1.90 93.90± 1.60 93.91± 1.59 87.82± 3.19 93.91± 1.59

1 2 3
For Net-1, the performances improved as 𝜈𝑁𝑒𝑡1 = 92.60 ± 1.90 , 𝜈𝑁𝑒𝑡1 = 92.20 ± 1.75 , 𝜈𝑁𝑒𝑡1 = 92.25 ±
4 5 6 7
1.60 , 𝜈𝑁𝑒𝑡1 = 92.40 ± 1.26 , 𝜈𝑁𝑒𝑡1 = 92.41 ± 1.28 , 𝜈𝑁𝑒𝑡1 = 84.83 ± 2.54 , 𝜈𝑁𝑒𝑡1 = 92.42 ± 1.28 . Comparing
the results of base CNN (Net-0) and BD-CNN (Net-1), we could find the effectiveness of dropout and BN.
1 2 3
Additionally, Net-2 yielded performances of 𝜈𝑁𝑒𝑡2 = 94.00 ± 1.89, 𝜈𝑁𝑒𝑡2 = 93.80 ± 1.99, 𝜈𝑁𝑒𝑡2 = 93.83 ±
4 5 6 7
1.90, 𝜈𝑁𝑒𝑡2 = 93.90 ± 1.60, 𝜈𝑁𝑒𝑡2 = 93.91 ± 1.59, 𝜈𝑁𝑒𝑡2 = 87.82 ± 3.19, 𝜈𝑁𝑒𝑡2 = 93.91 ± 1.59. Comparing all
indicator performances between BD-CNN (Net-1) and BDR-CNN (Net-2), we can observe that RSP provides
significantly better performance than employing MP in Net-1.

4.2 Effect of GCN

Next, we compared the performance when using GCN against not using GCN. The results of six models are
shown in Table 6. In this part, Net-0 (base CNN), Net-1 (BD-CNN), and Net-2 (BDR-CNN) did not use GCN, while
Net-3 (CNN-GCN), Net-4 (BD-CNN-GCN), and Net-5 (BDR-CNN-GCN) added GCN to the corresponding base
networks (Observe Table 1). Figure 10 displays the SD of the six models.

Table 6: Comparison of six network models (all values are percentages)

Model 𝜈1 𝜈2 𝜈3 𝜈4 𝜈5 𝜈6 𝜈7

25
Net-0 90.20± 2.90 91.00± 1.41 90.95± 1.23 90.60± 1.26 90.54± 1.41 81.25± 2.48 90.56± 1.39
Net-1 92.60± 1.90 92.20± 1.75 92.25± 1.60 92.40± 1.26 92.41± 1.28 84.83± 2.54 92.42± 1.28
Net-2 94.00± 1.89 93.80± 1.99 93.83± 1.90 93.90± 1.60 93.91± 1.59 87.82± 3.19 93.91± 1.59
Net-3 91.60± 2.63 92.00± 2.67 92.03± 2.37 91.80± 1.69 91.78± 1.69 83.66± 3.36 91.80± 1.69
Net-4 94.60± 2.99 93.80± 1.75 93.89± 1.51 94.20± 1.32 94.21± 1.39 88.47± 2.64 94.23± 1.39
Net-5 96.20± 2.90 96.00± 2.31 96.06± 2.14 96.10± 1.60 96.10± 1.61 92.27± 3.17 96.11± 1.60

Figure 10: Error bar of six models

(𝜈1 : sensitivity; 𝜈 2 : specificity; 𝜈 3 : precision; 𝜈 4 : accuracy; 𝜈 5 : F1 score; 𝜈 6 : Matthews correlation coefficient; 𝜈 7 :
Fowlkes–Mallows index; Net-(0-5) can be referred in Table 1 and Figure 3.)

Comparing CNN-GCN (Net-3) against base CNN (Net-0), we can see that adding GCN can improve all seven
indicators. The same scenario is observed by comparing BD-CNN-GCN (Net-4) against BD-CNN (Net-1), and
comparing BDR-CNN-GCN (Net-5) against BDR-CNN (Net-2). The reason why GCN could enhance the
performance, is because GCN can learn the RARs among the test samples. Therefore, classifiers with GCNs provide
more precise results than those without GCNs.
Additionally, Table 6 and Figure 10 determine the optimal 𝑛∗ = 5, which indicates Net-5 (BDR-CNN-GCN)
attained the best results (including MCC) among all our networks, which was expected, because Net-5, i.e, BDR-
CNN-GCN, is the combination of GCN and the best models without GCN (Net-2, viz., BDR-CNN).

4.3 Comparison to State-of-the-art Approaches

Table 7 shows the detailed results of each run of our proposed Net-5 BDR-CNN-GCN model. As is shown,
1 2 3 4 5
𝜈𝑁𝑒𝑡5 = 96.20 ± 2.90 , 𝜈𝑁𝑒𝑡5 = 96.00 ± 2.31 , 𝜈𝑁𝑒𝑡5 = 96.06 ± 2.14 , 𝜈𝑁𝑒𝑡5 = 96.10 ± 1.60 , 𝜈𝑁𝑒𝑡5 = 96.10 ±
6 7
1.61, 𝜈𝑁𝑒𝑡5 = 92.27 ± 3.17, 𝜈𝑁𝑒𝑡5 = 96.11 ± 1.60. As expected, the performance of Net-5 BDR-CNN-GCN was
the best among all six proposed network models (See Table 6).

26
Table 7 Performance of proposed Net-5 (all values are percentages)
Run 𝜈1 𝜈2 𝜈3 𝜈4 𝜈5 𝜈6 𝜈7
1 92.00 98.00 97.87 95.00 94.85 90.16 94.89
2 96.00 96.00 96.00 96.00 96.00 92.00 96.00
3 96.00 96.00 96.00 96.00 96.00 92.00 96.00
4 92.00 98.00 97.87 95.00 94.85 90.16 94.89
5 94.00 92.00 92.16 93.00 93.07 86.02 93.07
6 96.00 98.00 97.96 97.00 96.97 94.02 96.97
7 98.00 96.00 96.08 97.00 97.03 94.02 97.03
8 98.00 96.00 96.08 97.00 97.03 94.02 97.03
9 100.00 98.00 98.04 99.00 99.01 98.02 99.01
10 100.00 92.00 92.59 96.00 96.15 92.30 96.23
MSD 96.20± 2.90 96.00± 2.31 96.06± 2.14 96.10± 1.60 96.10± 1.61 92.27± 3.17 96.11± 1.60

Next, we compared the proposed BDR-CNN-GCN (Net-5) with 15 state-of-the-art methods: SVM [5], HBP [6],
SWSVM [7], WFrFT [8], TS+MIP [9], Jaya [10], IBBO [11], WEE [12], SVM-db [13], WE [14], HRD [15], DN201
[16], AN [17] , GN [17], and RSPNN [18]. The comparative results and plots are shown in Table 8 and Figure 11,
respectively. Overall, we observed that our proposed BDR-CNN-GCN (Net-5) model is superior to all 15 state-of-the-
art approaches.

Table 8: Comparison with state-of-the-art approaches

Approach 𝜈1 𝜈2 𝜈3 𝜈4 𝜈5 𝜈6 𝜈7
SVM [5] 20.4 87.2 n/a 62.0 n/a n/a n/a
HBP [6] 87.90 87.20 87.29 87.55 87.59 75.10 87.59
SWSVM [7] 93.3 88.6 n/a 90.1 n/a n/a n/a
WFrFT [8] 91.18 91.56 91.53 91.37 91.35 82.74 91.35
91.9 70.0 84.8
TS+MIP [9] n/a n/a n/a n/a
±2.3 ±4.7 ±3.1
Jaya [10] 92.26 92.28 92.28 92.27 92.27 84.54 92.27
IBBO [11] 92.54 92.50 92.50 92.52 92.52 85.04 92.52
WEE [12] 92.00 91.70 91.72 91.85 91.86 83.70 91.86
SVM-db [13] 83.10 82.60 82.69 82.85 82.89 65.70 82.89
WE [14] 82.60 81.00 81.30 81.80 81.94 63.61 81.95
HRD [15] 92.20 90.40 90.64 91.30 91.38 82.67 91.40

27
DN201 [16] 94.58 91.67 n/a 92.73 n/a n/a n/a
AN [17] 89.80 90.80 90.82 90.30 90.25 80.70 90.28
GN [17] 91.80 92.60 92.58 92.20 92.16 84.44 92.18
RSPNN [18] 93.40 94.60 94.53 94.00 93.96 88.01 93.97
96.20 96.00 96.06 96.10 96.10 92.27 96.11
Net-5 (Ours)
±2.90 ±2.31 ±2.14 ±1.60 ±1.61 ±3.17 ±1.60
(n/a means not available)

28
Figure 11 Comparison plot: (𝝂𝟏 : sensitivity; 𝝂𝟐 : specificity; 𝝂𝟑 : precision; 𝝂𝟒 : accuracy; 𝝂𝟓 : F1 score; 𝝂𝟔 :
Matthews correlation coefficient; 𝝂𝟕 : Fowlkes–Mallows index; SVM: support vector machine; HBP: Hybridization
of Biogeography-based optimization and Particle swarm optimization; SWSVM: spherical wavelet and SVM; WFrFT:
weighted-type fractional Fourier transform; TS: thin-plate spline; MIP: maximum intensity projection; IBBO:

29
improved biography-based optimization; WEE: wavelet energy entropy; db: Daubechies; WE: wavelet energy; HRD:
Hybrid model of Radial basis function network and Decision tree; DN201: DenseNet-201; AN: AlexNet; GN:
GoogleNet; RSPNN: rank-based stochastic pooling neural network)

The reason why our proposed Net-5 (BDR-CNN-GCN) gives the best performance compared to all the other 15
approaches is due to the following three points: (i) traditional SVM [5], HBP [6], SWSVM [7], WFrFT [8], TS+MIP
[9], Jaya [10], IBBO [11], WEE [12], SVM-db [13], WE [14], and HRD [15] used a combination of manual feature
extraction and simple classification models. They cannot ensure their feature extraction methods can efficiently extract
task-specific features, and the capacities of their classification models are not intricate enough to overcome
challenging decision-making tasks. (ii) DN201 [16], AN [17], and GN [17] used recent transfer learning techniques;
however, they do not fine tune the parameter setting transfer learning configuration. (iii) While RSPNN [18] used the
RSP and parametric rectified linear unit method, this method only focuses on learning individual image-level
representation and does not consider the relationships between input images.

4.4 Effectiveness of 14-way DA

We created a modification of Net-5 (BDR-CNN-GCN) by removing DA, and the modified network is called
𝑁𝑒𝑡-(5)-𝑁𝐷𝐴, where NDA means no DA. The performance of 𝑁𝑒𝑡-(5)-𝑁𝐷𝐴 is listed in Table 9. Compared to 𝑁𝑒𝑡-
1
(5) using DA, the performance of 𝑁𝑒𝑡-(5)-𝑁𝐷𝐴 decreased to 𝜈𝑁𝑒𝑡5𝑁𝐷𝐴 2
= 93.40 ± 2.67 , 𝜈𝑁𝑒𝑡5𝑁𝐷𝐴 = 93.00 ±
3 4 5 6
1.05, 𝜈𝑁𝑒𝑡5𝑁𝐷𝐴 = 93.04 ± 0.95, 𝜈𝑁𝑒𝑡5𝑁𝐷𝐴 = 93.20 ± 1.32, 𝜈𝑁𝑒𝑡5𝑁𝐷𝐴 = 93.20 ± 1.40, 𝜈𝑁𝑒𝑡5𝑁𝐷𝐴 = 86.44 ± 2.62,
7
and 𝜈𝑁𝑒𝑡5𝑁𝐷𝐴 = 93.21 ± 1.40. Thus, the comparison validates the need for DA in our algorithm. The benefits of
using DA are two-fold: (i) This 14-way DA can generate more data from the limited training set and (ii) our DA can
avoid overfitting.

Table 9: Performance of proposed Net-5-NDA (all values are percentages)

Run 𝜈1 𝜈2 𝜈3 𝜈4 𝜈5 𝜈6 𝜈7
1 92.00 94.00 93.88 93.00 92.93 86.02 92.93
2 92.00 92.00 92.00 92.00 92.00 84.00 92.00
3 92.00 92.00 92.00 92.00 92.00 84.00 92.00
4 88.00 94.00 93.62 91.00 90.72 82.15 90.77
5 92.00 94.00 93.88 93.00 92.93 86.02 92.93
6 96.00 94.00 94.12 95.00 95.05 90.02 95.05
7 96.00 92.00 92.31 94.00 94.12 88.07 94.14
8 96.00 94.00 94.12 95.00 95.05 90.02 95.05
9 94.00 92.00 92.16 93.00 93.07 86.02 93.07

30
10 96.00 92.00 92.31 94.00 94.12 88.07 94.14
MSD 93.40±2.67 93.00±1.05 93.04±0.95 93.20±1.32 93.20±1.40 86.44±2.62 93.21±1.40

5 Discussion and Conclusions

To improve on previously developed AI methods, our study proposed six network models for abnormal breast
detection in mammograms. The experiments showed proposed Net-5 model (BDR-CNN-GCN) can attain the best
results among all six proposed networks, and also attains superior performances to 15 state-of-the-art methods. The
BDR-CNN-GCN is a combination of BDR-CNN model and a 2L-GCN model. Here, BDR-CNN aids to extract image-
level features, whereas GCN aids to extract relation-awareness features. All those two types of features were learnt
during training. The combination of both two networks help increase the performance of our Net-5 model. Further,
we justified the necessity of the use of 14-way DA.
The application of our method in hospitals is promising. From the experimental results, it appears our Net-5
system can aid decision making when diagnosing breast cancer using mammograms. Furthermore, our Net-5 (BDR-
CNN-GCN) can be improved by integration with other classifiers developed by other teams from other universities.
In addition; our algorithms have the potential to be re-deployed to a new hospital’s server. If using cloud-computing
based apps, this will further improve speed and reduce costs.
The shortcomings of Net-5 (BDR-CNN-GCN) are that while it achieves high accuracy when interrogating
mammographic data, it cannot reliably interrogate heterogeneous data (such as thermograms, patient history, heartrate,
etc.). Hence, we propose that our Net-5 model is used to aid radiologists in diagnosing breast cancer using
mammograms. Despite this, the limited size of our dataset provides proof-of-concept towards the feasibility of our
Net-5 model; however, further optimization in much larger datasets is required before our algorithm can be
implemented in any clinical studies. The clinical implication of this research is limited because our study was purely
focused on the AI method design and algorithm verification. Additional work is currently ongoing towards test our
Net-5 model on larger datasets (such as the DDSM and Optimam datasets), with a view to integrating it into large
clinical studies.
The future research directions contain following points: (1) Enlarge the dataset and test proposed AI model on
other breast mammogram images of different sources and different resolutions. (2) Assess other combination
mechanics of GCN and CNN. (3) Attempt to develop a deeper GCN, and analyze whether GCN with more than two
layers will increase the classification results. (4) Test other recent DA methods.

Acknowledgement

This paper is partially supported by British Heart Foundation Accelerator Award, UK; Hope Foundation for
Cancer Research, UK (RM60G0680); Royal Society International Exchanges Cost Share Award, UK (RP202G0230);
Medical Research Council Confidence in Concept Award, UK (MC_PC_17171); MINECO/JUNTA/FEDER,

31
Spain/Regional/Europe (RTI2018-098913-B100, CV2045250, A-TIC-080-UGR18); Guangxi Key Laboratory of
Trusted Software (kx201901); Fundamental Research Funds for the Central Universities (CDLS-2020-03); Key
Laboratory of Child Development and Learning Science (Southeast University), Ministry of Education.

Appendix

Table 10: Abbreviation List

Abbreviation Full Name
2L-GCN Two-layer graph convolutional network
ADM Adjacency matrix
AI Artificial intelligence
AN Additive noise
AP Average pooling
BG Background
BN Batch normalization
CC Cluster centroid
CE Cross entropy
CL Conv layer
CLAHE Contrast-limited adaptive histogram equalization
CNN Convolutional neural network
CRSLW Compression ratio of size of learnable weights
CS Cosine similarity
DBT Dense breast tissue
DCIS Ductal carcinoma in situ
DFT Discrete Fourier transform
DL Deep learning
DT Decision tree
DO Dropout
DON Dropout neuron
EM empirical mean
ER Exponential rank
EV empirical variance
FCB Fully connected block
FCL Fully connected layer
FD Frequency domain

32
FrFT Fractional Fourier transform
GCN Graph convolutional network
HF Homomorphic filtering
KMC k-means clustering
kNN k-nearest neighbor
L2P l2-norm pooling
LCIS Lobular carcinoma in situ
LoG Lack of generation
LP Linear projection
MI Mammogram image
MN Multiplicative noise
MP Max pooling
MSD Mean ± SD
NLAF Nonlinear activation function
NLDS Nonlinear downsampling
NN Neural Network
PEM Pectoral muscle
PL Pooling layer
PM Population mean
PV Population variance
RAP Rank-based average pooling
RAR Relation-aware representation
RM Rank matrix
RP Rank-based pooling
ROI Region of interest
RSP Rank-based stochastic pooling
RVF Row-vector format
RWP Rank-based weighted pooling
SD Standard deviation
SLW Size of learnable weight
SM Spiculated mass
SSBC Squared-sized and breast-centered
SSDP Small-size dataset problem
WEE Wavelet energy entropy

33
References

[1] A. Ghanbari, P. Rahmatpour, N. Hosseini, and M. Khalili, "Social Determinants of Breast Cancer Screening among Married Women: A Cross-

Sectional Study," Journal of Research in Health Sciences, vol. 20, p. 5, Article ID: e00467, Win, 2020.

[2] H. Peiris, L. Mudduwa, N. Thalagala, K. Jayatilake, U. Ekanayake, and J. Horadugoda, "Nottingham grade; does it influence the survival of

operable breast cancer patients across all TNM stages?," Annals of Oncology, vol. 26, pp. 20-20, Dec, 2015.

[3] J. Demb, L. Abraham, D. L. Miglioretti, B. L. Sprague, E. S. O'Meara, S. Advani, et al., "Screening Mammography Outcomes: Risk of Breast

Cancer and Mortality by Comorbidity Score and Age," JNCI-Journal of the National Cancer Institute, vol. 112, pp. 599-606, Article ID:

djz172, Jun, 2020.

[4] Y.-D. Zhang, Z. Dong, S.-H. Wang, X. Yu, X. Yao, Q. Zhou, et al., "Advances in multimodal data fusion in neuroimaging: Overview, challenges,

and novel orientation," Information Fusion, vol. 64, pp. 149-187, 2020/12/01/, 2020.

[5] M. Milosevic, D. Jankovic, and A. Peulic, "Comparative analysis of breast cancer detection in mammograms and thermograms," Biomedical

Engineering-Biomedizinische Technik, vol. 60, pp. 49-56, Feb, 2015.

[6] K. Nakamura, "Abnormal Breast Detection Via Combination of Particle Swarm Optimization and Biogeography-Based Optimization,"

Advances in Computer Science Research, vol. 70, pp. 646-650, 2017.

[7] P. Gorgel, A. Sertbas, and O. N. Ucan, "Computer-aided classification of breast masses in mammogram images based on spherical wavelet

transform and support vector machines," Expert Systems, vol. 32, pp. 155-164, Feb, 2015.

[8] G. Liu, "Computer-aided diagnosis of abnormal breasts in mammogram images by weighted-type fractional Fourier transform," Advances in

Mechanical Engineering, vol. 8, Article ID: 11, Feb, 2016.

[9] S. N. Yang, F. J. Li, Y. H. Liao, Y. S. Chen, W. C. Shen, and T. C. Huang, "Identification of Breast Cancer Using Integrated Information from

MRI and Mammography," Plos One, vol. 10, Article ID: e0128404, Jun, 2015.

[10] R. V. Rao, "Abnormal Breast Detection in Mammogram Images by Feed-forward Neural Network trained by Jaya Algorithm," Fundamenta

Informaticae, vol. 151, pp. 191-211, 2017.

[11] X. Wu, "Smart detection on abnormal breasts in digital mammography based on contrast-limited adaptive histogram equalization and chaotic

adaptive real-coded biogeography-based optimization," Simulation, vol. 92, pp. 873-885, September 12, 2016, 2016.

[12] Y. Chen, "Wavelet energy entropy and linear regression classifier for detecting abnormal breasts," Multimedia Tools and Applications, vol. 77,

pp. 3813-3832, 2018.

[13] F. Liu and M. Brown, "Breast Cancer Recognition by Support Vector Machine Combined with Daubechies Wavelet Transform and Principal

Component Analysis," Lecture Notes in Computational Vision and Biomechanics, vol. 30, pp. 1921-1930, 2019.

[14] Z.-W. Guo, "Breast cancer detection via wavelet energy and support vector machine," in 27th IEEE International Conference on Robot and

Human Interactive Communication (ROMAN), Nanjing, China, 2018, pp. 758-763.

[15] A. Suresh, R. Udendhran, and M. Balamurgan, "Hybridized neural network and decision tree based classifier for prognostic decision making

in breast cancers," Soft Computing, vol. 24, pp. 7947-7953, Jun, 2020.

[16] X. Yu, "Utilization of DenseNet201 for diagnosis of breast abnormality," Machine Vision and Applications, vol. 30, pp. 1135-1144, 2019/10/01,

2019.

[17] R. K. Samala, H. P. Chan, L. M. Hadjiiski, M. A. Helvie, and C. D. Richter, "Generalization error analysis for deep convolutional neural

34
network with transfer learning in breast cancer diagnosis," Physics in Medicine and Biology, vol. 65, p. 13, Article ID: 105002, May, 2020.

[18] C. Pan, "Abnormal breast identification by nine-layer convolutional neural network with parametric rectified linear unit and rank-based

stochastic pooling," Journal of Computational Science, vol. 27, pp. 57-68, 2018.

[19] Z. Li, Z. Zhang, J. Qin, Z. Zhang, and L. Shao, "Discriminative fisher embedding dictionary learning algorithm for object recognition," IEEE

transactions on neural networks and learning systems, 2019.

[20] Z. Zhang, Z. Lai, Z. Huang, W. K. Wong, G.-S. Xie, L. Liu, et al., "Scalable supervised asymmetric hashing with semantic and latent factor

embedding," IEEE Transactions on Image Processing, vol. 28, pp. 4803-4818, 2019.

[21] Z. Zhang, L. Liu, F. Shen, H. T. Shen, and L. Shao, "Binary multi-view clustering," IEEE transactions on pattern analysis and machine

intelligence, vol. 41, pp. 1774-1782, 2018.

[22] J. Wen, Z. Zhang, Z. Zhang, L. Fei, and M. Wang, "Generalized Incomplete Multiview Clustering With Flexible Locality Structure Diffusion,"

IEEE Transactions on Cybernetics, 2020.

[23] Z. Zhang, L. Liu, Y. Luo, Z. Huang, F. Shen, H. T. Shen, et al. (2020). Inductive Structure Consistent Hashing via Flexible Semantic Calibration.

IEEE Transactions on Neural Networks and Learning Systems. doi: 10.1109/TNNLS.2020.3018790

[24] H. T. Shen, X. Zhu, Z. Zhang, S.-H. Wang, Y. Chen, X. Xu, et al., "Heterogeneous data fusion for predicting mild cognitive impairment

conversion," Information Fusion, vol. 66, pp. 54-63, 2021/02/01/, 2021.

[25] K. George, P. Sankaran, and K. P. Joseph, "Computer assisted recognition of breast cancer in biopsy images via fusion of nucleus-guided deep

convolutional features," Computer Methods and Programs in Biomedicine, vol. 194, p. 11, Article ID: 105531, Oct, 2020.

[26] X. Y. Zheng, Z. Yao, Y. N. Huang, Y. Y. Yu, Y. Wang, Y. B. Liu, et al., "Deep learning radiomics can predict axillary lymph node status in

early-stage breast cancer," Nature Communications, vol. 11, p. 9, Article ID: 1236, Mar, 2020.

[27] L. Alzubaidi, O. Al-Shamma, M. A. Fadhel, L. Farhan, J. L. Zhang, and Y. Duan, "Optimizing the Performance of Breast Cancer Classification

by Employing the Same Domain Transfer Learning from Hybrid Deep Convolutional Neural Network Model," Electronics, vol. 9, p. 21,

Article ID: 445, Mar, 2020.

[28] Y. Li, Y. Luo, and Z. Huang, "Graph-based relation-aware representation learning for clothing matching," in Australasian Database

Conference, 2020, pp. 189-197.

[29] I. Fukunaga, R. Sawada, T. Shibata, K. Kaitoh, Y. Sakai, and Y. Yamanishi, "Prediction of the Health Effects of Food Peptides and Elucidation

of the Mode-of-action Using Multi-task Graph Convolutional Neural Network," Molecular Informatics, vol. 39, p. 10, Jan, 2020.

[30] (2018). The mini-MIAS database of mammograms. Available: http://peipa.essex.ac.uk/info/mias.html

[31] R. McBride, K. Wang, Z. Y. Ren, W. Y. Li, and Aaai, "Cost-Sensitive Learning to Rank," in Thirty-Third AAAI Conference on Artificial

Intelligence / Thirty-First Innovative Applications of Artificial Intelligence Conference / Ninth Aaai Symposium on Educational Advances in

Artificial Intelligence, Palo Alto, 2019, pp. 4570-4577.

[32] J. M. Górriz, "Artificial intelligence within the interplay between natural and artificial computation: Advances in data science, trends and

applications," Neurocomputing, vol. 410, pp. 237-270, 2020.

[33] D. R. Nayak, D. Das, R. Dash, S. Majhi, and B. Majhi, "Deep extreme learning machine with leaky rectified linear unit for multiclass

classification of pathological brain images," Multimedia Tools and Applications, vol. 79, pp. 15381-15396, Jun, 2020.

[34] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: A Simple Way to Prevent Neural Networks from

Overfitting," Journal of Machine Learning Research, vol. 15, pp. 1929-1958, Jun, 2014.

35
[35] Y. Furusho and K. Ikeda, "Theoretical analysis of skip connections and batch normalization from generalization and optimization

perspectives," Apsipa Transactions on Signal and Information Processing, vol. 9, p. 7, Article ID: e9, 2020.

[36] J. Hong, "Sensorineural hearing loss identification via nine-layer convolutional neural network with batch normalization and dropout,"

Multimedia Tools and Applications, vol. 79, pp. 15135-15150, 2020.

[37] C. Garbin, X. Q. Zhu, and O. Marques, "Dropout vs. batch normalization: an empirical study of their impact to deep learning," Multimedia

Tools and Applications, vol. 79, pp. 12777-12815, May, 2020.

[38] M. Rezaei, H. Yang, and C. Meinel, "Deep Neural Network with l2-Norm Unit for Brain Lesions Detection," in International Conference on

Neural Information Processing (ICNIP), Cham, 2017, pp. 798-807.

[39] Z. L. Shi, Y. D. Ye, and Y. P. Wu, "Rank-based pooling for deep convolutional neural networks," Neural Networks, vol. 83, pp. 21-31, Nov,

2016.

[40] X. Seti, A. Wumaier, T. Yibulayin, D. Paerhati, L. L. Wang, and A. Saimaiti, "Named-Entity Recognition in Sports Field Based on a Character-

Level Graph Convolutional Network," Information, vol. 11, p. 16, Article ID: 30, Jan, 2020.

[41] T. Derr, Y. Ma, W. Q. Fan, X. R. Liu, C. Aggarwal, J. L. Tang, et al., "Epidemic Graph Convolutional Network," in 13th International

Conference on Web Search and Data Mining, Houston, TX, 2020, pp. 160-168.

[42] T. N. Kipf and M. Welling, "Semi-supervised classification with graph convolutional networks," presented at the International Conference on

Learning Representations (ICLR), Palais des Congrès Neptune, Toulon, France, 2017.

[43] J. Shi, R. Wang, Y. Zheng, Z. Jiang, and L. Yu, "Graph Convolutional Networks for Cervical Cell Classification," in Second MICCAI Workshop

on Computational Pathology (COMPAT), Shenzhen, China, 2019.

[44] J. Lee, J. Kim, and W. Ko, "Day-Ahead Electric Load Forecasting for the Residential Building with a Small-Size Dataset Based on a Self-

Organizing Map and a Stacking Ensemble Learning Method," Applied Sciences-Basel, vol. 9, p. 19, Article ID: 1231, Mar, 2019.

[45] S.-H. Wang. (2020). Covid-19 Classification by FGCNet with Deep Feature Fusion from Graph Convolutional Network and Convolutional

Neural Network. Information Fusion. doi: 10.1016/j.inffus.2020.10.004

View publication stats

Determining Spot Heights From Contours
0% (1)
Determining Spot Heights From Contours
13 pages
Jaipur Knowledge City
No ratings yet
Jaipur Knowledge City
31 pages
CH 1 Python Revision Tour - I
No ratings yet
CH 1 Python Revision Tour - I
60 pages
Auditing and Investigations R.K 05-05-2006 DR Maungu
100% (2)
Auditing and Investigations R.K 05-05-2006 DR Maungu
347 pages
Chapter5BMPN Final
No ratings yet
Chapter5BMPN Final
17 pages
Comparing Different Deep Learning Architectures For Classification of Chest Radiographs
No ratings yet
Comparing Different Deep Learning Architectures For Classification of Chest Radiographs
16 pages
From Symptoms To Diseases - Creating The Missing Link: June 2015
No ratings yet
From Symptoms To Diseases - Creating The Missing Link: June 2015
16 pages
Evolutionary Deep Learning For Car Park Occupancy Prediction in Smart Cities: 12th International Conference, LION 12, Kalamata, Greece, June 10-15, 2018, Revised Selected Papers
No ratings yet
Evolutionary Deep Learning For Car Park Occupancy Prediction in Smart Cities: 12th International Conference, LION 12, Kalamata, Greece, June 10-15, 2018, Revised Selected Papers
16 pages
Deep Learning - Reinier Hernandez
100% (1)
Deep Learning - Reinier Hernandez
8 pages
Creating The LACE (V5.1)
No ratings yet
Creating The LACE (V5.1)
29 pages
Tslearn, A Machine Learning Toolkit For Time Series Data: January 2020
No ratings yet
Tslearn, A Machine Learning Toolkit For Time Series Data: January 2020
8 pages
HI-SCAN 10080EDtS
No ratings yet
HI-SCAN 10080EDtS
8 pages
Breast Cancer Detection Using Deep Learning: February 2023
No ratings yet
Breast Cancer Detection Using Deep Learning: February 2023
12 pages
Online IGE 2022
No ratings yet
Online IGE 2022
15 pages
SZALAY Et Al-ICTINROADVEHICLESOBDvsCAN
No ratings yet
SZALAY Et Al-ICTINROADVEHICLESOBDvsCAN
8 pages
2014MTA Aninvestigationofpixelresonancephenomenon
No ratings yet
2014MTA Aninvestigationofpixelresonancephenomenon
20 pages
Research Proposal Azeem
No ratings yet
Research Proposal Azeem
10 pages
Machine and Deep Learning For Tuberculosis Detection On Chest X-Rays: Systematic Literature Review
No ratings yet
Machine and Deep Learning For Tuberculosis Detection On Chest X-Rays: Systematic Literature Review
23 pages
Gastrointestinal Image Classification Based On VGG16 and Transfer Learning IEEE
No ratings yet
Gastrointestinal Image Classification Based On VGG16 and Transfer Learning IEEE
6 pages
Ijar 34947
No ratings yet
Ijar 34947
9 pages
YouTube Gains by DarkFerret
No ratings yet
YouTube Gains by DarkFerret
11 pages
Deep Learning
No ratings yet
Deep Learning
30 pages
Doors Assignment
No ratings yet
Doors Assignment
29 pages
Classification of Breast Cancer Histology Using de
No ratings yet
Classification of Breast Cancer Histology Using de
9 pages
Generative Adversarial Networks (Gans) For Retinal Fundus Image Synthesis
No ratings yet
Generative Adversarial Networks (Gans) For Retinal Fundus Image Synthesis
15 pages
Skills and Cert Roadmap 2015
No ratings yet
Skills and Cert Roadmap 2015
1 page
Challenges of Deep Learning in Medical Image Analysis - Improving Explainability and Trust
No ratings yet
Challenges of Deep Learning in Medical Image Analysis - Improving Explainability and Trust
17 pages
The Definitive PS2 Trimming Guide - BitBuilt - Giving Life To Old Consoles
No ratings yet
The Definitive PS2 Trimming Guide - BitBuilt - Giving Life To Old Consoles
13 pages
MLMI
No ratings yet
MLMI
9 pages
1 DQuantum Convolutional Neural Networkfor Time Series Forecastingand Classification
No ratings yet
1 DQuantum Convolutional Neural Networkfor Time Series Forecastingand Classification
20 pages
10 Detection of Cotton Plant Diseases Using Deep Transfer Learning
No ratings yet
10 Detection of Cotton Plant Diseases Using Deep Transfer Learning
19 pages
JCTN Avinash Rohini 417 425
No ratings yet
JCTN Avinash Rohini 417 425
10 pages
1 s2.0 S1532046420302550 Main
No ratings yet
1 s2.0 S1532046420302550 Main
17 pages
MrCooper Interview Experience
No ratings yet
MrCooper Interview Experience
3 pages
Common Cathode Fast Recovery Epitaxial Diode (FRED) : Dsek 60 I 2x 30 A V 600 V T 35 Ns
No ratings yet
Common Cathode Fast Recovery Epitaxial Diode (FRED) : Dsek 60 I 2x 30 A V 600 V T 35 Ns
2 pages
Creating MS Word Documents With Page Breaks, Auto Tables of Contents, Mail Merge, and References
100% (1)
Creating MS Word Documents With Page Breaks, Auto Tables of Contents, Mail Merge, and References
3 pages
Sentinel Web Services Installation Guide
No ratings yet
Sentinel Web Services Installation Guide
33 pages
Breast Cancer Detection Using Deep Learning: February 2023
No ratings yet
Breast Cancer Detection Using Deep Learning: February 2023
12 pages
Maintenance Planning and Scheduling Laboratory Assessment 1
No ratings yet
Maintenance Planning and Scheduling Laboratory Assessment 1
4 pages
Sop - Vor
No ratings yet
Sop - Vor
3 pages
Opportunities For Machine Learning in Scientific Discovery: Preprint
No ratings yet
Opportunities For Machine Learning in Scientific Discovery: Preprint
23 pages
NFSU ScientificPosts 02 2024
No ratings yet
NFSU ScientificPosts 02 2024
24 pages
1 s2.0 S2153353924000270 Main
No ratings yet
1 s2.0 S2153353924000270 Main
49 pages
MLOps Survey12
No ratings yet
MLOps Survey12
84 pages
SSRN 4853237
No ratings yet
SSRN 4853237
17 pages
MMW1 - 4
No ratings yet
MMW1 - 4
50 pages
FINAL JOINING KIT COMPLETE - Employees 2
No ratings yet
FINAL JOINING KIT COMPLETE - Employees 2
17 pages
Guide For Combined Incorporation Process
No ratings yet
Guide For Combined Incorporation Process
5 pages
OfficialPubli 10752 ACIIDS18 DeepCNNandDataAugmentationforSkinLesionClassification
No ratings yet
OfficialPubli 10752 ACIIDS18 DeepCNNandDataAugmentationforSkinLesionClassification
11 pages
Unet++ Architecture
No ratings yet
Unet++ Architecture
10 pages
1 PB
No ratings yet
1 PB
10 pages
Ch0 Introduction
No ratings yet
Ch0 Introduction
13 pages
Reducing Error Propagation For Long Term Energy Forecasting Using Multivariate Prediction
No ratings yet
Reducing Error Propagation For Long Term Energy Forecasting Using Multivariate Prediction
10 pages
Spinger 3 Sumit Gupta Exploring Machine Learning
No ratings yet
Spinger 3 Sumit Gupta Exploring Machine Learning
8 pages
Cambodian Child Online Protection Guidelines
No ratings yet
Cambodian Child Online Protection Guidelines
40 pages
1 s2.0 S2001037023000405 Main
No ratings yet
1 s2.0 S2001037023000405 Main
9 pages
IEEE Xplore Reference Download 2024.6.18.20.21.16
No ratings yet
IEEE Xplore Reference Download 2024.6.18.20.21.16
2 pages
Editorial: Machine Learning Theory and Applications For Healthcare
No ratings yet
Editorial: Machine Learning Theory and Applications For Healthcare
3 pages
DRTECH API Manual For EVS Detectors
No ratings yet
DRTECH API Manual For EVS Detectors
74 pages
Rubicon2024CompleteBook 241015 212450-138-152
No ratings yet
Rubicon2024CompleteBook 241015 212450-138-152
16 pages
A Review of Deep Learning On Medical Image Analysis
No ratings yet
A Review of Deep Learning On Medical Image Analysis
48 pages
Manuscript EJOR D 20 01284.R3
No ratings yet
Manuscript EJOR D 20 01284.R3
29 pages
BrainTumorK Mean
No ratings yet
BrainTumorK Mean
12 pages
Detection of Location-Specific Intra-Cranial Brain Tumors
No ratings yet
Detection of Location-Specific Intra-Cranial Brain Tumors
12 pages
Lung Cancer Detection Using Deep Convolutional Neural Networks
No ratings yet
Lung Cancer Detection Using Deep Convolutional Neural Networks
14 pages
Detection of Cotton Plant Diseases Using Deep Transfer Learning
No ratings yet
Detection of Cotton Plant Diseases Using Deep Transfer Learning
19 pages
Detection of Intracranial Brain Tumor
No ratings yet
Detection of Intracranial Brain Tumor
12 pages
10 1016j Matpr 2020 09 536
No ratings yet
10 1016j Matpr 2020 09 536
7 pages
Universal Lesion Detection in CT Slices
No ratings yet
Universal Lesion Detection in CT Slices
11 pages
Towards Ontologizing A Digital Twin
No ratings yet
Towards Ontologizing A Digital Twin
14 pages
A Survey of Transfer Learning For Machinery Diagnostics and Prognostics
No ratings yet
A Survey of Transfer Learning For Machinery Diagnostics and Prognostics
53 pages
HP DL380 G8: Hardware Module Description
No ratings yet
HP DL380 G8: Hardware Module Description
6 pages
A2Log Attentive Augmented Log Anomaly Detection
No ratings yet
A2Log Attentive Augmented Log Anomaly Detection
11 pages
Crop Disease Detection Using Deep Learning Models
No ratings yet
Crop Disease Detection Using Deep Learning Models
11 pages
Problem Sheet Solution
No ratings yet
Problem Sheet Solution
11 pages
Resource and Energy Efficient Implementation of ECG Classifier Using Binarized CNN For Edge AI Devices
No ratings yet
Resource and Energy Efficient Implementation of ECG Classifier Using Binarized CNN For Edge AI Devices
6 pages
ODE/PDE Analysis of Multiple Myeloma Programming in R 1st Edition High-Resolution PDF Download
100% (11)
ODE/PDE Analysis of Multiple Myeloma Programming in R 1st Edition High-Resolution PDF Download
17 pages
Breast Cancer Diagnosis in Mammography Images Using Deep Convolutional Neural Network-Based Transfer and Scratch Learning Approach
No ratings yet
Breast Cancer Diagnosis in Mammography Images Using Deep Convolutional Neural Network-Based Transfer and Scratch Learning Approach
10 pages
Article
No ratings yet
Article
10 pages
Intille S. S. (2002)
No ratings yet
Intille S. S. (2002)
7 pages
Chen 2021 UsingDeepLearnin66gMethodstoPredi
No ratings yet
Chen 2021 UsingDeepLearnin66gMethodstoPredi
9 pages
10.1007s00034 019 01041 0
No ratings yet
10.1007s00034 019 01041 0
20 pages
Paper Batik 2024 With Mahatir
No ratings yet
Paper Batik 2024 With Mahatir
8 pages
ALSAIDI 41. TacklingClassImbalanceDermoscopic
No ratings yet
ALSAIDI 41. TacklingClassImbalanceDermoscopic
28 pages
NIPS 2017 Lightgbm A Highly Efficient Gradient Boosting Decision Tree Paper
No ratings yet
NIPS 2017 Lightgbm A Highly Efficient Gradient Boosting Decision Tree Paper
10 pages
HopfieldTransformer Arxiva
No ratings yet
HopfieldTransformer Arxiva
102 pages
Spiking Neural Networks A Survey
No ratings yet
Spiking Neural Networks A Survey
28 pages
Worldwide Implementation of Digital Mammography Imaging
From Everand
Worldwide Implementation of Digital Mammography Imaging
IAEA
No ratings yet
Selecting Megavoltage Treatment Technologies in External Beam Radiotherapy
From Everand
Selecting Megavoltage Treatment Technologies in External Beam Radiotherapy
IAEA
No ratings yet
CT at a Glance
From Everand
CT at a Glance
Euclid Seeram
No ratings yet

Paper v20

Uploaded by

Paper v20

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Improved Breast Cancer Classiﬁcation Through Combining Graph Convolutional

Article in Information Processing and Management · January 2021

Yu-Dong Zhang Suresh Satapathy

SEE PROFILE SEE PROFILE

David S Guttery Juan M Gorriz

SEE PROFILE SEE PROFILE

DGA-based varrescheduling for transmission loss reduction View project

The user has requested enhancement of the downloaded file.

1. School of Informatics, University of Leicester, Leicester, LE1 7RH, UK

3. School of Computer Engg, KIIT Deemed to University, Bhubaneswar, India

4. Leicester Cancer Research Center, University of Leicester, Leicester, LE2 7LX, UK

Email: YDZ ([email protected]), SCS ([email protected]), DSG ([email protected]), JMG

2.1 Aim & Dataset

(a) (b) (c)

(d) (e) (f)

2.2 Cost-sensitive Learning

Step 5 PEM Removal F -1

Preprocessed Image d7(k)

In total, we defined six networks in this study:

Table 1: The six proposed networks used in our study

3.1 Basics of CNN

3.2 Improvement 1: Dropout and Batch Normalization

3.3 Improvement 2: Rank-based Stochastic Pooling

|Φ| = 4 for a 2 × 2 NLDS pooling.

MP works on Φ and chooses its maximum value:

max(Φ) = max(5,6.9,1.1,4.9) = 6.9.

3.4 Proposed Net-0, Net-1 and Net-2

Table 2: Hyperparameters of Net-0

3.5 Improvement 3: Graph Convolutional Network

3.6 Proposed Net-3, Net-4, and Net-5

3.7 Improvement 4: Multiple-way Data Augmentation

The size of ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗

Table 3: Training and test sets

Moreover, F1 score could be formulated by precision and sensitivity, as 𝜈 5 (𝑤) = 2 × [𝜈 3 (𝑤) ×

Similar to F1 score, FMI could be formulated as 𝜈 7 (𝑤) = 𝑠𝑞𝑟𝑡[𝜈 3 (𝑤) × 𝜈1 (𝑤)].

Table 4 Pseudocode of our algorithm

4 Experiments and Results

4.1 Comparison among Net-0, Net-1, and Net-2

4.2 Effect of GCN

Table 6: Comparison of six network models (all values are percentages)

Figure 10: Error bar of six models

4.3 Comparison to State-of-the-art Approaches

Table 8: Comparison with state-of-the-art approaches

4.4 Effectiveness of 14-way DA

Table 9: Performance of proposed Net-5-NDA (all values are percentages)

5 Discussion and Conclusions

Table 10: Abbreviation List

djz172, Jun, 2020.

Engineering-Biomedizinische Technik, vol. 60, pp. 49-56, Feb, 2015.

Advances in Computer Science Research, vol. 70, pp. 646-650, 2017.

Mechanical Engineering, vol. 8, Article ID: 11, Feb, 2016.

Informaticae, vol. 151, pp. 191-211, 2017.

pp. 3813-3832, 2018.

Human Interactive Communication (ROMAN), Nanjing, China, 2018, pp. 758-763.

transactions on neural networks and learning systems, 2019.

intelligence, vol. 41, pp. 1774-1782, 2018.

IEEE Transactions on Cybernetics, 2020.

IEEE Transactions on Neural Networks and Learning Systems. doi: 10.1109/TNNLS.2020.3018790

conversion," Information Fusion, vol. 66, pp. 54-63, 2021/02/01/, 2021.

Article ID: 445, Mar, 2020.

Conference, 2020, pp. 189-197.

[30] (2018). The mini-MIAS database of mammograms. Available: http://peipa.essex.ac.uk/info/mias.html

Artificial Intelligence, Palo Alto, 2019, pp. 4570-4577.

applications," Neurocomputing, vol. 410, pp. 237-270, 2020.

Multimedia Tools and Applications, vol. 79, pp. 15135-15150, 2020.

Tools and Applications, vol. 79, pp. 12777-12815, May, 2020.

Neural Information Processing (ICNIP), Cham, 2017, pp. 798-807.

on Computational Pathology (COMPAT), Shenzhen, China, 2019.

Neural Network. Information Fusion. doi: 10.1016/j.inffus.2020.10.004

View publication stats

You might also like