Paper v20
Paper v20
net/publication/345998410
CITATIONS READS
101 859
5 authors, including:
Some of the authors of this publication are also working on these related projects:
International Conference on Frontiers of Intelligent Computing: Theory and Applications ( FICTA 2018) View project
All content following this page was uploaded by Yu-Dong Zhang on 18 November 2020.
Yu-Dong Zhang1,2,#, Suresh Chandra Satapathy3,#, David S Guttery4,*, Juan Manuel Górriz5,*, Shui-Hua Wang2,6,*,
2. Department of Information Systems, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi
Arabia
5. Department of Signal Theory, Networking and Communications, University of Granada, Granada, Spain
6. School of Architecture Building and Civil engineering, Loughborough University, Loughborough, LE11 3TU, UK
# Yu-Dong Zhang & Suresh Chandra Sataphaty contributed equally to this paper, and should be regarded as co-first
authors.
* Correspondence should be addressed to Shui-Hua Wang, Juan Manuel Gorriz, and David S Guttery
Abstract: (Aim) In a pilot study to improve detection of malignant lesions in breast mammograms, we aimed to
develop a new method called BDR-CNN-GCN, combining two advanced neural networks: (i) graph convolutional
network (GCN); and (ii) convolutional neural network (CNN). (Method) We utilised a standard 8-layer CNN, then
integrated two improvement techniques: (i) batch normalization (BN) and (ii) dropout (DO). Finally, we utilized rank-
based stochastic pooling (RSP) to substitute the traditional max pooling. This resulted in BDR-CNN, which is a
combination of CNN, BN, DO, and RSP. This BDR-CNN was hybridized with a two-layer GCN, and yielded our
BDR-CNN-GCN model which was then utilized for analysis of breast mammograms as a 14-way data augmentation
method. (Results) As proof of concept, we ran our BDR-CNN-GCN algorithm 10 times on the breast mini-MIAS
dataset (containing 322 mammographic images), achieving a sensitivity of 96.20±2.90%, a specificity of 96.00±2.31%
and an accuracy of 96.10±1.60%. (Conclusion) Our BDR-CNN-GCN showed improved performance compared to
five proposed neural network models and 15 state-of-the-art breast cancer detection approaches, proving to be an
effective method for data augmentation and improved detection of malignant breast masses.
Keywords: convolutional neural network; graph convolutional network; breast cancer classification; mammogram;
artificial intelligence; deep learning; rank-based stochastic pooling; data augmentation;
1
1 Introduction
Breast cancer commonly presents as a solid mass that could be considered (by inspection, palpation or
radiologically) as being different from the surrounding tissue [1], with subsequent confirmation using tissue biopsy.
There are several classifications used in breast cancer grading systems, the most common of which is the TNM system.
TNM (T-tumor, N-lymph node, and M-metastasis) [2] stratifies breast cancers into five stages: Stage 0 is a pre-
cancerous condition, such as lobular carcinoma in situ (LCIS) and ductal carcinoma in situ (DCIS), stages 1 and 2 are
invasive tumors that are still confined within the breast or have only extended to nearby sentinel lymph nodes, stage
3 represents breast cancers that have extended beyond the immediate tumor region and invaded nearby lymph nodes
and muscles; whereas stage 4 is metastatic cancer which has multiplied beyond the breast and neighboring lymph
nodes to distant regions of the body.
Digital mammography is a non-invasive method to detect the earliest stages of breast cancer development [3].
On mammograms, dense breast tissues (DBTs) and breast masses/tumors appear as increased radiological densities
(seen as white), potentially complicating detection of malignant breast masses due to overlay with DBTs. Hence,
mammography is beset by problems with false positives, false negatives, and overdiagnosis.
Recently, artificial intelligence (AI) approaches have been utilized to aid radiologists through quicker and
improved detection of breast cancer using mammography, since AI approaches interrogate mammograms on a pixel
level and have spatially long-range memory [4] (i.e., radiologists may focus on local or isolated regions, whereas AI
analyzes the mammogram globally).
Numerous studies have investigated AI methods for improving mammographic detection of breast cancer. Most
techniques used in previous studies include biogeography-based optimization (BBO), wavelet energy entropy (WEE),
cross validation (CV), k-nearest neighbor (kNN) algorithm, particle swarm optimization (PSO), fractional Fourier
transform (FrFT), support vector machine (SVM), decision tree (DT), and particle swarm optimization (PSO). For
example, Milosevic, et al. [5] tested different classifiers, including SVM, Naive Bayes and kNN classifiers. To select
the best classifier, five-fold CV and receiver operating characteristic analysis were carried out. The authors found the
best result was achieved by support vector machine (SVM). Nakamura [6] presented a hybridization of BBO and PSO
(termed HBP). The simulation achieved the sensitivity, specificity, and accuracy which were all more than 85%.
Gorgel, et al. [7] mixed spherical wavelet and SVM (termed SWSVM), showing improved accuracy compared to
ordinary discrete wavelet transform, which had an accuracy of 83.3%. Liu [8] introduced a weighted FrFT (WFrFT),
with subsequent kNN used as the classifier. The advantage of using WFrFT was to attain the analysis on unified time–
frequency spectrum. Yang, et al. [9] combined thin-plate spline (TS) with maximum intensity projection (MIP)
approaches. Using MRI assessment with information of mammography for diagnosis, their method achieved a
sensitivity of 91.9±2.3%, a specificity of 70.0±4.7%, and an accuracy of 84.8±3.1%. Rao [10] proposed using a Jaya
approach for abnormal breast cancer detection, showing Jaya was more effective than other training algorithms, such
as back propagation, momentum back propagation, genetic algorithm, simulated annealing, and PSO. Wu [11] offered
a novel improved BBO (IBBO) approach by implementing chaotic adaptive real-coded mechanisms to improve
2
traditional biogeography-based optimization. Chen [12] proposed to use a new feature – WEE where linear regression
was used for classification and achieved an accuracy of 91.85 ± 2.21%. Liu and Brown [13] combined SVM with
Daubechies (db) wavelet (SVM-db), reporting their average sensitivity, average specificity, and average accuracy were
all above 82%. Guo [14] used wavelet energy (WE) to detect abnormal breast. Their method arrived at that both
sensitivity and specificity and accuracy were all above 81%. Here, the authors mentioned that their method could
improve the accuracy of intelligent breast cancer diagnosis. [15] proposed a hybrid model of Radial basis function
network and DT, termed HRD. Their method was compared with three common algorithms, viz., kNN, Naive Bayes
algorithm and SVM. The authors found their proposed method yielded a high accuracy. Yu [16] presented a semi-
automatic system to classify mammograms into abnormality and normality. This study employed DenseNet201 (DN-
201) as a transfer learning technique, achieving a highest accuracy of 92.73%. Samala, et al. [17] simulated a dataset
set including noisy labels or corrupted data, and compared AlexNet (AN) with GoogleNet (GN) for breast cancer
diagnosis. The balance between memorization and learning of their networks was controlled by varying the ratio of
noisy training samples. Pan [18] used a rank-based stochastic pooling neural network (RSPNN) to detect diseased
breasts. In their paper, the authors compared three activation functions and six pooling techniques. Their best model
achieved an accuracy of 94.0%. In addition to these studies, there are also numerous studies on information embedding
[19-24].
Recently, researchers [17, 18] have focused their attention on mammogram-based breast cancer detection
methods using deep learning (DL) approaches to extract features of individual images automatically, but they do not
learn image-level relationships. Based on this, we propose that DL methods currently available could be improved
using additional feature engineering. There are other publications [25-27] using DL to analyze biopsy images; however,
this was not the focus of our study, which only focuses on mammogram images.
The inspiration is to not only learn the image-level representation automatically, but also the relation-aware
representation (RAR) [28] to more accurately detect abnormal masses using mammography. RAR is a method which
utilizes the relationships between data points obtained from a cohort as a whole to unbiasedly improve decisions when
analyzing each data point individually. For example, RAR will determine the relationships among breast
mammograms from an entire cohort of images to unbiasedly determine abnormal masses in each individual image.
Therefore, the AI system will generate more accurate results if considering the relationships among the input
mammogram images.
Graph is a tool to describe the above “relationship” if we treat each image as a “node”. Graph convolutional
network (GCN) [29] is a newly proposed AI framework, and GCN is capable of learning over graph structure and
node features, thus, it can learn RARs of nodes.
The idea of this work is to use traditional convolutional neural network (CNN) to learn image-level features, and
use GCN to learn relation-aware representation features. The hypothesis is this combination is expected to provide
better performance than any network working alone. The aim of this paper is to present a novel CNN and GCN
combination network to make more accurate diagnoses using breast mammograms.
The contributions of this research entail following points: (i) First, we designed a base network (Net-0), and then
3
added two improvement techniques to obtain an improved AI model, Net-1. (ii) Next, we developed Net-2 by utilizing
rank-based stochastic pooling (RSP) to substituting conventional pooling in Net-1. (iii) Net-0, Net-1 and Net-2 were
combined with our proposed 2-layer GCN, and we obtained Net-3, Net-4, and Net-5, respectively. (iv) Further
experiments showed that Net-5 gives the best results amongst all proposed six networks. In addition, Net-5 was
superior to state-of-the-art approaches.
2 Breast Dataset
The mini-MIAS [30] dataset was chosen that comprises 322 single-breast mammogram slices all of which are
1,024 × 1,024 in size. Among the 322 images, 113 are abnormal and 209 are normal. This dataset was chosen due to
numerous previous studies utilising this dataset, ensuring robust comparison between algorithms.
In this dataset abnormal breasts are stratified into six categories, which are presented in Figure 1. The task was
not to predict the 6 abnormal types, but instead combine all 6 abnormal categories as one category termed “abnormal”;
thus, the aim was to detect abnormality in each mammogram image.
4
appearance, looks like an abnormal arrangement of tissue strands; (d) Spiculated masses, representing sharp-pointed
barbed tissues; (e) Ill-defined masses, i.e., indistinct; (f) Calcification, small deposits of calcium as bright white specks
or dots on the soft tissue background of the breasts.
Note that circumscribed mass is where the contour is clearly defined at least 75% of its surface. The other 25%
may be masked by the adjacent gland. Asymmetry denotes a spectrum of morphological descriptors for a unilateral
fibroglandular-density finding seen on one or more mammographic projections which do not meet the standard for a
being a mass. Architecture distortion shows a region where the breast appears normal, but shows abnormal
arrangements of tissue strands. Spiculated masses (SMs – see Table 10 for a list of abbreviations) represent sharp-
pointed barbed tissues. These spiky tumors have elongated/spicules pieces of tissue extruding from the perimeter. SMs
occur on the border of the breast, not the middle areas. Ill-defined means indistinct masses. Calcification (i.e. small
deposits of calcium in the breast) are evident in mammograms as bright white specks or dots on the soft tissue
background of the breasts.
The whole image set D contains 113 abnormal (A) breast images and 209 normal (N) mammogram images (MIs).
Hence, 𝐷 = 𝐷(𝐴) + 𝐷(𝑁), where |𝐷(𝐴)| = 113, |𝐷(𝑁)| = 209. Dataset D is divided into training set B and test set
C randomly. Due to the imbalanced properties of the mini-MIAS dataset, the test set 𝐶 is composed of fifty abnormal
and fifty normal MIs, i.e., |𝐶(𝐴)| = |𝐶(𝑁)| = 50. The training set 𝐵 is composed of 63 abnormal and 159 normal
Mis.
|𝐵(𝐴)| = 63, |𝐵(𝑁)| = 159 (1)
where 𝐴 and 𝑁 means abnormal and normal, respectively.
The imbalanced dataset (|𝐵(𝑁)| ≫ |𝐵(𝐴)| ) during training was analysed using numerous techniques. One
effective strategy is cost-sensitive learning (CSL). CSL biases the model by giving extra cost 𝑒 on the minority class
A. the cost matrix 𝐸𝑐𝑜𝑠𝑡 is expanded to:
0 𝑒12
𝐸𝑐𝑜𝑠𝑡 = [ ] (2)
𝑒21 0
where 𝑒21 represents the cost of misclassification of 𝑁 to 𝐴, and its cost is set to 1. Assume 𝐴 is positive and 𝑁
negative,
𝑒21 = cost(𝑁 ↦ 𝐴) ≝ 1 (3)
𝑒12 represents the cost of misclassification of 𝐴 to 𝑁 [31], and its value equals the ratio of the number of normal
training MIs divided by that of abnormal training MIs, viz.,
|𝐵(𝑁)|
𝑒12 = cost(𝐴 ↦ 𝑁) = |𝐵(𝐴)|
(4)
5
2.3 Preprocessing
The raw images contain noise and other unwanted contents; thus, preprocessing plans to generate the region of
interest (ROI) of the breast itself. Figure 2 shows the seven-step process of preprocessing. For clear description, let
𝐷0 ← 𝐷, suppose a given original image 𝑑0 (𝑘) ∈ 𝐷0 , 𝑘 = 1,2, ⋯ ,322.
In step 1, a median filter was utilized to get rid of additive noise (AN). The filter window is set to with 3 × 3.
𝑑1 (𝑘) = MF{𝑑0 (𝑘), [3 × 3]} (5)
Original Mammogram
Image d0(k)
d1(k)
Step 1 AN Reduction ln
Step 2 MN Reduction a
F
Step 3 CLAHE A
P
Step 4 BG Removal
B
b
Step 6 SSBC
exp
d2(k)
Step 7 Downsampling
Figure 2: Preprocessing workflow of the mini-MIAS dataset: The original mammogram image 𝒅𝟎 (𝒌) passes
through seven steps: (1) AN reduction; (2) MN reduction; (3) CLAHE; (4) BG removal; (5) PEM removal; (6) SSBC;
and (7) Downsampling, to obtain the preprocessed image 𝒅𝟕 (𝒌) . MN: multiplicative noise; AN: additive noise;
CLAHE: contrast-limited adaptive histogram equalization; BG: background; PEM: pectoral muscle; SSBC: squared-
sized and breast-centered; a, b, A, and B serve as transient variables
In step 2, to take away multiplicative noise (MN), homomorphic filtering (HF) was utilized. The right side of
Figure 2 presents the workflow of HF. The denoised image 𝑑2 (𝑘) is
𝑑2 (𝑘) = exp{𝐹 −1 [𝑃(𝐹{ln[𝑑1 (𝑘)]})]} (6)
where 𝑑1 (𝑘) and 𝑑2 (𝑘) represents the input and output image of Step 2, respectively. Operations exp() and ln()
stand for the exponential and logarithmic function, respectively. F is the discrete Fourier transform (DFT) operation;
F-1 the inverse DFT (IDFT) operation. P is the filter function:
𝐷(𝑢) 2
𝑃(𝑢) = (𝜂𝐻 − 𝜂𝐿 ) {1 − exp [−𝑠 (
𝐷0
) ]} + 𝜂𝐿 (7)
where s is the slope parameter, 𝑢 is the frequency domain (FD), 𝐷(𝑢) the distance from point (u) to the origin in
6
the FD, D0 a predefined specified distance, 𝜂𝐿 = 0.5, 𝜂𝐻 = 2. To expand Eq. (2), we have
𝑎 = ln[𝑑1 (𝑘)]
𝐴 = 𝐹(𝑎)
𝐵 = 𝑃(𝐴) (8)
𝑏 = 𝐹 −1 (𝐵)
{𝑑2 (𝑘) = exp(𝑏)
In step 3, to equalize the images’ histogram, contrast-limited adaptive histogram equalization (CLAHE) was
carried out on 𝑑2 (𝑘).
𝑑3 (𝑘) = 𝐸𝐶𝐿 [𝑑2 (𝑘)] (9)
where 𝐸𝐶𝐿 represents CLAHE method.
In step 4, the background (BG) was taken away using region-growing (RG) approach.
𝑑4 (𝑘) = RG[𝑑3 (𝑘)] (10)
In step 5, pectoral muscle (PEM) areas were removed using the thresholding 𝒯 technique.
𝑑5 (𝑘) = 𝒯[𝑑4 (𝑘)] (11)
In step 6, we cropped the longer edge to make each image squared-sized and breast-centered (SSBC).
𝑑6 (𝑘) = SSBC[𝑑5 (𝑘)] (12)
Finally, in step 7, down-sampling (DS) was utizlied to resize the image in previous step 𝑑6 (𝑘) to a smaller one
𝑑7 (𝑘). Its size is set to 𝑂7 × 𝑂7 , with DS reducing redundant information and easing subsequent classification tasks.
𝑑7 (𝑘) = DS{𝑑6 (𝑘), [𝑂7 × 𝑂7 ]} (13)
where 𝑂7 is the output size at Step 7. We tested different sizes in the grid of 𝑂7 ∈ [64,128,256,512], and found the
optimal value of 𝑂7 is 𝑂7∗ = 256. The reason being images that are too small such as 𝑂7 = 64 ∨ 128 may not
contain sufficient information, and images that are too large such as 𝑂7 = 512 will result in overfitting by our
classifier.
After completing all seven steps, the ROI 𝑑7 (𝑘) was segmented from the raw mammogram image 𝑑0 (𝑘). All
the images were collated and formed a new dataset 𝐷7 = {𝑑7 (𝑘)}, 𝑘 = 1,2, ⋯ ,322, and 𝐷7 was assigned to replace
the original dataset 𝐷 ← 𝐷7 .
Quality control during preprocessing was mostly performed by algorithm result comparison. Subsequently, trial
and error was used for determining whether preprocessing steps are added or removed, and deciding the optimal
parameters. To do this, we analyzed a small subset from the whole dataset, and checked which combinations of
preprocessing steps can help improve performance.
3 Methodology
7
model, Net-1 termed “BD-CNN”.
3 Third, Net-2 was proposed and termed BDR-CNN, by utilizing rank-based stochastic pooling (RSP) to
substitute conventional max pooling in Net-1.
4 Net-0, Net-1 and Net-2 were combined with our proposed 2-layer graph convolutional network (GCN) to
obtain Net-3, Net-4, and Net-5, respectively. The names of Net-3, Net-4, and Net-5 are CNN-GCN, BD-
CNN-GCN, and BDR-CNN-GCN, respectively.
Table 1 details the six proposed networks and Figure 3 shows their relationships.
Base Network
Add GCN
Net-3 Net-0
Net-4 Net-1
Net-5 Net-2
Figure 3: Relationship between the six proposed networks. Net-0 is the base network; Adding BN & DO to Net-0
generates Net-1; Replacing MP with RSP generates Net-2; Adding GCN to Net-(0-2) generate Net-(3-5); BN: batch
normalization; DO: dropout; MP: max pooling; RSP: rank-based stochastic pooling; GCN: graph convolutional
network.
In recent neural network (NN) and deep learning (DL) techniques, convolutional neural network (CNN) [32] is
8
particularly suitable to handle two-dimensional images. CNN comprises conv layers (CLs), pooling layers (PLs), and
fully connected layers (FCLs). CNNs show improved performance against traditional AI methods (e.g., SVM, DT,
naive Bayesian classifier, etc.), because CNNs learn features from the data during training and therefore significantly
reduces the time needed towards feature engineering design, i.e., to select the most distinguishing features/biomarkers.
The most important procedure in CNN is convolution, and thus the most important layer in CNN is CL, which
carries out the 2D convolution operation of the input and the kernels during forward pass. The weights of kernels in
each CL are initialized randomly, and are updated at each iteration from the loss function by network training. As a
result, the final learnt kernels may detect some types of patterns within the input images.
Figure 4: Illustration of conv layer. Conv-in-Run means the convolution is running, i.e., the kernels are moving
across the input, and therefore several steps will be carried out before a complete convolution is performed.
Figure 4 display the three steps within a CL: (i) Convolution; (ii) Stack; (iii) nonlinear activation function
(NLAF). Mathematically, suppose an input matrix 𝑋 and an output 𝑂 of the CL, and suppose there exist a set of
kernels 𝐹𝑗 , ∀𝑗 ∈ [1, ⋯ , 𝐽], then the convolution output 𝐶(𝑗) after step 1 is defined as
𝐶(𝑗) = 𝑋 ⊗ 𝐹𝑗 , ∀𝑗 ∈ [1, ⋯ , 𝐽] (14)
where ⊗ denotes the convolution operation, which is dot product of filter and inputs.
Second, all 𝐶(𝑗) activation maps are piled to form a new 3D activation map
𝐷 = 𝒮(𝐶(1), ⋯ , 𝐶(𝐽)) (15)
where 𝒮 denotes the pile operation along the channel direction, and J the total number of filters
Third, the 3D activation map D is fed into the NLAF and outputs the final activation map
𝑂 = NLAF(𝐷) (16)
The sizes 𝑆 of three important matrixes (input, filters, and output) are assumed as
9
𝑉𝐼 × 𝑄𝐼 × 𝐻𝐼 𝑥=𝑋
𝑆(𝑥) = {𝑉𝐾 × 𝑄𝐾 × 𝐻𝐾 𝑥 = 𝐹𝑗 , ∀𝑗 ∈ [1, ⋯ , 𝐽] (17)
𝑉𝑂 × 𝑄𝑂 × 𝐻𝑂 𝑥=𝑂
where the three variables (𝑉, 𝑄, 𝐻) denote the size of height, width, and channels of the activation map, respectively.
The subscripts I, K, and O denote input, filter, and output, respectively. There are two equalities. First, 𝐻𝐼 = 𝐻𝐾 ,
indicating the channel of input 𝐻𝐼 equals the channel of filter 𝐻𝐾 . Second, 𝐻𝑂 = 𝐽, indicating the channel of output
𝐻𝑂 equals the number of filters 𝐽.
Let 𝐵 means the padding, 𝐴 the stride, the values of (𝑉𝑂 , 𝑄𝑂 , 𝐻𝑂 ) can be deduced as
𝑉𝑂 = 1 + 𝑓𝑓𝑙 [(2 × 𝐵 + 𝑉𝐼 − 𝑉𝐾 ) ÷ 𝐴] (18.a)
𝑄𝑂 = 1 + 𝑓𝑓𝑙 [(2 × 𝐵 + 𝑄𝐼 − 𝑄𝐾 ) ÷ 𝐴] (18.b)
where 𝑓𝑓𝑙 means the floor function.
|The NLAF 𝜎, commonly choses the rectified linear unit (ReLU) function [33].
𝜎ReLU (𝑑𝑖𝑗 ) = ReLU(𝑑𝑖𝑗 )
(19)
= max (0, 𝑑𝑖𝑗 )
where 𝑑𝑖𝑗 ∈ 𝐷 means the element of the activation map 𝐷. ReLU is at present the most popular NLAF compared to
traditional hyperbolic tangent (HT) and sigmoid (SM) function, which are defined as
𝜎HT (𝑑𝑖𝑗 ) = tanh (𝑑𝑖𝑗 )
(20.a)
= (𝑒 𝑑𝑖𝑗 − 𝑒 −𝑑𝑖𝑗 ) ÷ (𝑒 𝑑𝑖𝑗 + 𝑒 −𝑑𝑖𝑗 )
−1
𝜎SM (𝑑𝑖𝑗 ) = (1 + 𝑒 −𝑑𝑖𝑗 ) (20.b)
The main advantage of ReLU is its improved gradient propagation, i.e., compared to 𝜎SM , ReLU generates fewer
vanishing gradient problems. Compared to σHT , ReLU is one-sided, so it is more biologically plausible.
To improve fully connected layer (FCL)’s performance, a dropout layer composed of dropout neurons (DONs)
is inserted before each FCL. In Ref. [34], the authors introduced DON by dropping neurons and setting their associated
neurons’ weights during training to zero. The choosing of DON is random via a retention probability variable (𝛽𝑟𝑝 ).
Mathematically, a neuron 𝑁(𝑖, 𝑗), then
𝑠(𝑖, 𝑗) 𝑁(𝑖, 𝑗) ∈ DON
𝑠̃ (𝑖, 𝑗) = { (21)
training 0 𝑁(𝑖, 𝑗) ∉ DON
Where 𝑠(𝑖, 𝑗) is the corresponding weights of 𝑁(𝑖, 𝑗). 𝑠̃ (𝑖, 𝑗) means the weights of neuron 𝑁(𝑖, 𝑗) after association
with dropout layers. 𝛽𝑟𝑝 has a default value of 0.5, viz., 𝛽𝑟𝑝 = 0.5. In inference phase, the whole network run
without DONs, and the weights of FCLs associated with DONs are reduced by 𝛽𝑟𝑝 .
𝑠̃ (𝑖, 𝑗) = 𝛽𝑟𝑝 × 𝑠(𝑖, 𝑗) (22)
inference
10
(a) Before DO
(b) After DO
Figure 5: A toy example of a 4-layer FCL with and without DON. (a) a toy example with four FCL layers before
DO; (b) neurons frozen after DO; each number in a circle denotes a neuron, and each line linking two neurons denotes
the corresponding weights will be trained. (FCL: fully connected layer; DO: dropout; DON: dropout neuron)
Figure 5(a) displays a toy CNN instance (a four-FCL model) before DO. At k-th layer, there are 𝐶(𝑘), 𝑘 = 1, … ,4
neurons, and let 𝐶(1) = 10, 𝐶(2) = 8, 𝐶(3) = 6, 𝐶(4) = 8. Therefore, this model has altogether ∑4𝑘=1 𝐶(𝑘) =
32 neurons.
The size of learnable weights (SLWs) between layer 𝑖 and 𝑗 are defined as 𝐶(𝑖, 𝑗), (𝑖, 𝑗) = {(1,2) ∨ (2,3) ∨
(3,4)}. Thus, before dropout, the SLWs are 𝐶(1,2) = 10 × 8 = 80, 𝐶(2,3) = 8 × 6 = 48, 𝐶(3,4) = 6 × 8 = 48.
Hence, the total number of SLWs before dropout is 𝐶 = ∑𝑖,𝑗 𝐶(𝑖, 𝑗) = 176. Note neither incoming and outgoing
weights, nor the number of biases is taken into account in previous calculation.
Then if DO is chosen with 𝛽𝑟𝑝 = 0.5, Figure 5(b) shows the neurons frozen after DO. The SLW connecting layer
𝑖 and 𝑗 is defined as 𝐶′(𝑖, 𝑗), and the whole SLW is 𝐶 ′ = ∑𝑖,𝑗 𝐶′(𝑖, 𝑗) = 20 + 12 + 12 = 44. The compression
ratio of size of learnable weights (CRSLW), symbolized as 𝑓𝐶𝑅 , is deduced as 𝐶 ′ /𝐶 = 44/176 = 0.25, which equals
11
the squared result of 𝛽𝑟𝑝 :
𝐶′ 2
𝑓𝐶𝑅 = = 𝛽𝑟𝑝 (23)
𝐶
where 𝐶′ and 𝐶 are the SLW values after and before dropout, respectively.
On the other side, batch normalization (BN) is to work out the so-called “internal covariant shift (ICS)” problem
that reduces the performances of deep neural networks. Four abbreviations are defined for understanding BN:
empirical mean (EM), empirical variance (EV), population mean (PM), and population variance (PV). BN is utilized
to normalize the input of internal layer 𝑋 = {𝑥𝑖 } over each mini-batch to ensure the normalized output 𝑉 = {𝑣𝑖 } has
a uniform distribution. BN [35] is to learn a function as below
{𝑥𝑖 , 𝑖 = 1,2. ⋯ , 𝐼} ↦ ⏟
⏟ {𝑣𝑖 , 𝑖 = 1,2, ⋯ , 𝐼} (24)
𝑋 𝑉
where 𝐼 is the size of mini-batch. In the training phase, the EM 𝜇 and EV 𝜂 are deduced by
1
𝜇 = (∑𝐼𝑖=1 𝑥𝑖 ) (25)
𝐼
1
𝜂= ∑𝐼𝑖=1(𝑥𝑖 − 𝜇)2
(26)
𝐼
First, the input 𝑥𝑖 ∈ 𝑋 was normalized to 𝑥𝑖̀
(𝑥𝑖 −𝜇)
𝑥𝑖̀ = (27)
√(𝜂+∆)
where ∆ the denominator in Eq. (27) is to improve the numerical stability. ∆= 10−5 in this study, since this value
is commonly used in many similar publications [36, 37]. The ∆’s value is a trivial constant. After this step, the 𝑥𝑖̀ has
unit-variance and zero-mean properties. Further, to attain a more expressive AI model [37], a transformation was
carried out as
𝑣𝑖 = 𝐶𝐴 × 𝑥𝑖̀ + 𝐶𝐵 , 𝑖 = 1, ⋯ , 𝐼 (28)
in which the parameter vectors 𝐶𝐴 and 𝐶𝐵 are two trainable parameter vectors throughout training. Afterwards, the
transformed output 𝑣𝑖 ∈ 𝑉 is fed to the subsequent layer. The temporary variable 𝑥𝑖̀ remained internal to the present
layer.
There is no minibatch at test phase; hence, instead of computing the EM 𝜇 and EV 𝜂, we utilized the PM 𝜇̲
and PV 𝜂̲. The difference between empirical mean and population mean is 𝜇̲ calculates the average of the whole
population, while 𝜇 calculates the average from a collection (i.e., samples) from that population. This applies to the
difference between EV and PV. Thus, the output ℎ𝑖 at the test phase is:
𝑥𝑖 −𝜇
̲
ℎ𝑖 = 𝐶𝐴 × ( ) + 𝐶𝐵 (29)
𝑠𝑞𝑟𝑡(𝜂
̲ +∆)
The activation maps (AMs) after conv layer are frequently overly sizable, viz., the size of their width, length,
and channels are too big to cope with, which will lead to (1) overfitting during training and (2) huge computational
12
burdens.
Pooling layer (PL) is a procedure of nonlinear downsampling (NLDS) to solve the above problem. Additionally,
PL could provide invariance-to-translation characteristics to those AMs. Given a region Φ with size of 2 × 2, let the
pixels of Φ = {𝜑𝑚,𝑛 }, (𝑚 = 1,2, 𝑛 = 1,2) are
𝜑1,1 𝜑1,2
Φ = [𝜑 𝜑2,2 ] (30)
2,1
L2P calculates the 𝑙2 norm pooling [38] of a given region Φ. Let the output value after NLDS is 𝜆, L2P output
2 2
𝜆𝐿2𝑃
Φ is defined as 𝜆𝐿2𝑃
Φ = 𝑠𝑞𝑟𝑡(∑𝑖,𝑗=1 𝜙𝑖𝑗 ). We added a constant 1/|Φ|, here |Φ| denotes the size of region Φ.
∑2 2
𝑚,𝑛=1 𝜑𝑚,𝑛
𝜆𝐿2𝑃
Φ =√ (31)
|Φ|
Next, we introduce the average pooling (AP) and max pooling (MP). AP computes the mean value in Φ as
𝐴𝑃
𝜆Φ = average(Φ)
∑2
𝑚,𝑛=1 𝜑𝑚,𝑛 (32)
=
|Φ|
Rank-based pooling (RP) [39] is another type of pooling method. Three typical algorithms are rank-based average
pooling (RAP), rank-based weighted pooling (RWP), and rank-based stochastic pooling (RSP). All pooling operations
in RP are calculated based on the ranks other than the realistic values. First, the 2 × 2 region is vectorized, and the
rank matrix (RM) is calculated via the values of every entry 𝜑𝑘 ∈ Φ, 𝑘 ∈ {(1,1), (1,2), (2,1), (2,2)}, usually lower
ranks 𝑟𝑘 ∈ 𝑅 are assigned to higher values (𝜑𝑘 ) as
𝜑𝑘1 < 𝜑𝑘2 ⇒ 𝑟𝑘1 > 𝑟𝑘2 (34)
Providing tied values (𝜑𝑘1 = 𝜑𝑘2 ), a constraint is added to Eq. (34).
(𝜑𝑘1 = 𝜑𝑘2 ) ∧ (𝑘1 > 𝑘2) ⇒ 𝑟𝑘1 > 𝑟𝑘2 (35)
RAP 𝜆𝑅𝐴𝑃
𝛷 used the v greatest activations
1
𝜆𝑅𝐴𝑃
Φ = ∑𝑘(𝜑𝑘 |𝑟𝑘 ≤ 𝑣)
ℎ
(36)
𝑣 = 2 is defined in this work. RWP and RSP are calculated on the exponential rank (ER) vector 𝐸 = {𝑒𝑘 }, which is
defined as
𝑒𝑘 = 𝛼 × (1 − 𝛼)𝑟𝑘−1 (37)
where 𝛼 is a hyper-parameter, here 𝛼 = 0.5.
At this setting, equation (37) can be updated as 𝑒𝑘 = 0.5 × 0.5𝑟𝑘−1 = 0.5𝑟𝑘 . RWP is defined as the summation
of 𝜑𝑖𝑗 and 𝑒𝑖𝑗 as below
|Φ|
𝜆𝑅𝑊𝑃
Φ = ∑𝑘=1 𝜑𝑘 × 𝑒𝑘 (38)
Suppose 𝑘 ⋆ is an outcome from a binary discrete random variable ℰ~𝐸 = {𝑒1 , … , 𝑒|Φ| } , then RSP [18] is
defined as
13
𝜆𝑅𝑆𝑃
Φ = 𝜑𝑘 ⋆ (39)
Be mindful that all pooling methods (L2P, AP, MP, and RSP) run on each channel of the activation map separately.
Figure 6: A simplistic example of six pooling technologies. L2P: l2-norm pooling; AP: average pooling; MP: max
pooling; RAP: rank-based average pooling; RWP: rank-based weighted pooling; RSP: rank-based stochastic pooling;
RM: rank matrix; ER: exponential rank
Using Figure 6 as an instance, and let the region Φ means the patch at top left region of the input AM 𝐼. To
easy understanding, the row-vector format (RVF) is employed to represent the matrix by Φ ← ⃗Φ
⃗⃗ , so Φ = 𝐼(𝑟 =
1, 𝑐 = 1) = (5,6.9,1.1,4.9).
The L2P is 𝜆𝐿2𝑃 2 2 2 2
Φ(1,1) = sqrt((5 + 6.9 + 1.1 + 4.9 )/4) = √(25 + 47.61 + 1.21 + 24.01)/4 = 4.95 . The
𝐴𝑃
pooling result of AP is 𝜆Φ(1,1) = average(Φ) = (5 + 6.9 + 1.1 + 4.9) ÷ 4 = 4.47 . The MP result is 𝜆𝑀𝑃
Φ(1,1) =
The setting of the proposed Net-0 was crafted by trial-and-error. The number of CLs is set to 𝑁𝐶𝐿 , and the number
of FCLs is set to 𝑁𝐹𝐶𝐿 . Finally, we find 𝑁𝐶𝐿 = 6, 𝑁𝐹𝐶𝐿 = 2 gives the best performance.
14
(a) Net-0
(b) Net-1
(c) Net-2
Figure 7: Block chart of Net-0, Net-1, and Net-2. (a, b, and c) block chart of Net-0, Net-1, and Net-2, respectively;
𝑺𝒋 (𝒋 = 𝟏, … , 𝟗) means the activation map of internal layers; The digits in the format of 𝒂 × 𝒃 × 𝒄 under each block
means its corresponding size. C: Conv layer; MP: max pooling; FCL: fully-connected layer; BN: batch normalization;
DO: dropout; RSP: rank-based stochastic pooling.
The topmost row of Figure 7 displays the AMs of the proposed CNN Net-0, in which the size of input is 𝑆0 =
256 × 256 × 1, the AMs of all the following six conv blocks (each block contains one conv layer, one ReLU layer,
and one pooling layer) is 𝑆1 = 128 × 128 × 16, 𝑆2 = 64 × 64 × 32, 𝑆3 = 32 × 32 × 64 , 𝑆4 = 16 × 16 × 28 ,
𝑆5 = 8 × 8 × 256, and 𝑆6 = 4 × 4 × 512. Then 𝑆6 was squashed to one column vector 𝑆7 = 1 × 1 × 8192, and
fed into two fully-connected blocks (FCBs), where the first FCB is composed of FCL and ReLU, the second FCB is
composed of FCL and softmax. The AM of 2nd FCB is 𝑆8 = 1 × 1 × 100. Finally, the output of this CNN model is
𝑆9 = 1 × 1 × 2.
Totally the nine AMs 𝑆𝑘 , 𝑘 ∈ [1,9] are associated to the cuboids in the topmost row in Figure 7. The hyper-
parameters are presented in Table 2. Here (𝑎 𝑏 × 𝑏 /𝑐) represents a filters with size of 𝑏 × 𝑏 followed by pooling
with stride of c. W and B represent weights and bias, respectively.
15
Layer Parameters
Conv-1 16 3x3 /2
Conv-2 32 3x3 /2
Conv-3 64 3x3 /2
Conv-4 128 3x3 /2
Conv-5 256 3x3 /2
Conv-6 512 3x3 /2
FCL-1 W: 100x8192, B: 100x1
FCL-2 W: 2x100, B: 2x1
Based on Net-0, we embed the BN and DN, and then obtain the proposed Net-1. Furthermore, we can replace
traditional max pooling (MP) with rank-based stochastic pooling (RSP), and we obtain the proposed Net-2. The block
chart of Net-1 and Net-2 are shown in Figure 7(b-c). The activation maps of each block in Net-1 and Net-2 are the
same as those in Net-0.
To make more precise and accurate decisions on breast cancer, we brought in graph convolutional network (GCN)
to help determine the relation-aware representation (RAR). GCN is different from standard CNN, which works well
on Euclidean structure. For non-Euclidean data (such as graph), GCN can help generalize the standard convolution
operation to graph convolution [40].
For a graph 𝐺 = (𝑉, 𝐸) , where there are N nodes 𝑣𝑖 ∈ 𝑉, 𝑖 = 1, ⋯ , 𝑁 and cognate links (𝑣𝑖 , 𝑣𝑗 ) ∈ 𝐸 . An
adjacency matrix (ADM) 𝐴 ∈ ℝ𝑁×𝑁 could be defined, which includes the relationship of {𝑣𝑖 }. The point of GCN is
to express 𝐺 through a NN model 𝑓(𝑋, 𝐴) in which 𝑋 ∈ ℝ𝑁×𝐷 , where D represents the feature dimension of every
node. Here 𝐴𝑋 represents the sum of features of all neighboring nodes. Thus, GCN is able to apprehend the RAR
feature [41].
Mathematically, a multi-layer GCN updates all nodes’ feature representation via the layer-wise rule:
𝐻 𝑙+1 = 𝜎(𝐴̂𝐻 𝑙 𝑊 𝑙 ) (40)
Where 𝐴̂ ∈ ℝ𝑁×𝑁 denotes the normalized form of ADM 𝐴, and 𝜎 ReLU function. 𝐻 (𝑙) ∈ ℝ𝑁×𝑑𝑙 stands for the
feature representation of l-th layer.
To run the normalization 𝐴 ↦ 𝐴̂, the degree matrix 𝑑 𝑚 ∈ ℝ𝑁×𝑁 that is a diagonal matrix is firstly computed as:
𝑚 deg(𝑣𝑖 ) if 𝑖 = 𝑗
𝑑𝑖𝑗 ≝{ (41)
0 otherwise
Afterwards, 𝐴̂ is deduced via degree matrix 𝑑 𝑚 and ADM 𝐴 [42]. Here 𝑋 = 𝐻 (0) , so for a two-layer GCN (2L-
GCN), we have
16
𝐻 (1) = 𝜎(𝐴̂𝑋𝑊 (0) ) (42.a)
𝐻 (2)
= 𝜎(𝐴̂𝐻 𝑊 1 (1)
) (42.b)
where 𝑊 (0) ∈ ℝ𝑑0×𝑑𝐶 , and 𝑊 (1) ∈ ℝ𝑑𝐶×𝑑2 are two trainable weight matrixes.
In this breast cancer classification task, the previous Net-0 or Net-1 or Net-2 was first used to give image-level
presentation of breast mammogram images. Meanwhile, CNN does not take into account the inter-image dependencies,
so their RARs were learnt by GCN.
The GCN was combined with CNN models (Net-0 or Net-1 or Net-2). The 𝑆8 in Figure 7 is utilized as the
individual image-level representation 𝐼 ∈ ℝ𝐷 where 𝐷 = 100 in this work. Then, k-means clustering (KMC) is
implemented on the individual image-level representations, and N cluster centroids (CCs) 𝑋 ∈ ℝ𝑁×𝐷 are obtained.
The CC correlation displays the relationships of the images. The ADM 𝐴 ∈ ℝ𝑁×𝑁 is deduced as
1 if 𝑋𝑚 ∈ 𝐾𝑁𝑁(𝑋𝑛 ) ∨ 𝑋𝑛 ∈ 𝐾𝑁𝑁(𝑋𝑚 )
𝐴𝑚𝑛 = { (43)
0 otherwise
where KNN denotes the cosine similarity (CS)-based kNN. Its neighbor number is set to 𝑘𝐾𝑁𝑁 . Figure 8 delivers an
instance, where the three CS-based nearest neighbors of nodes 𝑖&𝑗 are 𝐾𝑁𝑁(𝑋𝑖 ) = (2,1, 𝑗), 𝐾𝑁𝑁(𝑋𝑗 ) = (4,5,3).
So, we have 𝑋𝑗 ∈ 𝐾𝑁𝑁(𝑋𝑖 ), 𝑋𝑖 ∉ 𝐾𝑁𝑁(𝑋𝑗 ). Using the ‘or’ operation we can conclude 𝐴𝑖𝑗 = 1.
4 5
1 ... j ... N
j 3 1
2 ... ...
...
i i 1
...
1 ... ...
N
(a) (b)
Figure 8: Illustration of a KNN-based ADM. a and b denote the graph and the corresponding ADM generated by
KNN. The ellipses in the top row and leftmost column mean omission, and the ellipses in the matrix mean uncertain
values. ADM: Adjacency matrix; KNN: cosine similarity (CS)-based k-nearest neighbors.
The node features 𝑋 and ADM 𝐴 were passed to a 2L-GCN, and we obtained 𝐻 (2) ∈ ℝ𝑁×𝐷 if we set 𝑑2 =
𝐷 = 100. The 𝐻 (2) was then combined with 𝐼 via dot product
𝑦 = 𝐻 (2) 𝐼 (44)
Through a linear projection (LP) with trainable weights 𝑊 (2) ∈ ℝ𝑁×𝑁𝐶 , in which 𝑁𝐶 denotes the number of
classes,
17
𝑧 = 𝑦𝑊 (2) + 𝑊 𝑏 (45)
𝑁𝐶 𝑏
where 𝑧 ∈ ℝ , and 𝑊 represents the bias. 𝑁𝐶 = 2 due to this binary classification problem, i.e., abnormal or
normal breast. Hence, we only need to train (𝑊 (0) , 𝑊 (1) , 𝑊 (2) ) and cognate biases for this 2L-GCN. Figure 9
illustrates the flowchart of Net-3, Net-4, and Net-5, which are a combination of GCN of Net-0, Net-1, and Net-2,
respectively.
Figure 9: Flowchart of Net-3, Net-4, and Net-5. LP = Linear Projection; CE = Cross Entropy; When the internal
box chooses Net-0 (Net-1 or Net-2), the whole picture denotes the flowchart of Net-3 (Net-4 or Net-5); Bottom row
shows the CNN pipeline while the top row shows the GCN pipeline, and CNN features and GCN features are finally
combined.
During the inference stage, CNN representations are attained and its corresponding GCN representations are
attained by trained 2L-GCN and pre-built graph. Combining CNN and 2L-GCN, every image is described by both its
image-level features and its neighbor features [43]. In this work, three hyperparameters were set by trial-and-error.
𝑑𝐶 = 50, 𝑁 = 128, 𝑘𝐾𝑁𝑁 = 7.
Medical data is not uncommon small-size due to the expense of collecting data and patients’ privacies. To solve
the small-size dataset problem (SSDP) [44] and lack of generation (LoG), four types of solutions, namely data
generation (DG), data augmentation (DA), regularization, and ensemble approaches (EA) exist. In this study, we
utilized a 14-way DA technology due to its ease of implementation [45].
From the whole training image set 𝐵, (See Equation (1)), |𝐵(𝐴)| = 63, |𝐵(𝑁)| = 159, we know that |𝐵| =
222 . For each training image 𝑏(𝑡) ∈ 𝐵, 𝑡 = 1, ⋯ , 222 , we carried out the subsequent seven DA methods with
different DA factors 𝛽. Every DA method generates V simulated images. In this study 𝑉 = 30.
18
(i) Noise injection. The 0-mean 0.01-variance Gaussian noises were added to all training images to produce V
simulated noised images.
⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝑏1 (𝑡) = NO[𝑏(𝑡)]
(46)
= [𝑏1𝑁𝑂 (𝑡), … 𝑏𝑉𝑁𝑂 (𝑡)]
(ii) Gamma correction (GC). The factor of GC 𝛽𝐺𝐶 varied from 0.4 to 1.6 with an increase of 0.04, skipping the
value of 1.
⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝑏 2 (𝑡) = GC[𝑏(𝑡)]
(47)
= [𝑏1𝐺𝐶 (𝑡, 𝛽1𝐺𝐶 ), … 𝑏𝑉𝐺𝐶 (𝑡, 𝛽𝑉𝐺𝐶 )]
𝐺𝐶
Where 𝛽1𝐺𝐶 = 0.4, , 𝛽2𝐺𝐶 = 0.44, …, 𝛽15 𝐺𝐶
= 0.96, 𝛽16 𝐺𝐶
= 1.04, 𝛽17 = 1.08, ⋯ 𝛽𝑉𝐺𝐶 = 1.6.
(iii) Rotation. Rotation angle vector 𝛽𝑅𝑁 was in the range of -30° to 30° with an increase of 2°, omitting 𝛽𝑅𝑁 =
0.
⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝑏 3 (𝑡) = RN[𝑏(𝑡)]
(48)
= [𝑏1𝑅𝑁 (𝑡, 𝛽1𝑅𝑁 ), … 𝑏𝑉𝑅𝑁 (𝑡, 𝛽𝑉𝑅𝑂 )]
𝑅𝑁
Where 𝛽1𝑅𝑁 = −30°, 𝛽2𝑅𝑁 = −28°, … , 𝛽15 𝑅𝑁
= −2°, 𝛽16 𝑅𝑁
= +2°, 𝛽17 = +4°, … , 𝛽𝑉𝑅𝑁 = +30°.
(iv) Horizontal Shear (HS) transform. Simulated V images were procuded by HT transformation
⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝑏 4 (𝑡) = HS[𝑏(𝑡)]
(49)
= [𝑏1𝐻𝑆 (𝑡, 𝛽1𝐻𝑆 ), … 𝑏𝑉𝐻𝑆 (𝑡, 𝛽𝑉𝐻𝑆 )]
where HS factors 𝛽𝐻𝑆 vary from -0.15 to 0.15 with an increase of 0.01, skipping the value of 𝛽𝐻𝑆 = 0. 𝛽1𝐻𝑆 =
𝐻𝑆
−0.15, 𝛽2𝐻𝑆 = −0.14, … , 𝛽15 𝐻𝑆
= −0.01, 𝛽16 𝐻𝑆
= +0.01, 𝛽17 = +0.02, … , 𝛽𝑉𝐻𝑆 = +0.15.
(v) Vertical Shear (VS) transform. VS transforms were generated similarly to ST transform. Especially, the VS
𝑉𝑆 𝐻𝑆
factors are identical to HS factors 𝛽𝑚 = 𝛽𝑚 , ∀𝑚 ∈ 1,2, ⋯ , 𝑉.
⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝑏 5 (𝑡) = VS[𝑏(𝑡)]
(50)
= [𝑏1𝑉𝑆 (𝑡, 𝛽1𝑉𝑆 ), … 𝑏𝑉𝑉𝑆 (𝑡, 𝛽𝑉𝑉𝑆 )]
(vi) Scaling. All training images 𝑏(𝑘) were shrunk or stretched with scaling factor 𝛽 𝑆𝐶 from 0.7 to 1.3 with an
increase of 0.02, skipping 𝛽 𝑆𝐶 = 1.
⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝑏 6 (𝑡) = SC[𝑏(𝑡)]
(51)
= [𝑏1𝑆𝐶 (𝑡, 𝛽1𝑆𝐶 ), … 𝑏𝑉𝑆𝐶 (𝑡, 𝛽𝑉𝑆𝐶 )]
𝑆𝐶
where the 𝑉 values of 𝛽 𝑆𝐶 are given as: 𝛽1𝑆𝐶 = 0.7, 𝛽2𝑆𝐶 = 0.72, … , 𝛽15 𝑆𝐶
= 0.98, 𝛽16 𝑆𝐶
= 1.02, 𝛽17 = 1.04 ,…,
𝛽𝑉𝑆𝐶 = 1.3
(vii) Random translation (RT). Altogether, the training image 𝑏(𝑡) was translated V times with random
horizontal shift 𝜀 𝑥 and random vertical shift 𝜀 𝑦 . Values of 𝜀 𝑥 and 𝜀 𝑦 are from −𝑌 to +𝑌, and they obey uniform
distribution 𝒰.
⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝑏 7 (𝑡) = RT[𝑏(𝑡)]
𝑦 𝑦 (52)
= [𝑏1𝑅𝑇 (𝑡, 𝜀1𝑥 , 𝜀1 ), … 𝑏𝑉𝑅𝑇 (𝑡, 𝜀𝑉𝑥 , 𝜀𝑉 )]
𝜃
where 𝜀𝑚 ~𝒰[−𝑌, 𝑌], ∀𝑚 ∈ [1, 𝑉] ∧ ∀𝜃 ∈ {𝑥, 𝑦}. We set 𝑌 = 25 in this study.
(ix) Mirror. The entirely previous DA results were mirrored, subsequently we get
19
⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝑏 𝑛+7 (𝑡) = 𝒩(𝑏 ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝑛 (𝑡)), ∀𝑛 ∈ {1,2, ⋯ ,7} (53)
where 𝒩 represents the mirror function. The raw image 𝑏(𝑡) was also mirrored as 𝒩[𝑏(𝑡)].
(x) Concatenation All the results were concatenated as
⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝑏⏟𝐷𝐴 (𝑡) = concat {𝑏(𝑡)
⏟ , 𝒩[𝑏(𝑡)]
⏟ ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
,⏟
𝑏 1 (𝑡) , ⋯ , ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝑏⏟14 (𝑡)} (54)
422 1 1 𝑉 𝑉
𝑏(𝑡) ↦ ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝑏 𝐷𝐴 (𝑡). The enhanced training set is symbolized as
⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝐵′ = {𝑏 𝐷𝐴 (1), ⋯ ⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
𝑏 𝐷𝐴 (|𝐵|)} (55)
3.8 Measures
Table 3 shows the size of our training, DA training, and test sets, with their cognate sizes. The algorithm
performed 𝑁𝑊 runs, where 𝑁𝑊 = 10 in this study. Suppose run index is 𝑤 = 1, ⋯ , 𝑁𝑊 the ideal 𝐸 𝑖 and realistic
𝐸 𝑟 confusion matrix over the test set are
50 0
𝐸 𝑖 (𝑤) = [ ] , ∀𝑤 ∈ 1, ⋯ , 𝑁𝑊 (56)
0 50
𝑎(𝑤) 𝑏(𝑤)
𝐸 𝑟 (𝑤) = [ ]
𝑐(𝑤) 𝑑(𝑤) , ∀𝑤 ∈ 1, ⋯ , 𝑁𝑊 (57)
0 ≤ 𝑎(𝑤), 𝑏(𝑤), 𝑐(𝑤), 𝑑(𝑤) ≤ 50
where the four variables {𝑎(𝑤), 𝑏(𝑤), 𝑐(𝑤), 𝑑(𝑤)} stand for TP, FN, FP, and TN at w-th run, respectively, in which
P stands for positive class (abnormal breast) and N means negative class (normal breast).
TP means the true positive, i.e., the abnormal breast image is correctly classified as abnormal. TN means true
negative, i.e., the normal breast is correctly classified as normal. FN means the abnormal image is wrongly classified
as normal, and FP means the normal image is wrongly classified as abnormal.
Four simple measures {𝜈1 (𝑤), 𝜈 2 (𝑤), 𝜈 3 (𝑤), 𝜈 4 (𝑤)} are defined below, here 𝜈1 means sensitivity, 𝜈 2 means
specificity, 𝜈 3 precision, and 𝜈 4 accuracy.
𝑎(𝑤)
𝜈1 (𝑤) = (58.a)
𝑎(𝑤)+𝑏(𝑤)
𝑑(𝑤)
𝜈 2 (𝑤) = (58.b)
𝑐(𝑤)+𝑑(𝑤)
20
𝑎(𝑤)
𝜈 3 (𝑤) = (58.c)
𝑎(𝑤)+𝑐(𝑤)
𝑎(𝑤)+𝑑(𝑤)
𝜈 4 (𝑤) = (58.d)
𝑎(𝑤)+𝑏(𝑤)+𝑐(𝑤)+𝑑(𝑤)
where accuracy is an overall indicator, which uses all the four variables: TP, TN, FP, and FN. The range of accuracy
is in [0, 1].
F1 score is 𝜈 5 (𝑤)
2×𝑎(𝑤)
𝜈 5 (𝑤) = (59)
2×𝑎(𝑤)+𝑏(𝑤)+𝑐(𝑤)
where 𝜂(𝑤) is a transient variable 𝜂(𝑤) = [𝑐(𝑤) + 𝑎(𝑤)] × [𝑎(𝑤) + 𝑏(𝑤)] × [𝑑(𝑤) + 𝑐(𝑤)] × [𝑑(𝑤) + 𝑏(𝑤)].
MCC can be regarded as a correlation coefficient, and the values of +1, 0, and -1 stand for perfect prediction, random
guess, and totally wrong prediction.
+1 perfect prediction
𝜈 6 (𝑤) = { 0 random guess (61)
−1 totally wrong prediction
Fowlkes–Mallows index (FMI) 𝜈 7 (𝑤) is:
𝑎(𝑤) 𝑎(𝑤)
𝜈 7 (𝑤) = √ × (62)
𝑎(𝑤)+𝑐(𝑤) 𝑎(𝑤)+𝑏(𝑤)
1 𝑁
Mean(𝜈 𝑚 ) = 𝑊
× ∑𝑤=1 𝜈 𝑚 (𝑤) (63)
𝑁𝑊
1 𝑁
SD(𝜈 𝑚 ) = √ 𝑊
× ∑𝑤=1 [𝜈 𝑚 (𝑤) − Mean(𝜈 𝑚 )]2 (64)
𝑁𝑊 −1
The last performances over 𝑁𝑊 runs are computed in the format of Mean ± SD format, abbreviated as MSD. In this
study, we proposed seven indicators in total. Some indicators may be more important than others in isolated conditions.
In general, we believe 𝜈 4 and 𝜈 6 are the most important indicators, since they consider all four variables
{𝑎(𝑤), 𝑏(𝑤), 𝑐(𝑤), 𝑑(𝑤)} in the confusion matrix. Considering their ranges, we found 𝜈 6 (MCC) has a larger range
than 𝜈 4 (See Figure 10), so we utilized 𝜈 6 as the main distinguishing factor for selecting the best model.
21
3.9 Proposed Algorithm
Table 4 itemizes the pseudocode of proposed algorithm, entailing one input, five phases, and one output. Phase I
introduces the procedures of preprocessing. Phase II explains the detailed steps of constructing the six proposed
models. Phase III presents the complete procedures of 𝑁𝑊 runs (𝑁𝑊 = 10 in this study) over the test set. Phase IV
shows the criterion to choose the best model, and Phase V validates the effectiveness of DA.
22
end
⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗⃗
Enhanced training set: 𝐵′ (𝑤) = {𝑏(𝑡, 𝑤)|𝑡 = 1, ⋯ ,222}. See Eq. (55).
Enhanced training set labels: 𝐺[𝐵′(𝑤)].
Step III.C Model Training
for 𝑛 = 0: 5 % n is the model index
Create an initial model 𝑁𝑒𝑡-(𝑛);
Train 𝑁𝑒𝑡-(𝑛) with enhanced training set 𝐵′(𝑤);
Use cost matrix 𝐸𝑐𝑜𝑠𝑡 . See Eq. (2).
𝑒12 = |𝐵(𝑁)|/|𝐵(𝐴)| = 159/63 = 2.52.
Record the trained model 𝑁(𝑛, 𝑤) as
𝑁(𝑛, 𝑤) = trainnetwork{𝑁𝑒𝑡-(𝑛), 𝐵′ (𝑤), 𝐺[𝐵′ (𝑤)]}.
end
Step III.D Test confusion matrix
Test Set: 𝐶(𝑤) and its labels 𝐺[𝐶(𝑤)].
Test prediction 𝑃𝑟𝑒𝑑(𝑛, 𝑤)
𝑃𝑟𝑒𝑑(𝑛, 𝑤) = predict[𝑁(𝑛, 𝑤), 𝐶(𝑤)];
Test performance on model Net-(n) at w-th run 𝐸 𝑟 (𝑛, 𝑤)
𝐸 𝑟 (𝑛, 𝑤) = compare[𝑝𝑟𝑒𝑑(𝑛, 𝑤), 𝐺[𝐶(𝑤)]]. See Eq. (56).
Step III.E Indicator Evaluation
for 𝑚 = 1: 7 % m is the indicator index
for 𝑛 = 0: 5 % n is model index
Extract {𝑎(𝑛, 𝑤), 𝑏(𝑛, 𝑤), 𝑐(𝑛, 𝑤), 𝑑(𝑛, 𝑤)} from confusion matrix 𝐸 𝑟 (𝑛, 𝑤). See Eq. (57)
Calculate indicator 𝜈 𝑚 (𝑛, 𝑤). See Eq. (58.a)-(62).
end
end
end
Phase IV: Compare and select the best model
Calculate MSD of each network model.
𝑁
𝜈 𝑚 (𝑛) = ∑𝑤=1
𝑊
𝜈 𝑚 (𝑛, 𝑤).
Then calculate mean and SD of performances of Net-(n). See Eqs. (63)(64).
Select the best model 𝑛∗ in terms of MCC 𝜈 6 , i.e., set 𝑚 = 6.
𝑛∗ = argmax[𝜈 6 (𝑛)].
Phase V: Validate the effectiveness of DA
Create optimal model without DA 𝑁𝑒𝑡-(𝑛∗ )-𝑁𝐷𝐴
by repeating a modified Step III, where the modification is at Step III.C “Train 𝑁𝑒𝑡-(𝑛) with raw training set
23
𝐵(𝑤)”
Compare 𝑁𝑒𝑡-(𝑛∗ ) against 𝑁𝑒𝑡-(𝑛∗ )-𝑁𝐷𝐴
Output: The best model
𝑁𝑒𝑡-(𝑛∗ ).
Table 5 gives the results of 10 runs using Net-0, Net-1, and Net-2. Net-0 is base CNN, Net-1 is BD-CNN, and
Net-2 is BDR-CNN. Net-0 (the basic 8-layer CNN model consisting of 6 conv layers and 2 fully connected layers)
1 2 3
yielded the following seven performance: 𝜈𝑁𝑒𝑡0 = 90.20 ± 2.90 , 𝜈𝑁𝑒𝑡0 = 91.00 ± 1.41 , 𝜈𝑁𝑒𝑡0 = 90.95 ± 1.23 ,
4 5 6 7
𝜈𝑁𝑒𝑡0 = 90.60 ± 1.26 , 𝜈𝑁𝑒𝑡0 = 90.54 ± 1.41 , 𝜈𝑁𝑒𝑡0 = 81.25 ± 2.48 , 𝜈𝑁𝑒𝑡0 = 90.56 ± 1.39 . Note that the
definition of 𝜈 can be found in Section 3.8.
Table 5: Comparison among Net-0, Net-1, and Net-2 (all values are percentages)
Net-0 𝜈1 𝜈2 𝜈3 𝜈4 𝜈5 𝜈6 𝜈7
1 84.00 94.00 93.33 89.00 88.42 78.39 88.54
2 88.00 92.00 91.67 90.00 89.80 80.06 89.81
3 92.00 90.00 90.20 91.00 91.09 82.02 91.09
4 92.00 92.00 92.00 92.00 92.00 84.00 92.00
5 90.00 90.00 90.00 90.00 90.00 80.00 90.00
6 94.00 92.00 92.16 93.00 93.07 86.02 93.07
7 92.00 90.00 90.20 91.00 91.09 82.02 91.09
8 88.00 90.00 89.80 89.00 88.89 78.02 88.89
9 92.00 90.00 90.20 91.00 91.09 82.02 91.09
10 90.00 90.00 90.00 90.00 90.00 80.00 90.00
MSD 90.20± 2.90 91.00± 1.41 90.95± 1.23 90.60± 1.26 90.54± 1.41 81.25± 2.48 90.56± 1.39
Net-1 𝜈1 𝜈2 𝜈3 𝜈4 𝜈5 𝜈6 𝜈7
1 94.00 94.00 94.00 94.00 94.00 88.00 94.00
2 90.00 90.00 90.00 90.00 90.00 80.00 90.00
3 92.00 94.00 93.88 93.00 92.93 86.02 92.93
4 92.00 92.00 92.00 92.00 92.00 84.00 92.00
5 94.00 94.00 94.00 94.00 94.00 88.00 94.00
6 92.00 92.00 92.00 92.00 92.00 84.00 92.00
24
7 94.00 90.00 90.38 92.00 92.16 84.07 92.17
8 90.00 92.00 91.84 91.00 90.91 82.02 90.91
9 96.00 90.00 90.57 93.00 93.20 86.16 93.24
10 92.00 94.00 93.88 93.00 92.93 86.02 92.93
MSD 92.60± 1.90 92.20± 1.75 92.25± 1.60 92.40± 1.26 92.41± 1.28 84.83± 2.54 92.42± 1.28
Net-2 𝜈1 𝜈2 𝜈3 𝜈4 𝜈5 𝜈6 𝜈7
1 94.00 94.00 94.00 94.00 94.00 88.00 94.00
2 92.00 90.00 90.20 91.00 91.09 82.02 91.09
3 96.00 96.00 96.00 96.00 96.00 92.00 96.00
4 92.00 96.00 95.83 94.00 93.88 88.07 93.90
5 94.00 94.00 94.00 94.00 94.00 88.00 94.00
6 96.00 92.00 92.31 94.00 94.12 88.07 94.14
7 92.00 94.00 93.88 93.00 92.93 86.02 92.93
8 96.00 94.00 94.12 95.00 95.05 90.02 95.05
9 92.00 92.00 92.00 92.00 92.00 84.00 92.00
10 96.00 96.00 96.00 96.00 96.00 92.00 96.00
MSD 94.00± 1.89 93.80± 1.99 93.83± 1.90 93.90± 1.60 93.91± 1.59 87.82± 3.19 93.91± 1.59
1 2 3
For Net-1, the performances improved as 𝜈𝑁𝑒𝑡1 = 92.60 ± 1.90 , 𝜈𝑁𝑒𝑡1 = 92.20 ± 1.75 , 𝜈𝑁𝑒𝑡1 = 92.25 ±
4 5 6 7
1.60 , 𝜈𝑁𝑒𝑡1 = 92.40 ± 1.26 , 𝜈𝑁𝑒𝑡1 = 92.41 ± 1.28 , 𝜈𝑁𝑒𝑡1 = 84.83 ± 2.54 , 𝜈𝑁𝑒𝑡1 = 92.42 ± 1.28 . Comparing
the results of base CNN (Net-0) and BD-CNN (Net-1), we could find the effectiveness of dropout and BN.
1 2 3
Additionally, Net-2 yielded performances of 𝜈𝑁𝑒𝑡2 = 94.00 ± 1.89, 𝜈𝑁𝑒𝑡2 = 93.80 ± 1.99, 𝜈𝑁𝑒𝑡2 = 93.83 ±
4 5 6 7
1.90, 𝜈𝑁𝑒𝑡2 = 93.90 ± 1.60, 𝜈𝑁𝑒𝑡2 = 93.91 ± 1.59, 𝜈𝑁𝑒𝑡2 = 87.82 ± 3.19, 𝜈𝑁𝑒𝑡2 = 93.91 ± 1.59. Comparing all
indicator performances between BD-CNN (Net-1) and BDR-CNN (Net-2), we can observe that RSP provides
significantly better performance than employing MP in Net-1.
Next, we compared the performance when using GCN against not using GCN. The results of six models are
shown in Table 6. In this part, Net-0 (base CNN), Net-1 (BD-CNN), and Net-2 (BDR-CNN) did not use GCN, while
Net-3 (CNN-GCN), Net-4 (BD-CNN-GCN), and Net-5 (BDR-CNN-GCN) added GCN to the corresponding base
networks (Observe Table 1). Figure 10 displays the SD of the six models.
25
Net-0 90.20± 2.90 91.00± 1.41 90.95± 1.23 90.60± 1.26 90.54± 1.41 81.25± 2.48 90.56± 1.39
Net-1 92.60± 1.90 92.20± 1.75 92.25± 1.60 92.40± 1.26 92.41± 1.28 84.83± 2.54 92.42± 1.28
Net-2 94.00± 1.89 93.80± 1.99 93.83± 1.90 93.90± 1.60 93.91± 1.59 87.82± 3.19 93.91± 1.59
Net-3 91.60± 2.63 92.00± 2.67 92.03± 2.37 91.80± 1.69 91.78± 1.69 83.66± 3.36 91.80± 1.69
Net-4 94.60± 2.99 93.80± 1.75 93.89± 1.51 94.20± 1.32 94.21± 1.39 88.47± 2.64 94.23± 1.39
Net-5 96.20± 2.90 96.00± 2.31 96.06± 2.14 96.10± 1.60 96.10± 1.61 92.27± 3.17 96.11± 1.60
Comparing CNN-GCN (Net-3) against base CNN (Net-0), we can see that adding GCN can improve all seven
indicators. The same scenario is observed by comparing BD-CNN-GCN (Net-4) against BD-CNN (Net-1), and
comparing BDR-CNN-GCN (Net-5) against BDR-CNN (Net-2). The reason why GCN could enhance the
performance, is because GCN can learn the RARs among the test samples. Therefore, classifiers with GCNs provide
more precise results than those without GCNs.
Additionally, Table 6 and Figure 10 determine the optimal 𝑛∗ = 5, which indicates Net-5 (BDR-CNN-GCN)
attained the best results (including MCC) among all our networks, which was expected, because Net-5, i.e, BDR-
CNN-GCN, is the combination of GCN and the best models without GCN (Net-2, viz., BDR-CNN).
Table 7 shows the detailed results of each run of our proposed Net-5 BDR-CNN-GCN model. As is shown,
1 2 3 4 5
𝜈𝑁𝑒𝑡5 = 96.20 ± 2.90 , 𝜈𝑁𝑒𝑡5 = 96.00 ± 2.31 , 𝜈𝑁𝑒𝑡5 = 96.06 ± 2.14 , 𝜈𝑁𝑒𝑡5 = 96.10 ± 1.60 , 𝜈𝑁𝑒𝑡5 = 96.10 ±
6 7
1.61, 𝜈𝑁𝑒𝑡5 = 92.27 ± 3.17, 𝜈𝑁𝑒𝑡5 = 96.11 ± 1.60. As expected, the performance of Net-5 BDR-CNN-GCN was
the best among all six proposed network models (See Table 6).
26
Table 7 Performance of proposed Net-5 (all values are percentages)
Run 𝜈1 𝜈2 𝜈3 𝜈4 𝜈5 𝜈6 𝜈7
1 92.00 98.00 97.87 95.00 94.85 90.16 94.89
2 96.00 96.00 96.00 96.00 96.00 92.00 96.00
3 96.00 96.00 96.00 96.00 96.00 92.00 96.00
4 92.00 98.00 97.87 95.00 94.85 90.16 94.89
5 94.00 92.00 92.16 93.00 93.07 86.02 93.07
6 96.00 98.00 97.96 97.00 96.97 94.02 96.97
7 98.00 96.00 96.08 97.00 97.03 94.02 97.03
8 98.00 96.00 96.08 97.00 97.03 94.02 97.03
9 100.00 98.00 98.04 99.00 99.01 98.02 99.01
10 100.00 92.00 92.59 96.00 96.15 92.30 96.23
MSD 96.20± 2.90 96.00± 2.31 96.06± 2.14 96.10± 1.60 96.10± 1.61 92.27± 3.17 96.11± 1.60
Next, we compared the proposed BDR-CNN-GCN (Net-5) with 15 state-of-the-art methods: SVM [5], HBP [6],
SWSVM [7], WFrFT [8], TS+MIP [9], Jaya [10], IBBO [11], WEE [12], SVM-db [13], WE [14], HRD [15], DN201
[16], AN [17] , GN [17], and RSPNN [18]. The comparative results and plots are shown in Table 8 and Figure 11,
respectively. Overall, we observed that our proposed BDR-CNN-GCN (Net-5) model is superior to all 15 state-of-the-
art approaches.
27
DN201 [16] 94.58 91.67 n/a 92.73 n/a n/a n/a
AN [17] 89.80 90.80 90.82 90.30 90.25 80.70 90.28
GN [17] 91.80 92.60 92.58 92.20 92.16 84.44 92.18
RSPNN [18] 93.40 94.60 94.53 94.00 93.96 88.01 93.97
96.20 96.00 96.06 96.10 96.10 92.27 96.11
Net-5 (Ours)
±2.90 ±2.31 ±2.14 ±1.60 ±1.61 ±3.17 ±1.60
(n/a means not available)
28
Figure 11 Comparison plot: (𝝂𝟏 : sensitivity; 𝝂𝟐 : specificity; 𝝂𝟑 : precision; 𝝂𝟒 : accuracy; 𝝂𝟓 : F1 score; 𝝂𝟔 :
Matthews correlation coefficient; 𝝂𝟕 : Fowlkes–Mallows index; SVM: support vector machine; HBP: Hybridization
of Biogeography-based optimization and Particle swarm optimization; SWSVM: spherical wavelet and SVM; WFrFT:
weighted-type fractional Fourier transform; TS: thin-plate spline; MIP: maximum intensity projection; IBBO:
29
improved biography-based optimization; WEE: wavelet energy entropy; db: Daubechies; WE: wavelet energy; HRD:
Hybrid model of Radial basis function network and Decision tree; DN201: DenseNet-201; AN: AlexNet; GN:
GoogleNet; RSPNN: rank-based stochastic pooling neural network)
The reason why our proposed Net-5 (BDR-CNN-GCN) gives the best performance compared to all the other 15
approaches is due to the following three points: (i) traditional SVM [5], HBP [6], SWSVM [7], WFrFT [8], TS+MIP
[9], Jaya [10], IBBO [11], WEE [12], SVM-db [13], WE [14], and HRD [15] used a combination of manual feature
extraction and simple classification models. They cannot ensure their feature extraction methods can efficiently extract
task-specific features, and the capacities of their classification models are not intricate enough to overcome
challenging decision-making tasks. (ii) DN201 [16], AN [17], and GN [17] used recent transfer learning techniques;
however, they do not fine tune the parameter setting transfer learning configuration. (iii) While RSPNN [18] used the
RSP and parametric rectified linear unit method, this method only focuses on learning individual image-level
representation and does not consider the relationships between input images.
We created a modification of Net-5 (BDR-CNN-GCN) by removing DA, and the modified network is called
𝑁𝑒𝑡-(5)-𝑁𝐷𝐴, where NDA means no DA. The performance of 𝑁𝑒𝑡-(5)-𝑁𝐷𝐴 is listed in Table 9. Compared to 𝑁𝑒𝑡-
1
(5) using DA, the performance of 𝑁𝑒𝑡-(5)-𝑁𝐷𝐴 decreased to 𝜈𝑁𝑒𝑡5𝑁𝐷𝐴 2
= 93.40 ± 2.67 , 𝜈𝑁𝑒𝑡5𝑁𝐷𝐴 = 93.00 ±
3 4 5 6
1.05, 𝜈𝑁𝑒𝑡5𝑁𝐷𝐴 = 93.04 ± 0.95, 𝜈𝑁𝑒𝑡5𝑁𝐷𝐴 = 93.20 ± 1.32, 𝜈𝑁𝑒𝑡5𝑁𝐷𝐴 = 93.20 ± 1.40, 𝜈𝑁𝑒𝑡5𝑁𝐷𝐴 = 86.44 ± 2.62,
7
and 𝜈𝑁𝑒𝑡5𝑁𝐷𝐴 = 93.21 ± 1.40. Thus, the comparison validates the need for DA in our algorithm. The benefits of
using DA are two-fold: (i) This 14-way DA can generate more data from the limited training set and (ii) our DA can
avoid overfitting.
30
10 96.00 92.00 92.31 94.00 94.12 88.07 94.14
MSD 93.40±2.67 93.00±1.05 93.04±0.95 93.20±1.32 93.20±1.40 86.44±2.62 93.21±1.40
To improve on previously developed AI methods, our study proposed six network models for abnormal breast
detection in mammograms. The experiments showed proposed Net-5 model (BDR-CNN-GCN) can attain the best
results among all six proposed networks, and also attains superior performances to 15 state-of-the-art methods. The
BDR-CNN-GCN is a combination of BDR-CNN model and a 2L-GCN model. Here, BDR-CNN aids to extract image-
level features, whereas GCN aids to extract relation-awareness features. All those two types of features were learnt
during training. The combination of both two networks help increase the performance of our Net-5 model. Further,
we justified the necessity of the use of 14-way DA.
The application of our method in hospitals is promising. From the experimental results, it appears our Net-5
system can aid decision making when diagnosing breast cancer using mammograms. Furthermore, our Net-5 (BDR-
CNN-GCN) can be improved by integration with other classifiers developed by other teams from other universities.
In addition; our algorithms have the potential to be re-deployed to a new hospital’s server. If using cloud-computing
based apps, this will further improve speed and reduce costs.
The shortcomings of Net-5 (BDR-CNN-GCN) are that while it achieves high accuracy when interrogating
mammographic data, it cannot reliably interrogate heterogeneous data (such as thermograms, patient history, heartrate,
etc.). Hence, we propose that our Net-5 model is used to aid radiologists in diagnosing breast cancer using
mammograms. Despite this, the limited size of our dataset provides proof-of-concept towards the feasibility of our
Net-5 model; however, further optimization in much larger datasets is required before our algorithm can be
implemented in any clinical studies. The clinical implication of this research is limited because our study was purely
focused on the AI method design and algorithm verification. Additional work is currently ongoing towards test our
Net-5 model on larger datasets (such as the DDSM and Optimam datasets), with a view to integrating it into large
clinical studies.
The future research directions contain following points: (1) Enlarge the dataset and test proposed AI model on
other breast mammogram images of different sources and different resolutions. (2) Assess other combination
mechanics of GCN and CNN. (3) Attempt to develop a deeper GCN, and analyze whether GCN with more than two
layers will increase the classification results. (4) Test other recent DA methods.
Acknowledgement
This paper is partially supported by British Heart Foundation Accelerator Award, UK; Hope Foundation for
Cancer Research, UK (RM60G0680); Royal Society International Exchanges Cost Share Award, UK (RP202G0230);
Medical Research Council Confidence in Concept Award, UK (MC_PC_17171); MINECO/JUNTA/FEDER,
31
Spain/Regional/Europe (RTI2018-098913-B100, CV2045250, A-TIC-080-UGR18); Guangxi Key Laboratory of
Trusted Software (kx201901); Fundamental Research Funds for the Central Universities (CDLS-2020-03); Key
Laboratory of Child Development and Learning Science (Southeast University), Ministry of Education.
Appendix
32
FrFT Fractional Fourier transform
GCN Graph convolutional network
HF Homomorphic filtering
KMC k-means clustering
kNN k-nearest neighbor
L2P l2-norm pooling
LCIS Lobular carcinoma in situ
LoG Lack of generation
LP Linear projection
MI Mammogram image
MN Multiplicative noise
MP Max pooling
MSD Mean ± SD
NLAF Nonlinear activation function
NLDS Nonlinear downsampling
NN Neural Network
PEM Pectoral muscle
PL Pooling layer
PM Population mean
PV Population variance
RAP Rank-based average pooling
RAR Relation-aware representation
RM Rank matrix
RP Rank-based pooling
ROI Region of interest
RSP Rank-based stochastic pooling
RVF Row-vector format
RWP Rank-based weighted pooling
SD Standard deviation
SLW Size of learnable weight
SM Spiculated mass
SSBC Squared-sized and breast-centered
SSDP Small-size dataset problem
WEE Wavelet energy entropy
33
References
[1] A. Ghanbari, P. Rahmatpour, N. Hosseini, and M. Khalili, "Social Determinants of Breast Cancer Screening among Married Women: A Cross-
Sectional Study," Journal of Research in Health Sciences, vol. 20, p. 5, Article ID: e00467, Win, 2020.
[2] H. Peiris, L. Mudduwa, N. Thalagala, K. Jayatilake, U. Ekanayake, and J. Horadugoda, "Nottingham grade; does it influence the survival of
operable breast cancer patients across all TNM stages?," Annals of Oncology, vol. 26, pp. 20-20, Dec, 2015.
[3] J. Demb, L. Abraham, D. L. Miglioretti, B. L. Sprague, E. S. O'Meara, S. Advani, et al., "Screening Mammography Outcomes: Risk of Breast
Cancer and Mortality by Comorbidity Score and Age," JNCI-Journal of the National Cancer Institute, vol. 112, pp. 599-606, Article ID:
[4] Y.-D. Zhang, Z. Dong, S.-H. Wang, X. Yu, X. Yao, Q. Zhou, et al., "Advances in multimodal data fusion in neuroimaging: Overview, challenges,
and novel orientation," Information Fusion, vol. 64, pp. 149-187, 2020/12/01/, 2020.
[5] M. Milosevic, D. Jankovic, and A. Peulic, "Comparative analysis of breast cancer detection in mammograms and thermograms," Biomedical
[6] K. Nakamura, "Abnormal Breast Detection Via Combination of Particle Swarm Optimization and Biogeography-Based Optimization,"
[7] P. Gorgel, A. Sertbas, and O. N. Ucan, "Computer-aided classification of breast masses in mammogram images based on spherical wavelet
transform and support vector machines," Expert Systems, vol. 32, pp. 155-164, Feb, 2015.
[8] G. Liu, "Computer-aided diagnosis of abnormal breasts in mammogram images by weighted-type fractional Fourier transform," Advances in
[9] S. N. Yang, F. J. Li, Y. H. Liao, Y. S. Chen, W. C. Shen, and T. C. Huang, "Identification of Breast Cancer Using Integrated Information from
MRI and Mammography," Plos One, vol. 10, Article ID: e0128404, Jun, 2015.
[10] R. V. Rao, "Abnormal Breast Detection in Mammogram Images by Feed-forward Neural Network trained by Jaya Algorithm," Fundamenta
[11] X. Wu, "Smart detection on abnormal breasts in digital mammography based on contrast-limited adaptive histogram equalization and chaotic
adaptive real-coded biogeography-based optimization," Simulation, vol. 92, pp. 873-885, September 12, 2016, 2016.
[12] Y. Chen, "Wavelet energy entropy and linear regression classifier for detecting abnormal breasts," Multimedia Tools and Applications, vol. 77,
[13] F. Liu and M. Brown, "Breast Cancer Recognition by Support Vector Machine Combined with Daubechies Wavelet Transform and Principal
Component Analysis," Lecture Notes in Computational Vision and Biomechanics, vol. 30, pp. 1921-1930, 2019.
[14] Z.-W. Guo, "Breast cancer detection via wavelet energy and support vector machine," in 27th IEEE International Conference on Robot and
[15] A. Suresh, R. Udendhran, and M. Balamurgan, "Hybridized neural network and decision tree based classifier for prognostic decision making
in breast cancers," Soft Computing, vol. 24, pp. 7947-7953, Jun, 2020.
[16] X. Yu, "Utilization of DenseNet201 for diagnosis of breast abnormality," Machine Vision and Applications, vol. 30, pp. 1135-1144, 2019/10/01,
2019.
[17] R. K. Samala, H. P. Chan, L. M. Hadjiiski, M. A. Helvie, and C. D. Richter, "Generalization error analysis for deep convolutional neural
34
network with transfer learning in breast cancer diagnosis," Physics in Medicine and Biology, vol. 65, p. 13, Article ID: 105002, May, 2020.
[18] C. Pan, "Abnormal breast identification by nine-layer convolutional neural network with parametric rectified linear unit and rank-based
stochastic pooling," Journal of Computational Science, vol. 27, pp. 57-68, 2018.
[19] Z. Li, Z. Zhang, J. Qin, Z. Zhang, and L. Shao, "Discriminative fisher embedding dictionary learning algorithm for object recognition," IEEE
[20] Z. Zhang, Z. Lai, Z. Huang, W. K. Wong, G.-S. Xie, L. Liu, et al., "Scalable supervised asymmetric hashing with semantic and latent factor
embedding," IEEE Transactions on Image Processing, vol. 28, pp. 4803-4818, 2019.
[21] Z. Zhang, L. Liu, F. Shen, H. T. Shen, and L. Shao, "Binary multi-view clustering," IEEE transactions on pattern analysis and machine
[22] J. Wen, Z. Zhang, Z. Zhang, L. Fei, and M. Wang, "Generalized Incomplete Multiview Clustering With Flexible Locality Structure Diffusion,"
[23] Z. Zhang, L. Liu, Y. Luo, Z. Huang, F. Shen, H. T. Shen, et al. (2020). Inductive Structure Consistent Hashing via Flexible Semantic Calibration.
[24] H. T. Shen, X. Zhu, Z. Zhang, S.-H. Wang, Y. Chen, X. Xu, et al., "Heterogeneous data fusion for predicting mild cognitive impairment
[25] K. George, P. Sankaran, and K. P. Joseph, "Computer assisted recognition of breast cancer in biopsy images via fusion of nucleus-guided deep
convolutional features," Computer Methods and Programs in Biomedicine, vol. 194, p. 11, Article ID: 105531, Oct, 2020.
[26] X. Y. Zheng, Z. Yao, Y. N. Huang, Y. Y. Yu, Y. Wang, Y. B. Liu, et al., "Deep learning radiomics can predict axillary lymph node status in
early-stage breast cancer," Nature Communications, vol. 11, p. 9, Article ID: 1236, Mar, 2020.
[27] L. Alzubaidi, O. Al-Shamma, M. A. Fadhel, L. Farhan, J. L. Zhang, and Y. Duan, "Optimizing the Performance of Breast Cancer Classification
by Employing the Same Domain Transfer Learning from Hybrid Deep Convolutional Neural Network Model," Electronics, vol. 9, p. 21,
[28] Y. Li, Y. Luo, and Z. Huang, "Graph-based relation-aware representation learning for clothing matching," in Australasian Database
[29] I. Fukunaga, R. Sawada, T. Shibata, K. Kaitoh, Y. Sakai, and Y. Yamanishi, "Prediction of the Health Effects of Food Peptides and Elucidation
of the Mode-of-action Using Multi-task Graph Convolutional Neural Network," Molecular Informatics, vol. 39, p. 10, Jan, 2020.
[31] R. McBride, K. Wang, Z. Y. Ren, W. Y. Li, and Aaai, "Cost-Sensitive Learning to Rank," in Thirty-Third AAAI Conference on Artificial
Intelligence / Thirty-First Innovative Applications of Artificial Intelligence Conference / Ninth Aaai Symposium on Educational Advances in
[32] J. M. Górriz, "Artificial intelligence within the interplay between natural and artificial computation: Advances in data science, trends and
[33] D. R. Nayak, D. Das, R. Dash, S. Majhi, and B. Majhi, "Deep extreme learning machine with leaky rectified linear unit for multiclass
classification of pathological brain images," Multimedia Tools and Applications, vol. 79, pp. 15381-15396, Jun, 2020.
[34] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, "Dropout: A Simple Way to Prevent Neural Networks from
Overfitting," Journal of Machine Learning Research, vol. 15, pp. 1929-1958, Jun, 2014.
35
[35] Y. Furusho and K. Ikeda, "Theoretical analysis of skip connections and batch normalization from generalization and optimization
perspectives," Apsipa Transactions on Signal and Information Processing, vol. 9, p. 7, Article ID: e9, 2020.
[36] J. Hong, "Sensorineural hearing loss identification via nine-layer convolutional neural network with batch normalization and dropout,"
[37] C. Garbin, X. Q. Zhu, and O. Marques, "Dropout vs. batch normalization: an empirical study of their impact to deep learning," Multimedia
[38] M. Rezaei, H. Yang, and C. Meinel, "Deep Neural Network with l2-Norm Unit for Brain Lesions Detection," in International Conference on
[39] Z. L. Shi, Y. D. Ye, and Y. P. Wu, "Rank-based pooling for deep convolutional neural networks," Neural Networks, vol. 83, pp. 21-31, Nov,
2016.
[40] X. Seti, A. Wumaier, T. Yibulayin, D. Paerhati, L. L. Wang, and A. Saimaiti, "Named-Entity Recognition in Sports Field Based on a Character-
Level Graph Convolutional Network," Information, vol. 11, p. 16, Article ID: 30, Jan, 2020.
[41] T. Derr, Y. Ma, W. Q. Fan, X. R. Liu, C. Aggarwal, J. L. Tang, et al., "Epidemic Graph Convolutional Network," in 13th International
Conference on Web Search and Data Mining, Houston, TX, 2020, pp. 160-168.
[42] T. N. Kipf and M. Welling, "Semi-supervised classification with graph convolutional networks," presented at the International Conference on
Learning Representations (ICLR), Palais des Congrès Neptune, Toulon, France, 2017.
[43] J. Shi, R. Wang, Y. Zheng, Z. Jiang, and L. Yu, "Graph Convolutional Networks for Cervical Cell Classification," in Second MICCAI Workshop
[44] J. Lee, J. Kim, and W. Ko, "Day-Ahead Electric Load Forecasting for the Residential Building with a Small-Size Dataset Based on a Self-
Organizing Map and a Stacking Ensemble Learning Method," Applied Sciences-Basel, vol. 9, p. 19, Article ID: 1231, Mar, 2019.
[45] S.-H. Wang. (2020). Covid-19 Classification by FGCNet with Deep Feature Fusion from Graph Convolutional Network and Convolutional
36