Machine Learning Based Intelligent System For Breast Cancer Prediction (MLISBCP)
Machine Learning Based Intelligent System For Breast Cancer Prediction (MLISBCP)
A R T I C L E I N F O A B S T R A C T
Keywords: Risks of death from Breast Cancer (BC) are drastically rising in recent years. The diagnosis of breast cancer is
Breast Cancer time-consuming due to the limited availability of diagnostic systems such as dynamic MRI, X-rays etc. Early
Machine Learning detection and diagnosis of breast cancer significantly impacts life expectancy as current medical technologies are
K-Means SMOTE
not advanced enough to treat patients in later stages effectively. Even though researchers have created many
Boruta
expert systems for early detection of BC such as WNBC, AR + NN system, AdaBoost ELM etc., but still most expert
systems frequently lack adequate handling of the class imbalance problem, proper data pre-processing, and
systematic feature selection. To overcome these limitations, this work proposes an expert system named “Ma
chine Learning Based Intelligent System for Breast Cancer Prediction (MLISBCP)” for better prediction of breast
cancer using machine learning analytics. The suggested system utilises the ‘K-Means SMOTE’ oversampling
method to handle the class imbalance problem and ‘Boruta’ feature selection technique to select the most
relevant features of the BC dataset. To understand the effectiveness of the proposed model – MLISBCP, its per
formance is compared with various single classifier based models, ensemble models and various models present
in literature in terms of performance metrics- accuracy, precision, recall, F1-score and RoC AUC Score. The
results reveal that the MLISBCP obtained the highest accuracy of 97.53 % with respect to existing models present
in the literature.
1. Introduction for women all over the world that brings both physical and psycholog
ical damage. BC can affect both men and women, but is more commonly
Breast Cancer (BC) is considered to be the second most dangerous observed among women ([India 1]). BC affects the cells present in the
cancer in the world with the highest mortality rate. It is a horrific disease breast and grows abnormally to the point where it may affect other parts
Abbreviations: ACS, American Cancer Society; AdaBoost ELM, Adaptive Boosting Extreme Learning Machine; Adaboost, Adaptive Boosting; ADASYN, Adaptive
Synthetic; ANN, Artificial Neural Network; AR + NN, Association Rules and Neural Network; AUC, Area Under the Curve; BC, Breast Cancer; BCCD, Breast Cancer
Coimbra Dataset; CART, Classification And Regression Tree; CNN, Convolutional Neural Network; CV, Cross-Validation; DT, Decision Tree; ESBCP, Expert System for
Breast Cancer Prediction; EXTR, extremely randomized trees; FABEE, Firefly Algorithm Based Expert System; FN, False Negatives; FP, False Positives; GB, Gradient
Boost; GSAM, Standard Additive Model (SAM) with Genetic Algorithm; HPBCR, Hybrid Predictor of Breast Cancer Recurrence; ICMR, Indian Council for Medical
Research; K-NN, K-Nearest Neighbors; LightGBM, Light Gradient-Boosting Machine; LR, Logistic Regression; ML, Machine Learning; MLP, Multi-Layer Perceptron;
MRI, Magnetic resonance imaging; NB, Naïve Bayesian; NN, Neural Network; REP Tree, Reduced Error Pruning Tree; RF, Random Forest; ROC, Receiver Operating
Characteristic; SMOTE, Synthetic Minority Oversampling Technique; SVM, Support Vector Machine; TN, True Negatives; TP, True Positives; U.S.A, United States of
America; WAUCE, Weighted Area Under the Receiver Operating Characteristic Curve Ensemble; WBCD, Wisconsin Breast Cancer Database; WHO, World Health
Organization; WNB, Weight Naïve Bayesian; WNBC, Weighted Naive Bayes Classifier; XGBoost, Extreme Gradient Boosting.
* Corresponding author.
E-mail addresses: [email protected] (A.K. Das), [email protected] (S.Kr. Biswas), [email protected] (A. Mandal), [email protected]
(A. Bhattacharya), [email protected] (S. Sanyal).
1
ORCID: 0000-0001-8773-4717.
https://doi.org/10.1016/j.eswa.2023.122673
Received 31 May 2023; Received in revised form 17 November 2023; Accepted 17 November 2023
Available online 25 November 2023
0957-4174/© 2023 Elsevier Ltd. All rights reserved.
A.K. Das et al. Expert Systems With Applications 242 (2024) 122673
of the body like brain, bones, lungs etc. Breast cancer may develop condition. Because the cancer cell appears microscopic when viewed
anywhere in the breast, but typically it grows in the lobules or vessels of from the outside, it is typically very difficult to identify breast cancer in
the breast. The milk-producing glands are known as lobules, while the its early stages. Ultrasound is a well-known technique for the diagnosis
milk-transporting ducts are known as ducts. of breast cancer in which a sound wave is sent inside the body to
Mammograms are commonly used in biological studies for the early examine the condition. Dynamic MRI has been developed to detect
identification of breast cancer. Breast cancer can be identified through a breast distortions (Nagashima et al., 2002). Despite the fact that several
number of signs, including a new growth in the breast or underarm, modalities have been demonstrated, none of them can provide a correct
breast swelling or thickening, and dimpling of the breast tissue (PMC and consistent result. Doctors must read a large volume of imaging data
0000). The five stages (0–4) of breast cancer are usually determined by in mammography which reduces accuracy. This procedure is also time-
the size and type of tumour, as well as the amount of tumour cell consuming, and prone to human errors. Experienced doctors can usually
penetration into the breast tissues (Das et al., 2021). In stage 0 of BC, detect the breast cancer with an accuracy of 78 %, whereas machine
there are no signs of tumour cells spreading to other parts of the body. In learning techniques can do so with an accuracy of more than 90 % in
stage 1, cancer cells begin to affect nearby cells or tissues. It has two most of the cases (Asri et al., 2016 Jan). This will enable the patients to
subcategories, such as 1A and 1B. In 1A, the tumor size is up to 2 cm, and receive necessary treatments when needed. Finding any preventative
it is found inside the breast but does not involve any lymph nodes. In methods is crucial and absolutely vital considering the seriousness of
stage 1B, lymph nodes contain a small group of cancer cells ranging in patient’s life-threatening complications. Early diagnosis of breast cancer
size from 0.2 mm to 2 mm (web 0000). Stage 2 is again divided into two is essential to support preventive measures since it allows for adequate
parts: 2A and 2B. In 2A, there is no tumor in the breast, but cancer is treatment to be given to avoid complications and lower the breast cancer
observed in the axillary lymph nodes. In stage 2B, the tumor size lies mortality rate. Therefore, it is necessary to create an intelligent expert
between 2 cm and 5 cm (Das et al., 2021). Stage 3 is divided into Stage system to identify BC based on clinical symptoms in a preliminary phase,
3A, Stage 3B, and Stage 3C. In the 3A stage, the tumor size is greater preventing BC from being treated as a typical fever and allowing for
than 5 cm, and it has spread to 1–3 auxiliary lymph nodes. In stage 3B, timely diagnosis and treatment. As a result, the medical expert system
the tumour can be of any size and can expand to the chest wall and up to saves both money and time on pathological diagnosis while also
9 axillary lymph nodes. In stage 3C, the tumour is under the collarbone lowering the risk of death. Even though there are many machine
and has spread to 10 or more lymph nodes but has not spread to other learning expert systems or different machine learning classifiers that
parts of the body. In stage 4, cancer spreads outside the breast, i.e., other have all been extensively used to detect breast cancer (BC), it is very
parts of the body like the skin, lungs, bones, liver, brain, etc. difficult to make accurate and efficient classifiers or expert systems to
The number of newly diagnosed Male BC patients worldwide detect breast cancer (BC) in medical machine learning research. Each
increased from 8,500 in 1990 to 23,100 in 2017 (Chen et al., 2020). expert system or classifier has its own advantages and disadvantages.
Male BC is uncommon, accounting for about 0.5–1 % of all BC patients Unfortunately, these systems frequently fall short in their ability to
worldwide (Yalaza et al., 2016 Jan; http 0000). As per the data (Ahmad, choose features systematically and handle the problem of class imbal
2019), the mortality rate among the male BC patients is surprisingly ance effectively. They also rely on single classifier models, which
very high amounting to 9.09 % whereas for women, it was only 1.87 %. struggle with noisy and unbalanced data. To solve the mentioned limi
This is due to the lack of information among the male BC patients. tations, this research paper proposes an expert system called “Machine
Almost all studies and clinical trials on BC were primarily focused on Learning Based Intelligent System for Breast Cancer Prediction
women, and the knowledge acquired from this was utilized to treat male (MLISBCP)” for more accurate breast cancer (BC) prediction in early
breast cancer patients. Knowing the hormonal differences between the stages using the symptomatic features. It may prove highly important in
male and female patients, the procedure of male BC treatment in ma lowering the risk of breast cancer through early treatment based on
jority of cases is not the best. Since the availability of data related to solely clinical symptoms, thereby reducing both expenditure and time.
male BC patients is very rare, it is very difficult to study and make proper This system is intended to improve breast cancer prediction through
decisions (Ahmad, 2019). the use of machine learning analytics. The proposed MLISBCP system
In the year 2013, there were around 2,32,340 women diagnosed with performs pre-processing on the breast cancer data to manage missing
BC only in U.S.A and among them 39,620 women lost their life due to BC values, encode features, handle class imbalance or perform over
(Akram et al., 2017 Dec). According to the WHO, approximately sampling, and perform feature selection. To solve the class imbalanced
1,56,000 cases of BC were reported in India in 2015, with 76,000 women dataset, this model utilizes K-Means SMOTE oversampling method. For
expected to die as a result of the disease (stat, 2016). There were dealing with class imbalance, the under-sampling strategy has the
approximately 3,16,120 new cases of BC reported in the United States in drawback of potentially losing a lot of crucial data that could be bene
2017, and approximately 40,600 people are expected to die from this ficial for model training. Therefore, oversampling techniques compen
disease in 2017 (ACS 0000). In the year 2018, 6,27,000 women died of sate for this drawback. Moreover, the majority of oversampling
this fatal disease. According to the American Cancer Society (ACS), the algorithms have a propensity to produce noise and over-fitting issues,
United States has 3.1 million breast cancer survivors. Invasive BC has which reduce the model’s capacity for prediction. K-Means SMOTE, on
also been found in 2,68,600 women, whereas non-invasive BC has also the other hand, interpolates between the cluster centroids and the
been diagnosed in 62,930 women, according to an ACS data released in original samples to create synthetic samples for each cluster once the
2019. Breast cancer has been diagnosed among 2.3 million women minority class samples have first been clustered using K-Means clus
worldwide in 2020, according to the World Health Organization (WHO), tering. As opposed to previous oversampling strategies, the generated
with 6,85,000 deaths worldwide. According to the WHO, it affects 2.10 synthetic samples are more representative of the minority class,
million women each year. BC is responsible for around 15 percent of lowering the danger of noise generation and over-fitting. Boruta is used
female deaths in every year (JaikrishnanSVJ and Breakup, 2019). Ac to accomplish feature selection because it is a potent and adaptable
cording to the Indian Council for Medical Research (ICMR), 1,50,000 optimization method that can successfully choose a subset of features to
women in India are diagnosed with breast cancer every year, with maximise the performance of the suggested model. It employs an all-
70,000 dying as a result. Breast cancer now affects 1 in every 12 women relevant feature selection methodology, capturing all variables that
in the United Kingdom between the ages of 1 and 85 (Jing Han et al., may occasionally be pertinent to the outcome variable. In contrast, the
2013). The mentioned data conclusively proves the severity of BC majority of conventional feature selection algorithms use a minimal
worldwide. optimum approach, relying on a small subset of characteristics that
Mammography is a technique to diagnose breast cancer (Mori et al., produce the least amount of error in a selected classifier. In addition, it
2017 Jan). X-rays are employed to determine a woman’s nipple locates every feature that has any relationship, whether strong or
2
A.K. Das et al. Expert Systems With Applications 242 (2024) 122673
tenuous, to the decision variable. This makes it ideal for biomedical accuracy of 95.7 %. Naji et al. (Naji et al., 2021 Jan) have taken the
applications in which it can be interesting to ascertain whether human dataset from the Wisconsin Breast Cancer Diagnostic dataset (WBCD)
features are related to a specific medical problem. This system also and applied five main algorithms, which are SVM, RF, LR, DT, and K-NN.
makes use of many stand-alone and ensemble machine learning classi After that, they measured, compared, and analysed several findings
fiers like SVM, K-NN, DT, RF, Adaboost, etc. This research paper pre based on the confusion matrix, accuracy, sensitivity, precision, and area
sents a comparative analysis of various stand-alone and ensemble under the ROC curve (AUC) in order to determine the best machine
machine learning approaches with a suggested system utilising the learning algorithms that are accurate, reliable, and have better preci
breast cancer dataset. 10-fold cross-validation approaches were used to sion. They observed that support vector machines significantly out
obtain perfect results. As a result, the proposed model can provide a performed all other algorithms and had a higher efficiency of 97.2 %.
reliable diagnosis system for BC detection. Firstly, the performance of Table 1 provides a concise overview of the entire literature section.
MLISBCP is compared in terms of metrics-accuracy, precision, recall, F-
measure and RoC AUC Score with various machine learning classifiers. 2.1. Research Gap
Finally, the existing system is compared with the proposed MLISBCP
model. The experimental results show that the proposed model out Hence, the three main conclusions that can be drawn: the feature
performs other single-classifier-based and ensemble models selection procedure was not systematic; the BC dataset have not been
significantly. properly pre-processed; and the class imbalance issue have not been
The work is organised into six sections: Section 2 provides a litera sufficiently addressed. Moreover, some of these systems did not employ
ture survey, Section 3 illustrates the proposed methodology; Section 4 cross-validation, which is absolutely necessary to verify the performance
describes the experiment and evaluation method; Section 5 discusses the for validation. All of the said issues were addressed in this proposed
results; and finally, Section 6 makes a conclusion and suggests future paper. Furthermore, the proposed model MLISBCP improved the accu
scope for research. racy of predicting breast cancer (BC) in early stages.
More research is being done today to determine the primary risk The MLISBCP model was divided into three sections.
factor for breast cancer or to detect and prevent breast cancer.
Numerous researchers have investigated this topic using the machine • Data collection and data description
learning method. Using the Wisconsin Breast Cancer datasets, Asri et al. • Data Pre-processing
(Asri et al., 2016 Jan) used four machine learning methods, including • Classification
SVM, NB, k-NN, and C4.5. They attempted to compare the efficiency and
effectiveness of those algorithms in terms of accuracy, sensitivity, Fig. 1 depicts the MLISBCP model’s workflow.
specificity, and precision to determine which had the best classification
accuracy and concluded that support vector machines give an accuracy 3.1. Data Collection and Data Description
of 97.13 % and outperform all other algorithms. Li et al. (Li and Chen,
2018 Oct 18) have employed the methods of DT, SVM, RF, LR, and NN The breast cancer dataset was retrieved from the UCI machine
models to predict the nature of breast cancer with other attributes. They learning repository (Wisconsin, 2018).The breast cancer dataset consists
used two breast cancer datasets from BCCD and WBCD. The highest of 699 instances with 11 attributes, including the target attribute
accuracy achieved by them is by using decision trees, i.e., 96.1 %. Gupta (“class”) and ‘id’ attribute. Here, class is categorised as either “Benign”
et al. (Gupta and Garg, 2020 Jan) have concentrated on six machine or “Malignant ”. Benign is denoted as “2” and Malignant is denoted as
learning algorithms like k-NN, LR, DT, RF, SVM, and the radial basis “4”. There are 241 malignant instances (34.50 %) and 458 benign in
function kernel. They used the WBCD dataset for this work. They used stances (65.50 %), but there were missing values in 16 rows out of 699
the Adam gradient descent learning method, which has the highest ac instances. Nine of the ten features are treated as input features, while the
curacy of all the algorithms to achieve 98.24 % prediction accuracy. remaining one is treated as an output feature. The Breast Cancer dataset
Thomas et al. (Thomas et al., 2020) have discussed six various machine revealed the following attributes in Table 2.
learning algorithms, such as SVM, NN, LR, RF, NB, and DT, to predict
breast cancer and compared their accuracy. They have used the WBCD 3.2. Data Pre-processing
dataset for this work. In comparison to all other techniques, they have
found that an artificial neural network (ANN) provides higher prediction Oftentimes, the basic data on breast cancer is inaccurate, inconsis
at 97.85 %. Tiwari et al. (Tiwari et al., 2020) have implemented various tent, lacking in certain behaviours or patterns and prone to many mis
machine learning techniques as well as deep learning algorithms and takes. The raw breast cancer data is transformed into a suitable,
compared their accuracy. The highest accuracy that machine learning understandable format during the data pre-processing stage. Data pre-
algorithms can achieve is 96.5 %, which was achieved by SVM and processing consists of the following phases:
Random Forest algorithms. For increased accuracy, they have used deep
learning methods like CNN and ANN. They have 97.3 % prediction ac • Missing Value Handling
curacy on CNN and 99.3 % on ANN, respectively. Therefore, they have • Level Encoding
concluded that the deep learning method is better than machine • Oversampling using K-Means SMOTE
learning algorithms. Sengar et al. (Sengar et al., 2020) used two machine • Feature Selection using Boruta
learning algorithms for breast cancer prediction: the decision tree clas
sifier and logistic regression. They compared their accuracy to see which
one would be better suited for the prediction. They found that decision 3.2.1. Missing Value Handling
trees outperformed random forests on the Wisconsin diagnostic dataset. There are 16 rows of the feature Bare Nuclei in Wisconsin Breast
Islam et al. (Islam et al., 2020 Sep) conducted a comparative study of Cancer that have a single attribute value that is missing (or unavailable),
five machine learning techniques for breast cancer prediction: SVM, K- indicated by the symbol “?”. To deal with these missing numbers,
NN, RF, ANN, and LR. Each of the five machine learning technique’s various methods are available, including imputation with the mode and
basic features and working principles were illustrated. ANNs achieve the mean. Missing values are very less in BC dataset. Replacing the missing
highest accuracy of 98.57 %, while RF and LR achieve the lowest values may introduce biasness into BC dataset. Therefore, all the missing
3
A.K. Das et al. Expert Systems With Applications 242 (2024) 122673
4
A.K. Das et al. Expert Systems With Applications 242 (2024) 122673
Fig. 1. Machine Learning Based Intelligent System for Breast Cancer Prediction (MLISBCP).
Table 2
Visualisation of the distribution of data with attributes of the BC dataset.
SL id Clump Uniformity of Uniformity of Marginal Single Bare Bland Normal Mitoses Class
NO Thickness Cell Size Cell Shape Adhesion Epithelial Cell Nuclei Chromatin Nucleoli
Size
0 1,000,025 5 1 1 1 2 1 3 1 1 2
1 1,002,945 5 4 4 5 7 10 3 2 1 2
2 1,015,425 3 1 1 1 2 2 3 1 1 2
3 1,016,277 6 8 8 1 3 4 3 7 1 2
4 1,017,023 4 1 1 3 2 1 3 1 1 2
….. ….. ….. ….. ….. ….. ….. ….. ….. ….. ….. …..
694 776,715 3 1 1 1 3 2 1 1 1 0
695 841,769 2 1 1 1 2 1 1 1 1 0
696 888,820 5 10 10 3 7 3 8 10 2 1
697 897,471 4 8 6 4 3 4 10 6 1 1
698 897,471 4 8 8 5 4 5 10 4 1 1
699 rows x 11 columns
suffer if the unbalanced classes in the BC dataset are not handled In the clustering phase, the input is divided into K groups utilising K-
correctly. The K-Means SMOTE procedure is used to balance benign and means clustering (algorithm). The filtering phase chooses groups for
malignant instances in the BC dataset. It is superior to other over oversampling, keeping samples with a high percentage of minority class
sampling methods since it doesn’t produce noise or have an over-fitting samples (malignant). Then, it decides how many synthetic samples to
issue like the majority of sampling methods like SMOTE, cluster-SMOTE, produce, allocating more samples to groups whose minority samples
borderline-SMOTE, and Gaussian SMOTE (Last et al., 2017; Fonseca have been dispersed more thinly (increasing the value from 239 to 444
et al., 2021 Jun 29; Xu et al., 2021 Sep). The K-means SMOTE method is malignant instances). In the oversampling phase, SMOTE is used in each
divided into three steps. The algorithm of the K means SMOTE method chosen group to achieve the desired minority to majority instance ratio
works as follows: (Wang et al., 2014 Jul). After the oversampling, there are 444 benign
instances and 444 malignant instances in the BC dataset, respectively.
• Clustering The flowchart of K-Means SMOTE method is shown in Fig. 2 for more in-
• Filtering depth knowledge.
• Oversampling
5
A.K. Das et al. Expert Systems With Applications 242 (2024) 122673
Table 3
Visualisation of the distribution of data after removing the missing data.
SL id Clump Uniformity of Uniformity of Marginal Single Bare Bland Normal Mitoses Class
No Thickness Cell Size Cell Shape Adhesion Epithelial Cell Nuclei Chromatin Nucleoli
Size
0 1,000,025 5 1 1 1 2 1 3 1 1 2
1 1,002,945 5 4 4 5 7 10 3 2 1 2
2 1,015,425 3 1 1 1 2 2 3 1 1 2
3 1,016,277 6 8 8 1 3 4 3 7 1 2
4 1,017,023 4 1 1 3 2 1 3 1 1 2
….. ….. ….. ….. ….. ….. ….. ….. ….. ….. ….. …..
678 776,715 3 1 1 1 3 2 1 1 1 2
679 841,769 2 1 1 1 2 1 1 1 1 2
680 888,820 5 10 10 3 7 3 8 10 2 4
681 897,471 4 8 6 4 3 4 10 6 1 4
682 897,471 4 8 8 5 4 5 10 4 1 4
683 rows x 11 columns
Table 4
Visualisation of the distribution of data after encoding.
SL id Clump Uniformity of Uniformity of Marginal Single Bare Bland Normal Mitoses Class
NO Thickness Cell Size Cell Shape Adhesion Epithelial Cell Nuclei Chromatin Nucleoli
Size
0 1,000,025 5 1 1 1 2 0 3 1 1 0
1 1,002,945 5 4 4 5 7 1 3 2 1 0
2 1,015,425 3 1 1 1 2 2 3 1 1 0
3 1,016,277 6 8 8 1 3 4 3 7 1 0
4 1,017,023 4 1 1 3 2 0 3 1 1 0
….. ….. ….. ….. ….. ….. ….. ….. ….. ….. ….. …..
678 776,715 3 1 1 1 3 2 1 1 1 0
679 841,769 2 1 1 1 2 0 1 1 1 0
680 888,820 5 10 10 3 7 3 8 10 2 1
681 897,471 4 8 6 4 3 4 10 6 1 1
682 897,471 4 8 8 5 4 5 10 4 1 1
683 rows x 11 columns
6
A.K. Das et al. Expert Systems With Applications 242 (2024) 122673
3.2.4. Feature Selection using Boruta Machine (SVM) (Zheng et al., 2014 Mar 1; Cortes and Vapnik, 1995 Sep;
Feature selection is a method of removing the redundant and irrel Yadav et al., 2018), the Nave Bayes (NB) [6 9] classifier, and the K-
evant features from the original feature set and selecting the subset of Nearest Neighbors (KNN) (Medjahed et al., 2013) classifier. It also em
the most relevant features. The main aim of feature selection is to ploys some ensemble machine learning algorithms like the Random
improve the model’s performance and decrease time complexity. The Forest (RF) (Ali et al., 2012 Sep 1; Breiman, 2001 Oct) classifier, Ada
proposed model uses the Boruta method (Kursa and Rudnicki, 2010 Sep) Boost (Chen and Xgboost, 2016; Zheng et al., 2020 May; Thongkam
to select the most relevant feature of the BC dataset. Actually it works for et al., 2008 Jan 1), Extra Tree classifier (Sharma et al., 2022 Dec),
all tree based classifier techniques. Boruta is all-relevant feature selec LightGBM (Ke et al., 2017), and XGBoost (Inan et al., 2021; , buhlmann
tion method i.e. it is able to identify and mark the most important fea 0000).
tures in a dataset, providing a greater level of detailed understanding of
the relationships between features and the target variable (Rudnicki
et al., 2015; Ahmadpour et al., 2021 Sep 9). In contrast, many traditional 4.1. Performance Evaluation Metrics
feature selection techniques do not provide this level of detailed infor
mation and just simply returns a list of correlated features without any 4.1.1. Accuracy
indication of their relative importance to target variable (Sumathi and The accuracy is calculated as the percentage of correctly identified
Padmavathi, 2019). In each iteration, rejected variables are removed instances. This is calculated by dividing the number of correct pre
and the process is continued till the method reaches the end of the dictions by the total number of instances in the dataset. As a result, the
iteration or all the irrelevant features are dropped (Kursa, 2014 Dec). accuracy can be calculated using the equation number 1.
Fig. 3 shows a complete overview of the Boruta feature selection (TP + TN)
approach. Accuracy = (1)
(TP + TN + F + FN)
Where TP = True Positive; FN = False Negative; FP = False Positive;
3.3. Classification TN = True Negative.
The proposed MLISBCP model uses the reduced set of features ob 4.1.2. Precision
tained by the Boruta approach to detect breast cancer. This is a strong It is mainly used to overcome the limitations of accuracy. The ac
and flexible technique that successfully selects a subset of features to curacy of a positive prediction is determined by its precision. It can be
maximize the model’s performance. The reduced feature set is utilised to determined by comparing the true positives, or predictions, to the total
train the classifiers. Different classifiers used 683 instances to process number of positive predictions. The precision can be determined using
the reduced feature set to detect breast cancer. Table 6 shows that the equation number 2.
modified models can be used to make intermediate predictions and the
final MLISBCP model learns from the intermediate predictions by Precision =
TP
(2)
selecting the modified model with the highest accuracy (modified RF) (TP + FP)
among the intermediate modified models. This enhances the model’s
performance, consistently outperforming all intermediate models. A 4.1.3. Recall/Sensitivity
flowchart of the proposed model is given in Fig. 4. It is used to determine the proportion of actual positive values that
were identified incorrectly. It can be calculated as the ratio of true
4. Experiment and Evaluation Method positives, or predictions that come true, to all the positives. The recall
can be determined using the equation number 3. Sensitivity analysis or
The machine learning (ML) algorithms that were employed in the recall analysis was used to confirm the system’s reliability and efficiency
study are described in this section. Some standalone methods are used in (Su et al., 2023 Jun 16; Zhang et al., 2023 Jun).
this research, such as the Logistic Regression (LR) (Witteveen et al., TP
2018 Oct; Shravya et al., 2019 Apr 6) classifier, the Decision Tree (DT) Recall/Sensitivity = (3)
(TP + FN)
(Williams et al., 2015; Quinlan, 1986 Mar) classifier, the Support Vector
7
A.K. Das et al. Expert Systems With Applications 242 (2024) 122673
Table 5
Python programming is used to implement the suggested expert
Confusion matrix.
model, MLISBCP. The proposed model MLISBCP is obtained by first
Predicted solving the class imbalance problem with K-Means SMOTE, then
Negative Positive
selecting relevant features with Boruta.
Actual Negative TN FP Firstly, the K-Means SMOTE oversampling approach is used to
Positive FN TP equalize the minority class with the majority class in order to eliminate
8
A.K. Das et al. Expert Systems With Applications 242 (2024) 122673
9
A.K. Das et al. Expert Systems With Applications 242 (2024) 122673
Table 10
RoC AUC Score Comparison (in %).
Base model RoC AUC Score (in %) Modified RoC AUC Score (in %)
Models
NB 95.84 NB 96.86
DT 94.41 DT 95.5
SVMC 94.72 SVMC 97.18
LR 94.83 LR 97.19
KNN 95.98 KNN 97.53
RF 96.08 RF 97.53
EXTR 96.39 EXTR 97.64
LightGBM 95.86 LightGBM 96.96
XGBoost 95.98 XGBoost 97.07
AdaBoost 94.82 AdaBoost 96.39
Proposed MLISBCP Model 97.53
used is producing fewer false positives. High levels of precision are Fig. 7. Graphical representation of recall comparison of MLISBCP system with
crucial for healthcare and medical systems because it has an effect on the base models and modified models.
accuracy and efficiency of diagnostic tools as well as treatment de
cisions. Low precision values can result in increased expenses and provides a more comprehensive evaluation of both precision and recall
possible patient harm, whereas high precision values can help improve values, and can help ensure that diagnostic tools and treatment decisions
patient outcomes and save healthcare costs by minimizing unnecessary are accurate, efficient, and effective.
testing and treatments. The graphical representation of precision as The comparison visualisation of the MLISBCP with the single-
depicted in Fig. 6 gives the comparison visualisation of MLISBCP with classifier-based models (NB, DT, SVMC, LR and KNN) and Ensemble
the single-classifier-based models (NB, DT, SVMC, LR and KNN) and Models(RF, EXTR, LightGBM, XGBoost and AdaBoost) is provided by the
Ensemble Models (RF, EXTR, LightGBM, XGBoost and AdaBoost) for graphical depiction of F1-Score as shown in Fig. 8 for a better
better interpretation. explanation.
From Table 8, it can be shown that MLISBCP performs better than the From Table 10, it can be shown that MLISBCP performs better than
single-classifier-based models (NB, DT, SVMC, LR, and KNN) and the single-classifier-based models (NB, DT, SVMC, LR and KNN) and
Ensemble Models (RF, EXTR, LightGBM, XGBoost and AdaBoost) in Ensemble models (RF, EXTR, LightGBM, XGBoost and AdaBoost) in
terms of recall. Better results in terms of recall show that the suggested terms of RoC AUC Score. The ROC AUC value indicates how effective the
MLISBCP has a classification that is accurate because the number of false model is. The higher the AUC, the better the model distinguishes be
negatives is comparatively less. High recall value is crucial for health tween positive and negative classifications. It is employed to evaluate an
care and medical purposes from an application standpoint because it examination’s overall diagnostic effectiveness and to contrast the results
directly impacts the accuracy and effectiveness of diagnostic tools and of multiple diagnostic procedures. It also serves to choose the best cut-
treatment decisions. off value for assessing whether a disease is present or not.
The comparison visualisation of the MLISBCP with the single- The comparison visualisation of the MLISBCP with the single-
classifier-based models (NB, DT, SVMC, LR and KNN) and Ensemble classifier-based models (NB, DT, SVMC, LR and KNN) and Ensemble
models (RF, EXTR, LightGBM, XGBoost and AdaBoostare) provided by Models(RF, EXTR, LightGBM, XGBoost and AdaBoost) is provided by the
the graphical depiction of recall as shown in Fig. 7 for a better graphical depiction of RoC AUC Score as shown in Fig. 9 for a better
explanation. explanation.
From Table 9, it can be shown that MLISBCP performs better than the MLISBCP performance comparison with the single-classifier-based
single-classifier-based models (NB, DT, SVMC, LR and KNN) and models (NB, DT, SVMC, LR and KNN) and ensemble models (RF,
Ensemble models (RF, EXTR, LightGBM, XGBoost and AdaBoost) in EXTR, LightGBM, XGBoost and AdaBoost) were presented in Table 11.
terms of F1-Score. Better results in terms of the F1-Score mean that the From Table 11, it can be observed that the proposed model MLISBCP
model or system has achieved a more balanced performance in terms of performs better than the single-classifier-based models (SVM (Asri et al.,
precision and recall values. A high F1-Score value is crucial for health 2016 Jan), LR (Murugan et al., 2017) and SVM (Bayrak et al., 2019) and
care and medical purposes from an application standpoint because it
Fig. 6. Graphical representation of precision comparison of MLISBCP system Fig. 8. Graphical representation of F1-Score comparison of MLISBCP system
with base models and modified model. with base models and modified models.
10
A.K. Das et al. Expert Systems With Applications 242 (2024) 122673
WNBC (Kharya and Soni, 2016 Jan); Firefly Algorithm based Expert
System (Alaybeyoglu and Mulayim, 2018 Aug); rough-fuzzy system
(Gilal et al., 2019 May 1), AdaBoostELM (Sharifmoghadam and
Jazayeriy, 2019) and ESBCP (Das et al., 2022); it is observed that the
proposed model MLISBCP greatly outperforms the rest of the models
mentioned. Class imbalance issues are not addressed by any of the
models that were compared. This method has a propensity to produce
noise and when used in conjunction with oversampling, it has a pro
pensity to lead to over-fitting issues that ultimately reduce the model’s
performance.
10-fold cross-validation provides a more accurate estimate of the
variance of the model than other methods. This is because it uses mul
tiple rounds of training and testing, which helps to reduce the impact of
random sampling and provides a more stable estimate of model per
formance. According to the results, the suggested MLISBCP system is
Fig. 9. Graphical representation of RoC AUC Score comparison of MLISBCP more accurate than the previously discussed systems.
system with base models and modified models.
It has been noted that the existing models did not properly identify
features and perform data pre-processing in order to provide more
Table 11 insightful results. According to the results, the suggested MLISBCP sys
Performance comparison of MLISBCP with the single-classifier models and tem is more accurate than the previously discussed systems.
ensemble models. The comparison of the MLISBCP with various models present in
literature is provided by the graphical depiction of accuracy as shown in
Ref. Models Best 10 -FOLD Balance
Accuracy CV dataset Fig. 11 for better explanation.
(in %)
11
A.K. Das et al. Expert Systems With Applications 242 (2024) 122673
Table 12
Performance comparison of MLISBCP with various models present in literature.
Ref. Models Best Accuracy 10 -FOLD CV Balance dataset
(in %)
Aroef, C., Rivan, Y., & Rustam, Z. (2020). Comparing random forest and support vector
machines for breast cancer classification. TELKOMNIKA (Telecommunication
Computing Electronics and Control)., 18(2), 815–821. https://doi.org/10.12928/
telkomnika.v18i2.14785
Asri, H., Mousannif, H., Al Moatassime, H., & Noel, T. (2016). Using machine learning
algorithms for breast cancer risk prediction and diagnosis. Procedia Computer Science,
1(83), 1064–1069. https://doi.org/10.1016/j.procs.2016.04.224
Assegie, T. A., Tulasi, R. L., & Kumar, N. K. (2021). Breast cancer prediction model with
decision tree and adaptive boosting. IAES International Journal of Artificial
Intelligence., 10(1), 184. https://doi.org/10.11591/ijai.v10.i1.pp184-190
Bayrak EA, Kırcı P, Ensari T. Comparison of machine learning methods for breast cancer
diagnosis. In2019 Scientific meeting on electrical-electronics & biomedical
engineering and computer science (EBBT) 2019 Apr 24 (pp. 1-3). IEEE. DOI:
10.1109/EBBT.2019.8741990.
Breiman, L. (2001). Random forests. Machine learning., 45, 5–32. https://doi.org/
Fig. 11. Performance comparison of the MLISBCP with various models present 10.1023/A:1010933404324
in literature. Bühlmann P, Hothorn T. Boosting algorithms: Regularization, prediction and model
fitting. DOI: 10.1214/07-STS242.
Chen T, Guestrin C. Xgboost: A scalable tree boosting system. InProceedings of the 22nd
Data Availability acm sigkdd international conference on knowledge discovery and data mining 2016
Aug 13 (pp. 785-794). https://doi.org/10.1145/2939672.2939785.
Chen, Z., Xu, L., Shi, W., Zeng, F., Zhuo, R., Hao, X., & Fan, P. (2020). Trends of female
Data will be made available on request. and male breast cancer incidence at the global, regional, and national levels,
1990–2017. Breast Cancer Res Treat, 180, 481–490. https://doi.org/10.1007/
Acknowledgement s10549-020-05561-1
Cortes C, Vapnik V. Support-vector networks. Machine learning. 1995 Sep;20:273-97.
doi: 10.1007/BF00994018.
This research received no specific grant from any funding agency in Das AK, Biswas SK, Bhattacharya A, Alam E. Introduction to Breast Cancer and
the public, commercial, or not-for-profit sectors. Awareness. In 2021 7th International Conference on Advanced Computing and
Communication Systems (ICACCS) 2021 Mar 19 (Vol. 1, pp. 227-232). IEEE. DOI:
10.1109/ICACCS51430.2021.9441686.
Data availability statement Das AK, Biswas SK, Mandal A. An Expert System for Breast Cancer Prediction (ESBCP)
using Decision Tree. Indian Journal of Science and Technology. 2022 Feb 12;15(45):
2441-50. https://doi.org/ 10.17485/IJST/v15i45.756.
All data generated or analysed during this study are included in this Fonseca, J., Douzas, G., & Bacao, F. (2021). Improving imbalanced land cover
article. classification with K-Means SMOTE: Detecting and oversampling distinctive
minority spectral signatures. Information., 12(7), 266. https://doi.org/10.3390/
info12070266
References Gilal, A. R., Abro, A., Hassan, G., & Jaafar, J. (2019). A Rough-Fuzzy Model for Early
Breast Cancer Detection. Journal of Medical Imaging and Health Informatics., 9(4),
Abdulrahman BF, Hawezi RS, MR SM, Kareem SW, Ahmed ZR. Comparative Evaluation 688–696. https://doi.org/10.1166/jmihi.2019.2664
of Machine Learning Algorithms in Breast Cancer. QALAAI ZANIST JOURNAL. 2022 Gupta, P., & Garg, S. (2020). Breast cancer prediction using varying parameters of
Mar 30;7(1):878-902. DOI: https://doi.org/10.25212/lfu.qzj.7.1.34. machine learning models. Procedia Computer Science., 1(171), 593–601. https://doi.
ACS. Breast Cancer Facts & Figures 2017–2018. org/10.1016/j.procs.2020.04.064
Ahmad A. Breast cancer statistics: recent trends. Breast cancer metastasis and drug Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., & Bing, G. (2017). Learning
resistance: challenges and progress. 2019:1-7. https://doi.org/10.1007/978-3-030- from class-imbalanced data: Review of methods and applications. Expert systems with
20301-6_1. applications., 1(73), 220–239. https://doi.org/10.1016/j.eswa.2016.12.035
Ahmadpour, H., Bazrafshan, O., Rafiei-Sardooi, E., Zamani, H., & Panagopoulos, T. http://www.breastcancerindia.net/statistics/stat_global.html [Accessed November,
(2021). Gully erosion susceptibility assessment in the Kondoran watershed using 2016].
machine learning algorithms and the Boruta feature selection. Sustainability., 13(18), https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3255438 [Accessed: 12-Feb- 2022].
10110. https://doi.org/10.3390/su131810110 https://www.who.int/news-room/fact-sheets/detail/breast-cancer#:~:text=Roughly%
Akram, M., Iqbal, M., Daniyal, M., & Khan, A. U. (2017). Awareness and current 20half%20of%20all%20breast,breast%20cancers%20occur%20in%20men.
knowledge of breast cancer. Biological Research, 50, 1–23. https://doi.org/10.1186/ [Accessed: 12-Feb- 2022].
s40659-017-0140-9 Inan MS, Hasan R, Alam FI. A hybrid probabilistic ensemble based extreme gradient
Al Helal M, Chowdhury AI, Islam A, Ahmed E, Mahmud MS, Hossain S. An optimization boosting approach for breast cancer diagnosis. In2021 IEEE 11th Annual Computing
approach to improve classification performance in cancer and diabetes prediction. and Communication Workshop and Conference (CCWC) 2021 Jan 27 (pp. 1029-
In2019 International Conference on Electrical, Computer and Communication 1035). IEEE. DOI: 10.1109/CCWC51732.2021.9376007.
Engineering (ECCE) 2019 Feb 7 (pp. 1-5). IEEE. DOI: 10.1109/ India against cancer 2019, “Breast Cancer”, National Institute of Cancer Prevention and
ECACE.2019.8679413. Research, viewed 12 November 2019, -<http://cancerindia.org.in/breast-cancer/>.
Alaybeyoglu, A., & Mulayim, N. (2018 Aug). A design of hybrid expert system for Islam, M., Haque, M., Iqbal, H., Hasan, M., Hasan, M., & Kabir, M. N. (2020). Breast
diagnosis of breast cancer and liver disorder. The Eurasia Proceedings of Science cancer prediction: A comparative study using machine learning techniques. SN
Technology Engineering and Mathematics., 19(2), 345–353. http://www.epstem. Computer Science., 1(5), 1–4. https://doi.org/10.1007/s42979-020-00305-w
net/en/pub/issue/38904/455966. JaikrishnanSVJ,Chantarakasemchit O, Meesad P. A Breakup Machine Learning Approach
Ali, J., Khan, R., Ahmad, N., & Maqsood, I. (2012). Random forests and decision trees. for Breast Cancer Prediction. 11th International Conference on Information
International Journal of Computer Science Issues (IJCSI)., 9(5), 272. Technology and Electrical Engineering (ICITEE). 2019 : 1–6. doi: 10.1109/
Apoorva V, Yogish H K, Chayadevi M L. Breast Cancer Prediction Using Machine ICITEED.2019.8929977.
Learning Techniques. 3rd International Conference on Integrated Intelligent Jing Han, S. J., Guo, Q. Q., Wang, T., Wang, Y. X., Zhang, Y. X., Liu, F., … He, Y. (2013).
Computing Communication & Security (ICIIC 2021)-2021. DOI:10.2991/ahis. Prognostic significance of interactions between ER alpha and ER beta and lymph
k.210913.043.
12
A.K. Das et al. Expert Systems With Applications 242 (2024) 122673
node status in breast cancer cases. Asian Pacific Journal of Cancer Prevention, 14(10), Shravya, C., Pravalika, K., & Subhani, S. (2019). Prediction of breast cancer using
6081–6084. https://doi.org/10.7314/APJCP.2013.14.10.6081 supervised machine learning techniques. International Journal of Innovative
Kabiraj S, Raihan M, Alvi N, Afrin M, Akter L, Sohagi SA, Podder E. Breast cancer risk Technology and Exploring Engineering (IJITEE)., 8(6), 1106–1110.
prediction using XGBoost and random forest algorithm. In2020 11th international Su, W., Chen, S., Zhang, C., & Li, K. W. (2023). A subgroup dominance-based benefit of
conference on computing, communication and networking technologies (ICCCNT) the doubt method for addressing rank reversals: A case study of the human
2020 Jul 1 (pp. 1-4). IEEE. DOI: 10.1109/ICCCNT49239.2020.9225451. development index in Europe. European Journal of Operational Research., 307(3),
Karabatak, M., & Ince, M. C. (2009). An expert system for detection of breast cancer 1299–1317. https://doi.org/10.1016/j.ejor.2022.11.030
based on association rules and neural network. Expert systems with Applications., 36 Sumathi, C. P., & Padmavathi, M. S. (2019). An experimental approach of applying
(2), 3465–3469. https://doi.org/10.1016/j.eswa.2008.02.064 boruta and elastic net for variable selection in classifying breast cancer datasets.
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu TY. Lightgbm: A highly International Journal of Knowledge Engineering and Data Mining., 6(4), 356–375.
efficient gradient boosting decision tree. Advances in neural information processing https://doi.org/10.1504/IJKEDM.2019.105265
systems. 2017;30. Thomas T, Pradhan N, Dhaka VS. Comparative analysis to predict breast cancer using
Kharya, S., & Soni, S. (2016). Weighted naive bayes classifier: A predictive model for machine learning algorithms: a survey. In 2020 International Conference on
breast cancer detection. International Journal of Computer Applications., 133(9), Inventive Computation Technologies (ICICT) 2020 Feb 26 (pp. 192-196). IEEE. doi:
32–37. https://doi.org/10.5120/ijca2016908023 10.1109/ICICT48043.2020.9112464.
Khourdifi Y, Bahaj M. Applying best machine learning algorithms for breast cancer Thongkam J, Xu G, Zhang Y, Huang F. Breast cancer survivability via AdaBoost
prediction and classification. In2018 International conference on electronics, algorithms. InProceedings of the second Australasian workshop on Health data and
control, optimization and computer science (ICECOCS) 2018 Dec 5 (pp. 1-5). IEEE. knowledge management-Volume 80 2008 Jan 1 (pp. 55-64).
DOI: 10.1109/ICECOCS.2018.8610632. Tiwari M, Bharuka R, Shah P, Lokare R. Breast cancer prediction using deep learning and
Kursa, M. B. (2014). Robustness of Random Forest-based gene selection methods. BMC machine learning techniques. Available at SSRN 3558786. 2020 Mar 22. doi:
bioinformatics., 15, 1–8. https://doi.org/10.1186/1471-2105-15-8 10.2139/ssrn.3558786.
Kursa, M. B., & Rudnicki, W. R. (2010). Feature selection with the Boruta package. Varma PS, Kumar S, Reddy KS. Machine Learning Based Breast Cancer Visualization and
Journal of statistical software., 16(36), 1–3. https://doi.org/10.18637/jss.v036.i11 Classification. In2021 International Conference on Innovative Trends in Information
Last F, Douzas G, Bacao F. Oversampling for Imbalanced Learning Based on K-Means and Technology (ICITIIT) 2021 Feb 11 (pp. 1-6). IEEE. DOI: 10.1109/
SMOTE. 2017. https://arxiv.org/abs/1711.00837. ICITIIT51526.2021.9399603.
Li, Y., & Chen, Z. (2018). Performance evaluation of machine learning methods for breast Verma D, Mishra N. Analysis and prediction of breast cancer and diabetes disease
cancer prediction. ApplComput Math., 7(4), 212–216. https://doi.org/10.11648/j. datasets using data mining classification techniques. In2017 International
acm.20180704.15 Conference on Intelligent Sustainable Systems (ICISS) 2017 Dec 7 (pp. 533-538).
Liang, X. W., Jiang, A. P., Li, T., Xue, Y. Y., & Wang, G. T. (2020). LR-SMOTE—An IEEE. DOI: 10.1109/ISS1.2017.8389229.
improved unbalanced data set oversampling based on K-means and SVM. Knowledge- Wang, K. J., Makond, B., Chen, K. H., & Wang, K. M. (2014). A hybrid classifier
Based Systems., 21(196), Article 105845. https://doi.org/10.1016/j. combining SMOTE with PSO to estimate 5-year survivability of breast cancer
knosys.2020.105845 patients. Applied Soft Computing., 1(20), 15–24. https://doi.org/10.1016/j.
Medjahed SA, Saadi TA, Benyettou A. Breast cancer diagnosis by using k-nearest asoc.2013.09.014
neighbor with different distances and classification rules. International Journal of Wang, H., Zheng, B., Yoon, S. W., & Ko, H. S. (2018). A support vector machine-based
Computer Applications. 2013 Jan 1;62(1). ensemble algorithm for breast cancer diagnosis. European Journal of Operational
Mohebian, M. R., Marateb, H. R., Mansourian, M., Mañanas, M. A., & Mokarian, F. Research., 267(2), 687–699. https://doi.org/10.1016/j.ejor.2017.12.001
(2017). A hybrid computer-aided-diagnosis system for prediction of breast cancer https://www.webmd.com/breast-cancer/stages-grades-breast-cancer [Accessed: 12-Feb-
recurrence (HPBCR) using optimized ensemble learning. Computational and structural 2022].
biotechnology journal., 1(15), 75–85. https://doi.org/10.1016/j.csbj.2016.11.004 Williams K, Idowu PA, Balogun JA, Oluwaranti AI. Breast cancer risk prediction using
Mori, M., Akashi-Tanaka, S., Suzuki, S., Daniels, M. I., Watanabe, C., Hirose, M., & data mining classification techniques. Transactions on Networks and
Nakamura, S. (2017). Diagnostic accuracy of contrast-enhanced spectral Communications. 2015 May 2;3(2):01. https://doi.org/10.14738/tnc.32.662.
mammography in comparison to conventional full-field digital mammography in a Breast Cancer Wisconsin (Original) Data Set, [Online]. https:// archive.ics.uci.edu/ml/
population of women with dense breasts. Breast Cancer, 24, 104–110. https://doi. machine-learning-databases/breast-cance r-wisconsin/breast-cancer-wisconsin.data.
org/10.1007/s12282-016-0681-8 Accessed 25 Aug 2018.
Murugan S, Kumar BM, Amudha S. Classification and prediction of breast cancer using Witteveen, A., Nane, G. F., Vliegen, I. M., Siesling, S., & IJzerman, M. J. (2018).
linear regression, decision tree and random forest. In2017 International Conference Comparison of logistic regression and Bayesian networks for risk prediction of breast
on Current Trends in Computer, Electrical, Electronics and Communication cancer recurrence. Medical decision making., 38(7), 822–833. https://doi.org/
(CTCEEC) 2017 Sep 8 (pp. 763-766). IEEE. DOI: 10.1109/CTCEEC.2017.8455058. 10.1177/0272989X18790963
Nagashima, T., Suzuki, M., Yagata, H., Hashimoto, H., Shishikura, T., Imanaka, N., & Wu J, Hicks C. Breast cancer type classification using machine learning. Journal of
Miyazaki, M. (2002). Dynamic-enhanced MRI predicts metastatic potential of personalized medicine. 2021 Jan 20;11(2):61.https://doi.org/10.3390/
invasive ductal breast cancer. Springer, 9(3), 226–230. https://doi.org/10.1007/ jpm11020061.
BF02967594 Xu, Z., Shen, D., Nie, T., Kou, Y., Yin, N., & Han, X. (2021). A cluster-based oversampling
Naji, M. A., El Filali, S., Aarika, K., Benlahmar, E. H., Abdelouhahid, R. A., & algorithm combining SMOTE and k-means for imbalanced medical data. Information
Debauche, O. (2021). Machine learning algorithms for breast cancer prediction and Sciences., 1(572), 574–589. https://doi.org/10.1016/j.ins.2021.02.056
diagnosis. Procedia Computer Science., 1(191), 487–492. https://doi.org/10.1016/j. Yadav P, Varshney R, Gupta VK. Diagnosis of breast cancer using decision tree models
procs.2021.07.062 and SVM. International Research Journal of Engineering and Technology (IRJET) e-
Nguyen, T., Khosravi, A., Creighton, D., & Nahavandi, S. (2015). Classification of ISSN. 2018 Mar:2395-0056.
healthcare data using genetic fuzzy logic system and wavelets. Expert Systems with Yalaza, M., İnan, A., & Bozer, M. (2016). Male breast cancer. The Journal of Breast Health,
Applications., 42(4), 2184–2197. https://doi.org/10.1016/j.eswa.2014.10.027 12(1), 1–8. https://doi.org/10.5152/tjbh.2015.2711
Quinlan, J. R. (1986). Induction of decision trees. Machine learning., 1, 81–106. https://doi. Yarabarla MS, Ravi LK, Sivasangari A. Breast cancer prediction via machine learning.
org/10.1007/BF00116251 In2019 3rd International Conference on Trends in Electronics and Informatics
Rudnicki, W. R., Wrzesień, M., & Paja, W. (2015). All relevant feature selection methods (ICOEI) 2019 Apr 23 (pp. 121-124). IEEE. DOI: 10.1109/ICOEI.2019.8862533.
and applications. Feature Selection for Data and Pattern Recognition., 11–28. https:// Zhang Z, Li Z. Evaluation Methods for Breast Cancer Prediction in Machine Learning
doi.org/10.1007/978-3-662-45620-0_2 Field. InSHS Web of Conferences 2022 (Vol. 144, p. 03010). EDP Sciences. https://
Sengar PP, Gaikwad MJ, Nagdive AS. Comparative study of machine learning algorithms doi.org/10.1051/shsconf/202214403010.
for breast cancer prediction. In2020 Third International Conference on Smart Zhang, J., & Chen, L. (2019). Clustering-based undersampling with random over
Systems and Inventive Technology (ICSSIT) 2020 Aug 20 (pp. 796-801). IEEE. doi: sampling examples and support vector machine for imbalanced classification of
10.1109/ICSSIT48917.2020.9214267. breast cancer diagnosis. Computer Assisted Surgery., 24(sup2), 62–72. https://doi.
Senthilkumar, B., Zodinpuii, D., Pachuau, L., Chenkual, S., Zohmingthanga, J., org/10.1080/24699322.2019.1649074
Kumar, N. S., & Hmingliana, L. (2022). Ensemble Modelling for Early Breast Cancer Zhang, C., Su, W., Chen, S., Zeng, S., & Liao, H. (2023). A Combined Weighting Based
Prediction from Diet and Lifestyle. IFAC-PapersOnLine., 55(1), 429–435. https://doi. Large Scale Group Decision Making Framework for MOOC Group Recommendation.
org/10.1016/j.ifacol.2022.04.071 Group Decision and Negotiation., 32(3), 537–567. https://doi.org/10.1007/s10726-
Sharifmoghadam M, Jazayeriy H. Breast cancer classification using AdaBoost-extreme 023-09816-2
learning machine. In2019 5th Iranian Conference on Signal Processing and Zheng, J., Lin, D., Gao, Z., Wang, S., He, M., & Fan, J. (2020). Deep learning assisted
Intelligent Systems (ICSPIS) 2019 Dec 18 (pp. 1-5). IEEE. DOI: 10.1109/ efficient AdaBoost algorithm for breast cancer detection and early diagnosis. IEEE
ICSPIS48872.2019.9066088. Access., 8(8), 96946–96954. https://doi.org/10.1109/ACCESS.2020.2993536
Sharma S, Aggarwal A, Choudhury T. Breast cancer detection using machine learning Zheng, B., Yoon, S. W., & Lam, S. S. (2014). Breast cancer diagnosis based on feature
algorithms. In2018 International conference on computational techniques, extraction using a hybrid of K-means and support vector machine algorithms. Expert
electronics and mechanical systems (CTEMS) 2018 Dec 21 (pp. 114-118). IEEE. DOI: Systems with Applications., 41(4), 1476–1482. https://doi.org/10.1016/j.
10.1109/CTEMS.2018.8769187. eswa.2013.08.044
Sharma, D., Kumar, R., & Jain, A. (2022). Breast cancer prediction based on neural
networks and extra tree classifier using feature ensemble learning. Measurement:
Sensors., 1(24), Article 100560. https://doi.org/10.1016/j.measen.2022.100560
13