0% found this document useful (0 votes)
51 views6 pages

A Hybrid Model To Predict The Breast Cancer Using Stacking and Bagging Model

The document discusses a hybrid model using stacking and bagging techniques to predict breast cancer as benign or malignant tumors. It uses the Wisconsin breast cancer dataset to train weak learners like KNN, random forest, decision tree and SVM. Logistic regression is used as the meta learner on the predictions from weak learners. The proposed hybrid ensemble model achieves improved accuracy of breast cancer prediction compared to individual models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views6 pages

A Hybrid Model To Predict The Breast Cancer Using Stacking and Bagging Model

The document discusses a hybrid model using stacking and bagging techniques to predict breast cancer as benign or malignant tumors. It uses the Wisconsin breast cancer dataset to train weak learners like KNN, random forest, decision tree and SVM. Logistic regression is used as the meta learner on the predictions from weak learners. The proposed hybrid ensemble model achieves improved accuracy of breast cancer prediction compared to individual models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2023 3rd International Conference on Mobile Networks andWireless Communications (ICMNWC)

A Hybrid Model to Predict the Breast Cancer using


Stacking and Bagging Model
1st S.Yuvalatha 2nd S.Nithyapriya 3rd S.Prabhavathy
Department of Computer Science and Department of Artificial Intelligence Department of Electronics and
Business Systems and Data Science Communication Engineering
Bannari Amman Institute of Technology Bannari Amman Institute of Technology Christ the King Engineering College
Sathyamangalam, India Sathyamangalam, India Coimbatore, India
2023 3rd International Conference on Mobile Networks and Wireless Communications (ICMNWC) | 979-8-3503-1702-2/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICMNWC60182.2023.10436010

[email protected] [email protected] [email protected]

4th R.Priyadharshini 5th S.Savitha 6th S.Kayathri


Department of Computer Science and Department of Computer and Science Department of Computer and Science
Engineering and Engineering and Engineering
Sri Ranganathar Institute of PSR Engineering College PSR Engineering College
Engineering Sivakasi, India Sivakasi, India
Coimbatore, India [email protected] [email protected]
[email protected]

Abstract—Breast cancer is a malignant tumor that develops on early detection [4]. Women are encouraged to perform
in the cells of the breast tissue. Breast cancer is one of the major regular breast self-exams and to have mammograms as
causes of death for women globally. In the examination of recommended by their healthcare provider [4,6].
medical data, breast cancer prediction is a difficult task. To
make decisions and accurately distinguish between benign and Breast cancer prediction using machine learning involves
malignant tumors, physicians and pathologists need certain developing a model that can predict whether a patient is likely
automated technologies. In this paper, hybrid ensemble to have breast cancer based on certain features or risk factors
technique (Bagging and Stacking) is used to predict the breast [5]. The goal of this paper is to improve early detection and
tumors as benign and malignant tumors. In the proposed work, increase the accuracy of breast cancer diagnosis.
the subset of data is created from the initial Wisconsin
(Diagnostic) Data Set by bootstrapping technique. Each A. Contributions of this study
bootstrap dataset is used to train the weak learner. The weak • This paper proposes hybrid ensemble technique for
learners are K-Nearest Neighbors (KNN) Random Forest (RF), breast cancer prediction.
Decision Tree (DT) and Support Vector Machine (SVM). The
Logistic Regression (LR) is used as the Meta Learner. The Meta • Bagging and stacking techniques are combined as the
Learner uses the predictions of weak learners as its training hybrid model in proposed work.
data. The proposed hybrid ensemble model obtains an accuracy
98.7%, Precision 98.83%, Recall 98.54%, F1 Score 98.68% and • After preprocessing, feature selection and extraction
0.012% error are made.
• The five-fold cross validation is performed on the
Keywords—Malignant, Ensemble, Meta Learner, Bagging,
selected features.
Stacking, Bootstrapping
• In each fold of the cross validation, training data is split
I. INTRODUCTION into two sections: 4/5 of the data for the training weak
Cancer can start in any area of the body and travel via the learners and 1/5 of the data for the Meta learner.
blood or lymphatic system to other areas. Different cancers
can manifest in different ways, such as lung, breast, prostate, • Using the bootstrapping algorithm, data subsets are
colon, etc. sampled from 4/5 of the training data based on which
weak learners are trained.
The malignancy that arises in the breast cells is called
breast cancer. Uncontrollably growing and dividing abnormal • The predictions of weak learners are used as training
cells in the breast give rise to a tumor [1, 2, 3]. Both men and data set for Meta learner (LR).
women can develop breast cancer, although women are more II. RELATED WORK
likely to do so [2, 3]. The most common symptom of breast
cancer is a lump or mass in the breast, although not all lumps Khandaker Mohammad Mohi Uddin et al. [2] uses
are cancerous [2]. Other symptoms are breast pain, swelling, machine learning algorithms like SVM, DT, RF, LR, voting
nipple discharge, or changes in the shape or size of the breast classifier to analyze the breast cancer tissue. Among the above
[5]. machine learning algorithms, the voting classifier gives good
accuracy of 98.77% in predicting the breast cancer.
Breast cancer comes in various forms, such as ductal
carcinoma, lobular carcinoma, inflammatory breast cancer, Furkan Atban et al. [3] applies the transfer learning
and breast Paget's disease. The kind and stage of the cancer, method for deep feature extraction. The best feature for the
the patient's general health, and their preferences will all breast cancer is selected by SWARM optimization. The SVM
influence the treatment options, which may include surgery, and RBF achieve F-score of 97.75%.
radiation therapy, chemotherapy, hormone therapy, and Siddharth Raj Gupt et al. [4] uses machine learning models
targeted therapy. An effective breast cancer treatment depends such as RF, SVM, and DT with three-fold cross validation and

979-8-3503-1702-2/23/$31.00 ©2023 IEEE

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on April 05,2024 at 08:20:28 UTC from IEEE Xplore. Restrictions apply.
obtained accuracy of 96.5% and 78.7% for WDBC and
WPBC.
Deepti Sharma et al. [5] applies the ensemble classifier on
UCIWDBC dataset. The feature selection and feature scaling
to balance the dataset. The Neural Network and Extra Tree
(NN-ET) gives 99.41 accuracy with 10-fold cross validation
in predicting the classes.
Ramdas Kapila et al. [8] uses correlation coefficient and
Anova for feature selection. Then UMAP, PCA and t- SNE
are used to extract the feature. After feature extraction
different ML model are trained and ML model predictions are
selected using voting method.
Vandana Kumari et al. [9] applies transfer learning to
identify breast cancer from the histological pictures of the
breast. The three different model (Deep convolution neural
network, Visual Geometry group 16, Depth wise separable
Fig. 1. Benign and Malignant Instances in Wisconsin Dataset
convolution) is used as base learner. The Invasive Ductal
Carcinoma gives accuracy of 99.42% and BreakHis dataset B. Pre-processing
gives 99.12% accuracy.
Data collected from heterogeneous source is messy and
Shtwai Alsubai et al. [10] uses deep neural network and contain lot of incomplete information, noisy and inconsistent
Inception V3. The Modified Scalable-Neighborhood corruption. Handling with these data may reduce the accuracy
Component Analysis is used for feature fusion and Genetic- of the learning model. Datapreprocessing is a foremost to
Hyper-parameter Optimization for finding the optimized extract the raw data into a appropriate format for training the
hyper parameters. model.
Parampreet Kaur et al. [11] proposes the stacking C. Missing value: Cascade Imputation
ensemble of Deep Neural Network, Gradient Boosting
The cascade imputation is a highly powered technique to
Machine, and Distributed Random Forest model. The
handle the missing value by applying the various imputation
Bayesian optimization is employed for ideal hyper parameter
methods in a sequential manner are shown in Fig.2.
selection.
Mahesh T R et al. [14] uses the Majority voting ensemble
model on Logistic Regression, Support Vector Machine and
CART classifiers. The Majority voting ensemble with K-fold
cross validation achieves the accuracy of 99.3%
III. MATERIALS AND METHODS
A. Dataset
This study makes use of the Wisconsin Breast Cancer
(Diagnostic) dataset. The cancer data repository has 30
features with the occurrences of 569 samples [2]. The first
attribute is the id and a second feature is class of the tissue
(malignant and benign). The dataset comprises 357 benign
and 212 malignant instances as shown in the Fig.1.
The dataset includes ten important real-valued features for
each cell nucleus: radius,compactness,concave points, Fig. 2. Cascade Imputation
perimeter,texture,symmetry, area,smoothness, concavity,and
fractal dimension. 1. Identify the missing data in the dataset.
2. A sequence of imputation methods is used to fill the
missing values.
3. The first imputed values are used as input for the next
imputation method.
4. Repeat step 3 for the remaining imputation methods in
the sequence.
5. Once all the imputation methods have been applied,
the final imputed values are obtained by averaging the
values obtained from each imputation method. The
Table I. shows the filling the missing values of the
attributes by Cascade Imputation.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on April 05,2024 at 08:20:28 UTC from IEEE Xplore. Restrictions apply.
TABLE I. CASCADE IMPUTATION done by flipping the some of the bits in the binary vectors. The
Imputation method genetic algorithm process undergoes iteration until it reaches
Missing value for Cascading the maximum specified number of generations.
the attributes Mean Imputation
Median Mode
compactness mean 0.1037 0.0921 0.1147 0.1035
G. Proposed Ensemble Model
Texture_se 1.2174 1.1095 0.8561 1.061 In this paper, to combine the best properties of Bagging
Smoothness_worst 0.0954 0.0978 0.4575 0.2169 and stacking, hybrid ensemble algorithm is applied to predict
Concavity_ se 0.0312 0.0257 0.0124 0.0213 the breast cancer. After data preprocessing, Feature extraction
D. Data Standardization is done by principle component analysis. The best attribute
features are selected by genetic algorithm. The five-fold cross
The data values of the dataset differ greatly. The model validation is made in the chosen top features of the dataset
prediction effect is significantly influenced by various [14]. In each fold of the cross validation, training data is split
dimensions.In this work, Z-score standardization was used to into two sections: 4/5 of the data for the training weak learners
scale a dataset and keep all the variables on the same scale. Z- and 1/5 of the data for the Meta learner. Using the
score standardization is characterized by a mean of 0 and bootstrapping algorithm, data subsets are sampled from 4/5 of
standard deviation of 1. The Z-score standardization is the training data based on which weak learners are trained.
calculated as in Eq. (1). The weak learners are K-Nearest Neighbors (KNN) Random
Forest (RF), Decision Tree (DT) and Support Vector Machine
= ( − )/ (1) (SVM). The predictions of weak learners are used as training
data set for Meta learner (Logistic Regression). The final
Where X is the original value,M represents the mean of the output from meta learner is the Malignant and Benign tumor.
dataset and S is the standard deviation of the dataset. After The Malignant is represented as 0 and Benign is represented
data standardization,the correlation between each feature and as 1 in the output.
the target variable is tested using Pearson's correlation
coefficient [6,7]. In Bagging, each weak learner model is trained by the data
subset generated by the bootstrapping [6,12]. The predictions
E. One hot encoding made by each weak learner model are aggregated at the end
One hot encoding converts the categorical data into binary using stacking technique to get the overall prediction [15]. So,
vector. Identify the categorical variable to encode and create a bagging is the combination of bootstrapping and aggregation.
list of all possible value. For each possible value create the The Stacking, an ensemble technique uses various base
binary column. In the dataset, the categorical data are “B” and classifiers and a Meta Learner. The foundational model is
“M”. After one hot encoding the numerical value for the “M” trained using the training dataset, and predictions are
will be 0 and “B” will be 1 are shown in Table II. generated using the test dataset. The meta-model is trained
using the base model's predictions as a feature [6,11,15]. The
TABLE II. CATEGORICAL FEATURE TO NUMERICAL FEATURE
model has superior predictions than all of the individual
Categorical Data Binary models with the help of stacking. The Fig.3 shows the
Malignant(M) 0 structure of the proposed hybrid ensemble model.
Benign(B) 1
H. Meta Learner
F. Feature Extraction and Selection Logistic Regression is employed for predicting a
The Principle Component Analysis (PCA) is applied for categorical dependent variable based on a specified set of
feature extraction [2,8]. The mean and Standard deviation are independent variables. The anticipated outcome is a
performed to capture the feature characteristics as shown in categorical or discrete value, such as Yes or No, 0 or 1, true or
Eq. (2) and Eq. (3). false, and so forth. Instead of providing precise 0 and 1 values,
it produces probabilistic values within the range of 0 to 1. In
µ= (2) Logistic Regression, the approach deviates from fitting a
straight regression line, opting for the use of an "S"-shaped
logistic function to predict two potential outcomes (0 or 1).
( µ)
= (3) The sigmoid function, a mathematical logistic function, is
applied to transform predicted values into probabilities. [2].
Where SD is Standard Deviation, Vp denotes each value Proposed Algorithm:
from population, µ =mean of the population, Sp =Size of the
population. 1. Read Wisconsin Dataset
Selecting features is an essential step in identifying the 2. Remove unnecessary attributes, outliers and null
most valuable key features within the dataset. It minimizes values from the dataset.
unnecessary and redundant features thereby resulting in
efficient model [13].The genetic algorithm is used to identify 3. The missing attribute values are filled by Cascading
the subset of features which gives better tissue prediction for imputation.
breast cancer [10]. The genetic algorithm provides the high- 4. Z-score standardization is applied to normalize the
quality optimization result by initializing the population to data distribution.
zero and then evaluate the fitness score for the possible
solution over number of iterations to predict the features. The 5. Pearson's correlation coefficient is applied to find the
fittest solution is selected and new solution is generated by correlation between each feature and the target
combining the features from the selected. Then mutation is variable.

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on April 05,2024 at 08:20:28 UTC from IEEE Xplore. Restrictions apply.
6. Multicollinearity features between attributes are 11. For five folds, use the four folds to train multiple base
removed. models (weak learner).
7. The categorical output is changed to binary 12. Use the base models to make predictions on the
(Malignant-0 and Benign-1) by One hot encoding validation set.
method.
13. Combine the predictions of the base models for each
8. After Preprocessing, Feature Extraction is made by fold into a single dataset.
Principle component analysis.
14. Train a meta learner on the combined dataset.
9. Choose the top key dataset features by applying the
Genetic algorithm. 15. The Meta learner classifies the instance as Malignant
and Benign with high accuracy and recall.
10. The five-fold cross validation is accomplished on the
selected features.

Fig. 3. Proposed Hybrid Ensemble Model

IV. RESULT AND DISCUSSION of-the-art model. The Fig.4 shows the performance graph of
various classifiers.
A. Experimental setup
&' (
Experiments were carried out on an 11th Generation !!"#$!% = (4)
&' (')&')(
Intel(R) Core (TM) i5-1155G7 processor. The system is
&
*#+!,-,./ =
equipped with 8 GB of RAM and operates on the Windows 11
(5)
operating system. The programming environment for this &')&
research is Jupyter Notebook (Anaconda3) version 6.3.0. To
&
implement the algorithms in this study, the following libraries 0+!$11 = (6)
&')(
are utilized: pandas, numpy, matplotlib, and scikit-learn.
& 67 7 ∗8 6
B. Metrics for evaluating the proposed algorithm 21 !.#+ = 2 ∗ (7)
& 67 7 '8 6
The performance of each model is assessed through the
)&')(
examination of a confusion matrix in terms of True Positive 9##.# = (8)
(TP), False Positive (FP), True Negative (TN) and False &' (')&')(
Negative (FN). Utilizing the Accuracy, Precision, Recall,F1
Score and Error as shown in Eq. (4), Eq. (5), Eq. (6), Eq. (7)
and Eq. (8), the Table III. displays the scores of the proposed
hybrid ensemble classifier in comparison to those of the state-

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on April 05,2024 at 08:20:28 UTC from IEEE Xplore. Restrictions apply.
TABLE III. COMPARISON WITH THE STATE-OF-THE-ART METHODS
F1
Accurac Precision(% Recall(%
Methods Score(%
y (%) ) )
)
SVM 98.07 98.28 97.61 97.92
RF 94.20 93.91 93.65 93.78
KNN 96.84 97.22 96.04 96.58
DT 94.20 93.99 93.56 93.77
LR 98.42 98.55 98.07 98.30
Gradient
Boosting(GB 95.78 95.57 95.39 95.48
)
Proposed
Hybrid 98.77 98.83 98.54 98.68
Ensemble

The Error (%) of the SVM, RF, KNN, DT, LR, GB are
0.019, 0.058, 0.061, 0.058, 0.015, 0.042 respectively. The
Proposed hybrid ensemble model obtained the less error rate
Fig. 6. ROC Curve of proposed Hybrid model
of 0.012%.
The proposed model predictive performance is measured
by the confusion matrix as shown in the Fig.5. The
Fig.6represents the ROC Curve of proposed Hybrid model.
ROC is a probability curveindicates the degree to which the
model can distinguish between classes.
V. CONCLUSION AND FUTURE SCOPE
The breast cancer stands out as the most common form of
cancer globally, claiming precious lives prematurely.
However, the timely identification of breast cancer holds the
potential to diminish mortality rates and safeguard valuable
Fig. 4. Performance of various Classifiers
lives. Machine learning has become increasingly ubiquitous in
the medical domain, offering applications across various
extensive datasets. In this investigation, diverse machine
learning models are implemented to forecast the cancer in
breast. Harnessing the variety of multiple classifiers, ensemble
methods exhibit superior accuracy and enhanced
generalization in contrast to individual classifiers.
In this study, breast cancer is predicted accurately by
hybrid ensemble model. To mitigate over fitting, a 5-fold
cross-validation approach is implemented. The proposed
hybrid ensemble model achieves remarkable metrics,
including the highest accuracy at 98.7%, precision at 98.83%,
recall at 98.54%, an F1 score of 98.68%, and a minimal error
rate of 0.012%. The experimental findings demonstrate that
the proposed ensemble approach surpasses other methods.
The Wisconsin Breast Cancer dataset has limited number
Fig. 5. Confusion matrix of proposed Hybrid model of attributes. This research extends its scope by using deep
learning models with breast ultrasound (US) video and
mammogram-based image segmentation and multi-scale
attention mechanism. In addition, convolutional neural
networks (CNN), BERT, Long Short-Term Memory (LSTM),
and optimization techniques can be considered.
REFERENCES
[1] Zhuorong Chen , Xumeng Gong , Chun Cheng , Yinghui Fu , Wanming
Wu , Zhihui Luo,”Circ_0001777 Affects Triple-negative Breast Cancer
Progression Through the miR-95-3p/AKAP12 Axis”, Clinical Breast
Cancer, Volume 23, Issue 2, February 2023, Pages 143-154.
[2] Khandaker Mohammad Mohi Uddin , Nitish Biswas , Sarreha Tasmin
Rikta , Samrat Kumar Dey," Machine learning-based diagnosis of
breast cancer utilizing feature optimization technique"Computer
Methods and Programs in Biomedicine Update Volume 3, 2023,
100098
[3] Furkan Atban, Ekin Ekinci, Zeynep Garip," Traditional machine
learning algorithms for breast cancer image classification with

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on April 05,2024 at 08:20:28 UTC from IEEE Xplore. Restrictions apply.
optimized deep features", Biomedical Signal Processing and Control,
Volume 81, March 2023, 104534
[4] Siddharth Raj Gupta, "Prediction time of breast cancer tumor
recurrence using Machine Learning",Cancer Treatment and Research
Communications, Volume 32, 2022, 100602
[5] Deepti Sharma , Rajneesh Kumar , Anurag Jain ,"Breast cancer
prediction based on neural networks and extra tree classifier using
feature ensemble learning",Measurement: Sensors Volume 24,
December 2022, 100560
[6] Mana Saleh AI Reshan,Samina Amin,Muhammad Ali Zeb,Adel
Sulaiman,Hani AIshahrani,Ahmad Taher Azar,Asadullah
Shaikh,"Enhancing Breast Cancer Detection and Classification Using
Advanced Multi-Model Features and Ensemble Machine Learning
Techniques",MDPI,Life2023,https://doi.org/10.3390/life13102093
[7] Rahul Kumar Yadav, Pardeep Singh, Poonam Kashtriya,"Diagnosis of
Breast Cancer using Machine Learning Techniques -A
Survey",Procedia Computer Science,Volume 218, 2023, Pages 1434-
1443,https://doi.org/10.1016/j.procs.2023.01.122
[8] Ramdas Kapila, Sumalatha Saleti, ”An efficient ensemble-based
Machine Learning for breast cancer detection “, Biomedical Signal
Processing and Control,Volume 86, Part B, September 2023, 105269
[9] Vandana Kumari, Rajib Ghosh,” A magnification-independent method
for breast cancer classification using transfer learning“, Healthcare
Analytics,Volume 3, November 2023, 100207
[10] Shtwai Alsubai, Abdullah Alqahtani, Mohemmed Sha,”Genetic
hyperparameter optimization with Modified Scalable-Neighbourhood
Component Analysis for breast cancer prognostication”, Neural
Networks, Volume 162, May 2023, Pages 240-257
[11] Parampreet Kaur, Ashima Singh, Inderveer Chana,” BSense: A parallel
Bayesian hyperparameter optimized Stacked ensemble model for
breast cancer survival prediction , Journal of Computational Science,
Volume 60, April 2022, 101570
[12] Varshali Jaiswal, Praneet Saurabh, Umesh Kumar Lilhore, Mayank
Pathak, Sarita Simaiya, Surjeet Dalal ,"A breast cancer risk predication
and classification model with ensemble learning and big data fusion",
Decision Analytics Journal, Volume8, September
2023,https://doi.org/10.1016/j.dajour.2023.100298
[13] Jnanendra Prasad Sarkar , Indrajit Saha , Anasua Sarkar , Ujjwal
Maulik, "Machine learning integrated ensemble of feature selection
methods followed by survival analysis for predicting breast cancer
subtype specific miRNA biomarkers",Computers in Biology and
Medicine, Volume 131, April 2021, 104244
[14] Mahesh T R, Vinoth Kumar V, Dhilip Kumar V, Oana Geman, Martin
Margala, Manisha Guduri,"The stratified K-folds cross-validation and
class-balancing methods with high-performance ensemble classifiers
for breast cancer classification",Health Care Analytics,Volume
4,December 2023,https://doi.org/10.1016/j.health.2023.100247
[15] Mina Samieinasab, S.Ahmad Torabzadeh, Arman Behnam, Amir
Aghsami, Fariborz Jolai,”Meta- health stack: A new approach for
breast cancer prediction”,Healthcare Analytics, Volume2,Novemebr
2022, https://doi.org/10.1016/j.health.2021.100010

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY MADRAS. Downloaded on April 05,2024 at 08:20:28 UTC from IEEE Xplore. Restrictions apply.

You might also like