0% found this document useful (0 votes)
93 views5 pages

Efficient Breast Cancer Prediction Using Ensemble Machine Learning Models

Uploaded by

Tunisha Varshney
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views5 pages

Efficient Breast Cancer Prediction Using Ensemble Machine Learning Models

Uploaded by

Tunisha Varshney
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2019 4th International Conference on Recent Trends on Electronics, Information, Communication & Technology (RTEICT-2019), MAY 17th &

18th 2019

Efficient Breast Cancer Prediction Using Ensemble


Machine Learning Models
Naveen Dr. R. K. Sharma Dr. Anil Ramachandran Nair
School of VLSI and Embedded Department of Electronics and R&D Divisions, Toshiba Software
Systems Design, NIT Kurukshetra Communication, NIT Kurukshetra (India) Pvt. Ltd., Bangalore
[email protected] [email protected] [email protected]

Abstract— Breast cancer is the second most exposed cancer in technique in machine learning models. So it give maximum
the world. When the growth of breast tissues are out of control is accuracy 100% by decision tree and KNN when data splitting
called breast cancer. Breast cancer prediction and prognosis are ratios is 90:10.
major challenge to medical community. Breast cancer are
prominent cause of death for women. Recurrence of cancer is the Breast cancer can be cured with current medical treatments
biggest fears for cancer patient and this can affect their quality of and new innovative techniques [1, 16]. It is very useful to
life. The aim of this research is to predict breast cancer from early predict breast cancer with high accuracy since patients
cancer features with high accuracy. The breast cancer Coimbra can get treatment on time and can survive. In this approach,
dataset taken from UCI (University of California Irvine) [1, 5] to we first perform feature importance techniques on breast
build a most efficient ensemble machine learning models. The cancer dataset to select the most significance features. Then
major steps we follow, here are feature scaling, cross validation we perform standard scaling with zero mean and standard
and various ensemble machine learning models with bagging deviation one to scale all features in a range. After that we
technique. Decision tree and KNN gives highest 100% accuracy. send this scaled features to various ensemble machine learning
Decision tree model gives 100% accuracy if we split train-test models. Here, we use 10 folds cross validation to train the ML
dataset in ratio of 90:10 and also used 300 bags of trees. KNN models with each folds, it reduce over fitting of model, as
gives maximum accuracy 100%, for k= 1 to 7 in seven loops with model become more efficient. Here we use bagging techniques
90% is train data and 10% is test data. Here k is the nearest in ML models. Bagging technique makes the bags of samples,
neighbors. And we also evaluate its prediction by accuracy,
each bags may contain identical samples also. Each bag is test
confusion matrix and classification report. Our aim is to build a
by different trained ML model, we performs voting in each
most accurate and efficient machine learning model. So as
prediction result, patient can take treatment on the early stage. bag prediction result, highest voting of class is our final
prediction results, and we conclude prediction results of ML
Keywords— Breast Cancer, Significant Feature, Ensemble model. We also evaluate its prediction results with confusion
Machine Learning Models, Accuracy, Confusion Matrix, matrix, classification report and accuracies. Models developed
Classification Report. through these techniques are very helpful for medical
community to take right decisions.
I. INTRODUCTION
Breast cancer is the most frequent types of skin cancer
disease in women around the world and this cancer may occur II. DATASET
in men also. Breast cancer produces from breast tissues. In The breast cancer Coimbra dataset taken from UCI
United States, one out of eight women have breast cancer in (University of California Irvine) [1, 17] to developed a most
their lifetime. The study shows that average ages of women in efficient ensemble machine learning models. These dataset
between 45 to 59 has more chance to cancer. Breast cancer is contains 116 samples. There are two set of samples (52
the second most commonly diagnosed cancer. In US, every samples: healthy peoples, 64 samples: breast cancer
year diagnoses cases of breast cancer is more than 266,000.
patients).It consist of nine features and two classes as healthy
According to Global Health Estimates, WHO 2013 report that
more than 508,000 women died in 2011 due to breast cancer. and patients. These nine features are: Age (years), BMI
Approximately 50% of breast cancer cases and 58% of death (kg/m2), Glucose (mg/dL), Leptin (ng/mL), Adiponectin
occurs in developing nations (GLOBOCAN 2008 Report) [9]. (ug/mL), Resistin (ng/mL), Insulin (Uu/mL), HOMA and
Breast cancer survival rates differ in worldwide, most survival MCP-1 (pg/dL). In this approach, we performed feature
rates reaching 80% or over in North America, Sweden and importance on breast cancer datasets for selection the
Japan and to around 60% in developing countries and less than significant features from dataset.
40% in undeveloped countries due to lack of early prediction As fig.1, line plot show that the behavior of Coimbra
[4,6]. breast cancer dataset. Here we see that MCP-1 attribute is
As previous works shows that there is a various research highly varying. Glucose is less varying feature. Except this all
gone on breast cancer prediction with different machine are linear data. To reduce this variance of feature, we performs
learning approach. However, for Coimbra breast cancer feature scaling with standard scale.
dataset we could found only few studies. Coimbra breast
cancer dataset gives nearly 65-85% accuracy using different
models of machine learning. Hare we use ensemble bagging

978-1-7281-0630-4/19/$31.00 ©2019 IEEE


100
deviation of data. And indices of this algorithm sort features in
decreasing order ranking.

Fig.1: Line Plot of Breast Cancer Dataset.

III. PROCEDURE AND METHODS


To classify the breast cancer dataset, a hybrid technique Fig.3: Significant Features in Ranking Wise
with three steps, has been proposed. As in initial step we With significant features plot, we see glucose is the most
normalized features with standard scale to keep in a range, in significant feature for prediction of bread cancer and MCP-1 is
second step we build the different ensemble machine learning the least significance feature. Below table shows importance
models with crossover validation and bagging technique and of features in the decreasing order ranking wise.
in third step we evaluate its prediction results with accuracy,
confusion matrix and classification reports and thus concluded
Feature Feature’s Contribution to Predict
that which model is the most efficient for breast cancer
No. Name Class (%)
prediction. Classification report values are in range of (0.0,
1.0). Confusion matrix is the heat maps of the actual class and 2 Glucose 16.85%
predicted class. 0 Age 15.23%
7 Resistin 13.74%
1 BMI 11.14%
Loading Breast Cancer Dataset with Classes (52:64)
3 Insulin 09.55%
4 HOMA 09.47%
Feature Importance with Extra Tree Classifier 5 Laptin 08.56%
6 Adiponectin 08.55%
Normalization Stage: Standard Scaling 8 MCP-1.0 06.91%
Table.1: Feature Significance ranking in decreasing order.
Cross Validation with 10-folds
B. Standard Scale or Z- Normalization
Features may different in scale or units, it is difficult for
Classification of Breast Cancer using Ensemble classifier or regression to give optimal results. The way to
Machine Learning Approach overcome this difficultly we need to scale it, in one specific
range. Here, we see MCP-1 is highly varying features, since
we normalize it with standard scaling or z-score normalized.
Healthy
thy Gr
Group Patients
nts Gr
Group For standard scaling, first subtract its value with its mean, to
bring its mean around zero then divided by its standard
deviation value, to bring its standard deviation near to one.
Fig.2: Procedure of Ensemble ML Model Approach.
Xi − mean(X)
A. Significant Features − = … … … … (1)
Stdev(X)
Feature importance improve the model’s performance. It is
quiet important to know the effect of a certain feature on the C. Cross Validation (CV)
model’s performance. Some are very less important that can Cross validation is a re-sampling method used to evaluate
reduce model performance. Significant features are evaluate machine learning models on a constrained data samples.
by extra tree classifier algorithm with 250 estimation trees of Generally k is chosen as 10-fold. The general strategies follow
forest. This extra tree classifier worked based on standard in crossover validation is:
x Mix the dataset arbitrarily.

101
x Split the dataset into k groups (k =10) 4) K-Nearest Neighbor (KNN):
x For every group have different kinds of samples. KNN is very useful for a large dataset do not use
x Summarize results of each group to utilizing the mathematical analysis. In the worst scenario, KNN needs
model accuracy. more memory to check all data sets. Here we used k=1 to 7
neighbor in a 7 random loop. It is total 7x7=49 evaluation. Its
maximum accuracy is 100% at k=5 in loop =3. And its
average accuracy in all loop is 89.29%.

( . )= ( − ) … … … … … … (3)

x Where a, b are coordinates of two point.

Fig.4: Working of the 10-fold Cross Validation

D. Machine Learning (ML) Approach


Machine learning (ML) techniques utilize statistics,
probabilities, Boolean logics, and unconventional optimization
techniques to build an ensemble machine learning classifier
and predict the class based on the highest voting.

1) Decision Tree:
Fig.5: K-Nearest Neighbor (KNN) model
Decision tree algorithms uses the hierarchical tree
approach, there every node represents a feature, and branch 5) Logistics Regression:
represents a decision and leaves represents an outcome (class). Logistic regression is a statistical machine learning model
In this approach we predict the breast cancer with 300 trees that classifies the data by considering outcome variables on
and dataset split in to 90:10 as train and test dataset. Here this extreme end class and tries to makes a logarithmic line that
model gives 100% accuracy with 300 trees and dataset are separates between them. Logistic regression gives 91.67 %
split in to 90:10 as train and test. accuracy in our approach.
2) Support Vector Machine (SVM):
Support vector machine (SVM) is a discriminative ( | )= … … … … … … … (4)
1+
classifier characterized by an isolating hyper plane. In two
dimensional space this hyper plane is a line separating a x Where α and β are the model parameters
plane in two sections where in each class lay in either side.
SVM gives 83.33% accuracy with linear kernel. 6) Random Forest:
Random forests is an ensemble machine learning algorithm
3) Multilayer Perceptron: of decision tree. Random forest build a set of decision trees
Multilayer perceptron (MLP) has a numbers of layers, two based on random chosen samples gets expectation from each
terminating layers are called input-output layer and tree and choose the best prediction results from the voting of
intermediate layers are called hidden layers. It uses a each tree. In our aproach, it gives 90.91% accuracy of
stimulation function in all neurons. Here we use 60 hidden predictions.
layers and each layer has 10 neurons with different weights.
MLP gives 83.33% accuracy.

( )= + … … … . . … … (2)

where xi = inputs of incoming layers.


wi = weights of hidden layers neurons.
b = initial weight.
Fig.6: Random Forest model

102
IV. ALGORITHM in k=10. If training and test dataset are 80:20 then maximum
For this approach we use anaconda navigator 3 (64-bit) for accuracy of KNN models is 87.5%.
python programming. Following algorithmic steps, we follow:

1) Import all the modules for feature selection,


normalization, data splitting, ML models, accuracy
score, confusion matrix, classification report and
some other required modules.
2) Load the breast cancer dataset.
3) Divide the datasets as feature and class.
4) Check the significant features for prediction of class.
5) Normalize the features to scale in one range with Fig.7: Accuracy plot of KNN with k = 1 to 7
Standard scaling.
6) Split the dataset in two training and testing set in B. Confusion Matrix
90:10 respectively. A confusion matrix is an outline of prediction. The number
7) Build a various machine learning models using of accurate and inaccurate predictions are précised with count
bagging techniques with 10-fold cross-validation values and broken down by each class. The confusion matrix
with different estimation trees. is the methods in which ML model is confused when it is
8) Print accuracy and classification reports of different predicted. It gives us intuition not only about the inaccuracies
models as comparing the true and predicted class of classifier but also tells about the types of errors in which
9) Plot confusion matrices of different models class.
comparing the true and predicted class.
Confusion Matrix of different ML model as below:
V. EXPERIMENTAL RESULTS
A. Accuracy
Accuracy is inversely proportional to the difference
between true class and predicted class. If any model has same
class predict as true class, since the model has highly accurate.
There are accuracies of various machine learning models.
Machine Learning Models Accuracy Accuracy
(without ensemble) (with ensemble)

Decision Tree with 300 Trees 66.67 % 100.0 %


Support Vector Machine (SVM) 83.33 % 83.33 %
Multilayer Perceptron (MLP) 66.67 % 83.33 %
K-Nearest Neighbor-Average 70.58 % 83.33%
K-Nearest Neighbor-Maximum 89.90 % 100.0 %
Logistics Regression 75.00 % 91.67 %
Random Forest 75.00 % 90.91 %
Table.2: Accuracy of ML models

In this approach decision tree and KNN gives highest


accuracy. KNN gives the maximum accuracy 100% when we
evaluate for first seven (k= 1 to 7) neighbors in seven loops, it Fig.8: Confusion Matrix of Ensemble ML models
is total 7 X7=49 loops, in k=5 and loop=3 its gives 100% C. Classification Report
accuracy and average accuracy is 83.33% with 90:10 ratios of
train and test split dataset respectively. It is 89.17% accuracy The classification report gives the information about
precision, recall, F1-score and sample support of models. So it

103
helps in problems identification. All heat maps are in the range patients feature diagnosis report and predict breast cancer of
(0.0, 1.0). Classification Report describe about the models the patients. This model give highly accurate results, since
behaviour and its reports. Among all models, KNN and they can take right decision.
decision tree algorithm gives the high precision, recall, f1-
score and support samples as shown below table.
REFERENCES
Classification Reports
ML Models Precision Recall F1- Support [1] Patrício, M., Pereira, J., Crisóstomo, J., Matafome, P., Gomes, M.,
Score Seica, R., & Caramelo, F., “Using Resistin, glucose, age and BMI to
Sample predict the presence of breast cancer, BMC Cancer”, 18(1), 2018.
Decision C-1 1.00 1.00 1.00 7 [2] Mohamed NEMISSI, Halima SALAH, Hamid SERIDI, “Breast cancer
Tree diagnosis using an enhanced Extreme Learning Machine based-Neural
C-2 1.00 1.00 1.00 5 Network ”, 2018 International Conference on Signal, Image, Vision and
their Applications (SIVA ).
avg/total 1.00 1.00 1.00 12 [3] Predict the presence of breast cancer”, research article, Patricio et al.
SVM Healthy 0.71 1.00 0.83 5 BMC Cancer (2018).

Patients 1.00 0.71 0.83 7 [4] Kemal Polat, Ümit Şentürk, “A Novel ML Approach to Prediction of
avg/total 0.88 0.83 0.83 12 Breast Cancer: Combining of mad normalization, KMC based feature
weighting and AdaBoost Classifier”, IEEE 2018, ISMSIT Turkey.
MLP Healthy 0.80 0.80 0.80 7 [5] Polat, K., “Similarity-based attribute weighting methods via clustering
algorithms in the classification of imbalanced medical datasets, Neural
Patients 0.86 0.86 0.86 5 Computing and Applications”, 30 (3), 987–1013, 2018.
avg/total 0.83 0.83 0.83 12 [6] Jaber Alwidian, Bassam H. Hammo, Nadim Obeid, “WCBA: Weighted
classification based on association rules algorithm for breast cancer
KNN Healthy 1.00 1.00 1.00 7 disease”, Applied Soft Computing, Volume 62, 536-549, 2018.
[7] “Breast Cancer Coimbra dataset from UCI”,
Patients 1.00 1.00 1.00 5 (https://archive.ics.uci.edu/ml/datasets /Breast+Cancer+Coimbra).
avg/total 1.00 1.00 1.00 12 [8] Naresh Khuriwal, Nidhi Mishra, “Breast Cancer Diagnosis Using
Adaptive Voting Ensemble Machine Learning Algorithm”, IEEE 2018,
Logistics Healthy 1.00 0.89 0.94 9 IEEMA Engineer Infinite Conference (eTechNxT).
Regression [9] “WHO data about breast cance”, (https://www.who.int/cancer/detection/
Patients 0.75 1.00 0.86 3 breastcancer /en/index1.html).
avg/total 0.88 0.94 0.90 12 [10] Noushin Jafarpisheh, Nahid Nafisi, Mohammad Teshnehlab, “Breast
Cancer Relapse Prognosis by Classic and Modern Structures of Machine
Random Healthy 1.00 0.89 0.94 9 Learning Algorithms”, IEEE 2018, 6th Iranian Joint Congress on Fuzzy
Forest and Intelligent Systems (CFIS).
Patients 0.75 1.00 0.86 3 [11] Meriem Amrane, Saliha oukid, Ikram Gagaoua, Tolga ENSAR, “Breast
Cancer Classification Using Machine Learning ”, IEEE 2018, 2018
avg/total 0.88 0.94 0.90 12 Electric Electronics, Computer Science, Biomedical Engineering’s'
Meeting (EBBT) Turkey.
Table.3: Classification Reports of Ensemble ML Models [12] Muhammad Imran Faisal, Saba Bashir, Zain Sikandar Khan, Farhan
Hassan Khan, “An Evaluation of Machine Learning Classifiers and
Ensembles for Early Stage Prediction of Lung Cancer”, IEEE 2018, 3rd
VI. CONCLUSIONS AND FUTURE WORKS International Conference on Emerging Trends in Engineering, Sciences
and Technology (ICEEST).
Ensemble ML models predict the breast cancer with high [13] Y.K. Anupama, S. Amutha, and D. R. Ramesh Babu, “Survey on data
accuracy compare with without ensemble models. Ensemble mining techniques for diagnosis and prognosis of breast cancer” Int. J.
model improves the system performance with un-biasing. Recent and Innovation Trends in Computing and Communication, vol.5,
pp. 33–37, 2017.
Here we used different six machine learning algorithms such
[14] C. E. DeSantis, J.Ma, A. Sauer Goding, L. A. Newman and A. Jemal,
as decision tree, support vector machine, multilayer “Breast cancer statistics, 2017, racial disparity in mortality by state”,
perceptron, K- nearest neighbors, logistics regression and CA: a cancer journal for clinicians, vol. 67, no 6, pp. 439-448. 2017.
random forest and compare its prediction evaluation with [15] Joana Crisostomo, Paulo Matafome, Daniela Santos-Silva, Ana L.
ensemble and without ensemble techniques. Decision tree and Gomes, Manuel Gomes, Miguel Patrı´cio, Liliana Letra, Ana B.
KNN gives 100% accuracy with ensemble technique. K- Sarmento Ribeiro, Lelita Santos, Raquel Seica, “Hyperresistinemia and
metabolic dysregulation: a risky crosstalk in obese breast cancer ”,
nearest neighbor gives maximum accuracy as 100% when we Springer Science Business Media New York 2016.
evaluate it for first seven (k= 1 to 7) neighbors in seven [16] Abreu, Pedro Henriques and Santos, Miriam Seoane and Abreu, Miguel
random loops, It is total 7X7=49 loops, in k=5 and loop=3 its Henriques and Andrade, Bruno and Silva, Daniel Castro, “Predicting
gives 100% accuracy with 90:10 of train and test split dataset. Breast Cancer Recurrence Using Machine Learning Techniques: A
Systematic Review, ACM Comput. Survey”, Volume 49 Issue 3,
It accuracy is 89.17% when we k=7 neighbors. If training and December 2016.
test dataset are 80:20 then accuracy of KNN models is 87.5%. [17] J. A. Cruz and D. S. Wishart, “Applications of machine learning in
As extension of this work, we can provide this model to cancer prediction and prognosis”, Cancer Informatics, vol. 2, pp. 59–77,
medical community, there doctor or diagnosis people put 2006.

104

You might also like