Efficient Breast Cancer Prediction Using Ensemble Machine Learning Models
Efficient Breast Cancer Prediction Using Ensemble Machine Learning Models
18th 2019
Abstract— Breast cancer is the second most exposed cancer in technique in machine learning models. So it give maximum
the world. When the growth of breast tissues are out of control is accuracy 100% by decision tree and KNN when data splitting
called breast cancer. Breast cancer prediction and prognosis are ratios is 90:10.
major challenge to medical community. Breast cancer are
prominent cause of death for women. Recurrence of cancer is the Breast cancer can be cured with current medical treatments
biggest fears for cancer patient and this can affect their quality of and new innovative techniques [1, 16]. It is very useful to
life. The aim of this research is to predict breast cancer from early predict breast cancer with high accuracy since patients
cancer features with high accuracy. The breast cancer Coimbra can get treatment on time and can survive. In this approach,
dataset taken from UCI (University of California Irvine) [1, 5] to we first perform feature importance techniques on breast
build a most efficient ensemble machine learning models. The cancer dataset to select the most significance features. Then
major steps we follow, here are feature scaling, cross validation we perform standard scaling with zero mean and standard
and various ensemble machine learning models with bagging deviation one to scale all features in a range. After that we
technique. Decision tree and KNN gives highest 100% accuracy. send this scaled features to various ensemble machine learning
Decision tree model gives 100% accuracy if we split train-test models. Here, we use 10 folds cross validation to train the ML
dataset in ratio of 90:10 and also used 300 bags of trees. KNN models with each folds, it reduce over fitting of model, as
gives maximum accuracy 100%, for k= 1 to 7 in seven loops with model become more efficient. Here we use bagging techniques
90% is train data and 10% is test data. Here k is the nearest in ML models. Bagging technique makes the bags of samples,
neighbors. And we also evaluate its prediction by accuracy,
each bags may contain identical samples also. Each bag is test
confusion matrix and classification report. Our aim is to build a
by different trained ML model, we performs voting in each
most accurate and efficient machine learning model. So as
prediction result, patient can take treatment on the early stage. bag prediction result, highest voting of class is our final
prediction results, and we conclude prediction results of ML
Keywords— Breast Cancer, Significant Feature, Ensemble model. We also evaluate its prediction results with confusion
Machine Learning Models, Accuracy, Confusion Matrix, matrix, classification report and accuracies. Models developed
Classification Report. through these techniques are very helpful for medical
community to take right decisions.
I. INTRODUCTION
Breast cancer is the most frequent types of skin cancer
disease in women around the world and this cancer may occur II. DATASET
in men also. Breast cancer produces from breast tissues. In The breast cancer Coimbra dataset taken from UCI
United States, one out of eight women have breast cancer in (University of California Irvine) [1, 17] to developed a most
their lifetime. The study shows that average ages of women in efficient ensemble machine learning models. These dataset
between 45 to 59 has more chance to cancer. Breast cancer is contains 116 samples. There are two set of samples (52
the second most commonly diagnosed cancer. In US, every samples: healthy peoples, 64 samples: breast cancer
year diagnoses cases of breast cancer is more than 266,000.
patients).It consist of nine features and two classes as healthy
According to Global Health Estimates, WHO 2013 report that
more than 508,000 women died in 2011 due to breast cancer. and patients. These nine features are: Age (years), BMI
Approximately 50% of breast cancer cases and 58% of death (kg/m2), Glucose (mg/dL), Leptin (ng/mL), Adiponectin
occurs in developing nations (GLOBOCAN 2008 Report) [9]. (ug/mL), Resistin (ng/mL), Insulin (Uu/mL), HOMA and
Breast cancer survival rates differ in worldwide, most survival MCP-1 (pg/dL). In this approach, we performed feature
rates reaching 80% or over in North America, Sweden and importance on breast cancer datasets for selection the
Japan and to around 60% in developing countries and less than significant features from dataset.
40% in undeveloped countries due to lack of early prediction As fig.1, line plot show that the behavior of Coimbra
[4,6]. breast cancer dataset. Here we see that MCP-1 attribute is
As previous works shows that there is a various research highly varying. Glucose is less varying feature. Except this all
gone on breast cancer prediction with different machine are linear data. To reduce this variance of feature, we performs
learning approach. However, for Coimbra breast cancer feature scaling with standard scale.
dataset we could found only few studies. Coimbra breast
cancer dataset gives nearly 65-85% accuracy using different
models of machine learning. Hare we use ensemble bagging
101
x Split the dataset into k groups (k =10) 4) K-Nearest Neighbor (KNN):
x For every group have different kinds of samples. KNN is very useful for a large dataset do not use
x Summarize results of each group to utilizing the mathematical analysis. In the worst scenario, KNN needs
model accuracy. more memory to check all data sets. Here we used k=1 to 7
neighbor in a 7 random loop. It is total 7x7=49 evaluation. Its
maximum accuracy is 100% at k=5 in loop =3. And its
average accuracy in all loop is 89.29%.
( . )= ( − ) … … … … … … (3)
1) Decision Tree:
Fig.5: K-Nearest Neighbor (KNN) model
Decision tree algorithms uses the hierarchical tree
approach, there every node represents a feature, and branch 5) Logistics Regression:
represents a decision and leaves represents an outcome (class). Logistic regression is a statistical machine learning model
In this approach we predict the breast cancer with 300 trees that classifies the data by considering outcome variables on
and dataset split in to 90:10 as train and test dataset. Here this extreme end class and tries to makes a logarithmic line that
model gives 100% accuracy with 300 trees and dataset are separates between them. Logistic regression gives 91.67 %
split in to 90:10 as train and test. accuracy in our approach.
2) Support Vector Machine (SVM):
Support vector machine (SVM) is a discriminative ( | )= … … … … … … … (4)
1+
classifier characterized by an isolating hyper plane. In two
dimensional space this hyper plane is a line separating a x Where α and β are the model parameters
plane in two sections where in each class lay in either side.
SVM gives 83.33% accuracy with linear kernel. 6) Random Forest:
Random forests is an ensemble machine learning algorithm
3) Multilayer Perceptron: of decision tree. Random forest build a set of decision trees
Multilayer perceptron (MLP) has a numbers of layers, two based on random chosen samples gets expectation from each
terminating layers are called input-output layer and tree and choose the best prediction results from the voting of
intermediate layers are called hidden layers. It uses a each tree. In our aproach, it gives 90.91% accuracy of
stimulation function in all neurons. Here we use 60 hidden predictions.
layers and each layer has 10 neurons with different weights.
MLP gives 83.33% accuracy.
( )= + … … … . . … … (2)
102
IV. ALGORITHM in k=10. If training and test dataset are 80:20 then maximum
For this approach we use anaconda navigator 3 (64-bit) for accuracy of KNN models is 87.5%.
python programming. Following algorithmic steps, we follow:
103
helps in problems identification. All heat maps are in the range patients feature diagnosis report and predict breast cancer of
(0.0, 1.0). Classification Report describe about the models the patients. This model give highly accurate results, since
behaviour and its reports. Among all models, KNN and they can take right decision.
decision tree algorithm gives the high precision, recall, f1-
score and support samples as shown below table.
REFERENCES
Classification Reports
ML Models Precision Recall F1- Support [1] Patrício, M., Pereira, J., Crisóstomo, J., Matafome, P., Gomes, M.,
Score Seica, R., & Caramelo, F., “Using Resistin, glucose, age and BMI to
Sample predict the presence of breast cancer, BMC Cancer”, 18(1), 2018.
Decision C-1 1.00 1.00 1.00 7 [2] Mohamed NEMISSI, Halima SALAH, Hamid SERIDI, “Breast cancer
Tree diagnosis using an enhanced Extreme Learning Machine based-Neural
C-2 1.00 1.00 1.00 5 Network ”, 2018 International Conference on Signal, Image, Vision and
their Applications (SIVA ).
avg/total 1.00 1.00 1.00 12 [3] Predict the presence of breast cancer”, research article, Patricio et al.
SVM Healthy 0.71 1.00 0.83 5 BMC Cancer (2018).
Patients 1.00 0.71 0.83 7 [4] Kemal Polat, Ümit Şentürk, “A Novel ML Approach to Prediction of
avg/total 0.88 0.83 0.83 12 Breast Cancer: Combining of mad normalization, KMC based feature
weighting and AdaBoost Classifier”, IEEE 2018, ISMSIT Turkey.
MLP Healthy 0.80 0.80 0.80 7 [5] Polat, K., “Similarity-based attribute weighting methods via clustering
algorithms in the classification of imbalanced medical datasets, Neural
Patients 0.86 0.86 0.86 5 Computing and Applications”, 30 (3), 987–1013, 2018.
avg/total 0.83 0.83 0.83 12 [6] Jaber Alwidian, Bassam H. Hammo, Nadim Obeid, “WCBA: Weighted
classification based on association rules algorithm for breast cancer
KNN Healthy 1.00 1.00 1.00 7 disease”, Applied Soft Computing, Volume 62, 536-549, 2018.
[7] “Breast Cancer Coimbra dataset from UCI”,
Patients 1.00 1.00 1.00 5 (https://archive.ics.uci.edu/ml/datasets /Breast+Cancer+Coimbra).
avg/total 1.00 1.00 1.00 12 [8] Naresh Khuriwal, Nidhi Mishra, “Breast Cancer Diagnosis Using
Adaptive Voting Ensemble Machine Learning Algorithm”, IEEE 2018,
Logistics Healthy 1.00 0.89 0.94 9 IEEMA Engineer Infinite Conference (eTechNxT).
Regression [9] “WHO data about breast cance”, (https://www.who.int/cancer/detection/
Patients 0.75 1.00 0.86 3 breastcancer /en/index1.html).
avg/total 0.88 0.94 0.90 12 [10] Noushin Jafarpisheh, Nahid Nafisi, Mohammad Teshnehlab, “Breast
Cancer Relapse Prognosis by Classic and Modern Structures of Machine
Random Healthy 1.00 0.89 0.94 9 Learning Algorithms”, IEEE 2018, 6th Iranian Joint Congress on Fuzzy
Forest and Intelligent Systems (CFIS).
Patients 0.75 1.00 0.86 3 [11] Meriem Amrane, Saliha oukid, Ikram Gagaoua, Tolga ENSAR, “Breast
Cancer Classification Using Machine Learning ”, IEEE 2018, 2018
avg/total 0.88 0.94 0.90 12 Electric Electronics, Computer Science, Biomedical Engineering’s'
Meeting (EBBT) Turkey.
Table.3: Classification Reports of Ensemble ML Models [12] Muhammad Imran Faisal, Saba Bashir, Zain Sikandar Khan, Farhan
Hassan Khan, “An Evaluation of Machine Learning Classifiers and
Ensembles for Early Stage Prediction of Lung Cancer”, IEEE 2018, 3rd
VI. CONCLUSIONS AND FUTURE WORKS International Conference on Emerging Trends in Engineering, Sciences
and Technology (ICEEST).
Ensemble ML models predict the breast cancer with high [13] Y.K. Anupama, S. Amutha, and D. R. Ramesh Babu, “Survey on data
accuracy compare with without ensemble models. Ensemble mining techniques for diagnosis and prognosis of breast cancer” Int. J.
model improves the system performance with un-biasing. Recent and Innovation Trends in Computing and Communication, vol.5,
pp. 33–37, 2017.
Here we used different six machine learning algorithms such
[14] C. E. DeSantis, J.Ma, A. Sauer Goding, L. A. Newman and A. Jemal,
as decision tree, support vector machine, multilayer “Breast cancer statistics, 2017, racial disparity in mortality by state”,
perceptron, K- nearest neighbors, logistics regression and CA: a cancer journal for clinicians, vol. 67, no 6, pp. 439-448. 2017.
random forest and compare its prediction evaluation with [15] Joana Crisostomo, Paulo Matafome, Daniela Santos-Silva, Ana L.
ensemble and without ensemble techniques. Decision tree and Gomes, Manuel Gomes, Miguel Patrı´cio, Liliana Letra, Ana B.
KNN gives 100% accuracy with ensemble technique. K- Sarmento Ribeiro, Lelita Santos, Raquel Seica, “Hyperresistinemia and
metabolic dysregulation: a risky crosstalk in obese breast cancer ”,
nearest neighbor gives maximum accuracy as 100% when we Springer Science Business Media New York 2016.
evaluate it for first seven (k= 1 to 7) neighbors in seven [16] Abreu, Pedro Henriques and Santos, Miriam Seoane and Abreu, Miguel
random loops, It is total 7X7=49 loops, in k=5 and loop=3 its Henriques and Andrade, Bruno and Silva, Daniel Castro, “Predicting
gives 100% accuracy with 90:10 of train and test split dataset. Breast Cancer Recurrence Using Machine Learning Techniques: A
Systematic Review, ACM Comput. Survey”, Volume 49 Issue 3,
It accuracy is 89.17% when we k=7 neighbors. If training and December 2016.
test dataset are 80:20 then accuracy of KNN models is 87.5%. [17] J. A. Cruz and D. S. Wishart, “Applications of machine learning in
As extension of this work, we can provide this model to cancer prediction and prognosis”, Cancer Informatics, vol. 2, pp. 59–77,
medical community, there doctor or diagnosis people put 2006.
104