Prediction of Cervical Cancer From Behavior Risk Using Machine Learning Techniques
Prediction of Cervical Cancer From Behavior Risk Using Machine Learning Techniques
https://doi.org/10.1007/s42979-021-00551-6
ORIGINAL RESEARCH
Abstract
Cervical cancer growth is the fourth maximum of regular diseases in females. It is one of the sicknesses which is compromis-
ing ladies’ wellbeing everywhere in the world and it is difficult to notice any sign in the beginning phase. But the screening
process of cervical cancer sometimes is being hampered due to some social-behavioral factors. There is still a limited number
of researches directed in cervical cancer identification dependent on the behavior and machine learning in the area of gynecol-
ogy and computer science. In this research, we have proposed three machine learning models such as Decision Tree, Random
Forest, and XGBoost to predict cervical cancer from behavior and its variables and we got significantly improved outcomes
than the current methods with 93.33% accuracy. Moreover, we have shown the top features from the dataset according to the
feature important scores to know their impacts on the development of the classification model.
Keywords Cervical cancer · Behavior risk · Machine learning · Correlation · Feature importance
Introduction
SN Computer Science
Vol.:(0123456789)
177 Page 2 of 10 SN Computer Science (2021) 2:177
danger of creating cervical malignancy. HPV is the leading the accuracy 97%, 96.9%, 96%, 88%, and 88%, respectively.
causal multiplier of cancer of the cervix. The most assess- Parikh et al. [9] developed a cervical disease detection sys-
ment considers demonstrating that the number of dominating tem utilizing K-NN, selected 25 features, DT selected 17
parts of the spread is the young who increases the likelihood features and RF selected 11 features. The K-NN was the best
of developing the cervical disease [5]. Cervical malignant classifier with maximum precision, AUC was 0.82 when
growth is profoundly predominant in the public arena and compared with the decision tree and random forest classi-
is the second most regular disease among ladies. The most fier. Tseng et al. [10] used 3 machine learning techniques
auspicious component of this malignant growth is that it is which are SVM, C5.0, and Extreme Learning Machine, and
preventable and treatable in the beginning phases. Neverthe- that were deliberated to discover significant risk factors to
less, females need data concerning screening details because forecast the recurrence-proneness of cervical disease. The
of cervical cancer. The helpless ladies and ladies with low accuracy of the SVM, C5.0, and ELM are 68.00%, 96.00%,
income do not go through screening for cervical malignancy and 94.00% separately. Suman et al. [11] proposed a pre-
(for example, Pap tests). They need attention to these well- diction model that helped to forecast the risk of cervical
being administrations, while some overlook the manifesta- cancer. There were used algorithms: Random Forest, Neu-
tions due to modesty. Hence, they are not screened satisfac- ral Network, SVM, AdaBoost, Bayes Net, and Decision
torily because of cervical malignant growth. So, there is a Tree. In Bayes Net algorithm, an error rate, FP rate, TP rate,
necessity to build open attention concerning the screening F1-score, AUC and MCC of 3.61%, 0.32, 0.96, 0.96, 0.95
of cervical cancer to eliminate the threat. and 0.68, respectively, are obtained.
In this paper, we have identified cervical cancer from the
behavior risk dataset by employing machine learning classi-
fiers. Moreover, we have found the feature importance score
Methodology
from the dataset. Our proposed methods have achieved better
classification accuracy than the existing methods.
The proposed system’s architecture has been illustrated in
The remaining parts of this paper have been arranged
Fig. 1. It can be seen that the system has begun with cervical
as follows: “Related Work” section describes the recent
cancer data collection, then data preprocessing, data analy-
researches in diagnosing and predicting cervical cancer
sis, dataset splitting, and other parts for classification and
using machine learning techniques. The implementation
evaluation with machine learning. The methodology of our
of this work has been represented in the “Methodology”
proposed system is divided into the following parts:
section along with different subsections. The results have
been analyzed in the “Result and discussion” section. The
• Data collection and preprocessing
“Conclusion” section presents the paper’s conclusion with
• Exploratory data analysis
future works.
• Machine learning for classification
Deng et al. [6] analyzed the information on cervical illness The dataset that we have used in this work was “Cervical
using XGBoost, SVM, and Random Forest. The dataset was Cancer Behavior Risk Data Set”, obtained from the “UCI
acquired from the "UCI machine learning repository" which Machine Learning Repository” [12, 13]. This dataset is a
includes 32 risk factors and 4 target variables of the clinical small dataset containing 72 instances or records. It included
history of 858 patients. They used Borderline-SMOTE to 19 attributes in which one is the class attribute “ca_cervix”.
manage the unevenness of the dataset. Lu et al. [7] explored The rest of the 18 attributes came from 8 variables. The vari-
different methodologies for cervical malignant growth and ables and corresponding attributes are listed in Table 1. In
raised a proposal and productive assistant pattern to fore- the dataset, there was no missing value to be handled, and all
cast cervical disease utilizing a quality arrangement mod- attributes including the class variable were in int64 format,
ule. There were used Decision Tree, Logistic Regression, so we did not need any encoding to apply for using them.
SVM, K-NN, and Multilayer Perceptron algorithms and the In machine learning, feature scaling is the method to
accuracy was 77.97%, 82.78%, 79.25%, 82.93%, 83.16%, carry all the features to a similar scale. If we don’t scale the
and 77.67%, respectively. Nithya et al. [8] applied machine features to a similar scale, the model will in general give
learning to explore the risk factors of cervical malignancy. higher weights to higher qualities and lower loads to bring
There were 858 rows, 27 attributes in the dataset. They used down qualities regardless of the units of qualities. So, feature
five algorithms that were SVM, C5.0, r-part, Random Forest, scaling is carrying significant factors to a similar scale. In
and K-NN algorithms and used tenfold cross-validation and this work, we have done min–max feature scaling.
SN Computer Science
SN Computer Science (2021) 2:177 Page 3 of 10 177
Min–Max scaling is a scaling procedure where values are is somewhere in the range of 0 and 1. It is otherwise called
moved and rescaled so they started going somewhere in the normalization. The formula of calculating normalization has
range of 0 and 1 [14]. When the estimation of X is the lowest been shown in Eq. (1).
in the column, the numerator will be 0, and consequently X′
X − Xmin
is 0. Then again, when the estimation of X is the highest in X� = (1)
the column, the numerator is equivalent to the denominator Xmax − Xmin
and in this way, the estimation of X′ is 1. If the estimation of
X is between the lowest and the highest, the estimation of X′
SN Computer Science
177 Page 4 of 10 SN Computer Science (2021) 2:177
Table 1 Dataset information value of the correlation coefficient is 0.85. So, we did not
Variable name Attribute name eliminate any features from the dataset for classification.
SN Computer Science
SN Computer Science (2021) 2:177 Page 5 of 10 177
parting the dataset. Every decision tree is constructed apply- The objective function is an aggregate of a particular loss
ing the feature selection measure of each attribute. Each tree assessed generally classifications and a whole of regularization
is reliant on an independent sample. When it is a classifica- term for all classifiers. We have chosen the “binary: logistic”
tion problem, each tree figure votes and the most noteworthy function as the objective function. Mathematically, the formula
votes class is picked. In this work, we have fixed the number of computing objective function is in Eq. (2).
of estimators to 200 with the random state 5. n n
∑ ⌢
∑
obj(𝜃) = l(yi − y i ) + Ω(fk ) (2)
XGBoost Classifier i k=1
SN Computer Science
177 Page 6 of 10 SN Computer Science (2021) 2:177
F1-score using the following formulas in Eqs. (3), (4), (5), and
(6), respectively.
TP + TN
Accuracy = (3)
TP + FP + FN + TN
TP
Precision = (4)
TP + FP
TP
Recall = (5)
TP + FN
TP
F1-Score = 1 (6)
TP + 2
(FP + FN)
SN Computer Science
SN Computer Science (2021) 2:177 Page 7 of 10 177
SN Computer Science
177 Page 8 of 10 SN Computer Science (2021) 2:177
Table 2 Performance measures Classifiers Class Acc (%) Pre (%) Rec (%) F1-score (%)
of each classifier
Decision Tree No cervical cancer 93.33 100 91 95
Cervical cancer 80 100 89
Random forest No cervical cancer 93.33 92 100 96
Cervical cancer 100 75 86
XGBoost No cervical cancer 93.33 93 100 97
Cervical cancer 0 0 0
classify, and it failed to classify it to the proper class. How- Creative cervical malignant growth screening systems are a
ever, it will be improved when there was more data in the test fundamental aspect of the avoidance initiative that will even-
set. The other two models—decision tree and random forest tually lessen the worldwide dissimilarity in cervical disease
showed good performance for both classes. The accuracy rate. Effective performance in the elimination of cervical
of the decision tree, random forest, and XGBoost classifier cancer, concerned administrations require to reach ladies at
SN Computer Science
SN Computer Science (2021) 2:177 Page 9 of 10 177
the root level with programs that are adequate, cheap, sim- of the dataset. Further investigation with a bigger dataset
ple to utilize, and supportable. In this paper, we attempted will assist to improve the model performance.
to show the cervical malignant growth hazard structure by
applying several machine learning techniques with enhanced
performance. We have shown the performances of each clas- Declarations
sifier in different metrics like—accuracy, precision, recall,
and F1-score. All of the three classifiers (decision tree, ran- Conflict of interest On behalf of all authors, the corresponding author
states that there is no conflict of interest.
dom forest, and XGBoost) achieved 93.33% accuracy. We
have also shown the feature importance score for the features
References
1. Ferlay J, Shin H, Bray F, Forman D, Mathers C, Parkin D. Esti-
Table 3 Comparison of the proposed system with the state-of-the-art mates of worldwide burden of cancer in 2008: GLOBOCAN 2008.
Int J Cancer. 2010;127(12):2893–917.
Author Model Accuracy (%) Feature 2. Guidelines for cervical cancer screening programme. Chandigarh:
importance Department of Cytology & Gynaecological Pathology, Postgradu-
calculation ate Institute of Medical Education, Research, screening.iarc.fr,
2020. Accessed 29 Oct 2020.
Sobar et al. [13] Naïve Bayes 91.67 No
3. Ndikom C, Ofi B. Awareness, perception and factors affecting
Logistic regression 87.5 utilization of cervical cancer screening services among women in
Proposed system Decision tree 93.33 Yes Ibadan, Nigeria: a qualitative study. Reprod Health. 2012;9:1–8.
Random forest 93.33 4. Hussain S, Sullivan R. Cancer control in Bangladesh. Jpn J Clin
Oncol. 2013;43(12):1159–69.
XGBoost 93.33
SN Computer Science
177 Page 10 of 10 SN Computer Science (2021) 2:177
5. Paul BS. Studies on the epidemiology of cervical cancer in South- 14. Patro S, Sahu K. Normalization: a preprocessing stage. IARJSET.
ern Assam. Assam Univ J Sci Technol. 2011;7(1):36–42. 2015. p. 20–22.
6. Deng X, Luo Y., Wang C. Analysis of risk factors for cervical 15. Cox V. Translating statistics to make decisions. 2017.
cancer based on machine learning methods. In: Proc. of 5th IEEE 16. Kumar S, Chong I. Correlation analysis to identify the effec-
international conference on cloud computing and intelligence sys- tive data in machine learning: prediction of depressive dis-
tems (CCIS), Nanjing, China, 2018. p. 631–5. order and emotion states. Int J Environ Res Public Health.
7. Lu J, Song E, Ghoneim A, Alrashoud M. Machine learning for 2018;15(12):2907.
assisting cervical cancer diagnosis: an ensemble approach. Futur 17. Hamlich M, Bellatreche L, Mondal A, Ordonez C. Smart applica-
Gener Comput Syst. 2020;106:199–205. tions and data analysis. Cham: Springer; 2020. p. 165–77.
8. Nithya B, Ilango V. Evaluation of machine learning based opti- 18. Abdoh SF, Abo Rizka M, Maghraby FA. Cervical cancer diagno-
mized feature selection approaches and classification methods for sis using random forest classifier with SMOTE and feature reduc-
cervical cancer prediction. SN Appl Sci. 2019;1(6):1–16. tion techniques. IEEE Access. 2018;6:59475–85.
9. Parikh D, Menon V. Machine learning applied to cervical cancer 19. Dimitrakopoulos GN, Vrahatis AG, Plagianakos V, Sgarbas K.
data. Int J Math Sci Comput. 2019;5(1):53–64. Pathway analysis using XGBoost classification in Biomedical
10. Tseng C, Lu C, Chang C, Chen G. Application of machine learn- Data. In: Proc. of the 10th hellenic conference on artificial intel-
ing to predict the recurrence-proneness for cervical cancer. Neural ligence. Association for computing machinery, New York, NY,
Comput Appl. 2013;24(6):1311–6. USA, Article 46, 2018. p. 1–6.
11. Suman S, Hooda N. Predicting risk of cervical cancer: a case study
of machine learning. J Stat Manag Syst. 2019;22(4):689–96. Publisher’s Note Springer Nature remains neutral with regard to
12. UCI machine learning repository: cervical cancer behavior risk jurisdictional claims in published maps and institutional affiliations.
data set. Archive.ics.uci.edu, 2020. Accessed 10 Nov 2020.
13. Machmud R, Wijaya A. Behavior determinant based cervical can-
cer early detection with machine learning algorithm. Adv Sci Lett.
2016;22(10):3120–3.
SN Computer Science