0% found this document useful (0 votes)
26 views10 pages

Prediction of Cervical Cancer From Behavior Risk Using Machine Learning Techniques

This article discusses predicting cervical cancer risk from behavioral factors using machine learning techniques. It proposes using decision trees, random forests, and XGBoost models on a cervical cancer behavior risk dataset. The models achieved up to 93.33% accuracy, outperforming previous methods. Feature importance is also analyzed to understand each feature's impact on the classification models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views10 pages

Prediction of Cervical Cancer From Behavior Risk Using Machine Learning Techniques

This article discusses predicting cervical cancer risk from behavioral factors using machine learning techniques. It proposes using decision trees, random forests, and XGBoost models on a cervical cancer behavior risk dataset. The models achieved up to 93.33% accuracy, outperforming previous methods. Feature importance is also analyzed to understand each feature's impact on the classification models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

SN Computer Science (2021) 2:177

https://doi.org/10.1007/s42979-021-00551-6

ORIGINAL RESEARCH

Prediction of Cervical Cancer from Behavior Risk Using Machine


Learning Techniques
Laboni Akter1 · Ferdib‑Al‑Islam2 · Md. Milon Islam2 · Mabrook S. Al‑Rakhami3 · Md. Rezwanul Haque2

Received: 22 February 2021 / Accepted: 26 February 2021


© The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd. part of Springer Nature 2021

Abstract
Cervical cancer growth is the fourth maximum of regular diseases in females. It is one of the sicknesses which is compromis-
ing ladies’ wellbeing everywhere in the world and it is difficult to notice any sign in the beginning phase. But the screening
process of cervical cancer sometimes is being hampered due to some social-behavioral factors. There is still a limited number
of researches directed in cervical cancer identification dependent on the behavior and machine learning in the area of gynecol-
ogy and computer science. In this research, we have proposed three machine learning models such as Decision Tree, Random
Forest, and XGBoost to predict cervical cancer from behavior and its variables and we got significantly improved outcomes
than the current methods with 93.33% accuracy. Moreover, we have shown the top features from the dataset according to the
feature important scores to know their impacts on the development of the classification model.

Keywords Cervical cancer · Behavior risk · Machine learning · Correlation · Feature importance

Introduction

Carcinoma of the cervix is one of the most widely recog-


nized diseases among ladies around the world. Cervical
This article is part of the topical collection “Advances in malignancy grows gradually in the body. In 2008, 275,000
Computational Approaches for Artificial Intelligence, Image
deaths happened because of cervical malignancy. Of these,
Processing, IoT and Cloud Applications” guest edited by Bhanu
Prakash K N and M. Shivakumar. 88% happened in developing nations. In 2018, an expected
570,000 ladies were determined to have cervical malignancy
* Mabrook S. Al‑Rakhami worldwide and around 311,000 ladies died from the infec-
[email protected] tion. In Asia, 159,800 passing happened because of cervi-
Laboni Akter cal disease [1]. The mass of cervical cancer occurrences
[email protected] commences from an increase within women at the age of
Ferdib‑Al‑Islam 20–29 years, then reaches the peak, turn grey within the age
[email protected] of 55–64 years, turns grayer after the age of 65 years [2].
Md. Milon Islam In India, one out of five ladies is determined to have cer-
[email protected] vical disease. India has the best heap of cervical malignancy
Md. Rezwanul Haque patients [3]. “Global Agency for Research on Cancer” has
[email protected] assessed malignancy-concerned demise rates in Bangladesh
1
Department of Biomedical Engineering, Khulna University to be 7.5% in 2005 and 13% in 2030. The two leading malig-
of Engineering & Technology, Khulna 9203, Bangladesh nancies are in the male are lung and oral malignancy and
2
Department of Computer Science and Engineering, Khulna in females are breast malignant growth and cervical malig-
University of Engineering & Technology, Khulna 9203, nancy [4]. The recognized risk factors behind cervical can-
Bangladesh cer are the Human Papilloma Virus, wedding before age
3
Research Chair of Pervasive and Mobile Computing, 18 years, young age at the first intercourse, smoking, low
Information Systems Department, College of Computer economic status, multiple sexual associates, and different
and Information Sciences, King Saud University,, sexual associates of the mate. These variables increase the
Riyadh 11543, Saudi Arabia

SN Computer Science
Vol.:(0123456789)
177 Page 2 of 10 SN Computer Science (2021) 2:177

danger of creating cervical malignancy. HPV is the leading the accuracy 97%, 96.9%, 96%, 88%, and 88%, respectively.
causal multiplier of cancer of the cervix. The most assess- Parikh et al. [9] developed a cervical disease detection sys-
ment considers demonstrating that the number of dominating tem utilizing K-NN, selected 25 features, DT selected 17
parts of the spread is the young who increases the likelihood features and RF selected 11 features. The K-NN was the best
of developing the cervical disease [5]. Cervical malignant classifier with maximum precision, AUC was 0.82 when
growth is profoundly predominant in the public arena and compared with the decision tree and random forest classi-
is the second most regular disease among ladies. The most fier. Tseng et al. [10] used 3 machine learning techniques
auspicious component of this malignant growth is that it is which are SVM, C5.0, and Extreme Learning Machine, and
preventable and treatable in the beginning phases. Neverthe- that were deliberated to discover significant risk factors to
less, females need data concerning screening details because forecast the recurrence-proneness of cervical disease. The
of cervical cancer. The helpless ladies and ladies with low accuracy of the SVM, C5.0, and ELM are 68.00%, 96.00%,
income do not go through screening for cervical malignancy and 94.00% separately. Suman et al. [11] proposed a pre-
(for example, Pap tests). They need attention to these well- diction model that helped to forecast the risk of cervical
being administrations, while some overlook the manifesta- cancer. There were used algorithms: Random Forest, Neu-
tions due to modesty. Hence, they are not screened satisfac- ral Network, SVM, AdaBoost, Bayes Net, and Decision
torily because of cervical malignant growth. So, there is a Tree. In Bayes Net algorithm, an error rate, FP rate, TP rate,
necessity to build open attention concerning the screening F1-score, AUC and MCC of 3.61%, 0.32, 0.96, 0.96, 0.95
of cervical cancer to eliminate the threat. and 0.68, respectively, are obtained.
In this paper, we have identified cervical cancer from the
behavior risk dataset by employing machine learning classi-
fiers. Moreover, we have found the feature importance score
Methodology
from the dataset. Our proposed methods have achieved better
classification accuracy than the existing methods.
The proposed system’s architecture has been illustrated in
The remaining parts of this paper have been arranged
Fig. 1. It can be seen that the system has begun with cervical
as follows: “Related Work” section describes the recent
cancer data collection, then data preprocessing, data analy-
researches in diagnosing and predicting cervical cancer
sis, dataset splitting, and other parts for classification and
using machine learning techniques. The implementation
evaluation with machine learning. The methodology of our
of this work has been represented in the “Methodology”
proposed system is divided into the following parts:
section along with different subsections. The results have
been analyzed in the “Result and discussion” section. The
• Data collection and preprocessing
“Conclusion” section presents the paper’s conclusion with
• Exploratory data analysis
future works.
• Machine learning for classification

Related Work Data Collection and Preprocessing

Deng et al. [6] analyzed the information on cervical illness The dataset that we have used in this work was “Cervical
using XGBoost, SVM, and Random Forest. The dataset was Cancer Behavior Risk Data Set”, obtained from the “UCI
acquired from the "UCI machine learning repository" which Machine Learning Repository” [12, 13]. This dataset is a
includes 32 risk factors and 4 target variables of the clinical small dataset containing 72 instances or records. It included
history of 858 patients. They used Borderline-SMOTE to 19 attributes in which one is the class attribute “ca_cervix”.
manage the unevenness of the dataset. Lu et al. [7] explored The rest of the 18 attributes came from 8 variables. The vari-
different methodologies for cervical malignant growth and ables and corresponding attributes are listed in Table 1. In
raised a proposal and productive assistant pattern to fore- the dataset, there was no missing value to be handled, and all
cast cervical disease utilizing a quality arrangement mod- attributes including the class variable were in int64 format,
ule. There were used Decision Tree, Logistic Regression, so we did not need any encoding to apply for using them.
SVM, K-NN, and Multilayer Perceptron algorithms and the In machine learning, feature scaling is the method to
accuracy was 77.97%, 82.78%, 79.25%, 82.93%, 83.16%, carry all the features to a similar scale. If we don’t scale the
and 77.67%, respectively. Nithya et al. [8] applied machine features to a similar scale, the model will in general give
learning to explore the risk factors of cervical malignancy. higher weights to higher qualities and lower loads to bring
There were 858 rows, 27 attributes in the dataset. They used down qualities regardless of the units of qualities. So, feature
five algorithms that were SVM, C5.0, r-part, Random Forest, scaling is carrying significant factors to a similar scale. In
and K-NN algorithms and used tenfold cross-validation and this work, we have done min–max feature scaling.

SN Computer Science
SN Computer Science (2021) 2:177 Page 3 of 10 177

Fig. 1  Proposed architecture of the cervical cancer prediction system

Min–Max scaling is a scaling procedure where values are is somewhere in the range of 0 and 1. It is otherwise called
moved and rescaled so they started going somewhere in the normalization. The formula of calculating normalization has
range of 0 and 1 [14]. When the estimation of X is the lowest been shown in Eq. (1).
in the column, the numerator will be 0, and consequently X′
X − Xmin
is 0. Then again, when the estimation of X is the highest in X� = (1)
the column, the numerator is equivalent to the denominator Xmax − Xmin
and in this way, the estimation of X′ is 1. If the estimation of
X is between the lowest and the highest, the estimation of X′

SN Computer Science
177 Page 4 of 10 SN Computer Science (2021) 2:177

Table 1  Dataset information value of the correlation coefficient is 0.85. So, we did not
Variable name Attribute name eliminate any features from the dataset for classification.

Behavior Behavior_personalHygine Machine Learning for Classification


Behavior_eating
Intention Intention_commitment
We have used the decision tree, random forest, and XGBoost
Intention_aggregation
classifier for performing classification in the dataset. Before
Attitude Attitude_spontaneity
Attitude_consistency applying classification algorithms, we have split the dataset
Norm Norm_fulfillment using the percentage split technique where 80% of data was
Norm_significantPerson in the training set, and 20% of data was in the testing set.
Perception Perception_severity
Perception_vulnerability Decision Tree Classifier
Motivation Motivation_willingness
Motivation_strength
The decision tree algorithm has a place with a group of
Socialsupport Socialsupport_emotionality
supervised learning classification algorithms [17]. In Fig. 3,
Socialsupport_appreciation
Socialsupport_instrumental the workflow of the decision tree algorithm has been illus-
Empowerment Empowerment_knowledge trated. A decision tree is a hierarchical tree structure where
Empowerment_abilities an inner node illustrates a feature, the branch illustrates a
Empowerment_desires system, and every document node illustrates the outcome.
The utmost node in a decision tree is known as the root.
It learns to divide on the attribute worth. This hierarchi-
where Xmax and Xmin are the highest and the lowest standards cal structure causes decision formation. It is accessed as a
of the feature correspondingly. flowchart that practically impersonates human-level reason-
ing. This is causes decision trees are expansive and easier
Exploratory data analysis to decode. The work process of the decision tree algorithm
is below:
It is one of the vital advances that permits accomplishing
certain bits of knowledge and statistical measure that is fun- • Select the best quality utilizing Attribute Selection Meas-
damental for predictive modeling. ures (ASM) to part the records.
Once Exploratory Data Analysis (EDA) is finished and • Convert the feature to a node and splits the dataset into
bits of knowledge are drawn, its component can be used for smaller parts.
developing a machine learning model [15]. EDA helps in • Starts tree construction by rehashing this cycle recur-
making and accepting the hypothesis that has to be made. sively for every child until one of these conditions will
Exploratory Data Analysis in the underlying investigation be satisfied:
is performed on a dataset that may help to identify the out-
liers, the distribution of data, and the correlation between i. All the tuples have a place with a similar attrib-
the features. This technique is a basic advance that must be ute value
represented before developing a machine learning model. ii. There are not any more residual traits.
Correlation is communicated as a correlation coefficient. iii. There are no more data.
To see whether these two factors are connected and how
much, the correlation coefficient can be determined between We have used the “Gini Index” as the attribute selection
the two. The correlation coefficient is estimated on a scale measure.
from − 1 to + 1. The correlation coefficient is used to find the
correlation between two factors. Correlation unquestionably Random Forest Classifier
impacts feature significance. Implying that if the features are
exceptionally correlated, there would be a significant level of Random forest is a supervised classification algorithm
excess if keeping them all. Since two features are associated [18]. It is additionally the most adaptable and simple to be
that implies a change in one will change another. So there is applied. The work process of the random forest classifier
no compelling reason to keep every one of them. As they are has been illustrated in Fig. 4. Random forests make decision
the most likely agents of each other and utilizing a couple of trees on randomly chosen data, get the prediction to every
them can ideally classify the data well [16]. The correlation tree, and selects the best solution by the means of casting
plot of the features has been illustrated in Fig. 2. Here, no votes. Random forest depends on the divide and conquers
feature has a strong correlation with others, the maximum the viewpoint of decision trees that are made by randomly

SN Computer Science
SN Computer Science (2021) 2:177 Page 5 of 10 177

Fig. 2  Correlation among the input variables

parting the dataset. Every decision tree is constructed apply- The objective function is an aggregate of a particular loss
ing the feature selection measure of each attribute. Each tree assessed generally classifications and a whole of regularization
is reliant on an independent sample. When it is a classifica- term for all classifiers. We have chosen the “binary: logistic”
tion problem, each tree figure votes and the most noteworthy function as the objective function. Mathematically, the formula
votes class is picked. In this work, we have fixed the number of computing objective function is in Eq. (2).
of estimators to 200 with the random state 5. n n
∑ ⌢

obj(𝜃) = l(yi − y i ) + Ω(fk ) (2)
XGBoost Classifier i k=1

XGBoost rules are used for smaller datasets on classification


issues [19]. The whole procedure of the XGBoost classifier
has been represented in Fig. 5. Boosting is a group technique Result and Discussion
where new models are added to address the mix-ups made by
existing models. Models are incorporated successively until no As we mentioned earlier, we have implemented the decision
further improvements can be made. Extreme gradient boost- tree, random forest, and XGBoost algorithm for classifying the
ing is a model where new models are made to predict and the patient who has cervical cancer or not.
residuals of prior models have been calculated and thereafter We measured our implemented system’s performance
included to anticipate the outcome. based on the accuracy (Acc), precision (Pre), recall (Rec), and

SN Computer Science
177 Page 6 of 10 SN Computer Science (2021) 2:177

F1-score using the following formulas in Eqs. (3), (4), (5), and
(6), respectively.
TP + TN
Accuracy = (3)
TP + FP + FN + TN

TP
Precision = (4)
TP + FP

TP
Recall = (5)
TP + FN

TP
F1-Score = 1 (6)
TP + 2
(FP + FN)

The confusion matrix of the Decision Tree, random for-


Fig. 3  Decision tree workflow est, and XGBoost model has been represented in Fig. 6.
According to Fig. 6a–c, the number of correctly classified
instances was 14 for all three classifiers, and 1 instance was
misclassified. It can be seen from Fig. 6c that, there was just
one test data in the “Cervical Cancer” class for XGBoost to

Fig. 4  Random forest algorithm


workflow

Fig. 5  XGBoost algorithm


workflow

SN Computer Science
SN Computer Science (2021) 2:177 Page 7 of 10 177

Fig. 6  Confusion matrix of the


cervical cancer prediction sys-
tem. a Decision tree, b Random
forest, c XGBoost classifier

SN Computer Science
177 Page 8 of 10 SN Computer Science (2021) 2:177

Table 2  Performance measures Classifiers Class Acc (%) Pre (%) Rec (%) F1-score (%)
of each classifier
Decision Tree No cervical cancer 93.33 100 91 95
Cervical cancer 80 100 89
Random forest No cervical cancer 93.33 92 100 96
Cervical cancer 100 75 86
XGBoost No cervical cancer 93.33 93 100 97
Cervical cancer 0 0 0

was 93.33%. The detailed description has been represented


in Table 2 and Fig. 7. The precision of the decision tree,
random forest, and XGBoost model for the “No Cervical
Cancer” class were 100%, 92%, and 93%, respectively. The
precision of the decision tree, random forest, and XGBoost
model for the “Cervical Cancer” class were 80%, 100%, and
0%, respectively. The recall of decision tree, random for-
est, and XGBoost model for “No Cervical Cancer” class
were 91%, 100%, and 100%, respectively. The recall of deci-
sion tree, random forest, and XGBoost model for “Cervical
Cancer” class were 100%, 75%, and 0%, respectively. The
F1-score of the decision tree, random forest, and XGBoost
model for the “No Cervical Cancer” class were 95%, 96%,
and 97%, respectively. The F1-score of the decision tree, ran-
dom forest, and XGBoost model for the “Cervical Cancer”
class were 89%, 86%, and 0%, respectively.
Feature importance gives a score that shows how help-
ful or important each feature was in the development of the
model. A trained XGBoost model naturally ascertains fea-
ture significance in the predictive modeling problem. We
also computed the feature importance from the classifier
model. This has been represented in Fig. 8. From Fig. 8,
the top-3 features were “empowerment_desires”, “percep-
tion_vulnerability”, and “intention_aggregation”.
In Table 3, we have compared our proposed models with
the existing methods for the same dataset and it can be seen
that we got significantly better results with 93.33% accuracy.

Fig. 7  Performance measures of each machine learning techniques. a


No cervical cancer, b cervical cancer
Conclusion

classify, and it failed to classify it to the proper class. How- Creative cervical malignant growth screening systems are a
ever, it will be improved when there was more data in the test fundamental aspect of the avoidance initiative that will even-
set. The other two models—decision tree and random forest tually lessen the worldwide dissimilarity in cervical disease
showed good performance for both classes. The accuracy rate. Effective performance in the elimination of cervical
of the decision tree, random forest, and XGBoost classifier cancer, concerned administrations require to reach ladies at

SN Computer Science
SN Computer Science (2021) 2:177 Page 9 of 10 177

Fig. 8  Feature importance score

the root level with programs that are adequate, cheap, sim- of the dataset. Further investigation with a bigger dataset
ple to utilize, and supportable. In this paper, we attempted will assist to improve the model performance.
to show the cervical malignant growth hazard structure by
applying several machine learning techniques with enhanced
performance. We have shown the performances of each clas- Declarations
sifier in different metrics like—accuracy, precision, recall,
and F1-score. All of the three classifiers (decision tree, ran- Conflict of interest On behalf of all authors, the corresponding author
states that there is no conflict of interest.
dom forest, and XGBoost) achieved 93.33% accuracy. We
have also shown the feature importance score for the features

References
1. Ferlay J, Shin H, Bray F, Forman D, Mathers C, Parkin D. Esti-
Table 3  Comparison of the proposed system with the state-of-the-art mates of worldwide burden of cancer in 2008: GLOBOCAN 2008.
Int J Cancer. 2010;127(12):2893–917.
Author Model Accuracy (%) Feature 2. Guidelines for cervical cancer screening programme. Chandigarh:
importance Department of Cytology & Gynaecological Pathology, Postgradu-
calculation ate Institute of Medical Education, Research, screening.iarc.fr,
2020. Accessed 29 Oct 2020.
Sobar et al. [13] Naïve Bayes 91.67 No
3. Ndikom C, Ofi B. Awareness, perception and factors affecting
Logistic regression 87.5 utilization of cervical cancer screening services among women in
Proposed system Decision tree 93.33 Yes Ibadan, Nigeria: a qualitative study. Reprod Health. 2012;9:1–8.
Random forest 93.33 4. Hussain S, Sullivan R. Cancer control in Bangladesh. Jpn J Clin
Oncol. 2013;43(12):1159–69.
XGBoost 93.33

SN Computer Science
177 Page 10 of 10 SN Computer Science (2021) 2:177

5. Paul BS. Studies on the epidemiology of cervical cancer in South- 14. Patro S, Sahu K. Normalization: a preprocessing stage. IARJSET.
ern Assam. Assam Univ J Sci Technol. 2011;7(1):36–42. 2015. p. 20–22.
6. Deng X, Luo Y., Wang C. Analysis of risk factors for cervical 15. Cox V. Translating statistics to make decisions. 2017.
cancer based on machine learning methods. In: Proc. of 5th IEEE 16. Kumar S, Chong I. Correlation analysis to identify the effec-
international conference on cloud computing and intelligence sys- tive data in machine learning: prediction of depressive dis-
tems (CCIS), Nanjing, China, 2018. p. 631–5. order and emotion states. Int J Environ Res Public Health.
7. Lu J, Song E, Ghoneim A, Alrashoud M. Machine learning for 2018;15(12):2907.
assisting cervical cancer diagnosis: an ensemble approach. Futur 17. Hamlich M, Bellatreche L, Mondal A, Ordonez C. Smart applica-
Gener Comput Syst. 2020;106:199–205. tions and data analysis. Cham: Springer; 2020. p. 165–77.
8. Nithya B, Ilango V. Evaluation of machine learning based opti- 18. Abdoh SF, Abo Rizka M, Maghraby FA. Cervical cancer diagno-
mized feature selection approaches and classification methods for sis using random forest classifier with SMOTE and feature reduc-
cervical cancer prediction. SN Appl Sci. 2019;1(6):1–16. tion techniques. IEEE Access. 2018;6:59475–85.
9. Parikh D, Menon V. Machine learning applied to cervical cancer 19. Dimitrakopoulos GN, Vrahatis AG, Plagianakos V, Sgarbas K.
data. Int J Math Sci Comput. 2019;5(1):53–64. Pathway analysis using XGBoost classification in Biomedical
10. Tseng C, Lu C, Chang C, Chen G. Application of machine learn- Data. In: Proc. of the 10th hellenic conference on artificial intel-
ing to predict the recurrence-proneness for cervical cancer. Neural ligence. Association for computing machinery, New York, NY,
Comput Appl. 2013;24(6):1311–6. USA, Article 46, 2018. p. 1–6.
11. Suman S, Hooda N. Predicting risk of cervical cancer: a case study
of machine learning. J Stat Manag Syst. 2019;22(4):689–96. Publisher’s Note Springer Nature remains neutral with regard to
12. UCI machine learning repository: cervical cancer behavior risk jurisdictional claims in published maps and institutional affiliations.
data set. Archive.ics.uci.edu, 2020. Accessed 10 Nov 2020.
13. Machmud R, Wijaya A. Behavior determinant based cervical can-
cer early detection with machine learning algorithm. Adv Sci Lett.
2016;22(10):3120–3.

SN Computer Science

You might also like