Evaluation of Machine Learning Based Optimized Feature Selection Approaches and Classification Methods For Cervical Cancer Prediction
Evaluation of Machine Learning Based Optimized Feature Selection Approaches and Classification Methods For Cervical Cancer Prediction
Abstract
Cervical cancer is one type of gynaecological cancers and the majority of these complications of cervical cancer are
associated to human papillomavirus infection. There are numerous risk factors associated with cervical cancer. It is
important to recognize the significance of test variables of cervical cancer for categorizing the patients based on the
results. This work intended to attain deeper understanding by applying machine learning techniques in R to analyze
the risk factors of cervical cancer. Various types of feature selection techniques are explored in this work to determine
about important attributes for cervical cancer prediction. Significant features are identified over various iterations of
model training through several feature selection methods and an optimized feature selection model has been formed.
In addition, this work aimed to build few classifier models using C5.0, random forest, rpart, KNN and SVM algorithms.
Maximum possibilities were explored for training and performance evaluation of all the models. The performance and
prediction exactness of these algorithms are conferred in this paper based on the outcomes attained. Overall, C5.0 and
random forest classifiers have performed reasonably well with comprehensive accuracy for identifying women exhibit-
ing clinical sign of cervical cancer.
Keywords Gynecological cancers · Cervical cancer · Machine learning · Feature selection · Classification · Prediction ·
Performance · Optimization
1 Introduction If the tumor is malignant, its cell flow through the blood
stream and spread to other parts of body, consequently
Gynaecological cancers are those that develop in a wom- those parts also get infected, and in maximum cases it can
an’s reproductive tract and they are the most common be prevented through early detection [2].
type of cancers in women after breast cancer. Gynaeco- Generally medical dataset is provided with more attrib-
logical cancers are very dangerous and lead to lessening utes and missing values [3]. Identifying the relevant and
the lifespan of women diagnosed with such type of can- important features for statistical model building is essen-
cers. Cervical cancer is one type of gynaecological cancer, tial by way of optimization. It is apparent that Machine
other types are Ovarian cancer, Uterine cancer, Vaginal Learning (ML) methods are more beneficial in predic-
cancer and Vulvar cancer. There are different risk factors for tions, optimization related explorations and they have
each type of gynaecological cancers. Cervical cancer is the been extensively implemented in various types of cancer
second most commonly identified cancer in women and researches. The study [4] which discussed about various
representing 7.5% of all female cancer deaths all over the works relevant to cancer prediction/prognosis evidenced
world [1]. Cervical cancer is malicious tumor that occurs accurate results attainment by means of ML techniques. R
when the cervix tissue cells begin to grow and reproduce is one of the most popular and widely-used software struc-
abnormally without controlled cell division and cell death. tures for statistics, data mining, and machine learning. The
Vol.:(0123456789)
Research Article SN Applied Sciences (2019) 1:641 | https://doi.org/10.1007/s42452-019-0645-7
Vol:.(1234567890)
SN Applied Sciences (2019) 1:641 | https://doi.org/10.1007/s42452-019-0645-7 Research Article
4.1 Data preparation
Fig. 3 Feature selection—embedded method The dataset used in this work is the openly accessible cer-
vical cancer dataset from UCI Machine Learning Repository
method for feature selection process is represented in [17] which was gathered at Hospital Universitario de Cara-
Fig. 2. cas in Caracas, Venezuela. This dataset comprises medici-
Some common examples of wrapper approaches are nal histories of 858 patients with 36 attributes (32 input
forward feature selection, backward feature elimination features and 4 target variables as Hinselmann, Schiller,
and recursive feature elimination. Cytology, Biopsy). The attributes information of the data-
Forward selection - It is an iterative method, initially set is given in Table 1.
there will not be any feature in the mode and in each itera- It is essential to feed the right data to the machine
tion, new feature is added which best advances the model. learning processes for the problem to be solved, since
This will be continued till an addition of a new variable these algorithms learn from data. After selecting the data,
does not advances the model performance. it should be pre-processed and transformed.
Backward elimination - In this method, we begin with all
the features and eliminates the minimum substantial fea- 4.1.1 Data cleaning: dealing with missing values
ture at each iteration. This process is repeated until there
is no progression is detected by eliminating the features. This cervical cancer dataset is having lot of missing values.
Recursive feature elimination - It is an optimization algo- Sometimes missing values are a common existence, and
rithm and intends to attain the finest feature subset. It con- we require an efficient approach for handling such infor-
tinually produces models and determine the finest or the mation. A missing value can imply a number of different
worst feature at each repetition. It creates the subsequent things in the data. The records with missing values can be
model with the remaining features till entire features are ignored or they can be replaced with the variable mean
explored. Then it organizes all the features with respect to for numerical attributes or by most frequent value in case
their order of elimination. of categorical attributes. When we applied the approach
to remove the records with missing values the number of
3.2.3 Embedded methods rows reduced to 737 from 858. Our aim is to reduce the
number of features but not the number of records avail-
The attribute selection using embedded method is able in the dataset, hence the strategy of replacing the
described in Fig. 3. This method combines the abilities of missing values with mean is used for numerical attributes.
both the methods discussed earlier. It is executed by pro- The columns for STDs_cervical_condylomatosis,
cedures which have their specific built-in feature selection STDs_vaginal_condylomatosis, STDs_pelvic_inflam_dis-
methods. ease, STDs_genital_herpes, STDs_molluscum_contagio-
Various study on cancer classification approach through sum, STDs_Hepatitis_B, STDs_HPV and STD_AIDS were
wrapper-based feature selection [16] showed an excel- removed, since there were 4 or less patient results for
lent performance, not only at identifying relevant genes, these features. Similarly, the features STDs_Time_since_
but also with respect to the computational cost. Accord- first_diagnosis and STDs_Time_since_last_diagnosis
ingly, wrapper methods are used in this work for feature which contained greater than 60% missing values (787 of
selection to see whether the accuracy of the model can 858) were also eliminated from the dataset. Subsequently,
Vol.:(0123456789)
Research Article SN Applied Sciences (2019) 1:641 | https://doi.org/10.1007/s42452-019-0645-7
Table 1 Attributes in cervical S. no. Attribute name Type S. no. Attribute name Type
cancer dataset
1 Age int 19 STDs: pelvic inflammatory disease bool
2 Number of sexual partners int 20 STDs: genital herpes bool
3 First sexual intercourse (age) int 21 STDs: molluscum contagiosum bool
4 Num of pregnancies int 22 STDs: AIDS bool
5 Smokes bool 23 STDs: HIV bool
6 Smokes (years) bool 24 STDs: Hepatitis B bool
7 Smokes (packs/year) bool 25 STDs: HPV bool
8 Hormonal contraceptives bool 26 STDs: Number of diagnosis int
9 Hormonal contraceptives years int 27 STDs: time since first daignosis int
10 IUD bool 28 STDs: time since last daignosis int
11 IUD (years) int 29 Dx: cancer bool
12 STDs bool 30 Dx: CIN bool
13 STDs (number) int 31 Dx: HPV bool
14 STDs: condylomatosis bool 32 Dx bool
15 STDs: cervical condylomatosis bool 33 Hinselmann: target variable bool
16 STDs: vaginal condylomatosis bool 34 Schiller: target variable bool
17 STDs: vulvo-perinerl condylomatosis bool 35 Cytolagy: target variable bool
18 STDs: syphilis bool 36 Biopsy: target variable bool
The values of four target variables Hinselmann, Schiller, 4.2 Applying feature selection to data
Citology and Biopsy represent the results for cervical can-
cer exams. The histogram representation with four target When the dimensionality of the data increases the compu-
variables is shown below in Fig. 4. tational cost also increases exponentially. In the existence
The data for these columns can be combined to cre- of several inappropriate features, learning models incline
ate a single target feature called ‘Cancer’ as 27th feature. to overfit and convert as less efficient [18]. To overcome
The advantage of combining all the four target variables this problem, it is required to find a method to diminish
is to confirm the possibility of the diagnosis. Higher val- the number of features in consideration. Feature sub-
ues for this feature signify an increased likelihood of cer- set selection works by eliminating the features that are
vical cancer. If one diagnosis indicate that the patient has redundant or not appropriate. During data cleaning pro-
cancer, but the other three diagnosis attained different cess, ten columns were removed which had missing values
results then the possibility of the patient having cancer is and 27th feature ‘Cancer’ is added as target by combing
Vol:.(1234567890)
SN Applied Sciences (2019) 1:641 | https://doi.org/10.1007/s42452-019-0645-7 Research Article
other four target values. In feature selection process, this rfe.train output has shown the following result.
target variable has been used to find the important and The top 3 variables (out of 3):
relevant features. So, the dataset is now available with 858
rows, 27 attributes. (22 predictor variables +4 target vari- Dx.HPV, Dx.Cancer, Dx
ables + an additional combined target variable ‘Cancer’).
In this work, various types of feature selection methods Predictors(rfe.train) has revealed the following output.
are explored using R tool to identify most significant and predictors(rfe.train)
optimal features.
[1] “Dx.HPV” “Dx.Cancer” “Dx”
4.2.1 Recursive feature elimination (RFE)
So RFE algorithm has predicted three features Dx.HPV,
It is evident from our understanding that the dataset is Dx.Cancer, Dx as important.
unbalanced, hence K-fold-cross-validation is required to
attain better outcomes. RFE is a feature selection method
4.2.2 Boruta algorithm
that fits a model and eliminates the weakest features.
Cross-validation is used with RFE to find the optimal
All-relevant feature selection is a moderately new sub-field
number of features and finest ranking set of features are
in the province of feature selection [19]. Boruta is an all
selected. In R tool, Recursive Feature Elimination—RFE
relevant feature selection algorithm in R which functions
procedure can be implemented using caret package.
as a wrapper algorithm around Random Forest. It makes a
Initially the control function to be used in RFE algorithm
top-down search for appropriate features by associating
should be defined. The random forest selection function
original attributes’ importance with importance attaina-
over rfFuncs option in rfeControl function is stated here.
ble at random, assessed using their permuted copies, and
gradually rejecting inappropriate features. Boruta captures
control < - rfeControl(functions = rfFuncs, method = ”cv”,
all features which are statistically significant to the target
number = 10)
variable in some conditions.
Working Principle of Boruta Algorithm - The procedure of
Then RFE algorithm is implemented as follows.
Boruta Algorithm is explained with sequence of phases.
rfe.train < - rfe(training_data[,1:22], training_data[,23],
(a) Initially, it assigns randomness to the given dataset
sizes = 1:10, rfeControl = control)
by making shuffled copies of all features (termed as
shadow features).
The original target variables in the dataset were
(b) Later, it trains a random forest classifier on the data-
removed and the procedure was implemented with pre-
set and applies a feature ranking measure (Mean
dictor variables and newly added target variable. After the
Decrease Accuracy) to estimate the relevance (higher
implementation of RFE algorithm the result has been plot-
mean value) of each feature.
ted and variable importance chart has been obtained. The
(c) On each iteration, it finds whether a real feature has a
chart is depicted in Fig. 6.
higher position than the best of its shadow features
Vol.:(0123456789)
Research Article SN Applied Sciences (2019) 1:641 | https://doi.org/10.1007/s42452-019-0645-7
and continuously eliminates features which are esti- represent minimal, average and maximum Z score of a
mated extremely insignificant. shadow attribute. Red, yellow and green boxplots indicate
(d) The algorithm halts when all features gets confirmed the Z scores of rejected, tentative and confirmed attributes
or excluded or when it accomplishes a stated limit of respectively.
random forest runs. In the process of deciding if a feature is important or
not, some features may be marked by Boruta as ’Tenta-
In Boruta, maxRuns is the number of times the algo- tive’. These tentative attributes will be decided as con-
rithm is supposed to run. The higher the maxRuns the firmed or rejected by comparing the median Z score of
more selectively the variables can be picked. The default the attributes with the median Z score of the best shadow
value is 100. Boruta check for all features which are either attribute. After deciding on tentative attributes Boruta
strongly or weakly pertinent to the target variable. With produced the following output for cervical cancer data
Boruta the dataset of missing values should not be used and the chart is showed in Fig. 8.
to identify significant variables. As the goal is to find the getSelectedAttributes(final.boruta, withTentative = F)
features (other than four target variables) which are all
significant to decide the outcome as Cancer or Not, the [1] “Smokes..packs.year.”
same set of data which was used in RFE algorithm was [2] “Hormonal.Contraceptives..years.”
used in Boruta also. After training our dataset with Boruta [3] “STDs”
algorithm it produces the following output. [4] “STDs.vulvo.perineal.condylomatosis”
print(boruta.train) [5] “STDs.syphilis”
[6] “STDs..Number.of.diagnosis”
Boruta performed 99 iterations in 58.59354 s. [7] “Dx.Cancer”
5 attributes confirmed important: Dx, Dx.Cancer, [8] “Dx.HPV”
Dx.HPV, [9] “Dx”
Smokes..packs.year., STDs.vulvo.perineal.condyloma-
tosis; Boruta algorithm has shown a much-improved result of
7 attributes confirmed unimportant: variable importance as compared to the old feature selec-
First.sexual.intercourse, Hormonal.Contraceptives, IUD, tion method (RFE). In Boruta, it is easy to understand the
IUD..years., Num.of.pregnancies and 2 more; results through the clear interpretation.
10 tentative attributes left: Age, Dx.CIN,
Hormonal.Contraceptives..years., Number.of.sexual. 4.2.3 Simulated annealing (SA)
partners,
Smokes and 5 more; Simulated annealing is a search algorithm that allows a
suboptimal solution to be accepted with an expectation
Here the top three features were already selected by that a better solution will be obtained at the end. This
RFE algorithm. The variable importance chart of Boruta algorithm is used with an aim to get the optimal solution
algorithm is portrayed in Fig. 7. The plot is shown for all by producing a smaller number of feature subsets for eval-
the attributes taken into consideration. Blue boxplots uation [20]. It works by doing minor random changes to
Vol:.(1234567890)
SN Applied Sciences (2019) 1:641 | https://doi.org/10.1007/s42452-019-0645-7 Research Article
a preliminary solution and checks for the improvement in • Then varImp() function is applied for finding important
the performance. The optimal variables obtained for cervi- features
cal cancer risk factors through Simulated Annealing (safs
function) are shown below. Few of ML algorithms namely rpart, C5.0, svmRadial,
print(sa_obj$optVariables) knn, ctree and rf were applied in this work to decide
about the features which are significant to attain reli-
[1] “Age” ”Smokes..years.” able accuracy with optimization. All these ML algorithms
[3] “Hormonal.Contraceptives” ”STDs..number.” are intended to train the model and the models built
[5] “STDs..Number.of.diagnosis” ”Dx.Cancer” by these algorithms would be applied on test data.
[7] “Dx.HPV” Hence, we decided to use the dataset with 26 predic-
tor variables which included those four target variables
This procedure has derived seven important features (Schiller, Citology, Biopsy, Hinselmann) also. The models
from cervical cancer dataset. were trained with respect to the decisive target variable
‘Cancer’.
rpart()—The R implementation of CART algorithm is
4.2.4 Feature selection with machine learning algorithms termed as RPART (Recursive Partitioning And Regression
Trees). The rpart algorithm works by splitting the dataset
There is an alternate way to accomplish feature selection is recursively, until a predetermined termination condition is
to consider variables most used by several Machine Learn- reached. At each step, the split is made based on the inde-
ing (ML) algorithms the most to be significant. Initially ML pendent variable which allows major possible reduction in
algorithms learn the association between X’s and Y then heterogeneity of the predicted variable. rpart method has
based on the learning, different machine learning algo- shown the following output for the cervical cancer dataset
rithms could probably end up using different variables to considered in our work.
various degrees. Therefore, the variables that showed suit- rpart variable importance (Output obtained in R)
able in a tree-based algorithm like rpart, can turn out to only 20 most important variables shown (out of 26)
be less valued in a regression-based model. Hence, all vari-
ables need not be equally appropriate to all algorithms. Overall
It is apparent that employing feature subset selection Schiller1 100.0000
using wrapper approach in ML algorithms could enhance Citology1 95.9648
classification accuracy [6]. Hence this work is intended to Biopsy1 61.0357
apply feature subset selection with few ML algorithms to Hinselmann1 53.8580
validate and compare their performance. Steps to find Dx.Cancer 4.4652
variable Importance from ML Algorithms are shown below. Age 1.9525
Dx 1.6306
• The desired model should be trained through train() Dx.CIN 1.5675
function using the caret package First.sexual.intercourse 0.4283
Vol.:(0123456789)
Research Article SN Applied Sciences (2019) 1:641 | https://doi.org/10.1007/s42452-019-0645-7
Fig. 9 Significance of variables through rpart algorithm Fig. 11 Significance of variables through rf algorithm
Vol:.(1234567890)
SN Applied Sciences (2019) 1:641 | https://doi.org/10.1007/s42452-019-0645-7 Research Article
It is observed through Fig. 12 which shows the signifi- good ratio of testing data points, because fewer amount
cance of variables through ctree() method has selected of data points can lead to a variance error while testing
most of these significant features in an efficient manner. the model effectiveness. It is essential that training and
testing process should be iterated multiple times, corre-
4.3 Classifier model construction and estimation spondingly the training and testing dataset distribution
of model accuracy should be changed which helps to accurately validate the
effectiveness of the model. All these requirements could
Feature Selection through ML algorithms have already be attained through K-fold cross validation.
trained the desired models for classification. Once the best
feature selection subset is identified for a particular data- 4.3.1 K‑fold cross validation
set the same can be used to improve the classifier accuracy
[18]. Hence, to enhance the performance and accuracy The K-fold cross validation method comprises splitting the
of various classifier models, decided to apply boosting dataset into k-subsets, where each subset/fold is used as a
method in determining the occurrence in cervical can- testing set. In the first iteration, first fold is used for model
cer [25]. When we are building a predictive model there testing and the rest are used for model training. Likewise,
must be a way to evaluate the capability of the model on this process will be repeated until each fold have been
concealed data. This can be accomplished by estimat- used for testing the model. The picturing of k-fold cross
ing accuracy through the data that was not used to train validation with k = 10 is shown in Fig. 13. This method is
the model such as test data or by means of cross valida- useful in defining the accuracy of the model with reason-
tion. The model should be trained on a large percentage able combinations of data.
of the dataset. Correspondingly there is a necessity for a
Vol.:(0123456789)
Research Article SN Applied Sciences (2019) 1:641 | https://doi.org/10.1007/s42452-019-0645-7
The procedure of splitting the data into k-folds can be mtry Accuracy Kappa
repeated for a required number of times, which is known as 2 0.9111862 0.4565576
Repeated k-fold Cross Validation. The eventual model accu- 14 0.9879466 0.9445185
racy is calculated as the mean from the number of repeats. 26 0.9908683 0.9578242
In this work, repeated cross validation techniques have been
Accuracy was used to select the optimal model using
applied for the processes of data splitting, model training
the largest value.
and testing in recurrent manner for a greater (for 50) number
The final value used for the model was mtry = 26.
of times over cervical cancer data. The trainControl() func-
tion in R can be used to specify resampling type. The code to
The accuracy of rf model with repeated cross validation
apply repeated cross validation for fifty times is shown here.
using 26 features is revealed through the plot as shown
in Fig. 14.
control <- trainControl(## 10-fold CV
The results attained by rf method using 14 predictor
method = “repeatedcv”,
(features) values with 10-fold cross validation with 50
number = 10,
repeats are shown below.
## repeated fifty times repeats = 50)
Random Forest
The train() function in R is used to fit the predictive mod-
600 samples
els based on various tuning parameters. The model training
14 predictor
thorough Random Forest technique is shown below.
5 classes ‘0’, ‘1’, ‘2’, ‘3’, ‘4’
fit.rf < - train(Cancer ~ ., data = cer_data, method = ”rf”,
No pre-processing
trControl = control)
Resampling Cross-Validated (10 fold, repeated 50 times)
Summary of sample sizes 540, 539, 539, 540, 540, 540,…
The results attained by rf method using 26 predictor
Resampling results across tuning parameters:
(features) values with 10-fold cross validation through 50
repeats are shown below.
mtry Accuracy Kappa
Random Forest
2 0.9597476 0.8016736
8 0.9901440 0.9548854
600 samples
14 0.9927090 0.9669582
26 predictor
5 classes ‘0’, ‘1’, ‘2’, ‘3’, ‘4’ Accuracy was used to select the optimal model using
the largest value.
No pre-processing The final value used for the model was mtry = 14.
Resampling Cross-Validated (10 fold, repeated 50 times)
Summary of sample sizes 541, 540, 541, 541, 540, 540,…
Vol:.(1234567890)
SN Applied Sciences (2019) 1:641 | https://doi.org/10.1007/s42452-019-0645-7 Research Article
Similarly, the accuracy plot of rf model with repeated significant in other algorithms as well. These results
cross validation using 14 features is shown in Fig. 15. are revealed in Fig. 16 by printing and plotting some of
Correspondingly, additional classifier models have these model outputs.
been created with maximum possibilities on this cervi- The results attained are comparatively upgraded for
cal cancer data through various methods like rpart, C5.0, most of the ML methods by training the models with these
SVM and KNN over repeated k-fold cross validation by significant features (14 Predictors) which were obtained
50 trials to determine whether the results obtained are through feature selection processes. The accuracy attained
Vol.:(0123456789)
Research Article SN Applied Sciences (2019) 1:641 | https://doi.org/10.1007/s42452-019-0645-7
for SVM model has shown very good progression with classification models which is calculated as the fraction
significant features but KNN model has not accomplished of predictions our model got precise and the formula is
progressive results. The plot for these models is shown in given below.
Fig. 17.
Accuracy = Number of correct predictions/Total
This exploration proved that the results attained are
more significant by applying repeated cross validation number of predictions
techniques for training the models. We have selected ten In this work, accuracy is estimated for some of the preva-
significant features which are optimal with the percentage lent classification algorithms like C5.0, rpart, rf, SVM and
(rank) values obtained and also based on their common- KNN in two ways i.e. by considering 26 input features in
ality existence in a greater number of feature selection the dataset (without feature selection process) and by
methods, subsequently the results are more precise and considering the significant features (14 features) obtained
significant. through feature selection methods which are discussed
earlier. The results attained for these classifiers are revealed
4.4 Performance and accuracy estimation of ML through cross tables, performance accuracy and AUC
classifier models values.
Vol:.(1234567890)
SN Applied Sciences (2019) 1:641 | https://doi.org/10.1007/s42452-019-0645-7 Research Article
Vol.:(0123456789)
Research Article SN Applied Sciences (2019) 1:641 | https://doi.org/10.1007/s42452-019-0645-7
Vol:.(1234567890)
SN Applied Sciences (2019) 1:641 | https://doi.org/10.1007/s42452-019-0645-7 Research Article
cleaning, replacement of missing values and applying fea- validation technique with all the 26 features as well as with
ture selection process to achieve higher efficiency in out- an optimal feature subset of 14 predictors for diagnosis
come prediction with an optimal feature subset. To evaluate prediction of cervical cancer. The results of the classifier
the performance of classifier models, this work employed models through C5.0 and rf algorithms with an optimal
ML methods on the cervical cancer data by considering all significant feature are significantly upgraded to 99% to
the records in the dataset through replacement of missing 100%. In both the ways this work revealed that C5.0 and
values in the rows with their mean, eliminating only the col- rf methods as more prominent algorithms for predicting
umns which had missing values. Hence, after data cleaning significant risk factors in cervical cancer. The relative per-
process the dataset had 858 rows with 26 predictors. Then formance analysis of the conferred classification methods
by implementing few imperative feature selection tech- is shown in Table 2 with their accuracy and AUC values.
niques and by training the models through ML algorithms, The performance evaluation of ML classification algo-
an optimal feature subset has been selected based on the rithms is exhibited through the bar plot which is shown in
importance of variables. The following attributes have been Fig. 25. Random forest and C5.0 both the methods have
identified as more significant in addition with four target equally performed well with maximum accuracy and
features for cervical cancer diagnosis prediction. reduced amount of error rate.
We have selected significant predictors based on their
Hormonal.Contraceptives..years Dx.Cancer importance and mutual existence over feature selection
First.sexual.intercourse Dx methods and by training the models through repeated
Number.of.sexual.partners Dx.HPV k-fold cross validation with ML methods. Through unbi-
STDs..Number.of.diagnosis Smokes..years. ased feature list, there are only three features which are
Age Num.of.pregnancies common in most of these methods. If we employ these
minimal features for ML classification process then the
ML classifier models with C5.0, RF, RPART, SVM and results will not be precise. This shows that the stability in
KNN methods have been built with repeated k-fold cross feature selection is an important issue and its importance
has been determined through this work. Therefore, an
Table 2 Comparative analysis of ML algorithms based on accuracy
optimized feature selection approach is more essential to
and AUC values improve the performance accuracy of prediction process,
accordingly with an optimal feature subset an efficient
Algorithms→ C5.0 RF RPART SVM KNN
performance has been gained through this work for cer-
↓Features/attributes
details, evaluation
vical cancer diagnosis prediction.
metrics
Vol.:(0123456789)
Research Article SN Applied Sciences (2019) 1:641 | https://doi.org/10.1007/s42452-019-0645-7
The feature selection process over Boruta algorithm, SA 8. Sowjanya D et al (2014) Staging prediction in cervical cancer
and ctree() methods have shown good proficiency in patients—a machine learning approach. Int J Innov Res Pract
2(2):14–23
accomplishing major features of an optimal feature sub- 9. Akyol K (2018) A study on test variable selection and balanced
set for cervical cancer risk factors prediction. However, all data for cervical cancer disease. Int J Inf Eng Electron Bus 10:1
the information related to the dataset were not provided 10. Menon V, Parikh D (2018) Machine learning applied to cervical
and some of the information, such as factorizing or not cancer data. Int J Sci Eng Res 9(7):46–50
11. Choudhary A et al (2018) Classification of cervical cancer data-
factorizing, replacement of variables was done based on set. In: Proceedings of the 2018 IISE annual conference, Orlando,
assumptions. Through the examination of C5.0, rpart, Ran- pp 1456–1461
dom Forest, SVM and KNN algorithms, we have found that 12. Jović A, Brkić K, Bogunović N (2015) A review of feature selection
most of the algorithms were efficient in providing cervi- methods with applications. In: 2015 38th international conven-
tion on information and communication technology, electronics
cal cancer diagnosis with advanced accuracy. Overall C5.0 and microelectronics (MIPRO), Opatija, pp 1200–1205
and Random Forest classifiers have performed reasonably 13. Bagherzadeh-Khiabani F et al (2016) A tutorial on variable selec-
well, besides extremely accurate through reliable results tion for clinical prediction models: feature selection methods
with maximum accuracy for identifying women exhibit- in data mining could improve the results. J Clin Epidemiol
71:76–85
ing clinical sign of cervical cancer. It is apparent through 14. Le Thi HA et al (2015) Feature selection in machine learning: an
this work that, an enhanced prediction accuracy for cervi- exact penalty approach using a difference of convex function
cal cancer diagnosis can be attained by means of includ- algorithm. Mach Learn 101:163–186
ing an optimal feature subset through enhanced feature 15. Park HW et al (2017) A hybrid feature selection method to classi-
fication and its application in hypertension diagnosis. In: ITBAM
selection approaches and by building the classifier models 2017, LNCS 10443. Springer, pp 11–19
with ML algorithms through repeated k-fold cross valida- 16. Ruiz R et al (2006) Incremental wrapper-based gene selection
tion techniques. This work can be extended for other types from microarray data for cancer classification. Pattern Recogni-
of gynecological cancer type predictions. Altogether, the tion 39(12):2383–2392
17. UCI Machine Learning Repository, Cervical cancer (Risk Factors)
conferred classifiers have shown enhanced performance Data Set. Retrieved February 5, 2019, from https: //archiv e.ics.uci.
accuracy with the optimal features’ dataset. edu/ml/datasets/Cervical+cancer+%28Risk+Factors%29
18. Zhao Z et al (2010) Advancing feature selection research—ASU
feature selection repository: Citeseer
19. Rudnicki WR, Wrzesień M, Paja W (2015) All relevant feature
Compliance with ethical standards selection methods and applications. In: Stańczyk U, Jain L (eds)
Feature selection for data and pattern recognition. Studies in
Conflict of interest The authors declare that they have no conflict of computational intelligence, vol 584. Springer, Berlin
interest. 20. Antony DA (2016) Literature review on feature selection
methods for high-dimensional data. Int J Comput Appl
136:0975–8887
References 21. Pandya R, Pandya J (2015) C5.0 algorithm to improved decision
tree with feature selection and reduced error pruning. Int J Com-
put Appl 117(16):18–21
1. World Health Organization (2019) Fact sheet: human-papillo- 22. Nguyen C et al (2013) Random forest classifier combined with
mavirus-(hpv)-and-cervical-cancer, Retrieved 13-02-2019 feature selection for breast cancer diagnosis and prognostic. J
2. Sarwar A et al (2015) Performance evaluation of machine learn- Biomed Sci Eng 6:551–560
ing techniques for screening of cervical cancer, INDIACom-2015; 23. Genuer R et al (2015) An R package for variable selection using
ISSN 0973-7529; ISBN 978-93-80544-14-4 random forests. The R J R Found Stat Comput 7(2):19–33
3. Abdoh SF, Abo Rizka M, Maghraby FA (2018) Cervical cancer 24. Jacobucci R (2018) Decision tree stability and its effect on inter-
diagnosis using random forest classifier with SMOTE and feature pretation. Retrieved from osf.io/m5p2v
reduction techniques. In: IEEE Access, vol 6, pp 59475–59485 25. Dinov ID (2018) Improving model performance. In: Data science
4. Kourou K et al (2015) Machine learning applications in cancer and predictive analytics. Springer, Cham, pp 497–511
prognosis and prediction. Comput Struct Biotechnol J 13:8–17 26. Seethal CR, Panicker JR, Vasudevan V (2016) Feature selection in
5. Bischl B et al (2016) mlr: machine learning in R. J Mach Learn Res clinical data processing for classification. In: International con-
17:1–5 ference on information science (ICIS), pp 172–175
6. Gowda A et al (2010) Feature subset selection problem using
wrapper approach in supervised learning. Int J Comput Appl
Publisher’s Note Springer Nature remains neutral with regard to
1(7):13–17
jurisdictional claims in published maps and institutional affiliations.
7. Lavanya D et al (2011) Analysis of feature selection with classifi-
cation: breast cancer datasets. Indian J Comput Sci Eng (IJCSE)
2(5):756–763
Vol:.(1234567890)