0% found this document useful (0 votes)

18 views16 pages

Evaluation of Machine Learning Based Optimized Feature Selection Approaches and Classification Methods For Cervical Cancer Prediction

This research article evaluates machine learning based optimized feature selection approaches and classification methods for cervical cancer prediction. It explores various feature selection techniques to identify important risk factors and builds classifier models using different algorithms like C5.0, random forest, decision trees and SVM. The performance of these algorithms is assessed based on the results, with C5.0 and random forest classifiers performing reasonably well with high accuracy for identifying women exhibiting signs of cervical cancer.

Uploaded by

Mahmudur Rahman27

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views16 pages

Evaluation of Machine Learning Based Optimized Feature Selection Approaches and Classification Methods For Cervical Cancer Prediction

Uploaded by

Mahmudur Rahman27

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Research Article

Evaluation of machine learning based optimized feature selection

approaches and classification methods for cervical cancer prediction
B. Nithya1 · V. Ilango1

© Springer Nature Switzerland AG 2019

Abstract
Cervical cancer is one type of gynaecological cancers and the majority of these complications of cervical cancer are
associated to human papillomavirus infection. There are numerous risk factors associated with cervical cancer. It is
important to recognize the significance of test variables of cervical cancer for categorizing the patients based on the
results. This work intended to attain deeper understanding by applying machine learning techniques in R to analyze
the risk factors of cervical cancer. Various types of feature selection techniques are explored in this work to determine
about important attributes for cervical cancer prediction. Significant features are identified over various iterations of
model training through several feature selection methods and an optimized feature selection model has been formed.
In addition, this work aimed to build few classifier models using C5.0, random forest, rpart, KNN and SVM algorithms.
Maximum possibilities were explored for training and performance evaluation of all the models. The performance and
prediction exactness of these algorithms are conferred in this paper based on the outcomes attained. Overall, C5.0 and
random forest classifiers have performed reasonably well with comprehensive accuracy for identifying women exhibit-
ing clinical sign of cervical cancer.

Keywords Gynecological cancers · Cervical cancer · Machine learning · Feature selection · Classification · Prediction ·
Performance · Optimization

1 Introduction If the tumor is malignant, its cell flow through the blood
stream and spread to other parts of body, consequently
Gynaecological cancers are those that develop in a wom- those parts also get infected, and in maximum cases it can
an’s reproductive tract and they are the most common be prevented through early detection [2].
type of cancers in women after breast cancer. Gynaeco- Generally medical dataset is provided with more attrib-
logical cancers are very dangerous and lead to lessening utes and missing values [3]. Identifying the relevant and
the lifespan of women diagnosed with such type of can- important features for statistical model building is essen-
cers. Cervical cancer is one type of gynaecological cancer, tial by way of optimization. It is apparent that Machine
other types are Ovarian cancer, Uterine cancer, Vaginal Learning (ML) methods are more beneficial in predic-
cancer and Vulvar cancer. There are different risk factors for tions, optimization related explorations and they have
each type of gynaecological cancers. Cervical cancer is the been extensively implemented in various types of cancer
second most commonly identified cancer in women and researches. The study [4] which discussed about various
representing 7.5% of all female cancer deaths all over the works relevant to cancer prediction/prognosis evidenced
world [1]. Cervical cancer is malicious tumor that occurs accurate results attainment by means of ML techniques. R
when the cervix tissue cells begin to grow and reproduce is one of the most popular and widely-used software struc-
abnormally without controlled cell division and cell death. tures for statistics, data mining, and machine learning. The

* B. Nithya, [email protected] | 1Department of MCA, CMR Institute of Technology, Bangalore, India.

SN Applied Sciences (2019) 1:641 | https://doi.org/10.1007/s42452-019-0645-7

Received: 11 March 2019 / Accepted: 19 May 2019

Vol.:(0123456789)
Research Article SN Applied Sciences (2019) 1:641 | https://doi.org/10.1007/s42452-019-0645-7

R packages offer an innovative, easy-to-use, and flexible 3 Feature selection

domain-specific functions for machine learning experi-
ments [5]. It supports classification, regression, cluster- Feature selection is a process in which the features that
ing, and survival analysis with more modelling techniques. contribute more to the estimated predictor variable are
Accordingly, an efficient classifier model for cervical can- automatically selected from the data. Feature selection
cer prediction can be built by implementing ML methods (FS) methods can be used in data pre-processing to accom-
in R and the correctness of the model can be estimated plish effective data reduction and this is suitable for finding
using various evaluation metrics to attain enhanced per- accurate data models [12–14]. Selecting appropriate fea-
formance efficiency. tures in the data are important, since irrelevant features can
decrease the accuracy of many models [15]. We need not use
every feature present in the data for creating an algorithm.
We can train our algorithm with those features that are cer-
2 Related work tainly important and it will authorize improved results than
using complete set of features for the same algorithm.
The studies on women cancer [6, 7] developed a prediction
model by combining classifier methods and feature selec- 3.1 Advantages of using feature selection
tion techniques in order to significantly improve predictive
accuracy for breast cancer diagnosis and prognosis. The • Allows the ML procedure to train the model more rapidly
study on staging prediction in cervical cancer [8] aimed at • Reduces model complexity with an ease of interpretation
identifying the most influential risk factors by using decision • Advances the precision of a model when the precise sub-
tree classifier and extracted the rules based on the signs and set is selected
symptoms observed from the dataset. The study on cervical • Decreases overfitting
cancer data [9] applied RUS and ROS methods for balancing
of data. Stability Selection (SS) method was used for feature 3.2 Feature selection techniques
selection. In this work, 190 instances including missing val-
ues (‘?’, Null) were removed from the raw dataset. So, there 3 types, they are filter methods, wrapper methods and
were 668 records in the raw dataset. Here, learning model embedded methods.
was designed based on the combination of SS method and
RF algorithm. The success of this model was tested on the 3.2.1 Filter methods
RUS and ROS methods. The results showed that ROS based
SS method more successful than RUS based SS method on Filter methods are normally used as a pre-processing stage.
this dataset and this work achieved 98% accuracy. Another Here the features are selected based on their correlation
work [10] on cervical cancer data used KNN algorithm and with the outcome variable through statistical tests i.e. It
it has selected 25 features, decision tree classifier selected measures the importance of features through their correla-
17 features and random forest algorithm selected 11 fea- tion with dependent variable. The feature selection process
tures for prediction. This work concluded that KNN algorithm with filter methods is depicted in Fig. 1. Filter methods are
seems to be the best model with higher accuracy and AUC considerably faster than wrapper methods.
which is 0.822 as compared to Decision tree and random
forest algorithms. But the number of training and test data 3.2.2 Wrapper methods
samples selected from cervical cancer dataset were varied in
each algorithm studied in this work. The study [11] to classify In wrapper methods, subset of features is used for training
cervical cancer data applied over sampling, under sampling a model. Based on the inferences gained from the preced-
and combined sampling methods to handle the imbalanced ing model, inclusion or removal of features from the subset
data. This method selected six features as important and can be decided. Thus, it measures the effectiveness of a
attained 97% accuracy with decision tree classifier. Obser- subset of feature by means of training a model on it. Hence
vational studies have shown that the cervical cancer data- these methods are computationally higher. The wrapper
set considered in various works have removed the instances
with missing values and less importance has been given in
determining significant attributes. Subsequently, there is a
challenge in dealing with missing values in dataset, deter-
mining precise attributes and accomplishing the results of
higher prediction accuracy with optimization. Therefore, this
work is intended to attain these challenges. Fig. 1 Feature selection—filter method

Vol:.(1234567890)
SN Applied Sciences (2019) 1:641 | https://doi.org/10.1007/s42452-019-0645-7 Research Article

be improved through perceptively nominated subset of

features rathe than using all features in the dataset.

4 Cervical cancer risk factors analysis

Fig. 2 Feature selection—wrapper method
and prediction

This work consists of four main stages which include the

data preparation—cleaning of data, identification of sig-
nificant test variables or predictors, model training/build-
ing of classifier model and performance evaluation.

4.1 Data preparation
Fig. 3 Feature selection—embedded method The dataset used in this work is the openly accessible cer-
vical cancer dataset from UCI Machine Learning Repository
method for feature selection process is represented in [17] which was gathered at Hospital Universitario de Cara-
Fig. 2. cas in Caracas, Venezuela. This dataset comprises medici-
Some common examples of wrapper approaches are nal histories of 858 patients with 36 attributes (32 input
forward feature selection, backward feature elimination features and 4 target variables as Hinselmann, Schiller,
and recursive feature elimination. Cytology, Biopsy). The attributes information of the data-
Forward selection - It is an iterative method, initially set is given in Table 1.
there will not be any feature in the mode and in each itera- It is essential to feed the right data to the machine
tion, new feature is added which best advances the model. learning processes for the problem to be solved, since
This will be continued till an addition of a new variable these algorithms learn from data. After selecting the data,
does not advances the model performance. it should be pre-processed and transformed.
Backward elimination - In this method, we begin with all
the features and eliminates the minimum substantial fea- 4.1.1 Data cleaning: dealing with missing values
ture at each iteration. This process is repeated until there
is no progression is detected by eliminating the features. This cervical cancer dataset is having lot of missing values.
Recursive feature elimination - It is an optimization algo- Sometimes missing values are a common existence, and
rithm and intends to attain the finest feature subset. It con- we require an efficient approach for handling such infor-
tinually produces models and determine the finest or the mation. A missing value can imply a number of different
worst feature at each repetition. It creates the subsequent things in the data. The records with missing values can be
model with the remaining features till entire features are ignored or they can be replaced with the variable mean
explored. Then it organizes all the features with respect to for numerical attributes or by most frequent value in case
their order of elimination. of categorical attributes. When we applied the approach
to remove the records with missing values the number of
3.2.3 Embedded methods rows reduced to 737 from 858. Our aim is to reduce the
number of features but not the number of records avail-
The attribute selection using embedded method is able in the dataset, hence the strategy of replacing the
described in Fig. 3. This method combines the abilities of missing values with mean is used for numerical attributes.
both the methods discussed earlier. It is executed by pro- The columns for STDs_cervical_condylomatosis,
cedures which have their specific built-in feature selection STDs_vaginal_condylomatosis, STDs_pelvic_inflam_dis-
methods. ease, STDs_genital_herpes, STDs_molluscum_contagio-
Various study on cancer classification approach through sum, STDs_Hepatitis_B, STDs_HPV and STD_AIDS were
wrapper-based feature selection [16] showed an excel- removed, since there were 4 or less patient results for
lent performance, not only at identifying relevant genes, these features. Similarly, the features STDs_Time_since_
but also with respect to the computational cost. Accord- first_diagnosis and STDs_Time_since_last_diagnosis
ingly, wrapper methods are used in this work for feature which contained greater than 60% missing values (787 of
selection to see whether the accuracy of the model can 858) were also eliminated from the dataset. Subsequently,

Vol.:(0123456789)
Research Article SN Applied Sciences (2019) 1:641 | https://doi.org/10.1007/s42452-019-0645-7

Table 1 Attributes in cervical S. no. Attribute name Type S. no. Attribute name Type
cancer dataset
1 Age int 19 STDs: pelvic inflammatory disease bool
2 Number of sexual partners int 20 STDs: genital herpes bool
3 First sexual intercourse (age) int 21 STDs: molluscum contagiosum bool
4 Num of pregnancies int 22 STDs: AIDS bool
5 Smokes bool 23 STDs: HIV bool
6 Smokes (years) bool 24 STDs: Hepatitis B bool
7 Smokes (packs/year) bool 25 STDs: HPV bool
8 Hormonal contraceptives bool 26 STDs: Number of diagnosis int
9 Hormonal contraceptives years int 27 STDs: time since first daignosis int
10 IUD bool 28 STDs: time since last daignosis int
11 IUD (years) int 29 Dx: cancer bool
12 STDs bool 30 Dx: CIN bool
13 STDs (number) int 31 Dx: HPV bool
14 STDs: condylomatosis bool 32 Dx bool
15 STDs: cervical condylomatosis bool 33 Hinselmann: target variable bool
16 STDs: vaginal condylomatosis bool 34 Schiller: target variable bool
17 STDs: vulvo-perinerl condylomatosis bool 35 Cytolagy: target variable bool
18 STDs: syphilis bool 36 Biopsy: target variable bool

Fig. 5 Histogram representation of sum of target variables

Fig. 4 Histogram representation of target variables

doubtful. However, if all four diagnoses are showing that
the patient does not have cancer, then the chances that
after removing all these columns the dataset comprised of the patient not having cancer is moderately high. The com-
858 rows with 26 features. bined target variables representation is shown in Fig. 5.
So, it is determined that 87% of the patients do not have
4.1.2 Creation of a target feature cancer. This can be our baseline.

The values of four target variables Hinselmann, Schiller, 4.2 Applying feature selection to data
Citology and Biopsy represent the results for cervical can-
cer exams. The histogram representation with four target When the dimensionality of the data increases the compu-
variables is shown below in Fig. 4. tational cost also increases exponentially. In the existence
The data for these columns can be combined to cre- of several inappropriate features, learning models incline
ate a single target feature called ‘Cancer’ as 27th feature. to overfit and convert as less efficient [18]. To overcome
The advantage of combining all the four target variables this problem, it is required to find a method to diminish
is to confirm the possibility of the diagnosis. Higher val- the number of features in consideration. Feature sub-
ues for this feature signify an increased likelihood of cer- set selection works by eliminating the features that are
vical cancer. If one diagnosis indicate that the patient has redundant or not appropriate. During data cleaning pro-
cancer, but the other three diagnosis attained different cess, ten columns were removed which had missing values
results then the possibility of the patient having cancer is and 27th feature ‘Cancer’ is added as target by combing

Vol:.(1234567890)
SN Applied Sciences (2019) 1:641 | https://doi.org/10.1007/s42452-019-0645-7 Research Article

other four target values. In feature selection process, this rfe.train output has shown the following result.
target variable has been used to find the important and The top 3 variables (out of 3):
relevant features. So, the dataset is now available with 858
rows, 27 attributes. (22 predictor variables +4 target vari- Dx.HPV, Dx.Cancer, Dx
ables + an additional combined target variable ‘Cancer’).
In this work, various types of feature selection methods Predictors(rfe.train) has revealed the following output.
are explored using R tool to identify most significant and predictors(rfe.train)
optimal features.
[1] “Dx.HPV” “Dx.Cancer” “Dx”
4.2.1 Recursive feature elimination (RFE)
So RFE algorithm has predicted three features Dx.HPV,
It is evident from our understanding that the dataset is Dx.Cancer, Dx as important.
unbalanced, hence K-fold-cross-validation is required to
attain better outcomes. RFE is a feature selection method
4.2.2 Boruta algorithm
that fits a model and eliminates the weakest features.
Cross-validation is used with RFE to find the optimal
All-relevant feature selection is a moderately new sub-field
number of features and finest ranking set of features are
in the province of feature selection [19]. Boruta is an all
selected. In R tool, Recursive Feature Elimination—RFE
relevant feature selection algorithm in R which functions
procedure can be implemented using caret package.
as a wrapper algorithm around Random Forest. It makes a
Initially the control function to be used in RFE algorithm
top-down search for appropriate features by associating
should be defined. The random forest selection function
original attributes’ importance with importance attaina-
over rfFuncs option in rfeControl function is stated here.
ble at random, assessed using their permuted copies, and
gradually rejecting inappropriate features. Boruta captures
control < - rfeControl(functions = rfFuncs, method = ”cv”,
all features which are statistically significant to the target
number = 10)
variable in some conditions.
Working Principle of Boruta Algorithm - The procedure of
Then RFE algorithm is implemented as follows.
Boruta Algorithm is explained with sequence of phases.
rfe.train < - rfe(training_data[,1:22], training_data[,23],
(a) Initially, it assigns randomness to the given dataset
sizes = 1:10, rfeControl = control)
by making shuffled copies of all features (termed as
shadow features).
The original target variables in the dataset were
(b) Later, it trains a random forest classifier on the data-
removed and the procedure was implemented with pre-
set and applies a feature ranking measure (Mean
dictor variables and newly added target variable. After the
Decrease Accuracy) to estimate the relevance (higher
implementation of RFE algorithm the result has been plot-
mean value) of each feature.
ted and variable importance chart has been obtained. The
(c) On each iteration, it finds whether a real feature has a
chart is depicted in Fig. 6.
higher position than the best of its shadow features

Fig. 6 Variables importance

chart over recursive feature
elimination

Vol.:(0123456789)
Research Article SN Applied Sciences (2019) 1:641 | https://doi.org/10.1007/s42452-019-0645-7

and continuously eliminates features which are esti- represent minimal, average and maximum Z score of a
mated extremely insignificant. shadow attribute. Red, yellow and green boxplots indicate
(d) The algorithm halts when all features gets confirmed the Z scores of rejected, tentative and confirmed attributes
or excluded or when it accomplishes a stated limit of respectively.
random forest runs. In the process of deciding if a feature is important or
not, some features may be marked by Boruta as ’Tenta-
In Boruta, maxRuns is the number of times the algo- tive’. These tentative attributes will be decided as con-
rithm is supposed to run. The higher the maxRuns the firmed or rejected by comparing the median Z score of
more selectively the variables can be picked. The default the attributes with the median Z score of the best shadow
value is 100. Boruta check for all features which are either attribute. After deciding on tentative attributes Boruta
strongly or weakly pertinent to the target variable. With produced the following output for cervical cancer data
Boruta the dataset of missing values should not be used and the chart is showed in Fig. 8.
to identify significant variables. As the goal is to find the getSelectedAttributes(final.boruta, withTentative = F)
features (other than four target variables) which are all
significant to decide the outcome as Cancer or Not, the [1] “Smokes..packs.year.”
same set of data which was used in RFE algorithm was [2] “Hormonal.Contraceptives..years.”
used in Boruta also. After training our dataset with Boruta [3] “STDs”
algorithm it produces the following output. [4] “STDs.vulvo.perineal.condylomatosis”
print(boruta.train) [5] “STDs.syphilis”
[6] “STDs..Number.of.diagnosis”
Boruta performed 99 iterations in 58.59354 s. [7] “Dx.Cancer”
5 attributes confirmed important: Dx, Dx.Cancer, [8] “Dx.HPV”
Dx.HPV, [9] “Dx”
Smokes..packs.year., STDs.vulvo.perineal.condyloma-
tosis; Boruta algorithm has shown a much-improved result of
7 attributes confirmed unimportant: variable importance as compared to the old feature selec-
First.sexual.intercourse, Hormonal.Contraceptives, IUD, tion method (RFE). In Boruta, it is easy to understand the
IUD..years., Num.of.pregnancies and 2 more; results through the clear interpretation.
10 tentative attributes left: Age, Dx.CIN,
Hormonal.Contraceptives..years., Number.of.sexual. 4.2.3 Simulated annealing (SA)
partners,
Smokes and 5 more; Simulated annealing is a search algorithm that allows a
suboptimal solution to be accepted with an expectation
Here the top three features were already selected by that a better solution will be obtained at the end. This
RFE algorithm. The variable importance chart of Boruta algorithm is used with an aim to get the optimal solution
algorithm is portrayed in Fig. 7. The plot is shown for all by producing a smaller number of feature subsets for eval-
the attributes taken into consideration. Blue boxplots uation [20]. It works by doing minor random changes to

Fig. 7 Variables importance

chart over Boruta algorithm

Vol:.(1234567890)
SN Applied Sciences (2019) 1:641 | https://doi.org/10.1007/s42452-019-0645-7 Research Article

Fig. 8 Final attributes selected

over Boruta algorithm

a preliminary solution and checks for the improvement in • Then varImp() function is applied for finding important
the performance. The optimal variables obtained for cervi- features
cal cancer risk factors through Simulated Annealing (safs
function) are shown below. Few of ML algorithms namely rpart, C5.0, svmRadial,
print(sa_obj$optVariables) knn, ctree and rf were applied in this work to decide
about the features which are significant to attain reli-
[1] “Age” ”Smokes..years.” able accuracy with optimization. All these ML algorithms
[3] “Hormonal.Contraceptives” ”STDs..number.” are intended to train the model and the models built
[5] “STDs..Number.of.diagnosis” ”Dx.Cancer” by these algorithms would be applied on test data.
[7] “Dx.HPV” Hence, we decided to use the dataset with 26 predic-
tor variables which included those four target variables
This procedure has derived seven important features (Schiller, Citology, Biopsy, Hinselmann) also. The models
from cervical cancer dataset. were trained with respect to the decisive target variable
‘Cancer’.
rpart()—The R implementation of CART algorithm is
4.2.4 Feature selection with machine learning algorithms termed as RPART (Recursive Partitioning And Regression
Trees). The rpart algorithm works by splitting the dataset
There is an alternate way to accomplish feature selection is recursively, until a predetermined termination condition is
to consider variables most used by several Machine Learn- reached. At each step, the split is made based on the inde-
ing (ML) algorithms the most to be significant. Initially ML pendent variable which allows major possible reduction in
algorithms learn the association between X’s and Y then heterogeneity of the predicted variable. rpart method has
based on the learning, different machine learning algo- shown the following output for the cervical cancer dataset
rithms could probably end up using different variables to considered in our work.
various degrees. Therefore, the variables that showed suit- rpart variable importance (Output obtained in R)
able in a tree-based algorithm like rpart, can turn out to only 20 most important variables shown (out of 26)
be less valued in a regression-based model. Hence, all vari-
ables need not be equally appropriate to all algorithms. Overall
It is apparent that employing feature subset selection Schiller1 100.0000
using wrapper approach in ML algorithms could enhance Citology1 95.9648
classification accuracy [6]. Hence this work is intended to Biopsy1 61.0357
apply feature subset selection with few ML algorithms to Hinselmann1 53.8580
validate and compare their performance. Steps to find Dx.Cancer 4.4652
variable Importance from ML Algorithms are shown below. Age 1.9525
Dx 1.6306
• The desired model should be trained through train() Dx.CIN 1.5675
function using the caret package First.sexual.intercourse 0.4283

Vol.:(0123456789)
Research Article SN Applied Sciences (2019) 1:641 | https://doi.org/10.1007/s42452-019-0645-7

Remaining attributes were shown with 0.0 value. The

significance of variables in rpart method is shown with the
following plot in Fig. 9.
C5.0()—The C50 package in R contains an interface
to the C5.0 model. This method acknowledges noise and
missing values in the dataset. This method can appro-
priately anticipate relevant attributes in the dataset, the
problem of overfitting and error pruning is solved with this
algorithm [21]. The plot for variable importance through
C5.0 method is shown below in Fig. 10.
rf()—Random forests are based on decision trees. They
Fig. 10 Significance of variables through C5.0 algorithm
also have feature importance methodology which uses
‘gini index’ to assign a score and rank the features based
on the values [22, 23]. The following plot in Fig. 11 shows important in few algorithms are not equally important in
the variable importance by applying rf method. other algorithms. This work is proposed to decide about
ctree()—Conditional inference trees evaluate a regres- the optimal number of features and also to choose most
sion association by binary recursive partitioning in a condi- significant features from all these procedures. The propor-
tional inference framework. ctree uses a significant proce- tional study of these outcomes attained through various
dure to select variables. Ctree is based on a overall theory feature selection methods have shown that there are few
of permutation tests, executing a hypothesis test at each features which are significant in all the methods other than
node, accordingly producing a equivalent p value to test those four target variables. Similarly, some features are
whether the tree should stop or keep growing [24]. The common or showing more percentage of importance in
following plot in Fig. 12 shows the variable importance by few methods. Hence it is decided to select the significant
means of ctree method. predictors based on the higher rank (percentage) value
SVM and KNN—The principle of SVM classifier (Support gained and based on their mutual existence in a greater
Vector Machine) method is to build a hyperplane separat- number of feature selection methods. As we combined
ing data for different classes. The main consideration while four target features Hinselmann, Schiller, Citology and
drawing the hyperplane is on maximizing the distance from Biopsy as a single target variable, they could be consid-
hyperplane to the nearest data point of either class. These ered as most significant features. Apart from this based on
adjacent data points are known as Support Vectors. KNN the proportion of importance (ranks), commonality and
algorithm is an instance-based learning algorithm which precedence in finding the exact outcome, additional ten
calculates distance for a particular value of K for each new core features have been identified as most significant to
sample. SVM and KNN methods showed the ROC curve var- predict the final result ‘Cancer’. They are mentioned below.
iable importance. In both the methods variables are sorted
by maximum importance across the classes. Hormonal.Contraceptives..years Dx.Cancer
First.sexual.intercourse Dx
4.2.5 Determining significant features Number.of.sexual.partners Dx.HPV
STDs..Number.of.diagnosis Smokes..years.
Through our experiments with various feature selection Age Num.of.pregnancies
methods, it is observed that the features which are most

Fig. 9 Significance of variables through rpart algorithm Fig. 11 Significance of variables through rf algorithm

Vol:.(1234567890)
SN Applied Sciences (2019) 1:641 | https://doi.org/10.1007/s42452-019-0645-7 Research Article

Fig. 12 Significance of vari-

ables through ctree algorithm

It is observed through Fig. 12 which shows the signifi- good ratio of testing data points, because fewer amount
cance of variables through ctree() method has selected of data points can lead to a variance error while testing
most of these significant features in an efficient manner. the model effectiveness. It is essential that training and
testing process should be iterated multiple times, corre-
4.3 Classifier model construction and estimation spondingly the training and testing dataset distribution
of model accuracy should be changed which helps to accurately validate the
effectiveness of the model. All these requirements could
Feature Selection through ML algorithms have already be attained through K-fold cross validation.
trained the desired models for classification. Once the best
feature selection subset is identified for a particular data- 4.3.1 K‑fold cross validation
set the same can be used to improve the classifier accuracy
[18]. Hence, to enhance the performance and accuracy The K-fold cross validation method comprises splitting the
of various classifier models, decided to apply boosting dataset into k-subsets, where each subset/fold is used as a
method in determining the occurrence in cervical can- testing set. In the first iteration, first fold is used for model
cer [25]. When we are building a predictive model there testing and the rest are used for model training. Likewise,
must be a way to evaluate the capability of the model on this process will be repeated until each fold have been
concealed data. This can be accomplished by estimat- used for testing the model. The picturing of k-fold cross
ing accuracy through the data that was not used to train validation with k = 10 is shown in Fig. 13. This method is
the model such as test data or by means of cross valida- useful in defining the accuracy of the model with reason-
tion. The model should be trained on a large percentage able combinations of data.
of the dataset. Correspondingly there is a necessity for a

Fig. 13 Visualization of K-fold

cross validation with K = 10

Vol.:(0123456789)
Research Article SN Applied Sciences (2019) 1:641 | https://doi.org/10.1007/s42452-019-0645-7

4.3.2 Repeated k‑fold cross validation Resampling results across tuning parameters:

The procedure of splitting the data into k-folds can be mtry Accuracy Kappa
repeated for a required number of times, which is known as 2 0.9111862 0.4565576
Repeated k-fold Cross Validation. The eventual model accu- 14 0.9879466 0.9445185
racy is calculated as the mean from the number of repeats. 26 0.9908683 0.9578242
In this work, repeated cross validation techniques have been
Accuracy was used to select the optimal model using
applied for the processes of data splitting, model training
the largest value.
and testing in recurrent manner for a greater (for 50) number
The final value used for the model was mtry = 26.
of times over cervical cancer data. The trainControl() func-
tion in R can be used to specify resampling type. The code to
The accuracy of rf model with repeated cross validation
apply repeated cross validation for fifty times is shown here.
using 26 features is revealed through the plot as shown
in Fig. 14.
control <- trainControl(## 10-fold CV
The results attained by rf method using 14 predictor
method = “repeatedcv”,
(features) values with 10-fold cross validation with 50
number = 10,
repeats are shown below.
## repeated fifty times repeats = 50)
Random Forest
The train() function in R is used to fit the predictive mod-
600 samples
els based on various tuning parameters. The model training
14 predictor
thorough Random Forest technique is shown below.
5 classes ‘0’, ‘1’, ‘2’, ‘3’, ‘4’
fit.rf < - train(Cancer ~ ., data = cer_data, method = ”rf”,
No pre-processing
trControl = control)
Resampling Cross-Validated (10 fold, repeated 50 times)
Summary of sample sizes 540, 539, 539, 540, 540, 540,…
The results attained by rf method using 26 predictor
Resampling results across tuning parameters:
(features) values with 10-fold cross validation through 50
repeats are shown below.
mtry Accuracy Kappa
Random Forest
2 0.9597476 0.8016736
8 0.9901440 0.9548854
600 samples
14 0.9927090 0.9669582
26 predictor
5 classes ‘0’, ‘1’, ‘2’, ‘3’, ‘4’ Accuracy was used to select the optimal model using
the largest value.
No pre-processing The final value used for the model was mtry = 14.
Resampling Cross-Validated (10 fold, repeated 50 times)
Summary of sample sizes 541, 540, 541, 541, 540, 540,…

Fig. 14 Plotting of rf model

with 26 predictors

Vol:.(1234567890)
SN Applied Sciences (2019) 1:641 | https://doi.org/10.1007/s42452-019-0645-7 Research Article

Similarly, the accuracy plot of rf model with repeated significant in other algorithms as well. These results
cross validation using 14 features is shown in Fig. 15. are revealed in Fig. 16 by printing and plotting some of
Correspondingly, additional classifier models have these model outputs.
been created with maximum possibilities on this cervi- The results attained are comparatively upgraded for
cal cancer data through various methods like rpart, C5.0, most of the ML methods by training the models with these
SVM and KNN over repeated k-fold cross validation by significant features (14 Predictors) which were obtained
50 trials to determine whether the results obtained are through feature selection processes. The accuracy attained

Fig. 15 Plotting of rf model

with 14 predictors

Fig. 16 Plotting of C5.0, SVM and KNN models with 26 predictors

Vol.:(0123456789)
Research Article SN Applied Sciences (2019) 1:641 | https://doi.org/10.1007/s42452-019-0645-7

for SVM model has shown very good progression with classification models which is calculated as the fraction
significant features but KNN model has not accomplished of predictions our model got precise and the formula is
progressive results. The plot for these models is shown in given below.
Fig. 17.
Accuracy = Number of correct predictions/Total
This exploration proved that the results attained are
more significant by applying repeated cross validation number of predictions
techniques for training the models. We have selected ten In this work, accuracy is estimated for some of the preva-
significant features which are optimal with the percentage lent classification algorithms like C5.0, rpart, rf, SVM and
(rank) values obtained and also based on their common- KNN in two ways i.e. by considering 26 input features in
ality existence in a greater number of feature selection the dataset (without feature selection process) and by
methods, subsequently the results are more precise and considering the significant features (14 features) obtained
significant. through feature selection methods which are discussed
earlier. The results attained for these classifiers are revealed
4.4 Performance and accuracy estimation of ML through cross tables, performance accuracy and AUC
classifier models values.

To enhance the efficiency of clinical outcome predictions

multiple measurements can be used as performance met- 4.4.1 Accuracy of ML models with 26 features
rics [26]. In this study the performance competences of
various classification methods are measured using the The accuracy obtained for conferred ML models with 26
evaluation metrics like accuracy and AUC (Area Under predictors are exhibited through confusion matrix with the
Curve) values. Accuracy is one of the metrics for assessing actual and predicted outcomes.

Fig. 17 Plotting of C5.0, SVM and KNN models with 14 predictors

Vol:.(1234567890)
SN Applied Sciences (2019) 1:641 | https://doi.org/10.1007/s42452-019-0645-7 Research Article

Performance of rf algorithm: Accuracy = 96.9%,

AUC = 0.91

Likewise, the outcomes were accomplished for

SVM and KNN models also. But their accuracy (88%)
and AUC (0.5) values are comparatively less than other
methods. The precisions obtained for these ML models
are displayed through Box and Whisker Plots in Fig. 21
which is a suitable way to look at the spread of the
Fig. 18 Confusion matrix output of C5.0 algorithm with 26 features
estimated accuracies for varied methods and how they
relate.

4.4.2 Accuracy of ML models with feature selection process

The accuracy obtained for conferred ML models with 14

predictors are revealed through confusion matrix based
on the actual and predicted results.

Output of C5.0 and rf Methods with Selected

Fig. 19 Confusion matrix output of rpart algorithm with 26 features
Features:
The implementation of C5.0 and rf algorithms on cer-
vical cancer dataset with selected features have shown
The confusion matrix in Fig. 18 shows the output of highest accuracy (100%) with AUC as 0.91. The evalua-
C5.0 procedure, and this output is obtained by using 26 tion on training data is shown below in Fig. 22.
features of the dataset. The confusion matrix output of C5.0 and rf methods
are depicted in Fig. 23. This model is extremely accurate
Performance of C5.0 algorithm: Accuracy = 97%, at 99.77%.
AUC = 0.91
Performance of C5.0 and rf algorithms: Accuracy = 100%,
The confusion matrix in Fig. 19 shows the output of AUC = 0.91
rpart method with 26 features.
Similarly, the output for rpart algorithm has attained
Performance of rpart algorithm: Accuracy = 96%, 97% accuracy with 0.81 as AUC and SVM method
AUC = 0.81 attained 93% accuracy with 0.8 as AUC. But the perfor-
mance of KNN has not much upgraded, it has shown 89%
The confusion matrix output of rf algorithm is shown as accuracy and 0.5 as AUC. The enhancement in the pre-
in Fig. 20. cision values of these classifiers with an optimal feature
subset is shown in Fig. 24.

Fig. 20 Confusion matrix

output of rf algorithm with 26
features

Vol.:(0123456789)
Research Article SN Applied Sciences (2019) 1:641 | https://doi.org/10.1007/s42452-019-0645-7

Fig. 21 Precision of ML classifiers with 26 features

The results showed that C5.0 and rf methods equally

attained well with maximum accurateness and SVM
model has achieved a better improvement with optimal
features.

5 Results and discussion

In this work, Machine Learning algorithms (C5.0, RF, RPART,

Fig. 22 Method evaluation on training data with C5.0 and rf meth-
ods SVM and KNN) were employed for cervical cancer diagno-
sis to prove the importance of model building with data

Fig. 23 Confusion matrix

output of rf and C5.0 methods
with significant features

Fig. 24 Precision of ML classi-

fiers with 14 features

Vol:.(1234567890)
SN Applied Sciences (2019) 1:641 | https://doi.org/10.1007/s42452-019-0645-7 Research Article

cleaning, replacement of missing values and applying fea- validation technique with all the 26 features as well as with
ture selection process to achieve higher efficiency in out- an optimal feature subset of 14 predictors for diagnosis
come prediction with an optimal feature subset. To evaluate prediction of cervical cancer. The results of the classifier
the performance of classifier models, this work employed models through C5.0 and rf algorithms with an optimal
ML methods on the cervical cancer data by considering all significant feature are significantly upgraded to 99% to
the records in the dataset through replacement of missing 100%. In both the ways this work revealed that C5.0 and
values in the rows with their mean, eliminating only the col- rf methods as more prominent algorithms for predicting
umns which had missing values. Hence, after data cleaning significant risk factors in cervical cancer. The relative per-
process the dataset had 858 rows with 26 predictors. Then formance analysis of the conferred classification methods
by implementing few imperative feature selection tech- is shown in Table 2 with their accuracy and AUC values.
niques and by training the models through ML algorithms, The performance evaluation of ML classification algo-
an optimal feature subset has been selected based on the rithms is exhibited through the bar plot which is shown in
importance of variables. The following attributes have been Fig. 25. Random forest and C5.0 both the methods have
identified as more significant in addition with four target equally performed well with maximum accuracy and
features for cervical cancer diagnosis prediction. reduced amount of error rate.
We have selected significant predictors based on their
Hormonal.Contraceptives..years Dx.Cancer importance and mutual existence over feature selection
First.sexual.intercourse Dx methods and by training the models through repeated
Number.of.sexual.partners Dx.HPV k-fold cross validation with ML methods. Through unbi-
STDs..Number.of.diagnosis Smokes..years. ased feature list, there are only three features which are
Age Num.of.pregnancies common in most of these methods. If we employ these
minimal features for ML classification process then the
ML classifier models with C5.0, RF, RPART, SVM and results will not be precise. This shows that the stability in
KNN methods have been built with repeated k-fold cross feature selection is an important issue and its importance
has been determined through this work. Therefore, an
Table 2 Comparative analysis of ML algorithms based on accuracy
optimized feature selection approach is more essential to
and AUC values improve the performance accuracy of prediction process,
accordingly with an optimal feature subset an efficient
Algorithms→ C5.0 RF RPART SVM KNN
performance has been gained through this work for cer-
↓Features/attributes
details, evaluation
vical cancer diagnosis prediction.
metrics

With all the features (26 predictors)

6 Conclusion
Accuracy (%) 97 96.9 96 88 88
AUC 0.91 0.9 0.81 0.5 0.5
Cervical cancer is one of the important reasons among
With selected optimal features only (14 predictors)
female cancer deaths in the recent years. But, through
Accuracy (%) 100 100 97 93 89
machine learning, we are able to recognize the factors
AUC 0.91 0.91 0.81 0.8 0.5
that increase possibility of evolving this cancer in women.

Fig. 25 Performance com-

parison of ML classification
algorithms

Vol.:(0123456789)
Research Article SN Applied Sciences (2019) 1:641 | https://doi.org/10.1007/s42452-019-0645-7

The feature selection process over Boruta algorithm, SA 8. Sowjanya D et al (2014) Staging prediction in cervical cancer
and ctree() methods have shown good proficiency in patients—a machine learning approach. Int J Innov Res Pract
2(2):14–23
accomplishing major features of an optimal feature sub- 9. Akyol K (2018) A study on test variable selection and balanced
set for cervical cancer risk factors prediction. However, all data for cervical cancer disease. Int J Inf Eng Electron Bus 10:1
the information related to the dataset were not provided 10. Menon V, Parikh D (2018) Machine learning applied to cervical
and some of the information, such as factorizing or not cancer data. Int J Sci Eng Res 9(7):46–50
11. Choudhary A et al (2018) Classification of cervical cancer data-
factorizing, replacement of variables was done based on set. In: Proceedings of the 2018 IISE annual conference, Orlando,
assumptions. Through the examination of C5.0, rpart, Ran- pp 1456–1461
dom Forest, SVM and KNN algorithms, we have found that 12. Jović A, Brkić K, Bogunović N (2015) A review of feature selection
most of the algorithms were efficient in providing cervi- methods with applications. In: 2015 38th international conven-
tion on information and communication technology, electronics
cal cancer diagnosis with advanced accuracy. Overall C5.0 and microelectronics (MIPRO), Opatija, pp 1200–1205
and Random Forest classifiers have performed reasonably 13. Bagherzadeh-Khiabani F et al (2016) A tutorial on variable selec-
well, besides extremely accurate through reliable results tion for clinical prediction models: feature selection methods
with maximum accuracy for identifying women exhibit- in data mining could improve the results. J Clin Epidemiol
71:76–85
ing clinical sign of cervical cancer. It is apparent through 14. Le Thi HA et al (2015) Feature selection in machine learning: an
this work that, an enhanced prediction accuracy for cervi- exact penalty approach using a difference of convex function
cal cancer diagnosis can be attained by means of includ- algorithm. Mach Learn 101:163–186
ing an optimal feature subset through enhanced feature 15. Park HW et al (2017) A hybrid feature selection method to classi-
fication and its application in hypertension diagnosis. In: ITBAM
selection approaches and by building the classifier models 2017, LNCS 10443. Springer, pp 11–19
with ML algorithms through repeated k-fold cross valida- 16. Ruiz R et al (2006) Incremental wrapper-based gene selection
tion techniques. This work can be extended for other types from microarray data for cancer classification. Pattern Recogni-
of gynecological cancer type predictions. Altogether, the tion 39(12):2383–2392
17. UCI Machine Learning Repository, Cervical cancer (Risk Factors)
conferred classifiers have shown enhanced performance Data Set. Retrieved February 5, 2019, from https: //archiv e.ics.uci.
accuracy with the optimal features’ dataset. edu/ml/datasets/Cervical+cancer+%28Risk+Factors%29
18. Zhao Z et al (2010) Advancing feature selection research—ASU
feature selection repository: Citeseer
19. Rudnicki WR, Wrzesień M, Paja W (2015) All relevant feature
Compliance with ethical standards selection methods and applications. In: Stańczyk U, Jain L (eds)
Feature selection for data and pattern recognition. Studies in
Conflict of interest The authors declare that they have no conflict of computational intelligence, vol 584. Springer, Berlin
interest. 20. Antony DA (2016) Literature review on feature selection
methods for high-dimensional data. Int J Comput Appl
136:0975–8887
References 21. Pandya R, Pandya J (2015) C5.0 algorithm to improved decision
tree with feature selection and reduced error pruning. Int J Com-
put Appl 117(16):18–21
1. World Health Organization (2019) Fact sheet: human-papillo- 22. Nguyen C et al (2013) Random forest classifier combined with
mavirus-(hpv)-and-cervical-cancer, Retrieved 13-02-2019 feature selection for breast cancer diagnosis and prognostic. J
2. Sarwar A et al (2015) Performance evaluation of machine learn- Biomed Sci Eng 6:551–560
ing techniques for screening of cervical cancer, INDIACom-2015; 23. Genuer R et al (2015) An R package for variable selection using
ISSN 0973-7529; ISBN 978-93-80544-14-4 random forests. The R J R Found Stat Comput 7(2):19–33
3. Abdoh SF, Abo Rizka M, Maghraby FA (2018) Cervical cancer 24. Jacobucci R (2018) Decision tree stability and its effect on inter-
diagnosis using random forest classifier with SMOTE and feature pretation. Retrieved from osf.io/m5p2v
reduction techniques. In: IEEE Access, vol 6, pp 59475–59485 25. Dinov ID (2018) Improving model performance. In: Data science
4. Kourou K et al (2015) Machine learning applications in cancer and predictive analytics. Springer, Cham, pp 497–511
prognosis and prediction. Comput Struct Biotechnol J 13:8–17 26. Seethal CR, Panicker JR, Vasudevan V (2016) Feature selection in
5. Bischl B et al (2016) mlr: machine learning in R. J Mach Learn Res clinical data processing for classification. In: International con-
17:1–5 ference on information science (ICIS), pp 172–175
6. Gowda A et al (2010) Feature subset selection problem using
wrapper approach in supervised learning. Int J Comput Appl
Publisher’s Note Springer Nature remains neutral with regard to
1(7):13–17
jurisdictional claims in published maps and institutional affiliations.
7. Lavanya D et al (2011) Analysis of feature selection with classifi-
cation: breast cancer datasets. Indian J Comput Sci Eng (IJCSE)
2(5):756–763

Vol:.(1234567890)

82-P01.91.300096-07 GE300 GE320 Operation Manual
No ratings yet
82-P01.91.300096-07 GE300 GE320 Operation Manual
126 pages
COHESIVE DEVICES-Advanced
100% (2)
COHESIVE DEVICES-Advanced
2 pages
5.2 Understanding Inheritance
No ratings yet
5.2 Understanding Inheritance
18 pages
April 2024-2
No ratings yet
April 2024-2
19 pages
CN-Design Research, Architectural Research, Architectural Design Research - An Argument On Disciplinarity and Identity
No ratings yet
CN-Design Research, Architectural Research, Architectural Design Research - An Argument On Disciplinarity and Identity
30 pages
AppendixEL 02schedule D 2
No ratings yet
AppendixEL 02schedule D 2
428 pages
Mining Big Data: Breast Cancer Prediction Using DT - SVM Hybrid Model
No ratings yet
Mining Big Data: Breast Cancer Prediction Using DT - SVM Hybrid Model
12 pages
Assignment 1 Cyber Security
No ratings yet
Assignment 1 Cyber Security
10 pages
TESDA Crim
No ratings yet
TESDA Crim
2 pages
GPT-9000 User Manual - EN Rev G 201712
No ratings yet
GPT-9000 User Manual - EN Rev G 201712
183 pages
PPG Debate
No ratings yet
PPG Debate
6 pages
(IJCST-V11I3P3) :DR M Narendra, A Nandini, T Kamal Raj, V Sai Sowmya, CH Brahma Reddy
No ratings yet
(IJCST-V11I3P3) :DR M Narendra, A Nandini, T Kamal Raj, V Sai Sowmya, CH Brahma Reddy
3 pages
Utilization of Low-Density Polyethylene (LDPE) Plastic in Production of Cement Brick
No ratings yet
Utilization of Low-Density Polyethylene (LDPE) Plastic in Production of Cement Brick
41 pages
A Computational Study On Classification of Malignant
No ratings yet
A Computational Study On Classification of Malignant
63 pages
CHAPTER ONE To 3-1
No ratings yet
CHAPTER ONE To 3-1
51 pages
Improving Breast Cancer Diagnosis Accuracy by Particle Swarm Optimization Feature Selection
No ratings yet
Improving Breast Cancer Diagnosis Accuracy by Particle Swarm Optimization Feature Selection
30 pages
Chapter One To Three
No ratings yet
Chapter One To Three
39 pages
1 s2.0 S2772528624000141 Main
No ratings yet
1 s2.0 S2772528624000141 Main
15 pages
SVM Model
No ratings yet
SVM Model
14 pages
Complex Thought FINAL
No ratings yet
Complex Thought FINAL
25 pages
Abdullah 2019 IOP Conf. Ser. Mater. Sci. Eng. 557 012003
No ratings yet
Abdullah 2019 IOP Conf. Ser. Mater. Sci. Eng. 557 012003
12 pages
Sagar Gowda
No ratings yet
Sagar Gowda
13 pages
Breast Cancer Prediction Model Assignment
No ratings yet
Breast Cancer Prediction Model Assignment
37 pages
Proactive Cervical Cancer Risk Assessment Using Data-Driven Analytics
No ratings yet
Proactive Cervical Cancer Risk Assessment Using Data-Driven Analytics
11 pages
130 - Cervical Cancer Prediction Through Different Screening Methods Using Data Mining
No ratings yet
130 - Cervical Cancer Prediction Through Different Screening Methods Using Data Mining
9 pages
Intelligent Cervical Cancer Detection: Empowering Healthcare With Machine Learning Algorithms
No ratings yet
Intelligent Cervical Cancer Detection: Empowering Healthcare With Machine Learning Algorithms
9 pages
Cervical Cancer Prediction Using Machine Learning
No ratings yet
Cervical Cancer Prediction Using Machine Learning
10 pages
Enhancing Breast Cancer Diagnosis: A Comparative Analysis of Feature Selection Techniques
No ratings yet
Enhancing Breast Cancer Diagnosis: A Comparative Analysis of Feature Selection Techniques
11 pages
Paper - Heart Disease Prediction
No ratings yet
Paper - Heart Disease Prediction
5 pages
Spanos - Past-Life Ids Ufos Satanic Abuse
No ratings yet
Spanos - Past-Life Ids Ufos Satanic Abuse
8 pages
15114L23 Popa-Mirela 2023 29-2 - 133-137
No ratings yet
15114L23 Popa-Mirela 2023 29-2 - 133-137
5 pages
Breast Cancer Classification and Prediction Using Machine Learning IJERTV9IS020280
No ratings yet
Breast Cancer Classification and Prediction Using Machine Learning IJERTV9IS020280
5 pages
Paper 51-Cervical Cancer Prediction
No ratings yet
Paper 51-Cervical Cancer Prediction
9 pages
s40537 019 0247 7
No ratings yet
s40537 019 0247 7
15 pages
4150 8028 1 PB
No ratings yet
4150 8028 1 PB
12 pages
10 1109@iccmc48092 2020 Iccmc-00011
No ratings yet
10 1109@iccmc48092 2020 Iccmc-00011
6 pages
Breast Cancer Detection Using GA Feature Selection and Rotation Forest
No ratings yet
Breast Cancer Detection Using GA Feature Selection and Rotation Forest
11 pages
Paper 1-Predicting Cervical Cancer Based On Behavioral Risk Factors
No ratings yet
Paper 1-Predicting Cervical Cancer Based On Behavioral Risk Factors
9 pages
A Homogeneous Ensemble Classifier For Breast Cancer Detection Using Parameters Tuning of MLP Neural
No ratings yet
A Homogeneous Ensemble Classifier For Breast Cancer Detection Using Parameters Tuning of MLP Neural
22 pages
CoS Undergraduate Brochure
No ratings yet
CoS Undergraduate Brochure
36 pages
FT 1000 - FP 1000 - TG L111e
No ratings yet
FT 1000 - FP 1000 - TG L111e
12 pages
Applications of Machine Learning Techniques To Predict Diagnostic Breast Cancer
No ratings yet
Applications of Machine Learning Techniques To Predict Diagnostic Breast Cancer
11 pages
Bai Tap Unit 5
No ratings yet
Bai Tap Unit 5
3 pages
An Enhanced Ensemble Diagnosis of Cervical Cancer: A Pursuit of Machine Intelligence Towards Sustainable Health
No ratings yet
An Enhanced Ensemble Diagnosis of Cervical Cancer: A Pursuit of Machine Intelligence Towards Sustainable Health
15 pages
Project Report: Bangladesh University of Business & Technology (BUBT)
No ratings yet
Project Report: Bangladesh University of Business & Technology (BUBT)
18 pages
Performance Comparison of KNN, Random Forest and SVM in The Prediction of Cervical Cancer From Behavioral Risk
No ratings yet
Performance Comparison of KNN, Random Forest and SVM in The Prediction of Cervical Cancer From Behavioral Risk
9 pages
Prediction of Cervical Cancer From Behavior Risk Using Machine Learning Techniques
No ratings yet
Prediction of Cervical Cancer From Behavior Risk Using Machine Learning Techniques
10 pages
Thong Kam 2008
No ratings yet
Thong Kam 2008
8 pages
10 1109@ccis 2018 86911
No ratings yet
10 1109@ccis 2018 86911
5 pages
An Optimized Machine Learning Model For Automatic Prediction of Cervical Cancer Using Decision Tree Classifier
No ratings yet
An Optimized Machine Learning Model For Automatic Prediction of Cervical Cancer Using Decision Tree Classifier
6 pages
Cervical Cancer Risk Prediction Model and Analysis of Risk Factors Based On Machine Learning
No ratings yet
Cervical Cancer Risk Prediction Model and Analysis of Risk Factors Based On Machine Learning
5 pages
Minor Project Ankit
No ratings yet
Minor Project Ankit
9 pages
Conservation of Momentum I: Final Object/mass/velocity Initial Object/mass/velocity Event
No ratings yet
Conservation of Momentum I: Final Object/mass/velocity Initial Object/mass/velocity Event
3 pages
Career Opportunities - Food Security Cluster Coordinator - WFP
No ratings yet
Career Opportunities - Food Security Cluster Coordinator - WFP
4 pages
Machine Learning Algorithms For Breast Cancer Prediction and Diagnosis Machine Learning Algorithms For Breast Cancer Prediction and Diagnosis
No ratings yet
Machine Learning Algorithms For Breast Cancer Prediction and Diagnosis Machine Learning Algorithms For Breast Cancer Prediction and Diagnosis
6 pages
310-A STO FY 2024 TIER 1
No ratings yet
310-A STO FY 2024 TIER 1
12 pages
Cervical Cancer Classification Using Machine Learning With Feature Importance and Model Explainability
No ratings yet
Cervical Cancer Classification Using Machine Learning With Feature Importance and Model Explainability
4 pages
Early Risk Prediction of Cervical Cancer A Machine Learning Approach
No ratings yet
Early Risk Prediction of Cervical Cancer A Machine Learning Approach
4 pages
ML Classifier Comparative Performance Analysis of Prediction On Cervical Cancer
No ratings yet
ML Classifier Comparative Performance Analysis of Prediction On Cervical Cancer
4 pages
Paper V1.edited
No ratings yet
Paper V1.edited
6 pages
Paper V1.edited
No ratings yet
Paper V1.edited
6 pages
Paper Draft Mtech
No ratings yet
Paper Draft Mtech
4 pages
Signal Integrity Measurements and Network Analysis
No ratings yet
Signal Integrity Measurements and Network Analysis
55 pages
Konica Monolta Drum (Photoconductor) DR512-DR512K
No ratings yet
Konica Monolta Drum (Photoconductor) DR512-DR512K
4 pages
Utilizing Cutting-Edge Machine Learning Methods Fo - 241221 - 101813 Paper
No ratings yet
Utilizing Cutting-Edge Machine Learning Methods Fo - 241221 - 101813 Paper
7 pages
Diagnosing Cervical Cancer Using Machine Learning Methods
No ratings yet
Diagnosing Cervical Cancer Using Machine Learning Methods
3 pages
Breast Cancer Modeling and Prediction Combining
No ratings yet
Breast Cancer Modeling and Prediction Combining
6 pages
Modified Bitumens
No ratings yet
Modified Bitumens
6 pages
Journal-Breast Cancer Prediction
No ratings yet
Journal-Breast Cancer Prediction
10 pages
IRJMETS51200105224
No ratings yet
IRJMETS51200105224
5 pages
Geological Map Symbol
No ratings yet
Geological Map Symbol
5 pages
3 Dirt Wall
No ratings yet
3 Dirt Wall
5 pages
Breast Cancer Prediction Model With Decision Tree and Adaptive Boosting
No ratings yet
Breast Cancer Prediction Model With Decision Tree and Adaptive Boosting
7 pages
1750 Blood Angels (Warhammer 40,000 9th Edition) (92 PL, 11CP, 1,749pts)
No ratings yet
1750 Blood Angels (Warhammer 40,000 9th Edition) (92 PL, 11CP, 1,749pts)
15 pages
Investigational Device Exemption (IDE) - FDA
No ratings yet
Investigational Device Exemption (IDE) - FDA
2 pages
Module 4.1 - Minimum Design Lateral Force
No ratings yet
Module 4.1 - Minimum Design Lateral Force
6 pages
Oxidation-Reduction: Lecture No: 16
No ratings yet
Oxidation-Reduction: Lecture No: 16
4 pages
Breast Cancer
No ratings yet
Breast Cancer
3 pages
The Goal-Setting Theory of Motivation: - Result of 1. Conscious Goals 2. Intensions - Task Performance
No ratings yet
The Goal-Setting Theory of Motivation: - Result of 1. Conscious Goals 2. Intensions - Task Performance
3 pages
2019-05 Machine Learning Techniques For Detecting and Predicting Breast Cancer
No ratings yet
2019-05 Machine Learning Techniques For Detecting and Predicting Breast Cancer
5 pages
Electro Chemistry: Lecture No: 11
No ratings yet
Electro Chemistry: Lecture No: 11
3 pages
Multilevel Classification Algorithm Using Diagnosis and Prognosis of Breast Cancer
No ratings yet
Multilevel Classification Algorithm Using Diagnosis and Prognosis of Breast Cancer
3 pages
Question of Abstract Algebra01
No ratings yet
Question of Abstract Algebra01
2 pages
Yousefi Arzyabiamalkard12
No ratings yet
Yousefi Arzyabiamalkard12
5 pages
ACR-Orientation Work Arrangement
No ratings yet
ACR-Orientation Work Arrangement
10 pages
Morality and The Good Life
No ratings yet
Morality and The Good Life
6 pages
Stella Proposal Model For Predicting Cervical Cancer Using Machine Learning Algorithms
No ratings yet
Stella Proposal Model For Predicting Cervical Cancer Using Machine Learning Algorithms
5 pages
CB 1
No ratings yet
CB 1
4 pages
Lakhnaw
No ratings yet
Lakhnaw
2 pages
Predicting Breast Cancer Recurrence Using Effective Classification and Feature Selection Technique
No ratings yet
Predicting Breast Cancer Recurrence Using Effective Classification and Feature Selection Technique
1 page
Progress Test 2A (Units 4-6)
No ratings yet
Progress Test 2A (Units 4-6)
7 pages
Feature Selection For Breast Cancer Detection Using Machine Learning Algorithms
No ratings yet
Feature Selection For Breast Cancer Detection Using Machine Learning Algorithms
4 pages
Advanced Analytics of Image Datasets in Human Health
From Everand
Advanced Analytics of Image Datasets in Human Health
Dr. Zemelak Goraga
No ratings yet
Smart Business Problems and Analytical Hints in Cancer Research
From Everand
Smart Business Problems and Analytical Hints in Cancer Research
Zemelak Goraga
No ratings yet
Cancer Registry Manager - The Comprehensive Guide: Vanguard Professionals
From Everand
Cancer Registry Manager - The Comprehensive Guide: Vanguard Professionals
Viruti Shivan
No ratings yet
Implementation of a Remote and Automated Quality Control Programme for Radiography and Mammography Equipment
From Everand
Implementation of a Remote and Automated Quality Control Programme for Radiography and Mammography Equipment
IAEA
No ratings yet

Evaluation of Machine Learning Based Optimized Feature Selection Approaches and Classification Methods For Cervical Cancer Prediction

Uploaded by

Evaluation of Machine Learning Based Optimized Feature Selection Approaches and Classification Methods For Cervical Cancer Prediction

Uploaded by

Research Article

Evaluation of machine learning based optimized feature selection

© Springer Nature Switzerland AG 2019

* B. Nithya, [email protected] | 1Department of MCA, CMR Institute of Technology, Bangalore, India.

SN Applied Sciences (2019) 1:641 | https://doi.org/10.1007/s42452-019-0645-7

Received: 11 March 2019 / Accepted: 19 May 2019

R packages offer an innovative, easy-to-use, and flexible 3 Feature selection

be improved through perceptively nominated subset of

4 Cervical cancer risk factors analysis

This work consists of four main stages which include the

Fig. 5 Histogram representation of sum of target variables

Fig. 4 Histogram representation of target variables

Fig. 6 Variables importance

Fig. 7 Variables importance

Fig. 8 Final attributes selected

Remaining attributes were shown with 0.0 value. The

Fig. 12 Significance of vari-

Fig. 13 Visualization of K-fold

4.3.2 Repeated k‑fold cross validation Resampling results across tuning parameters:

Fig. 14 Plotting of rf model

Fig. 15 Plotting of rf model

Fig. 16 Plotting of C5.0, SVM and KNN models with 26 predictors

To enhance the efficiency of clinical outcome predictions

Fig. 17 Plotting of C5.0, SVM and KNN models with 14 predictors

Performance of rf algorithm: Accuracy = 96.9%,

Likewise, the outcomes were accomplished for

4.4.2 Accuracy of ML models with feature selection process

The accuracy obtained for conferred ML models with 14

Output of C5.0 and rf Methods with Selected

Fig. 20 Confusion matrix

Fig. 21 Precision of ML classifiers with 26 features

The results showed that C5.0 and rf methods equally

5 Results and discussion

In this work, Machine Learning algorithms (C5.0, RF, RPART,

Fig. 23 Confusion matrix

Fig. 24 Precision of ML classi-

With all the features (26 predictors)

Fig. 25 Performance com-

You might also like

R packages offer an innovative, easy-to-use, and flexible 3 Feature selection

4 Cervical cancer risk factors analysis

4.3.2 Repeated k‑fold cross validation Resampling results across tuning parameters:

4.4.2 Accuracy of ML models with feature selection process

5 Results and discussion