A 2-Stages Feature Selection Framework For Colon Cancer Classification Using SVM
A 2-Stages Feature Selection Framework For Colon Cancer Classification Using SVM
Abstract—As the colon cancer gene expression dataset is of results obtained. The microarray dataset contains thousands
high dimension, many irrelevant, redundant and noisy features of genes and the sample count is generally small. These types
might be included which may cause unprecedented challenges of microarray data make it difficult to determine the most
for data mining and machine learning algorithms. In this paper,
we have proposed a new feature selection based method for relevant genes. All genes have adequate tracking information,
colon cancer classification. First, we have used the ReliefF filter many of which are redundant. The two current methods of
technique to provide a ranking in terms of the discriminatory generating feature genes for gene expression data based on
ability of each feature. Second, since ReliefF cannot handle cancer classification are feature transformation and feature
feature redundancy as well as feature interaction, another step selection[5]. Feature transformation is a process that allows
is performed to select the best subset of gene expression pro-
files from the available 2K subsets. The proposed method has A new set of characteristics to be produced, from the original
efficiently reduced the dimensionality of the colon dataset and characteristic of the tumor, makes it possible to reach the
increased the classification accuracy. goal of reducing the characteristics. Even if they have the
The results from the Colon Cancer Gene Expression Data Set high power of discrimination, they from time to time fail
confirmed the effectiveness of the proposed method compared to to retain the biological information of the original genetic
advanced techniques.
Index Terms—Colon cancer, Feature selection, Machine learn- expression. The identification of cancer-related target genes as
ing, Data Mining, ReliefF, Dimensionality reduction. a result of lost data understanding cannot be determined using
the data transformation method. However, other methods
I. I NTRODUCTION of transforming features, Feature Selection methods work
Colon cancer is a significant public health issue, and by eliminating insignificant or duplicate characteristics and
the global incidence of this cancer has increased rapidly preserving the best classification accuracy. however, does
as the population has increased. The 2020 World Health not create a new subset of features. Selecting features is not
Organization (WHO) GLOBOCAN study reported 1 93 about transforming the original features, reducing dimensions
million new cases of colorectal cancer and 935 000 deaths and creating a bold learning model from the selected data.[6].
from CRC[1]. The goal of feature selection in machine learning is to get
Today, a number of cancers have become very common in the best set of features which makes it possible to construct
humans which can, unfortunately, be the cause of death of useful models of learning phenomena. The techniques of
several people who are infected. According to the world feature selection in machine learning can be categorised
ranking of the most common cancers, Colon cancer is as follows: Supervised techniques: can be used for data
the third most common cause of death in the world[2]. marking and to determine relevant features to identify the
According to a recent study [3], around 25% of CCR cases effectiveness of supervised models, such as classification
are predisposed to the disease. Golub et al.[4]. He began by and regression. Unsupervised techniques: can be applied
developing a generic approach to cancer classification based to unlabeled data. Essentially, 3 types of feature selection
on monitoring the gene expression of DNA microarrays. The exist. Filtering methods retrieve the intrinsic properties of
microarray-based genetic expression has been extensively the measured features through univariate statistics instead
used in diagnosing and analyzing colon cancer. Sometimes of cross-validation performance. Such methods are faster
the serious condition of some patients is due to their delay and less costly in calculation than wrapper methods. When
in detecting the disease. So the acceleration of the process it comes to high-dimensional data, it is less expensive to
of detecting and predicting these cancers will have a positive use filtering methods. Wrapper methods (front, back and
and direct impact on patient survival and treatment costs. step-by-step selection), Filtering methods and embedded
In front of this huge number of people infected with this methods [7, 8]. Wrapper methods are based on avid search
disease, It is interesting to develop techniques to identify algorithms that iteratively search for a feature on a specific
the best pertinent gene that has a direct impact with the machine learning algorithm for an optimal subset of features.
Then they iterate and test a different sub-set of features once
978-1-6654-9558-5/22/$31.00 ©2022 IEEE the optimum sub-set is achieved.Two main drawbacks of this
Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on December 14,2023 at 08:44:46 UTC from IEEE Xplore. Restrictions apply.
method: High calculation time for data with many features, tumors, the suggested algorithm obtained a precision of
and this tends to exceed the model where there are not many 85.48%.
data points.
A genetic algorithm (GA) selection method was proposed
Embedded methods involve the benefits of wrapper and for selecting the best subset of characteristics in the breast
filter methods, including functional interactions as well as cancer diagnostic system [16]. Random forest is a classifier
maintaining a reasonable estimated cost. These methods based on a group of classification and regression trees (CART).
are iterative in that they take into account each iteration In comparison with other classifiers such as SVM, neural
of the model training process and thoroughly extract the network, decision tree, etc., it minimizes overflow and is hence
features that offer best of the training for a particular iteration. more precise. It is also used as an approach of feature selection
for classifying the significance of features.
In this paper, a method of variable selection based on data
augumentation was suggested using Relief in the first step III. METHODOLOGY
for filter those features that are particularly sensitive to the Binary colon classification data were used in the proposed
interactions of the features. But this filter is not enough to get method to detect colon cancer. Patient data can detect
the best results, i.e. extract features that have a direct effect whether it has a tumor or not. The proposed method consists
on the classification result. so for this we used the recursive of the following three main components: Data Augmentation,
feature elimination (RFE) for the second step in order to Filtering and subset selection. Figure 1: represents this
search for a sub-feature set starting with all the features of research’s methodology.
the learning data set and successfully removing the features
until the desired number remains. Next, an SVM [9, 10] The main steps of the current study are the following: First
classification is built to predict colon cancer; and Experimental the data collection that allows us to extract a simple database.
results demonstrate that the Relief-RFE-SVM algorithm based Output data are transferred to pretreatment in the second
on feature selection has better performance than the algorithm phase. First, we applied a technique to augmentation data
without feature selection. The document is organized in this diversity and then normalize the dataset using a normalization
way: Section. 2 presents previous work in the field of colon method. In the third phase, we used two methods of feature
cancer detection through machine learning tools linked to a set selection algorithms to form and test the data. In the fourth
of microarray data; Section. 3 describes the architecture and phase, we apply different algorithms of machine learning.
methodology of the proposed system. Section. 4 is devoted The last phase is to perform a comparative study of the
to the analysis of experimental results and discussion. Finally, selected characteristics and another model based on the feature
the findings of this study are summarized in section.5. selection in the final phase of research.
II. RELATED WORK
Many recent studies have been developed to work on
health data by integrating machine learning techniques
with feature selection methods. Park & Kim formulated
a model to examine the property of the model based on
a sequential, random k-nearest neighbor feature selection
method based on 20 datasets of microarray gene expressions
[11]. An intelligent technique based on feature selections
using t - statistics has been proposed for predicting colon
cancer. The authors achieved an accuracy of almost 85%, Fig. 1. Proposed Method.
using the method of selecting t-statistic features and the
Support Vector Machine (SVM) classifier [12]. A function
selection algorithm using the fuzzy decision tree (FDT) has A. Data Augmentation
been introduced by S.A. Ludwig et al. [13] Analyze gene Machine learning applications continue to grow and grow
expression in the classification of colon cancer data and rapidly. Data augmentation techniques can be a good tool
achieve an accuracy of 80.28% by selecting 20 features. against the challenges facing the world of artificial intelli-
The Modified Analytical Hierarchy Process (MAHP) with gence.
Probabilistic Neural Network (PNN) was introduced in [14] Data augmentation is effective in improving the perfor-
as a new method for the selection of aggregate genes for the mance and results of machine learning models by creating
classification of microarray data [15] The others are based new and different examples to ensure more reliable results.
on a genetic algorithm (GA) methodology for trait reduction More than the data series in the machine learning model is
and genetic programming (GP) for cancer classification based rich and adequate, more than the model is more efficient and
on gene expression profiles, information gain (IG) for the more accurate. As far as we are concerned, we went from 62
selection of features. Regarding the classification of colon characteristics to 80 after using this method.
Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on December 14,2023 at 08:44:46 UTC from IEEE Xplore. Restrictions apply.
For high-precision models, it is imperative that the steps IV. SIMULATION STUDY
in a data model are to wipe out the data. However, if the A. Dataset
cleaning-up reduces the representativeness of the data, the
Information on the expression of the colon cancer gene has
model cannot provide good forecasts for the actual inputs. Data
been obtained since [10] during the data acquisition phase. The
augmentation techniques improve the robustness of machine
dataset consists of 62 cases (tests) and 2,000 genes (attributes)
learning models by creating variations that the model can
of colon cancer patients. These include:
observe in the real world.
• 40 tumor biopsies (marked as abnormal)
B. Filtering stage • 22 normal.
Feature selection is a major factor in interpretation and pre- Colon tumors sample data can be seen in table 1
diction. It also facilitates the classification process rather than
including unneeded attributes.The Feature selection makes it TABLE I
possible to identify the most important features for microarrays DATASETS
or large data sets, this reduces the classifier’s workload and
thereby enhances classification accuracy. In Feature Selection, No Att1 Att2 Att3 Att4 ... Att2000 Class
two indices are examined in this paper: Recursive Feature 1 8589.416 5468.2407 4263.4077 4064.9358 ... 28.70125 Abnormal
2 9164.254 6719.5293 4883.4487 3718.159 ... 16.77375 Normal
Elimination (RFE) and RELIEF [17]. Both techniques take 3 3825.705 6970.3613 5369.9688 4705.65 ... 15.15625 Abnormal
into account the significance of the impurity of a variable and 4 6246.4487 7823.534 5955.835 3975.5642 ... 16.085 Normal
the significance of the bag output error [18]. ... ... ... ... ... ... ... ...
62 7472.01 3653.934 2728.2163 3494.4805 ... 39.63125 Normal
• ReliefF: Is a feature selection method, it looks for every
example of training, his closest neighbour to the same
class (Hit the closest), and the opposite class (Miss the B. Evaluation strategy
closest). The score of each single feature is the differ-
ence between the two. This score can then be applied In this study, we have evaluated the performance of the
to rank and select the top scoring features for feature proposed method when used in conjunction with three classi-
selection[20] . fication algorithms (Random Forest, K-Nearest Neighbors and
• Recursive Feature Elimination (RFE): RFE is a re-
Support Vector Machine). As the performance evaluation of a
cursive process, it starts with the whole feature space machine learning model is a crucial component of any data
then at each iteration, the worst performing features are science project. We have used a 3- Fold Cross Validation
discarded. The features of the last model are chosen as the method with grid search strategy aiming to estimate the
best feature subset. In this paper, we have used Random generalization performance in terms of Accuracy of a model
Forest as the evaluation model to assess the quality of on unseen data.
features. This RFE technique to handle redundancy prob- C. Results and discussion
lems that may arise when using just a filtering technique
This section explains the experimental results obtained dur-
such as ReliefF. In addition, RFE is used to capture the
ing the three phases, namely the evaluation of the classification
beneficial interactions between features[21].
phase Use and do not use the Feature Selection Methods ,
C. Classification stage and Benchmarking Phase. During the experimental tests, we
examined each of the 2000 genes to classify the dataset as
• Support Vector Machine SVM: Is one of the machine
normal and abnormal. Table 2 presents the accuracy, confusion
learning algorithms known for its great flexibility, its
matrix and performance analysis for the recall and accuracy
robust theory thus guarantees its ease of use, even without
ratings for the baseline model without using feature selection
much knowledge of data extraction. This algorithm has
presented in Table 3.
the ability to solve classification problems as well as to
detect regression or anomaly.
• Support Vector Machine KNN: Is a versatile algorithm. TABLE II
It can be used for classification or regression, assumes CONFUSION MATRIX FOR BASELINE MODEL
similar things exist nearby so it aims to determine the
Predect class
minimum distance between neighbors but the disadvan- Classe
Positive Negative
tage becomes much slower as the number of independent
Positive 12, TP 2, FN
observations and variables increases.
• Random forest: Is one of the supervised algorithms Negative 1, FP 12, TN
widely used to resolve classification and regression is-
sues. based on the predictions of the decision trees.It
predicts for the average or average production of different Results from our classification models based on SVM,
trees. Increased tree count increases the accuracy of the KNN and random forest and using data augmentation that can
resulIt gives better results on classification issues. accurately detect 74 elements out of a total of 80. The overall
Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on December 14,2023 at 08:44:46 UTC from IEEE Xplore. Restrictions apply.
TABLE III TABLE VI
P ERFORMANCE MEASURE FOR BASELINE MODEL P ERFORMANCE ANALYSIS OF THE MODEL WITHOUT FEATURE SELECTION
Methods Accuracy
class Recall Precision F1-score Accuracy(%)
SVM 93.17%
Abnormale 0.92 0.86 0.89
Normal 0.86 0.92 0.89 93 Random Forest 93.17%
Weighted measure (%) 85.68 85.68 88.68 KNN 84.23%
TABLE VII
accuracy of SVM model is 93, 17% using all genes, KNN P ERFORMANCE ANALYSIS OF THE MODEL WITH FEATURE SELECTION
model is 84, 23% and 93, 17% of Random forest. We applied
the mean accuracy of decreases and the mean decrease of Methods Accuracy
the gene as a feature selection procedure to suppress the SVM 99.07%
the majority of unnecessary and redundant genes in the dataset. Random Forest 91.23%
KNN 93.20%
The objective is to identify a subset of discriminatory genes
which enhance the performance of learning models. The 50
TABLE VIII
major genes selected by the two features selection methods, C OMPARATIVE ANALYSIS OF THE MODEL
have been used to construct a solid learning method. The
final confounding matrix and performanceTable 4 presents Accuracy
Methods
SVM KNN Random Forest
the accuracy, confusion matrix and performance analysis for Model without feature selection 93.17% 84.23% 93.17%
the recall and accuracy ratings for the baseline model without model with feature selection 99.07% 93.20% 91.23%
using feature selection presented in Table 5.
TABLE IV
CONFUSION MATRIX FOR F ETAURE SELECTION M ODEL
Predect class
Classe
Positive Negative
Positive 14, TP 0, FN
Negative 0, FP 14, TN
TABLE V
PERFORMANCE MEASURE F OR FEAUTURE SELECTION M ODEL .
Measurements based on the 50 most selected genes are all other methods that have less effective gene expression data.
presented in Tables 6 and 7 respectively. The model based on
the top 50 genes selected allows for the correct detection of
79 out of 80 samples with 99.07% accuracy using the SVM TABLE IX
classification. Table 5 illustrates the comparison of the model P ERFORMANCE COMPARISON AMONG DIFFERENT METHODS
with and without feature selection.
Publication Methods Nb att Accuracy
Simone A.Ludwig et al.[13] FDT 20 80.23%
Table 8 illustrates the results that, when using the model Nguyen T et al.[14] MAPH+PNN 5 88.89%
with function selection, all the analyses, the measures Lingyun Gao et al.[15] FCBFS+SVM 14 93.55%
outperformed their counterparts without function selection. Salem H et al.[16] IG+GA+GP 60 85.48%
Proposed method[22] MDA+MDG+RF 33 95.16%
Table 7 illustrates how our proposed methodology Our proposed method RelieFf + RFE + SVM 50 99.07%
compares to existing approaches. Table 7 shows also
that the performance of our method is better than that of
Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on December 14,2023 at 08:44:46 UTC from IEEE Xplore. Restrictions apply.
V. CONCLUSION [18] Akhiat, Y., Manzal, Y., Chahhou, M., & Zinedine, A. (2021). A New
Noisy Random Forest Based Method for Feature Selection. Cybernetics
In this study, we evaluated the usage of machine learning and Information Technologies, 21(2), 10-28.
tools for the sake of enhancing the colon cancer classification [19] Akhiat, Y., Asnaoui, Y., Chahhou, M., & Zinedine, A. (2021, June). A
accuracy. More precisely, we have proposed a two stage new graph feature selection approach. In 2020 6th IEEE Congress on
Information Science and Technology (CiSt) (pp. 156-161). IEEE.
feature selection method for colon cancer classification. The [20] Akhiat, Y., Chahhou, M., & Zinedine, A. (2019). Ensemble feature
system performs a feltring phase first in order to eliminate selection algorithm. International Journal of Intelligent Systems and
the unimportant features as well as to speed up the training Applications, 11(1), 24.
and the classification process. Second, another selection phase
is used for the objective of reducing the redundancy problem
and taking into account the underlying features that may exist
between the high numbers of features (2000 features). In
addition, the selected features are used to construct an SVM
classifier for colon cancer classification. We have presented the
results with and without feature selection so as to scrutinize
the efficiency of feature selection technique. The experiments
confirmed the validity of the proposed method where the
highest classification accuracy (about 99%) is achieved by
applying our proposed system using just 50-genes rather than
the whole 2000 genes. The method we proposed works very
well with binary data. So our next goal is to improve it so that
we can apply it to 2ds and 3 D data.
R EFERENCES
[1] World Health Organization (WHO) Cancer. https://www.who.int/health-
topics/cancer.
[2] Siegel Rebecca L, Kimberly D, Miller JA (2019) Cancer statistics.CA
Cancer J Clin 69(1):7–34.
[3] RGolub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov
JP, Bloomfield CD (1999) Molecular classification of cancer: class
discovery and class prediction by gene expression monitoring.Science
286(5439):531–537.
[4] Xi M, Sun J, Liu L, Fan F, Wu X (2016) Cancer feature selection and
classification using a binary quantum-behaved particle swarm optimiza-
tion and support vector machine. Comput Math Methods Med 2016:1–9.
[5] Ghazavi SN, Liao TW (2008) Medical data mining by fuzzy modeling
with selected features’. Artif Intell Med 43(3):195–206.
[6] Blum AL, Langley P (1997) Selection of relevant features and examples
in machine learning. Artif Intell 97(1–2):245–271
[7] Das S (2001) Filters, wrappers and a boosting-based hybrid for feature
selection. In: Proceedings of the 18th international conference on
machine learning. kaufmann publishers, San Francisco,Calif, USA.
[8] Nguyen HN, Vu TN, Ohn SY, Park YM, Han M.Y, Kim CW (2006) fea-
ture elimination approach based on random forest for cancer diagnosis.
In: Mexican international conference on artificial intelligence. Springer.
[9] Ram M, Najafi A, Shakeri M (2017) Classification and biomarker genes
selection for cancer gene expression data using random forest. Iran J
Pathol 12(4):339–347
[10] Park CH, Kim SB (2015) Sequential random k-nearest neighbor feature
selection for high-dimensional data. Expert Syst Appl 42(5):2336–2342
[11] Alladi SM, Shinde SP, Ravi V, Murthy US (2008) Colon cancer predic-
tion with genetic profiles using intelligent techniques. Bioinformation
3(3):130–133
[12] Ludwig SA, Picek S, Jakobovic D (2018) Classification of cancer data:
analyzing gene expression data using a fuzzy decision tree algorithm.
Springer, Berlin, pp 327–347
[13] Nguyen T, Khosravi A, Creighton D, Nahavandi S (2015) A novel ag-
gregate gene selection method for microarray data classification.Pattern
Recognit Lett 60–61:16–23
[14] Salem H, Attiya G, El-Fishawy N (2017) Classification of human cancer
diseases by gene expression profiles. Appl Soft Comput 50:124–134
[15] Aalaei S, Shahraki H, Rowhanimanesh A, Eslami S (2016) Feature selec-
tion using genetic algorithm for breast cancer diagnosis: an experiment
[16] Breiman L (2001) Random forests. Mach Learn 45(1):5–32
[17] Han H, Guo X, Yu H (2016) Variable selection using mean decrease
accuracy and mean decrease gini based on random forest. In: 7th IEEE
international conference on software engineering and service science,
Beijing, 219–224
Authorized licensed use limited to: VNR Vignana Jyothi Inst of Eng & Tech. Downloaded on December 14,2023 at 08:44:46 UTC from IEEE Xplore. Restrictions apply.