Optimized Gene Classification Using Support Vector Machine With Convolutional Neural Network For Cancer Detection From Gene Expression Microarray Data
Optimized Gene Classification Using Support Vector Machine With Convolutional Neural Network For Cancer Detection From Gene Expression Microarray Data
ISSN No:-2456-2165
Abstract:- There are numerous approaches for the dataset's dimensionality, [13]. Microarray gene
handling microarray gene expression data since new expression studies frequently produce a large number of
feature selection techniques are constantly being characteristics for a limited number of patients, resulting in
developed. To create a new subset of pertinent features, a high dimensional dataset with a small number of samples.
feature selection (FS) is utilized to pinpoint the essential Gene expression data is extremely complicated and
feature subset. The model that used the informative problematic; genes are connected with one other either
subset projected that a classification model generated directly or indirectly, making the classification process
solely using this subset would have higher predicted extremely tough and challenging. Typically, this means
accuracy than a model developed using the whole employing a precise and potent feature selection technique
collection of attributes. is required [10].
We offer an analytical approach for cancer The field of Deep Learning (DL) is concerned with
classification and developed a model using Support using deep networks for information processing.
Vector Machine as classifier and after that Convolutional Neural Networks (CNNs) are a type of
Convolutional Neural Network in the aspect of Deep artificial neural network that can extract local features from
Learning. The outcome received in the context of the data. CNN assigns weights based on a single feature
proposed model is very impressive and accurate. mapping, simplifying the network model and allowing for a
reduction in overall weights. DL is designed to handle data
Keywords:- Feature selection; Optimization; Classification; using both supervised and unsupervised methods, with
Support Vector Machine (SVM); Deep Learning; Machine learning taking place on several layers of features and
Learning; Convolutional Neural Network (CNN). descriptions [17]. The processes of feature extraction and
selection result in a distilled set of the essential
I. INTRODUCTION characteristics that define the core characteristics of the
This A microarray expression experiment records the data and categorize it. You can carry out this classification
expression levels of thousands of genes simultaneously; with or without supervision. While supervised learning uses
each gene is a segment of DNA that carries all the data with output class markers, unsupervised learning uses
information needed to make several kinds of proteins in our information without output class labels. An algorithm
body. The main methods used in these experiments are illustrates the relationship between patterns in the input
explained by authors that either multiple monitoring of each attribute variables and the associated descriptors in the
gene under various conditions, or evaluating each gene in a output for supervised approaches [5]
single environment but in different types of tissues, The main objective of this literature to get better
particularly cancerous tissues [9]. One method for reducing results with highest accuracy for feature selection of gene
classifier calculation errors is feature selection (FS), which expression microarray data. Lots of works happened till
removes noisy, redundant, and unrelated qualities from the date, but at this end for the humanity it is required to re-
original data set and selects important attributes. According examine the datasets with better prediction related to
to the authors, generally feature selection techniques fall identify the cancer form the infected gene expression
into three main categories: wrapper, filter, and embedded microarray data. Here, we have proposed an embedded
models [7]. approach for Dimension Reduction and Classification using
The microarray dataset technique faces two primary Support Vector Machine with Convolutional Neural
issues: an excessive number of genes relative to a lesser Network (DRC-SVM-CNN). This approach has produced
number of samples. The process of identifying relevant the result with better accuracy and prediction.
features from the data and displaying the higher dimension This literature contains the following sections: In
dataset with a condensed search space is known as feature Section 2, the Microarray Technology and Datasets
selection (FS). The proper FS resolution for microarray described. In Section 3, Related works compiled by the
data, however, is very difficult to determine because the researchers and their published articles are evaluated. In
sample size is smaller than the total number of genes. There Section 4 Proposed Methods and model which are used to
are a number of things to take into account when reducing address the challenge of features selection, explained. In
The conclusion will consequently be minimum feature b is the bias term that shifts the hyperplane away from
size and highest classification accuracy in order to acquire the origin.
the best feature subsets (standard deviation and average
number of features). Finding the values of w and b that maximize the
margin, the distance between the hyperplane and the closest
D. Applied Methodologies data points from each class, is the aim of defining the
Here we are explaining the applied methodologies for hyperplane. Support vectors are those closest data points.
the analysis of the research.
The following is a formulation for the SVM
Support Vector Machine optimization:
The foundation of Support Vector Machines (SVMs) is
a mathematical formulation that looks for the best Yi (w ⋅ xi + b) ≥ M, for i = 1, 2, …, N (2)
hyperplane to divide two data classes. An SVM's principal
mathematical expression is as follows: Where:
M is the margin (The distance apart the two parallel
Given a dataset with input data points xi and their hyperplanes are).
corresponding labels yi ∈ {−1,1} w and b are the parameters to be optimized.
where i =1, 2, …, N, an SVM seeks to find a Convolutional Neural Network
hyperplane defined by: Convolutional neural networks, or CNNs, are a kind of
deep learning models that are mainly utilized for tasks
w⋅x + b = 0 (1) involving images, however they can also be used with other
w is the weight vector that is orthogonal to the kinds of data. Figure 4 illustrating the typical architecture
hyperplane. of Convolutional Neural Network (CNN).
x represents the input features of a data point.
The following are the essential elements of a typical The mathematical representation of the convolution
CNN's mathematical expression: operation for a 2D input, such as an image, is as follows:
Fig. 4: Observing the Test and Validation loss over the Epochs
Figure 5 represents the accuracy over all epochs and Specificity = (True Negatives) / (True Negatives +
we observed that the validation accuracy is 100% and False Positives)
training accuracy is at nearly 97%. So, that in this graphical
representation our proposed model has get fulfilled the Specificity is important when you want to minimize
expectations. In this regards we have analysed that the false positives.
proposed model is outperformed, one of the better solution
for optimization and classification of gene expression C. Accuracy
microarray data. Accuracy is a measure of overall correctness. It
calculates the proportion of all cases (both positive and
VI. RESULTS ANALYSIS AND DISCUSSION negative) that the model classifies correctly. It is calculated
as:
According to the proposed model the classification
accuracy of original feature sets are very impressive. Accuracy = (True Positives + True Negatives) / (True
Tensorflow, Numpy, Scikit-Learn are used in our proposed Positives + True Negatives + False Positives + False
program. To achieve these results, datasets were divided for Negatives)
experimental purpose into two parts that are Training and
Testing, such as 80:20 ratios. Sensitivity, Specificity, and Accuracy provides a broad view of a model's
Accuracy are checked for performance evaluation. performance but may not be the best metric when class
imbalances exist or when certain types of errors are more
A. Sensitivity costly than others.
Sensitivity measures the proportion of actual positive
cases that the model correctly identifies as positive. It is D. Results from Selected Datasets
calculated as: The following results have described the experimental
findings for each microarray dataset. After optimization of
Sensitivity = (True Positives) / (True Positives + False gene expression microarray data by selecting dropouts and
Negatives) applying Support Vector Machine with Convolutional
Neural Network, our model gets trained and analysed by
Sensitivity is important when you want to minimize calculating three tests for Sensitivity, Specificity, and
false negatives, as in medical diagnoses, where missing a Accuracy. Here we can see the collected results with
true positive (e.g., a disease) could be critical. classification accuracy almost above 99% on average. So,
that the performance of the proposed model is very much
B. Specificity suitable for the prediction of the disease identification
Specificity measures the proportion of actual negative through gene expression microarray data. The performance
cases that the model correctly identifies as negative. It is assessments of the complete microarray dataset are
calculated as: mentioned in Table 2.