0% found this document useful (0 votes)
127 views47 pages

#Thesis Final v1-06082021

Uploaded by

bca
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views47 pages

#Thesis Final v1-06082021

Uploaded by

bca
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

ACKNOWLEDGEMENT

It gives me immense pleasure to recognize the efforts of a number of people in framing this
work. First and foremost, I would like to express my heartfelt gratitude to my dissertation
supervisor, Mr. Jagdish Bhatta, Assistant Professor at Central Department of Computer
Science and Information Technology (CDCSIT), for giving me an opportunity to work under
his supervision and for providing me constant guidance and support throughout this work.

I owe a special debt of gratitude to Head of CDCSIT department, Assistant Professor Mr.
Nawaraj Paudel for his unwavering support during this research. I would also like to express
my sincere thanks to all the teachers in CDCSIT department; Mr. Tej Bahadur Shahi, Prof. Dr.
Subarna Shakya, Mr. Ram Krishna Dahal, Mr. Bikash Balami, Mr. Bal Krishna Subedi, Mrs.
Lalita Sthapit, Mr. Dhiraj Kedar Pandey, Dr. Bhogendra Mishra, Dr. Dibakar Raj Pant as well
as CDCSIT staff members for providing me with broad insights and inspirations over a two-
year span. I would also wish to thank my colleagues for their extraordinary contributions to
this project.

I am always grateful to my parents, Mr. Badri Bahadur Karki and Mrs. Narbada Devi Karki
for their unconditional love, tremendous support and constant encouragement. I would also
like to express my gratitude to my brother, Dr. Sanjib Karki, and sister, Dr. Sunita Khanal for
their unwavering support and encouragement.

Lastly, I would like to express my sincere thanks to almighty God for his immense blessings.

i
ABSTRACT
Breast cancer is one of a few cancers for which an effective detection is important for
mitigating its effects in one’s health. Supervised machine learning models that analyze data for
classification and regression has been used in different clinical models for diagnostic and
prediction purposes. This thesis analyzed and compared the performance of supervised
classification models over breast cancer datasets. Algorithm used in this study are Support
Vector Machine (SVM) and Random Forest (RF). The breast cancer dataset obtained from
publicly available repository was preprocessed by dividing the data into attributes and labels.
The divided data were split into training set and test sets into 80 % and 20 % respectively. The
SVM model and RF model was built and their performance was analyzed using confusion
matrix on the provided datasets. Based on performance analysis, RF classifier obtained the
highest accuracy of 97%, closely followed by SVM classifier with 93% accuracy. Similarly,
precision for RF was 96% while SVM had 89% precision. Both RF and SVM showed similar
sensitivity (recall) and prevalence of 100 % and 58% respectively. However, specificity for RF
was 94%, which is 11% higher than SVM. Therefore, from the performance analysis of SVM
and RF algorithms over breast cancer dataset, this study suggests that Random Forest (RF) can
be used in classification of cancer datasets with slightly higher accuracy and precision. In
future, this work can further be extended to predict the progression of diseases by classifying
them with high accuracy.

Keywords: Support Vector Machines, Random Forest, Train-Test Split, Confusion Matrix

ii
TABLE OF CONTENTS

ACKNOWLEDGEMENT ............................................................................................................... i
ABSTRACT .................................................................................................................................... ii
List of figures .................................................................................................................................. v
List of tables ................................................................................................................................... vi
List of abbreviations ..................................................................................................................... vii
CHAPTER 1 ................................................................................................................................... 1
INTRODUCTION ....................................................................................................................... 1
1.1 Introduction ........................................................................................................................... 1
1.2 Problem Statement ................................................................................................................ 2
1.3 Objectives .............................................................................................................................. 3
1.4 Report Organization .............................................................................................................. 3
CHAPTER 2 ................................................................................................................................... 4
BACKGROUND STUDY AND LITERATURE REVIEW ...................................................... 4
2.1 BACKGROUND STUDY .................................................................................................... 4
2.1.1 Machine Learning ........................................................................................................... 4
2.1.2 Supervised Learning ....................................................................................................... 4
2.1.3 Unsupervised Learning ................................................................................................... 5
2.1.4 Reinforcement Learning ................................................................................................. 6
2.1.5 Semi-Supervised Learning ............................................................................................. 6
2.1.6 Multi-task Learning ........................................................................................................ 7
2.1.7 Transduction Learning .................................................................................................... 7
2.2 Classification and Regression ............................................................................................... 8
2.3 Support Vector Machine ....................................................................................................... 8
2.3.1 Support vectors ............................................................................................................... 9
2.3.2 Decision boundaries and hyperplane .............................................................................. 9
2.3.3 Optimal hyperplane for linearly separable datasets ...................................................... 10
2.4 Random Forest .................................................................................................................... 11
2.4.1 Bagging (Bootstrap Aggregation): ............................................................................... 12
2.4.2 Feature Randomness ..................................................................................................... 13
2.5 LITERATURE REVIEW .................................................................................................... 13

iii
CHAPTER 3 ................................................................................................................................. 17
METHODOLOGY .................................................................................................................... 17
3.1 Methodology ....................................................................................................................... 17
3.2 Implementation.................................................................................................................... 18
3.2.1 Tools used in experiment .............................................................................................. 18
3.2.2 Python Packages ........................................................................................................... 18
3.2.3 Data Collection and Preprocessing ............................................................................... 18
3.2.4 Training and Test Data ................................................................................................. 19
3.3 Building the Model.............................................................................................................. 19
3.3.1 SVM Model .................................................................................................................. 19
3.3.2 Random Forest Model .................................................................................................. 21
3.4 Performance Measure .......................................................................................................... 24
3.5 Confusion matrix ................................................................................................................. 24
3.6 Data Normalization ............................................................................................................. 26
CHAPTER 4 ................................................................................................................................. 28
RESULTS AND ANALYSIS ................................................................................................... 28
4.1 Data visualization ................................................................................................................ 28
4.2 Performance results for SVM algorithm ............................................................................. 29
4.3 Performance results for Random Forest Algorithm ............................................................ 32
4.4 Performance analysis of SVM and Random Forest ............................................................ 34
CHAPTER 5 ................................................................................................................................. 36
CONCLUSION AND FUTURE RECOMMENDATIONS ..................................................... 36
5.1 Conclusion........................................................................................................................... 36
5.2 Future Recommendation ..................................................................................................... 36
References ..................................................................................................................................... 37

iv
List of figures

Figure 1. Supervised Learning ........................................................................................................ 4


Figure 2. Unsupervised Learning .................................................................................................... 5
Figure 3. Reinforcement Learning .................................................................................................. 6
Figure 4. Semi Supervised Learning ............................................................................................... 6
Figure 5.Multi-task Learning .......................................................................................................... 7
Figure 6. Transduction Learning..................................................................................................... 7
Figure 7. Principle of SVM: (a) many hyperplanes for linearly separable data (b) finding the
optimal hyperplane with maximal margin ...................................................................................... 9
Figure 8. A hyperplane in 2-D and 3-D space. ............................................................................. 10
Figure 9. Support Vectors and optimal hyperplane for maximum marginal classification of linearly
separable datasets .......................................................................................................................... 10
Figure 10. Random Forest Structure ............................................................................................. 12
Figure 11. Node splitting in a random forest model is based on a random subset of features for
each tree ........................................................................................................................................ 13
Figure 12. Schematic representation of the methodology for performance analysis of SVM and RF
....................................................................................................................................................... 17
Figure 13. Working mechanism of Random Forest Classifier .................................................... 22
Figure 14. Pair plot for minimum five features of Breast Cancer Dataset ................................... 28
Figure 15. Heat map for features of Breast Cancer Dataset ......................................................... 29
Figure 16. Performance measure over Breast Cancer Dataset ...................................................... 34

v
List of tables
Table 1. Confusion Matrix ............................................................................................................ 24
Table 2. Confusion matrix over Breast cancer test data for SVM ............................................... 30
Table 3. Performance analysis for SVM ....................................................................................... 30
Table 4. Confusion matrix over Breast cancer test data for SVM after Data normalization ....... 31
Table 5. Performance analysis for SVM after data normalization................................................ 31
Table 6. Confusion matrix over Breast cancer test data for RF .................................................... 32
Table 7. Performance analysis for Random forest ........................................................................ 33
Table 8. Confusion matrix over Breast cancer test data for RF after Data normalization ........... 33
Table 9. Performance analysis for RF after data normalization ................................................... 34

vi
List of abbreviations
Abbreviations Full Form
ANN Artificial Neural Network
CBIR Content Based Image Retrieval
COVID-19 Corona Virus Disease 2019
CPU Central Processing Unit
GLCOM Gray Level Co-Occurrence Matrix
ML Machine Learning
MSE Mean Square Error
RBF Radial Basis Function
RF Random Forest
RMSE Root Mean Square Error
SVM Support Vector Machine
SVR Support Vector Regression

vii
CHAPTER 1
INTRODUCTION
1.1 Introduction
Artificial intelligent techniques or machine learning have been investigated to the diagnosis of
diseases with high classification accuracies. In Machine Learning, classification is the task of
assigning a class to instances of data identified by a set of attributes. Classification or
supervised learning is the process of creating a classifier that is trained on a set of training data
to construct a model of the distribution of class labels. It is then applied to new data in which
the values of features are known but the class of the features is unknown. For supervised
classification, several artificial intelligence-based algorithms have been developed, including
decision trees, Support Vector Machines, Artificial Neural Networks, perceptron-based
techniques, and statistical learning techniques, logistic regression, K-Nearest Neighbor, and
Random Forest. Support Vector Machine (SVM) and Random Forest are the most powerful
and highly accurate supervised Machine Learning algorithms.

Support Vector Machine (SVM), a supervised learning classifier, was first introduced by
Vladimir Vapnik and his colleagues in 1990s [1]. SVM is derived from the statistical learning
theory or VC theory, which was developed by Vapnik and Chervonenkis [2]. Since their
success in using SVM algorithm in handwritten digit recognition, it has become popular for
various applications. A SVM can locate a separating hyperplane in the feature space and
classify points in that space by defining a kernel function. In general, SVM machine learning
approaches has been used in computer-aided diagnosis and prognosis of many diseases. For
example, SVMs can build classifiers for the diseases based on previous data, and use them to
diagnose a new patient [3], [4], [5]. Early-stage cancer detection or diagnosis of heart disease
could reduce the cancer and stroke-related death rate significantly. Similarly, diagnosis of
COVID-19 would minimize the spread of virus.

Random Forest is a powerful supervised machine learning algorithm that can be applied to a
wide range of problems, including regression and classification. RF method was first
introduced by Breiman in 2001 [6] as an ensemble approach. RF model includes many decision
trees called estimators, each of which makes its own predictions and the random forest model
incorporates the estimators' predictions to generate a more accurate result [6]. RF has been

1
used for the classification of liver disorder [7]. It has also been used in prediction of diabetes
with feature selection methods [8]. In the same way, RF has been used in classification of
cancer over three different cancer datasets [9].

Breast cancer is the most common form of cancer among women. The early diagnosis of the
cancer has always been a challenge in both developing and developed countries. To fill that
gap, the classification models have been used for the diagnosis based on patients’ historical
medical records and symptoms. Therefore, determining the most effective classification model
approach has been always an utmost choice of research in clinical problems. The wrong
classification in medical datasets indicates the poor prediction of health condition. However,
the performance of each algorithm depends on various model parameters. In this context, this
study includes implementation and analysis of SVM and RF algorithm over the breast cancer
datasets and analyze these algorithms based on their performance indices. Performance
analysis is the technique of observing or measuring the performance of a particular situation.
Different performance evaluation metrics such as precision, recall, f1_score, specificity, and
accuracy, associated with confusion matrix, are used to compare the performance of SVM and
RF.

1.2 Problem Statement


Given a problem to categorize a set of data into various classes, Support Vector Machine (SVM)
has always been a choice of researchers because of better accuracy, direct geometric
interpretation, and not requiring large data to avoid the overfitting issues. In the process of
developing a classification model, SVM plays a significant role because it consists of kernels
which assists in mapping dataset to a higher dimensional space to obtain a better interpretation
at the classification model. Random Forest also plays an important role in classification due to
robustness, good at heterogeneous data type and few hyper parameters. Therefore, it is
necessary to analyze the performance of the algorithm for effective classification of diseases
and hence is the problem of this study.

2
1.3 Objectives
The objectives of this study are as follows:

 To implement Support Vector Machine and Random Forest for classifying breast cancer
datasets
 To compare the performance of above-mentioned algorithms based on accuracy and
precision parameters, obtained from confusion matrices.

1.4 Report Organization


This thesis is outlined in five chapters. They are as follows:

Chapter 1 consists of introduction, problem statement and objectives.


Chapter 2 describes about the background study for the research and literature review relevant
to the study.
Chapter 3 describes the overview of the methodology and implementation of SVM and RF
algorithm.
Chapter 4 combines the results and analysis for the performance analysis of SVM and RF over
Breast Cancer datasets.
Chapter 5 concludes the thesis by summarizing the findings and future recommendations.

3
CHAPTER 2
BACKGROUND STUDY AND LITERATURE REVIEW
2.1 BACKGROUND STUDY
2.1.1 Machine Learning
According to Marvin Minsky (1986), “Learning” is defined as “making useful improvement
in the functioning of our mind”. Machine learning (ML) is a part of Artificial Intelligence (AI)
seeking to make a computer capable of learning in the same way that humans do. Therefore,
the science of having computers to behave without being specifically programmed is known
as machine learning (ML). ML deals with algorithms which trains a system to learn from given
data to predict the output [10]. ML is applied in many areas of science and technology such as
robotics and autonomous vehicle control [11]; speech and natural language processing [12];
successful web search; neuroscience research [13], [14]; understanding human genetics and
genomics [15]; cancer genomics [16]; image processing and computer vision [17]. Some of
the examples of ML are supervised learning, unsupervised learning, semi-supervised or
minimally supervised learning, and reinforcement learning. The most widely used ML
algorithm is the supervised learning algorithm [18].

2.1.2 Supervised Learning

Figure 1. Supervised Learning [19]

4
Supervised learning is a set of techniques that allows future predictions based on behaviors or
characteristics analyzed in historical data (Figure 1). Based on a model, algorithms are trained
using training set. Training set is a dataset with input and its corresponding output. Based on
this training, a pattern is developed. This developed pattern is used to classify the new data
item [20], [21]. Classification and regression are two categories of supervised learning. For
instance, supervised learning has been used for classification problems such as classification
of fruits, whether a person has diabetes or not, whether a person is male or female. It is also
used for regression problems such as prediction of house pricing based on size or forecasting
weather and much more.

2.1.3 Unsupervised Learning

Figure 2. Unsupervised Learning [19]


Unsupervised learning is the technique where interpretation is made based on input data only,
without known output [15]. Unsupervised learning finds the structure or pattern of given data
and group data based on its structural properties such as size, dimensions, probabilities (Figure
2). Unsupervised learning is useful when the test is to find the certain connections in a given
unlabeled or not preassigned dataset [11]. Clustering and association (non-clustering) are two
types of unsupervised learning. Clustering segregates the data based on their similarities while
association determines the connections amongst the data.

5
2.1.4 Reinforcement Learning

Figure 3. Reinforcement Learning [19]


Reinforcement learning (RL) is the functions with continuous outputs using a connectionist
network [22]. RL learns the situation from judging success or failure of the observed behaviors
[23]. It is similar to getting an agent to act in the environment to maximize its rewards (Figure
3). For example, teaching a dog new tricks using treats or rewards.

2.1.5 Semi-Supervised Learning

Figure 4. Semi Supervised Learning [24]


Semi-supervised learning is a form of machine learning that trains models using a combination
of labeled data and large amounts of unlabeled data. This method of machine learning

6
combines both supervised and unsupervised learning system where supervised learning uses
labeled training data, while unsupervised learning uses unlabeled data [24].Text document
classifier is a common example of semi-supervised learning.

2.1.6 Multi-task Learning

Figure 5.Multi-task Learning [25]


Multi-task learning (MTL) is a branch of machine learning in which multiple tasks are
simultaneously learned by a shared model (Figure 5) [25]. Its advantages include improving
data quality, reducing overfitting by shared representations and fast learning by leveraging
auxiliary information.

2.1.7 Transduction Learning

Figure 6. Transduction Learning [26]

7
Transduction or transductive learning is used in the field of statistical learning theory to refer
to predicting specific examples given specific examples from a domain (Figure 6). It is
contrasted with other types of learning, such as inductive learning and deductive learning.
Induction is the method of deducing a function from given data. Deduction is the method of
calculating the values of a given function for a range of points of interest. Transduction is the
method of using given data to extract the values of an unknown function for points of interest
[27], [28].

2.2 Classification and Regression


Classification and regression are two categories of supervised machine learning system.
Classification is the task of assigning a class to instances of data represented by a collection of
attributes in Machine Learning. The creation of a classifier that is trained on a collection of
training data to create a model of the distribution of class labels is referred to as classification
or supervised learning. To put it another way, classification foresees discrete values. Many
artificial intelligence-based algorithms, such as decision trees, Support Vector Machines,
Artificial Neural Networks, perceptron-based techniques, and statistical learning techniques,
have been developed for supervised classification. Support Vector Machine (SVM) and
Random Forest are the most efficient and accurate machine learning system for the
classification [21].

The aim of regression is to predict values. A continuous value is predicted by regression. The
regression is carried out by calculating the relationships between input variables and output
variables, with the relationship being the effect of a shift in the input on the output. The
prediction of brain activity is a good example of regression in Machine Learning.

2.3 Support Vector Machine


Vapnik et al. introduced the SVM principle, which is based on statistical learning theory[2].
SVM is a supervised machine learning technique that can be used for classification as well as
regression, mostly preferred for classification algorithms. If SVM is used in regression, the
concept is called support vector regression or SVR. Some of the advantages of SVM are as
follows:
 SVM is robust to outliers
 It is effective in high dimensional cases.
8
 It is memory efficient as it uses a subset of training points in the decision function called
support vectors.
 Different kernel functions can be specified for the decision functions and it is possible to
specify custom kernels.

The basic principle of SVM is to map the training data into a multidimensional feature space
and create an optimal hyperplane with the maximum margin (Figure 7). For example, to
classify k dimensional datasets, SVM produces k-1 dimensional hyperplanes (Figure 7a) then
SVM locates an optimal hyperplane that maximizes the distance from the members of each
class to the optimal hyperplane (Figure 7b).

Figure 7. Principle of SVM: (a) many hyperplanes for linearly separable data (b) finding the
optimal hyperplane with maximal margin [29]

2.3.1 Support vectors


Support vectors are the data points on the margin of the both classes (Figure 7). The support
vectors help the classifier to constrain the width of the margin [29]. Since support vectors are
the closest points to the hyperplane, they have an impact on the hyperplane's direction and
orientation. The hyperplane's location would be altered if the support vectors are deleted.

2.3.2 Decision boundaries and hyperplane


The decision boundaries, as shown by dotted line in figure 7, are the separation lines for the
classes in n-dimensional space. The distance between the decision boundaries is called the
margin. Hyperplane is the best decision boundary which segregates classes with the maximum
margin between them. The maximum margin or separation gap between the two classes is
important for the SVM model to efficiently estimate the class for new input [30]. If there are
9
only two input features, 2-D space, the hyperplane is just a line. When the number of input
features exceeds three, the hyperplane is converted into a two-dimensional plane (Figure 8).

Figure 8. A hyperplane in 2-D and 3-D space [31].

2.3.3 Optimal hyperplane for linearly separable datasets

Figure 9. Support Vectors and optimal hyperplane for maximum marginal classification of
linearly separable datasets [31]

10
Assume the linearly separable training datasets containing k number of samples is represented
by {xi, yi}, i = 1, …. K where x ∈ RN is an N-dimensional space and y ∈ {-1,+1} is class label
[29], [32]–[34]. Then, the optimum hyperplane (Figure 9) is given by equation (1),

w. x + b = 0 …. eq (1)

where x is point vector; w is a weight vector or a vector perpendicular to the separating


hyperplane; b is the bias i.e., the shortest distance of hyperplane from the origin; and (w. x) is
a scalar product of vectors w and x. The equations for two hyperplanes separating the positive
from the negative training sets are

w. x + b ≥ 0 for y = +1 (lies on or above decision surface)

w. x + b < 0 for y = -1 (lies below decision surface)

The geometric margin between these planes is equal to 1 /||w||, which is equivalent to
minimizing ||w||2 [35]. This leads to the following constrained optimization problem:

1
Minimize (w. x) ||w||2
2

subject to constraints: yi (w.xi +b) ≥ 1

The constraints in this formulation ensure that the maximum-margin classifier in linearly
separable datasets classifies each data points correctly.

2.4 Random Forest


Random forest, developed by Leo Breiman [6], is an extension of decision tree learning.
Random forest is a supervised learning algorithm which uses ensemble learning method for
classification and regression. It generates multiple decision trees and the decision is made
either by average of trees for regression or by voting in classification [36], [37]. The random
forest classifier is a collection of prediction trees, where every tree is dependent on random
vectors sampled independently with similar distribution with every other tree in random forest
(Figure 10).

11
Figure 10. Random Forest Structure [38]
Some of the advantages of random forest classifier includes:

 For many datasets, it is a highly accurate classifier.


 It runs efficiently on large datasets and can handle thousands of input variables.
 It gives estimates of what variables are important in the classification.
 It has an effective method for estimating missing data and maintains accuracy even when
a large proportion of data are missing.

Random Forest, uses the following two methods that ensures the behavior of each individual
tree is not too correlated with the behavior of any of the other trees in the model [39].

2.4.1 Bagging (Bootstrap Aggregation): Decisions trees are very sensitive to the data they are
trained on — small changes to the training set can result in significantly different tree
structures. Random forest takes advantage of this by allowing each individual tree to randomly

12
sample from the dataset with replacement, resulting in different trees. This process is known as
bagging.
2.4.2 Feature Randomness — In a normal decision tree, when it is time to split a node, every
possible feature is considered and pick the one that produces the most separation between the
observations in the left node vs. those in the right node. In contrast, each tree in a random forest
can pick only from a random subset of features. This forces even more variation amongst the
trees in the model and ultimately results in lower correlation across trees and more
diversification.

Figure 11. Node splitting in a random forest model is based on a random subset of features for
each tree

2.5 LITERATURE REVIEW


By combining decision trees and relative support distance, Ryu M. and Lee K. [40] proposed
a novel data reduction approach for reducing training time. In each partition created by the
decision trees, they have used a new concept called relative support distance to select good
support vector candidates. For large-scale SVM problems, the chosen support vector

13
candidates improved training speed. In contrast to current methods, they demonstrated that
their approach significantly reduced training time while retaining good classification efficiency
in experiments.

Chidambaram S. and Shrinivasagan K.G. [41] evaluated the performance of SVM


classification approaches in datamining which has been broken down in three phases. To
classify data objects, support vector classifiers were implemented in the first step using four
different kernel methods: linear function, polynomial function, RBF, and sigmoid functions.
In the second stage, classifier subset evaluation was used to select features, and SVM
classification was used to optimize feature vectors, resulting in the highest level of accuracy.
In the third stage, a new kernel approach (Poly-Gaussian kernel method) was introduced, which
produced the highest classification accuracy as compared to the other four kernel approaches.
According to the results of the experiments, SVM with the proposed kernel approach achieved
the highest accuracy compared to other kernel approaches.

Parbat D. and Chakraborty M. [42] developed a Support Vector Regression model with a
Radial Basis Function as the kernel and a 10% confidence interval for curve fitting to predict
values. The data was divided into a train and test range, with 60% and 40% train and test set
respectively. The model performance parameters were measured as Mean Square error, Root
Mean Square Error, regression score, and percentage accuracy. The model predicted deaths,
recovered cases, and the total number of reported cases with an accuracy of over 97 %, and
daily new cases with an accuracy of 87 %. The findings pointed to a Gaussian decrease in the
number of cases, which might take another 3 to 4 months to reach the bare minimum of no
new cases registered. The approach was more accurate and effective than linear or polynomial
regression.

Huang S. et al. [16] reviewed Support Vector Machine (SVM) Learning Applications in Cancer
Genomics where they looked at how SVMs have progressed in cancer genomic studies
recently. They wanted to know how powerful SVM learning is and what the future holds for
cancer genomic applications.

Kapadia M. R. and Paunwala C. N. [43] used color moments because of their small feature
vector, which reduced computational complexity. The Gray Level Co-Occurrence Matrix is a
texture feature that is used to remove a repeating pattern from an image. As a classifier, the

14
Support Vector Machine (SVM) eliminated irrelevant images and thus increased retrieval
accuracy. The SVM classifiers, both linear and non-linear, were used to predict the query
images’ category and filter out irrelevant images. Three separate kernels were used in non-
linear SVM classifiers: Polynomial, Radial Bases Function (RBF), and Sigmoidal function.
Due to its exponential kernel, RBF is shown to work well and thus solve the problem in infinite
dimensions. The average precision rate was used to compare the results of Content Based
Image Retrieval (CBIR) for linear and nonlinear SVM classifiers, as well as various fusion
techniques in different color spaces.

Okutan A. and Yildiz O.T. [44] suggested using novel kernels for defect prediction based on
plagiarized source code, software clones, and textual similarity. To model the relationship
between source code similarity and defectness, a precomputed kernel matrix was generated
and compared their output on different data sets. Each value in a kernel matrix indicated the
degree of parallelism between the corresponding files of a software system. In terms of F-
measure, the experiments on ten real-world datasets showed that support vector machines
(SVM) with a precomputed kernel matrix outperformed SVM with a linear kernel. According
to the findings of this preliminary analysis, source code similarity can be used to predict defect
proneness.

The aim of the research proposed by Singh V.et al. [45] was to find out how to predict Corona
Virus Disease 2019 (COVID-19) cases that have been confirmed, died, or recovered. Using
SVM calculations, the predicted patient of COVID-19 was forecasted based on the scores of
the attributes. RBF and C was used to find the optimal hyperplane which aided in comparing
hyperplane parameters in order to examine support vectors more thoroughly. At the same time,
they used bar charts to conduct statistical analysis to separate classes of subjects. Using the
kernel function, SVM generated optimized output values to forecast the expected COVID-19
cases.

Afentoulis V.A. et al. [46] performed SVM classification with linear and RBF kernels. They
conducted a series of experiments on standard benchmark datasets to demonstrate the validity
and accuracy of their algorithms' classification. The data was split into training and test sets in
this series of experiments. The discrepancies between the algorithms were as follows: The
Linear kernel uses four values for the C parameter so that the output can be verified. In the

15
Radial Basis Function kernel, on the other hand, the C parameter has remained constant at 102,
while the Gamma parameter has fluctuated between 102 and 8*10-2. As far as the two final
graphical representations can tell, the achievement results from the two kernels are almost
identical. Both kernels, in particular, reach the same degree of accuracy, nearly 93 percent.

Savas C. and Dovis F. [47] compared the output of different SVM kernel methods and checked
and evaluated the linear, Gaussian, and polynomial kernel SVM algorithms for phase and
amplitude scintillation detection. The ROC curves, confusion matrix results, and performance
metrics associated with the confusion matrix were used to determine performance comparison.
The efficiency of the RBF kernel SVM method outperformed the linear kernel SVM method
in terms of overall accuracy when the kernel scale parameter of the Gaussian RBF kernel SVM
algorithm was optimized. Furthermore, although the third order polynomial kernel SVM
outperformed the linear kernel, it does so at the expense of increased time and space
complexity.

Random Forests and Artificial Neural Networks for Liver Disorder Classification Performance
Evaluation was developed by Haque R. et al. [7]. Python was applied in both methods and
found them to be the most successful in classifying the diagnostic data set into two groups
based on the severity of the disease. In the research process, they achieved an accuracy of 80%
and 85.29 percent in RFs and ANNs, respectively.

Krishna G. et al. [9] performed analysis and evaluation of different datamining algorithms used
for cancer classification in three different cancer datasets. When applied to all three data sets,
the findings show that none of the classifiers outperformed the others in terms of accuracy. As
the scale of the data set grew larger, most of the algorithms performed better.

Raghavendra S and Santosh Kumar J. [8] performed performance evaluation of random forest
with feature selection methods in prediction of diabetes. They evaluated the PIMA Indian
Diabetes dataset of UCI repository using machine learning algorithm like Random Forest along
with feature selection methods such as forward selection and backward elimination based on
entropy evaluation method using percentage split as test option. They achieved a classification
accuracy of 84.1 percentage.

16
CHAPTER 3
METHODOLOGY
3.1 Methodology
The performance analysis of the SVM and Random Forest proposed in this study was
implemented following the steps as shown in figure 12. Data collection, preprocessing, model
building and accuracy measurement were all part of the basic methodology, which is detailed
below.

Figure 12. Schematic representation of the methodology for performance analysis of SVM and
RF

17
3.2 Implementation
3.2.1 Tools used in experiment
This experiment is performed in a personal computer having Intel® Core™ i7-5500U CPU
@2.40GHz processor with 4 GB RAM and 64-bit Operating System. The algorithms were
performed using Python Programming Language and Jupyter notebook as a software
environment.

3.2.2 Python Packages


For machine learning operation, Scikit-learn was used. Different sets of machine learning
libraries such as Python data analysis library (pandas), numerical python (NumPy), scientific
python (SciPy) were loaded. The Matplotlib and seaborn as sns were used for making statistical
graphics in Python. These packages helped to organize and visualize the data.

 The code for loading the packages were as below:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
3.2.3 Data Collection and Preprocessing
The Breast Cancer Data Set was obtained from scikit learn repository [48]. The Breast Cancer
dataset consists of 569 class distribution with 30 feature columns such as radius, perimeter,
area, smoothness, texture, compactness, etc. Next, the data have been converted into pandas
data frame which is a table containing rows and columns. Data preprocessing was performed
by dividing data into attributes and labels.

 The code for collecting the data from scikit learn was as follows:

from sklearn.datasets import load_breast_cancer

cancer = load_breast_cancer()

 The code for converting the data into pandas data frame was as follows:

df_cancer = pd.DataFrame(np.c_[cancer['data'], cancer['target']], columns =


np.append(cancer['feature_names'],['target']))

18
 The code for visualizing the data (basically five features along with target feature) in
the form of pairplot was as follows where seaborn has been used as sns:

sns.pairplot(df_cancer, hue = 'target', vars = ['mean radius','mean texture', 'mean


perimeter', 'mean area', 'mean smoothness'])

 The code for visualizing the data (basically all features) in the form of heatmap was as
follows:
plt.figure(figsize=(20,12))
sns.heatmap(df_cancer.corr(), annot=True)
 The code for dividing the data into attributes and labels is as follows:
X = df_cancer.drop(['target'], axis = 1)

3.2.4 Training and Test Data


After preprocessing breast cancer data, it has been separated into two parts: training data 80
percent and test data 20 percent. Machine learning model used training data to extract
knowledge and data pattern. The extracted knowledge was tested against the test data.

 The code for splitting the data into training set and test set were as follows:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state =


20)

In Breast cancer dataset, the size of our training “X” input features were 455 rows and 30
columns. The size of our testing “X” input features were 114 rows and 30 columns. The size
of our training “y” output features were 455 rows whereas the size of our testing “y” output
features were 114 rows.

3.3 Building the Model


3.3.1 SVM Model
After train test split, a Support Vector Machine Model on the train data using Support Vector
Classifier was built. When we use a soft margin to determine the location of a threshold we
were using Soft Margin Classifier also known as Support Vector Classifier to classify
observations.

A support vector classifier has been imported from scikit learn library.

19
 The code for importing the support vector classifier from scikit learn library was as
follows:
from sklearn.svm import SVC

The model has been built using Support Vector Classifier. Support Vector Classifier or soft
margin classifier was used because it allowed some observations into the band but not too
many, making the band even wider at the same time. That would be more robust against outliers
and can be controlled by the parameter C. We have set it to 0.1 which would result in a wider
band, but with some observations inside it. After building the model, fitting the model to
training data has been executed.

 The code for fitting the model to training data was as follows:
svc_model.fit(X_train, y_train)

Next, prediction has been done using test data.

 The code for prediction made on the test data was as follows:
y_predict = svc_model.predict(X_test)

After performing the prediction, classification report and confusion matrix has been imported
from scikit learn library. A confusion matrix has been constructed from predicted data and test
data along with the labels whose value was 0 or 1 for benign and malignant class respectively.

 The code for importing the classification report and confusion matrix from scikit learn
library and constructing a confusion matrix from predicted data and test data were as
follows:
from sklearn.metrics import classification_report, confusion_matrix
cm = np.array(confusion_matrix(y_test, y_predict, labels=[1,0]))
confusion = pd.DataFrame(cm, index=['is_cancer', 'is_healthy'],
columns=['predicted_cancer', 'predicted_healthy'])
confusion

For obtaining the training and test accuracy, a model has been created using support vector
classifier. Next, the model has been fitted into training data and the prediction has been made

20
on the test data. Thereafter, the model’s accuracy has been checked using accuracy score for
training and test data using the model. score method.

 The code for obtaining the training and test accuracy by creating a model using support
vector classifier and fitting the model into training data where predictions were made
on the test data were as follows:
print(classification_report(y_test, y_predict))
svc_model = SVC(C = .1, kernel = "linear", gamma=1)
svc_model.fit(X_train, y_train)
prediction = svc_model .predict(X_test)
 The code for checking the model’s accuracy on training data and test data were as
follows:
print(svc_model.score(X_train, y_train))
print(svc_model.score(X_test, y_test))

3.3.2 Random Forest Model


The working of Random Forest algorithm was as follows:

Step 1: First of all, random samples were selected from given dataset.
Step 2: Next, the algorithm would construct a decision tree for every sample. Then it would
get the prediction result from every decision tree.
Step 3: Voting has been performed for every predicted result.
Step 4: Finally, the most voted prediction result was to be selected as the final prediction result.

21
Figure 133. Working mechanism of Random Forest Classifier
For building the model for RF classifier, a RF classifier has been obtained from scikit learn
where ensemble method has been used. The ensemble methods use multiple learning
algorithms to obtain better predictive performance than could be obtained from any of the
constituent learning algorithms alone.

 The code for building the model for Random Forest classifier was as follows:

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=20)

Thereafter, the model has been fitted to training data. The prediction has been made on the test
data by creating a variable named y_predict.

 The code for fitting the model to training data and prediction on the test data was as
follows:
model.fit(X_train, y_train)
y_predict = model.predict(X_test)

22
After performing the prediction, classification report and confusion matrix has been used from
scikit learn library. A confusion matrix has been constructed from predicted data and test data
along with the labels whose value is 0 or 1 for benign and malignant class respectively.

 The code for importing the classification report and confusion matrix from scikit learn
library and constructing a confusion matrix from predicted data and test data were as
follows:
from sklearn.metrics import classification_report, confusion_matrix
cm = np.array(confusion_matrix(y_test, y_predict, labels=[1,0]))
confusion = pd.DataFrame(cm, index=['is_cancer', 'is_healthy'],
columns=['predicted_cancer', 'predicted_healthy'])
confusion

For obtaining the training and test accuracy, a model has been created using random forest
classifier. Next, the model has been fitted into training data and the prediction has been made
on the test data.

 The code for obtaining the training and test accuracy by creating a model using support
vector classifier and fitting the model into training data where predictions were made
on the test data were as follows:
print(classification_report(y_test, y_predict))
model.fit(X_train, y_train)
prediction = model .predict(X_test)
Thereafter, the model’s accuracy was checked using accuracy score for training and test data
using model. score method.

 The code for checking the model’s accuracy on training data and test data were as
follows:
print(model.score(X_train, y_train))
print(model.score(X_test, y_test))

23
3.4 Performance Measure
The performance measure on train and test data over breast cancer datasets were analyzed. The
performance was evaluated by using sensitivity (Recall rate), specificity, prevalence, accuracy
and F-Measure. These performance indices were calculated using confusion matrix.

3.5 Confusion matrix


A confusion matrix is a table that shows how well a classification model (or "classifier")
performs on a collection of test data for which the true values are known. A confusion matrix
(or confusion table) shows a more detailed breakdown of correct and incorrect classifications
for each class [49][50]. The rows of the matrix (Table 1) represent ground truth labels, while
the columns correspond to the prediction. Each of these have two parameters “Yes” and “No”.
"Yes" and "No" are the two probable expected groups. When forecasting the existence of a
disease, "yes" would indicate that they have the disease, and "no" would indicate that they do
not. If the doctor diagnoses a patient with cancer when the patient actually has it then it is
called as True positive (TP) whereas if the doctor diagnoses a patient with cancer when the
patient does not have it, is known as False Positive (FP) and so on.

Table 1. Confusion Matrix

24
These consequences from confusion matrix is important to determine the performance
parameters such as accuracy score, precision, specificity, sensitivity (Recall) and F-score.
Some of the main formulae are defined below:
Accuracy measures how often the classifier makes the correct prediction. It is measured by the
ratio of the number of correct predictions to the total number of predictions.
TP+TN
Accuracy Score =
(TP + TN + FP + FN)
Precision determines how often the predictions are correctly identified. Thus, the precision
rate is the proportion of the true positive to the predicted positives.
TP
Precision =
(TP+FP)
Although precision rate is a critical measure of the performance for the diagnostic method, its
value, however, depends on the prevalence of the outcome. Prevalence determines how often
the yes condition actually occur in our sample. Prevalence is the proportion of actual yes to
total number of predictions.
(TP+FN)
Prevalence =
(TP+ FP +FN+ TN)
Sensitivity or recall rate and the specificity are the statistical measures of the performance in
classification test. Sensitivity measures the proportion of actual positives which are correctly
identified, while specificity measures the proportion of negatives which are correctly
identified.
TP
Sensitivity or Recall rate =
(TP + FN)
TN
Specificity =
(FP + TN)

F-Measure produces a single score that accounts for both precision and recall problems in a
single number. There are also a lot of situations where both precision and recall are equally
important. For example, for our model, if the doctor informs us that the patients who were
incorrectly classified as suffering from breast cancer are equally important since they could be
indicative of some other ailment, then we would aim for not only a high recall but a high

25
precision as well. In such cases, F1-score is used. F1-score is the Harmonic mean of the
Precision and Recall:
(PR)
F1_Score = 2 ∗ (P
+ R)

3.6 Data Normalization


As Support Vector Machines are sensitive to features’ scale there is a need to normalize the
data before fitting the model. Feature scaling is a method used to normalize the range of
independent variables or features of data. In data processing, it is also known as data
normalization and is generally performed during data preprocessing step. Since the range of
values of raw data varies widely, in some Machine Learning Algorithms, objective functions
will not work properly without normalization. For instance, many classifiers calculate the
distance between two points by the Euclidean distance. If one of the features has a broad range
of values, the distance will be governed by this particular feature. Therefore, the range of all
features should be normalized so that each feature contributes approximately proportionately
to the final distance. In this dissertation min-max normalization has been used to normalize the
data whose formula is given below:

x –min(x)
x’ = ….eq (2)
max(x)–min(x)

For the training data, min() method and max() method has been used to create the value of data
to fall between the range 0 and 1. For the test data, min() method and max() method has been
used to create the value of data to fall between the range 0 and 1. Thereafter, the data
normalization formula given in equation (2) has been used for both the training data and test
data. Afterwards, the model has been created, and the model has been fitted to scaled training
data. Subsequently prediction has been made on scaled test data. A confusion matrix on scaled
datasets has been constructed. Finally, the accuracy for training data and test data has been
built.

 The following is the code for Data Normalization:


 For training data:
X_train_min = X_train.min()
X_train_max = X_train.max()
26
X_train_range = (X_train_max- X_train_min)
X_train_scaled = (X_train - X_train_min)/(X_train_range)
 For test data:
X_test_min = X_test.min()
X_test_range = (X_test - X_test_min).max()
X_test_scaled = (X_test - X_test_min)/X_test_range
 For creating a model:
svc_model = SVC()
svc_model.fit(X_train_scaled, y_train)
 Prediction with scaled datasets:
y_predict = svc_model.predict(X_test_scaled)
cm = confusion_matrix(y_test, y_predict)
 Confusion matrix on scaled datasets:
cm = np.array(confusion_matrix(y_test, y_predict, labels=[1,0]))
confusion = pd.DataFrame(cm, index=['is_cancer', 'is_healthy'],
columns=['predicted_cancer', 'predicted_healthy'])
confusion
 Accuracy score for training data and test data:
print(classification_report(y_test, y_predict))
svc_model = SVC(C= .1, kernel='linear', gamma = 1)
svc_model.fit(X_train, y_train)
prediction = svc_model .predict(X_test)
print(svc_model.score(X_train, y_train))
print(svc_model.score(X_test, y_test))

27
CHAPTER 4
RESULTS AND ANALYSIS
4.1 Data visualization
The attributes in the Breast Cancer Dataset were visualized by two methods, pair plot and
heatmap. The pair plot for minimum five features: ‘mean radius’, ‘mean texture’, ‘mean
perimeter’, ‘mean area’ and ‘mean smoothness’ has been shown in figure 14. The pair plot
shows the distributions and pairwise relationship between features. The pair plot function
creates a grid of Axes such that each variable in data will be shared in the y-axis across a single
row and in the x-axis across a single column. In figure 14, target 0.0 represented by blue color
indicates benign cells and target 1.0 represented by orange color indicates cancerous
(malignant) cells.

Figure 144. Pair plot for minimum five features of Breast Cancer Dataset

28
Furthermore, heatmap (Figure 15) was constructed to observe the correlation among all the
thirty attributes of a Breast Cancer dataset. Heatmap presents the dataset in the form of matrix
where each square in the matrix denotes the correlation among two attributes where the
magnitude of a phenomenon or correlation is determined by the deepness of color. A lighter
color indicates a stronger correlation while a darker color suggests a weaker correlation. The
number in the matrix also indicates the strength of correlation.

Figure 155. Heat map for features of Breast Cancer Dataset


The preprocessed data was split into training and test datasets of 80% (i.e 455 instances) and
20% (i.e 114 instances) respectively. Performance of classification models – SVM and RF;
was analyzed on the testing datasets.

4.2 Performance results for SVM algorithm


The confusion matrix (Table 2) was created for the test datasets using SVM algorithm.

29
Table 2. Confusion matrix over Breast cancer test data for SVM

Actual Yes No

Predicted
True Positive False Positive
Yes (TP) (FP)
66 8
No False Negative True Negative
(FN) (TN)
0 40

The true positive and true negative elements indicate the instances of correctly classifying
datasets whereas false positive and false negative indicates incorrectly classifying datasets. In
Table 2, out of 114 instances 66 instances were found to be True Positive which is an outcome
where the model correctly predicted the positive class (i.e. has breast cancer) whereas 40
instances were found to be True Negative which is an outcome where the model correctly
predicted the negative class (i.e. does not have breast cancer). Similarly, out of 114 instances,
8 instances were found to be False Positive which is an outcome where the model incorrectly
predicted the positive class (i.e. has breast cancer) whereas 0 instance was found to be False
Negative which is an outcome where the model incorrectly predicted the negative class (i.e.
does not have breast cancer).

The performance was analyzed for SVM algorithm by calculating different performance
parameters (Table 3) using the equation as mentioned in section 3.5 of methodology.

Table 3. Performance analysis for SVM

Performance indices calculation

Accuracy 0.93
Precision 0.89
Recall (Sensitivity) 1.00
Specificity 0.83
Prevalence 0.58
F1-score 0.94

30
From table 3, 0.89 (or 89%) was the precision rate which signifies the proportion of the true
positive to the predicted positives whereas 100% was the sensitivity or recall which measures
the proportion of actual positives which are correctly identified. Likewise, 94% was the F1-
score is a single score that accounts for both precision and recall problems in a single number
whereas 93% was the accuracy which is measured by the ratio of the number of correct
predictions to the total number of predictions. Also, the specificity value was 83% which
measured the proportion of negatives which were correctly identified, and the prevalence value
was 58% which denotes the proportion of actual yes to total number of predictions.

The performance of SVM algorithm was reanalyzed after data normalization. The confusion
matrix and calculation of performance indices were presented in table 4 and table 5 below.

Table 4. Confusion matrix over Breast cancer test data for SVM after Data normalization

Actual Yes No

Predicted
True Positive False Positive
Yes (TP) (FP)
61 0
No False Negative True Negative
(FN) (TN)
5 48

Table 5. Performance analysis for SVM after data normalization

Performance indices calculation

Accuracy 0.96
Precision 1.00
Recall (Sensitivity) 0.92
Specificity 1.00
Prevalence 0.58
F1-score 0.96

31
Most of the performance indices such as accuracy, precision, specificity increased after data
normalization and other remained the same. This indicates that SVM model performed better
after normalizing the datasets.

4.3 Performance results for Random Forest Algorithm


The confusion matrix (table 6) for test datasets was created using RF algorithm.

Table 6. Confusion matrix over Breast cancer test data for RF

Actual Yes No

Predicted
True Positive False Positive
Yes (TP) (FP)
66 3
No False Negative True Negative
(FN) (TN)
0 45

In Table 6, out of 114 instances 66 instances were found to be True Positive which is an
outcome where the model correctly predicted the positive class (i.e. has breast cancer) whereas
45 instances were found to be True Negative which is an outcome where the model correctly
predicted the negative class (i.e. does not have breast cancer). Similarly, out of 114 instances,
3 instances were found to be False Positive which is an outcome where the model incorrectly
predicted the positive class (i.e. has breast cancer) whereas 0 instance was found to be False
Negative which is an outcome where the model incorrectly predicted the negative class (i.e.
does not have breast cancer).

The performance of RF algorithm was analyzed by calculating different performance


parameters (Table 7) using the equation as mentioned in section 3.5 of methodology.

32
Table 7. Performance analysis for Random forest

Performance indices calculation


Accuracy 0.97
Precision 0.96
Recall (Sensitivity) 1.00
Specificity 0.93
Prevalence 0.58
F1-score 0.98

In table 7, the precision rate of 96% indicated the proportion of the true positive to the predicted
positives whereas 100% of sensitivity or recall measured the proportion of actual positives
which are correctly identified. Likewise, F1-score of 98% denoted a single score that accounts
for both precision and recall problems in a single number whereas 97% was the accuracy which
was measured by the ratio of the number of correct predictions to the total number of
predictions. Also, the specificity value is 93 % which measured the proportion of negatives
which were correctly identified, and the prevalence value is 58 % which signifies the
proportion of actual yes to total number of predictions.
The performance of RF algorithm was reanalyzed after data normalization. The confusion
matrix and calculation of performance indices were presented in table 8 and table 9 below. The
accuracy for RF slightly decreased after data normalization yet it is greater than 90%.

Table 8. Confusion matrix over Breast cancer test data for RF after Data normalization

Actual Yes No

Predicted
True Positive False Positive
Yes (TP) (FP)
58 0
No False Negative True Negative
(FN) (TN)
8 48

33
Table 9. Performance analysis for RF after data normalization

Performance indices calculation

Accuracy 0.93
Precision 1.00
Recall (Sensitivity) 0.88
Specificity 1.00
Prevalence 0.58
F1-score 0.94
4.4 Performance analysis of SVM and Random Forest
In the calculation of performance indices using confusion matrix, SVM showed accuracy of
93% and RF showed 97% (Figure 18). Similarly, RF showed 7% and 11% higher precision
and specificity respectively than SVM.

Figure 166. Performance measure over Breast Cancer Dataset

34
In SVM model, the training set has accuracy of 95% and testing data has 94%. Whereas, in RF
model, the accuracy for training set is 100% and testing set is 99%. The accuracy of training
data and test data remained same in SVM before (table 3) and after normalization (table 5).
After data normalization in RF algorithm (table 9), the accuracy of test data decreases although
accuracy of training data remains same. However, performance measure i.e. precision has been
improved in both SVM and Random Forest whereas recall has been deteriorated in both SVM
and Random Forest. F-measure has been deteriorated in SVM but improved in Random Forest
after normalization. Subsequently after normalization the accuracy of SVM has been increased
from 93% to 96% whereas the accuracy of Random Forest has been decreased from 97% to
93%.

Overall, in this study we observed that the Random forest (RF) classifier showed higher
accuracy and precision than SVM classifier. Higher precision of RF classifier means that it
returns more relevant results than irrelevant ones. Thus, we can say that RF is more efficient
algorithm than SVM for the classification of breast cancer.

35
CHAPTER 5
CONCLUSION AND FUTURE RECOMMENDATIONS
5.1 Conclusion
In this research, Support Vector Machine and Random Forest algorithms were studied,
analyzed and implemented on Breast Cancer Dataset. The performance measure indices such
as precision, recall, F-measure and accuracy was calculated and compared. Similarly, other
performance metrics such as specificity and prevalence was also calculated and compared. It
was found that accuracy of Support Vector Machine which is 93% over breast cancer dataset
is less than Random Forest’s accuracy which is 97%. In conclusions, this study demonstrates
that the Random Forest (RF) model, which is a rule-based classification model, was the best
model with the highest level of accuracy. Therefore, this model is recommended as a useful
tool for breast cancer prediction as well as other medical decision making specifically cancer.

5.2 Future Recommendation


With the help of performance analysis of SVM and Random Forest we may be able to develop
robust breast cancer detection. In future, this technique can be enhanced further by using some
other classifier that can increase the accuracy of the system. In addition to SVM or Random
Forest algorithm, there are other algorithms such as K-nearest-neighbors, artificial neural
network (ANNs) for classification that can be implemented. Performance analysis for SVM
kernel function can also be carried out in cancer datasets. The variants of SVM such as
proximal SVM, smooth SVM, Lagrangian SVM, finite Newton method for Lagrangian SVM
and Linear Programming SVM can also be implemented for classifying cancer datasets.

36
References

[1] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn., vol. 20, no. 3, pp. 273–
297, Sep. 1995.
[2] V. Vapnik and A. Chervonenkis, “Theory of pattern recognition.” Nauka, Moscow, 1974.
[3] B. H. Cho, H. Yu, J. Lee, Y. J. Chee, I. Y. Kim, and S. I. Kim, “Nonlinear support vector
machine visualization for risk factor analysis using nomograms and localized radial basis
function kernels,” IEEE Trans. Inf. Technol. Biomed., vol. 12, no. 2, pp. 247–256, 2008.
[4] H. L. Chen, B. Yang, G. Wang, S. J. Wang, J. Liu, and D. Y. Liu, “Support vector machine
based diagnostic system for breast cancer using swarm intelligence,” J. Med. Syst., vol. 36,
no. 4, pp. 2505–2519, Aug. 2012.
[5] F. Liu, L. Zhou, C. Shen, and J. Yin, “Multiple kernel learning in the primal for multimodal
alzheimer’s disease classification,” IEEE J. Biomed. Heal. Informatics, vol. 18, no. 3, pp.
984–990, 2014.
[6] L. Breiman, “Random forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, Oct. 2001.
[7] M. R. Haque, M. M. Islam, H. Iqbal, M. S. Reza, and M. K. Hasan, “Performance Evaluation
of Random Forests and Artificial Neural Networks for the Classification of Liver Disorder,”
in International Conference on Computer, Communication, Chemical, Material and
Electronic Engineering, IC4ME2 2018, 2018.
[8] S. Raghavendra and J. Santosh Kumar, “Performance evaluation of random forest with
feature selection methods in prediction of diabetes,” Int. J. Electr. Comput. Eng., vol. 10,
no. 1, pp. 353–359, 2020.
[9] G. Krishna, B. Kumar, N. Orsu, and S. B., “Performance Analysis and Evaluation of
Different Data Mining Algorithms used for Cancer Classification,” Int. J. Adv. Res. Artif.
Intell., vol. 2, no. 5, 2013.
[10] S. Shalev-Shwartz and S. Ben-David, Understanding machine learning: From theory to
algorithms. Cambridge university press, 2014.
[11] M. I. Jordan and T. M. Mitchell, “Machine learning: Trends, perspectives, and prospects,”
Science (80-. )., vol. 349, no. 6245, pp. 255–260, 2015.
[12] G. Hinton et al., “Deep neural networks for acoustic modeling in speech recognition: The
shared views of four research groups,” IEEE Signal Process. Mag., vol. 29, no. 6, pp. 82–
97, 2012.
[13] V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol.
518, no. 7540, pp. 529–533, Feb. 2015.
[14] L. Wehbe, B. Murphy, P. Talukdar, A. Fyshe, A. Ramdas, and T. Mitchell, “Simultaneously
uncovering the patterns of brain regions involved in different story reading subprocesses,”
PLoS One, vol. 9, no. 11, p. e112575, 2014.

37
[15] M. W. Libbrecht and W. S. Noble, “Machine learning applications in genetics and
genomics,” Nature Reviews Genetics, vol. 16, no. 6. Nature Publishing Group, pp. 321–332,
18-May-2015.
[16] S. Huang, C. A. I. Nianguang, P. Penzuti Pacheco, S. Narandes, Y. Wang, and X. U. Wayne,
“Applications of support vector machine (SVM) learning in cancer genomics,” Cancer
Genomics and Proteomics, vol. 15, no. 1. International Institute of Anticancer Research, pp.
41–51, 01-Jan-2018.
[17] V. Blanz, B. Schölkopf, H. Bülthoff, C. Burges, V. Vapnik, and T. Vetter, “Comparison of
view-based object recognition algorithms using realistic 3D models,” in Lecture Notes in
Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture
Notes in Bioinformatics), 1996, vol. 1112 LNCS, pp. 251–256.
[18] T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical learning: data mining,
inference, and prediction. Springer Science & Business Media, 2009.
[19] R. Van Loon, “Machine learning explained: Understanding supervised, unsupervised, and
reinforcement learning,” 05-Feb-2018. [Online]. Available: https://bigdata-
madesimple.com/machine-learning-explained-understanding-supervised-unsupervised-
and-reinforcement-learning/. [Accessed: 07-Apr-2021].
[20] MATLAB, “Supervised Learning Workflow and Algorithms,” Uk.mathworks.com, 2021.
[Online]. Available: http://uk.mathworks.com/help/stats/supervised-learning-machine-
learning-workflow-and-algorithms.html. [Accessed: 06-Apr-2021].
[21] A. Singh, N. Thakur, and A. Sharma, “A review of supervised machine learning
algorithms,” in Proceedings of the 10th INDIACom; 2016 3rd International Conference on
Computing for Sustainable Global Development, INDIACom 2016, 2016, pp. 1310–1315.
[22] V. Gullapalli, “A stochastic reinforcement learning algorithm for learning real-valued
functions,” Neural Networks, vol. 3, no. 6, pp. 671–692, Jan. 1990.
[23] R. Hafner and M. Riedmiller, “Reinforcement learning in feedback control Challenges and
benchmarks from technical process control,” Mach Learn, vol. 84, pp. 137–169, 2011.
[24] X. Zhu and A. B. Goldberg, Introduction to Semi-Supervised Learning Synthesis Lectures
on Artificial Intelligence and Machine Learning. Synthesis lectures on artificial intelligence
and machine learning, 2009.
[25] R. Carauana, “Multitask Learning*,” Mach. Learn., vol. 28, pp. 41–75, 1997.
[26] V. N. Vapnik, The Nature of Statistical Learning Theory. Springer Science & Business
Media, 2013.
[27] V. Vladimir, “Transductive Inference and Semi-Supervised Learning,” in Semi-Supervised
Learning, 2013, pp. 452–472.
[28] C. Saunders, A. Gammerman, and V. Vovk, “Transduction with confidence and
credibility,” in IJCAI International Joint Conference on Artificial Intelligence, 1999, vol.
2, pp. 722–726.

38
[29] T. Kavzoglu and I. Colkesen, “A kernel functions analysis for support vector machines for
land cover classification,” Int. J. Appl. Earth Obs. Geoinf., vol. 11, no. 5, pp. 352–359,
2009.
[30] M. O. Stitson, J. A. E. Weston, A. Gammerman, V. Vovk, and V. Vapnik, “Theory of
Support Vector Machines,” 1996.
[31] V. Kecman, “Basics of Machine Learning by Support Vector Machines,” 2005, pp. 49–103.
[32] L. Demidova, E. Nikulchev, and Y. Sokolova, “Big Data Classification Using the SVM
Classifiers with the Modified Particle Swarm Optimization and the SVM Ensembles,” Int.
J. Adv. Comput. Sci. Appl., vol. 7, no. 5, 2016.
[33] E. E. Osuna, R. Freund, and F. Girosi, “Support vector machines and applications,” 1996.
[34] M. Hofmann, “Support Vector Machines-Kernels and the Kernel Trick An elaboration for
the Hauptseminar ‘Reading Club: Support Vector Machines,’” 2006.
[35] A. Ben-Hur and J. Weston, “A user’s guide to support vector machines.,” Methods Mol.
Biol., vol. 609, pp. 223–239, 2010.
[36] L. Guo, Y. Ma, B. Cukic, and H. Singh, “Robust prediction of fault-proneness by random
forests,” in Proceedings - International Symposium on Software Reliability Engineering,
ISSRE, 2004, pp. 417–428.
[37] P. Doupe, J. Faghmous, and S. Basu, “Machine Learning for Health Services Researchers,”
Value Heal., vol. 22, no. 7, pp. 808–815, Jul. 2019.
[38] A. Chakure, “Random Forest Regression,” Start it UP, 2019. [Online]. Available:
https://medium.com/swlh/random-forest-and-its-implementation-71824ced454f.
[Accessed: 22-Apr-2021].
[39] IBM, “What is Random Forest? | IBM,” IBM Cloud learn Hub, 2020. [Online]. Available:
https://www.ibm.com/cloud/learn/random-forest. [Accessed: 22-Apr-2021].
[40] M. Ryu and K. Lee, “Selection of support vector candidates using relative support distance
for sustainability in large-scale support vector machines,” Appl. Sci., vol. 10, no. 19, pp. 1–
14, Oct. 2020.
[41] S. Chidambaram and K. G. Srinivasagan, “Performance evaluation of support vector
machine classification approaches in data mining,” Cluster Comput., vol. 22, pp. 189–196,
2019.
[42] D. Parbat and M. Chakraborty, “A python based support vector regression model for
prediction of COVID19 cases in India,” Chaos, Solitons and Fractals, vol. 138, p. 109942,
Sep. 2020.
[43] M. R. Kapadia and C. N. Paunwala, “Analysis of SVM kernels for content based image
retrieval system,” in 2017 International Conference on Energy, Communication, Data
Analytics and Soft Computing, ICECDS 2017, 2018, pp. 1409–1414.
[44] A. Okutan and O. Taner Yildiz, “A novel kernel to predict software defectiveness,” J. Syst.
Softw., vol. 119, pp. 109–121, 2016.

39
[45] V. Singh et al., “Prediction of COVID-19 corona virus pandemic based on time series data
using support vector machine,” J. Discret. Math. Sci. Cryptogr., vol. 23, no. 8, pp. 1583–
1597, 2020.
[46] V. Apostolidis-afentoulis and K.-I. Lioufi, “SVM Classification with Linear and RBF
kernels Konstantina-Ina Lioufi,” ResearchGate, no. July, pp. 0–7, 2015.
[47] C. Savas and F. Dovis, “The impact of different kernel functions on the performance of
scintillation detection based on support vector machines,” Sensors (Switzerland), vol. 19,
no. 23, Dec. 2019.
[48] D. Dua and C. Graff, “Breast Cancer Wisconsin (Diagnostic) Data Set,” UCI Machine
Learning Repository. Irvine, CA: University of California, School of Information and
Computer Science., 2019. [Online]. Available:
https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic).
[Accessed: 09-Apr-2021].
[49] D. L. Gupta, A. K. Malviya, and S. Singh, “Performance Analysis of Classification Tree
Learning Algorithms,” 2012.
[50] A. T. Azar and S. A. El-Said, “Performance analysis of support vector machines classifiers
in breast cancer mammography recognition,” Neural Comput. Appl., vol. 24, no. 5, pp.
1163–1177, Apr. 2014.

40

You might also like