0% found this document useful (0 votes)
46 views7 pages

Article 4

This document presents a comparative study of supervised and unsupervised machine learning techniques for lung cancer prediction. It compares artificial neural networks (ANN) and support vector machine (SVM) as supervised learning methods to the unsupervised methods of Apriori and K-means algorithms. The lung cancer dataset was obtained from Cancer Image Archive and preprocessed before applying each machine learning model. Performance metrics like accuracy, precision and recall were compared between the different methods. The study aims to evaluate which technique most accurately predicts lung cancer and its stage at an early diagnosis.

Uploaded by

Tashu Sarda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views7 pages

Article 4

This document presents a comparative study of supervised and unsupervised machine learning techniques for lung cancer prediction. It compares artificial neural networks (ANN) and support vector machine (SVM) as supervised learning methods to the unsupervised methods of Apriori and K-means algorithms. The lung cancer dataset was obtained from Cancer Image Archive and preprocessed before applying each machine learning model. Performance metrics like accuracy, precision and recall were compared between the different methods. The study aims to evaluate which technique most accurately predicts lung cancer and its stage at an early diagnosis.

Uploaded by

Tashu Sarda
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Vol 11, Issue 5,May/ 2020

ISSN NO: 0377-9254

A COMPARATIVE STUDY OF SUPERVISED AND


UNSUPERVISED MACHINE LEARNING TECHNIQUES ON
LUNG CANCER PREDICTION
M. Sheik Mansoor1 and M. Mohamed Sathik2
1
Research Scholar r (Reg. No. 17221192161007), Sadakathullah Appa College, Affiliated to
Manonmanium Sundaranar University, Tirunelveli, Tamilnadu, India
2
Principal and Research Supervisor, Sadakathullah Appa College, Tirunelveli, Affiliated to
Manonmanium Sundaranar University, Tirunelveli, Tamilnadu, India

Abstract— Lung cancer is one of the most 1. INTRODUCTION


dangerous type of cancers which has the high Lung cancer is a type cancer which starts in the
spread rate. Lung cancer metastases spreads cells of the lungs and spreads to the other parts of
through fluid lymph nodes and bloodstreams to the human body [3]. Likewise, cancer cells such as
other organs like bone, glands and brains. Due breast, mouth and kidney can also spread to the
to the air and industrial pollution the rate of lungs via lymph nodes or bloodstreams[2, 3]. The
people who has affected by the lung cancer is lung are is made up of sponge like structure in the
increasing enormously. According to the chest of the human. The main objective of the
prediction reports of World Health lungs is to take oxygen into the body and release
Organization (WHO) the number of lung cancer the carbon dioxide [1]. While breathing air passes
deaths will reach 9.6 million in 2020, which is an through pipe like structure called trachea and
alarming issue. Diagnosis the lung cancer at its propagates through bronchi nodes to enter lungs
earlier stage could help the physicians to treat and come outs in the same path. The small sized
the patients. Though the manual analysis of CT holes in the bronchi nodes called alveoli passes the
scan exists in the medical field, it is too hard for oxygen to the blood and takes out the carbon
the medical advisors to predict the exact stage of dioxide out from the blood [4,5].
the disease using the CT scan images. Hence, the At initial stage of lung cancer, DNA of the patient
medical informatics research community has will change or damage and mutate the genes.
created several machine learning model to Mutated genes will not work properly because they
predict the lung cancer and its type in the will not get any instruction from DNA properly or
earlier stage. In this comparative research in a correct manner. This will cause the cells in the
study, we have downloaded the lung cancer lung to divide and grow out of control in and
dataset from the Cancer Image Archive and around the lungs and causes the lung cancer [6].
given as the input to the two most accepted As stated by Global Cancer Observatory (GCO),
machine learning models such as, Artificial every 5.4 person has lung cancer among one
Neural Networks (ANN), Support Vector million peoples in India. The alarming issue in the
Machine (SVM) from supervised learning raise of lung cancer is, it has very low survival rate
method and another unsupervised dataset as compare to any other cancer diseases. In India,
input for Apriori and K-means model from 25% of cancer victims loses their life every year.
unsupervised learning to observe the changes. Due to late stage diagnosis and fast outspread,
The final results and the performance metrics of deaths rate of lung cancer is too high compared to
the machine learning algorithms such as other prostate, colorectal, skin, kidney and breast
accuracy, precision and recall are compared cancers [7]. Accurately identify the lung cancer
with each other and tabulated. cells in its initial stage through manual analysis of
CT scan is not possible. It makes difficult for
Keywords— Machine Learning; Lung Cancer medical advisors to predict the exact stage of the
Prediction; Supervised Learning; Cancer cancer using the CT scan images.
Diagnosis. To overcome these issues and to identify the cancer
type in early stage, Machine Learning techniques
are used on the patient data. It helps the physicians

www.jespublication.com Page No:51


Vol 11, Issue 5,May/ 2020
ISSN NO: 0377-9254

to acquire a clear cut knowledge about condition of the correct answer. So it can be directly compared
the patients. Moreover, it helps physicians in with learning process. A supervised learning
identifying the type and vigorous of the cancer algorithm learns from labeled training data, helps
cells [8, 9]. you to predict outcomes for unforeseen data.
Machine learning techniques can be classified into Where as in unsupervised machine learning
two major types based on its application and technique, the dataset will not have a clear label.
working nature. While considering the lung cancer Instead it should be programmed in a manner, in
prediction several research contributions and which it should discover the information on its
prediction methods were been introduced. In this own. Unsupervised machine learning techniques
research work, we have taken two supervised can perform more complex processing tasks
learning methods such as Artificial Neural compared to the supervised learning algorithm but
Networks (ANN), Support Vector Machine (SVM) the results of the unsupervised machine learning
and two unsupervised learning methods Apriori are unpredictable compared to other deep learning
and K-means for this comparative study. The and natural learning process.
datasets were downloaded from the open- source
Cancer Imaging archives and given as the training A. Supervised learning algorithm: Artificial
set to these machine learning algorithm. The Neural Networks (ANN)
ANN maintains an interconnected nodes, called as
preprocessing, feature extraction and selection are
neurons to gather information by identifying
kept same for all these four methods.
relationships and new pattern between the data. It
This comparative study paper is organized in such
has three layer such as, input neuron layer, hidden
a manner such that, Section 2, describes the
neuron layer and output neuron layer. Neurons in
difference between the supervised learning and
each layers will receive the input data, performs
unsupervised learning. Section 3, explains the
operations and forwards the data to the nearby
preprocessing, feature extraction and selection.
connected neurons. Each neurons and the edge
Section 4, evaluates the performance of the ANN,
which connects the neurons has a particular weight.
SVM, Apriori and K-means and Section 5
The weight will change on the neurons based on
concludes and discusses about the future work of
the learnings. ANN allows both forward and
the comparative study.
backward propagation for learning.
The final result of ANN are produced based on the
2. SUPERVISED LEARNING AND
maximum probability of neurons present in output
UNSUPERVISED LEARNING
layer. Even there exist several algorithms to predict
In supervised learning, the machine learning
the early state lung cancer, using ANN will
system is trained with a well labeled information,
which means that some data is already tagged with produce an accurate result.

Fig. 1. Stucture of Neural Networks


B. Supervised learning algorithm : Support dataset. SVM is a discriminative classifier, which
Vector Machine (SVM) draws a hyper plane to differentiate the classes that
SVM is a supervised learning technique which are derived as outputs. The hyper planes are the
performs classification and regression to identify decision boundaries. The maximum accuracy can
the associations between the data in the given

www.jespublication.com Page No:52


Vol 11, Issue 5,May/ 2020
ISSN NO: 0377-9254

be attained only if the SVM draws the hyper plane maximized to get the clear idea. Deleting the
separating all the objects to its classes correctly. support vectors will change the position of the
hyper plane. These are the points that help us build
In here, support vectors are data that are very closer accurate SVM model.
to the hyper plane and influence the position and
orientation of the hyper plane. Using these support C. Unsupervised learning algorithm – Apriori
vectors, the margin of the classifier can be Algorithm
maximized to get the clear idea. Deleting the
The Apriori algorithm is a classical frequent item
support vectors will change the position of the
sets generation algorithm and a milestone in the
hyper plane. These are the points that help us build
development of data mining. It is used for finding
accurate SVM model.
frequent item in a dataset for Boolean association
In here, support vectors are data that are very closer rule. Apriori algorithm uses prior knowledge of
to the hyper plane and influence the position and frequent item properties. An iterative approach or
orientation of the hyper plane. Using these support level-wise search where k-frequent item are used to
vectors, the margin of the classifier can be find k+1 item.

Fig. 2. Hyperplane of SVM


To improve the efficiency of level-wise 3. HANDLING DATA
generation of frequent item, an important property
is used called Apriori property which helps by A. Pre-processing
reducing the search space. Apriori property states
that, all non-empty subset of frequent item set must The pre-processing is an important task that is used
be frequent. for transforming the raw data into a useful and
efficient data. The pre- processing include several
D. Unsupervised learning algorithm – K steps such as data cleaning, transformation and
Means reduction. Data cleaning is a process in which the
missing values are replaced or removed and
K-means algorithm is an iterative algorithm that
involves in removing the noise in the data through
tries to partition the dataset into K pre-defined
several methods such as regression, clustering or
distinct non-overlapping subgroups (clusters)
binning method. The transformation of data
where each data point belongs to only one group. It
includes normalization of data, selection of
tries to make the inter-cluster data points as similar attributes, transferring of continuous dataset to
as possible while also keeping the clusters as
discrete through discretization and generation of
different (far) as possible. It assigns data points to a
hierarchy. The data reduction includes several
cluster such that the sum of the squared distance
actions such as aggregation of data as per the need,
between the data points and the cluster’s centroid
subset selection section in the particular attribute,
(arithmetic mean of all the data points that belong
replacement of original data into a data
to that cluster) is at the minimum. The less
representation through parametric or non-
variation we have within clusters, the more
parametric numerosity reduction and reduction of
homogeneous (similar) the data points are within
dimensionality.
the same cluster.

www.jespublication.com Page No:53


Vol 11, Issue 5,May/ 2020
ISSN NO: 0377-9254

B. Feature Selection 4. Performance evaluation of ANN, SVM,


Apriori and K- means
Feature selection is a dimension reduction method
which is used to select the relevant feature for A. Performance comparison of ANN and
constructing the model. It includes four important SVM
approaches such as wrapper, filter, embedded and The ANN and SVM machine learning experiments
hybrid approaches for selecting the features. are carried out on the Tensor Flow software, which
Wrapper approach is an approach which is highly is a free open-source software developed by
complex computation. It selects the feature through Google Inc., The dataset used for the
classification and uses a learning algorithm for implementation is taken from Cancer Imaging
calculating the accuracy of the classification. The Archives. The chosen dataset consist of CT scan
filter approaches select the subset of the feature data of 1019 patients with different cancers.
without using any leaners. The database with Initially, information about the patients, who has
higher dimension can use this type of feature affected by the NSCLC cancer is taken out from
selection approach. The embedded approach selects the given dataset. Around 419 patient records are
the feature during the training of the data and it extracted. Later these, NSCLC cancer data is
uses applied learning algorithms for deriving the separated into training dataset and test dataset with
specificity of the approach. The hybrid approach is the ratio of 70:30. The training dataset are fed as an
another approach where the filter and wrapper input to ANN and SVM, simultaneously. They are
approaches are used in combination for selecting trained and computed simultaneously for best
the feature. The feature is selected through the filter prediction results.
approach and are tested with wrapper approach.
Thus it uses both the advantages for feature True/Actual
selection. Type Type Type
‘T’ ‘M’ ‘N’
C. Feature Extraction
Predicted

Cancer
96 8 4
Feature extraction is another dimensionality Type ‘T’
reduction method through which the raw data will Cancer
5 89 5
be transformed into a group of manageable data for Type ‘M’
further processing. It plays an important role in Cancer
4 5 104
image processing as multiple parameters are Type ‘N’
needed to process the images. It includes low level Table. 1. Prediction of cancer type using ANN
extraction, edge level extraction, curvature method
extraction, shape detection, motion detection and
so on. Here the low level processing of images True/Actual
includes several detection such as detection of Type Type Type
edges, detection of corners, blob detection for ‘T’ ‘M’ ‘N’
detecting the regions in the images, ridge detection
Predicted

Cancer
101 6 2
for extracting the thin line which is brighter than Type ‘T’
the nearby regions and feature transform through Cancer
difference in scales of images. The curvature 7 95 8
Type ‘M’
extraction intends to extract the direction of edges. Cancer
It also identifies the change in intensity of images 8 5 88
Type ‘N’
and the autocorrelation. The shape detection Table. 2. Prediction of cancer type using linear
involves in finding the threshold of the images, SVM
region extraction and template matching. It also
includes hough transformation which involves in Table 1 and Table 2, represent the prediction made
extract the imperfect features of the objects by through ANN and SVM respectively. Accuracy of
comparing it within the class through voting ANN model is 90.2%. In 320 total predicted value,
procedure. The motion detection model involves in ANN has correctly predicted 290 values. However,
extracting the motion of images and the optical accuracy of SVM algorithm is 88%, where 284
flow by admiring the area of the images. predictions are made correctly.

www.jespublication.com Page No:54


Vol 11, Issue 5,May/ 2020
ISSN NO: 0377-9254

The second important performance metric of ML B. Comparison of Apriori and K-Means


algorithm is precision. It is the fraction of relevant This research work used a dataset which is needed
information retrieved (i.e.) in lung cancer type to extract to achieve useful information about the
prediction, what fraction of patients belong to a effect of k-means algorithm to apriori algorithm
particular cancer type. from computation time and rule achieved. The
In predicting lung cancer type, precision value of dataset used consists of 8243 disease diagnose data.
type ‘x’ cancer is found by the fraction of correctly Medical data variables consist of disease diagnosis,
predicted type ‘x’ cancer from the total prediction. age group, gender, the status of care. The partial
Precision is calculated by the formula specified data used can be seen in Table 5.
below,
In the first approach, directly apply the apriori
algorithm in the dataset to 4 input variables,
Precision (Type ‘x’) = (No. of correctly predicted
namely disease diagnosis, age group, gender, the
Type ‘x’) / (Total predicted Type ‘x’ cancer)
status of care in order to obtain confidence values,
rules and computational time on apriori algorithms.
Precision values (in percentage)
The test results obtained from the Apriori algorithm
ANN SVM
can be seen in Table 6.
Cancer Type ‘T’ 88.8% 92.6%
Cancer Type ‘M’ 90.8% 86.3% This rule information obtained in the Large Itemset
Cancer Type ‘N’ 92.6% 87.1% 4 results in two rules, namely the diagnosis of
Table. 3. Precision values of ANN and SVM another allergic rhinitis with the age group of
algorithm for given dataset. female children and outpatient status. Then, the
diagnosis of postoperative disease with the adult
The recall can be derived through the below age group gender male and outpatient status with
mentioned formula, each confidence value of 69%. From these results,
it can be seen that the information obtained from
Recall = (No. of correctly predicted type ‘x’ the Apriori algorithm is still lacking.
cancer) / (No. of actual type x cancer patients)
Table. 4. Precision values of ANN and SVM
algorithm for given dataset.

On comparing both the algorithms we can observe


that the ANN is more effective than SVM in many
of the cases. On observing the precision value
ANN works better than SVM but in few cases such As shown in Table 7 above, the combination of the
as in detecting cancer type T the precision value is K-Means algorithm and the Apriori algorithm
good for SVM. On comparing the recall value we produces more complete and detailed information
can observe that ANN results good than SVM but compared to the results obtained by the application
in few cases such as predicting cancer type M, the of a priori algorithm only.
SVM computes better than ANN as the correctness
of result is good.
S. No. Disease Diagnosis Age Cluster Gender Status of Care

1 Observation of Febris Baby Male Outpatient

2 Observation of Febris Baby Female Outpatient


3 Observation of Febris Baby Male Outpatient
4 Observation of Febris Baby Female Outpatient
5 Paronychia Adult Female Outpatient
6 Hnp Lumbalis Adult Male Outpatient
: : : : :
8243 Disputes with the counselor Toddlers Female Outpatient

www.jespublication.com Page No:55


Vol 11, Issue 5,May/ 2020
ISSN NO: 0377-9254

Table 5. Sample patient diagnosis data in 2016

Using Apriori
Full Data
Cataracts Cataracts Cataracts Cataracts Another Another
Disease Post
not not not not Allergic Allergic
Diagnose Operation
Specified Specified Specified Specified Rhinitis Rhinitis
Age Cluster -- Elder Elder Elder -- Child Adult
Gender Male -- Female -- Female Female Male
Status of
Out Out Out Out Out Out Out
Care
Confidence
69 76 60 66 69 69 69
(%)
Table 6. Data processing using Apriori algorithm

K-Means + Apriori
Cluster 1 Cluster 2 Cluster 3 Cluster 4
Cataracts Cataracts Another
Disease
not not allergic Post Operation
Diagnose
Specified Specified rhinitis
Age Cluster Elder Elder Child Adult
Gender Male Female Female Male
Status of Care Outpatient Outpatient Outpatient Outpatient
Confidence (%) 66 66 92 93
Table 7. Data processing using K- Means and Aprior

Meanwhile, the computation time of K-Means and Reference


Apriori algorithms combinations are faster than the
1. K. Kancherla and S. Mukkamala,” Feature
Apriori algorithm, where the total time from K-
Selection for Lung Cancer Detection Using SVM
Means algorithm and Apriori algorithms
Based Recursive Feature Elimination Method”,
combinations are 17.41 minutes while the total
Lecture Notes in Computer Science, Vol. 7256,
time of the Apriori algorithm is 21.93 minutes.
pp. 168-176,2012.
5. CONCLUSION 2. S.K. Lakshmanaprabu, S.N. Mohanty, K.
Shankar, N Arunkumar, and G. Ramirez,
In this comparative research study, we have
”Future Generation Computer System, Vol. 92,
downloaded the lung cancer dataset from the
pp.374-382, 2018
Cancer Imaging archives and given as the input to
3. A. Trivedi and P. Shukla, “Lung Cancer
the two most accepted machine learning models
Diagnosis by Hybrid Support Vector Machine”,
such as, Artificial Neural Networks (ANN),
Communications in Computer and Information
Support Vector Machine (SVM) and another
Science, Vol. 628, pp. 177-187, 2016.
patient dataset is given as input for unsupervised
4. T. Nadira and Z. Rustama, “Classification of
learning method such as Apriori and K-means
Cancer Data Using Support Vector Machines
model for observing the performance difference.
with Features Selection Method Based on Global
The final results and the performance metrics of the
Artificial Bee Colony”, Vol. 2023(1), pp. 1-7,
machine learning algorithms such as accuracy,
2018.
precision and recall are compared with each other
5. Arora, P., Boyne, D., Slater, J. J., Gupta, A.,
and tabulated. Thus the comparison of
Brenner, D. R., and Druzdzel, “Bayesian
unsupervised and supervised algorithms are
Networks for Risk Prediction Using Real-World
compared.
Data: A Tool for Precision Medicine, Value in
Health”,Vol. 22, pp.437-445, 2019.

www.jespublication.com Page No:56


Vol 11, Issue 5,May/ 2020
ISSN NO: 0377-9254

6. M. B. Sesen, T. Kadir, R. B. Alcantara, J. Fox,


and M. Brady,” Survival Prediction and
Treatment Recommendation with Bayesian
Techniques in Lung Cancer”, AMIA Annual
Symposium, pp. 838-847.
7. K. Jayasurya, G. Fung, S. Yu, C. Dehing-
Oberije, D. De Ruysscher, A. Hope, W. De
Neve, Y. Lievens, P. Lambin , and A. L. A. J.
Dekker, “Comparison of Bayesian network and
support vector machine models for two-year
survival prediction in lung cancer patients treated
with radiotherapy” Medical Physics, Vol. 37, pp.
1401-1407, 2010.
8. Dharshinni N P, Mawengkang H and Nasution M
K M 2018. Mapping of medicine data with k-
means and apriori combinations based on
patient diagnosis. International Conference on
Computing and Applied Informatics.Vol 2 (978).
9. E. Adetiba and O. Olugbara, “Lung Cancer
Prediction Using Neural Network Ensemble with
Histogram of Oriented Gradient Genomic
Features” The Scientific World Journal”, Vol.
2015, pp. 1-17, 2015.

www.jespublication.com Page No:57

You might also like