Mobile Application Development
Mobile Application Development
SUBMITTED BY
ADARSH MISHRA
SAHIL POTALE
BHAVESHGHADE
UNDER THE GUIDANCE OF
MRS.MANJULA ATHANI
1
A
CAPSTONE PROJECT REPORT
ON
SUBMITTED BY
UNDERTHE GUIDANCE OF
MRS.MANJULA ATHANI
2
CERTIFICATE
This is to certify that Mr. MISHRA ADARSH KRISHNAMOHAN from PRAVIN PATIL
COLLEGE OF DIPLOMA ENGINEERING AND TECHNOLOGY institute having Enrollment no:
2005630126 has completed project of final year having title LUNG CANCER PREDICTION during
the academic year 2022 – 2023.
The project completed in a group consisting of three persons under the guidance of the faculty
Guide.
Project Members
1 Adarsh Mishra
2 Sahil Potale
3 Bhavesh Ghade
5
Acknowledgement
I express my sincere thanks to the Principal Mrs. R.B Patil , who has given
me the opportunity to pursue my Diploma computer engineering department also
express my thanks to H.O.D Mrs. Manjula Athani and other staff of the
Computer Engineering department. I would like to thanks my guide Mrs.
Manjula Athani for her encouragement and guidance, which helped me in
completing the project. Finally, I would like to thank my colleagues and friends
who helped me in completing the Project successfully.
PROJECT MEMBERS
1 Adarsh Mishra
2 Sahil Potale
3 Bhavesh Ghade
7
Abstract
8
Contents
1. Introduction .............................................................................................. 10
1.1 Current Scenario ................................................................................. 11
1.2 Problem Faced in Current scenario .................................................... 11
1.3 Solution and Planning ....................................................................... 11
1.4 Flow Diagram of Lung Cancer Prediction ......................................... 14
2. Literature Review..................................................................................... 15
3. Scope of project ....................................................................................... 18
4. Methodology ........................................................................................... 20
4.1 Hardware and Software Requirements .................................................. 25
5. Designing ................................................................................................ 26
5.1 Block Diagram.................................................................................... 27
5.2 Activity Diagram ................................................................................ 28
5.3 Data Flow Diagram ............................................................................ 29
6. Results and Applications ......................................................................... 32
6.1 Results ................................................................................................ 33
6.2 Data Base Design ............................................................................... 38
6.3 Application ......................................................................................... 39
7. Conclusion and future scope ................................................................... 40
7.1 Conclusion ............................................................................................. 41
7.2 Future Scope .......................................................................................... 42
8. Appendix ................................................................................................. 43
8.1 Gantt Charts ........................................................................................... 44
9. References & Bibliography..................................................................... 45
9.1 References ............................................................................................. 46
9.2 Bibliography .......................................................................................... 47
9
1. INTRODUCTION
10
1. Introduction to project
While lung cancer is the second most common cancer in the U.S., it's not often
detected early. However, lung cancer screening offers hope for catching the cancer
early when it's easier to treat. Unlike some other cancers, lung cancer usually has
no noticeable symptoms until it's in an advanced stage.
1.3 Solution
As we all know that “prevention is better than cure” so, this is the main goal of
choosing this topic for project. To overcome this concerns we are going to build
this project which is using various machine learning techniques.
Using Machine Learning and Deep Learning techniques the software will going
topredict the stages of cancer one is having based on the image of the x-ray and will
suggest the early precautions one should take to avoid the increament in the
cancercells
Modules :-
Login-in and signup page:
The patient and user can sign-in or sign-up into his/her account to get the
11
detailed information about the patient’s record.
The login page has two data fields email and password. The patient can fill
up the credentials and sign-in into the account and see the dashboard.
The registration form has name, email and passwords fields for
registering the new user. They can sign-up from this form.
Dashboard:
The dashboard is visible after successful login. It routes of different modules.
Home Page:
A home page (or homepage) is the main page of a website or application. The
term may also refer to the start page shown when the application first opens.
After getting logged into our application the user can now access the
dashboard. The first thing user will see is the home page, the user can get basic
information about our application like what our is about and what is our basic
agenda.
Prediction:
After knowing about the basic architecture of our application (or agenda). The
user or patient will come to this module through the dashboard. This is the major
or integral module of our application. Here only the user can detect the cancer
stages with the help of x-ray and get to know that he /she is cancerous or not. Here
we had used various data fields like Buttons, Image Picker widget, Container.
We had use 2 buttons for selecting the picture from gallery or taking the
picture from camera at the real time. The Image Picker widget help us to select
picture from the gallery on the click of button. And we have stored all this
components in a Container
Doctors:
After doing the prediction, the user will get to know about the stage of Lung
12
cancer he/she is having. And if the stage is severe, it will suggest the doctors and
medical centers were this kind treatment is available and also get to know about
their public details. In this we have stored the details of doctor in a container and
in a Tabular Format.
History:
After performing the prediction the prediction multiple times. If the user or
patient wants to compare the current or previous result. For that, history module
can be used, where all the prediction performed by the single account can be seen.
13
1.4 Flow Diagram of Lung Cancer Prediction:
The Fig 1.1 shows the System flow diagram of lung cancer prediction. Firstly
the user has to go on the website or on the application then, the user has to login. For
the security reasons the OTP will generated (4 digits) on the phone no which is used
for the login/signup.
After that there will a browse option is available on GUI of application. The
user has to go on to that option and select the document (image format) from their
device memory. The image must be of the X-Ray of the Lungs. Then the system will
start its working and try to match the image from the database and based on thatit will
give the accuracy between the original and reference image.
Based on the Stage it will suggest the early precaution one should take.
14
2. LITERATURE REVIEW
15
2. Literature Review
Cancer is very dangerous and common disease that causes death worldwide.
Early diagnosis of cancer provide more possibility of getting cured. Cancer disease
generates abnormal growth of cells which spreads to all parts of body. In this paper
we discuss, the early prediction of lung cancer with help of data miningtechniques.
Lung are spongy organs that affected by cancer cells that leads to lossof life. The
common reasons of lung cancer are smoking habits, working in smokeenvironment
or breathing of industrial pollutions, air pollutionsand genetic. In thispaper we have
proposed a genetic algorithm based dataset classification for prediction of multiple
models. The usage of genetic algorithm (GA) have shown better performance
when compared with Particle swarm optimization and differential evolutions.
Lung cancer is one of the leading causes of mortality in every country,affecting
both men and women. Lung cancer has a low prognosis, resulting in a high death
rate. The computing sector is fully automating it, and the medical industry is also
automating itself with the aid of image recognition and data analytics. Thispaper
endeavors to inspect accuracy ratio of three classifiers which isSupport Vector
Machine (SVM), KNearest Neighbor (KNN)and, Convolutional Neural Network
(CNN) that classify lung cancer in early stage so that manylives canbe saving.
Basically, the informational indexes utilized as a part of this examination are taken
from UCI datasets for patients affected by lung cancer. Theprinciple point of this
paper is to the execution investigation of the classification algorithms accuracy by
WEKA Tool. The experimental results show that SVM gives the best result with
95.56%, then CNN with CNN 92.11% and KNN with 88.40% characterized by
pervasive, unobtrusive and anticipatory communications.
Machine learning based lung cancer prediction models have been proposed to
assist clinicians in managing incidental or screen detected
166
16
indeterminate pulmonary nodules. Such systems may be able to reduce variability
in nodule classification, improve decision making and ultimately reduce the
number of benign nodules that are needlessly followed or worked-up. In this
article, we provide an overview of the main lung cancer prediction approaches
proposed to date and highlight some of their relative strengths and weaknesses.
We discuss some of the challenges in the development and validation of such
techniques and outline the path to clinical adoption
Cancer is very dangerous and common disease that causes death worldwide.
Early diagnosis of cancer provide more possibility of getting cured. Cancer disease
generates abnormal growth of cells which spreads to all parts of body. In this idea
we will cure lung cancer at the early stage by the prediction of lung cancer with help
of machine learning techniques. The common reasons of lung cancer are smoking
habits, working in smoke environment or breathing of industrial pollutions, air
pollutions and genetic. In this paper we have proposed a genetic algorithm based
dataset classification for prediction of multiple models. The usage of genetic
algorithm (GA) have shown better performance when compared with Particle
swarm optimization and differential evolutions
17
3. SCOPE OF THE PROJECT
18
3. Scope of Project
Lung cancer results in over 1.7 million deaths per year, making it the deadliest
of all cancers worldwide more than breast, prostate, and colorectal cancers
combined and it’s the sixth most common cause of death globally, according to
the World Health Organization. While lung cancer has one of the worst survival
rates among all cancers, interventions are much more successful when the cancer
is caught early. The user can get the information of early precaution one should
take and also about the nearby doctors. The application interface is responsive and
user friendly. The application can be used by particular individual to know that
they are cancerous or not. It can also used by doctors just to double check their
reports. ML and DL capabilities to be included providing the ability of a computer
to understand, analyze, manipulate, and potentially generate Results. The CNN ,
Image processing technology can be used for analyzing images of X-Ray or Ct
scan images.
19
4. Methodology
20
4. Methodology
In this project we used Flutter and Python as frontend and Firebase as
backend
4.1 Flutter:
Flutter Version: 3.7
The first version of Flutter was announced in the year 2015 at the Dart
Developer Summit. It was initially known as codename Sky and can run on
the Android OS. On December 4, 2018, the first stable version of the Flutter
framework was released, denoting Flutter 1.0. The current stable release of
the framework is Flutter v1.9.1+hotfix.6 on October 24, 2019.
In general, creating a mobile application is a very complex and challenging
task. There are many frameworks available, which provide excellent features
to develop mobile applications. For developing mobile apps, Android
provides a native framework based on Java and Kotlin language, while iOS
provides a framework based on Objective-C/Swift language. Thus, we need
two different languages and frameworks to develop applications for both OS.
Today, to overcome form this complexity, there are several frameworks have
introduced that support both OS along with desktop apps. These types of the
framework are known as cross-platform development tools.
4.2 Python:
Python Version: 3.11.2
Python is a very popular general-purpose interpreted, interactive, object-
oriented, and high-level programming language. Python is dynamically-typed
and garbage-collected programming language. It was created by Guido van
Rossum during 1985- 1990. Like Perl, Python source code is also available
under the GNU General Public License (GPL).
This Python tutorial has been written for the beginners to help them
understand the basic to advanced concepts of Python Programming Language.
21
Python is a high-level, interpreted, interactive and object-oriented scripting
language. Python is designed to be highly readable. It uses English keywords
frequently where as other languages use punctuation, and it has fewer
syntactical constructions than other languages.
History of Python:
Python was developed by Guido van Rossum in the late eighties and early
nineties at the National Research Institute for Mathematics and Computer
Science in the Netherlands.
Python is copyrighted. Like Perl, Python source code is now available under
the GNU General Public License (GPL).
4.3 Firebase:
22
We had build this project, using various machine learning techniques (ML)
and Deep Learning Techniques like CNN(Convolutional Neural
Network),Linear Regression, Multiple Regression. This will monitor the
health status of the patient. Using this techniques we can predict the cancer
stage of lungs of an individual with the help of image of the X-Ray. It will
also try to match the imageof X-ray with the dataset of images of X-ray. On
this basis it will give accuracy.
4.4 CNN:
CNN’s were first developed and used around the 1980s. The most that a
CNNcould do at that time was recognize handwritten digits. It was mostly
used inthe postal sectors to read zip codes, pin codes, etc. The important thing
to remember about any deep learning model is that it requires a large amount
ofdata to train and also requires a lot of computing resources. This was a major
drawback for CNNs at that period and hence CNNs were only limited to the
postal sectors and it failed to enter the world of machine learning.
4.5 TensorFlow:
23
mean! TensorFlow is basically a software library for numerical computation
using data flow graphs
4.6 Keras:
24
4.1 Hardware and Software Requirements:
Hardware:
❖ HDD or SSD.
❖ Intel 2.60 GHz Processor i5 (10th Gen)
❖ 4 GB RAM or above
Software:
25
5. Designing
26
5. Designing
The Fig 5.1 shows the Block-diagram of Lung cancer Prediction. Firstly after the
login/signup, when the user inputs the image, the first work the system will is do is to check
if the image is in proper resolution or size for prediction. After satisfying this guide line.
The system is going to apply the prediction algorithm on the image. After that it is going
to extract a feature from the image, this feature will tell that image is cancerous or not.
After scanning the system will try to match that image from the images in the dataset. Based
on the matching process the system will predict that the patient is cancerous or not.
27
5.2 Activity Diagram
The Fig 5.2 shows the Activity diagram of Lung cancer Prediction. Here firstly the
login/sign up into the application using valid credentials. After that user will reach to dashboard
of our application. Now , the user can select the any tab(Doctors, Prediction, History, Account)
as per his/her suitability. But the main detection is taking place in the prediction tab
28
5.3 Data Flow Diagram
In the given level 0 DFD the user will give the input to the system and the system
will showthe result on the basis of the image
29
Fig. 5.3.2 Data Flow Diagram LEVEL 1 for Lung Cancer
Prediction
The Fig 5.3.2 shows the Data flow diagram level 1 of lung cancer prediction.
Firstly the user has to go on the website or on the application then, the user has to
login. For the security reasons the OTP will generated (4 digits) on the phone no
which is used for the login/signup.
After that there will a browse option is available on GUI of application. The
30
user has to go on to that option and select the document (image format) from their
device memory. The image must be of the X-Ray of the Lungs. Then the system will
start its working and try to match the image from the database and based on thatit will
give the accuracy between the original and reference image.
Based on the Stage it will suggest the early precaution one should take.
31
6 Results and Applications
32
6. Results and Applications
6.1 RESULTS
Fig 6.1.1Shows the Welcome page of our application. This is basically the first page of
our application
33
6.1.2 Login/Signup page
Fig. 6.1.2.1 Shows the Login page of our application. Here the new patients or
user cancreate his/her own accounts by using the valid credentials as per his/her choice
34
Fig. 6.1.2.2 Sign Up page for lung Cancer
prediction
Fig. 6.1.2.2 Signup - Here the existing patients can adds his/her his/her correct
credentials to gain access of the website/Application
35
6.1.3 Dashboard
Fig. 6.1.3 Dashboard. After successful login, the following dashboard appears on the screen.
36
6.1.4 Prediction
Fig 6.1.4 Prediction. The data collected from the user in the image document or
format. Here the actual prediction takes place
37
6.2 DATABASE DESIGN
Fig 6.2 Shows the database of our application and it consists of identifier, provider
and date when sign in and User UID.
38
6.3 APPLICATIONS
• It can also be used to know about the early precaution, one should take.
• It can be used in remote places like villages where doctors cannot reach
out easily.
• It is a friendly-app for the patients.
• It will suggest you about the nearby doctors and the treatment.
39
7. Conclusion and future
scope
40
7. Conclusion and future scope
7.1 Conclusion
As we know that prevention is better than Cure, so the is the main goal of our
project.As we know that Lung Cancer is one of most common cancer most of the
peopleare facing, even at very young age. The vast majority (85%) of cases of
lung cancer are due to long-term tobacco smoking. About 10–15% of cases occur
in people who have never smoked. They are having lung cancer due to their
genetics.But, the cancer cells present in the lungs are very difficult to detect at the
earlystages, it can be detected only in the advanced stages. Since, our motive is
to detect the cancer cells at the early stages. We proposed this project.
41
7.2 Future Scope
Currently the system is supporting only Cancer Prediction of Lungs, but it can
be scaled, and even support can be provided for multiple problems, like Cancer of
heart. Therefore, patients other than Lung Cancer can also use the system for their
convenience; hence targeting a large number of users. The user can get the
information of early precaution one should take and also about the nearby doctors.
The application interface is responsive and user friendly. The system would be
constantly upgraded for new features and regularly tested for errors and bugs, thus
providing more accuracy and less error prone environment. ML and DL
capabilities to be included providing the ability of a computer to understand,
analyze, manipulate, and potentially generate Results. The CNN , Image
processing technology can be used for analyzing images of X-Ray.
42
8. Appendix
43
8. Appendix
44
9. References &
Bibliography
45
9. References & Bibliography
9.1 References:
[1] F. Leena Vinmala, Dr. A. Kumar Kombaiya, ” Prediction of
Lung Cancer using Data Mining Techniques”, International
Journal of Engineering Research & Technology (IJERT)
46
9.2 Bibliography
• Anuradha A. Puntambekar, Yogesh S. Gunjal, Narendra S. Joshi,
Yogesh B. Patil ,”Software Engineering”, Technical Publication
▪ https://www.lucidchart.com/pages/landing
▪ (PDF) Lung Cancer Prediction Using Deep Learning Framework
(researchgate.net)
▪ https://www.researchgate.net/publication/325977576_Smart_heal
th_ band_using_IoT
• https://ieeexplore.ieee.org/document/8628067
47
Review Article
Abstract: Machine learning based lung cancer prediction models have been proposed to assist clinicians in
managing incidental or screen detected indeterminate pulmonary nodules. Such systems may be able to reduce
variability in nodule classification, improve decision making and ultimately reduce the number of benign nodules
that are needlessly followed or worked-up. In this article, we provide an overview of the main lung cancer
prediction approaches proposed to date and highlight some of their relative strengths and weaknesses. We discuss
some of the challenges in the development and validation of such techniques and outline the path to clinical
adoption.
Keywords: Pulmonary nodules; lung neoplasms; lung; machine learning; decision making
Submitted Apr 07, 2018. Accepted for publication May 22, 2018.doi:10.21037/tlcr.2018.05.15
View this article at: http://dx.doi.org/10.21037/tlcr.2018.05.15
48
Translational Lung Cancer Research, Vol 7, No 3 June 2018 305
at baseline and of 2,858 Lung-RADS 3 and above screens after However, despite their attraction and good performance, their
baseline 2,543 were false positives. Therefore, while the adoption and performance as part of decision making has not
adoption of Lung-RADS can reduce the total number of benign been studied. The British Thoracic Society (BTS) guidelines on
nodules being worked-up within a screening programme, at a the management of incidentally detected pulmonary nodules
cost of just under 10% loss in sensitivity, there remain a very (10), recommends the use of the Brock model (6). Anecdotally,
large number of benign nodules being investigated, and the many physicians report usingthem for patient communication
nodule classification task remains a challenging one. only and feel that such models do not add a great deal to their
One approach to address this problem is to adopt computer clinical expertise. More specifically, questions remain as to the
aided diagnosis (CADx) technology as an aid to radiologists and utility of such models when the patient population is different
pulmonary medicine physicians. Given an input CT and possible to thatof the training data. It is clear, that for such models to
additional relevant patient meta- data, such techniques aim to be clinically useful, knowledge of the training data used is
provide a quantitative output related to the lung cancer risk. critical, and this also will determine the clinical scenarios in
One may consider the goal of such systems to be two- fold. which they may be used. There are clearly significant
First, to reduce the variability in assessing and reporting the differences in the pre-test probabilities of a nodule being
lung cancer risk between interpreting physicians. Indeed, malignant in different patient groups. For instance, patients
computer assisted approaches have been shown to improve with a current or prior history of malignancy are at significantly
consistency between physicians in a variety of clinical contexts, different risk of nodule malignancy than non- smokers with no
including nodule detection (4) and significant prior history.
mammography screening (5) and one might expect such From a technical perspective, such models have a number
decision support tools could provide the same benefit in nodule of limitations. Foremost is the reliance on human
classification. Second, CADx could improve classification interpretation of input variables such as nodule size,
performance by supporting the less experienced or non- morphology and even the reliance on the patient’s own
specialised clinicians in assessing the risk of aparticular nodule estimate of factors such as smoking history. For example, under
being malignant. the Brock model, a 1mm increase in the reported size of a 5
In this article, we review progress made towards the mm spiculated solid nodule in a 50-year-old female almost
development and validation of lung cancer prediction models doubles its risk, from 0.98% to 1.89%. However, inter-
and nodule classification CADx software. While wedo not radiologist variability in reporting nodule size is typically
intend this to be a comprehensive review, we do aim to greater than this (11). Moreover, inter- reader variability in
provide an overview of the main approaches taken to date and reporting morphology and nodule type is common even
outline some of the challenges that remain to bring this amongst experienced thoracic radiologists (12,13).
technology to routine clinical use. Some recent work to address this has been proposed by
Ciompi et al. (14) where an automated system for the
Risk models classification on nodules into solid, non-solid, part-solid,
calcified, perifissural and spiculated types was proposed.
There have been a number of lung cancer risk models Overall classification accuracy is reported to be within theinter-
developed and validated that one may consider to be a formof radiologist variability at 79.5% but this varies between 86% for
CADx tool (6-9). Typically based on logistic regression, such solid and calcified nodules down to 43% for spiculatednodule
tools aim to provide an overall risk of the patient havingcancer classification. Of course, since the ground-truth classifications
based on patient meta-data such as age, sex and smoking were provided by radiologist opinion, the performance at
history and nodule characteristics such as nodule size, validation cannot be expected to improve on that. As the
morphology and growth, if a previous CT was available. authors point out, the nodule types are radiologist developed
Although such tools currently require manual entry by the concepts that, while useful for clinical purposes, lack a precise
user, they do produce an objective lung cancer risk score which definition. The impact of the system’s output as an input to the
may be used in the decision-making process. Brock model was not reported and ultimately this approach
should be judged
49
306 Kadir and Gleeson. Lung cancer prediction using machine learning
50
Translational Lung Cancer Research, Vol 7, No 3 June 2018 307
LungX winning entry it difficult for one set of texture features to capture the
patterns. We believe this was a significant contribution to the
Figure 1 provides an overview of the main steps in the
performance of the system.
algorithm used in the winning entry. The software has four
The 15 features were selected from a palette of over 1,300
main steps at test time, i.e., when used to classify a nodule:
classical texture features including Haralick (17), Gabor (22),
(I) nodule segmentation, (II) texture feature extraction, (III)risk
along with simple measures such as mean, standard deviation
score regression and (IV) risk score thresholding.
and volume. We utilized a fully automated feature selection
The nodule segmentation is required because the
strategy that aimed to select a small subset of features that
subsequent step of feature extraction is applied to a regionof
optimised classification performance over an in- house training
interest (ROI) around the nodule. Each nodule was segmented
dataset. Since it is computationallyinfeasible to test all
in a semi-automated thresholding approach using a
combinations of the full palette of features, we utilized a
commercial software package (Mirada RTx, Mirada Medical
sequential “greedy” algorithm that, starting with the optimal
Ltd.). The user first defined a spherical ROI around each nodule
pair of features found by exhaustive search over all pairs of
and then applied a fixed threshold to the ROI. Next, the user
features, selected features one-by-one so as to maximise the
could adjust the threshold to improve the segmentation and
performance over the training dataset at each step.
finally, manual editing tools could be used to edit the
Finally, an SVM regression algorithm with a cubic kernel was
segmentation to remove any voxels that did not correspond to
trained using the libSVM library. The output of this step is a
the nodule of interest that the segmentation had included.
number between 0 and 1 that reflects the likelihood that a
Typically, adjacent vessels would need to be excluded in this
particular nodule is malignant.
manner. In later work, we replaced the semi- automated
The training dataset we utilized for the competition was
method with a more automated technique thatdid not
mostly derived from the Lung Image Database Consortium and
require any user interaction other than toidentify the centre
Image Database Resource Initiative (LIDC-IDRI dataset) (23).
and diameter of the nodule (21).
We extracted 15 texture features from two regions, the This publicly available dataset comprises a wide variety of
first inside the nodule segmentation and the second in a nodules and comes with multiple segmentations and likelihood
surrounding region defined automatically. Based on earlier of malignancy score estimated by expert clinicians. Nodules
work using our internal databases, we found that better were included in our training set if at least three sets of
performance could be achieved if the region inside the nodule clinician-drawn contours and corresponding likelihood- of-
was treated separately to the immediate surrounding malignancy scores were included in the XML metadata. The
parenchyma. The insight here is that the texture of the nodule malignancy scores are integers from one to five inclusive and
carries separate information to the region in the nearby are recorded per clinician. Only nodules whose malignancy
parenchyma and the very different ranges of Hounsfield units scores were all below 3 (the benign set) or all above 3 (the
in each region would make malignant set)
17
51
308 Kadir and Gleeson. Lung cancer prediction using machine learning
0.6
Figure 2 ROC curves for the LungX trained and tests on the LIDC-IDRI dataset (A) and the Oxford Data (B). LIDC-IDRI, Lung Image
Database Consortium and Image Database Resource Initiative; ROC, receiver operating characteristic curve; AUC, area under the curve.
were included, yielding a labeled subset of 222 nodules overall once we had collected and curated sufficiently large training
for the LIDC-IDRI training set. sets by the end of 2016, our CNN based techniques started to
Figure 2 shows the Receiver Operating Characteristic (ROC) outperform the previous state-of-the-art texture and SVM
curves for the system as trained and tested on the LIDC-IDRI based method. While a detailed exposition of such techniques
dataset using 20-way cross-validation. With such high AUCs, we is beyond the scope of this article, it is worth understanding the
were suspicious that the dataset was too easy to classify and so main differences to previous methods and their advantages.
we trained and validated on a second dataset, PLAN, to
examine the system’s performancefurther. The PLAN nodule
Feature learning vs. feature selection
database was built up from nodules collected from theOxford
University Hospitals NHS trust. This set consists of 709 Unlike Radiomic/texture analysis approaches, CNN techniques
nodules, 377 malignant and 328 benign, diagnosed either using build features from scratch rather than selecting from a palette
histology or by 2- year stable follow-up. Using 20-way cross- of engineered or pre-selected set that rely on the contextual
validation, the average AUC was 0.854; the ROC curves are knowledge of the algorithm developer.
shown on the right of Figure 2. In the end, the system we
submitted usedboth LIDC-IDRI and PLAN for feature selection
Hierarchical features
but the SVM was trained only on the LIDC-IDRI dataset.
The first few layers of a CNN typically comprise several layers
of features allowing the network to learn the relationships
Convolutional neural networks and deep learning
between features in a much more sophisticated way than can
Convolutional Neural Networks (CNN) trained using deep be achieved with a single feature extraction stage. Consider this
learning techniques have come to dominate pattern detection, illustrative example: a texture feature, such as local entropy of
recognition, segmentation and classification applications in the joint histogram, can be used to detect spiculations
both medical and non-medical fields. Indeed, where sufficient extending into the parenchyma. But a CNN can learn this and
training data is available, CNNs have largely superseded the also learn that spiculations encompass the whole perimeter of
previous generation of Radiomic/ texture analysis methods the nodule and that this is a sign of a malignant nodule.
described above. In our own work,
52
Translational Lung Cancer Research, Vol 7, No 3 June 2018 309
53
310 Kadir and Gleeson. Lung cancer prediction using machine learning
SV
Dataset A
Model A
640 size-matched
SV
Dataset B
Model B
640 NLST
Figure 3 Investigating the role of nodule size within a machine learning model of nodule malignancy. Model A was trained on size-matched
data and model B was trained on unmatched data. SVM, support vector machine.
hence are typically at a late stage and their nodules are moreover, that such features add approximately 0.2 AUC
consequently larger than benign nodules. A machine learning points to using size-alone.
algorithm trained on such data would perform very poorly Coincidentally, the performance on size matched data was
when applied to, for example, a screening application where very close to that we achieved on the LungX competition data
the distribution of malignant nodule sizes is more similar to (AUC: 0.70 and 0.68) which was subsequently revealed to have
benign ones. also used size-matched data in the test set (20).
We explored this issue further by comparing the
performance of a CADx system trained on size-matched and Conclusions
size unmatched data (27). Figure 3 illustrates the experiment.
Two datasets were created from the US NLST.The first (A), We have provided an overview of the main approaches used
comprising 640 solid nodules, was built to remove size as a for nodule classification and lung cancer prediction from CT
discriminatory factor between benign and malignant; all imaging data. In our experience, given sufficient training data,
malignant solid nodules between 4 and 20 mm diameter were the current state-of-the-art is achieved using CNNs trained with
selected, and for each, a benign solid nodule was selected that Deep Learning achieving a classification performance in the
most closely matched it in diameter. Any malignant nodule for region of low 90s AUC points. When evaluating system
which an equivalently sized benign couldnot be found within performance, it is important to be aware of the limitations or
0.8 mm was rejected. Sizes were measured using automated otherwise of the training and validation data sets used, i.e.,
volumetric segmentation. The second dataset (B), also were the patients’ smokers or non- smokers, or were patients
comprising 640 subjects, included all malignant nodules in A with a current or prior history of malignancy included.
but benign nodules were randomly selected following the Given an apparent acceptable level of performance, the next
empirical size distribution of the whole NLST dataset. stage is to test such CADx systems in a clinical setting but before
Therefore, nodule size cannot be a discriminative factor in A this can be done, we must first define the way inwhich the
but would be in output of the CADx should be utilized in clinical decision
B. Two nodule classifiers were built using texture features making. Who should use such a system and how should it be
combined with an SVM classifier; this was utilized here because integrated into their decisions? Should thealgorithm produce
the small datasets prevented the use of a CNN model. an absolute risk of malignancy and how should this be
The average AUC for the classifier trained on dataset A was expressed; should it be incorporated into clinical opinion and
0.70 whereas using size alone on the same dataset gave an AUC how much weight should clinicians or patients lend to it. Should
of 0.50 as would be expected. The AUC was the algorithms be incorporated into or designed to fit current
0.91 for the classifier trained on dataset B. This indicates that guidelines such as Lung- RADS or the BTS guidelines? If nodules
the classifier can learn morphological features that can are followed over time, should the algorithm incorporate
discriminate between benign and malignant nodules and, changes in nodule
54
Translational Lung Cancer Research, Vol 7, No 3 June 2018 311
volume or should this be assessed separately? Is success Computer-aided Detection: Prospective Study of 12,860
defined by a reduction in the numbers of false positive scans Patients in a Community Breast Center. Radiology
defined as those needing further follow up or intervention, or 2001;220:781-6.
by detecting all lung cancers and earlier than determined by 6. McWilliams A, Tammemagi MC, Mayo JR, et al. Probability
following current guidelines? Who should be compared to the of Cancer in Pulmonary Nodules Detected onFirst
algorithm when determining its value? Should the comparison Screening CT. N Engl J Med 2013;369:910-9.
be experts or general radiologists, as it may be difficult to be 7. Gould MK, Ananth L, Barnett PG, et al. A Clinical ModelTo
significantly better than an expert but may be of substantial Estimate the Pretest Probability of Lung Cancer
help to a generalist, and most scans are not interpreted by in Patients With Solitary Pulmonary Nodules. Chest
experts? Relatively little work has been done to address such 2007;131:383-8.
questions. 8. Swensen SJ, Silverstein MD, Ilstrup DM, et al. The
probability of malignancy in solitary pulmonary nodules.
Acknowledgments Application to small radiologically indeterminate nodules.
Arch Intern Med 1997;157:849-55.
The authors would like to thank the numerous research 9. Deppen SA, Blume JD, Aldrich MC, et al. Predicting lung
scientists and clinical staff involved in the project for their cancer prior to surgical resection in patients with lung
contributions: Sarim Ather, Djamal Boukerrouri, Amalia Cifor, nodules. J Thorac Oncol 2014;9:1477-84.
Monica Enescu, Mark Gooding, William Hickes, Samia Hussain, 10. Callister ME, Baldwin DR, Akram AR, et al. British
Aymeric Larrue, Jean Lee, Heiko Peschl, Lyndsey Pickup, Thoracic Society guidelines for the investigation and
Shameema Stalin, Ambika Talwar, Eugene Teoh, JulienWillaime management of pulmonary nodules. Thorax 2015;70 Suppl
and Phil Whybra. 2:ii1-54.
Funding: Part of this work was funded by Innovate UK 11. Revel MP, Bissery A, Bienvenu M, et al. Are two-
project TSB 101676. dimensional CT measurements of small noncalcified
pulmonary nodules reliable? Radiology 2004;231:453-8.
Footnote 12. Bartlett EC, Walsh SL, Hardavella G, et al. Interobserver
Variation in Characterisation of Incidentally-Detected
Conflicts of Interest : T Kadir is CTO, Director and shareholder
Pulmonary Nodules: An International, Multicenter Study.
of Optellum Ltd. F Gleeson is a shareholder and advisor to
Availableonline: http://4wcti.org/2017/SS5-3.cgi
Optellum Ltd.
13. Zinovev D, Feigenbaum J, Furst J, et al. Probabilistic lung
nodule classification with belief decision trees. Conf Proc
References IEEE Eng Med Biol Soc 2011;2011:4493-8.
14. Ciompi F, Chung K, van Riel SJ, et al. Towards automatic
1. National Lung Screening Trial Research Team, Aberle DR,
pulmonary nodule management in lung cancer screening
Adams AM, et al. Reduced Lung-Cancer Mortality with
with deep learning. Sci Rep 2017;7:46479.
Low-Dose Computed Tomographic Screening. N Engl J
15. Aerts HJ, Velazquez ER, Leijenaar RT, et al. Decoding
Med 2011;365:395-409.
tumour phenotype by noninvasive imaging using
2. Lung CT Screening Reporting & Data System. Available
a quantitative radiomics approach. Nat Commun
online: https://www.acr.org/Clinical-Resources/ Reporting-
2014;5:4006.
and-Data-Systems/Lung-Rads
16. Lambin P, Rios-Velazquez E, Leijenaar R, et al. Extracting
3. Pinsky PF, Gierada DS, Black W, et al. Performanceof
more information from medical images using advanced
Lung-RADS in the National Lung Screening Trial: A
feature analysis. Eur J Cancer 2012;48:441-6.
Retrospective Assessment. Ann Intern Med
17. Haralick RM, Shanmugam K, Dinstein I. Textural Features
2015;162:485-91.
for Image Classification. IEEE Trans Syst ManCybern Syst
4. Awai K, Murao K, Ozawa A, et al. Pulmonary Nodulesat
1973;3:610-21.
Chest CT: Effect of Computer-aided Diagnosis
18. Wilson R, Devaraj A. Radiomics of pulmonary nodules and
on Radiologists Detection Performance. Radiology
lung cancer. Transl Lung Cancer Res 2017;6:86-91.
2004;230:347-52.
19. Chalkidou A, O’Doherty MJ, Marsden PK. False Discovery
5. Freer TW, Ulissey MJ. Screening Mammography with Rates in PET and CT Studieswith Texture Features: A
55
312 Kadir and Gleeson. Lung cancer prediction using machine learning
Systematic Review. PLoS One 2015;10:e0124165. of malignancy or benign in patients with solitary
20. Armato SG 3rd, Drukker K, Li F, et al. LUNGx Challengefor pulmonary nodules. Beijing Da Xue Xue Bao Yi Xue Ban
computerized lung nodule classification. J Med Imaging 2011;43:450-4.
(Bellingham) 2016;3:044506. 25. Hammack D. Forecasting Lung Cancer Diagnoses
21. Willaime JM, Pickup L, Boukerroui D, et al. Impact of with Deep Learning. Available online: https://raw.
segmentation techniques on the performance of a CT githubusercontent.com/dhammack/DSB2017/master/
texture-based lung nodule classification system. Available dsb_2017_daniel_hammack.pdf
online: https://posterng.netkey.at/esr/viewing/index. 26. Setio AA, Traverso A, de Bel T, et al. Validation,
php?module=viewing_poster&task=&pi=135229 comparison, and combination of algorithms for automatic
22. Lee TS. Image Representation Using 2D Gabor Wavelets. detection of pulmonary nodules in computed tomography
IEEE Trans Pattern Anal Mach Intell 1996;18:1-13. images: The LUNA16 challenge. Med Image Anal 2017;42:1-
23. Armato SG 3rd, McLennan G, Bidaut L, et al. The Lung 13.
Image Database Consortium (LIDC) and Image Database 27. Pickup L, Declerck J, Munden R, et al. MA 14.13 NoduleSize
Resource Initiative (IDRI): a completed reference database Isn't Everything: Imaging Features Other Than
of lung nodules on CT scans. Med Phys 2011;38:915-31. Size Contribute to AI Based Risk Stratification of Solid
24. Li Y, Chen KZ, Sui XZ, et al. Establishment of a mathematical Nodules. J Thorac Oncol 2017;12:S1860-1.
prediction model to evaluate the probability
56
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/351889513
CITATIONS READS
12
5,470
3 authors:
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
57
Lung cancer Prediction and Classification based on
Correlation Selection method Using Machine Learning
Techniques
1st Dakhaz Mustafa Abdullah 2nd Adnan Mohsin Abdulazeez
Technical College of Informatics, Akre 3rd Amira Bibo Sallow
Presidency of Duhok Polytechnic
Duhok Polytechnic University College of Engineering
University
Nawroz University
Duhok, Iraq Duhok Polytechnic University
Duhok, Iraq
[email protected] Duhok, Iraq
[email protected]
[email protected]
https://doi.org/10.48161/qaj.v1n2a58
Abstract—Lung cancer is one of the leading causes of artificial intelligence to create algorithms that become more
mortality in every country, affecting both men and women. Lung efficiently when subject to relevant data[9][10]. Many
cancer has a low prognosis, resulting in a high death rate. The systems lack adequate detection accuracy, and some systems
computing sector is fully automating it, and the medical industry must also be developed in order to reach the highest accuracy
is also automating itself with the aid of image recognition and data of 100%. Pulmonary cancer identification and classification
analytics. This paper endeavors to inspect accuracy ratio of three were based on machine learning techniques and image
classifiers which is Support Vector Machine (SVM), K- Nearest processing techniques [11]. However, some signs of lung
Neighbor (KNN)and, Convolutional Neural Network (CNN) that cancer patients, such as their smoking rate, may aid in early
classify lung cancer in early stage so that many lives can be saving.
detection of the disease [12][13][14]. Researchers started to
Basically, the informational indexes utilized as a part of this
use machine learning for medical diagnosis after the advent
examination are taken from UCI datasets for patients affected by
lung cancer. The principle point of this paper is to theexecution of artificial intelligence. using a machine learning approach
investigation of the classification algorithms accuracy by WEKA to investigate the classification of diseases in traditional
Tool. The experimental results show that SVM givesthe best Chinese medicine clinical data (TCM). Valuable guidelines
result with 95.56%, then CNN with CNN 92.11% and KNN with on diagnosis of brain disturbances from network architecture
88.40%. aspects, function learning and classification prediction via the
method of machine learning, and provided through the
Keywords— Lung Cancer, Machine Learning, SVM, KNN, machine learning method and the implementation of the brain
CNN. network based on machine learning [15]. It will be a key step
towards improved early detection [16].
I. INTRODUCTION This paper provides an effective method to predict lung
Cancers exist in several organs, and simultaneously, and cancer in early stage with heigh accuracy ratio. The dataset
different types of cancer occur in various organs of the body. used is taken form UCI machine learning repository. Then
The illness may even go unnoticed for long periods of time. apply three classifier Support Vector Machine (SVM), K-
According to WHO reports, cancer may be prevented if it is Nearest Neighbor (KNN)and, Convolutional Neural Network
detected early enough. The patient's life span will be extended (CNN) to endeavors inspect accuracy ratio of three classifier
whether he or she receives an early prognosis [1][2][3][4]. by using WEKA tool. the present study is aid to develop a
Lung cancer has a low prognosis that differs greatly Machine Learning Models to detect the lung cancer with
depending on tumor staging at the time of diagnosis. Lung better accuracy.
cancer is divided into two types of clinical practice: non-small This paper is organized as follows. Section 2 introduce to
cell lung cancer (NSCLC) and small cell lung cancer (SCLC) lung cancer. Section 3 Material & Methods that used in this
[5][6]. It is, in reality, a malignant tumor characterized by paper. describes related work in Section 4. then Section
unregulated cell tissue formation. Lung cancer developed 5present the theory, introduction to machine learning and
mostly as a result of long-term tobacco use[7]. According to their types also confusion matrix. Section 6 present the
research, a stable individual may be affected by nineteen Performance evaluation and results. Finally, Section 7
distinct forms of cancer. Lung cancer has the largest death
Conclusion.
rate among all of these tumors. This disease is expected to kill
over 1.7 million people per year [8]. In the area of machine
learning (ML) research has already grown a great deal, which II. LUNG CANCER
is helpful to reduce humanlaborers. ML combines statistics Carcinogenesis is the unchecked proliferation of one or
and computers in the area of more cell types. Good tissues do not support the growth of
58
This is an open access article distributed under the Creative Commons Attribution License
59
normal cells, and when they do, they separate quickly and b. Un-Supervised Learning
become tumors. Primary lung cancer originates elsewhere in Unsupervised Learning is a form of learning that occurs
the body and spreads to the lungs, while secondary lung without the presence of a supervisor [28]. The machine is
cancer starts elsewhere in the body and then spreads from given some sample inputs, but no output is generated in the
there. It's one of the most aggressive types of cancer and a method of learning. Since there is no optimal value over here,
life-threatening threat to the human body [17]. If this categorization is used to ensure that the algorithm
unchecked development can be identified correctly at anearly distinguishes between the datasets correctly. It is the
point, it can help to diagnose the likelihood of unnecessary difficulty of finding unknown structure in unidentified details
surgery and improve the chance of recovery. Chronic [29][30]. Although there are no testing sets or tests given to
Obstructive Pulmonary (COPD) illness attacks the areas of the respondent, there are no opportunities to reward a
the lungs and causes diseases such as measles, influenza,
successful solution. Unlike supervised learning and
pneumonia, and other respiratory issues such as asthma.
reinforcement learning, unsupervised learning has no
Small Cell Lung Cancer (SCLC) or oat cell cancer and Non-
Small Cell Lung Cancer (NSCLC) are the two mainforms of teacher, and produces results that are unrelated to prior
lung cancer that develop and expand in separate ways and experience. It is directly connected to density and statistics
may be handled accordingly. Within the non-smallcell lung [21][31].
cancer category, there are three subtypes (adenocarcinomas, c. Reinforcement Learning
squamous cell carcinomas, large cell carcinomas) fig (1)
show the two types of lung cancer. So Mixed small cell/large Reinforcement Learning, this machine learning style comes
cell cancer is a disease that occurs where a patient shows from interacting with its surroundings Reinforcement
symptoms of both types of cancer. (NSCLC) learning. A Reinforcement Learning manager learns from the
Adenocarcinoma is more common and progresses more meaning of tasks, and even by explicitly articulated
slowly than small cell lung cancer. Small cell lung cancer is instructions, and decides on previous behaviors by using new
linked to smoking which progresses more rapidly by techniques. Since specific input/output data sets are not
becoming a large tumor that will spread across the body provided, this differs from traditional supervised learning.
[18][19]. Instead, the focus is on the presentation, which entails striking
a balance between discovery (of uncharted territory) and
utilization (of existing data) [32][33].
IV. DEEP LEARNING
Deep learning is a type of machine learning techniques that
uses representation learning to categorize important features
for classification problems [6]. The primary characteristic of
deep learning is its compatibility with features, although it
may also learn from data. So, to learn complex features a deep
learning integrates the simple features that have learned from
data. Deep learning is accomplished using multiple-layer
Fig. 1. Lung Cancer Tupes artificial neural networks, such as the Deep Neural Network
(DNN), Convolutional Neural Network (CNN), and the
III. MACHINE LEARNING Recurrent Neural Network (RNN) [13][23].
Machine learning is a subfield of Artificial Intelligence V. RELATED WORKS
[20]. Machine Learning is also used for complex data
classification and decision making [21][22]. In general, the Roy et al[34]. They use a combination of image
implementation of algorithms aids the machine's learning. processing biomedical techniques and information discovery
Machine learning gives systems the opportunity to learn in data to improve accuracy and assess precise significance
automatically and improve over time without being directly for early detection of lung carcinoma. The representation of
configured. The implementation of algorithms aids the the lungs acquired from CT (Computer Tomography) The
computer in learning and making the required decisions scan images are pre-processed, and the Region of Interest is
[11][23]. Machine Learning strategies and activities are segmented (ROI) is performed. The Random Forest
narrowly divided into three categories: procedure is used to distinguish the distinct features. Using an
SVM Classifier, the SURF (Speeded Up Robust
Functionality) algorithm was used to extract features like
a. Supervised learning entropy, co-relation, power, and variance from Saliency
Machine learning, in its most simple form, employs Enhanced images. The image's classification determines if it
programmed algorithms that learn and refine their functions is safe or toxic (carcinomic). CT scan images were used as the
by processing input data and making predictions within a dataset. The SVM classification and random forest algorithm
reasonable range. These algorithms aim to be predictivemore were used to carry out the whole operation. Using SVM
precisely by feeding fresh data[24][25]. While there are classification, the best outcome is achieved. This technique is
several changes in the way machine learning algorithms are 94.5 percent effective in general, 74.2 percent sensitive, 66.3
grouped. Two categories of issues: grouping problems and percent recall, and 77.6% specific.
back-up problems, are well suited to supervised learning
algorithms. The output variable usually takes on a limited For lung cancer diagnosis, Faisal et al [12] recommend
number of discrete values[26][27]. evaluating machine learning classifiers as well as, classifiers
such as Multilayer perceptron (MLP), Nave Bayes, Decision
60
Tree, Neural Network, Gradient Boosted Tree, and SVM are Reddy et al [40] propose a model that is successful in
evaluated. The dataset was downloaded from the UCI registry detecting the phases of lung cancer using machine learning
and is used to analyze random forest and plurality voting- algorithms. The model combines K-NN, Decision Trees, and
based ensembles for predict lung cancer. Gradient Boosted Neural Networks structures with the bagging ensemble
Tree was found to outperform all other person and ensemble approach to improve overall prediction accuracy. As opposed
classifiers. Gradient-boosted Tree outperformed allothers as to individual algorithms, the proposed model's estimated
well as ensemble classifiers, achieving 90% precision, outcomes are more accurate. The versions with and without
according to performance assessments. bagging are compared to draw conclusions. The bootstrap
aggregating methodology improves the individual models'
Delta Radiomics uses the machine learning methods performance, with accuracy scores of 97% (Decision Tree),
proposed by Baskar et al [35] to extract the characteristics of 94%, and 96% (K-NN) respectively (Neural Networks). The
the cancer nodules. Lung cancer nodule malignancy is integrated model has a score of 0.98 for accuracy. The
predicted by using the Support Vector Machine (SVM). The precision of the integrated model is increased by 3.33 percent.
SVM can examine compact features in a lung cancer nodule
photograph, and image classification is useful in Günaydin et al [41] proposed machine learning methods
distinguishing between the multiple nodules. As a result, for detecting lung cancer nodules that used Principal
SVM is recommended as the best tool for diagnosing and Component Analysis, K-NN, SVM, Nave Bayes, Decision
detecting lung cancer, with a 90.9 percent accuracy rate. Trees, and ANN to detect anomaly. Then, both approaches
were compared both after and without preprocessing. The
Boban et al [36]. They use ML algorithms for the 400 lung experimental findings indicate that Artificial Neural
disease videos, including the Multilayer perceptron (MLP), Networks produce the best results with 82,43 percent
KNN and SVM classifiers (i.e., CT scan images). The accuracy after image processing, while Decision Tree
performance is segmented after extraction of features and produces the best results with 93,24 percent accuracy without
compares the exactness of the classifier. When a classifier has image processing. Standard Digital Image Database, Japanese
received a CT scan image, it contains irrelevantcontent. Gray Society of Radiological Technology (JSRT) CT was used as
Level Cooccurrence Matrix (GLCM) is used topick the most the dataset (computed tomography).
important features (i.e., for removing features). This
classification is 98% accuracy for MLP, 70,45% for SVM Early identification of lung nodes from low dose
accuracy, and 99,2% for KNN accuracy. computed tomography (LDCT) images was suggested by
Elnakib et al [42]. Initially, the proposed device processes
Using Deep Learning, Sreekumar et al [37] proposed a the raw data in order to increase the comparison between low-
method for detecting malignant pulmonary nodules from CT dose videos. The compact profound learning capabilitiesof
scans. To block out the lung areas from the scans, a various architectures, including Alex, VGG16 and VGG19
preprocessing pipeline was used. A 3D CNN model based on networks are then explored. A genetic algorithm (GA) is
the C3D network architecture was used to remove the trained to identify the most important early detectionfeatures
functionality. For the decrease of false-positives, researchers for optimizing the derived collection of features. In order to
used the Lung Image Database Consortium (LIDC-IDRI) as reliably diagnose lung nodules, various forms of classifiers
well as a few materials from the LUNA16 grand challenge. are then checked. The method is validated using the I-ELCAP
The end result is a model that predicts the coordinates of International Early Lung Cancer Action Project(ELCAP) in
malignant pulmonary nodules and demarcates the associated 320 photographs from 50 separate topics. With VGG19 and
areas using CT scans, for identifying malignant Lung SVM classification, the system suggested achieves the
Nodules and estimating their malignancy scores, the final highest 96.25 percent detection precision, 97.5 percent
model had a sensitivity of 86 percent. sensitivity and 95 percent specificity.
Banerjee et al. [38] suggested a paradigm for tumor
classification, with ANN, Random forests, and SVM as VI. MATERIAL AND METHODS
machine learning algorithms. Artificial neural networks are
more accurate in both area and texture dependent features. As A. Dataset
the precision is compared to the proposed model, it canbe
shown that accuracy has improved while recall has decreased. The data used in this work is a lung cancer dataset that
MATLAB R2017a was used for digital image analysis, and a was first released in and later made available in the UCI
Jupyter notebook was used for machine learning machine learning repository under the name "Lung Cancer
classification. Random Forest 79 percent, SVM 86 percent, Data set". This dataset was used to show the capability of the
and ANN 92 percent were the accuracy for region- based optimum discriminant plane in ill-posed situations. This
features, while Random Forest 70 percent, SVM 80 percent, dataset contains data on the pathological forms of lung
and ANN 96 percent were the accuracy for texture- based cancer. It contains 32 observations on three forms of lung
features. cancer using 56 elements[43].
61
1. Support Vector Machine
Support Vector Machine is a supervised learning
algorithm that uses the Classification method to analyze data
and predicate patterns. The texture is divided into two
categories or classes by the SVM classifier: regular and
abnormal pictures [44]. It is used to effectively map the
nodule. SVM is a margin classifier (hyperplane) that
separates the two classes, which is why it is often referred to
as a non-probabilistic binary classifier. The Support Vector is
described as the training data point that is nearest to the
classifier, and the Support Vector Machine is the maximum
classifier. The gap between the cancer nodules and the
hyperplane is as wide as possible [45][46].
2. K-Nearest Neighbor Classifier
The KNN algorithm is a supervised classificationmethod.
It's a simple algorithm that looks for the nearest fit. The
database is compared to the comparison set. The test sample's
mark is determined by the closest match of the k nearest
neighbors. To calculate the distances between research
samples and database samples, various distances such as
Euclidean, cosine, similarity, and city block are used[47].
3. Convolutional Neural Network
CNN as a supervised deep learning tool, CNN is an
Fig. 2. Block Diagram for Proposed method
excellent choice. This algorithm is suitable for multi-class
classification and binary classification (for example, To define the lung cancer dataset in this article, Weka
predicting whether or not a diagnostic picture contains a classifiers were used. WEKA was established by a team of
malignant tumor) [48][49]. CNNs are often used to solve a researchers from New Zealand's University of Waikato [52].
wide range of pattern and image recognition issues. This deep It is a java-based open-source platform that can perform data
learning approach is effective and appropriate for visualdata mining and machine learning algorithms, such as data pre-
because of three key characteristics. To begin with, local processing, sorting, clustering, and association ruleextraction,
receptive fields are perfectly matched to the image data among other things. WEKA is a popular choice among
specificity of being correlated geographically but analysts because of its ease of use and open-source nature
uncorrelated globally. Second, since the convolution is [53].
applied to the entire image, mutual weights allow for
significant parameter reduction without affecting image In this paper for the feature selection Correlation Attribute
processing. Finally, a grid-structured image allows for data (CA) method were used. CA is a feature subset selection
pooling operations that reduce data complexity without algorithm [1]. It evaluates the attribute by calculating the
sacrificing valuable information [50][51]. correlation (Pearson's product moment correlation) between
it and the class [4]. The main objective of CA is to obtain a
C. Proposed Model highly relevant subset of features that are uncorrelated to each
other. In this way, the dimensionality of datasets can be
The paper suggests a model to predict and classify the
drastically reduced and the performance of learning
lung cancer classes. The proposed model starts with data
algorithms can be improved [2]. Ranker search method used
preprocessing, feature selection, classification and with CA. Based on the Correlation values, features are ranked
evaluating), figure (2) shows the block diagram for the and those features that are most suitable to be appliedin the
proposed work. machine learning algorithm, are filtered [3].
A. Confusion Matrix
The Confusion Matrix is a deep learning visual
assessment method. The prediction class results are
represented in the columns of a Confusion Matrix, whereas
the real class results are represented in the rows [54]. This
matrix includes all the raw data regarding a classification
model's assumptions on a specified data collection. To
determine how accurate a model is. It's a square matrix with
the rows representing the instances' real class and the columns
representing their expected class. The confusion matrix is a 2
x 2 matrix that reports the number of true positives (T P), true
negatives (T N), false positives (FP),
62
and false negatives (F N) when dealing with a binary classification mission.
prediction ratio of 95.56 percent and, CNN accuracy ratio is
TP FN 92.11 percent. While KNN has the lowest estimation
FP TN percentage which is 88.40 percent.as shown in Table 2, 3, and
4. Table (2) shows and analysis the results for using SVM
algorithm, Table (3) shows and analysis the results forusing
Precision, recall, and F-measure, which are commonly
CNN algorithm.
utilized in the text mining and machine learning communities,
were used to evaluate the algorithms. True positive (TP –
objects correctly labeled as belonging to the class), false TABLE 1. USING SVM CLASSIFIER
positive (FP – items falsely labeled as belongingto a certain
class), false negative (FN – items incorrectly labeled as not using SVM classifier
belonging to a certain class), and true negative (TN – items Class TP FP Precision Recall F- ROC
incorrectly labeled as not belonging to a certain class) are the Rate Rate Measure Area
1 0.986 0.109 0.951 0.986 0.968 0.946
four types of classified items (TN - items correctly labelled as 2 0.882 0.005 0.938 0.882 0.909 0.984
not belonging to a certain class). Recall is determined using 3 0.667 0.000 1.000 0.667 0.800 0.968
the following formula given the amount of true positives and 4 0.857 0.005 0.947 0.857 0.900 0.944
false negatives[55][56]: 5 1.000 0.000 1.000 1.000 1.000 1.000
Avg 0.956 0.076 0.956 0.956 0.954 0.955
Recall=
TABLE 2. USING K-NEAREST NEIGHBOR CLASSIFIER
B. ROC curve
Table (4) show the comparison between the three classifiers
The region under the ROC curve, or literally AUC,
summarizes the relationship between a binary classifier's true depending on the time taken to build the model and the
and false positive rate for various judgment thresholds. accuracy of the classifier.
Several authors have shown that (AUC) is superior to
TABLE 4. COMPARISON OF RESULT BY TIME AND ACCURACY
absolute accuracy for classifier assessment, rendering it one
of the most common metrics for static imbalanced data. To
measure AUC, however, one must sort a specified dataset COMPARISON OF RESULTS
and iterate through each example [57]. This ensures that CLASSIFIER TIME TAKEN TO ACCURACY
AUC cannot be computed directly on vast data streams since BUILD MODEL
it will necessitate scanning the whole stream after each SVM 1.77 SEC. 95.56
example. As a result, the usage of AUC for data sources has
been restricted to estimations on periodic holdout sets or KNN 0.01 SEC. 89.65
whole streams, rendering it inherently skewed or
CNN 3.79 SEC. 92.11
computationally infeasible for realistic implementations [58].
VIII. PERFORMANCE EVALUATION AND Figure (3) shows that CNN algorithm takes the
RESULTS longest time to build its model while KNN algorithm had the
The confusion matrix was used to evaluate the accuracy of shortest time
each classifier. The experimental results show that using five
attributes from an SVM classifier produces the best
63
used (JSRT) dataset and several classifiers to gain
experimental results (ANN 82,43% and, Decision Tree
93,24%). finally, the researchers in [21] used (LDCT) dataset
and smart genetic algorithm with applying (SVM) to obtained
96.25% accuracy.
64
TABLE 5. COMPARISON OF RELATED WORK
smart genetic
LDCT images
Elnakib et al[42] - algorithm - SVM 96.25%
65
normalized Mahalanobis distance,” in 2017 International
[9] B. Charbuty and A. Abdulazeez, “Classification Based on
Conference on Intelligent Informatics and Biomedical Sciences
Decision Tree Algorithm for Machine Learning,” J. Appl. Sci.
(ICIIBMS), 2017, pp. 140–145.
Technol. Trends, vol. 2, no. 01, pp. 20–28, 2021.
[10] H. A. Hussein and A. M. Abdulazeez, “COVID-19 [29] S. Hussein, P. Kandel, C. W. Bolan, M. B. Wallace, and U.
Bagci, “Lung and pancreatic tumor characterization in the deep
PANDEMIC DATASETS BASED ON MACHINE LEARNING
learning era: novel supervised and unsupervised learning
CLUSTERING ALGORITHMS: A REVIEW,” PalArch’s J.
Archaeol. Egypt/Egyptology, vol. 18, no. 4, pp. 2672–2700, 2021. approaches,” IEEE Trans. Med. Imaging, vol. 38, no. 8, pp.
1777–1787, 2019.
[11] D. M. Abdullah and N. S. Ahmed, “A Review of most Recent
[30] B. M. S. Hasan and A. M. Abdulazeez, “A Review of
Lung Cancer Detection Techniques using Machine Learning,” Int.
Principal Component Analysis Algorithm for Dimensionality
J. Sci. Bus., vol. 5, no. 3, pp. 159–173, 2021.
Reduction,” J. Soft Comput. Data Min., vol. 2, no. 1, pp. 20–30,
[12] M. I. Faisal, S. Bashir, Z. S. Khan, and F. H. Khan, “An
2021.
evaluation of machine learning classifiers and ensembles for early
[31] D. M. Sulaiman, A. M. Abdulazeez, H. Haron, and S. S.
stage prediction of lung cancer,” in 2018 3rd International
Sadiq, “Unsupervised Learning Approach-Based New
Conference on Emerging Trends in Engineering, Sciences and
Optimization K-Means Clustering for Finger Vein Image
Technology (ICEEST), 2018, pp. 1–4.
Localization,” in 2019 International Conference on Advanced
[13] D. Q. Zeebaree, H. Haron, and A. M. Abdulazeez, “Gene
Science and Engineering (ICOASE), 2019, pp. 82–87.
selection and classification of microarray data using
[32] H. U. Dike, Y. Zhou, K. K. Deveerasetty, and Q. Wu,
convolutional neural network,” in 2018 International Conference
“Unsupervised learning based on artificial neural network: A
on Advanced Science and Engineering (ICOASE), 2018, pp. 145–
review,” in 2018 IEEE International Conference on Cyborg and
150.
Bionic Systems (CBS), 2018, pp. 322–327.
[14] D. Q. Zeebaree, H. Haron, A. M. Abdulazeez, and D. A.
[33] H. R. Abdulqadir and A. M. Abdulazeez, “Reinforcement
Zebari,“Trainable model based on new uniform LBP feature to
Learning and Modeling Techniques: A Review,” Int. J. Sci. Bus.,
identify the risk of the breast cancer,” in 2019 International
vol. 5, no. 3, pp. 174–189, 2021.
Conference on Advanced Science and Engineering (ICOASE),
[34] K. Roy et al., “A Comparative study of Lung Cancer
2019, pp. 106–111.
detection using supervised neural network,” in 2019 International
[15] H. Tang, J. Zhao, and X. Yang, “Explore machine learning
Conference on Opto-Electronics and Applied Optics (Optronix),
for analysis and prediction of lung cancer related risk factors,” in
2019, pp. 1–5.
Proceedings of the 2018 2nd International Conference on
[35] S. Baskar, P. M. Shakeel, K. P. Sridhar, and R. Kanimozhi,
Computer Science and Artificial Intelligence, 2018, pp. 41–45.
“Classification System for Lung Cancer Nodule Using Machine
[16] P. R. Radhika, R. A. S. Nair, and G. Veena, “A Comparative
Learning Technique and CT Images,” in 2019 International
Study of Lung Cancer Detection using Machine Learning
Conference on Communication and Electronics Systems
Algorithms,” in 2019 IEEE International Conference on
(ICCES), 2019, pp. 1957–1962.
Electrical, Computer and Communication Technologies
[36] B. M. Boban and R. K. Megalingam, “Lung Diseases
(ICECCT), 2019, pp. 1–4.
Classification based on Machine Learning Algorithms and
[17] A. I. Rahmani and M. Katouli, “Diagnosing Lung Cancer
Performance Evaluation,” in 2020 International Conference on
Using Grasshopper Optimization Algorithm and k-Nearest
Communication and Signal Processing (ICCSP), 2020, pp. 315–
Neighbor Classification,” J. homepage http//iieta.
320.
org/journals/rces, vol. 6, no. 4, pp. 69–75, 2019.
[37] A. Sreekumar, K. R. Nair, S. Sudheer, H. G. Nayar, and J. J.
[18] Y. Nai et al., “Improving Lung Lesion Detection in Low
Nair, “Malignant Lung Nodule Detection using Deep Learning,”
Dose Positron Emission Tomography Images Using Machine
in 2020 International Conference on Communication and Signal
Learning,” in 2018 IEEE Nuclear Science Symposium and
Processing (ICCSP), 2020, pp. 209–212.
Medical Imaging Conference Proceedings (NSS/MIC), 2018, pp.
[38] N. Banerjee and S. Das, “Prediction Lung Cancer–In
1–3.
Machine Learning Perspective,” in 2020 International Conference
[19] S. Senthil and B. Ayshwarya, “Lung cancer prediction using feed
on Computer Science, Engineering and Applications (ICCSEA),
forward back propagation neural networks with optimal features,”
2020, pp. 1–5.
Int. J. Appl. Eng. Res., vol. 13, no. 1, pp. 318–325, 2018.
[39] N. Maleki, Y. Zeinali, and S. T. A. Niaki, “A k-NN method
[20] [20]M. R. Mahmood, A. M. Abdulazeez, and Z. ORMAN, “A
NEW HAND GESTURE RECOGNITION SYSTEM USING for lung cancer prognosis with the use of a genetic algorithm for
feature selection,” Expert Syst. Appl., vol. 164, p. 113981, 2021.
ARTIFICIAL NEURAL NETWORK.”
[40] D. Reddy, E. N. H. Kumar, D. Reddy, and P. Monika,
[21] M. Somvanshi, P. Chavan, S. Tambade, and S. V. Shinde, “A
review of machine learning techniques using decision tree and “Integrated Machine Learning Model for Prediction of Lung
Cancer Stages from Textual data using Ensemble Method,” in
support vector machine,” Proc. - 2nd Int. Conf. Comput.
2019 1st International Conference on Advances in Information
Commun. Control Autom. ICCUBEA 2016, 2017, doi:
10.1109/ICCUBEA.2016.7860040. Technology (ICAIT), 2019, pp. 353–357.
[41] Ö. Günaydin, M. Günay, and Ö. Şengel, “Comparison of lung
[22] D. M. Abdulqader, A. M. Abdulazeez, and D. Q. Zeebaree,
cancer detection algorithms,” in 2019 Scientific Meeting on
“Machine Learning Supervised Algorithms of Gene Selection: A
Review,” Mach. Learn., vol. 62, no. 03, 2020. Electrical-Electronics & Biomedical Engineering and Computer
Science (EBBT), 2019, pp. 1–4.
[23] O. Ahmed and A. Brifcani, “Gene Expression Classification
[42] A. Elnakib, H. M. Amer, and F. E. Z. Abou-Chadi, “Early
Based on Deep Learning,” in 2019 4th Scientific International
Conference Najaf (SICN), 2019, pp. 145–149. Lung Cancer Detection Using Deep Learning Optimization,”
2020.
[24] N. O. M. Salim and A. M. Abdulazeez, “Human Diseases
[43] S. M. Salaken, A. Khosravi, A. Khatami, S. Nahavandi, and
Detection Based On Machine Learning Algorithms: A Review,”
M. A. Hosen, “Lung cancer classification using deep learned
Int. J. Sci. Bus., vol. 5, no. 2, pp. 102–113, 2021.
features on low population dataset,” in 2017 IEEE 30th Canadian
[25] N. M. Abdulkareem and A. M. Abdulazeez, “Machine
Conference on Electrical and Computer Engineering (CCECE),
Learning Classification Based on Radom Forest Algorithm: A
2017, pp. 1–5.
Review,” Int. J. Sci. Bus., vol. 5, no. 2, pp. 128–142, 2021.
[44] A. Asuntha and A. Srinivasan, “Deep learning for lung
[26] R. Sathishkumar, K. Kalaiarasan, A. Prabhakaran, and M.
Cancer detection and classification,” Multimed. Tools Appl., vol.
Aravind, “Detection of Lung Cancer using SVM Classifier and
79, no. 11, pp. 7731–7762, 2020.
KNN Algorithm,” in 2019 IEEE International Conference on
[45] W. Rahane, H. Dalvi, Y. Magar, A. Kalane, and S. Jondhale,
System, Computation, Automation and Networking (ICSCAN),
“Lung cancer detection using image processing and machine
2019, pp. 1–7.
learninghealthcare,” in 2018 International Conference on Current
[27] S. Uddin, A. Khan, M. E. Hossain, and M. A. Moni,
Trends towards Converging Technologies (ICCTCT), 2018, pp.
“Comparing different supervised machine learning algorithms for
1–5.
disease prediction,” BMC Med. Inform. Decis. Mak., vol. 19, no.
[46] H. S. Yahia and A. M. Abdulazeez, “Medical Text
1, pp. 1–16, 2019.
Classification Based on Convolutional Neural Network: A
[28] N. Najat and A. M. Abdulazeez, “Gene clustering with
Review,” Int. J. Sci. Bus., vol. 5, no. 3, pp. 27–41, 2021.
partition around mediods algorithmbased on weighted and
66
International Journal of Advanced Computer Science and
[47] S. Potghan, R. Rajamenakshi, and A. Bhise, “Multi-Layer
Technology, 8(1), 1-13.
Perceptron Based Lung Tumor Classification,” in 2018 Second
[55] Sugianela, Y., & Ahmad, T. (2020, February). Pearson
International Conference on Electronics, Communication and
Correlation Attribute Evaluation-based Feature Selection for
Aerospace Technology (ICECA), 2018, pp. 499–502.
Intrusion Detection System. In 2020 International Conference on
[48] S. S. Raoof, M. A. Jabbar, and S. A. Fathima, “Lung Cancer
Smart Technology and Applications (ICoSTA) (pp. 1-5). IEEE.
Prediction using Machine Learning: A Comprehensive
[56] Demisse, G. B., Tadesse, T., & Bayissa, Y. (2017). Data mining
Approach,” in 2020 2nd International Conference on Innovative
attribute selection approach for drought modeling: A case study
Mechanisms for Industry Applications (ICIMIA), 2020, pp. 108–
for Greater Horn of Africa. arXiv preprint arXiv:1708.05072.
115.
[57] Kumar, S., & Chong, I. (2018). Correlation analysis to identify
[49] J. Saeed and A. M. Abdulazeez, “Facial Beauty Prediction
the effective data in machine learning: Prediction of depressive
and Analysis Based on Deep Convolutional Neural Network: A
disorder and emotion states. International journal of
Review,” J. Soft Comput. Data Min., vol. 2, no. 1, pp. 1–12,
environmental research and public health, 15(12), 2907.
2021.
[58] O. Caelen, “A Bayesian interpretation of the confusion
[50] Y. Lei, B. Yang, X. Jiang, F. Jia, N. Li, and A. K. Nandi,
matrix,” Ann. Math. Artif. Intell., vol. 81, no. 3, pp. 429–450,
“Applications of machine learning to machine fault diagnosis: A
2017.
review and roadmap,” Mech. Syst. Signal Process., vol. 138, p.
[59] N. Milosevic, A. Dehghantanha, and K.-K. R. Choo,
106587, 2020.
“Machine learning aided Android malware classification,”
[51] N. Omar, A. M. Abdulazeez, A. Sengur, and S. G. S. Al-Ali,
Comput. Electr. Eng., vol. 61, pp. 266–274, 2017.
“Fused faster RCNNs for efficient detection of the license plates,”
[60] J. Xu, Y. Zhang, and D. Miao, “Three-way confusion matrix
Indones. J. Electr. Eng. Comput. Sci., vol. 19, no. 2, pp. 974–982,
for classification: A measure driven view,” Inf. Sci. (Ny)., vol.
2020.
507, pp. 772–794, 2020.
[52] Z. Zainudin, S. M. Shamsuddin, and S. Hasan, “Deep
[61] Z. Yang, T. Zhang, J. Lu, D. Zhang, and D. Kalui,
Learning for Image Processing in WEKA Environment,” Int. J.
“Optimizing area under the ROC curve via extreme learning
Adv. Soft Compu. Appl, vol. 11, no. 1, 2019.
machines,” Knowledge-Based Syst., vol. 130, pp. 74–89, 2017.
[53] V. Mhetre and M. Nagar, “Classification based data mining
[62] D. Brzezinski and J. Stefanowski, “Prequential AUC:
algorithms to predict slow, average and fast learners in
properties of the area under the ROC curve for data streams with
educational system using WEKA,” in 2017 International
concept drift,” Knowl. Inf. Syst., vol. 52, no. 2, pp. 531–562,
Conference on Computing Methodologies and Communication
2017.
(ICCMC), 2017, pp. 475–479.
[54] Al Janabi, K. B., & Kadhim, R. (2018). Data reduction
techniques: a comparative study for attribute selection methods.
68
View publication stats
MaterialsToday: Proceedings
journal homepage: www.elsevier.com/locate/matpr
Article history: Cancer has identified a diverse condition of several various subtypes. The timely screening and course of treatment of a
Availableonline16 April2021
cancer form is now a requirement in early cancer research because it supports the medical treatment of patients. Many
research teams studied the application of ML and Deep Learning methods in the field of biomedicine and bioinformatics in
Keywords:
Cancer the classification of people with cancer across high- or low- risk categories. These techniques have therefore been usedas
Deeplearning a model for the development and treatment of cancer. As, it is important that ML instruments are capable of detecting key
ML features from complex datasets. Many of these methods are widely used for the development of predictive models for
ANN predicat-ing a cure for cancer, some of the methodsare artificial neuralnetworks (ANNs), supportvector machine(SVMs)
SVM and decisiontrees (DTs). While we can understandcancer progressionwith the use of ML meth- ods, an adequate validity
Decisiontress level is needed to take these methods intoconsideration inclinical practice every day.
In this study, the ML & DL approaches used in cancer progression modeling are reviewed. The predic- tions
addressed are mostly linked to specific ML, input, and data samples supervision.
© 2021Elsevier Ltd. All rights reserved.
Selection and peer-review under responsibility of the scientific committee of the International Virtual Conference on
Advanced Nanomaterials and Applications. This is an open access article under the CC BY- NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
68
F.J. Shaikh and D.S. Rao Materials Today: Proceedings 50 (2022) 40–47
69
F.J. Shaikh and D.S. Rao Materials Today: Proceedings 50 (2022) 40–47
70
F.J. Shaikh and D.S. Rao Materials Today: Proceedings 50 (2022) 40–47
Table 1
Featuresandchallengesofexisting lung cancerpredictionmodels.
● Itisverysimpletoimplement.Tae-
WooKimet al. [2] DecisionTree(DT) ● Thesearesimpletointerpret. ● Theysufferfromoverfitting.
● It should be takenastheminimaldecisionstandardofwork-
relatedness for lung cancer.
Maciej Zi˛eba et al. [3] Boosted SVM ● It is used in medical application for predicting post- ● Therunningtime of trainingalgorithmsdonotscale
operativelifeexpectancyinlungcancerpatients. wellwiththesizeofthetrainingset.
● Many parameters need to be set accurately for
● Itisused to solve theimbalanceddata problems. attaining the best results.
Worrawat Engchuan[4] SVM ● It is used to build n- hyperlanes and n-features for
dividingeachdifferent classapartfrommaximalmargin.
H. Azzawiet al. [5] Gene Expression Gene Expression Programming ● It has better solution for predicting lungcancer
Programming difficulties.
● Hashighaccuracy.
PanayiotisPetousisetal. [6] DynamicBayesian DynamicBayesian Networks ● Has demonstrated high discrimination and
Networks predictive power.
● It is used to acquire the probability of posi- tive
outcomeofabiopsyforthegivenindividual.
Chip M. Lynchet al. [7] DT ● It isthebestpredictor by attaining highaccuracy.
● Itautomaticallyprunestoaveryshortthree-leveldepth.
P. Petousisetal. [10] Partially- Observable Partially-ObservableMarkov Decision Process(POMDP) ● It optimizesthe lung cancerprediction dur-
MarkovDecisionProcess ing theimprovementoftestspecificity.
(POMDP) ● Itreducesthefalse positive rates.
71
[18]. Thus, early prediction of lung cancer is very important for the
recurrence in patientswithbreast cancer for five years. Seven pre- dictive
appropriate treatments for decreasing the deaths. In big data, healthcare is
variableshavebeencombined,comprising ofclinical
one of the significant sources. Accurate examination of healthcare
informationlikepatientage, tumorsizeandno. of axillametas-tases. Protein information is mostly demandedfordetecting lung cancer in an early stage.
biomarkers, like oestrogen and progesterone recep- torlevels, also received Multipleresearches are beingdesigning newly to recognize lung cancerwith
information. The focus of the researchwas to produce an automated, more quality using big data. As there is a necessity to classification
quantitative predictive approach more precise than those of the classical approach for improvingthe detection accuracy with respect to time. In
metastasization of the tumor node(TNM). TNM is agroup of medical experts addition, machine learning techniques are modelled for enhancing the
that rely majorly on the professional judgment of a pathologist or clinician. detectioing a new variant. The ability of machine learning to solve compos-
The researchers used an ANN model, were using information from 2441 ite taskswith dynamic environment and knowledge has con- tributed to its
breast cancer patients (each time seven data points). A sam- ple to feature success in prediction research especially lung cancer, enabled with novel
ratioremainedsignificantlyhigherthanthe recom- mendedminimum of five met-heuristic algorithms.
(Somorjai et al., 2013). The wholedatasetwasdivided into 3 classes: training Althoughthere are many advantages for predictingthe lung cancer, but
(1/3), testing (1/3) and test sets (1/3). Furthermore, the authors have still there are few defects with the existing methodolo- gies so that a new
collected 310 separate samples from another organization to carry out an methodneedstobeimplemented. Adaboost[31] has attainedhighsensitivity
exter- nal assessment of breast cancer patients. This helped the research- and best performance, and it is very simple to implement. But, it is very
ers to test the generalization of their system out beyond ones institution — sensitive to noisy data. DT
a phase not taken in the 2 experiments discussed above. The analysis [32] is simple to interpret, it should be taken as the minimal deci- sion
demonstrates not only the volume of data and the thoroughness of standard of work-relatedness for lung cancer, is the best pre- dictor by
validation, but also the level of quality control for data processing. The attaining high accuracy, and it automatically prunes to a very short three-
information, forexample, was decided to enter and collected autonomously level depth. However, the running time of training algorithms do not scale
in a connection data- base and was autonomously checked to keep the well with the size of the training set. SVM
referringdoctors in goodstanding. Thesamples of 2441 patients and 17 000 [34] is used to build n-hyperlanes and n-features for dividingeach different
data points were sufficiently large for a typical breast cancer population class apart from maximal margin, and it improves the classification power
demographics when subdivided into the data sequence. However, by and robustness. Yet, many parameters need to be set accurately for
examining data distributions for patients in each set (training, monitoring, attainingthebest results. Gene Expression Programming [35] has the better
testing and external), the authors explicitly verified this assumption and solution for predicting lung can- cer difficulties, and has high accuracy.
demonstrated that distributions are quite same. The Authors built an However there are some dis- advantages such as if they are easy to
extremely accurate and robust classifier through consistency and attention manipulate,theyloseinfunctionalcomplexity. DynamicBayesianNetworks
todetail Sincethestudy’saimistoproducea system that better predicted re- [36] has demonstrated high discrimination and predictive power, and it is
currence of breast cancer than the traditional TNM stalemate method, used to acquire the probability of positive outcome of abiopsy for the given
comparing the ANN model with the TNM stalemate predictions was impor- individual. Though there are few challenges like if there is longer search
tant. This was done by using an Operator Characteristic (ROC) curve to time, the performance might be affected. POMDP [38] optimizes the lung
compare the performance. The ANN model (0.726), calcu- lated by the cancer prediction during the improvement of test specificity, and it reduces
portion inthe ROCcurve, exceededthe TNMsystem (0.677). Thisresearch is the false positive rates. But, the performance needs to be improved. Hence,
an brilliantillustration thatmachinery is well articulated and tested. Alarge the new model needs to be introduced for providing best performance so
enough set of data was obtained and data was tested for performance thattheaboveconflictsareusefulforthenewdevelopmentmethod.
assuranceandpre- cisionforeachsampleindependently. In addition, blinded
valida-tionsetswereavailable forassessingthe generality of the machine 4. Researchobjectives
learningsystem bothfrom the original data set and through an external
point. Finally, theprecisionof themodel has
Theobjectiveofthisresearchworkisdiscussedas follows.
been contrasted directly with that of the traditional TNM projec- tion
scheme. Thus the only challenge to this analysis was that the researchers
1. Toreviewonvariousstate-of-the-art lung cancer predictionmodels
evaluated only one form of ANN algorithm. Because of the type and the
and develop a new feature extraction model.
amount of data used, another machine learning technique can well have
2. Comparethesymptomsof cancerforearlynotification.
exceeded their ANN model.
3. Todesignanddevelopadeeplearningmodeltopredictthelungcancer.
4. Tovalidatetheproposedmodelbycomparing it withothercon-
3. Researchgap ventional models.
5. Sending Automaticnotificationfordetecting thecancer.
Lung cancer is the second largest human illness, which refers todeaths
from cancer worldwide. The average survival rate of 5 yearsfor patients 5. Discussion
with lung cancer in other organs such as the breast, cervix, bladder,
prostate or colon does not exceed 14 percent, which is significantly less In The latest research on predicting cancer using ML & DL tech- niques
than the rate of patients with cancer are discussed in this study. Furtherthrough the short detailsof the ML & DL
field and the preprocessing data techniques, the selection techniques and
the classification algorithms
72
sorts of cancersbeing investigated, andthe overall performance of can- cer
wereemployed, we discussedthree specific case studies based on popu- lar
prediction or outcome methods have been identified. While the ANNs are
ML tools, concerning foretell of the susceptibility of cancer, can- cer
common, it is clear that a broader variety of alternative learning
recurrence and cancer survival. Clearly, a huge number of ML & DL
concepts released overthe past decade produce precise outputs regarding approaches is also used to predict at least three different cancer types.
ANNs continue to be prevalent. Furthermore, it is clear that machine
particular cancer predictions. Moreover, it is crucial for the separation of
training methods typically increase the effi-ciency or predictable accuracy
clinical decisions to identify potential problems including experimental
ofmostpronostics,inparticularwhen matched with conventional statistical
design, collecting suitable samples of data and validating classified results.
Moreover, despite claims to have contributed to appropriate and efficient or expert systems. Although most researches are usually excellently-
designed and fairly validated, more focus is quite desirable for the
decision-making by the ML classification methods, very few have in fact
planning and implementation of experiments, in particular with regard to
entered clinical practice. Recent advances in omits technology have led us
furthertobetterunderstandawiderange of diseases, butvalidationresults quantity and quality of biological data. Improving the experimental design
and the biological validation of several device classification systems would
needtobeaccuratebeforesignatures of gene expression
undoubtedly increase the general Quality, replica- bility and reproductivity
shall be used in hospitals. Only afew markedsamples in general. The small of manysystems.Intotal, we believethattheusage of the deviceseducation
amount of data samples is a majorly frequent drawback observed in the
& deep learning classificatory will probably be quite common in many
researchsurveyed in thisarticle. The size of train- ing datasetsthat need to
clinicalandhospitalset-tingsifthequalityofstudycontinuestoimprove.
belargeenoughisabasicrequirementintheuseofclassificationschemesto Theassimilationofmultifacetedheterogeneousdata, whichcan
model a disease. A relatively large dataset makes it possible to divide
offer a promising tool for cancer infection and foresee the disease, also
enough into training and trial sets and therefore to validate the
demonstratestheincorporation in the application of differentanalyticaland
calculators reasonably. A small training sample can result in
classification methods.
misclassifications compared with the dimension of the data, while
In future, by using the proposed framework, we would like touse other
estimators can develop unstable and partial techniques. It’s clear that a
state of the art machine learning algorithms and extrac- tion methods to
more wealthy group of patients could predict their survival may improve
allow more intensive comparative analysis.
predic- tive model capacity. The quality of the dataset and the selection
schemes are important for efficient ML and DL and then for precise cancer
Declarationof Competing Interest
foretell except for data size. Using feature selection meth- ods to select the
maximum informative characteristics subset for training the technique
The authors declare that they have no known competing finan- cial
couldlead to sturdy models. Reproducible valuesare also characterized as
interests or personal relationships that could have appeared to influence
characteristic sets consisting of histology and pathology studies. Given the
the work reported in this paper.
lack of static entities, it is essential that a multiple feature sets are
adapted to the ML& DL technology over time. We also discovered which
References
SVM and ANN classifiers are commonly utilized for cancer forecasting
results as one of the most frequently used MLalgorithms [35]. As discussed [1] Chao Tan, Hui Chen, Chengyun Xia, Early prediction of lung cancer based onthe combination of
in our introductory section, ANNs are widely used for nearly 30 years [40]. trace element analysis in urine and an Adaboost algorithm, J. Pharm. Biomed. Anal. 49 (3)
SVMsarealsoanewermethod to cancer pre- diction but have already been (2009) 746–752.
[2] D.-H. Tae-WooKim, Chung-Yill Park, Decision tree of occupational lung cancer using
widely included in their trustworthy predictive results. However, the classificationandregressionanalysis,SafetyHealthWork1(2)(2010)140–148.
selection of the best algorithm is dependent on a large number of [3] M. Zie˛ ba, J.M. Tomczak, Marek Lubicz, Jerzy Świa˛ tek, Boosted SVM for
parameters, whichinclude datatypes collected, sample size, time limitsand extracting rules from imbalanced data in application to prediction of the post- operative life
expectancyinthelungcancerpatients,Appl.SoftComput.14(2014) 99–108.
the type of prediction results. New methods for overcoming the above- [4] Worrawat Engchuan, Jonathan H. Chan, Pathway activity transformation for multi-class
mentioned limita- tions should be explored regarding the future of cancer classificationoflungcancerdatasets,Neurocomputing165(2015)81–89.
modeling. More accurate results and reasoned conclusions would be [5] H. Azzawi, J. Hou, Y. Xiang, R. Alanni, Lung cancer prediction from microarray data by gene
expression programming, IET Syst. Biol. 10 (5) (2016) 168–178.
obtainedthrough efficient quantitative research of the heterogeneous data [6] P.Petousis, S.X.Han,DeniseAberle,AlexA.T.Bui,Prediction of lungcancer incidence on the low-
sages used. Further research on the basis of more public databases, which dose computed tomography arm of the National Lung Screening Trial: a dynamic Bayesian
gather valid cancer data for all diagnosed patients, is needed. Their use by network, Artif. Intell. Med. 72 (2016)42– 55.
[7] C.M. Lynch, J.D. Behnaz Abdollahi, A. Fuqua, R. de Carlo, James A. Bartholomai, Rayeanne N.
scholars will allow their modeling studies to generate relevant outputs
Balgemann,Victor H. vanBerkel, Hermann B. Frieboes, Prediction oflung cancerpatientsurvival
andintegratedclinical decision- making. viasupervisedmachinelearningclassificationtechniques,Int.J. Med. Inf. 108(2017)1–8.
[9] D.S. Rao, D.P. Tripathy, Optimization of machinery noise using Genetic Algorithm. Noise
Conference 2017. Michigan, 2017; 527–537.
6. Conclusion [10] P. Petousis, A. Winter, W. Speier, D.R. Aberle, W. Hsu, A.A.T. Bui, Using sequential decision
making to improve lung cancer screening performance, IEEE Access 7 (2019) 119403–
Thewholestudyexplainsandcomparesthefindingsofvariousmachine 119419.
[12] V. Krishnaiah, G. Narsimha, C. Subhash, Diagnosis of lung cancer prediction system using data
learning and in-depth learning implemented to cancer prognosis. miningclassificationtechniques,Int.J.Comp.Sci.Inf.Technol.4 (1) (2013) 39–45.
Specifically, several trends related to those same kinds of machines [13] T. Ojala, M. Pietikainen, T. Maenpaa, Multiresolution gray-scale and rotation invariant texture
techniques to be used, the kinds of training data to be incorporated, classification with local binary patterns, IEEE Trans. Pattern Anal. Mach. Intell. 24 (7) (2002)
971–987.
thekindofendpoint forecaststobemade,
73
M. Hoogendoorn, L.M.G. Moons, M.E. Numans, R.-J. Sips, Utilizing data mining for predictive
[15] L. Demidova, I. Klyueva, Y. Sokolova, N. Stepanov, N. Tyart, Intellectual approaches to modeling of colorectal cancer using electronic medical records, in: International Conference
improvement of the classification decisions quality on the base of the SVM classifier, onBrainInformaticsandHealthBIH2014:BrainInformaticsand Health, 2014, pp. 132–141.
Procedia Comput. Sci. 103 (2017) 222–230. [36] R. Al-Bahrani, A. Agrawal, A. Choudhary, Coloncancersurvivalpredictionusing ensemble data
[16] N. Picco, R.A. Gatenby, A.R.A. Anderson, Stem cell plasticity and niche dynamics in cancer mining on SEER data, 2013 IEEE International Conference on Big Data, Silicon Valley, CA,
progression,IEEETrans.Biomed.Eng.64(3)(2017)528–537. pp. 9–16, 2013.
[18] Paweł Krawczyk, Tomasz Kucharczyk, Kamila Wojas-Krawczyk, Screening of Gene Mutations [38] C.M. Lynch, V.H.V. Berkel, H.B. Frieboes, Application of unsupervised analysis techniques to lung
in Lung Cancer for Qualification to Molecularly Targeted Therapies, INTECH Open Access cancer patient data, PLoS One, 12 (9), 2017.
Publisher, 2012. [40] N. Arshadi, I. Jurisica, Data mining for case-based reasoning in high- dimensional
[19] A. Colquhoun, L. McHugh, E. Tulchinsky, M. Kriajevska, J. Mellon, Combinationtreatment with biologicaldomains,IEEETrans.Knowl.DataEng.17(8)(2005)1127–1137.
ionising radiation and Gefitinib (‘Iressa’, ZD1839), an epidermal growth factor receptor
(EGFR)inhibitor,significantlyinhibitsbladdercancercellgrowth in vitro and in vivo, J. Radiat.
Res. 48 (5)(2007) 351–360. Further Reading
[20] E. Adetiba, O.O. Olugbara, Lung cancer prediction using neural network ensemble with
histogramoforientedgradientgenomicfeatures,Sci.WorldJ.(2015).
[8]D.S.Rao,D.P.Tripathy,Optimizationofmachinery noise using DifferentialEvolutionalgorithm,
[21] S.S. Alahmari, D. Cherezov, D.B. Goldgof, L.O. Hall, R.J. Gillies, M.B. Schabath, Delta radiomics
Int. J. Min. Mineral Eng. 8 (4) (2017) 294–309.
improves pulmonary nodule malignancy prediction in lung cancer screening, IEEE Access 6
[11]D.S. Rao, D.P.Tripathy, AGeneticAlgorithmapproachforoptimizationofmachinery noise
(2018) 77796–77806.
calculations. Noise Vibr. Worldwide. 2019 50(4): 112–123.
[22] S. Park, S.J. Lee, E. Weiss, Y. Motai, Intra- and inter-fractional variation prediction of lung
[14]DavidMeyer,FriedrichLeisch,KurtHornik,Thesupportvectormachineundertest,
tumorsusingfuzzydeeplearning,IEEEJ.Transl.Eng.HealthMed. 4 (2016) 1–12.
Neurocomputing 55 (s 1–2) (2003) 169–186.
[23] A. Raweh, M. Nassef, A. Badr, A hybridized feature selection and extraction approach for
[17] W. Kim, K.S. Kim, J.E. Lee, D.Y. Noh, S.W. Kim, Y.S. Jung, M.Y. Park, R.W. Park, Development of
enhancingcancerpredictionbasedonDNAmethylation,IEEEAccess6 (2018) 15212–15223.
novel breast cancer recurrence prediction model using support vector machine, J. Breast
[24] J. Pati, Gene expression analysis for early lung cancer prediction using machine learning
Cancer 15 (2) (2012) 230–238.
techniques:aneco-genomicsapproach,IEEEAccess7(2019)4232–4238.
[30] Z.W. Huang, A. Mcwilliams, H. Lui, D. Mclean, S. Lan, H.S. Zeng, Near-infrared Raman
[25] B. Zhang et al., Ensemble learners of multiple deepCNNs for pulmonary nodules classification
spectroscopyforopticaldiagnosisoflungcancer,Int.J.Cancer107(6)(2003) 1047–1052.
using CTimages, IEEE Access7 (2019)110358–110371.
[33]D.Delen,Analysisofcancerdata:a data miningapproach,ExpertSyst.20(1)(2009) 100–112.
[26] C. Arunkumar, S. Ramakrishnan, Prediction of cancer using customised fuzzy rough machine
[37] D. Fradkin, I. Muchnik, D. Schneider, Machine Learning Methods in the Analysis of Lung
learningapproaches,HealthcareTechnol.Lett.6(1)(2019)13–18. Cancer Survival Data, DIMACS Technical Report, 2005.
[27] H. Guo, U. Kruger, G. Wang, M.K. Kalra, P. Yan, Knowledge-based analysis for mortality [39]D.Chen,K. Xing,D.Henson, L.Sheng,A.M.Schwartz, X. Cheng, Developing prognostic systems of
predictionfromCTimages,IEEEJ.Biomed.Health.Inf.24(2)(2020)457–464. cancerpatientsbyensembleclustering,J.Biomed.Biotechnol.(2009).
[28] J. Yang, N. Li, S. Fang, K. Yu, Y. Chen, Semantic features prediction for pulmonary nodule [41] G. Dimitoglou, J.A. Adams, C.M. Jim, Comparison of the C4.5 and a naive bayes classifier for
diagnosisbasedononlinestreamingfeatureselection,IEEEAccess7(2019) 61121–61135. the prediction of lung cancer survivability, J. Comput. (2012).
[29] Raja MohammadTaisirMasadeh,BaselA. Mahafzah, AhmadAbdel-Aziz Sharieh, Sea lion [42] A. Agrawal, S. Misra, Ramanathan Narayanan, Lalith Polepeddi, Alok Choudhary, Lung
optimizationalgorithm,Int.J.Adv.Comp.Sci.Appl.10(5)(2019)388–395. cancer survival prediction using ensemble data mining on seer data, Sci. Program. 20 (1)
[31] A. Jemal, F. Bray, M.M. Center, J.J. Ferlay, E. Ward, D. Forman, CA A Cancer J. Clin., 61 (2), (2012) 29–42.
69–90, 2011. [43] S.M. Agrawal, R. Narayanan, L. Polepeddi, A. Choudhary, A lung cancer outcome calculator
[32] D.Delen,G.Walker,A.Kadam,Predictingbreast cancersurvivability:a comparisonofthree using ensemble data mining on SEER data, Proceedings of the Tenth International
dataminingmethods,Artif.Intell.Med.34(2)(2005)113–127. Workshop on Data Mining in Bioinformatics, 2011.
[34] D.Delen, N.Patil, Knowledge Extraction from Prostate Cancer Data, Proceedingsofthe [44] D.L. Tong, A.C. Schierz, Hybrid genetic algorithm-neural network: feature extraction for
39thAnnualHawaiiInternationalConferenceon,vol. 5, 2006. unpreprocessedmicroarraydata,Artif.Intell.Med.53(1)(2011)47–56.
[45] M.R. Mohebian, H.R. Marateb, M. Mansourian, Miguel Angel Mañanas, Fariborz Mokarian, A
[35] hybrid computer-aided-diagnosis system for prediction of breast cancer recurrence (HPBCR)
usingoptimizedensemblelearning,Comput.Struct.Biotechnol.J. 15(2017)75–85.
[46] M. Zie ˛ ba, J.M. Tomczak, M. Lubicz, J. Świa˛ tek, Boosted SVM for extracting rules
fromimbalanceddatainapplicationtopredictionofthepost-operativelife
74