0% found this document useful (0 votes)
5 views55 pages

Lung Cancer Report

The document discusses the rising incidence of lung cancer and the importance of early detection for improving survival rates. It highlights the potential of machine learning techniques in diagnosing lung cancer at early stages, emphasizing the need for effective tools to analyze large datasets. The document also reviews various machine learning methods and their applications in healthcare, particularly in predicting lung cancer outcomes.

Uploaded by

Kowsalya Devi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views55 pages

Lung Cancer Report

The document discusses the rising incidence of lung cancer and the importance of early detection for improving survival rates. It highlights the potential of machine learning techniques in diagnosing lung cancer at early stages, emphasizing the need for effective tools to analyze large datasets. The document also reviews various machine learning methods and their applications in healthcare, particularly in predicting lung cancer outcomes.

Uploaded by

Kowsalya Devi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 55

Prediction of Early Stage Lung Cancer using Machine Learning

CHAPTER 1

INTRODUCTION

Lung cancer is the most rapidly increase disease which is causing human death world-
wide due to respiratory problems; this cancer disease has exceeded the death rate
compared to breast cancer. This disease has characterized based on growth of
uncontrolled cells. If this disease is not diagnosed at the early stages and cured before
the second stage, it will increase the death percentage in human. This tissue will be
spread rapidly to other parts of the body like brain, heart, bones, glands and liver.

As from early research, there is no such tool for early detection of lung cancer
disease in human. We come across two types of lung cancers, one is SCLS and NSCLS.
There are two major types of lung cancer, non-small cell lung cancer (NSCLC) and
small cell lung cancer (SCLC). Over 85 percent of all lung cancers are non-small cell
lung cancers, while about 13 percent contribute small cell lung cancers [2]. Staging
lung cancer is based on whether the cancer is local or has spread from the lungs to the
lymph nodes or other organs. Because the lungs are big, tumors can grow in them for a
long time before they are found. Even when symptoms like coughing and fatigue
occurs; people think that they are due to other causes. Because of this reason, early-
stage of lung cancer (stages I and II) is difficult to detect. Most people with NSCLC are
diagnosed only at stages III and IV.

As the volume of data is growing proportionally with the increase in population,


there is a greater need to extract the knowledge from the data. Machine Learning
contributes much towards this and finds its application in various diverse fields
including the healthcare industry. Machine Learning is the process of sifting through
historical data thus providing an insight into the patterns from large dataset and helps to
incorporate the pattern in everyday activity. Machine Learning helps in medical
diagnosis to extract the underlying pattern of the disease. Researchers are suggesting
that applying Machine Learning techniques in identifying effective pre-diagnosis of the

Dept Of MCA DSCE 1


Prediction of Early Stage Lung Cancer using Machine Learning
disease can improve practitioner performance. Lung cancer being a disease which is
highly dependent on historical data can make use of machine learning for its early
detection. Researchers have been investigating on applying various Machine Learning
techniques on lung cancer dataset for early diagnosis of lung cancer.

As there is a big growth in large volume of data now days, this will create a need
for extracting meaningful data from the information. Machine Learning is contributing in
various type of applications domains related to information technology, stock, marketing,
healthcare and banking among them. With the increase in population growth has increased in
coupled of disease and has increased the necessity of machine learning model in diagnosis using
medical datasets. From the various biomedical datasets, cancer is the widest disease that has
killed human life over 7 million every year and lung cancer among them is nearly 17% of
moralities. Previous research works show that survival rate of patients affected with cancer is
larger and higher, when compared to the diagnosed at the initial stage, Lung cancer is the most
historic data and dependent disease in for early diagnosis. This has created the researcher to use
machine learning technique for early diagnosis of lung cancer in stage 1. There has-been an
increase in survival rate to about 70% at the early stage of detection, when tumor is not spread.
Pre-existing techniques the five-year survival rate increases to 70% with the early detection at
stage 1, when the tumor has not yet spread. Existing medical techniques like X-Ray, Computed
Tomography (CT) scan, sputum cytology analysis and other imaging techniques not only
require complex equipment and high cost but is also proven to be efficient only in stage 4, when
the tumor has metastasized to other parts of the body. Our proposed work involves the uses of
machine learning technique used in classification of lung cancer patients and the categorization
of stage to which it belongs positive. The work is based on early diagnosis of prediction of lung
cancer which suggest the doctors in treating the patients for increasing the survival rate of the
human.

The impact of Lung-RADS was analysed in a retrospective analysis of the NLST (3).
Lung-RADS was shown to reduce the overall screening false positive rate to 12.8% and 5.3% at
baseline and interval imaging respectively at the cost of a reduction of sensitivity from 93.5% in
the NLST to 84.9% using Lung-RADS at baseline and 93.8% in the NLST and 84.9% using
Lung-RADS after baseline. However, while Lung-RADS reduces the overall false positive rate,
the false positive rate of positive screens, i.e., Lung-RADS 3 and above, remains very high at

Dept Of MCA DSCE 2


Prediction of Early Stage Lung Cancer using Machine Learning
93% at baseline and 89% after baseline; of 3,591 Lung-RADS 3 and above screens, 3,343 were
false positives at baseline and of 2,858 Lung-RADS 3 and above screens after baseline 2,543
were false positives. Therefore, while the adoption of Lung-RADS can reduce the total number
of benign nodules being worked-up within a screening programme, at a cost of just under 10%
loss in sensitivity, there remain a very large number of benign nodules being investigated, and
the nodule classification task remains a challenging one.

One approach to address this problem is to adopt computer aided diagnosis (CADx)
technology as an aid to radiologists and pulmonary medicine physicians. Given an input CT and
possible additional relevant patient meta-data, such techniques aim to provide a quantitative
output related to the lung cancer risk.

One may consider the goal of such systems to be two-fold. First, to reduce the variability
in assessing and reporting the lung cancer risk between interpreting physicians. Indeed,
computer assisted approaches have been shown to improve consistency between physicians in a
variety of clinical contexts, including nodule detection and mammography screening and one
might expect such decision support tools could provide the same benefit in nodule
classification. Second, CADx could improve classification performance by supporting the less
experienced or non-specialized clinicians in assessing the risk of a particular nodule being
malignant.

In this project, we review progress made towards the development and validation of
lung cancer prediction models and nodule classification CADx software. While we do not
intend this to be a comprehensive review, we do aim to provide an overview of the main
approaches taken to date and outline some of the challenges that remain to bring this technology
to routine clinical use.

1.1 Overview of Machine Learning


Machine learning is a subfield of artificial intelligence (AI). The goal of machine learning
generally is to understand the structure of data and fit that data into models that can be
understood and utilized by people.

Although machine learning is a field within computer science, it differs from traditional
computational approaches. In traditional computing, algorithms are sets of explicitly

Dept Of MCA DSCE 3


Prediction of Early Stage Lung Cancer using Machine Learning
programmed instructions used by computers to calculate or problem solve. Machine learning
algorithms instead allow for computers to train on data inputs and use statistical analysis in
order to output values that fall within a specific range. Because of this, machine learning
facilitates computers in building models from sample data in order to automate decision-making
processes based on data inputs.

Any technology user today has benefitted from machine learning. Facial recognition technology
allows social media platforms to help users tag and share photos of friends. Optical character
recognition (OCR) technology converts images of text into movable type. Recommendation
engines, powered by machine learning, suggest what movies or television shows to watch next
based on user preferences. Self-driving cars that rely on machine learning to navigate may soon
be available to consumers.

Machine learning is a continuously developing field. Because of this, there are some
considerations to keep in mind as you work with machine learning methodologies, or analyze
the impact of machine learning processes.

In this tutorial, we’ll look into the common machine learning methods of supervised and
unsupervised learning, and common algorithmic approaches in machine learning, including the
k-nearest neighbor algorithm, decision tree learning, and deep learning. We’ll explore which
programming languages are most used in machine learning, providing you with some of the
positive and negative attributes of each. Additionally, we’ll discuss biases that are perpetuated
by machine learning algorithms, and consider what can be kept in mind to prevent these biases
when building algorithms.

Machine Learning Methods

In machine learning, tasks are generally classified into broad categories. These categories are
based on how learning is received or how feedback on the learning is given to the system
developed.

Dept Of MCA DSCE 4


Prediction of Early Stage Lung Cancer using Machine Learning
Two of the most widely adopted machine learning methods are supervised learning which
trains algorithms based on example input and output data that is labeled by humans,
and unsupervised learning which provides the algorithm with no labeled data in order to allow
it to find structure within its input data. Let’s explore these methods in more detail.

Supervised Learning

In supervised learning, the computer is provided with example inputs that are labeled with their
desired outputs. The purpose of this method is for the algorithm to be able to “learn” by
comparing its actual output with the “taught” outputs to find errors, and modify the model
accordingly. Supervised learning therefore uses patterns to predict label values on additional
unlabeled data.

For example, with supervised learning, an algorithm may be fed data with images of sharks
labeled as fish and images of oceans labeled as water. By being trained on this data, the
supervised learning algorithm should be able to later identify unlabeled shark images as fish and
unlabeled ocean images as water.

A common use case of supervised learning is to use historical data to predict statistically likely
future events. It may use historical stock market information to anticipate upcoming
fluctuations, or be employed to filter out spam emails. In supervised learning, tagged photos of
dogs can be used as input data to classify untagged photos of dogs.

Unsupervised Learning

In unsupervised learning, data is unlabeled, so the learning algorithm is left to find


commonalities among its input data. As unlabeled data are more abundant than labeled data,
machine learning methods that facilitate unsupervised learning are particularly valuable.

The goal of unsupervised learning may be as straightforward as discovering hidden patterns


within a dataset, but it may also have a goal of feature learning, which allows the computational
machine to automatically discover the representations that are needed to classify raw data.

Dept Of MCA DSCE 5


Prediction of Early Stage Lung Cancer using Machine Learning
Unsupervised learning is commonly used for transactional data. You may have a large dataset
of customers and their purchases, but as a human you will likely not be able to make sense of
what similar attributes can be drawn from customer profiles and their types of purchases. With
this data fed into an unsupervised learning algorithm, it may be determined that women of a
certain age range who buy unscented soaps are likely to be pregnant, and therefore a marketing
campaign related to pregnancy and baby products can be targeted to this audience in order to
increase their number of purchases.

Without being told a “correct” answer, unsupervised learning methods can look at complex data
that is more expansive and seemingly unrelated in order to organize it in potentially meaningful
ways. Unsupervised learning is often used for anomaly detection including for fraudulent credit
card purchases, and recommender systems that recommend what products to buy next. In
unsupervised learning, untagged photos of dogs can be used as input data for the algorithm to
find likenesses and classify dog photos together.

The Machine Learning commonly uses the following techniques:


• Artificial neural networks: An artificial neural network (ANN) is a computational model
based on the structure. Information flows through the network that affects the structure of the
ANN based on that input and output.
• Decision trees: Decision tree is a graph that uses a branching method to illustrate every
possible outcome of a decision.
• Naïve Bayes’ algorithm: Naïve Bayes’ classifiers are a family of simple “Probabilistic
classifiers” based on applying Bayes’ theorem with strong independence assumptions between
the features.
• Genetic algorithms: Genetic algorithm (GA) is adaptive heuristic search algorithm based
on evolutionary ideas of genetics and natural selection.
• Nearest neighbor method: The principle behind nearest neighbor method is to find a
predefined number of training samples closest in distance to the new point and predict the label
from these.
• Rule induction: Rule induction is an area of machine learning, where the formal rules
are extracted from the set. The extracted rules represent pattern in the data.

Dept Of MCA DSCE 6


Prediction of Early Stage Lung Cancer using Machine Learning
To best apply these advanced techniques, they must be integrated with data warehouse as well
as flexible interactive business analysis tools. In the first the information is collected from
review sites, the source of information or Data Sources can be from product reviews, which is
often available in multiple formats like csv, ASCII/text, flat files.

1.2 Review of Literature Survey

[1] Bashetha and G. U. Srikanth “Effective cancer detection using soft computing technique”
Cancer research is rudimentary research which is done to identify causes and develop strategies
for prevention, diagnosis, treatment and cure. An optimized solution for the better treatment of
cancer and toxicity minimization on the cancer patient is performed by identifying the exact
type of tumor. A clear cancer classification analysis system is required to get a clear picture on
the insight of a problem. A systematic approach to analyze global gene expression is followed
for identifying exact problem area. Molecular diagnostics provide a promising option of
systematic human cancer classification. But these types of tests are not mostly applied because
characteristics molecular markers have yet to be identified for most solid tumors. Recently,
DNA micro-array based tumor gene expression profiles have been used for cancer diagnosis. In
the proposed system, gene expressions are taken from multiple sources and an ontological store
is created. Ant colony optimization technique is used to analyze the cluster of data with attribute
match association rule for detecting cancer using the acquired knowledge

[2] X. Wang and O. Gotoh “Microarray-based cancer prediction using soft computing
approach”
One of the difficulties in using gene expression profiles to predict cancer is how to effectively
select a few informative genes to construct accurate prediction models from thousands or ten
thousands of genes. We screen highly discriminative genes and gene pairs to create simple
prediction models involved in single genes or gene pairs on the basis of soft computing
approach and rough set theory. Accurate cancerous prediction is obtained when we apply the
simple prediction models for four cancerous gene expression datasets: CNS tumor, colon tumor,
lung cancer and DLBCL. Some genes closely correlated with the pathogenesis of specific or
general cancers are identified. In contrast with other models, our models are simple, effective

Dept Of MCA DSCE 7


Prediction of Early Stage Lung Cancer using Machine Learning
and robust. Meanwhile, our models are interpretable for they are based on decision rules. Our
results demonstrate that very simple models may perform well on cancerous molecular
prediction and important gene markers of cancer can be detected if the gene selection approach
is chosen reasonably.

[3] Devika R, Sai Vaishnavi Avilala, V. Subramaniyaswamy“Comparative Study of Classifier


for Chronic Kidney Disease prediction using Naive Bayes, KNN and Random Forest”
Chronic kidney disease (CKD), is also known as chronic nephritic sickness. It defines
constrains which affects your kidneys and reduces your potential to stay healthy. There will be
various complication concerns like increased levels in your blood, anemia (low blood count),
weak bones, and nerve injury. Detection and treatment should be done prior so it will typically
keep chronic uropathy from obtaining a worse condition. Data processing is the term used for
information discovery from big databases. The task of knowledge mining is to generate regular
patterns from historical data and emphasize future conclusions, follows from the convergence of
many recent trends: the decreased value of huge knowledge storage devices and therefore the
tremendous ease of aggregation knowledge over networks; the development of robust and
economical machine learning algorithms to method this data; and therefore the decrease value
of machine power, enabling use of computationally intensive strategies for knowledge analysis.
Machine learning is an important task as it benefits many applications such as analyzing life
science outcomes, sleuthing fraud, sleuthing faux users etc. varied knowledge mining
classification approaches and machine learning algorithms are applied for prediction of chronic
diseases. Therefore, this paper examines the performance of Naive Bayes, K-Nearest Neighbors
(KNN) and Random Forest classifier on the basis of its accuracy, preciseness and execution
time for CKD prediction. Finally, the outcome after conducted research is that the performance
of Random Forest classifier is finest than Naive Bayes and KNN.

Dept Of MCA DSCE 8


Prediction of Early Stage Lung Cancer using Machine Learning

CHAPTER-2
SYSTEM REQUIREMENT SPECIFICATION

System requirement specifications gathered by extracting the appropriate information to


implement the system. It is the elaborative conditions which the system needs to attain.
Moreover, the SRS delivers a complete knowledge of the system to understand what this
project is going to achieve without any constraints on how to achieve this goal. This SRS does
not providing the information to outside characters but it hides the plan.

2.1 Hardware Requirements

 System Processor : Core i3 /i5

 Hard Disk : 500GB.

 Ram : 4GB.

 Any desktop / Laptop system with above configuration or higher level.

2.2 Software Requirements

 Operating system : Windows 11

 Coding Language : Python

 Software : Anaconda

 IDE : JupyterNotebook

2.3 Functional requirements


A function of software system is defined in functional requirement and the behavior
of the system is evaluated when presented with specific inputs or conditions which may
include calculations, data manipulation and processing and other specific functionality. The

Dept Of MCA DSCE 9


Prediction of Early Stage Lung Cancer using Machine Learning
functional requirements of the project are one of the most important aspects in terms of
entire mechanism of modules.

 This system has to develop by python language with Jupyternotebook.

 There should be provision to upload Lung dataset. We much understand various


information which caused lung cancer such as
 The following are the generic lung cancer symptoms :
i. A cough that does not go away and gets worse over time
ii. Coughing up blood (heamoptysis) or bloody mucus.
iii. Chest, shoulder, or back pain that doesn't go away and often is made worse by deep
Hoarseness
iv. Weight loss and loss of appetite a. Wheezing b. Increase in volume of sputum
v. Fatigue and weakness
vi. Repeated problems with pneumonia or bronchitis
vii. Repeated respiratory infections, such as bronchitis or pneumonia
viii. Fatigue and weakness Shortness of breath
ix. New onset of wheezing x. Swelling of the neck and face a. Clubbing of the fingers and
toes. The nails appear to bulge out more than normal.
xi. Paraneo plastic syndromes which are caused by biologically active substances that are
secreted by the tumor
i. Fever
ii. Hoarseness of voice
xii. Loss of appetite xiii. Puffiness of face
xiv. Nausea and vomiting C. Lung cancer risk factors
: a. Smoking: i. Beedi ii. Cigarette iii. Hukka
b. Second-hand smoke
c. Radon exposure
d. High dose of ionizing radiation
e. Occupational exposure to mustard gas chloro methyl ether, inorganic arsenic,
chromium, nickel, radon asbestos f. Air pollution

Dept Of MCA DSCE 10


Prediction of Early Stage Lung Cancer using Machine Learning

 Gathering the data:

The choice of data entirely depends on the problem you’re trying to solve.
Picking the right data must be your goal, luckily, almost every topic you can think of has several
datasets which are public & free.
3 of my favorite free awesome website for dataset hunting are:

1. Kaggle which is so organized. You’ll love how detailed their datasets are, they give you info
on the features, data types, number of records. You can use their kernel too and you won’t
have to download the dataset.

2. Reddit which is great for requesting the datasets you want.

3. Google Dataset Search which is still Beta, but it’s amazing.

4. UCI Machine Learning Repository, this one maintains 468 data sets as a service to the
machine learning community.

 Handling the data :

one of the hardest step and the one that will probably take the longest unless you’re lucky with a
complete perfect dataset, which is rarely the case. Handling missing data in the wrong way
can cause disasters.

Generally, there are many solutions such as:

 null value replacement

 mode/median/average value replacement

 deleting the whole record

 Model based imputation — regression, k-nearest neighbors etc

Dept Of MCA DSCE 11


Prediction of Early Stage Lung Cancer using Machine Learning
 Interpolation \ Extrapolation

 Forward filling \ Backward filling — Hot Deck

 Multiple imputation

 The uploaded data has to be cleaned and do the visualization process

 There should be a functionality which will split data into training and test data.

 With the help of training data using random forest algorithm it has to build the
validation model.
 Once the validation model is saved it has to accept test data and produce result
whether it is spam or not.
 The record which are given as an input as to predict it has not Lung cancer or not.

Finally, the result can be viewed in django framework. Django distribution also bundles a
number of applications in its "contrib" package, including:

 an extensible authentication system


 the dynamic administrative interface
 tools for generating RSS and Atom syndication feeds
 a "Sites" framework that allows one Django installation to run multiple websites, each with
their own content and applications
 tools for generating Google Sitemaps
 built-in mitigation for cross-site request forgery, cross-site scripting, SQL
injection, password cracking and other typical web attacks, most of them turned on by
default[20][21]
 a framework for creating GIS applications

2.4 Non-functional requirements


Nonfunctional requirements describe how a system must behave and establish
constraints of its functionality. This type of requirements is also known as the system’s
quality attributes. Attributes such as performance, security, usability, compatibility are

Dept Of MCA DSCE 12


Prediction of Early Stage Lung Cancer using Machine Learning
not the feature of the system, they are a required characteristic. They are "developing"
properties that emerge from the whole arrangement and hence we can't compose a
particular line of code to execute them. Any attributes required by the customer are
describe by the specification. We must include only those requirements that are
appropriate for our project.

Figure 2.1: Block Diagram for Non-Functional Requirements.

Some Non-Functional Requirements are as follows:

1.4.1 Reliability

The structure must be reliable and strong in giving the functionalities. The
movements must be made unmistakable by the structure when a customer has
revealed a couple of enhancements. The progressions made by the Programmer
must be Project pioneer and in addition the Test designer.
1.4.2 Maintainability

Dept Of MCA DSCE 13


Prediction of Early Stage Lung Cancer using Machine Learning
The system watching and upkeep should be fundamental and focus in its
approach. There should not be an excess of occupations running on diverse
machines such that it gets hard to screen whether the employments are running
without lapses.
1.4.3 Performance

The framework will be utilized by numerous representatives all the while. Since
the system will be encouraged on a single web server with a lone database server
outside of anyone's ability to see, execution transforms into a significant
concern. The structure should not capitulate when various customers would use
everything the while. It should allow brisk accessibility to each and every piece
of its customers. For instance, if two test specialists are all the while attempting
to report the vicinity of a bug, then there ought not to be any irregularity at the
same time.

1.4.4 Portability

The framework should to be effectively versatile to another framework. This is


obliged when the web server, which s facilitating the framework gets adhered
because ofafewissues,whichrequirestheframeworktobetakentoanotherframework.
1.4.5 Scalability

The framework should be sufficiently adaptable to include new functionalities at


a later stage. There should be a run of the mill channel, which can oblige the
new functionalities.
1.4.6 Flexibility

Flexibility is the capacity of a framework to adjust to changing situations and


circumstances, and to adapt to changes to business approaches and rules. An
adaptable framework is one that is anything but difficult to reconfigure or adjust
because of diverse client and framework prerequisites. The deliberate division of
concerns between the trough and motor parts helps adaptability as just a little
bit of the framework is influenced when strategies or principles changes.

Dept Of MCA DSCE 14


Prediction of Early Stage Lung Cancer using Machine Learning

2.5 Summary

The chapter 2 considers all the system requirements which is required to develop this
proposed system. The specific requirements for this project have been explained in section 2.1.
The hardware requirements for this project have been explained in section 2.2. The backend
software is clearly explained in section 2.3. The functional requirements are explained in the
section 2.4.

Dept Of MCA DSCE 15


Prediction of Early Stage Lung Cancer using Machine Learning

CHAPTER-3
METHODOLOGY
3.1 Motivation
Lung cancer is the most rapidly increase disease which is causing human death world-
wide due to respiratory problems; this cancer disease has exceeded the death rate compared to
breast cancer. This disease has characterized based on growth of uncontrolled cells. If this
disease is not diagnosed at the early stages and cured before the second stage, it will increase
the death percentage in human. Machine Learning is contributing in various type of applications
domains related to information technology, stock, marketing, healthcare and banking among
them. With the increase in population growth has increased in coupled of disease and has
increased the necessity of machine learning model in diagnosis using medical datasets\

3.2 Objectives
 To develop an intelligent machine learning model which helps to predict lung
cancer.

 To make use of total of twenty-three different parameters to predict the possibilities


of lung cancer.

 To build an UI application on top of the model to help doctors and researchers to


use the tool.

 To directly integrate the tool with the tool with the medical records of an individual
in order to automatically analyze in the background and alert the doctor when the
risk increases.

 To improve the accuracy of existing crude methods of prediction of lung cancer by


employing multiple classifiers.

 To explore the possibility of implementing the proposed system to predict other


types of cancer as well.
.

Dept Of MCA DSCE 16


Prediction of Early Stage Lung Cancer using Machine Learning

3.3 Problem Statement


To design and implement an efficient and accurate Machine Learning Platform that can
help predict the possibilities of early stage lung cancer.

Input: Twenty three features that are essential for lung cancer diagnosis.

Processing:

 Preprocessing methods: Standardization and label encoding.

 Classifier: Random forest and Support Vector Machine.

 Prediction: Probability of having lung cancer.

Output: Prediction of possibility and stage of lung cancer.

3.4 Advantages
 Ensemble model predicts more accurate results.
 Higher performance compare to existing system.
 More accurate and reliable result.

Dept Of MCA DSCE 17


Prediction of Early Stage Lung Cancer using Machine Learning

CHAPTER 4
HIGH LEVEL DESIGN
4.1. System design

System Architecture design-identifies the overall hypermedia structure for the WebApp.
Architecture design is tied to the goals establish for a WebApp, the content to be presented, the
users who will visit, and the navigation philosophy that has been established. Content
architecture, focuses on the manner in which content objects and structured for presentation and
navigation. WebApp architecture, addresses the manner in which the application is structure to
manage user interaction, handle internal processing tasks, effect navigation, and present content.
WebApp architecture is defined within the context of the development environment in which
the application is to be implemented.

4.2 Design consideration


Design for Web Apps encompasses technical and non-technical activities. The look and
feel of content is developed as part of graphic design; the aesthetic layout of the user interface is
created as part of interface design; and the technical structure of the WebApp is modeled as part
of architectural and navigational design.

This argues that a Web engineer must design an interface so that it answers three primary
questions for the end-user:

 Where am I? – The interface should (1) provide an indication of the WebApp has been
accessed and (2) inform the user of her location in the content.

 What can I do now? – The interface should always help the user understand his current
options- what functions are available, what links are live, what content is relevant.

Dept Of MCA DSCE 18


Prediction of Early Stage Lung Cancer using Machine Learning

 Where have I been; where am I going? – The interface must facilitate navigation. Hence
it must provide a “map” of where the user has been and what paths may be taken to
move elsewhere in the WebApp.
4.3 Design goals
The following are the design goals that are applicable to virtually every WebApp
regardless of application domain, size, or complexity.
 Simplicity
 Consistency
 Identity
 Visual appeal
 Compatibility.
The activities of the Design process:
 Interface design-describes the structure and organization of the user interface. Includes a
representation of screen layout, a definition of the modes of interaction, and a
description of navigation mechanisms. Interface Control mechanisms- to implement
navigation options, the designer selects form one of several interaction mechanisms;
a. Navigation menus
b. Graphic icons
c. Graphic images
Interface Design workflow- the workflow begins with the identification of user, task, and
environmental requirements.

 Aesthetic design-also called graphic design, describes the “look and feel” of the
WebApp. Includes color schemes, geometric layout. Text size, font and placement, the
use of graphics, and related aesthetic decisions.

 Content design-defines the layout, structure, and outline for all content that is presented
as part of the WebApp. Establishes the relationships between content objects.

 Navigation design-represents the navigational flow between contents objects and for all
WebApp functions.

Dept Of MCA DSCE 19


Prediction of Early Stage Lung Cancer using Machine Learning

 Architecture design-identifies the overall hypermedia structure for the WebApp.


Architecture design is tied to the goals establish for a WebApp, the content to be
presented, the users who will visit, and the navigation philosophy that has been
established.
 Content architecture focuses on the way content objects and structured for presentation
and navigation.
 WebApp architecture, addresses the way the application is structure to manage user
interaction, handle internal processing tasks, effect navigation, and present content.
WebApp architecture is defined within the context of the development environment in
which the application is to be implemented.
 Component design-develops the detailed processing logic required to implement
functional components

Fi
g 4.1 System Architecture

Dept Of MCA DSCE 20


Prediction of Early Stage Lung Cancer using Machine Learning

4.4. Flow chart diagram

It is important to complete all tasks and meet deadlines. There are many project management
tools that are available to help project managers manage their tasks and schedule and one of
them is the flowchart.

A flowchart is one of the seven basic quality tools used in project management and it displays
the actions that are necessary to meet the goals of a particular task in the most practical
sequence. Also called as process maps, this type of tool displays a series of steps with branching
possibilities that depict one or more inputs and transforms them to outputs.

The advantage of flowcharts is that they show the activities involved in a project including the
decision points, parallel paths, branching loops as well as the overall sequence of processing
through mapping the operational details within the horizontal value chain. Moreover, this
particular tool is very used in estimating and understanding the cost of quality for a particular
process. This is done by using the branching logic of the workflow and estimating the expected
monetary returns.

Dept Of MCA DSCE 21


Prediction of Early Stage Lung Cancer using Machine Learning

Fig 4.2 Flow Chart Diagram for data flow module.

4.5 Use case diagrams:

A use case is a set of scenarios that describing an interaction between a source and a
destination. A use case diagram displays the relationship among actors and use cases. The two

Dept Of MCA DSCE 22


Prediction of Early Stage Lung Cancer using Machine Learning

main components of a use case diagram are use cases and actors. shows the use case diagram.

Dept Of MCA DSCE 23


Prediction of Early Stage Lung Cancer using Machine Learning

Fig 4.3 Use Case Diagram for User

4.6 Data flow diagram

1. A data flow diagram (DFD) is graphic representation of the "flow" of data through an
information system. A data flow diagram can also be used for the visualization of data
processing (structured design). It is common practice for a designer to draw a context
level DFD first which shows the interaction between the system and outside entities.
DFD’s show the flow of data from external entities into the system, how the data moves
from one process to another, as well as its logical storage. There are only four symbols:
2. Squares representing external entities, which are sources and destinations of information
entering and leaving the system.
3. Rounded rectangles representing processes, in other methodologies, may be called
'Activities', 'Actions', 'Procedures', 'Subsystems' etc. which take data as input, do
processing to it, and output it.
4. Arrows representing the data flows, which can either, be electronic data or physical
items. It is impossible for data to flow from data store to data store except via a process,
and external entities are not allowed to access data stores directly.
5. The flat three-sided rectangle is representing data stores should both receive information
for storing and provide it for further processing.

4.6.1 Level 0 data flow diagram

Fig 4.4 Level 0Data Flow diagram

Dept Of MCA DSCE 24


Prediction of Early Stage Lung Cancer using Machine Learning

4.6.2 Level1 data flow diagram

Fig 4.5 Level 1 Data Flow Diagram

Dept Of MCA DSCE 25


Prediction of Early Stage Lung Cancer using Machine Learning

Dept Of MCA DSCE 26


Prediction of Early Stage Lung Cancer using Machine Learning

CHAPTER 5
SYSTEM IMPLEMENTATION
5.1 Modules
 Data Acquisition and Preprocessing
 Feature Selection and Data Preparation
 Model Construction and Model Training
 Model Validation and Result Analysis

5.2 Detailed Description of each module


5.2.1 Data Acquisition and Preprocessing
Machine learning needs two things to work, data (lots of it) and models. When acquiring
the data, be sure to have enough features (aspect of data that can help for a prediction, like the
surface of the house to predict its price) populated to train correctly your learning model. In
general, the more data you have the better so make to come with enough rows.
The primary data collected from the online sources remains in the raw form of
statements, digits and qualitative terms. The raw data contains error, omissions and
inconsistencies. It requires corrections after careful scrutinizing the completed questionnaires.
The following steps are involved in the processing of primary data. A huge volume of raw data
collected through field survey needs to be grouped for similar details of individual responses.
Data Preprocessing is a technique that is used to convert the raw data into a clean data
set. In other words, whenever the data is gathered from different sources it is collected in raw
format which is not feasible for the analysis.
Data Preprocessing is necessary because of the presence of unformatted real-world data. Mostly
real-world data is composed of -
Inaccurate data (missing data) - There are many reasons for missing data such as data is not
continuously collected, a mistake in data entry, technical problems with biometrics and much
more.
The presence of noisy data (erroneous data and outliers) - The reasons for the existence of
noisy data could be a technological problem of gadget that gathers data, a human mistake
during data entry and much more.

Dept Of MCA DSCE 27


Prediction of Early Stage Lung Cancer using Machine Learning

Inconsistent data - The presence of inconsistencies are due to the reasons such that existence
of duplication within data, human data entry, containing mistakes in codes or names, i.e.,
violation of data constraints and much more.

5.2.2 Feature Selection and Data Preparation


Feature engineering is the process of using domain knowledge of the data to create features that
make machine learning algorithms work. If feature engineering is done correctly, it increases
the predictive power of machine learning algorithms by creating features from raw data that
help facilitate the machine learning process.
Feature engineering is the most important art in machine learning which creates the huge
difference between a good model and a bad model. Feature engineering is the process of
transforming raw data into features that better represent the underlying problem to the
predictive models, resulting in improved model accuracy on unseen data.
The process of organizing data into groups and classes on the basis of certain
characteristics is known as the classification of data. Classification helps in making
comparisons among the categories of observations. It can be either according to numerical
characteristics or according to attributes. So here we need to visualize the prepared data to find
whether the training data contains the correct label, which is known as a target or target
attribute.
Next, we will slice a single data set into a training set and test set.
 Training set—a subset to train a model.
 Test set—a subset to test the trained model.

Make sure that your test set meets the following two conditions:
 Is large enough to yield statistically meaningful results.
 Is representative of the data set as a whole? In other words, don't pick a test set with
different characteristics than the training set.

Assuming that your test set meets the preceding two conditions, your goal is to create a model
that generalizes well to new data. Our test set serves as a proxy for new data.

Dept Of MCA DSCE 28


Prediction of Early Stage Lung Cancer using Machine Learning

Fig 5.1 : Classification of Training and test data

5.2.3 Model Construction and Model Training


The process of training an ML model involves providing an ML algorithm (that is, the
learning algorithm) with training data to learn from. The term ML model refers to the model
artifact that is created by the training process. The training data must contain the correct answer,
which is known as a target or target attribute. The learning algorithm finds patterns in the
training data that map the input data attributes to the target (the answer that you want to
predict), and it outputs an ML model that captures these patterns.
Gradient Boosted Decision Trees were used to develop the predictive model for flight
delay analysis. A regression-based approach was implemented in the proposed model. Gradient
Boosted Decision Trees can prove to be quite effective in handling regression tasks. It is
adaptable, easy to interpret, and attains precise results. In the process of gradient boosting, a
sequence of predictor values is iteratively produced. The weighted average of these predictor
values is iteratively calculated to generate the final predictor value. At every step, an additional
classifier is invoked to boost the performance of the complete ensemble.

Dept Of MCA DSCE 29


Prediction of Early Stage Lung Cancer using Machine Learning

Logistic Regression:

Logistic regression is a statistical method used for binary classification problems, meaning it
predicts one of two possible outcomes (e.g., disease or no disease, fraud or no fraud). It is
widely used in medical diagnosis, finance, and many other domains.

Types of Logistic Regression

 Binary Logistic Regression: Used for two-class classification (e.g., cancer vs. no
cancer).
 Multinomial Logistic Regression: Used when there are more than two categories
without an order (e.g., predicting type of cancer: lung, breast, or skin).
 Ordinal Logistic Regression: Used when the categories have a natural order (e.g.,
cancer stages: Stage I, II, III, IV).

Assumptions of Logistic Regression

 Linearity in log-odds: The independent variables should have a linear relationship with
the logit function.
 Independence of observations: The data points should not be correlated (no
autocorrelation).
 No multicollinearity: Highly correlated independent variables can reduce model
accuracy (use Variance Inflation Factor to detect this).
 Sufficiently large sample size: A small dataset may lead to overfitting or underfitting.

Feature Engineering for Logistic Regression

 Interaction terms: Combining variables to capture relationships (e.g., Smoking × Age).


 Polynomial terms: Using squared or cubic terms to model non-linear effects.
 Standardization: Scaling features (e.g., age, BMI) to improve convergence and
interpretability.
 Encoding categorical variables: Converting categories into numerical values (e.g.,
one-hot encoding for gender).

Decision Tree Algorithm:

A Decision Tree is a supervised machine learning algorithm used for both classification and
regression tasks. It is a tree-like model where each internal node represents a decision based on
a feature, each branch represents an outcome, and each leaf node represents a class label (for
classification) or a numerical value (for regression).

How Decision Trees Work

1. Start with the Root Node

Dept Of MCA DSCE 30


Prediction of Early Stage Lung Cancer using Machine Learning

o The dataset is split based on the most significant feature.


2. Splitting Criteria
o The process continues by choosing the best feature at each node to maximize
information gain.
3. Stop Splitting (Leaf Nodes)
o The tree stops growing when it meets a stopping condition (e.g., maximum
depth, minimum samples per leaf).
4. Making Predictions
o For classification, the majority class in a leaf node is assigned as the predicted
class.
o For regression, the average of the values in the leaf node is used.

If you are new to the concept of decision tree. I am giving you a basic overview of the decision
tree.

Basic decision tree concept

Decision tree concept is more to the rule based system. Given the training dataset with targets
and features, the decision tree algorithm will come up with some set of rules. The same set
rules can be used to perform the prediction on the test dataset.

Suppose you would like to predict that your daughter will like the newly released animation
movie or not. To model the decision tree you will use the training dataset like the animated
cartoon characters your daughter liked in the past movies.

So once you pass the dataset with the target as your daughter will like the movie or not to the
decision tree classifier. The decision tree will start building the rules with the characters your
daughter like as nodes and the targets like or not as the leaf nodes. By considering the path from
the root node to the leaf node. You can get the rules.

The simple rule could be if some x character is playing the leading role then your daughter will
like the movie. You can think few more rule based on this example.

Then to predict whether your daughter will like the movie or not. You just need to check the
rules which are created by the decision tree to predict whether your daughter will like the newly
released movie or not.

Dept Of MCA DSCE 31


Prediction of Early Stage Lung Cancer using Machine Learning

In decision tree algorithm calculating these nodes and forming the rules will happen using the
information gain and gini index calculations.

In random forest algorithm, Instead of using information gain or gini index for calculating the
root node, the process of finding the root node and splitting the feature nodes will happen
randomly. Will look about in detail in the coming section.

Next, you are going to learn why random forest algorithm? When we are having other
classification algorithms to play with.

5.2.4 Model Validation and Result Analysis

Dept Of MCA DSCE 32


Prediction of Early Stage Lung Cancer using Machine Learning

In testing phase the model is applied to new set of data. The training and test data
are two different datasets. The goal in building a machine learning model is to have the
model perform well. On the training set, as well as generalize well on new data in the test
set. Once the build model is tested then we will pass real time data for the prediction. Once
prediction is done then we will analyzes the output to find out the crucial information.

Django Framework web frame word for test the input:

In a traditional data-driven website, a web application waits for HTTP requests from the web
browser (or other client). When a request is received the application works out what is needed
based on the URL and possibly information in POST data or GET data. Depending on what is
required it may then read or write information from a database or perform other tasks required to
satisfy the request. The application will then return a response to the web browser, often
dynamically creating an HTML page for the browser to display by inserting the retrieved data
into placeholders in an HTML template.
Django web applications typically group the code that handles each of these steps into separate
files:

Dept Of MCA DSCE 33


Prediction of Early Stage Lung Cancer using Machine Learning

 URLs: While it is possible to process requests from every single URL via a single function, it is
much more maintainable to write a separate view function to handle each resource. A URL
mapper is used to redirect HTTP requests to the appropriate view based on the request URL. The
URL mapper can also match particular patterns of strings or digits that appear in an URL, and
pass these to a view function as data.
 View: A view is a request handler function, which receives HTTP requests and returns HTTP
responses. Views access the data needed to satisfy requests via models, and delegate the
formatting of the response to templates.
 Models: Models are Python objects that define the structure of an application's data, and
provide mechanisms to manage (add, modify, delete) and query records in the database.
 Templates: A template is a text file defining the structure or layout of a file (such as an HTML
page), with placeholders used to represent actual content. A view can dynamically create an
HTML page using an HTML template, populating it with data from a model. A template can be
used to define the structure of any type of file; it doesn't have to be HTML!
Note: Django refers to this organisation as the "Model View Tem

4.2.5 Performance Metrics for Classification Problems

We have discussed classification and its algorithms in the previous chapters. Here, we are
going to discuss various performance metrics that can be used to evaluate predictions for
classification problems.

Confusion Matrix

It is the easiest way to measure the performance of a classification problem where the output
can be of two or more type of classes. A confusion matrix is nothing but a table with two
dimensions viz. “Actual” and “Predicted” and furthermore, both the dimensions have “True
Positives (TP)”, “True Negatives (TN)”, “False Positives (FP)”, “False Negatives (FN)” as
shown below –

Dept Of MCA DSCE 34


Prediction of Early Stage Lung Cancer using Machine Learning

Explanation of the
terms associated with confusion matrix are as follows −

 True Positives (TP) − It is the case when both actual class & predicted class of data
point is 1.
 True Negatives (TN) − It is the case when both actual class & predicted class of data
point is 0.
 False Positives (FP) − It is the case when actual class of data point is 0 & predicted
class of data point is 1.
 False Negatives (FN) − It is the case when actual class of data point is 1 & predicted
class of data point is 0.

We can use confusion_matrix function of sklearn. metrics to compute Confusion Matrix of our
classification model.

Classification Accuracy

It is most common performance metric for classification algorithms. It may be defined as the
number of correct predictions made as a ratio of all predictions made. We can easily calculate
it by confusion matrix with the help of following formula −

Accuracy=TP+TNTP+FP+FN+TN/TP+TNTP+FP+FN+TN

We can use accuracy_score function of sklearn. metrics to compute accuracy of our


classification model.

Classification Report

Dept Of MCA DSCE 35


Prediction of Early Stage Lung Cancer using Machine Learning

This report consists of the scores of Precisions, Recall, F1 and Support. They are explained as
follows −

Precision

Precision, used in document retrievals, may be defined as the number of correct documents
returned by our ML model. We can easily calculate it by confusion matrix with the help of
following formula −

Precision=TPTP+FNPrecision=TPTP+FN
Recall or Sensitivity

Recall may be defined as the number of positives returned by our ML model. We can easily
calculate it by confusion matrix with the help of following formula.

Recall=TPTP+FNRecall=TPTP+FN
Specificity

Specificity, in contrast to recall, may be defined as the number of negatives returned by our
ML model. We can easily calculate it by confusion matrix with the help of following formula

Specificity=TNTN+FPSpecificity=TNTN+FP
Support

Support may be defined as the number of samples of the true response that lies in each class of
target values.

F1 Score

This score will give us the harmonic mean of precision and recall. Mathematically, F1 score is
the weighted average of the precision and recall. The best value of F1 would be 1 and worst
would be 0. We can calculate F1 score with the help of following formula −

F1=2∗(precision∗recall)/(precision+recall)F1=2∗(precision∗recall)/(precision+recall)

F1 score is having equal relative contribution of precision and recall.

We can use classification_report function of sklearn. metrics to get the classification report of
our classification model.

Dept Of MCA DSCE 36


Prediction of Early Stage Lung Cancer using Machine Learning

Applications

The application of the project are,

 Medical and research domain that diagnose and study lung cancer and the factors that
cause it.

 The system can be used at places like hospitals, research labs or even online.

CHAPTER 6

Dept Of MCA DSCE 37


Prediction of Early Stage Lung Cancer using Machine Learning

SYSTEM TESTING
Testing performs a very critical role for quality assurance and for ensuring the reliability
of the software. The success of testing for errors in programs depends critically on the test
cases.

6.1 Testing phase


The completion of the system is achieved only after it has been thoroughly tested. Though this
gives a feel that project is completed there cannot be any project without going through this
stage. Hence in this stage it is decided whether this project can undergo real time environment
execution without any breakdowns, therefore the package can be rejected even at this stage.

A primary purpose of testing is to detect software failures so that defects may be


discovered and corrected. This is a non-trivial pursuit. Testing cannot establish that a product
functions properly under all conditions but can only establish that it does not function properly
under specific conditions. The scope of software testing often includes examination of code as
well as execution of that code in various environments and conditions as well as examining the
aspects of code: does it do what it is supposed to do and do what it needs to do. In the current
culture of software development, a testing organization may be separate from the development
team. There are various roles for testing team members. Information derived from software
testing may be used to correct the process by which software is developed.

6.2 System testing


Testing is a set of activities that can be planned in advance and conducted systematically. The
proposed system is tested in parallel with the software that consists of its own phases of its
analysis, implementation, testing and maintenance. Following are the tests conducted on the
system.

System testing of software or hardware is testing conducted on a complete, integrated


system to evaluate the system's compliance with its specified requirements. System testing falls
within the scope of black box testing, and as such, should require no knowledge of the inner
design of the code or logic.

Dept Of MCA DSCE 38


Prediction of Early Stage Lung Cancer using Machine Learning

6.3 Unit testing


During this implementation of the system each module of the system was tested separately to
uncover errors within its boundaries. User interface is used as a guide in the process

In computer programming, unit testing is a method by which individual units of source


code are tested to determine if they are fit for use. A unit is the smallest testable part of an
application. In procedural programming a unit may be an individual function or procedure. In
object-oriented programming a unit is usually a method. Unit tests are created by programmers
or occasionally by white box testers during the development process.

Ideally, each test case is independent from the others: substitutes like method stubs,
mock objects, fakes and test harnesses can be used to assist testing a module in isolation. Unit
tests are typically written and run by software developers to ensure that code meets its design
and behaves as intended. Its implementation can vary from being very manual (pencil and
paper) to being formalized as part of build automation.

6.4 Module testing

A module is composed of various programs related to that module. Module testing is done to
check the module functionality and interaction between units within a module. It checks the
functionality of each program with relation to other programs within the same module. It then
tests the overall functionality of each module.
This module introduces the technique of functional (black box) unit testing to verify the
correctness of classes. It shows how to design unit test cases based on a class specification
within a contract programming approach. The laboratory exercises then guide students through
creating and running tester classes in Java from a test case design, utilizing the Joint unit test
framework. It also contains a worked example on how to unit test GUI and event handling
classes.

Dept Of MCA DSCE 39


Prediction of Early Stage Lung Cancer using Machine Learning

Testing is an essential part of any software development process, but it is often poorly
understood. In this module, we take a simple approach and look at two key approaches to
testing. Unit testing is done at the class level, and is designed with the aid of method pre- and
post-conditions. System testing is done at the program level, and is designed based on
documented use cases. We look at how your approach to object-oriented design is influenced by
the need to design and execute tests, and close with some other issues in testing.

6.5 Integration testing


Integration testing is a systematic technique for constructing the program structure while
conducting tests to uncover errors associated with interfacing. The object is to take unit tester
module and build a program structure that has been dictated by the design.
The purpose of integration testing is to verify functional, performance, and reliability
requirements placed on major design items. These "design items", i.e. assemblages (or groups
of units), are exercised through their interfaces using Black box testing, success and error cases
being simulated via appropriate parameter and data inputs. Simulated usage of shared data areas
and inter-process communication is tested and individual subsystems are exercised through their
input interface.
Test cases are constructed to test that all components within assemblages interact
correctly, for example across procedure calls or process activations, and this is done after
testing individual modules, i.e. unit testing. The overall idea is a "building block" approach, in
which verified assemblages are added to a verified base which is then used to support the
integration testing of further assemblages.

6.6 Acceptance testing


This software has been tested with realistic data given by the client and produced results. Then
the client satisfying all the requirements specified by them has also developed the software
within the time limitations specified. A demonstration has been given to the client and the end
user giving all the operational features.
Acceptance testing generally involves running a suite of tests on the completed system.
Each individual test, known as a case, exercises a particular operating condition of the user's

Dept Of MCA DSCE 40


Prediction of Early Stage Lung Cancer using Machine Learning

environment or feature of the system, and will result in a pass or fail, or Boolean, outcome.
There is generally no degree of success or failure. The test environment is usually designed to
be identical, or as close as possible, to the anticipated user's environment, including extremes of
such. These test cases must each be accompanied by test case input data or a formal description
of the operational activities (or both) to be performed—intended to thoroughly exercise the
specific case—and a formal description of the expected results.
6.6.1 Test plan
The test plan contains the following:
 Features to be tested.
 Approach to be tested.
 Test deliverables.

6.7 Features to be tested


All the functional features specified in the required document will be tested. Test coverage in
the test plan states what requirements will be verified during what stages of the product life.
Test Coverage is derived from design specifications and other requirements, such as safety
standards or regulatory codes, where each requirement or specification of the design ideally will
have one or more corresponding means of verification.

Test coverage for different product life stages may overlap, but will not necessarily be
exactly the same for all stages. For example, some requirements may be verified during Design
Verification test, but not repeated during Acceptance test. Test coverage also feeds back into the
design process, since the product may have to be designed to allow test access.

Test methods in the test plan state how test coverage will be implemented. Test methods
may be determined by standards, regulatory agencies, or contractual agreement, or may have to
be created new. Test methods also specify test equipment to be used in the performance of the
tests and establish pass/fail criteria. Test methods used to verify hardware design requirements
can range from very simple steps, such as visual inspection, to elaborate test procedures that are
documented separately.

6.8 Approach for testing

Dept Of MCA DSCE 41


Prediction of Early Stage Lung Cancer using Machine Learning

For unit testing structured testing based on branch coverage criteria will be used. The goal is to
achieve branch coverage of more than 95% system testing will be functional in nature. A test
plan documents the strategy that will be used to verify and ensure that a product or system
meets its design specifications and other requirements. A test plan is usually prepared by or
with significant input from Test Engineers.

Depending on the product and the responsibility of the organization to which the test plan
applies, a test plan may include one or more of the following:

i. Design Verification or Compliance test - to be performed during the development or


approval stages of the product, typically on a small sample of units.
ii. Manufacturing or Production test - to be performed during preparation or assembly of
the product in an ongoing manner for purposes of performance verification and quality
control.
iii. Acceptance or Commissioning test - to be performed at the time of delivery or
installation of the product.
iv. Service and Repair test - to be performed as required over the service life of the
product.
v. Regression test - to be performed on an existing operational product, to verify that
existing functionality didn't get broken when other aspects of the environment are
changed (e.g., upgrading the platform on which an existing application runs).

vi. Test deliverables


a. The following documents will be required:
b. Unit tests report for each unit.
c. Test case specification for system testing.
d. Test reports for system testing.

Testing performs a very critical role for quality assurance and for ensuring the reliability
of the software. The success of testing for errors in programs depends critically on the test
cases.

Dept Of MCA DSCE 42


Prediction of Early Stage Lung Cancer using Machine Learning

Sl # Test Case : - UTC-1


Name of Test: - Dataset Load
Items being tested: - Dataset
Sample Input: - Folder contains Images of plain and damaged
All the Images must be loaded in to jupyter notebook
Expected output: -

As Expected
Actual output: -

Remarks: - Pass.
Table 6.1 Test case 1

Sl # Test Case : - UTC-2


Name of Test: - RF Model
Items being tested: - RF Model
Sample Input: - Train data X_Train ,Y_Train
When we fit the model with Train data,epoch and batch Size,based
Expected output: - on epoch size that many training should happen and display
respective accuracy
AS Expected
Actual output: -

Remarks: - Pass
Table 6.2 Test case 2

Sl # Test Case : - UTC-3

Dept Of MCA DSCE 43


Prediction of Early Stage Lung Cancer using Machine Learning

Name of Test: - Split Data


Items being tested: - Split data
Sample Input: - Read CSV content as dataframe
Data need to get splitted to 80% and 20% exported as csve
Expected output: -

As Expected
Actual output: -

Remarks: - Pass.
Table 6. 3 Test case 3

Sl # Test Case : - UTC-4


Name of Test: - RANDOM FOREST Model test
Items being tested: - RF Model test
Sample Input: - Test CSV
After the training the data need to be tested with predict function
Expected output: - using X_test and it should get predicted value

Predicted value doesn’t appear


Actual output: -

Remarks: - Failed.
Table 6.4 Test case 4

Sl # Test Case : - UTC-5


Name of Test: - RF Model test

Dept Of MCA DSCE 44


Prediction of Early Stage Lung Cancer using Machine Learning

Items being tested: - RF Model test


Sample Input: - Test CSV
After the training the data need to be tested with predict function
Expected output: - using X_test and it should get predicted value

As Expected
Actual output: -

Remarks: - Pass

Table 6.5 Test case 5

CHAPTER 7
CODE SNIPPETS

Dept Of MCA DSCE 45


Prediction of Early Stage Lung Cancer using Machine Learning

#Import modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#Data Preprocessing
data = pd.read_csv('LUNG_CANCER.csv')

data.head()
data.tail()
data.info()
data.describe()
data.isnull().any()
data.isnull().sum()

data['Level'].replace('High',3,inplace=True)
data['Level'].replace('Medium',2,inplace=True)
data['Level'].replace('Low',1,inplace=True)
data.head(10)

from sklearn.preprocessing import LabelEncoder


le = LabelEncoder()
data['Patient_Id'] = le.fit_transform(data['Patient_Id'])
data

#Data Visualization
plt.figure(figsize=(20,20))
sns.heatmap(data.corr().values, annot=True, center=0, xticklabels=data.columns,

Dept Of MCA DSCE 46


Prediction of Early Stage Lung Cancer using Machine Learning

yticklabels=data.columns, cmap='YlGnBu')
plt.show()

sns.set(style="white", color_codes=True)
sns.jointplot(x='Age',y='Level',data=data)

plt.figure(figsize=(20,20), dpi=100)
sns.countplot(x='Age',hue='Level',data=data, palette='Set1')

#Split the data into dependent and independent variable


x = data.iloc[:,:24] #independent Variable
x

y = data.iloc[:,24:] #Dependent Variable


y

type(x)
type(y)
np.shape(x)
np.shape(y)
x.shape
y.shape

#Train Test Split


from sklearn.model_selection import train_test_split
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

#Logistic Regression

Dept Of MCA DSCE 47


Prediction of Early Stage Lung Cancer using Machine Learning

from sklearn.linear_model import LogisticRegression


log = LogisticRegression()
log.fit(x_train,y_train)
logprediction = log.predict(x_test)
logprediction
y_test
#Accuracy Score
from sklearn.metrics import accuracy_score
logacc = accuracy_score(y_test, logprediction)
logacc

#Decision Tree Algorithm


from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier(criterion = 'entropy')
dtc.fit(x_train,y_train)
dtcprediction = dtc.predict(x_test)
dtcprediction
#Accuracy Score
from sklearn.metrics import accuracy_score
dtcacc = accuracy_score(dtcprediction,y_test)
dtcacc

#Confusion Matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
dtccm = confusion_matrix(dtcprediction,y_test)
dtccm
dtccm = confusion_matrix(dtcprediction,y_test)
dtccm
disp = ConfusionMatrixDisplay(confusion_matrix=dtccm,
display_labels=['High','Medium','Low'])

Dept Of MCA DSCE 48


Prediction of Early Stage Lung Cancer using Machine Learning

disp.plot(cmap=plt.cm.Blues)
plt.title("Confusion Matrix for Lung Cancer using Decision Tree Classification")
plt.show()

#Test own data


x_test_df =
pd.DataFrame([[1,155,40,60,31,1.349,50,60,11,150,10,50,31,1.3,50,60,1,55,40,60,31,1.3,50,60]
])
predict = dtc.predict(x_test_df)
print(predict)

CHAPTER 8
SNAP SHOTS

Dept Of MCA DSCE 49


Prediction of Early Stage Lung Cancer using Machine Learning

FIGURE: 8.1

FIGURE: 8.2

Dept Of MCA DSCE 50


Prediction of Early Stage Lung Cancer using Machine Learning

FIGURE: 8.3

Dept Of MCA DSCE 51


Prediction of Early Stage Lung Cancer using Machine Learning

FIGURE: 8.4

FIGURE: 8.5

CHAPTER 9

Dept Of MCA DSCE 52


Prediction of Early Stage Lung Cancer using Machine Learning

CONCLUSION

We have provided an overview of the main approaches used for nodule classification and lung
cancer prediction from CT imaging data. In our experience, given sufficient training data, the
current state-of-the-art is achieved using CNNs trained with Deep Learning achieving a
classification performance in the region of low 90s AUC points. When evaluating system
performance, it is important to be aware of the limitations or otherwise of the training and
validation data sets used, i.e., were the patients’ smokers or non-smokers, or were patients with
a current or prior history of malignancy included.

Given an apparent acceptable level of performance, the next stage is to test such CADx systems
in a clinical setting but before this can be done, we must first define the way in which the output
of the CADx should be utilized in clinical decision making. Who should use such a system and
how should it be integrated into their decisions? Should the algorithm produce an absolute risk
of malignancy and how should this be expressed; should it be incorporated into clinical opinion
and how much weight should clinicians or patients lend to it. Should the algorithms be
incorporated into or designed to fit current guidelines such as Lung-RADS or the BTS
guidelines? If nodules are followed over time, should the algorithm incorporate changes in
nodule volume or should this be assessed separately? Is success defined by a reduction in the
numbers of false positive scans defined as those needing further follow up or intervention, or by
detecting all lung cancers and earlier than determined by following current guidelines? Who
should be compared to the algorithm when determining its value? Should the comparison be
experts or general radiologists, as it may be difficult to be significantly better than an expert but
may be of substantial help to a generalist, and most scans are not interpreted by experts?
Relatively little work has been done to address such questions.

CHAPTER 10

Dept Of MCA DSCE 53


Prediction of Early Stage Lung Cancer using Machine Learning

REFERENCES

• [1]“Comparative Study of Classifier for Chronic Kidney Disease prediction using Naive
Bayes, KNN and Random Forest”3rd International Conference on Computing
Methodologies and Communication(ICCMC)2019.DOI:
10.1109/ICCMC.2019.8819654.
• [2] L. Shoon et al., ‘‘Cancer recognition from DNA microarray gene expression data
using averaged one-dependence estimators,’’ Int. J. Cybern. Inform., vol. 3, no. 2, pp. 1–
10, 2014.
• [3] G. Russo, C. Zegar, and A. Giordano, ‘‘Advantages and limitations of microarray
technology in human cancer,’’ Oncogene, vol. 22, no. 42, pp. 6497–6507, 2013.
• [4] X. Wang and O. Gotoh, ‘‘Microarray-based cancer prediction using soft computing
approach,’’ Cancer Inform., vol. 7, Jan. 2009.
• [5] A. Bashetha and G. U. Srikanth, ‘‘Effective cancer detection using soft computing
technique,’’ IOSR J. Comput. Eng., vol. 17, no. 1, pp. 1–5, 2015.
• [6] F. Li, M. Huang, Y. Yang, and X. Zhu. Learning to identify review spam.
Proceedings of the 22nd International Joint Conference on Artificial Intelligence; IJCAI,
2011.
• [7] G. Fei, A. Mukherjee, B. Liu, M. Hsu, M. Castellanos, and R. Ghosh. Exploiting
burstiness in reviews for review spammer detection. In ICWSM, 2013, vol. 3, no. 2, pp.
1–10
• [8] A. j. Minnich, N. Chavoshi, A. Mueen, S. Luan, and M. Faloutsos. Trueview:
Harnessing the power of multiple review sites. In ACM WWW, 2015, vol. 3, no. 2, pp.
1–10
• [9] B. Viswanath, M. Ahmad Bashir, M. Crovella, S. Guah, K. P. Gummadi, B.
Krishnamurthy, and A. Mislove. Towards detecting anomalous user behavior in online
social networks. In USENIX, 2014.
• [10] H. Li, Z. Chen, B. Liu, X. Wei, and J. Shao. Spotting fake reviews via collective
PU learning. In ICDM, 2014, vol. 3, no. 2, pp. 1–10
• [11] L. Akoglu, R. Chandy, and C. Faloutsos. Opinion fraud detection in online reviews
bynetwork effects. In ICWSM, 2013, vol. 3, no. 2, pp. 1–10

Dept Of MCA DSCE 54


Prediction of Early Stage Lung Cancer using Machine Learning

• [12] R. Shebuti and L. Akoglu. Collective opinion spam detection: bridging review
networksand metadata. In ACM KDD, 2015.
• [13] S. Feng, R. Banerjee and Y. Choi. Syntactic stylometry for deception detection.
Proceedings of the 50th Annual Meeting of the Association for Computational
Linguistics: Short Papers; ACL, 2012.
• [14] N. Jindal, B. Liu, and E.-P. Lim. Finding unusual review patterns using unexpected
rules. In ACM CIKM, 2012.
• [15] E.-P. Lim, V.-A. Nguyen, N. Jindal, B. Liu, and H. W. Lauw. Detecting product
review spammers using rating behaviors. In ACM CIKM, 2010.
• [16] A. Mukherjee, A. Kumar, B. Liu, J. Wang, M. Hsu, M. Castellanos, and R. Ghosh.
Spotting opinion spammers using behavioral footprints. In ACM KDD, 2013. [17] S.
Xie, G. Wang, S. Lin, and P. S. Yu. Review spam detection via temporal pattern
discovery. In ACM KDD, 2012.
• [18] G. Wang, S. Xie, B. Liu, and P. S. Yu. Review graph based online store review
spammer detection. IEEE ICDM, 2011.
• [19] Y. Sun and J. Han. Mining Heterogeneous Information Networks; Principles and
Methodologies, In ICCCE, 2012.

Dept Of MCA DSCE 55

You might also like