Lung Cancer Report
Lung Cancer Report
CHAPTER 1
INTRODUCTION
Lung cancer is the most rapidly increase disease which is causing human death world-
wide due to respiratory problems; this cancer disease has exceeded the death rate
compared to breast cancer. This disease has characterized based on growth of
uncontrolled cells. If this disease is not diagnosed at the early stages and cured before
the second stage, it will increase the death percentage in human. This tissue will be
spread rapidly to other parts of the body like brain, heart, bones, glands and liver.
As from early research, there is no such tool for early detection of lung cancer
disease in human. We come across two types of lung cancers, one is SCLS and NSCLS.
There are two major types of lung cancer, non-small cell lung cancer (NSCLC) and
small cell lung cancer (SCLC). Over 85 percent of all lung cancers are non-small cell
lung cancers, while about 13 percent contribute small cell lung cancers [2]. Staging
lung cancer is based on whether the cancer is local or has spread from the lungs to the
lymph nodes or other organs. Because the lungs are big, tumors can grow in them for a
long time before they are found. Even when symptoms like coughing and fatigue
occurs; people think that they are due to other causes. Because of this reason, early-
stage of lung cancer (stages I and II) is difficult to detect. Most people with NSCLC are
diagnosed only at stages III and IV.
As there is a big growth in large volume of data now days, this will create a need
for extracting meaningful data from the information. Machine Learning is contributing in
various type of applications domains related to information technology, stock, marketing,
healthcare and banking among them. With the increase in population growth has increased in
coupled of disease and has increased the necessity of machine learning model in diagnosis using
medical datasets. From the various biomedical datasets, cancer is the widest disease that has
killed human life over 7 million every year and lung cancer among them is nearly 17% of
moralities. Previous research works show that survival rate of patients affected with cancer is
larger and higher, when compared to the diagnosed at the initial stage, Lung cancer is the most
historic data and dependent disease in for early diagnosis. This has created the researcher to use
machine learning technique for early diagnosis of lung cancer in stage 1. There has-been an
increase in survival rate to about 70% at the early stage of detection, when tumor is not spread.
Pre-existing techniques the five-year survival rate increases to 70% with the early detection at
stage 1, when the tumor has not yet spread. Existing medical techniques like X-Ray, Computed
Tomography (CT) scan, sputum cytology analysis and other imaging techniques not only
require complex equipment and high cost but is also proven to be efficient only in stage 4, when
the tumor has metastasized to other parts of the body. Our proposed work involves the uses of
machine learning technique used in classification of lung cancer patients and the categorization
of stage to which it belongs positive. The work is based on early diagnosis of prediction of lung
cancer which suggest the doctors in treating the patients for increasing the survival rate of the
human.
The impact of Lung-RADS was analysed in a retrospective analysis of the NLST (3).
Lung-RADS was shown to reduce the overall screening false positive rate to 12.8% and 5.3% at
baseline and interval imaging respectively at the cost of a reduction of sensitivity from 93.5% in
the NLST to 84.9% using Lung-RADS at baseline and 93.8% in the NLST and 84.9% using
Lung-RADS after baseline. However, while Lung-RADS reduces the overall false positive rate,
the false positive rate of positive screens, i.e., Lung-RADS 3 and above, remains very high at
One approach to address this problem is to adopt computer aided diagnosis (CADx)
technology as an aid to radiologists and pulmonary medicine physicians. Given an input CT and
possible additional relevant patient meta-data, such techniques aim to provide a quantitative
output related to the lung cancer risk.
One may consider the goal of such systems to be two-fold. First, to reduce the variability
in assessing and reporting the lung cancer risk between interpreting physicians. Indeed,
computer assisted approaches have been shown to improve consistency between physicians in a
variety of clinical contexts, including nodule detection and mammography screening and one
might expect such decision support tools could provide the same benefit in nodule
classification. Second, CADx could improve classification performance by supporting the less
experienced or non-specialized clinicians in assessing the risk of a particular nodule being
malignant.
In this project, we review progress made towards the development and validation of
lung cancer prediction models and nodule classification CADx software. While we do not
intend this to be a comprehensive review, we do aim to provide an overview of the main
approaches taken to date and outline some of the challenges that remain to bring this technology
to routine clinical use.
Although machine learning is a field within computer science, it differs from traditional
computational approaches. In traditional computing, algorithms are sets of explicitly
Any technology user today has benefitted from machine learning. Facial recognition technology
allows social media platforms to help users tag and share photos of friends. Optical character
recognition (OCR) technology converts images of text into movable type. Recommendation
engines, powered by machine learning, suggest what movies or television shows to watch next
based on user preferences. Self-driving cars that rely on machine learning to navigate may soon
be available to consumers.
Machine learning is a continuously developing field. Because of this, there are some
considerations to keep in mind as you work with machine learning methodologies, or analyze
the impact of machine learning processes.
In this tutorial, we’ll look into the common machine learning methods of supervised and
unsupervised learning, and common algorithmic approaches in machine learning, including the
k-nearest neighbor algorithm, decision tree learning, and deep learning. We’ll explore which
programming languages are most used in machine learning, providing you with some of the
positive and negative attributes of each. Additionally, we’ll discuss biases that are perpetuated
by machine learning algorithms, and consider what can be kept in mind to prevent these biases
when building algorithms.
In machine learning, tasks are generally classified into broad categories. These categories are
based on how learning is received or how feedback on the learning is given to the system
developed.
Supervised Learning
In supervised learning, the computer is provided with example inputs that are labeled with their
desired outputs. The purpose of this method is for the algorithm to be able to “learn” by
comparing its actual output with the “taught” outputs to find errors, and modify the model
accordingly. Supervised learning therefore uses patterns to predict label values on additional
unlabeled data.
For example, with supervised learning, an algorithm may be fed data with images of sharks
labeled as fish and images of oceans labeled as water. By being trained on this data, the
supervised learning algorithm should be able to later identify unlabeled shark images as fish and
unlabeled ocean images as water.
A common use case of supervised learning is to use historical data to predict statistically likely
future events. It may use historical stock market information to anticipate upcoming
fluctuations, or be employed to filter out spam emails. In supervised learning, tagged photos of
dogs can be used as input data to classify untagged photos of dogs.
Unsupervised Learning
Without being told a “correct” answer, unsupervised learning methods can look at complex data
that is more expansive and seemingly unrelated in order to organize it in potentially meaningful
ways. Unsupervised learning is often used for anomaly detection including for fraudulent credit
card purchases, and recommender systems that recommend what products to buy next. In
unsupervised learning, untagged photos of dogs can be used as input data for the algorithm to
find likenesses and classify dog photos together.
[1] Bashetha and G. U. Srikanth “Effective cancer detection using soft computing technique”
Cancer research is rudimentary research which is done to identify causes and develop strategies
for prevention, diagnosis, treatment and cure. An optimized solution for the better treatment of
cancer and toxicity minimization on the cancer patient is performed by identifying the exact
type of tumor. A clear cancer classification analysis system is required to get a clear picture on
the insight of a problem. A systematic approach to analyze global gene expression is followed
for identifying exact problem area. Molecular diagnostics provide a promising option of
systematic human cancer classification. But these types of tests are not mostly applied because
characteristics molecular markers have yet to be identified for most solid tumors. Recently,
DNA micro-array based tumor gene expression profiles have been used for cancer diagnosis. In
the proposed system, gene expressions are taken from multiple sources and an ontological store
is created. Ant colony optimization technique is used to analyze the cluster of data with attribute
match association rule for detecting cancer using the acquired knowledge
[2] X. Wang and O. Gotoh “Microarray-based cancer prediction using soft computing
approach”
One of the difficulties in using gene expression profiles to predict cancer is how to effectively
select a few informative genes to construct accurate prediction models from thousands or ten
thousands of genes. We screen highly discriminative genes and gene pairs to create simple
prediction models involved in single genes or gene pairs on the basis of soft computing
approach and rough set theory. Accurate cancerous prediction is obtained when we apply the
simple prediction models for four cancerous gene expression datasets: CNS tumor, colon tumor,
lung cancer and DLBCL. Some genes closely correlated with the pathogenesis of specific or
general cancers are identified. In contrast with other models, our models are simple, effective
CHAPTER-2
SYSTEM REQUIREMENT SPECIFICATION
Ram : 4GB.
Software : Anaconda
IDE : JupyterNotebook
The choice of data entirely depends on the problem you’re trying to solve.
Picking the right data must be your goal, luckily, almost every topic you can think of has several
datasets which are public & free.
3 of my favorite free awesome website for dataset hunting are:
1. Kaggle which is so organized. You’ll love how detailed their datasets are, they give you info
on the features, data types, number of records. You can use their kernel too and you won’t
have to download the dataset.
4. UCI Machine Learning Repository, this one maintains 468 data sets as a service to the
machine learning community.
one of the hardest step and the one that will probably take the longest unless you’re lucky with a
complete perfect dataset, which is rarely the case. Handling missing data in the wrong way
can cause disasters.
Multiple imputation
There should be a functionality which will split data into training and test data.
With the help of training data using random forest algorithm it has to build the
validation model.
Once the validation model is saved it has to accept test data and produce result
whether it is spam or not.
The record which are given as an input as to predict it has not Lung cancer or not.
Finally, the result can be viewed in django framework. Django distribution also bundles a
number of applications in its "contrib" package, including:
1.4.1 Reliability
The structure must be reliable and strong in giving the functionalities. The
movements must be made unmistakable by the structure when a customer has
revealed a couple of enhancements. The progressions made by the Programmer
must be Project pioneer and in addition the Test designer.
1.4.2 Maintainability
The framework will be utilized by numerous representatives all the while. Since
the system will be encouraged on a single web server with a lone database server
outside of anyone's ability to see, execution transforms into a significant
concern. The structure should not capitulate when various customers would use
everything the while. It should allow brisk accessibility to each and every piece
of its customers. For instance, if two test specialists are all the while attempting
to report the vicinity of a bug, then there ought not to be any irregularity at the
same time.
1.4.4 Portability
2.5 Summary
The chapter 2 considers all the system requirements which is required to develop this
proposed system. The specific requirements for this project have been explained in section 2.1.
The hardware requirements for this project have been explained in section 2.2. The backend
software is clearly explained in section 2.3. The functional requirements are explained in the
section 2.4.
CHAPTER-3
METHODOLOGY
3.1 Motivation
Lung cancer is the most rapidly increase disease which is causing human death world-
wide due to respiratory problems; this cancer disease has exceeded the death rate compared to
breast cancer. This disease has characterized based on growth of uncontrolled cells. If this
disease is not diagnosed at the early stages and cured before the second stage, it will increase
the death percentage in human. Machine Learning is contributing in various type of applications
domains related to information technology, stock, marketing, healthcare and banking among
them. With the increase in population growth has increased in coupled of disease and has
increased the necessity of machine learning model in diagnosis using medical datasets\
3.2 Objectives
To develop an intelligent machine learning model which helps to predict lung
cancer.
To directly integrate the tool with the tool with the medical records of an individual
in order to automatically analyze in the background and alert the doctor when the
risk increases.
Input: Twenty three features that are essential for lung cancer diagnosis.
Processing:
3.4 Advantages
Ensemble model predicts more accurate results.
Higher performance compare to existing system.
More accurate and reliable result.
CHAPTER 4
HIGH LEVEL DESIGN
4.1. System design
System Architecture design-identifies the overall hypermedia structure for the WebApp.
Architecture design is tied to the goals establish for a WebApp, the content to be presented, the
users who will visit, and the navigation philosophy that has been established. Content
architecture, focuses on the manner in which content objects and structured for presentation and
navigation. WebApp architecture, addresses the manner in which the application is structure to
manage user interaction, handle internal processing tasks, effect navigation, and present content.
WebApp architecture is defined within the context of the development environment in which
the application is to be implemented.
This argues that a Web engineer must design an interface so that it answers three primary
questions for the end-user:
Where am I? – The interface should (1) provide an indication of the WebApp has been
accessed and (2) inform the user of her location in the content.
What can I do now? – The interface should always help the user understand his current
options- what functions are available, what links are live, what content is relevant.
Where have I been; where am I going? – The interface must facilitate navigation. Hence
it must provide a “map” of where the user has been and what paths may be taken to
move elsewhere in the WebApp.
4.3 Design goals
The following are the design goals that are applicable to virtually every WebApp
regardless of application domain, size, or complexity.
Simplicity
Consistency
Identity
Visual appeal
Compatibility.
The activities of the Design process:
Interface design-describes the structure and organization of the user interface. Includes a
representation of screen layout, a definition of the modes of interaction, and a
description of navigation mechanisms. Interface Control mechanisms- to implement
navigation options, the designer selects form one of several interaction mechanisms;
a. Navigation menus
b. Graphic icons
c. Graphic images
Interface Design workflow- the workflow begins with the identification of user, task, and
environmental requirements.
Aesthetic design-also called graphic design, describes the “look and feel” of the
WebApp. Includes color schemes, geometric layout. Text size, font and placement, the
use of graphics, and related aesthetic decisions.
Content design-defines the layout, structure, and outline for all content that is presented
as part of the WebApp. Establishes the relationships between content objects.
Navigation design-represents the navigational flow between contents objects and for all
WebApp functions.
Fi
g 4.1 System Architecture
It is important to complete all tasks and meet deadlines. There are many project management
tools that are available to help project managers manage their tasks and schedule and one of
them is the flowchart.
A flowchart is one of the seven basic quality tools used in project management and it displays
the actions that are necessary to meet the goals of a particular task in the most practical
sequence. Also called as process maps, this type of tool displays a series of steps with branching
possibilities that depict one or more inputs and transforms them to outputs.
The advantage of flowcharts is that they show the activities involved in a project including the
decision points, parallel paths, branching loops as well as the overall sequence of processing
through mapping the operational details within the horizontal value chain. Moreover, this
particular tool is very used in estimating and understanding the cost of quality for a particular
process. This is done by using the branching logic of the workflow and estimating the expected
monetary returns.
A use case is a set of scenarios that describing an interaction between a source and a
destination. A use case diagram displays the relationship among actors and use cases. The two
main components of a use case diagram are use cases and actors. shows the use case diagram.
1. A data flow diagram (DFD) is graphic representation of the "flow" of data through an
information system. A data flow diagram can also be used for the visualization of data
processing (structured design). It is common practice for a designer to draw a context
level DFD first which shows the interaction between the system and outside entities.
DFD’s show the flow of data from external entities into the system, how the data moves
from one process to another, as well as its logical storage. There are only four symbols:
2. Squares representing external entities, which are sources and destinations of information
entering and leaving the system.
3. Rounded rectangles representing processes, in other methodologies, may be called
'Activities', 'Actions', 'Procedures', 'Subsystems' etc. which take data as input, do
processing to it, and output it.
4. Arrows representing the data flows, which can either, be electronic data or physical
items. It is impossible for data to flow from data store to data store except via a process,
and external entities are not allowed to access data stores directly.
5. The flat three-sided rectangle is representing data stores should both receive information
for storing and provide it for further processing.
CHAPTER 5
SYSTEM IMPLEMENTATION
5.1 Modules
Data Acquisition and Preprocessing
Feature Selection and Data Preparation
Model Construction and Model Training
Model Validation and Result Analysis
Inconsistent data - The presence of inconsistencies are due to the reasons such that existence
of duplication within data, human data entry, containing mistakes in codes or names, i.e.,
violation of data constraints and much more.
Make sure that your test set meets the following two conditions:
Is large enough to yield statistically meaningful results.
Is representative of the data set as a whole? In other words, don't pick a test set with
different characteristics than the training set.
Assuming that your test set meets the preceding two conditions, your goal is to create a model
that generalizes well to new data. Our test set serves as a proxy for new data.
Logistic Regression:
Logistic regression is a statistical method used for binary classification problems, meaning it
predicts one of two possible outcomes (e.g., disease or no disease, fraud or no fraud). It is
widely used in medical diagnosis, finance, and many other domains.
Binary Logistic Regression: Used for two-class classification (e.g., cancer vs. no
cancer).
Multinomial Logistic Regression: Used when there are more than two categories
without an order (e.g., predicting type of cancer: lung, breast, or skin).
Ordinal Logistic Regression: Used when the categories have a natural order (e.g.,
cancer stages: Stage I, II, III, IV).
Linearity in log-odds: The independent variables should have a linear relationship with
the logit function.
Independence of observations: The data points should not be correlated (no
autocorrelation).
No multicollinearity: Highly correlated independent variables can reduce model
accuracy (use Variance Inflation Factor to detect this).
Sufficiently large sample size: A small dataset may lead to overfitting or underfitting.
A Decision Tree is a supervised machine learning algorithm used for both classification and
regression tasks. It is a tree-like model where each internal node represents a decision based on
a feature, each branch represents an outcome, and each leaf node represents a class label (for
classification) or a numerical value (for regression).
If you are new to the concept of decision tree. I am giving you a basic overview of the decision
tree.
Decision tree concept is more to the rule based system. Given the training dataset with targets
and features, the decision tree algorithm will come up with some set of rules. The same set
rules can be used to perform the prediction on the test dataset.
Suppose you would like to predict that your daughter will like the newly released animation
movie or not. To model the decision tree you will use the training dataset like the animated
cartoon characters your daughter liked in the past movies.
So once you pass the dataset with the target as your daughter will like the movie or not to the
decision tree classifier. The decision tree will start building the rules with the characters your
daughter like as nodes and the targets like or not as the leaf nodes. By considering the path from
the root node to the leaf node. You can get the rules.
The simple rule could be if some x character is playing the leading role then your daughter will
like the movie. You can think few more rule based on this example.
Then to predict whether your daughter will like the movie or not. You just need to check the
rules which are created by the decision tree to predict whether your daughter will like the newly
released movie or not.
In decision tree algorithm calculating these nodes and forming the rules will happen using the
information gain and gini index calculations.
In random forest algorithm, Instead of using information gain or gini index for calculating the
root node, the process of finding the root node and splitting the feature nodes will happen
randomly. Will look about in detail in the coming section.
Next, you are going to learn why random forest algorithm? When we are having other
classification algorithms to play with.
In testing phase the model is applied to new set of data. The training and test data
are two different datasets. The goal in building a machine learning model is to have the
model perform well. On the training set, as well as generalize well on new data in the test
set. Once the build model is tested then we will pass real time data for the prediction. Once
prediction is done then we will analyzes the output to find out the crucial information.
In a traditional data-driven website, a web application waits for HTTP requests from the web
browser (or other client). When a request is received the application works out what is needed
based on the URL and possibly information in POST data or GET data. Depending on what is
required it may then read or write information from a database or perform other tasks required to
satisfy the request. The application will then return a response to the web browser, often
dynamically creating an HTML page for the browser to display by inserting the retrieved data
into placeholders in an HTML template.
Django web applications typically group the code that handles each of these steps into separate
files:
URLs: While it is possible to process requests from every single URL via a single function, it is
much more maintainable to write a separate view function to handle each resource. A URL
mapper is used to redirect HTTP requests to the appropriate view based on the request URL. The
URL mapper can also match particular patterns of strings or digits that appear in an URL, and
pass these to a view function as data.
View: A view is a request handler function, which receives HTTP requests and returns HTTP
responses. Views access the data needed to satisfy requests via models, and delegate the
formatting of the response to templates.
Models: Models are Python objects that define the structure of an application's data, and
provide mechanisms to manage (add, modify, delete) and query records in the database.
Templates: A template is a text file defining the structure or layout of a file (such as an HTML
page), with placeholders used to represent actual content. A view can dynamically create an
HTML page using an HTML template, populating it with data from a model. A template can be
used to define the structure of any type of file; it doesn't have to be HTML!
Note: Django refers to this organisation as the "Model View Tem
We have discussed classification and its algorithms in the previous chapters. Here, we are
going to discuss various performance metrics that can be used to evaluate predictions for
classification problems.
Confusion Matrix
It is the easiest way to measure the performance of a classification problem where the output
can be of two or more type of classes. A confusion matrix is nothing but a table with two
dimensions viz. “Actual” and “Predicted” and furthermore, both the dimensions have “True
Positives (TP)”, “True Negatives (TN)”, “False Positives (FP)”, “False Negatives (FN)” as
shown below –
Explanation of the
terms associated with confusion matrix are as follows −
True Positives (TP) − It is the case when both actual class & predicted class of data
point is 1.
True Negatives (TN) − It is the case when both actual class & predicted class of data
point is 0.
False Positives (FP) − It is the case when actual class of data point is 0 & predicted
class of data point is 1.
False Negatives (FN) − It is the case when actual class of data point is 1 & predicted
class of data point is 0.
We can use confusion_matrix function of sklearn. metrics to compute Confusion Matrix of our
classification model.
Classification Accuracy
It is most common performance metric for classification algorithms. It may be defined as the
number of correct predictions made as a ratio of all predictions made. We can easily calculate
it by confusion matrix with the help of following formula −
Accuracy=TP+TNTP+FP+FN+TN/TP+TNTP+FP+FN+TN
Classification Report
This report consists of the scores of Precisions, Recall, F1 and Support. They are explained as
follows −
Precision
Precision, used in document retrievals, may be defined as the number of correct documents
returned by our ML model. We can easily calculate it by confusion matrix with the help of
following formula −
Precision=TPTP+FNPrecision=TPTP+FN
Recall or Sensitivity
Recall may be defined as the number of positives returned by our ML model. We can easily
calculate it by confusion matrix with the help of following formula.
Recall=TPTP+FNRecall=TPTP+FN
Specificity
Specificity, in contrast to recall, may be defined as the number of negatives returned by our
ML model. We can easily calculate it by confusion matrix with the help of following formula
−
Specificity=TNTN+FPSpecificity=TNTN+FP
Support
Support may be defined as the number of samples of the true response that lies in each class of
target values.
F1 Score
This score will give us the harmonic mean of precision and recall. Mathematically, F1 score is
the weighted average of the precision and recall. The best value of F1 would be 1 and worst
would be 0. We can calculate F1 score with the help of following formula −
F1=2∗(precision∗recall)/(precision+recall)F1=2∗(precision∗recall)/(precision+recall)
We can use classification_report function of sklearn. metrics to get the classification report of
our classification model.
Applications
Medical and research domain that diagnose and study lung cancer and the factors that
cause it.
The system can be used at places like hospitals, research labs or even online.
CHAPTER 6
SYSTEM TESTING
Testing performs a very critical role for quality assurance and for ensuring the reliability
of the software. The success of testing for errors in programs depends critically on the test
cases.
Ideally, each test case is independent from the others: substitutes like method stubs,
mock objects, fakes and test harnesses can be used to assist testing a module in isolation. Unit
tests are typically written and run by software developers to ensure that code meets its design
and behaves as intended. Its implementation can vary from being very manual (pencil and
paper) to being formalized as part of build automation.
A module is composed of various programs related to that module. Module testing is done to
check the module functionality and interaction between units within a module. It checks the
functionality of each program with relation to other programs within the same module. It then
tests the overall functionality of each module.
This module introduces the technique of functional (black box) unit testing to verify the
correctness of classes. It shows how to design unit test cases based on a class specification
within a contract programming approach. The laboratory exercises then guide students through
creating and running tester classes in Java from a test case design, utilizing the Joint unit test
framework. It also contains a worked example on how to unit test GUI and event handling
classes.
Testing is an essential part of any software development process, but it is often poorly
understood. In this module, we take a simple approach and look at two key approaches to
testing. Unit testing is done at the class level, and is designed with the aid of method pre- and
post-conditions. System testing is done at the program level, and is designed based on
documented use cases. We look at how your approach to object-oriented design is influenced by
the need to design and execute tests, and close with some other issues in testing.
environment or feature of the system, and will result in a pass or fail, or Boolean, outcome.
There is generally no degree of success or failure. The test environment is usually designed to
be identical, or as close as possible, to the anticipated user's environment, including extremes of
such. These test cases must each be accompanied by test case input data or a formal description
of the operational activities (or both) to be performed—intended to thoroughly exercise the
specific case—and a formal description of the expected results.
6.6.1 Test plan
The test plan contains the following:
Features to be tested.
Approach to be tested.
Test deliverables.
Test coverage for different product life stages may overlap, but will not necessarily be
exactly the same for all stages. For example, some requirements may be verified during Design
Verification test, but not repeated during Acceptance test. Test coverage also feeds back into the
design process, since the product may have to be designed to allow test access.
Test methods in the test plan state how test coverage will be implemented. Test methods
may be determined by standards, regulatory agencies, or contractual agreement, or may have to
be created new. Test methods also specify test equipment to be used in the performance of the
tests and establish pass/fail criteria. Test methods used to verify hardware design requirements
can range from very simple steps, such as visual inspection, to elaborate test procedures that are
documented separately.
For unit testing structured testing based on branch coverage criteria will be used. The goal is to
achieve branch coverage of more than 95% system testing will be functional in nature. A test
plan documents the strategy that will be used to verify and ensure that a product or system
meets its design specifications and other requirements. A test plan is usually prepared by or
with significant input from Test Engineers.
Depending on the product and the responsibility of the organization to which the test plan
applies, a test plan may include one or more of the following:
Testing performs a very critical role for quality assurance and for ensuring the reliability
of the software. The success of testing for errors in programs depends critically on the test
cases.
As Expected
Actual output: -
Remarks: - Pass.
Table 6.1 Test case 1
Remarks: - Pass
Table 6.2 Test case 2
As Expected
Actual output: -
Remarks: - Pass.
Table 6. 3 Test case 3
Remarks: - Failed.
Table 6.4 Test case 4
As Expected
Actual output: -
Remarks: - Pass
CHAPTER 7
CODE SNIPPETS
#Import modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#Data Preprocessing
data = pd.read_csv('LUNG_CANCER.csv')
data.head()
data.tail()
data.info()
data.describe()
data.isnull().any()
data.isnull().sum()
data['Level'].replace('High',3,inplace=True)
data['Level'].replace('Medium',2,inplace=True)
data['Level'].replace('Low',1,inplace=True)
data.head(10)
#Data Visualization
plt.figure(figsize=(20,20))
sns.heatmap(data.corr().values, annot=True, center=0, xticklabels=data.columns,
yticklabels=data.columns, cmap='YlGnBu')
plt.show()
sns.set(style="white", color_codes=True)
sns.jointplot(x='Age',y='Level',data=data)
plt.figure(figsize=(20,20), dpi=100)
sns.countplot(x='Age',hue='Level',data=data, palette='Set1')
type(x)
type(y)
np.shape(x)
np.shape(y)
x.shape
y.shape
#Logistic Regression
#Confusion Matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
dtccm = confusion_matrix(dtcprediction,y_test)
dtccm
dtccm = confusion_matrix(dtcprediction,y_test)
dtccm
disp = ConfusionMatrixDisplay(confusion_matrix=dtccm,
display_labels=['High','Medium','Low'])
disp.plot(cmap=plt.cm.Blues)
plt.title("Confusion Matrix for Lung Cancer using Decision Tree Classification")
plt.show()
CHAPTER 8
SNAP SHOTS
FIGURE: 8.1
FIGURE: 8.2
FIGURE: 8.3
FIGURE: 8.4
FIGURE: 8.5
CHAPTER 9
CONCLUSION
We have provided an overview of the main approaches used for nodule classification and lung
cancer prediction from CT imaging data. In our experience, given sufficient training data, the
current state-of-the-art is achieved using CNNs trained with Deep Learning achieving a
classification performance in the region of low 90s AUC points. When evaluating system
performance, it is important to be aware of the limitations or otherwise of the training and
validation data sets used, i.e., were the patients’ smokers or non-smokers, or were patients with
a current or prior history of malignancy included.
Given an apparent acceptable level of performance, the next stage is to test such CADx systems
in a clinical setting but before this can be done, we must first define the way in which the output
of the CADx should be utilized in clinical decision making. Who should use such a system and
how should it be integrated into their decisions? Should the algorithm produce an absolute risk
of malignancy and how should this be expressed; should it be incorporated into clinical opinion
and how much weight should clinicians or patients lend to it. Should the algorithms be
incorporated into or designed to fit current guidelines such as Lung-RADS or the BTS
guidelines? If nodules are followed over time, should the algorithm incorporate changes in
nodule volume or should this be assessed separately? Is success defined by a reduction in the
numbers of false positive scans defined as those needing further follow up or intervention, or by
detecting all lung cancers and earlier than determined by following current guidelines? Who
should be compared to the algorithm when determining its value? Should the comparison be
experts or general radiologists, as it may be difficult to be significantly better than an expert but
may be of substantial help to a generalist, and most scans are not interpreted by experts?
Relatively little work has been done to address such questions.
CHAPTER 10
REFERENCES
• [1]“Comparative Study of Classifier for Chronic Kidney Disease prediction using Naive
Bayes, KNN and Random Forest”3rd International Conference on Computing
Methodologies and Communication(ICCMC)2019.DOI:
10.1109/ICCMC.2019.8819654.
• [2] L. Shoon et al., ‘‘Cancer recognition from DNA microarray gene expression data
using averaged one-dependence estimators,’’ Int. J. Cybern. Inform., vol. 3, no. 2, pp. 1–
10, 2014.
• [3] G. Russo, C. Zegar, and A. Giordano, ‘‘Advantages and limitations of microarray
technology in human cancer,’’ Oncogene, vol. 22, no. 42, pp. 6497–6507, 2013.
• [4] X. Wang and O. Gotoh, ‘‘Microarray-based cancer prediction using soft computing
approach,’’ Cancer Inform., vol. 7, Jan. 2009.
• [5] A. Bashetha and G. U. Srikanth, ‘‘Effective cancer detection using soft computing
technique,’’ IOSR J. Comput. Eng., vol. 17, no. 1, pp. 1–5, 2015.
• [6] F. Li, M. Huang, Y. Yang, and X. Zhu. Learning to identify review spam.
Proceedings of the 22nd International Joint Conference on Artificial Intelligence; IJCAI,
2011.
• [7] G. Fei, A. Mukherjee, B. Liu, M. Hsu, M. Castellanos, and R. Ghosh. Exploiting
burstiness in reviews for review spammer detection. In ICWSM, 2013, vol. 3, no. 2, pp.
1–10
• [8] A. j. Minnich, N. Chavoshi, A. Mueen, S. Luan, and M. Faloutsos. Trueview:
Harnessing the power of multiple review sites. In ACM WWW, 2015, vol. 3, no. 2, pp.
1–10
• [9] B. Viswanath, M. Ahmad Bashir, M. Crovella, S. Guah, K. P. Gummadi, B.
Krishnamurthy, and A. Mislove. Towards detecting anomalous user behavior in online
social networks. In USENIX, 2014.
• [10] H. Li, Z. Chen, B. Liu, X. Wei, and J. Shao. Spotting fake reviews via collective
PU learning. In ICDM, 2014, vol. 3, no. 2, pp. 1–10
• [11] L. Akoglu, R. Chandy, and C. Faloutsos. Opinion fraud detection in online reviews
bynetwork effects. In ICWSM, 2013, vol. 3, no. 2, pp. 1–10
• [12] R. Shebuti and L. Akoglu. Collective opinion spam detection: bridging review
networksand metadata. In ACM KDD, 2015.
• [13] S. Feng, R. Banerjee and Y. Choi. Syntactic stylometry for deception detection.
Proceedings of the 50th Annual Meeting of the Association for Computational
Linguistics: Short Papers; ACL, 2012.
• [14] N. Jindal, B. Liu, and E.-P. Lim. Finding unusual review patterns using unexpected
rules. In ACM CIKM, 2012.
• [15] E.-P. Lim, V.-A. Nguyen, N. Jindal, B. Liu, and H. W. Lauw. Detecting product
review spammers using rating behaviors. In ACM CIKM, 2010.
• [16] A. Mukherjee, A. Kumar, B. Liu, J. Wang, M. Hsu, M. Castellanos, and R. Ghosh.
Spotting opinion spammers using behavioral footprints. In ACM KDD, 2013. [17] S.
Xie, G. Wang, S. Lin, and P. S. Yu. Review spam detection via temporal pattern
discovery. In ACM KDD, 2012.
• [18] G. Wang, S. Xie, B. Liu, and P. S. Yu. Review graph based online store review
spammer detection. IEEE ICDM, 2011.
• [19] Y. Sun and J. Han. Mining Heterogeneous Information Networks; Principles and
Methodologies, In ICCCE, 2012.