Lung Cancer Detection Report
Lung Cancer Detection Report
SUBMITTED BY
CERTIFICATE
This is to certify that the project entitled “LUNG CANCER PREDICTION” is
a bonafide record of the work done by NAYANA M.S(Reg no: 2200805),
POOJA M(Reg no: 2200806), ADITHYA ANIL(Reg no: 2200817), RAKESH
R(Reg no: 2200828) in partial fulfillment of the requirements for the award
of Bachelor of Science Degree in Computer Science by the University of
Kerala.
EXTERNAL
EXAMINERS
1.
2
ACKNOWLEDGMENT
We would like to express our genuine and heartfelt gratitude to everyone who has assisted us in
this quest. Without their active advice, help, collaboration and encouragement, we would not have
achieved headway in the project.
First and foremost, we thank God for keeping us safe and well so that we could successfully finish
the first part of our project.
We would like to convey our heartfelt gratitude and respect to our Principal Dr.JIJIMON K.
THOMAS and Director Dr. K. OOMMACHAN for providing all essential facilities.
We would like to express our gratitude to Ms. TINU C. PHILIP, Head of the Department,
Computer Science, for all of her assistance and support in carrying out this project.
We are really grateful to our project guide, Ms. JISHA ISAAC, for her important direction and
assistance in completing the project in its current state.
We also express our profound thanks to our parents and family members, who have always
provided spiritual as well as financial assistance.
Last but not least, we would like to express our appreciation to all of our friends who directly or
indirectly assisted us in completing this project report.
Lung cancer is one of the most important deadly diseases in the world. CT scan image are not easy
to understand but using Deep CNN with Image Segmentation it is an easy approach to detect Lung
cancer. Convolutional neural network (CNN) is one of the deep structured algorithms widely
applied to analyze the ability to visualize and extract the hidden texture features of image datasets.
This report aims to automatically extract the self-learned features using an endto-end Deep learning
CNN and compares the results with the conventional state-of-art and traditional computer-aided
diagnosis system’s performance.
In this Report we are classifying the image data into two categories one is malignant and benign.
At the end of the report, you can see the total number of tests we are done, total positive patients
and negative patients.
First approach will be to build a simple 2D Conv. And then we optimized so the result will give
you total number of parameters and segregate the total number of trainable parameters and
nontrainable parameters.
Next, will be the preprocessing of the input image and segmentation of the regions like lungs and
the cancer areas. And we are going to build the models to get an accurate prediction as predicting
in medical science is too sensitive. After testing for best model. We will get a final result.
TABLE OF CONTENTS
1. INTRODUCTION..................................................................................................................................1
2. LITERATURE REVIEW......................................................................................................................2
2.3 Lung Cancer Prediction Using Machine Learning............................................................................2
3. SYSTEM ANALYSIS............................................................................................................................3
3.2 IDENTIFICATION OF NEED...........................................................................................................3
3.2 PROPOSED SYSTEM........................................................................................................................3
3.2.1 MODEL CLASSIFIERS...............................................................................................................4
4. SYSTEM SPECIFICATIONS...............................................................................................................5
4.1 SOFTWARE REQUIREMENTS.......................................................................................................5
4.2 HARDWARE REQUIREMENTS......................................................................................................6
5. SYSTEM DESIGN.................................................................................................................................7
6. CONCLUSION.....................................................................................................................................13
7.BIBLIOGRAPHY.....................................................................................................................................14
LIST OF FIGURES
Figure 1: Deep CNN......................................................................................................................................3
Figure 2:The architecture of Convolutional Neural Networks (CNN)......................................................7
Figure 3:Architecture Diagram....................................................................................................................9
Figure 4:Data Flow Diagram......................................................................................................................10
1. INTRODUCTION
Lung cancer is one of the dangerous diseases. However, the early detection of lung cancer
significantly improves survival rate. Malignant and benign pulmonary nodules are the small
growths of cells inside the lung. Detection of malignant lung nodules at an early stage is necessary
for the crucial prognosis. Early-stage cancerous lung nodules are very much similar to
noncancerous nodules. The Main task is to measure the probability of malignancy for the early
cancerous lung nodules.
Various diagnostic procedures are used by physicians, in connection, for the early diagnosis of
malignant lung nodules, such as clinical settings, computed tomography (CT) scan analysis
(morphological assessment), positron emission tomography (PET) (metabolic assessments), and
needle prick biopsy analysis. However, mostly invasive methods such as biopsies or surgeries are
used by healthcare practitioners to differentiate between benign and malignant lung nodules.
The most suitable method used for the investigation of lung diseases is computed tomography (CT)
imaging. However, CT scan investigation has a high rate of false positive findings, with
carcinogenic effects of radiations. Low-dose CT uses considerably lower power radiation contact
than standard-dose CT.
The results show that there is no significant difference in detection sensitivities between low-dose
and standard-dose CT images. However, cancer-related deaths were significantly reduced in the
selected population that were exposed to low-dose CT scans as compared to chest radiographs,
which is depicted in the National Lung Screening Trial (NLST). The detection sensitivity of lung
nodules improves with sophisticated anatomical details (thinner slices) and better image
registration techniques. However, this increases the datasets to a very large extent.
To handle this situation our first approach will be to make segmentation of CT scan images after
preprocessing. We are using a Brisk CV algorithm to make segmentation. Brisk CV algorithms
make a mask for cancer cells in lung images. Our Lung cancer prediction project can help the
patient in predicting the cancer.
1|Page
2. LITERATURE REVIEW
2.1 Lung Cancer Prediction from Text Datasets Using Machine Learning (C. Anil Kumar, S.
Harish, Prabha Ravi, Murthy SVN, B. P. Pradeep Kumar, V. Mohanavel, Nouf M. Alyami, S.
Shanmuga Priya, Amare Kebede Asfaw)
A support vector machine (SVM)-based machine learning model was employed to optimise
the lung cancer dataset's detection procedure.
Lung cancer patients are categorised according to their symptoms using an SVM classifier,
and the Python programming language is also used to advance the model's implementation..
Several distinct metrics were used to assess the SVM model's efficiency.
The proposed model was contrasted with the SVM and SMOTE techniques currently in use.
Comparing the new method to the current ones yields a 98.8% accuracy rate.
2.2 Lung cancer Prediction and Classification based on Correlation Selection method Using
Machine Learning Techniques (Dakhaz Abdullah, Adnan Mohsin Abdulazeez, Amira Bibo
Sallow)
The accuracy ratios of three classifiers—the Support Vector Machine (SVM), K-Nearest
Neighbor (KNN), and Convolutional Neural Network—are examined in this research
(CNN).
This paper's main focus is on the execution analysis of WEKA Tool's classification
algorithms' accuracy. According to the experimental findings, SVM achieves the best
results (95.56%), followed by CNN (92.11%) and KNN (88.40%).
2.3 Lung Cancer Prediction Using Machine Learning (Vasupalli Saranya, Madilla Suresh
Kumar)
In this study, the accuracy score and confusion matrix have been calculated to identify the
correct and incorrect predicted features.
The KNN algorithm is used to predict lung cancer. The dataset is imported in this study's
attempt to predict the likelihood that users will develop lung cancer
.
Machine learning algorithms are then applied to the dataset to determine whether users are
at risk for the disease.
2|Page
3. SYSTEM ANALYSIS
3.2 IDENTIFICATION OF NEED
Lung cancer is one of the most important deadly diseases in the world.CT scan image are not easy
to understand but using Deep CNN with Image Segmentation it is an easy approach to detect Lung
cancer.
Convolutional neural network (CNN) is one of the deep structured.In this Report we are classifying
the image data into two categories one is malignant and benign.
At the end of the report, you can see the total number of tests we are done, total positive patients
and negative patients.
First approach will be to build a simple 2D Conv. And then we optimized so the result will give
you total number of parameters and segregate the total number of trainable parameters and non-
trainable parameters
.Next, will be the preprocessing of the input image and segmentation of the regions like lungs and
the cancer areas. And we are going to build the models to get an accurate prediction as predicting
in medical science is too sensitive. After testing for best model. We will get a final result.
3|Page
3.2 PROPOSED SYSTEM
Deep convolutional neural network has recently been applied to image classification with large
image datasets. A deep CNN is able to learn basic filters automatically and combine them
hierarchically to enable the description of latent concepts for pattern recognition. The 'deep' in a
'Deep CNN' refers to the number of layers in the network. It is common to have 5–10 or even more
feature learning layers in a regular CNN. Modern architectures used in cutting edge application
have networks that are more than 50–100 layers deep.
5|Page
4. SYSTEM SPECIFICATIONS
7|Page
5. SYSTEM DESIGN
5.1 ALGORITHM USED
Convolutional Neural Network:
A convolutional neural network (CNN or convnet) is a subset of machine learning. It is one of the
various types of artificial networks which are used for different applications and data types. A
CNN is a kind of network architecture for deep learning algorithms and is specifically used
for image recognition and tasks that involve the processing of pixel data. The CNN is another type
of neural network that can uncover key information in both time series and image data. For this
reason, it is highly valuable for image-related tasks, such as image recognition, object classification
and pattern recognition.
Convolutional layer:
8|Page
Convolutional layers are made up of a set of filters (also called kernels) that are applied to an input
image. The output of the convolutional layer is a feature map, which is a representation of the input
image with the filters applied. Convolutional layers can be stacked to create more complex models,
which can learn more intricate features from images.
Pooling layer:
Pooling layers are a type of convolutional layer used in deep learning. Pooling layers reduce the
spatial size of the input, making it easier to process and requiring less memory. Pooling also helps
to reduce the number of parameters and makes training faster. There are two main types of pooling:
max pooling and average pooling. Max pooling takes the maximum value from each feature map,
while average pooling takes the average value. Pooling layers are typically used after convolutional
layers in order to reduce the size of the input before it is fed into a fully connected layer.
9|Page
METHODOLOGY
Steps:
First, we will take an input data as our data contains image database.
The next step, goes with pre-processing of the data where we done with resizing and
reshaping of the data.
And then we created the model on deep CNN which includes conv2D, maximum pooling,
batch Normalization, Drop out.
And Next, we done with featuring by taking some models.
Optimization with different layers.
Final Result.
10 | P a g e
DATA VISUALIZATION
Visualization of dataset is an important part of training; it gives better understanding of dataset.
But CT scan images are hard to visualize for a normal pc or any window browser. Therefore, we
use the PYDICOM library to solve this problem. The PYDICOM library gives an image array and
metadata information stored in CT images like patient’s name, patient's id, patient’s birth date,
image position, image number, doctor’s name, doctor’s birth date etc. We have nearly 9796 CT
images. The image data was split into training, validation and testing dataset. The source of the
dataset is the LUNA16 dataset. The LUNA16 dataset is a subset of LIDC-IDRI dataset, in which
the heterogeneous scans are filtered by different criteria.
11 | P a g e
Feature extraction:
After the segmentation is performed, the segmented lung nodule is used for feature extraction. A
feature is a significant piece of information extracted from an image which provides more detailed
understanding of the image. The features like geometric and intensity-based statistical features are
extracted. Shape measurements are physical dimensional measures that characterize the appearance
of an object. Only these features were considered to be extracted; area, perimeter and eccentricity.
1) Area: The area is obtained by the summation of areas of pixel in the image
2) Perimeter: The perimeter [length] is the number of pixels in the boundary of the object.
3) Eccentricity: The eccentricity is the ratio of the distance between the foci of the ellipse and its major axis
length. The value is between 0 and 1.
4) Entropy: the statistical measure of randomness that can be used to characterize the texture of the input image.
5) Contrast: Measures the local variations in the GLCM. It calculates intensity contrast between a pixel and its
neighbor pixel for the whole image. Contrast is 0 for a constant image.
6) ) Correlation: Measures the joint probability occurrence of the specified pixel pairs.
MODEL CLASSIFIERS
1. SVC (SUPPORT VECTOR CLASSIFIER)
You can use a support vector classifier when your data has exactly two classes. An SVM
classifies data by finding the best hyperplane that separates all data points of one class from
those of the other class. The best hyperplane for an SVM means the one with the largest
margin between the two classes. Margin means the maximal width of the slab parallel to the
hyperplane that has no interior data points. The support vectors are the data points that are
closest to the separating hyperplane; these points are on the boundary of the slab. The
following figure illustrates these definitions, with + indicating data points of type 1, and –
indicating data points of type –1.
2. EXTREME GRADIENT BOOSTING
Extreme Gradient Boosting is a tree-based algorithm, which sits under the supervised
branch of Machine Learning. While it can be used for both classification and regression
problems,
• Prediction target — the trees are built using residuals, not the actual class labels.
Hence, despite us focusing on classification problems, the base estimators in these
algorithms are regression trees and not classification trees. This is because residuals
12 | P a g e
are continuous and not discrete. At the same time, though, some of the formulas you
will see below are unique for classification, so please don’t assume that exactly the
same applies to regression problems.
3. RANDOM FOREST CLASSIFIER
Random forest classifiers fall under the broad umbrella of ensemble-based learning
methods. They are simple to implement, fast in operation, and have proven to be extremely
successful in a variety of domains. The key principle underlying the random forest
approach comprises the construction of many “simple” decision trees in the training stage
and the majority vote (mode) across them in the classification stage. Among other benefits,
this voting strategy has the effect of correcting for the undesirable property of decision trees
to overfit training data. In the training stage, random forests apply the general technique
known as bagging to individual trees in the ensemble. Bagging repeatedly selects a random
sample with replacement from the training set and fits trees to these samples. Each tree is
grown without any pruning. The number of trees in the ensemble is a free parameter which
is readily learned automatically using the so-called out-of-bag error.
13 | P a g e
6. CONCLUSION
The greatest benefit of deep learning above other algorithms on machine learning is its ability in
executing feature engineering on its own. This examines the data to search for associated features
and incorporates them to provide for quicker learning.It takes advantage of spatial coherence in the
input. The training and testing of images are done where images are pre-processed and feature
selection and feature extraction of images are done. Once training and testing part is done
successfully, the CNN algorithm classifies the input lung image either as normal or abnormal and
the output will be displayed. Hence, a Deep CNN network is used for the classification of lung
images for the detection of cancer.
14 | P a g e
7.BIBLIOGRAPHY
1. Liu, Y.; Zhang, H.; Guo, H.; Xiong, N.N. A FAST-BRISK Feature Detector with Depth
Information. Sensors 2018, 18, 3908. https://doi.org/10.3390/s18113908
2. Kornilov, A.S.; Safonov, I.V. An Overview of Watershed Algorithm Implementations in Open
Source Libraries. J. Imaging 2018, 4, 123. https://doi.org/10.3390/jimaging4100123
3. W.D. Travis, N. Rekhtman, G.J. Riley, K.R. Geisinger, H. Asamura, E. Brambilla, K. Garg, F.R.
Hirsch, M. Noguchi, and C.A. Powell, Pathologic diagnosis of advanced lung cancer based on
small biopsies and cytology: a paradigm shift, J. Thorac. Oncol., Vol. 5, 2010, pp. 411-414.
4. A. Wang, H.Y. Wang, Y. Liu, M.C. Zhao, H. Zhang, Z.Y. Lu, T.C. Fang, X. Chen, and G.T. Liu,
The prognostic value of PD-L1 expression for non-small cell lung cancer patients: a meta-analysis,
EJSO., Vol. 41, 2015, pp. 450-456.
5. R.L. Siegel, K.D. Miller, and A. Jemal, Cancer statistics, 2015, CA Cancer J. Clin., Vol. 65,
2015, pp. 5-29.
6. T. Nawa, T. Nakagawa, T. Mizoue, S. Kusano, T. Chonan, S. Fukai, and K. Endo, Long-term
prognosis of patients with lung cancer detected on low-dose chest computed tomography screening,
Lung Cancer., Vol. 75, 2012, pp. 197-202.
7. X.D. Teng, [World Health Organization classification of tumours, pathology and genetics of
tumours of the lung], Chin. J. Pathol., Vol. 34, 2005, pp. 544.
8. S. Kundu, R. Mitra, S. Misra, and S. Chatterjee, Squamous cell carcinoma lung with progressive
systemic sclerosis, J. Assoc. Phys. India., Vol. 60, 2012, pp. 52- 54.
https://www.japi.org/t2b4e444/squamous-cell-carcinoma-lungwithprogressivesystemic-sclerosi
15 | P a g e