Documentation
Documentation
ABSTRACT:
Lung cancer poses a significant threat to human life, characterized by the uncontrolled growth
of cells within lung tissue, with potential spread to nearby areas and other parts of the body.
Diagnosis traditionally relies on the examination of computerized tomography (CT) scan
images by physicians, but this manual process can be time-consuming and prone to oversight.
Leveraging artificial intelligence (AI) techniques, particularly deep learning algorithms, has
emerged as a promising approach to enhance the accuracy and efficiency of lung cancer
diagnosis. This study proposes an automatic lung cancer diagnosis system utilizing deep
learning algorithms trained on CT scan images. The motivation behind this approach is to
address the limitations of manual diagnosis, including the possibility of missing small tumours
and the scarcity of expert radiologists. Early detection of lung cancer is crucial for timely
intervention and improved patient outcomes. The methodology involves several key steps,
including image pre-processing to enhance image quality, data augmentation to increase the
diversity of the training dataset, and the utilization of convolutional neural networks (CNN) for
disease classification. The CNN architecture is optimized to accurately classify CT scan images
into three classes: normal lung tissue, benign abnormalities, and malignant tumours. Evaluation
of the proposed model includes assessing performance metrics such as precision, recall, F1
score, as well as training and test accuracy.
1
1. INTRODUCTION
Cancer is a serious public health concern that is spreading around the globe. It is a
condition in which the body develops malignant or tumorous growths as a result of unchecked
cell division in certain tissues. Globally, 19.3 million new cases of cancer and almost 10 million
cancer deaths were predicted by GLOBOCAN for 2020. Lung cancer is the most prevalent
cancer to be diagnosed and the world's greatest cause of mortality for both men and women.
Every year, around 1.8 million people die from lung cancer, accounting for the 2.2 million new
cases that are diagnosed worldwide. Lung cancer often presents with haemoptysis (coughing
up blood), weight loss, and fatigue, among other symptoms. In addition, a number of risk
factors, such as alcohol use, smoking, poor air quality, and food. According to the histology of
the cancer cells, there are two types of lung cancer: non-small lung cancer (NSCLC) and small-
cell lung cancer (SCLC). Non-small-cell lung carcinoma (NSCLS) accounts for about 85–88%
of cases of lung cancer, while small-cell lung cancer (SCLC) accounts for 12–15% of cases
Compared to SCLC, which represents 5% of all cases, NSCLC is thought to be the most
frequent kind of lung cancer, accounting for 85% of cases . Over the past 20 years, lung cancer
has become much more common in poor nations, especially Sub-Saharan Africa, where the
prevalence of HIV/AIDS is also quite high. Compared to other malignancies like prostate
cancer (99%), colorectal cancer (65%), and breast cancer (90%) the overall 5-year survival rate
for all types of lung cancer is less than 18%. None the less, lung cancer need more focus from
the biological, scientific, and medical domains in order to develop novel strategies to support
early diagnosis, which supports medical judgments and assesses solutions to enhance
healthcare. Lung cancer may be detected using a vast amount of computed tomography (CT)
scan picture data. Lung cancer is invasive and heterogeneous, making early detection and
treatment crucial to improving the five-year survival rate overall [12]. The prediction and
classification of lung nodules has been thoroughly studied over the past decade using a variety
of medical imaging techniques, including chest X-ray, positron emission tomography (PET),
magnetic resonance imaging (MRI), computed tomography (CT), low-dose CT (LDCT), and
chest radiograph (CRG). It is crucial to pay more attention to the medical field and, especially,
to build new systems used for early cancer diagnosis in order to limit its mortality rate. . Deep-
learning-based models have recently demonstrated great performance in various computer
vision and artificial intelligence fields. The use of deep learning in the medical field has
2
witnessed significant development, especially when applied to organ cancer. Deep-learning-
based models have demonstrated more outstanding capabilities in the classification of medical
images. These photos may be used by machine learning and deep learning algorithms to
improve cancer prediction and diagnosis as soon as feasible and identify the most effective
treatment approaches
Motivation:
Recently, the diagnosis of lung cancer has been made mostly by professionals using
their experience, augmented by data from laboratory testing or clinical information. Examined
in these labs, test results vary according to factors like smoking status, yellowing of fingers,
anxiety, peer pressure, chronic disease status, allergies, wheezing, alcohol intake, coughing,
dyspnoea, difficulty swallowing, and tension in the chest. This project aims to precisely
forecast the incidence of lung cancer using deep learning and optimization methods using a
given dataset.
WHO Response:
WHO recognizes the significant impact of lung cancer on global health and has
implemented several initiatives to address the disease comprehensively. The WHO's response
focuses on tobacco control, cancer prevention, early detection, and improving access to quality
treatment and care. WHO supports countries in implementing evidence-based tobacco control
policies, including increasing tobacco taxes, enforcing comprehensive bans on tobacco
advertising, promotion, and sponsorship, and implementing strong graphic health warnings on
tobacco products.
The Organization also promotes cancer prevention strategies by advocating for healthy
lifestyles, including regular physical activity, a healthy diet, and minimizing exposure to
environmental risk factors. Additionally, WHO supports early detection programs and
encourages countries to implement screening measures for high-risk populations to detect lung
cancer at earlier stages when treatment options are more effective. Last, WHO works towards
ensuring access to quality treatment and care for lung cancer patients by providing technical
guidance to member states, promoting equitable access to essential cancer medicines, and
fostering international collaboration to share best practices and improve cancer care out comes.
3
SYMPTOMS:
Lung cancer can cause several symptoms that may indicate a problem in the lungs.
4
2. SYSTEM ANALYSIS
Adnan Mohsin Abdul Azeez used several different methods, different dataset and
different ways of feature selection/feature extraction. In comparison with related work, we
obtain a good result in this work with dataset and methods that we used. However, researcher
in obtained 94% CT scan images dataset and SURF (Speeded Up Robust Features) for feature
selection.in addition the researchers in .they used the same dataset, the researchers in obtained
90.9% with used Delta Radionics method for feature extraction. But researchers in could
obtained better accuracy by using GLCM function for feature extraction and (MLP 98%, SVM
70.45%, & KNN 99.2%) classifier.by using (UCI) dataset the researchers in could gain a good
result 90% and used several classifiers. In future, the proposed methodology can be enhanced
by including fuzzy genetic optimization techniques with the deep learning approaches.
5
2.2 PROPOSED SYSTEM
To develop a framework for selecting the best classification approach among the three
approaches based on different techniques to address the class imbalance problem, the
methodology has been proposed, which is divided into three stages. In the first stage, the IQ-
OTH/NCCD dataset was selected, which consisted of 1097 samples of chest CT scan images
for lung cancer divided into 3 classes (benign, malignant, and normal). Then data pre-
processing was performed (image resizing, image filtering, image normalization) and after that
the data has been entered separately into three class balancing techniques (SMOTE, class-
weighted approach, and data augmentation). Each of these techniques produces different
balanced data. The second stage is where features are extracted and classified using the
proposed convolutional neural network architecture separately for the data obtained from each
balancing technique. In this way, three classification models were made. Thus, the balancing
methods are evaluated in terms of their effect on the classification results from best to least
effect on the results. Fig. 19 shows the stages of the proposed methodology.
6
3. DEVELOPMENT ENVIRONMENT
Processor : NIVDIA
RAM : 4GB
Keyboard : Logitech
❖ Operating system
➢ Windows 11
❖ Python -Colab
❖ Packages
➢ Numpy
➢ Pandas
➢ Matplotlip
➢ Tensorflow
7
3.3. GOOGLE COLAB
• Free Access to GPUs: Colab offers free GPU access, which is particularly
useful for training machine learning models that require significant computational
power.
• No Setup Required: Colab runs in the cloud, eliminating the need for users to
set up and configure their own development environment. This makes it convenient for
quick coding and collaboration.
• Collaborative Editing: Multiple users can work on the same Colab notebook
simultaneously, making it a useful tool for collaborative projects.
8
• Support for Popular Libraries: Colab comes pre-installed with many popular
Python libraries for machine learning, data analysis, and visualization, such as
TensorFlow, PyTorch, Matplotlib, and more.
• Easy Sharin: Colab notebooks can be easily shared just like Google Docs or
Sheets. Users can provide a link to the notebook, and others can view or edit the code in
real-time.
3.4. PYTHON
Python is a high-level, general-purpose programming language. Its design philosophy
emphasizes code readability with the use of significant indentation.
Guido van Rossum began working on Python in the late 1980s as a successor to
the ABC programming language and first released it in 1991 as Python 0.9.0. Python 2.0 was
released in 2000. Python 3.0, released in 2008, was a major revision not completely backward-
compatible with earlier versions. Python 2.7.18, released in 2020, was the last release of
Python 2.
Python consistently ranks as one of the most popular programming languages, and has
gained widespread use in the machine learning community
Python uses dynamic typing and a combination of reference counting and a cycle-
detecting garbage collector for memory management. It uses dynamic name resolution (late
binding), which binds method and variable names during program execution.
Its design offers some support for functional programming in the Lisp tradition. It
has filter , map and reduce functions; list comprehensions, dictionaries, sets,
9
and generator expressions.[71] The standard library has two modules ( itertools and functools )
that implement functional tools borrowed from Haskell and Standard ML.
Figure 3: Python
❖ Fast prototyping
Professionally, Python is great for backend web development, data analysis, artificial
intelligence, and scientific computing.Developers also usePython to build productivitytools,
games, and desktop apps.
10
Python and AI:
AI researchers are fans of Python. Google TensorFlow, as well as other libraries (scikit-
learn, Keras), establish a foundation for AI development because of the usability and flexibility
it offers Python users. These libraries, and their availability, are critical because they enable
developers to focus on growth and building.
Numpy:
At the core of the NumPy package, is the ndarray object. This encapsulates n-dimensional
arrays of homogeneous data types, with many operations being performed in compiled code
for performance. There are several important differences between NumPy arrays and the
standard Python sequences:
❖ NumPy arrays have a fixed size at creation, unlike Python lists (which can grow
dynamically). Changing the size of an ndarray will create a new array and delete the
original.
❖ The elements in a NumPy array are all required to be of the same data type, and thus will
be the same size in memory. The exception: one can have arrays of (Python, including
NumPy) objects, thereby allowing for arrays of different sized elements.
❖ NumPy arrays facilitate advanced mathematical and other types of operations on large
numbers of data. Typically, such operations are executed more efficiently and with less
code than is possible using Python’s built-in sequences.
❖ A growing plethora of scientific and mathematical Python-based packages are using
NumPy arrays; though these typically support Python-sequence input, they convert such
input to NumPy arrays prior to processing, and they often output NumPy arrays. In other
words, in order to efficiently use much (perhaps even most) of today’s
scientific/mathematical Python-based software, just knowing how to use Python’s built-
in sequence types is insufficient - one also needs to know how to use NumPy arrays.
11
Matplotlib:
Matplotlib is a python library used to create 2D graphs and plots by using python scripts. It
has a module named pyplot which makes things easy for plotting by providing feature to control
line styles, font properties, formatting axes etc. It supports a very wide variety of graphs and
plots namely - histogram, bar charts, power spectra, error charts etc. It is used along with
NumPy to provide an environment that is an effective open source alternative for MatLab .
Tensorflow:
TensorFlow works on the basis of data flow graphs that have nodes and edges. As the
execution mechanism is in the form of graphs, it is much easier to execute TensorFlow code in
a distributed manner across a cluster of computers while using GPUs.
Pandas:
Pandas is a Python library used for working with data sets. It has functions for analyzing,
cleaning, exploring, and manipulating data. The name "Pandas" has a reference to both "Panel
Data", and "Python Data Analysis" and was created by Wes McKinney in 2008. Pandas allows
us to analyze big data and make conclusions based on statistical theories. Pandas can clean
messy data sets, and make them readable and relevant. Relevant data is very important in data
science. As an open-source software library built on top of Python specifically for data
manipulation and analysis, Pandas offers data structure and operations for powerful, flexible,
and easy-to-use data analysis and manipulation. Pandas is a predominantly used python data
analysis library. It provides many functions and methods to expedite the data analysis process.
What makes pandas so common is its functionality, flexibility, and simple syntax.
12
Keras:
Keras is an open-source Python library that is known to be beneficial for efficient and
fast experimentation associated with deep neural networks. This neural network Python library
was initially developed by Google for Open-Ended Neuro Electronic Intelligent Robot
Operating System or ONEIROS. It is now a standalone library and is supported in the core
library of TensorFlow which makes it available in addition to TensorFlow. Keras is
extensively used for creating Machine Learning and Deep Learning models that assist
engineers and developers in building applications like Yelp, Uber, Square, and Netflix. This
Python-written library augments the neural network creation speed.
Pros:
Cons:
• Before you can implement an operation, you will need to leverage a computational graph. This
makes this Python library slow.
13
4. FEASIBILITY STUDY
A feasibility study is an assessment of the practicality of a proposed plan or project.
A feasibility study analyzes the viability of a project to determine whether the project or
venture is likely to succeed. The study is also designed to identify potential issues and
problems that could arise while pursuing the project.
There are several benefits to feasibility studies, including helping project managers
discern the pros and cons of undertaking a project before investing a significant amount of
time and capital into it.
Feasibility studies can also provide a company's management team with crucial
information that could prevent them from entering into a risky business venture.
Such studies help companies determine how they will grow. They will know more
about how theywill operate, what the potential obstacles are, who the competition is, and
what the market is.
Feasibility studies also help convince investors and bankers that investing in a
particular project or business is a wise choice.
There are four main elements that go into a feasibility study: technical feasibility,
financialfeasibility, market feasibility (or market fit), and operational feasibility. You
may also see these referred to as the four types of feasibility studies, though most
feasibility studies actually include a review of all four elements.
14
return on investment (ROI), payback period, and net present value (NPV) to determine if the
project is financially viable.
Market feasibility assesses the demand for the proposed product or service in the target
market. It involves conducting market research to understand customer needs, preferences, and
behaviors. Market feasibility analysis helps determine if there is a sufficient market for the
product or service, identify competitors, and evaluate pricing strategies.
Operational feasibility evaluates whether the proposed project aligns with the
organization's capabilities and operational processes. It examines factors such as staffing
requirements, workflow, and potential disruptions to existing operations. Operational
feasibility analysis helps identify potential challenges and risks related to implementing the
project and suggests ways to mitigate them.
Think of the financial feasibility study as the projected income statement for the
project. This part of the feasibility study clarifies the expected project income and outlines
what your organization needs to invest-in terms of time and money-in order to hit the
project objectives.
During the financial feasibility study, take into account whether or not the project
will impact yourbusiness's cash flow. Depending on the complexity of the initiative, your
15
internal PMO or external consultant may want to work with your financial team to run a
cost-benefit analysis of the project.
The market assessment, more than any other part of the feasibility study, is a chance
to evaluate whether or not there’s an opportunity in the market. During this study, it’s
critical to evaluate your competitor’s positions and analyze demographics to get a sense
of how the project will do.
Even if the financials are looking good and the market is ready, this initiative
may not be something your organization can support. To evaluate operational
feasibility, consider any staffing or equipment requirements this project needs. What
organizational resourcesincluding time, money, and skills are necessary in order for
this project to succeed?
Depending on the project, it may also be necessary to consider the legal impact
of the initiative. For example, if the project involves developing a new patent for your
product, you will need to involve your legal team and incorporate that requirement into
the project plan.
At this stage, your internal PMO team or external consultant have looked at all
four elements of your feasibility study inancials, market analysis, technical feasibility,
and operational feasibility. Before running their recommendations by you and your
stakeholders, they will review and analyze the data for any inconsistencies. This
includes ensuring the income statement is in line with your market analysis.
16
so you can make the best decision for your project and for your team.
4.10. PROPOSAL
The final step of the feasibility study is an executive summary touching on the
main points and proposing a solution.
Depending on the complexity and scope of the project, your internal PMO or
external consultant may share the feasibility study with stakeholders or present it to
the group in order to field any questions live. Either way, with the study in hand, your
team now has the information you need to make an informed decision.
17
5.METHODOLOGY
As the hype around AI has accelerated, vendors have been scrambling to promote how
their products and services use it. Often, what they refer to as AI is simply a component of the
technology, such as machine learning. AI requires a foundation of specialized hardware and
software for writing and training machine learning algorithms. No single programming
language is synonymous with AI, but Python, R, Java, C++ and Julia have features popular
with AI developers.
• Learning. This aspect of AI programming focuses on acquiring data and creating rules for
how to turn it into actionable information. The rules, which are called algorithms, provide
computing devices with step-by-step instructions for how to complete a specific task.
18
• Creativity. This aspect of AI uses neural networks, rules-based systems, statistical
methods and other AI techniques to generate new images, new text, new music and new
ideas.
AI is important for its potential to change how we live, work and play. It has been
effectively used in business to automate tasks done by humans, including customer service
work, lead generation, fraud detection and quality control. In a number of areas, AI can perform
tasks much better than humans. Particularly when it comes to repetitive, detail-oriented tasks,
such as analysing large numbers of legal documents to ensure relevant fields are filled in
properly, AI tools often complete jobs quickly and with relatively few errors. Because of the
massive data sets it can process, AI can also give enterprises insights into their operations they
might not have been aware of. The rapidly expanding population of generative AI tools will be
important in fields ranging from education and marketing to product design.
Indeed, advances in AI techniques have not only helped fuel an explosion in efficiency, but
opened the door to entirely new business opportunities for some larger enterprises. Prior to
the current wave of AI, it would have been hard to imagine using computer software to connect
riders to taxis, but User has become a Fortune 500 company by doing just that.
AI has become central to many of today's largest and most successful companies, including
Alphabet, Apple, Microsoft and Meta, where AI technologies are used to improve operations
and outpace competitors. At Alphabet subsidiary Google, for example, AI is central to its
search engine, Way MO’s self-driving cars and Google Brain, which invented the transformer
neural network architecture that underpins the recent breakthroughs in natural language
processing.
19
Advantages of AI
• Good at detail-oriented jobs. AI has proven to be just as good, if not better than doctors
at diagnosing certain cancers, including breast cancer and melanoma.
• Reduced time for data-heavy tasks. AI is widely used in data-heavy industries, including
banking and securities, pharm and insurance, to reduce the time it takes to analyse big data
sets. Financial services, for example, routinely use AI to process loan applications and
detect fraud.
• Saves labour and increases productivity. An example here is the use of warehouse
automation, which grew during the pandemic and is expected to increase with the
integration of AI and machine learning.
• Delivers consistent results. The best AI translation tools deliver high levels of
consistency, offering even small businesses the ability to reach customers in their native
language.
• AI-powered virtual agents are always available. AI programs do not need to sleep or
take breaks, providing 24/7 service.
Disadvantages of AI
• Expensive.
20
5.2. MACHINE LEARNING
Machine learning (ML) is a branch of artificial intelligence (AI) and computer science
that focuses on the using data and algorithms to enable AI to imitate the way that humans learn,
gradually improving its accuracy.
There are four types of machine learning algorithms: supervised, unsupervised and semi-
supervised and reinforcement
21
Figure 6: Supervised Learning
22
segmentation, and image and pattern recognition. It’s also used to reduce the number
of features in a model through the process of dimensionality reduction. Principal
component analysis (PCA) and singular value decomposition (SVD) are two common
approaches for this. Other algorithms used in unsupervised learning include neural
networks, k-means clustering, and probabilistic clustering methods.
➢ Semi-supervised learning:
23
Reinforcement learning uses algorithms that learn from outcomes and decide
which action to take next. After each action, the algorithm receives feedback that helps it
determine whether the choice it made was correct, neutral or incorrect. It is a good
technique to use for automated systems that have to make a lot of small decisions without
human guidance.
Typical ML algorithms divide problems into sub problems and address them
individually without concern for the main problem. However, RL is more about achieving
the long-term goal without dividing the problem into sub-tasks, thereby maximizing the
rewards.
24
5.3. DEEP LEARNING
1. Deep Learning is a subfield of Machine Learning that involves the use of neural
networks to model and solve complex problems. Neural networks are modelled after the
structure and function of the human brain and consist of layers of interconnected nodes that
process and transform data.
2. The key characteristic of Deep Learning is the use of deep neural networks, which
have multiple layers of interconnected nodes. These networks can learn complex
representations of data by discovering hierarchical patterns and features in the data. Deep
Learning algorithms can automatically learn and improve from data without the need for
manual feature engineering.
25
3. Deep Learning has achieved significant success in various fields, including image
recognition, natural language processing, speech recognition, and recommendation systems.
Some of the popular Deep Learning architectures include Convolutional Neural Networks
(CNNs), Recurrent Neural Networks (RNNs), and Deep Belief Networks (DBNs).
4. Training deep neural networks typically requires a large amount of data and
computational resources. However, the availability of cloud computing and the development
of specialized hardware, such as Graphics Processing Units (GPUs), has made it easier to train
deep neural networks.
In summary, Deep Learning is a subfield of Machine Learning that involves the use of
deep neural networks to model and solve complex problems. Deep Learning has achieved
significant success in various fields, and its use is expected to continue to grow as more data
becomes available, and more powerful computing resources become available.
Computer vision:
In computer vision, Deep learning models can enable machines to identify and
understand visual data. Some of the main applications of deep learning in computer vision
include:
• Object detection and recognition: Deep learning model can be used to identify and
locate objects within images and videos, making it possible for machines to perform
tasks such as self-driving cars, surveillance, and robotics.
• Image classification: Deep learning models can be used to classify images into
categories such as animals, plants, and buildings. This is used in applications such as
medical imaging, quality control, and image retrieval.
• Image segmentation: Deep learning models can be used for image segmentation into
different regions, making it possible to identify specific features within images.
26
Natural language processing (NLP):
In NLP, the Deep learning model can enable machines to understand and generate human
language. Some of the main applications of deep learning in NLP include:
• Automatic Text Generation: Deep learning model can learn the corpus of text and new
text like summaries, essays can be automatically generated using these trained models.
• Language translation: Deep learning models can translate text from one language to
another, making it possible to communicate with people from different linguistic
backgrounds.
• Sentiment analysis: Deep learning models can analyze the sentiment of a piece of text,
making it possible to determine whether the text is positive, negative, or neutral. This
is used in applications such as customer service, social media monitoring, and political
analysis.
• Speech recognition: Deep learning models can recognize and transcribe spoken words,
making it possible to perform tasks such as speech-to-text conversion, voice search,
and voice-controlled devices.
Reinforcement learning:
• Game playing: Deep reinforcement learning models have been able to beat human
experts at games such as Go, Chess, and Atari.
• Robotics: Deep reinforcement learning models can be used to train robots to perform
complex tasks such as grasping objects, navigation, and manipulation.
• Control systems: Deep reinforcement learning models can be used to control complex
systems such as power grids, traffic management, and supply chain optimization.
27
• Automated feature engineering: Deep Learning algorithms can automatically
discover and learn relevant features from data without the need for manual feature
engineering.
• Scalability: Deep Learning models can scale to handle large and complex datasets, and
can learn from massive amounts of data.
• Flexibility: Deep Learning models can be applied to a wide range of tasks and can
handle various types of data, such as images, text, and speech.
• Continual improvement: Deep Learning models can continually improve their
performance as more data becomes available.
CNNs are particularly useful for finding patterns in images to recognize objects, classes,
and categories. CNN is a deep learning algorithm responsible for processing animal visual
28
cortex-inspired images in the form of grid patterns. These are designed to automatically detect
and segment-specific objects and learn spatial hierarchies of features from low to high-level
patterns.
1.Input Layers: It’s the layer in which we give input to our model. The number of
neurons in this layer is equal to the total number of features in our data (number of pixels in
the case of an image).
2. Hidden Layer: The input from the Input layer is then feed into the hidden layer.
There can be many hidden layers depending upon our model and data size. Each hidden layer
can have different numbers of neurons which are generally greater than the number of features.
The output from each layer is computed by matrix multiplication of output of the previous
layer with learnable weights of that layer and then by the addition of learnable biases followed
by activation function which makes the network nonlinear.
3. Output Layer: The output from the hidden layer is then fed into a logistic function
like sigmoid or softmax which converts the output of each class into the probability score of
each class.
29
5.5. CNN ARCHITECTURE
Convolutional Neural Network consists of multiple layers like the input layer,
Convolutional layer, Pooling layer, and fully connected layers.
CNN Layers :
A deep learning CNN consists of three layers: a convolutional layer, a pooling layer
and a fully connected (FC) layer. The convolutional layer is the first layer while the FC layer
is the last. The convolutional layer to the FC layer, the complexity of the CNN increases. It is
this increasing complexity that allows the CNN to successively identify larger portions and
more complex features of an image until it finally identifies the object in its entirety.
30
Figure 13: CNN key Components
Convolutional Layer :
The majority of computations happen in the convolutional layer, which is the core building
block of a CNN. A second convolutional layer can follow the initial convolutional layer. The
process of convolution involves a kernel or filter inside this layer moving across the receptive
fields of the image, checking if a feature is present in the image. Over multiple iterations, the
kernel sweeps over the entire image. After each iteration a dot product is calculated between
the input pixels and the filter. The final output from the series of dots is known as a feature
map or convolved feature. Ultimately, the image is converted into numerical values in this
layer, which allows the CNN to interpret the image and extract relevant patterns from it.
ReLu Layer :
Rectified linear unit layer is a layer that categorizes the images into negative and
positive images that accompany the results and it falls under negative values, it will assign
zero. This is done to maintain a strategic distance from the qualities from summarizing to zero.
The change work possibly initiates a hub if the info is over a specific amount, while the
31
information is underneath zero, the yield is zero, yet when the info transcends a specific limit,
it has a straight relationship with the reliant variable.
Pooling Layer :
Like the convolutional layer, the pooling layer also sweeps a kernel or filter across the input
image. But unlike the convolutional layer, the pooling layer reduces the number of parameters
in the input and also results in some information loss. On the positive side, this layer reduces
complexity and improves the efficiency of the CNN.
Flatten Layer :
Intuition behind flattening layer is to converts data into 1-dimentional array for feeding
next layer. We flatted output of convolutional layer into single long feature vector. Which is
connected to final classification model, called fully connected layer. Let’s suppose we’ve [5,
5, 5] pooled feature map are flattened into 1x125 single vector. So, flatten layers converts’
multidimensional array to single dimensional vector.
32
Fully Connected Layer:
The FC layer is where image classification happens in the CNN based on the features
extracted in the previous layers. Here, fully connected means that all the inputs or nodes from
one layer are connected to every activation unit or node of the next layer. All the layers in the
CNN are not fully connected because it would result in an unnecessarily dense network. It also
would increase losses and affect the output quality, and it would be computationally expensive.
➢ Good at detecting patterns and features in images, videos, and audio signals.
➢ Robust to translation, rotation, and scaling invariance.
➢ End-to-end training, no need for manual feature extraction.
➢ Can handle large amounts of data and achieve high accuracy.
33
5.6. CLASSIFICATION
The method of arranging the organisms into groups is called classification. When we
classify things we put them into groups based on their characteristics.
Image classification using CNN involves the extraction of features from the image to
observe some patterns in the dataset. Using an ANN for the purpose of image classification
would end up being very costly in terms of computation since the trainable parameters become
extremely large. We use filters when using CNNs.
The CNN architecture is especially useful for image recognition and image
classification, as well as other computer vision tasks because they can process large amounts
of data and produce highly accurate predictions.
Libraries Required:
TFLearn – Deep learning library featuring a higher-level API for TensorFlow used to create
layers of our CNN
tqdm – Instantly make your loops show a smart progress meter, just for simple design sake
open-cv – To process the image like converting them to grayscale and etc.
os – To access the file system to read the image from the train and test directory from our
machines
34
TensorFlow – Just to use the tensorboard to compare the loss and adam curve our result data
or obtained log.
5.7. PRE-PROCESSING
Data preprocessing is a step in the data mining and data analysis process that takes raw
data and transforms it into a format that can be understood and analyzed by computers and
machine learning.Raw, real-world data in the form of text, images, video, etc., is messy. Not
only may it contain errors and inconsistencies, but it is often incomplete, and doesn’t have a
regular, uniform design.Machines like to process nice and tidy information – they read data as
1s and 0s. So calculating structured data, like whole numbers and percentages is easy.
However, unstructured data, in the form of text and images must first be cleaned and formatted
before analysis.
When using data sets to train machine learning models, you’ll often hear the
phrase “garbage in, garbage out” This means that if you use bad or “dirty” data to train your
model, you’ll end up with a bad, improperly trained model that won’t actually be relevant to
your analysis.
Good, preprocessed data is even more important than the most powerful algorithms, to
the point that machine learning models trained with bad data could actually be harmful to the
analysis you’re trying to do – giving you “garbage” results.
35
Figure 18: Data Pre-processing Importance
Depending on your data gathering techniques and sources, you may end up with data
that’s out of range or includes an incorrect feature, like household income below zero or an
image from a set of “zoo animals” that is actually a tree. Your set could have missing values
or fields. Or text data, for example, will often have misspelled words and irrelevant symbols,
URLs, etc.
5.8. SMOTE
➢ Selecting a Minority Class Sample: Randomly select a sample from the minority
class.
➢ Finding Nearest Neighbors: Find the k-nearest neighbors of the selected sample from
the minority class. The number of neighbors (k) is a parameter that can be tuned.
➢ Generating Synthetic Samples: For each selected minority class sample, generate
synthetic samples by randomly selecting one of its k-nearest neighbors and creating a
new sample along the line connecting the two points in feature space.
36
➢ Adding Synthetic Samples: Repeat the process until the desired balance between
classes is achieved. Typically, the number of synthetic samples generated is
proportional to the level of class imbalance.
However, it's important to note that while SMOTE can be effective in certain scenarios,
it may not always lead to improved performance, and its effectiveness depends on the specific
characteristics of the dataset and the machine learning algorithm being used. Additionally, care
should be taken to avoid overfitting when using SMOTE, especially if the synthetic samples
are not representative of the true underlying data distribution.
37
6.SYSTEM MODEL
DATASET
Training Testing
Data Augmentation
CNN(Convolutional
neural network)
classification
38
6.2. DATASET
Bengin Cases:
39
Malignant Cases:
Normal Cases:
6.3. PRE-PROCESSING
Preprocessing plays a critical role in preparing CT scan images for lung cancer
prediction and classification using deep learning methods.Ensure that CT scan images are
acquired using standardized protocols to minimize variability in image quality and
characteristics. Normalize pixel values to a common scale (e.g., [0, 1] or [-1, 1]) to make the
data consistent and facilitate model convergence during training. CT scans may have non-
uniform voxel sizes and orientations. Resample the images to a common voxel size and
orientation to ensure uniformity across the dataset. Interpolation techniques like trilinear or
40
cubic interpolation can be used. Apply noise reduction techniques such as Gaussian blurring
or median filtering to reduce noise and improve image quality, especially in low-dose CT scans.
Perform intensity standardization to reduce variability in pixel intensity values across different
scans. Techniques such as histogram matching or z-score normalization can be used to
standardize intensity values.Segment the lung region from the CT scan images to focus the
analysis on relevant anatomical structures. Lung segmentation can be performed using
thresholding, region growing, or deep learning-based methods.Detect and segment lung
nodules or lesions from the segmented lung region. Various techniques, including thresholding,
morphological operations, and machine learning-based methods, can be used for nodule
detection. Augment the dataset to increase its size and diversity. Common augmentation
techniques include rotation, translation, scaling, flipping, and adding noise to the images.
Ensure that the distribution of image features is consistent across the dataset to prevent bias
during model training. Techniques like min-max scaling or z-score normalization can be
applied.Reduce the dimensionality of the data to improve computational efficiency and reduce
the risk of overfitting. Techniques such as principal component analysis (PCA) or autoencoders
can be used for dimensionality reduction. Perform quality control checks to ensure that
preprocessed images are free from artifacts or errors that could adversely affect model
performance.Annotation and Labeling: Annotate and label the preprocessed images with
ground truth information (e.g., presence or absence of cancer, tumor type, etc.) for supervised
learning tasks.
6.4. SMOTE
41
space, and generating new examples along that line.Start by collecting a dataset that contains
features related to lung cancer prediction, such as patient demographics, medical history,
genetic information, and imaging data. Ensure that the dataset includes labeled examples
indicating whether each patient has lung cancer or not.Before applying SMOTE, preprocess
your dataset by handling missing values, scaling numerical features, encoding categorical
variables, and splitting the data into training and testing sets.Check the class distribution in
your dataset to confirm whether it is imbalanced. If one class (e.g., lung cancer cases) is
significantly underrepresented compared to the other class (e.g., non-cancer cases), you'll need
to address this imbalance.Apply SMOTE to oversample the minority class (lung cancer cases)
in the training set. This will create synthetic samples to balance the class distribution. Make
sure to apply SMOTE only to the training data to prevent information leakage into the test set.
Design and train a deep learning model for lung cancer prediction and classification. You can
use various architectures such as convolutional neural networks (CNNs), recurrent neural
networks (RNNs), or hybrid models depending on the nature of your data.Evaluate the
performance of your deep learning model using appropriate evaluation metrics such as
accuracy, precision, recall, F1-score, and area under the ROC curve (AUC-ROC). Use the test
set that was not involved in the training or SMOTE process to obtain unbiased performance
estimates. Fine-tune the hyperparameters of your deep learning model using techniques like
grid search or random search to further optimize its performance.Validate the generalizability
of your model on independent datasets if available. Additionally, interpret the model's
predictions to understand the factors contributing to lung cancer prediction and
classification.Iterate on the model by incorporating feedback from domain experts, refining
feature engineering techniques, experimenting with different deep learning architectures, and
exploring advanced techniques for addressing imbalanced datasets.By incorporating SMOTE
into your deep learning-based lung cancer prediction and classification workflow, you can
improve the model's ability to learn from imbalanced data and potentially enhance its
performance in identifying patients at risk of lung cancer.
Deep Learning-Based Lung Cancer Prediction And Classification Tasks Using Medical
Imaging Data, Image Data Augmentation Plays A Crucial Role In Improving Model
Performance And Generalization.Rotate Lung Images By Small Angles (E.G., ±15°) And
Perform Horizontal Or Vertical Flips. This Helps The Model Learn Invariant Features Despite
Variations In Orientation. Resize Images To Different Scales And Randomly Crop Patches.
42
This Simulates Variations In Image Resolution And Focuses The Model's Attention On
Relevant Regions Of Interest. Randomly Shift The Images Horizontally And Vertically. This
Introduces Variability In The Position Of Lung Structures Within The Images.Apply Elastic
Transformations To The Images, Simulating Realistic Deformations Of Lung Structures. This
Can Help The Model Learn To Generalize Better Across Different Anatomical Variations. Add
Random Gaussian Noise To The Images. This Improves The Model's Robustness To Noise
Present In Medical Imaging Data. Randomly Adjust The Contrast And Brightness Of The
Images. This Helps The Model Generalize To Different Lighting Conditions During Imaging.
Apply Gamma Correction To Simulate Variations In Image Intensity. This Can Help The
Model Learn To Cope With Differences In Imaging Techniques And Equipment.Apply
Gaussian Blur Or Sharpening Filters To The Images. This Can Help The Model Generalize
Better By Learning Features At Different Levels Of Detail.Perform Histogram Equalization
To Improve The Contrast Of The Images. This Enhances The Visibility Of Lung Structures
And Abnormalities. Normalize The Pixel Values Of The Images To A Standard Scale (E.G.,
[0, 1] Or [-1, 1]). This Ensures Consistent Input Data Across Different Samples And Improves
Convergence During Training. When implementing data augmentation for deep learning-based
lung cancer prediction and classification, it's essential to apply these techniques judiciously and
validate their effectiveness through cross-validation or separate validation datasets.
Additionally, consider the computational resources required for augmentation, especially when
working with large datasets or complex augmentation pipelines.
Gather a large dataset of lung images, including both images of patients with lung
cancer and those without. Ensure that the images are properly labeled, indicating whether they
depict a healthy lung or one with cancerous growths.Preprocess the images to standardize their
size, resolution, and orientation. You may also need to perform augmentation techniques to
increase the diversity of your dataset.
Design a CNN architecture suitable for image classification tasks. This typically
involves stacking convolutional layers, pooling layers, and possibly some fully connected
layers. Consider using popular CNN architectures like VGG, ResNet, or Inception, possibly
with modifications to suit your specific needs.Since lung cancer prediction is a binary
43
classification problem (cancerous vs. non-cancerous), the output layer of your CNN should
have a single neuron with a sigmoid activation function.
Split your dataset into training, validation, and testing sets. The training set is used to
train the model, the validation set helps tune hyperparameters and prevent overfitting, and the
testing set evaluates the model's performance. Train your CNN using the training set,
optimizing a suitable loss function (such as binary cross-entropy) with an optimizer like Adam
or RMSprop.Monitor the model's performance on the validation set and adjust hyperparameters
(e.g., learning rate, dropout rate) as needed to improve performance.
Once training is complete, evaluate the model's performance using the testing set.
Calculate metrics such as accuracy, precision, recall, and F1-score to assess the model's
effectiveness at identifying lung cancer.Additionally, consider generating a confusion matrix
to visualize the model's predictions and identify any patterns of misclassification.If the model
performs satisfactorily, deploy it in a clinical setting where it can assist radiologists in
diagnosing lung cancer. Ensure that the deployment environment meets any regulatory or
ethical requirements for medical software. Continuously monitor the model's performance in
production and update it as necessary to maintain accuracy and reliability.
Bengin Cases:
A benign lung tumour is an abnormal rate of cell division or cell death in lung tissue or
the airways that lead to the lungs. It isn’t cancerous. Types include hematomas, adenomas and
papilloma. In most cases, benign lung tumors don’t require treatment, but a healthcare provider
will recommend monitoring them for changes.
Malignant Cases:
A cancerous tumour of the lung can grow into nearby tissue and destroy it. The tumour
can also spread (metastasize) to other parts of the body. Cancerous tumours are also called
malignant tumours. There are 2 main types of lung cancer: non−small cell lung cancer and
small cell lung cancer.
Normal Cases:
When Lungs Are Healthy. Healthy lungs look and feel like sponges. They're pink,
squishy, and flexible enough to squeeze and expand with each breath. Their main job is to take
oxygen out of the air you breathe and pass it into your blood.
44
7. SYSTEM IMPLEMENTATION
Import Packages
import numpy as np
import pandas as pd
import cv2
import random
import os
import imageio
import plotly.graph_objects as go
import plotly.express as px
import plotly.figure_factory as ff
45
from sklearn.model_selection import RandomizedSearchCV, cross_val_score,
RepeatedStratifiedKFold
import tensorflow as tf
import keras
size_data = {}
for i in categories:
path = os.path.join(directory, i)
class_num = categories.index(i)
46
temp_dict = {}
else:
size_data[i] = temp_dict
size_data
for i in categories:
path = os.path.join(directory, i)
class_num = categories.index(i)
print(i)
img = cv2.imread(filepath, 0)
plt.imshow(img)
plt.show()
break
img_size = 256
for i in categories:
47
cnt, samples = 0, 3
fig.suptitle(i)
path = os.path.join(directory, i)
class_num = categories.index(i)
img = cv2.imread(filepath, 0)
ax[cnt, 0].imshow(img)
ax[cnt, 1].imshow(img0)
ax[cnt, 2].imshow(img1)
cnt += 1
if cnt == samples:
break
plt.show()
Preparing Data
data = []
img_size = 256
for i in categories:
path = os.path.join(directory, i)
class_num = categories.index(i)
48
for file in os.listdir(path):
img = cv2.imread(filepath, 0)
# preprocess here
data.append([img, class_num])
random.shuffle(data)
X, y = [], []
X.append(feature)
y.append(label)
# normalize
X = X / 255.0
y = np.array(y)
print(len(X_train), X_train.shape)
print(len(X_valid), X_valid.shape)
print(Counter(y_train), Counter(y_valid))
print(len(X_train), X_train.shape)
49
X_train = X_train.reshape(X_train.shape[0], img_size*img_size*1)
print(len(X_train), X_train.shape)
smote = SMOTE()
print(len(X_train), X_train.shape)
print(len(X_train_sampled), X_train_sampled.shape)
model1 = Sequential()
model1.add(Activation('relu'))
model1.add(MaxPooling2D(pool_size=(2, 2))
model1.add(MaxPooling2D(pool_size=(2, 2)))
model1.add(Flatten())
model1.add(Dense(16))
model1.add(Dense(3, activation='softmax'))
model1.summary()
model1.compile(loss='sparse_categorical_crossentropy', optimizer='adam',
metrics=['accuracy'])
50
history = model1.fit(X_train_sampled, y_train_sampled, batch_size=8, epochs=10,
validation_data=(X_valid, y_valid))
Results
print(classification_report(y_valid, y_pred_bool))
print(confusion_matrix(y_true=y_valid, y_pred=y_pred_bool))
plt.plot(history.history['accuracy'], label='Train')
plt.plot(history.history['val_accuracy'], label='Validation')
plt.title('Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend()
plt.show()
plt.plot(history.history['loss'], label='Train')
plt.plot(history.history['val_loss'], label='Validation')
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend()
plt.show()
model2 = Sequential()
51
model2.add(Conv2D(64, (3, 3), input_shape=X_train.shape[1:]))
model2.add(Activation('relu'))
model2.add(MaxPooling2D(pool_size=(2, 2)))
model2.add(MaxPooling2D(pool_size=(2, 2)))
model2.add(Flatten())
model2.add(Dense(16))
model2.add(Dense(3, activation='softmax'))
model2.summary()
model2.compile(loss='sparse_categorical_crossentropy', optimizer='adam',
metrics=['accuracy'])
new_weights = {
0: X_train.shape[0]/(3*Counter(y_train)[0]),
1: X_train.shape[0]/(3*Counter(y_train)[1]),
2: X_train.shape[0]/(3*Counter(y_train)[2]),
# new_weights[0] = 0.5
# new_weights[1] = 20
new_weights
Results
52
print(classification_report(y_valid, y_pred_bool))
print(confusion_matrix(y_true=y_valid, y_pred=y_pred_bool))
plt.plot(history.history['accuracy'], label='Train')
plt.plot(history.history['val_accuracy'], label='Validation')
plt.title('Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend()
plt.show()
plt.plot(history.history['loss'], label='Train')
plt.plot(history.history['val_loss'], label='Validation')
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend()
plt.show()
Data Augmentation
val_datagen = ImageDataGenerator()
53
val_generator = val_datagen.flow(X_valid, y_valid, batch_size=8)
model3 = Sequential()
model3.add(Activation('relu'))
model3.add(MaxPooling2D(pool_size=(2, 2)))
model3.add(MaxPooling2D(pool_size=(2, 2)))
model3.add(Flatten())
model3.add(Dense(16))
model3.add(Dense(3, activation='softmax'))
model3.summary()
model3.compile(loss='sparse_categorical_crossentropy', optimizer='adam',
metrics=['accuracy'])
Results
print(classification_report(y_valid, y_pred_bool))
print(confusion_matrix(y_true=y_valid, y_pred=y_pred_bool))
CM = confusion_matrix(y_valid, y_pred_bool)
sns.heatmap(CM_percent,fmt='g',center = True,cbar=False,annot=True,cmap='Blues')
CM
54
plt.plot(history.history['accuracy'], label='Train')
plt.plot(history.history['val_accuracy'], label='Validation')
plt.title('Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend()
plt.show()
plt.plot(history.history['loss'], label='Train')
plt.plot(history.history['val_loss'], label='Validation')
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend()
plt.show()
55
8. RESULT AND DISCRIPTION
o Accuracy
o Confusion Matrix
o Precision
o Recall
o F-Score
o AUC(Area Under the Curve)-ROC
Accuracy:
The accuracy metric is one of the simplest Classification metrics to implement, and it
can be determined as the number of correct predictions to the total number of predictions.
To implement an accuracy metric, we can compare ground truth and predicted values
in a loop, or we can also use the scikit-learn module for this.
56
1.
2. from sklearn.metrics import accuracy_score
3.
4. Here, metrics is a class of sklearn.
5.
6. Then we need to pass the ground truth and predicted values in the function to calculat
e the accuracy.
7.
8. print(f'Accuracy Score is {accuracy_score(y_test,y_hat)}')
Although it is simple to use and implement, it is suitable only for cases where an equal number
of samples belong to each class.
It is good to use the Accuracy metric when the target variable classes in data are
approximately balanced. For example, if 60% of classes in a fruit image dataset are of Apple,
40% are Mango. In this case, if the model is asked to predict whether the image is of Apple or
Mango, it will give a prediction with 97% of accuracy.
It is recommended not to use the Accuracy measure when the target variable majorly
belongs to one class. For example, suppose there is a model for a disease prediction in which,
out of 100 people, only five people have a disease, and 95 people don't have one. In this case,
if our model predicts every person with no disease (which means a bad prediction), the
Accuracy measure will be 95%, which is not correct.
Confusion Matrix:
The Confusion Matrix is a deep learning visual assessment method. The prediction class
results are represented in the columns of a Confusion Matrix, whereas the real class results are
represented in the rows. This matrix includes all the raw data regarding a classification model's
assumptions on a specified data collection. To determine how accurate a model is. It's a square
matrix with the rows representing the instances' real class and the columns representing their
expected class. The confusion matrix is a 2 x 2 matrix that reports the number of true positives
57
(T P), true negatives (T N), false positives (FP), and false negatives (F N) when dealing with a
binary
TP FN
FP TN
Precision, recall, and F-measure, which are commonly utilized in the text mining and
machine learning communities, were used to evaluate the algorithms. True positive (TP –
objects correctly labelled as belonging to the class), false positive (FP – items falsely labelled
as belonging to a certain class), false negative (FN – items incorrectly labelled as not belonging
to a certain class), and true negative (TN – items incorrectly labelled as not belonging to a
certain class) are the four types of classified items (TN - items correctly labelled as not
belonging to a certain class).
Recall:
Recall is determined using the following formula given the amount of true positives
and false negatives. It is also similar to the Precision metric; however, it aims to calculate the
proportion of actual positive that was identified incorrectly. It can be calculated as True
Positive or predictions that are actually true to the total number of positives, either correctly
predicted as positive or incorrectly predicted as negative (true Positive and false negative).
Precision:
The precision metric is used to overcome the limitation of Accuracy. The precision
determines the proportion of positive prediction that was actually correct. It can be calculated
as the True Positive or predictions that are actually true to the total positive predictions (True
Positive and False Positive).
58
Precision (also known as “positive predictive rate”) is measured using the amount of
true positive and false positive graded objects as follows:
From the above definitions of Precision and Recall, we can say that recall determines
the performance of a classifier with respect to a false negative, whereas precision gives
information about the performance of a classifier with respect to a false positive. So, if we want
to minimize the false negative, then, Recall should be as near to 100%, and if we want to
minimize the false positive, then precision should be close to 100% as possible. In simple
words, if we maximize precision, it will minimize the FP errors, and if we maximize recall, it
will minimize the FN error.
The measure that combines precision and recall is known as F-measure, given as:
where β denotes the precision's relative value. A value of β = 1 (which is often used)
means that recall and accuracy are of equal importance. A lower value implies that accuracy is
more important, whereas a higher value indicates that recall is more important.
F-Scores:
59
The formula for calculating the F1 score is given below:
As F-score make use of both precision and recall, so it should be used if both of them are
important for evaluation, but one (precision or recall) is slightly more important to consider
than the other. For example, when False negatives are comparatively more important than false
positives, or vice versa.
AUC-ROC CURVE :
60
Firstly, let's understand ROC (Receiver Operating Characteristic curve) curve. ROC
represents a graph to show the performance of a classification model at different threshold
levels. The curve is plotted between two parameters, which are:
TPR or true Positive rate is a synonym for Recall, hence can be calculated as:
To calculate value at any point in a ROC curve, we can evaluate a logistic regression
model multiple times with different classification thresholds, but this would not be much
efficient. So, for this, one efficient method is used, which is known as AUC.
AUC is known for Area Under the ROC curve. As its name suggests, AUC calculates
the two-dimensional area under the entire ROC curve, as shown below image:
61
AUC calculates the performance across all the thresholds and provides an aggregate
measure. The value of AUC ranges from 0 to 1. It means a model with 100% wrong prediction
will have an AUC of 0.0, whereas models with 100% correct predictions will have an AUC of
1.0.
AUC should be used to measure how well the predictions are ranked rather than their
absolute values. Moreover, it measures the quality of predictions of the model without
considering the classification threshold.
Figure 25: Balanced data model accuracy Figure 26: Balanced data model loss
62
ROC curve with Weighted Approch:
Figure 27: Weighted approch model accuracy Figure 28: Weighted approch model loss
Figure 29: Agumentation model accuracy Figure 30: Agumentation model loss
63
8.3. Classification report:
For lung cancer prediction using deep learning techniques, a classification report can
provide valuable insights into the performance of your model. This classification report
presents the performance evaluation of a lung cancer prediction model. The goal of this model
is to accurately classify lung cancer CT scan images into three categories: lung cancer
containing bengin cases, malignant cases,normal cases.
[[ 29 0 1]
[ 2 137 2]
[ 2 0 102]]
The model achieved a precision of 0.88 for predict bengin cases, 1.00 for predict
malignant cases, 0.97 for pridict normal cases. The recall (sensitivity) for bengin cases,
malignant cases,normal cases are 0.97, 0.97, 0.98, respectively. This shows that the model
can correctly identify 97% of actual lung cancer cases and 97% of actual non-lung cancer
cases. The F1-score, which considers both precision and recall, is 0.92 for bengin cases,
0.99 for malignant cases, 0.98 for normal cases. The overall accuracy of the model is 97%,
which suggests that the model performs well in classifying lung cancer CT scan images
into bengin cases, malignant cases,and normal cases.
64
9.CONCLUSION
In the proposed approach deep learning models were built for efficient diagnosis
of lung cancer. The methods used for the predict and classify of lung cancer are CNN.
The experimental analyses were done based on the CT scan images and
Histopathological images. The performance of the proposed approach was done based
on the recognition accuracy, F-Score, Precision, Recall, Support etc. Because of the
inherent advantage of the proposed methodology, this will be an efficient method for
lung cancer prediction and classification will be beneficial to needed people. The
enhanced performance is seen in the CNN model with an accuracy of 97%. The higher
precision is 98% and the Recall rate is 97%. The support is obtained maximum in CNN.
Similarly, the CNN shows the maximum F-Score of 96%. In future,this model may be
a desktop application of an oncologist embedded with IOT.
65
10. REFERENCES
[1] M. Togacara, B. Ergen, Z. Cömert, “Detection of lung cancer on chest CT images using
minimum redundancy maximum relevance feature selection method with convolutional neural
networks”, Elsevier Journal of Biocybernetics and Biomedical engineering vol. 40, no. 1,
(2020) pp. 23-34.
[2] Z. Liu, C. Yao, H. Yu, T. Wu, “Deep reinforcement learning with its application for lung
cancer detection in medical Internet of Things”, Elsevier Journal of Future Generation
Computer Systems, vol. 97, (2019) pp. 1- 9.
[4] Y. Liu, H. Wang, Y. Gua, X. Lv, “Image classification toward lung cancer recognition by
learning deep quality model”, Elsevier Journal of Visual Communication and Visual
Representation, vol. 63, (2019) pp. 274-282.
[5] P. Huang, C. T. Lin, Y. Li, M. Tammemagi, “Prediction of lung cancer risk at follow-up
screening with low-dose CT: a training and validation study of a deepLearning method”, The
Lancet Digital Health, vol. 1, (7), (2019).
[6] S. K., L. Mohanty, S. N., K., S., N., A., & Ramirez, G., “Optimal deep learning model for
classification of lung cancer on CT images”, Future Generation Computer Systems, vol. 92,
(2018) pp. 374-38.
66
[9] S. Alagarsamy, K. Kamatchi, V. Govindaraj, “A Novel Technique for identification of
tumor region in MR Brain Image,” 3rd International conference on Electronics,
Communication and Aerospace Technology (ICECA), (2019), pp. 1061-1066.
[10] V. Govindaraj, V, Sofia Fazila, M. Beevi, M. Marikani, “An Imitating Device for
Assisting Trans-Radial Amputees using Gestures,” International Journal of Advanced Science
and Technology, vol. 29, no.7s, (2020), pp. 2922-2931.
67