Enhanced Lung Cancer Detection From CT Scans Leveraging Deep Learning For Precise Detection
Enhanced Lung Cancer Detection From CT Scans Leveraging Deep Learning For Precise Detection
Abstract—Lung cancer stands as a prevalent and lethal global are available, including imaging techniques, biopsies, and bio
ailment, emphasizing the critical role of early detection in marker tests, each with its advantages and limitations.
enhancing a patient’s likelihood of survival. Computed tomog- In this paper, we aim to review the current state of lung
raphy (CT) imaging emerges as a widely employed modality for
screening and diagnosing lung cancer, given its ability to provide cancer diagnosis, including the different diagnostic methods,
comprehensive lung images. In consonance with the advancement their accuracy, and their potential for early detection. We will
of computer-assisted systems, deep learning methodologies have also examine the challenges faced in the diagnosis of lung
been extensively explored to facilitate the interpretation of CT cancer and explore new approaches and technologies that may
scans for lung cancer diagnosis. Consequently, this study aims improve early detection and enhance patient outcomes.
to introduce a specifically devised deep learning method for the
detection of lung cancer. The paper encompasses an introduction we will also explore the potential role of artificial intelligence
to the deep learning (DL) approach, the proposed DL technique (AI) in lung cancer diagnosis. AI has the potential to improve
tailored for lung cancer applications, and the innovative features the accuracy of imaging interpretation and assist in the iden-
inherent in the examined methods. We have trained our CNN tification of bio markers for early detection.
model with IQ-OTH/NCCD Lung Cancer Dataset which consists CNNs are a sort of neural network that is very good at image
of 1190 images of Benign, Malignant, and Normal CT scan
images. Firstly, we resized the images and applied Gaussian blur identification. We have used medical images such as CT scans
to the images. Then we applied Smote, Weighted class and Data as inputs for the CNN to improve the accuracy of lung cancer
augmentation to equalize the number of training images for each diagnosis. For possible early detection and determining the
class to remove the model bias. After that we trained and tested stage of cancer through AI system more accurately from the
the model with 30% testing data and got 98% accuracy on Smote, current situation is our main focus in this research.
99% accuracy on Class weighted and 98% accuracy on Data
augmentation. Where other systems have less accuracy. Notably, The dataset was acquired from Kaggle, a widely recognized
the study centers on a Convolutional Neural Network (CNN)- platform renowned for its extensive repository of datasets
based architecture system designed for lung cancer screening.. spanning diverse domains and its hosting of data science
competitions. Opting to procure the dataset from Kaggle
Index Terms—Lung Cancer, Deep Learning, CT Scan, was a deliberate choice, The Iraq-Oncology Teaching Hos-
SMOTE, Weighted Class, Data Augmentation.
pital/National Center for Cancer Diseases (IQ-OTH/NCCD)
I. INTRODUCTION lung cancer dataset was meticulously gathered within the
aforementioned specialized hospitals during a three-month
Lung cancer constitutes a significant global public health
period in the autumn of 2019.
challenge and its incidence has been increasing over the years.
For data preprocessing, firstly we collected the dataset and
It is one of the leading causes of cancer-related deaths, Lung
categorized the dataset into three different cases. Normal case,
cancer is the main cause of cancer incidence and mortality
Benign case and Malignant case. Then we applied Gaussian
worldwide, accounting for 2 million diagnoses and 1.8 million
blur to enhance the features
deaths. [1]
Due to the bias of data on lung cancer in CT scans, three-
Early diagnosis of lung cancer is important for effective
pronged processes were in approach. First, SMOTE ”stretches”
treatment and better prognosis. Various diagnostic methods
existing cancer data by creating synthetic samples, exposing
Identify applicable funding agency here. If none, delete this. the model to enough features. Second, class weights prioritize
cancer images, ensuring they influence learning despite their AlexNet architecture and the Support Vector Machine (SVM)
limited number. Finally, data augmentation manipulates exist- algorithm.The model exhibits a well-designed structure with
ing scans with rotations and flips, forcing the model to learn seven convolutional layers, three pooling layers and two fully
lung cancer across diverse appearances. This multi-faceted connected layers for efficient feature extraction.The model
approach effectively trains models on limited data, enhancing achieves impressive performance on the LUNA16 dataset,
their ability to detect rare and critical cases. boasting a 97.64% accuracy, 96.37% sensitivity, and 99.08%
The data set was divided into training and testing sets. Sequen- specificity.
tial CNN model was used on the dataset and the model has got So, depending on the accuracy of various models implemented
98% accuracy on SMOTE pre-processing, and after applying for lung cancer detection, the highest result was about 98.48%
weighted class we have got 99% accuracy and with Data using SegChaNet method patch-based CNNs. [4]
augmentation the model have achieved 98% of the accuracy.
III. P ROPOSED F RAMEWORK
II. R ELATED W ORKS
In this section, our objective is to elaborate on the method-
CNNs (Convolutional Neural Networks) have been used ology that involves processing user input and translating it
successfully in lung cancer diagnosis. One common approach into the framework for Lung Cancer Detection from CT scan
is to use CT (Computed Tomography) scans to train a CNN to images using Deep Learning. Fig. 1 illustrates the system
detect and classify lung nodules or masses as either malignant architecture, which comprises three fundamental components.
(cancerous) or benign (non-cancerous). The roles and functionalities of these components will be
In the year 2020,a study published that CNN was able to expounded upon in the subsequent subsections. Notably, our
accurately classify lung nodules on CT scans with a sensitivity system operates with dynamic processing capabilities.
of 86% and a specificity of 83%.
Ahmed et al. [2] developed a 3D CNN model to determine
whether the CT scan image is cancerous or non-cancerous.
The experimental results show that the proposed method can
achieve a detection accuracy of about 80%.
Li, Xuechen and Shen et al. [3] employed patch-based multi-
resolution convolutional networks to extract the features and
employed four different fusion methods for classification.
Which is much more robust than those previously reported
researches. and achieved 90% of testing accuracy.
Cifci et al. [4] have used SegChaNet method to encodes CT
slices of the input lung into feature maps utilizing the trail of
encoders. They acquired a sensitivity of 98.48%.
Shah et al. [5] have used three different 2D CNN models
and combined the outputs together. Which gave three different
implementation. They acquired a sensitivity (true positive rate)
of 95%.
Rajasekar et al. [6] focuses on CNN, CNN GD, Inception
V3, Resnet-50, VGG-16, and VGG-19. Evaluation was based Fig. 1. Diagram of Proposed Methodology
on CT scan and histopathological images.Among the models,
CNN GD stood out with an accuracy of 97.86%, high precision • Original dataset: This refers to the collection of images
(96.39%), sensitivity (96.79%), specificity (97.40%), and an that was used to train the CNN model.The original dataset
impressive F-Score of 97.96%. includes benign, malignant and normal images.
Sori et al. [7] introduces DFD-Net, a lung cancer detec- • Data preprocessing and segmentation: In this step,
tion model.The two-path CNN architecture outperforms other the images in the original dataset were collected and
versions, especially post-retraining achieving 87.8% accuracy categorized.The images were resized to a standard size
with preprocessing and 77.3% without. to ensure consistent viewing.The resized image further
Shakeel et al. [8] introduces a lung cancer detection system processed with a Gaussian blur to smooth out details
using an improved deep neural network (IDNN) and ensemble and focus on larger features.Assigned numerical labels,
classifiers. The IDNN achieves high segmentation accuracy separated features and labels, normalized pixel values and
(96%), specificity (98%), precision (97%), recall (98%), and prepared the data for further use like balancing minority
F1-score (98%) compared to traditional methods.The HSOGR classes in the dataset using SMOTE, Weighted Class and
selects 14 features out of 50, minimizing computational com- so on.
plexity and outperforming other feature selection methods. • Data splitting: The preprocessed data was then split into
Naseer et al. [9] introduces a novel lung cancer detec- two sets: a training set and a testing set. The training set
tion method named LungNet-SVM, leveraging a modified is used to train the CNN model, while the testing set is
used to evaluate the performance of the trained model. SMOTE: This created synthetic cancer images, essentially
The split ratio is 75% for training and 25% for testing ”stretching” the available data to give the model more
set. exposure to subtle lung cancer features.
• Training the CNN model: The training set is used to Class Weights: Higher weights were assigned to cancer
train the CNN model. During training, the CNN model images, forcing the model to pay more attention to them
learns to identify the features that are different between during training and ensuring their limited number had a
benign, malignant and normal nodules. strong influence on learning.
• Testing the CNN model: The testing set was used to Data Augmentation:Existing CT scans were manipulated
evaluate the performance of the trained CNN model. The with rotations, flips, and shifts, creating a diverse library. This
model was given a set of images that it had never seen forced the model to learn the essence of lung cancer across
before, and it was asked to classify them as benign, malig- various appearances, making it more robust to real-world
nant or normal. The accuracy of the model’s predictions variations.
was then measured. By combining these techniques, the researchers ensured the
• Detect nodule: Once the model was trained, it was model had sufficient and diverse data to effectively learn and
used to detect nodules in new images. The new images identify lung cancer in CT scans, even with limited real-world
were preprocessed and then the CNN model was used to data availability.
classify the ROIs in the image as benign, malignant or
normal.
VI. DATA S AMPLING
IV. DATASETS C OLLECTION Then we applied three different data sampling techniques,
The Iraq-Oncology Teaching Hospital/National Center for First, we have divided the dataset into approximate 75/25
Cancer Diseases (IQ-OTH/NCCD) lung cancer dataset was percent for training and testing. SMOTE: We applied SMOTE
meticulously gathered within the aforementioned specialized to training data to create synthetic versions of these rare cases
hospitals during a three-month period in the autumn of 2019. which will create an equal number of data for ’Malignant
This dataset encompasses computed tomography (CT) scans cases (1)’, ’Normal cases(2), ’Benign cases(3)’
acquired from patients diagnosed with varying stages of lung Before SMOTE: Counter(1: 420, 2: 312, 0: 90)
cancer, along with scans from individuals deemed healthy. After SMOTE: Counter(2: 420, 1: 420, 0: 420)
Oncologists and radiologists in the IQ-OTH/NCCD centers
meticulously annotated the slides.The overall dataset com- Class Weights: we have used a class weighted approach,
prises 1190 images, distributed across the following categories: by assigning higher weights to the minority class (cancer
120 images for benign cases, 561 images for malignant cases, images), the model is encouraged to pay more attention
and 416 images for normal cases. The remaining images to them during training. This ensures that even the limited
are designated as test case images for further evaluation and number of cancer cases have a strong influence on the
analysis. model’s learning process which will create an equal number
of data for ’Malignant cases (1)’, ’Normal cases(2), ’Benign
V. DATASET P REPARING AND P REPROCESSING cases(3)’
Preprocessing the CT Scan Dataset: Categorization and The weights we have implemented on the dataset are:
Smoothing The first step involved categorizing the dataset {0: 3.0444444443, 1: 0.6523809523809524, 2:
into three distinct cases: normal, benign (non-cancerous 0.8782051282051282}
growths), and malignant (cancerous tumors). Each category
held specific characteristics for accurate classification. To Data Augmentation: have used a Data Augmentation, It
reduce noise and irrelevant details, Gaussian blur was applied, takes the existing CT scan images and applies clever tricks
acting like a filter to smooth out the images while preserving like rotations, flips, and slight shifts, essentially creating a
crucial tumor features. This allowed the model to focus on diverse library. This forces the model to learn the essence
defining characteristics for better analysis. of lung cancer across a wider range of appearances, making
it more robust and adaptable to real-world variations in
Addressing Data Imbalance and Generalizability The CT scans. data for ’Malignant cases (1)’, ’Normal cases(2),
dataset was then split into training and testing sets. The ’Benign cases(3)’
model learned from the training set, adjusting its parameters We have done vertical flip and horizontal flip on the dataset.
based on identified features. The testing set served as a blind
evaluation, revealing the model’s performance on unseen data
and ensuring it could generalize to real-world lung cancer VII. M ODEL D EVELOPMENT
cases beyond memorized training examples. The CNN model employs a sequential structure, meaning
To address the imbalance of rare cancer cases compared to layers are stacked sequentially to form a linear path for data
normal/benign ones, three data sampling techniques were flow.
employed.
class probabilities with softmax activation,
ez i
σ(zi ) = PK (2)
j=1 ezj
ensuring they sum to 1 and represent mutually exclusive
classes. The network has two output layers: one with 512
neurons for complex feature interactions and another with 3
neurons (corresponding to the number of classes) producing
class probabilities via softmax activation EQ.2.
TP
Recall = × 100 (5)
TP + FN
Certainly, the calculation of Precision (Eq.4), Recall (Eq.5),
and F-measure (Eq. 6) is integral to the analysis of model
performance. The results are typically presented in a tabular
format for clarity.
Fig. 3. CNN model architecture 2 ∗ precision ∗ recall
F − measure = (6)
precision + recall
In the model 3 The Layer one uses 64 filters with 3x3 kernels
and ReLU activation to capture basic features. The confusion matrixes are prepared for Lung Cancer Detec-
tion. We use those data to train our system. We calculate TP,
Relu(z) = max(0, z) (1) FP, FN, TN for confusion matrix use value from table. We
also calculate Accuracy, Precision, Recall and F-measure use
Layer two employs 128 filters with 3x3 kernels and ReLU EQ. 3to 6 form this confusion matrix.
activation EQ.1 for extracting more complex features. Layer
three increases complexity further with 256 filters and ReLU A. Results after Applying SMOTE
activation EQ.1. A flattening layer transforms the multidi- For validation, batch size to 8 and number of epochs to 10
mensional feature maps into a 1D vector for feeding into was chosen.
fully connected layers. These layers introduce non-linearity
for complex relationships using ReLU activation EQ.1. The From the table II 98% accuracy was achieved from the
final fully connected layer uses extracted features to determine program after applying the Eq.(3) - Eq.( 6).
TABLE I lung cancer detection.
TABLE OF ACCURACY AFTER APPLYING SMOTE Here are some key details from the research worth
precision recall f1-score support highlighting:
accuracy 0.98 275
macro 0.96 0.97 0.97 275 * We employed diverse data sampling techniques to enhance the
avg
weighted 0.98 0.98 0.98 275 system’s robustness and accuracy.
avg * We introduced an efficient detection procedure that surpasses
the performance of traditional methods.
* We acknowledge limitations such as relying on a single dataset
B. Results after Applying Weighted Class and utilizing a basic CNN model.
For validation, we have chosen batch size to 8 and number We outlined future directions like creating more diverse
*
of epochs to 10. datasets, exploring advanced CNN models, and developing a
medically applicable model.
We believe these future scopes will refine our proposed system
TABLE II and improve its effectiveness in lung cancer detection. Overall,
TABLE OF ACCURACY AFTER APPLYING C LASS W EIGHTED A PPROACH we’re proud of the work done on this project and enthusiastic
precision recall f1-score support about its potential impact on improving lung cancer detection
accuracy 0.99 275 in the future.
macro 0.96 0.99 0.97 275
avg R EFERENCES
weighted 0.99 0.99 0.99 275
avg [1] Saginala K Aluru JS Barsouk A. Thandra KC, Barsouk A. Epidemi-
ology of lung cancer. Contemp Oncol (Pozn). 2021;25(1):45-52. doi:
10.5114/wo.2021.103829. Epub 2021 Feb 23. PMID: 33911981; PMCID:
From the table II 99% accuracy has been achieved from the PMC8063897.
[2] Tasnim Ahmed, Mst Shahnaj Parvin, Mohammad Reduanul Haque,
program after applying the Eq.(3) - Eq.( 6). Mohammad Shorif Uddin, et al. Lung cancer detection using ct image
based on 3d convolutional neural network. Journal of Computer and
Communications, 8(03):35, 2020.
C. Results after Applying Data Augmentation [3] Xuechen Li, Linlin Shen, Xinpeng Xie, Shiyun Huang, Zhien Xie, Xian
Hong, and Juan Yu. Multi-resolution convolutional networks for chest
For validation, Adam learning rate=0.0001 and number of x-ray radiograph based lung nodule detection. Artificial intelligence in
epochs to 10. medicine, 103:101744, 2020.
[4] Mehmet Akif Cifci et al. Segchanet: A novel model for lung cancer
segmentation in ct scans. Applied Bionics and Biomechanics, 2022, 2022.
[5] Asghar Ali Shah, Hafiz Abid Mahmood Malik, AbdulHafeez Muhammad,
TABLE III Abdullah Alourani, and Zaeem Arif Butt. Deep learning ensemble 2d
TABLE OF ACCURACY AFTER APPLYING DATA AUGMENTATION cnn approach towards the detection of lung cancer. Scientific Reports,
13(1):2987, 2023.
precision recall f1-score support [6] Vani Rajasekar, MP Vaishnnave, S Premkumar, Velliangiri Sarveshwaran,
accuracy 0.98 275 and V Rangaraaj. Lung cancer disease prediction with ct scan and
macro 0.96 0.98 0.97 275 histopathological images feature analysis using deep learning techniques.
avg Results in Engineering, 18:101111, 2023.
weighted 0.98 0.98 0.98 275 [7] Worku J Sori, Jiang Feng, Arero W Godana, Shaohui Liu, and Demissie J
avg Gelmecha. Dfd-net: lung cancer detection from denoised ct scan image
using deep learning. Frontiers of Computer Science, 15:1–13, 2021.
[8] P Mohamed Shakeel, MA Burhanuddin, and Mohammad Ishak Desa. Au-
From the table III 98% accuracy was achieved from the tomatic lung cancer detection from ct image using improved deep neural
program after applying the Eq.(3) - Eq.(6). network and ensemble classifier. Neural Computing and Applications,
pages 1–14, 2022.
IX. C ONCLUSION A ND F UTURE W ORK [9] Iftikhar Naseer, Tehreem Masood, Sheeraz Akram, Arfan Jaffar, Muham-
mad Rashid, and Muhammad Amjad Iqbal. Lung cancer detection using
In this research, we developed a promising new technique modified alexnet architecture and support vector machine. Computers,
for lung cancer detection in CT scans using deep learning. Materials & Continua, 74(1), 2023.
Our primary focus was to create an accurate method for
cancer identification, and while the system hasn’t undergone
commercial testing, the results suggest potential for real-world
application.
This project served as a valuable learning experience,
introducing us to various new concepts, functionalities,
techniques, and potential issues. These insights will
undoubtedly fuel our future endeavors.
Looking ahead, we plan to refine our methodology to boost
performance and achieve even higher accuracy. We’re excited
to see how this research can contribute to advancements in