Lung Cancer Detection and Nodule Type Classification Using Image Processing and Machine Learning
Lung Cancer Detection and Nodule Type Classification Using Image Processing and Machine Learning
Sadok Elasmi
Cosim Research Labs,
Carthage University,
Higher School of Communication,
Tunisia
[email protected]
Abstract—Lung cancer is one of the most frequent type of source (fiberscope) inserted through the nostril into the
cancers worldwide. It is a type of disease that grows out of trachea to check for abnormalities. The third diagnosis
control and forms abnormal cells in the lungs. These cells do step is biopsy which consists on taking tissue samples
not function like other normal cells due to deoxyribonucleic
acid (DNA) mutation by various genetic factors. However, early seeming to be abnormal and analyze them to classify
detection and treatment of cases can reduce cancer-related the type of cells. Using image processing and machine
mortality. In this paper we propose a novel system for cancer learning approaches, can automatically detect the cancer
detection and classificaion based on image processing method from the Computed Tomography(CT) scan lungs images.
and machine learning algorithms. Our system composed of This technology saves time by removing manual steps
two main process, the first is the preprocessing process to
detect features and the second is the discrimination process to and improves outcomes at each stage of chemotherapy,
determine the type of lung cancer such as benign of malignant. radiotherapy, surgery, and immunotherapy. A lot of efforts
have been dedicated to create automated systems that are
keywords: segmentation, watershed, Machine Learning, more effective.
lung cancer detection, feature extraction N Kalaivani et al. [3] developed a deep neural network
for detecting lung cancer from CT images for the clas-
I. INTRODUCTION sification of the lung image (normal or malignant), a
In 2020, approximately 10 million people worldwide died densely connected convolution neural network (DenseNet)
of cancer, where 2.21 million died from lung cancer [1]. and adaptive boosting algorithm was used.This method
Lung cancer results from changes in lung cell behaviour or achieved an accuracy of 90.85%. R. Idrees, M. Abid et
growth leading to lung nodules development, which can al. [4] proposed a segmentation method based on the
be categorized into two types : non-cancerous (benign) watershed algorithm controlled by threshold and marker
or cancerous (malignant) nodules. There are two main and a binary classifier for the classification method. The
categories of lung cancer which are non-small and small resulting accuracy is 88.5%. A network of artificial neurons
cell lung cancer representing more than 80% and less than to detect the existence of lung cancer is developed in [5]
15% respectivly. [2]. through the symptoms that appear in the body. Model
The effective and the early detection of lung cancer is a provides an accuracy of 96.67%.In [6] the authors
the only way to cure. However the diagnosis of lung used a ground-breaking interdisciplinary process that
cancer requires four main steps. The first step is the combines metabolic and machine learning techniques to
chest X-ray for revealing the abnormalities. However find biomarkers early in the diagnosis of lung cancer. To
it cannot specifies the nodule type (malignant/benign). distinguish between people who have stage 1 lung cancer
Secondly the fiberscope that is a flexible tube with a light and healthy people, machine learning method Naive Bayes
1155
Authorized licensed use limited to: Angadi Institute of Technology & Management. Downloaded on December 03,2024 at 04:06:41 UTC from IEEE Xplore. Restrictions apply.
C. BAS AND WATERSHED COMBINED ALGO- 2) THE NODULE TEXTURE: The nodular texture
RITHMS FOR NODULES DETECTION is a very important factor in the classification of cancer
grades. The characteristics related to the texture that have
The watershed algorithm extracts the layer between
been detected are:
forgeround and background layers in the CT image,
the texture, intensity the average intensity, the minimum
detects the lung regions by creating markers and labels
intensity, the maximum intensity, and the density per pixel
inside and then it fills all detected regions and extracts
to ensure that we treat the right region
pulmonary regions Features. While they are not all tumor
3) THE NODULE SHAPE: We detect the nodular
nodules, which becomes an over-segmentation issue, we
shape and localization. In this context we extract the
solved this confusion by applying BAS (Beetle Antennae
sphericity that indicates how close the nodule is to having
Search) algorithm [11], which takes three features of the
a sphere shape, the compacity that indicates the amount
region (diameter, circularity and solidity) as parameters
of space occupied by the nodule compared to the rest of
to distinguish the pulmonary nodules from other regions
the size of the lungs and the centroid of each nodule.
detected.
The BAS technique for texture nodule cancer discovery E. CLASSIFICATION
is divided in two steps : This stage involves classifying the cancer type whether
The Beetle behavior for Bengin nodule detection is it is Malignant or Benign. During this final step we tested
shown in Fig.3 and the Beetle behavior for malignant four supervised classification algorithms .
nodule detection is shown in Fig.4 1) SUPPORT VECTOR MACHINES: Supervised
learning algorithms that can be used to identify SVM
classifiers. When our data has precisely two classes, we
can use support vector machines (SVM). Two classes with
various data elements will be created using a hyperplan
[13].
2) XGBOOST: XGBoost a supervised learning algo-
Fig. 3. Beetle behavior for Bengin nodule detection rithm that attempts to accurately predict a target variable
by combining an ensemble of estimates from a set of
simpler and weaker models. The XGBoost algorithm
performs well in machine learning competitions because of
its robust handling of a variety of data types, relationships,
distributions, and the variety of hyperparameters that you
can fine-tune [14].
3) RANDOM FOREST: Random forests is a super-
Fig. 4. Beetle behavior for malignant nodule detection
vised learning algorithm. It can be used both for classifi-
cation and regression [15]. It is also the most flexible and
The Beetle take the parameter region value, compare easy to use algorithm. Random forests creates decision
the value with the next region value and take step in trees on randomly selected data samples, gets prediction
direction of increasing or decreasing the values to reach from each tree and selects the best solution by means
the goal (nodule researched). of voting. It also provides a pretty good indicator of the
d(x) : denotes the diameter of region . feature importance.
c(x) : denotes the circularity of region . 4) DECISION TREE: The non-parametric supervised
s(x) : denotes the solidity of region . learning algorithm used for classification and regression
problems is the decision tree. It is organized hierarchically
D. FEATURES EXTRACTION and has a parent node, branches, internal nodes, and leaf
It is important to extract the lung nodule attributes nodes [16].
because they are needed in classifying nodules type and III. RESULTS AND DISCUSSION
cancer stage. These features are: This section is composed of 2 subsections. In the
1) THE NODULE SIZE: the staging of non-small cell first subsection, we present the results of pre-processing,
lung cancer depends on the two thresholds depending on segmentation and detection using the two datasets men-
the size of the tumor. 2 and 5 cm represent appropriate tionned in the subsection II,A. The second subsection
threshold diameters that define subgroups with signifi- details the results of the classification process.
cantly different prognosis [12] Generally, the large nodule
is more suspicious, but there are many exceptions that A. PRE-PROCESSING, SEGMENTATION AND DE-
can be seen from other features. The features related to TECTION RESULTS
the size that we detected are: Area, perimeter, diameter, Fig.5 presents the result after the blue channel extrac-
and volume tion that is a gray image. In our case the blue channel
1156
Authorized licensed use limited to: Angadi Institute of Technology & Management. Downloaded on December 03,2024 at 04:06:41 UTC from IEEE Xplore. Restrictions apply.
is equivalent to a gray level. Fig.6 illustrates the output TABLE I
nodule size
TABLE II
nodule texture
TABLE III
closing operation where the image was dilated and then nodule shape
eroded.
sphericity compacity centroid
C. CLASSIFICATION RESULTS
Each nodule is caracterized by eight features and a
single class variable(bengin or malignant). We starts by
spliting the dataset[2] on 70% for training step and 30%
for test step. Four classifications algorithm were compared.
Fig. 6. image after opening application The comparaison is based on the evaluation metrics
axplained as bellow.
In Fig.7. we present the image after the Otsu thresh- The confusion matrix which is an array of four different
olding apllication where the image becomes clustered on combinations of predicted and real values. Each row
white and black classes in the figure A, the forground in displays the real class result, while each column displays
the figure B and the background in the figure C separation the total number of nodules predicted for each class.
and the image after the watershed filling in the figure D.
1157
Authorized licensed use limited to: Angadi Institute of Technology & Management. Downloaded on December 03,2024 at 04:06:41 UTC from IEEE Xplore. Restrictions apply.
Fig.9, Fig.10, Fig.11 and Fig.12 :
Also, we use the accuracy, recall, precision, f1score are The results are given in TableIV. From these tables and
explained respectively by equations, (1)-(4). figures we can clearly see that Random Forest gives better
TP + TN resulkts compared to SVM XGboost and Decision Tree
Accuracy = (1) That is explained by Random Forest is an ensambling
TP + FP + TN + FN
1158
Authorized licensed use limited to: Angadi Institute of Technology & Management. Downloaded on December 03,2024 at 04:06:41 UTC from IEEE Xplore. Restrictions apply.
References
[1] E. Sebastian, M. Leitzmann, M. Bahls, C. Meisinger, C. Amos,
R. Hung, I. Consortium, L. Consortium, A. Teumer, and
H. Baurecht, “Physical activity does not lower the risk of lung
cancer,” Cancer Research, vol. 80, p. canres.1127.2020, 07 2020.
[2] J. Molina, P. Yang, S. Cassivi, S. Schild, and A. Adjei,
“Non-small cell lung cancer: Epidemiology, risk factors, treat-
ment, and survivorship,” Mayo Clinic proceedings. Mayo Clinic,
vol. 83, pp. 584–94, 06 2008.
[3] N. Kalaivani, N. Manimaran, D. Sophia, and D. Devi, “Deep
learning based lung cancer detection and classification,” IOP
Conference Series: Materials Science and Engineering, vol. 994,
p. 012026, 12 2020.
[4] R. Idrees, M. Abid, S. Raza, M. Kashif, M. Waqas, M. Ali, and
L. Rehman, “Lung cancer detection using supervised machine
learning techniques,” 10 2022.
Fig. 15. The Decision Tree ROC CURVE [5] C. Shankara and H. S. A. ., “Artificial neural network for
lung cancer detection using ct images,” International journal
of health sciences, pp. 2708–2724, 04 2022.
[6] Y. Xie, W.-Y. Meng, R.-Z. Li, Y. Wang, X. Qian, C. Chan,
Z.-F. Yu, X.-X. Fan, H. Pan, C. Xie, Q. Wu, P.-Y. Yan, L. Liu,
Y.-J. Tang, X.-J. Yao, M.-F. Wang, and L.-H. Leung, “Early
lung cancer diagnostic biomarker discovery by machine learning
methods,” Translational Oncology, vol. 14, p. 100907, 01 2021.
[7] D. Chen, X. Li, and S. Li, “A novel convolutional neural network
model based on beetle antennae search optimization algorithm
for computerized tomography diagnosis,” IEEE Transactions on
Neural Networks and Learning Systems, vol. PP, pp. 1–12, 08
2021.
[8] R. Dhiaa and O. Awad, “A comparative analysis study of lung
cancer detection and relapse prediction using xgboost classifier,”
IOP Conference Series: Materials Science and Engineering, vol.
1076, p. 012048, 02 2021.
[9] J. Ying, T. Yu, J. Liu, D. Huang, H. Yan, and Y. Zhuang,
“Clinical comparison of the “windowing” technique and the
Fig. 16. The XGboost Confusion Matrix “open book” technique in schatzker type ii tibial plateau
fracture,” Orthopaedic Surgery, vol. 14, 09 2022.
[10] H. F. Al-Yasriy, “The iq-oth/nccd lung cancer dataset,” 01 2021.
TABLE IV [11] X. Jiang and S. Li, “Bas: Beetle antennae search algorithm for
classification’s result optimization problems,” International Journal of Robotics and
Control, vol. 1, 10 2017.
Algorithm Accuracy f1score Recall Precision [12] C. Christian, S. Erica, and U. Morandi, “The prognostic impact
of tumor size in resected stage i non-small cell lung cancer:
SVM 95.1% 97.41% 97.33% 97.5% Evidence for a two thresholds tumor diameters classification,”
Lung cancer (Amsterdam, Netherlands), vol. 54, pp. 185–91, 11
Random Forest 98.6% 99.26% 99.1% 99.43% 2006.
[13] “Sparse and robust svm classifier for large scale classification,”
Xgboost 98.1% 98.99% 98.79% 99.2% Applied Intelligence, pp. 1–25, 03 2023.
[14] M. Babyomi, O. Olagbaju, and A. Kadiri, “Convolutional
Decision Tree 95.6% 97.65% 98.76% 96.57% xgboost (c-xgboost) model for brain tumor detection,” 01 2023.
[15] J. Lötsch and B. Mayer, “A biomedical case study showing
that tuning random forests can fundamentally change the
interpretation of supervised data structure exploration aimed
at knowledge discovery,” BioMedInformatics, vol. 2, 10 2022.
and bagging algorithm so it is choosen to adopt for the [16] P. Sarang, Decision Tree: A Supervised Learning Algorithm for
classification step. Classification, 03 2023, pp. 75–96.
[17] D. Chicco and G. Jurman, “The matthews correlation coefficient
(mcc) should replace the roc auc as the standard metric for
IV. CONCLUSION assessing binary classification,” BioData Mining, vol. 16, 02
Preprocessing, nodule identification, nodule segmenta- 2023.
tion, feature extraction, and image classification are all
processes in the method for lung cancer that is suggested
in this paper. Following the detection and segmentation
of nodules, feature extraction takes place. From the seg-
mentation nodule, features required for classification are
recovered using feature extraction algorithms. A classifier
is employed to determine that the nodule is bengin or
malignant based on the features that were taken from it.
In a future work we propose a method of cancer state
prediction based on reinforcement learning.
1159
Authorized licensed use limited to: Angadi Institute of Technology & Management. Downloaded on December 03,2024 at 04:06:41 UTC from IEEE Xplore. Restrictions apply.