International Conference on Communication and Signal Processing, July 28 - 30, 2020, India

Lung Diseases Classification based on Machine

Learning Algorithms and Performance
Binila Mariyam Boban and Rajesh Kannan Megalingam

Abstract—Machine learning (ML) is a significant subset of computed tomography images could be used to distinguish
Artificial Intelligence (AI) that plays a key role in medical many body tissues. A medical diagnosis that can be performed
diagnosis. The advantage of AI is they can automatically learn, using traditional X-rays provides multiple images within the
extract and translate the features from data sets such as images,
text or video, without introducing traditional hand-coded code or
body. The cross-sectional CT scan images provided a variety
rules. This paper focuses on recognizing and classifying lung of body planes that can be generated in the 3D view. CT scans
diseases by ML algorithms. It includes 400 lung disease images include high resolution pictures of lungs that can be viewed on
(i.e. CT scan images) including bronchitis, emphysema, pleural a PC or printed on a film. Lungs are responsible for oxygen
effusion, cancer, and normal. The input image is analyzed, supply and carbon dioxide exhalation as well. Most individuals
categorized and classified using ML algorithms such as the MLP, have smoking habit that leads to infection and biological
KNN and SVM classifier. After feature extraction, the output is
segmented and compares the classifier’s accuracy. When a CT disorders that cause pulmonary diseases.
scan image was given to a classifier as an input, it contains This paper contains four disease types (i.e. bronchitis,
irrelevant information. For the selection of the most relevant emphy-sema, pleural effusion, and cancer) as well as a normal
features (i.e. for extracting characteristics) here Gray Level Co- lung CT scan. The inflammation between the nose area and the
occurrence Matrix (GLCM) is used. For MLP, this classifier lung tissue that surrounds the airways causes bronchitis. This
acquires 98% accuracy, for SVM accuracy is 70.45% and for
causes pneumonia. Emphysema is a form of COPD (chronic
KNN accuracy is 99.2%. These classifiers will help the doctors to
prescribe the most effective treatment for a patient. obstructive lung disease) that causes damage to the lung air
sacs when germs affect pleural space. Cigarette smoking
Index Terms—Machine learning (ML), Artificial Intelligence triggered this. Pleural effusion is otherwise referred to as lung
(AI), Gray-Level Co-occurrence Matrix (GLCM), Multilayer water. It is due to the accumulation of excess fluid between
perceptron (MLP), K-nearest neighbors (KNN), Support vector pleura layers. It will damage the inhalation and exhalation and
machine(SVM) reduce lung tissue growth. Lung cancer is uncontrollably
caused by cell division in the lung and will affect breathing.
When CT scan image itself is used as an input, we re-quire a

R EDUCING the detection period of diseases and

improving identification accuracy becomes the most
important issue in creating a reliable and more efficient
large number of variables when handling data. The computing
power and memory will be increased by large number of
variables so extraction of features is used to reduce the
medical decision support systems (MDSS) to help the information i.e. the number of variables used, for that in this
complicated decision process for diagnosis. A complex and paper the method used is GLCM (Gray Level Co-occurrence
fuzzy cognitive method is the medical diagnosis, soft Matrix).The GLCM is a mathematical methodology for
computational methods such as ML algorithms like MLP, analysis of texture that provides the spatial ratio of pixels. The
SVM, and KNN showed great promise in the design of MDSS GLCM is the spatially dependent gray level matrix. This
for disease detection. requires properties of texture and color. The properties of
In medical diagnosis, computed tomography (CT) images texture include contrast, correlation, energy and homogeneity.
are commonly used. Depending on their distinct gray scales, Mean, Standard Deviation, Entropy, and RMS are color
properties. GLCM matrix includes these eight features.
Classification was performed using MLP (Multilayer
Perceptron), Support vector machine (SVM) and KNN (k-
Binila Mariyam Boban is with Department Of Electronics and Communica-tion Engineering, Amrita Vishwa Vidyapeetham, Amritapuri, India.
Communica-tion Engineering, Amrita Vishwa Vidyapeetham, Amritapuri,
data. After classification the performance of two classifiers are
India. (E-mail:
Rajesh Kannan Megalingam is with Department Of Electronics and Com-
munication Engineering, Amrita Vishwa Vidyapeetham, Amritapuri, India.

The proposed method should be carried out in four phases, lung cancer detection. This paper proposes a computational
i.e. Pre-processing is done in the first phase with the use of method, i.e. particle swarm optimization (PSO) with neural
median filters and morphological smoothening. The network. In this paper [8], authors concentrated on detection of
characteristics are derived from the pre-processed picture lung cancer at early stage. For the identification, a non-
using GLCM (Gray-Level Co- occurrence Matrix) parametric process, like genetic K-Nearest Neighbor (GKNN)
methodology. The second last phase of detection and algorithm is suggested. In this process K (50-100) are chosen
separation of lung ailments is accomplished using the MLP for each iteration using genetic algorithms and performance
(Multilayer Perceptron), Sup-port Vector Mechanism (SVM) tests in the exact range of 90%. Researchers introduced in this
and KNN (k-nearest neighbor) classifications. The final phase paper[9] a K-immediate neighbors classification to define and
performance evaluation of the classifier. For implementing distinguish cancer into harmless or malignant pictures. In the
these algorithms software’s such as MATLAB or python, can classification of benign or malignant tumor, the overall
be used. classification acquired by the classifier is 97%. The learning
The rest of the paper is organized as below. Section II and time in this K nearest neighbor algorithm is 3 seconds and the
Section III describes about the literature review and nearest neighbor distance is 0.20889. Authors applied a SVM
methodology of the work. The simulation results are discussed based description of diagnosis of lung cancer in this paper
in Section IV. At last, Section V concludes the paper with [10]. CLAHE Equalization technique improved the contrast of
conclusion of the work. the CT scan graphic. After that, the method of walk
segmentation was implemented. The writers in this paper [11]
II. LITERATURE REVIEW used median filters to minimize noise without affecting
This paper [1] discusses the potential for medical diagnosis performance in pre-processing. After that feature, extraction
and prediction of osteoporosis by risk factor in the use of an has been done and the feature extracted has been selected by
artificial neural network (ANN). Artificial neural network PSO (particle swarm optimization) algorithm method and lung
(ANN) is developed in tandem with Probabilistic Neural disease classification has been done. In this paper [12] author
Networks (PNN) based on MLPs with back propagation.In this proposed, a KNN based classifier together with the genetic
paper [2], authors proposed a neural network focused on MLP algorithm for heart disease detection. Here values have been
backpropagation to predict heart disease.Here vari-ous multi- taken and recorded for different k values.
layer perceptron training functions are compared and the best In this paper [13], features were derived from the GLCM
training function is chosen for training. MLP with TRAINBR method and the neural network back propagation algorithm
training algorithm gives 96.3% accuracy in heart disease was used for the classification of images. In the training stage,
prediction. In this paper [3], authors developed an artificial the classifier reaches 95% precision and 81.25% exactness in
neural network with histogram based genomic gradient the evaluation level. This paper [14] explores the use of a
characteristics for predicting lung cancer. Together with neural network to diagnose various patterns of rubella, German
histogram based gradient genomic features, this ANN network measles and chickenpox signs, based on the pores and skin
provides 95.90% percent accuracy and 0.0159 mean square symptoms. The ANN will examine the signs and provides
error. In this paper [4], two forms of ANNs used to identify better predictions and credibility than a human doctor. Thus,
and diagnose Parkinson’s disease were suggested by patients can be monitored entirely based on the signs found for
researchers. One is MLP (MultiLayer Perceptron) and the pores and skin problems. In this paper [15], a novel approach
other is RBF (Radial Base Function). MLP is the best is suggested within order to achieve better rates of
classifier with 93.22% percent accuracy based on the accuracy classification by integrating the predictive T-test and absolute
compar-ison. RBF classifier offers just 86.44% accuracy in ranking. Appropriate classification methods are also explored
classifying the same set of data. This can assist neurologists in using linear SVM, proximal SVM and Newton SVM. Also
their medical diagnosis. In this paper [5], researchers presented is a descriptive study on the various extraction
suggested a diagnostic method to assist doctors in the techniques. In this paper [16], they describe the image
diagnosis of heart disease based on patient clinical conditions processing technique like fractal image compression and its
after translating it into numerical representation. Two properties and a method to improve the performance.
classifiers were proposed: Multi-Layer Perceptron Neural
Network (MLP) and Support machine vector (SVM). Here III. METHODOLOGY
they considered the classification of two heart diseases and
A. Multilayer perceptron (MLP)
used the collected database to evaluate the performance of this
classifier. Neuron is a basic building block of a neural network (MLP)
In this paper [6], authors proposed a Convolution Neural which is also known as artificial neurons that takes certain
Network (CNN) for the classification of malignant or benign number of weighted input signals and bias and produce
tumors in the lung. By using CNN as a classifier, the accuracy weighted output based on activation function as shown in Fig.
reached 96%, which is better than the traditional neural 1. When a network has 5 inputs it will have 5 weights that can
classifier accuracy. In this paper [7], authors focus on early be adjusted in training section.


Back propagation - After forward propagation we get a
predicted value at output side in order to find the error we
compare the actual output value with these predicted one (loss
function is usually used). Their difference is error, In order to
reduce error we calculate the derivative of the error with
respect to each and every weight in the network. Calculating
the derivative gradients start from the last layer weights and
move backwards until we reaches initial layer. Then subtract
these gradient value from current weights and initialize the
result as new weight. Then the input is given to check whether
the error reduced. It will continue until the error reaches
minimum value
B. K-nearest neighbors (KNN)
Algorithm for K neighbors (KNN) uses the similarity
function to estimate values for the new data points, implying
that a score will also be allocated to the current data points
depending on how exactly they fit the training points.

Fig. 1. A single neuron

The main purpose of training is changing the model’s
dimensions, or weights and biases to eliminate errors
output = £(weight × input) + bias (1)
Bias - It is an extra input to the neuron and its value is
always one and it has its own weight when all the inputs are
zero an activation in the neuron takes place.
It also allows the neuron to shift the decision boundary left
or right i.e. it allow to decide the decision boundary.
Activation Function -It implies that the neural network is not
linear. The triggering functions include tanH,sigmoid, RelU
and SelU. Strength of output signal and the thresh-old at which
neuron is activated was decided by the activation function.It is
a mapping between the weighted input and output.
Input layer - The first layer which takes input values and no
operations are apply on the input signals. Here there is no
weight values and basis applied.
Hidden layer – Each layer collects knowledge from the data
from the neurons (input line) and moves to the next step. When
the no. of layer increasing it capture all the minute details. One
hidden layer means one set of neurons that arrange
vertically.It’s called fully joined if all neurons in the secret
layer are bound to each neuron in the next layer.
Output layer – The last layer to accept data from the most
recent secret layer and output within the desired range.
Weights – It represents the strength of the connection
between the nodes. They are initialized with some random
values initially between 0–0.5. Fig. 2. Flow chart.
Feed propagation – Forward movement of data from the The following steps help us to understand its working:
input layer, where no process is done to next layer where Phase 1: For any algorithm, we need a data set. So during the
process like multiplication, addition and activation process are first step of KNN, we will load the training and test data.
take place and this repeated in coming layers until it reaches Phase 2: We will first select K (here k=3), i.e. the closest
output layer. From output layer we get a predetermined value. points of information.Difference between test data and each


training row is then calculated. The distance calculated in affecting the sharpness of image. Fig. 3 shows when a CT
ascending order based on distance values is sorted from scans image given as input: a-bronchitis, b-emphysema, c-
Euclidean distance as distance metric. pleural effusion, d-normal, e-lung cancer. first column
Phase 3: Then get top k rows from the categorized list. The corresponds to original image, then gray image and finally
most common class is the real one. filtered image.
C. Summer Vector Machine
Multi-class SVM attempts to allocate marking to instances of
Fig. 4. Feature matrix.
supporting vector machines that derive the mark from several
elements in a finite range.The approach used here is to reduce Fig. 4. is the features extracted using GLCM function. Here
the single multi-class problem to several binary classification we take only eight features and this is given to classifier for
problems via a one-to-all approach. The one-over-all approach identifying the disease and for correctly classifying it.
is to create binary classifiers that differentiate one label from
the rest.
From Fig. 2, first the input image (i.e. RGB image) is con-
verted into grey format and applied to median filter to remove
noises and for smoothening. Then the output image is now
applied to GLCM so that certain parameters (Contrast, Corre-
lation, Energy, Homogeneity, Mean, Standard deviation, En-
tropy, RMS) can be extracted. Then segmentation is done here
we identifying the affected area. Finally images are passed to
the classifier, where the classification takes place. After
applying the classification techniques on the same dataset, it is
Fig. 5. Result.
found that KNN classifier is having higher accuracy than
simple MLP and SVM classifier.


When a CT scan image is given as input first it is converting
to gray image i.e. removing hue and saturation then given to
median filter for removing noise and smoothing without

Fig. 6. Segmentation of tumor.

A message box is used to display the result if this is belong
to bronchitis then ”predicted disease is bronchitis” such a
message is displayed on the screen as Fig. 5. Similarly for
other four classes here only one result is shown in this paper.
A. Segmentation
Fig. 6 shows the segmentation of tumor part by using a
mask. Here when the predicted disease is cancer then
segmentation take place and get the tumor out. Histogram
graph shows the number of pixels in different intensity values.
Method used for segmentation is binarization along with
B. Percentage of classification in MLP
Confusion matrix shown in Fig. 7 and confusion chart is
used to evaluate the performance of classifiers. From chart or
matrix we can find the percentage of correct and incorrect
classifications in each class as shown in Fig. 8.
The accuracy get from MLP is 98.7%.Here the percentage is
displayed on the command window by using this formula.

Accuracy = (tp + tn) / (tp + tn + fp + fn) (2)

Fig. 3. CT scan image. tp- True positive (The actual class is correctly predicted).


The output is get as probability as shown in Fig. 9. Here in this
tn- True negative (The actual class is wrongly predicted). figure third row value is high i.e. the Ct scan image is belongs
to cancer class.
fp- False positive (The wrong class is correctly predicted).

fn- False negative (The wrong class is wrongly predicted).

Fig. 10. Accuracy of KNN

C. K nearest neighbors (KNN)

KNN is less complex than MLP because there is no
activation function, weights and bias and it is mainly used for
supervised machine learning as shown in Fig. 10. This paper is
based on supervised ML. From the confusion matrix it see that
incorrect classification made by KNN compared to MLP is
less so accuracy will increase up to 99.6%

Fig. 7. MLP Confusion matrix

Fig. 8. MLP classification percentage

Fig. 11. Accuracy of SVM

D. Support Vector Machine (SVM)

Fig. 11 shows the SVM classifier accuracy when same set of
dataset used. The classification accuracy is only 70.45% less
than both MLP and KNN classifiers

In this project we are giving CT scan image of lungs in jpg
format as an input to the program. After pre-processing i.e.
converting to gray image and remove the noise then it is fed
for feature extraction using GLCM. Here we get a matrix that
contains only needed features; it helps to save time and
memory i.e. to reduce the variables. After that matrix is given
to successfully trained classifiers and compare the
performances. Segmentation is done by using masking and
thresholding. Comparing the performances shows that KNN
(K nearest neighbor) is more accurate than MLP (Multi layer
preceptron) and Support vector machine (SVM) classifiers. .

Fig. 9. Probability matrix ACKNOWLEDGMENT

This is MLP output got from the output layer. The five I am thankful for the wonderful opportunity offered me by
values corresponds to the five classes i.e. bronchitis, Amrita University and Humanitarian Lab to transform my
emphysema, pleural effusion, cancer, and normal respectively. thoughts on a specific project. I am thankful for the space and


infrastructures required for completing this project to Dr [8] P . Bhuvaneswari , Dr. A. Brintha Therese ,“Detection of cancer in lung
with k-nnclassification using genetic algorithm”,2nd International
Rajesh Kannan Megalingam. I appreciate everyone who Conference onNanomaterials and Technologies, 2014.
helped me get this project done in good time. [9] P. Thamilselvan, Dr. J. G. R. Sathiaseelan,”An enhanced k nearest
neighbor method todetecting and classifying mri lung cancer images for
REFERENCES large amount data”,International Journal of Applied Engineering
Research ISSN 0973-4562, vol.11, Number 6 pp 4223-4229, 2016.
[1] Dimitrios H. Mantzaris, George C. Anastassopoulos , Dimitrios K. [10] R Sathishkumar,Kalaiarasan K,Prabakaran A, Aravind M,”Detection of
Lymberopoulos, “Medicaldisease prediction using artificial neural net- lung cancer usingsvm classifier and knn algorithm”,International
works”, 2008 8th IEEE International Conference on BioInformatics and Journal of ScientificResearch and Review, Volume 8, Issue 3, 2019.
BioEngineering,Oct. 2008, DOI: 10.1109/BIBE.2008.4696782. [11] Tejinder Kaur,Neelakshi Gupta,”A new algorithm for classification of
[2] Durairaj M, Revathi V, ”Prediction of heart disease using back prop- lung diseases”,International Journal of Advances in Electronics and
agation mlpalgorithm”,International Journal of Scientific Technology Computer Science,ISSN: 2393-2835, Volume-2, Issue-9, Sept.-2015.
Research volume4, issue 08, August 2015. [12] M.Akhil jabbar, B.L Deekshatulua Priti Chandra ,”Classification of
[3] Emmanuel Adetiba, Oludayo O. Olugbara,” “Lung cancer prediction heart disease using k- nearestneighbor and genetic
using neural networkensemble with histogram of oriented gradient algorithm”,International Conference on ComputationalIntelligence:
genomic features”,TheScientific World Journal, Volume 2015, Article Modeling Techniques and Applications (CIMTA), 2013.
ID 786013, [13] Kusworo Adi, Catur Edi Widodo, Aris Puji Widodo, Rahmat Gernowo,
[4] Farhad Soleimanian Gharehchopogh ,Peyman Mohammadi, “A case Adi Pamungkas, Rizky Ayomi Syifa ,”Detectionlung cancer using gray
study of parkinson’sdisease diagnosis using artificial neural net- level co-occurrence matrix (glcm) and backpropagation neural network
works”,International Journal ofComputer Applications (0975 – 8887), classification”,JOURNAL OF EngineeringScience and Technology Re-
vol. 73– No.19, July 2013. view, March 2018.
[5] Tabreer T. Hasan,Manal H. Jasim, Ivan A. Hashim, ”Heart disease [14] Monisha M; Suresh A; Rashmi M R, “Artificial Intelligence Based Skin
diagnosis systembased on multi-layer perceptron neural network and Classification Using GMM”, Journal of Medical Systems, vol. 43, no. 1,
support vector machine”,International Journal of Current Engineering p. 3, 2018.
and Technology, vol. 7, oct 2017. [15] Arunkumar Chinnaswamy , Ramakrishnan S, “Two Step Feature Ex-
[6] S. Sasikala, M. Bharathi, B. R. Sowmiya,“Lung cancer detection and traction Method for Microarray Cancer Data using Support Vector
classificationusing deep cnn”,International Journal of Innovative Tech- Machines”, International Journal of Computer Applications, vol. 85, no.
nology andExploring Engineering (IJITEE), ISSN: 2278-3075, Volume- 8, pp. 34-42, 2014.
8 Issue-2S,December, 2018. [16] Loganathan D, Amudha J; Mehata K.M,”Classification and feature
[7] Dr. S. Senthil,B. Ayshwarya,”Lung cancer prediction using feed vector techniques to improve fractal image coding”,IEEE Region 10
forward back propagation neural networks with optimal Annual International Conference, Proceedings/TENCON, Volume 4,
features”,International Journal ofApplied Engineering Research ISSN Bangalore, p.1503-1507 (2003).
0973-4562, vol. 13, Number 1 pp.318-325, 2018.


