Detecting Pneumonia Using Vision Transformer and Comparing With Other Techniques
Detecting Pneumonia Using Vision Transformer and Comparing With Other Techniques
Abstract—Pneumonia is life-threatening. It's critical for infants, X-Ray can be very difficult b lurry, which may give misleading
young children, elders, and people with health problems or enfeebles results.
immune systems. However, someone who has been infected with
coronavirus can get intense Pneumonia in each lung. The best way Co mputer Vision techniques are the most precise ways for
to stumble on Pneumonia is via chest X-ray. Radiotherapist is chest X-Ray image examination to detect Pneumonia. CNN's
required for an examination of chest X-Ray. An automated have ruled in computer vision tasks so far. An image is based
pneumonia detection device would be helpful for early detection in on the idea that one pixel is dependent on its neighboring
far-off places. The proposed method makes it possible to train ViT pixels, and the next pixel is dependent on its immediate
models with enhanced performance. Nowadays, ViT is an alternative adjacent pixels (be it color, brightness, contrast, and so on).
method of CNN in the field of computer vision. In this research,
three models have been proposed, namely convolutional neural
Different researchers developed many algorithms to recognize
network (CNN), VGG16, and Visual Transformer were constructed. Pneumonia using different approaches like "ChexNet" [3], a
Statistical results are obtained after the comparison of all three CNN of 121 layers. Also, some more approaches like single-
models. Results indicate that ViT can identify Pneumonia with an shot detectors and squeeze-and-extinction deep CNN [4]. So me
accuracy of 96.45%. And also can be used to recognize other lung- researchers tried to combine and utilize some pretrained CNN
related diseases. All the models were trained and tested on a dataset models like AlexNet, VGG-19, etc.
that contains standard chest X-Rays and pneumonia chest X-Rays.
every patch of the image, adds position embed -dings, and are used with 8 number of heads in multi-head attention layer.
fetches the sequences to an encoder. Adam optimizer has been used. Parameters that are passed with
their values in ViT model are given below:
image_size – 250 – size of image in pixels.
Intel(R) Core(TM) i5-8300H CPU @ 2.30GHz is used.
Basic CNN, VGG-16, and Vision Transformer results are patch_size – 50 – size of each patch in pixels.
compared to find the best approach to detect Pneumonia.
channels – 3 – number of channels in image.
num_classes – 2 – number of classes to classify.
A. VISION TRANSFORMER APPROACH
Nowadays, in Natural Language Processing (NLP) tasks, dim – 64 – last dimension of output tensor
transformers have become a handy way. In co mputer vision, depths – 6 – total no. of transformation blocks
Vision transformer (ViT) implements a pure transformer model
without convolutional blocks [17]. For many years CNN is heads – 8 – total no. of heads in multi-head attention
used in image recognition. However, CNN has some layer
drawbacks, A CNN is significantly slower due to an operation mlp_dim – 128 – dimension of mlp layer
such as max pool, and ConvNet requires a large Dataset to
process and train the neural network [18].
B. Convolutional Neural Networks approach
The model is proposed based on the Vision Transformer The Convolutional Neural Networks approach consists of
(Vit) approach to classify Pneumonia using a dataset of chest mu ltiple hidden layers which extract the information fro m an
X-rays. Recently, Vision transformer [17] was preferred over image. ReLU (Rectified Linear Unit) activation layer has been
CNN for large-scale co mputer vision datasets. Transformer used. ReLU only passes values 0 for negative pixels. It
architecture with self-attention allows ViT to integrate introduces non-linearity to the network. Various filters are used
information across the entire image. in the pooling layer to identify different parts of the images.
Then flattening is used to create a linear vector. The flattened
matrix is fetched as input to the fully connected layers, used to
classify the image [7, 21, 22, 23, 26].
C. VGG-16 approach
The data are pre-processed by re-sizing all images to
224×224 pixels after that, rescaling the pixel values by 1/255.
Then, horizontal flip is applied to half of the pictures selecting
randomly, followed by random shear transformations and
zooming. Soft max function is used as activation function in
output layer to predict a multinomial probability distribution.
The Sequential method is used as a sequential model has
Fig 2. Vision Transformer Architecture been created. A sequential model means that all the layers of
the model will be arranged in sequence. Here, a VGG-16 pre-
trained model trained on the "Imagenet" Dataset is used [16].
Then all the layers of the model are frozen to train. ADAM
The image is broken into equal-sized patches. The small optimizer and learning rate decay are used to optimize the
patches are also known as tokens. The series of permits is learning process.
reshaped by 2D flattening into a vector format. Then a position
embedding is added to the patch embedding to preserve
positional information. The transformer encoder [18] consists
of mu lti-head attention. The encoder contains self-attention
layers. Embedded patches are connected to layer normalization V. RESULT AND DISCUSSION
in mu lti-head, and then again, layer normalization is connected
The results of three different approaches are observed and
to multi-layer perceptron blocks.
evaluated. Therefore, the best result is obtained by comparing
All the X-Ray images were resized to 250×250 pixels, then them.
each image is broken down into 25 patches of 50×50 pixels
each. These patches were then flattened and vectorized to feed Experiment No. 1
into the transformer encoder network which adds positional Convolutional neural networks is used in which four max
encoding to the image vectors. A total of 6 transformer blocks pool layers, 1 Soft-max and 2 Rectified linear units (Relu) were
applied for better computational time to make it better to A. ADVANTAGES OF VIT OVER CNN
classify by non-linearity. In this model, an accuracy of 90.52% ViT div ides image into fixed size patches whereas CNN
is achieved [23, 24, 25]. uses pixel arrays. In ViT patches are embedded according to
their respective positions which leads to better results in feature
extraction. Also ViT surpasses CNN in computational
Experiment No. 2 efficiency and accuracy.
VGG 16 CNN architecture is used. The images have been
re-scaled by dividing the pixel values by 255. To maintain the
uniform size of the image, the images are configured to shape
(224, 224). ResNet50 model is used here as a base model for 120
transfer learning. This model proposes an accuracy of 93.30%.
100
Accuracy in percent %
Experiment No. 3 80
Vision Transformer (ViT) is used to extract the features 60
using attention layers and the model is trained in t wo
classifications of datasets where the image is broken into 25 40
patches and then sequenced as linear embedding. The accuracy
is 96.45% by using this technique. 20
0
After analyzing the results of all three models, it is found ViT CNN VGG-16
that ViT is better than CNN models. The primary trouble with Proposed Approaches
CNNs, they fail to encode the spatial features. CNN does not
consider the position of detecting characteristics concerning
each other. In the Vision transformer, self-attention is used Accuracy Val_Accuracy
where it divides the image into small patches which are
trainable and give importance to each part of the image and
fetch into the Transformer alongside their positions. ViT Fig 3. Graph showing comparison between accuracies of
implements a natural transformer model without the need for
convolutional blocks. ViT is also more effective at doing three approaches
complex tasks. Due to self-attention, transformer architecture
can compute in a parallel manner to minimize co mputing time
[19]. It can concurrently extract all the records needed fro m the B. LIMITATIONS
input and its inter-relation, co mpared to CNN's. The result can CNN's depends on the size of their filters and the number of
be seen in the table. 1 of all three approaches and compared convolutional layers used. Increasing the value of these hyper-
accuracies in the graph shown in fig. 3. parameters increases the complexity of the model, which can
produce vanishing gradients or even models impossible to
train. Residual connections and dilated convolutions have also
been used to improve the receptive fields of these models, but
the way convolutions operate over texts always presents
limitations and trade-offs on the receptive field that it can
capture.
Unlike CNN, ViT works on self-attention does not contain
Table 1. Showing results of all three approaches a convolutional layer. The performance of ViTs saturates fast
when scaled to be more profound. More specifically, it is
Table 1 shows that ViT gave an acceptable accuracy of empirically observed that the attention collapse issue causes
such scaling difficulty: as the Transformer goes deeper, the
0.9645 on train ing data and accuracy of 0.8638 on unseen
validation data with a fairly min imized value of cost attention maps gradually become similar and even much th e
function/loss. In the case of VGG-16, accuracy on train and same after specific layers [19].
validation data is acceptable at 0.933 and 0.8597, respectively, C. FUTURE WORK
but the value of cost function/loss is considerably high and In the field of chest x-ray diagnosis, not much work has
highly unacceptable. In the CNN approach, the accuracy and been done in the Vision transformer. In the future, it can be
validation accuracy are also acceptable. The value of cost
beneficial for the detection of other diseases such as Pleural
function/loss is comparably higher than ViT and lower than thickening, Covid-19, Edema, Effusion, Emphysema or Cystic
VGG-16. Cross Entropy Loss function has been used in all
Fibrosis, and even Cancer.
three approaches to calculate the loss.
VI. CONCLUSION [14] Zhao, Defang, Dandan Zhu, Jianwei Lu, Ye Luo, and Guokai Zhang.
"Synthetic medical images using F&BGAN for improved lung nodules
In this paper, a Vision Transformer model is proposed for classification by multi-scale VGG16." Symmetry 10, no. 10 (2018): 519.
the early detection of Pneumonia to reduce the time-consuming [15] Daniel Kermany; Kang Zhang; Michael Goldbaum, (2018) "Labled
chest X-ray evaluation process in far-off places. It can be seen Optical Coherence Tomography (OCT ) and Chest X-Ray Images for
that this approach of Vision Transformer gives comparable Classification", Mendeley Data, v2 published 06-01-2018.
accuracy of 96.45% on the chest X-Ray data. Specialized [16] Deng, J., Dong, W., Socher, R., Li, L.-J., Kai Li, & Li Fei-Fei. (2009).
radiology is the most crucial point for adequate diagnosis of ImageNet: A large-scale hierarchical image database. 2009 IEEE
Conference on Computer Vision and Pattern Recognition.
any chest sac disease. It can prevent unfortunate outcomes in doi:10.1109/cvpr.2009.5206848
such far-off places. [17] Alexey Dosovitskiy∗, Lucas Beyer∗, Alexander Kolesnikov∗, Dirk
Weissenborn∗, Xiaohua Zhai∗, Thomas Unterthiner, Mostafa Dehghani,
Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil
Houlsby∗, ∗ equal technical contribution, † equal advising Google
REFERENCES Research.
[1] Theodoratou E, Zhang JSF, Kolcic I, Davis AM, Bhopal S, et al. (2011) [18] Vaswani, Ashish, et al. "Attention is all you need." arXiv preprint
Estimating Pneumonia Deaths of Post-Neonatal Children in Countries of arXiv:1706.03762 (2017).
Low or No Death Certification in 2008. PLoS ONE 6(9): e25095. doi: [19] Zhou, Daquan, et al. "Deepvit: Towards deeper vision transformer."
10.1371/journal.pone.0025095 arXiv preprint arXiv:2103.11886 (2021).
[2] WHO URL: https://www.who.int/news-room/fact- [20] Rezvantalab, Amirreza, Samir Mitha, and April Khademi. "Alzheimer's
sheets/detail/pneumonia Disease Classification using Vision T ransformers." (2021).
[3] Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel [21] R. Nijhawan, H. Sharma, H. Sahni and A. Batra, "A Deep Learning
Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Hybrid CNN Framework Approach for Vegetation Cover Mapping
Shpanskaya, Matthew P. Lungren, Andrew Y. Ng "CheXNet: Using Deep Features," 2017 13th International Conference on Signal-
Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Image Technology & Internet-Based Systems (SIT IS), 2017, pp. 192-
Learning"; Cornell University arXiv:1711.05225, 14 Nov 2017. 196, doi: 10.1109/SITIS.2017.41.
[4] Tatiana Gabruseva, Dmytro Poplavskiy, Alexandr Kalinin "Deep [22] Nijhawan, R., Das, J., & Raman, B. (2018). A hybrid of deep learning
Learning for Automatic Pneumonia Detection"; Proceedings of the and hand-crafted features-based approach for snow cover mapping.
IEEE/CVF Conference on Computer Vision and Pattern Recognition International Journal of Remote Sensing, 1–15.
(CVPR) Workshops, 2020, pp. 350-351. doi:10.1080/01431161.2018.1519277
[5] R. Nijhawan, R. Verma, Ayushi, S. Bhushan, R. Dua and A. Mittal, "An [23] Nijhawan, R., Joshi, D., Narang, N., Mittal, A., & Mittal, A. (2018). A
Integrated Deep Learning Framework Approach for Nail Disease Futuristic Deep Learning Framework Approach for Land Use-Land
Identification," 2017 13th International Conference on Signal-Image Cover Classification Using Remote Sensing Imagery. Advances in
Technology & Internet -Based Systems (SITIS), 2017, pp. 197-202, doi: Intelligent Systems and Computing, 87–96. doi:10.1007/978-981-13-
10.1109/SITIS.2017.42. 0680-8_9
[6] D. Chandra, S. S. Rawat and R. Nijhawan, "A Machine Learning Based [24] S. Gupta, A. Panwar, S. Goel, A. Mittal, R. Nijhawan and A. K. Singh,
Approach for Progeria Syndrome Detection," 2019 4th International "Classification of Lesions in Retinal Fundus Images for Diabetic
Conference on Information Systems and Computer Networks (ISCON), Retinopathy Using Transfer Learning," 2019 International Conference
2019, pp. 74-78, doi: 10.1109/ISCON47742.2019.9036229. on Information Technology (ICIT), 2019, pp. 342 -347, doi:
[7] D. Varshni, K. Thakral, L. Agarwal, R. Nijhawan and A. Mittal, 10.1109/ICIT48102.2019.00067.
"Pneumonia Detection Using CNN based Feature Extraction," 2019 [25] Y. K. Arora, A. T andon and R. Nijhawan, "Hybrid Com putational
IEEE International Conference on Electrical, Computer and Intelligence T echnique: Eczema Detection," TENCON 2019 - 2019
Communication Technologies (ICECCT ), 2019, pp. 1-7, doi: IEEE Region 10 Conference (TENCON), 2019, pp. 2472-2474, doi:
10.1109/ICECCT.2019.8869364. 10.1109/TENCON.2019.8929578.
[8] Chouhan, V., Singh, S.K., Khamparia, A., Gupta, D., Tiwari, P., [26] S. S. Rawat, K. S. Rawat, V. Rawat and R. Nijhawan, "Neural Networks
Moreira, C., Damaševičius, R. and De Albuquerque, V.H.C., 2020. A based Hand-crafted genetic learning approach to simulate Space Mario
novel transfer learning based approach for pneumonia detection in chest Game," 2020 International Conference on Smart Electronics and
X-ray images. Applied Sciences, 10(2), p.559 Communication (ICOSEC), 2020, pp. 1-5, doi:
[9] H. Salehinejad, S. Valaee, T. Dowdell, E. Colak and J. Barfett, 10.1109/ICOSEC49089.2020.9215233.
"Generalization of Deep Neural Networks for Chest Pathology [27] Sathish, Prof. “Adaptive Shape based Interactive Approach to
Classification in X-Rays Using Generative Adversarial Networks," 2018 Segmentation for Nodule in Lung CT Scans.” Journal of Soft Computing
IEEE International Conference on Acoustics, Speech and Signal Paradigm 2, no. 4: 216-225.
Processing (ICASSP), 2018, pp. 990-994, doi:
10.1109/ICASSP.2018.8461430.
[10] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Ronald M. Summers;
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 2018, pp. 9049-9058.
[11] Toğaçar, M., et al. "A deep feature learning model for pneumonia
detection applying a combination of mRMR feature selection and
machine learning models." Irbm 41.4 (2020): 212-222.
[12] Gutta, Jignesh Chowdary, G. Suganya, M. Premalatha, and K.
Karunamurthy. "Class dependency based learning using Bi-LSTM
coupled with the transfer learning of VGG16 for the diagnosis of
T uberculosis from chest x-rays." medRxiv (2021).
[13] Guan, Qing et al. “Deep convolutional neural network VGG-16 model
for differential diagnosing of papillary thyroid carcinomas in cytological
images: a pilot study.” Journal of Cancer vol. 10,20 4876-4882. 27 Aug.
2019, doi:10.7150/jca.28769