Applsci 13 05521 v2
Applsci 13 05521 v2
Applsci 13 05521 v2
Comparing Vision Transformers and Convolutional Neural
Networks for Image Classification: A Literature Review
José Maurício , Inês Domingues and Jorge Bernardino *
Polytechnic of Coimbra, Coimbra Institute of Engineering (ISEC), Rua Pedro Nunes, 3030-199 Coimbra, Portugal;
[email protected] (J.M.); [email protected] (I.D.)
* Correspondence: [email protected]
1. Introduction
Citation: Maurício, J.; Domingues, I.;
Nowadays, transformers have become the preferred models for performing Natural
Bernardino, J. Comparing Vision
Language Processing (NLP) tasks. They offer scalability and computational efficiency,
Transformers and Convolutional
Neural Networks for Image
allowing models to be trained with more than a hundred billion parameters without
Classification: A Literature Review.
saturating model performance. Inspired by the success of the transformers applied to
Appl. Sci. 2023, 13, 5521. https://
NLP and assuming that the self-attention mechanism could also be beneficial for image classification tasks, it was proposed to use the same architecture, with few modifications, to
perform image classification [1]. The author’s proposal was an architecture, called Vision
Academic Editor: Yu-Dong Zhang
Transformers (ViT), which consists of breaking the image into 2D patches and providing
Received: 20 March 2023 this linear sequence of patches as input to the model. Figure 1 presents the architecture
Revised: 19 April 2023 proposed by the authors.
Accepted: 26 April 2023 In contrast to this deep learning architecture, there is another very popular tool for
Published: 28 April 2023 processing large volumes of data called Convolutional Neural Networks (CNN). The CNN
is an architecture that consists of multiple layers and has demonstrated good performance
in various computer vision tasks such as object detection or image segmentation, as well as
NLP problems [2]. The typical CNN architecture starts with convolutional layers that pass
Copyright: © 2023 by the authors. through the kernels or filters, from left to right of the image, extracting computationally
Licensee MDPI, Basel, Switzerland.
interpretable features. The first layer extracts low-level features (e.g., colours, gradient
This article is an open access article
orientation, edges, etc.), and subsequent layers extract high-level features. Next, the pooling
distributed under the terms and
layers reduce the information extracted by the convolutional layers, preserving the most
conditions of the Creative Commons
important features. Finally, the fully-connected layers are fed with the flattened output
Attribution (CC BY) license (https://
of the convolutional and pooling layers and perform the classification. Its architecture is
shown in Figure 2.
Figure Exampleof
of anarchitecture
Figure architecture ofthe
1. Example ofof
an theViT,
architecture on
of theon [1].
ViT, based on [1].
Figure Exampleof
of anarchitecture
an architecture ofaarchitecture
2. Example of an
of aCNN,
based on[2].
of a CNN,
on [2].
based on [2].
With the increasing interest in Vision Transformers as a novel architecture for image
recognition tasks, and the established success of CNNs in image classification, this work
aims to review the state of the art in comparing Vision Transformers (ViT) and Convolu-
tional Neural Networks (CNN) for image classification. Transformers offer advantages
such as the ability to model long-range dependencies, adapt to different input sizes, and the
potential for parallel processing, making them suitable for image tasks. However, Vision
Transformers also face challenges such as computational complexity, model size, scalability
Appl. Sci. 2023, 13, 5521 3 of 17
2. Research Methodology
The purpose of a literature review is to evaluate, analyse and summarize the existing
literature on a specific research topic, in order to facilitate the emergence of theoretical
frameworks [3]. In this literature review, the aim is to synthesize the knowledge base,
critically evaluate the methods used and analyze the results obtained in order to identify
the shortcomings and improve the two aforementioned deep learning architectures for
image classification. The methodology for conducting this literature review is based on the
guidelines presented in [3,4].
2.5. Results
After applying the inclusion and exclusion criteria to the papers obtained in each
of the electronic databases, seventeen (17) papers were selected for the literature review.
Table 3 lists all the papers selected for this work, the year of publication and the type
of publication.
Figure 33 shows
shows the
the distribution
distribution of
of the
the selected
selected papers
papers by
by year
year of
of publication.
Figure 3.
3. Distribution
Distribution of
of the
the selected
selected studies
studies by
by years.
Figure 44 shows
shows the
the distribution
distribution of of the selected studies
the selected studies by
by application
application area.
area. In In the
figure, most
most of
of the
papersare aregeneric
area.InInthese papers
these papers without
without a
specific application area, the authors try to better understand the characteristics of the
a specific application area, the authors try to better understand the characteristics of
two architectures.
the two For example,
architectures. For example, between CNNs
between and and
CNNs ViTs,ViTs,
the authors havehave
the authors triedtried
to un- to
understand which of the architectures is more transferable. If architectures
derstand which of the architectures is more transferable. If architectures based on trans- based on
formers are more
are more robustrobust than CNNs.
than CNNs. And ifAnd if the
the ViT ViTbewill
will ablebetoable
see to
samethe infor-
information as CNN with a different architecture. Within the health
mation as CNN with a different architecture. Within the health domain, some studiesdomain, some studies
have been
have been developed
developed in in different
showthat ViT
that cancan
ViT be
Appl. Sci. 2023, 13, 5521 better than CNNs. The figure also shows that some work has been done,
be better than CNNs. The figure also shows that some work has been done, albeit6 to albeit to a lessera
of 19
lesser in other
in application areas. areas.
other application Agriculture stands stands
Agriculture out withouttwo
withpapers comparing
two papers compar-ViTs
with CNNs.
ing ViTs with CNNs.
Figure 4. Distribution
Figure 4. Distribution of
of the
the selected
selected studies by application
studies by application area.
3. Findings
3. Findings
An overview of the studies selected through the research methodology is shown
An overview of the studies selected through the research methodology is shown in
in Table 4. This information summarizes the authors’ approach, the findings, and other
Table 4. This information summarizes the authors' approach, the findings, and other archi-
architectures that were used to build a comparative study. Therefore, to address the research
tectures that were used to build a comparative study. Therefore, to address the research
questions, this section will offer an overview of the data found in the collected papers.
questions, this section will offer an overview of the data found in the collected papers.
In the study developed in [12], the authors aimed to compare the two architectures
In the study developed in [12], the authors aimed to compare the two architectures
(i.e., ViT and CNN), as well as the creation of a hybrid model that corresponded to the
(i.e., ViT and CNN), as well as the creation of a hybrid model that corresponded to the
combination of the two. The experiment was conducted using the ImageNet dataset and
combination of the two. The experiment was conducted using the ImageNet dataset and
perturbations were applied to the dataset images. It was concluded that ViT can perform
better and be more resilient on images with natural or adverse disturbances than CNN.
It was also found in this work that the combination of the two architectures results in a
10% improvement in accuracy (Acc).
Appl. Sci. 2023, 13, 5521 6 of 17
perturbations were applied to the dataset images. It was concluded that ViT can perform
better and be more resilient on images with natural or adverse disturbances than CNN. It
was also found in this work that the combination of the two architectures results in a 10%
improvement in accuracy (Acc).
The work done in [14] aimed to compare Vision Transformers (ViT) with Convolu-
tional Neural Networks (CNN) for digital holography, where the goal was to reconstruct
amplitude and phase by extracting the distance of the object from the hologram. In this
work, DenseNet201, DenseNet169, EfficientNetB4, EfficientNetB7, ViT-B/16, ViT-B32 and
ViT-L/16 architectures were compared with a total of 3400 images. They were divided into
four datasets, original images with or without filters, and negative images with or without
filters. The authors concluded that ViT despite having an accuracy like CNN, was more
robust because, due to the self-attention mechanism, it can learn the entire hologram rather
than a specific area.
The authors in [7] studied the performance of ViT in comparison with other architec-
tures to detect pneumonia, through chest X-ray images. Therefore, a ViT model, a CNN
network developed by the authors and the VGG-16 network were used for the study which
focussed on a dataset with 5856 images. After the experiments performed, the authors
concluded that ViT was better than CNN with 96.45% accuracy, 86.38% validation accuracy,
10.8% loss and 18.25% validation loss. In this work, it was highlighted that ViT has a self-
attention mechanism that allows splitting the image into small patches that are trainable,
and each part of the image can be given an importance. However, the attention mechanism
as opposed to the convolutional layers makes ViT’s performance saturate fast when the
goal is scalability.
In the study [21] the goal was to compare ViT with state-of-art CNN networks to
classify UAV images to monitor crops and weeds. The authors compared the influence
of the size of the training dataset on the performance of the architectures and found
that ViT performed better with fewer images than CNN networks in terms of F1-Score.
They concluded that ViT-B/16 was the best model to do crop and weed monitoring. In
comparison with CNN networks, ViT could better learn the patterns of images in small
datasets due to the self-attention mechanism.
In the scope of lung diseases, the authors in [11] investigated the performance of
ViT models to automatically classify emphysema subtypes through Computed Tomog-
raphy (CT) images in comparison with CNN networks. In this study, they performed a
comparative study between the two architectures using a dataset collected by the authors
(3192 patches) and a public dataset of 168 patches taken from 115 HRCT slides. In addition
to this, they also verified the importance of pre-trained models. They concluded that
ViT failed to generalize when trained with fewer images, because when comparing the
pre-training accuracy with 91.27% on the training and 70.59% on the test.
In the work in [9], a comparison between state-of-the-art CNNs and ViT models
for Breast ultrasound image classification was developed. The study was performed
with two different datasets: the first containing 780 images and the second containing
163 images. The following architectures were selected for the study: ResNet50, VGG-16,
Inception, NASNET, ViT-S/32, ViT-B/32, ViT-Ti/16, R + ViT-Ti/16 and R26 + ViT-S/16. ViT
models were found to perform better than CNN networks for image classification. The
authors also highlighted that ViT models could perform better when they were trained
with a small dataset, because via the attention mechanism, it was possible to collect more
information from different patches, instead of collecting information from the image.
Benz et al., in [5], compared ViT models, with the MLP-Mixer architecture and with
CNNs. The goal was to evaluate which architecture was more robust in image classification.
The study consisted of generating perturbations and adverse examples in the images and
understanding which of the architectures was most robust. However, this study did not
aim to analyse the causes. Therefore, the authors concluded that ViT were more robust than
CNNs to adversarial attacks and from a features perspective CNN networks were more
sensitive to high-frequency features. It was also described that the shift-variance property
Appl. Sci. 2023, 13, 5521 7 of 17
of convolutional layers may be at the origin of the lack of robustness of the network in the
classification of images that have been transformed.
The authors in [15] performed an analysis between ViT and CNN models aimed at
detecting deepfake images. The experiment consisted in using the ForgeryNet dataset with
2.9 million images and 220 thousand video clips, together with three different image manip-
ulation techniques, where they tried to train the models with real and manipulated images.
By training the ViT-B model and the EfficientNetV2 network the authors demonstrated that
the CNN network could generalize better and obtain higher training accuracy. However,
ViT could have better generalization, reducing the bias in the identification of anomalies
introduced by one or more different techniques to introduce anomalies.
Chao Xin et al. [17] aimed to compare their ViT model with CNN networks and with
another ViT model to perform image classification to detect skin cancer. The experiment
conducted by the authors used a public HAM10000 dataset with dermatoscopic skin cancer
images and a clinical dataset collected through dermoscopy. In this study, a multi-scale
image and the overlapping sliding window were used to serialize the images. They
also used contrastive learning to improve the similarity of different labels and minimize
the similarity in the same label. Thus, the ViT model developed was better for skin
cancer classification using these techniques. However, the authors also demonstrated the
effectiveness of balancing the dataset on the model performance, but they did not present
the F1-Score values before the dataset is balanced to verify the improvement.
The authors in [19] aimed to study if ViT models could be an alternative to CNNs
in time-critical applications. That is, for edge computing instances and IoT networks,
applications using deep learning models consume multiple computational resources. The
experiment used pre-trained networks such as ResNet152, DenseNet201, InceptionV3,
and SL-ViT with three different datasets in the scope of images, audio, and video. They
concluded that the ViT model introduced less overhead and performed better than the
architectures used. It was also shown that increasing the kernel size of convolutional layers
and using dilated convolutional caused a reduction in the accuracy of a CNN network.
In a study carried out in [20], the authors tried to find in ViTs an alternative solution to
CNN networks for asphalt and concrete crack detection. The authors concluded that ViTs,
due to the self-attention mechanism, had better performance in crack detection images
with intense noise. CNN networks in the same images suffered from a high number of false
negative rates, as well as the presence of biases in image classification.
Haolan Wang in [16] aimed to analyse eight different Vision Transformers and compare
them with the performance of a pre-trained CNN network and without the pre-trained
parameters to perform traffic signal recognition in autonomous driving systems. In this
study, three different datasets with images of real-world traffic signals were used. This
allowed the authors to conclude that the pre-trained DenseNet161 network had a higher
accuracy than the ViT models to do traffic sign recognition. However, it was found in
this work that ViT models performed better than the DenseNet161 network without the
pre-trained parameters. From this work, it was also possible to conclude that the ViT
models with a total number of parameters equal to or greater than the CNN networks, used
during the experiment, had a shorter training time.
The work done in [13] compared CNN networks with Vision Transformers models for
the classification of Diabetic Foot Ulcer images. For the study, the authors decided to use the
following architectures: Big Image Transfer (BiT), EfficientNet, ViT-base and Data-efficient
Image Transformers (DeIT) upon a dataset composed of 15,683 images. A further aim of
this study to compare the performance of deep learning models using Stochastic Gradient
Descent (SGD) [22] with Sharpness-Aware Optimization (SAM) [23,24]. These two tools are
optimizers that seek to minimize the value of the loss function, improving the generalization
ability of the model. However, SAM minimizes the value of the loss function and the
sharpness loss, looking for parameters in the neighbourhood with a low loss. Therefore,
this work concluded that the SAM optimizer originated an improvement in the values of
F1-Score, AUC, Recall and Precision in all the architectures used. However, the authors did
Appl. Sci. 2023, 13, 5521 8 of 17
not present the training and test values that allow for evaluating the improvement in the
generalization of the models. Therefore, the BiT-ResNetX50 model with the SAM optimizer
obtained the best performance for the classification of Diabetic Foot Ulcer images with
F1-Score = 57.71%, AUC = 87.68%, Recall = 61.88%, and Precision = 57.74%.
The authors in [18] performed a comparative study between ViT models and CNN
networks used in the state of the art with a model developed by them, where they combined
CNN and transformers to perform insect pest recognition to protect agriculture worldwide.
This study involved three public datasets: the IP102 dataset, the D0 dataset and Li’s dataset.
The algorithm created by the authors consisted of using the sequence of inputs formed by
the CNN feature maps to make the model more efficient, and a flexible attention-based
classification head was implemented to use the spatial information. Comparing the results
obtained, the proposed model obtained a better performance in insect pest recognition with
an accuracy of 74.897%. This work demonstrated that fine-tuning worked better on Vision
Transformers than CNN, but on the other hand, this caused the number of parameters,
the size, and the inference time of the model to increase significantly with respect to CNN
networks. Through their experiments, the authors also demonstrated the advantage of
using decoder layers in the proposed model to perform image classification. The greater
the number of decoder layers, the greater the accuracy value of the model. However, this
increase in the number of decoder layers increased the number of parameters, the size, and
the inference time of the model. In other words, the architecture to process the images
consumes far greater computational resources, which may not compensate for the increase
in accuracy value with few layers. In the case of this study, the increase from one layer to
three decoder layers represented only an increase of 0.478% in the accuracy value.
Several authors in [6,8,10] went deeper into the investigation and aimed to understand
how the learning process of Vision Transformers works if ViT could be more transferable
and better understand if the transform-based architecture were more robust than CNNs.
In this sense, the authors in [8] intended to analyse the internal representations of ViT
and CNN structures in image classification benchmarks and found differences between
them. One of the differences was that ViT has greater similarity between high and low
layers, while CNN architecture needs more low layers to compute similar representations in
smaller datasets. This is due to the self-attention layers implemented in ViT, which allows
it to aggregate information from other spatial locations, vastly different from the fixed field
sizes in CNNs. They also observed that ViTs in the lower, self-attention layers can access
information from local heads (small distances) and global heads (large distances). Whereas
CNNs have access to information locally in the lower layers. On the other hand, the authors
in [10] systematically analysed the transfer learning capacity in the two architectures. The
study was conducted by comparing the performance of the two architectures on single-
task and multi-task learning problems, using the ImageNet dataset. Through this study,
the authors concluded that the transform-based architecture contained more transferable
representations compared to convolutional networks for fine-tuning, presenting better
performance and robustness in multi-task learning problems.
In another study carried out in [6], the goal was to prove if ViT were more robust than
CNN as the most recent studies have shown. The authors developed their work comparing
the robustness of the two architectures using two different types of perturbations: adversar-
ial samples, which consists in evaluating the robustness of deep learning architectures in
images with human-caused perturbations (i.e., data augmentation) and out-of-distribution
samples, which consists in evaluating the robustness of the architectures in benchmarks of
classification images. Through this experiment, it was demonstrated that by replacing the
activation function ReLU by the activation function of transformer-based architecture (i.e.,
GELU) the CNN network was more robust than ViT in adversarial samples. In this study,
it was also demonstrated that CNN networks were more robust than ViT in patch-based
attacks. However, the authors concluded that the self-attention mechanism was the key to
the robustness of the transformer-based architecture in most of the experiments performed.
Appl. Sci. 2023, 13, 5521 9 of 17
Table 4. Cont.
4. Discussion
The results can be summarized as follows. In [12], ViTs were found to perform
better and be more resilient to images with natural or adverse disturbances compared
to CNNs. Another study [14] concluded that ViTs are more robust in digital holography
because they can access the entire hologram rather than just a specific area, giving them an
advantage. ViTs have also been found to outperform CNNs in detecting pneumonia in chest
X-ray images [7] and in classifying UAV images for crop and weed monitoring with small
datasets [21]. However, it has been noted that ViT performance may saturate if scalability
is the goal [7]. In a study on classifying emphysema subtypes in CT images [11], ViTs were
found to struggle with generalization when trained on fewer images. Nevertheless, ViTs
were found to outperform CNNs in breast ultrasound image classification, especially with
small datasets [9]. Another study [5] found that ViTs are more robust to adversarial attacks
and that CNNs are more sensitive to high-frequency features. The authors in [15] found that
CNNs had higher training accuracy and better generalization, but ViTs showed potential to
reduce bias in anomaly detection. In [17], the authors claimed that the ViT model showed
better performance for skin cancer classification. ViTs have also been shown to introduce
less overhead and perform better for time-critical applications in edge computing and IoT
networks [19]. In [20], the authors investigated the use of ViTs for asphalt and concrete
crack detection and found that ViTs performed better due to the self-attention mechanism,
especially in images with intense noise and biases. Wang [16] found that a pre-trained CNN
network had higher accuracy, but the ViT models performed better than the non-pre-trained
CNN network and had a shorter training time. The authors in [13] used several models for
diabetic foot ulcer image classification and compared SGD and SAM optimizers, concluding
that the SAM optimizer improved several evaluation metrics. In [18], the authors showed
that fine-tuning performed better on ViT models than CNNs for insect pest recognition.
Therefore, based on the information gathered from the selected papers, we attempt to
answer the research questions posed in Section 1:
RQ1—Can the ViT architecture have a better performance than the CNN architecture,
regardless of the characteristics of the dataset?
The literature review shows that ViT in image processing can be more efficient in
smaller datasets due to the increase of relations created between images through the self-
attention mechanism. However, it is also shown that if ViT trained with little data will have
less generalization ability and worse performance compared to CNN’s.
RQ2—What influences the CNNs that do not allow them to perform as well as the ViTs?
Shift-invariance is a limitation of CNN that makes the same architecture not have a sat-
isfactory performance because the introduction of noise in the input images makes the same
architecture unable to get the maximum information from the central pixels. However, the
authors in [27] propose the addition of an anti-aliasing filter which combines blurring with
subsampling in the Convolutional, MaxPooling and AveragePooling layers. Demonstrating
through the experiment carried out that the application of this filter originates a greater
generalization capacity and an increase in the accuracy of CNN. Furthermore, increasing
the kernel size in convolutional layers and using dilated convolution have been shown as
limitations that deteriorate the performance of CNNs against ViTs.
RQ3—How does the Multi-Head Attention mechanism, which is a key component of ViTs,
influence the performance of these models in image classification?
The Attention mechanism is described as the mapping of a query and a set of key-
value pairs to an output, the output being the result of a weighted sum of the values,
in which the weight given is calculated, through the query with the corresponding key
by a compatibility function. Multi-head Attention mechanism instead of performing a
single attention function will perform multiple projections of attention functions [28]. This
mechanism improves the ViT architecture because it allows it to extract more information
from each pixel of the images that have been placed inside the embedding. In addition, this
Appl. Sci. 2023, 13, 5521 12 of 17
mechanism can have better performance if the images have more secondary elements that
illustrate the central element. And since this mechanism performs several computations in
parallel, it reduces the computational cost [29].
Overall, ViTs have shown promising performance compared to CNNs in various
applications, but there are limitations and factors that can affect their performance, such as
dataset size, scalability, and pre-training accuracy.
5. Threats to Validity
This section discusses internal and external validity threats. The validity of the entire
process performed in this study is demonstrated and how the results of this study can be
replicated in other future experiments.
In this literature review, different search strings were used in each of the selected data
sources, resulting in different results from each source. This approach may introduce a
bias into the validation of the study, as it makes it difficult to draw conclusions about the
diversity of studies obtained by replicating the same search. In addition, the maturity of
the work was identified as an internal threat to validity, as the ViT architecture is relatively
new and only a limited number of research projects have been conducted using it. In order
to draw more comprehensive conclusions about the robustness of ViT compared to CNN, it
is imperative that this architecture is further disseminated and deployed, thereby making
more research available for analysis.
In addition to these threats, this study did not use methods that would allow to quan-
titatively and qualitatively analyse the results obtained in the selected papers. This may
bias the validity of this review in demonstrating which of the deep learning architectures is
more efficient in image processing.
The findings obtained in this study could be replicated in other future research in
image classification. However, the results obtained may not be the same as those described
by the selected papers because it has been proven that for different problems and different
methodologies used, the results are different. In addition, the authors do not describe
in sufficient detail all the methodologies they used, nor the conditions under which the
experiment was performed.
6.1. Strengths
Both CNNs and ViTs have their own advantages, and some common ones. This
section will explore these in more detail, including considerations on the Datasets, Ro-
bustness, Performance optimization, Evaluation, Explainability and Interpretability, and
6.1.2. Robustness
CNNs are inherently translation-invariant, making them robust to small changes in
object position or orientation within an image. The main advantage of ViTs is their ability
Appl. Sci. 2023, 13, 5521 13 of 17
6.1.4. Evaluation
CNNs have been widely used in image classification tasks for many years, resulting in
well-established benchmarks and evaluation metrics that allow meaningful comparison
and evaluation of model performance. The standardized evaluation protocols, such as cross-
validation or hold-out validation, which provide a consistent framework for evaluating
and comparing model performance across different datasets and tasks, are applicable for
both architectures.
6.1.6. Architecture
CNNs have a wide range of established architecture variants, such as VGG, ResNet,
and Inception, with proven effectiveness in various image classification tasks. These
architectures are well-tested and widely used in the deep learning community. ViTs can be
Appl. Sci. 2023, 13, 5521 14 of 17
easily modified to accommodate different input sizes, patch sizes, and depth, providing
flexibility in architecture design and optimization.
6.2. Limitations
Despite their many advantages and the breakthroughs made over the years. There are
still some drawbacks to the architectures studied. This section focuses on these.
6.2.2. Robustness
CNNs may struggle to capture long-range contextual information, as they focus
primarily on local feature extraction, which may limit the ability to understand global
context, leading to reduced robustness in tasks that require global context, such as scene
understanding, image captioning or fine-grained recognition.
Both architectures can be prone to overfitting, especially when the training data is
limited or noisy, which can lead to reduced robustness to input data outside the training
distribution. Adversarial attacks can also pose a challenge to the robustness of both
architectures. In particular, ViTs do not have an inherent spatial inductive bias like CNNs,
which are specifically designed to exploit the spatial locality of images. This can make them
more vulnerable to certain types of adversarial attacks that rely on spatial information,
such as spatially transformed adversarial examples.
6.2.4. Evaluation
As mentioned above, CNNs are primarily designed for local feature extraction and
may struggle to capture long-range contextual dependencies, which can limit the eval-
uation performance in tasks that require understanding of global context or long-term
dependencies. ViTs are relatively newer than CNNs, and as such, may lack well-established
benchmarks or evaluation metrics for specific tasks or datasets, which can make perfor-
mance evaluation difficult and less standardized.
difficult to interpret the interactions between different attention heads and to understand
the reasoning behind model predictions.
6.2.6. Architecture
CNNs typically have fixed, predefined model architectures, which may limit the flex-
ibility to adapt to specific task requirements or incorporate domain-specific knowledge,
potentially affecting performance optimization. For ViTs, the availability of established ar-
chitecture variants is still limited, which may require more experimentation and exploration
to find optimal architectures for specific tasks.
6.3.2. Robustness
As documented in [5], deep learning models are typically vulnerable to adversarial
attacks, where small perturbations to an input image can cause the model to misclassify it.
Future research should focus on developing architectures that are more robust to adversarial
attacks (for example by further augmenting the robustness of ViTs), as well as exploring
ways to detect and defend against these attacks.
Beyond that, most studies (as the ones reviewed in this work) have focused on the
performance of deep learning architectures on image classification tasks, but there are many
other image processing tasks (such as object detection, segmentation, and captioning) that
could benefit from the use of these architectures. Future research should further explore
the effectiveness of these architectures on these tasks.
6.3.4. Evaluation
The adequacy of the metrics to the task and problem at hand is also another suggested
line of future research. Most studies have used standard performance metrics (such as
accuracy and F1-score) to evaluate the performance of deep learning architectures. Future
research should consider using more diverse metrics that better capture the strengths and
weaknesses of different architectures.
6.3.6. Architecture
In future investigations, it will be necessary to study the impact of the MLP-Mixer
deep learning architecture in image processing, what are the characteristics that allow it to
have a performance superior to CNNs, but inferior to the performance obtained by the ViT
architecture [5]. Future research should also focus on developing novel architectures that
can achieve high performance with fewer parameters or that are more efficient in terms of
computation and memory usage.
7. Conclusions
This work has reviewed recent studies done in image processing to give more infor-
mation about the performance of the two architectures and what distinguishes them. A
common feature across all papers is that transformer-based architecture or the combination
of ViTs with CNN allows for better accuracy compared to CNN networks. It has also been
shown that this new architecture, even with hyperparameters fine-tuning, can be lighter
than the CNN, consuming fewer computational resources and taking less training time as
demonstrated in the works [16,19].
In summary, the ViT architecture is more robust than CNN networks for images that
have noise or are augmented. It manages to perform better compared to CNN due to the
self-attention mechanism because it makes the overall image information accessible from
the highest to the lowest layers [12]. On the other hand, CNN’s can generalize better with
smaller datasets and get better accuracy than ViTs, but in contrast, ViTs have the advantage
of learning information better with fewer images. This is because the images are divided
into small patches, so there is a greater diversity of relationships between them.
Author Contributions: Conceptualization, J.B.; Methodology, J.M. and J.B.; Software, J.M; Validation,
J.M., I.D. and J.B.; Formal analysis, J.M., I.D. and J.B.; Investigation, J.M.; Resources, J.M.; Data
curation, J.M.; Writing—original draft preparation, J.M.; Writing—review and editing, J.M., I.D. and
J.B.; Supervision, J.B. and I.D.; Project administration, J.B. and I.D.; Funding acquisition, J.B. All
authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Data Availability Statement: Data sharing is not applicable to this article.
Conflicts of Interest: The authors declare no conflict of interest.
1. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly,
S.; et al. An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [CrossRef]
2. Saha, S. A Comprehensive Guide to Convolutional Neural Networks—The ELI5 Way. Available online: https:// (accessed on
8 January 2023).
Appl. Sci. 2023, 13, 5521 17 of 17
3. Snyder, H. Literature Review as a Research Methodology: An Overview and Guidelines. J. Bus. Res. 2019, 104, 333–339. [CrossRef]
4. Matloob, F.; Ghazal, T.M.; Taleb, N.; Aftab, S.; Ahmad, M.; Khan, M.A.; Abbas, S.; Soomro, T.R. Software Defect Prediction Using
Ensemble Learning: A Systematic Literature Review. IEEE Access 2021, 9, 98754–98771. [CrossRef]
5. Benz, P.; Ham, S.; Zhang, C.; Karjauv, A.; Kweon, I.S. Adversarial Robustness Comparison of Vision Transformer and MLP-Mixer
to CNNs. arXiv 2021, arXiv:2110.02797. [CrossRef]
6. Bai, Y.; Mei, J.; Yuille, A.; Xie, C. Are Transformers More Robust Than CNNs? arXiv 2021, arXiv:2111.05464. [CrossRef]
7. Tyagi, K.; Pathak, G.; Nijhawan, R.; Mittal, A. Detecting Pneumonia Using Vision Transformer and Comparing with Other
Techniques. In Proceedings of the 2021 5th International Conference on Electronics, Communication and Aerospace Technology
(ICECA), IEEE, Coimbatore, India, 2 December 2021; pp. 12–16.
8. Raghu, M.; Unterthiner, T.; Kornblith, S.; Zhang, C.; Dosovitskiy, A. Do Vision Transformers See Like Convolutional Neural
Networks? arXiv 2021, arXiv:2108.08810. [CrossRef]
9. Gheflati, B.; Rivaz, H. Vision Transformer for Classification of Breast Ultrasound Images. arXiv 2021, arXiv:2110.14731. [CrossRef]
10. Zhou, H.-Y.; Lu, C.; Yang, S.; Yu, Y. ConvNets vs. Transformers: Whose Visual Representations Are More Transferable? In
Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), IEEE, Montreal, BC,
Canada, 17 October 2021; pp. 2230–2238.
11. Wu, Y.; Qi, S.; Sun, Y.; Xia, S.; Yao, Y.; Qian, W. A Vision Transformer for Emphysema Classification Using CT Images. Phys. Med.
Biol. 2021, 66, 245016. [CrossRef]
12. Filipiuk, M.; Singh, V. Comparing Vision Transformers and Convolutional Nets for Safety Critical Systems. AAAI Workshop Artif.
Intell. Saf. 2022, 3087, 1–5.
13. Galdran, A.; Carneiro, G.; Ballester, M.A.G. Convolutional Nets Versus Vision Transformers for Diabetic Foot Ulcer Classification.
arXiv 2022, arXiv:2111.06894. [CrossRef]
14. Cuenat, S.; Couturier, R. Convolutional Neural Network (CNN) vs Vision Transformer (ViT) for Digital Holography. In
Proceedings of the 2022 2nd International Conference on Computer, Control and Robotics (ICCCR), IEEE, Shanghai, China,
18 March 2022; pp. 235–240.
15. Coccomini, D.A.; Caldelli, R.; Falchi, F.; Gennaro, C.; Amato, G. Cross-Forgery Analysis of Vision Transformers and CNNs for
Deepfake Image Detection. In Proceedings of the 1st International Workshop on Multimedia AI against Disinformation, Newark,
NJ, USA, 27–30 June 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 52–58.
16. Wang, H. Traffic Sign Recognition with Vision Transformers. In Proceedings of the 6th International Conference on Information
System and Data Mining, Silicon Valley, CA, USA, 27–29 May 2022; Association for Computing Machinery: New York, NY, USA,
2022; pp. 55–61.
17. Xin, C.; Liu, Z.; Zhao, K.; Miao, L.; Ma, Y.; Zhu, X.; Zhou, Q.; Wang, S.; Li, L.; Yang, F.; et al. An Improved Transformer Network
for Skin Cancer Classification. Comput. Biol. Med. 2022, 149, 105939. [CrossRef]
18. Peng, Y.; Wang, Y. CNN and Transformer Framework for Insect Pest Classification. Ecol. Inform. 2022, 72, 101846. [CrossRef]
19. Bakhtiarnia, A.; Zhang, Q.; Iosifidis, A. Single-Layer Vision Transformers for More Accurate Early Exits with Less Overhead.
Neural Netw. 2022, 153, 461–473. [CrossRef]
20. Asadi Shamsabadi, E.; Xu, C.; Rao, A.S.; Nguyen, T.; Ngo, T.; Dias-da-Costa, D. Vision Transformer-Based Autonomous Crack
Detection on Asphalt and Concrete Surfaces. Autom. Constr. 2022, 140, 104316. [CrossRef]
21. Reedha, R.; Dericquebourg, E.; Canals, R.; Hafiane, A. Vision Transformers for Weeds and Crops Classification of High Resolution
UAV Images. Remote Sens. 2022, 14, 592. [CrossRef]
22. Bottou, L.; Bousquet, O. The Tradeoffs of Large Scale Learning. In Advances in Neural Information Processing Systems; Platt, J.,
Koller, D., Singer, Y., Roweis, S., Eds.; Curran Associates, Inc.: Vancouver, BC, Canada, 2007; Volume 20.
23. Foret, P.; Kleiner, A.; Mobahi, H.; Neyshabur, B. Sharpness-Aware Minimization for Efficiently Improving Generalization. arXiv
2020, arXiv:2010.01412. [CrossRef]
24. Korpelevich, G.M. The Extragradient Method for Finding Saddle Points and Other Problems. Ekon. Mat. Metod. 1976, 12, 747–756.
25. Al-Dhabyani, W.; Gomaa, M.; Khaled, H.; Fahmy, A. Dataset of Breast Ultrasound Images. Data Brief 2020, 28, 104863. [CrossRef]
26. Yap, M.H.; Pons, G.; Marti, J.; Ganau, S.; Sentis, M.; Zwiggelaar, R.; Davison, A.K.; Marti, R. Automated Breast Ultrasound Lesions
Detection Using Convolutional Neural Networks. IEEE J. Biomed. Health Inform. 2018, 22, 1218–1226. [CrossRef]
27. Zhang, R. Making Convolutional Networks Shift-Invariant Again. arXiv 2019, arXiv:1904.11486. [CrossRef]
28. Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need.
Neural Inf. Process. Syst. 2017, 30, 3762. [CrossRef]
29. Zhou, D.; Kang, B.; Jin, X.; Yang, L.; Lian, X.; Jiang, Z.; Hou, Q.; Feng, J. DeepViT: Towards Deeper Vision Transformer. arXiv 2021,
arXiv:2103.11886. [CrossRef]
30. Amorim, J.P.; Domingues, I.; Abreu, P.H.; Santos, J.A.M. Interpreting Deep Learning Models for Ordinal Problems. In Proceedings
of the European Symposium on Artificial Neural Networks, Bruges, Belgium, 25–27 April 2018.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.