0% found this document useful (0 votes)
10 views61 pages

Seminar

This technical report presents an in-depth study on image classification using Vision Transformers (ViTs), highlighting their advantages over traditional Convolutional Neural Networks (CNNs). It discusses the evolution of image classification techniques, the architecture of ViTs, and their application in various fields such as healthcare and autonomous vehicles. The report emphasizes the transformative potential of ViTs in improving accuracy and efficiency in image classification tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views61 pages

Seminar

This technical report presents an in-depth study on image classification using Vision Transformers (ViTs), highlighting their advantages over traditional Convolutional Neural Networks (CNNs). It discusses the evolution of image classification techniques, the architecture of ViTs, and their application in various fields such as healthcare and autonomous vehicles. The report emphasizes the transformative potential of ViTs in improving accuracy and efficiency in image classification tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

A Technical Report on

Image Classification using Vision Transformer


A report submitted in partial fulfilment of the academic requirements for the award of
the degree of

BACHELOR OF TECHNOLOGY
In
ELECTRICAL AND ELECTRONICS ENGINEERING
Submitted By

G Surya Anirudh
21011A0219
Under the esteemed guidance of

Dr.K.Bhaskar
B.E. (AU), M.Tech (NIT Surathkal), Ph.D. (IITK)

Professor & Head of the Department

Sri B.Yadagiri
(Asst.Professor)

DEPARTMENT OF ELECTRICAL AND ELECTRONICS ENGINEERING,


JNTUH UNIVERSITY COLLEGE OF ENGINEERING, SCIENCE AND
TECHNOLOGY HYDERABAD
(AUTONOMOUS)
Accredited by NAAC A+
Kukatpally, Hyderabad, 500085
MAY - JUNE 2024

1|Page
ACKNOWLEDGEMENT

I express my profound gratitude to my Seminar Guides Dr. K. Bhaskar, Professor &


Head of the Department , and Mr. B.Yadagiri , Assistant Professor, Electrical & Electronics
Engineering JNTUH-UCESTH for their valuable guidance, encouragement, and motivation
throughout from topic selection to preparation and Final presentation of this Seminar report.

I express my hearty thanks to thank Dr. K. Bhaskar, Head of the Department, Electrical
& Electronics Engineering, JNTUH-UCESTH for continuous moral support to complete my
seminar on time with better efficiency. My sincere thanks to all the authors of the references and
other literature referred to during the preparation of this seminar report. I express my thanks to
all the faculty members and non-teaching staff for all the help and coordination extended in
bringing out this seminar report successfully in time. Finally, I am very much thankful to my
parents who guided me in every step.

G Surya Anirudh

21011A0219

2|Page
JNTUH UNIVERSITY COLLEGE OF ENGINEERING, SCIENCE AND
TECHNOLOGY, HYDERABAD
DEPARTMENT OF ELECTRICAL AND ELECTRONICS
ENGINEERING

CERTIFICATE

This is to certify that the seminar report entitled “IMAGE CLASSIFICATION


USING VISION TRANSFORMER” is being submitted by G SURYA ANIRUDH with
Admission No: 21011A0219 in partial fulfilment of the requirement for the award of the
degree of Bachelor of Technology in Electrical and Electronics Engineering,
JNTUHUCESTH.

This is a record of Bonafide work carried out by him under guidance and
supervision during the academic year 2024-2025. The discussions and results presented
in this seminar report have been verified and are found to be satisfactory. The work
embodied in this seminar report has not been submitted to any other university for the
award of any other degree or diploma.

3|Page
ABSTRACT

Image classification, a cornerstone of computer vision, involves identifying and


categorizing objects, scenes, or activities in images. While Convolutional Neural Networks
(CNNs) have long been the dominant approach, they face several limitations. CNNs struggle
with local receptive fields, making it difficult to capture long-range dependencies. Their spatial
inductive bias, imposed by convolution and pooling layers, may not suit all tasks, and their
hierarchical structures often lead to high computational costs. These constraints have spurred
the development of Vision Transformers, which provide a novel and effective alternative for
image classification.

Vision Transformers represent a paradigm shift in image classification by adapting the


Transformer architecture, originally designed for natural language processing, to visual data.
Unlike CNNs, which rely on localized operations, ViT employ self-attention mechanisms to
capture global relationships within images. This innovative design eliminates the constraints of
local receptive fields and spatial inductive biases, enabling more comprehensive feature
extraction and analysis.

The core functionality of ViTs begins with dividing input images into patches, which
are then linearly embedded and enriched with positional encodings. These embeddings are
processed through a series of Transformer encoder layers, leveraging self-attention to model
long-range dependencies. The final representation is passed through a classification head,
typically an MLP, to predict output classes. This design ensures flexibility, scalability, and
improved interpretability by highlighting the region’s most relevant to decision-making.

Vision Transformers offer significant advantages over traditional CNNs. They achieve
state-of-the-art performance on various benchmarks, outperforming CNNs in many cases.
Their modular architecture allows for easy adaptation to diverse tasks, while the self-attention
mechanism enhances interpretability by focusing on important image regions. These features,
combined with ongoing advancements in efficiency and multimodal integration, position ViTs
as a transformative technology in image classification.

-4-|Page
TABLE OF CONTENTS

Details of Content Page NO.

Abstract iv

Table of Contents v

List of Figures vi

Chapter 1:Introduction 7

1.1 Introduction to Image Classification 7

1.2 Literature Review 10

1.3 Challenges of Image Classification 13

1.4 Limitations of Convolutional Neural Networks(CNN’s) 16

Chapter 2:Vision Transformer 20

2.1 Emergence of Vision Transformer(ViT’s) 20

2.2 Overview of Vision Transformer(ViT’s) 20

2.3 Transformer Architecture and Self Attention Mechanism 22

2.4 Architecture of Vision Transformer(ViT’s) 23

2.5 Training Vision Transformer(ViT’s) 27

Chapter 3:Training the Model 32

3.1 Dataset for Training: CIFAR-100 Dataset Details 32

3.2 Implementation of Vision Transformer for Image Classification 36

3.3 Program 39

3.4 Output 42

Chapter 4:Model Usage 44

4.1 Advantages of Vision Transformer 44

4.2 Challenges and Limitations of Vision Transformer 47

-5-|Page
4.3 Future Development in Vision Transformer 50

4.4 Practical Applications of Vision Transformer 53

Chapter 5:Conclusion 57

5.1 Conclusion 57

References 60

LIST OF FIGURES

Details of Contents Page NO

Figure 2.1: Vision Transformer Architecture 27

Figure 3.1: CIFAR Dataset 34

Figure 3.2: Program 39-42

Figure 3.3: Epoch 42

Figure 3.4: Training and Validation Loss 43

Figure 3.5: Training and Validation Loss of Top-5-Accuracy 43

-6-|Page
CHAPTER 1:INTRODUCTION

1.1 Introduction to Image Classification

1.1.1 Definition and Importance:

Image classification is a fundamental task within computer vision that involves assigning a
specific label or category to an image based on its visual content. The process is achieved by
analysing the image’s pixel data and identifying patterns or features that help distinguish it
from other images. This task typically involves categorizing images into predefined classes,
which can range from identifying objects like animals, vehicles, or buildings to detecting
scenes such as urban landscapes or nature environments.

The importance of image classification in the modern world cannot be overstated. It is a crucial
part of many advanced technologies that have practical applications in various industries. In
healthcare, image classification is used to analyse medical images such as X-rays, MRIs, and
CT scans, helping doctors diagnose diseases like cancer, pneumonia, and other conditions more
accurately and efficiently. In the field of autonomous vehicles, image classification algorithms
help cars detect pedestrians, other vehicles, and road signs, facilitating safer navigation in
complex environments.

In the entertainment and retail industries, image classification is used for content tagging and
product categorization, enhancing user experiences by making search and recommendation
systems more effective. Security systems also rely heavily on image classification techniques,
such as facial recognition, to authenticate individuals and monitor access control. Moreover, as
the world generates vast amounts of visual data every day through social media, surveillance
cameras, and other sources, the need for automated systems capable of efficiently processing
and classifying images is increasingly critical.

The evolution of machine learning techniques, particularly deep learning, has played a
transformative role in the accuracy and scope of image classification, allowing systems to
identify complex patterns and features within images that would be difficult, if not impossible,
for humans to manually annotate. Thus, image classification is not just about identifying
objects in an image but also about creating smarter systems that can interact with the world
based on visual data, opening up endless possibilities for automation and AI.

-7-|Page
1.1.2 Historical Context:

The journey of image classification began in the early days of computer vision in the 1960s,
where researchers sought ways to teach computers to understand images and interpret visual
data. The earliest approaches, such as edge detection and basic pattern recognition, were quite
primitive and lacked the ability to generalize across different image types. These methods were
limited by the computational power available at the time and the simplistic nature of the
algorithms, which often failed to handle real-world complexities such as lighting variations or
object occlusions.

In the 1980s and 1990s, the development of machine learning algorithms, such as decision
trees and support vector machines (SVMs), provided a more sophisticated approach to image
classification. These algorithms could learn from labelled training data and generalize their
knowledge to classify unseen images. However, despite these advances, early methods still
struggled with accuracy when applied to large and complex datasets, particularly in natural
images, due to limitations in the feature extraction processes.

The real breakthrough in image classification came with the advent of deep learning in the
2010s. Convolutional Neural Networks (CNNs), inspired by the structure of the human visual
cortex, were developed to automatically learn hierarchical features from raw pixel data without
requiring manual feature extraction. CNNs demonstrated significant improvements in accuracy,
especially when trained on large, labelled datasets. The release of ImageNet, a large-scale
dataset containing millions of labelled images across thousands of categories, played a crucial
role in advancing deep learning techniques by providing researchers with a comprehensive
benchmark for evaluating image classification models.

With the rise of GPUs (Graphics Processing Units) and cloud computing, which enabled the
efficient processing of large datasets, deep learning became the dominant approach for image
classification. In 2012, a CNN model known as AlexNet achieved a significant breakthrough
by winning the ImageNet Large Scale Visual Recognition Challenge with a substantial margin,
leading to widespread adoption of deep learning for image classification tasks. Over the years,
various CNN architectures such as VGGNet, ResNet, and Inception have further improved
accuracy and efficiency, allowing image classification systems to recognize objects in images
with human-level performance.

Today, with the growing availability of labelled data and computational resources, image
classification continues to evolve. The focus has shifted towards building models that are not

-8-|Page
only accurate but also efficient and interpretable. Transfer learning, where pre-trained models
are fine-tuned for specific tasks, and unsupervised learning techniques, which allow models to
learn from unlabelled data, are some of the latest trends in the field.

1.1.3 Objective

Goals of Image Classification:

The ultimate goal of image classification is to enable machines to automatically assign


meaningful labels to images. However, the specific objectives can vary depending on the
context and application of the image classification task. Here are some of the primary goals:

1. Identifying Objects: One of the most fundamental goals of image classification is to


recognize objects present in an image. This includes a wide range of applications, from
detecting specific items in product images for e-commerce to identifying animals,
vehicles, or even complex objects like machinery in industrial settings. Accurate object
recognition is crucial in areas like robotics, where machines need to identify objects in
their environment to perform tasks such as picking, placing, or navigating.

2. Recognizing People: Image classification can also involve identifying specific people
or distinguishing between different individuals in an image. This is widely used in
security systems through facial recognition technology, which allows for authentication
and access control. Additionally, image classification can be used in social media
platforms for automatic tagging of individuals in photos, or in healthcare for monitoring
patient progress through visual data.

3. Classifying Scenes: Another important goal is to categorize images based on the overall
scene or environment they represent. This might include identifying whether an image
depicts an urban landscape, a forest, a beach, or an indoor setting. Scene classification
is particularly useful in applications like satellite imagery, where categorizing different
types of landscapes is necessary for environmental monitoring, urban planning, and
agriculture.

4. Detecting Activities: Image classification can also be employed to detect specific


activities or behaviours in images. For example, classifying a sequence of images from
a video feed to determine whether a person is running, walking, or sitting. This is
especially relevant in fields such as surveillance, sports analytics, and human-computer
interaction. Understanding activities in images can also have applications in healthcare,
where activity detection can help monitor patients’ physical movements and health
-9-|Page
conditions.

5. Medical Image Classification: In medical imaging, the goal is often to classify images
based on abnormalities or conditions, such as identifying tumours in X-rays or
classifying stages of a disease from scans. Image classification models trained on
medical datasets have become invaluable tools in assisting doctors with diagnosing
conditions more quickly and accurately, reducing human error, and providing better
patient care.

6. Automating Search and Indexing: Image classification is also central to enhancing the
functionality of search engines. By categorizing large volumes of image data, search
algorithms can provide more relevant and accurate results. For example, when users
search for images based on specific criteria (e.g., "dog breed," "mountain landscape"),
classification systems can help index and retrieve images that fit the query, facilitating
better user experiences.

1.2 Literature Review

The field of image classification has evolved significantly over the years, driven by both
traditional machine learning techniques and the more recent advances in deep learning. This
literature review provides an overview of the key methodologies used in image classification,
from traditional methods like Support Vector Machines (SVM) and K-Nearest Neighbors (K-
NN), to deep learning techniques such as Convolutional Neural Networks (CNN), hybrid
methods, and more recent innovations like Vision Transformers (ViT).

1.2.1 Traditional Methods

Support Vector Machines (SVM):

Support Vector Machines (SVM) are one of the earliest and most well-known traditional
machine learning methods used for image classification. SVMs operate by finding a hyperplane
that best separates data into distinct classes. In the context of image classification, the input
features are typically derived from handcrafted techniques such as colour histograms, texture
analysis, or edge detection. SVMs excel in high-dimensional spaces, making them effective for
problems where the dataset consists of numerous features.

The strength of SVM lies in its ability to handle small-to-medium-sized datasets with high
accuracy, especially when the classes are linearly separable or nearly so. However, SVMs
- 10 - | P a g e
struggle when dealing with large datasets or images with complex, non-linear features, which
led to the development of more advanced techniques like neural networks and deep learning
models.

K-Nearest Neighbors (K-NN):

K-Nearest Neighbors (K-NN) is another traditional approach that has been used for image
classification. K-NN works by assigning a label to an image based on the majority class of its
‘K’ nearest neighbors in the feature space. It is a simple and intuitive algorithm that does not
require a training phase, as it classifies new data based on distance metrics such as Euclidean
distance.

While K-NN has the advantage of being easy to implement and understand, it faces challenges
in terms of computational efficiency and scalability, particularly with high-dimensional image
data. The method also suffers from the "curse of dimensionality," where the performance
degrades as the number of features increases, making it less suitable for large-scale image
classification tasks compared to more advanced algorithms.

Random Forest:

Random Forest is an ensemble learning method that constructs a collection of decision trees
and combines their outputs to improve classification accuracy. Each decision tree is built using
a subset of the training data and features, and the final classification is determined by majority
voting from all trees in the forest. This approach can be applied to image classification by first
extracting relevant features from images and then applying Random Forest to classify these
features.

One of the key advantages of Random Forest is its robustness to overfitting and its ability to
handle noisy data. However, like other traditional methods, it requires manual feature
extraction, which limits its ability to capture complex, hierarchical patterns present in images.
As a result, deep learning techniques began to take center stage in image classification tasks, as
they can learn features directly from raw image data without requiring manual intervention.

1.2.2 Deep Learning

Convolutional Neural Networks (CNN):

Convolutional Neural Networks (CNNs) have revolutionized the field of image classification
by enabling machines to learn hierarchical patterns and features directly from raw pixel data.
Unlike traditional machine learning methods that rely on handcrafted features, CNNs are
- 11 - | P a g e
designed to automatically learn spatial hierarchies of features through a series of convolutional
layers. These layers detect increasingly complex patterns in the image, starting with simple
features such as edges and textures, and progressing to more abstract patterns like shapes and
objects.

The key advantage of CNNs is their ability to learn features at multiple levels of abstraction,
making them highly effective for image classification tasks. CNNs consist of three main types
of layers: convolutional layers, pooling layers, and fully connected layers. The convolutional
layers perform the feature extraction by applying filters to the input image, the pooling layers
reduce dimensionality while preserving important features, and the fully connected layers use
the extracted features to make the final classification.

CNNs have significantly improved classification accuracy in a variety of domains, from


recognizing everyday objects to detecting medical conditions in imaging scans. The
breakthrough moment for CNNs came in 2012, when the AlexNet model won the ImageNet
competition by a significant margin, demonstrating that deep learning could outperform
traditional methods in large-scale image classification tasks.

Moreover, CNNs have the advantage of being highly parallelizable, meaning they can leverage
modern hardware such as Graphics Processing Units (GPUs) to process large datasets
efficiently. This ability to scale with data and computation has made CNNs the dominant
approach in image classification.

1.2.3 Hybrid Methods

Ensemble of CNN:

To further improve the performance of CNNs, researchers have explored hybrid methods that
combine multiple CNN models to form an ensemble. The idea behind ensemble learning is to
combine the predictions of several models to reduce errors and increase the robustness of the
classification process. An ensemble of CNNs can take various forms, including averaging the
predictions of several models, majority voting, or using weighted voting schemes.

Ensemble methods can be particularly effective when individual CNN models have different
strengths and weaknesses, as the combination of their outputs tends to yield more accurate and
stable predictions. By training multiple CNN models on different subsets of the data or using
different architectures, ensemble methods can capture a broader range of features and improve
the overall performance. This approach has been successfully applied in many image
classification tasks, including object detection and facial recognition.
- 12 - | P a g e
While ensemble methods can improve accuracy, they also come with increased computational
costs, as training multiple models requires significantly more resources. Despite this, ensemble
learning has become a popular approach for image classification competitions and real-world
applications where high accuracy is critical.

1.3 Challenges in Image Classification

Image classification, while a significant breakthrough in computer vision, faces several


challenges that can affect the accuracy and reliability of models. These challenges stem from
the inherent complexity and variability of real-world data, which can hinder the model’s ability
to correctly classify images under different conditions. The primary challenges include image
resolution, occlusion, and background clutter, all of which present unique difficulties in
achieving high classification performance.

1.3.1 Image Resolution

Impact of High and Low-Resolution Images on Classification Accuracy:

Image resolution refers to the level of detail present in an image, which is determined by the
number of pixels in the image. Higher resolution images contain more pixel data, allowing
models to capture fine-grained details of objects and scenes. Lower resolution images, on the
other hand, are smaller in size and may lack the necessary detail for accurate classification.

High-Resolution Images: High-resolution images typically provide more detailed and


clearer visual information, which aids in the classification process. Fine details such as
textures, small objects, and intricate patterns are easier for machine learning models,
particularly deep learning models like CNNs, to learn. High-resolution images are particularly
advantageous when the task requires distinguishing between subtle differences between similar
classes (e.g., distinguishing between different breeds of dogs or identifying medical conditions
in X-rays).

However, the use of high-resolution images also comes with its own set of challenges. They
require more computational resources to process, store, and analyse. Training deep learning
models on high-resolution data demands significant amounts of memory and processing power,
which can lead to longer training times and higher costs. Additionally, even with high-
resolution images, a model may still fail if the object of interest is not well represented due to
factors like lighting, scale, or orientation.

- 13 - | P a g e
Low-Resolution Images: Low-resolution images, while computationally less
demanding, pose significant challenges in terms of accuracy. The lack of fine detail in low-
resolution images often leads to misclassification, as the model may not be able to extract
meaningful features from the image. This is particularly problematic in tasks where minute
differences in object features are critical for classification.

For example, in medical imaging, low-resolution scans may obscure critical details such as
small tumours or early-stage diseases, leading to incorrect diagnoses. Similarly, in object
detection tasks, low-resolution images can cause models to miss small objects or incorrectly
merge objects that appear too close together due to blurring or pixelation.

To mitigate this, techniques like image super-resolution, which uses machine learning to
upscale low-resolution images, have been proposed. However, these methods often introduce
noise or artifacts that can further complicate the classification task.

1.3.2 Occlusion

Challenges Posed by Partially Visible Objects:

Occlusion occurs when part of an object in an image is blocked or hidden, either by another
object or due to the angle at which the image is captured. This is a common problem in real-
world applications where objects are not always fully visible, either due to physical
obstructions or the limitations of the camera’s field of view. Occlusion can significantly
degrade the performance of image classification models.

In object classification tasks, occlusion makes it difficult for models to identify objects because
the model is deprived of crucial visual information. For example, if a person is partially
blocked by a table or a car is partially obscured by a tree, a classification model may fail to
correctly identify the object or may misclassify it based on the visible parts.

Deep learning models, particularly CNNs, are somewhat robust to partial occlusion, as they are
designed to recognize patterns even when parts of the image are missing. However, the level of
robustness depends on the extent and nature of the occlusion. Light occlusion may not affect
classification accuracy much, but heavy occlusion, where large portions of the object are
obscured, can severely degrade performance. Moreover, some objects are more difficult to
recognize than others when occluded; for example, recognizing a partially visible car might be
- 14 - | P a g e
easier than identifying a person wearing a partially obscured uniform.

To address the challenges of occlusion, techniques such as data augmentation, where training
images are artificially modified by adding occlusions or perturbations, are often employed.
These methods help the model learn to recognize objects even when parts of them are missing.
Another approach involves using more advanced architectures like attention mechanisms,
which can focus on the most relevant parts of an image and ignore occluded areas.

1.3.3 Background Clutter

Difficulty in Distinguishing Objects from Complex Backgrounds:

Background clutter refers to the presence of unnecessary, complex, or distracting elements in


an image that can make it difficult for a model to distinguish the object of interest from its
surroundings. In real-world images, the background is often not a simple, uniform colour but
rather a complex scene with various textures, objects, and patterns. This clutter can interfere
with the classification process by confusing the model, leading it to misinterpret the object or
fail to identify it altogether.

For example, in an image where a dog is surrounded by trees, grass, and other animals, the
model might mistakenly focus on the background elements rather than the dog itself, causing
incorrect classification. This issue is particularly pronounced in outdoor scenes where the
background can be highly dynamic and variable.

Deep learning models, particularly CNNs, can learn to focus on relevant features of an image,
but they can still struggle when background clutter is too prominent or when the object of
interest is small and hard to distinguish. Even with sophisticated models, background clutter
can confuse the feature extraction process, leading to false positives or misclassifications.

To address background clutter, various techniques can be used, including image segmentation,
where the image is divided into different regions to isolate the object of interest from the
background. Models like Mask R-CNN and U-Net are commonly used for such tasks, as they
allow the system to focus on the object while ignoring irrelevant background elements.
Another approach involves improving the quality of training data by removing or simplifying
background elements, which can help the model learn more effectively.

In some cases, background subtraction methods are used in video processing, where the
background is initially learned and subtracted from the scene to highlight moving objects.
These techniques have been particularly useful in surveillance and tracking applications.

- 15 - | P a g e
1.4 Limitations of Convolutional Neural Networks (CNNs)

Convolutional Neural Networks (CNNs) have become the gold standard for image
classification and various other computer vision tasks due to their ability to automatically learn
hierarchical features from raw image data. However, despite their remarkable success, CNNs
come with certain limitations that can hinder their performance in specific contexts or
applications. These limitations include challenges related to local receptive fields, spatial
inductive bias, and computational complexity. Understanding these limitations is crucial for
improving CNN-based models and advancing the field of deep learning in computer vision.

1.4.1 Local Receptive Fields

Struggles with Capturing Long-Range Dependencies:

A fundamental characteristic of CNNs is their use of local receptive fields, which are small,
localized regions of the input image that the convolutional filters focus on. In each
convolutional layer, a filter slides over the image to extract local features such as edges,
textures, and simple patterns. These local receptive fields are highly effective at learning low-
level features but can struggle with understanding long-range dependencies and relationships
between distant parts of the image.

The issue with local receptive fields is that they only capture information within a small
neighbourhood of the image. While this is advantageous for detecting simple patterns and local
features, it becomes a limitation when the model needs to understand the global context of the
image. For example, in an image of a person walking on a beach, local receptive fields might
capture the details of the person’s face or the sand, but they may not fully capture the
relationship between the person and the broader scene (e.g., the beach, the ocean, and the
horizon).

As CNNs deepen, the receptive field increases, allowing the network to learn more global
features, but this process is gradual and still confined to a local area for each individual filter.
This means that even very deep CNNs may struggle to capture long-range dependencies in the
image without additional mechanisms or architectures.

To address this, various strategies have been proposed, including dilated convolutions and
residual networks, which attempt to expand the receptive field without increasing the number
of parameters. Additionally, models like Vision Transformers (ViT) have emerged as an

- 16 - | P a g e
alternative, utilizing self-attention mechanisms to capture long-range dependencies across the
entire image at once, without the restriction of local receptive fields.

1.4.2 Spatial Inductive Bias

Limitations Imposed by Convolution and Pooling Operations:

Another inherent limitation of CNNs lies in their spatial inductive bias, which is imposed by
the convolution and pooling operations. These operations assume that the spatial relationships
between features are critical for recognition, leading CNNs to heavily rely on local patterns and
hierarchical structures to make predictions. While this works well for many image
classification tasks, it can also be restrictive in certain scenarios.

Convolutional Operation:
The convolution operation in CNNs applies a fixed filter to every part of the input image,
effectively assuming that the spatial arrangement of features remains consistent across the
image. This is an inductive bias toward translation invariance—meaning that CNNs expect the
features they learn to be spatially invariant, regardless of their position in the image. While
translation invariance is beneficial for tasks like object recognition, it becomes a limitation
when trying to recognize more complex or context-sensitive patterns that require the model to
account for positional information. For example, recognizing a face or a human figure may
depend on the exact spatial location and orientation of the object, which can be challenging for
a model that only considers local patterns and assumes spatial invariance.

Pooling Operation:
Pooling layers, particularly max-pooling, reduce the spatial dimensions of the input, effectively
down sampling the feature maps. While pooling helps reduce computational cost and prevents
overfitting by focusing on the most salient features, it also discards important spatial
information. This loss of information can be detrimental when fine-grained details are essential
for classification. In certain tasks, such as semantic segmentation or instance segmentation,
precise localization of objects and their boundaries is critical. Pooling operations, which
emphasize abstraction and generalization, may therefore lead to a loss of valuable spatial detail
needed for accurate classification.

Furthermore, the pooling operation enforces a rigid grid structure, where the spatial

- 17 - | P a g e
relationships of features are assumed to be regularly spaced and uniform. This is a form of
spatial bias that may not always be appropriate for tasks that require non-uniform spatial
relationships or when dealing with irregularly shaped objects.

To mitigate these issues, some models use alternative techniques like stride convolutions or
dilated convolutions to maintain higher spatial resolution while still reducing dimensionality.
Additionally, more advanced architectures, such as the Transformer-based Vision Transformers
(ViTs), do not rely on convolutions or pooling but instead focus on processing the entire image
through self-attention mechanisms, which have the potential to overcome these limitations.

1.4.3 Computational Complexity

High Computational Cost of Large-Scale CNNs:

One of the most significant limitations of CNNs is their high computational complexity,
especially when dealing with large-scale datasets or deep networks. CNNs, particularly very
deep architectures with many layers and parameters, require significant computational
resources for both training and inference.

Training Computational Cost:


Training a deep CNN involves performing millions of matrix operations and backpropagation
updates across large datasets. For each convolutional layer, the model must apply a large
number of filters to the input, which results in a high number of parameters. Additionally,
deeper CNNs typically involve a greater number of layers, each requiring its own set of
computations. This leads to an exponential increase in the computational requirements as the
depth and complexity of the network grow. Training such models requires powerful hardware,
such as Graphics Processing Units (GPUs) or specialized accelerators like TPUs, to speed up
the matrix operations. Even with these hardware resources, training large CNNs can take days,
weeks, or even longer, depending on the dataset size and model architecture.

Memory Usage:
Deep CNNs also require substantial memory to store the weights, gradients, and intermediate
activations during training. For very deep networks, the memory footprint can be prohibitively
large, making it difficult to train models on standard hardware. Moreover, the sheer size of the
models may make deployment in resource-constrained environments, such as mobile devices
or embedded systems, challenging.

- 18 - | P a g e
Inference Computational Cost:
Once trained, CNNs still face challenges in terms of inference, particularly for large-scale
models. While CNNs are highly parallelizable and can be optimized for GPUs, the inference
time may still be long when dealing with large input images or complex models. In
applications like real-time video classification or autonomous driving, these delays can be
problematic, as fast and responsive decision-making is critical.

To mitigate computational complexity, various techniques have been proposed, such as model
pruning, quantization, and knowledge distillation. These methods aim to reduce the size of the
model or the precision of the computations without sacrificing accuracy. Additionally,
lightweight CNN architectures, such as MobileNets and EfficientNets, have been designed to
operate with fewer parameters and lower computational requirements, making them suitable
for mobile and edge computing applications.

- 19 - | P a g e
CHAPTER 2:VISION TRANSFORMER

2.1 Emergence of Vision Transformers (ViTs)

In recent years, Vision Transformers (ViTs) have emerged as a revolutionary alternative to


Convolutional Neural Networks (CNNs) in the field of computer vision. While CNNs have
been the dominant architecture for image classification and other visual tasks, ViTs offer
several advantages by leveraging the power of transformers, a model originally designed for
natural language processing (NLP). ViTs have redefined how image data is processed and
analysed, shifting the paradigm from localized feature extraction to global context modelling,
which offers new possibilities for capturing complex patterns and long-range dependencies
within images. This section provides an in-depth overview of Vision Transformers, their
architecture, how they handle spatial and patch embeddings, and the key advantages of their
self-attention mechanism over traditional CNNs.

2.2 Overview of Vision Transformers (ViTs)

2.2.1 Introduction to Vision Transformers:

Vision Transformers (ViTs) are based on the transformer architecture, which was first
introduced for sequence-to-sequence tasks in NLP. Transformers, and specifically the self-
attention mechanism they employ, were designed to model long-range dependencies within
sequences by attending to all parts of the input simultaneously. ViTs bring this powerful idea to
the domain of computer vision by treating an image as a sequence of patches rather than as a
grid of pixels. This approach contrasts with the CNN paradigm, where images are processed
through local filters that capture hierarchical patterns and spatial relationships at different
scales.

The key insight behind Vision Transformers is that by dividing an image into small patches,
each patch is treated as a token (similar to words in NLP tasks), and these tokens are processed
through transformer layers. The transformer then uses self-attention mechanisms to learn the
global dependencies between patches, allowing the model to capture both local and global
features efficiently. ViTs have demonstrated state-of-the-art performance in image
classification tasks, outperforming CNNs in several benchmarks, especially when trained on
large datasets.

- 20 - | P a g e
The architecture of Vision Transformers is largely inspired by the transformer model used in
NLP, particularly the encoder portion of the original transformer model. This encoder consists
of multi-head self-attention layers and feed-forward networks, which allow for the learning of
complex relationships between image patches. The self-attention mechanism, which is the
cornerstone of transformers, enables ViTs to model long-range dependencies more effectively
than CNNs, which are inherently limited by their local receptive fields.

2.2.2 Spatial and Patch Embeddings

How ViTs Handle Image Data by Dividing It into Patches:

The first step in Vision Transformers is to divide the input image into smaller, fixed-size
patches. This is in stark contrast to CNNs, which operate on entire images or local regions with
convolutional filters. In ViTs, the image is split into non-overlapping patches (e.g., 16x16 or
32x32 pixels) that are flattened into one-dimensional vectors. These vectors are then passed as
input tokens into the transformer model.

The process of dividing an image into patches is the first key difference between ViTs and
CNNs. Each patch is treated as a discrete entity, akin to a token in natural language processing.
The patches can vary in size, but a common approach is to use small square patches, such as
16x16 or 32x32 pixels, depending on the resolution of the input image. These patches are then
flattened into one-dimensional vectors, which represent the pixel values in the flattened form.

Embedding the Patches:

After the image is divided into patches, each patch is projected into a high-dimensional space
using a linear embedding layer. This step is similar to how words in NLP are converted into
embeddings, allowing the model to process the patches in a more efficient way. Each patch’s
vector is passed through a linear transformation, creating an embedding that represents the
patch in a continuous feature space.

In addition to the patch embeddings, position embeddings are added to encode spatial
information. Since transformers do not inherently have a notion of the spatial arrangement of
data the position embeddings are crucial. These embeddings represent the position of each
patch within the original image and are added to the patch embeddings to maintain spatial
awareness. This allows the model to learn the spatial relationships between patches and
understand how they are arranged in the image.

- 21 - | P a g e
Benefits of Patch Embeddings:

By converting an image into a sequence of patches, ViTs can treat image data in the same way
transformers handle textual data in NLP. This transformation enables the model to perform
self-attention on the entire image, which allows it to capture both local and global
dependencies more effectively than CNNs. The flexibility in the size of the patches also allows
ViTs to handle images of varying resolutions, giving them a unique advantage in terms of
scalability and adaptability.

2.3 Transformer Architecture and Self-Attention Mechanism

2.3.1 Self-Attention Mechanism:

The core innovation of the transformer architecture is the self-attention mechanism. Self-
attention enables the model to weigh the importance of each patch in relation to every other
patch, regardless of their spatial proximity. This is in contrast to CNNs, where filters only
capture local patterns based on neighbouring pixels. The self-attention mechanism works by
computing a set of attention scores, which dictate how much focus should be placed on each
part of the image when processing a given patch.

In ViTs, the self-attention mechanism is computed as follows:

1. Query, Key, and Value: Each patch is transformed into three vectors: the query (Q),
the key (K), and the value (V). These vectors are derived from the patch embedding
using learned linear transformations. The self-attention mechanism works by computing
the attention between each patch’s query and all other patches’ keys.

2. Attention Scores: The attention score between a query and a key is computed as the
dot product of the query and key vectors, followed by a SoftMax operation to ensure
the scores sum to one. This attention score indicates how much each patch should
attend to other patches when computing its output.

3. Weighted Sum: The final output for each patch is a weighted sum of the value vectors,
where the weights are determined by the attention scores. This allows the model to
incorporate information from distant patches, capturing long-range dependencies and
context.

4. Multi-Head Attention: To allow the model to focus on different aspects of the input

- 22 - | P a g e
simultaneously, multi-head attention is used. This technique involves applying multiple
self-attention operations in parallel, with each head learning a different representation
of the input. The results from all attention heads are then concatenated and passed
through a feed-forward layer.

2.3.2 Advantages Over CNNs:

The self-attention mechanism provides several advantages over the local receptive fields of
CNNs:

1. Global Context Modelling: While CNNs are limited by their local receptive fields, the
self-attention mechanism allows ViTs to capture global relationships between image
patches. This is particularly useful for tasks where understanding the broader context or
long-range dependencies is critical, such as in object recognition or scene
understanding.

2. Scalability and Flexibility: ViTs can handle images of various sizes more effectively
than CNNs. The ability to process images as sequences of patches gives the model
flexibility in terms of input size, while still being able to scale efficiently.

3. Parallelism: The transformer model’s attention mechanism is highly parallelizable,


which leads to faster training times on hardware accelerators like GPUs and TPUs.
CNNs, by contrast, process data sequentially at each layer, making them less efficient
in terms of parallelism.

4. Long-Range Dependencies: ViTs excel at capturing long-range dependencies, which


CNNs struggle with due to their local nature. The self-attention mechanism allows ViTs
to learn complex relationships between distant image regions, enabling them to make
more accurate predictions in tasks where global context is important.

2.4 Architecture of Vision Transformers (ViTs)

The architecture of Vision Transformers (ViTs) represents a departure from traditional


Convolutional Neural Networks (CNNs) in its approach to processing images. Instead of
applying convolutions to the entire image, ViTs treat the image as a sequence of patches and

- 23 - | P a g e
process them through the powerful transformer encoder architecture, originally designed for
sequential data in natural language processing (NLP). In this section, we will explore the core
components of the ViT architecture, focusing on the input preparation, the functioning of the
transformer encoder, and the final classification head that produces the output.

2.4.1 Input Preparation: Steps to Preprocess and Prepare Image Data for ViTs

Before an image can be fed into a Vision Transformer, it must first undergo a series of
preprocessing steps to transform it into a format suitable for the model. Unlike CNNs, which
process the raw pixel data directly, Vision Transformers divide the image into smaller patches,
treating each patch as a token. This approach allows the transformer model to treat images
similarly to sequences in natural language processing, where each patch (or "token") is
embedded and processed independently.

1. Image Resizing: The first step in preprocessing is resizing the input image to a fixed size.
Since Vision Transformers process images as sequences of patches, it is important that all input
images have a consistent shape. A common approach is to resize the image to a predefined
resolution, such as 224x224 or 384x384 pixels. This resizing ensures that the patches extracted
from the image will be of a consistent size and that the transformer model can handle the input
efficiently.

2. Dividing the Image into Patches: Once the image is resized, the next step is to divide it
into smaller, non-overlapping patches. Typically, these patches are of a fixed size, such as
16x16 or 32x32 pixels. The number of patches depends on the input image size and the patch
size. For instance, a 224x224 image divided into 16x16 patches would result in 196 patches
(224 / 16 = 14 patches along each dimension, for a total of 14x14 patches). Each patch is
flattened into a one-dimensional vector, representing the pixel values of the patch in a linear
form.

3. Linear Embedding of Patches: After the image is divided into patches, each patch is
flattened and projected into a high-dimensional embedding space. This is done using a linear
projection (typically a learned weight matrix) that transforms each flattened patch into a vector
of a fixed size, typically 768 or 1024 dimensions, depending on the model configuration. The
resulting vectors are the patch embeddings, and they represent the visual features of each patch
in a continuous feature space.

4. Position Embeddings: Since transformers do not have a built-in notion of spatial


- 24 - | P a g e
relationships or the ordering of tokens (unlike CNNs, which process pixels in a local, spatially
aware manner), position embeddings are added to the patch embeddings to encode the spatial
information of each patch within the image. These position embeddings are learned during
training and are added to the patch embeddings before the data is passed through the
transformer encoder. This allows the model to maintain awareness of where each patch is
located in the original image, enabling it to capture the spatial relationships between patches.

5. Sequence Construction: The patch embeddings, along with their corresponding position
embeddings, are concatenated to form a sequence of tokens. This sequence is then fed into the
transformer encoder. The sequence of tokens can be thought of as a sequence of "words"
(patches) in a sentence, where each token contains both the visual content of the patch and its
spatial location within the image. The sequence construction step ensures that the transformer
can process the image as a sequence of related visual tokens, enabling it to capture global
dependencies between distant patches.

2.4.2 Transformer Encoder: Functioning of the Transformer Encoder Layers

The heart of the Vision Transformer is the transformer encoder, which is responsible for
processing the sequence of patch embeddings and learning the relationships between them. The
transformer encoder consists of multiple identical layers, each containing a self-attention
mechanism and a feed-forward network. The encoder layers allow the model to attend to all
patches in the image simultaneously, learning global dependencies and context between distant
patches.

1. Self-Attention Mechanism: The self-attention mechanism is the key innovation in


transformer models, and it is responsible for learning relationships between all tokens (patches)
in the input sequence. In the context of ViTs, the self-attention mechanism computes the
attention scores between all pairs of patches in the image. For each patch, the model calculates
three vectors: the query (Q), the key (K), and the value (V). These vectors are derived from the
input sequence of patch embeddings using learned weight matrices.

The attention score between two patches is computed as the dot product of the query vector of
one patch with the key vector of another patch. The attention score is then normalized using a
SoftMax function, which ensures that the scores sum to one. The result is a weighted sum of
the value vectors, where the weights are determined by the attention scores. This process
allows each patch to attend to other patches in the image, effectively capturing long-range
- 25 - | P a g e
dependencies and contextual information across the entire image.

The self-attention mechanism is applied in parallel across multiple attention heads, allowing
the model to focus on different aspects of the input sequence simultaneously. This multi-head
attention mechanism enables the transformer encoder to learn a richer representation of the
image by considering multiple relationships between patches.

2. Feed-Forward Neural Network: After the self-attention mechanism, the output is passed
through a feed-forward neural network (FFN), which consists of two fully connected layers
with a non-linear activation function, such as ReLU, in between. The feed-forward network is
applied independently to each patch embedding and helps to further transform the
representations learned by the self-attention mechanism. The FFN introduces additional
capacity for the model to learn complex, non-linear relationships between patches.

3. Layer Normalization and Residual Connections: Each sub-layer of the transformer


encoder (the self-attention mechanism and the feed-forward network) is followed by layer
normalization, which helps stabilize the training process by normalizing the activations across
each layer. Additionally, residual connections are used to allow gradients to flow more easily
through the network, helping to prevent issues such as vanishing gradients during training. The
residual connections add the input of each sub-layer to its output, ensuring that the model can
learn both the original and transformed features.

4. Stacking Multiple Encoder Layers: The transformer encoder typically consists of multiple
stacked layers, usually 12 or 24 in Vision Transformers. Each layer learns progressively more
complex relationships between the image patches, allowing the model to capture a hierarchy of
features. The deeper the model, the more abstract the learned features become, enabling the
ViT to make sense of both local and global patterns within the image.

2.4.3 Classification Head: Generating the Final Output through a Multi-Layer


Perceptron (MLP)

After the input image has passed through the transformer encoder, the next step is to generate
the final output, which in the case of image classification is typically the class label for the
image. This is done through a classification head, which consists of a multi-layer perceptron
(MLP) that processes the output from the transformer encoder.

1. [CLS] Token: In Vision Transformers, a special token known as the [CLS] (classification)

- 26 - | P a g e
token is added to the input sequence before feeding it into the transformer encoder. This token
is initialized as a learnable vector and is designed to capture the aggregated information from
all the patches in the image. After the transformer encoder has processed the sequence of
patches, the output corresponding to the [CLS] token is used as a global representation of the
entire image.

The [CLS] token is crucial for image classification tasks, as it serves as a compact
representation of the image's content after it has passed through the self-attention layers. The
output of this token contains information about both local and global features of the image,
making it suitable for classification.

2. MLP Head: The output from the [CLS] token is then passed through an MLP head, which
consists of one or more fully connected layers. These layers are designed to map the high-
dimensional representation of the image (from the [CLS] token) to the desired output space,
typically a vector representing the class probabilities. The final layer of the MLP typically uses
a SoftMax activation function to produce a probability distribution over the possible classes,
where each class corresponds to a label in the classification task.

3. Output: The final output of the classification head is the predicted class label for the input
image. The SoftMax function provides a probability distribution over all possible classes, with
the class corresponding to the highest probability being the model's predicted label.

Figure 2.1: Vision Transformer Architecture

2.5 Training Vision Transformers (ViTs)

Training Vision Transformers (ViTs) represents a significant shift in the way image
- 27 - | P a g e
classification tasks are approached. Unlike Convolutional Neural Networks (CNNs), which
rely on local convolutional filters and hierarchical feature extraction, ViTs treat images as
sequences of patches, utilizing the transformer architecture to capture both local and global
dependencies. However, training ViTs effectively requires careful consideration of several key
factors, including the availability of large and diverse datasets, the computational resources
required, and the time-intensive nature of training these models. This section explores these
critical elements in detail, focusing on the datasets required, the computational resources
necessary, and the challenges related to training time.

2.5.1 Datasets Required: Importance of Large and Diverse Datasets

The success of Vision Transformers, like any deep learning model, heavily depends on the
quality and quantity of the training data. Since ViTs process images as sequences of patches,
they require large, high-quality datasets that can expose the model to a wide range of variations
in terms of object types, scenes, lighting conditions, and more. Unlike CNNs, which are
designed to work well with smaller datasets due to their inductive biases and ability to learn
spatial hierarchies, ViTs excel when trained on large, diverse datasets.

1. Large Datasets: ViTs, due to their flexible architecture and large number of parameters,
require substantial amounts of data to generalize well. Datasets like ImageNet, which contains
over 1.2 million labelled images across 1,000 classes, are ideal for training ViTs. The vast scale
of these datasets allows the transformer model to learn global relationships between patches
across many different object categories, improving its ability to classify images accurately.

ViTs benefit from the large-scale data because the self-attention mechanism in transformers
requires enough information to understand complex, long-range dependencies within images. If
the dataset is too small, the model may overfit, as the transformer architecture has more
capacity than is necessary to model the relationships within limited data.

2. Diverse Datasets: In addition to size, the diversity of the dataset plays a crucial role in
training ViTs. A diverse dataset ensures that the model encounters a variety of different images
with different backgrounds, orientations, lighting conditions, and object scales. This diversity
helps prevent the model from overfitting to a specific type of data and allows it to generalize
well to unseen examples during testing. Moreover, diverse datasets expose the model to a wide
range of visual patterns, which is essential for training a model that can understand complex,
real-world scenes.
- 28 - | P a g e
Some widely used datasets for training Vision Transformers include:

 ImageNet: A large-scale image dataset used for image classification, containing


millions of labelled images across thousands of categories.

 CIFAR-10 and CIFAR-100: These datasets contain 60,000 images in 10 and 100
classes, respectively. They are often used for benchmarking small-scale models.

 MS COCO: A dataset that includes images with annotations for object detection,
segmentation, and captioning. It offers rich, diverse data suitable for training models to
understand more than just simple classification.

 Open Images: A large dataset with over 9 million labelled images across 600
categories, useful for training large-scale models.

3. Pretraining on Large Datasets: Given the high data requirements of Vision Transformers,
a common strategy is to pretrain ViTs on large datasets (such as ImageNet) before fine-tuning
on task-specific datasets. This pretraining allows the model to learn general visual features that
can be transferred to other tasks, such as object detection or medical image analysis, with
relatively smaller datasets for fine-tuning. Pretraining is critical for ViTs to achieve high
performance, as they lack the inductive bias inherent in CNNs.

2.5.2 Computational Resources: Hardware Requirements for ViT Training

Training Vision Transformers requires significant computational resources due to their large
number of parameters and the complexity of the self-attention mechanism. Unlike CNNs,
which can process images efficiently with localized convolutions, ViTs must calculate pairwise
attention scores between all patches in the input image, leading to higher memory usage and
more computation.

1. GPUs vs. TPUs: The hardware requirements for training Vision Transformers are
substantial, and the use of specialized hardware accelerators like Graphics Processing Units
(GPUs) or Tensor Processing Units (TPUs) is essential. GPUs, particularly those designed for
deep learning tasks, such as NVIDIA’s A100, V100, or Titan GPUs, provide the computational
power required for training large models on massive datasets. TPUs, which are custom
accelerators developed by Google, are optimized for tensor-based operations and have become
a popular choice for training large-scale deep learning models.

 GPUs: ViTs require powerful GPUs with large memory capacities to handle the vast
number of parameters in the model and the large input image sizes. The GPUs must
- 29 - | P a g e
also be capable of performing many parallel operations to efficiently process the self-
attention mechanism and feed-forward networks in the transformer layers.

 TPUs: For extremely large ViT models, TPUs offer significant speedups over GPUs,
particularly when training large models like those used in natural language processing
(NLP) or vision tasks. TPUs are designed specifically to handle the types of matrix
operations that are prevalent in transformers, making them well-suited for Vision
Transformer training.

2. Memory Requirements: The memory requirements for training Vision Transformers can be
considerable. Since the self-attention mechanism requires the computation of pairwise
relationships between all patches in the input, the memory usage grows quadratically with the
number of patches. For example, a 224x224 image divided into 16x16 patches results in 196
patches, and the model must store attention scores for each pair of patches. As the image size
or the number of patches increases, the memory required for training grows rapidly.

To manage memory, techniques like gradient checkpointing (where intermediate activations


are recomputed during backpropagation instead of being stored) can be used to reduce memory
usage, allowing for the training of larger models on available hardware.

3. Distributed Training: Given the computational cost of training ViTs, distributed training
across multiple GPUs or TPUs is often necessary. Distributed training allows for parallel
processing of large datasets and the ability to scale the model size, which is critical for
achieving state-of-the-art performance on large-scale image classification tasks. Using
frameworks like TensorFlow, PyTorch, or JAX, researchers can split the training workload
across multiple devices, effectively reducing the time required to train the model.

2.5.3 Training Time: Time-Intensive Nature of Training ViTs

Training Vision Transformers is typically more time-consuming than training traditional


CNNs. Several factors contribute to this, including the large number of parameters, the
complexity of the self-attention mechanism, and the need for vast datasets. The training process
can take days or even weeks, depending on the model size, dataset size, and available
hardware.

1. Model Size and Complexity: Vision Transformers tend to have more parameters than
CNNs, primarily because of the large embedding dimensions, the multi-head attention

- 30 - | P a g e
mechanism, and the depth of the transformer layers. For example, a ViT model with 12
transformer layers and a 768-dimensional embedding space can have hundreds of millions of
parameters. Training such a model requires a large amount of computation to update the
weights of these parameters using gradient descent.

2. Computational Load: The self-attention mechanism, which requires computing pairwise


attention between all patches, increases the computational load. For each attention operation,
the model must compute the dot product of every pair of patches, which can be especially time-
intensive for large images with many patches. As a result, training ViTs on high-resolution
images can be significantly slower than training CNNs, which rely on local convolutions.

3. Time for Pretraining: Pretraining a ViT on a large dataset like ImageNet can take several
days to weeks, even with powerful GPUs or TPUs. This pretraining step is essential to give the
model the general visual knowledge it needs to perform well on downstream tasks. Fine-tuning
the model on specific tasks may take less time but still requires significant computational
resources.

4. Hyperparameter Tuning: Like other deep learning models, ViTs require hyperparameter
tuning to achieve optimal performance. This process can involve testing different
configurations of learning rates, batch sizes, and the number of transformer layers.
Hyperparameter optimization for ViTs can also be time-consuming, as training deep models
across different settings adds to the overall time.

5. Batch Size and Training Epochs: Larger batch sizes can lead to faster convergence by
making more efficient use of hardware, but they also require more memory. On the other hand,
smaller batch sizes may slow down training. Additionally, ViTs typically require more epochs
to converge compared to CNNs, as they need to learn global relationships in addition to local
features. Therefore, training Vision Transformers can be both time- and resource-intensive.

- 31 - | P a g e
CHAPTER 3:TRAINING THE MODEL

3.1 Dataset for Training: CIFAR-100 Dataset Details

The CIFAR-100 dataset is one of the most widely used datasets for evaluating image
classification models. It serves as an ideal benchmark for training and testing machine learning
algorithms, including Vision Transformers (ViTs), due to its manageable size, variety, and
relevance to real-world tasks. The CIFAR-100 dataset, an extension of the CIFAR-10 dataset,
provides a more challenging test for models, as it contains 100 different classes, each
representing a category of objects or scenes. This section will describe the structure of the
CIFAR-100 dataset, its components, and its use in training Vision Transformers.

3.1.1 Overview of CIFAR-100

The CIFAR-100 dataset is part of the CIFAR (Canadian Institute for Advanced Research)
family of datasets, which is widely used in computer vision research. It was introduced in 2001
by Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton, and has since become a standard
benchmark for image classification tasks. The CIFAR-100 dataset consists of 60,000 images,
which are evenly divided into 100 different classes, with each class containing 600 images. The
images are small, with a resolution of 32x32 pixels, and each image is labelled with one of the
100 class categories.

While the CIFAR-100 dataset shares similarities with CIFAR-10, which contains only 10
classes, CIFAR-100 is designed to be more challenging by offering a wider variety of object
categories. The dataset is specifically valuable for evaluating the ability of machine learning
models to generalize across a larger set of object classes, making it an excellent choice for
training Vision Transformers (ViTs) and other image classification models.

3.1.2 Structure of CIFAR-100

The CIFAR-100 dataset is organized into training and test sets, as follows:

 Training Set: Contains 50,000 images, with 500 images per class, across all 100
classes. These images are used to train the model and learn the features that can
distinguish between the different categories.

- 32 - | P a g e
 Test Set: Contains 10,000 images, with 100 images per class, and is used to evaluate
the performance of the trained model. This set is used to assess the accuracy, precision,
and generalization ability of the model after training.

Each image in the CIFAR-100 dataset is 32x32 pixels in size and is represented in RGB colour.
The images are typically pre-processed (e.g., normalization, augmentation) before being used
in model training.

Class Labels and Super classes: The CIFAR-100 dataset includes two levels of class labels:

 Fine Labels: The specific class labels (e.g., 'apple,' 'dog,' 'tree')—these are the primary
labels used for image classification.

 Coarse Labels (Super classes): A set of 20 super classes that group the 100 fine-
grained classes into broader categories. For instance, 'apple' and 'orange' belong to the
'fruit' superclass, and 'dog' and 'cat' belong to the 'animal' superclass.

This hierarchical structure can be beneficial for transfer learning and fine-tuning models
trained on CIFAR-100, especially when grouping images based on more general features.

3.1.3 Classes in CIFAR-100

The 100 fine-grained classes in CIFAR-100 represent a wide range of real-world objects. These
classes cover various types of animals, vehicles, natural scenes, and household items, making it
a comprehensive dataset for evaluating models on diverse types of image data. Some example
classes from CIFAR-100 include:

 Animals: 'cat,' 'dog,' 'horse,' 'bird,' 'fish'

 Vehicles: 'airplane,' 'automobile,' 'ship,' 'truck'

 Natural Objects: 'tree,' 'flower,' 'mountain,' 'sky,' 'sea'

 Household Items: 'keyboard,' 'mouse,' 'laptop,' 'toothbrush'

The diversity of these classes allows for the testing of a model’s ability to classify objects from
different domains with varying levels of complexity and visual characteristics.

- 33 - | P a g e
Figure 3.1: CIFAR dataset

3.1.4 Preprocessing and Augmentation for ViTs

Before training a Vision Transformer (ViT) on the CIFAR-100 dataset, images are typically
subjected to several preprocessing and augmentation techniques to improve model performance
and generalization. These steps are important because ViTs, unlike CNNs, do not inherently
possess spatial inductive biases. Therefore, preprocessing and augmentation play a critical role
in helping ViTs learn meaningful features from the image data.

 Resizing and Normalization: Although the CIFAR-100 dataset consists of 32x32 pixel
images, Vision Transformers generally require fixed input sizes. Often, the images are
resized to match the input size expected by the ViT model (e.g., 224x224 or 384x384
pixels). Additionally, pixel values are normalized to the range [0, 1] or [-1, 1] for model
training.

 Image Augmentation: Since the CIFAR-100 dataset is relatively small in terms of


image resolution and may not represent all possible variations of real-world data,
augmentation techniques like rotation, flipping, scaling, cropping, and colour jittering
are used to artificially expand the dataset. These techniques help prevent overfitting and
- 34 - | P a g e
improve the generalization ability of the model.

 Patch Splitting: For ViTs, the images need to be divided into smaller patches. Each
image in CIFAR-100 is split into a grid of non-overlapping patches (e.g., 16x16 patches
for a 32x32 image). These patches are then flattened and used as input tokens for the
transformer model. This patch-based representation is essential for ViTs to handle the
image data efficiently and apply the self-attention mechanism.

3.1.5 Use of CIFAR-100 for Training Vision Transformers

The CIFAR-100 dataset serves as an excellent training ground for Vision Transformers due to
its diversity, complexity, and relatively manageable size compared to larger datasets like
ImageNet. Vision Transformers, which excel at capturing global dependencies and long-range
relationships between patches, are able to leverage the full potential of the CIFAR-100 dataset.
By treating the image as a sequence of patches, ViTs can learn spatial relationships between
object parts and their contextual dependencies within the image, even if these dependencies are
not immediately local.

While training ViTs on CIFAR-100 can still be computationally intensive due to the large
number of patches and model parameters, the dataset provides a balanced challenge that allows
researchers to test and refine the performance of transformer-based models on image
classification tasks. ViTs are capable of outperforming traditional CNNs on CIFAR-100 when
trained with sufficient data and computational resources.

Moreover, CIFAR-100 allows for the comparison of ViTs with other state-of-the-art image
classification models, such as CNNs and hybrid models (e.g., CNN-Transformer hybrids).
Given that CIFAR-100 consists of 100 distinct classes, it is a strong test of how well a model
can generalize to multiple types of objects, which is a common requirement in real-world
image recognition tasks.

3.1.6 Challenges and Opportunities with CIFAR-100

While CIFAR-100 offers an ideal starting point for training ViTs, there are some inherent
challenges that researchers may encounter:

 Limited Image Resolution: The small image size (32x32) can limit the performance of

- 35 - | P a g e
Vision Transformers, as the patches become relatively small and less informative. This
requires techniques such as patch merging or resizing images to higher resolutions to
better capture fine-grained details.

 Overfitting Risk: The CIFAR-100 dataset, while diverse, still represents a relatively
small set of images in comparison to real-world datasets. Regularization techniques and
data augmentation are crucial to mitigate overfitting.

At the same time, these challenges present opportunities for further research and innovation.
Experimenting with more advanced techniques, such as self-supervised learning, transfer
learning from larger datasets, or hybrid models, can help overcome some of these limitations
and improve the performance of Vision Transformers on CIFAR-100.

3.2 Implementation of Vision Transformers for Image Classification

In this section, we will walk through the implementation process of a Vision Transformer
(ViT) for image classification, specifically using the CIFAR-100 dataset. This step-by-step
guide will outline the essential stages, including importing libraries, preparing the dataset,
defining hyperparameters, augmenting data, building the model, and training/evaluating the
performance. We will also utilize Keras with JAX, a high-performance backend, to facilitate
training.

3.2.1 Import Libraries

The first step in implementing a Vision Transformer for image classification is importing the
necessary libraries. In this case, we will use Keras with JAX for efficient computation, along
with other essential libraries for data processing and model building.

Here:

 TensorFlow provides the foundation for building and training the model.

 Jax will enable efficient and scalable computations.

 NumPy and matplotlib are used for data manipulation and visualization.

 ImageDataGenerator from Keras will be used for data augmentation.

- 36 - | P a g e
3.2.2 Load Dataset

Next, we load the CIFAR-100 dataset. TensorFlow/Keras provides built-in support for CIFAR-
100, which allows us to quickly access the training and testing data, along with the
corresponding labels.

Output will indicate:

 x_train and x_test contain images, each with the shape (32, 32, 3), where 32x32 is the
resolution of the images and 3 refers to the RGB channels.

 y_train and y_test contain labels for each image, where the shape is (num_samples, 1)
representing the class index.

3.2.3 Set Hyperparameters

Before building the model, we need to define the key hyperparameters for training the Vision
Transformer.

 learning_rate controls the step size during optimization.

 batch_size refers to the number of samples used per training step.

 epochs defines the number of times the model will iterate over the dataset.

 num_patches, patch_size, and embedding_dim control the patching strategy, which is


critical for ViT architecture.

 num_heads and num_layers define the complexity of the transformer’s self-attention


mechanism.

3.2.4 Data Augmentation

Data augmentation is crucial for enhancing the diversity of training data. By applying
transformations like flipping, rotation, and zooming, the model can generalize better and avoid
overfitting.

 rescale=1.0/255.0 normalizes the pixel values to the range [0, 1].

 rotation_range, width_shift_range, height_shift_range, and other parameters define the


augmentation techniques.
- 37 - | P a g e
3.2.5 Define MLP (Multi-Layer Perceptron) Function

The Multi-Layer Perceptron (MLP) is used in the Vision Transformer for classification. It
processes the final output from the Transformer layers.

The mlp_block consists of:

 A fully connected layer (Dense) with ReLU activation.

 A dropout layer to prevent overfitting.

 The final SoftMax layer to classify into 100 classes (for CIFAR-100).

3.2.6 Create Patches Layer

For Vision Transformers, we divide the image into smaller patches. Each patch is treated as a
token to be processed by the transformer.

This function splits the images into patches, reshapes them into a suitable format, and prepares
them for further encoding by the transformer model.

3.2.7 Patch Encoder Layer

The patch encoder adds positional embeddings to each patch to provide the model with spatial
information.

The positional encoding helps the model understand the order of patches, an essential feature
for image data where the spatial relationship between patches matters.

3.2.8 Build Vision Transformer

Now, we can stack multiple transformer layers (self-attention and MLP) to construct the Vision
Transformer.

In this function:

 We first create patches and encode them.

 The transformer layers consist of multi-head attention followed by a feed-forward


- 38 - | P a g e
network.

 The final output is passed through an MLP to classify into one of the 100 CIFAR-100
categories.

3.2.9 Train Model

After building the Vision Transformer model, we compile it with an optimizer (Adam) and a
loss function (sparse categorical cross-entropy for multi-class classification).

The model is trained on the CIFAR-100 dataset using augmented data. After training, we save
the best weights using callbacks or manually during the training loop.

3.2.10 Evaluate & Plot

Finally, after training the Vision Transformer, we evaluate its performance on the test set and
visualize the results.

The test accuracy is printed, and the training/validation accuracy curve is visualized, helping us
understand the model’s learning performance.

3.3 Program

- 39 - | P a g e
- 40 - | P a g e
- 41 - | P a g e
Figure 3.2:Program

3.4 OUTPUT

Figure 3.3: Epochs

- 42 - | P a g e
Figure 3.4: Train and Validation loss

Figure 3.5: Train and Validation loss of Top-5-accuracy


- 43 - | P a g e
CHAPTER 4:MODEL USAGE

4.1 Advantages of Vision Transformers

Vision Transformers (ViTs) have quickly gained popularity in the computer vision community,
especially due to their remarkable ability to outperform traditional Convolutional Neural
Networks (CNNs) on various image classification tasks. This section explores the key
advantages of ViTs, including their enhanced performance, flexible architecture, and improved
interpretability.

4.1.1 Enhanced Performance

One of the most significant advantages of Vision Transformers is their enhanced performance
on image classification benchmarks, especially when trained on large datasets. Several factors
contribute to this improvement:

a. Efficient Handling of Global Context: Traditional CNNs are limited by their local
receptive fields, which means they can only capture local patterns and relationships in an
image. While pooling layers help to capture higher-level features, CNNs still struggle to
understand long-range dependencies between pixels. In contrast, ViTs utilize the self-attention
mechanism to consider global context from the very beginning of the network, allowing them
to capture relationships between all patches of the image, regardless of their spatial proximity.
This ability to capture global dependencies makes ViTs particularly powerful for complex
image classification tasks.

b. State-of-the-Art Results on Large Datasets: ViTs have consistently outperformed CNNs


on large-scale image classification benchmarks such as ImageNet, CIFAR-100, and JFT-300M
when trained on large datasets. This is because ViTs can effectively model complex patterns
and relationships in high-dimensional data. For example, ViTs have achieved state-of-the-art
accuracy on ImageNet when trained with a sufficient amount of data and computational
resources. The self-attention mechanism in ViTs enables them to learn more abstract and
complex feature representations compared to CNNs, leading to superior performance on
diverse tasks.

c. Improved Accuracy with More Data: Unlike CNNs, which require a significant amount of
hand-crafted design and careful architecture tuning to improve performance, ViTs generally
- 44 - | P a g e
improve in accuracy as more data is made available. The large number of parameters in the
ViT architecture is well-suited for handling vast amounts of data, making it easier for the
model to generalize better and achieve higher accuracy. This is particularly evident when
training on large datasets like ImageNet or other high-resolution image datasets.

4.1.2 Flexible Architecture

The Vision Transformer’s architecture offers a high degree of flexibility, which is another
major advantage over traditional CNNs. The modular and scalable design of ViTs allows for
easy adaptation to a wide variety of tasks and use cases.

a. Modular Design: ViTs are built on a modular transformer architecture, which consists of
multiple layers of self-attention and feed-forward networks. This design allows for flexibility in
the number of layers, the size of the attention heads, and the dimension of the embeddings.
These hyperparameters can be adjusted depending on the specific task, dataset size, and
available computational resources. For example, for smaller datasets or less complex tasks, the
model can be simplified by reducing the number of transformer layers or attention heads, while
for more complex tasks, the model can be scaled up to improve performance.

b. Adaptability to Various Data Types: While ViTs were originally developed for image
data, their transformer-based architecture is not limited to just image classification. They can
be extended and adapted to other types of data, such as time-series data, medical imaging, and
even multimodal data that includes both text and images. This adaptability makes ViTs a
versatile tool in many areas of machine learning and computer vision.

c. Reduced Need for Domain-Specific Design: One of the challenges with CNNs is the need
for specialized architectural choices tailored to the specific characteristics of the data or task,
such as choosing the appropriate kernel size, stride, or pooling strategy. In contrast, the design
of ViTs is more uniform across different domains, as it relies on a general self-attention
mechanism and a standardized approach to processing image patches. This reduces the need for
domain-specific design and enables faster development and experimentation with fewer
architectural adjustments.

- 45 - | P a g e
4.1.3 Improved Interpretability

Interpretability has been a significant concern with deep learning models, particularly CNNs,
which are often referred to as "black boxes" due to their complex and opaque decision-making
processes. Vision Transformers, with their self-attention mechanism, offer improved
interpretability in comparison.

a. Self-Attention Mechanism: The key feature of the Vision Transformer is the self-attention
mechanism, which allows the model to focus on different parts of the input image as it makes
predictions. This means that, during the forward pass, the model learns which image patches
are most relevant to the decision it is making, giving insights into how the model processes the
image. The self-attention mechanism effectively highlights the relationships between distant
patches and makes it possible to visualize which regions of the image the model attends to
most when making a classification decision.

b. Visualization of Attention Maps: One of the primary ways ViTs improve interpretability is
through the use of attention maps. By visualizing the attention weights between different
patches of the image, researchers can understand which areas of the image are most important
for the model’s decision. These attention maps provide insights into the model's reasoning
process and can help uncover biases, errors, or areas for improvement. For instance, if a ViT
model classifies a dog image correctly, attention maps may reveal that the model focused on
the dog's face or body, offering a clearer understanding of how the model arrives at its
decision.

c. Explainability in Decision Making: Unlike CNNs, where decisions are often made based
on local features extracted through convolutional filters, ViTs make decisions based on the
relationships between patches, which are influenced by the self-attention mechanism. This
makes it easier to interpret the model's decisions, as attention can be traced directly back to
image regions, leading to more transparent and understandable predictions. This interpretability
is crucial in fields such as medical imaging, where understanding the reasoning behind a
model’s decision is as important as the prediction itself.

- 46 - | P a g e
4.2 Challenges and Limitations of Vision Transformers

While Vision Transformers (ViTs) have demonstrated remarkable success in image


classification and other computer vision tasks, there are several challenges and limitations that
need to be addressed. These challenges arise from the data requirements, computational
complexity, and potential biases that can be present during the training process. In this section,
we explore these issues in detail and discuss their impact on the performance and practicality of
Vision Transformers.

4.2.1 Data Requirements

One of the most significant challenges associated with Vision Transformers is their need for
large and diverse datasets to achieve optimal performance. This issue arises due to the
architecture and training requirements of ViTs, which can make them less suitable for smaller
datasets or low-resource environments.

a. Need for Large Datasets: ViTs are designed to model global relationships across all parts
of an image, which requires a large number of parameters. To effectively train a Vision
Transformer, it is essential to provide it with a vast amount of labelled data to prevent
overfitting and ensure the model generalizes well. The large number of parameters in the
transformer layers (such as the multi-head attention mechanism) allows the model to learn
intricate patterns and relationships, but without sufficient data, the model may struggle to
generalize and perform poorly.

b. Limited Performance on Small Datasets: While CNNs can achieve decent results even
with relatively smaller datasets due to their ability to focus on local patterns and their inductive
biases (such as translation invariance), ViTs rely heavily on large-scale datasets to perform
effectively. On small datasets, ViTs may fail to perform as well as CNNs because they are
prone to overfitting, as they have far more parameters than CNNs and fewer regularization
mechanisms in place. This challenge highlights the importance of having a sufficiently large
and diverse dataset when working with Vision Transformers.

c. Data Augmentation and Pretraining: To overcome this limitation, techniques such as data
augmentation (random rotations, flips, cropping, etc.) and pretraining on large datasets (such as
ImageNet) can help improve the performance of ViTs on smaller datasets. Pretraining allows
the model to learn general features from a larger dataset and then fine-tune on the specific task
or smaller dataset. However, this process still requires access to substantial computational
- 47 - | P a g e
resources, making it difficult for smaller research teams or organizations to effectively deploy
ViTs.

4.2.2 Computational Complexity

The computational complexity of Vision Transformers is another significant challenge. ViTs


typically require more computational power than CNNs due to their architecture, which
involves processing and attending to multiple image patches and learning long-range
dependencies.

a. High Memory and Computational Requirements: The key factor contributing to the
computational cost of ViTs is their use of self-attention. In a Vision Transformer, every patch
of an image is attended to by every other patch, which leads to a computational complexity of
O(n2)O(n^2)O(n2) in terms of the number of image patches, where nnn is the number of
patches. This means that as the image size increases, the computational cost grows
exponentially. For large images, this can quickly become prohibitively expensive in terms of
both memory and processing power, requiring access to high-performance hardware like GPUs
or TPUs.

b. Hardware Requirements: Training large Vision Transformer models on datasets such as


ImageNet or JFT-300M requires significant computational resources. High-performance GPUs
or TPUs are often necessary to handle the large number of parameters and to speed up training.
For smaller research teams or individuals without access to such hardware, training ViTs can
be a major bottleneck, limiting their ability to experiment with or deploy these models
effectively.

c. Training Time: ViTs also suffer from longer training times compared to CNNs. The self-
attention mechanism and the large number of parameters in the transformer layers mean that
training ViTs can be time-consuming, especially for large datasets. Even with optimized
hardware, training ViTs to convergence can take weeks or even months, depending on the size
of the model and dataset. This makes the iterative process of model selection and
hyperparameter tuning more time-consuming compared to CNNs, which tend to require less
training time and resources.

d. Efficiency Improvements: Several research efforts have focused on improving the


efficiency of ViTs. Techniques such as sparse attention, low-rank approximations, and the use
of hybrid models (combining CNNs and transformers) have been proposed to reduce the
- 48 - | P a g e
computational cost of ViTs. However, these approaches are still in development and may not
yet be universally applicable.

4.2.3 Potential Biases

Like all machine learning models, Vision Transformers are susceptible to biases present in the
training data. Since these models are data-driven, the quality and diversity of the dataset used
to train them directly affect their performance and fairness.

a. Data Imbalance: One of the most common biases that can affect ViTs is data imbalance. If
the training dataset contains a disproportionate number of images from certain classes or
categories, the model may develop a bias towards those classes. This can lead to poor
performance on underrepresented classes, especially when the data is imbalanced across many
categories. In the case of ViTs, which tend to require large datasets, an imbalance in data can
be exacerbated by the sheer size of the dataset, leading to overfitting on dominant classes.

b. Cultural and Demographic Bias: Another source of bias is cultural or demographic bias in
the training data. If the dataset contains images that predominantly represent certain cultures,
races, or environments, the model may learn these biases and perform poorly when applied to
images from different contexts. For example, a Vision Transformer trained primarily on
Western facial recognition datasets may fail to recognize faces from non-Western cultures
accurately.

c. Model Fairness and Accountability: The biases learned by Vision Transformers can lead
to unfair or discriminatory outcomes, especially in sensitive applications like facial recognition,
medical imaging, or autonomous driving. Understanding and mitigating these biases is crucial
for ensuring the fairness and ethical use of ViTs. Techniques such as bias correction, diverse
and representative datasets, and fairness-aware model training are essential for addressing these
concerns.

d. Bias Mitigation Strategies: There are ongoing research efforts aimed at mitigating biases in
machine learning models, including ViTs. These strategies include using more diverse datasets,
implementing regularization techniques to reduce overfitting, and developing methods for
auditing models for fairness. However, fully eliminating biases remains a challenging task, and
continuous attention to this issue is necessary to build more equitable models.

- 49 - | P a g e
4.3 Future Developments in Vision Transformers

As Vision Transformers (ViTs) continue to revolutionize the field of image classification, there
are ongoing efforts to address their challenges and push the boundaries of their capabilities.
The future developments of ViTs are focused on creating more efficient architectures,
leveraging unsupervised learning techniques to reduce the need for labelled data, and
integrating multiple modalities such as language models for richer, multi-modal understanding.
These advancements hold great potential for expanding the use cases of ViTs across various
domains and improving their practical applications.

4.3.1 Efficient Architectures

The efficiency of Vision Transformers is one of the primary concerns for their widespread
adoption, especially for applications that require real-time processing or have limited
computational resources. To address these challenges, researchers are focusing on developing
more efficient ViT architectures that retain high performance while reducing computational
costs.

a. Sparse Attention Mechanisms: One of the most promising developments in making ViTs
more efficient is the introduction of sparse attention mechanisms. Traditional self-attention in
ViTs requires pairwise attention between all patches, which leads to a quadratic complexity of
O(n2)O(n^2)O(n2), where nnn is the number of patches. Sparse attention methods, such as
Long former and Informer, aim to reduce this computational burden by limiting the attention to
only a subset of relevant patches. These sparse attention techniques allow ViTs to scale more
effectively to larger images and datasets, enabling faster training and inference while retaining
most of the performance benefits of full attention.

b. Low-Rank Approximations: Another approach to improving the efficiency of ViTs is the


use of low-rank approximations. In this method, the attention mechanism is approximated by
lower-dimensional representations, which reduces the number of parameters and operations
required for processing each patch. Techniques like kernelized attention or low-rank
factorization aim to speed up the self-attention computation without significantly sacrificing
the quality of learned representations. This can make ViTs more computationally feasible for
deployment in real-time applications or on edge devices with limited hardware resources.

c. Hybrid Models: Researchers are also exploring hybrid models that combine the strengths of
ViTs with other architectures, such as Convolutional Neural Networks (CNNs). For example,
- 50 - | P a g e
CNNs are well-suited to capture local spatial patterns, while ViTs excel at modelling global
dependencies. By combining these two architectures, hybrid models can achieve both high
performance and computational efficiency. Additionally, CNNs can be used in the early stages
of the network to extract low-level features, which are then passed to the ViT for capturing
higher-level global relationships. This hybrid approach allows for the benefits of both
architectures while mitigating some of their individual drawbacks.

d. Quantization and Pruning: Techniques such as quantization and pruning can also be used
to reduce the size and computational cost of ViTs. Quantization involves reducing the precision
of the model weights, which can significantly reduce memory usage and improve inference
speed without sacrificing too much accuracy. Pruning, on the other hand, involves removing
less important weights or attention heads, leading to a sparser network that is more efficient in
terms of both memory and computation.

4.3.2 Unsupervised Learning

Another key area of future development for Vision Transformers is the advancement of
unsupervised learning techniques to reduce the reliance on large amounts of labelled data.
While supervised learning has been the dominant paradigm for training ViTs, it requires vast
amounts of labelled data, which can be costly and time-consuming to collect. Unsupervised
learning, on the other hand, enables models to learn from unlabelled data, making them more
scalable and adaptable.

a. Self-Supervised Learning: Self-supervised learning (SSL) is a type of unsupervised


learning that has shown great promise in reducing the need for labelled data. In SSL, models
learn to predict parts of the input data from other parts without explicit labels. For example, in
the context of ViTs, a model might learn to predict the relative positions of image patches or to
reconstruct missing parts of an image. By leveraging unlabelled data, self-supervised learning
can help ViTs learn useful feature representations that can later be fine-tuned on smaller
labelled datasets for downstream tasks.

Recent advancements in SSL, such as contrastive learning, have enabled models to learn from
large amounts of unlabelled data and perform well on tasks such as image classification, object
detection, and segmentation. By combining ViTs with SSL techniques, it is possible to reduce
the amount of labelled data required while still achieving high performance, making ViTs more
accessible for applications where labelled data is scarce or expensive.
- 51 - | P a g e
b. Semi-Supervised Learning: In addition to self-supervised learning, semi-supervised
learning is another promising direction for reducing the reliance on labelled data. In semi-
supervised learning, a model is trained using a small amount of labelled data alongside a large
amount of unlabelled data. This approach can be particularly useful in situations where
obtaining labelled data is expensive or time-consuming. Techniques such as pseudo-labelling,
where the model generates its own labels for unlabelled data, or consistency regularization,
where the model is encouraged to make consistent predictions across perturbations, can be
applied to ViTs to make better use of limited labelled data.

c. Zero-Shot Learning: Zero-shot learning is another technique that has gained traction in
recent years. This approach enables models to classify objects or recognize patterns in
categories they have never seen before during training. In ViTs, zero-shot learning can be
achieved by leveraging pre-trained models and transferring knowledge learned from a large
dataset to new, unseen tasks. This could significantly reduce the need for task-specific labelled
data, making ViTs more versatile and efficient in a variety of domains.

4.3.3 Multi-Modal Integration

One of the most exciting developments for Vision Transformers is their integration with
language models for multi-modal understanding. The combination of visual and textual data
has the potential to improve the model’s ability to understand and interpret complex
information, leading to more robust performance across a wider range of tasks.

a. ViTs with Language Models: Combining ViTs with language models, such as transformers
trained on text data, enables models to process both visual and textual information
simultaneously. This is particularly useful in tasks such as image captioning, visual question
answering (VQA), and multimodal retrieval, where the model must understand both the content
of an image and its associated textual description. By training on paired image-text datasets,
models can learn to map visual features to semantic concepts described in language, improving
the model’s ability to handle diverse, multimodal data.

b. Vision-Language Pretraining: Pretraining ViTs on large-scale image-text datasets, similar


to how models like CLIP (Contrastive Language-Image Pretraining) and DALL·E are trained,
enables them to perform well across multiple modalities. Vision-Language Pretraining (VLP)
allows the model to learn joint representations of both image and text, which can be applied to
a variety of multimodal tasks. This cross-modal understanding is particularly valuable for
- 52 - | P a g e
applications such as autonomous driving, where the system must interpret both visual data
(from cameras) and textual or sensor data (from other sources like GPS or textual instructions).

c. Enhanced Cross-Modal Reasoning: With advancements in multi-modal integration, Vision


Transformers will be able to perform more advanced forms of cross-modal reasoning, such as
associating textual descriptions with corresponding visual elements. This ability to reason
across modalities opens up exciting possibilities for applications in AI-driven content creation,
enhanced human-computer interaction, and autonomous systems that need to make decisions
based on complex, multi-source data.

4.4 Practical Applications of Vision Transformers

Vision Transformers (ViTs) have emerged as powerful tools in various practical applications
due to their ability to learn rich, global feature representations from images. By capturing long-
range dependencies and modelling intricate relationships between different parts of an image,
ViTs have proven to be effective in fields such as medical imaging, autonomous driving, and
surveillance. In this section, we explore the key practical applications of Vision Transformers
in these domains, emphasizing their potential to revolutionize each field.

4.4.1 Medical Imaging

Medical imaging is a domain where Vision Transformers have the potential to make a
significant impact. The ability of ViTs to capture complex patterns and relationships across
different regions of an image makes them especially suited for tasks like disease diagnosis,
tumour detection, and organ segmentation. In the medical field, accurate and early detection of
diseases is critical for improving patient outcomes, and ViTs offer an innovative approach to
enhancing diagnostic tools.

a. Disease Diagnosis: In the diagnosis of various diseases, such as cancer, heart disease, and
neurological conditions, the accuracy of automated systems is essential for timely intervention.
Vision Transformers have shown promise in improving the accuracy of diagnostic models by
enabling more precise feature extraction from medical images. For example, in radiology, ViTs
can be used to analyse X-ray, CT, and MRI scans to detect abnormalities like tumours,
fractures, or lesions. Their ability to model long-range dependencies helps them identify subtle
patterns in medical images that might be missed by traditional methods.

- 53 - | P a g e
For instance, in breast cancer detection, ViTs can be trained to analyse mammograms and
identify malignant tumours. By processing the image as a whole, ViTs can detect complex
patterns that may not be apparent in localized regions, allowing for more accurate identification
of early-stage cancers. In pathology, ViTs can also assist in analysing histopathological images
of tissue samples, aiding in the detection of cancerous cells and abnormal growths.

b. Tumour Detection: Tumour detection, particularly in imaging modalities such as MRI and
CT scans, is another area where Vision Transformers excel. Tumours often present with
complex and subtle characteristics that require a model capable of understanding the intricate
spatial relationships within the image. ViTs, with their ability to consider global information
from all regions of the image, are well-suited for this task. By learning the spatial hierarchies
and relationships between tumour features, ViTs can accurately identify and localize tumours,
providing clinicians with valuable insights to guide treatment decisions.

c. Organ Segmentation: Vision Transformers are also being applied in organ segmentation
tasks, such as segmenting the brain, liver, or heart in medical scans. Accurate organ
segmentation is crucial for treatment planning, especially in the case of radiotherapy, where
precise targeting of tumour regions within the organ is necessary. ViTs can segment organs
with high accuracy, even in cases where traditional methods struggle due to variations in organ
shape and size or the presence of lesions.

4.4.2 Autonomous Driving

In the rapidly advancing field of autonomous driving, Vision Transformers have shown great
promise in improving object detection, scene understanding, and decision-making processes.
Autonomous vehicles rely heavily on computer vision to perceive their environment, and
Vision Transformers can enhance this perception by providing more robust and accurate
models for interpreting the visual data from cameras and sensors.

a. Object Detection: Object detection is a critical task for autonomous vehicles, as it involves
identifying and locating objects such as pedestrians, other vehicles, traffic signs, and obstacles
in real-time. Vision Transformers are particularly effective in this domain due to their ability to
capture long-range dependencies between various objects in the scene. Unlike traditional
CNNs, which focus on local features, ViTs can learn more complex global representations,
which is essential for understanding relationships between objects and their context in a driving
scenario.
- 54 - | P a g e
For example, ViTs can detect and classify pedestrians in complex urban environments where
occlusions and varying distances between objects pose challenges. By modelling the entire
scene and considering long-range interactions between objects, ViTs can enhance the vehicle’s
ability to navigate safely, even in cluttered environments with multiple moving objects.

b. Scene Understanding: ViTs are also well-suited for scene understanding tasks, such as road
segmentation, lane detection, and predicting the behaviour of surrounding objects. Scene
understanding is vital for autonomous vehicles to make informed decisions, such as lane
changes, turns, or emergency manoeuvres. Vision Transformers can segment different parts of
the scene, including the road, sidewalks, and other vehicles, with high accuracy, enabling the
vehicle to understand the layout of its environment.

Furthermore, ViTs can be used to predict the future movements of objects based on their
current positions and behaviours. For example, if a pedestrian is walking toward a crosswalk,
the ViT can analyse the situation and help the vehicle predict the pedestrian's trajectory,
allowing the vehicle to adjust its speed or trajectory accordingly.

c. End-to-End Learning: In autonomous driving systems, ViTs are increasingly being


incorporated into end-to-end learning frameworks. In this approach, the vehicle is trained to
learn a policy that maps raw sensor inputs (such as camera images) directly to control
commands (such as steering, acceleration, and braking). By leveraging the global feature
extraction capabilities of Vision Transformers, end-to-end learning systems can improve
decision-making and reduce the need for hand-crafted features or rule-based systems, making
the autonomous vehicle more adaptable to diverse and dynamic environments.

4.4.3 Surveillance and Security

Surveillance and security applications are another area where Vision Transformers are making
an impact. From facial recognition to anomaly detection and activity recognition, ViTs are
helping to enhance security systems by providing more accurate and robust models for
analysing visual data. In security applications, real-time processing and accuracy are critical for
identifying potential threats and responding to incidents quickly.

a. Facial Recognition: Facial recognition is one of the most common applications in security
and surveillance. Vision Transformers are particularly well-suited for this task due to their
ability to capture fine-grained details in facial features. Unlike traditional CNNs, which focus
on local feature extraction, ViTs can consider the entire image of a face, learning complex
- 55 - | P a g e
relationships between facial landmarks and enhancing recognition accuracy. This is particularly
important in challenging conditions such as low lighting, occlusions, or when the subject is at a
slight angle.

ViTs can improve facial recognition systems by enabling them to generalize better across
different lighting conditions, poses, and expressions. This results in more reliable and secure
authentication systems, which are increasingly being used in access control, banking, and law
enforcement.

b. Anomaly Detection: Anomaly detection is a crucial task in surveillance systems, where


unusual behaviour or events need to be identified in real time. Vision Transformers can analyse
surveillance footage and detect anomalous activities such as suspicious behaviour,
unauthorized access, or potential threats. ViTs can capture long-range contextual information
and detect subtle variations in scenes, making them more effective at identifying unusual
events in complex environments.

For example, in a crowded public space, a Vision Transformer might detect a person acting
strangely or loitering in an area where they do not belong. By analysing the global context of
the scene, the ViT can accurately flag this behaviour for further review, enhancing the
effectiveness of security systems.

c. Activity Recognition: In addition to detecting anomalies, ViTs can be used for activity
recognition, which involves understanding and classifying actions or events within a video
stream. This is particularly useful for monitoring public spaces, transportation systems, or
workplaces. Vision Transformers, by leveraging their ability to model both spatial and
temporal dependencies, can improve the recognition of complex activities, such as identifying
whether a person is engaged in a fight, fleeing from a crime scene, or interacting with an object
in an unusual way.

By incorporating temporal information, ViTs can analyse the sequence of frames in a video to
understand the progression of activities, enabling more accurate classification of events and
enhancing the overall reliability of surveillance systems.

- 56 - | P a g e
CHAPTER 5:CONCLUSION

5.1 Conclusion

This report has explored the emergence of Vision Transformers (ViTs) and their application to
the task of image classification. We have examined the architecture, advantages, and practical
applications of ViTs, while also addressing challenges and limitations associated with their use.
The findings suggest that ViTs have made significant strides in the realm of computer vision,
demonstrating enhanced performance over traditional Convolutional Neural Networks (CNNs)
and offering new possibilities for a variety of practical applications, including medical
imaging, autonomous driving, and surveillance. However, there are still several challenges and
areas for improvement in the field, particularly concerning data requirements and
computational complexity.

5.1.1 Summary of Findings

1. Advancements in Image Classification: Image classification has undergone


significant advancements with the introduction of deep learning models, particularly
Convolutional Neural Networks (CNNs). However, Vision Transformers (ViTs) have
introduced a paradigm shift by utilizing self-attention mechanisms and transformer-
based architectures, which offer several advantages over CNNs, particularly in handling
long-range dependencies and learning richer feature representations. This ability to
capture global information has led to improved classification accuracy on large and
complex datasets.

2. ViT Architecture and Training: Vision Transformers utilize a unique approach in


which images are divided into fixed-size patches, which are then processed by
transformer layers. These layers, equipped with self-attention mechanisms, enable the
model to attend to various parts of an image simultaneously, capturing global
relationships. Training these models, however, requires large datasets and substantial
computational resources, with GPUs or TPUs being necessary for effective training.
The long training times are a common challenge, especially when working with large,
high-resolution datasets.

- 57 - | P a g e
3. Practical Applications: The report highlighted several practical applications where
ViTs are making a significant impact. In medical imaging, ViTs are being used to
improve the accuracy of disease detection, such as in cancer diagnosis and tumour
detection, thanks to their ability to understand complex patterns in medical scans. In
autonomous driving, ViTs have been applied to object detection and scene
understanding, improving the safety and efficiency of self-driving vehicles. In
surveillance and security, ViTs are enhancing facial recognition, anomaly detection,
and activity recognition, making security systems more reliable and responsive.

4. Advantages Over CNNs: One of the key advantages of ViTs over CNNs is their
ability to capture global contextual information, which enables more accurate and
robust feature extraction from images. This has been particularly beneficial in
applications that require fine-grained recognition and understanding of complex visual
data, such as medical imaging and autonomous driving. Moreover, the flexible and
modular architecture of ViTs allows for easy adaptation to different tasks and datasets,
making them a versatile tool in the computer vision toolbox.

5. Challenges and Limitations: Despite their success, Vision Transformers are not
without their challenges. They require large, diverse datasets to perform effectively, and
their computational complexity can be prohibitive for smaller research teams or
organizations with limited resources. Furthermore, the high training costs and long
training times are significant barriers to entry. Additionally, like all machine learning
models, ViTs are susceptible to biases in the training data, which can affect their
generalization and accuracy. These challenges must be addressed to unlock the full
potential of ViTs in real-world applications.

5.1.2 Future Work and Recommendations

1. Improved Efficiency and Scalability: One of the most promising areas for future
research is improving the efficiency and scalability of Vision Transformers. Current
ViT models require extensive computational resources and large datasets, which may
limit their accessibility for smaller projects or organizations. Research into more
efficient ViT architectures, such as hybrid models that combine the strengths of CNNs
and transformers, could help mitigate the high computational costs. Techniques like
pruning, quantization, and knowledge distillation may also be explored to reduce the

- 58 - | P a g e
model size and speed up inference times without compromising performance.

2. Unsupervised and Semi-Supervised Learning: The need for large labelled datasets is
a major bottleneck in training Vision Transformers. A promising direction for future
work is the exploration of unsupervised and semi-supervised learning techniques, which
can reduce the reliance on labelled data. By leveraging techniques such as self-
supervised learning, ViTs could be trained on unlabelled data, making them more
accessible and applicable to a wider range of real-world scenarios. Advances in this
area could lead to more efficient use of data, allowing ViTs to perform well even with
limited labelled datasets.

3. Multi-Modal Integration: Another exciting avenue for future research is the


integration of Vision Transformers with other modalities, such as natural language
processing (NLP). Multi-modal models, which combine vision and language, are
gaining traction in fields like robotics, autonomous systems, and human-computer
interaction. For instance, Vision Transformers could be combined with language
models like GPT or BERT to enable more sophisticated scene understanding and
decision-making processes. In autonomous driving, for example, multi-modal models
could incorporate both visual data and sensor information to improve perception and
decision-making.

4. Addressing Bias and Fairness: As with all machine learning models, the potential for
bias in Vision Transformers is a critical concern, particularly when the models are
deployed in real-world applications such as facial recognition and medical imaging.
Future research should focus on identifying and mitigating biases in training data, as
well as developing methods to improve the fairness and transparency of ViT models.
Techniques like adversarial training and fairness constraints could help ensure that ViTs
make decisions that are equitable and unbiased, especially in sensitive areas like
healthcare and security.

5. Improving Generalization: While ViTs have demonstrated superior performance on


large-scale datasets, their generalization to unseen or smaller datasets remains an area
for improvement. Future work could explore methods to improve the robustness of
ViTs to variations in data, such as domain adaptation techniques or data augmentation
strategies. By enhancing the ability of Vision Transformers to generalize across
different tasks and environments, their applicability to real-world challenges will be
further enhanced.
- 59 - | P a g e
References

1. Dosovitskiy, A., & Brox, T. (2016). Discriminative Unsupervised Feature Learning


with Exemplar Convolutional Neural Networks. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 38(9), 1734-1747.
https://doi.org/10.1109/TPAMI.2015.2490294

2. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image
Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), 770-778. https://doi.org/10.1109/CVPR.2016.90

3. Khan, A., & Teytaud, F. (2020). Vision Transformers: A Survey. arXiv preprint
arXiv:2006.03677. https://arxiv.org/abs/2006.03677

4. Kingma, D. P., & Ba, J. (2015). Adam: A Method for Stochastic Optimization.
International Conference on Learning Representations (ICLR).
https://arxiv.org/abs/1412.6980

5. Liu, H., Lin, Y., & Tan, X. (2021). Vision Transformers for Dense Prediction: A
Survey. arXiv preprint arXiv:2105.06212. https://arxiv.org/abs/2105.06212

6. Radford, A., & Chintala, S. (2021). CLIP: Connecting Vision and Language with
Contrastive Learning. arXiv preprint arXiv:2103.00020.
https://arxiv.org/abs/2103.00020

7. Raghu, M., Chen, Z., & Zhang, S. (2021). Vision Transformers as a Better
Alternative to Convolutional Neural Networks. Proceedings of the IEEE International
Conference on Computer Vision (ICCV), 10791-10800.
https://doi.org/10.1109/ICCV48922.2021.01065

8. Touvron, H., Cord, M., & Douze, M. (2021). Training Data-Efficient Image
Transformers & Distillation through Attention. Proceedings of the International
Conference on Machine Learning (ICML). https://arxiv.org/abs/2012.12877

9. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. A.,
Kaiser, Ł., & Polosukhin, I. (2017). Attention is All You Need. Proceedings of the
Advances in Neural Information Processing Systems (NeurIPS), 30.
https://arxiv.org/abs/1706.03762

10. Zhu, X., Wang, T., & Xu, S. (2021). Exploring Vision Transformers for Object

- 60 - | P a g e
Detection. Proceedings of the IEEE/CVF International Conference on Computer Vision
(ICCV), 11772-11781. https://doi.org/10.1109/ICCV48922.2021.01153

- 61 - | P a g e

You might also like