0% found this document useful (0 votes)
358 views14 pages

Vision Transformers: Revolutionizing Computer Vision

The document discusses how vision transformers have revolutionized computer vision by using a patch-based method and self-attention mechanism instead of convolutional neural networks. It examines the history and design of vision transformers, including how they process images by dividing them into patches and using self-attention to model relationships between patches, achieving state-of-the-art performance on computer vision tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
358 views14 pages

Vision Transformers: Revolutionizing Computer Vision

The document discusses how vision transformers have revolutionized computer vision by using a patch-based method and self-attention mechanism instead of convolutional neural networks. It examines the history and design of vision transformers, including how they process images by dividing them into patches and using self-attention to model relationships between patches, achieving state-of-the-art performance on computer vision tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 14

Vision Transformers: Revolutionizing Computer Vision

The creation of a new type of neural network known as the vision transformer (ViT) has
revolutionized computer vision in recent years; as opposed to typical convolutional neural
networks (CNNs), vision transformers process images using a patch-based method and self-
attention mechanism, allowing them to model long-range dependencies between image
patches and produce state-of-the-art outcomes on a variety of computer vision applications.
In this detailed article, we will examine the history of vision transformers, their design and
training process, their benefits and drawbacks, and their applications in various disciplines.
We'll also discuss future research objectives, implementation tips, and how vision
transformers stack up against other prominent computer vision techniques. Whether you are
new to computer vision or an expert practitioner, this blog will thoroughly grasp vision
transformers and their potential to alter the field.
Neural Network – How it relates to ViT?
Neural networks are algorithms inspired by the structure and function of the human brain.
They are an effective tool for addressing complicated issues like image identification, audio
recognition, natural language processing, and many more.
A neuron is the fundamental building component of a neural network; it receives input from
other neurons or external sources, processes it, and creates an output. Layers of neurons are
formed, and one layer's output becomes the next's input.

Image Source: https://www.tibco.com/reference-center/what-is-a-neural-network


A neural network's architecture relates to how the neurons are organized and connected.
Numerous neural network topologies exist, such as feedforward networks, recurrent neural
networks (RNNs), convolutional neural networks (CNNs), and transformers.
Feedforward networks are the most basic type of neural network architecture, often called
multi-layer perceptron (MLPs). They comprise three layers: an input layer, one or more
hidden layers, and an output layer. Each layer's neurons are fully coupled to the following
layer's neurons, and each neuron applies a non-linear activation function to its input.
RNNs are intended to process data sequences such as time series or natural language text.
They have recurrent connections that allow information to be transmitted from one-time step
to the next, and they can learn data dependencies over time.
CNNs are built to handle spatial data, such as pictures. They extract features from the input
data using convolutional layers, then pooling layers to lower the dimensionality of the
features, and then fully connected layers to generate the final prediction.
On the other hand, transformers are a type of neural network architecture that processes
incoming data through self-attention techniques. The network's ability to focus on different
input areas at different times allows it to capture local and global relationships.
Vision transformers collect spatial relationships in images more effectively than other types
of neural networks, resulting in state-of-the-art performance on many computer vision
applications.
Self-Attention Mechanism - Crucial component of vision transformers.
The self-attention mechanism is an essential component of vision transformers because it
allows the network to focus on different sections of the input data at other times, allowing it
to capture both local and global associations.

Image Source: https://vaclavkosar.com/ml/transformers-self-attention-mechanism-simplified


In a conventional feedforward neural network, each neuron in a given layer is connected to
all neurons in the next layer. However, in a self-attention mechanism, each neuron in a
specific layer is connected to all other neurons in that layer, including itself. The network can
then compute a weighted sum of all the neurons in the layer, with weights dependent on the
similarity of the current neuron to each of the other neurons.
The self-attention mechanism can be expressed mathematically as follows:
Let X be the input sequence of length N and h_i be the hidden representation of the i-th
element in the sequence. The self-attention mechanism computes a new representation z_i for
each component of the sequence as follows:

Where a_{i,j} is the attention weight assigned to the jth element by the ith element, the
attention weight is computed as follows:

Where e_{i,j} is the attention energy between the ith and jth elements, which is computed as
follows:

Where f is a learned function that computes the compatibility between the representations of
the i-th and jth elements.
In the context of vision transformers, the input image is divided into a grid of patches, and
each patch is treated as an element in the input sequence. The self-attention mechanism is
used to build a new set of embeddings representing the image's local and global spatial
relationships.
By using self-attention instead of convolutions, vision transformers may capture long-range
dependencies and interactions between patches in an image more effectively, resulting in a
state-of-the-art performance for many computer vision applications.
Attention Mechanism in CV
In the past, attention methods were frequently utilized in computer vision tasks, particularly
in picture captioning and object detection. The model needed to focus on different image
portions at different times.
For example, in image captioning, the model must create a natural language description of a
picture. The model generates a caption word at each time step, and it must pick which
components of the image to attend to construct that word. This is accomplished by using an
attention mechanism, which computes a weighted sum of the visual attributes, with weights
based on the similarity of the current word to each part of the image.
Similarly, the object detection model must detect entities' existence and position in an image.
A convolutional neural network (CNN) is often used to extract picture information, followed
by a region proposal network (RPN) to create candidate object regions. The candidate regions
are then refined using an attention technique that attends to relevant parts of the image.
Attention mechanisms are utilized similarly in vision transformers to record the image's local
and global spatial relationships. Instead of using convolutions to extract image features, the
input image is partitioned into a grid of patches, with each patch regarded as a sequence
element. The self-attention mechanism is then applied to the sequence of patch embeddings
to generate a new set of embeddings that represent the spatial relationships between the
patches.
Vision transformers can capture long-range dependencies and relationships between patches
in the image more effectively by using self-attention rather than convolutions, resulting in
state-of-the-art performance on various computer vision tasks such as image classification
and object detection. The attention mechanism in vision transformers enables the model to
focus on crucial aspects of the image while processing it, allowing it to be more efficient and
accurate while dealing with complicated visual input.
History of Vision Transformer
Vision transformers are a relatively new invention in computer vision, with their roots in the
success of transformer-based language models such as BERT and GPT in natural language
processing.
The transformer architecture was first presented in Vaswani et al.'s landmark publication
"Attention Is All You Need" in 2017, which proposed a novel way to sequence modelling
based on self-attention processes. Because of its capacity to capture long-term dependencies
and increase the performance of language models, this architecture quickly became a popular
choice for natural language processing jobs.
Dosovitskiy et al. presented the Vision Transformer (ViT) architecture in 2020, modifying
the transformer architecture for application in computer vision tasks. The ViT architecture
divides input images into smaller patches, which are then processed with a transformer
encoder to provide a fixed-size feature vector for the image. On numerous popular picture
classification benchmarks, the ViT architecture achieved state-of-the-art results,
demonstrating the promise of transformer-based models in computer vision.
Since then, other ViT design versions have been developed, including DeiT (Data-efficient
Image Transformers), ViT-Large, and Swin Transformer, each with different tweaks to the
original architecture to improve performance and minimize computational cost.
Overall, the emergence of vision transformers can be viewed as a continuation of the success
of transformer-based models in natural language processing, as well as a promising new
approach for furthering the area of computer vision.
Patch-Based Processing
Vision transformers use a patch-based approach to image processing, breaking the input
image into smaller, fixed-size patches and treating each patch as a single token. This method
has both advantages and cons.
Image Source: https://www.researchgate.net/figure/A-schematic-illustration-of-patch-based-
processing-of-images-By-breaking-the-given-image_fig15_282609022
One advantage of patch-based processing is that vision transformers may accept inputs of
various sizes without extra resizing or cropping. This is especially beneficial for applications
like object detection and segmentation, where the size and shape of the objects in the image
might change significantly.
Another advantage of patch-based processing is that the self-attention mechanism may attend
to interactions between patches throughout the image, allowing for more excellent capture of
the global image context. This is especially significant for tasks like scene comprehension or
image captioning, where the context and interactions between items in the image are critical
for creating accurate descriptions.
However, patch-based processing has several drawbacks. One significant disadvantage is that
spatial information is lost because each patch is handled as a separate token, and the relative
positions of the patches are not explicitly stored. This can impair performance in tasks that
rely substantially on spatial relationships, such as fine-grained object recognition or
geometric reasoning.
Another potential disadvantage is the computational and memory costs of processing many
patches. To some extent, this can be addressed by employing techniques such as overlapping
patches or hierarchical processing, but it remains a substantial difficulty for large-scale
applications.
Overall, patch-based processing is crucial for vision transformers, allowing them to attain
cutting-edge results on various computer vision benchmarks. However, it is critical to
carefully assess the benefits and drawbacks of this strategy for individual applications and
investigate techniques to alleviate some of its limits.
ViT Architecture
A vision transformer (ViT) architecture comprises numerous layers, each containing several
critical components such as patch embeddings, multi-head self-attention, and feedforward
networks.
Image Source: https://theaisummer.com/vision-transformer/
Patch Embeddings: A linear projection separates the input image into a grid of non-
overlapping patches, with each patch represented as a vector. After that, the patch
embeddings are concatenated along the channel dimension to generate a vector sequence sent
to the transformer encoder.
Multi-head self-attention: The transformer encoder comprises several layers of multi-head
self-attention, allowing the model to capture local and global interactions between patches.
Each multi-head self-attention layer comprises a self-attention mechanism, a normalization
layer, and a feedforward network.
Multi-Head Attention: The model's self-attention mechanism enables it to attend to different
parts of the input sequence at other times, allowing it to capture local and global correlations.
Each patch embedding is converted into a collection of queries, keys, and values and then
used to calculate attention weights. The attention weights are utilized to calculate a weighted
sum of the values, which is used as the self-attention layer's output.
Normalization Layer: Following the application of the attention mechanism, the output is
passed via a normalization layer, which aids in the stabilization of the learning process by
ensuring that the distribution of activations remains reasonably constant across different
cases.
Feedforward Network: Finally, the output of the normalization layer is routed through a
feedforward network composed of two linear layers separated by a ReLU activation function.
The feedforward network facilitates the capturing of complex interactions between patches
and enables the model to learn non-linear transformations of the input data.
The vision transformer learns a hierarchical representation of the input image by stacking
many layers of patch embeddings, multi-head self-attention, and feedforward networks. This
allows it to capture both low-level features and high-level semantic information.
ViT can be coded as,
Step 1: Importing libraries
import torch
import torchvision
from torchvision import transforms
from transformers import ViTForImageClassification, ViTFeatureExtractor
Step 2: Importing datasets
data = torchvision.datasets.CIFAR10(root='./data', train=True, download=True,
transform=transforms.ToTensor())
Step 3: Splitting dataset
train_size = int(0.8 * len(data))
val_size = len(data) - train_size
train_data, val_data = torch.utils.data.random_split(data, [train_size, val_size])
Step 4: Creating dataloader to load database
train_loader = torch.utils.data.DataLoader(train_data, batch_size=32, shuffle=True)
Step 5: Defining model
model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')
feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224')
Step 6: Loss and optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
criterion = torch.nn.CrossEntropyLoss()
Step 7: Training dataset
for epoch in range(10):
for i, (inputs, labels) in enumerate(train_loader):
optimizer.zero_grad()
inputs = feature_extractor(inputs)['pixel_values']
outputs = model(inputs)
loss = criterion(outputs.logits, labels)
loss.backward()
optimizer.step()
Step 8: Evaluating dataset
val_loader = torch.utils.data.DataLoader(val_data, batch_size=32)
with torch.no_grad():
correct = 0
total = 0
for inputs, labels in val_loader:
inputs = feature_extractor(inputs)['pixel_values']
outputs = model(inputs)
_, predicted = torch.max(outputs.logits, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()

print('Accuracy on validation set: %d %%' % (100 * correct / total))


Training
Data pre-processing: Typically, the uploaded images are pre-processed to ensure that they
are normalized and of consistent size. The photos, for example, may be resized to a
predetermined size, and the pixel values could be normalized to have a zero mean and unit
variance. Data augmentation techniques like random cropping, flipping, and colour jittering
can also be employed to boost the diversity of the training data and prevent overfitting.
Loss function: The task determines the function used to train a visual transformer. For
example, cross-entropy loss is often employed in picture classification applications. The
cross-entropy loss compares the model's predicted class probabilities to the true class labels
and penalizes the model for wrong predictions.
Optimization method: Typically, a stochastic gradient descent (SGD) variant is employed to
train a vision transformer. SGD operates by computing the gradient of the loss function
concerning the model parameters and then updating the parameters in the direction that
minimizes the loss. SGD variations such as Adam and RMSProp are often employed in
practice since they have been found to converge faster and more reliably than vanilla SGD.
During training, batches of pre-processed images are fed into the vision transformer, and the
model parameters are adjusted depending on the gradients of the loss function concerning the
parameters. The training technique typically involves many iterations (epochs) across the
whole training dataset, with the model's performance verified on a new validation set after
each epoch. This allows for early termination if the model begins to overfit the training data.
In addition to standard training techniques, recent advancements in training vision
transformers include distillation, which trains a smaller student model to mimic the behaviour
of a larger teacher model, and contrastive learning, which trains the model to learn
representations that are invariant to data augmentation.
Pre-Trained Models:
The academic community and industry practitioners can use pre-trained vision transformer
models such as ViT (Vision Transformer), DeiT (Data-efficient Image Transformer), and
Swin Transformer (Swin). These models are meant to extract rich visual information from
images and have been pre-trained on large-scale datasets like as ImageNet, COCO, and JFT-
300M.
The primary advantage of pre-trained vision transformer models is that they may be fine-
tuned for specific applications with limited labelled data. Transfer learning benefits
applications with little labelled data, like medical imaging or satellite imagery. By utilizing
the model's pre-trained characteristics, fine-tuning can help boost the accuracy and speed of
training for specific jobs.
The last classification layer of a pre-trained vision transformer model is replaced with a new
layer adapted to the specific job, such as object detection or picture segmentation. The
weights of the pre-trained layers are frozen during training, while only the weights of the new
layer are updated to minimize the task-specific loss function.
Furthermore, pre-trained models can be employed as feature extractors to build high-
dimensional embeddings for subsequent tasks such as image retrieval or clustering.
Overall, the availability of pre-trained vision transformer models has dramatically reduced
the barriers to entry for computer vision research and application development. By fine-
tuning these models for specific tasks or employing them as feature extractors, practitioners
can achieve cutting-edge performance using fewer data and computational resources.
One pretrained model, we will see it with Python codes,
Step 1: Install the packages
!pip install torch torchvision timm
Step 2: Importing libraries
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import timm
Step 3: Data transformations
transform = transforms.Compose([
transforms.RandomCrop(32, padding=4),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
Step 4: Loading dataset
trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=128,
shuffle=True, num_workers=2)
testset = torchvision.datasets.CIFAR10(root='./data', train=False,
download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=100,
shuffle=False, num_workers=2)
classes = ('plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck')
Step 5: Defining model
model = timm.create_model('deit_base_patch16_224', pretrained=True, num_classes=10)
Step 6: Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
Step 7: Training dataset
for epoch in range(10):
running_loss = 0.0
for i, data in enumerate(trainloader, 0):
inputs, labels = data
optimizer.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
loss.backward()
optimizer.step()
running_loss += loss.item()
if i % 100 == 99:
print('[%d, %5d] loss: %.3f' %
(epoch + 1, i + 1, running_loss / 100))
running_loss = 0.0
print('Finished Training')
Step 8: Evaluating dataset
correct = 0
total = 0
with torch.no_grad():
for data in testloader:
images, labels = data
outputs = model(images)
_, predicted = torch.max(outputs.data, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()

print('Accuracy of the network on the 10000 test images: %d %%' % (


100 * correct / total))
Interpretability:
Vision transformers have the advantage of being more interpretable than typical
convolutional neural networks (CNNs). Models that can be interpreted provide information
on how the model generates judgements or predictions. In computer vision, interpretability
can help users understand why the model made a particular classification or detection. This is
especially significant in applications such as medical imaging, where the model's accuracy
and dependability are critical.
Vision transformers are more interpretable due to the self-attention mechanism used in their
architecture. The self-attention method allows the model to focus on different regions of the
image, letting the user observe which parts are being used to make predictions. In contrast to
traditional CNNs, the model's intermediate feature maps may be difficult to interpret.
Furthermore, vision transformers can generate saliency maps, which depict the most
significant portions of the input image for a specific prediction. This can assist users in better
understanding how the model makes decisions and identifying potential flaws or biases in the
model's predictions.
Overall, vision transformer interpretability can be helpful in various applications where
understanding the model's decision-making process is vital. This includes medical imaging,
self-driving cars, and other safety-sensitive applications where model accuracy and reliability
are crucial.
Hybrid architectures
Hybrid designs that combine vision transformers with other neural network topologies, such
as CNNs, have risen in favour in recent years. These hybrid systems aim to combine the
advantages of visual transformers with CNNs, resulting in improved performance and lower
computational costs.
The Transformer in a Convolutional Neural Network (T-CNN) is an example of a hybrid
architecture for object detection tasks that combines a visual transformer with a CNN. In this
design, the CNN extracts low-level features, which are then transmitted to the vision
transformer for high-level feature extraction and object detection.
Another example is the Hybrid Vision Transformer (HVT), which combines a vision
transformer with a UNet-like architecture for semantic segmentation tasks. In the HVT
architecture, the vision transformer extracts high-level semantic features, while the UNet-like
architecture is used for low-level feature extraction and upsampling.
These hybrid designs can offer various benefits, including improved performance, lower
computation costs, and greater interpretability. By combining the strengths of vision
transformers and CNNs, hybrid architectures can provide cutting-edge performance on a wide
range of computer vision applications while also being more interpretable than traditional
CNN architectures.
Furthermore, hybrid architectures may make better use of resources such as memory and
computing by allowing for concurrent image processing. This is particularly important in
applications that require real-time performance, such as autonomous driving.
Overall, hybrid designs that combine vision transformers with other neural network
architectures have the potential to push the boundaries of computer vision and enable a wide
range of applications that were previously difficult to perform with traditional CNNs or
vision transformers alone.
Comparison with Other Techniques
CNNs, like vision transformers, are neural networks used in computer vision tasks. They
differ from CNNs in that they process images using patches and the self-attention method,
whereas CNNs extract features from images using convolutional filters.
Unlike RNNs, widely used for sequence data, vision transformers are more suited for image
data because they can model long-term dependencies between image patches.
GNNs, on the other hand, are used to process graph-structured data like social networks or
molecules. While vision transformers do not directly deal with graph data, they can be
utilized for object detection, where objects can be viewed as nodes in a graph.
Overall, each technique has advantages and disadvantages and is best suited to particular data
and activities. The correct approach is determined by the specific situation at hand as well as
the qualities of the data.
Advantage
Global receptive field
Scalability
Interpretable features
Transfer learning
Application
Even though there are many applications, some of the highlighted are as,
Image classification - Image classification is the most typical use of vision transformers, with
the purpose of assigning an image to one of several pre-defined categories. Vision
transformers have demonstrated competitive or superior performance to standard CNN-based
models on various image classification benchmarks, including ImageNet, CIFAR-100, and
the recently released ImageNet-21K.
Generative works - Vision transformers have also been employed in generative tasks, where
the goal is to generate new images similar to a training data set. This is frequently performed
through the use of a transformer architecture variant known as the "GPT-style" transformer,
which is trained on a massive corpus of text data and then fine-tuned on image data.
Limitations
Computational cost
Large memory requirements
Difficulty in training
Limited interpretability
Conclusion
In computer vision, vision transformers are a relatively new and intriguing breakthrough.
They process images using a transformer architecture with a self-attention mechanism, and
their findings in image classification, object identification, and image segmentation have
been promising.
Vision transformers provide several significant advantages, including capturing long-term
dependencies, flexibility in processing inputs of varied sizes, and the potential for greater
generalization to new data. However, they have disadvantages, such as high computational
costs, enormous memory requirements, and training difficulty.
Despite these obstacles, I believe vision transformers will continue to play a significant role
in computer vision research and applications. We should expect even more impressive results
and new uses of this technology as additional research is undertaken on approaches to
minimize the computing and memory needs of vision transformers and increase their
interpretability and ease of training.
Overall, the invention of vision transformers is an exciting achievement in the science of
computer vision, with significant potential for improving our comprehension and capacity to
analyze visual data in various fields.

You might also like