Vision Transformers: Revolutionizing Computer Vision
Vision Transformers: Revolutionizing Computer Vision
The creation of a new type of neural network known as the vision transformer (ViT) has
revolutionized computer vision in recent years; as opposed to typical convolutional neural
networks (CNNs), vision transformers process images using a patch-based method and self-
attention mechanism, allowing them to model long-range dependencies between image
patches and produce state-of-the-art outcomes on a variety of computer vision applications.
In this detailed article, we will examine the history of vision transformers, their design and
training process, their benefits and drawbacks, and their applications in various disciplines.
We'll also discuss future research objectives, implementation tips, and how vision
transformers stack up against other prominent computer vision techniques. Whether you are
new to computer vision or an expert practitioner, this blog will thoroughly grasp vision
transformers and their potential to alter the field.
Neural Network – How it relates to ViT?
Neural networks are algorithms inspired by the structure and function of the human brain.
They are an effective tool for addressing complicated issues like image identification, audio
recognition, natural language processing, and many more.
A neuron is the fundamental building component of a neural network; it receives input from
other neurons or external sources, processes it, and creates an output. Layers of neurons are
formed, and one layer's output becomes the next's input.
Where a_{i,j} is the attention weight assigned to the jth element by the ith element, the
attention weight is computed as follows:
Where e_{i,j} is the attention energy between the ith and jth elements, which is computed as
follows:
Where f is a learned function that computes the compatibility between the representations of
the i-th and jth elements.
In the context of vision transformers, the input image is divided into a grid of patches, and
each patch is treated as an element in the input sequence. The self-attention mechanism is
used to build a new set of embeddings representing the image's local and global spatial
relationships.
By using self-attention instead of convolutions, vision transformers may capture long-range
dependencies and interactions between patches in an image more effectively, resulting in a
state-of-the-art performance for many computer vision applications.
Attention Mechanism in CV
In the past, attention methods were frequently utilized in computer vision tasks, particularly
in picture captioning and object detection. The model needed to focus on different image
portions at different times.
For example, in image captioning, the model must create a natural language description of a
picture. The model generates a caption word at each time step, and it must pick which
components of the image to attend to construct that word. This is accomplished by using an
attention mechanism, which computes a weighted sum of the visual attributes, with weights
based on the similarity of the current word to each part of the image.
Similarly, the object detection model must detect entities' existence and position in an image.
A convolutional neural network (CNN) is often used to extract picture information, followed
by a region proposal network (RPN) to create candidate object regions. The candidate regions
are then refined using an attention technique that attends to relevant parts of the image.
Attention mechanisms are utilized similarly in vision transformers to record the image's local
and global spatial relationships. Instead of using convolutions to extract image features, the
input image is partitioned into a grid of patches, with each patch regarded as a sequence
element. The self-attention mechanism is then applied to the sequence of patch embeddings
to generate a new set of embeddings that represent the spatial relationships between the
patches.
Vision transformers can capture long-range dependencies and relationships between patches
in the image more effectively by using self-attention rather than convolutions, resulting in
state-of-the-art performance on various computer vision tasks such as image classification
and object detection. The attention mechanism in vision transformers enables the model to
focus on crucial aspects of the image while processing it, allowing it to be more efficient and
accurate while dealing with complicated visual input.
History of Vision Transformer
Vision transformers are a relatively new invention in computer vision, with their roots in the
success of transformer-based language models such as BERT and GPT in natural language
processing.
The transformer architecture was first presented in Vaswani et al.'s landmark publication
"Attention Is All You Need" in 2017, which proposed a novel way to sequence modelling
based on self-attention processes. Because of its capacity to capture long-term dependencies
and increase the performance of language models, this architecture quickly became a popular
choice for natural language processing jobs.
Dosovitskiy et al. presented the Vision Transformer (ViT) architecture in 2020, modifying
the transformer architecture for application in computer vision tasks. The ViT architecture
divides input images into smaller patches, which are then processed with a transformer
encoder to provide a fixed-size feature vector for the image. On numerous popular picture
classification benchmarks, the ViT architecture achieved state-of-the-art results,
demonstrating the promise of transformer-based models in computer vision.
Since then, other ViT design versions have been developed, including DeiT (Data-efficient
Image Transformers), ViT-Large, and Swin Transformer, each with different tweaks to the
original architecture to improve performance and minimize computational cost.
Overall, the emergence of vision transformers can be viewed as a continuation of the success
of transformer-based models in natural language processing, as well as a promising new
approach for furthering the area of computer vision.
Patch-Based Processing
Vision transformers use a patch-based approach to image processing, breaking the input
image into smaller, fixed-size patches and treating each patch as a single token. This method
has both advantages and cons.
Image Source: https://www.researchgate.net/figure/A-schematic-illustration-of-patch-based-
processing-of-images-By-breaking-the-given-image_fig15_282609022
One advantage of patch-based processing is that vision transformers may accept inputs of
various sizes without extra resizing or cropping. This is especially beneficial for applications
like object detection and segmentation, where the size and shape of the objects in the image
might change significantly.
Another advantage of patch-based processing is that the self-attention mechanism may attend
to interactions between patches throughout the image, allowing for more excellent capture of
the global image context. This is especially significant for tasks like scene comprehension or
image captioning, where the context and interactions between items in the image are critical
for creating accurate descriptions.
However, patch-based processing has several drawbacks. One significant disadvantage is that
spatial information is lost because each patch is handled as a separate token, and the relative
positions of the patches are not explicitly stored. This can impair performance in tasks that
rely substantially on spatial relationships, such as fine-grained object recognition or
geometric reasoning.
Another potential disadvantage is the computational and memory costs of processing many
patches. To some extent, this can be addressed by employing techniques such as overlapping
patches or hierarchical processing, but it remains a substantial difficulty for large-scale
applications.
Overall, patch-based processing is crucial for vision transformers, allowing them to attain
cutting-edge results on various computer vision benchmarks. However, it is critical to
carefully assess the benefits and drawbacks of this strategy for individual applications and
investigate techniques to alleviate some of its limits.
ViT Architecture
A vision transformer (ViT) architecture comprises numerous layers, each containing several
critical components such as patch embeddings, multi-head self-attention, and feedforward
networks.
Image Source: https://theaisummer.com/vision-transformer/
Patch Embeddings: A linear projection separates the input image into a grid of non-
overlapping patches, with each patch represented as a vector. After that, the patch
embeddings are concatenated along the channel dimension to generate a vector sequence sent
to the transformer encoder.
Multi-head self-attention: The transformer encoder comprises several layers of multi-head
self-attention, allowing the model to capture local and global interactions between patches.
Each multi-head self-attention layer comprises a self-attention mechanism, a normalization
layer, and a feedforward network.
Multi-Head Attention: The model's self-attention mechanism enables it to attend to different
parts of the input sequence at other times, allowing it to capture local and global correlations.
Each patch embedding is converted into a collection of queries, keys, and values and then
used to calculate attention weights. The attention weights are utilized to calculate a weighted
sum of the values, which is used as the self-attention layer's output.
Normalization Layer: Following the application of the attention mechanism, the output is
passed via a normalization layer, which aids in the stabilization of the learning process by
ensuring that the distribution of activations remains reasonably constant across different
cases.
Feedforward Network: Finally, the output of the normalization layer is routed through a
feedforward network composed of two linear layers separated by a ReLU activation function.
The feedforward network facilitates the capturing of complex interactions between patches
and enables the model to learn non-linear transformations of the input data.
The vision transformer learns a hierarchical representation of the input image by stacking
many layers of patch embeddings, multi-head self-attention, and feedforward networks. This
allows it to capture both low-level features and high-level semantic information.
ViT can be coded as,
Step 1: Importing libraries
import torch
import torchvision
from torchvision import transforms
from transformers import ViTForImageClassification, ViTFeatureExtractor
Step 2: Importing datasets
data = torchvision.datasets.CIFAR10(root='./data', train=True, download=True,
transform=transforms.ToTensor())
Step 3: Splitting dataset
train_size = int(0.8 * len(data))
val_size = len(data) - train_size
train_data, val_data = torch.utils.data.random_split(data, [train_size, val_size])
Step 4: Creating dataloader to load database
train_loader = torch.utils.data.DataLoader(train_data, batch_size=32, shuffle=True)
Step 5: Defining model
model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')
feature_extractor = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224')
Step 6: Loss and optimizer
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
criterion = torch.nn.CrossEntropyLoss()
Step 7: Training dataset
for epoch in range(10):
for i, (inputs, labels) in enumerate(train_loader):
optimizer.zero_grad()
inputs = feature_extractor(inputs)['pixel_values']
outputs = model(inputs)
loss = criterion(outputs.logits, labels)
loss.backward()
optimizer.step()
Step 8: Evaluating dataset
val_loader = torch.utils.data.DataLoader(val_data, batch_size=32)
with torch.no_grad():
correct = 0
total = 0
for inputs, labels in val_loader:
inputs = feature_extractor(inputs)['pixel_values']
outputs = model(inputs)
_, predicted = torch.max(outputs.logits, 1)
total += labels.size(0)
correct += (predicted == labels).sum().item()