0% found this document useful (0 votes)
16 views21 pages

Challenging Task

Literature review for few type of papers

Uploaded by

Chirag Dugar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views21 pages

Challenging Task

Literature review for few type of papers

Uploaded by

Chirag Dugar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 21

LITERATURE REVIEW

Paper 1: Vision Transformers for Image Classification


Authors: Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk
Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias
Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby
Published: 2020

https://arxiv.org/abs/2010.11929

1. Introduction

 CNNs have been the dominant models for image classification, with architectures like
ResNet, VGG, and AlexNet. These models exploit local image features but struggle to
capture long-range dependencies. Vision Transformers (ViTs) aim to solve this problem
by applying transformers, previously successful in NLP, to image classification tasks.

2. Background

 Transformers, introduced by Vaswani et al. in 2017 for NLP tasks, handle long
sequences with attention mechanisms. They parallelize computations and capture
global dependencies. ViTs adapt this for images by treating them as sequences of
patches, unlike CNNs, which are limited by locality biases.

3. Research Objective

 The primary goal of ViTs is to demonstrate that transformers can replace convolutions
for large-scale image classification, capturing global image relationships more
efficiently.

4. Related Work

 Traditional CNN-based models like ResNet and VGG have dominated vision tasks. Prior
hybrid models, such as Image Transformers, integrated CNNs with attention
mechanisms. ViTs eliminate convolutions entirely, showing that transformers alone can
perform competitively in image classification.

5. Methodology

 ViTs split an image into fixed-size patches, which are embedded and passed through a
transformer. The transformer applies self-attention to these patches to capture both
local and global relationships. ViTs are trained on large datasets like ImageNet.
6. Key Contributions

o Pure transformer architecture for image classification without convolutions.

o Scalability: ViTs outperform CNNs on large datasets.

o Efficient Training: Transformers parallelize computations, reducing training


overhead.

7. Results and Performance

 ViTs achieve competitive accuracy on benchmarks like ImageNet, especially when


trained on larger datasets. However, on smaller datasets, CNNs still excel due to their
inductive biases.

8. Limitations and Future Work

 ViTs require large datasets to perform well. Future research could improve their data
efficiency or explore hybrid architectures combining CNNs and transformers.

9. Conclusion

 ViTs represent a shift in image classification by using transformers. They demonstrate


the feasibility of discarding convolutions and provide a path toward scalable vision
models.
Paper 2: Swin Transformer: Hierarchical Vision Transformer using Shifted
Windows
Authors: Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang,
Stephen Lin.
Published:2021
https://arxiv.org/abs/2103.14030

1. Introduction

• While transformers have shown promise in vision tasks, such as image classification
and object detection, their application to high-resolution images poses challenges.
Standard transformer architectures require significant computational resources when
processing large images. The Swin Transformer (Shifted Windows Transformer)
addresses this issue by introducing a hierarchical architecture that processes images
at multiple scales and employs a shifted window mechanism to balance computation
and accuracy.

2. Background

• The Vision Transformer (ViT) introduced the use of transformers in vision tasks,
treating image patches as tokens similar to words in NLP. However, ViT operates on a
fixed resolution and lacks efficiency for high-resolution images, requiring large
datasets for effective training. CNNs, on the other hand, have hierarchical structures
that help them process images more efficiently by capturing local features and
progressively integrating global information. Swin Transformers aim to combine the
strength of hierarchical processing from CNNs with the global modeling capabilities of
transformers.

3. Research Objective

• The primary goal of this paper is to propose a transformer-based architecture that


can efficiently handle high-resolution images while maintaining the scalability and
flexibility of transformers. By employing a shifted window technique, the model
captures local and global information effectively without the computational overhead
seen in traditional transformers when dealing with large image sizes.

4. Related Work

• CNNs like ResNet and EfficientNet have dominated vision tasks due to their
hierarchical structures. Transformers, like ViT and DETR, have introduced global self-
attention for vision tasks, but face challenges in efficiency when dealing with high-
resolution images. The Swin Transformer builds on the hierarchical nature of CNNs by
processing images at different scales, while also leveraging the global modeling
capability of transformers through a shifted window mechanism to improve efficiency
and performance.

5. Methodology

• The Swin Transformer divides images into non-overlapping windows and computes
self-attention within each window. The shifted window strategy ensures that the
model can capture cross-window dependencies and process images in a hierarchical
fashion, similar to a pyramid. This approach reduces computation by limiting self-
attention to smaller windows while maintaining the ability to model long-range
dependencies by shifting the window positions between layers.

6. Key Contributions

 Hierarchical Transformer Architecture: The Swin Transformer introduces a


hierarchical structure for transformers, similar to that of CNNs, allowing it to process
images at multiple resolutions.
 Shifted Window Mechanism: By shifting the windows between layers, Swin
Transformer enables efficient computation and the ability to model interactions
between distant patches.
 Scalability: The model is scalable for vision tasks ranging from image classification to
object detection, and it can handle high-resolution images more efficiently than
previous transformer-based architectures.
7. Results and Performance
• The Swin Transformer achieves state-of-the-art performance across a range of
vision tasks, including image classification (on ImageNet), object detection (on COCO),
and semantic segmentation (on ADE20K). Compared to other transformers like ViT, it
performs better on high-resolution images and demonstrates improved efficiency,
while also maintaining competitive accuracy on standard benchmarks.

8. Limitations and Future Work

• One limitation of the Swin Transformer is its reliance on hierarchical feature


extraction, which may still underperform compared to CNNs in certain specific cases
that benefit from highly localized feature extraction. Additionally, while the shifted
window mechanism improves efficiency, there is room for further optimization,
particularly in terms of computational cost for even larger datasets and resolutions.
Future work could involve exploring more efficient versions of the architecture or
hybridizing with CNNs for specialized tasks.

9. Conclusion
• The Swin Transformer represents a significant step forward in the use of
transformer architectures for computer vision, especially in handling high-resolution
images more efficiently. Its hierarchical design and shifted window technique make it
a scalable and versatile model for various vision tasks, pushing the boundaries of
transformer applications in computer vision.
Paper 3: An Image is Worth 16x16 Words: Transformers for Image Recognition
at Scale
Authors: Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk
Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias
Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby
Published: 2020
https://arxiv.org/abs/2010.11929

1. Introduction:

• This paper presents the Vision Transformer (ViT), a model that applies transformers
to image classification by treating images as sequences of 16x16 pixel patches, similar
to how text is processed in NLP. This challenges traditional convolutional approaches
and demonstrates that transformers can excel in vision tasks when trained at scale.

2. Background:

• NLP models like BERT and GPT use transformers to process text, benefiting from
their ability to capture long-range dependencies. In vision, CNNs have traditionally
dominated but struggle to model global context effectively. ViTs propose a novel
solution by using transformers, which handle global dependencies through self-
attention mechanisms.

3. Research Objective:

• The paper aims to explore how transformers can be adapted for image classification
tasks by using patches instead of individual pixels or convolutional filters, allowing for
scalability and better capture of global relationships.

4. Related Work:

• Previous works have relied heavily on CNN architectures for image recognition.
Hybrid models like Image Transformers used both convolutions and transformers, but
ViTs represent the first pure transformer architecture applied to large-scale image
datasets.

5. Methodology:

• The image is divided into 16x16 pixel patches, which are embedded as input tokens
for the transformer. The standard transformer architecture is used to process these
tokens, with positional encodings added to retain spatial information.

6. Key Contributions:
 Patch-based Input Representation: Treating image patches as tokens enables the
transformer to process images at scale.
 Self-attention Mechanism: The self-attention mechanism allows for better modeling
of long-range dependencies in images.
 Scalability: ViTs can outperform CNNs on large datasets when properly scaled.

7. Results and Performance:

• ViTs achieve state-of-the-art results on ImageNet and perform especially well on


larger datasets like JFT-300M. Their ability to capture global context makes them
highly competitive with CNN-based models.

8. Limitations and Future Work:

• The requirement for large datasets limits the applicability of ViTs in small-data
scenarios. Future research should focus on making ViTs more data-efficient or
exploring hybrid architectures that combine transformers with CNNs.

9. Conclusion:

• ViTs open a new direction for image classification, demonstrating the potential of
transformer-based models to handle large-scale vision tasks and outperform
traditional convolutional networks.
Paper 4: End-to-End Object Detection with Transformers
Authors: Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier,
Alexander Kirillov, Sergey Zagoruyko
Published: 2020
https://arxiv.org/abs/2005.12872

1. Introduction:

• Object detection has been traditionally approached using complex pipelines with
CNN backbones. This paper introduces DETR (Detection Transformer), a transformer-
based model for end-to-end object detection, which simplifies the detection pipeline
and replaces hand-designed components with attention-based mechanisms.

2. Background:

• Object detection models, such as Faster R-CNN and YOLO, have used CNN-based
architectures and complex post-processing for detecting objects in images. However,
these pipelines are not fully end-to-end. Transformers, known for their success in NLP,
have shown promise in vision tasks by leveraging self-attention to model global
dependencies in an image.

3. Research Objective:

• The objective of this research is to create a simpler and fully end-to-end object
detection pipeline by using transformers, eliminating the need for traditional methods
like region proposal networks and non-maximum suppression.

4. Related Work:

• Prior work includes CNN-based detectors such as Faster R-CNN, which rely on hand-
designed components like anchor boxes and non-maximum suppression. DETR
removes these hand-designed elements by applying transformers to object detection,
which models relationships between objects directly using self-attention.

5. Methodology:

• DETR applies a transformer architecture that takes feature maps from a CNN
backbone as input. The model uses self-attention to predict object locations and
classes directly, without requiring region proposals or complex post-processing.

6. Key Contributions:
 End-to-End Architecture: DETR removes the need for region proposal networks and
hand-designed components, creating a fully end-to-end detection pipeline.
 Self-attention for Object Detection: By applying self-attention to feature maps, the
model can capture relationships between objects globally.
 Simplified Pipeline: DETR simplifies the detection pipeline and provides competitive
results with traditional detectors.

7. Results and Performance:

• DETR achieves comparable performance to state-of-the-art detectors like Faster R-


CNN on benchmarks such as COCO. Its end-to-end design and self-attention
mechanism demonstrate the potential of transformers for object detection tasks.

8. Limitations and Future Work:

• DETR requires more training time than traditional CNN-based detectors. Future
work could involve reducing the training time or exploring hybrid models that
combine CNN and transformer architectures.

9. Conclusion:

• DETR demonstrates that transformers can be applied successfully to object


detection, creating an end-to-end pipeline without the need for hand-designed
components. The paper opens new avenues for simplifying object detection models
and improving their scalability.
Paper 5: Rethinking Spatial Dimensions of Vision Transformers
Authors: Kai Han, Yunhe Wang, Jianyuan Guo, Yehui Tang, Enhua Wu, Chang Xu
Published: 2021
https://arxiv.org/abs/2103.16302

1. Introduction:

• Vision Transformers (ViTs) are powerful, but their performance heavily depends on
the spatial resolution of image patches. This paper explores how the spatial
dimensions of image patches can be optimized to improve the effectiveness of vision
transformers in classification tasks. It proposes methods to adjust spatial dimensions
dynamically.

2. Background:

• Traditional vision transformers treat images as sequences of fixed-size patches,


ignoring the possibility of varying the spatial dimensions of patches. In contrast, CNNs
adapt spatial hierarchies naturally. By rethinking spatial resolutions in transformers,
this research aims to enhance ViT’s performance without increasing computational
cost.

3. Research Objective:

• The goal of this paper is to analyze the effect of spatial dimension modifications in
vision transformers and propose strategies for dynamically adjusting these
dimensions to improve accuracy in classification tasks.

4. Related Work:

• ViTs typically use fixed-size image patches for their input sequences, which can limit
their adaptability. Previous works have explored hierarchical vision transformers to
create better spatial representations, but few have tackled the dynamic adjustment of
spatial dimensions as proposed here.

5. Methodology:

• The proposed approach introduces a dynamic mechanism for adjusting the spatial
dimensions of image patches. A hierarchical strategy is employed, where the patch
size is reduced progressively, allowing the transformer to focus on more detailed
information in deeper layers.

6. Key Contributions:

 Dynamic Patch Resizing: Introduces a method for adjusting the spatial dimensions of
image patches dynamically, leading to better performance in vision tasks.
 Hierarchical Vision Transformer: Builds upon the concept of hierarchical
transformers by incorporating dynamic spatial resolutions to improve model
accuracy.

7. Results and Performance:

• The proposed method achieves higher accuracy on vision benchmarks like


ImageNet when compared to traditional ViT models, without increasing
computational complexity. The dynamic adjustment of patch sizes allows the model to
learn better spatial hierarchies.

8. Limitations and Future Work:

• Although the method improves performance, its efficiency is tied to specific tasks
like classification. Future work could explore applying the method to other vision
tasks such as object detection or segmentation.

9. Conclusion:

• The paper proposes a novel approach to improving vision transformers by


dynamically adjusting the spatial dimensions of image patches. This approach
enhances ViT’s capability to model spatial hierarchies, resulting in improved
classification accuracy.
Paper 6: Exploring Plain Vision Transformer Backbones for Object Detection
Authors: Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, Han
Hu
Published: 2021
https://arxiv.org/abs/2203.16527

1. Introduction:

• Vision Transformers (ViTs) have been successful in image classification, but their
application to object detection remains underexplored. This paper evaluates the
potential of plain ViT backbones for object detection, proposing modifications to
enhance their performance without additional complexity.

2. Background:

• Object detection models, such as Faster R-CNN and YOLO, typically rely on CNN-
based backbones. However, transformers, with their global context modeling
capabilities, can also excel in this domain. The study investigates whether plain ViT
models, without significant modifications, can perform well in object detection tasks.

3. Research Objective:

• The main goal is to evaluate the effectiveness of plain ViT backbones in object
detection tasks and suggest minor modifications that enable transformers to achieve
competitive results without additional complexity.

4. Related Work:

• Previous works on object detection have relied on hybrid architectures or CNN


backbones. DETR demonstrated the viability of using transformers in detection, but
required modifications. This paper builds upon DETR's approach by exploring plain
vision transformers for detection tasks.

5. Methodology:

• The paper utilizes standard ViT architectures for object detection, with slight
modifications to optimize their performance. A multi-scale feature pyramid is
introduced to improve object localization, similar to the FPN (Feature Pyramid
Network) used in traditional detection models.

6. Key Contributions:

 Plain Vision Transformers for Detection: Demonstrates that plain vision transformers
can be effectively used for object detection tasks.
 Feature Pyramid for Localization: Introduces a multi-scale feature pyramid to
improve object localization without significantly increasing model complexity.

7. Results and Performance:

• The modified ViT backbones achieve competitive results on detection benchmarks


like COCO, showing that plain transformers can perform as well as traditional CNN-
based detection backbones with minimal architectural changes.

8. Limitations and Future Work:

• The study focuses on plain vision transformers, limiting the scope to object
detection tasks. Future research could explore extending the approach to other tasks
like segmentation or applying more advanced transformer architectures.

9. Conclusion:

• The paper demonstrates that plain ViT backbones, with minor modifications, can
achieve competitive results in object detection. This opens the door to further
exploration of transformers in detection without relying on complex architectural
changes.
Paper 7: Pyramid Vision Transformer: A Versatile Backbone for Dense
Prediction Without Convolutions
Authors: Wang, Yuliang, Qiang Chen, Hongsheng Li, Xiaogang Wang, Jifeng Dai
Published: 2021
https://arxiv.org/abs/2102.12122

1. Introduction:

• Pyramid Vision Transformer (PVT) introduces a pyramid structure into the vision
transformer framework to create a versatile backbone for dense prediction tasks like
object detection and semantic segmentation. This model replaces convolutions with
transformers to improve performance in vision tasks without relying on CNNs.

2. Background:

• Traditional transformers used in vision tasks treat images as sequences of patches.


While this works well for classification, it struggles with dense prediction tasks due to
the lack of hierarchical spatial features, which CNNs excel at. PVT addresses this by
introducing a pyramid structure to better handle multi-scale features.

3. Research Objective:

• The research aims to create a versatile transformer backbone that works well for
dense prediction tasks such as object detection and segmentation by introducing a
multi-scale pyramid structure to vision transformers.

4. Related Work:

• Previous approaches like ViT and DETR applied transformers to vision tasks, but
struggled with dense prediction due to their lack of spatial hierarchy. PVT builds on
this by incorporating a pyramid structure that enables transformers to handle
multiple spatial scales more effectively, similar to how CNNs process images.

5. Methodology:

• PVT introduces a hierarchical structure that progressively downsamples feature


maps at different scales. This creates a pyramid of multi-scale features, making the
model suitable for tasks that require fine-grained spatial information. Self-attention
mechanisms are used at each stage to capture both local and global dependencies.

6. Key Contributions:
 Hierarchical Transformer: Introduces a pyramid structure to create multi-scale
feature maps, making transformers suitable for dense prediction tasks.
 Transformer Without Convolutions: PVT eliminates convolutions, creating a fully
transformer-based backbone for vision tasks.

7. Results and Performance:

• PVT outperforms traditional CNN-based backbones like ResNet on dense prediction


tasks such as object detection and semantic segmentation. It achieves state-of-the-
art results on benchmarks like COCO and ADE20K, demonstrating the effectiveness of
its hierarchical structure.

8. Limitations and Future Work:

• While PVT performs well, the model's complexity could be a challenge in real-time
applications. Future research could explore more efficient hierarchical transformers or
hybrid models that combine the best of CNNs and transformers.

9. Conclusion:

• Pyramid Vision Transformer (PVT) demonstrates that transformers can be effectively


adapted for dense prediction tasks like object detection and segmentation by
introducing a hierarchical pyramid structure. This work opens new possibilities for
using transformers in a wider range of vision tasks.
Paper 8: Twins: Revisiting the Design of Spatial Attention in Vision
Transformers
Authors: Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haibing Ren, Xiaolin
Wei, Huaxia Xia, Chunhua Shen
Published: 2021
https://arxiv.org/abs/2104.13840

1. Introduction:

• Vision Transformers (ViTs) have shown remarkable performance in image


classification. However, they struggle to model local details efficiently due to their
reliance on global self-attention. The Twins architecture revisits spatial attention
design by introducing locally-grouped self-attention alongside global attention to
capture both local and global information effectively.

2. Background:

• While traditional ViTs treat an entire image as a sequence of patches and apply
global attention across the sequence, this approach lacks the ability to capture fine-
grained local details effectively. CNNs excel in capturing local features due to their
inductive biases. Twins aim to bridge this gap by combining local and global attention.

3. Research Objective:

• The objective is to enhance vision transformers' ability to capture local and global
spatial information simultaneously by introducing a locally-grouped self-attention
mechanism.

4. Related Work:

• Previous models like ViT and DETR apply global self-attention to the whole image
sequence, limiting their ability to capture local features. CNNs have demonstrated
superior performance in capturing local details, but lack the global context
transformers provide. Twins combines the strengths of both.

5. Methodology:

• Twins utilize a Locally-grouped Self-Attention (LSA) mechanism alongside Global


Subsampled Attention (GSA) to process image patches. LSA handles local interactions
by applying attention to non-overlapping groups of patches, while GSA captures
global relationships by subsampling the input.

6. Key Contributions:
 Local and Global Attention Combination: Introduces a hybrid attention mechanism,
combining local and global attention to capture both detailed and contextual
information effectively.
 Efficient Computation: By using subsampling in global attention and locally-grouped
attention, Twins reduce computational overhead compared to pure global self-
attention.

7. Results and Performance:

• Twins outperform both ViT and CNN backbones on tasks like image classification,
object detection, and semantic segmentation. It achieves state-of-the-art results on
benchmarks like ImageNet and COCO, demonstrating improved efficiency in modeling
local and global features.

8. Limitations and Future Work:

• While Twins improve efficiency, the introduction of local attention requires careful
design to balance computational cost and accuracy. Future work may focus on further
optimizing this balance or applying this model to other tasks such as video
understanding.

9. Conclusion:

• The Twins architecture improves upon traditional vision transformers by combining


local and global attention mechanisms. This approach enables the model to capture
both fine-grained local details and global context more efficiently, achieving superior
performance across a range of vision tasks.
Paper 9: EfficientNetV2: Smaller Models and Faster Training
Authors: Mingxing Tan, Quoc V. Le
Published: 2021
https://arxiv.org/abs/2104.00298

1. Introduction:

• EfficientNetV2 introduces a new family of neural networks designed to be both


faster and more efficient in terms of model size and training speed. This paper builds
upon the original EfficientNet models, which focused on scaling architectures for
optimal performance across various computing resources.

2. Background:

• EfficientNet was a major breakthrough in designing models that achieved state-of-


the-art accuracy while minimizing computational costs through compound scaling.
However, the training process remained slow for large datasets. EfficientNetV2 aims
to address this by optimizing the training process and reducing model size without
sacrificing accuracy.

3. Research Objective:

• The primary goal of this research is to develop smaller, faster models that can be
trained more quickly without compromising performance, particularly for image
classification tasks.

4. Related Work:

• EfficientNet and other scaling-based models like ResNet have been successful in
balancing accuracy and efficiency. However, as datasets grow larger, these models
require longer training times. EfficientNetV2 introduces optimizations to reduce
training time while maintaining competitive accuracy.

5. Methodology:

• The model employs a new Fused-MBConv layer that combines depthwise


convolutions with pointwise convolutions for faster computation. Additionally,
progressive learning techniques are used to gradually increase input image size
during training, allowing the model to learn efficiently from smaller images before
scaling up.

6. Key Contributions:
 Fused-MBConv Layer: A new convolution layer that accelerates training and reduces
computational cost while maintaining accuracy.
 Progressive Learning: A novel training strategy where the input image size is
increased gradually, speeding up the training process without degrading
performance.

7. Results and Performance:

• EfficientNetV2 achieves better accuracy than EfficientNet on benchmarks like


ImageNet, while also reducing training time by up to 5x. It also outperforms other
models in terms of efficiency across various tasks, making it suitable for deployment
in real-world applications with limited computing resources.

8. Limitations and Future Work:

• Although EfficientNetV2 is efficient for training, its performance in low-data regimes


could be explored further. Future work could adapt these methods for tasks requiring
lower data, such as transfer learning or fine-tuning.

9. Conclusion:

• EfficientNetV2 introduces smaller, faster models that train efficiently without


sacrificing accuracy. The use of fused convolutions and progressive learning
techniques sets a new standard for scalable neural networks, making them suitable
for both academic research and practical applications.
Paper 10: MViT: A Multiscale Vision Transformer for Efficient Video and Image
Processing
Authors: Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan,
Jitendra Malik, Christoph Feichtenhofer
Published: 2021
https://arxiv.org/abs/2104.11227

1. Introduction:

• The MViT (Multiscale Vision Transformer) is designed for efficient video and image
processing. It introduces a hierarchical multiscale structure to transformers, allowing
the model to capture spatial and temporal features at various scales, improving
performance in both video understanding and image classification tasks.

2. Background:

• Standard vision transformers (ViTs) struggle with large images or videos due to their
computational complexity and lack of hierarchical structure. CNNs have traditionally
been better suited for video and multiscale processing. MViT combines the strengths
of transformers and CNNs by using multiscale features.

3. Research Objective:

• The paper aims to create a multiscale transformer that efficiently processes large-
scale data (e.g., videos) by capturing spatial and temporal relationships at different
scales.

4. Related Work:

• Previous work such as ViT and DETR have shown the effectiveness of transformers
in image classification and object detection, but their flat structure limits their ability
to process large-scale data like videos. MViT addresses this gap by introducing
multiscale processing similar to CNNs.

5. Methodology:

• MViT introduces a multiscale transformer architecture that gradually downsamples


the input and builds hierarchical feature representations. This enables the model to
process large-scale data, such as high-resolution images and long video sequences,
while maintaining efficiency in computation.
6. Key Contributions:

 Multiscale Processing: Introduces a hierarchical multiscale structure to vision


transformers, improving their ability to process videos and large images.
 Efficient Computation: MViT reduces computational complexity by progressively
downsampling inputs and applying self-attention at multiple scales.

7. Results and Performance:

• MViT outperforms traditional CNN-based video models and flat vision transformers
on tasks like video action recognition and image classification. It achieves state-of-
the-art results on benchmarks like Kinetics-400 and ImageNet.

8. Limitations and Future Work:

• While MViT achieves good results, the complexity of its hierarchical structure may
still pose challenges for real-time applications. Future research could focus on further
optimizing multiscale transformers for faster inference.

9. Conclusion:

• MViT combines multiscale processing with the power of transformers, achieving


significant improvements in both video and image tasks. Its ability to handle spatial
and temporal information at different scales makes it a strong candidate for various
vision-related applications.

You might also like