DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention

Tang, Xiaoya; Zhang, Bodong; Knudsen, Beatrice S.; Tasdizen, Tolga

Computer Science > Computer Vision and Pattern Recognition

arXiv:2407.13920 (cs)

[Submitted on 18 Jul 2024]

Title:DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention

Authors:Xiaoya Tang, Bodong Zhang, Beatrice S. Knudsen, Tolga Tasdizen

View PDF HTML (experimental)

Abstract:We here propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs). Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations. These representations are then adapted for transformer input through an innovative patch tokenization. We also introduce a 'scale attention' mechanism that captures cross-scale dependencies, complementing patch attention to enhance spatial understanding and preserve global perception. Our approach significantly outperforms baseline models on small and medium-sized medical datasets, demonstrating its efficiency and generalizability. The components are designed as plug-and-play for different CNN architectures and can be adapted for multiple applications. The code is available at this https URL.

Comments:	11 pages, 5 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2407.13920 [cs.CV]
	(or arXiv:2407.13920v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2407.13920

Submission history

From: Xiaoya Tang [view email]
[v1] Thu, 18 Jul 2024 22:15:35 UTC (867 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators