Zorro: the masked multimodal transformer

Recasens, Adrià; Lin, Jason; Carreira, Joāo; Jaegle, Drew; Wang, Luyu; Alayrac, Jean-baptiste; Luc, Pauline; Miech, Antoine; Smaira, Lucas; Hemsley, Ross; Zisserman, Andrew

Computer Science > Computer Vision and Pattern Recognition

arXiv:2301.09595 (cs)

[Submitted on 23 Jan 2023 (v1), last revised 22 Feb 2023 (this version, v2)]

Title:Zorro: the masked multimodal transformer

Authors:Adrià Recasens, Jason Lin, Joāo Carreira, Drew Jaegle, Luyu Wang, Jean-baptiste Alayrac, Pauline Luc, Antoine Miech, Lucas Smaira, Ross Hemsley, Andrew Zisserman

View PDF

Abstract:Attention-based models are appealing for multimodal processing because inputs from multiple modalities can be concatenated and fed to a single backbone network - thus requiring very little fusion engineering. The resulting representations are however fully entangled throughout the network, which may not always be desirable: in learning, contrastive audio-visual self-supervised learning requires independent audio and visual features to operate, otherwise learning collapses; in inference, evaluation of audio-visual models should be possible on benchmarks having just audio or just video. In this paper, we introduce Zorro, a technique that uses masks to control how inputs from each modality are routed inside Transformers, keeping some parts of the representation modality-pure. We apply this technique to three popular transformer-based architectures (ViT, Swin and HiP) and show that with contrastive pre-training Zorro achieves state-of-the-art results on most relevant benchmarks for multimodal tasks (AudioSet and VGGSound). Furthermore, the resulting models are able to perform unimodal inference on both video and audio benchmarks such as Kinetics-400 or ESC-50.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2301.09595 [cs.CV]
	(or arXiv:2301.09595v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2301.09595

Submission history

From: Adrià Recasens [view email]
[v1] Mon, 23 Jan 2023 17:51:39 UTC (1,322 KB)
[v2] Wed, 22 Feb 2023 18:58:10 UTC (1,325 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Zorro: the masked multimodal transformer

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Zorro: the masked multimodal transformer

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators