Multimodal Learning With Transformers A Survey
Multimodal Learning With Transformers A Survey
(Survey Paper)
I. INTRODUCTION
HE initial inspiration of Artificial Intelligence (AI) is to
T imitate human perception, e.g., seeing, hearing, touching,
smelling. In general, a modality is often associated with a
Fig. 1. Overview of Transformer [2].
and summary of representative methods to enable researchers to (VLP) [57], [58], [59] are relevant to MML. However, VLP is
understand the global picture of the MML field across related only a subdomain of MML. In this survey, we focus solely on
disciplines and more importantly to capture a holistic structured the intersection of multimodal learning and Transformers.
picture of current achievements as well as major challenges. Features: To our knowledge, this paper is the first compre-
Taxonomy: For better readability and reachability from and hensive review of the state of Transformer based multimodal
across different disciplines, we adopt a two-tier structured machine learning. The major features of this survey include
taxonomy based on the application and challenge dimensions (1) We highlight that Transformers have the advantage that
respectively. This has several benefits: (1) Researchers with they can work in a modality-agnostic way. Thus, they are com-
expertise in specific applications can find those applications patible with various modalities (and combinations of modal-
appropriate to their own research domain before connecting ities). To support this view, we, for the first time, offer an
to other related domains. (2) Similar model designs and archi- understanding of the intrinsic traits of Transformers in a multi-
tectures developed in different domains can be summarized in modal context from a geometrically topological perspective. We
an abstract, formula-driven perspective so that the mathemat- suggest that self-attention be treated as a graph style modelling,
ical ideas of various models formed in different applications which models the input sequence (both uni-modal and multi-
can be correlated and contrasted on common ground, crossing modal) as a fully-connected graph. Specifically, self-attention
domain-specific restrictions. Crucially, our taxonomy offers an models the embedding of arbitrary tokens from an arbitrary
interesting stereo-view of individual works with the insights modality as a graph node.
in both application specificity and formulation generality. It is (2) We discuss the key components of Transformers in a
hoped that this can help to break down domain boundaries and multimodal context as mathematically as possible.
foster more effective idea communication and exchange across (3) Based on Transformers, cross-modal interactions (e.g.,
modalities. By using the prompt modelling strategy [9], [10] as a fusion, alignment) are essentially processed by self-attention and
basis for investigation, we also include the classical classification its variants. In this paper, we extract the mathematical essence
problem (e.g., image classification) – usually regarded as a single and formulations of Transformer based MML practices, from
modality learning application in conventional MML surveys [1], the perspective of self-attention designs.
[11], [12] – as a special MML application. This has the potential Contributions: Having presented our review of the landscape
to significantly enrich MML, as the classification problem is an of multimodal learning, Transformer ecosystem, and multi-
AI topic amongst the most extensive studies in the literature [13]. modal Big Data era in Section II, we summarize our main
Scope: This survey will discuss the multimodality specific contributions as the follows.
designs of Transformer architecture including, but not limited 1) In Section III, we present a systematic reviewing of Vanilla
to, the following modalities: RGB image [5], depth image [14], Transformer, Vision Transformer, and multimodal Trans-
multispectral image [15], video [7], audio/speech/music [14], formers, from a geometrically topological perspective.
[16], [17], table [18], scene graph/layout [19], [20], [21], [22], 2) We contribute a taxonomy for Transformer based MML
pose skeleton [23], SQL [24], [25], recipe [26], programming from two complementary perspectives, i.e., application
language [27], sign language [28], [29], [30], point cloud [31], based and challenge based. In Section IV, we provide a
symbolic knowledge (graph) [32], [33], multimodal knowl- review of multimodal Transformer applications, via two
edge graph [34], sketch drawing [35], [36], [37], [38], 3D important paradigms, i.e., for multimodal pretraining and
object/scene [39], [40], [41], document [42], [43], programming for specific multimodal tasks. In Section V, we summarize
code [44] and Abstract Syntax Tree (AST) – a kind of graph [45], the common challenges and designs shared by the various
optical flow [46], medical knowledge (e.g., diagnosis code ontol- multimodal Transformer models and applications.
ogy [47]). Note that this survey will not discuss the multimodal 3) In Section VI, we discuss current bottlenecks, existing
papers where Transformer is used simply as the feature extractor problems, and potential research directions for Trans-
without multimodal designs. former based MML.
Related Surveys: We relate this paper to existing surveys of
the two specific dimensions MML and Transformers. There exist II. BACKGROUND
a few MML surveys [1], [11], [12]. In particular, [1] proposed a
structured, acknowledged taxonomy by five challenges, which A. Multimodal Learning (MML)
we also adopt as part of our structure. Unlike [1], [11], and [12], MML [1], [60], [61] has been an important research area in
which review general machine learning models, we instead focus recent decades; an early multimodal application – audio-visual
on Transformer architectures and their self-attention mecha- speech recognition was studied in 1980s [62]. MML is key to
nisms. Several surveys dedicated to Transformers have been human societies. The world we humans live in is a multimodal
recently introduced, with a range of emphases including general environment, thus both our observations and behaviours are mul-
Transformers [48], efficient designs [49], visualization [50], timodal [63]. For instance, an AI navigation robot needs multi-
computer vision tasks [51], [52], [53], [54], medical imag- modal sensors to perceive the real-world environment [64], [65],
ing [55], video tasks [56], and vision language pretraining [57]. [66], e.g., camera, LiDAR, radar, ultrasonic, GNSS, HD Map,
While [51], [53], [54], [55] consider MML, their reviews are odometer. Furthermore, human behaviours, emotions, events,
somewhat limited in the scope, taxonomy, and coverage. To our actions, and humour are multimodal, thus various human-
knowledge, only a few surveys on video-language pretraining centred MML tasks are widely studied, including multimodal
XU et al.: MULTIMODAL LEARNING WITH TRANSFORMERS: A SURVEY 12115
emotion recognition [67], multimodal event representation [68], ActBERT [114], ImageBERT [115], HERO [116], UniVL [117])
understanding multimodal humor [69], face-body-voice based have become research topics of increasing interest in the field
video person-clustering [70], etc. of machine learning.
Thanks to the development of the internet and a wide variety In 2021, CLIP [9] was proposed. It is a new milestone that
of intelligent devices in recent years, increasing amounts of mul- uses multimodal pretraining to convert classification as a re-
timodal data are being transmitted over the internet, thus an in- trieval task that enables the pretrained models to tackle zero-shot
creasing number of multimodal application scenarios are emerg- recognition. Thus, CLIP is a successful practice that makes full
ing. In modern life, we can see various multimodal applications, use of large-scale multimodal pretraining to enable zero-shot
including commercial services (e.g., e-commerce/commodity learning. Recently, the idea of CLIP is further studied, e.g., CLIP
retrieval [71], vision-and-language navigation (VLN) [72], [73], pretrained model based zero-shot semantic segmentation [118],
[74], [75], [76]), communication (e.g., lip reading [77], sign lan- ALIGN [119], CLIP-TD [120], ALBEF [121], and CoCa [122].
guage translation [28], [29]), human-computer interaction [78],
healthcare AI [79], [80], surveillance AI [81], etc.
Moreover, in the era of Deep Learning, deep neural networks C. Multimodal Big Data
greatly promote the development of MML, and Transform- In the past decade, with the rapid development of internet
ers [2] are a highly competitive architecture family, bringing applications such as social media and online retail, massive
new challenges and opportunities to MML. In particular, the multimodal datasets have been proposed, e.g., Conceptual Cap-
recent success of large language models and their multimodal tions [123], COCO [124], VQA [125], Visual Genome [126],
derivatives [82], [83], [84], [85], [86] further demonstrates the SBU Captions [127], Cooking312K [7], LAIT [115], e-SNLI-
potential of Transformers in multimodal foundation models. VE [128], ARCH [129], Adversarial VQA [130], OTT-QA [18],
MULTIMODALQA (MMQA) [131], VALUE [132], Fashion
IQ [133], LRS2-BBC [134], ActivityNet [135], VisDial [136].
B. Transformers: A Brief History and Milestones Some emergent new trends among the recently released mul-
Transformers are emerging as promising learners. Vanilla timodal datasets are:
Transformer [2] benefits from a self-attention mechanism, and 1) Data scales are larger. Various recently released datasets
is a breakthrough model for sequence-specific representation are million-scale, e.g., Product1M [137], Conceptual
learning that was originally proposed for NLP, achieving the 12M [138], RUC-CAS-WenLan [139] (30 M), How-
state-of-the-art on various NLP tasks. Following the great ToVQA69M [140], HowTo100M [141], ALT200M [142],
success of Vanilla Transformer, a lot of derivative models have LAION-400M [143].
been proposed, e.g., BERT [4], BART [87], GPT [88], Long- 2) More modalities. In addition to the general modalities
former [43], Transformer-XL [89], XLNet [90]. of vision, text, and audio, further diverse modalities are
Transformers currently stand at the dominant position in NLP emerging, e.g., Pano-AVQA [144] – the first large-scale
domains, and this motivates researchers try to apply Transform- spatial and audio-visual question answering dataset on
ers to other modalities, such as visual domains. In early attempts 360◦ videos, YouTube-360 (YT-360) [145] (360◦ videos),
for visual domain, the general pipeline is “CNN features + stan- AIST++ [146] (a new multimodal dataset of 3D dance
dard Transformer encoder”, and researchers achieved BERT- motion and music), Artemis [147] (affective language for
style pretraining, via preprocessing raw images by resizing to a visual arts). In particular, MultiBench [148] provides a
low resolution and reshaping into a 1D sequence [91]. dataset including 10 modalities.
Vision Transformer (ViT) [5] is a seminal work that con- 3) More scenarios. In addition to common caption and QA
tributes an end-to-end solution by applying the encoder of datasets, more applications and scenarios have been stud-
Transformer to images. Both ViT and its variants have been ied, e.g., CIRR [149] (real-life images), Product1M [137],
widely applied to various computer vision tasks, including Bed and Breakfast (BnB) [150] (vision-and-language nav-
low-level tasks [92], recognition [93], detection [94], segmen- igation), M3A [151] (financial dataset), X-World [152]
tation [95], etc, and also work well for both supervised [93] (autonomous drive).
and self-supervised [96], [97], [98] visual learning. Moreover, 4) Tasks are more difficult. Beyond the straightforward tasks,
some recently-released works provide further theoretical under- more abstract multimodal tasks are proposed, e.g., Mul-
standing for ViT, e.g., its internal representation robustness [99], tiMET [153] (a multimodal dataset for metaphor under-
the continuous behaviour of its latent representation propaga- standing), Hateful Memes [154] (hate speech in multi-
tion [100], [101]. modal memes).
Motivated by the great success of Transformer, 5) Instructional videos have become increasingly popular,
VideoBERT [7] is a breakthrough work that is the first work e.g., cooking video YouCookII [155]. Aligning a sequence
to extend Transformer to the multimodal tasks. VideoBERT of instructions to a video of someone carrying out a task
demonstrates the great potential of Transformer in multimodal is an example of a powerful pretraining pretext task [7],
context. Following VideoBERT, a lot of Transformer based [156]. Pretext tasks are pre-designed problems to force the
multimodal pretraining models (e.g., ViLBERT [102], models to learn representation by solving them.
LXMERT [103], VisualBERT [104], VL-BERT [105], Similar to other deep neural network architectures, Trans-
UNITER [106], CBT [107], Unicoder-VL [108], B2T2 [109], formers are also data hungry. Therefore, their high-capacity
VLP [110], 12-in-1 [111], Oscar [112], Pixel-BERT [113], models and multimodal Big Data basis co-create the prosperity
12116 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 10, OCTOBER 2023
of the Transformer based multimodal machine learning. For 1) Input Tokenization: Tokenization Vanilla: Transformer
instance, Big Data bring zero-shot learning capability to VLP was originally proposed for machine translation as a sequence-
Transformer models. to-sequence model, thus it is straightforward to take the vocab-
ulary sequences as input. As mentioned previously, the original
III. TRANSFORMERS self-attention can model an arbitrary input as a fully-connected
graph, independently of modalities. Specifically, both Vanilla
In this section, we use mathematical formulations to review and variant Transformers take in the tokenized sequences, where
the key techniques of Vanilla Transformer [2], Vision Trans- each token can be regarded as a node of the graph.
former [5], and multimodal Transformers,1 including tokenized Special/Customized Tokens: In Transformers, various spe-
inputs, self-attention, multi-head attention, basic Transformer cial/customized tokens can be semantically defined as place-
layers/blocks, etc. We highlight that Vanilla Transformers can be holders in the token sequences, e.g., mask token [MASK] [4].
understood from a geometrically topological perspective [157], Some common special tokens are summarized in appendix,
because due to the self-attention mechanism, given each tok- available in the online supplemental material. Special tokens
enized input from any modalities, Vanilla self-attention (Trans- can be used in both uni-modal and multimodal Transformers.
former) can model it as a fully-connected graph in topological Position Embedding: Position embeddings are added to the
geometry space [158]. Compared with other deep networks (for token embeddings to retain positional information [4]. Vanilla
instance, CNN is restricted in the aligned grid spaces/matrices), Transformer uses sine and cosine functions to produce position
Transformers intrinsically have a more general and flexible embedding. To date, various implementations of position em-
modelling space. This is a notable advantage of Transformers bedding have been proposed. The concrete solutions are outside
for multimodal tasks. Sections III-A, III-B, and III-C will review the focus of this survey.
the key designs of Vanilla Transformer, Vision Transformer, and Discussion: The main advantages of input tokenization in-
multimodal Transformers, respectively. clude the following:
1) Tokenization is a more general approach from a geomet-
A. Vanilla Transformer rically topological perspective, achieved by minimizing
constraints caused by different modalities. In general,
Vanilla Transformer has an encoder-decoder structure and
every modality has intrinsic constraints on modelling.
is the origin of the Transformer-based research field. It takes
For instance, sentences have sequential structures that
tokenized input (see Section III-A1). Both its encoder and de-
are well-suited by RNN, and photos are restricted in
coder are stacked by the Transformer layers/blocks, as demon-
aligned grid matrices that CNN works well for. Tokeniza-
strated in Fig. 1. Each block has two sub-layers, i.e., a multi-
tion helps Transformers inherently to process different
head self-attention (MHSA) layer (see Section III-A2) and a
modalities universally via irregular sparse structures. Thus
position-wise fully-connected feed-forward network (FFN) (see
even Vanilla Transformer can encode multimodal inputs
Section III-A3). To help the back propagation of the gradient,
flexibly by just concatenation, weighted summation, even
both MHSA and FFN use Residual Connection [159] (given
without any multimodal tailor-made modifications.
an input x, the residual connection of any mapping f (·) is
2) Tokenization is a more flexible approach to organize
defined as x ← f (x) + x), followed by normalization layer.
the input information via concatenation/stack, weighted
Thus, assuming that the input tensor is Z, the output of MHSA
summation, etc. Vanilla Transformer injects temporal in-
and FFN sub-layers can be formulated as:
formation to the token embedding by summing position
Z ← N (sublayer(Z) + Z), (1) embedding. For instance, when use Transformer to model
free-hand sketch drawing [163], each input token can
where sublayer(·) is the mapping implemented by the sub-layer integrate various drawing stroke patterns, e.g., stroke co-
itself and N (·) denotes normalization, e.g., BN (·) [160], LN (·) ordinates, stroke ordering, pen state (start/end).
[161]. 3) Tokenization is compatible with the task-specific cus-
Discussion: There is an important unsolved problem that tomized tokens, e.g., [MASK] token [4] for Masked Lan-
is post-normalization versus pre-normalization. The original guage Modelling, [CLASS] token [5] for classification.
Vanilla Transformer uses post-normalization for each MHSA Discussion: How to understand position embedding to Trans-
and FFN sub-layer. However, if we consider this from the mathe- formers is an open problem. It can be understood as a kind of
matical perspective, pre-normalization makes more sense [162]. implicit coordinate basis of feature space, to provide temporal
This is similar to the basic principle of the theory of matrix, or spatial information to the Transformer. For cloud point [164]
that normalization should be performed before projection, e.g., and sketch drawing stroke [163], their token element is already
Gram–Schmidt process.2 This problem should be studied further a coordinate, meaning that position embedding is optional, not
by both theoretical research and experimental validation. necessary. Furthermore, position embedding can be regarded as
a kind of general additional information. In other words, from
a mathematical point of view, any additional information can
1 In this survey, “multimodal Transformer” means “Transformer in multimodal be added, such as detail of the manner of position embedding,
learning context”. e.g., the pen state of sketch drawing stroke [163], cameras
2 https://en.wikipedia.org/wiki/Gram%E2%80%93Schmidt_process and viewpoints in surveillance [165]. There is a comprehensive
XU et al.: MULTIMODAL LEARNING WITH TRANSFORMERS: A SURVEY 12117
survey [166] discussing the position information in Transform- where each head Zh = SA(Qh , Kh Vh ) and h ∈ [1, H], and W
ers. For both sentence structures (sequential) and general graph is a linear projection matrix. The idea of MHSA is a kind of en-
structures (sparse, arbitrary, and irregular), position embeddings semble. MHSA helps the model to jointly attend to information
help Transformers to learn or encode the underlying structures. from multiple representation sub-spaces.
Considered from the mathematical perspective of self-attention, 3) Feed-Forward Network: The output of the multi-head
i.e., scaled dot-product attention, attentions are invariant to the attention sub-layer will go through the position-wise Feed-
positions of words (in text) or nodes (in graphs), if position Forward Network (FFN) that consists of successive linear layers
embedding information is missing. Thus, in most cases, position with non-linear activation. For instance, a two-layer FFN can be
embedding is necessary for Transformers. formulated as
2) Self-Attention and Multi-Head Self-Attention: The core
F F N (Z) = σ(ZW1 + b1 )W2 + b2 , (6)
component of Vanilla Transformer is the Self-Attention
(SA) operation [2] that is also termed “Scaled Dot-Product where W1 , b1 , W2 , and b2 denote the weights and biases of the
Attention”. Assume that X = [x1 , x2 , . . .] ∈ RN ×d is an two linear transformations, while σ(·) is non-linear activation,
input sequence of N elements/tokens, and an optional pre- e.g., ReLU(·) [171], GELU (·) [172]. In some Transformer
processing is positional encoding by point-wise summa- literature, FFN is also termed Multi-Layer Perceptron (MLP).
tion Z ← X ⊕ P ositionEmbedding or concatenation Z ←
concat(X, P ositionEmbedding). B. Vision Transformer
Self-Attention (SA): After preprocessing, embedding Z will
Vision Transformer (ViT) [5] has an image-specific input
go through three projection matrices (WQ ∈ Rd×dq , WK ∈
pipeline in which the input image must be split into fixed-size
Rd×dk , and WV ∈ Rd×dv , dq = dk ) to generate three embed-
(e.g., 16 × 16, 32 × 32) patches. After going through the lin-
dings Q (Query), K (Key), and V (Value):
early embedded layer and adding the position embeddings, all
Q = ZWQ , K = ZWK , V = ZWV . (2) the patch-wise sequences will be encoded by a standard Trans-
former encoder. Given an image X ∈ RH×W ×C (H height, W
The output of self-attention is defined as width, C channels), ViT needs to reshape X into a sequence of
2
flattened 2D patches: xp ∈ RN×(P ·C) , where (P × P ) is the
QK
Z = SA(Q, K, V) = Sof tmax V. (3) patch resolution and N = HW/P 2 . To perform classification,
dq a standard approach is to prepend an extra learnable embedding
Given an input sequence, self-attention allows each element to “classification token” [CLASS] to the sequence of embedded
attend to all the other elements, so that self-attention encodes patches:
the input as a fully-connected graph. Therefore, the encoder of Z ← concat([CLASS], XW), (7)
Vanilla Transformer can be regarded as a fully-connected GNN
encoder, and the Transformer family has the non-local ability of where W denotes the projection.
global perception, similar to the Non-Local Network [167].
Masked Self-Attention (MSA): In practice, modification of C. Multimodal Transformers
self-attention is needed to help the decoder of Transformer to Recently, a large number of Transformers have been stud-
learn contextual dependence, to prevent positions from attending ied extensively for various multimodal tasks, and shown to be
to subsequent positions, as compatible with various modalities in both discriminative and
generative tasks.
QK
Z = M SA(Q, K, V) = Sof tmax M V, (4) In this section, we will review the key techniques/designs of
dq the existing multimodal Transformer models, from the perspec-
where M is a masking matrix. For instance, in GPT [88], an tives of multimodal input (Section III-C1), self-attention variants
upper triangular mask to enable look-ahead attention where each (Section III-C2), and network architectures (Section III-C3).
token can only look at the past tokens. Masking can be used 1) Multimodal Input: The Transformer family is a general
in both encoder [163], [168] and decoder of Transformer, and architecture that can be formulated as a type of general graph
has flexible implementations, e.g., 0-1 hard mask [163], soft neural network. Specifically, self-attention can process each
mask [168]. input as a fully-connected graph, by attending to the global (non-
In both uni-modal and multimodal practices, specific masks local) patterns. Therefore, this intrinsic trait helps Transformers
are designed based on domain knowledge and prior knowledge. can work in a modality agnostic pipeline that is compatible with
Essentially, MSA is used to inject additional knowledge to various modalities by treating the embedding of each token as a
Transformer models, e.g., [24], [163], [169], [170]. node of the graph.
Multi-Head Self-Attention (MHSA): In practice, multiple self- Tokenization and Embedding Processing: Given an input from
attention sub-layers can be stacked in parallel and their concate- an arbitrary modality, users only need to perform two main
nated outputs are fused by a projection matrix W, to form a steps, (1) tokenize the input, and (2) select an embedding space
structure named Multi-Head Self-Attention: to represent the tokens, before inputting the data into Trans-
formers. In practice, both the tokenizing input and selecting
Z = M HSA(Q, K, V) = concat(Z1 , . . . , ZH )W, (5) embedding for the token are vital for Transformers but highly
12118 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 10, OCTOBER 2023
TABLE I
TOKENIZATION AND TOKEN EMBEDDING COMPARISON FOR MULTI-MODAL INPUTS FOR TRANSFORMERS
flexible, with many alternatives. For instance, given an image, Token Embedding Fusion: In practice, Transformers allow
the solution of tokenizing and embedding is not unique. Users each token position to contain multiple embeddings. This is
can choose or design tokenization at multiple granularity levels essentially a kind of early-fusion of embeddings, for both uni-
– coarse-grained versus fine-grained. e.g., use ROIs (obtained by modal and multimodal Transformer models. (This will be dis-
an object detector) and CNN features as tokens and token embed- cussed further in subsequent sections.) The most common fusion
dings [102], use patches and linear projection as tokens and token is the token-wise summing of the multiple embeddings, e.g., a
embeddings [5], or use graph node (obtained by object detector specific token embedding ⊕ position embedding. Similar to the
and graph generator) and GNN features as tokens and token flexible tokenization, token embedding fusion is also flexible and
embeddings [181]. Given a tokenization plan, the subsequent widely applied to both uni-modal and multimodal Transformer
embedding approaches can be diverse. For example, for video applications. In [81], token-wise weighted summing is used to
input, a common tokenization is to treat the non-overlapping perform early-fusion of RGB and grey-scale images for mul-
windows (down-sampled) over the video as tokens, and their timodal surveillance AI. In particular, token embedding fusion
embeddings can then be extracted by various 3D CNNs, e.g., has an important role in multimodal Transformer applications
VideoBERT [7], CBT [107], and UniVL [117] use S3D [186], as various embeddings can be fused by token-wise operators,
ActBERT uses ResNet-3D [187]. e.g., in VisualBERT [104] and Unicoder-VL [108], segment
Table I summarizes some common practices of multi- embeddings are token-wise added to indicate which modal-
modal inputs for Transformers, including RGB, video, au- ity (vision or language) each token is from, VL-BERT [105]
dio/speech/music, text, graph, etc. injects global visual context to linguistic domain by “linguis-
Discussion: When considered from the perspective of geo- tic token embedding ⊕ full image visual feature embedding”,
metric topology, each of the modalities listed in Table I can be InterBERT [188] adds location information for ROI by “ROI
regarded as a graph. An RGB image is essentially a neat grid embedding ⊕ location embedding”, in ImageBERT [115], five
graph in the pixel space. Both video and audio are clip/segment kinds of embeddings are fused “image embedding ⊕ position
based graphs over a complex space involving temporal and embedding ⊕ linguistic embedding ⊕ segment embedding ⊕
semantic patterns. Both 2D and 3D drawing sketches [78], [163] sequence position embedding”.
are a kind of sparse graph if we consider their key points along 2) Self-Attention Variants in Multimodal Context: In mul-
the drawing strokes. Similar to sketches, the human pose also is a timodal Transformers, cross-modal interactions (e.g., fusion,
kind of graph. 3D point cloud is a graph in which each coordinate alignment) are essentially processed by self-attention and its
is a node. Other abstract modalities also can be interpreted as variants. Thus, in this section, we will review the main mul-
graphs, e.g., source code [44], data flow of source code [44], timodal modelling practices of Transformers, from a perspec-
table [18], SQL database schema [25], text question graph [24], tive of self-attention designs, including (1) early summation
and electronic health records (EHRs) [184]. (token-wise, weighted), (2) early concatenation, (3) hierarchical
XU et al.: MULTIMODAL LEARNING WITH TRANSFORMERS: A SURVEY 12119
TABLE II
SELF-ATTENTION VARIANTS FOR MULTI-MODAL INTERACTION/FUSION
Fig. 2. Transformer-based cross-modal interactions: (a) Early Summation, (b) Early Concatenation, (c) Hierarchical Attention (multi-stream to one-stream),
(d) Hierarchical Attention (one-stream to multi-stream), (e) Cross-Attention, and (f) Cross-Attention to Concatenation. “Q”: Query embedding; “K”: Key embedding;
“V”: Value embedding. “TL”: Transformer Layer. Best viewed in colour.
attention (multi-stream to one-stream), (4) hierarchical attention summing position embedding is intrinsically a case of early
(one-stream to multi-stream), (5) cross-attention, and (6) cross- summation.
attention to concatenation. See Table II and Fig. 2. (2) Early Concatenation: Another straightforward solution is
For brevity, we will state and compare the mathematical early concatenation [7], [44], [178], [180] that the token embed-
formulations in two-modality cases. Please note that all dis- ding sequences from multiple modalities are concatenated and
cussed self-attention and its variants are such flexible that can be input into Transformer layers as
extended to multiple modality cases. Specifically, the following
formulations are modality-, tokenization-, and embedding- ag- Z ← T f (C(Z(A) , Z(B) )). (9)
nostic, as self-attention models the embedding of arbitrary token
from arbitrary modality as a node of a graph. Thus, all the multimodal token positions can be attended as a
Given inputs XA and XB from two arbitrary modalities, Z(A) whole sequence, such that the positions of each modality can be
and Z(B) denote their respective token embeddings. Let Z denot- encoded well by conditioning the context of other modalities.
ing the token embedding (sequence) produced by the multimodal VideoBERT [7] is the one of the first multimodal Transformer
interactions. T f (·) stands for the processing of Transformer works, where video and text are fused via early concatenation
layers/blocks. that can encode the global multimodal context well [188].
(1) Early Summation: In practice, early summation [46], [81] However, the longer sequence after concatenation will increase
is a simple and effective multimodal interaction, where the token computational complexity. Early concatenation is also termed
embeddings from multiple modalities can be weighted summed “all-attention” or “Co-Transformer” [137].
at each token position and then processed by Transformer layers: (3) Hierarchical Attention (multi-stream to one-stream):
Transformer layers can be combined hierarchically to attend
to the cross-modal interactions. A common practice is that
multimodal inputs are encoded by independent Transformer
Z ← T f (αZ(A) ⊕ βZ(B) ) = M HSA(Q(AB) , K(AB) , V(AB) ),
streams and their outputs are concatenated and fused by another
(8)
Transformer [146]:
where ⊕ is element-wise sum, and α and β are weight-
Q
ings. Concretely, Q(AB) = (αZ(A) ⊕ βZ(B) )W(AB) , K(AB) = Z ← T f3 (C(T f1 (Z(A) ), T f2 (Z(B) ))). (10)
(αZ(A) ⊕ βZ(B) )W(AB)K
, and V(AB) = (αZ(A) ⊕ βZ(B) )W(AB)V
.
Its main advantage is that it does not increase computational This kind of hierarchical attention is an implementation of late
complexity. However, its main disadvantage is due to the manu- interaction/fusion, and can be treated as a special case of early
ally set weightings. As discussed in Sections III-A1 and III-C1, concatenation.
12120 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 10, OCTOBER 2023
(4) Hierarchical Attention (one-stream to multi-stream): In- single-stream, (2) cross-attention work in multi-streams, (3)
terBERT [188] is another good practice of hierarchical attention hierarchical attention and cross-attention to concatenation work
where concatenated multimodal inputs are encoded by a shared in hybrid-streams. Thus, multimodal Transformers can be di-
single-stream Transformer that is followed by two separate vided into single-stream (e.g., Uniter [106], Visualbert [104],
Transformer streams. This flow can be formulated as Vl-bert [105], Unified VLP [110]), multi-stream (e.g., ViL-
⎧ BERT [102], Lxmert [103], ActBERT [114]), hybrid-stream
⎨C(Z(A) , Z(B) ) ← T f1 (C(Z(A) , Z(B) )),
Z ← T f2 (Z(A) ), (11) (e.g., InterBERT [188]), etc.
⎩ (A) From the perspective of timing of interaction, these multi-
Z(B) ← T f3 (Z(B) ).
modal attentions fall into three categories, i.e., early interaction:
This method perceives the cross-modal interactions and mean- early summation, early concatenation, and hierarchical attention
while preserves the independence of uni-modal representation. (one-stream to multi-stream), late interaction: hierarchical at-
(5) Cross-Attention: For two-stream Transformers, if the Q tention (multi-stream to one-stream), or throughout interaction:
(Query) embeddings are exchanged/swapped in a cross-stream cross-attention, cross-attention to concatenation.
manner, the cross-modal interactions can also be perceived. This As demonstrated in Fig. 2 in [192], the multimodal Trans-
method is termed cross-attention or co-attention [190], which former models have another architecture taxonomy based on
was first proposed in VilBERT [102]: the computational size of the components.
Z(A) ← M HSA(QB , KA , VA ),
(12) IV. APPLICATION SCENARIOS
Z(B) ← M HSA(QA , KB , VB ).
In this section we survey multimodal Transformers based on
Cross-attention attends to each modality conditioned on the
the application scenarios. We consider two important paradigms:
other and does not cause higher computational complexity,
(1) Transformers for multimodal pretraining (Section IV-A,
however if considered for each modality, this method fails to
including both task-agnostic (Section IV-A1) and task-specific
perform cross-modal attention globally and thus loses the whole
(Section IV-A2) multimodal pretraining), and (2) Transformers
context. As discussed in [188], two-stream cross-attention can
for specific multimodal tasks (Section IV-B).
learn cross-modal interaction, whereas there is no self-attention
to the self-context inside each modality.
A. Transformers for Multimodal Pretraining
(6) Cross-Attention to Concatenation: The two streams of
cross-attention [102] can be further concatenated and processed Inspired by the great success of Transformer based pretraining
by another Transformer to model the global context. This kind in NLP community, Transformers are also widely studied for
of hierarchically cross-modal interaction is also widely stud- multimodal pretraining as the various large-scale multimodal
ied [137], [189], and alleviates the drawback of cross-attention. corpora is emerging. Recent work has demonstrated that if
pretrained on large scale multimodal corpora Transformer based
⎧ models [7], [102], [103], [104], [105], [106], [110] clearly
⎨Z(A) ← M HSA(QB , KA , VA ),
outperform other competitors in a wide range of multimodal
Z ← M HSA(QA , KB , VB ), (13)
⎩ (B) down-stream tasks, and moreover achieve the zero-shot gen-
Z ← T f (C(Z(A) , Z(B) )).
eralization ability. These superiorities have led Transformer-
Discussion: All these aforementioned self-attention variants for based multimodal pretraining to become a hot topic, which
multimodal interactions are modality-generic, and can be ap- has two main directions, i.e., general pretraining for agnostic
plied in flexible strategies and for multi-granular tasks. Specifi- down-stream tasks (Section IV-A1), goal-oriented pretraining
cally, these interactions can be flexibly combined and nested. For for specific down-stream tasks (Section IV-A2).
instance, multiple cross-attention streams are used in hierarchi- We focus on these key points: (1) What trends are emerging?
cal attention (one-stream to multi-stream) that in a two-stream (2) Where/how do the cross-modal interactions take place during
decoupled model [191] T f2 and T f3 of (11) are implemented by pretraining? (3) How to sort out and understand the pretraining
cross-attention defined in (12). Moreover, they can be extended pretext objectives? How can they drive Transformers to learn
to multiple (≥ 3) modalities. TriBERT [183] is a tri-modal the cross-modal interactions?
cross-attention (co-attention) for vision, pose, and audio, where 1) Task-Agnostic Multimodal Pretraining: Recently Trans-
given a Query embedding, its Key and Value embeddings are former-oriented pretraining has been widely studied involving
the concatenation from the other modalities. Cross-attention diverse modality combinations, e.g., video-text [7], [107], [117],
to concatenation is applied to three modalities (i.e., language, image-text [102], [103], [104], [193], [194], [195], acoustic-
video, and audio) in [189]. text [180].
3) Network Architectures: Essentially, various multimodal Among existing work, the following main trends are emerg-
Transformers work due to their internal multimodal attentions ing:
that are the aforementioned self-attention variants. Meanwhile, (1) Vision-language pretraining (VLP) is a major research
as illustrated in Fig. 2, these attentions determine the external problem in this field. VLP is including both “image + lan-
network structures of the multimodal Transformers where they guage” and “video + language”, also termed visual-linguistic
are embedded. pretraining. A great deal of excellent work has been proposed,
In general, if we consider from the angle of network struc- e.g., VideoBERT [7], ViLBERT [102], LXMERT [103], Visu-
tures, (1) early summation and early concatenation work in alBERT [104], VL-BERT [105], UNITER [106], CBT [107],
XU et al.: MULTIMODAL LEARNING WITH TRANSFORMERS: A SURVEY 12121
Unicoder-VL [108], B2T2 [109], VLP [110], 12-in-1 [111], For not only the multimodal pretraining but also the specific
Oscar [112], Pixel-BERT [113], ActBERT [114], Image- multimodal tasks, the cross-modal interactions can perform
BERT [115], HERO [116], UniVL [117], SemVLP [196]. within arbitrary component(s) of the three. As discussed in
(2) Speech can be used as text. Thanks to recent advances in Section III-C2, because self-attention models the embedding
automatic speech recognition (ASR) techniques, in a multimodal of an arbitrary token from an arbitrary modality as a node of
context, speech can be converted to text by the off-the-shelf a graph, the existing pretraining pipelines can, in general, be
speech recognition tools. For instance, VideoBERT [7] and transferred independently across modalities, unless considered
CBT [107] make full use of speech rather than low-level sounds with modality-specific objectives.
as a source of cross-modal supervision, by extracting high-level Discussion: Vision Language Pretraining (VLP) follows
semantic text. two general pipelines: two-stage (need object detector, e.g.,
(3) Overly dependent on the well-aligned multimodal data. A Faster R-CNN [202]) (e.g., LXMERT [103], ViLBert [102],
majority of Transformer-based multimodal pretraining works VL-Bert [105], UNITER [106]) and end-to-end (e.g., Pixel-
in a self-supervised manner, however, it is overly depen- Bert [113], SOHO [203], KD-VLP [204], Simvlm [199]). Two-
dent on the well-aligned multimodal sample pairs/tuples. stage pipelines have a main advantage – object-aware perceiving,
For instance, large amount of image-language pretraining by using the supervised pre-trained visual detectors, however
Transformer models are pretrained on large-scale image- these are based on a strong assumption that the visual represen-
text pairs, e.g., VisualBERT [104], VL-BERT [105], ViL- tations can be fixed.
BERT [102], LXMERT [103], UNITER [106]. For another Discussion: How to look for more corpora that intrinsically
example, the instructional videos (e.g., cooking) 3 are widely have well-aligned cross-modal supervision, such as instruc-
used as the pretraining corpora, e.g., HowToVQA69M [140], tional videos, is still an open problem. However, weakly-aligned
HowTo100M [141], as in general, their visual clues/content cross-modal samples are popular in the real-life scenarios,
and the spoken words have a higher probability to align with for instance, enormous weakly aligned multimodal data sam-
each other, if compared with other videos. However, using ples are emerging in e-commerce [137], due to fine-grained
cross-modal alignment as cross-modal supervision is costly for categories, complex combinations, and fuzzy correspondence.
large-scale applications. Thus, how to use the weakly-aligned Well labelled/aligned cross-modal datasets are very costly in
or even unpaired/unaligned multimodal data as the pretraining collecting and annotating; how to use weakly-aligned or even
corpora is still understudied. Some recent attempts [137], [199] unaligned corpora crawled from the web is a promising question.
study the use of weakly-aligned cross-modal supervision to train Some recently successful practice [9], [199], [205] used weakly
Transformers to learn the cross-modal interactions. aligned image-text pairs to perform pretraining, and achieve
(4) Most of the existing pretext tasks transfer well across both competitive performance and zero-shot learning capability
modalities. For instance, Masked Language Modelling (MLM) for image classification, image-text retrieval, and open-ended
in the text domain has been applied to audio and image, e.g., visual question answering, etc. Because these practices in weak
Masked Acoustic Modelling [180], [200], Masked Image Re- supervision make full use of large-scale pretraining corpora, they
gion Prediction [190], while both Sentence Ordering Modelling yield greater promise of zero-shot generalization.
(SOM) [201] in text domain and Frame Ordering Modelling Pretext Tasks: In Transformer based multimodal pretrain-
(FOM) [116] in video domain share the same idea. We will ing, the pretraining tasks/objectives are also termed pretext
further discuss the pretext tasks for multimodal Transformer tasks/objectives. To date, various pretext tasks have been stud-
pretraining in the follows. ied, e.g., masked language modelling (MLM) [137], masked
(5) Model structures are mainly in three categories. Essen- image region prediction/classification (also termed masked
tially, in multimodal pretraining scenarios, Transformer models object classification (MOC)) [137], [190], masked region
work based on those self-attention variants that are discussed regression (MRR) [115], visual-linguistic matching (VLM)
in Section III-C2. Thus, if considered from the perspective of (e.g., image–text matching (ITM) [188], image text match-
model structures, the existing Transformers for multimodal pre- ing (ITM), phrase-region alignment (PRA) [204], word-region
training are also mainly in three categories, i.e., single-stream, alignment (WRA) [106], video-subtitle matching (VSM) [116]),
multi-stream, hybrid-stream. masked frame modelling (MFM) [116], frame order mod-
(6) Cross-modal interactions can perform within various com- elling (FOM) [116], next sentence prediction (NSP) [4], [102],
ponents/levels in the pretraining pipelines. For Transformer [190], masked sentence generation (MSG) [191], masked group
based multimodal pretraining, the key is to drive the Transformer modelling (MGM) [188], prefix language modelling (Pre-
(encoder w/, w/o decoder) to learn the cross-modal interac- fixLM) [199], video conditioned masked language model [117],
tions. In the existing Transformer-based multimodal pretraining text conditioned masked frame model [117], visual transla-
practices, the cross-modal interactions are flexible, which can tion language modelling (VTLM) [206], and image-conditioned
perform within various components/levels in the pretraining masked language modelling (also termed image-attended
pipelines. In general, Transformer-based multimodal pretraining masked language modelling) [207]. These down-stream task
pipelines have three key components, from bottom to top, i.e., -agnostic pretext pretraining is optional, and the down-stream
tokenization, Transformer representation, objective supervision. task objectives can be trained directly, which will be discussed
in Section IV-A2. Table III provides the common and representa-
3 Note that instructional videos also have weakly aligned cases [197], [198]. tive pretext tasks for Transformer based multimodal pretraining.
12122 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 10, OCTOBER 2023
TABLE III
PRETEXT TASK COMPARISON OF MULTI-MODAL PRETRAINING TRANSFORMER MODELS (FOR AGNOSTIC DOWN-STREAM TASKS)
In practice, pretext tasks can be combined, and some represen- feature extraction, architecture (single- or dual- stream), de-
tative cases are summarized in Table III of [57], Table II of [58]. coder (w/, w/o), pretext tasks/objectives, pretraining datasets,
The pretext tasks have multiple taxonomies: and down-stream tasks, e.g., Table III of [57], Table II of [58].
(1) Supervision: The common multimodal pretraining Trans- Different from these views, in this survey, we would propose
formers use well-aligned, weakly-aligned, and even un- our comparisons from some new perspectives. Specifically: (1)
aligned multimodal sample pairs/tuples, to work in supervised, The core of Transformer ecosystem is self-attention, thus we
weakly-supervised, and unsupervised manners, respectively. would compare the existing multimodal pretraining Transformer
Meanwhile, if we consider the definitions of their pretext models from the angles of how and when the self-attention
tasks/objectives from supervision, the pretexts can be sorted into or its variants perform cross-modal interactions. (2) Consider-
unsupervised/self-supervised (e.g., masked language modelling ing from a geometrically topological perspective, self-attention
(MLM) [7], [137]) and supervised (e.g., image-text matching helps Transformers intrinsically work in a modality agnostic
(ITM) [102], [103], [104], [106], [188], [209]), etc. Nowadays, pipeline that is compatible with various modalities by taking
self-supervised attempts are the majority. in the embedding of each token as a node of graph, thus
(2) Modality: Considering the mathematical formulations, we would highlight that the existing VLP can be applied to
some pretexts are defined on single modality, e.g., masked other modalities, beyond visual and linguistic domains. (3)
language modelling [7], masked acoustic modelling [200], We suggest to treat the Transformer-based multimodal pre-
masked region regression (MRR) [115], while other pretexts training pipelines having three key components, from bottom
are defined on multiple modalities, e.g., image-conditioned to top, i.e., tokenization, Transformer representation, objective
masked language modelling (IMLM) [208], image-text match- supervision.
ing (ITM) [188], video-subtitle matching (VSM) [116]. Thus, Discussion: In spite of the recent advances, multimodal pre-
from this mathematical view, the pretext tasks can be divided training Transformer methods still have some obvious bottle-
into two categories, i.e., uni-modal and multimodal. necks. For instance, as discussed by [208] in VLP field, while
However, this classification is not really accurate. It should be the BERT-style cross-modal pretraining models produce excel-
highlighted that in multimodal pretraining Transformer models, lent results on various down-stream vision-language tasks, they
even if the pretext objective formulations only include uni-modal fail to be applied to generative tasks directly. As discussed
elements, pretexts can still involve other modalities, essentially in [208], both VideoBERT [7] and CBT [107] have to train
conditioned on the clues from other modalities, by (a) prepositive a separate video-to-text decoder for video captioning. This
token level interactions and/or Transformer level interactions, is a significant gap between the pretraining models designed
(b) co-training with other pretexts that involve other modalities. for discriminative and generative tasks, as the main reason is
For instance, VL-BERT [105] uses two dual pretext tasks, i.e., discriminative task oriented pretraining models do not involve
masked language modelling and masked RoI classification. the decoders of Transformer. Therefore, how to design more
(3) Motivation: If consider their motivations, the pretext tasks unified pipelines that can work for both discriminative and
include masking, describing, matching, ordering, etc. generative down-stream tasks is also an open problem to be
Some recent surveys focus on VLP and compare the ex- solved. Again for instance, common multimodal pretraining
isting VLP Transformer models from the angles of domain models often underperform for fine-grained/instance-level tasks
(image-text or video-text), vision feature extraction, language as discussed by [137].
XU et al.: MULTIMODAL LEARNING WITH TRANSFORMERS: A SURVEY 12123
Discussion: As discussed in [208], the masked language and discriminative applications, e.g., RGB & optical flow [46],
region modelling as pre-training task have a main advantage that RGB & depth [213], RGB & point cloud [214], RGB & Li-
the Transformer encoder learned from these supervisions can DAR [215], [216], textual description & point cloud [31], acous-
encode both vision and language patterns based on bidirectional tic & text [180], audio & visual observation for Audio-Visual
context and it is naturally fit for the semantic understanding Navigation [76], speech query & schema of SQL database [25],
tasks, e.g., VQA, image-text retrieval. text question/query & the schema SQL database [24], audio &
Discussion: How to boost the performance for multimodal tags [217], multimodal representation for video [218], [219],
pretraining Transformers is an open problem. Some prac- text query & video [220], audio & video for audio visual speech
tices demonstrate that multi-task training (by adding auxil- enhancement (AVSE) [179], audio & video for Audio-Visual
iary loss) [111], [137] and adversarial training [210] improve Video Parsing [173], audio & video for audio-visual speech
multimodal pretraining Transformers to further boost the per- recognition [134], video & text for Referring Video Object
formance. Meanwhile, overly compound pretraining objectives Segmentation (RVOS) [221], source code & comment & data
potentially upgrade the challenge of balancing among different flow [44], image & text for retrieval [222].
loss terms, thus complicate the training optimization [199]. Meanwhile, Transformers also contribute to various multi-
Moreover, the difficulty of the pretexts is also worth discussing. modal generative tasks, including single-modality to single-
In general, if aim to learn more explicit object concepts, more modality (e.g., raw audio to 3D mesh sequence [39], RGB
complex pretext losses will be used [204]. However, for pretexts, to 3D scene [40], single image to 3D human texture estima-
whether more complexity is better remains a question. tion [223], RGB to scene graph [19], [224], [225], [226], graph
2) Task-Specific Multimodal Pretraining: In practices of to graph [33], knowledge graph to text [227], video to scene
multimodal Transformers, the aforementioned down-stream graph [228], video to caption [229], [230], [231], [232], image
task -agnostic pretraining is optional, not necessary, and down- to caption [233], [234], [235], [236], [237], text to speech [238],
stream task specific pretraining is also widely studied [150], text to image [205], [239], text to shape [240], RGB to 3D human
[190], [208], [211]. The main reasons include: (1) Limited by pose and mesh [41], music to dance [241]), multimodality to
the existing technique, it is extremely difficult to design a set single modality (e.g., image & text to scene graph [242], Video
of highly universal network architectures, pretext tasks, and Dialogue (text & audio & visual to text) [243], Mono Audio &
corpora that work for all the various down-stream applications. Depth to Binaural Audio [14], music piece & seed 3D motion to
(2) There are non-negligible gaps among various down-stream long-range future 3D motions [146], X-raying image & question
applications, e.g., task logic, data form, making it difficult to to answer [244], video & text & audio to text [245]), and
transfer from pretraining to down-stream applications. multimodality to multimodality (e.g., [246]).
Therefore, a large number of down-stream tasks still need
tailor-made pretraining to improve the performance. Guhur V. CHALLENGES AND DESIGNS
et al. [150] propose in-domain pretraining for vision-and-
Complementing the application scenario taxonomy discussed
language navigation, as the general VLP focuses on learning
in Section IV, we further survey prior work from the perspec-
vision-language correlations, not designed for sequential deci-
tive of technical challenges. We discuss seven challenges of
sion making as required in embodied VLN. Murahari et al. [190]
Transformer based multimodal learning, including fusion, align-
present a visual dialogue oriented approach to leverage pre-
ment, transferability, efficiency, robustness, universalness, and
training on general vision-language datasets. XGPT [208] is
interpretability. This further extends the taxonomy introduced
tailor-made for image captioning, to overcome the limitation that
in [1] to tackle the higher diversity and wider scopes of existing
BERT-based cross-modal pre-trained models fail to be applied
Transformer based MML works in recent years.
to generative tasks directly. ERNIE-ViLG [211] is designed for
bidirectional image-text generation with Transformers.
Special modalities have their own unique domain knowledge A. Fusion
that can be used to design the specific pretrain pretexts. Graph- In general, MML Transformers fuse information across multi-
CodeBERT [44] uses two structure-aware pretext tasks (i.e., pre- ple modalities primarily at three levels: input (i.e., early fusion),
dict where a variable is identified from, data flow edge prediction intermediate representation (i.e., middle fusion), and prediction
between variables) for programming source code. To learn from (i.e., late fusion). Common early fusion based MML Trans-
the spatial cues in 360◦ video, Morgado et al. [145] propose former models [7], [104], [108] are also known as one-stream
to perform contrastive audio-visual spatial alignment of 360◦ architecture, allowing the adoption of the merits of BERT due
video and spatial audio. Med-BERT [184] is a contextualized to minimal architectural modification. The main difference be-
embedding model pretrained on a structured electronic health tween these one-stream models is the usage of problem-specific
record dataset of two million patients. Kaleido-BERT [212] is a modalities with variant masking techniques. With attention op-
VLP Transformer model tailor-made for the fashion domain. eration, a noticeable fusion scheme is introduced based on a
notion of bottleneck tokens [175]. It applies for both early and
middle fusion by simply choosing to-be-fused layers. We note
B. Transformers for Specific Multimodal Tasks
that the simple prediction-based late fusion [247], [248] is less
Recent work has demonstrated that Transformer models can adopted in MML Transformers. This makes sense consider-
encode various multimodal inputs in both classical and novel ing the motivations of learning stronger multimodal contextual
12124 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 10, OCTOBER 2023
representations and great advance of computing power. For an inspiring solution that transfers knowledge across modal-
enhancing and interpreting the fusion of MML, probing the ities by learning a shared multimodal embedding space, en-
interaction and measuring the fusion between modalities [249] abling zero-shot transfer of the model to down-stream tasks.
would be an interesting direction to explore. The main inspiration that CLIP presents the community is that
the pretrained multimodal (image and text) knowledge can be
B. Alignment transferred to down-stream zero-shot image prediction by using
a prompt template “A photo of a {label}.” to bridge
Cross-modal alignment is the key to a number of real-world
the distribution gap between training and test datasets.
multimodal applications. Transformer based cross-modal align-
Over-fitting is a major obstacle to transfer. Multimodal Trans-
ment has been studied for various tasks, e.g., speaker localization
formers can be overly fitted to the dataset biases during training,
in multi-speaker videos [250], speech translation [180], text-
due to the large modelling capability. Some recent practices
to-speech alignment [251], text-to-video retrieval [252], [253],
exploit how to transfer the oracle model trained on noiseless
[254], and visual grounding of natural language [255], [256],
dataset to real dataset. For instance, Kervadec et al. [272],
[257], [258], [259]. Recently, Transformer based alignment [9],
[273] explore how transferable reasoning patterns are in VQA,
[119], [260], [261], [262] has led to a surge of leveraging large
and demonstrate that for LXMERT [103]/BERT-like reasoning
quantities of web data (e.g., image-text pairs) for vision and
patterns can be partially transferred from an ideal dataset to a
language tasks.
real dataset.
A representative practice is to map two modalities into a com-
Cross-task gap is another major obstacle to transfer [208],
mon representation space with contrastive learning over paired
[274], due to the different reasoning and input-output workflows,
samples. The models based on this idea are often enormous
e.g., how to use multimodal datasets to finetune the language
in size and expensive to optimize from millions or billions of
pretrained model is difficult [274]. In real applications, multi-
training data. Consequently, successive works mostly exploit
modal pretrained Transformers sometimes need to handle the
pretrained models for tackling various down-stream tasks [120],
uni-modal data at inference stage due to the issue of missing
[263], [264], [265], [266]. These alignment models have the
modalities. One solution is using knowledge distillation, e.g.,
ability of zero-shot transfer particularly for image classifica-
distilling from multimodal to uni-modal attention in Trans-
tion via prompt engineering [267]. This novel perspective is
formers [275], distilling from multiple uni-modal Transformer
mind-blowing, given that image classification is convention-
teachers to a shared Transformer encoder [276]. There is a
ally regarded as a unimodal learning problem and zero-shot
huge gap across discriminative and generative multimodal tasks.
classification remains an unsolved challenge despite extensive
As discussed in [208], the BERT-like encoder-only multimodal
research [268]. This has been studied for more challenging and
Transformers (e.g., VideoBERT [7], CBT [107]) need sepa-
fine-grained tasks (e.g., object detection [269], visual ques-
rately to train decoders for generation tasks. This could create
tion answering [103], [106], [112], [263], and instance re-
a pretrain-finetune discrepancy detrimental to the generality.
trieval [222], [263]) by imposing region (semantic parts such as
Recently, more and more attempts study this issue further, e.g.,
objects) level alignment. Fine-grained alignment will however
GilBERT [222] is a generative VLP models for a discriminative
incur more computational costs from explicit region detection
task, i.e., image-text retrieval.
and how to eliminate this whilst keeping the region-level learn-
Cross-lingual gap also should be considered for the transfer-
ing capability becomes a challenge. Several ideas introduced
ability of Transformer based multimodal learning, e.g., universal
recently include random sampling [113], learning concept dic-
cross-lingual generalization from English to non-English mul-
tionary [203], uniform masking [270], patch projection [192],
timodal contexts [206], [277].
joint learning of a region detector [271], and representation
aligning before mask prediction [263].
D. Efficiency
C. Transferability
Multimodal Transformers suffer from two major efficiency
Transferability is a major challenge for Transformer based issues: (1) Due to the large model parameter capacity, they are
multimodal learning, involving the question of how to transfer data hungry and thus dependent on huge scale training datasets.
models across different datasets and applications. (2) They are limited by the time and memory complexities that
Data augmentation and adversarial perturbation strategies grow quadratically with the input sequence length, which are
help multimodal Transformers to improve the generalization caused by the self-attention. In multimodal contexts, calculation
ability. VILLA [210] is a two-stage strategy (task-agnostic explosion will become worse due to jointly high dimension
adversarial pretraining, followed by task-specific adversarial representations. These two bottlenecks are interdependent and
finetuning) that improves VLP Transformers. should be considered together.
In practice, the distribution gap between training data and To improve the training and/or inferring efficiency for mul-
practical data is noticeable. For instance, supervised data sam- timodal Transformers, recent efforts have attempted to find
ples (well-labelled, well-aligned) are costly in practical appli- various solutions, to use fewer training data and/or parameters.
cations, thus how to transfer the supervised multimodal Trans- The main ideas can be summarized as the follows.
formers pretrained on well-aligned cross-modal pairs/tuples to 1) Knowledge distillation. Distill the knowledge from the
the weakly aligned test bed is challenging [137]. CLIP [9] is trained larger Transformers to smaller Transformers [93].
XU et al.: MULTIMODAL LEARNING WITH TRANSFORMERS: A SURVEY 12125
Miech et al. [278] conduct distillation from a slower model using early concatenation based multimodal interaction
(early concatenation based Transformers, O((N(A) + to synchronously fuse the inputs from multiple modali-
N(B) )2 )) to a faster one (independently dual branch Trans- ties/views is costly. Yan et al. [288] present an efficient
2
formers, O(N(A) )). solution that sequentially fuses information between all
2) Simplifying and compressing model. Remove the com- pairs of two adjacent views in ascending order of sequence
ponents to simplify the pipelines. Taking the VLP Trans- length. This is intrinsically a greedy strategy.
former models as an example, two-stage pipeline is costly
as they need object detector. One simplifying is processing E. Robustness
the visual input in convolution-free manner, e.g., E2E-
Multimodal Transformers pretrained on large-scale corpora
VLP [271], ViLT [192]. DropToken [174] reduces the
achieve the state-of-the-art for various multimodal applications,
training complexity via random dropping a portion of
while their robustness is still unclear and understudied. This
the video and audio tokens from input sequence during
at least involves two key challenges, i.e., how to theoretically
training. DropToken can be treated as an implementation
analyse the robustness, how to improve the robustness.
of dropout or adversarial training. Weight-sharing is also a
Although that recent attempts [99], [182], [289], [290] study
common practice for simplifying multimodal Transformer
and evaluate how the Transformer components/sub-layers con-
models. Wen et al. [279] present a weight-sharing Trans-
tribute to the robustness, the main bottleneck is that the commu-
former on top of the visual and textual encoders to align
nity lacks theoretical tools to analyse the Transformer family.
text and image. Lee et al. [280] propose a novel parameter
Recently, the common practices to analyse robustness are mainly
sharing scheme based on low-rank approximation.
based on experiment evaluations [291], e.g., cross-dataset evalu-
3) Asymmetrical network structures. Assign different model
ations, perturbation-based evaluations. Thus, some multimodal
capacities and computational size properly for different
datasets [130], [292] are proposed for evaluating the robustness.
modalities, to save parameters. See Fig. 2 in [192].
Recent attempts mainly use two straightforward methods
4) Improving utilization of training samples. Liu et al. [281]
to improve the robustness for multimodal Transformer mod-
train a simplified LXMERT by making full use of fewer
els: (1) augmentation and adversarial learning based strate-
samples at different granularities. Li et al. [282] use
gies [293], [294], (2) fine-grained loss functions [295]. For
fewer data to train CLIP by fully mining the potential
instance: VILLA [210] is a generic adversarial training frame-
self-supervised signals of (a) self-supervision within each
work that can be applied to various multimodal Transformers.
modality, (b) multi-view supervision across modalities,
Akula et al. [292] empirically demonstrate that ViLBERT fails
and (c) nearest-neighbour supervision from other similar
to exploit linguistic structure, and they propose two methods to
pairs.
improve the robustness of ViLBERT, one based on contrastive
5) Compressing and pruning model. Search the optimal
learning and the other based on multi-task learning.
sub-structures/sub-networks of multimodal Transformers,
e.g., playing Lottery Tickets with the VLP Transformer
F. Universalness
models [283], adaptively freezing some layers during
training [284]. Due to the highly diversity of tasks and modalities of mul-
6) Optimizing the complexity of self-attention. Transform- timodal learning, universalness is an important problem for
ers cost time and memory that grows quadratically with multimodal Transformer models. A large amount of recent
the input sequence length [285]. One potential solu- attempts [117], [296], [297], [298] study how to use as unified as
tion is optimizing the O(N 2 ) complexity, e.g., Child possible pipelines to handle various modalities and multimodal
et al. [286] present sparse factorizations of the attention tasks. Ideally, the unified multimodal Transformers can be com-
√
matrix to reduce the quadratical complexity to O(n n), patible with various data (e.g., aligned and unaligned, uni-modal
Transformer-LS [287] is an efficient Transformer for both and multimodal) and tasks (e.g., supervised and unsupervised,
language and vision long sequence, with linear computa- uni-modal and multimodal, discriminative and generative), and
tional and memory complexity. meanwhile have either few-shot or even zero-shot generalization
7) Optimizing the complexity of self-attention based mul- ability. Thus, the current solutions for universalness goal for
timodal interaction/fusion. Nagrani et al. [175] propose multimodal Transformers are preliminary probes.
Fusion via Attention Bottlenecks (FSN, fusion bottle- The currently unifying-oriented attempts mainly include:
neck) to improve the early concatenation based multi- 1) Unifying the pipelines for both uni-modal and multi-
modal interaction. FSN passes on the messages through modal inputs/tasks. As discussed Section V-C, in prac-
a small number of bottleneck latents, thus requiring the tical scenarios, multimodal Transformers need to handle
model to purify the most necessary information from uni-modal data due to the issue of missing modalities.
each modality for cross-modal sharing. This strategy uses Distilling multimodal knowledge into small models that
the fusion bottleneck as a bridge, and not only improves are adaptable to uni-modal data and tasks is a successful
fusion performance, but also reduces computational practice [275], [276].
cost. 2) Unifying the pipelines for both multimodal understanding
8) Optimizing other strategies. Use optimal strategies to per- and generation. In general, for multimodal Transformer
form the common Transformer based multimodal inter- pipelines, understanding and discriminative tasks require
actions. Given the quadratic complexity of self-attention, Transformer encoders only, while generation/generative
12126 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 10, OCTOBER 2023
tasks require both Transformer encoders and decoders. Ex- recent studies towards universal modality learning in terms of
isting attempts use multi-task learning to combine the un- modality-agnostic network design [3] and more task-generic
derstanding and generation workflows, where two kinds of architecture design [307], [308], [309] have been introduced,
workflows are jointly trained by multi-task loss functions. and it is hoped this will spark further investigation. To that end,
From the perspective of model structures, typical solutions instead of exhaustively exploring the vast model design space,
include: (a) encoder + decoder, e.g., E2E-VLP [271]. seeking in-depth understanding and interpretation of a MML
(b) separate encoders + cross encoder + decoder, e.g., model’s behaviour might be insightful for superior algorithm
UniVL [117], CBT [107]. (c) single unified/combined design, even though the interactions and synergy across differ-
encoder-decoder, e.g., VLP [110]. (d) two-stream decou- ent modalities are intrinsically complex and even potentially
pled design [191]. inconsistent over tasks [249].
3) Unifying and converting the tasks themselves, e.g., For more fine-grained MML, it is widely acknowledged that
CLIP [9] converts zero-shot recognition to retrieval, thus discovering the latent semantic alignments across modalities
reduces the costs of modifying the model. is critical. An intuitive strategy is to leverage semantic parts
However, the aforementioned practices suffer some obvious (e.g., objects) pre-extracted by an off-the-shelf detector for
challenges and bottlenecks, at least including: MML [103], [104], [105], [106], [112], [204], [310]. This, how-
1) Due to modality and task gaps, universal models should ever, is not only complex and error-prone, but computationally
consider the trade-off between universalness and cost. costly [207]. Several remedies introduced recently include ran-
Unifying the pipelines of different modalities and tasks dom sampling [113], learning concept dictionary [203], jointly
generally cause larger or more complicated model con- learning a region detector [271], and representation aligning
figuration, whereas for a specific modality or task, some before mask prediction [263]. Given the scale of MML training
components are redundant. data, exploring this direction needs exhaustive computational
2) Multi-task loss functions increase the complexity of train- costs, and it is supposed that industrial research teams with rich
ing. How to co-train multiple objectives properly and resources are more likely to afford. Ideally, a favourable MML
effectively is challenging, due to that different objectives method would leave fine-grained semantic alignment across
generally should be optimized in different strategies. modalities to emerge on its own, which is worthy of careful
investigation in the future.
G. Interpretability As the learning scale expands exponentially, the training data
become inevitably noisy and heterogeneous [9], [199], [263].
Why and how Transformers perform so well in multimodal
It has been recently shown that properly tackling the noise
learning has been investigated [106], [299], [300], [301], [302],
issue is useful [263], [309]. Another related facet is training
[303], [304], [305], [306]. These attempts mainly use probing
strategy, e.g., how many stages of training is superior over the
task and ablation study. Cao et al. [299] design a set of prob-
common one-stage policy [115]. Further, the quadratic com-
ing tasks on UNITER [106] and LXMERT [103], to evaluate
plexity with Transformers becomes more acute for multimodal
what patterns are learned in pretraining. Hendricks et al. [301]
data due to longer input. Despite extensive research on efficient
probe the image–language Transformers by fine-grained image–
variants [49], dedicated efficiency study for MML is still under-
sentence pairs, and find that verb understanding is harder than
estimated even empirically and call for more investigation.
subject or object understanding. Chen et al. [106] examine the
Identifying the strengths of Transformers for multimodal
optimal combination of pretraining tasks via ablation study, to
machine learning is a big open problem. The following main
compare how different pretexts contribute to the Transformers.
points can be summarized from the literature: (1) Transform-
Despite these attempts, the interpretability of multimodal Trans-
ers can encode implicit knowledge [32]. (2) The multi-head
formers is still under-studied to date.
brings multiple modelling sub-spaces that can further enhance
the expressive ability of the model. Ideally, multiple heads
VI. DISCUSSION AND OUTLOOK
after training are good and different. This is essentially a good
Designing the universal MML models to excel across all practice of ensemble learning. (3) Transformers intrinsically
the unimodal and multimodal down-stream tasks with different have a nature of global aggregation that perceives the non-local
characteristics simultaneously [115], [299] is a non-trivial chal- patterns. (4) Thanks to the large model capacity, Transformer
lenge. For instance, two-stream architectures [9], [263] are typ- models handle the challenging domain gaps and shifts (e.g.,
ically preferred over one-stream ones for cross-modal retrieval- linguistic and visual) better via effective pretraining on large-
like tasks in efficiency, since the representation of each modality scale corpora [294]. (5) Transformers can represent the inputs as
can be pre-computed beforehand and reused repeatedly. That graphs, which are intrinsically compatible with more modalities,
being said, how to design task-agnostic MML architectures is e.g., table and SQL. (6) For modelling series and sequence
still an open challenge, in addition to other design choices such patterns (e.g., time-series), Transformers have better training and
as pretext and objective loss functions. Furthermore, a clear inference efficiency against RNN-based models, thanks to their
gap remains between the state-of-the-art and this ultimate goal. parallel computation in training and/or inference. Transformers
In general, existing multimodal Transformer models [9], [199], are inherently permutation invariant for processing a sequence
[263] are superior only for specific MML tasks, as they are de- of points, e.g., well-suited for point cloud learning [164]. (7) To-
signed specifically for only a subset of specific tasks [137], [142], kenization makes Transformers flexible to organize multimodal
[212], [249], [260], [261], [265], [266]. Encouragingly, several inputs, as discussed in Section III-A1.
XU et al.: MULTIMODAL LEARNING WITH TRANSFORMERS: A SURVEY 12127
VII. CONCLUSION [21] C.-F. Yang, W.-C. Fan, F.-E. Yang, and Y.-C. F. Wang, “Layout-
Transformer: Scene layout generation with conceptual and spatial di-
This survey focuses on multimodal machine learning with versity,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021,
Transformers. We reviewed the landscape by introducing the pp. 3731–3740.
[22] R. Li, S. Zhang, and X. He, “SGTR: End-to-end scene graph generation
Transformer designs and training in the multimodal contexts. We with transformer,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
summarized the key challenges and solutions for this emerging 2022, pp. 19464–19474.
and exciting field. Moreover, we discussed open problems and [23] P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high-
resolution image synthesis,” in Proc. IEEE Conf. Comput. Vis. Pattern
potential research directions. We hope that this survey gives a Recognit., 2021, pp. 12868–12878.
helpful and detailed overview for new researchers and practition- [24] R. Cai, J. Yuan, B. Xu, and Z. Hao, “SADGA: Structure-aware dual
ers, provides a convenient reference for relevant experts (e.g., graph aggregation network for text-to-SQL,” in Proc. Int. Conf. Neural
Inf. Process. Syst., 2021, pp. 7664–7676.
multimodal machine learning researchers, Transformer network [25] Y. Song, R. C.-W. Wong, X. Zhao, and D. Jiang, “Speech-to-SQL:
designers), and encourages future progress. Towards speech-driven SQL query generation from natural language
question,” 2022, arXiv:2201.01209.
[26] A. Salvador, E. Gundogdu, L. Bazzani, and M. Donoser, “Revamping
cross-modal recipe retrieval with hierarchical transformers and self-
REFERENCES supervised learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
2021, pp. 15470–15479.
[1] T. Baltrušaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine [27] Z. Zhao et al., “ProTo: Program-guided transformer for program-
learning: A survey and taxonomy,” IEEE Trans. Pattern Anal. Mach. guided tasks,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2021,
Intell., vol. 41, no. 2, pp. 423–443, Feb. 2019. pp. 17021–17036.
[2] A. Vaswani et al., “Attention is all you need,” in Proc. Int. Conf. Neural [28] H. Zhou, W. Zhou, W. Qi, J. Pu, and H. Li, “Improving
Inf. Process. Syst., 2017, pp. 5998–6008. sign language translation with monolingual data by sign back-
[3] A. Jaegle, F. Gimeno, A. Brock, A. Zisserman, O. Vinyals, and J. Carreira, translation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021,
“Perceiver: General perception with iterative attention,” in Proc. Int. pp. 1316–1325.
Conf. Mach. Learn., 2021, pp. 4651–4664. [29] G. Varol, L. Momeni, S. Albanie, T. Afouras, and A. Zisserman, “Read
[4] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- and attend: Temporal localisation in sign language videos,” in Proc. IEEE
training of deep bidirectional transformers for language understanding,” Conf. Comput. Vis. Pattern Recognit., 2021, pp. 16852–16861.
2018, arXiv:1810.04805. [30] H. Bull, T. Afouras, G. Varol, S. Albanie, L. Momeni, and A. Zisserman,
[5] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for “Aligning subtitles in sign language videos,” 2021, arXiv:2105.02877.
image recognition at scale,” 2020, arXiv:2010.11929. [31] L. Zhao, D. Cai, L. Sheng, and D. Xu, “3DVG-transformer: Relation
[6] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and modeling for visual grounding on point clouds,” in Proc. IEEE Int. Conf.
S. Zagoruyko, “End-to-end object detection with transformers,” in Proc. Comput. Vis., 2021, pp. 2908–2917.
Eur. Conf. Comput. Vis., 2020, pp. 213–229. [32] K. Marino, X. Chen, D. Parikh, A. Gupta, and M. Rohrbach, “KRISP: In-
[7] C. Sun, A. Myers, C. Vondrick, K. Murphy, and C. Schmid, “VideoBERT: tegrating implicit and symbolic knowledge for open-domain knowledge-
A joint model for video and language representation learning,” in Proc. based VQA,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021,
IEEE Int. Conf. Comput. Vis., 2019, pp. 7463–7472. pp. 14106–14116.
[8] J. Chen et al., “Speech-T: Transducer for text to speech and beyond,” [33] P. Ammanabrolu and M. O. Riedl, “Learning knowledge graph-based
in Proc. Int. Conf. Neural Inf. Process. Syst., 2021, pp. 6621–6633. world models of textual environments,” 2021, arXiv:2106.09608.
[9] A. Radford et al., “Learning transferable visual models from natural [34] X. Zhu et al., “Multi-modal knowledge graph construction and applica-
language supervision,” 2021, arXiv:2103.00020. tion: A survey,” 2022, arXiv:2202.05786.
[10] M. Li et al., “CLIP-event: Connecting text and images with event struc- [35] P. Xu et al., “SketchMate: Deep hashing for million-scale human sketch
tures,” 2022, arXiv:2201.05078. retrieval,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018,
[11] C. Zhang, Z. Yang, X. He, and L. Deng, “Multimodal intelli- pp. 8090–8098.
gence: Representation learning, information fusion, and applications,” [36] P. Xu, Z. Song, Q. Yin, Y.-Z. Song, and L. Wang, “Deep self-supervised
IEEE J. Sel. Topics Signal Process., vol. 14, no. 3, pp. 478–493, representation learning for free-hand sketch,” IEEE Trans. Circuits Syst.
Mar. 2020. Video Technol., vol. 31, no. 4, pp. 1503–1513, Apr. 2021.
[12] A. Rahate, R. Walambe, S. Ramanna, and K. Kotecha, “Multimodal [37] P. Xu et al., “Fine-grained instance-level sketch-based video retrieval,”
co-learning: Challenges, applications with datasets, recent advances and IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 5, pp. 1995–2007,
future directions,” Inf. Fusion, vol. 81, pp. 203–239, 2022. May 2021.
[13] T. Hastie, R. Tibshirani, J. H. Friedman, and J. H. Friedman, The Elements [38] Y. Vinker et al., “CLIPasso: Semantically-aware object sketching,”
of Statistical Learning: Data Mining, Inference, and Prediction, vol. 2. 2022, arXiv:2202.05822.
Berlin, Germany: Springer, 2009. [39] Y. Fan, Z. Lin, J. Saito, W. Wang, and T. Komura, “FaceFormer: Speech-
[14] K. K. Parida, S. Srivastava, and G. Sharma, “Beyond mono to binaural: driven 3D facial animation with transformers,” 2021, arXiv:2112.05329.
Generating binaural audio from mono audio with depth and cross modal [40] D. Shin, Z. Ren, E. B. Sudderth, and C. C. Fowlkes, “3D scene reconstruc-
attention,” in Proc. IEEE/CVF Winter Conf. Appl. Comput. Vis., 2022, tion with multi-layer depth and epipolar transformers,” in Proc. IEEE Int.
pp. 2151–2160. Conf. Comput. Vis., 2019, pp. 2172–2182.
[15] F. Qingyun, H. Dapeng, and W. Zhaokui, “Cross-modality fusion trans- [41] K. Lin, L. Wang, and Z. Liu, “End-to-end human pose and mesh recon-
former for multispectral object detection,” 2021, arXiv:2111.00273. struction with transformers,” in Proc. IEEE Conf. Comput. Vis. Pattern
[16] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: Recognit., 2021, pp. 1954–1963.
A framework for self-supervised learning of speech representations,” [42] Y. Xu et al., “LayoutLMv2: Multi-modal pre-training for visually-rich
in Proc. Int. Conf. Neural Inf. Process. Syst., 2020, Art. no. 1044. document understanding,” 2020, arXiv:2012.14740.
[17] A. Nagrani, C. Sun, D. Ross, R. Sukthankar, C. Schmid, and A. Zis- [43] I. Beltagy, M. E. Peters, and A. Cohan, “Longformer: The long-document
serman, “Speech2Action: Cross-modal supervision for action recog- transformer,” 2020, arXiv:2004.05150.
nition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, [44] D. Guo et al., “GraphCodeBERT: Pre-training code representations with
pp. 10314–10323. data flow,” 2020, arXiv:2009.08366.
[18] W. Chen, M.-W. Chang, E. Schlinger, W. Wang, and W. W. Cohen, “Open [45] D. Zügner, T. Kirschstein, M. Catasta, J. Leskovec, and S. Günnemann,
question answering over tables and text,” 2020, arXiv:2010.10439. “Language-agnostic representation learning of source code from struc-
[19] Y. Guo et al., “From general to specific: Informative scene graph gen- ture and context,” 2021, arXiv:2103.11318.
eration via balance adjustment,” in Proc. IEEE Int. Conf. Comput. Vis., [46] K. Gavrilyuk, R. Sanford, M. Javan, and C. G. Snoek, “Actor-
2021, pp. 16363–16372. transformers for group activity recognition,” in Proc. IEEE Conf. Comput.
[20] K. Gupta, J. Lazarow, A. Achille, L. Davis, V. Mahadevan, and Vis. Pattern Recognit., 2020, pp. 836–845.
A. Shrivastava, “LayoutTransformer: Layout generation and completion [47] J. Shang, T. Ma, C. Xiao, and J. Sun, “Pre-training of graph augmented
with self-attention,” 2020, arXiv:2006.14615. transformers for medication recommendation,” 2019, arXiv:1906.00346.
12128 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 10, OCTOBER 2023
[48] T. Lin, Y. Wang, X. Liu, and X. Qiu, “A survey of transformers,” [78] P. Xu, T. M. Hospedales, Q. Yin, Y.-Z. Song, T. Xiang, and L. Wang,
2021, arXiv:2106.04554. “Deep learning for free-hand sketch: A survey,” IEEE Trans. Pattern
[49] Y. Tay, M. Dehghani, D. Bahri, and D. Metzler, “Efficient transformers: Anal. Mach. Intell., vol. 45, no. 1, pp. 285–312, Jan. 2023.
A survey,” 2020, arXiv:2009.06732. [79] Y. Li et al., “BEHRT: Transformer for electronic health records,” Sci.
[50] A. M. Braşoveanu and R. Andonie, “Visualizing transformers for NLP: Rep., vol. 10, 2020, Art. no. 7155.
A brief survey,” in Proc. Int. Conf. Inf. Visualisation, 2020, pp. 270–279. [80] Y. Li, H. Wang, and Y. Luo, “A comparison of pre-trained vision-and-
[51] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, language models for multimodal representation learning across medical
“Transformers in vision: A survey,” 2021, arXiv:2101.01169. images and reports,” in Proc. IEEE Int. Conf. Bioinf. Biomed., 2020,
[52] Y. Liu et al., “A survey of visual transformers,” 2021, arXiv:2111.06091. pp. 1999–2004.
[53] K. Han et al., “A survey on vision transformer,” 2020, arXiv:2012.12556. [81] P. Xu and X. Zhu, “DeepChange: A large long-term person re-
[54] Y. Xu et al., “Transformers in computational visual media: A survey,” identification benchmark with clothes change,” 2021, arXiv:2105.14685.
Comput. Vis. Media, vol. 8, pp. 33–62, 2022. [82] M. Tsimpoukelli, J. Menick, S. Cabi, S. Eslami, O. Vinyals, and F. Hill,
[55] F. Shamshad et al., “Transformers in medical imaging: A survey,” “Multimodal few-shot learning with frozen language models,” in Proc.
2022, arXiv:2201.09873. Int. Conf. Neural Inf. Process. Syst., 2021, pp. 200–212.
[56] J. Selva, A. S. Johansen, S. Escalera, K. Nasrollahi, T. B. Moeslund, and [83] Y.-L. Sung, J. Cho, and M. Bansal, “VL-ADAPTER: Parameter-efficient
A. Clapés, “Video transformers: A survey,” 2022, arXiv:2201.05991. transfer learning for vision-and-language tasks,” in Proc. IEEE Conf.
[57] L. Ruan and Q. Jin, “Survey: Transformer based video-language pre- Comput. Vis. Pattern Recognit., 2022, pp. 5217–5227.
training,” 2021, arXiv:2109.09920. [84] J.-B. Alayrac et al., “Flamingo: A visual language model for few-shot
[58] F. Chen et al., “VLP: A survey on vision-language pre-training,” learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2022, pp. 23716–
2022, arXiv:2202.09061. 23736.
[59] F. Li et al., “Vision-language intelligence: Tasks, representation learning, [85] W. Wang et al., “Image as a foreign language: BEiT pretraining for all
and large models,” 2022, arXiv:2203.01922. vision and vision-language tasks,” 2022, arXiv:2208.10442.
[60] L. Wu, S. L. Oviatt, and P. R. Cohen, “Multimodal integration-a statistical [86] X. Chen et al., “PaLi: A jointly-scaled multilingual language-image
view,” IEEE Trans. Multimedia, vol. 1, no. 4, pp. 334–341, Dec. 1999. model,” 2022, arXiv:2209.06794.
[61] W. Guo, J. Wang, and S. Wang, “Deep multimodal representa- [87] M. Lewis et al., “BART: Denoising sequence-to-sequence pre-training
tion learning: A survey,” IEEE Access, vol. 7, pp. 63373–63394, for natural language generation, translation, and comprehension,”
2019. 2019, arXiv:1910.13461.
[62] B. P. Yuhas, M. H. Goldstein, and T. J. Sejnowski, “Integration of acoustic [88] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving
and visual speech signals using neural networks,” IEEE Commun. Mag., language understanding by generative pre-training,” 2018. [Online].
vol. 27, no. 11, pp. 65–71, Nov. 1989. Available: https://s3-us-west-2.amazonaws.com/openai-assets/
[63] A. A. Lazarus et al., Multimodal Behavior Therapy. Berlin, Germany: research-covers/language-unsupervised/language_understanding_
Springer, 1976. paper.pdf
[64] D. Feng et al., “Deep multi-modal object detection and semantic segmen- [89] Z. Dai, Z. Yang, Y. Yang, J. Carbonell, Q. V. Le, and R. Salakhutdinov,
tation for autonomous driving: Datasets, methods, and challenges,” IEEE “Transformer-XL: Attentive language models beyond a fixed-length con-
Trans. Intell. Transp. Syst., vol. 22, no. 3, pp. 1341–1360, Mar. 2021. text,” 2019, arXiv:1901.02860.
[65] Y. Liu, J. Zhang, L. Fang, Q. Jiang, and B. Zhou, “Multimodal motion [90] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and
prediction with stacked transformers,” in Proc. IEEE Conf. Comput. Vis. Q. V. Le, “XLNet: Generalized autoregressive pretraining for language
Pattern Recognit., 2021, pp. 7573–7582. understanding,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2019,
[66] A. Moudgil, A. Majumdar, H. Agrawal, S. Lee, and D. Batra, “SOAT: A Art. no. 517.
scene-and object-aware transformer for vision-and-language navigation,” [91] M. Chen et al., “Generative pretraining from pixels,” in Proc. Int. Conf.
in Proc. Int. Conf. Neural Inf. Process. Syst., 2021, pp. 7357–7367. Mach. Learn., 2020, pp. 1691–1703.
[67] F. Lv, X. Chen, Y. Huang, L. Duan, and G. Lin, “Progressive modality re- [92] H. Chen et al., “Pre-trained image processing transformer,” in Proc. IEEE
inforcement for human multimodal emotion recognition from unaligned Conf. Comput. Vis. Pattern Recognit., 2021, pp. 12294–12305.
multimodal sequences,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog- [93] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and
nit., 2021, pp. 2554–2562. H. Jégou, “Training data-efficient image transformers & distillation
[68] R. Zellers et al., “MERLOT: Multimodal neural script knowledge mod- through attention,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 10347–
els,” 2021, arXiv:2106.02636. 10357.
[69] M. K. Hasan et al., “Humor knowledge enriched transformer for under- [94] J. Beal, E. Kim, E. Tzeng, D. H. Park, A. Zhai, and D. Kislyuk, “Toward
standing multimodal humor,” in Proc. AAAI Conf. Artif. Intell., 2021, transformer-based object detection,” 2020, arXiv:2012.09958.
pp. 12972–12980. [95] Z. Liu et al., “Swin transformer: Hierarchical vision transformer using
[70] A. Brown, V. Kalogeiton, and A. Zisserman, “Face, body, voice: Video shifted windows,” 2021, arXiv:2103.14030.
person-clustering with multiple modalities,” 2021, arXiv:2105.09939. [96] X. Chen, S. Xie, and K. He, “An empirical study of training self-
[71] L. Yu et al., “CommerceMM: Large-scale commerce multimodal repre- supervised vision transformers,” 2021, arXiv:2104.02057.
sentation learning with omni retrieval,” 2022, arXiv:2202.07247. [97] M. Caron et al., “Emerging properties in self-supervised vision trans-
[72] K. Chen, J. K. Chen, J. Chuang, M. Vázquez, and S. Savarese, formers,” 2021, arXiv:2104.14294.
“Topological planning with transformers for vision-and-language nav- [98] H. Bao, L. Dong, and F. Wei, “BEiT: BERT pre-training of image
igation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, transformers,” 2021, arXiv:2106.08254.
pp. 11271–11281. [99] S. Paul and P.-Y. Chen, “Vision transformers are robust learners,”
[73] Y. Hong, Q. Wu, Y. Qi, C. Rodriguez-Opazo, and S. Gould, “VLN BERT: 2021, arXiv:2105.07581.
A recurrent vision-and-language bert for navigation,” in Proc. IEEE Conf. [100] M. Raghu, T. Unterthiner, S. Kornblith, C. Zhang, and A. Dosovit-
Comput. Vis. Pattern Recognit., 2021, pp. 1643–1653. skiy, “Do vision transformers see like convolutional neural networks?,”
[74] J. Zhang et al., “Curriculum learning for vision-and-language nav- in Proc. Int. Conf. Neural Inf. Process. Syst., 2021, pp. 12116–12128.
igation,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2021, [101] S. Cao, P. Xu, and D. A. Clifton, “How to understand masked autoen-
pp. 13328–13339. coders,” 2022, arXiv:2202.03670.
[75] Y. Qi, Z. Pan, Y. Hong, M.-H. Yang, A. van den Hengel, and Q. Wu, “The [102] J. Lu, D. Batra, D. Parikh, and S. Lee, “ViLBERT: Pretraining task-
road to know-where: An object-and-room informed sequential BERT for agnostic visiolinguistic representations for vision-and-language tasks,”
indoor vision-language navigation,” in Proc. IEEE Int. Conf. Comput. 2019, arXiv:1908.02265.
Vis., 2021, pp. 1635–1644. [103] H. Tan and M. Bansal, “LXMERT: Learning cross-modality encoder
[76] C. Chen, Z. Al-Halah, and K. Grauman, “Semantic audio-visual nav- representations from transformers,” 2019, arXiv:1908.07490.
igation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, [104] L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, “Visu-
pp. 15511–15520. alBERT: A simple and performant baseline for vision and language,”
[77] S. Ren, Y. Du, J. Lv, G. Han, and S. He, “Learning from the master: 2019, arXiv:1908.03557.
Distilling cross-modal advanced knowledge for lip reading,” in Proc. [105] W. Su et al., “VL-BERT: Pre-training of generic visual-linguistic repre-
IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 13320–13328. sentations,” 2019, arXiv:1908.08530.
XU et al.: MULTIMODAL LEARNING WITH TRANSFORMERS: A SURVEY 12129
[106] Y.-C. Chen et al., “UNITER: Universal image-text representation learn- [135] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles, “Dense-
ing,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 104–120. captioning events in videos,” in Proc. IEEE Int. Conf. Comput. Vis., 2017,
[107] C. Sun, F. Baradel, K. Murphy, and C. Schmid, “Learning pp. 706–715.
video representations using contrastive bidirectional transformer,” [136] A. Das et al., “Visual dialog,” in Proc. IEEE Conf. Comput. Vis. Pattern
2019, arXiv:1906.05743. Recognit., 2017, pp. 1080–1089.
[108] G. Li, N. Duan, Y. Fang, M. Gong, and D. Jiang, “Unicoder-VL: A [137] X. Zhan et al., “Product1M: Towards weakly supervised instance-level
universal encoder for vision and language by cross-modal pre-training,” product retrieval via cross-modal pretraining,” in Proc. IEEE Int. Conf.
in Proc. AAAI Conf. Artif. Intell., 2020, pp. 11336–11344. Comput. Vis., 2021, pp. 11762–11771.
[109] C. Alberti, J. Ling, M. Collins, and D. Reitter, “Fusion of detected objects [138] S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, “Conceptual 12M:
in text for visual question answering,” 2019, arXiv:1908.05054. Pushing web-scale image-text pre-training to recognize long-tail visual
[110] L. Zhou, H. Palangi, L. Zhang, H. Hu, J. Corso, and J. Gao, “Unified concepts,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021,
vision-language pre-training for image captioning and VQA,” in Proc. pp. 3557–3567.
AAAI Conf. Artif. Intell., 2020, pp. 13041–13049. [139] Y. Huo et al., “WenLan: Bridging vision and language by large-scale
[111] J. Lu, V. Goswami, M. Rohrbach, D. Parikh, and S. Lee, “12-in-1: Multi- multi-modal pre-training,” 2021, arXiv:2103.06561.
task vision and language representation learning,” in Proc. IEEE Conf. [140] A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid, “Just ask: Learning
Comput. Vis. Pattern Recognit., 2020, pp. 10434–10443. to answer questions from millions of narrated videos,” in Proc. IEEE Int.
[112] X. Li et al., “Oscar: Object-semantics aligned pre-training for vision- Conf. Comput. Vis., 2021, pp. 1666–1677.
language tasks,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 121–137. [141] A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic,
[113] Z. Huang, Z. Zeng, B. Liu, D. Fu, and J. Fu, “Pixel-BERT: “HowTo100M: Learning a text-video embedding by watching hundred
Aligning image pixels with text by deep multi-modal transformers,” million narrated video clips,” in Proc. IEEE Int. Conf. Comput. Vis., 2019,
2020, arXiv:2004.00849. pp. 2630–2640.
[114] L. Zhu and Y. Yang, “ActBERT: Learning global-local video-text repre- [142] X. Hu et al., “Scaling up vision-language pre-training for image caption-
sentations,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, ing,” 2021, arXiv:2111.12233.
pp. 8743–8752. [143] C. Schuhmann et al., “LAION-400M: Open dataset of CLIP-filtered 400
[115] D. Qi, L. Su, J. Song, E. Cui, T. Bharti, and A. Sacheti, “ImageBERT: million image-text pairs,” 2021, arXiv:2111.02114.
Cross-modal pre-training with large-scale weak-supervised image-text [144] H. Yun, Y. Yu, W. Yang, K. Lee, and G. Kim, “Pano-AVQA: Grounded
data,” 2020, arXiv:2001.07966. audio-visual question answering on 360◦ videos,” in Proc. IEEE Int. Conf.
[116] L. Li, Y.-C. Chen, Y. Cheng, Z. Gan, L. Yu, and J. Liu, “HERO: Hier- Comput. Vis., 2021, pp. 2011–2021.
archical encoder for video+ language omni-representation pre-training,” [145] P. Morgado, Y. Li, and N. Vasconcelos, “Learning representations from
2020, arXiv:2005.00200. audio-visual spatial alignment,” 2020, arXiv:2011.01819.
[117] H. Luo et al., “UniVL: A unified video and language pre-training model [146] R. Li, S. Yang, D. A. Ross, and A. Kanazawa, “AI choreographer: Music
for multimodal understanding and generation,” 2020, arXiv:2002.06353. conditioned 3D dance generation with AIST++,” in Proc. IEEE Int. Conf.
[118] M. Xu et al., “A simple baseline for zero-shot semantic segmentation Comput. Vis., 2021, pp. 13381–13392.
with pre-trained vision-language model,” 2021, arXiv:2112.14757. [147] P. Achlioptas, M. Ovsjanikov, K. Haydarov, M. Elhoseiny, and L. J.
[119] C. Jia et al., “Scaling up visual and vision-language representation Guibas, “ArtEmis: Affective language for visual art,” in Proc. IEEE Conf.
learning with noisy text supervision,” 2021, arXiv:2102.05918. Comput. Vis. Pattern Recognit., 2021, pp. 11564–11574.
[120] Z. Wang et al., “CLIP-TD: CLIP targeted distillation for vision-language [148] P. P. Liang et al., “MultiBench: Multiscale benchmarks for multimodal
tasks,” 2022, arXiv:2201.05729. representation learning,” 2021, arXiv:2107.07502.
[121] J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, [149] Z. Liu, C. Rodriguez-Opazo, D. Teney, and S. Gould, “Image retrieval on
“Align before fuse: Vision and language representation learning with real-life images with pre-trained vision-and-language models,” in Proc.
momentum distillation,” in Proc. Int. Conf. Neural Inf. Process. Syst., IEEE Int. Conf. Comput. Vis., 2021, pp. 2105–2114.
2021, pp. 9694–9705. [150] P.-L. Guhur, M. Tapaswi, S. Chen, I. Laptev, and C. Schmid, “AirBERT:
[122] J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and In-domain pretraining for vision-and-language navigation,” in Proc.
Y. Wu, “CoCa: Contrastive captioners are image-text foundation models,” IEEE Int. Conf. Comput. Vis., 2021, pp. 1614–1623.
2022, arXiv:2205.01917. [151] R. Sawhney, M. Goyal, P. Goel, P. Mathur, and R. Shah, “Multimodal
[123] P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A multi-speaker merger & acquisition financial modeling: A new task,
cleaned, hypernymed, image alt-text dataset for automatic image caption- dataset, and neural baselines,” in Proc. 59th Annu. Meeting Assoc.
ing,” in Proc. Conf. Assoc. Comput. Linguistics, 2018, pp. 2556–2565. Comput. Linguistics 11th Int. Joint Conf. Natural Lang. Process., 2021,
[124] T.-Y. Lin et al., “Microsoft COCO: Common objects in context,” in Proc. pp. 6751–6762.
Eur. Conf. Comput. Vis., 2014, pp. 740–755. [152] J. Zhang, M. Zheng, M. Boyd, and E. Ohn-Bar, “X-world: Accessibility,
[125] S. Antol et al., “VQA: Visual question answering,” in Proc. IEEE Int. vision, and autonomy meet,” in Proc. IEEE Int. Conf. Comput. Vis., 2021,
Conf. Comput. Vis., 2015, pp. 2425–2433. pp. 9742–9751.
[126] R. Krishna et al., “Visual genome: Connecting language and vision using [153] D. Zhang, M. Zhang, H. Zhang, L. Yang, and H. Lin, “MultiMET: A
crowdsourced dense image annotations,” Int. J. Comput. Vis., vol. 123, multimodal dataset for metaphor understanding,” in Proc. 59th Annu.
pp. 32–73, 2017. Meeting Assoc. Comput. Linguistics 11th Int. Joint Conf. Natural Lang.
[127] V. Ordonez, G. Kulkarni, and T. Berg, “Im2Text: Describing images using Process., 2021, pp. 3214–3225.
1 million captioned photographs,” in Proc. Int. Conf. Neural Inf. Process. [154] D. Kiela et al., “The hateful memes challenge: Detecting hate speech in
Syst., 2011, pp. 1143–1151. multimodal memes,” 2020, arXiv:2005.04790.
[128] M. Kayser et al., “e-ViL: A dataset and benchmark for natural language [155] L. Zhou, C. Xu, and J. J. Corso, “Towards automatic learning of proce-
explanations in vision-language tasks,” 2021, arXiv:2105.03761. dures from web instructional videos,” in Proc. AAAI Conf. Artif. Intell.,
[129] J. Gamper and N. Rajpoot, “Multiple instance captioning: Learning 2018, pp. 7590–7598.
representations from histopathology textbooks and articles,” in Proc. [156] J. Malmaud, J. Huang, V. Rathod, N. Johnston, A. Rabinovich, and
IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 16544–16554. K. Murphy, “What’s Cookin’? Interpreting cooking videos using text,
[130] L. Li, J. Lei, Z. Gan, and J. Liu, “Adversarial VQA: A new benchmark speech and vision,” 2015, arXiv:1503.01558.
for evaluating the robustness of VQA models,” 2021, arXiv:2106.00245. [157] M. M. Bronstein, J. Bruna, T. Cohen, and P. Veličković, “Geomet-
[131] A. Talmor et al., “MultiModalQA: Complex question answering over ric deep learning: Grids, groups, graphs, geodesics, and gauges,”
text, tables and images,” 2021, arXiv:2104.06039. 2021, arXiv:2104.13478.
[132] L. Li et al., “VALUE: A multi-task benchmark for video-and-language [158] V. P. Dwivedi and X. Bresson, “A generalization of transformer networks
understanding evaluation,” 2021, arXiv:2106.04632. to graphs,” 2020, arXiv:2012.09699.
[133] H. Wu et al., “Fashion IQ: A new dataset towards retrieving images by [159] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
natural language feedback,” in Proc. IEEE Conf. Comput. Vis. Pattern recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
Recognit., 2021, pp. 11302–11312. pp. 770–778.
[134] T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Deep [160] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep net-
audio-visual speech recognition,” IEEE Trans. Pattern Anal. Mach. In- work training by reducing internal covariate shift,” in Proc. Int. Conf.
tell., vol. 44, no. 12, pp. 8717–8727, Dec. 2022. Mach. Learn., 2015, pp. 448–456.
12130 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 10, OCTOBER 2023
[161] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” [189] Y.-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and
2016, arXiv:1607.06450. R. Salakhutdinov, “Multimodal transformer for unaligned multimodal
[162] R. Xiong et al., “On layer normalization in the transformer architecture,” language sequences,” in Proc. Conf. Assoc. Comput. Linguistics, 2019,
in Proc. Int. Conf. Mach. Learn., 2020, Art. no. 975. pp. 6558–6569.
[163] P. Xu, C. K. Joshi, and X. Bresson, “Multigraph transformer for free- [190] V. Murahari, D. Batra, D. Parikh, and A. Das, “Large-scale pretraining
hand sketch recognition,” IEEE Trans. Neural Netw. Learn. Syst., vol. 33, for visual dialog: A simple state-of-the-art baseline,” in Proc. Eur. Conf.
no. 10, pp. 5150–5161, Oct. 2022. Comput. Vis., 2020, pp. 336–352.
[164] M.-H. Guo, J.-X. Cai, Z.-N. Liu, T.-J. Mu, R. R. Martin, and S.-M. Hu, [191] Y. Li, Y. Pan, T. Yao, J. Chen, and T. Mei, “Scheduled sampling in vision-
“PCT: Point cloud transformer,” Comput. Vis. Media, vol. 7, pp. 187–199, language pretraining with decoupled encoder-decoder network,” in Proc.
2021. AAAI Conf. Artif. Intell., 2021, pp. 8518–8526.
[165] S. He, H. Luo, P. Wang, F. Wang, H. Li, and W. Jiang, “TransReID: [192] W. Kim, B. Son, and I. Kim, “ViLT: Vision-and-language transformer
Transformer-based object re-identification,” in Proc. IEEE Int. Conf. without convolution or region supervision,” in Proc. Int. Conf. Mach.
Comput. Vis., 2021, pp. 14993–15002. Learn., 2021, pp. 5583–5594.
[166] P. Dufter, M. Schmitt, and H. Schütze, “Position information in trans- [193] J. Yang et al., “Vision-language pre-training with triple contrastive learn-
formers: An overview,” 2021, arXiv:2102.11090. ing,” 2022, arXiv:2202.10401.
[167] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net- [194] L. H. Li et al., “Grounded language-image pre-training,” in Proc. IEEE
works,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, Conf. Comput. Vis. Pattern Recognit., 2022, pp. 10955–10965.
pp. 7794–7803. [195] H. Zhang et al., “GLIPv2: Unifying localization and vision-language
[168] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong, “End-to-end understanding,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2022,
dense video captioning with masked transformer,” in Proc. IEEE Conf. pp. 36067–36080.
Comput. Vis. Pattern Recognit., 2018, pp. 8739–8748. [196] C. Li et al., “SemVLP: Vision-language pre-training by aligning seman-
[169] B. Wang, R. Shin, X. Liu, O. Polozov, and M. Richardson, “RAT-SQL: tics at multiple levels,” 2021, arXiv:2103.07829.
Relation-aware schema encoding and linking for Text-to-SQL parsers,” [197] A. Miech, J.-B. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zisserman,
2019, arXiv:1911.04942. “End-to-end learning of visual representations from uncurated instruc-
[170] Z. Wang et al., “SGEITL: Scene graph enhanced image-text learning for tional videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020,
visual commonsense reasoning,” 2021, arXiv:2112.08587. pp. 9876–9886.
[171] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural net- [198] T. Han, W. Xie, and A. Zisserman, “Temporal alignment networks for
works,” in Proc. 14th Int. Conf. Artif. Intell. Statist., 2011, pp. 315–323. long-term video,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
[172] D. Hendrycks and K. Gimpel, “Gaussian error linear units (GELUs),” 2022, pp. 2896–2906.
2016, arXiv:1606.08415. [199] Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y. Tsvetkov, and Y. Cao, “SimVLM:
[173] Y.-B. Lin, H.-Y. Tseng, H.-Y. Lee, Y.-Y. Lin, and M.-H. Yang, “Ex- Simple visual language model pretraining with weak supervision,”
ploring cross-video and cross-modality signals for weakly-supervised 2021, arXiv:2108.10904.
audio-visual video parsing,” in Proc. Int. Conf. Neural Inf. Process. Syst., [200] J. Chen, M. Ma, R. Zheng, and L. Huang, “MAM: Masked acoustic mod-
2021, pp. 11449–11461. eling for end-to-end speech-to-text translation,” 2020, arXiv:2010.11445.
[174] H. Akbari et al., “VATT: Transformers for multimodal self-supervised [201] M. Golestani, S. Z. Razavi, Z. Borhanifard, F. Tahmasebian, and
learning from raw video, audio and text,” 2021, arXiv:2104.11178. H. Faili, “Using BERT encoding and sentence-level language model for
[175] A. Nagrani, S. Yang, A. Arnab, A. Jansen, C. Schmid, and C. Sun, sentence ordering,” in Proc. 24th Int. Conf. Text Speech Dialogue, 2021,
“Attention bottlenecks for multimodal fusion,” in Proc. Int. Conf. Neural pp. 318–330.
Inf. Process. Syst., 2021, pp. 14200–14213. [202] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-
[176] P.-A. Duquenne, H. Gong, and H. Schwenk, “Multimodal and multi- time object detection with region proposal networks,” in Proc. Int. Conf.
lingual embeddings for large-scale speech mining,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2015, pp. 91–99.
Neural Inf. Process. Syst., 2021, pp. 15748–15761. [203] Z. Huang, Z. Zeng, Y. Huang, B. Liu, D. Fu, and J. Fu, “Seeing out
[177] D. Min, D. B. Lee, E. Yang, and S. J. Hwang, “Meta-StyleSpeech: Multi- of the box: End-to-end pre-training for vision-language representation
speaker adaptive text-to-speech generation,” 2021, arXiv:2106.03153. learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021,
[178] B. Shi, W.-N. Hsu, K. Lakhotia, and A. Mohamed, “Learning audio- pp. 12971–12980.
visual speech representation by masked multimodal cluster prediction,” [204] Y. Liu, C. Wu, S.-Y. Tseng, V. Lal, X. He, and N. Duan, “KD-VLP: Im-
2022, arXiv:2201.02184. proving end-to-end vision-and-language pretraining with object knowl-
[179] K. Ramesh, C. Xing, W. Wang, D. Wang, and X. Chen, “Vset: A edge distillation,” 2021, arXiv:2109.10504.
multimodal transformer for visual speech enhancement,” in Proc. IEEE [205] A. Ramesh et al., “Zero-shot text-to-image generation,” 2021,
Int. Conf. Acoust. Speech Signal Process., 2021, pp. 6658–6662. arXiv:2102.12092.
[180] R. Zheng, J. Chen, M. Ma, and L. Huang, “Fused acoustic and text [206] M. Zhou et al., “UC2 : Universal cross-lingual cross-modal vision-and-
encoding for multimodal bilingual pretraining and speech translation,” language pre-training,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog-
2021, arXiv:2102.05766. nit., 2021, pp. 4153–4163.
[181] X. Yang, S. Feng, Y. Zhang, and D. Wang, “Multimodal sentiment [207] W. Hao, C. Li, X. Li, L. Carin, and J. Gao, “Towards learning a generic
detection based on multi-channel graph neural networks,” in Proc. 59th agent for vision-and-language navigation via pre-training,” in Proc. IEEE
Annu. Meeting Assoc. Comput. Linguistics 11th Int. Joint Conf. Natural Conf. Comput. Vis. Pattern Recognit., 2020, pp. 13134–13143.
Lang. Process., 2021, pp. 328–339. [208] Q. Xia et al., “XGPT: Cross-modal generative pre-training for image
[182] X. Mao et al., “Towards robust vision transformer,” 2021, captioning,” in Proc. 10th CCF Int. Conf. Natural Lang. Process. Chin.
arXiv:2105.07926. Comput., 2021, pp. 786–797.
[183] T. Rahman, M. Yang, and L. Sigal, “TriBERT: Human-centric audio- [209] L. Yao et al., “FILIP: Fine-grained interactive language-image pre-
visual representation learning,” in Proc. Int. Conf. Neural Inf. Process. training,” 2021, arXiv:2111.07783.
Syst., 2021, pp. 9774–9787. [210] Z. Gan, Y.-C. Chen, L. Li, C. Zhu, Y. Cheng, and J. Liu, “Large-scale
[184] L. Rasmy, Y. Xiang, Z. Xie, C. Tao, and D. Zhi, “Med-BERT: Pretrained adversarial training for vision-and-language representation learning,”
contextualized embeddings on large-scale structured electronic health 2020, arXiv:2006.06195.
records for disease prediction,” NPJ Digit. Med., vol. 4, 2021, Art. no. 86. [211] H. Zhang et al., “ERNIE-ViLG: Unified generative pre-training for
[185] R. J. Chen et al., “Multimodal co-attention transformer for survival bidirectional vision-language generation,” 2021, arXiv:2112.15283.
prediction in gigapixel whole slide images,” in Proc. IEEE Int. Conf. [212] M. Zhuge et al., “Kaleido-BERT: Vision-language pre-training on fashion
Comput. Vis., 2021, pp. 3995–4005. domain,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021,
[186] S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, “Rethinking spatiotem- pp. 12642–12652.
poral feature learning for video understanding,” 2017, arXiv:1712.04851. [213] Y. Wang, X. Chen, L. Cao, W. Huang, F. Sun, and Y. Wang, “Multimodal
[187] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer token fusion for vision transformers,” in Proc. IEEE Conf. Comput. Vis.
look at spatiotemporal convolutions for action recognition,” in Proc. IEEE Pattern Recognit., 2022, pp. 12176–12185.
Conf. Comput. Vis. Pattern Recognit., 2018, pp. 6450–6459. [214] Y. Wang et al., “Bridged transformer for vision and point cloud 3D object
[188] J. Lin, A. Yang, Y. Zhang, J. Liu, J. Zhou, and H. Yang, “Inter- detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022,
BERT: Vision-and-language interaction for multi-modal pretraining,” pp. 12104–12113.
2020, arXiv:2003.13198.
XU et al.: MULTIMODAL LEARNING WITH TRANSFORMERS: A SURVEY 12131
[215] A. Prakash, K. Chitta, and A. Geiger, “Multi-modal fusion transformer [242] Y. Zhong, J. Shi, J. Yang, C. Xu, and Y. Li, “Learning to generate
for end-to-end autonomous driving,” in Proc. IEEE Conf. Comput. Vis. scene graph from natural language supervision,” in Proc. IEEE Int. Conf.
Pattern Recognit., 2021, pp. 7073–7083. Comput. Vis., 2021, pp. 1803–1814.
[216] X. Bai et al., “TransFusion: Robust LiDAR-camera fusion for 3D object [243] S. Geng et al., “Dynamic graph representation learning for video dialog
detection with transformers,” in Proc. IEEE Conf. Comput. Vis. Pattern via multi-modal shuffled transformers,” in Proc. AAAI Conf. Artif. Intell.,
Recognit., 2022, pp. 1080–1089. 2021, pp. 1415–1423.
[217] X. Favory, K. Drossos, T. Virtanen, and X. Serra, “Learning contextual [244] T. Jaunet, C. Kervadec, R. Vuillemot, G. Antipov, M. Baccouche, and
tag embeddings for cross-modal alignment of audio and tags,” in Proc. C. Wolf, “VisQA: X-raying vision and language reasoning in transform-
Int. Conf. Acoust. Speech Signal Process., 2021, pp. 596–600. ers,” IEEE Trans. Vis. Comput. Graph., vol. 28, no. 1, pp. 976–986,
[218] V. Gabeur, C. Sun, K. Alahari, and C. Schmid, “Multi-modal transformer Jan. 2022.
for video retrieval,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 214–229. [245] X. Lin, G. Bertasius, J. Wang, S.-F. Chang, D. Parikh, and L. Torresani,
[219] N. Shvetsova et al., “Everything at once - multi-modal fusion transformer “VX2TEXT: End-to-end learning of video-based text generation from
for video retrieval,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., multimodal inputs,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
2022, pp. 19988–19997. 2021, pp. 7001–7011.
[220] Z. Wang, Y. Wu, K. Narasimhan, and O. Russakovsky, “Multi-query [246] J. Lin et al., “M6: Multi-modality-to-multi-modality multitask mega-
video retrieval,” 2022, arXiv:2201.03639. transformer for unified pretraining,” in Proc. 27th ACM SIGKDD Conf.
[221] A. Botach, E. Zheltonozhskii, and C. Baskin, “End-to-end re- Knowl. Discov. Data Mining, 2021, pp. 3251–3261.
ferring video object segmentation with multimodal transformers,” [247] T. Chen and R. R. Rao, “Audio-visual integration in multimodal com-
2021, arXiv:2111.14821. munication,” Proc. IEEE, vol. 86, no. 5, pp. 837–852, May 1998.
[222] W. Hong, K. Ji, J. Liu, J. Wang, J. Chen, and W. Chu, “GilBERT: Gener- [248] A. Owens and A. A. Efros, “Audio-visual scene analysis with self-
ative vision-language pre-training for image-text retrieval,” in Proc. 44th supervised multisensory features,” in Proc. Eur. Conf. Comput. Vis., 2018,
Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval, 2021, pp. 1379–1388. pp. 639–658.
[223] X. Xu and C. C. Loy, “3D human texture estimation from a single [249] H. Xue et al., “Probing inter-modality: Visual parsing with self-attention
image with transformers,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, for vision-and-language pre-training,” in Proc. Int. Conf. Neural Inf.
pp. 13829–13838. Process. Syst., 2021, pp. 4514–4528.
[224] X. Lin, C. Ding, J. Zeng, and D. Tao, “GPS-Net: Graph property sensing [250] T.-D. Truong et al., “The right to talk: An audio-visual transformer
network for scene graph generation,” in Proc. IEEE Conf. Comput. Vis. approach,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 1085–1094.
Pattern Recognit., 2020, pp. 3743–3752. [251] M. Chen et al., “MultiSpeech: Multi-speaker text to speech with trans-
[225] W. Wang, R. Wang, and X. Chen, “Topic scene graph generation by former,” 2020, arXiv:2006.04664.
attention distillation from caption,” in Proc. IEEE Int. Conf. Comput. [252] S. Ging, M. Zolfaghari, H. Pirsiavash, and T. Brox, “COOT: Cooper-
Vis., 2021, pp. 15880–15890. ative hierarchical transformer for video-text representation learning,”
[226] Y. Lu et al., “Context-aware scene graph generation with Seq2Seq trans- 2020, arXiv:2011.00597.
formers,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 15911–15921. [253] M. Patrick et al., “Support-set bottlenecks for video-text representation
[227] P. Ke et al., “JointGT: Graph-text joint representation learning for text learning,” 2020, arXiv:2010.02824.
generation from knowledge graphs,” 2021, arXiv:2106.10502. [254] V. Gabeur, A. Nagrani, C. Sun, K. Alahari, and C. Schmid, “Masking
[228] Y. Teng, L. Wang, Z. Li, and G. Wu, “Target adaptive context aggregation modalities for cross-modal video retrieval,” in Proc. Winter Conf. Appl.
for video scene graph generation,” in Proc. IEEE Int. Conf. Comput. Vis., Comput. Vis., 2022, pp. 2111–2120.
2021, pp. 13668–13677. [255] A. Sadhu, K. Chen, and R. Nevatia, “Video object grounding using
[229] M. Chen, Y. Li, Z. Zhang, and S. Huang, “TVT: Two-view transformer semantic roles in language description,” in Proc. IEEE Conf. Comput.
network for video captioning,” in Proc. 10th Asian Conf. Mach. Learn., Vis. Pattern Recognit., 2020, pp. 10414–10424.
2018, pp. 847–862. [256] Y. Zhang, M. Choi, K. Han, and Z. Liu, “Explainable seman-
[230] K. Lin et al., “SwinBERT: End-to-end transformers with sparse attention tic space by grounding language to vision with cross-modal con-
for video captioning,” 2021, arXiv:2111.13196. trastive learning,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2021,
[231] C. Deng, S. Chen, D. Chen, Y. He, and Q. Wu, “Sketch, ground, and pp. 18513–18526.
refine: Top-down dense video captioning,” in Proc. IEEE Conf. Comput. [257] Y.-W. Chen, Y.-H. Tsai, and M.-H. Yang, “End-to-end multi-modal video
Vis. Pattern Recognit., 2021, pp. 234–243. temporal grounding,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2021,
[232] T. Wang, R. Zhang, Z. Lu, F. Zheng, R. Cheng, and P. Luo, “End-to-end pp. 28442–28453.
dense video captioning with parallel decoding,” in Proc. IEEE Int. Conf. [258] S. Chen and B. Li, “Multi-modal dynamic graph transformer for visual
Comput. Vis., 2021, pp. 6827–6837. grounding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022,
[233] L. Huang, W. Wang, J. Chen, and X.-Y. Wei, “Attention on attention pp. 15513–15522.
for image captioning,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, [259] A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid, “TubeDETR:
pp. 4633–4642. Spatio-temporal video grounding with transformers,” in Proc. IEEE Conf.
[234] Y. Pan, T. Yao, Y. Li, and T. Mei, “X-linear attention networks for image Comput. Vis. Pattern Recognit., 2022, pp. 16421–16432.
captioning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, [260] H. Xu et al., “VideoCLIP: Contrastive pre-training for zero-shot video-
pp. 10968–10977. text understanding,” 2021, arXiv:2109.14084.
[235] X. Yang, H. Zhang, G. Qi, and J. Cai, “Causal attention for vision- [261] J. Lei et al., “Less is more: CLIPBERT for video-and-language learning
language tasks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., via sparse sampling,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
2021, pp. 9842–9852. 2021, pp. 7327–7337.
[236] Y. Luo et al., “Dual-level collaborative transformer for image captioning,” [262] J. Yang, Y. Bisk, and J. Gao, “TACo: Token-aware cascade contrastive
2021, arXiv:2101.06462. learning for video-text alignment,” in Proc. IEEE Int. Conf. Comput. Vis.,
[237] G. Xu, S. Niu, M. Tan, Y. Luo, Q. Du, and Q. Wu, “Towards accurate 2021, pp. 11542–11552.
text-based image captioning with content diversity exploration,” in Proc. [263] D. Li, J. Li, H. Li, J. C. Niebles, and S. C. Hoi, “Align
IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 12632–12641. and prompt: Video-and-language pre-training with entity prompts,”
[238] N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu, “Neural speech synthesis 2021, arXiv:2112.09583.
with transformer network,” in Proc. AAAI Conf. Artif. Intell., 2019, [264] H. Luo et al., “CLIP4Clip: An empirical study of CLIP for end to end
pp. 6706–6713. video clip retrieval,” 2021, arXiv:2104.08860.
[239] M. Ding et al., “CogView: Mastering text-to-image generation via trans- [265] H. Fang, P. Xiong, L. Xu, and Y. Chen, “CLIP2Video: Mastering video-
formers,” 2021, arXiv:2105.13290. text retrieval via image CLIP,” 2021, arXiv:2106.11097.
[240] A. Sanghi, H. Chu, J. G. Lambourne, Y. Wang, C.-Y. Cheng, and [266] M. Narasimhan, A. Rohrbach, and T. Darrell, “CLIP-It! Language-guided
M. Fumero, “CLIP-forge: Towards zero-shot text-to-shape generation,” video summarization,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2021,
2021, arXiv:2110.02624. pp. 13988–14000.
[241] R. Huang, H. Hu, W. Wu, K. Sawada, M. Zhang, and D. Jiang, “Dance [267] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train,
revolution: Long-term dance generation with music via curriculum learn- prompt, and predict: A systematic survey of prompting methods in natural
ing,” 2020, arXiv:2006.06119. language processing,” ACM Comput. Surv., vol. 55, 2023, Art. no. 195.
12132 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 45, NO. 10, OCTOBER 2023
[268] Y. Xian, C. H. Lampert, B. Schiele, and Z. Akata, “Zero-shot learning—A [296] S. Pramanik, P. Agrawal, and A. Hussain, “OmniNet: A unified architec-
comprehensive evaluation of the good, the bad and the ugly,” IEEE Trans. ture for multi-modal multi-task learning,” 2019, arXiv:1907.07804.
Pattern Anal. Mach. Intell., vol. 41, no. 9, pp. 2251–2265, Sep. 2019. [297] P. Wang et al., “Unifying architectures, tasks, and modali-
[269] X. Gu, T.-Y. Lin, W. Kuo, and Y. Cui, “Open-vocabulary ob- ties through a simple sequence-to-sequence learning framework,”
ject detection via vision and language knowledge distillation,” 2022, arXiv:2202.03052.
2021, arXiv:2104.13921. [298] R. Girdhar, M. Singh, N. Ravi, L. van der Maaten, A. Joulin, and
[270] J. Cho, J. Lu, D. Schwenk, H. Hajishirzi, and A. Kembhavi, “X- I. Misra, “Omnivore: A single model for many visual modalities,”
LXMERT: Paint, caption and answer questions with multi-modal trans- 2022, arXiv:2201.08377.
formers,” in Proc. Conf. Empir. Methods Natural Lang. Process., 2020, [299] J. Cao, Z. Gan, Y. Cheng, L. Yu, Y.-C. Chen, and J. Liu, “Behind the
pp. 8785–8805. scene: Revealing the secrets of pre-trained vision-and-language models,”
[271] H. Xu et al., “E2E-VLP: End-to-end vision-language pre-training en- in Proc. Eur. Conf. Comput. Vis., 2020, pp. 565–580.
hanced by visual learning,” 2021, arXiv:2106.01804. [300] L. A. Hendricks, J. Mellor, R. Schneider, J.-B. Alayrac, and
[272] C. Kervadec, C. Wolf, G. Antipov, M. Baccouche, and A. Nematzadeh, “Decoupling the role of data, attention, and losses in
M. Nadri, “Supervising the transfer of reasoning patterns in VQA,” multimodal transformers,” Trans. Assoc. Comput. Linguistics, vol. 9,
2021, arXiv:2106.05597. pp. 570–585, 2021.
[273] C. Kervadec, T. Jaunet, G. Antipov, M. Baccouche, R. Vuillemot, and [301] L. A. Hendricks and A. Nematzadeh, “Probing image-language trans-
C. Wolf, “How transferable are reasoning patterns in VQA?,” in Proc. formers for verb understanding,” 2021, arXiv:2106.09141.
IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 4205–4214. [302] S. Frank, E. Bugliarello, and D. Elliott, “Vision-and-language or vision-
[274] W. Rahman et al., “Integrating multimodal information in large pre- for-language? On cross-modal influence in multimodal transformers,”
trained transformers,” in Proc. Conf. Assoc. Comput. Linguistics, 2020, 2021, arXiv:2109.04448.
pp. 2359–2369. [303] H. Chefer, S. Gur, and L. Wolf, “Generic attention-model explain-
[275] D. Agarwal, T. Agrawal, L. M. Ferrari, and F. Bremond, “From multi- ability for interpreting bi-modal and encoder-decoder transformers,”
modal to unimodal attention in transformers using knowledge distilla- 2021, arXiv:2103.15679.
tion,” in Proc. 17th IEEE Int. Conf. Adv. Video Signal Based Surveill., [304] L. Parcalabescu, M. Cafagna, L. Muradjan, A. Frank, I. Calixto, and
2021, pp. 1–8. A. Gatt, “VALSE: A task-independent benchmark for vision and language
[276] Q. Li et al., “Towards a unified foundation model: Jointly pre-training models centered on linguistic phenomena,” 2021, arXiv:2112.07566.
transformers on unpaired images and text,” 2021, arXiv:2112.07074. [305] T. Zhao et al., “VL-CheckList: Evaluating pre-trained vision-language
[277] M. Ni et al., “M3P: Learning universal representations via multitask models with objects, attributes and relations,” 2022, arXiv:2207.00221.
multilingual multimodal pre-training,” in Proc. IEEE Conf. Comput. Vis. [306] E. Aflalo et al., “VL-InterpreT: An interactive visualization tool for
Pattern Recognit., 2021, pp. 3976–3985. interpreting vision-language transformers,” in Proc. IEEE Conf. Comput.
[278] A. Miech, J.-B. Alayrac, I. Laptev, J. Sivic, and A. Zisserman, “Thinking Vis. Pattern Recognit., 2022, pp. 21374–21383.
fast and slow: Efficient text-to-visual retrieval with transformers,” in Proc. [307] N. Mu, A. Kirillov, D. Wagner, and S. Xie, “SLIP: Self-supervision meets
IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 9821–9831. language-image pre-training,” 2021, arXiv:2112.12750.
[279] K. Wen, J. Xia, Y. Huang, L. Li, J. Xu, and J. Shao, “COOKIE: Contrastive [308] H. Xu et al., “VLM: Task-agnostic video-language model pre-training
cross-modal knowledge sharing pre-training for vision-language repre- for video understanding,” 2021, arXiv:2105.09996.
sentation,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 2188–2197. [309] J. Li, D. Li, C. Xiong, and S. Hoi, “BLIP: Bootstrapping language-image
[280] S. Lee, Y. Yu, G. Kim, T. Breuel, J. Kautz, and Y. Song, “Parameter pre-training for unified vision-language understanding and generation,”
efficient multimodal transformers for video representation learning,” 2022, arXiv:2201.12086.
2020, arXiv:2012.04124. [310] P. Zhang et al., “VinVL: Revisiting visual representations in vision-
[281] T. Liu, F. Feng, and X. Wang, “Multi-stage pre-training over simplified language models,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
multimodal pre-training models,” 2021, arXiv:2107.14596. 2021, pp. 5575–5584.
[282] Y. Li et al., “Supervision exists everywhere: A data efficient contrastive
language-image pre-training paradigm,” 2021, arXiv:2110.05208.
[283] Z. Gan et al., “Playing lottery tickets with vision and language,”
2021, arXiv:2104.11832. Peng Xu is a lecturer with the Department of Elec-
[284] C. He, S. Li, M. Soltanolkotabi, and S. Avestimehr, “PipeTransformer: tronic Engineering, Tsinghua University. Previously,
Automated elastic pipelining for distributed training of large-scale mod- he was a postdoctoral research assistant with the
els,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 4150–4159. Department of Engineering Science, University of
[285] T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré, “FlashAttention: Fast and Oxford.
memory-efficient exact attention with IO-awareness,” in Proc. Int. Conf.
Neural Inf. Process. Syst., 2022, pp. 16344–16359.
[286] R. Child, S. Gray, A. Radford, and I. Sutskever, “Generating long
sequences with sparse transformers,” 2019, arXiv:1904.10509.
[287] C. Zhu et al., “Long-short transformer: Efficient transformers for lan-
guage and vision,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2021,
pp. 17723–17736.
[288] S. Yan et al., “Multiview transformers for video recognition,” in Proc. Xiatian Zhu is a senior lecturer with the Surrey
IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 3323–3333. Institute for People-Centred Artificial Intelligence,
[289] W. Wang and Z. Tu, “Rethinking the value of transformer components,” and Centre for Vision, Speech and Signal Process-
in Proc. 28th Int. Conf. Comput. Linguistics, 2020, pp. 6019–6029. ing (CVSSP), Faculty of Engineering and Physical
[290] M. Ma, J. Ren, L. Zhao, D. Testuggine, and X. Peng, “Are multimodal Sciences, University of Surrey.
transformers robust to missing modality?,” in Proc. IEEE Conf. Comput.
Vis. Pattern Recognit., 2022, pp. 18156–18165.
[291] A. Akula, V. Jampani, S. Changpinyo, and S.-C. Zhu, “Robust visual
reasoning via language guided neural module networks,” in Proc. Int.
Conf. Neural Inf. Process. Syst., 2021, pp. 11041–11053.
[292] A. R. Akula, S. Gella, Y. Al-Onaizan, S.-C. Zhu, and S. Reddy, “Words
aren’t enough, their order matters: On the robustness of grounding visual
referring expressions,” 2020, arXiv:2005.01655. David A. Clifton is a professor of clinical machine
[293] L. Li, Z. Gan, and J. Liu, “A closer look at the robustness of vision-and- learning and leads the Computational Health Infor-
language pre-trained models,” 2020, arXiv:2012.08673. matics (CHI) Lab, Department of Engineering Sci-
[294] M. Zhang, T. Maidment, A. Diab, A. Kovashka, and R. Hwa, “Domain- ence, University of Oxford.
robust VQA with diverse datasets and methods but no target la-
bels,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021,
pp. 7042–7052.
[295] Y. Kant, A. Moudgil, D. Batra, D. Parikh, and H. Agrawal, “Contrast and
classify: Training robust VQA models,” in Proc. IEEE Int. Conf. Comput.
Vis., 2021, pp. 1584–1593.