Question Aware Vision Transformer For Multimodal Reasoning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

Question Aware Vision Transformer for Multimodal Reasoning

Roy Ganz* Yair Kittenplon† Aviad Aberdam Elad Ben Avraham


Technion, Israel AWS AI Labs AWS AI Labs AWS AI Labs
[email protected] [email protected] [email protected] [email protected]

Oren Nuriel Shai Mazor Ron Litman†


AWS AI Labs AWS AI Labs AWS AI Labs
arXiv:2402.05472v1 [cs.CV] 8 Feb 2024

[email protected] [email protected] [email protected]

Abstract QA-
Image ViT
LLM ViT LLM
Vision-Language (VL) models have gained significant re-
search focus, enabling remarkable advances in multimodal
Question
reasoning. These architectures typically comprise a vision
encoder, a Large Language Model (LLM), and a projection
module that aligns visual features with the LLM’s repre-
sentation space. Despite their success, a critical limita-
tion persists: the vision encoding process remains decou-
pled from user queries, often in the form of image-related
questions. Consequently, the resulting visual features may
not be optimally attuned to the query-specific elements of
What color is the ‘Pink’ ‘Black’
the image. To address this, we introduce QA-ViT, a Question pink bear’s nose?
Aware Vision Transformer approach for multimodal reason-
ing, which embeds question awareness directly within the
vision encoder. This integration results in dynamic visual
features focusing on relevant image aspects to the posed
question. QA-ViT is model-agnostic and can be incorpo-
rated efficiently into any VL architecture. Extensive experi-
ments demonstrate the effectiveness of applying our method
What is written on ‘Park Square’ ‘Park Ave’
to various multimodal architectures, leading to consistent
the top blue sign?
improvement across diverse tasks and showcasing its po-
tential for enhancing visual and scene-text understanding.

Figure 1. Question-Aware Vision Encoding. Comparative il-


1. Introduction lustrations for VQAv2 (upper) and TextVQA (lower) predictions
of ViT+T5 and QA-ViT+T5 VL models. Employing GradCAM
In recent years, VL architectures have emerged as a piv- highlights the focus areas with respect to key terms in the posed
otal research area, leading to significant progress in the do- questions. This vividly demonstrates the motivation behind QA-
main of multimodal reasoning [3, 15, 19, 20, 24, 30, 31, 34, ViT: enhancing ViT with the question enables it to focus on the
43, 54]. Such architectures fundamentally seek to bridge relevant image aspects, resulting in more accurate predictions.
the gap between visual and textual data, enabling mod-
els to interpret, comprehend, and generate content based (VQA) [4, 46] to tasks in autonomous robotics and human-
on both visual and textual information. This fusion of computer interactions. As the list of applications continues
modalities has diverse applications and tasks, from image to grow, the role of VL architectures becomes increasingly
captioning (CAP) [10, 45] and visual question answering crucial within the broader field of deep learning.
* Work done during an Amazon internship. At the heart of multimodal VL architectures lies the con-
† Corresponding author. cept of vision-language Modeling. These models typically

1
consist of three essential steps. First, a unimodal vision “Yves Saint Laurent”

architecture extracts meaningful information from images.


Typically, the vision encoder is a frozen Vision-Transformer
LLM
(ViT), often based on CLIP [17, 41]. Second, a projection
module bridges the gap between vision and language, trans-
forming visual features into ones that can be comprehended Projection Module
and processed by a language model. This module is usually FVQ
either a simple linear layer or MLP [33, 34, 54], or a cross-
attention-based transformer architecture [6, 15, 31]. Lastly, FQ Q
Vision Question Question
the projected visual information and the textual instruction, Encoder Fusing Encoding

commonly in the form of questions or prompts, are inserted


into a Large Language Model (LLM) to complete the task.
Despite the remarkable progress achieved in VL re-
search, we have identified an intriguing yet often over- “What is the name written
on the book?”
looked limitation within such architectures. The success of
such a model hinges on its ability to not only comprehend
the visual content but also to do so through the lens of the
Figure 2. Method overview. A high-level illustration of the QA-
accompanying textual instruction, e.g., the provided ques-
ViT (highlighted in orange) incorporated into a general VL archi-
tion, often requiring focus on fine-grained details inside the tecture (depicted in blue). This is achieved by encoding the ques-
entire image. Existing architectures, however, are subop- tion Q into features FQ , which are fused into the vision encoder,
timal in this aspect, as they perform the vision encoding resulting in question-aware visual features FV Q .
unaware of the posed question, resulting in visual features
not optimally aligned with the user query. As the vision tual representations into any vision encoder while keeping
encoder outputs a fixed size features sequence FV , it is lim- most of it frozen, preserving its visual understanding ca-
ited in the level of information encoded in them. Due to the pabilities (Fig. 2). In practice, we utilize the preexisting
relatively high abstraction level, it is likely to disregard or self-attention mechanism in the ViT to also attend to textual
overlook low-level details in the image. This oversight be- encodings, representing the user query.
comes particularly problematic in scenarios where nuanced To demonstrate QA-ViT effectiveness, we leverage the
image understanding is essential to accurately respond to model-agnostic nature of our method and integrate it into
queries. Thus, we claim that the vision encoder V should be top-performing systems, including BLIP2 [31], Instruct-
cast from a single input function into a conditional function. BLIP [15], and LLaVA-1.5 [33]. In addition, we also in-
Namely, V(I|Q) instead of V(I), where I, Q are the image tegrate QA-ViT into a simple ViT+T5 architecture, without
and question, respectively. pretraining, to demonstrate its benefit when training an un-
To mitigate this limitation and yield a textual conditioned aligned VL system from scratch. We train all these archi-
vision encoding, we present QA-ViT, Question Aware Vi- tectures on a combined dataset of visual question answering
sion Transformer for multimodal reasoning. The intuition and image captioning, requiring visual and Optical Charac-
of our method is clear: if the model understands the posed ter Recognition (OCR) understanding, and evaluate them
question and the inherent context, it can extract visual fea- accordingly. Despite the architectural differences between
tures that directly correspond to the relevant image aspects the considered VL models in the vision-encoder, projection
essential for answering it correctly. We illustrate this behav- module (QFormer vs. MLP), and LLM structure (encoder-
ior in Fig. 1; By applying GradCAM [44] to both vanilla decoder vs. decoder only), extensive experiments show that
CLIP-based ViT and QA-ViT, w.r.t. textual prompts cor- QA-ViT consistently improves the performance over all the
respond with a distinct spatial location. While the base- tested models and benchmarks, attesting to its versatility.
line tends to favor high abstraction level features, even To summarize:
when prompted with region-specific descriptions, QA-ViT • We identify an overlooked suboptimality in the paradigm
focuses significantly more on the relevant image parts. For of vision-language modeling stemming from the lack of
instance, considering the bottom image and the question instruction-aware image encoding.
like “What is written on the top blue sign?”, we can see • We introduce QA-ViT, a model-agnostic method that en-
that while the baseline vision encoder generates features ables existing vision encoders to be conditioned on tex-
that contain a wealth of information about the scene (e.g., tual prompts or questions.
the buildings, cars, and people), QA-ViT is able to pinpoint • Thorough experiments on multiple architectures demon-
the specific region of interest, namely, the blue sign. Our ap- strate our method’s ability to enhance multimodal reason-
proach achieves the above goal by directly integrating tex- ing, improving the performance on various benchmarks.

2
2. Related Work be casted into a conditioned one:

Vision-Language Models. Earlier-generation VL models FV = V(I) → FV Q = V(I|Q). (1)


pursue the paradigm of rigorous and extensive pretraining,
using contrastive losses, followed by designated fine-tuning In this section, we first describe our high-level design and
for specific tasks [28–30, 50–52]. While this approach con- then delve into the details of each building block.
stituted a critical milestone, it led to specialist models that
only perform well on a specific downstream task [8, 20, 46]. 3.1. Overall Architecture
By leveraging the capabilities of recent Large Language As illustrated in Fig. 2, our method comprises two funda-
Models (LLMs) [14, 47–49], current top-performing VL mental components. First, the question, denoted as Q, is fed
models are generalist models, showcasing remarkable per- into a “Question Encoding” module, which processes and
formance across various VL tasks. Interestingly, such mod- projects the textual prompt, bridging the gap between the
els demonstrate strong zero-shot performance and general- linguistic and visual features domains. Subsequently, the
ization to unseen data and tasks [3, 6, 12, 15, 31, 33], and textual encoded features, denoted as FQ , are integrated in-
sometimes even surpassing specialist models. side a frozen vision model via “Question Fusing” module,
Architecturally, there are two main types of VL mod- producing text-aware visual features FV Q . Lastly, the FV Q
els, which mainly differ in the integration mechanism of is projected by the projection module, concatenated with the
the visual features into the LLM. The first type projects instruction embeddings, and fed into the LLM, which pro-
the visual features using a cross-attention-based transformer cesses and produces the overall system’s output. In general,
model (e.g., QFormer), which also reduces the visual se- QA-ViT modifies solely the vision encoder, maintaining the
quence length [6, 15, 31]. The introduction of such a mech- rest of the architecture intact.
anism enables keeping both the LLM and the vision encoder
frozen. The second line of research demonstrates that the 3.2. Question Encoding
projection module can be simplified to a linear projection In order to introduce text prompts Q into an unimodal vision
(or an MLP) while also training the LLM [12, 33, 34, 54]. transformer, we propose a streamlined two-stage process.
Despite such differences, all current top-performing VL
models perform image encoding in an unaware manner to Question Representation. First, we encode the natural
the given textual prompt. language prompt (e.g., the question) into meaningful rep-
resentations, denoted as FQ′ . Formally, we define this op-
eration as E(Q) = FQ′ , where E represents the encoding
Question-Aware Vision Encoding. A possible solution function. This step introduces flexibility in choosing E,
for the limitation above was proposed in the OCR-free the source of these textual representations – the preexist-
text-oriented multimodal understanding by pix2struct [27], ing LLM’s encoder or embeddings or a designated language
which suggests directly rendering the question as a header model. We mainly focus on the former as it offers more pa-
at the top of the original image instead of passing it to the rameter efficiency and can lead to more seamless integra-
LLM. However, this approach relies highly on their OCR- tion, as the same LLM subsequently processes the visual
oriented pretraining and is suboptimal in the general VL features. We compare these approaches in Sec. 5.1.
case. Another step towards instruction-aware visual fea-
tures is InstructBlip [15], which introduces the visual fea- Representation Projection. Second, we utilize MLPs to
tures into the QFormer alongside the instruction. Neverthe- project the textual representations into the vision model
less, it operates solely on top of the outputs of the vision features space. Due to the vision model’s hierarchical
encoder and, thus, is incapable of compensating for over- structure, different layers have different abstraction lev-
looked image aspects. In this paper, we propose to integrate els [17, 42]. Hence, we adopt a per-layer MLP to obtain
question information into any ViT-based image encoder in better alignment. We denote the projected textual repre-
a flexible and modular manner. sentation for layer i as FQi . Overall, the question encoding
phase operates as follows:
3. Method
FQi = MLPi (E(Q)). (2)
Our method proposes a versatile and lightweight model-
agnostic approach, which can be integrated into any vi- For simplicity, we omit the layer index from now on.
sion transformer model in any VL architecture, designed
3.3. Question Fusing
to transform trained image encoders into question-aware
ones effectively. Formally, given the image and question Given the projected textual representations FQ , we pro-
I, Q, we argue that the vision encoding module V should pose a parameter-efficient fusing mechanism to integrate

3
them into frozen ViT architectures in a model-agnostic way. FVQ
Keeping the vision encoder frozen enables text-conditioned
encoding of the image while preserving the model’s origi- Gated
Projection ❆

Top L-layers
nal capabilities intact. While such integration can be done Projection🔥

in various ways, we propose a straightforward approach that FFN


harnesses the ViT preexisting self-attention mechanism, il- Self-Attention F’VQ
lustrated in Fig. 3.
Attention ❆
Fusing Mechanism. We extend the input sequence of the FQ
self-attention layer to contain the projected representations
ViT Encoder
FQ ∈ RK×C by concatenating it with the visual represen-
M K
tations FV ∈ RM ×C , where C is the channel dimension.
FV
This yields a sequence of length K + M , containing vision
and question information. Next, the frozen self-attention Figure 3. Textual representations fusing. Left: General scheme
mechanism is applied to produce the attention scores and of the ViT encoder. Right: Zoom in to our fusing mechanism in
outputs while also attending to the textual information FQ , one of the top-L self-attention layers. The M visual features from
enabling cross-modal attention. We select the attention out- the previous layer FV , are concatenated with K textual features
put that corresponds with the input visual representations, FQ and fed into the frozen self-attention mechanism to obtain M
resulting in FV′ Q ∈ RM ×C . More formally, text-attended visual representations FV′ Q . Next, a parallel gated
projection obtains the question-aware visual features of FV Q .
FV′ Q = Attention(concat(FV , FQ ))[0:M] . (3)
4. Experiments
An additional projection followed by a learnable gating
mechanism [2, 3, 20, 22] is introduced in parallel to the ex- We conduct a comprehensive set of experiments to assess
isting frozen projection head. This module compensates for the capabilities of QA-ViT. Given the model-agnostic na-
the distribution shift from incorporating question informa- ture of our method, which enables seamless integration into
tion in the frozen self-attention layer. The goal of such a any existing VL architecture, our experiments are designed
gating is to enable the gradual blending of the residual pro- to showcase its versatility in two distinct architectural set-
jected information with the existing one, avoiding a signif- tings. In the first setting, we experiment with a straight-
icant feature modification and a degradation of the overall forward VL approach consisting of a vision encoder and
performance. Such gating is done by multiplying the addi- encoder-decoder-based LLM, denoted as ViT+T5. The sec-
tional projection layer’s outputs with tanh(β), where β is ond setting involves integrating our method into already
a learnable parameter initialized to zero. This technique is trained top-performing vision-language models, specifi-
designed to maintain the layer’s outputs with minimal de- cally LLAVA-1.5 [33], BLIP2 [31], and instructBLIP [15].
viation at initialization, improving stability while enabling This allows us to assess the benefits of QA-ViT for already
a residual learnable stream of information. Mathematically, finetuned models. In both settings, we train and evaluate the
our fusing mechanism functions as follows: models using a combined dataset of visual question answer-
ing and image captioning, requiring both visual and OCR
FV Q = P(FV′ Q ) + Pg (FV′ Q ) · tanh(β). (4) understanding [1, 2, 32]. In the OCR case, we are inter-
ested in the OCR-free setting; we do not equip the models
with OCR tokens.
Integration Point. An important design choice in our fus-
ing mechanism is the choice of the integration point of the 4.1. Training Data
textual representations into the vision transformer layers.
Specifically, we perform late fusion, namely, apply- For training across all considered architectures, we adopt
ing the fusing in the top L self-attention layers of the N - a multi-task approach using concatenated VL datasets that
layered ViT, where L < N . This choice is motivated by involve reasoning over both visual and OCR informa-
the nature of ViT layers hierarchy – lower layers primar- tion. In particular, we consider general visual question-
ily capture low-level visual details, while the higher layers answering datasets [21, 25] alongside scene-text [8, 40, 46]
mainly focus on high-level concepts [17, 42]. Therefore, the and document-oriented ones [37–39]. For these datasets,
likelihood of disregarding fine-grained details is expected to We insert the question representations into the vision en-
emerge in the higher layers, making them an optimal target coder when applying QA-ViT. In addition, we include cap-
for our method. We validate this choice in Sec. 5. tioning datasets (COCO Captions [11] and TextCaps [45]),

4
Figure 4. Paying attention to details in visual question answering. Representative examples require answering questions regarding
subtle or less conspicuous image details (zoomed-in) from VQAv2 and TextVQA datasets. Each sample includes an image-question pair
alongside predictions from ViT+T5 and QA-ViT+T5, where green indicates correct predictions and red indicates incorrect ones.

which leads to additional improvements, as can be seen in To better understand the gains achieved by QA-ViT,
Sec. 5.2). In the captioning data, we utilize a random tem- we provide qualitative results in the ViT+T5-large model
plate instruction, as in [15], e.g., “Please provide a short in Fig. 4. As seen, QA-ViT leads to better performance,
depiction of the picture” and insert them into the ViT. We specifically on image-question pairs that require reasoning
provide the complete list of such templates in the supple- over nuanced low-level details inside the image. For exam-
mentary materials, alongside further details on the train- ple, the image-question pair on the right requires focusing
ing dataset composition. Overall, our dataset comprises ap- on the board, which is relatively small and marginal in im-
proximately 3 million assets from multiple training datasets portance compared to the entire image. Similar behavior is
of different sizes. We adopt a sampling strategy propor- observed throughout all such examples.
tional to each dataset’s size during training to address the State-of-the-art Models After validating the efficacy of
size disparity. This approach is designed to prevent overfit- QA-ViT in a pretraining-free setting, we turn to experiment
ting smaller datasets and underfitting larger ones. with already-trained leading VL models. In this setting, we
finetune the base model with and without QA-ViT using
4.2. QA-ViT Performance Gains
our training data introduced in Sec. 4.1. As in the ViT+T5
We evaluate QA-ViT on general (VQAv2 and COCO) and case, we employ a similar training setting by applying LoRa
scene-text (VQAT , VQAST and TextCaps) benchmarks, in to the LLM and tuning the projection model and the QA-
addition to zero-shot setting (VizWiz [7]). Additionally, we ViT components, if applicable. Specifically, we consider
calculate average scores by assigning equal weight to both BLIP2 [31], InstructBLIP [15], using different sizes, and
visual question answering and image captioning tasks. LLaVA-1.5 [33], top-performing multimodal architectures,
and report the results in Tab. 1. As can be seen, QA-ViT
ViT+T5 First, we examine a simple yet effective ap- consistently improves the baselines in all the tested archi-
proach – a frozen CLIP1 [41] and Flan-T5 [14] of differ- tectures and across all the seen benchmarks while showing
ent sizes (base, large, and xl), with an MLP projec- benefit also in the unseen one (except in InstructBLIP).
tion module. We train the system on the data described in
Sec. 4.1, using both the standard CLIP-ViT and QA-ViT, 4.3. QA-ViT Results Analysis
with the same training hyperparameters. In particular, we
We turn to conduct a more in-depth analysis of the results
adapt the LLM weights using LoRa [23], train the projec-
provided in Tab. 1 to better understand the contributions of
tion MLP, and, in the QA-ViT case, also the instruction fus-
QA-ViT. Our method improves the performance of differ-
ing counterparts. Both the baseline and the QA-ViT settings
ent architectures, highlighting the three-way model agnos-
exhibit high parameter efficiency, keeping the vast majority
ticism of QA-ViT in terms of the vision encoder, projection
of the weights frozen. We report the quantitative results of
module, and LLM.
the ViT+T5 and compare them with QA-ViT in Table 1.
• Vision Encoder – Despite BLIP2 and InstructBLIP uti-
As can be seen, QA-ViT leads to a substantial and consis-
lizes a different vision encoder than LLaVA-1.5 (39-
tent improvement compared to the baseline in all the bench-
layered EVA-CLIP [18] with a resolution of 224 × 224
marks and across all model sizes. Moreover, our method not
vs. a 24-layered CLIP ViT-L of 336 × 336 resolution),
only improves performance on the seen benchmarks, but it
integrating QA-ViT leads to improved performance.
also benefits it in a zero-shot setting on VizWiz [7].
• Projection Module – On the one hand, BLIP2 and In-
1 https://huggingface.co/openai/clip- vit- large- structBLIP use a QFormer, a transformer-based architec-
patch14-336 ture with learnable tokens, that also reduces the sequence

5
General Scene-Text 0-shot Average
Method LLM VQAv2 COCO VQAT VQAST TextCaps VizWiz
General Scene-Text
vqa-score CIDEr vqa-score ANLS CIDEr vqa-score
ViT+T5-base Flan-T5-base 66.5 110.0 40.2 47.6 86.3 23.7 88.3 65.1
+ QA-ViT 71.7 114.9 45.0 51.1 96.1 23.9 93.3 72.1
∆ +5.2 +4.9 +4.8 +3.5 +9.8 +0.2 +5.0 +7.0
ViT+T5-large Flan-T5-large 70.0 114.3 44.7 50.6 96.0 24.6 92.2 71.8
+ QA-ViT 72.0 118.7 48.7 54.4 106.2 26.0 95.4 78.9
∆ +2.0 +4.4 +4.0 +3.8 +10.2 +1.4 +3.2 +7.1
ViT+T5-xl Flan-T5-xl 72.7 115.5 48.0 52.7 103.5 27.0 94.1 77.0
+ QA-ViT 73.5 116.5 50.3 54.9 108.2 28.3 95.0 80.4
∆ +0.8 +1.0 +2.3 +2.2 +4.7 +1.3 +0.9 +3.4
BLIP2 [31] Flan-T5-xl 72.5 134.8 34.5 36.4 93.6 28.2 103.7 64.5
+ QA-ViT 74.6 136.6 36.6 38.1 97.4 28.4 105.6 67.4
∆ +2.1 +1.8 +2.1 +1.7 +3.8 +0.2 +1.9 +2.9
BLIP2 [31] Flan-T5-xxl 74.8 134.8 36.5 37.9 97.4 29.8 104.8 67.3
+ QA-ViT 75.6 135.9 37.5 39.9 98.7 30.4 105.8 68.7
∆ +0.8 +1.1 +1.0 +2.0 +1.3 +0.6 +1.0 +1.4
InstructBLIP [15] Flan-T5-xl 75.7 135.9 36.2 38.1 98.2 28.9 105.8 67.7
+ QA-ViT 76.0 136.9 37.4 39.4 99.9 28.8 106.5 69.2
∆ +0.3 +1.0 +1.2 +1.3 +1.7 -0.1 +0.7 +1.5
InstructBLIP [15] Flan-T5-xxl 76.1 136.1 37.4 38.7 99.0 31.1 106.1 68.5
+ QA-ViT 76.5 138.2 38.4 40.0 101.7 30.7 107.4 70.5
∆ +0.4 +2.1 +1.0 +1.3 +2.7 -0.4 +1.3 +2.0
LLaVA-1.5 [33] Vicuna-7B 79.7 133.5 57.4 61.6 126.4 33.9 106.6 93.0
+ QA-ViT 80.5 134.7 59.1 62.4 128.7 36.5 107.6 94.7
∆ +0.8 +1.2 +1.7 +0.8 +2.3 +2.6 +1.0 +1.7

Table 1. QA-ViT results. Quantitative comparison of QA-ViT integrated into ViT+T5, BLIP2, InstructBLIP, and LLaVA-1.5, using differ-
ent model sizes, with these baselines trained on the data described in Sec. 4.1. The evaluation covers general and scene-text VL benchmarks
and 0-shot capabilities. QA-ViT consistently outperforms the different baselines, demonstrating its effectiveness and versatility.

length of the visual features by processing the different vi- it compatible with different LLM sizes. Remarkably, for a
sual features. On the other hand, LLaVA-1.5 and ViT+T5 given LLM size, applying QA-ViT is more beneficial than
utilize a simple MLP that operates separately on the vi- scale-up in terms of average general and scene-text perfor-
sual features. Despite this crucial difference, our method mance. For example, InstructBLIP-xl + QA-ViT leads to
is compatible with both, leading to consistent gains. 106.5 and 69.2 (general and scene-text averages), compared
• LLM Architecture – We experiment with both encoder- to InstructBLIP-xxl with 106.1 and 68.5 – an improvement
decoder (FLAN-T5 [14]) and decoder-only (Vicuna [13]). of +0.4 and +0.7, compared to the scale-up. Based on
In the encoder-decoder case, we encode the textual guid- these results, we conduct a more thorough analysis of our
ance using the preexisting encoder, and in the decoder- method’s contribution in Sec. 4.5.
only, we utilize the model’s embedding module. We Lastly, we focus on InstructBLIP, as it utilizes an
provide a comparison between these two alternatives in instruction-aware QFormer. In particular, this component
Sec. 5.1. Our experiments show that despite the signifi- processes the visual features with respect to the provided
cant LLM architecture differences, QA-ViT is compatible text, which conceptually resembles QA-ViT. Thus, one
with both, showcasing its versatility. might presume that utilizing such a model might make QA-
Next, we examine the effects of scale-up on our approach ViT contribution redundant. However, it is fundamentally
by comparing the results of different model sizes. In partic- different as our method is integrated inside the ViT and not
ular, we consider base, large, and xl and xl and xxl on top of it. Hence, the QFormer cannot compensate for
for ViT+T5 and BLIP2 and InstrucrtBLIP, respectively. Our information disregarded in the output features of the ViT.
quantitative analysis demonstrates that our approach leads On the contrary, QA-ViT, by being integrated into the ViT
to consistent improvement across all model scales, making layers, can emphasize the relevant features and prevent their

6
Method VQAv2 VQAT TextCaps VizWiz
mPLUG-DocOwl [53] - 52.6∗ 111.9∗ -
BLIP2 [31] 65.0 23.4 70.4 29.4
InstructBLIP [15] - 30.9 75.6∗ 30.9
InstructBLIP+OCR [15] - 46.6 126.0∗ 30.9
12.59
OpenFlamingo-9B [5] 50.3 24.2 - 17.7
IDEFICS-9B [26] 50.9 25.9 25.4 35.5
IDEFICS-80B [26] 60.0 30.9 56.8 36.0 3.6
Shikra [9] 77.4∗ - - -
Qwen-VL [6] 79.5∗ 63.8∗ - 35.2
LLaVA-1.5 [33] 79.7∗ 57.4∗ 126.4∗ 33.9
+ QA-ViT 80.5∗ 59.1∗ 128.7∗ 36.5
∆ +0.8 +1.7 +2.3 +2.6

Table 2. Comparison to generalist models. Results comparison Figure 5. QA-ViT effectiveness analysis. Comparison of the
of QA-ViT integrated into LLaVA-1.5 with top-performing gener- trends in error rate reduction of QA-ViT in VQAT and VQAv2
alist models on VQA and captioning. QA-ViT outperforms exist- as the language model is scaled up. The relative performance
ing methods in the VQAv2 , TextCaps and VizWiz. Models marked improvements of our approach are more consistent across model
with +OCR receive a list of OCR tokens, and scores noted with ∗ scales in the former. These trends are attributed to each dataset’s
signify that the dataset’s training images are observed in training. different question types’ composition, where VQAT exhibits more
questions focusing on non-salient and overlooked elements.

potential disregardance, leading to performance gains.


sual data. Indeed, as anticipated, the trends in Fig. 5 align
4.4. Comparison to State-of-the-art with our expectation that the gains of QA-ViT in VQAT
Despite QA-ViT being a model-agnostic approach that would be more significant when scaling up compared to
can be integrated into any VL model, we compare VQAv2 . Although more substantial gains are generally ob-
LLaVA-1.5 + QA-ViT to other state-of-the-art gener- served in smaller models, our method leads to consistent
alist methods. In particular, we consider mPLUG- improvements even on the largest models (i.e., BLIP2-xxl
DocOWL [53], OpenFlamingo-9B [5], IDEFICS-9B and InstructBLIP-xxl and LLaVA-1.5), as evidenced in Tab. 1.
80B [26], Shikra [9] and Qwen-VL [6], and report the re-
sults in Tab. 2. As can be seen, QA-ViT pushes the perfor- 5. Ablation Studies
mance of the LLaVA-1.5 model on the unseen VizWiZ be- In this section, we conduct extensive experiments to under-
yond Qwen-VL and IDEFICS-80B, leading to the best per- stand the performance improvements better and analyze the
formance across the considered models. In addition, QA- impact of our method. We first study the effect of different
ViT leads to the top-performing generalist model in VQAv2 . design choices (Sec. 5.1) and then analyze the contributions
of different training data compositions (Sec. 5.2). Through-
4.5. Why and When QA-ViT is Effective? out this section, we focus on ViT-T5-large architecture.
In this section, we better study the impact of QA-ViT. We
5.1. Design Choices
argue that our method plays a crucial role in addressing two
common image-question fail-cases within VL architectures: We analyze different design choices and explore different
first, questions regarding image aspects disregarded by the settings for the textual guidance encoding and representa-
vision model, and second, questions related to elements en- tions fusing while applying QA-ViT.
coded by the vision model but misinterpreted by the LLM. Finetuning Strategy Despite being parameter efficient,
While scaling up the LLM might mitigate some of the latter QA-ViT introduces more trainable parameters than the
type of fail-case, the former remains challenging to address, baseline. To validate that the improvements are credited to
hence, we consider the first as a more interesting setting for the method and not the additional capacity, we conduct ex-
our method. To examine our claim, we propose to compare periments with two other finetuning techniques. First, anal-
the gains of QA-ViT across different LLM scales in two ogous to deep prompt tuning, we train our model while in-
datasets, VQAT and VQAv2 , that differ in the composition serting into QA-ViT a fixed textual prompt instead of the
of the fail-cases mentioned above. We categorize VQAT relevant question. By employing the same blocks as our
as having more instances of the first fail-case and VQAv2 method, this interpretation of prompt tuning (denoted as
as having more of the second one since OCR information P.T.) isolates the contribution of question-conditioned im-
is more likely to be disregarded due to its relative scarcity age encoding. In addition, we also experiment with finetun-
in the ViT’s pretraining captions compared to non-OCR vi- ing the entire baseline’s vision encoder, which introduces

7
Inst. Fuse Freeze VQAv2 VQAT Datasets Size VQAv2 VQAT COCO TextCaps
✗ ✗ ✓ 70.0 44.7 VQA 2.3M 71.2 45.8 29.9 34.3
P.T. late ✓ 70.1 (+0.1%) 45.8 (+1.1%) + CAP 3.0M 71.5 47.4 117.5 106.1
✗ ✗ ✗ 69.5 (-0.5%) 44.9 (+0.2%) + DOC 3.1M 72.0 48.7 118.7 106.2
Enc. early ✓ 67.9 (-2.1%) 41.7 (-3.0%)
Enc. sparse ✓ 70.7 (+0.7%) 46.6 (+1.9%) Table 4. Training data ablation. Contribution analysis of dif-
Enc. all ✓ 69.5 (-0.5%) 45.9 (+1.2%) ferent training dataset compositions on visual question answering
Emb. late ✓ 71.0 (+1.0%) 47.5 (+2.8%) and captioning, demonstrating the importance of multi-task data.
BERT late ✓ 71.8 (+1.8%) 48.3 (+3.6%)
CLIP late ✓ 71.8 (+1.8%) 48.0 (+3.3%) both a BERT [16] and the corresponding CLIP text encoder.
Enc. late ✓ 72.0 (+2.0%) 48.7 (+4.0%) While utilizing the system’s language model is more param-
eter efficient and can lead to more seamless integration, a
Table 3. Design choices ablation. We mark the baseline and our dedicated language model can better align with the vision
top-performing configuration of QA-ViT in grey and yellow, re- model and offer a more modular and generic design. As
spectively. Top: Results of different finetuning strategies. Mid- can be seen, while both perform satisfactorily, the desig-
dle: The effect of different integration points of QA-ViT. Bottom: nated LLM is superior, while BERT outperforms CLIP.
Comparison of different instruction (Inst.) encodings.
5.2. The Impact of Training Data
a significant amount of trainable parameters. The results Our training data, described in Sec. 4.1, consists of three
in the top part of Tab. 3 show that while QA-ViT leads to main data types: i) natural images visual question answer-
+2.0% and +4.0% on VQAv2 and VQAT , P.T improves ing (VQA); ii) natural image captioning (CAP); and iii) doc-
solely in +0.1% and +1.1%, respectively. Comparing QA- uments understanding (DOC). We turn to evaluate the con-
ViT results with P.T. enables decomposing our method’s im- tribution of each of them and report the results in Tab. 4. As
provement into gains attributed to additional capacity and can be seen, adding CAP datasets into the VQA ones (sec-
to question-aware visual features, implying that the latter ond row) not only improves the captioning performance but
is the most significant. In addition, full finetuning CLIP, also boosts the performance on the VQA ones. We attribute
which introduces training instability, improves the baseline this to the enlargement and diversification of the training
in VQAT but reduces it on VQAv2 . This supports the choice data. Moreover, incorporating DOC data, despite the sig-
of current VL works to freeze the ViT during pretraining. nificant change of domain (natural images vs. documents),
Integration Point We explore different fusing locations – increases the performance. We hypothesize that this is be-
early (bottom layers), late (top layers), sparse (every cause QA-ViT maintains the original visual capabilities; it
2 layers), and all (every layer). While early, sparse, prevents the performance drop due to multi-domain data
and late add the same amount of trainable parameters, while leading to better OCR understanding. This, in return,
all doubles it. The results presented in the middle part of improves the overall results, as observed in [20].
Tab. 3 demonstrate the significant advantage of late fu-
sion. We attribute this to the hierarchical structure of the 6. Discussion and Conclusions
ViT’s layers, in which early layers specialize in capturing In this work, we introduced an approach to condition the
low-level and localized visual details, while higher ones fo- vision encoder in any multimodal vision-language architec-
cus on extracting more abstract and high-level visual fea- ture, named QA-ViT. Our method leads to question-aware
tures. Thus, disregarding question-related image aspects is visual features, improving their alignment with the provided
more likely to occur on the higher layers, QA-ViT is most query. Through extensive experimentation across a diverse
effective in late fusion. Moreover, as the early layers ex- set of vision-language models, we have demonstrated the
tract low-level details, they should not be modified, and ap- effectiveness and versatility of our method. It consistently
plying QA-ViT to them impairs the results. enhances the performance of these models across a range
Question Representation As specified in Sec. 3, we use of benchmark tasks, encompassing both general and scene-
the preexisting LLM’s encoder (Enc.) to obtain the question text domains, as well as the challenging zero-shot setting.
representation. Here, we study the effect of different such The introduction of QA-ViT represents a notable advance-
choices and present their results at the bottom of Tab. 3. ment in the pursuit of question-aware vision within VL
First, utilizing solely the embeddings (Emb.) is less ef- modeling, making models more context-aware and enabling
fective than the encoder. We attribute this to the improved them to excel in various tasks. We hope our method will in-
contextual understanding of the latter, enabling better guid- spire further research striving towards improved text-aware
ance to the visual features in QA-ViT . Next, we experi- mechanisms and designated pretraining techniques.
ment with using a designated language model, considering

8
References Pali-3 vision language models: Smaller, faster, stronger.
arXiv preprint arXiv:2310.09199, 2023. 3
[1] Aviad Aberdam, Roy Ganz, Shai Mazor, and Ron Litman.
[13] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang-
Multimodal semi-supervised learning for text recognition.
hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong-
arXiv preprint arXiv:2205.03873, 2022. 4
hao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P.
[2] Aviad Aberdam, David Bensaı̈d, Alona Golts, Roy Ganz, Xing. Vicuna: An open-source chatbot impressing gpt-4
Oren Nuriel, Royee Tichauer, Shai Mazor, and Ron Litman. with 90%* chatgpt quality, 2023. 6
Clipter: Looking at the bigger picture in scene text recogni-
[14] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph,
tion. arXiv preprint arXiv:2301.07464, 2023. 4
Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa De-
[3] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine hghani, Siddhartha Brahma, Albert Webson, Shixiang Shane
Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha
Katherine Millican, Malcolm Reynolds, et al. Flamingo: a Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu,
visual language model for few-shot learning. Advances in Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu,
Neural Information Processing Systems, 35:23716–23736, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam
2022. 1, 3, 4 Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. Scaling
[4] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret instruction-finetuned language models, 2022. 3, 5, 6, 1
Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. [15] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat
Vqa: Visual question answering. In Proceedings of the IEEE Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale
international conference on computer vision, pages 2425– Fung, and Steven Hoi. Instructblip: Towards general-
2433, 2015. 1 purpose vision-language models with instruction tuning,
[5] Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf 2023. 1, 2, 3, 4, 5, 6, 7
Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, [16] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open- Toutanova. Bert: Pre-training of deep bidirectional
source framework for training large autoregressive vision- transformers for language understanding. arXiv preprint
language models. arXiv preprint arXiv:2308.01390, 2023. arXiv:1810.04805, 2018. 8
7 [17] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
[6] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
Zhou. Qwen-vl: A frontier large vision-language model with vain Gelly, et al. An image is worth 16x16 words: Trans-
versatile abilities. arXiv preprint arXiv:2308.12966, 2023. 2, formers for image recognition at scale. arXiv preprint
3, 7 arXiv:2010.11929, 2020. 2, 3, 4
[7] Jeffrey P Bigham, Chandrika Jayant, Hanjie Ji, Greg Lit- [18] Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu,
tle, Andrew Miller, Robert C Miller, Robin Miller, Aubrey Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue
Tatarowicz, Brandyn White, Samual White, et al. Vizwiz: Cao. Eva: Exploring the limits of masked visual representa-
nearly real-time answers to visual questions. In Proceedings tion learning at scale. In Proceedings of the IEEE/CVF Con-
of the 23nd annual ACM symposium on User interface soft- ference on Computer Vision and Pattern Recognition, pages
ware and technology, pages 333–342, 2010. 5 19358–19369, 2023. 5
[8] Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, [19] Roy Ganz and Michael Elad. Clipag: Towards generator-free
Marçal Rusinol, Ernest Valveny, CV Jawahar, and Dimos- text-to-image generation. arXiv preprint arXiv:2306.16805,
thenis Karatzas. Scene text visual question answering. In 2023. 1
Proceedings of the IEEE/CVF international conference on [20] Roy Ganz, Oren Nuriel, Aviad Aberdam, Yair Kittenplon,
computer vision, pages 4291–4301, 2019. 3, 4 Shai Mazor, and Ron Litman. Towards models that can see
[9] Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, and read. arXiv preprint arXiv:2301.07389, 2023. 1, 3, 4, 8
Feng Zhu, and Rui Zhao. Shikra: Unleashing multi- [21] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba-
modal llm’s referential dialogue magic. arXiv preprint tra, and Devi Parikh. Making the V in VQA matter: Ele-
arXiv:2306.15195, 2023. 7 vating the role of image understanding in Visual Question
[10] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan- Answering. In Conference on Computer Vision and Pattern
tam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Recognition (CVPR), 2017. 4
Microsoft coco captions: Data collection and evaluation [22] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term
server. arXiv preprint arXiv:1504.00325, 2015. 1 memory. Neural computation, 9(8):1735–1780, 1997. 4
[11] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan- [23] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-
tam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.
Microsoft coco captions: Data collection and evaluation Lora: Low-rank adaptation of large language models. arXiv
server. arXiv preprint arXiv:1504.00325, 2015. 4 preprint arXiv:2106.09685, 2021. 5, 1
[12] Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, [24] Wenbo Hu, Yifan Xu, Y Li, W Li, Z Chen, and Z Tu. Bliva:
Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian A simple multimodal llm for better handling of text-rich vi-
Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, et al. sual questions. arXiv preprint arXiv:2308.09936, 2023. 1

9
[25] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, [38] Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar.
Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- Docvqa: A dataset for vqa on document images. In Proceed-
tidis, Li-Jia Li, David A Shamma, et al. Visual genome: ings of the IEEE/CVF winter conference on applications of
Connecting language and vision using crowdsourced dense computer vision, pages 2200–2209, 2021.
image annotations. International journal of computer vision, [39] Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis
123:32–73, 2017. 4 Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa.
[26] Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bek- In Proceedings of the IEEE/CVF Winter Conference on Ap-
man, Amanpreet Singh, Anton Lozhkov, Thomas Wang, plications of Computer Vision, pages 1697–1706, 2022. 4
Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, [40] Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and
Matthieu Cord, and Victor Sanh. Obelics: An open web- Anirban Chakraborty. Ocr-vqa: Visual question answering
scale filtered dataset of interleaved image-text documents, by reading text in images. In 2019 international conference
2023. 7 on document analysis and recognition (ICDAR), pages 947–
[27] Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, 952. IEEE, 2019. 4
Fangyu Liu, Julian Martin Eisenschlos, Urvashi Khandel- [41] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
wal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Pix2struct: Screenshot parsing as pretraining for visual lan- Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
guage understanding. In International Conference on Ma- transferable visual models from natural language supervi-
chine Learning, pages 18893–18912. PMLR, 2023. 3 sion. In International conference on machine learning, pages
[28] Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang, Ming 8748–8763. PMLR, 2021. 2, 5, 1
Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai Xu, Zheng [42] Maithra Raghu, Thomas Unterthiner, Simon Kornblith,
Cao, et al. mplug: Effective and efficient vision-language Chiyuan Zhang, and Alexey Dosovitskiy. Do vision trans-
learning by cross-modal skip-connections. arXiv preprint formers see like convolutional neural networks? Advances
arXiv:2205.12005, 2022. 3 in Neural Information Processing Systems, 34:12116–12128,
[29] Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, 2021. 3, 4
Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. [43] Noam Rotstein, David Bensaid, Shaked Brody, Roy Ganz,
Align before fuse: Vision and language representation learn- and Ron Kimmel. Fusecap: Leveraging large language mod-
ing with momentum distillation. Advances in neural infor- els to fuse visual data into enriched image captions. arXiv
mation processing systems, 34:9694–9705, 2021. preprint arXiv:2305.17718, 2023. 1
[30] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. [44] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das,
Blip: Bootstrapping language-image pre-training for uni- Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra.
fied vision-language understanding and generation. In In- Grad-cam: Visual explanations from deep networks via
ternational Conference on Machine Learning, pages 12888– gradient-based localization. In Proceedings of the IEEE in-
12900. PMLR, 2022. 1, 3 ternational conference on computer vision, pages 618–626,
[31] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2017. 2
Blip-2: Bootstrapping language-image pre-training with [45] Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and
frozen image encoders and large language models. arXiv Amanpreet Singh. Textcaps: a dataset for image caption-
preprint arXiv:2301.12597, 2023. 1, 2, 3, 4, 5, 6, 7 ing with reading comprehension. In Computer Vision–ECCV
[32] Ron Litman, Oron Anschel, Shahar Tsiper, Roee Litman, 2020: 16th European Conference, Glasgow, UK, August 23–
Shai Mazor, and R Manmatha. Scatter: selective con- 28, 2020, Proceedings, Part II 16, pages 742–758. Springer,
text attentional scene text recognizer. In proceedings of 2020. 1, 4
the IEEE/CVF conference on computer vision and pattern [46] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang,
recognition, pages 11962–11972, 2020. 4 Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus
[33] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Rohrbach. Towards vqa models that can read. In Proceedings
Improved baselines with visual instruction tuning. arXiv of the IEEE/CVF conference on computer vision and pattern
preprint arXiv:2310.03744, 2023. 2, 3, 4, 5, 6, 7 recognition, pages 8317–8326, 2019. 1, 3, 4
[34] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. [47] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois,
Visual instruction tuning. arXiv preprint arXiv:2304.08485, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B
2023. 1, 2, 3 Hashimoto. Stanford alpaca: An instruction-following llama
[35] Ilya Loshchilov and Frank Hutter. Sgdr: Stochas- model, 2023. 3
tic gradient descent with warm restarts. arXiv preprint [48] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
arXiv:1608.03983, 2016. 1 Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste
[36] Ilya Loshchilov and Frank Hutter. Fixing weight decay reg- Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al.
ularization in adam. 2018. 1 Llama: Open and efficient foundation language models.
[37] Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, arXiv preprint arXiv:2302.13971, 2023.
and Enamul Hoque. Chartqa: A benchmark for question an- [49] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert,
swering about charts with visual and logical reasoning. arXiv Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
preprint arXiv:2203.10244, 2022. 4 Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al.

10
Llama 2: Open foundation and fine-tuned chat models. arXiv
preprint arXiv:2307.09288, 2023. 3
[50] Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li,
Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang.
Git: A generative image-to-text transformer for vision and
language. arXiv preprint arXiv:2205.14100, 2022. 3
[51] Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai,
Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and
Hongxia Yang. Ofa: Unifying architectures, tasks, and
modalities through a simple sequence-to-sequence learning
framework. In International Conference on Machine Learn-
ing, pages 23318–23340. PMLR, 2022.
[52] Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei
Florencio, Lijuan Wang, Cha Zhang, Lei Zhang, and Jiebo
Luo. Tap: Text-aware pre-training for text-vqa and text-
caption. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pages 8751–8761,
2021. 3
[53] Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan,
Yuhao Dan, Chenlin Zhao, Guohai Xu, Chenliang Li, Jun-
feng Tian, et al. mplug-docowl: Modularized multimodal
large language model for document understanding. arXiv
preprint arXiv:2307.02499, 2023. 7
[54] Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou,
Nedim Lipka, Diyi Yang, and Tong Sun. Llavar: Enhanced
visual instruction tuning for text-rich image understanding.
arXiv preprint arXiv:2306.17107, 2023. 1, 2, 3

11
Question Aware Vision Transformer for Multimodal Reasoning
Supplementary Material
A. Implementation Details Template
<image>”A short image caption:”
<image>”A short image description:”
Overall Training Protocol For all of the considered ar- <image>”A photo of”
chitectures, we follow the same general training procedure <image>”An image that shows”
in which we apply LoRa [23] to the LLM and finetune <image>”Write a short description for the image.”
<image>”Write a description for the photo.”
the projection module. When applying QA-ViT, we also <image>”Provide a description of what is presented in the photo.”
finetune the instruction representation projection MLPs. In <image>”Briefly describe the content of the image.”
particular, we employ LoRa (α=32, r=16, dropout=0.05, <image>”Can you briefly explain what you see in the image?”
<image>”Could you use a few words to describe what you perceive in the photo?”
and the queries and keys as the target modules) and uti- <image>”Please provide a short depiction of the picture.”
lize an AdamW [36] optimizer (β1 , β2 = 0.9, 0.999 and <image>”Using language, provide a short account of the image.”
ϵ = 1e − 08) with cosine annealing scheduler [35] that de- <image>”Use a few words to illustrate what is happening in the picture.”
cays to ×0.01 from the base learning rate. In addition, we
perform 1000 warm-up steps. We use 8 Nvidia A100 (40G) Table 5. Captioning instruction templates. The instruction tem-
GPUs in all of our experiments with bfloat16. Next, we pro- plates used for the captioning datasets. For VQA, we simply use
the provided question.
vide the specific implementation details regarding ViT+T5,
BLIP2, InstructBLIP, and LLaVA-1.5.
LLaVA-1.5 As LLaVA-1.5 is based on a decoder-only
LLM, we use the model’s embedding module to process the
ViT+T5 ViT+T5 is comprised of a CLIP [41] ViT-L vi- questions when applying QA-ViT . We train for one epoch
sion encoder that operates in a 336 × 336 resolution, cou- with an effective batch size of 4 per GPU (using 2-step gra-
pled with a FLAN-T5 encoder-decoder model [14] using an dient accumulation) and a base learning rate of 5e − 5.
MLP projection module. The projection component con-
sists of two linear layers that map from the ViT’s dimension B. Multi-Task Training Dataset and Evalua-
D1 into the LLM’s one D2 (D1 → D2 → D2 ). We train tion
three variants of ViT+T5, which differ in the LLM scale,
where we consider base, large, and xl. We use the As stated in Sec. 4.1, we utilize a multi-task dataset that
LLM’s encoder as the question encoder and train the models contains multiple benchmarks of different tasks. In Tab. 6,
on our multi-task dataset (Sec. 4.1) for 5, 2, and 2 epochs, we provide a detailed list of the training datasets and
using a batch size per GPU of 16, 8, and 6, with a learning the evaluation metric and split used for reporting results
rate of 1e−4, 5e−5 and 1e−5, respectively. QA-ViT intro- throughout the paper.
duces 38M, 45M, and 66M trainable parameters out of the
overall 589M, 1,132M, and 3,220M. In addition, when ap- C. Image Captioning Templates
plying QA-ViT to a pretraining-free setup, we observe that For the VQA-based datasets, we simply utilize the provided
using a higher learning rate (×100) for the projection mod- question to guide QA-ViT. However, in the captioning case,
ule stabilizes the training. We hypothesize that while the it is infeasible. Thus, we use the captioning templates used
vision encoder and LLM are pretrained separately, the pro- in InstructBLIP [15] and provide them in Tab. 5 for com-
jection module is randomly initialized, and thus, its weights pleteness. These captions are sampled uniformly during
should be adjusted more than the former counterparts. training and inference.

D. Additional OCR Results


BLIP2 and InstructBLIP We experiment with both the
D.1. In-Depth Scene-Text analysis
xl and xxl models, and similar to the ViT+T5, we use the
LLM’s encoder for processing the question before feeding As explained in Sec. 4.5, we view the scene-text bench-
it into QA-ViT . We use a single learning rate group for all marks as an interesting testing bed for our approach. To
models for all the trainable parameters. For the xl models, understand the contribution of QA-ViT for scene-text un-
we train for 2 epochs, with a batch size of 8 per GPU with a derstanding, we follow the analysis of Ganz et al. [20] and
base learning rate of 2e−5. For the xxl ones, we reduce the decompose the results of VQAT into two non-overlapping
batch size to 4 per GPU. In addition, we employ a weight subsets – i) VQATSee∩Read is the manually curated subset
decay of 0.05 for all models. which contains questions that require reasoning over OCR

1
Task Dataset Description Eval split Metric
Image Caption COCO Captioning of natural images karpathy-test CIDEr(↑)
Scene-Text Caption TextCaps Text-oriented captioning of natural images validation CIDEr(↑)
VQAv2 VQA on natural images test-dev vqa-score(↑)
General VQA
Visual Genome VQA on natural images - -
VQAT Text-oriented VQA on natural images validation vqa-score(↑)
Scene-Text VQA VQAST Text-oriented VQA on natural images test ANLS(↑)
VQAOCR Text-oriented VQA on book covers - -
DocVQA VQA on scanned documents test ANLS(↑)
Documents Understanding InfoVQA VQA on infographic images test ANLS(↑)
ChartQA VQA on chart images - -

Table 6. Training datasets and evaluation. The datasets used for training alongside their evaluation split and metric, if applicable.

Scene-Text Documents
Method LLM
VQAT VQATRead VQAT
See∩Read DocVQA InfoVQA Average
ViT+T5-xl Flan-T5-xl 48.0 49.3 35.6 42.3 26.4 34.4
+ QA-ViT 50.3 51.8 36.2 44.2 27.1 35.7
∆ +2.3 +2.5 +0.6 +1.9 +0.7 +1.3
BLIP2 Flan-T5-xl 34.5 36.1 18.7 16.1 21.1 18.6
+ QA-ViT 36.6 38.3 20.4 17.1 21.2 19.2
∆ +2.1 +2.2 +1.7 +1.0 +0.1 +0.6
InstructBLIP Flan-T5-xl 36.2 37.9 19.3 17.3 19.9 18.6
+ QA-ViT 37.4 39.0 22.5 18.2 20.5 19.3
∆ +1.2 +1.1 +3.2 +0.9 +0.6 +0.7
LLaVa-1.5 Vicuna-7B 57.4 59.0 42.5 44.1 32.1 38.1
+ QA-ViT 59.1 60.7 43.5 45.4 32.1 38.8
∆ +1.7 +1.7 +1.0 +1.3 0.0 +0.7

Table 7. Additional OCR Results. Results on documents understanding and comprehensive VQAT analysis.

and visual information simultaneously. We view this sub- vision encoder (from natural images to documents and in-
set as the most challenging one. ii) VQATRead is composed forgraphichs). Moreover, as CLIP is inherently limited
of questions that can be answered solely by using the OCR in dense-text scenarios, the application of QA-ViT, which
information. The unification of these subsets results in the specifically targets existing visual features, is not antici-
entire VQAT validation set. We provide the results on these pated to yield a significant performance boost in such set-
subsets on the middle section of Tab. 7. As can be seen, QA- tings. Despite these challenges, our results, while far from
ViT improves the results on VQATRead in all the models. This state-of-the-art levels, consistently demonstrate improve-
highlights the ability of our method to better harness some ments over baseline performance. This underscores the ef-
of the overlooked OCR information. In addition, it leads fectiveness of our method in directing visual attention to-
to consistent improvements on the VQATSee∩Read , which re- wards OCR information within the given constraints.
quires cross-modal reasoning over the OCR and visual cues.

D.2. Documents Understanding E. Additional Qualitative Results and Analysis


In this section, we present the performance results of both In Fig. 6, we extend the visualizations conducted in the
QA-ViT and the various baseline models in the context of main paper to focus on the alignment of the text queries
document understanding, evaluated on DocVQA and In- and visual features and provide additional demonstrations:
foVQA, as detailed in the right section of Tab. 7. DocVQA • We provide attention visualizations at three levels of gran-
encompasses questions related to dense-text scanned doc- ularity within the ViT: (i) before the question fusing, (ii)
uments, while InfoVQA is designed for reasoning over in- immediately after it, and (iii) at the final layer. Illustrated
fographics. Operating in these domains is highly challeng- in Fig. 6, in (i), the network’s attention spans across the
ing as it constitutes a substantial domain shift for the CLIP entire visual content, while in (ii) and (iii), it focuses

2
Before After Q. Fusing
Early Late
“Top blue sign”
“Stop”
“Bear nose”

“Wheels” “Traffic lights” “Shoes”

Figure 6. Elaborated interpretations of QA-ViT. Additional vi-


sual and textual features interaction demonstrations, including vi-
sualizations at different granularity levels within the ViT.

on fine-grained details according to the provided text.


Specifically, the interaction of the text and vision through-
out QA-ViT leads to more focused attention maps, as can
be seen in the rightmost two columns.
• To better demonstrate the fine-grained interaction of text
and vision in QA-ViT, we show the attention maps of the
same image with respect to different text prompts (top
two rows). This highlights QA-ViT’s ability to shift the
focus of the visual features based on the provided text.
• The bottom row contains additional visual-textual atten-
tion visualization, indicating QA-ViT’s text-based focus.
In addition, we provide qualitative comparison between
QA-ViT and and the baseline in Fig. 7.

3
What is the last letter of the white What color are the letters bms What does his nametag say? What is the yellow book called? How long is this set to
label? on the board? cook?

ViT+T5: C ViT+T5: White ViT+T5: Tom ViT+T5: Backchainer ViT+T5: 10 minutes


Ours: S Ours: Green Ours: Toms Ours: Growing up global Ours: 1 hour

If you turn right, where does the road What beer is the yellow sign What color is the word rolex What is the denomination of this What brand is the bottle with
lead? advertising? wrote in? currency? the black and green label?

ViT+T5: Hatta Oman ViT+T5: Alabama ViT+T5: White ViT+T5: 5758872 ViT+T5: Gatorade
Ours: Dubai academic city Ours: Corona Ours: Black Ours: 2 Ours: Coca cola

What does the red sign say? What building unit number is What is the jersey number of the What is the brand of the yellow How much is the green
above the man's head? player in the middle? bin? bottled beer?

ViT+T5: China ViT+T5: 32 ViT+T5: 60 ViT+T5: Truke ViT+T5: .45


Ours: Cargo Ours: 23 Ours: 1 Ours: Brute Ours: 11.4

What number is on the back of the What brand is the vcr on top of What color are the letters bms What does it say next to the How tall is this artifact?
man in red shoe's white jersey? the tv? on the board? microphone?

BLIP2: 13 BLIP2: Yaesu BLIP2: Black BLIP2: Ringtone BLIP2: 3.5 inches
Ours: 7 Ours: Sony Ours: Green Ours: Voice recorder Ours: 10.5 cm

What number is on the player closest Where does a left turn take you? What is the name on the book? Question: What letters are on Question: What does the
to you? the player's hat? red shirt say?

InstructBLIP: 10 InstructBLIPl: Legacy pony InstructBLIP: Yes, sir InstructBLIP: Giants InstructBLIP: Santa claus
Ours: 19 Ours: Woods cross Ours: Yves saint laurent Ours: Sf Ours: Safadi

What does the red sign say? What is the number on the What jersey number currently has What is the letter to the right of How much does the
bottle with the green label? possession of the ball? the player? battleship cost?

LLaVA-1.5: Reading LLaVA-1.5: 10 LLaVA-1.5: 42 LLaVA-1.5: S LLaVA-1.5: 37.99


Ours: Recycling Ours: 1 Ours: 21 Ours: A Ours: 37.88

Figure 7. Additional qualitative results. Comparison between the baseline and our method on VQAT validation set using ViT+T5 (base,
large, xl), BLIP2 and InstructBLIP (xxl) and LLaVA-1.5. Success and fail cases are presented on the left and right, respectively.

You might also like