Question Aware Vision Transformer For Multimodal Reasoning
Question Aware Vision Transformer For Multimodal Reasoning
Question Aware Vision Transformer For Multimodal Reasoning
Abstract QA-
Image ViT
LLM ViT LLM
Vision-Language (VL) models have gained significant re-
search focus, enabling remarkable advances in multimodal
Question
reasoning. These architectures typically comprise a vision
encoder, a Large Language Model (LLM), and a projection
module that aligns visual features with the LLM’s repre-
sentation space. Despite their success, a critical limita-
tion persists: the vision encoding process remains decou-
pled from user queries, often in the form of image-related
questions. Consequently, the resulting visual features may
not be optimally attuned to the query-specific elements of
What color is the ‘Pink’ ‘Black’
the image. To address this, we introduce QA-ViT, a Question pink bear’s nose?
Aware Vision Transformer approach for multimodal reason-
ing, which embeds question awareness directly within the
vision encoder. This integration results in dynamic visual
features focusing on relevant image aspects to the posed
question. QA-ViT is model-agnostic and can be incorpo-
rated efficiently into any VL architecture. Extensive experi-
ments demonstrate the effectiveness of applying our method
What is written on ‘Park Square’ ‘Park Ave’
to various multimodal architectures, leading to consistent
the top blue sign?
improvement across diverse tasks and showcasing its po-
tential for enhancing visual and scene-text understanding.
1
consist of three essential steps. First, a unimodal vision “Yves Saint Laurent”
2
2. Related Work be casted into a conditioned one:
3
them into frozen ViT architectures in a model-agnostic way. FVQ
Keeping the vision encoder frozen enables text-conditioned
encoding of the image while preserving the model’s origi- Gated
Projection ❆
Top L-layers
nal capabilities intact. While such integration can be done Projection🔥
4
Figure 4. Paying attention to details in visual question answering. Representative examples require answering questions regarding
subtle or less conspicuous image details (zoomed-in) from VQAv2 and TextVQA datasets. Each sample includes an image-question pair
alongside predictions from ViT+T5 and QA-ViT+T5, where green indicates correct predictions and red indicates incorrect ones.
which leads to additional improvements, as can be seen in To better understand the gains achieved by QA-ViT,
Sec. 5.2). In the captioning data, we utilize a random tem- we provide qualitative results in the ViT+T5-large model
plate instruction, as in [15], e.g., “Please provide a short in Fig. 4. As seen, QA-ViT leads to better performance,
depiction of the picture” and insert them into the ViT. We specifically on image-question pairs that require reasoning
provide the complete list of such templates in the supple- over nuanced low-level details inside the image. For exam-
mentary materials, alongside further details on the train- ple, the image-question pair on the right requires focusing
ing dataset composition. Overall, our dataset comprises ap- on the board, which is relatively small and marginal in im-
proximately 3 million assets from multiple training datasets portance compared to the entire image. Similar behavior is
of different sizes. We adopt a sampling strategy propor- observed throughout all such examples.
tional to each dataset’s size during training to address the State-of-the-art Models After validating the efficacy of
size disparity. This approach is designed to prevent overfit- QA-ViT in a pretraining-free setting, we turn to experiment
ting smaller datasets and underfitting larger ones. with already-trained leading VL models. In this setting, we
finetune the base model with and without QA-ViT using
4.2. QA-ViT Performance Gains
our training data introduced in Sec. 4.1. As in the ViT+T5
We evaluate QA-ViT on general (VQAv2 and COCO) and case, we employ a similar training setting by applying LoRa
scene-text (VQAT , VQAST and TextCaps) benchmarks, in to the LLM and tuning the projection model and the QA-
addition to zero-shot setting (VizWiz [7]). Additionally, we ViT components, if applicable. Specifically, we consider
calculate average scores by assigning equal weight to both BLIP2 [31], InstructBLIP [15], using different sizes, and
visual question answering and image captioning tasks. LLaVA-1.5 [33], top-performing multimodal architectures,
and report the results in Tab. 1. As can be seen, QA-ViT
ViT+T5 First, we examine a simple yet effective ap- consistently improves the baselines in all the tested archi-
proach – a frozen CLIP1 [41] and Flan-T5 [14] of differ- tectures and across all the seen benchmarks while showing
ent sizes (base, large, and xl), with an MLP projec- benefit also in the unseen one (except in InstructBLIP).
tion module. We train the system on the data described in
Sec. 4.1, using both the standard CLIP-ViT and QA-ViT, 4.3. QA-ViT Results Analysis
with the same training hyperparameters. In particular, we
We turn to conduct a more in-depth analysis of the results
adapt the LLM weights using LoRa [23], train the projec-
provided in Tab. 1 to better understand the contributions of
tion MLP, and, in the QA-ViT case, also the instruction fus-
QA-ViT. Our method improves the performance of differ-
ing counterparts. Both the baseline and the QA-ViT settings
ent architectures, highlighting the three-way model agnos-
exhibit high parameter efficiency, keeping the vast majority
ticism of QA-ViT in terms of the vision encoder, projection
of the weights frozen. We report the quantitative results of
module, and LLM.
the ViT+T5 and compare them with QA-ViT in Table 1.
• Vision Encoder – Despite BLIP2 and InstructBLIP uti-
As can be seen, QA-ViT leads to a substantial and consis-
lizes a different vision encoder than LLaVA-1.5 (39-
tent improvement compared to the baseline in all the bench-
layered EVA-CLIP [18] with a resolution of 224 × 224
marks and across all model sizes. Moreover, our method not
vs. a 24-layered CLIP ViT-L of 336 × 336 resolution),
only improves performance on the seen benchmarks, but it
integrating QA-ViT leads to improved performance.
also benefits it in a zero-shot setting on VizWiz [7].
• Projection Module – On the one hand, BLIP2 and In-
1 https://huggingface.co/openai/clip- vit- large- structBLIP use a QFormer, a transformer-based architec-
patch14-336 ture with learnable tokens, that also reduces the sequence
5
General Scene-Text 0-shot Average
Method LLM VQAv2 COCO VQAT VQAST TextCaps VizWiz
General Scene-Text
vqa-score CIDEr vqa-score ANLS CIDEr vqa-score
ViT+T5-base Flan-T5-base 66.5 110.0 40.2 47.6 86.3 23.7 88.3 65.1
+ QA-ViT 71.7 114.9 45.0 51.1 96.1 23.9 93.3 72.1
∆ +5.2 +4.9 +4.8 +3.5 +9.8 +0.2 +5.0 +7.0
ViT+T5-large Flan-T5-large 70.0 114.3 44.7 50.6 96.0 24.6 92.2 71.8
+ QA-ViT 72.0 118.7 48.7 54.4 106.2 26.0 95.4 78.9
∆ +2.0 +4.4 +4.0 +3.8 +10.2 +1.4 +3.2 +7.1
ViT+T5-xl Flan-T5-xl 72.7 115.5 48.0 52.7 103.5 27.0 94.1 77.0
+ QA-ViT 73.5 116.5 50.3 54.9 108.2 28.3 95.0 80.4
∆ +0.8 +1.0 +2.3 +2.2 +4.7 +1.3 +0.9 +3.4
BLIP2 [31] Flan-T5-xl 72.5 134.8 34.5 36.4 93.6 28.2 103.7 64.5
+ QA-ViT 74.6 136.6 36.6 38.1 97.4 28.4 105.6 67.4
∆ +2.1 +1.8 +2.1 +1.7 +3.8 +0.2 +1.9 +2.9
BLIP2 [31] Flan-T5-xxl 74.8 134.8 36.5 37.9 97.4 29.8 104.8 67.3
+ QA-ViT 75.6 135.9 37.5 39.9 98.7 30.4 105.8 68.7
∆ +0.8 +1.1 +1.0 +2.0 +1.3 +0.6 +1.0 +1.4
InstructBLIP [15] Flan-T5-xl 75.7 135.9 36.2 38.1 98.2 28.9 105.8 67.7
+ QA-ViT 76.0 136.9 37.4 39.4 99.9 28.8 106.5 69.2
∆ +0.3 +1.0 +1.2 +1.3 +1.7 -0.1 +0.7 +1.5
InstructBLIP [15] Flan-T5-xxl 76.1 136.1 37.4 38.7 99.0 31.1 106.1 68.5
+ QA-ViT 76.5 138.2 38.4 40.0 101.7 30.7 107.4 70.5
∆ +0.4 +2.1 +1.0 +1.3 +2.7 -0.4 +1.3 +2.0
LLaVA-1.5 [33] Vicuna-7B 79.7 133.5 57.4 61.6 126.4 33.9 106.6 93.0
+ QA-ViT 80.5 134.7 59.1 62.4 128.7 36.5 107.6 94.7
∆ +0.8 +1.2 +1.7 +0.8 +2.3 +2.6 +1.0 +1.7
Table 1. QA-ViT results. Quantitative comparison of QA-ViT integrated into ViT+T5, BLIP2, InstructBLIP, and LLaVA-1.5, using differ-
ent model sizes, with these baselines trained on the data described in Sec. 4.1. The evaluation covers general and scene-text VL benchmarks
and 0-shot capabilities. QA-ViT consistently outperforms the different baselines, demonstrating its effectiveness and versatility.
length of the visual features by processing the different vi- it compatible with different LLM sizes. Remarkably, for a
sual features. On the other hand, LLaVA-1.5 and ViT+T5 given LLM size, applying QA-ViT is more beneficial than
utilize a simple MLP that operates separately on the vi- scale-up in terms of average general and scene-text perfor-
sual features. Despite this crucial difference, our method mance. For example, InstructBLIP-xl + QA-ViT leads to
is compatible with both, leading to consistent gains. 106.5 and 69.2 (general and scene-text averages), compared
• LLM Architecture – We experiment with both encoder- to InstructBLIP-xxl with 106.1 and 68.5 – an improvement
decoder (FLAN-T5 [14]) and decoder-only (Vicuna [13]). of +0.4 and +0.7, compared to the scale-up. Based on
In the encoder-decoder case, we encode the textual guid- these results, we conduct a more thorough analysis of our
ance using the preexisting encoder, and in the decoder- method’s contribution in Sec. 4.5.
only, we utilize the model’s embedding module. We Lastly, we focus on InstructBLIP, as it utilizes an
provide a comparison between these two alternatives in instruction-aware QFormer. In particular, this component
Sec. 5.1. Our experiments show that despite the signifi- processes the visual features with respect to the provided
cant LLM architecture differences, QA-ViT is compatible text, which conceptually resembles QA-ViT. Thus, one
with both, showcasing its versatility. might presume that utilizing such a model might make QA-
Next, we examine the effects of scale-up on our approach ViT contribution redundant. However, it is fundamentally
by comparing the results of different model sizes. In partic- different as our method is integrated inside the ViT and not
ular, we consider base, large, and xl and xl and xxl on top of it. Hence, the QFormer cannot compensate for
for ViT+T5 and BLIP2 and InstrucrtBLIP, respectively. Our information disregarded in the output features of the ViT.
quantitative analysis demonstrates that our approach leads On the contrary, QA-ViT, by being integrated into the ViT
to consistent improvement across all model scales, making layers, can emphasize the relevant features and prevent their
6
Method VQAv2 VQAT TextCaps VizWiz
mPLUG-DocOwl [53] - 52.6∗ 111.9∗ -
BLIP2 [31] 65.0 23.4 70.4 29.4
InstructBLIP [15] - 30.9 75.6∗ 30.9
InstructBLIP+OCR [15] - 46.6 126.0∗ 30.9
12.59
OpenFlamingo-9B [5] 50.3 24.2 - 17.7
IDEFICS-9B [26] 50.9 25.9 25.4 35.5
IDEFICS-80B [26] 60.0 30.9 56.8 36.0 3.6
Shikra [9] 77.4∗ - - -
Qwen-VL [6] 79.5∗ 63.8∗ - 35.2
LLaVA-1.5 [33] 79.7∗ 57.4∗ 126.4∗ 33.9
+ QA-ViT 80.5∗ 59.1∗ 128.7∗ 36.5
∆ +0.8 +1.7 +2.3 +2.6
Table 2. Comparison to generalist models. Results comparison Figure 5. QA-ViT effectiveness analysis. Comparison of the
of QA-ViT integrated into LLaVA-1.5 with top-performing gener- trends in error rate reduction of QA-ViT in VQAT and VQAv2
alist models on VQA and captioning. QA-ViT outperforms exist- as the language model is scaled up. The relative performance
ing methods in the VQAv2 , TextCaps and VizWiz. Models marked improvements of our approach are more consistent across model
with +OCR receive a list of OCR tokens, and scores noted with ∗ scales in the former. These trends are attributed to each dataset’s
signify that the dataset’s training images are observed in training. different question types’ composition, where VQAT exhibits more
questions focusing on non-salient and overlooked elements.
7
Inst. Fuse Freeze VQAv2 VQAT Datasets Size VQAv2 VQAT COCO TextCaps
✗ ✗ ✓ 70.0 44.7 VQA 2.3M 71.2 45.8 29.9 34.3
P.T. late ✓ 70.1 (+0.1%) 45.8 (+1.1%) + CAP 3.0M 71.5 47.4 117.5 106.1
✗ ✗ ✗ 69.5 (-0.5%) 44.9 (+0.2%) + DOC 3.1M 72.0 48.7 118.7 106.2
Enc. early ✓ 67.9 (-2.1%) 41.7 (-3.0%)
Enc. sparse ✓ 70.7 (+0.7%) 46.6 (+1.9%) Table 4. Training data ablation. Contribution analysis of dif-
Enc. all ✓ 69.5 (-0.5%) 45.9 (+1.2%) ferent training dataset compositions on visual question answering
Emb. late ✓ 71.0 (+1.0%) 47.5 (+2.8%) and captioning, demonstrating the importance of multi-task data.
BERT late ✓ 71.8 (+1.8%) 48.3 (+3.6%)
CLIP late ✓ 71.8 (+1.8%) 48.0 (+3.3%) both a BERT [16] and the corresponding CLIP text encoder.
Enc. late ✓ 72.0 (+2.0%) 48.7 (+4.0%) While utilizing the system’s language model is more param-
eter efficient and can lead to more seamless integration, a
Table 3. Design choices ablation. We mark the baseline and our dedicated language model can better align with the vision
top-performing configuration of QA-ViT in grey and yellow, re- model and offer a more modular and generic design. As
spectively. Top: Results of different finetuning strategies. Mid- can be seen, while both perform satisfactorily, the desig-
dle: The effect of different integration points of QA-ViT. Bottom: nated LLM is superior, while BERT outperforms CLIP.
Comparison of different instruction (Inst.) encodings.
5.2. The Impact of Training Data
a significant amount of trainable parameters. The results Our training data, described in Sec. 4.1, consists of three
in the top part of Tab. 3 show that while QA-ViT leads to main data types: i) natural images visual question answer-
+2.0% and +4.0% on VQAv2 and VQAT , P.T improves ing (VQA); ii) natural image captioning (CAP); and iii) doc-
solely in +0.1% and +1.1%, respectively. Comparing QA- uments understanding (DOC). We turn to evaluate the con-
ViT results with P.T. enables decomposing our method’s im- tribution of each of them and report the results in Tab. 4. As
provement into gains attributed to additional capacity and can be seen, adding CAP datasets into the VQA ones (sec-
to question-aware visual features, implying that the latter ond row) not only improves the captioning performance but
is the most significant. In addition, full finetuning CLIP, also boosts the performance on the VQA ones. We attribute
which introduces training instability, improves the baseline this to the enlargement and diversification of the training
in VQAT but reduces it on VQAv2 . This supports the choice data. Moreover, incorporating DOC data, despite the sig-
of current VL works to freeze the ViT during pretraining. nificant change of domain (natural images vs. documents),
Integration Point We explore different fusing locations – increases the performance. We hypothesize that this is be-
early (bottom layers), late (top layers), sparse (every cause QA-ViT maintains the original visual capabilities; it
2 layers), and all (every layer). While early, sparse, prevents the performance drop due to multi-domain data
and late add the same amount of trainable parameters, while leading to better OCR understanding. This, in return,
all doubles it. The results presented in the middle part of improves the overall results, as observed in [20].
Tab. 3 demonstrate the significant advantage of late fu-
sion. We attribute this to the hierarchical structure of the 6. Discussion and Conclusions
ViT’s layers, in which early layers specialize in capturing In this work, we introduced an approach to condition the
low-level and localized visual details, while higher ones fo- vision encoder in any multimodal vision-language architec-
cus on extracting more abstract and high-level visual fea- ture, named QA-ViT. Our method leads to question-aware
tures. Thus, disregarding question-related image aspects is visual features, improving their alignment with the provided
more likely to occur on the higher layers, QA-ViT is most query. Through extensive experimentation across a diverse
effective in late fusion. Moreover, as the early layers ex- set of vision-language models, we have demonstrated the
tract low-level details, they should not be modified, and ap- effectiveness and versatility of our method. It consistently
plying QA-ViT to them impairs the results. enhances the performance of these models across a range
Question Representation As specified in Sec. 3, we use of benchmark tasks, encompassing both general and scene-
the preexisting LLM’s encoder (Enc.) to obtain the question text domains, as well as the challenging zero-shot setting.
representation. Here, we study the effect of different such The introduction of QA-ViT represents a notable advance-
choices and present their results at the bottom of Tab. 3. ment in the pursuit of question-aware vision within VL
First, utilizing solely the embeddings (Emb.) is less ef- modeling, making models more context-aware and enabling
fective than the encoder. We attribute this to the improved them to excel in various tasks. We hope our method will in-
contextual understanding of the latter, enabling better guid- spire further research striving towards improved text-aware
ance to the visual features in QA-ViT . Next, we experi- mechanisms and designated pretraining techniques.
ment with using a designated language model, considering
8
References Pali-3 vision language models: Smaller, faster, stronger.
arXiv preprint arXiv:2310.09199, 2023. 3
[1] Aviad Aberdam, Roy Ganz, Shai Mazor, and Ron Litman.
[13] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang-
Multimodal semi-supervised learning for text recognition.
hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong-
arXiv preprint arXiv:2205.03873, 2022. 4
hao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P.
[2] Aviad Aberdam, David Bensaı̈d, Alona Golts, Roy Ganz, Xing. Vicuna: An open-source chatbot impressing gpt-4
Oren Nuriel, Royee Tichauer, Shai Mazor, and Ron Litman. with 90%* chatgpt quality, 2023. 6
Clipter: Looking at the bigger picture in scene text recogni-
[14] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph,
tion. arXiv preprint arXiv:2301.07464, 2023. 4
Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa De-
[3] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine hghani, Siddhartha Brahma, Albert Webson, Shixiang Shane
Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha
Katherine Millican, Malcolm Reynolds, et al. Flamingo: a Chowdhery, Sharan Narang, Gaurav Mishra, Adams Yu,
visual language model for few-shot learning. Advances in Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu,
Neural Information Processing Systems, 35:23716–23736, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam
2022. 1, 3, 4 Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. Scaling
[4] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret instruction-finetuned language models, 2022. 3, 5, 6, 1
Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. [15] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat
Vqa: Visual question answering. In Proceedings of the IEEE Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale
international conference on computer vision, pages 2425– Fung, and Steven Hoi. Instructblip: Towards general-
2433, 2015. 1 purpose vision-language models with instruction tuning,
[5] Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf 2023. 1, 2, 3, 4, 5, 6, 7
Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, [16] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina
Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open- Toutanova. Bert: Pre-training of deep bidirectional
source framework for training large autoregressive vision- transformers for language understanding. arXiv preprint
language models. arXiv preprint arXiv:2308.01390, 2023. arXiv:1810.04805, 2018. 8
7 [17] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov,
[6] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner,
Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl-
Zhou. Qwen-vl: A frontier large vision-language model with vain Gelly, et al. An image is worth 16x16 words: Trans-
versatile abilities. arXiv preprint arXiv:2308.12966, 2023. 2, formers for image recognition at scale. arXiv preprint
3, 7 arXiv:2010.11929, 2020. 2, 3, 4
[7] Jeffrey P Bigham, Chandrika Jayant, Hanjie Ji, Greg Lit- [18] Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu,
tle, Andrew Miller, Robert C Miller, Robin Miller, Aubrey Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue
Tatarowicz, Brandyn White, Samual White, et al. Vizwiz: Cao. Eva: Exploring the limits of masked visual representa-
nearly real-time answers to visual questions. In Proceedings tion learning at scale. In Proceedings of the IEEE/CVF Con-
of the 23nd annual ACM symposium on User interface soft- ference on Computer Vision and Pattern Recognition, pages
ware and technology, pages 333–342, 2010. 5 19358–19369, 2023. 5
[8] Ali Furkan Biten, Ruben Tito, Andres Mafla, Lluis Gomez, [19] Roy Ganz and Michael Elad. Clipag: Towards generator-free
Marçal Rusinol, Ernest Valveny, CV Jawahar, and Dimos- text-to-image generation. arXiv preprint arXiv:2306.16805,
thenis Karatzas. Scene text visual question answering. In 2023. 1
Proceedings of the IEEE/CVF international conference on [20] Roy Ganz, Oren Nuriel, Aviad Aberdam, Yair Kittenplon,
computer vision, pages 4291–4301, 2019. 3, 4 Shai Mazor, and Ron Litman. Towards models that can see
[9] Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, and read. arXiv preprint arXiv:2301.07389, 2023. 1, 3, 4, 8
Feng Zhu, and Rui Zhao. Shikra: Unleashing multi- [21] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba-
modal llm’s referential dialogue magic. arXiv preprint tra, and Devi Parikh. Making the V in VQA matter: Ele-
arXiv:2306.15195, 2023. 7 vating the role of image understanding in Visual Question
[10] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan- Answering. In Conference on Computer Vision and Pattern
tam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Recognition (CVPR), 2017. 4
Microsoft coco captions: Data collection and evaluation [22] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term
server. arXiv preprint arXiv:1504.00325, 2015. 1 memory. Neural computation, 9(8):1735–1780, 1997. 4
[11] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedan- [23] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-
tam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen.
Microsoft coco captions: Data collection and evaluation Lora: Low-rank adaptation of large language models. arXiv
server. arXiv preprint arXiv:1504.00325, 2015. 4 preprint arXiv:2106.09685, 2021. 5, 1
[12] Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, [24] Wenbo Hu, Yifan Xu, Y Li, W Li, Z Chen, and Z Tu. Bliva:
Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian A simple multimodal llm for better handling of text-rich vi-
Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, et al. sual questions. arXiv preprint arXiv:2308.09936, 2023. 1
9
[25] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, [38] Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar.
Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan- Docvqa: A dataset for vqa on document images. In Proceed-
tidis, Li-Jia Li, David A Shamma, et al. Visual genome: ings of the IEEE/CVF winter conference on applications of
Connecting language and vision using crowdsourced dense computer vision, pages 2200–2209, 2021.
image annotations. International journal of computer vision, [39] Minesh Mathew, Viraj Bagal, Rubèn Tito, Dimosthenis
123:32–73, 2017. 4 Karatzas, Ernest Valveny, and CV Jawahar. Infographicvqa.
[26] Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bek- In Proceedings of the IEEE/CVF Winter Conference on Ap-
man, Amanpreet Singh, Anton Lozhkov, Thomas Wang, plications of Computer Vision, pages 1697–1706, 2022. 4
Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, [40] Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and
Matthieu Cord, and Victor Sanh. Obelics: An open web- Anirban Chakraborty. Ocr-vqa: Visual question answering
scale filtered dataset of interleaved image-text documents, by reading text in images. In 2019 international conference
2023. 7 on document analysis and recognition (ICDAR), pages 947–
[27] Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, 952. IEEE, 2019. 4
Fangyu Liu, Julian Martin Eisenschlos, Urvashi Khandel- [41] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
wal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry,
Pix2struct: Screenshot parsing as pretraining for visual lan- Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning
guage understanding. In International Conference on Ma- transferable visual models from natural language supervi-
chine Learning, pages 18893–18912. PMLR, 2023. 3 sion. In International conference on machine learning, pages
[28] Chenliang Li, Haiyang Xu, Junfeng Tian, Wei Wang, Ming 8748–8763. PMLR, 2021. 2, 5, 1
Yan, Bin Bi, Jiabo Ye, Hehong Chen, Guohai Xu, Zheng [42] Maithra Raghu, Thomas Unterthiner, Simon Kornblith,
Cao, et al. mplug: Effective and efficient vision-language Chiyuan Zhang, and Alexey Dosovitskiy. Do vision trans-
learning by cross-modal skip-connections. arXiv preprint formers see like convolutional neural networks? Advances
arXiv:2205.12005, 2022. 3 in Neural Information Processing Systems, 34:12116–12128,
[29] Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, 2021. 3, 4
Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. [43] Noam Rotstein, David Bensaid, Shaked Brody, Roy Ganz,
Align before fuse: Vision and language representation learn- and Ron Kimmel. Fusecap: Leveraging large language mod-
ing with momentum distillation. Advances in neural infor- els to fuse visual data into enriched image captions. arXiv
mation processing systems, 34:9694–9705, 2021. preprint arXiv:2305.17718, 2023. 1
[30] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. [44] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das,
Blip: Bootstrapping language-image pre-training for uni- Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra.
fied vision-language understanding and generation. In In- Grad-cam: Visual explanations from deep networks via
ternational Conference on Machine Learning, pages 12888– gradient-based localization. In Proceedings of the IEEE in-
12900. PMLR, 2022. 1, 3 ternational conference on computer vision, pages 618–626,
[31] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2017. 2
Blip-2: Bootstrapping language-image pre-training with [45] Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and
frozen image encoders and large language models. arXiv Amanpreet Singh. Textcaps: a dataset for image caption-
preprint arXiv:2301.12597, 2023. 1, 2, 3, 4, 5, 6, 7 ing with reading comprehension. In Computer Vision–ECCV
[32] Ron Litman, Oron Anschel, Shahar Tsiper, Roee Litman, 2020: 16th European Conference, Glasgow, UK, August 23–
Shai Mazor, and R Manmatha. Scatter: selective con- 28, 2020, Proceedings, Part II 16, pages 742–758. Springer,
text attentional scene text recognizer. In proceedings of 2020. 1, 4
the IEEE/CVF conference on computer vision and pattern [46] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang,
recognition, pages 11962–11972, 2020. 4 Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus
[33] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Rohrbach. Towards vqa models that can read. In Proceedings
Improved baselines with visual instruction tuning. arXiv of the IEEE/CVF conference on computer vision and pattern
preprint arXiv:2310.03744, 2023. 2, 3, 4, 5, 6, 7 recognition, pages 8317–8326, 2019. 1, 3, 4
[34] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. [47] Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois,
Visual instruction tuning. arXiv preprint arXiv:2304.08485, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B
2023. 1, 2, 3 Hashimoto. Stanford alpaca: An instruction-following llama
[35] Ilya Loshchilov and Frank Hutter. Sgdr: Stochas- model, 2023. 3
tic gradient descent with warm restarts. arXiv preprint [48] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
arXiv:1608.03983, 2016. 1 Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste
[36] Ilya Loshchilov and Frank Hutter. Fixing weight decay reg- Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al.
ularization in adam. 2018. 1 Llama: Open and efficient foundation language models.
[37] Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, arXiv preprint arXiv:2302.13971, 2023.
and Enamul Hoque. Chartqa: A benchmark for question an- [49] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert,
swering about charts with visual and logical reasoning. arXiv Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
preprint arXiv:2203.10244, 2022. 4 Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al.
10
Llama 2: Open foundation and fine-tuned chat models. arXiv
preprint arXiv:2307.09288, 2023. 3
[50] Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li,
Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang.
Git: A generative image-to-text transformer for vision and
language. arXiv preprint arXiv:2205.14100, 2022. 3
[51] Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai,
Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and
Hongxia Yang. Ofa: Unifying architectures, tasks, and
modalities through a simple sequence-to-sequence learning
framework. In International Conference on Machine Learn-
ing, pages 23318–23340. PMLR, 2022.
[52] Zhengyuan Yang, Yijuan Lu, Jianfeng Wang, Xi Yin, Dinei
Florencio, Lijuan Wang, Cha Zhang, Lei Zhang, and Jiebo
Luo. Tap: Text-aware pre-training for text-vqa and text-
caption. In Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition, pages 8751–8761,
2021. 3
[53] Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan,
Yuhao Dan, Chenlin Zhao, Guohai Xu, Chenliang Li, Jun-
feng Tian, et al. mplug-docowl: Modularized multimodal
large language model for document understanding. arXiv
preprint arXiv:2307.02499, 2023. 7
[54] Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou,
Nedim Lipka, Diyi Yang, and Tong Sun. Llavar: Enhanced
visual instruction tuning for text-rich image understanding.
arXiv preprint arXiv:2306.17107, 2023. 1, 2, 3
11
Question Aware Vision Transformer for Multimodal Reasoning
Supplementary Material
A. Implementation Details Template
<image>”A short image caption:”
<image>”A short image description:”
Overall Training Protocol For all of the considered ar- <image>”A photo of”
chitectures, we follow the same general training procedure <image>”An image that shows”
in which we apply LoRa [23] to the LLM and finetune <image>”Write a short description for the image.”
<image>”Write a description for the photo.”
the projection module. When applying QA-ViT, we also <image>”Provide a description of what is presented in the photo.”
finetune the instruction representation projection MLPs. In <image>”Briefly describe the content of the image.”
particular, we employ LoRa (α=32, r=16, dropout=0.05, <image>”Can you briefly explain what you see in the image?”
<image>”Could you use a few words to describe what you perceive in the photo?”
and the queries and keys as the target modules) and uti- <image>”Please provide a short depiction of the picture.”
lize an AdamW [36] optimizer (β1 , β2 = 0.9, 0.999 and <image>”Using language, provide a short account of the image.”
ϵ = 1e − 08) with cosine annealing scheduler [35] that de- <image>”Use a few words to illustrate what is happening in the picture.”
cays to ×0.01 from the base learning rate. In addition, we
perform 1000 warm-up steps. We use 8 Nvidia A100 (40G) Table 5. Captioning instruction templates. The instruction tem-
GPUs in all of our experiments with bfloat16. Next, we pro- plates used for the captioning datasets. For VQA, we simply use
the provided question.
vide the specific implementation details regarding ViT+T5,
BLIP2, InstructBLIP, and LLaVA-1.5.
LLaVA-1.5 As LLaVA-1.5 is based on a decoder-only
LLM, we use the model’s embedding module to process the
ViT+T5 ViT+T5 is comprised of a CLIP [41] ViT-L vi- questions when applying QA-ViT . We train for one epoch
sion encoder that operates in a 336 × 336 resolution, cou- with an effective batch size of 4 per GPU (using 2-step gra-
pled with a FLAN-T5 encoder-decoder model [14] using an dient accumulation) and a base learning rate of 5e − 5.
MLP projection module. The projection component con-
sists of two linear layers that map from the ViT’s dimension B. Multi-Task Training Dataset and Evalua-
D1 into the LLM’s one D2 (D1 → D2 → D2 ). We train tion
three variants of ViT+T5, which differ in the LLM scale,
where we consider base, large, and xl. We use the As stated in Sec. 4.1, we utilize a multi-task dataset that
LLM’s encoder as the question encoder and train the models contains multiple benchmarks of different tasks. In Tab. 6,
on our multi-task dataset (Sec. 4.1) for 5, 2, and 2 epochs, we provide a detailed list of the training datasets and
using a batch size per GPU of 16, 8, and 6, with a learning the evaluation metric and split used for reporting results
rate of 1e−4, 5e−5 and 1e−5, respectively. QA-ViT intro- throughout the paper.
duces 38M, 45M, and 66M trainable parameters out of the
overall 589M, 1,132M, and 3,220M. In addition, when ap- C. Image Captioning Templates
plying QA-ViT to a pretraining-free setup, we observe that For the VQA-based datasets, we simply utilize the provided
using a higher learning rate (×100) for the projection mod- question to guide QA-ViT. However, in the captioning case,
ule stabilizes the training. We hypothesize that while the it is infeasible. Thus, we use the captioning templates used
vision encoder and LLM are pretrained separately, the pro- in InstructBLIP [15] and provide them in Tab. 5 for com-
jection module is randomly initialized, and thus, its weights pleteness. These captions are sampled uniformly during
should be adjusted more than the former counterparts. training and inference.
1
Task Dataset Description Eval split Metric
Image Caption COCO Captioning of natural images karpathy-test CIDEr(↑)
Scene-Text Caption TextCaps Text-oriented captioning of natural images validation CIDEr(↑)
VQAv2 VQA on natural images test-dev vqa-score(↑)
General VQA
Visual Genome VQA on natural images - -
VQAT Text-oriented VQA on natural images validation vqa-score(↑)
Scene-Text VQA VQAST Text-oriented VQA on natural images test ANLS(↑)
VQAOCR Text-oriented VQA on book covers - -
DocVQA VQA on scanned documents test ANLS(↑)
Documents Understanding InfoVQA VQA on infographic images test ANLS(↑)
ChartQA VQA on chart images - -
Table 6. Training datasets and evaluation. The datasets used for training alongside their evaluation split and metric, if applicable.
Scene-Text Documents
Method LLM
VQAT VQATRead VQAT
See∩Read DocVQA InfoVQA Average
ViT+T5-xl Flan-T5-xl 48.0 49.3 35.6 42.3 26.4 34.4
+ QA-ViT 50.3 51.8 36.2 44.2 27.1 35.7
∆ +2.3 +2.5 +0.6 +1.9 +0.7 +1.3
BLIP2 Flan-T5-xl 34.5 36.1 18.7 16.1 21.1 18.6
+ QA-ViT 36.6 38.3 20.4 17.1 21.2 19.2
∆ +2.1 +2.2 +1.7 +1.0 +0.1 +0.6
InstructBLIP Flan-T5-xl 36.2 37.9 19.3 17.3 19.9 18.6
+ QA-ViT 37.4 39.0 22.5 18.2 20.5 19.3
∆ +1.2 +1.1 +3.2 +0.9 +0.6 +0.7
LLaVa-1.5 Vicuna-7B 57.4 59.0 42.5 44.1 32.1 38.1
+ QA-ViT 59.1 60.7 43.5 45.4 32.1 38.8
∆ +1.7 +1.7 +1.0 +1.3 0.0 +0.7
Table 7. Additional OCR Results. Results on documents understanding and comprehensive VQAT analysis.
and visual information simultaneously. We view this sub- vision encoder (from natural images to documents and in-
set as the most challenging one. ii) VQATRead is composed forgraphichs). Moreover, as CLIP is inherently limited
of questions that can be answered solely by using the OCR in dense-text scenarios, the application of QA-ViT, which
information. The unification of these subsets results in the specifically targets existing visual features, is not antici-
entire VQAT validation set. We provide the results on these pated to yield a significant performance boost in such set-
subsets on the middle section of Tab. 7. As can be seen, QA- tings. Despite these challenges, our results, while far from
ViT improves the results on VQATRead in all the models. This state-of-the-art levels, consistently demonstrate improve-
highlights the ability of our method to better harness some ments over baseline performance. This underscores the ef-
of the overlooked OCR information. In addition, it leads fectiveness of our method in directing visual attention to-
to consistent improvements on the VQATSee∩Read , which re- wards OCR information within the given constraints.
quires cross-modal reasoning over the OCR and visual cues.
2
Before After Q. Fusing
Early Late
“Top blue sign”
“Stop”
“Bear nose”
3
What is the last letter of the white What color are the letters bms What does his nametag say? What is the yellow book called? How long is this set to
label? on the board? cook?
If you turn right, where does the road What beer is the yellow sign What color is the word rolex What is the denomination of this What brand is the bottle with
lead? advertising? wrote in? currency? the black and green label?
ViT+T5: Hatta Oman ViT+T5: Alabama ViT+T5: White ViT+T5: 5758872 ViT+T5: Gatorade
Ours: Dubai academic city Ours: Corona Ours: Black Ours: 2 Ours: Coca cola
What does the red sign say? What building unit number is What is the jersey number of the What is the brand of the yellow How much is the green
above the man's head? player in the middle? bin? bottled beer?
What number is on the back of the What brand is the vcr on top of What color are the letters bms What does it say next to the How tall is this artifact?
man in red shoe's white jersey? the tv? on the board? microphone?
BLIP2: 13 BLIP2: Yaesu BLIP2: Black BLIP2: Ringtone BLIP2: 3.5 inches
Ours: 7 Ours: Sony Ours: Green Ours: Voice recorder Ours: 10.5 cm
What number is on the player closest Where does a left turn take you? What is the name on the book? Question: What letters are on Question: What does the
to you? the player's hat? red shirt say?
InstructBLIP: 10 InstructBLIPl: Legacy pony InstructBLIP: Yes, sir InstructBLIP: Giants InstructBLIP: Santa claus
Ours: 19 Ours: Woods cross Ours: Yves saint laurent Ours: Sf Ours: Safadi
What does the red sign say? What is the number on the What jersey number currently has What is the letter to the right of How much does the
bottle with the green label? possession of the ball? the player? battleship cost?
Figure 7. Additional qualitative results. Comparison between the baseline and our method on VQAT validation set using ViT+T5 (base,
large, xl), BLIP2 and InstructBLIP (xxl) and LLaVA-1.5. Success and fail cases are presented on the left and right, respectively.