0% found this document useful (0 votes)
50 views12 pages

BLIVA - A Simple Multimodal LLM For Better Handling of Text-Rich Visual Questions

BLIVA is a multimodal language model designed to enhance the interpretation of text-rich visual questions by integrating learned query embeddings and encoded patch embeddings. The model significantly improves performance on visual question-answering benchmarks, achieving notable gains in both text-rich and general VQA tasks. BLIVA utilizes a two-stage training approach, leveraging instruction tuning to better align visual information with language understanding.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views12 pages

BLIVA - A Simple Multimodal LLM For Better Handling of Text-Rich Visual Questions

BLIVA is a multimodal language model designed to enhance the interpretation of text-rich visual questions by integrating learned query embeddings and encoded patch embeddings. The model significantly improves performance on visual question-answering benchmarks, achieving notable gains in both text-rich and general VQA tasks. BLIVA utilizes a two-stage training approach, leveraging instruction tuning to better align visual information with language understanding.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual

Questions
Wenbo Hu* 1 , Yifan Xu* 2 , Yi Li1 Weiyue Li1 Zeyuan Chen1 Zhuowen Tu1
1
UC San Diego 2 Coinbase Global, Inc.
{w1hu, yil115, wel019, zec016, ztu}@ucsd.edu [email protected]
arXiv:2308.09936v3 [cs.CV] 18 Dec 2023

Abstract by framing various tasks into instructions. Vision Lan-


guage Models (VLMs) such as OpenAI’s GPT-4 (OpenAI
Vision Language Models (VLMs), which extend Large Lan-
guage Models (LLM) by incorporating visual understanding
2023), which incorporates LLM with visual understanding
capability, have demonstrated significant advancements in ad- capability, have demonstrated significant advancements in
dressing open-ended visual question-answering (VQA) tasks. addressing open-ended visual question-answering (VQA)
However, these models cannot accurately interpret images tasks. Several approaches have been proposed for employ-
infused with text, a common occurrence in real-world sce- ing LLMs on vision-related tasks by directly aligning with
narios. Standard procedures for extracting information from a visual encoder’s patch feature (Liu et al. 2023a) or ex-
images often involve learning a fixed set of query embed- tracting image information through a fixed number of query
dings. These embeddings are designed to encapsulate im- embeddings. (Li et al. 2023b; Zhu et al. 2023).
age contexts and are later used as soft prompt inputs in However, despite exhibiting considerable abilities for
LLMs. Yet, this process is limited to the token count, po- image-based human-agent interactions, these models strug-
tentially curtailing the recognition of scenes with text-rich
context. To improve upon them, the present study introduces
gle with interpreting text within images. Images with text are
BLIVA: an augmented version of InstructBLIP with Visual pervasive in our daily lives, and comprehending such content
Assistant. BLIVA incorporates the query embeddings from is essential for human visual perception. Previous works uti-
InstructBLIP and also directly projects encoded patch em- lized an abstraction module with queried embeddings, lim-
beddings into the LLM, a technique inspired by LLaVA. This iting their capabilities in textual details within images (Li
approach assists the model to capture intricate details po- et al. 2023b; Awadalla et al. 2023; Ye et al. 2023).
tentially missed during the query decoding process. Empir- In our work, we employ learned query embeddings with
ical evidence demonstrates that our model, BLIVA, signif- additional visual assistant branches, utilizing encoded patch
icantly enhances performance in processing text-rich VQA embeddings. This approach addresses the constraint image
benchmarks (up to 17.76% in OCR-VQA benchmark) and in
information typically provided to language models, leading
undertaking general (not particularly text-rich) VQA bench-
marks (up to 7.9% in Visual Spatial Reasoning benchmark), to improved text-image visual perception and understand-
and achieved 17.72% overall improvement in a comprehen- ing. Empirically, we report the results of our model in gen-
sive multimodal LLM benchmark (MME), comparing to our eral (not particularly text-rich) VQA benchmarks following
baseline InstructBLIP. BLIVA demonstrates significant capa- the evaluation datasets of (Dai et al. 2023) and text-rich im-
bility in decoding real-world images, irrespective of text pres- age evaluation protocol from (Liu et al. 2023b). Our model
ence. To demonstrate the broad industry applications enabled is initialized from a pre-trained InstructBLIP and an en-
by BLIVA, we evaluate the model using a new dataset com- coded patch projection layer trained from scratch. Following
prising YouTube thumbnails paired with question-answer sets (Zhu et al. 2023; Liu et al. 2023a), we further demonstrate
across 11 diverse categories. For researchers interested in fur- a two-stage training paradigm. We begin by pre-training the
ther exploration, our code and models are freely accessible at
patch embeddings projection layer. Subsequently, with the
https://github.com/mlpc-ucsd/BLIVA.
instruction tuning data, we fine-tune both the Q-former and
the patch embeddings projection layer. During this phase,
Introduction we maintain both the image encoder and LLM in a frozen
state. We adopt this approach based on two findings from
Recently, Large Language Models (LLMs) have trans- our experiments: firstly, unfreezing the vision encoder re-
formed the field of natural language understanding, exhibit- sults in catastrophic forgetting of prior knowledge; secondly,
ing impressive capabilities in generalizing across a broad ar- training the LLM concurrently didn’t bring improvement but
ray of tasks, both in zero-shot and few-shot settings. This brought significant training complexity.
success is mainly contributed by instruction tuning (Wu In summary, our study consists of the following high-
et al. 2023) which improves generalization to unseen tasks lights:
* These authors contributed equally. • We present BLIVA, which leverages both learned query
embeddings and encoded patch embeddings, providing
🔥 Trained from scratch or Finetuning
❄ Pretrained and frozen

Output Text Output Text Output Text Output Text

LLM Block ❄ xN LLM ❄ LLM ❄ → 🔥 LLM ❄


XATTN Layer 🔥
Question Question
Question Question Projection 🔥 → ❄ Learned Query
Learned Query
Learned Query Embeddings Embeddings
Embeddings
Q-Former 🔥 Q-Former 🔥 Projection 🔥
Perceiver 🔥 Encoded
Patch
Embeddings Question Encoded
Question Patch
Queries
Queries Embeddings
Queries ❄
❄ ❄ Vision
Vision Vision Encoder
Encoder Encoder ❄
Vision
Image Encoder
Image Image

Image
a) Flamingo b) BLIP-2 / InstructBLIP c) LLaVA d) BLIVA (Ours)

Figure 1: Comparison of various VLM approaches. Both (a) Flamingo (Alayrac et al. 2022) and (b) BLIP-2 / InstructBLIP
(Li et al. 2023b; Dai et al. 2023) architecture utilize a fixed, small set of query embeddings. These are used to compress visual
information for transfer to the LLM. In contrast, (c) LLaVA aligns the encoded patch embeddings directly with the LLM. (d)
BLIVA (Ours) builds upon these methods by merging learned query embeddings with additional encoded patch embeddings.

an effective method for interpreting text within images. of ImageBind (Girdhar et al. 2023), which can take multi-
• Our experimental results showcase that BLIVA provides modal inputs besides visual. While sharing the same two-
improvements in the understanding of text within images stage training paradigm, we focus on developing an end-to-
while maintaining a robust performance in general (not end multimodal model for both text-rich VQA benchmarks
particularly text-rich) VQA benchmarks and achieving and general VQA benchmarks.
the best performance on MME benchmark among pre-
vious methods. Multimodal instruction tuning
• To underscore the real-world applicability of BLIVA, Instruction tuning has been shown to improve the general-
we evaluate the model using a new dataset of YouTube ization performance of language models to unseen tasks. In
thumbnails with associated question-answer pairs. the natural language processing (NLP) community, some ap-
proaches collect instruction-tuning data by converting exist-
Related Work ing NLP datasets into instruction format (Wang et al. 2022b;
Wei et al. 2021; Sanh et al. 2022; Chung et al. 2022) oth-
Multimodal Large Language Model ers use LLMs to generate instruction data (Taori et al. 2023;
Large Language Models (LLMs) have demonstrated im- Zheng et al. 2023; Wang et al. 2023; Honovich et al. 2022).
pressive zero-shot abilities across various open-ended tasks. Recent research expanded instruction tuning to multimodal
Recent research has explored the application of LLMs for settings. In particular, for image-based instruction tuning,
multimodal generation to understand visual inputs. Some MiniGPT-4 (Zhu et al. 2023) employs human-curated in-
approaches leverage the pre-trained LLM to build unified struction data during the finetuning stage. LLaVA (Liu et al.
models for multi-modality. For example, Flamingo (Alayrac 2023a) generates 156K multimodal instruction-following
et al. 2022) connects the vision encoder and LLM by a Per- data by prompting GPT-4 (OpenAI 2023) with image cap-
ceiver Resampler which exhibits impressive few-shot per- tions and bounding boxes coordinates. mPLUG-Owl (Ye
formance. Additionally, BLIP-2 (Li et al. 2023b) designs a et al. 2023) also employs 400K mixed text only and multi-
Q-former to align the visual feature with OPT (Zhang et al. modal instruction data for finetuning. Instruction tuning also
2022) and FLAN-T5 (Wei et al. 2021). MiniGPT-4 (Zhu enhanced the previous vision language foundation model’s
et al. 2023) employed the same Q-former but changed the performance. For example, MultimodalGPT (Gong et al.
LLM to Vicuna (Zheng et al. 2023). Some approaches 2023) designed various instruction templates that incorpo-
also finetuned LLM for better alignment with visual fea- rate vision and language data for multi-modality instruction
tures such as LLaVA (Liu et al. 2023a) directly finetuned tuning OpenFlamingo (Awadalla et al. 2023). (Xu, Shen, and
LLM and mPLUG-Owl (Ye et al. 2023) performs low-rank Huang 2023) built a multimodal instruction tuning bench-
adaption (LoRA) (Hu et al. 2022) to finetune a LLaMA mark dataset that consists of 62 diverse multimodal tasks
model (Touvron et al. 2023). PandaGPT (Su et al. 2023) in a unified seq-to-seq format and finetuned OFA (Wang
also employed LoRA to finetune a Vicuna model on top et al. 2022a). MIMIC-IT (Li et al. 2023a) built a big-
The image depicts the famous Hollywood sign located on a hillside, surrounded by mountains. The sign is
prominently displayed in the center of the image, with its letters spelling out "HOLLYWOOD." In addition
to the Hollywood sign, there are several trees scattered throughout the scene, providing a natural backdrop
for the iconic landmark.

Pre-trained LLM
Q-Former
Learned Query ...
Embeddings ... ... Projection
Feed-Forward Text
Projection Encoded Patch
Cross-Attention Embeddings
Embeddings
Self-Attention Learned Query ...
Embeddings
Encoded Vision Encoder
... ... Patch
Embeddings Q-Former
Text Embeddings Queries

What is this image about? ... ...


User Instruction Text Embeddings Queries Input Image

Figure 2: Model architecture of BLIVA. BLIVA uses a Q-Former to draw out instruction-aware visual features from the patch
embeddings generated by a frozen image encoder. These learned query embeddings are then fed as soft prompt inputs into
the frozen Language-Learning Model (LLM). Additionally, the system repurposes the originally encoded patch embeddings
through a fully connected projection layer, serving as a supplementary source of visual information for the frozen LLM.

ger dataset comprising 2.8 million multimodal instruction- sequently, it is sent separately to the Q-former to extract re-
response pairs to train a stronger model Otter (Li et al. fined learned query embeddings, and to the projection layer,
2023a). We also employed instruction tuning data following allowing the LLM to grasp the rich visual knowledge. We
the same prompt as InstructBLIP(Dai et al. 2023) to demon- concatenate the two types of embeddings and feed them di-
strate the effectiveness of utilizing additional encoded patch rectly to the LLM. These combined visual embeddings are
embeddings. appended immediately after the question text embedding to
serve as the final input to the LLM. During inference, we
Method employed beam search to select the best-generated output.
Conversely, for classification and multi-choice VQA bench-
Architecture Overview marks, we adopted the vocabulary ranking method as out-
As illustrated in Figure 1, there are mainly two types of lined in InstructBLIP (Dai et al. 2023). Given our prior
end-to-end multimodal LLMs: 1) Models that utilize learned knowledge of a list of candidates, we calculated the log-
query embeddings for LLM. For instance, MiniGPT-4 (Zhu likelihood for each and chose the one with the highest value
et al. 2023) used the frozen Q-former module from BLIP- as the final prediction. To support another version for com-
2 (Li et al. 2023b) to extract image features by querying the mercial usage of our architecture, we also selected FlanT5
CLIP vision encoder. Flamingo (Alayrac et al. 2022), em- XXL as our LLM. This is named as BLIVA (FLanT5XXL ) in
ployed a Perceiver Resampler, which reduced image features this paper.
to a fixed number of visual outputs for LLM. 2) Models that
directly employed image-encoded patch embeddings, such Two stages Training Scheme
as LLaVA (Liu et al. 2023a), which connect its vision en- We adopted the typical two-stage training scheme: 1) In
coder to the LLM using an MLP. Nevertheless, these mod- the pre-training stage, the goal is to align the LLM with
els exhibit certain constraints. Some models employ learned visual information using image-text pairs from image cap-
query embeddings for LLM, which help in better under- tioning datasets that provide global descriptions of images.
standing the vision encoder but may miss crucial informa- 2) After pre-training, the LLM becomes familiar with the
tion from encoded patch embeddings. On the other hand, visual embedding space and can generate descriptions of
some models directly use encoded image patch embeddings images. However, it still lacks the capability to discern the
through a linear projection layer, which might have limited finer details of images and respond to human questions. In
capability in capturing all the information required for LLM. the second stage, we use instruction tuning data to enhance
To address this, we introduce BLIVA, a multimodal LLM performance and further align the visual embeddings with
designed to incorporate both learned query embeddings — the LLM and human values. Recent methods have predom-
which are more closely aligned with the LLM — and image- inantly adopted a two-stage training approach (Zhu et al.
encoded patch embeddings that carry richer image informa- 2023; Liu et al. 2023a; Ye et al. 2023) except PandaGPT (Su
tion. In particular, Figure 2 illustrates that our model incor- et al. 2023), which utilizes a one-stage training method, has
porates a vision tower, which encodes visual representations also demonstrated commendable results. In BLIVA, our vi-
from the input image into encoded patch embeddings. Sub- sual assistant branch, specifically the encoded patch em-
Image Stage 1: Pre training: Stage 2: Instruction Tuning: marks, including image captioning, image question answer-
Image-Text Caption Pairs Instruction Tuning Data
Question:{}, Answer: ing, visual reasoning, visual conversational QA, image clas-
A baseball pitcher winds
● Question: What is this photo taken looking through?
Short answer: net
sification, and video question answering. We also evaluated
up to pitch the ball.
● What position is this man playing? Pitcher
● Q: What color is the players shirt? A: orange
on a comprehensive multimodal LLM benchmark (MME).
● What is the answer to the following question? Is this
man a professional baseball player? yes We seek to answer the following:
• How does our proposed method compare to alternative
Figure 3: A typical multi-stage VLM training paradigm. single image embeddings approaches in text-rich VQA,
The training process involves two key stages. For Q-former, general VQA benchmarks and MME benchmark?
the first stage is done by (Li et al. 2023b) where image and
• How do the individual components of our method influ-
text caption pairs are pre-trained to accomplish a raw align-
ence its success?
ment between visual and language modalities. As for the
patch feature, we followed (Liu et al. 2023a) to use the same • How does BLIVA enhance the recognition of YouTube
pre-training dataset. In the second stage, the alignment is thumbnails?
further refined using instruction tuning VQA data, which fa-
cilitates a more detailed understanding of visual input based Datasets
on language instructions. To demonstrate the effectiveness of patch embeddings, we
followed (Dai et al. 2023) to use the same training and eval-
uation data unless mentioned explicitly. The detailed dataset
beddings, diverges from the approach of BLIP-2 (Li et al. information can be found at Appendix .
2023b), which uses a 129M pre-training dataset. Instead, it
leverages a more compact 0.5M pre-training caption data Implementation Details
following (Liu et al. 2023a). This presents a more efficient We selected the ViT-G/14 from EVA-CLIP (Sun et al. 2023)
strategy for aligning the visual encoder and LLM at the first as our visual encoder. In line with InstructBLIP, we em-
stage. We employed language model loss as our training ployed Vicuna-7B which has been instruction-tuned from
objective. The model learns to generate subsequent tokens LLaMA (Touvron et al. 2023) and serves as our LLM. Ad-
based on the preceding context. ditional details can be found in Appendix .
Thumbnails Dataset Results & Discussions
To showcase the wide-ranging industry applications made
We introduce our results in the context of each of our three
feasible by BLIVA, we assess the model by introducing a
questions and discuss our main findings.
new evaluation dataset, named YTTB-VQA which consists
1. How does our proposed method compare to alterna-
of 400 YouTube Thumbnail Visual Question-Answer pairs
tive single image embeddings approaches in text-rich VQA,
to evaluate the visual perception abilities of in-text images.
general VQA benchmarks and MME benchmark?
It covers 11 different categories which is illustrated in the
Appendix Figure 7. During the data collection, we randomly Zero-shot evaluation for text-rich VQA benchmarks
selected YouTube videos with text-rich thumbnails from dif- We compared our data with state-of-the-art Multimodality
ferent categories. We recorded the unique video ID for each LLMs. This includes LLaVA, which showcases robust OCR
YouTube video and obtained the high-resolution thumb- capabilities using only patch embedding. We also consid-
nail from the URL ”http://img.youtube.com/vi/<YouTube- ered BLIP2’s previous best version, BLIP-FLanT5xxL, the
Video-ID>/maxresdefault.jpg”. After retrieving all the state-of-the-art vision-language model mPlug-Owl (trained
YouTube thumbnails, we created the annotation file with the on a vast amount of both text and vision-text data), and
following fields: ”video id” representing the unique identifi- our baseline, InstructBLIP. The results are illustrated in Ta-
cation for a specific YouTube video, ”question” representing ble 1. Our model consistently shows significant improve-
the human-made question based on the text and image in the ment across all the text-rich VQA datasets compared to
thumbnail, ”video classes” representing the 11 video cate- InstructBLIP. Note that since InstructBLIP utilized OCR-
gories, ”answers” representing the ground truth answer, and VQA as its training dataset, the comparison for this spe-
”video link” representing the URL link for each YouTube cific dataset isn’t zero-shot. We evaluated both InstructBLIP
video. Our Youtube thumbnail datasets are available at https: and our model using the OCR-VQA validation set. BLIVA
//huggingface.co/datasets/mlpc-lab/YTTB-VQA. achieved state-of-the-art results among 6 text-rich datasets
while mPlug-Owl performed the best in 4 datasets. Com-
We also provide two sample scenarios from the YTTB-
pared to mPlug-Owl, which employs about 1104M image
VQA dataset. Figure 4 illustrates BLIVA’s capability to pro-
captioning data in the Pre-training stage, BLIVA only em-
vide detailed captions and answer users’ visual questions.
ploys 558K image caption pairs which could explain why
BLIVA is not performing the best in information-based
Experiment VQA tasks such as InfoVQA, ChartQA and ESTVQA.
In this section, we conduct extensive experiments and anal- BLIVA demonstrates the best performance on average com-
yses to show the efficacy of our model. We evaluate our pared to all previous methods, underscoring our design
model, baseline, and other SOTA models on 10 OCR-related choice to employ learned query embeddings, further aided
tasks and 8 general (not particularly text-rich) VQA bench- by encoded patch embeddings.
Figure 4: Two Sample Scenarios from the YTTB-VQA Dataset. This dataset demonstrates the dual application of BLIVA.
The first scenario highlights BLIVA’s capability to provide detailed captions that encompass all visual information within an
image. The second scenario showcases BLIVA’s utility in summarizing visual data into concise captions, followed by its ability
to field more detailed visual queries posed by users.

STVQA ↑ OCRVQA ↑ TextVQA ↑ DocVQA ↑ InfoVQA ↑ ChartQA ↑ ESTVQA ↑ FUNSD ↑ SROIE ↑ POIE ↑ Average ↑
OpenFlamingo (Awadalla et al. 2023) 19.32 27.82 29.08 5.05 14.99 9.12 28.20 0.85 0.12 2.12 13.67
BLIP2-OPT6.7b (Li et al. 2023b) 13.36 10.58 21.18 0.82 8.82 7.44 27.02 0.00 0.00 0.02 8.92
BLIP2-FLanT5XXL (Li et al. 2023b) 21.38 30.28 30.62 4.00 10.17 7.20 42.46 1.19 0.20 2.52 15.00
MiniGPT4 (Zhu et al. 2023) 14.02 11.52 18.72 2.97 13.32 4.32 28.36 1.19 0.04 1.31 9.58
LLaVA (Liu et al. 2023a) 22.93 15.02 28.30 4.40 13.78 7.28 33.48 1.02 0.12 2.09 12.84
mPLUG-Owl (Ye et al. 2023) 26.32 35.00 37.44 6.17 16.46 9.52 49.68 1.02 0.64 3.26 18.56
InstructBLIP (FlanT5XXL ) (Dai et al. 2023) 26.22 55.04 36.86 4.94 10.14 8.16 43.84 1.36 0.50 1.91 18.90
InstructBLIP (Vicuna-7B) (Dai et al. 2023) 28.64 47.62 39.60 5.89 13.10 5.52 47.66 0.85 0.64 2.66 19.22
BLIVA (FlanT5XXL ) 28.24 61.34 39.36 5.22 10.82 9.28 45.66 1.53 0.50 2.39 20.43
BLIVA (Vicuna-7B) 29.08 65.38 42.18 6.24 13.50 8.16 48.14 1.02 0.88 2.91 21.75

Table 1: Zero-Shot OCR-Free Results on Text-Rich VQA benchmarks. This table presents the accuracy (%) results for
OCR-free methods, implying no OCR-tokens were used. Note that our work follows InstructBLIP which incorporated OCR-
VQA in its training dataset, thus inevitably making OCR-VQA evaluation not zero-shot.

Zero-shot evaluation for general (not particularly text- eral VQA benchmarks.
rich) VQA benchmarks Next, we compared BLIVA with
models that employ single image features. Results are given MME Benchmark We further evaluated BLIVA on a
in Table 2 and in Table 3 for LLMs available for commer- comprehensive Multlimodal LLM benchmark (MME) (Fu
cial use. Our model consistently and significantly outper- et al. 2023). As illustrated in Table 4, BLIVA demonstrates
formed the original InstructBLIP model in VSR, IconQA, the best performance among the current methods on aver-
TextVQA, Visual Dialog, Hateful Memes, MSRVTT, and age for both the perception and cognition tasks. For all text-
Flickr30K. For VizWiz, our model nearly matched Instruct- rich tasks such as OCR, Poster, Numerical Calculation, Text
BLIP’s performance. This naturally raises the question: why Translation, and code, BLIVA outperforms InstructBLIP.
didn’t additional visual assistance improve all the bench- BLIVA achieved top 2 performance across all the tasks ex-
marks? We speculate that the additional visual information cept artwork and landmark which demand extensive infor-
didn’t aid VizWiz task. We continue to investigate this phe- mational knowledge. This is consistent with our findings
nomenon in the next ablation study section. Overall, our de- from informational VQA, indicating that our light-weight
sign not only achieved significant improvements in under- pre-training stage and the missing LAION-115M web image
standing text-rich images but also improves 7 out of 8 gen- caption dataset during instruction tuning stage both likely
contribute to a degradation in BLIVA’s internet knowledge
Models VSR ↑ IconQA ↑ TextVQA ↑ Visdial ↑ Flickr30K ↑ HM ↑ VizWiz ↑ MSRVTT ↑
(val) (val-dev) (val-dev)
Flamingo-3B (Alayrac et al. 2022) - - 30.1 - 60.6 - - -
Flamingo-9B (Alayrac et al. 2022) - - 31.8 - 61.5 - - -
Flamingo-80B (Alayrac et al. 2022) - - 35.0 - 67.2 - - -
MiniGPT-4 (Zhu et al. 2023) 50.65 - 18.56 - - 29.0 34.78 -
LLaVA (Liu et al. 2023a) 56.3 - 37.98 - - 9.2 36.74 -
BLIP-2 (Vicuna-7B) (Dai et al. 2023) 50.0 39.7 40.1 44.9 74.9 50.2 49.34 4.17
InstructBLIP (Vicuna-7B) (Dai et al. 2023) 54.3 43.1 50.1 45.2 82.4 54.8 43.3 18.7
InstructBLIP Baseline (Vicuna-7B) 58.67 44.34 37.58 40.58 84.61 50.6 44.10 20.97
BLIVA (Vicuna-7B) 62.2 44.88 57.96 45.63 87.1 55.6 42.9 23.81

Table 2: Zero-shot results on general (not particularly text-rich) VQA benchmarks. Our baseline is obtained by directly
finetuning InstructBLIP (Dai et al. 2023). For the three datasets on the right, due to the unavailability of test-set answers,
we have evaluated them using validation dev. Here, Visdial and HM denote the Visual Dialog and Hateful Memes datasets,
respectively. Following previous works (Alayrac et al. 2022; Yang et al. 2021; Murahari et al. 2020), we report the CIDEr
score (Vedantam, Zitnick, and Parikh 2015) for Flickr30K, AUC score for Hateful Memes, and Mean Reciprocal Rank (MRR)
for Visual Dialog. For all remaining datasets, we report the top-1 accuracy (%). Notably, for Text-VQA, we have followed
InstructBLIP’s method of using OCR-tokens for comparison. While InstructBLIP also included GQA, iVQA, and MSVDQA,
we were unable to access these datasets due to either unresponsive authors or the datasets being removed from their websites.
For ScienceQA and Nocaps, we were unable to reproduce the results of InstructBLIP, hence their results are not reported here.

Models VSR ↑ IconQA ↑ TextVQA ↑ Visdial ↑ Flickr30K ↑ HM ↑ VizWiz ↑ MSRVTT ↑


(val) (val-dev) (val-dev)
BLIP-2 (FlanT5XXL ) (Li et al. 2023b) 68.2 45.4 44.1 46.9 73.7 52.0 29.4 17.4
InstructBLIP (FlanT5XXL ) (Dai et al. 2023) 65.6 51.2 46.6 48.5 83.5 53.6 41.35 20.79
BLIVA (FlanT5XXL ) 68.82 52.42 57.2 36.18 87.66 50.0 43.97 23.78

Table 3: Zero-shot results on general (not particularly text-rich) VQA benchmarks for models with open LLM eligible for
commercial use. Here, the commercial use applicable LLM we reported is FlanT5XXL . Same as Table 2, we report the same
evaluation datasets with the same evaluation metrics.

Model Overall ↑ Perception ↑ Cognition ↑ Avg. ↑


Exist. Count Pos. Color OCR Poster Cele. Scene Land. Art. Comm. NumCal. Trans. Code
LLaVA(Liu et al. 2023a) 712.5 50.0 50.0 50.0 50.0 50.0 50.0 48.8 50.0 50.0 49.0 57.1 50.0 57.5 50.0 50.9
MiniGPT-4(Zhu et al. 2023) 694.3 68.3 55.0 43.3 43.3 57.5 41.8 54.4 71.8 54.0 60.5 59.3 45.0 0.0 40.0 49.6
mPLUG-Owl(Ye et al. 2023) 1238.4 120.0 50.0 50.0 50.0 65.0 136.1 100.3 135.5 159.3 96.3 78.6 60.0 80.0 57.5 88.5
InstructBLIP(Dai et al. 2023) 1417.9 185.0 143.3 66.7 66.7 72.5 123.8 101.2 153.0 79.8 134.3 129.3 40.0 65.0 57.5 101.3
BLIP-2(Li et al. 2023b) 1508.8 160.0 135.0 73.3 73.3 110.0 141.8 105.6 145.3 138.0 136.5 110.0 40.0 65.0 75.0 107.8
BLIVA 1669.2 180.0 138.3 81.7 180.0 87.5 155.1 140.9 151.5 89.5 133.3 136.4 57.5 77.5 60.0 119.2

Table 4: Evaluation of MME-Benchmark. Here we report the results on all the sub tasks, including Existence(Exist.),
Count, Position(Pos.), Color, OCR, Poster, Celebrity(Cele.), Scene, Landmark(Land.), Artwork(Art.), Commonsense Reason-
ing(Comm.), Numerical Calculation(NumCal.), Text Translation(Trans.), Code Reasoning(Code) and the task-level average
(Avg.). We bold the highest overall and average score and highlight the Top-2 model of each sub task with underline.

InstructBLIP Baseline Patch Pre- Finetuning ST- OCR- Text– Doc- Info- Chart- EST- FUNSD SROIE POIE Improvement
(Dai et al. 2023) (Instruction Embedding Training LLM VQA VQA VQA VQA VQA QA VQA
Tuning
Qformer)
✓ 28.64 47.62 39.60 5.89 13.10 5.52 47.66 0.85 0.64 2.66 +0%
✓ ✓ 30.08 65.8 40.5 6.13 12.03 8.08 47.02 0.85 0.57 2.62 + 7.40%
✓ ✓ ✓ 28.86 65.04 40.7 6.65 14.28 8.24 47.72 1.19 1.66 2.83 + 31.72%
✓ ✓ ✓ ✓ 29.08 65.38 42.18 6.24 13.50 8.16 48.14 1.02 0.88 2.91 + 17.01%
✓ ✓ ✓ ✓ ✓ 29.94 66.48 41.9 6.47 12.51 7.52 46.76 1.02 0.51 2.85 + 9.65%

Table 5: Results of adding individual techniques of our framework in text-rich VQA benchmarks. We include four ab-
lations that accumulate each technique (i) baseline: instruction tuning InstructBLIP’s Qformer. (ii) instruction tuning patch
embeddings (iii) pre-training stage of patch embeddings (iv) Finetuning LLM with LORA during the instruction tuning stage.

base. conducted ablation studies, incorporating each element re-


2. How do the individual components of our method spectively. For simplicity, here we only conduct ablation on
influence its success? the BLIVA (Vicuna-7B) model. Since our baseline is the In-
To investigate the impact of image-encoded patch embed- structBLIP, we report the results of using baseline alone as
dings, the pre-training stage, and fine-tuning the LLM, we directly finetuning the InstructBLIP model with our data and
InstructBLIP Baseline Patch Pre- Finetuning VSR IconQA TextVQA Visdial Flickr HM VizWiz MSRVTT Improvement
(Dai et al. 2023) (Instruction Embeddings Training LLM 30K (val) (val-dev) (val-dev)
Tuning
Qformer)
✓ 54.3 43.1 50.1 45.2 82.4 54.8 43.3 18.7 + 0%
✓ ✓ 58.67 44.34 37.58 40.58 84.61 50.6 44.1 20.97 - 1.91%
✓ ✓ ✓ 58.85 44.91 58.8 41.67 87.4 49.1 42.83 23.70 + 5.43%
✓ ✓ ✓ ✓ 62.2 44.88 57.96 45.63 87.1 55.6 42.9 23.81 + 8.61%
✓ ✓ ✓ ✓ ✓ 51.39 41.34 57.82 42.32 82.7 46.2 44.91 22.67 + 1.15%

Table 6: Results of adding individual techniques of our framework in general (not particularly text-rich) VQA bench-
marks. We include four ablations that accumulate each technique same as in Table 5.

implementation. embeddings are interpreted as a ”foreign language” to LLM


Ablation in text-rich VQA benchmarks For text-rich and thus finetuning LLM together is not needed in this case.
image related tasks, Table 5 illustrates the results of adding 3. How does BLIVA enhance the recognition of
each technique separately. Compared to the baseline, adding YouTube thumbnails?
patch embeddings improved performance across all tasks Youtube Thumbnails Evaluation Table 7 illustrates the
with the exception of ST-VQA and OCR-VQA. This can results of the youtube thumbnail dataset with BLIVA achiev-
stem from data contamination, as STVQA includes data ing the best performance. From an application perspective,
already present in InstructBLIP’s Qformer training set but BLIVA has the ability to extract extra visual information
not included in patch embedding’s training set. Without the from images besides extracting information from YouTube
pre-training stage, the performance of ST-VQA, OCR-VQA, captions alone like LLMs. Our success in this use case can
TextVQA, ESTVQA, and POIE decreased, while the rest are be further expanded to large-scale thumbnail images.
benefited. Since the pre-training stage employs image cap-
Models Accuracy (%)
tion pairs, we observed that it didn’t benefit BLIVA’s perfor-
MiniGPT4 (Zhu et al. 2023) 47.75
mance in text-rich VQA tasks as consistently as in the gen- LLaVA (Liu et al. 2023a) 41.75
eral VQA tasks. Considering the improvement of all tasks, InstructBLIP (Vicuna-7B) (Dai et al. 2023) 82.2
pre-training is still adopted. BLIVA on average outperforms BLIVA (Vicuna-7B) 83.5
InstructBLIP by 31.72% without pre-training and 17.01%
with it, both outpacing the 7.40% improvement from instruc- Table 7: Evaluation results of our collected Youtube
tion tuning Qformer. These studies indicate that our design thumbnails dataset. We report the top-1 accuracy (%).
of employing patch embeddings provides more detailed vi-
sual information. It also supports our hypothesis that an ad-
ditional visual assistant improves visual knowledge in areas
where the query embeddings either neglect or have limited Qualitative Analysis
extraction capabilities. We use real-life scene images, movie posters, webpages,
and memes to demonstrate our model’s performance regard-
Ablation in general (not particularly text-rich) VQA
ing interaction with humans based on text-rich images. The
benchmarks As illustrated in Table 6, the presence of
examples are in Appendix . BLIVA showcases exceptional
encoded patch embeddings improves performance in all
OCR capabilities, paired with a robust localization ability
benchmarks significantly except HM and VizWiz. For tasks
that accurately identifies texts and objects within images.
where we observed a drop in performance, such as HM,
which focuses on interpreting the feeling of hatefulness, and
VizWiz, which predicts whether a visual question can be an- Conclusion
swered. We conjecture these tasks can be fulfilled by utiliz- In this paper, we illustrate the effectiveness of assisting
ing global-level query embeddings information such as feel- learned query embeddings with encoded image patch em-
ing the hatefulness in the image or if the image’s object is beddings as a visual assistant. This straightforward yet in-
unrelated to the question asking. When adding the first pre- novative design bolsters performance in both general VQA
training stage, the performance for VSR, VisDial, HM, and benchmarks and text-rich VQA benchmarks. Our model,
MSRVTT tasks improves substantially while others are kept BLIVA, demonstrates superior performance in both aca-
roughly the same. These ablation results confirmed the ne- demic benchmarks and qualitative real-world examples.
cessity of two-stage training. During the instruction tuning Moreover, human evaluation of the model’s performance re-
stage, we also experimented with fine-tuning the LLM us- veals that BLIVA struggles with deciphering numerical sym-
ing LoRA in conjunction with Q-former and encoded patch bols in images. This could be attributed to the reduced pixel
embeddings. However, this approach didn’t yield as much representation often used for these symbols and needs future
improvement as our best model and even reduced perfor- work to develop valuable insights. Our work also demon-
mance in many tasks. Nonetheless, we have included these strates the effectiveness of mixing different types of visual
results in the ablation study for completeness. We conjec- embeddings. We encourage more future work to explore
ture that frozen LLM has a satisfactory understanding of vi- how to scale more visual embeddings to LLM which can be
sual information after our two-stage alignment. The visual the key to the next stage of Large Vision-Language Models.
Acknowledgements Gurari, D.; Li, Q.; Stangl, A. J.; Guo, A.; Lin, C.; Grauman,
Zhuowen Tu is funded by NSF Award IIS-2127544. K.; Luo, J.; and Bigham, J. P. 2018. Vizwiz grand challenge:
Answering visual questions from blind people. In Proceed-
References ings of the IEEE conference on computer vision and pattern
recognition (CVPR), 3608–3617.
Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Has-
son, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; Honovich, O.; Scialom, T.; Levy, O.; and Schick, T. 2022.
et al. 2022. Flamingo: a visual language model for few-shot Unnatural Instructions: Tuning Language Models with (Al-
learning. Advances in Neural Information Processing Sys- most) No Human Labor. arXiv:2212.09689.
tems (NeurIPS), 35: 23716–23736. Hu, E. J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang,
Awadalla, A.; Gao, I.; Gardner, J.; Hessel, J.; Hanafy, Y.; S.; Wang, L.; and Chen, W. 2022. LoRA: Low-Rank Adap-
Zhu, W.; Marathe, K.; Bitton, Y.; Gadre, S.; Jitsev, J.; et al. tation of Large Language Models. In International Confer-
2023. Openflamingo. ence on Learning Representations (ICLR).
Biten, A. F.; Litman, R.; Xie, Y.; Appalaraju, S.; and Man- Huang, Z.; Chen, K.; He, J.; Bai, X.; Karatzas, D.; Lu, S.;
matha, R. 2022. LaTr: Layout-Aware Transformer for and Jawahar, C. 2019. Icdar2019 competition on scanned
Scene-Text VQA. In Proceedings of the IEEE/CVF Confer- receipt ocr and information extraction. In 2019 Interna-
ence on Computer Vision and Pattern Recognition (CVPR), tional Conference on Document Analysis and Recognition
16548–16558. (ICDAR), 1516–1520. IEEE.
Chung, H. W.; Hou, L.; Longpre, S.; Zoph, B.; Tay, Y.; Fe- Hudson, D. A.; and Manning, C. D. 2019. GQA: A New
dus, W.; Li, Y.; Wang, X.; Dehghani, M.; Brahma, S.; Web- Dataset for Real-World Visual Reasoning and Composi-
son, A.; Gu, S. S.; Dai, Z.; Suzgun, M.; Chen, X.; Chowd- tional Question Answering. In Computer Vision and Pattern
hery, A.; Castro-Ros, A.; Pellat, M.; Robinson, K.; Val- Recognition (CVPR).
ter, D.; Narang, S.; Mishra, G.; Yu, A.; Zhao, V.; Huang, Jaume, G.; Ekenel, H. K.; and Thiran, J.-P. 2019. Funsd:
Y.; Dai, A.; Yu, H.; Petrov, S.; Chi, E. H.; Dean, J.; De- A dataset for form understanding in noisy scanned doc-
vlin, J.; Roberts, A.; Zhou, D.; Le, Q. V.; and Wei, J. uments. In 2019 International Conference on Document
2022. Scaling Instruction-Finetuned Language Models. Analysis and Recognition Workshops (ICDARW), volume 2,
arXiv:2210.11416. 1–6. IEEE.
Dai, W.; Li, J.; Li, D.; Tiong, A. M. H.; Zhao, J.; Wang, W.; Kiela, D.; Firooz, H.; Mohan, A.; Goswami, V.; Singh, A.;
Li, B.; Fung, P.; and Hoi, S. 2023. InstructBLIP: Towards Ringshia, P.; and Testuggine, D. 2020. The hateful memes
General-purpose Vision-Language Models with Instruction challenge: Detecting hate speech in multimodal memes. Ad-
Tuning. arXiv:2305.06500. vances in neural information processing systems (NeurIPS),
Das, A.; Kottur, S.; Gupta, K.; Singh, A.; Yadav, D.; Moura, 33: 2611–2624.
J. M.; Parikh, D.; and Batra, D. 2017. Visual Dialog. In Kuang, J.; Hua, W.; Liang, D.; Yang, M.; Jiang, D.; Ren, B.;
Proceedings of the IEEE Conference on Computer Vision and Bai, X. 2023. Visual information extraction in the wild:
and Pattern Recognition (CVPR). practical dataset and end-to-end solution. In International
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, Conference on Document Analysis and Recognition, 36–53.
D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Springer.
Heigold, G.; Gelly, S.; et al. 2020. An image is worth 16x16 Li, B.; Zhang, Y.; Chen, L.; Wang, J.; Pu, F.; Yang, J.; Li,
words: Transformers for image recognition at scale. arXiv C.; and Liu, Z. 2023a. Mimic-it: Multi-modal in-context
preprint arXiv:2010.11929. instruction tuning. arXiv preprint arXiv:2306.05425.
Fu, C.; Chen, P.; Shen, Y.; Qin, Y.; Zhang, M.; Lin, X.; Qiu, Li, D.; Li, J.; Le, H.; Wang, G.; Savarese, S.; and Hoi, S. C.
Z.; Lin, W.; Yang, J.; Zheng, X.; Li, K.; Sun, X.; and Ji, 2023b. LAVIS: A One-stop Library for Language-Vision In-
R. 2023. MME: A Comprehensive Evaluation Benchmark telligence. In Proceedings of the 61st Annual Meeting of the
for Multimodal Large Language Models. arXiv preprint Association for Computational Linguistics (Volume 3: Sys-
arXiv:2306.13394. tem Demonstrations), 31–41. Toronto, Canada: Association
Girdhar, R.; El-Nouby, A.; Liu, Z.; Singh, M.; Alwala, K. V.; for Computational Linguistics.
Joulin, A.; and Misra, I. 2023. ImageBind: One Embedding Li, J.; Li, D.; Xiong, C.; and Hoi, S. 2022. BLIP: Boot-
Space To Bind Them All. In Computer Vision and Pattern strapping Language-Image Pre-training for Unified Vision-
Recognition Conference (CVPR). Language Understanding and Generation. In ICML.
Gong, T.; Lyu, C.; Zhang, S.; Wang, Y.; Zheng, M.; Zhao, Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick,
Q.; Liu, K.; Zhang, W.; Luo, P.; and Chen, K. 2023. R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C. L.; and
MultiModal-GPT: A Vision and Language Model for Dia- Dollár, P. 2015. Microsoft COCO: Common Objects in Con-
logue with Humans. arXiv:2305.04790. text. arXiv:1405.0312.
Goyal, Y.; Khot, T.; Summers-Stay, D.; Batra, D.; and Liu, F.; Emerson, G.; and Collier, N. 2023. Visual spatial
Parikh, D. 2017. Making the V in VQA Matter: Elevating reasoning. Transactions of the Association for Computa-
the Role of Image Understanding in Visual Question An- tional Linguistics (TACL), 11: 635–651.
swering. In Conference on Computer Vision and Pattern Liu, H.; Li, C.; Wu, Q.; and Lee, Y. J. 2023a. Visual Instruc-
Recognition (CVPR). tion Tuning.
Liu, Y.; Li, Z.; Li, H.; Yu, W.; Liu, Y.; Yang, B.; Huang, M.; Schwenk, D.; Khandelwal, A.; Clark, C.; Marino, K.; and
Peng, D.; Liu, M.; Chen, M.; Li, C.; Yin, X.; lin Liu, C.; Jin, Mottaghi, R. 2022. A-OKVQA: A Benchmark for Visual
L.; and Bai, X. 2023b. On the Hidden Mystery of OCR in Question Answering using World Knowledge. arXiv.
Large Multimodal Models. arXiv:2305.07895. Sharma, P.; Ding, N.; Goodman, S.; and Soricut, R. 2018.
Loshchilov, I.; and Hutter, F. 2017. Decoupled weight decay Conceptual Captions: A Cleaned, Hypernymed, Image Alt-
regularization. arXiv preprint arXiv:1711.05101. text Dataset For Automatic Image Captioning. In Proceed-
Lu, P.; Qiu, L.; Chen, J.; Xia, T.; Zhao, Y.; Zhang, W.; Yu, Z.; ings of ACL.
Liang, X.; and Zhu, S.-C. 2022. IconQA: A New Benchmark Sidorov, O.; Hu, R.; Rohrbach, M.; and Singh, A. 2020.
for Abstract Diagram Understanding and Visual Language TextCaps: a Dataset for Image Captioning with Reading
Reasoning. arXiv:2110.13214. Comprehension. arXiv:2003.12462.
Marino, K.; Rastegari, M.; Farhadi, A.; and Mottaghi, R. Singh, A.; Natarajan, V.; Shah, M.; Jiang, Y.; Chen, X.; Ba-
2019. Ok-vqa: A visual question answering benchmark re- tra, D.; Parikh, D.; and Rohrbach, M. 2019. Towards vqa
quiring external knowledge. In Proceedings of the IEEE/cvf models that can read. In Proceedings of the IEEE/CVF con-
conference on computer vision and pattern recognition ference on computer vision and pattern recognition (CVPR),
(CVPR), 3195–3204. 8317–8326.
Masry, A.; Do, X. L.; Tan, J. Q.; Joty, S.; and Hoque, E. Su, Y.; Lan, T.; Li, H.; Xu, J.; Wang, Y.; and Cai, D. 2023.
2022. ChartQA: A Benchmark for Question Answering PandaGPT: One Model To Instruction-Follow Them All.
about Charts with Visual and Logical Reasoning. In Find- arXiv preprint arXiv:2305.16355.
ings of the Association for Computational Linguistics: ACL Sun, Q.; Fang, Y.; Wu, L.; Wang, X.; and Cao, Y. 2023.
2022, 2263–2279. EVA-CLIP: Improved Training Techniques for CLIP at
Mathew, M.; Bagal, V.; Tito, R.; Karatzas, D.; Valveny, E.; Scale. arXiv:2303.15389.
and Jawahar, C. 2022. Infographicvqa. In Proceedings of the Taori, R.; Gulrajani, I.; Zhang, T.; Dubois, Y.; Li, X.;
IEEE/CVF Winter Conference on Applications of Computer Guestrin, C.; Liang, P.; and Hashimoto, T. B. 2023. Stanford
Vision, 1697–1706. Alpaca: An Instruction-following LLaMA model. https:
//github.com/tatsu-lab/stanford alpaca.
Mathew, M.; Karatzas, D.; and Jawahar, C. 2021. DocVQA:
A Dataset for VQA on Document Images. In Proceedings of Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux,
the IEEE/CVF Winter Conference on Applications of Com- M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.;
puter Vision (WACV), 2200–2209. Azhar, F.; Rodriguez, A.; Joulin, A.; Grave, E.; and Lample,
G. 2023. LLaMA: Open and Efficient Foundation Language
Mishra, A.; Shekhar, S.; Singh, A. K.; and Chakraborty, A. Models. arXiv:2302.13971.
2019. OCR-VQA: Visual Question Answering by Reading
Text in Images. In ICDAR. Vedantam, R.; Zitnick, C. L.; and Parikh, D. 2015. CIDEr:
Consensus-based image description evaluation. In 2015
Murahari, V.; Batra, D.; Parikh, D.; and Das, A. 2020. Large- IEEE Conference on Computer Vision and Pattern Recog-
Scale Pretraining for Visual Dialog: A Simple State-of-the- nition (CVPR), 4566–4575.
Art Baseline. In Vedaldi, A.; Bischof, H.; Brox, T.; and
Frahm, J.-M., eds., ECCV. Wang, P.; Yang, A.; Men, R.; Lin, J.; Bai, S.; Li, Z.; Ma,
J.; Zhou, C.; Zhou, J.; and Yang, H. 2022a. OFA: Unify-
OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774. ing Architectures, Tasks, and Modalities Through a Sim-
Ordonez, V.; Kulkarni, G.; and Berg, T. 2011. Im2Text: ple Sequence-to-Sequence Learning Framework. CoRR,
Describing Images Using 1 Million Captioned Photographs. abs/2202.03052.
In Shawe-Taylor, J.; Zemel, R.; Bartlett, P.; Pereira, F.; and Wang, X.; Liu, Y.; Shen, C.; Ng, C. C.; Luo, C.; Jin, L.;
Weinberger, K., eds., Advances in Neural Information Pro- Chan, C. S.; Hengel, A. v. d.; and Wang, L. 2020. On
cessing Systems, volume 24. Curran Associates, Inc. the general value of evidence, and bilingual scene-text vi-
Sanh, V.; Webson, A.; Raffel, C.; Bach, S. H.; Sutawika, L.; sual question answering. In Proceedings of the IEEE/CVF
Alyafeai, Z.; Chaffin, A.; Stiegler, A.; Scao, T. L.; Raja, A.; Conference on Computer Vision and Pattern Recognition,
Dey, M.; Bari, M. S.; Xu, C.; Thakker, U.; Sharma, S. S.; 10126–10135.
Szczechla, E.; Kim, T.; Chhablani, G.; Nayak, N.; Datta, D.; Wang, Y.; Kordi, Y.; Mishra, S.; Liu, A.; Smith, N. A.;
Chang, J.; Jiang, M. T.-J.; Wang, H.; Manica, M.; Shen, S.; Khashabi, D.; and Hajishirzi, H. 2023. Self-Instruct: Align-
Yong, Z. X.; Pandey, H.; Bawden, R.; Wang, T.; Neeraj, T.; ing Language Models with Self-Generated Instructions.
Rozen, J.; Sharma, A.; Santilli, A.; Fevry, T.; Fries, J. A.; arXiv:2212.10560.
Teehan, R.; Bers, T.; Biderman, S.; Gao, L.; Wolf, T.; and Wang, Y.; Mishra, S.; Alipoormolabashi, P.; Kordi, Y.;
Rush, A. M. 2022. Multitask Prompted Training Enables Mirzaei, A.; Arunkumar, A.; Ashok, A.; Dhanasekaran,
Zero-Shot Task Generalization. arXiv:2110.08207. A. S.; Naik, A.; Stap, D.; Pathak, E.; Karamanolakis, G.;
Schuhmann, C.; Vencu, R.; Beaumont, R.; Kaczmarczyk, Lai, H. G.; Purohit, I.; Mondal, I.; Anderson, J.; Kuznia,
R.; Mullis, C.; Katta, A.; Coombes, T.; Jitsev, J.; and Ko- K.; Doshi, K.; Patel, M.; Pal, K. K.; Moradshahi, M.; Par-
matsuzaki, A. 2021. LAION-400M: Open Dataset of CLIP- mar, M.; Purohit, M.; Varshney, N.; Kaza, P. R.; Verma, P.;
Filtered 400 Million Image-Text Pairs. arXiv:2111.02114. Puri, R. S.; Karia, R.; Sampat, S. K.; Doshi, S.; Mishra, S.;
Reddy, S.; Patro, S.; Dixit, T.; Shen, X.; Baral, C.; Choi, Y.; Appendix
Smith, N. A.; Hajishirzi, H.; and Khashabi, D. 2022b. Super- Official Version
NaturalInstructions: Generalization via Declarative Instruc-
tions on 1600+ NLP Tasks. arXiv:2204.07705. Our official version in the AAAI proceedings will be notified
Wei, J.; Bosma, M.; Zhao, V. Y.; Guu, K.; Yu, A. W.; Lester, and accessible at https://github.com/mlpc-ucsd/BLIVA
B.; Du, N.; Dai, A. M.; and Le, Q. V. 2021. Finetuned
language models are zero-shot learners. arXiv preprint Time and Memory Complexity
arXiv:2109.01652. We report the training time as in one epoch and memory
Wu, Y.; Zhao, Y.; Li, Z.; Qin, B.; and Xiong, K. 2023. is on 8 NVIDIA A6000 with a batch size of 24. The infer-
Improving Cross-Task Generalization with Step-by-Step In- ence time is based on NVIDIA A100. BLIVA outperforms
structions. arXiv preprint arXiv:2305.04429. InstructBLIP without sacrificing much computation power.
Xu, D.; Zhao, Z.; Xiao, J.; Wu, F.; Zhang, H.; He, X.; and Models Training Time Inference Time Training Memory Inference Memory
Zhuang, Y. 2017. Video Question Answering via Gradually InstructBLIP 11.5h 0.89s 34G 18G
BLIVA (Ours) 14.8h 0.91s 40G 22G
Refined Attention over Appearance and Motion. In Proceed-
ings of the 25th ACM International Conference on Multime-
dia, 1645–1653. Table 8: Comparison of Time and Memory Overheads
Xu, Z.; Shen, Y.; and Huang, L. 2023. MultiInstruct: Im-
proving Multi-Modal Zero-Shot Learning via Instruction
Tuning. arXiv:2212.10773. Text Fonts in images
Yang, A.; Miech, A.; Sivic, J.; Laptev, I.; and Schmid, C. We evaluated BLIVA with two widely-used fonts, Times
2021. Just ask: Learning to answer questions from millions New Roman and Impact, in four colors—white, red, pale
of narrated videos. In International Conference on Com- green, and blue—as detailed in Table 9. We observed degra-
puter Vision (ICCV), 1686–1697. dation of performance in Blue color. Since the background
Ye, Q.; Xu, H.; Xu, G.; Ye, J.; Yan, M.; Zhou, Y.; Wang, J.; of image is gray-scale, the color contrast between image
Hu, A.; Shi, P.; Shi, Y.; Jiang, C.; Li, C.; Xu, Y.; Chen, H.; color and text color affects model’s performance. BLIVA
Tian, J.; Qi, Q.; Zhang, J.; and Huang, F. 2023. mPLUG- performs significantly better with light-colored fonts, like
Owl: Modularization Empowers Large Language Models white or pale green, compared to darker colors like blue.
with Multimodality. arXiv:2304.14178.
Font Type / Color White Red Pale Green Blue
Young, P.; Lai, A.; Hodosh, M.; and Hockenmaier, J. 2014. Times New Roman 94.12% 94.12% 94.12% 58.82%
From image descriptions to visual denotations: New simi- Impact 94.12% 88.24% 94.12% 70.59%
larity metrics for semantic inference over event descriptions.
Nlp.cs.illinois.edu. Table 9: Comparison of performance across various fonts
Zhang, S.; Roller, S.; Goyal, N.; Artetxe, M.; Chen, M.; and colors, evaluated based on text-capture accuracy which
Chen, S.; Dewan, C.; Diab, M.; Li, X.; Lin, X. V.; Mihaylov, is the number of words correctly detected over all the words.
T.; Ott, M.; Shleifer, S.; Shuster, K.; Simig, D.; Koura, P. S.;
Sridhar, A.; Wang, T.; and Zettlemoyer, L. 2022. OPT: Open
Pre-trained Transformer Language Models. arXiv preprint Encoding Methods in Visual Assistant branch
arXiv:2205.01068.
We included different BLIVA variants focusing on encoding
Zheng, L.; Chiang, W.-L.; Sheng, Y.; Zhuang, S.; Wu,
localized positional details in Table 10. We picked four most
Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E. P.;
representative datasets covering diverse tasks for this abla-
Zhang, H.; Gonzalez, J. E.; and Stoica, I. 2023. Judg-
tion study. BLIVA with linear projection outperforms other
ing LLM-as-a-judge with MT-Bench and Chatbot Arena.
variants. A simple projector leads to better generalization.
arXiv:2306.05685.
Zhu, D.; Chen, J.; Shen, X.; Li, X.; and Elhoseiny, M. Models Flickr30K VSR TextVQA IconQA
2023. MiniGPT-4: Enhancing Vision-Language Under- InstructBLIP 82.4 54.3 39.60 43.1
BLIVA (Ours) 87.1 62.2 42.18 44.88
standing with Advanced Large Language Models. arXiv + Convolutional Position 85.87 60.92 41.28 41.96
preprint arXiv:2304.10592. + ViT Block w/ Convolutional Position 86.59 63.90 40.72 41.80
+ ViT Block w/ Relative Position 84.57 60.31 41.24 41.96
+ MLP Layer 82.9 59.57 42.12 43.79

Table 10: BLIVA with different encoding methods

Datasets
We followed (Dai et al. 2023) to use the same training and
evaluation data unless mentioned explicitly. Due to the ille-
gal contents involved in LAION-115M dataset (Schuhmann
et al. 2021), we cannot download it securely through the uni- 18 def __call__(self, item):
versity internet. Besides lacking a subset of samples of im- 19 return self.transform(item)
age captioning, we keep all other training data the same. It The first class BlipImageTrainProcessor is used
includes MSCOCO (Lin et al. 2015) for image captioning, to pre-process the images for the training purpose. Specif-
TextCaps (Sidorov et al. 2020), VQAv2 (Goyal et al. 2017), ically, it randomly crops and resizes images to 224 * 224
OKVQA (Marino et al. 2019), A-OKVQA (Schwenk et al. with an interpolation method of Bicubic, possibly flips
2022), OCR-VQA (Mishra et al. 2019) and LLaVA-Instruct- them horizontally, converts them to tensor format, and
150K (Liu et al. 2023a). For evaluation datasets, we also fol- normalizes them using mean = (0.48145466, 0.4578275,
low (Dai et al. 2023) but only keep Flickr30K (Young et al. 0.40821073) and standard deviation = (0.26862954,
2014), VSR (Liu, Emerson, and Collier 2023), IconQA (Lu 0.26130258, 0.27577711).
et al. 2022), TextVQA (Singh et al. 2019), Visual Dia- 1 class BlipImageEvalProcessor(
log (Das et al. 2017), Hateful Memes (Kiela et al. 2020), BlipImageBaseProcessor):
VizWiz (Gurari et al. 2018), and MSRVTT QA (Xu et al. 2 def __init__(self, image_size=224,
mean=None, std=None):
2017) datasets. Here, for Vizwiz, since there’s no ground
3 super().__init__(mean=mean, std=
truth answer for the test split, we choose to use a validation std)
split. For Hateful Memes, the test split also misses answers, 4
so we picked the same number of examples from the train- 5 self.transform = transforms.
ing set as our evaluation data. InstructBLIP originally also Compose(
had GQA (Hudson and Manning 2019) and iVQA (Yang 6 [
et al. 2021); we contacted the authors for access to their 7 transforms.Resize(
datasets but received no reply yet. As for MSVDQA (Xu 8 (image_size,
et al. 2017), the authors completely removed this dataset image_size),
from their competition website. For OCR task datasets, we interpolation=
InterpolationMode
followed (Liu et al. 2023b) to select OCR-VQA (Mishra
.BICUBIC
et al. 2019), Text-VQA (Singh et al. 2019), ST-VQA (Biten 9 ),
et al. 2022), DOC-VQA (Mathew, Karatzas, and Jawahar 10 transforms.ToTensor(),
2021), InfoVQA (Mathew et al. 2022), ChartQA (Masry 11 self.normalize,
et al. 2022), ESTVQA (Wang et al. 2020), FUNSD (Jaume, 12 ]
Ekenel, and Thiran 2019), SROIE (Huang et al. 2019), and 13 )
POIE (Kuang et al. 2023). The second class BlipImageEvalProcessor is de-
signed to preprocess images for evaluation purposes. It re-
Data Pre-processing sizes images to a specified size using bicubic interpolation,
converts them to tensor format, and then normalizes them
We followed (Dai et al. 2023) to use the same data pre- using the same mean and standard deviation values as the
processing methods, which are attached below. BlipImageTrainProcessor.
1 class BlipImageTrainProcessor(
BlipImageBaseProcessor):
2 def __init__(
Implementation Details
3 self, image_size=224, mean=None, We selected the ViT-G/14 from EVA-CLIP (Sun et al. 2023)
std=None, min_scale=0.5, as our visual encoder. The pre-trained weights are initial-
max_scale=1.0 ized and remain frozen during training. We removed the last
4 ): layer from ViT (Dosovitskiy et al. 2020) and opted to use
5 super().__init__(mean=mean, std= the output features of the second last layer, which yielded
std) slightly better performance. We first pre-train our patch
6
7 self.transform = transforms.
embeddings projection layer using LLaVA filtered 558K
Compose( image-text pairs from LAION (Schuhmann et al. 2021), CC-
8 [transforms. 3M (Sharma et al. 2018), and SBU (Ordonez, Kulkarni, and
RandomResizedCrop( Berg 2011), captioned by BLIP (Li et al. 2022). Using the
9 image_size, pre-training stage leads to slightly better performance. Dur-
10 scale=(min_scale, ing the vision-language instruction tuning stage, we initial-
max_scale), ize the Q-Former from InstructBLIP’s weight and finetune
11 interpolation= the parameters of the Q-former and projection layer together
InterpolationMode. while keeping both the image encoder and LLM frozen. We
BICUBIC, pre-trained the projection layer with 3 epochs with a batch
12 ),
13 transforms.
size of 64. During the instruction finetuning stage, we em-
RandomHorizontalFlip(), ploy a batch size of 24 with a maximum of 200K steps which
14 transforms.ToTensor(), roughly iterates two epochs of the training data. For both
15 self.normalize, stage training, we used the AdamW (Loshchilov and Hutter
16 ] 2017) optimizer, with β1 = 0.9, β2 = 0.999, and a weight
17 ) decay of 0.05. Additionally, we apply a linear warmup of
the learning rate during the initial 1K steps, increasing from
10−8 to 10−5 , followed by a cosine decay with a minimum
learning rate of 0. The pre-training stage takes 6 hours and
the instruction finetuning stage finished within two days on
8 Nvidia A6000 Ada (48G) GPUs.

Qualitative Examples

Figure 5 and Figure 6 showcase the additional examples of


our model’s performance on various types of images, includ-
ing the recognition of road signs and shopping products. Be-
sides recognizing textual information, our models can also
identify spatial relationships. These examples demonstrate
the model’s practical applicability and relevance in everyday
life, as it can accurately interpret and analyze visual infor-
mation that people commonly encounter, such as navigating
traffic or identifying products in a store.

Figure 6: Example of BLIVA’s performance on the web page


and meme images. BLIVA’s reply shows its understanding
of the visual information and the meaning behind both text
and image. It can localize both the text and objects in the
image clearly. BLIVA demonstrates great OCR capabilities
in reading the text in memes and understanding the tabs on
the web pages.

YTTB-VQA Category distribution

Figure 5: Example of BLIVA’s performance on real-life-


scene and movie poster images. BLIVA’s reply is strictly
based on visual information with the ability to localize the Figure 7: Category Distribution in YTTB-VQA: This
objects in the image. BLIVA also demonstrates great OCR chart represents the distribution across 11 distinct categories
capabilities in reading road sign, food packaging, movie within the YTTB-VQA dataset. These categories encompass
poster titles, and detailed texts. a broad spectrum, including technology, sports, entertain-
ment, food, news, history, music, nature, cars, and educa-
tion.

You might also like