BLIVA - A Simple Multimodal LLM For Better Handling of Text-Rich Visual Questions
BLIVA - A Simple Multimodal LLM For Better Handling of Text-Rich Visual Questions
Questions
Wenbo Hu* 1 , Yifan Xu* 2 , Yi Li1 Weiyue Li1 Zeyuan Chen1 Zhuowen Tu1
1
UC San Diego 2 Coinbase Global, Inc.
{w1hu, yil115, wel019, zec016, ztu}@ucsd.edu [email protected]
arXiv:2308.09936v3 [cs.CV] 18 Dec 2023
Image
a) Flamingo b) BLIP-2 / InstructBLIP c) LLaVA d) BLIVA (Ours)
Figure 1: Comparison of various VLM approaches. Both (a) Flamingo (Alayrac et al. 2022) and (b) BLIP-2 / InstructBLIP
(Li et al. 2023b; Dai et al. 2023) architecture utilize a fixed, small set of query embeddings. These are used to compress visual
information for transfer to the LLM. In contrast, (c) LLaVA aligns the encoded patch embeddings directly with the LLM. (d)
BLIVA (Ours) builds upon these methods by merging learned query embeddings with additional encoded patch embeddings.
an effective method for interpreting text within images. of ImageBind (Girdhar et al. 2023), which can take multi-
• Our experimental results showcase that BLIVA provides modal inputs besides visual. While sharing the same two-
improvements in the understanding of text within images stage training paradigm, we focus on developing an end-to-
while maintaining a robust performance in general (not end multimodal model for both text-rich VQA benchmarks
particularly text-rich) VQA benchmarks and achieving and general VQA benchmarks.
the best performance on MME benchmark among pre-
vious methods. Multimodal instruction tuning
• To underscore the real-world applicability of BLIVA, Instruction tuning has been shown to improve the general-
we evaluate the model using a new dataset of YouTube ization performance of language models to unseen tasks. In
thumbnails with associated question-answer pairs. the natural language processing (NLP) community, some ap-
proaches collect instruction-tuning data by converting exist-
Related Work ing NLP datasets into instruction format (Wang et al. 2022b;
Wei et al. 2021; Sanh et al. 2022; Chung et al. 2022) oth-
Multimodal Large Language Model ers use LLMs to generate instruction data (Taori et al. 2023;
Large Language Models (LLMs) have demonstrated im- Zheng et al. 2023; Wang et al. 2023; Honovich et al. 2022).
pressive zero-shot abilities across various open-ended tasks. Recent research expanded instruction tuning to multimodal
Recent research has explored the application of LLMs for settings. In particular, for image-based instruction tuning,
multimodal generation to understand visual inputs. Some MiniGPT-4 (Zhu et al. 2023) employs human-curated in-
approaches leverage the pre-trained LLM to build unified struction data during the finetuning stage. LLaVA (Liu et al.
models for multi-modality. For example, Flamingo (Alayrac 2023a) generates 156K multimodal instruction-following
et al. 2022) connects the vision encoder and LLM by a Per- data by prompting GPT-4 (OpenAI 2023) with image cap-
ceiver Resampler which exhibits impressive few-shot per- tions and bounding boxes coordinates. mPLUG-Owl (Ye
formance. Additionally, BLIP-2 (Li et al. 2023b) designs a et al. 2023) also employs 400K mixed text only and multi-
Q-former to align the visual feature with OPT (Zhang et al. modal instruction data for finetuning. Instruction tuning also
2022) and FLAN-T5 (Wei et al. 2021). MiniGPT-4 (Zhu enhanced the previous vision language foundation model’s
et al. 2023) employed the same Q-former but changed the performance. For example, MultimodalGPT (Gong et al.
LLM to Vicuna (Zheng et al. 2023). Some approaches 2023) designed various instruction templates that incorpo-
also finetuned LLM for better alignment with visual fea- rate vision and language data for multi-modality instruction
tures such as LLaVA (Liu et al. 2023a) directly finetuned tuning OpenFlamingo (Awadalla et al. 2023). (Xu, Shen, and
LLM and mPLUG-Owl (Ye et al. 2023) performs low-rank Huang 2023) built a multimodal instruction tuning bench-
adaption (LoRA) (Hu et al. 2022) to finetune a LLaMA mark dataset that consists of 62 diverse multimodal tasks
model (Touvron et al. 2023). PandaGPT (Su et al. 2023) in a unified seq-to-seq format and finetuned OFA (Wang
also employed LoRA to finetune a Vicuna model on top et al. 2022a). MIMIC-IT (Li et al. 2023a) built a big-
The image depicts the famous Hollywood sign located on a hillside, surrounded by mountains. The sign is
prominently displayed in the center of the image, with its letters spelling out "HOLLYWOOD." In addition
to the Hollywood sign, there are several trees scattered throughout the scene, providing a natural backdrop
for the iconic landmark.
Pre-trained LLM
Q-Former
Learned Query ...
Embeddings ... ... Projection
Feed-Forward Text
Projection Encoded Patch
Cross-Attention Embeddings
Embeddings
Self-Attention Learned Query ...
Embeddings
Encoded Vision Encoder
... ... Patch
Embeddings Q-Former
Text Embeddings Queries
Figure 2: Model architecture of BLIVA. BLIVA uses a Q-Former to draw out instruction-aware visual features from the patch
embeddings generated by a frozen image encoder. These learned query embeddings are then fed as soft prompt inputs into
the frozen Language-Learning Model (LLM). Additionally, the system repurposes the originally encoded patch embeddings
through a fully connected projection layer, serving as a supplementary source of visual information for the frozen LLM.
ger dataset comprising 2.8 million multimodal instruction- sequently, it is sent separately to the Q-former to extract re-
response pairs to train a stronger model Otter (Li et al. fined learned query embeddings, and to the projection layer,
2023a). We also employed instruction tuning data following allowing the LLM to grasp the rich visual knowledge. We
the same prompt as InstructBLIP(Dai et al. 2023) to demon- concatenate the two types of embeddings and feed them di-
strate the effectiveness of utilizing additional encoded patch rectly to the LLM. These combined visual embeddings are
embeddings. appended immediately after the question text embedding to
serve as the final input to the LLM. During inference, we
Method employed beam search to select the best-generated output.
Conversely, for classification and multi-choice VQA bench-
Architecture Overview marks, we adopted the vocabulary ranking method as out-
As illustrated in Figure 1, there are mainly two types of lined in InstructBLIP (Dai et al. 2023). Given our prior
end-to-end multimodal LLMs: 1) Models that utilize learned knowledge of a list of candidates, we calculated the log-
query embeddings for LLM. For instance, MiniGPT-4 (Zhu likelihood for each and chose the one with the highest value
et al. 2023) used the frozen Q-former module from BLIP- as the final prediction. To support another version for com-
2 (Li et al. 2023b) to extract image features by querying the mercial usage of our architecture, we also selected FlanT5
CLIP vision encoder. Flamingo (Alayrac et al. 2022), em- XXL as our LLM. This is named as BLIVA (FLanT5XXL ) in
ployed a Perceiver Resampler, which reduced image features this paper.
to a fixed number of visual outputs for LLM. 2) Models that
directly employed image-encoded patch embeddings, such Two stages Training Scheme
as LLaVA (Liu et al. 2023a), which connect its vision en- We adopted the typical two-stage training scheme: 1) In
coder to the LLM using an MLP. Nevertheless, these mod- the pre-training stage, the goal is to align the LLM with
els exhibit certain constraints. Some models employ learned visual information using image-text pairs from image cap-
query embeddings for LLM, which help in better under- tioning datasets that provide global descriptions of images.
standing the vision encoder but may miss crucial informa- 2) After pre-training, the LLM becomes familiar with the
tion from encoded patch embeddings. On the other hand, visual embedding space and can generate descriptions of
some models directly use encoded image patch embeddings images. However, it still lacks the capability to discern the
through a linear projection layer, which might have limited finer details of images and respond to human questions. In
capability in capturing all the information required for LLM. the second stage, we use instruction tuning data to enhance
To address this, we introduce BLIVA, a multimodal LLM performance and further align the visual embeddings with
designed to incorporate both learned query embeddings — the LLM and human values. Recent methods have predom-
which are more closely aligned with the LLM — and image- inantly adopted a two-stage training approach (Zhu et al.
encoded patch embeddings that carry richer image informa- 2023; Liu et al. 2023a; Ye et al. 2023) except PandaGPT (Su
tion. In particular, Figure 2 illustrates that our model incor- et al. 2023), which utilizes a one-stage training method, has
porates a vision tower, which encodes visual representations also demonstrated commendable results. In BLIVA, our vi-
from the input image into encoded patch embeddings. Sub- sual assistant branch, specifically the encoded patch em-
Image Stage 1: Pre training: Stage 2: Instruction Tuning: marks, including image captioning, image question answer-
Image-Text Caption Pairs Instruction Tuning Data
Question:{}, Answer: ing, visual reasoning, visual conversational QA, image clas-
A baseball pitcher winds
● Question: What is this photo taken looking through?
Short answer: net
sification, and video question answering. We also evaluated
up to pitch the ball.
● What position is this man playing? Pitcher
● Q: What color is the players shirt? A: orange
on a comprehensive multimodal LLM benchmark (MME).
● What is the answer to the following question? Is this
man a professional baseball player? yes We seek to answer the following:
• How does our proposed method compare to alternative
Figure 3: A typical multi-stage VLM training paradigm. single image embeddings approaches in text-rich VQA,
The training process involves two key stages. For Q-former, general VQA benchmarks and MME benchmark?
the first stage is done by (Li et al. 2023b) where image and
• How do the individual components of our method influ-
text caption pairs are pre-trained to accomplish a raw align-
ence its success?
ment between visual and language modalities. As for the
patch feature, we followed (Liu et al. 2023a) to use the same • How does BLIVA enhance the recognition of YouTube
pre-training dataset. In the second stage, the alignment is thumbnails?
further refined using instruction tuning VQA data, which fa-
cilitates a more detailed understanding of visual input based Datasets
on language instructions. To demonstrate the effectiveness of patch embeddings, we
followed (Dai et al. 2023) to use the same training and eval-
uation data unless mentioned explicitly. The detailed dataset
beddings, diverges from the approach of BLIP-2 (Li et al. information can be found at Appendix .
2023b), which uses a 129M pre-training dataset. Instead, it
leverages a more compact 0.5M pre-training caption data Implementation Details
following (Liu et al. 2023a). This presents a more efficient We selected the ViT-G/14 from EVA-CLIP (Sun et al. 2023)
strategy for aligning the visual encoder and LLM at the first as our visual encoder. In line with InstructBLIP, we em-
stage. We employed language model loss as our training ployed Vicuna-7B which has been instruction-tuned from
objective. The model learns to generate subsequent tokens LLaMA (Touvron et al. 2023) and serves as our LLM. Ad-
based on the preceding context. ditional details can be found in Appendix .
Thumbnails Dataset Results & Discussions
To showcase the wide-ranging industry applications made
We introduce our results in the context of each of our three
feasible by BLIVA, we assess the model by introducing a
questions and discuss our main findings.
new evaluation dataset, named YTTB-VQA which consists
1. How does our proposed method compare to alterna-
of 400 YouTube Thumbnail Visual Question-Answer pairs
tive single image embeddings approaches in text-rich VQA,
to evaluate the visual perception abilities of in-text images.
general VQA benchmarks and MME benchmark?
It covers 11 different categories which is illustrated in the
Appendix Figure 7. During the data collection, we randomly Zero-shot evaluation for text-rich VQA benchmarks
selected YouTube videos with text-rich thumbnails from dif- We compared our data with state-of-the-art Multimodality
ferent categories. We recorded the unique video ID for each LLMs. This includes LLaVA, which showcases robust OCR
YouTube video and obtained the high-resolution thumb- capabilities using only patch embedding. We also consid-
nail from the URL ”http://img.youtube.com/vi/<YouTube- ered BLIP2’s previous best version, BLIP-FLanT5xxL, the
Video-ID>/maxresdefault.jpg”. After retrieving all the state-of-the-art vision-language model mPlug-Owl (trained
YouTube thumbnails, we created the annotation file with the on a vast amount of both text and vision-text data), and
following fields: ”video id” representing the unique identifi- our baseline, InstructBLIP. The results are illustrated in Ta-
cation for a specific YouTube video, ”question” representing ble 1. Our model consistently shows significant improve-
the human-made question based on the text and image in the ment across all the text-rich VQA datasets compared to
thumbnail, ”video classes” representing the 11 video cate- InstructBLIP. Note that since InstructBLIP utilized OCR-
gories, ”answers” representing the ground truth answer, and VQA as its training dataset, the comparison for this spe-
”video link” representing the URL link for each YouTube cific dataset isn’t zero-shot. We evaluated both InstructBLIP
video. Our Youtube thumbnail datasets are available at https: and our model using the OCR-VQA validation set. BLIVA
//huggingface.co/datasets/mlpc-lab/YTTB-VQA. achieved state-of-the-art results among 6 text-rich datasets
while mPlug-Owl performed the best in 4 datasets. Com-
We also provide two sample scenarios from the YTTB-
pared to mPlug-Owl, which employs about 1104M image
VQA dataset. Figure 4 illustrates BLIVA’s capability to pro-
captioning data in the Pre-training stage, BLIVA only em-
vide detailed captions and answer users’ visual questions.
ploys 558K image caption pairs which could explain why
BLIVA is not performing the best in information-based
Experiment VQA tasks such as InfoVQA, ChartQA and ESTVQA.
In this section, we conduct extensive experiments and anal- BLIVA demonstrates the best performance on average com-
yses to show the efficacy of our model. We evaluate our pared to all previous methods, underscoring our design
model, baseline, and other SOTA models on 10 OCR-related choice to employ learned query embeddings, further aided
tasks and 8 general (not particularly text-rich) VQA bench- by encoded patch embeddings.
Figure 4: Two Sample Scenarios from the YTTB-VQA Dataset. This dataset demonstrates the dual application of BLIVA.
The first scenario highlights BLIVA’s capability to provide detailed captions that encompass all visual information within an
image. The second scenario showcases BLIVA’s utility in summarizing visual data into concise captions, followed by its ability
to field more detailed visual queries posed by users.
STVQA ↑ OCRVQA ↑ TextVQA ↑ DocVQA ↑ InfoVQA ↑ ChartQA ↑ ESTVQA ↑ FUNSD ↑ SROIE ↑ POIE ↑ Average ↑
OpenFlamingo (Awadalla et al. 2023) 19.32 27.82 29.08 5.05 14.99 9.12 28.20 0.85 0.12 2.12 13.67
BLIP2-OPT6.7b (Li et al. 2023b) 13.36 10.58 21.18 0.82 8.82 7.44 27.02 0.00 0.00 0.02 8.92
BLIP2-FLanT5XXL (Li et al. 2023b) 21.38 30.28 30.62 4.00 10.17 7.20 42.46 1.19 0.20 2.52 15.00
MiniGPT4 (Zhu et al. 2023) 14.02 11.52 18.72 2.97 13.32 4.32 28.36 1.19 0.04 1.31 9.58
LLaVA (Liu et al. 2023a) 22.93 15.02 28.30 4.40 13.78 7.28 33.48 1.02 0.12 2.09 12.84
mPLUG-Owl (Ye et al. 2023) 26.32 35.00 37.44 6.17 16.46 9.52 49.68 1.02 0.64 3.26 18.56
InstructBLIP (FlanT5XXL ) (Dai et al. 2023) 26.22 55.04 36.86 4.94 10.14 8.16 43.84 1.36 0.50 1.91 18.90
InstructBLIP (Vicuna-7B) (Dai et al. 2023) 28.64 47.62 39.60 5.89 13.10 5.52 47.66 0.85 0.64 2.66 19.22
BLIVA (FlanT5XXL ) 28.24 61.34 39.36 5.22 10.82 9.28 45.66 1.53 0.50 2.39 20.43
BLIVA (Vicuna-7B) 29.08 65.38 42.18 6.24 13.50 8.16 48.14 1.02 0.88 2.91 21.75
Table 1: Zero-Shot OCR-Free Results on Text-Rich VQA benchmarks. This table presents the accuracy (%) results for
OCR-free methods, implying no OCR-tokens were used. Note that our work follows InstructBLIP which incorporated OCR-
VQA in its training dataset, thus inevitably making OCR-VQA evaluation not zero-shot.
Zero-shot evaluation for general (not particularly text- eral VQA benchmarks.
rich) VQA benchmarks Next, we compared BLIVA with
models that employ single image features. Results are given MME Benchmark We further evaluated BLIVA on a
in Table 2 and in Table 3 for LLMs available for commer- comprehensive Multlimodal LLM benchmark (MME) (Fu
cial use. Our model consistently and significantly outper- et al. 2023). As illustrated in Table 4, BLIVA demonstrates
formed the original InstructBLIP model in VSR, IconQA, the best performance among the current methods on aver-
TextVQA, Visual Dialog, Hateful Memes, MSRVTT, and age for both the perception and cognition tasks. For all text-
Flickr30K. For VizWiz, our model nearly matched Instruct- rich tasks such as OCR, Poster, Numerical Calculation, Text
BLIP’s performance. This naturally raises the question: why Translation, and code, BLIVA outperforms InstructBLIP.
didn’t additional visual assistance improve all the bench- BLIVA achieved top 2 performance across all the tasks ex-
marks? We speculate that the additional visual information cept artwork and landmark which demand extensive infor-
didn’t aid VizWiz task. We continue to investigate this phe- mational knowledge. This is consistent with our findings
nomenon in the next ablation study section. Overall, our de- from informational VQA, indicating that our light-weight
sign not only achieved significant improvements in under- pre-training stage and the missing LAION-115M web image
standing text-rich images but also improves 7 out of 8 gen- caption dataset during instruction tuning stage both likely
contribute to a degradation in BLIVA’s internet knowledge
Models VSR ↑ IconQA ↑ TextVQA ↑ Visdial ↑ Flickr30K ↑ HM ↑ VizWiz ↑ MSRVTT ↑
(val) (val-dev) (val-dev)
Flamingo-3B (Alayrac et al. 2022) - - 30.1 - 60.6 - - -
Flamingo-9B (Alayrac et al. 2022) - - 31.8 - 61.5 - - -
Flamingo-80B (Alayrac et al. 2022) - - 35.0 - 67.2 - - -
MiniGPT-4 (Zhu et al. 2023) 50.65 - 18.56 - - 29.0 34.78 -
LLaVA (Liu et al. 2023a) 56.3 - 37.98 - - 9.2 36.74 -
BLIP-2 (Vicuna-7B) (Dai et al. 2023) 50.0 39.7 40.1 44.9 74.9 50.2 49.34 4.17
InstructBLIP (Vicuna-7B) (Dai et al. 2023) 54.3 43.1 50.1 45.2 82.4 54.8 43.3 18.7
InstructBLIP Baseline (Vicuna-7B) 58.67 44.34 37.58 40.58 84.61 50.6 44.10 20.97
BLIVA (Vicuna-7B) 62.2 44.88 57.96 45.63 87.1 55.6 42.9 23.81
Table 2: Zero-shot results on general (not particularly text-rich) VQA benchmarks. Our baseline is obtained by directly
finetuning InstructBLIP (Dai et al. 2023). For the three datasets on the right, due to the unavailability of test-set answers,
we have evaluated them using validation dev. Here, Visdial and HM denote the Visual Dialog and Hateful Memes datasets,
respectively. Following previous works (Alayrac et al. 2022; Yang et al. 2021; Murahari et al. 2020), we report the CIDEr
score (Vedantam, Zitnick, and Parikh 2015) for Flickr30K, AUC score for Hateful Memes, and Mean Reciprocal Rank (MRR)
for Visual Dialog. For all remaining datasets, we report the top-1 accuracy (%). Notably, for Text-VQA, we have followed
InstructBLIP’s method of using OCR-tokens for comparison. While InstructBLIP also included GQA, iVQA, and MSVDQA,
we were unable to access these datasets due to either unresponsive authors or the datasets being removed from their websites.
For ScienceQA and Nocaps, we were unable to reproduce the results of InstructBLIP, hence their results are not reported here.
Table 3: Zero-shot results on general (not particularly text-rich) VQA benchmarks for models with open LLM eligible for
commercial use. Here, the commercial use applicable LLM we reported is FlanT5XXL . Same as Table 2, we report the same
evaluation datasets with the same evaluation metrics.
Table 4: Evaluation of MME-Benchmark. Here we report the results on all the sub tasks, including Existence(Exist.),
Count, Position(Pos.), Color, OCR, Poster, Celebrity(Cele.), Scene, Landmark(Land.), Artwork(Art.), Commonsense Reason-
ing(Comm.), Numerical Calculation(NumCal.), Text Translation(Trans.), Code Reasoning(Code) and the task-level average
(Avg.). We bold the highest overall and average score and highlight the Top-2 model of each sub task with underline.
InstructBLIP Baseline Patch Pre- Finetuning ST- OCR- Text– Doc- Info- Chart- EST- FUNSD SROIE POIE Improvement
(Dai et al. 2023) (Instruction Embedding Training LLM VQA VQA VQA VQA VQA QA VQA
Tuning
Qformer)
✓ 28.64 47.62 39.60 5.89 13.10 5.52 47.66 0.85 0.64 2.66 +0%
✓ ✓ 30.08 65.8 40.5 6.13 12.03 8.08 47.02 0.85 0.57 2.62 + 7.40%
✓ ✓ ✓ 28.86 65.04 40.7 6.65 14.28 8.24 47.72 1.19 1.66 2.83 + 31.72%
✓ ✓ ✓ ✓ 29.08 65.38 42.18 6.24 13.50 8.16 48.14 1.02 0.88 2.91 + 17.01%
✓ ✓ ✓ ✓ ✓ 29.94 66.48 41.9 6.47 12.51 7.52 46.76 1.02 0.51 2.85 + 9.65%
Table 5: Results of adding individual techniques of our framework in text-rich VQA benchmarks. We include four ab-
lations that accumulate each technique (i) baseline: instruction tuning InstructBLIP’s Qformer. (ii) instruction tuning patch
embeddings (iii) pre-training stage of patch embeddings (iv) Finetuning LLM with LORA during the instruction tuning stage.
Table 6: Results of adding individual techniques of our framework in general (not particularly text-rich) VQA bench-
marks. We include four ablations that accumulate each technique same as in Table 5.
Datasets
We followed (Dai et al. 2023) to use the same training and
evaluation data unless mentioned explicitly. Due to the ille-
gal contents involved in LAION-115M dataset (Schuhmann
et al. 2021), we cannot download it securely through the uni- 18 def __call__(self, item):
versity internet. Besides lacking a subset of samples of im- 19 return self.transform(item)
age captioning, we keep all other training data the same. It The first class BlipImageTrainProcessor is used
includes MSCOCO (Lin et al. 2015) for image captioning, to pre-process the images for the training purpose. Specif-
TextCaps (Sidorov et al. 2020), VQAv2 (Goyal et al. 2017), ically, it randomly crops and resizes images to 224 * 224
OKVQA (Marino et al. 2019), A-OKVQA (Schwenk et al. with an interpolation method of Bicubic, possibly flips
2022), OCR-VQA (Mishra et al. 2019) and LLaVA-Instruct- them horizontally, converts them to tensor format, and
150K (Liu et al. 2023a). For evaluation datasets, we also fol- normalizes them using mean = (0.48145466, 0.4578275,
low (Dai et al. 2023) but only keep Flickr30K (Young et al. 0.40821073) and standard deviation = (0.26862954,
2014), VSR (Liu, Emerson, and Collier 2023), IconQA (Lu 0.26130258, 0.27577711).
et al. 2022), TextVQA (Singh et al. 2019), Visual Dia- 1 class BlipImageEvalProcessor(
log (Das et al. 2017), Hateful Memes (Kiela et al. 2020), BlipImageBaseProcessor):
VizWiz (Gurari et al. 2018), and MSRVTT QA (Xu et al. 2 def __init__(self, image_size=224,
mean=None, std=None):
2017) datasets. Here, for Vizwiz, since there’s no ground
3 super().__init__(mean=mean, std=
truth answer for the test split, we choose to use a validation std)
split. For Hateful Memes, the test split also misses answers, 4
so we picked the same number of examples from the train- 5 self.transform = transforms.
ing set as our evaluation data. InstructBLIP originally also Compose(
had GQA (Hudson and Manning 2019) and iVQA (Yang 6 [
et al. 2021); we contacted the authors for access to their 7 transforms.Resize(
datasets but received no reply yet. As for MSVDQA (Xu 8 (image_size,
et al. 2017), the authors completely removed this dataset image_size),
from their competition website. For OCR task datasets, we interpolation=
InterpolationMode
followed (Liu et al. 2023b) to select OCR-VQA (Mishra
.BICUBIC
et al. 2019), Text-VQA (Singh et al. 2019), ST-VQA (Biten 9 ),
et al. 2022), DOC-VQA (Mathew, Karatzas, and Jawahar 10 transforms.ToTensor(),
2021), InfoVQA (Mathew et al. 2022), ChartQA (Masry 11 self.normalize,
et al. 2022), ESTVQA (Wang et al. 2020), FUNSD (Jaume, 12 ]
Ekenel, and Thiran 2019), SROIE (Huang et al. 2019), and 13 )
POIE (Kuang et al. 2023). The second class BlipImageEvalProcessor is de-
signed to preprocess images for evaluation purposes. It re-
Data Pre-processing sizes images to a specified size using bicubic interpolation,
converts them to tensor format, and then normalizes them
We followed (Dai et al. 2023) to use the same data pre- using the same mean and standard deviation values as the
processing methods, which are attached below. BlipImageTrainProcessor.
1 class BlipImageTrainProcessor(
BlipImageBaseProcessor):
2 def __init__(
Implementation Details
3 self, image_size=224, mean=None, We selected the ViT-G/14 from EVA-CLIP (Sun et al. 2023)
std=None, min_scale=0.5, as our visual encoder. The pre-trained weights are initial-
max_scale=1.0 ized and remain frozen during training. We removed the last
4 ): layer from ViT (Dosovitskiy et al. 2020) and opted to use
5 super().__init__(mean=mean, std= the output features of the second last layer, which yielded
std) slightly better performance. We first pre-train our patch
6
7 self.transform = transforms.
embeddings projection layer using LLaVA filtered 558K
Compose( image-text pairs from LAION (Schuhmann et al. 2021), CC-
8 [transforms. 3M (Sharma et al. 2018), and SBU (Ordonez, Kulkarni, and
RandomResizedCrop( Berg 2011), captioned by BLIP (Li et al. 2022). Using the
9 image_size, pre-training stage leads to slightly better performance. Dur-
10 scale=(min_scale, ing the vision-language instruction tuning stage, we initial-
max_scale), ize the Q-Former from InstructBLIP’s weight and finetune
11 interpolation= the parameters of the Q-former and projection layer together
InterpolationMode. while keeping both the image encoder and LLM frozen. We
BICUBIC, pre-trained the projection layer with 3 epochs with a batch
12 ),
13 transforms.
size of 64. During the instruction finetuning stage, we em-
RandomHorizontalFlip(), ploy a batch size of 24 with a maximum of 200K steps which
14 transforms.ToTensor(), roughly iterates two epochs of the training data. For both
15 self.normalize, stage training, we used the AdamW (Loshchilov and Hutter
16 ] 2017) optimizer, with β1 = 0.9, β2 = 0.999, and a weight
17 ) decay of 0.05. Additionally, we apply a linear warmup of
the learning rate during the initial 1K steps, increasing from
10−8 to 10−5 , followed by a cosine decay with a minimum
learning rate of 0. The pre-training stage takes 6 hours and
the instruction finetuning stage finished within two days on
8 Nvidia A6000 Ada (48G) GPUs.
Qualitative Examples