BLIP-2: Bootstrapping Language-Image Pre-Training With Frozen Image Encoders and Large Language Models
BLIP-2: Bootstrapping Language-Image Pre-Training With Frozen Image Encoders and Large Language Models
Vision-and-Language Vision-to-Language
Abstract Representation Learning Generative Learning
1
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
encoder and the frozen LLM, where it feeds the most useful et al., 2021; 2022; Yu et al., 2022; Wang et al., 2022b).
visual feature for the LLM to output the desired text. In
Most VLP methods perform end-to-end pre-training using
the first pre-training stage, we perform vision-language rep-
large-scale image-text pair datasets. As the model size keeps
resentation learning which enforces the Q-Former to learn
increasing, the pre-training can incur an extremely high
visual representation most relevant to the text. In the second
computation cost. Moreover, it is inflexible for end-to-end
pre-training stage, we perform vision-to-language genera-
pre-trained models to leverage readily-available unimodal
tive learning by connecting the output of the Q-Former to a
pre-trained models, such as LLMs (Brown et al., 2020;
frozen LLM, and trains the Q-Former such that its output
Zhang et al., 2022; Chung et al., 2022).
visual representation can be interpreted by the LLM.
We name our VLP framework as BLIP-2: Bootstrapping 2.2. Modular Vision-Language Pre-training
Language-Image Pre-training with frozen unimodal models. More similar to us are methods that leverage off-the-shelf
The key advantages of BLIP-2 include: pre-trained models and keep them frozen during VLP. Some
methods freeze the image encoder, including the early work
• BLIP-2 effectively leverages both frozen pre-trained im-
which adopts a frozen object detector to extract visual fea-
age models and language models. We bridge the modality
tures (Chen et al., 2020; Li et al., 2020; Zhang et al., 2021),
gap using a Q-Former pre-trained in two-stages: repre-
and the recent LiT (Zhai et al., 2022) which uses a frozen
sentation learning stage and generative learning stage.
pre-trained image encoder for CLIP (Radford et al., 2021)
BLIP-2 achieves state-of-the-art performance on various
pre-training. Some methods freeze the language model
vision-language tasks including visual question answer-
to use the knowledge from LLMs for vision-to-language
ing, image captioning, and image-text retrieval.
generation tasks (Tsimpoukelli et al., 2021; Alayrac et al.,
• Powered by LLMs (e.g. OPT (Zhang et al., 2022), 2022; Chen et al., 2022a; Tiong et al., 2022; Guo et al.,
FlanT5 (Chung et al., 2022)), BLIP-2 can be prompted to 2022). The key challenge in using a frozen LLM is to
perform zero-shot image-to-text generation that follows align visual features to the text space. To achieve this,
natural language instructions, which enables emerging Frozen (Tsimpoukelli et al., 2021) finetunes an image en-
capabilities such as visual knowledge reasoning, visual coder whose outputs are directly used as soft prompts for
conversation, etc. (see Figure 4 for examples). the LLM. Flamingo (Alayrac et al., 2022) inserts new cross-
attention layers into the LLM to inject visual features, and
• Due to the use of frozen unimodal models and a pre-trains the new layers on billions of image-text pairs.
lightweight Q-Former, BLIP-2 is more compute-efficient Both methods adopt the language modeling loss, where the
than exisiting state-of-the-arts. For example, BLIP-2 out- language model generates texts conditioned on the image.
performs Flamingo (Alayrac et al., 2022) by 8.7% on
zero-shot VQAv2, while using 54× fewer trainable pa- Different from existing methods, BLIP-2 can effectively and
rameters. Furthermore, our results show that BLIP-2 is a efficiently leverage both frozen image encoders and frozen
generic method that can harvest more advanced unimodal LLMs for various vision-language tasks, achieving stronger
models for better VLP performance. performance at a lower computation cost.
3. Method
2. Related Work
We propose BLIP-2, a new vision-language pre-training
2.1. End-to-end Vision-Language Pre-training method that bootstraps from frozen pre-trained unimodal
Vision-language pre-training aims to learn multimodal foun- models. In order to bridge the modality gap, we propose a
dation models with improved performance on various vision- Querying Transformer (Q-Former) pre-trained in two stages:
and-language tasks. Depending on the downstream task, (1) vision-language representation learning stage with a
different model architectures have been proposed, including frozen image encoder and (2) vision-to-language genera-
the dual-encoder architecture (Radford et al., 2021; Jia et al., tive learning stage with a frozen LLM. This section first
2021), the fusion-encoder architecture (Tan & Bansal, 2019; introduces the model architecture of Q-Former, and then
Li et al., 2021), the encoder-decoder architecture (Cho et al., delineates the two-stage pre-training procedures.
2021; Wang et al., 2021b; Chen et al., 2022b), and more
recently, the unified transformer architecture (Li et al., 2022; 3.1. Model Architecture
Wang et al., 2022b). Various pre-training objectives have We propose Q-Former as the trainable module to bridge the
also been proposed over the years, and have progressively gap between a frozen image encoder and a frozen LLM. It
converged to a few time-tested ones: image-text contrastive extracts a fixed number of output features from the image
learning (Radford et al., 2021; Yao et al., 2022; Li et al., encoder, independent of input image resolution. As shown
2021; 2022), image-text matching (Li et al., 2021; 2022; in Figure 2, Q-Former consists of two transformer submod-
Wang et al., 2021a), and (masked) language modeling (Li
2
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Figure 2. (Left) Model architecture of Q-Former and BLIP-2’s first-stage vision-language representation learning objectives. We jointly
optimize three objectives which enforce the queries (a set of learnable embeddings) to extract visual representation most relevant to the
text. (Right) The self-attention masking strategy for each objective to control query-text interaction.
ules that share the same self-attention layers: (1) an image of negative pairs. We align the output query representation
transformer that interacts with the frozen image encoder Z from the image transformer with the text representation
for visual feature extraction, (2) a text transformer that can t from the text transformer, where t is the output embed-
function as both a text encoder and a text decoder. We create ding of the [CLS] token. Since Z contains multiple output
a set number of learnable query embeddings as input to the embeddings (one from each query), we first compute the
image transformer. The queries interact with each other pairwise similarity between each query output and t, and
through self-attention layers, and interact with frozen image then select the highest one as the image-text similarity. To
features through cross-attention layers (inserted every other avoid information leak, we employ a unimodal self-attention
transformer block). The queries can additionally interact mask, where the queries and text are not allowed to see each
with the text through the same self-attention layers. Depend- other. Due to the use of a frozen image encoder, we can
ing on the pre-training task, we apply different self-attention fit more samples per GPU compared to end-to-end meth-
masks to control query-text interaction. We initialize Q- ods. Therefore, we use in-batch negatives instead of the
Former with the pre-trained weights of BERTbase (Devlin momentum queue in BLIP.
et al., 2019), whereas the cross-attention layers are randomly
Image-grounded Text Generation (ITG) loss trains the
initialized. In total, Q-Former contains 188M parameters.
Q-Former to generate texts, given input images as the con-
Note that the queries are considered as model parameters.
dition. Since the architecture of Q-Former does not allow
In our experiments, we use 32 queries where each query direct interactions between the frozen image encoder and
has a dimension of 768 (same as the hidden dimension the text tokens, the information required for generating the
of the Q-Former). We use Z to denote the output query text must be first extracted by the queries, and then passed
representation. The size of Z (32 × 768) is much smaller to the text tokens via self-attention layers. Therefore, the
than the size of frozen image features (e.g. 257 × 1024 for queries are forced to extract visual features that capture all
ViT-L/14). This bottleneck architecture works together with the information about the text. We employ a multimodal
our pre-training objectives into forcing the queries to extract causal self-attention mask to control query-text interaction,
visual information that is most relevant to the text. similar to the one used in UniLM (Dong et al., 2019). The
queries can attend to each other but not the text tokens. Each
3.2. Bootstrap Vision-Language Representation text token can attend to all queries and its previous text to-
Learning from a Frozen Image Encoder kens. We also replace the [CLS] token with a new [DEC]
In the representation learning stage, we connect Q-Former to token as the first text token to signal the decoding task.
a frozen image encoder and perform pre-training using Image-Text Matching (ITM) aims to learn fine-grained
image-text pairs. We aim to train the Q-Former such that the alignment between image and text representation. It is a
queries can learn to extract visual representation that is most binary classification task where the model is asked to pre-
informative of the text. Inspired by BLIP (Li et al., 2022), dict whether an image-text pair is positive (matched) or
we jointly optimize three pre-training objectives that share negative (unmatched). We use a bi-directional self-attention
the same input format and model parameters. Each objec- mask where all queries and texts can attend to each other.
tive employs a different attention masking strategy between The output query embeddings Z thus capture multimodal
queries and text to control their interaction (see Figure 2). information. We feed each output query embedding into a
Image-Text Contrastive Learning (ITC) learns to align two-class linear classifier to obtain a logit, and average the
image representation and text representation such that their logits across all queries as the output matching score. We
mutual information is maximized. It achieves so by contrast- adopt the hard negative mining strategy from Li et al. (2021;
ing the image-text similarity of a positive pair against those 2022) to create informative negative pairs.
3
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Figure 3. BLIP-2’s second-stage vision-to-language generative pre-training, which bootstraps from frozen large language models (LLMs).
(Top) Bootstrapping a decoder-based LLM (e.g. OPT). (Bottom) Bootstrapping an encoder-decoder-based LLM (e.g. FlanT5). The
fully-connected layer adapts from the output dimension of the Q-Former to the input dimension of the chosen LLM.
3.3. Bootstrap Vision-to-Language Generative Learning captions using the BLIPlarge captioning model, and rank the
from a Frozen LLM synthetic captions along with the original web caption based
In the generative pre-training stage, we connect Q- on the image-text similarity produced by a CLIP ViT-L/14
Former (with the frozen image encoder attached) to a frozen model. We keep top-two captions per image as training data
LLM to harvest the LLM’s generative language capability. and randomly sample one at each pre-training step.
As shown in Figure 3, we use a fully-connected (FC) layer Pre-trained image encoder and LLM. For the frozen im-
to linearly project the output query embeddings Z into the age encoder, we explore two state-of-the-art pre-trained
same dimension as the text embedding of the LLM. The vision transformer models: (1) ViT-L/14 from CLIP (Rad-
projected query embeddings are then prepended to the input ford et al., 2021) and (2) ViT-g/14 from EVA-CLIP (Fang
text embeddings. They function as soft visual prompts that et al., 2022). We remove the last layer of the ViT and
condition the LLM on visual representation extracted by uses the second last layer’s output features, which leads
the Q-Former. Since the Q-Former has been pre-trained to slightly better performance. For the frozen language
to extract language-informative visual representation, it ef- model, we explore the unsupervised-trained OPT model
fectively functions as an information bottleneck that feeds family (Zhang et al., 2022) for decoder-based LLMs, and
the most useful information to the LLM while removing the instruction-trained FlanT5 model family (Chung et al.,
irrelevant visual information. This reduces the burden of the 2022) for encoder-decoder-based LLMs.
LLM to learn vision-language alignment, thus mitigating
the catastrophic forgetting problem. Pre-training settings. We pre-train for 250k steps in the
first stage and 80k steps in the second stage. We use a batch
We experiment with two types of LLMs: decoder-based size of 2320/1680 for ViT-L/ViT-g in the first stage and
LLMs and encoder-decoder-based LLMs. For decoder- a batch size of 1920/1520 for OPT/FlanT5 in the second
based LLMs, we pre-train with the language modeling loss, stage. During pre-training, we convert the frozen ViTs’
where the frozen LLM is tasked to generate the text con- and LLMs’ parameters into FP16, except for FlanT5 where
ditioned on the visual representation from Q-Former. For we use BFloat16. We found no performance degradation
encoder-decoder-based LLMs, we pre-train with the prefix compared to using 32-bit models. Due to the use of frozen
language modeling loss, where we split a text into two parts. models, our pre-training is more computational friendly
The prefix text is concatenated with the visual representation than existing large-scale VLP methods. For example, using
as input to the LLM’s encoder. The suffix text is used as the a single 16-A100(40G) machine, our largest model with
generation target for the LLM’s decoder. ViT-g and FlanT5-XXL requires less than 6 days for the first
stage and less than 3 days for the second stage.
3.4. Model Pre-training
The same set of pre-training hyper-parameters are used for
Pre-training data. We use the same pre-training dataset as all models. We use the AdamW (Loshchilov & Hutter, 2017)
BLIP with 129M images in total, including COCO (Lin optimizer with β1 = 0.9, β1 = 0.98, and a weight decay
et al., 2014), Visual Genome (Krishna et al., 2017), of 0.05. We use a cosine learning rate decay with a peak
CC3M (Sharma et al., 2018), CC12M (Changpinyo et al., learning rate of 1e-4 and a linear warmup of 2k steps. The
2021), SBU (Ordonez et al., 2011), and 115M images from minimum learning rate at the second stage is 5e-5. We use
the LAION400M dataset (Schuhmann et al., 2021). We images of size 224×224, augmented with random resized
adopt the CapFilt method (Li et al., 2022) to create synthetic cropping and horizontal flipping.
captions for the web images. Specifically, we generate 10
4
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Figure 4. Selected examples of instructed zero-shot image-to-text generation using a BLIP-2 model w/ ViT-G and FlanT5XXL , where it
shows a wide range of capabilities including visual conversation, visual knowledge reasoning, visual commensense reasoning, storytelling,
personalized image-to-text generation, etc.
5
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Table 1. Overview of BLIP-2 results on various zero-shot vision-language tasks. Compared with previous state-of-the-art models. BLIP-2
achieves the highest zero-shot performance while requiring the least number of trainable parameters during vision-language pre-training.
6
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Table 3. Comparison with state-of-the-art image captioning methods on NoCaps and COCO Caption. All methods optimize the cross-
entropy loss during finetuning. C: CIDEr, S: SPICE, B@4: BLEU@4.
60
BLIP2 ViT-G OPT6.7B 70
BLIP2 ViT-G FlanT5-XL #Trainable VQAv2
Models
Params val test-dev
50 60
Open-ended generation models
zero-shot VQAv2 acc.
7
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Flickr30K Zero-shot (1K test set) COCO Fine-tuned (5K test set)
#Trainable
Model Image → Text Text → Image Image → Text Text → Image
Params
R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
Dual-encoder models
CLIP (Radford et al., 2021) 428M 88.0 98.7 99.4 68.7 90.6 95.2 - - - - - -
ALIGN (Jia et al., 2021) 820M 88.6 98.7 99.7 75.7 93.8 96.8 77.0 93.5 96.9 59.9 83.3 89.8
FILIP (Yao et al., 2022) 417M 89.8 99.2 99.8 75.0 93.4 96.3 78.9 94.4 97.4 61.2 84.3 90.6
Florence (Yuan et al., 2021) 893M 90.9 99.1 - 76.7 93.6 - 81.8 95.2 - 63.2 85.7 -
BEIT-3(Wang et al., 2022b) 1.9B 94.9 99.9 100.0 81.5 95.6 97.8 84.8 96.5 98.3 67.2 87.7 92.8
Fusion-encoder models
UNITER (Chen et al., 2020) 303M 83.6 95.7 97.7 68.7 89.2 93.9 65.7 88.6 93.8 52.9 79.9 88.0
OSCAR (Li et al., 2020) 345M - - - - - - 70.0 91.1 95.5 54.0 80.8 88.5
VinVL (Zhang et al., 2021) 345M - - - - - - 75.4 92.9 96.2 58.8 83.5 90.3
Dual encoder + Fusion encoder reranking
ALBEF (Li et al., 2021) 233M 94.1 99.5 99.7 82.8 96.3 98.1 77.6 94.3 97.2 60.7 84.3 90.5
BLIP (Li et al., 2022) 446M 96.7 100.0 100.0 86.7 97.3 98.7 82.4 95.4 97.9 65.1 86.3 91.8
BLIP-2 ViT-L 474M 96.9 100.0 100.0 88.6 97.6 98.9 83.5 96.0 98.0 66.3 86.5 91.8
BLIP-2 ViT-g 1.2B 97.6 100.0 100.0 89.7 98.1 98.9 85.4 97.0 98.5 68.3 87.7 92.6
Table 5. Comparison with state-of-the-art image-text retrieval methods, finetuned on COCO and zero-shot transferred to Flickr30K.
8
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
References Webson, A., Gu, S. S., Dai, Z., Suzgun, M., Chen, X.,
Chowdhery, A., Narang, S., Mishra, G., Yu, A., Zhao,
Agrawal, H., Anderson, P., Desai, K., Wang, Y., Chen, X.,
V. Y., Huang, Y., Dai, A. M., Yu, H., Petrov, S., Chi, E. H.,
Jain, R., Johnson, M., Batra, D., Parikh, D., and Lee, S.
Dean, J., Devlin, J., Roberts, A., Zhou, D., Le, Q. V., and
nocaps: novel object captioning at scale. In ICCV, pp.
Wei, J. Scaling instruction-finetuned language models.
8947–8956, 2019.
arXiv preprint arXiv:2210.11416, 2022.
Alayrac, J., Donahue, J., Luc, P., Miech, A., Barr, I., Has-
son, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, Dai, W., Hou, L., Shang, L., Jiang, X., Liu, Q., and Fung,
M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., P. Enabling multimodal generation on CLIP via vision-
Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., language knowledge distillation. In Muresan, S., Nakov,
Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, P., and Villavicencio, A. (eds.), ACL Findings, pp. 2383–
M., Barreira, R., Vinyals, O., Zisserman, A., and Si- 2395, 2022.
monyan, K. Flamingo: a visual language model for few- Devlin, J., Chang, M., Lee, K., and Toutanova, K. BERT:
shot learning. arXiv preprint arXiv:2204.14198, 2022. pre-training of deep bidirectional transformers for lan-
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, guage understanding. In Burstein, J., Doran, C., and
J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Solorio, T. (eds.), NAACL, pp. 4171–4186, 2019.
Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G.,
Dong, L., Yang, N., Wang, W., Wei, F., Liu, X., Wang, Y.,
Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu,
Gao, J., Zhou, M., and Hon, H. Unified language model
J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M.,
pre-training for natural language understanding and gen-
Gray, S., Chess, B., Clark, J., Berner, C., McCandlish,
eration. In Wallach, H. M., Larochelle, H., Beygelzimer,
S., Radford, A., Sutskever, I., and Amodei, D. Language
A., d’Alché-Buc, F., Fox, E. B., and Garnett, R. (eds.),
models are few-shot learners. In Larochelle, H., Ranzato,
NeurIPS, pp. 13042–13054, 2019.
M., Hadsell, R., Balcan, M., and Lin, H. (eds.), NeurIPS,
2020. Fang, Y., Wang, W., Xie, B., Sun, Q., Wu, L., Wang, X.,
Huang, T., Wang, X., and Cao, Y. Eva: Exploring the
Changpinyo, S., Sharma, P., Ding, N., and Soricut, R. Con-
limits of masked visual representation learning at scale.
ceptual 12M: Pushing web-scale image-text pre-training
arXiv preprint arXiv:2211.07636, 2022.
to recognize long-tail visual concepts. In CVPR, 2021.
Chen, J., Guo, H., Yi, K., Li, B., and Elhoseiny, M. Visu- Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., and
algpt: Data-efficient adaptation of pretrained language Parikh, D. Making the V in VQA matter: Elevating the
models for image captioning. In CVPR, pp. 18009–18019, role of image understanding in visual question answering.
2022a. In CVPR, pp. 6325–6334, 2017.
Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A. J., Guo, J., Li, J., Li, D., Tiong, A. M. H., Li, B., Tao, D., and
Padlewski, P., Salz, D., Goodman, S., Grycner, A., Hoi, S. C. H. From images to textual prompts: Zero-shot
Mustafa, B., Beyer, L., Kolesnikov, A., Puigcerver, J., VQA with frozen large language models. In CVPR, 2022.
Ding, N., Rong, K., Akbari, H., Mishra, G., Xue, L.,
Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E.,
Thapliyal, A., Bradbury, J., Kuo, W., Seyedhosseini,
Cai, T., Rutherford, E., Casas, D. d. L., Hendricks, L. A.,
M., Jia, C., Ayan, B. K., Riquelme, C., Steiner, A., An-
Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican,
gelova, A., Zhai, X., Houlsby, N., and Soricut, R. Pali: A
K., Driessche, G. v. d., Damoc, B., Guy, A., Osindero,
jointly-scaled multilingual language-image model. arXiv
S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O.,
preprint arXiv:2209.06794, 2022b.
and Sifre, L. Training compute-optimal large language
Chen, Y., Li, L., Yu, L., Kholy, A. E., Ahmed, F., Gan, Z., models. arXiv preprint arXiv:2203.15556, 2022.
Cheng, Y., and Liu, J. UNITER: universal image-text
representation learning. In ECCV, volume 12375, pp. Hudson, D. A. and Manning, C. D. GQA: A new dataset for
104–120, 2020. real-world visual reasoning and compositional question
answering. In CVPR, pp. 6700–6709, 2019.
Cho, J., Lei, J., Tan, H., and Bansal, M. Unifying vision-
and-language tasks via text generation. arXiv preprint Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham,
arXiv:2102.02779, 2021. H., Le, Q. V., Sung, Y., Li, Z., and Duerig, T. Scaling up
visual and vision-language representation learning with
Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., noisy text supervision. arXiv preprint arXiv:2102.05918,
Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma, S., 2021.
9
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Jin, W., Cheng, Y., Shen, Y., Chen, W., and Ren, X. A good Schuhmann, C., Vencu, R., Beaumont, R., Kaczmarczyk,
prompt is worth millions of parameters: Low-resource R., Mullis, C., Katta, A., Coombes, T., Jitsev, J., and
prompt-based learning for vision-language models. In Komatsuzaki, A. Laion-400m: Open dataset of clip-
Muresan, S., Nakov, P., and Villavicencio, A. (eds.), ACL, filtered 400 million image-text pairs. arXiv preprint
pp. 2763–2775, 2022. arXiv:2111.02114, 2021.
Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Sharma, P., Ding, N., Goodman, S., and Soricut, R. Con-
Kravitz, J., Chen, S., Kalantidis, Y., Li, L., Shamma, ceptual captions: A cleaned, hypernymed, image alt-text
D. A., Bernstein, M. S., and Fei-Fei, L. Visual genome: dataset for automatic image captioning. In Gurevych, I.
Connecting language and vision using crowdsourced and Miyao, Y. (eds.), ACL, pp. 2556–2565, 2018.
dense image annotations. IJCV, 123(1):32–73, 2017.
Tan, H. and Bansal, M. LXMERT: learning cross-modality
Li, J., Selvaraju, R. R., Gotmare, A. D., Joty, S., Xiong, encoder representations from transformers. In Inui, K.,
C., and Hoi, S. Align before fuse: Vision and language Jiang, J., Ng, V., and Wan, X. (eds.), EMNLP, pp. 5099–
representation learning with momentum distillation. In 5110, 2019.
NeurIPS, 2021.
Tiong, A. M. H., Li, J., Li, B., Savarese, S., and Hoi, S.
Li, J., Li, D., Xiong, C., and Hoi, S. C. H. BLIP: boot- C. H. Plug-and-play VQA: zero-shot VQA by conjoining
strapping language-image pre-training for unified vision- large pretrained models with zero training. In Goldberg,
language understanding and generation. In ICML, pp. Y., Kozareva, Z., and Zhang, Y. (eds.), EMNLP Findings,
12888–12900, 2022. 2022.
Li, X., Yin, X., Li, C., Zhang, P., Hu, X., Zhang, L., Wang, Tsimpoukelli, M., Menick, J., Cabi, S., Eslami, S. M. A.,
L., Hu, H., Dong, L., Wei, F., Choi, Y., and Gao, J. Oscar: Vinyals, O., and Hill, F. Multimodal few-shot learning
Object-semantics aligned pre-training for vision-language with frozen language models. In Ranzato, M., Beygelz-
tasks. In ECCV, pp. 121–137, 2020. imer, A., Dauphin, Y. N., Liang, P., and Vaughan, J. W.
(eds.), NeurIPS, pp. 200–212, 2021.
Lin, T., Maire, M., Belongie, S. J., Hays, J., Perona, P.,
Wang, P., Yang, A., Men, R., Lin, J., Bai, S., Li, Z., Ma, J.,
Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft
Zhou, C., Zhou, J., and Yang, H. OFA: unifying architec-
COCO: common objects in context. In Fleet, D. J., Pajdla,
tures, tasks, and modalities through a simple sequence-to-
T., Schiele, B., and Tuytelaars, T. (eds.), ECCV, volume
sequence learning framework. In Chaudhuri, K., Jegelka,
8693, pp. 740–755, 2014.
S., Song, L., Szepesvári, C., Niu, G., and Sabato, S. (eds.),
Loshchilov, I. and Hutter, F. Decoupled weight decay regu- ICML, pp. 23318–23340, 2022a.
larization. arXiv preprint arXiv:1711.05101, 2017. Wang, W., Bao, H., Dong, L., and Wei, F. Vlmo: Unified
Marino, K., Rastegari, M., Farhadi, A., and Mottaghi, R. Ok- vision-language pre-training with mixture-of-modality-
vqa: A visual question answering benchmark requiring experts. arXiv preprint arXiv:2111.02358, 2021a.
external knowledge. In CVPR, 2019. Wang, W., Bao, H., Dong, L., Bjorck, J., Peng, Z., Liu, Q.,
Aggarwal, K., Mohammed, O. K., Singhal, S., Som, S.,
Ordonez, V., Kulkarni, G., and Berg, T. L. Im2text: Describ-
and Wei, F. Image as a foreign language: Beit pretraining
ing images using 1 million captioned photographs. In
for all vision and vision-language tasks. arXiv preprint
Shawe-Taylor, J., Zemel, R. S., Bartlett, P. L., Pereira, F.
arXiv:2208.10442, 2022b.
C. N., and Weinberger, K. Q. (eds.), NIPS, pp. 1143–1151,
2011. Wang, Z., Yu, J., Yu, A. W., Dai, Z., Tsvetkov, Y., and Cao,
Y. Simvlm: Simple visual language model pretraining
Plummer, B. A., Wang, L., Cervantes, C. M., Caicedo, J. C.,
with weak supervision. arXiv preprint arXiv:2108.10904,
Hockenmaier, J., and Lazebnik, S. Flickr30k entities:
2021b.
Collecting region-to-phrase correspondences for richer
image-to-sentence models. In ICCV, pp. 2641–2649, Yao, L., Huang, R., Hou, L., Lu, G., Niu, M., Xu, H., Liang,
2015. X., Li, Z., Jiang, X., and Xu, C. FILIP: fine-grained
interactive language-image pre-training. In ICLR, 2022.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini,
et al. Learning transferable visual models from natural M., and Wu, Y. Coca: Contrastive captioners are image-
language supervision. arXiv preprint arXiv:2103.00020, text foundation models. arXiv preprint arXiv:2205.01917,
2021. 2022.
10
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Yuan, L., Chen, D., Chen, Y., Codella, N., Dai, X., Gao,
J., Hu, H., Huang, X., Li, B., Li, C., Liu, C., Liu, M.,
Liu, Z., Lu, Y., Shi, Y., Wang, L., Wang, J., Xiao, B.,
Xiao, Z., Yang, J., Zeng, M., Zhou, L., and Zhang, P.
Florence: A new foundation model for computer vision.
arXiv preprint arXiv:2111.11432, 2021.
Zhai, X., Wang, X., Mustafa, B., Steiner, A., Keysers, D.,
Kolesnikov, A., and Beyer, L. Lit: Zero-shot transfer with
locked-image text tuning. In CVPR, pp. 18102–18112,
2022.
Zhang, P., Li, X., Hu, X., Yang, J., Zhang, L., Wang, L.,
Choi, Y., and Gao, J. Vinvl: Making visual representa-
tions matter in vision-language models. arXiv preprint
arXiv:2101.00529, 2021.
Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M.,
Chen, S., Dewan, C., Diab, M. T., Li, X., Lin, X. V.,
Mihaylov, T., Ott, M., Shleifer, S., Shuster, K., Simig,
D., Koura, P. S., Sridhar, A., Wang, T., and Zettlemoyer,
L. OPT: open pre-trained transformer language models.
arXiv preprint arXiv:2205.01068, 2022.
11
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Figure 6. Incorrect output examples for instructed zero-shot image-to-text generation using a BLIP-2 model w/ ViT-g and FlanT5XXL .
12
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Image
Encoder Q-Former
Fully
Connected
Question
…
What is the cat wearing?
LLM
Answer sunglasses
13