0% found this document useful (0 votes)
6 views13 pages

Compress - Align - Urating Image-Text Data With Human Knowledge2

This paper presents a novel algorithm that leverages human knowledge to compress and enhance the quality of web-crawled image-text datasets. The method involves three main steps: collecting a dataset with multiple captions for each image, establishing criteria for assessing caption alignment, and training a reward model to filter low-quality pairs. Experiments demonstrate that the approach can reduce dataset size significantly while maintaining or improving model performance on various tasks.

Uploaded by

720matheusmendes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views13 pages

Compress - Align - Urating Image-Text Data With Human Knowledge2

This paper presents a novel algorithm that leverages human knowledge to compress and enhance the quality of web-crawled image-text datasets. The method involves three main steps: collecting a dataset with multiple captions for each image, establishing criteria for assessing caption alignment, and training a reward model to filter low-quality pairs. Experiments demonstrate that the approach can reduce dataset size significantly while maintaining or improving model performance on various tasks.

Uploaded by

720matheusmendes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Compress & Align: Curating Image-Text Data with Human Knowledge

Lei Zhang1,2 * Fangxun Shu2∗ Sucheng Ren3 Bingchen Zhao4


Hao Jiang2† Cihang Xie5†
1
Zhejiang University 2 Alibaba Group 3 Johns Hopkins University
4
University of Edinburgh 5 University of California, Santa Cruz
arXiv:2312.06726v2 [cs.CV] 13 Dec 2023

Zero-shot I2T
Abstract (Flickr30k)

The massive growth of image-text data through web 69

crawling inherently presents the challenge of variability Image Captioning 67


Zero-shot T2I
in data quality. This paper introduces a novel algorithm, (MSCOCO) 65 (Flickr30k)
55 82
rooted in human knowledge, to compress this vast corpus 65
80
45
of web-crawled image-text datasets to a compact and high- 78
N/A
quality form. Our method unfolds in three major steps.
40
First, we collect an image-text dataset, wherein each image 50
40

is associated with multiple captions sourced from diverse 60


54
41
Image Captioning 42
origins. Then, to systemically capture human preferences (NoCaps)
Zero-shot I2T
(MSCOCO)
55
regarding the best caption paired with each image, we es-
tablish a comprehensive set of both subjective and objective 56 Full-size
CLIP-Score (90%)
criteria for critically guiding the alignment assessment from BLIP-Score (90%)
Zero-shot T2I Ours (90%)
labelers. Lastly, we train a reward model on the annotated (MSCOCO)

dataset to internalize the nuanced human understanding of Figure 1. Our method outperforms full-size training dataset on
image-text alignment. The resulting reward model thus can various downstream tasks with BLIP-B/16. This training set con-
act as a human-like referee to filter misaligned/low-quality sists of CC3M, CC12M, and a subset of LAION-400M. We reduce
image-text pairs. the training sample size from 130M to 15.5M (i.e. ∼9×smaller).
Extensive experiments demonstrate that we are able to
secure (or even improve) model performance by compress-
LAION-400M [32], and LAION-5B [33]. Typically, to fa-
ing the image-text datasets up to ∼90%. An impressive ex-
cilitate the large-scale collection of such data, web crawling
ample is that, by aggressively reducing the total training
is applied with simple filtering mechanisms in place. While
sample from 130M to 15.5M (e.g., ∼9× smaller), our BLIP-
this collection pipeline ensures a rich diversity of data, it in-
B/16 models still consistently show superior performance
advertently introduces significant variability in data quality,
compared with the full-size-dataset counterpart on image-
presenting new challenges for learning at scale.
text retrieval (Flickr30K, COCO) by ∼2.5% in Recall@1,
and on image-captioning (Nocaps, COCO) by ∼10.0% in A popular strategy for mitigating this challenge is data
CIDEr and ∼2.7% in SPICE. filtering. Recent literature illustrates various methodolo-
gies where models are designed to assess the alignments be-
tween images and their corresponding textual descriptions,
1. Introduction selectively retaining only those pairs that meet predefined
standards [38, 40]. However, a fundamental concern asso-
The rapid progress in vision-language foundation mod- ciated with this line of work is that the filtering models are
els [2, 5, 11, 26, 29, 35] has been largely driven by the grow- typically built upon the original, uncurated datasets, which
ing availability of image-text pairs, experiencing a massive are inherently noisy, as illustrated in the upper part of Fig. 2.
escalation from datasets comprising a few million samples, This noise can possibly propagate biases and misalignments
such as COCO [20], CC3M [35] and CC12M [5], to those into filtered datasets, potentially impinging upon the top-
encompassing billions, exemplified by YFCC-100M [37], line performance for subsequent models trained on such
* Equal contribution. data. Moreover, the intrinsic cognitive discrepancies be-
† Corresponding author. tween human and machine perception, as studied in recent

1
Vision-language Model-Dependent
works [17, 24, 31], suggest that machine-based assessments Misaligned Image-Text Pairs
alone may not sufficiently encapsulate the quality standards Raw Data Why Jerry Lorenzo Left
set by human judgment. VL
Model
Filter Nike For Adidas.

Noisy $59 for a Two-Hour


Compressed Horseback-Riding
Data
Data Birthday Party.
Intriguingly, recent advancements in language models Misaligned Sample
Aligned Sample
[3, 15, 22, 25, 44] have demonstrated that Reinforcement
Integration with Human Knowledge
Learning from Human Feedback (RLHF) [6, 36] can ef- Aligned Image-Text Pairs

fectively incorporate human preferences as a reward signal, Raw Data


Man giving young boy
shoulder ride outdoors
markedly aligning the model with human intentions. A key Reward
Model
Filter at park.

component here is the reward model, which is trained to Human


Compressed
Conrad New York
superior room river
Knowledge view.
approximate the often complex and nuanced human eval- Data

uations of desired behavior. By integrating human feed- Figure 2. The upper side demonstrates the pipeline of filtering
back directly into the training loop, models can develop a method based on feature similarity. The bottom side demonstrates
more profound understanding of complex domains of hu- the pipeline of our method integrated with human knowledge
man knowledge. Inspired by these advancements, this paper .
seeks a human-centric approach to the curation of image- 2. Related Work
text pairings, aiming to improve data quality through a fil-
Learning from human knowledge. The integration of hu-
tration process intrinsically attuned to the subtleties of hu-
man knowledge is becoming progressively instrumental in
man cognition and evaluative criteria, as illustrated in the
aligning model behavior with human intent [3, 15, 17, 22,
bottom part of Fig. 2.
39, 41, 43, 44]. The key part is to train a reward model [21]
to directly capture human preferences regarding the outputs
An overview of our data filtering pipeline, which is generated by the model. Recent works propose to utilize
grounded in human knowledge, is illustrated in Fig. 3. reinforcement learning [34] to finetune the language mod-
Specifically, our first step is to collect a dataset with 10,000 els [25] and diffusion models [17, 41, 43] with the signal
images, each paired with a variety of captions. Then, human of reward model. However, few studies focus on utilizing
preferences are collected on ranking the alignment between human knowledge to directly improve dataset quality. This
different captions to their corresponding images, applying a work pioneers the exploration of integrating human knowl-
set of criteria that captures both objective aspects, such as edge with image-text data to relieve the visual and textual
accuracy and completeness, and subjective aspects, such as misalignment in large-scale image-text datasets.
vividness and contextual relevance. In the final stage, we
Vision-language data. Data is of central importance to the
train a reward model on the annotated dataset, aiming to
success of vision-language pre-training at scale [2, 11, 29].
predict the human preference for captions. The resulting re-
Early efforts in dataset collection often employ simple but
ward model is expected to function as a human-like referee
vague cleaning strategies [5, 35, 37]. Recently it has turned
that can transfer human knowledge to identify and filter the
to focusing on visual and textual alignment to intricately
misaligned and low-quality image-text pairs for enhancing
filter misaligned image-text pairs, e.g., LAION-400M [32]
the overall dataset quality.
utilized similarity scores of pre-trained CLIP model [29] to
filter low-quality image-text pairs. Further advancements
Extensive experiments are provided on demonstrating include CiT [40], which trains models with dynamic train-
that we can successfully compress large-scale and noisy ing data on the fly, and TL;DR [38], which learns a trainable
image-text datasets to compact and well-aligned forms. For codebook separately by image feature extraction and lan-
example, as shown in Fig. 1, by compressing the train- guage modeling. However, these data-filtering models are
ing dataset, consisting of CC3M [35], CC12M [5], and a typically built upon noisy datasets, therefore potentially be-
portion of the LAION-400M dataset [32], from 130M to ing less effective in filtering data. Differently, in this work,
15.5M, BLIP-B/16 achieves a highly competitive 82.4% we directly introduce human preference into data filtering
zero-shot text recall@1 (Flickr30K [28]) and 61.7% zero- for evaluating dataset quality.
shot CIDEr (Nocaps [1] val set). This performance stands
in stark contrast to the counterpart trained on the original 3. Method
130M dataset, which attains 78.6% on text recall@1 and
0.0 on CIDEr. Additionally, our method exhibits a good To improve image-text datasets with human knowledge, we
generalization to different vision-language models — with follow the steps shown in Fig. 3. We first generate a set
CLIP-B/16 model, we not only reduce the dataset by 50% of diverse captions for a given set of images, subsequently
but also impressively achieve 2.6% and 2.0% improvements soliciting human assessments to determine the degree of
on text and image recall@1 (Flickr30K). alignment between each image and its associated captions.

2
Step 1 Step 2 Step 3
Construct dataset human-in- Train a reward model that Reward model as human-like
the-loop. learns human preference. referee to evaluate alignment.

Images are An image, Image-text pairs


collected in several captions are sampled from
COCO, ImageNet, YFCC, LAION, etc.
natural domain. etc. and human label datasets.
from constructed
dataset. A B
A blue and A moto bike
white … and wine… Reward model RM
evaluates data
C D
Captions are Writing A motorcycle Blue motorcycle
samples as human.
generated via on display … parked in …

multiple ways.
Crawling Model
Training data The reward is
A > C D > C
is organized used to select Top 30% r
in pair-wise. data samples.
C = B D > B
A labeler ranks
the alignment
of per image- Selected samples Well-aligned
prompt pair. D > A > C = B The data is used RM for efficient Efficient
to train our training.
reward model.
D > A > C = B

Figure 3. A diagram illustrating the three steps of our method. We first curate an image-text dataset to collect human knowledge on
alignment in Stage 1. Then we train a reward model to predict human preference in Stage 2. The reward model is functioned as a human-
like referee to filter misaligned image-text pairs in Stage 3. *The style of this figure is inspired by Figure 2. in InstructionGPT [25].

This phase involves a comprehensive scoring process by la- BLIP [7] to generate captions using images as inputs and
belers to evaluate the quality of image-caption alignment, prompts optionally.
as detailed in Sec. 3.1. Next, we train a reward model to • Human Rewriting. Additionally, we engage human an-
emulate human evaluative feedback, taking an image and notators and task them with rewriting captions based on
a corresponding caption as inputs, an approach elaborated content depicted in the corresponding observed images.
upon in Sec. 3.2. The resulting reward model is expected With these three steps, we collect a dataset with 1,000
to function as a human-like referee to improve image-text candidate images, each accompanied by 8 to 10 captions.
datasets (see Sec. 3.3). Human-Preference Annotations. We recruit well-trained
3.1. Human Knowledge Collection human labelers to evaluate the alignment between generated
captions and image content. To ensure annotation consis-
We construct a systematic pipeline to build an image-text tency, we instituted four precise guidelines that outline the
dataset enriched with human feedback, as delineated in Step criteria for alignment, with illustrative examples provided
1 of Fig. 3. This pipeline contains two integral components: in Fig. 4:
the collection of image-caption data and the subsequent an- • Accuracy. The caption should first accurately reflect the
notation of this data with human feedback provided by ex- content of the corresponding image. The inaccurate cap-
pert labelers. This built image-caption dataset is expected tions include descriptions conflicted with the content of
to be both well-aligned and diverse: on an individual level the image, fabrication not relying on corresponding im-
of image-text pair, the caption contains rich and useful vi- ages, etc.
sual information accurately reflecting the image’s content; • Completeness. The caption should contain the main vi-
on a population level of dataset, the distribution of captions sual objects in the image as completely as possible. It
is diverse preventing biases from a single source. guarantees the caption acknowledges an abstract view of
Image-Captions Collection. We utilize MSCOCO [20] as the whole image.
the basis to curate our new dataset, termed COCO-HF. To • Vividness. The caption should describe the details of the
enrich the diversity of captions, we explore the following mentioned visual objects. The details here refer to the
strategies: number, appearance, action, and status of the visual ob-
• MSCOCO Sampling. Note that multiple text descrip- jects. The vivid captions provide individual status and
tions for one image are already available in the original relationships between different visual objects, which is a
MSCOCO dataset, i.e., each image is associated with five kind of comprehensive understanding of the image.
distinct human-write text descriptions. • Context. The caption should mention the context of the
• Model Rewriting. We also employ various generative image, which is easily ignored. The context includes the
models, such as BLIP [18], BLIP-2 [19], and Instruct- background where the image occurs and the atmosphere

3
that the image conveys or implies. It contains complete
observation and human understanding of the image. Caption: A orange striped tabby cat with a
big fluffy tail laying on top of a red car’s wheel.
These criteria collectively contribute to a multi-faceted
annotation framework that captures both the objective deno-
laying
tations and the subjective connotations of image-text pair. ACTION

3.2. Reward Model Training tabby cat wheel


OBJECT OBJECT
Inspired by InstructGPT [25], we propose to train a reward
model to learn the human preference knowledge on COCO- orange big fluffy red car
HF dataset. This reward model is designed to capture the striped tail CONTEXT
DETAILS DETAILS
subtleties of human preferences, thereby serving as an au-
tomatic, yet human-like, arbiter for aligning image-text cor- Figure 4. We provide an image-text pair as the example and em-
respondences. (seen in Fig. 3 Step 2). ploy our proposed criteria to evaluate the caption. The caption here
We transform the preference annotations as rankings and is a high-quality exemplar since it satisfies all the requirements.
formulate the training of the reward model as a pairwise
ranking problem. For each given image I in the COCO- reward score in RD . We train the vision-language models
HF dataset, we have k ∈ [8, 10] textual captions ranked by with compressed dataset D̂ efficiently and evaluate on vari-
human labelers, which are denoted as x1 , x2 , ..., xk . If xi ous downstream tasks.
is better than xj , we organize (I, xi , xj ) as a comparison
pair. This produces at most Ck2 comparison pairs for each 4. Experiments
image. Then, we follow the Bradley-Terry model [4, 25] of
preferences to define the pair-wise loss function as: In this section, we provide extensive experiments to demon-
strate the effectiveness of the proposed method in compress-
loss(θ) = −E(I,xi ,xj )∼DH [log(σ(fθ (I, xi ) − fθ (I, xj ))], (1) ing vision-language data for better quality. The main re-
sults of RM are measured on human preference accuracy.
where fθ (I, x) is a scalar value of reward model f param-
The evaluation of data compression is presented next and
eterized by θ for image I and caption x. σ is the sigmoid
organized as image-text retrieval tasks, image classification
function. DH is COCO-HF dataset.
tasks, and image captioning tasks. We also present abla-
Implementation. Following [41], our reward model fθ tion studies, including data efficiency and vision-language
contains BLIP [18] as the backbone and score mapping model architecture.
layer based on MLPs. The backbone produces a multi-
modal embedding of image and text features, and the score 4.1. Evaluation on Human Preferences
mapping layer maps the multi-modal embedding to a scalar Alignment with Human. We first investigate the quality of
as the reward score. When training the reward model, we our reward model and other image-text pair selection meth-
freeze the parameters of the backbone BLIP and take the ods by evaluating its prediction in terms of human prefer-
MLP part as trainable. The hyperparameter setup and train- ence. We randomly choose 200 images from the MSCOCO
ing details of reward model training are provided in the sup- dataset [20] and generate 8 to 10 captions for each image
plementary material. and human labelers rank the following Sec. 3.1 to curate
3.3. Dataset Compression a validation dataset. In addition to the proposed method,
we consider the other two methods for comparison: CLIP
In order to compress large-scale datasets into compact and Score, and BLIP Score. Our reward model is combined
well-aligned ones, we utilize the reward model as a human- BLIP [18] model as the backbone with MLP to generate
like referee to filter misaligned image-text pairs (Step 3 reward score. For each baseline, we consider different
in Fig. 3). Vision Transformer backbone architectures including ViT-
N
Let D = {(I n , xn )}n=1 be the original image-text B/16 and ViT-L/16. We use these methods to select the best
dataset. We utilize the reward model fθ to evaluate the caption among the candidate captions for each image and
image-text alignment of each image-text pair (I n , xn ) from evaluate the accuracy in accordance with human preference.
the dataset D. The reward score ri is formulated as rn = As shown in Fig. 5, our reward model achieves signif-
fθ (I n , xn ). Then, we obtain the set of reward scores RD icant advantages over others. For example, while CLIP
of the whole dataset D, which can be expressed as RD =
 model achieves merely 40% accuracy, our method achieves
r1 , r2 , ..., rN . In order to compress the original dataset over 70%. It suggests that existing vision-language models
D to a compact and well-aligned dataset D̂ with k% origi- trained on noisy datasets are poor at aligning with human
nal amount, we consist D̂ of image-text pairs with top k% knowledge. The experiment results demonstrate that our re-

4
80 Baselines. In addition to our method, we consider other
Human Preference Accuracy (%)
three commonly-used baselines: random selection, CLIP
60 Score [29], and BLIP Score [18]. The latter two methods
target poorly aligned image-text pairs by using pre-trained
CLIP models and BLIP models to rank the cosine similarity
40
score between image and text embeddings. The CLIP and
BLIP models are with a large ViT vision backbone.
20
Training Parameters. For our experiments on CC3M and
0
CC12M, we utilize the BLIP [18] of ViT-B/16 [9] architec-
BLIP-B BLIP-L CLIP-B CLIP-L Ours-B Ours-L ture. For LAION-400M, we use the ViT-L architecture and
Model
follow the exact training setup outlined in [18]. For CC3M
Figure 5. Accuracy of CLIP Score [29], BLIP Score [18], and
dataset, we train the models with a batch size of 600 and the
our reward model with different ViT architectures on predicting
AdamW optimizer [14]. We set the batch size to 2, 880 for
preference of human labelers.
CC12M and LAION-400M datasets. The training epoch is
40 for CC3M and CC12M datasets and 5 for LAION-400M
ward method effectively encodes human knowledge of vi-
dataset, respectively. The learning rate is set to 3 × 10−3 ,
sual and textual alignment. We adopt the reward model with
and weight decay is set to 0.1. The supplementary material
a large backbone to filter misaligned image-text pairs in the
contains a detailed breakdown of training parameters.
following experiments.
Visualization. In order to demonstrate the performance of Evaluation Setup. We consider three downstream evalu-
different filtering methods, we employ CLIP-Score, BLIP- ation tasks for the trained BLIP models: image classifica-
Score, and our reward model with the large backbone to tion tasks, image-text retrieval tasks, and image caption-
filter the Conceptual Captions 3M (CC3M) [35] dataset. We ing tasks. It is worth noting that we evaluate these tasks
remain the image-text pairs in which the scores are ranked all in a zero-shot manner. In order to demonstrate the im-
at the top 50% in the full-size dataset. provements in alignment of vision-language models, we do
As shown in Fig. 6, we visualize the image-text pairs that not fine-tune trained BLIP models to avoid bias from other
are selected by our reward model but not selected by CLIP- datasets in fine-tuning.
Score and BLIP-Score. It demonstrates that our reward per-
forms evident advantages over a fine-grained understanding
of image and text contents, including position, appearance, 4.2.2 Downstream Tasks
and background, which is close to human perception.
Zero-Shot Image-Text Retrieval. The task is to measure
fine-grained world region alignment between images and
texts. We evaluate the trained BLIP models on the two stan-
dard image-text retrieval benchmarks: MSCOCO [20] and
Flickr30K [27]. Following previous work [13, 42], we use
hand grating peeled a man securing an a bare tree with a
potatoes over an aqua- awning with a rope on scary face and
the Karpathy split [12] for the two benchmarks.
colored bowl next to a a residential pumpkins on the ground
food processor, knife narrowboat moored in beneath it
Tab. 1 shows the main results. Compared with the full-
background winter on the bank size CC12M dataset, our method improves text recall@1
Figure 6. We employ different methods to select the top 50% data from 81.0% to 81.6% with only 50% amount. This shows
from the CC3M dataset. The shown image-text pairs are selected that we not only reduce data redundancy but also improve
by our method but excluded by CLIP Score and BLIP Score. data quality. Furthermore, we observe consistent improve-
ments in performance compared with other filtering meth-
4.2. Evaluation on Data Selection ods when scaling up the dataset. The performance gap
4.2.1 Training Setup between our method and BLIP Score on Image Recall@1
is −0.4% on Flickr30K dataset and −0.1% on MSCOCO
Datasets. We conduct experiments on three different dataset, however, the gap enlarges to +1.8% and +0.9%
image-text datasets at different scales: Conceptual Captions when scaling from 3M to 12M. Compared with CLIP Score
3M (CC3M) [35], Conceptual Captions 12M (CC12M) [5], on CC12M dataset, we experience a 3.9% and 3.8% for
and LAION-400M [32]. The majority of our ablation stud- text recall@1 on Flickr30K and MSCOCO dataset. It sug-
ies are performed on the CC12M datasets. We compress gests that filtering with human knowledge effectively con-
the original datasets to 50% as default. Additionally, we veys the capability of a fine-grained understanding of the
explore different compression ratios in ablation studies. vision-language model.

5
Text → Image Image →Text
Dataset Method #Samples Flickr30K MSCOCO Flickr30K MSCOCO
R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
- 2.82M 66.0 89.6 94.5 39.9 64.6 74.7 51.9 76.8 83.9 28.8 53.8 64.9
Random 1.41M 57.7 84.2 91.0 30.7 56.7 68.7 44.4 71.3 79.5 24.7 48.4 59.6
CC3M CLIP Score 1.41M 51.3 80.0 87.9 30.2 56.4 68.7 41.5 68.5 78.1 25.6 49.2 60.2
BLIP Score 1.41M 61.4 87.8 93.2 36.8 62.8 74.2 49.7 75.2 82.9 28.7 53.7 64.8
Ours 1.41M 64.2 87.5 92.5 37.4 63.9 74.6 49.3 74.8 82.8 28.6 53.2 64.1
- 10.4M 81.0 95.6 97.4 53.5 77.2 85.7 65.0 87.4 92.1 39.2 65.0 74.8
Random 5.20M 74.0 92.8 96.9 48.3 73.1 82.1 59.5 83.8 89.7 34.6 60.3 71.3
CC12M CLIP Score 5.20M 77.7 95.2 97.4 49.7 75.4 83.8 62.1 83.9 89.6 36.0 61.9 72.5
BLIP Score 5.20M 79.6 94.7 97.2 53.1 76.8 85.0 62.2 84.9 90.3 37.6 63.4 73.5
Ours 5.20M 81.6 95.7 98.0 53.5 78.2 86.2 64.0 86.4 91.7 38.5 64.0 74.3
Table 1. Zero-shot image-text retrieval results on Flickr30K [27] and MSCOCO [20] datasets.

MSCOCO NoCaps
Dataset Method #Samples Valid Test
BLEU@4 CIDEr METEOR SPICE CIDEr SPICE CIDEr SPICE
- 2.82M 7.3 39.2 12.5 9.3 35.0 7.0 34.1 6.9
Random 1.41M 7.5 37.8 12.7 9.2 34.6 6.9 33.5 7.0
CC3M CLIP Score 1.41M 7.4 38.4 12.7 9.3 34.2 7.1 33.7 6.7
BLIP Score 1.41M 8.5 42.4 13.5 9.9 38.1 7.3 37.1 7.3
Ours 1.41M 12.9 48.7 16.4 11.9 46.1 8.6 44.5 8.6
- 10.4M 11.4 42.7 14.2 10.3 38.6 7.7 36.1 7.6
Random 5.20M 11.1 43.1 14.2 10.2 38.7 7.6 35.8 7.5
CC12M CLIP Score 5.20M 13.8 49.6 16.5 11.8 46.1 8.7 43.6 8.7
BLIP Score 5.20M 16.9 59.1 18.1 13.0 52.9 9.0 50.2 9.1
Ours 5.20M 18.0 59.1 19.5 14.0 57.2 10.4 54.8 10.4
Table 2. Zero-shot image captioning results on MSCOCO [20] and NoCaps [1] datasets. Note that the models are not finetuned with CIDEr
optimization on MSCOCO dataset.

Dataset Method #Samples IN-1K IN-A IN-O on most of datasets, for instance, CLIP Score hurts by
- 2.82M 29.0 10.0 41.4 2.3% on ImageNet-1K, 1.7% on ImageNet-A, and 1.1%
Random 1.41M 27.5 9.5 37.8 on ImageNet-O. However, the performance of our method
CC3M CLIP Score 1.41M 26.7 8.3 40.3 becomes slightly worse when scaling up to the CC12M
BLIP Score 1.41M 29.6 9.2 41.7
dataset. This indicates that the classification task heavily
Ours 1.41M 29.7 11.0 42.0
relies on visual diversity. Moreover, the prompts here are
- 10.4M 48.8 18.1 62.1
Random 5.20M 43.2 19.2 61.9
constructed by simply splicing classnames into a template,
CC12M CLIP Score 5.20M 51.3 21.0 66.2 which cannot fully demonstrate the capability of textual and
BLIP Score 5.20M 48.7 19.6 65.3 visual alignment.
Ours 5.20M 48.5 21.8 64.4 Zero-shot Image Captioning. Additionally, we evaluate
Table 3. Zero-shot image classification results on ImageNet-1K trained models on the image captioning task. The task is to
(IN-1K) [8], ImageNet-A (IN-A) [10], and ImageNet-O (IN-O). generate a natural language caption for the given image. We
consider two popular benchmarks: MSCOCO [20] and No-
Zero-shot Image Classification. We evaluate trained caps [1]. Note that we do not evaluate the models finetuned
BLIP models on ImageNet with different distributions in- on MSCOCO Karpathy split [12]. Since we investigate the
cluding ImageNet-1K [8], Imagenet-A [10], and ImageNet- alignment of models, finetuning tends to introduce biases.
O. We follow the exact setup in [29] and apply the same set Tab. 2 shows the main results. Compared with the full-
of prompts used for label classnames to all the baselines. size dataset, our method demonstrates a significant im-
Tab. 3 shows the main experiment results. We achieve provement with only 50% dataset, for instance from 38.6%
improvement on full-size dataset with only 50% amount to 57.2% on CIDEr, from 7.7% to 10.4% on SPICE on No-
on CC3M dataset, for example improving from 29.7% Caps validation set. We observe consistent performance im-
to 29.0% on ImageNet-1K dataset, 11.0% to 10.0% on provements when reducing the amount of dataset whatever
ImageNet-A dataset, 42.0% to 41.4% on ImageNet-O filtering methods are applied. It suggests that misalignment
dataset. However, other filtering methods hurt performance seriously hurts the captioning performance. Our method

6
Text → Image Image →Text
Dataset Method #Samples Flickr30K MSCOCO Flickr30K MSCOCO
R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
- 15M+115M 78.6 95.6 98.0 54.7 78.5 86.3 66.9 88.1 92.9 41.5 66.9 76.3
CLIP Score 15M+8M 80.4 94.7 97.0 54.4 77.3 84.7 67.2 88.4 93.2 41.0 66.4 75.8
CC+LAION
BLIP Score 15M+8M 80.0 95.0 97.6 53.7 77.8 85.9 67.0 88.2 92.6 40.7 66.0 75.6
Ours 15M+8M 82.4 96.4 98.8 56.7 79.9 87.2 69.6 89.3 93.6 42.6 68.0 76.9
Table 4. Performance of downstream tasks on BLIP of ViT-B/16 on image-text retrieval tasks. The CC dataset consists of the CC3M [35]
and the CC12M [5] dataset. All the downstream tasks are evaluated in a zero-shot manner. Following BLIP [18], we filter the images
whose short edge is smaller than 256 pixels from the original LAION-400M.

MSCOCO NoCaps
Dataset Method #Samples Valid Test
BLEU@4 CIDEr METEOR SPICE CIDEr SPICE CIDEr SPICE
- 15M+115M 9.1 46.7 14.0 10.3 41.6 7.6 40.0 7.5
CLIP Score 7.5M+8M 11.5 50.2 15.0 11.1 45.5 8.2 43.4 8.1
CC+LAION
BLIP Score 7.5M+8M 13.2 55.4 16.3 12.0 51.5 8.5 49.0 8.5
Ours 7.5M+8M 19.8 68.3 19.9 14.6 61.7 10.3 59.9 10.3
Table 5. Performance of downstream tasks on BLIP of ViT-B/16 on image captioning tasks. The CC dataset consists of the CC3M [35]
and the CC12M [5] dataset. All the downstream tasks are evaluated in a zero-shot manner.

significantly outperforms other methods. We experience the performance between BLIP Score and CLIP Score is
a +4.3% gap on CIDEr compared with BLIP Score and a similar, for instance, 80.0% and 80.4% in text recall@1 on
+11.1% gap compared with CLIP Score, further suggesting the Flickr30K dataset. However, it is noteworthy that our
that our efforts to examine the vividness of textual captions method achieves a ∼2.4% improvement over BLIP-Score
empower models with a powerful generation capability. and CLIP-Score with the same volume. Integration with
human knowledge brings the model an efficient and strong
4.3. Transfer to LAION-400M Dataset capability of fine-grained understanding.
Experiment Setting. We study the data compression Tab. 5 presents the main result of image captioning.
on a more large-scale and challenging LAION-400M We combine the compression of the LAION-400M dataset
dataset [32]. We mostly follow the experiment settings in with the CC3M dataset and CC12M dataset from 130M to
previous work [18]; the details can be found in the supple- 15.5M. Compared with the full-size dataset, we reduce the
mentary material. The only difference in training is that we total training samples by ∼90% and achieve prominent per-
make modifications to the pre-train dataset. In the BLIP formance improvements, for instance on MSCOCO dataset,
paper, the pre-train dataset is a mixture of CC3M [35], from 9.1% to 19.8% on BLEU@4, 46.7% to 68.3% on
CC12M [5], MSCOCO [20], VG [16], SBU [23], and CIDEr, 14.0% to 19.9% on METEOR, and 10.3% to 14.6%
LAION-400M [32] datasets. However, VG and SBU on SPICE. Compared with other filtering methods under the
datasets potentially leak information in the downstream same volume, we outperform BLIP Score by +10.9% on
tasks, as SBU and VG datasets contain the information in CIDEr and +1.8% on SPICE validating on the Nocaps test
the Flickr30K [28] and MSCOCO datasets, respectively. set. The gap extends to +16.5% and +2.2% compared with
Towards a fair evaluation, the pre-train dataset in our exper- CLIP Score. It indicates that less but high-quality image-
iment only contains CC3M, CC12M, and LAION-400M. text data is much more effective than cumbersome but noisy
Downstream Tasks. Similarly, we consider two classical data.
tasks in a zero-shot manner: image-text retrieval and image
4.4. Ablation Studies
captioning.
Tab. 4 demonstrates the main result of image-text re- Dataset Efficiency. We study the data compression under
trieval. Compared with the full-size datasets, we reduce various compression ratios. The main results are demon-
the total training samples by ∼85% and demonstrate sig- strated in Fig. 7. Our method consistently achieves signif-
nificant improvements. For instance, we improve text re- icant improvements over other filtering baselines, demon-
call@1 from 54.7% to 56.7% and image recall@1 from strating its good generalizability. It is noteworthy that the
41.5% to 42.6% on the MSCOCO dataset. It suggests that a performance of our method degrades less than other meth-
large number of misaligned samples in the LAION-400M ods with the compression of the training dataset. Compress-
dataset seriously hurt the performance. We observe that ing the dataset from 50% to 80%, our method observes a

7
81.6 64.0 57.2 11.3
64 54.3 54.4 54.2 11.0
80 79.0 62.2 55 11 10.6
78.3 60.9 10.4
60 50 10
58.0
Recall@1

Recall@1
75

SPICE
CIDEr
72.8 45
56 9
70 Random 40
CLIP Score 8
BLIP Score
Ours 52 35
65
50 60 70 80 50 60 70 80 50 60 70 80 7 50 60 70 80
Ratio (%) Ratio (%) Ratio (%) Ratio (%)
(a) Text → Image Retrieval (b) Image → Text Retrieval (c) Image Captioning (d) Image Captioning
Figure 7. We set the compression ratio for CC12M dataset [5] to 50%, 60%, 70%, and 80%, respectively. The BLIP models with ViT-B/16
architecture are trained on compressed datasets and evaluated on zero-shot image-text retrieval on Flickr30K dataset [28] and zero-shot
image captioning on Nocaps dataset [1].

ImNet-1K ImNet-A ImNet-V2 Flickr30K


Dataset Method #Samples
Accuracy Accuracy Accuracy IR@1 IR@5 IR@10 TR@1 TR@5 TR@10
- 2.82M 16.0 3.6 13.2 27.4 54.7 65.2 19.0 40.4 50.8
Random 1.41M 13.7 3.5 11.2 24.2 50.2 62.1 16.8 37.6 45.9
CC3M CLIP Score 1.41M 14.1 3.0 12.1 23.4 44.8 57.0 16.6 36.2 47.0
BLIP Score 1.41M 15.6 2.9 12.9 28.9 54.9 66.8 19.6 41.6 52.4
Ours 1.41M 15.0 4.1 13.0 31.0 57.0 67.9 21.0 44.0 55.2
Table 6. Zero-shot image classification results on ImageNet of different distributions and image-text retrieval results on Flickr30K [27]
datasets on CLIP model of ViT-B/16. Here we adopt ImageNet-1K [8], ImageNet-A [10], and ImageNet-V2 [30].

decrease in performance of 3%, however, the decrease of noisy datasets. Scaling to various downstream tasks, we
BLIP Score is 4.4% in Fig. 7c. Moreover, our method gains discover that leveraging less but high-quality data leads to a
improvements both in performance and efficiency. When greater abundance of fine-grained knowledge and is more
the volume decreases from 50% to 20%, we observe an in- suitable for captioning tasks. These findings underscore
crease from 10.4 to 11.3 in Fig. 7d. This suggests that fil- the potential of human knowledge in efficiency and align-
tering misaligned samples directly brings out gains in the ment. Our findings can significantly enable faster training
capability to align image and text features in both a fine- with better results. We hope that our work sheds light on the
grained manner and a comprehensive manner. data-centric and human-centric field in the vision-language
community and engages more and deeper exploration.
Model Architecture. We generalize our method to an-
other vision-language model CLIP [29] of ViT-B/16 [9].
Tab. 6 demonstrates the experiment results. Our method Acknowledgment
with the 50% dataset outperforms the full-size dataset on This work is partially supported by TPU Research Cloud
the retrieval task. For instance, we observe an increase (TRC) program and Google Cloud Research Credits pro-
in performance from 27.4% to 31.0% in IR@1 and from gram.
19.0% to 21.0% in TR@1. For the classification task, our
method demonstrates a good performance on ImageNet-A Supplementary Material
and ImageNet-V2 with +0.5% increase and slightly −0.2%
gap, showing the generalization to different distributions. A. Training Deatils
All results suggest the scalability of our method to different
model architectures. Reward Model. We use ViT-L/16 BLIP model [18] to ex-
tract and fuse image and text embeddings and train an MLP
5. Conclusion using fused embedding as input to generate a scalar as re-
ward score. Specifically, we use five-layer MLPs with 768,
In this paper, we delve deep into image-text data. We 1024, 128, 64, and 16 hidden dimensions each. We use the
present COCO-HF, an image-text dataset with human Dropout function between the first four layers with the 0.2,
knowledge for alignment, and reveal that human knowl- 0.2, and 0.1 ratios each. We freeze the pre-trained BLIP
edge enhances both data efficiency and visual-textual align- backbone and only train the MLP layers. We update the re-
ment. Scaling up to different datasets, we identify that our ward model using AdamW [14] optimizer with β1 = 0.9,
method shows significant performance on large-scale and β2 = 0.999, ϵ = 1e − 8. The learning rate is set to 1e − 5.

8
The model is trained on 2 NVIDIA 80GB A100 GPUs, with Our method outperforms the full-size dataset by over 2%
a per-GPU batch size of 32, resulting in a total batch size of on text recall@1, nearly 1% on image recall@1, over 15%
64. It is trained for a total of 20,000 updates. on CIDEr, and over 3% on SPICE. Compared with other
BLIP. We train the ViT-B/16 BLIP model with a batch size filtering methods with the same amount of data, we demon-
of 2880 and ViT-L/16 BLIP model with a batch size of 2400 strate significant advantages in both image-text retrieval and
both for 20 epochs. We update models with AdamW [14] image captioning tasks. For instance, we improve text re-
optimizer with β1 = 0.9, β2 = 0.999, ϵ = 1e−8 and weight call@1 from 79.3 to 83.5 on the Flickr30K dataset and from
decay 0.05. The learning rate is warmed-up to 3e − 4 for 52.3 to 56.8 on the MSCOCO dataset compared with the
ViT-B/16 and 2e−4 for ViT-L/16 and decayed linearly with CLIP Score. It suggests that our method is flexible to vari-
a rate of 0.85. ous vision backbones.

B.2. Ablation Study on Data Efficiency


B. Additional Experiments
We provide detailed experiment results on different com-
pression ratios in Section 4.4 in the main paper. The main
Ratio Random CLIP Score BLIP Score Ours results are demonstrated in Tab. 7. Our method outper-
50% 74.0 77.7 79.6 81.6 forms other filtering methods consistently under different
60% 72.9 76.4 77.3 78.3 compression ratios. For instance, our method improves by
70% 70.4 73.3 76.2 79.0 an average of +1.9% on text recall@1, +2.6% on image re-
80% 65.0 70.2 70.2 72.8 call@1, +2.4% on CIDEr, and +1.5% on SPICE over BLIP
Score. It is noteworthy CIDEr performs a minor decline
(a) Text Recall@1 on Flickr30K dataset.
with the decrease of data volume and SPICE even improves
Ratio Random CLIP Score BLIP Score Ours +1% while the training samples reduce 30%. The outper-
forming results show that our method is successfully gener-
50% 59.5 62.1 62.2 64.0
alized to datasets with different scales.
60% 58.3 59.2 59.7 62.2
70% 55.0 57.7 58.3 60.9 B.3. Comparison with BLIP Model
80% 50.8 54.1 54.6 58.0
We follow the setting in [18] to demonstrate the effective-
(b) Image Recall@1 on Flickr30K dataset.
ness of our method. The pre-training dataset consists of
Ratio Random CLIP Score BLIP Score Ours CC3M [35], CC12M [5], COCO [20], SBU [23], VG [16],
and LAION-400M [32] datasets.
50% 38.7 46.1 52.9 57.2 The main results are shown in Tab. 8 and Tab. 9. The
60% 35.8 43.3 54.0 54.3 performance of our method becomes slightly worse than
70% 34.6 43.0 54.0 54.3 BLIP-Capfilt on image-text retrieval tasks. Firstly, it can be
80% 35.7 39.8 49.6 54.2 attributed to the increase in data since VG and SBU contain
(c) CIDEr on Nocaps dataset. part of the samples from Flickr30K and MSCOCO datasets,
which is not fair to some extent. However, our method out-
Ratio Random CLIP Score BLIP Score Ours
performs VG and SBU without VG and SBU datasets (Ta-
50% 7.6 8.7 9.0 10.4 ble. 4, main paper), which still demonstrates the superior-
60% 7.4 8.4 9.6 10.6 ity of our method to some extent. Secondly, BLIP-Capfilt
70% 7.2 8.6 9.4 11.0 rewrites the textual captions of misaligned image-text pairs,
80% 7.2 8.6 9.5 11.3 while we only filter the misaligned pairs. The rewriting en-
(d) SPICE on Nocaps dataset. hances the data quality and provides more useful informa-
tion for fine-grained understanding.
Table 7. Zero-shot image-retrieval task on Flickr30K dataset [27] As shown in Tab. 9, although BLIP-CapFilt utilizes cap-
and image-captioning task on Nocaps dataset [1] with different tion rewriting and more than 7× data, our method per-
compression ratio on BLIP ViT-B/16 model [18] and CC12M forms significant advantages on image captioning tasks.
dataset [5]. We demonstrate improvements of 9.2% in CIDEr, 2.8%
in BLEU@4, 1.4% in METEOR, and 1.9% in SPICE on
the MSCOCO dataset compared to BLIP-CapFilt. Simi-
B.1. Ablation Study on Vision Backbone
larly, we outperform BLIP-CapFilt by an average of 0.7%
We conduct experiments on the vision backbone of BLIP in CIDEr and 0.5% in SPICE on the Nocaps dataset. The
model to ViT-L/16. The main results are demonstrated caption rewriting model of BLIP-CapFilt is trained on the
in Tab. 10 and Tab. 11. When reducing the dataset to 50%, MSCOCO dataset. The captions in the MSCOCO dataset

9
Retrieval (COCO) Retrieval (Flickr30K) Caption (COCO) Caption (Nocaps)
Dataset Method #Samples
TR@1 IR@1 TR@1 IR@1 CIDEr SPICE CIDEr SPICE
COCO+VG+CC BLIP-CapFilt 129M 72.2 57.7 89.0 79.6 95.2 17.7 74.4 10.7
+SBU+LAION Ours 17.6M 68.9 52.9 89.4 72.2 104.4 19.0 74.8 11.1

Table 8. Zero-shot evaluation of BLIP ViT-B/16 models trained with [18] and our method on downstream tasks. BLIP-CapFilt means
bootstrapping with caption rewrite and filtering with ViT-L/16 models proposed in [18]. Note that the models are not fine-tuned before
evaluation.

MSCOCO NoCaps
Dataset Method #Samples Valid Test
BLEU@4 CIDEr METEOR SPICE CIDEr SPICE CIDEr SPICE
COCO+VG+CC BLIP-CapFilt 129M 28.2 95.2 23.2 17.1 74.4 10.7 72.2 10.5
+SBU+LAION Ours 17.6M 31.0 104.4 24.6 19.0 74.8 11.1 73.3 11.1

Table 9. Zero-shot image captioning results on MSCOCO [20] and NoCaps [1] datasets. Note that the models are not finetuned with CIDEr
optimization on MSCOCO dataset.

Text → Image Image →Text


Dataset Method #Samples Flickr30K MSCOCO Flickr30K MSCOCO
R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
- 10.4M 81.1 95.6 98.0 54.6 79.7 87.2 68.4 89.1 93.0 40.9 67.1 76.8
CLIP Score 5.20M 79.3 95.9 97.2 52.3 77.7 85.4 64.7 85.5 90.6 38.7 64.7 74.1
CC12M
BLIP Score 5.20M 81.5 95.4 97.7 54.8 80.4 87.4 68.1 89.0 93.0 40.8 68.1 77.5
Ours 5.20M 83.5 96.4 98.6 56.8 80.6 87.8 66.9 88.9 93.4 41.7 67.7 77.0

Table 10. Zero-shot image-text retrieval results on Flickr30K [27] and MSCOCO [20] datasets.

MSCOCO NoCaps
Dataset Method #Samples Valid Test
BLEU@4 CIDEr METEOR SPICE CIDEr SPICE CIDEr SPICE
- 10.4M 13.1 52.2 15.8 11.6 43.8 8.1 41.8 8.0
CLIP Score 5.20M 15.3 55.4 17.5 12.8 49.4 8.9 46.1 8.8
CC12M
BLIP Score 5.20M 19.0 66.6 19.4 14.6 56.2 9.4 54.8 9.5
Ours 5.20M 20.3 68.2 20.7 15.4 63.8 11.0 60.5 11.0

Table 11. Zero-shot image captioning results on MSCOCO [20] and NoCaps [1] datasets. Note that the models are not finetuned with
CIDEr optimization on MSCOCO dataset.

are plain and simple. Admittedly, it perhaps contains more


visual objects and simple descriptions which boosts re-
trieval tasks, however, it lacks vivid details which are cru-
cial to captioning tasks and are meticulously formulated in
our method.

C. Visualization of COCO-HF Dataset


We visualize the image-text pairs in the COCO-HF dataset,
as shown in Fig. 8. Furthermore, we make annotations
on the textual captions following the proposed criteria for
human-preference annotations in Section 3.1 of the main
paper. It demonstrates that the textual captions in the dataset
differ in completeness, vividity, and description of context.

10
A man in a yellow vest standing next to a white
truck with a hose attached to a fire hydrant.

a man in a reflective vest standing next to a fire


hydrant and a truck

a man standing next to a fire hydrant

There is a man sitting on the couch next to a woman


but he has three neck ties on.

a man wearing a tie sitting next to a woman wearing


glasses sitting on a couch

A man on a couch who is wearing several ties.

a man and woman sitting on a couch.

a baseball player swings his bat at a pitch while


the catcher and umpire behind him play a game of
baseball on the field.

a baseball batter in red and white on a baseball


field

a small dog standing in a green field of grass next


to a blue and white soccer ball

A pug dog with its paw on a soccer ball on field.

a pug is playing with a soccer ball in the grass

A small dog is playing with a soccer ball.

Figure 8. Visualization of image-text pair in COCO-HF dataset. We make annotations on each caption based on the criteria mentioned in
Section 3.1 in the main paper. We used different colors to demonstrate visual objects, vivid details, and context.

11
References [13] Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision-
and-language transformer without convolution or region su-
[1] Harsh Agrawal, Peter Anderson, Karan Desai, Yufei Wang, pervision. In ICML, pages 5583–5594, 2021. 5
Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra,
[14] Diederik P. Kingma and Jimmy Ba. Adam: A method for
Devi Parikh, and Stefan Lee. nocaps: novel object caption-
stochastic optimization. In ICLR, 2015. 5, 8, 9
ing at scale. In ICCV, pages 8947–8956, 2019. 2, 6, 8, 9,
10 [15] Julia Kreutzer, Shahram Khadivi, Evgeny Matusov, and Ste-
fan Riezler. Can neural machine translation be improved
[2] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine
with user feedback? In NACCL-HIT, pages 92–105, 2018. 2
Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men-
[16] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson,
sch, Katherine Millican, Malcolm Reynolds, Roman Ring,
Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan-
Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong,
tidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and
Sina Samangooei, Marianne Monteiro, Jacob L. Menick,
Li Fei-Fei. Visual genome: Connecting language and vision
Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sa-
using crowdsourced dense image annotations. Int. J. Com-
hand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira,
put. Vis., 123(1):32–73, 2017. 7, 9
Oriol Vinyals, Andrew Zisserman, and Karén Simonyan.
Flamingo: a visual language model for few-shot learning. [17] Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins,
In NeurIPS, 2022. 1, 2 Yuqing Du, Craig Boutilier, Pieter Abbeel, Moham-
mad Ghavamzadeh, and Shixiang Shane Gu. Align-
[3] Dzmitry Bahdanau, Philemon Brakel, Kelvin Xu, Anirudh
ing text-to-image models using human feedback. CoRR,
Goyal, Ryan Lowe, Joelle Pineau, Aaron C. Courville, and
abs/2302.12192, 2023. 2
Yoshua Bengio. An actor-critic algorithm for sequence pre-
diction. In ICLR, 2017. 2 [18] Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H.
Hoi. BLIP: bootstrapping language-image pre-training for
[4] Ralph Allan Bradley and Milton E Terry. Rank analysis of
unified vision-language understanding and generation. In
incomplete block designs: I. the method of paired compar-
ICML, pages 12888–12900, 2022. 3, 4, 5, 7, 8, 9, 10
isons. abs/1707.06347:39(3/4):324–345, 1952. 4
[19] Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H.
[5] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu
Hoi. BLIP-2: bootstrapping language-image pre-training
Soricut. Conceptual 12m: Pushing web-scale image-text pre-
with frozen image encoders and large language models. In
training to recognize long-tail visual concepts. In CVPR,
ICML, pages 19730–19742, 2023. 3
pages 3558–3568, 2021. 1, 2, 5, 7, 8, 9
[20] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James
[6] Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and
Shane Legg, and Dario Amodei. Deep reinforcement learn- C. Lawrence Zitnick. Microsoft COCO: common objects in
ing from human preferences. In NIPS, pages 4299–4307, context. In ECCV, pages 740–755, 2014. 1, 3, 4, 5, 6, 7, 9,
2017. 2 10
[7] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat [21] James MacGlashan, Mark K. Ho, Robert Tyler Loftin, Bei
Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Peng, Guan Wang, David L. Roberts, Matthew E. Taylor,
Fung, and Steven C. H. Hoi. Instructblip: Towards general- and Michael L. Littman. Interactive learning from policy-
purpose vision-language models with instruction tuning. dependent human feedback. In ICML, pages 2285–2294,
CoRR, abs/2305.06500, 2023. 3 2017. 2
[8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, [22] Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu,
and Li Fei-Fei. Imagenet: A large-scale hierarchical image Long Ouyang, Christina Kim, Christopher Hesse, Shan-
database. In CVPR, pages 248–255, 2009. 6, 8 tanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang,
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin But-
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, ton, Matthew Knight, Benjamin Chess, and John Schulman.
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- Webgpt: Browser-assisted question-answering with human
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is feedback. CoRR, abs/2112.09332, 2021. 2
worth 16x16 words: Transformers for image recognition at [23] Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg.
scale. CoRR, abs/2010.11929, 2020. 5, 8 Im2text: Describing images using 1 million captioned pho-
[10] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Stein- tographs. In NIPS, pages 1143–1151, 2011. 7, 9
hardt, and Dawn Song. Natural adversarial examples. In [24] Mayu Otani, Riku Togashi, Yu Sawai, Ryosuke Ishigami,
CVPR, pages 15262–15271, 2021. 6, 8 Yuta Nakashima, Esa Rahtu, Janne Heikkilä, and Shin’ichi
[11] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Satoh. Toward verifiable and reproducible human evaluation
Hieu Pham, Quoc V. Le, Yun-Hsuan Sung, Zhen Li, and Tom for text-to-image generation. In CVPR, pages 14277–14286,
Duerig. Scaling up visual and vision-language representation 2023. 2
learning with noisy text supervision. In ICML, pages 4904– [25] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car-
4916, 2021. 1, 2 roll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini
[12] Andrej Karpathy and Li Fei-Fei. Deep visual-semantic align- Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob
ments for generating image descriptions. IEEE Trans. Pat- Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda
tern Anal. Mach. Intell., 39(4):664–676, 2017. 5, 6 Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and
Ryan Lowe. Training language models to follow instructions [38] Alex Jinpeng Wang, Kevin Qinghong Lin, David Junhao
with human feedback. In NeurIPS, 2022. 2, 3, 4 Zhang, Stan Weixian Lei, and Mike Zheng Shou. Too
[26] Hieu Pham, Zihang Dai, Golnaz Ghiasi, Kenji Kawaguchi, large; data reduction for vision-language pre-training. CoRR,
Hanxiao Liu, Adams Wei Yu, Jiahui Yu, Yi-Ting Chen, abs/2305.20087, 2023. 1, 2
Minh-Thang Luong, Yonghui Wu, Mingxing Tan, and [39] Xiaoshi Wu, Keqiang Sun, Feng Zhu, Rui Zhao, and Hong-
Quoc V. Le. Combined scaling for zero-shot transfer learn- sheng Li. Better aligning text-to-image models with human
ing. Neurocomputing, 555:126658, 2023. 1 preference. CoRR, abs/2303.14420, 2023. 2
[27] Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, [40] Hu Xu, Saining Xie, Po-Yao Huang, Licheng Yu, Russell
Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazeb- Howes, Gargi Ghosh, Luke Zettlemoyer, and Christoph Fe-
nik. Flickr30k entities: Collecting region-to-phrase corre- ichtenhofer. Cit: Curation in training for effective vision-
spondences for richer image-to-sentence models. In ICCV, language data. CoRR, abs/2301.02241, 2023. 1, 2
pages 2641–2649, 2015. 5, 6, 8, 9, 10 [41] Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai
[28] Bryan A. Plummer, Liwei Wang, Chris M. Cervantes, Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagere-
Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazeb- ward: Learning and evaluating human preferences for text-
nik. Flickr30k entities: Collecting region-to-phrase corre- to-image generation. CoRR, abs/2304.05977, 2023. 2, 4
spondences for richer image-to-sentence models. In ICCV, [42] Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang,
pages 2641–2649, 2015. 2, 7, 8 Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao.
[29] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Vinvl: Revisiting visual representations in vision-language
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, models. In CVPR, pages 5579–5588, 2021. 5
Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen [43] Shu Zhang, Xinyi Yang, Yihao Feng, Can Qin, Chia-Chih
Krueger, and Ilya Sutskever. Learning transferable visual Chen, Ning Yu, Zeyuan Chen, Huan Wang, Silvio Savarese,
models from natural language supervision. In ICML, pages Stefano Ermon, Caiming Xiong, and Ran Xu. HIVE: har-
8748–8763, 2021. 1, 2, 5, 6, 8 nessing human feedback for instructional visual editing.
[30] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and CoRR, abs/2303.09618, 2023. 2
Vaishaal Shankar. Do imagenet classifiers generalize to im- [44] Wangchunshu Zhou and Ke Xu. Learning to compare for
agenet? In ICML, pages 5389–5400, 2019. 8 better training and evaluation of open domain natural lan-
[31] Nataniel Ruiz, Yuanzhen Li, Varun Jampani, Yael Pritch, guage generation models. In AAAI, pages 9717–9724, 2020.
Michael Rubinstein, and Kfir Aberman. Dreambooth: Fine 2
tuning text-to-image diffusion models for subject-driven
generation. In CVPR, pages 22500–22510, 2023. 2
[32] Christoph Schuhmann, Richard Vencu, Romain Beaumont,
Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo
Coombes, Jenia Jitsev, and Aran Komatsuzaki. LAION-
400M: open dataset of clip-filtered 400 million image-text
pairs. CoRR, abs/2111.02114, 2021. 1, 2, 5, 7, 9
[33] Christoph Schuhmann, Romain Beaumont, Richard Vencu,
Cade Gordon, Ross Wightman, Mehdi Cherti, Theo
Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts-
man, Patrick Schramowski, Srivatsa Kundurthy, Katherine
Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia
Jitsev. LAION-5B: an open large-scale dataset for training
next generation image-text models. In NeurIPS, 2022. 1
[34] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad-
ford, and Oleg Klimov. Proximal policy optimization algo-
rithms. CoRR, abs/1707.06347, 2017. 2
[35] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu
Soricut. Conceptual captions: A cleaned, hypernymed, im-
age alt-text dataset for automatic image captioning. In ACL,
pages 2556–2565, 2018. 1, 2, 5, 7, 9
[36] Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel M.
Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario
Amodei, and Paul F. Christiano. Learning to summarize with
human feedback. In NIPS, 2020. 2
[37] Bart Thomee, David A. Shamma, Gerald Friedland, Ben-
jamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and
Li-Jia Li. YFCC100M: the new data in multimedia research.
Commun. ACM, 59(2):64–73, 2016. 1, 2

13

You might also like