Incorporating Visual Information Into Natural Language Processing
Incorporating Visual Information Into Natural Language Processing
Winter 1-24-2025
Recommended Citation
Aladago, Maxwell Mbabilla, "Incorporating Visual Information into Natural Language Processing" (2025).
Dartmouth College Ph.D Dissertations. 335.
https://digitalcommons.dartmouth.edu/dissertations/335
This Thesis (Ph.D.) is brought to you for free and open access by the Theses and Dissertations at Dartmouth Digital
Commons. It has been accepted for inclusion in Dartmouth College Ph.D Dissertations by an authorized
administrator of Dartmouth Digital Commons. For more information, please contact
[email protected].
INCORPORATING VISUAL INFORMATION INTO NATURAL
LANGUAGE PROCESSING
A Thesis
Submitted to the Faculty
in partial fulfillment of the requirements for the
degree of
Doctor of Philosophy
in
Computer Science
by
Maxwell Mbabilla Aladago
DARTMOUTH COLLEGE
Hanover, New Hampshire
January 2025
Examining Committee:
Lorenzo Torresani
Temiloluwa O. Prioleau
Anthony J Piergiovanni
Natural language describes entities in the world, some real and some abstract. It is
also common practice to complement human learning of natural language with visual
cues. This is evident in the heavily graphical nature of children’s literature which
underscores the importance of visual cues in language acquisition. Similarly, the no-
tion of “visual learners” is well recognized, reflecting the understanding that visual
signals such as illustrations, gestures, and depictions effectively supplement language.
In machine learning, two primary paradigms have emerged for training systems in-
volving natural language. The first paradigm encompasses setups where pre-training
and downstream tasks are exclusively in natural language. The second paradigm
comprises models that require joint reasoning over both language and visual inputs
during pre-training and downstream tasks. Given the widely acknowledged role of
visual input in human language comprehension, it is pertinent to inquire whether
visual information can similarly augment the comprehension of language-only tasks
in machine learning. Despite the remarkable advancements in the capabilities of ma-
chine learning models across all domains in recent years, the concept of supplementing
Natural Language Processing with visual signals remains insufficiently explored. This
is in part due to the absence of clear and effective strategies for integrating visual
ii
information into language models, given the limited availability of extensive, high
quality image-language paired datasets. In this thesis, we address this challenge and
propose two frameworks for incorporating visual information into natural language
pre-training leveraging multimodal models as intermediaries between visual informa-
tion and language models. Empirical evaluations conducted on language pre-training
datasets of varying sizes demonstrate the efficacy of the proposed frameworks across
diverse downstream language tasks. In addition, we introduce methods for training
effective multimodal models through architectural innovations and novel multimodal
data augmentation techniques. The representations generated by our multimodal
models improve performance in zero-shot image categorization, visual question an-
swering, visual entailment, and cross-modal retrieval tasks in downstream evaluations.
Finally, this thesis presents a novel method for constructing effective neural networks
by selection from randomly initialized parameters in contrast to the conventional
practice of parameter updates via gradient descent.
iii
Acknowledgements
iv
Contents
Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
1 Introduction 1
1.1 Modeling Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Vision-Guided Natural Language Pre-training . . . . . . . . . . . . . 4
1.3 Main Contributions and Outline of the Thesis . . . . . . . . . . . . . 7
2 Related Works 10
2.1 Language Models Pre-training and Grounding . . . . . . . . . . . . . 10
2.2 Multimodal Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
v
3.4.1 Zero-shot Semantic Relatedness . . . . . . . . . . . . . . . . . 32
3.4.2 General Language Understanding Evaluation (GLUE) Results 33
3.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.5 Pre-training with Guidance from Multimodal Text Encoder . . 39
3.4.6 Is the Visual Information Necessary? . . . . . . . . . . . . . . 42
3.5 Implications & Limitations . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
vi
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6 Slot Machines 84
6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.2 Technical Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2.1 Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2.2 Learning in Slot Machines . . . . . . . . . . . . . . . . . . 89
6.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 93
vii
6.4.1 Slot Machines versus Traditionally-Trained Networks . . . . . 93
6.4.2 Fine-tuning Slot Machines . . . . . . . . . . . . . . . . . . . . 95
6.5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.5.1 Greedy Selection Versus Probabilistic Sampling . . . . . . . . 97
6.5.2 Weights Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.5.3 Sparse Slot Machines . . . . . . . . . . . . . . . . . . . . . 100
6.6 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
7 Conclusion 103
References 107
viii
List of Tables
ix
4.8 Modality Involved in Composition: Applying semantic compo-
sitions on both modalities is the most consistently effective method
across different downstream datasets and tasks. . . . . . . . . . . . . 63
4.9 SLIP + CLIP-C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.10 Sugar-Crepe Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
x
List of Figures
xi
6.3 Slot Machines versus Traditional Training . . . . . . . . . . . . . . 94
6.4 Test Accuracy versus Flops for Slot Machines . . . . . . . . . . . . 94
6.5 Finetuning Slot Machines Selected Weights . . . . . . . . . . . . . 95
6.6 Different Slot Machine Finetuning Checkpoints . . . . . . . . . . . 96
6.7 Slot Machines Selection Method . . . . . . . . . . . . . . . . . . . 97
6.8 Weight exploration in Slot Machines. . . . . . . . . . . . . . . . . 98
6.9 Weights Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
xii
Introduction
1
Unless our words, concepts, ideas are hooked onto an image, they
will go in one ear, sail through the brain, and go out the other ear
1
Introduction Introduction
Figure 1.1: Different Model Paradigms: (a) A natural language processing model
versus (b) a computer vision model versus (c) a multimodal model. In
BERT [39], the inputs and downstream tasks are in textual form. In the
Vision Transformer [41], the inputs are images with integer labels as out-
puts. To perform the desired task, PaLI [25] must integrate inputs from
both the visual and textual modalities. Credit: The images are from the
respective papers.
modality, input data comprises either images or videos, whereas in the language
modality, the data is textual in nature. Within this framework, certain modeling
patterns can be discerned, as depicted by representative examples in Figure 1.1 and
described in the subsequent sections.
2
1.1 Modeling Paradigms Introduction
Section 1.1
Modeling Paradigms from an Input Perspective
(a) Unimodal Models: Typically, unimodal models process inputs from a singular
modality and execute tasks that necessitate only this specific modality. This
classification encompasses two prominent fields within machine learning:
(i) Computer Vision (CV) Models: These models process inputs in the form
of images or videos and are designed to for tasks ranging from image recog-
nition, object detection to semantic segmentation etc. This category en-
compasses early models such as AlexNet [84] and ResNet [55], as well as
more contemporary models like Vision Transformers [41].
(ii) Natural Language Processing (NLP) Models: In the domain of NLP, models
are provided with natural language inputs, which are typically represented
as tokens. Once trained, these models are capable of performing a range
of tasks, including sentiment classification, semantic similarity matching,
summarization, and entailment. Examples of models within this domain
include BERT [39] and the GPT series [137, 138, 15] of models, among
others.
(b) Multimodal Models: This category contains models trained on inputs from
multiple modalities, thereby enabling them to perform tasks that necessitate
integrated reasoning across these modalities. Notable examples include tradi-
tional multimodal models such as VilBERT [110], PaLI [25], SimVLM [164]],
and Flamingo [3]. These models are designed to address a variety of downstream
multimodal tasks, such as visual question answering (VQA), visual entailment,
and image captioning.
3
1.2 Vision-Guided Natural Language Pre-training Introduction
While unimodal and multimodal models have eminently driven recent advance-
ments in machine learning, we posit that an alternative modeling framework warrants
increased scholarly attention. This framework involves models that incorporate in-
puts from both visual and linguistic domains during the pre-training phase, yet utilize
only a single modality when addressing downstream tasks.
In this respect, significant advancements have been achieved in vision-centric
downstream tasks, ranging from zero-shot image categorization to open-vocabulary
object detection [85], through the application of joint embedding models like CLIP [136]
and ALIGN [70]. Furthermore, text-conditioned generative visual models, including
Imagine-3 [7], DALL-E [10], and Stable Diffusion [143], have been developed, demon-
strating remarkable capabilities in generating visual outputs from natural language
inputs. Motivated by these advancements in vision-centric systems, this thesis ex-
plores the potential for achieving similar improvements in language-centric tasks by
incorporating visual cues during the pre-training phase of language models. This mod-
eling approach, depicted in Figure 1.2, is referred to in this thesis as vision-guided
language learning.
Section 1.2
Vision-Guided Natural Language Pre-training
4
1.2 Vision-Guided Natural Language Pre-training Introduction
in an identical manner following the pre-training stage. The motivation for building
language models with visual priming has foundations in three key areas:
(a) Cognitive Realism: In Journal of The Modern Language Review, Troscianko [157]
defines cognitive realism as “the capacity of a text to tap in directly to some
aspect of a reader’s cognitive faculties... A text that is cognitively realistic cor-
responds to how we really remember, or see, or feel, and may therefore induce
a particularly effortless imaginative response on the reader’s part.” This con-
cept of “cognitive text” effectively encapsulates the role of language in shaping
our understanding of the world, a dimension that remains underrepresented in
current NLP models. In human cognition, sentences are not merely isolated se-
quences of tokens, as often represented in language models; rather, they embody
feelings, visions, and emotions. The significance of visual aids such as graphics,
5
1.2 Vision-Guided Natural Language Pre-training Introduction
6
1.3 Main Contributions and Outline of the Thesis Introduction
Section 1.3
Main Contributions and Outline of the Thesis
This thesis introduces a framework for incorporating visual information during the
pre-training of language models and evaluates its effectiveness across multiple lan-
guage tasks. Additionally, we propose two methods for training effective multimodal
models through creative data augmentation and better multimodal representation fu-
sion. Finally, we present an innovative approach for “training” neural networks based
on weight selection rather than traditional gradient updates.
Chapter 2 examines the related works in the areas of language model pre-training,
grounded language models, and multimodal methodologies. This discussion underpins
the approaches detailed in Chapters 3, 4 and 5. The related works pertinent to Slot
Machines are addressed in Chapter 6.
In Chapter 3, we present an architecture for vision-guided language learning that
leverages the knowledge embedded in multimodal models for language grounding.
First, we facilitate the transfer of visual knowledge from a multimodal image encoder
to a language model through guided pre-training on paired image-text data. Within
the context of this thesis, a multimodal image encoder is defined as the component
responsible for processing images in a vision-language model such as CLIP [136].
Due to the limited availability of high-quality paired image-text data, we propose a
variation of our approach that employs the multimodal text encoder from a vision-
language model. This second approach enables us to circumvent the necessity for raw
image data by using the text encoder’s association with visual information as a sur-
rogate. Text corpora can then be used for guided pre-training thereby mitigating the
challenges associated with the limited availability of high-quality image-text datasets.
Empirical evaluations on various language benchmarks demonstrate that our mod-
7
1.3 Main Contributions and Outline of the Thesis Introduction
els surpass those trained with language-only pre-training. The improvements are par-
ticularly pronounced in tasks assessing semantic relatedness, where the objective is
to evaluate the proximity between pairs of sentences. Our explorations are scoped to
language models trained with a de-noising objective specifically the masked language
model objective introduced in BERT [39].
Unlike Chapter 3, which concentrates on leveraging vision-language models to
enhance natural language understanding, Chapter 4 introduces a novel approach
for constructing more robust vision-language models, particularly in contexts where
paired image-text data is scarce. Our framework, termed CLIP-C (for CLIP Com-
positions), generates a third composite image-text pair from two randomly selected
image-text pairs based on a predetermined probability. We label the examples cre-
ated through this operation composite examples. The vision-language model is sub-
sequently trained on a mini-batch comprising both simple and composite examples,
with the ratio of composite examples determined by the composition probability. Ex-
perimental results across multiple vision-language datasets of varying scales demon-
strate that CLIP-C consistently yields performance improvements across tasks such
as zero-shot image categorization, image-text retrieval, and linear evaluations when
compared to baseline methods.
Similar to Chapter 4, Chapter 5 focuses on enhancing multimodal representation
learning via architectural innovations, introducing a novel formulation of multimodal
tokens. This approach embeds image and text representations into the same feature
vectors, thus facilitating stronger interactions between the two modalities. This is
accomplished by concatenating image tokens and text tokens along the channel di-
mension. Cross-attention is employed to retrieve the complementary tokens which are
then merged together on the feature dimension. Referred to as Compound Tokens,
this method demonstrates superior performance compared to standard approaches
8
1.3 Main Contributions and Outline of the Thesis Introduction
across several multimodal tasks, including vision question answering and visual en-
tailment.
Finally, in Chapter 6, we present Slot Machines, a novel method of training
neural networks based on weight selection as opposed to the conventional approach of
weight optimization in a continuous space. Slot Machines identify effective combi-
nations of random weights by selecting from a predetermined set of random values for
each connection, achieving performance comparable to that of traditionally trained
networks with identical capacity. This demonstrates the potential existence of efficient
random networks whose weights remain unchanged. Adjacent to our investigations
in Chapters 3, 4, and 5 which explore multimodal models and the enhancement of
language models through visual information, Slot Machines are primarily applied
to computer vision tasks.
9
Related Works
2
We organize our discussion of related works into three primary categories: (1) Lan-
guage models pre-training and grounding which is pertinent to our research in Chap-
ter 3 on vision-guided language model pre-training, leading to the development of in
VGLMs and MT-GLMs; (2) Multimodal methods encompassing CLIP-C (Chapter 4)
and Compound Tokens (Chapter 5); and (3) Neural networks initialization
and pruning which relates to Slot Machines (Chapter 6). In this chapter, we
focus on the first two categories, while the third category is reviewed in Chapter 6,
as Slot Machines primarily pertains to the domain of computer vision model ini-
tialization.
Section 2.1
Language Models Pre-training and Grounding
10
2.1 Language Models Pre-training and Grounding Related Works
Grounded Language Models. The idea that natural language must be grounded
in physical reality has motivated many works in machine learning over the years [89,
176, 76, 154]. Several studies utilize visual information to enhance particular language
tasks including cross-modal learning [122], machine translation [43, 69, 181] or seman-
tic parsing [80, 150]. Our contribution to grounded language models, as detailed in
Chapter 3, is akin to the vokenization method [154] which emphasizes pre-training
with visual information to enhance general language understanding. The vokeniza-
tion approach comprises two stages. Initially, a token-to-image retrieval model is built
11
2.1 Language Models Pre-training and Grounding Related Works
by mapping individual words to images using a hinge loss function. This first phase
generates visual tokens or “vokens” for use in the subsequent training phase. In their
experiments, Tan and Basal [154] utilize an image captioning dataset for vokenization
where each word in a caption is linked to the corresponding image. The second phase
involves language model pre-training which incorporates a masked “voken” prediction
alongside the masked token prediction used in BERT. Given a sequence of token IDs
and their associated voken IDs retrieved from a predefined set of images, an identi-
cal masking protocol is applied to both, with each contributing to their respective
masking prediction loss function.
Our vision-guided language models (VGLMs) do not employ vokenization nor
a voken classification objective. Instead, we directly utilize continuous embeddings
obtained from a pre-trained image encoder to implement a guidance loss via con-
trastive learning. Moreover, we introduce multimodal text encoders as effective alter-
native modules to image encoders, thereby mitigating the dependency on image-text
datasets.
A second closely related study builds upon the SkipThought model [79] by fus-
ing visual semantics into a grounded space through the incorporation of cluster and
perceptual loss functions [13]. The cluster loss evaluates whether two sentences are
visually equivalent without considering the content of the associated images. Two
sentences are designated as visually equivalent if they are paired with the same im-
age. The perceptual loss, as the second objective, factors in the structure of the
visual space by ensuring that the similarity between two sentences correlates with the
similarity between their corresponding images.
Our vision-guidance loss in VGLMs is distinct from both the cluster and per-
ceptual objectives introduced by Bordes et al . [13]. Our approach involves directly
contrasting the textual representations with their corresponding visual embeddings or
12
2.2 Multimodal Methods Related Works
inferred visual representations (for the case where we use a multimodal text encoder)
for improved guided learning. Furthermore, we use a modern language model based
on BERT in contrast to the SkipThought model utilized in Bordes et al . [13].
Finally, in recent years, a class of investigations has emerged that evaluate existing
multimodal models on language-only tasks, yielding mixed results [4, 67]. Alper et
al . [4] observed that multimodal extensions of pre-trained language models enhance
the performance of “visual-language tasks” such as color association prediction and
shape association prediction. Conversely, Ikil and Aizawa [67] found that these exten-
sions can degrade performance on natural language tasks. Similarly, Cao et al . [18]
noted that while multimodal text encoders demonstrate strong performance on se-
mantic linguistic probing tasks, they consistently underperform BERT.
Our VGLMs are language models augmented with visual information which we
train from scratch without targeting any multimodal tasks. Additionally, we do not
use the probing methodologies presented in the prior works. Instead, we evaluate our
models on downstream datasets in both zero-shot and fine-tuning scenarios.
Section 2.2
Multimodal Methods
The use of language as a potent supervisory signal for learning visual representations
has a rich history in machine learning [135, 47, 73, 96, 136]. Notably, early influential
vision-language endeavors like DeViSE [47] capitalized on unannotated textual data
to first learn semantic relationships and subsequently mapping images into that es-
tablished semantic space through the use of class labels. Recent works have relied
principally on weakly supervised datasets from the internet to build joint vision and
language embedding models.
13
2.2 Multimodal Methods Related Works
Joint vision and language embedding models. Recently, widely adopted models
such as CLIP [136], ALIGN [70], and others [121, 104, 63, 68] produce joint embed-
ding representations by training on extensive image-text paired datasets contrastively
using the InfoNCE loss [158]. Predominantly, these models utilize free-form text as a
supervisory signal for training generalized vision models facilitating a variety of capa-
bilities, including zero-shot image categorization, open-vocabulary object detection,
and semantic segmentation. However, both CLIP [136] and ALIGN [70] used massive
amounts of data, consuming 400 million and 1 billion image-text pairs, respectively.
The collection of such extensive datasets is resource-intensive and impractical for re-
searchers in resource-constrained environments. Furthermore, the reliance on large
image-text datasets impedes the adoption of these methods in application areas where
such datasets are not readily available, such as in medical imaging.
In response, subsequent approaches such as DeCLIP [104], SLIP [121] have aimed
to enhance the data efficiency of the CLIP model. DeCLIP [104] uses multiple train-
ing objectives, including self-supervision on each modality [26, 39], nearest-neighbor
supervision, and multi-view supervision [19]. SLIP [121] supplements the CLIP loss
with image self-supervision by employing the methodology outlined in SimCLR [24].
Another contribution toward reducing the training data requirements of CLIP is pre-
sented by Wu et al . [167] who exploit optimal transport distillation to implement soft
image-text matches. These existing methods aimed at enhancing the data efficiency
of CLIP exhibit several limitations. SLIP and DeCLIP need multiple passes through
the image encoder [121, 104] for every parameter update thus increasing computa-
tional demands. Other methods such as that proposed by Fan et al . [44] need access
to proprietary conversational systems such as ChatGPT to to re-write image captions.
Our work on CLIP-C discussed in Chapter 4 aligns with these studies in devel-
oping joint-embedding vision and language models targeted towards increasing data
14
2.2 Multimodal Methods Related Works
15
2.2 Multimodal Methods Related Works
16
2.2 Multimodal Methods Related Works
17
Vision-Guided Language Learning
3
This chapter discusses our methodology for vision-guided language learning, wherein
visual information is incorporated into the pre-training phase of a language model.
The models obtained from our framework are called VGLMs for Vision-Guided Lan-
guage Models. We employ paired image-text datasets to train VGLMs. However,
considering the short supply of high-quality image-text datasets, we introduce a sec-
ond method for guided learning that relies chiefly on multimodal text encoders. We
title the second variation of models as MT-GLMs for Multimodal-Text Guided Lan-
guage models. Empirical evaluations demonstrate that both VGLMs and MT-GLMs
surpass the performance of baseline unguided language models across various language
tasks in both zero-shot and fine-tuning scenarios.
Section 3.1
Overview
18
3.1 Overview Vision-Guided Language Learning
19
3.1 Overview Vision-Guided Language Learning
20
3.1 Overview Vision-Guided Language Learning
substituted with their synonyms, (b) correct identification of keyword(s), and (c)
semantic understanding. We expound on these observations in Section 3.4.3.
On GLUE [160], the benefits of vision-guided learning are very modest when pre-
training on MS-COCO. The main guided model beats the baseline language model by
0.43%. The second class of models we introduced, MT-GLMs, that use multimodal
text encoders (see Figure 3.5) beat the baseline models by 1.24%, 0.93%, and 0.24%
when pre-training on MS-COCO (captions), Wiki-103 [119], and English Wikipedia
respectively. These results suggest that the benefits of guidance on general language
understanding diminish with increasing text corpus size. All the empirical results for
our experiments on this work are discussed in Section 3.4.
21
3.2 Technical Approach Vision-Guided Language Learning
Section 3.2
Technical Approach
The technical details of our approach including the background material for the model
are discussed in this section. At a high-level, our pre-training method uses two objec-
tive functions: a Masked Language Modeling (MaskLM) loss and a noise contrastive
estimation (InfoNCE) loss, as illustrated in Figure 3.2. Prior to delving into these
two objectives, we first present an explanation of the architectural components of our
network.
3.2.1. Architecture
As depicted in Figure 3.2, the trainable module in our architecture is a language
model F that takes as input a piece of text and produces a sequence of token rep-
22
3.2 Technical Approach Vision-Guided Language Learning
for each input sequence wi . The “class token” is designated as the token responsible
for capturing the overall semantic essence of the input sequence. Consequently, this
token is the representation most commonly utilized in downstream tasks [39, 154],
although alternative sequence pooling methods, such as mean pooling, may also be
employed.
The focus of this work is on improving the language capabilities of F using visual
information. We invoke a pre-trained image encoder G that outputs an embedding
zi ∈ Rd for each input image xi . Owing to resource constraints, our experimentation
primarily focuses on the use of images to demonstrate the efficacy of our method.
However, it is feasible to incorporate videos or other forms of visual media within our
method. In our experiments, we utilize the image encoder from a multimodal model
for G to leverage the benefits of prior exposure to textual information. Additionally,
multimodal image encoders typically encounter a broader array of images compared
to supervised image models. In the ablations in Table 3.4, we present results for an
image encoder trained on ImageNet-22k for comparison with MetaCLIP. As explained
before, G is not updated during pre-training as the model of interest is the language
model F. Keeping G frozen also enables the pre-computation of image representations
prior to the commencement of pre-training.
We implement a linear transformation E : Rc → Rd on t∗i to produce a distillation
token zt∗i ∈ Rd to match the feature dimension of image output zi .
23
3.2 Technical Approach Vision-Guided Language Learning
where H is the cross-entropy loss, tji is the softmax output of token j in sentence
i, and wij is the ground-truth label of tji . We use the masking protocol employed in
BERT [39]. Overall, 15% of the tokens in each sentence are randomly selected for
masking: For each selected token t, the following masking procedure ensues: (1) t is
replaced with a placeholder token [Mask] with a probability of 80%. (2) 10% of the
time, t is replaced by a random token t′ from the sequence. (3) t is left unmasked
10% of the time to match the downstream fine-tuning setup where masking is absent.
Vision-Guidance Loss. This loss function is derived from Noise Contrastive Es-
timation (InfoNCE) [158] which is commonly employed in contexts that involve the
concept of matching representation pairs such as matching image-text pairs [136] or
distinct crops of the same image [24]. In this framework, the matching pairs are
designated as positive samples, while the non-matching pairs within the batch are
designated as negative examples. InfoNCE functions to bring the representations
24
3.2 Technical Approach Vision-Guided Language Learning
of matching pairs closer together while distancing the non-matching pairs. Given a
batch of N matching pairs of representations {(p(1) , q (1) ), (p(2) , q (2) ), · · · , (p(N ) , q (N ) )},
the contrastive loss is computed as
we employ the unidirectional loss LD p(i) , q (i) , since the image encoder is locked
during language model pre-training. Consequently, the overall loss for all N examples
in our configuration is expressed by
N
1 X
LD p(i) , q (i) . (3.3)
LD =
N i=1
In our model, the trainable embedding p(i) corresponds to the class token represen-
tation zt∗i of the language model F. The target vector q (i) represents the corresponding
image representation zi from the pre-trained image encoder. It is noteworthy to reit-
erate that the image encoder remains static during the language model’s pre-training
phase, as the primary objective is to improve the performance of F on language tasks.
Vision-guided language learning combines the language modeling loss, as delineated
in Equation 3.1, and the guidance loss described in Equation 3.3:
25
3.3 Experimental Setup Vision-Guided Language Learning
Section 3.3
Experimental Setup
We build our framework using the mask language modeling setup popularized in
BERT [39] due to its simplicity and broad applicability across numerous language
tasks. While it is acknowledged that generative models such as GPT-2 [138] and GPT-
3 [15] have achieved more widespread adoption in recent years within the machine
learning community, BERT continues to serve as an effective network for language
modeling, particularly in the low-resource settings upon which our experiments are
based.
All comparative experiments are performed by training the language model from
randomly initialized parameters to ensure control over pre-training data, model size,
and other crucial hyperparameters that could significantly impact performance. Due
to resource constraints, our experiments are primarily confined to small data and
model regimes. While we anticipate that our observations will extend to larger models
and datasets, we are unable to empirically demonstrate this scalability given the
constraints of our current study.
26
3.3 Experimental Setup Vision-Guided Language Learning
Models. The main network used for the language model F is the “Base” variant
of the BERT [39] model which is constructed on the Transformer architecture [159].
This model comprises 12 attention blocks, each integrating self-attention operations
with a multi-layer perceptron. The model features a dimension of 768, distributed
across 12 attention heads, each with a dimension of 64. (See [159] for details of
the self-attention operation). The tokenization module employed is the WordPiece
tokenizer, developed for BERT, which supports a vocabulary size of 30,522 tokens.
Additionally, results for model configurations smaller than the “Base” architecture are
provided for comparative analysis.
For the image encoder G, we use the image tower of a pre-trained MetaCLIP [63].
This model was trained using the vision-language contrastive learning objective on
approximately 400 million image-text pairs, curated to closely resemble the private
dataset used in CLIP [136]. For our experiments, we adopt the “Base” 1 configuration
of MetaCLIP with a patch size of 32. The image features generated by G have a
dimension of 512; hence, a linear module E (refer to Section 3.2) is employed to
map the language model’s class token to this dimensionality. Results for other pre-
trained image encoders, including a model trained on ImageNet-22K, are presented in
Section 3.4.4. It is important to note that the image encoder is kept frozen throughout
the language model pre-training.
27
3.3 Experimental Setup Vision-Guided Language Learning
select one of these captions to serve as the text pair for each image.
The likelihood is substantial that the distribution of captions within the MS-
COCO dataset does not align with the general distribution of free-form text. This
assumption has been corroborated by previous studies, including the study of Tan
and Basal [154]. Consequently, the results derived from using MS-COCO, or any
paired image-text dataset, may not be optimally extendable to standard text corpora.
To address this challenge, we propose a modification of our VGLM architecture,
wherein the image encoder is replaced with a multimodal text encoder, as depicted in
Figure 3.5. This adjustment eliminates the requirement for paired image-text data.
As a result, this architecture enables us to pre-train on text-only datasets, such as
English Wikipedia2 and its subset Wiki-103 [119]. The English Wikipedia dataset
consists of approximately 120 million sentences, while Wiki-103 encompasses around
4.2million sentences.
Hyper-parameters. Our models are trained with a batch size of 1, 024 examples,
each having a sequence length of 32 tokens. The number of training iterations is set
to 20 thousand iterations for MS-COCO, 100 thousand steps for Wiki-103, and 200
thousand iterations for English Wikipedia. We employ a learning rate of 4×10−4 with
a cosine annealing schedule and a warm-up phase of 5, 000 steps. The weight decay
parameter is set to 0.01, and gradient clipping is applied with a maximum gradient
norm of 1.0. The AdamW [109] optimizer is utilized, with beta parameters (0.9, 0.999)
and epsilon 1 × 10−8 . Mixed-precision training is conducted across all models using
the PyTorch framework [129] along with the HuggingFace [166] and Accelerate [52]
packages.
2
https://github.com/attardi/wikiextractor
28
3.3 Experimental Setup Vision-Guided Language Learning
We use the Semantic Textual Similarity (STS) dataset curated from news ar-
ticles, online forums, etc., for evaluating our models. STS [2] has several cuts:
STS12, STS13, STS14, STS15 and STS16. We test on all these versions includ-
ing the STS-Benchmark (STS-B) [20] which is a selection of samples in the five
prior STS datasets. Each pair of sentences (s1 , s2 ) in STS is associated with
a rating r ∈ [0, 5] obtained from a human being through Amazon Mechanical
Turk3 . Rating r = 5 corresponds to two sentences that are completely equiva-
lent in meaning, e.g., “the child is bathing in the sink” and “the child is washing
himself in the water basin”. On the other extreme, r = 0 means the two sen-
tences are about different topics and uncorrelated. For example, “Jim played
chess and won several championships in his youth” and “It is really dark and
cold outside tonight”.
29
3.3 Experimental Setup Vision-Guided Language Learning
(iii) The Multi-Genre Natural Language Inference (MNLI) is used to test natu-
30
3.4 Main Results and Analysis Vision-Guided Language Learning
(iv) The Quora Question Pairs (QQP) is a question similarity task comprising
pairs of questions sourced from Quora4 . Each pair of questions is associated
with a human-label similarity score r ∈ {1, 2, 3, 4, 5} and the task of the
model is to predict the score given a question pair.
Unlike the approach adopted for semantic relatedness tasks, where zero-shot
evaluations are conducted, we fine-tuning on each GLUE task using the respec-
tive standard training dataset prior to evaluation on a separate test set. As
illustrated in Figure 3.2, the fine-tuning process is applied exclusively to the
language model F. Typically, fine-tuning is executed with a batch size of 64
and a sequence length of 128. The learning rate for this procedure is configured
to 5 × 10−5 . We use the AdamW [109] optimizer and fine-tune for 10 epochs.
Section 3.4
Main Results and Analysis
This section summarizes the results of our comparative evaluations between vision-
guided language training and the baseline model on semantic relatedness and general
language understanding. We name the baseline model M askLM in our experiments
since it is trained with only the mask language modeling objective. The datasets,
models, and hyper-parameters used in these experiments are detailed in Section 3.3.
4
https://www.quora.com/
31
3.4 Main Results and Analysis Vision-Guided Language Learning
Table 3.1: Zero-shot Semantic Relatedness Results: The results here show
clearly that our guided model is far superior to the unguided model on
semantic relatedness tasks across all datasets and model sizes. B, M and S
respectively stand for “Base”, “Medium” and “Small” configurations of the
language model. We employ the same image encoder in all experiments
here.
32
3.4 Main Results and Analysis Vision-Guided Language Learning
33
3.4 Main Results and Analysis Vision-Guided Language Learning
4.5 4.5
4.0 4.0
Ground Truth Score
3.0 3.0
2.5 2.5
2.0 2.0
1.5 1.5
Figure 3.3: Semantic Relatedness Correlation: These plots show that the base-
line model (left) assigns a high similarity score to a majority of question
pairs compared to our method (right) whose predictions correlate better
with the human-annotated scores.
information helpful are illustrated in Figure 3.1 and explained in greater detail in
Section 3.4.3.
3.4.3. Discussion
As presented in Table 3.1, vision-guided models are particularly more effective relative
to the text-only model on semantic relatedness. In this section, we conduct an in-
depth analysis of the raw predictions produced by both models to understand the
results at a more granular level. First, we plot the relatedness score from each model
against the human-annotated scores as depicted in Figure 3.3 using examples from
the Quora Question Pairs (QQP) dataset. The results show that the unguided model
(Figure 3.3, left) tends to perceive all question pairs as highly similar even in cases
where the human ratings are notably low (e.g., a score of 1). In contrast, the vision-
guided model (Figure 3.3, right) competently distinguishes between question pairs
with differing meanings and those that are semantically equivalent.
34
3.4 Main Results and Analysis Vision-Guided Language Learning
(a) Synonym replacement is an inherent and potent feature of natural language that
enables individuals to employ variously spelled words without modifying their
semantic intent. This capability enriches, personalizes, and diversifies communi-
cation. However, our pre-trained Masked Language Model (MaskLM) exhibits
limitations in recognizing synonymous words6 . For instance, it incorrectly pre-
dicts that the questions “what is the function of a resistor in a circuit?” and
“what is the purpose of a resistor in a circuit?” convey different meanings. The
addition of visual cues during pre-training ameliorates this issue, allowing the
model to correctly interpret these questions as equivalent.
(c) Semantic understanding involves interpreting the meaning and intent underly-
ing textual data. Without a semantic framework, comprehension and effective
6
We do not claim language models in general lack this ability. Super-massive and highly ad-
vanced language models trained with human supervision such as ChatGPT and Gemini can identify
synonyms and antonyms.
35
3.4 Main Results and Analysis Vision-Guided Language Learning
Table 3.3: Ablation Results on the Guidance Loss: The results in this table show
that other objectives such as soft-distillation from [58] and mean squared
error are suboptimal compared to the unidirectional contrastive loss used
in our work.
Guidance Loss STS12 STS13 STS14 STS15 STS16 STS-B SICK-R AVG
MSE-Loss 50.8 35.17 43.41 58.98 44.56 55.66 59.27 49.70
Soft-Distillation 48.99 35.24 43.08 58.46 42.85 52.89 60.60 48.87
InfoNCE 47.99 54.54 54.33 65.74 54.92 64.78 60.35 57.52
36
3.4 Main Results and Analysis Vision-Guided Language Learning
this is suboptimal because creating “image embeddings” from text destroys the special
characteristics of textual data. Also, since the concept of “classes” or “categories” is
absent in our pre-training framework, soft-distillation which relies heavily on class
probabilities is not an ideal fit.
Vision-guided loss weighting factor β. The weighting factor dictates the extent
of the model’s focus on learning the correspondence with visual representations. High
values of β signify an increased emphasis on minimizing the visual correspondence loss
relative to the masked language modeling loss. A setting of β = 0 corresponds to the
absence of visual guidance. Figure 3.4 presents results indicating that β > 1 does not
significantly enhance performance compared to β = 1. Nevertheless, consistent with
the observations in Section 3.4.1, including an image correspondence loss contributes
to improvements in semantic textual similarity.
37
3.4 Main Results and Analysis Vision-Guided Language Learning
Table 3.4: Impact of Pre-trained Image Encoder Ablation: The results here
suggest that the specifics of the image encoder are not differentiating fac-
tors.
Image Encoder G STS12 STS13 STS14 STS15 STS16 STS-B SICK-R AVG
IN-22k-ViT-B/32 44.43 53.97 54.53 64.84 56.15 63.51 62.03 57.07
MC-2.5b-ViT-B/32 45.21 52.57 54.18 65.61 57.70 64.77 61.41 57.35
MC-400m-ViT-B/32 47.99 54.54 54.33 65.74 54.92 64.78 60.35 57.52
38
3.4 Main Results and Analysis Vision-Guided Language Learning
Figure 3.5: Architecture of Text Encoder Guidance: We replace the image en-
coder in Figure 3.2 with a multimodal text encoder. This alteration re-
moves the need for image-text datasets which are difficult to gather in
large quantities.
39
3.4 Main Results and Analysis Vision-Guided Language Learning
40
3.4 Main Results and Analysis Vision-Guided Language Learning
this approach results in a relatively slower training process compared to the unguided
models.
41
3.4 Main Results and Analysis Vision-Guided Language Learning
of the image encoder yields a higher average GLUE performance when MS-COCO is
the pre-training dataset (73.63% versus 72.86%).
42
3.5 Implications & Limitations Vision-Guided Language Learning
Pre-trained Model STS12 STS13 STS14 STS15 STS16 STS-B SICK-R AVG
None 36.40 44.13 37.29 56.86 49.80 33.99 54.89 44.77
BERT-Base 27.64 41.63 27.69 45.48 48.04 22.90 49.99 37.62
MetaCLIP-Base (Ours) 61.25 58.04 57.14 74.46 70.53 69.08 62.73 64.75
Section 3.5
Implications & Limitations
43
3.5 Implications & Limitations Vision-Guided Language Learning
(1) Pre-training language models with visual information significantly enhances se-
mantic understanding. Our analysis identifies areas, e.g., synonym compre-
hension and keyword identification, through which visual information typically
contributes positively to performance (see Section 3.4.3). Furthermore, we hy-
pothesize that some of the observed improvements in semantic relatedness tasks
can be attributed, in part, to an enhanced ability to resolve ambiguous sen-
tences. In instances where a question or sentence possesses multiple interpre-
tations, the inclusion of visual information as supplementary context can serve
to disambiguate meaning.
These implications emphasize the critical need for ongoing investigation into lan-
guage grounding strategies. Grounded language models parallel the linguistic acquisi-
tion processes observed in human cognition and may serve as a conduit for enhanced
cross-lingual learning and “true” linguistic understanding in AI systems. Nevertheless,
this research is subject to certain limitations, which are highlight below.
(1) While our findings indicate that the incorporation of visual information during
44
3.6 Summary Vision-Guided Language Learning
Section 3.6
Summary
45
3.6 Summary Vision-Guided Language Learning
46
CLIP-C: Semantic Compositions For
4
Data-Efficient Vision-Language Models
Section 4.1
Overview
47
4.1 Overview CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models
model [136], which set a benchmark for vision and language joint embedding mod-
els. CLIP adopts contrastive learning capitalizing on matched image-caption pairs
as positive examples and non-matching pairs within the batch as negative examples.
This method has yielded impressive results across numerous tasks. Nonetheless, a
downside of the CLIP model is its need for large-scale datasets of image-text pairs,
which poses accessibility challenges for researchers in low-resource settings. Follow-
up research efforts to enhance the data efficiency of the CLIP model have proposed
supplementary objectives [121, 104] or the generation of additional captions using
large language models [44, 87, 182]. However, these approaches often require sup-
plementary computational procedures, such as multiple forward passes or the use of
additional encoders, which are resource-intensive. Our work in this chapter proposes
a technique to develop effective vision-language models in scenarios where image-text
data is scarce.
The method we propose is based on creating semantically composite examples
to pre-train vision-language models. This approach, termed CLIP-C ( for CLIP
Compositions), involves merging captions and blending images from two distinct
image-caption instances within the dataset to create a new composite example. This
straightforward procedure does not introduce additional computational overhead or
increase model parameters similar to CutMix [175] data augmentation method in the
domain of vision categorization tasks.
We dynamically broaden the semantic challenges encountered by the model by
adding compositions of two distinct image-caption pairs in each iteration. CLIP-C
implements the composition by conjoining the captions and merging the central crops
of both images. The two image-caption pairs constituting the composite instance are
sampled randomly in each iteration based on a predefined probability, empowering
the model to uncover novel combinations of examples throughout training.
48
4.1 Overview CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models
49
4.1 Overview CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models
Figure 4.1: CLIP-C: We use the center half crops spanning the width (as in this
illustration) or the height of the image. The captions are concatenated
with the delimiter “and”. We vary the positions of the captions on ei-
ther side of the conjunction, i.e., the output caption can be either (a)
{caption1 and caption2} or (b) {caption2 and caption1}. We emphasize
that only a fraction of the batch in each iteration constitute composite
samples. The colored boxes and texts shown here are for illustrative pur-
poses.
50
4.2 Technical Approach CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models
Section 4.2
Technical Approach
This section covers a background of the baseline method as well as the core compo-
nents of CLIP-C’s framework, depicted in Figure 4.1.
4.2.1. Background
(i) (i)
1
B
X exp 1
τ
sim zI , zT
LI2 T = − log P (4.1)
B i=1
B
exp 1
sim
(i)
zI , zT
(j)
j=1 τ
(i) (i)
1
B
X exp 1
τ
sim
zI , zT
LT2 I = − log P (4.2)
B (k) (i)
k=1 exp τ sim zI , zT
B 1
i=1
51
4.2 Technical Approach CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models
4.2.2. CLIP-C
n oB
(i) (i)
In each training step, CLIP-C samples a batch of examples of size B, x̂I , x̂T .
i=1
(i) (i) (i) (i)
Any given paired instance x̂I , x̂T is either the original example xI , xT or a
′
(i ) (i′ )
composition of that example and another example xI , xT , i ̸= i′ , drawn from
the dataset. Note that index i′ is taken with respect to the dataset size and not
the batch size B, i.e., sample i′ may not be present in the current mini-batch. The
proportion of composed samples in any mini-batch is controlled by a sampling rate
hyper-parameter ρ. The impact of this parameter is discussed in Section 4.5.3.
(i) (i) (i)
In the case whereby x̂I , x̂T is a composite sample, the new caption x̂T is a
(i) (i) (i′ )
concatenation of the two original captions involved: x̂T = [xT , xT ] where [·, ·] is a
string concatenation function with the word “and” as a conjunction. The positions of
(i)
the captions on either side of this conjunction change, with xT appearing first fifty
percent of the time.
The new image is composed of the center half crops spanning either the height or
the width of each image. For example, if the images have resolution (S × S), either
( S2 ×S) or (S × S2 ) center crops are taken from both images and concatenated together
as illustrated in Figure 4.1. We experiment with other forms of image augmentation
methods such as MixUP [179] and CutMix [175] in Table 4.7.
After assembling the mini-batch as described above, CLIP-C proceeds to ex-
52
4.3 Experimental Setup CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models
(i) (i) (i)
tract the image and text features as in CLIP: ẑI = gI fI x̂I and ẑT =
(i) (i) (i)
gT fT x̂T . With ẑI and ẑT computed, Eq. 4.1, Eq. 4.2, and Eq. 4.3 are used
to compute the InfoNCE loss.
The sampling strategy CLIP-C employs exposes the model to a much higher diver-
sity of images and their corresponding captions compared to the vanilla pre-training
pipeline. As a result, we observe much more significant improvements in downstream
transfer when the pre-training dataset is small. It is reasonably expected that rela-
tively larger datasets such as RedCaps [38] are already sufficiently diverse and, there-
fore, may not benefit from our method. Nonetheless, CLIP-C still does better than
CLIP on these large datasets.
Section 4.3
Experimental Setup
All our experiments use the CLIP framework due to its demonstrated effectiveness,
simplicity, and widespread usage. We emphasize that we do not use pretrained CLIP
checkpoints from prior works as our method is a pre-training mechanism. Thus, we
retrain CLIP on all pre-training datasets and compare it to our approach. Finally,
due to resource constraints, we conduct our experiments in the low data and small
model regimes.
53
4.4 Results and Analysis CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models
Models. We use Vision Transformer [41] models of various sizes as in [121]. The
vision encoder is set to ViT-S/16 [156] in all our ablation experiments unless explicitly
specified otherwise. We use ViT-B/16 [41] as the image encoder to demonstrate
the efficacy of our method at scale. The text encoder in all our experiments is
set to the 38M parameter text Transformer model from [136]. Following previous
methods, Byte-Pair encoding is used for tokenization with a context length of 77 and
a vocabulary size of 49k. Finally, we fixed the temperature parameter at 0.01, the
maximum value used in CLIP [136].
Hyper-parameters. We train all models using PyTorch [129] with a global batch
size of 2, 048 split across 8 GPUS in a single machine. AdamW [109] is the optimizer
during pre-training. All models are pretrained for 40 epochs using a cosine decay
learning rate schedule with a base rate of 0.003, a warm-up period of 5 epochs, and a
final learning rate of 1e−5 . The weight decay parameter is always set to 0.1. Random
cropping is the only augmentation applied to the images during pre-training. The
image resolution is always set to 224 × 224.
Section 4.4
Results and Analysis
This section outlines our key comparisons between CLIP and CLIP-C (our method)
on zero-shot image classification, cross-modal retrieval, and linear probing. However,
we explain first why our method works.
We perform zero-shot evaluation on several classification benchmarks using class
names and prompts provided by [136, 121]. We test our model on eleven downstream
datasets including ImageNet [146], CIFAR [83], Caltech-101 [45], Oxford Pets [128],
Country211 [136], DTD [31], Sun397 [168], STL-10 [32], RESISC-45 [28], and Eu-
54
4.4 Results and Analysis CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models
Table 4.1: Zero-shot Image Classification: CLIP-C is our method. CLIP is the
model from [136] trained in our setting. CC3M CLIP-C models use ρ = 0.3
while CC12M and RedCaps models use ρ = 0.15. Bold numbers are the
best in each dataset and architecture comparison.
Caltech-101
PT Dataset
Country211
CIFAR-100
RESISC45
CIFAR-10
ImageNet
EuroSAT
Food-101
Method
STL-10
Sun397
DTD
Pets
Vision Encoder: ViT-S/16
CLIP 11.6 56.1 22.7 46.9 12.9 10.5 0.6 20.5 77.0 24.5 23.7 18.5
CC3M
CLIP-C 15.1 66.4 26.9 51.9 14.5 14.8 0.7 27.2 84.6 25.4 30.7 20.5
Vision Encoder: ViT-B/16
CLIP 13.8 54.8 20.4 49.8 14.9 12.2 0.7 21.9 76.0 22.7 19.6 19.6
CC3M
CLIP-C 15.7 58.0 28.5 50.1 11.4 14.2 0.7 27.8 86.8 26.1 21.3 21.2
CLIP 46.9 78.0 43.0 76.2 57.2 19.3 4.8 41.2 89.7 33.8 27.8 37.9
CC12M
CLIP-C 48.1 76.8 44.8 73.5 60.8 21.9 5.0 41.1 90.3 36.2 36.1 38.5
CLIP 78.8 72.8 38.7 72.1 76.0 16.2 6.1 27.5 92.9 36.5 30.9 40.7
RedCaps
CLIP-C 79.0 73.7 42.2 72.1 77.1 18.1 6.6 29.4 94.2 41.1 34.8 41.6
roSAT [56]. Following previous works [121, 44], we use “mean per class accuracy”
as the metric for Oxford Pets and Caltech-101. Accuracy is the metric for all other
datasets. Additionally, we analyze the zero-shot retrieval performance of our method
versus CLIP.
55
4.4 Results and Analysis CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models
Table 4.2: Zero-shot Cross-modal Retrieval. ρ is set to 0.3 for CC3M and 0.15
for CC12M abd RedCaps. Similarly to zero-shot classification, our se-
mantic composition model is nontrivially better than CLIP on zero-shot
retrieval.
Flickr30k MS-COCO
Image → Text Text → Image Image → Text Text → Image
PT Dataset Method
R@1 R@5 R@1 R@5 R@1 R@5 R@1 R@5
Vision Encoder: ViT-S/16
CLIP 35.2 62.3 25.4 49.12 17.3 39.0 13.1 31.2
CC3M
CLIP-C 40.7 70.9 30.6 57.9 21.4 45.6 16.2 36.5
Vision Encoder: ViT-B/16
CLIP 36.1 65.1 26.3 52.4 18.6 41.1 13.9 32.8
CC3M
CLIP-C 39.6 69.4 31.2 58.3 22.9 46.7 17.0 37.9
CLIP 61.5 87.2 46.1 74.9 36.2 64.2 25.3 49.7
CC12M
CLIP-C 66.0 87.8 49.5 75.6 38.4 65.6 26.4 51.5
CLIP 26.8 51.9 20.5 42.5 24.3 44.8 16.7 35.7
RedCaps
CLIP-C 32.3 57.2 23.6 44.9 27.1 49.2 18.2 38.4
56
4.4 Results and Analysis CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models
Figure 4.2: (a) Training and validation losses during pre-training: Counter-
intuitively, the model learns to match the composite examples faster com-
pared to the plain instances. (b) CLIP-C v.s. CLIP: pre-training CLIP
longer than CLIP-C does not close the performance gap. CLIP-C becomes
even more superior as training duration increases.
dataset, our method outperforms CLIP by over 5% absolute top-1 retrieval accuracy
in both image-to-text and text-to-image retrieval. The enhancement on MS-COCO
is 4% on image-to-text and 3% on text-to-image retrievals.
4.4.3. Discussion
Why is CLIP-C an Effective Method?
As evidenced by the higher downstream zero-shot and retrieval results, it is clear
that CLIP-C is an effective method. To provide further intuitions as to why CLIP-C
works, we present two arguments for the effectiveness of our method based on data
diversity and improved model optimization as evidenced in the training and validation
losses. We expand on these points below starting with data diversity.
(a) CLIP-C exposes the model to more diverse data. It could be argued
that our method sees a lot more examples due to the compositions we employ,
and that may be the reason for the observed improved performances. The
empirical results in Table 4.3 and in Figure 4.2b show that this is not the
case. Even after training CLIP longer (Figure 4.2b) or with larger batch size
57
4.4 Results and Analysis CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models
(b) CLIP-C eases contrastive learning for all examples. Contrary to the
expectation that compound examples will be the more challenging to the model
(since they are multiple examples condensed into single instances), we observed
precisely the opposite: as shown in Figure 4.2a, the training and validation
losses on composite examples is lower than the losses on plain examples. Our
hypothesis for this empirical observation is that the model is able to easily
recognize compound image-caption pairs as they tend to be structurally different
due to stitching artifacts and other distortions. This also transfers to improved
matching of the plain examples in CLIP-C compared to CLIP as conveyed by
the validation losses in Figure 4.2a (right). We believe this elevated learning
of plain examples is a contributing factor to the superior capabilities of our
method.
58
4.5 Ablation Study CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models
Table 4.3: CLIP-C beats CLIP using just half the batch size of the CLIP model.
Section 4.5
Ablation Study
We ablate the various components of our framework including (1) the sampling prob-
ability ρ, (2) semantic versus stylistic compositions, (3) the impact of stochasticity
in drawing the second example, and (4) the function used for the image composition.
These ablation experiments underscore the importance of using semantically diverse
examples in compositions. They also reveal that while incorporating a proportion
of CLIP-C examples in the mini-batch contributes positively to performance, exclu-
sively using such compositions during training detracts from downstream transfer
capabilities. Finally, they unearth the necessity of generating compound examples
dynamically during training rather than relying on a static set of pre-generated in-
stances. Collectively, these insights affirm the effectiveness of the design principles
underpinning CLIP-C.
All ablation experiments are conducted using CC3M [149] with ViT-S/16 as the
image encoder to minimize cost. Additionally, we present only the zero-shot results
of CIFAR-10, CIFAR-100, and ImageNet for the ablation experiments.
To study the impact of random seeding, we train three models each for CLIP
and CLIP-C on CC3M with three different random seeds and show the results in
Table 4.4 indicating that the zero-shot performances are consistent across different
random initializations.
59
4.5 Ablation Study CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models
Table 4.4: Both CLIP and CLIP-C are consistent across three different initializations.
60
4.5 Ablation Study CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models
prefers the use of distinct examples in the composition. We also note that just in-
creasing the diversity of examples is helpful as the stylistic augmentations method
yields a 0.5% zero-shot accuracy gain over CLIP on ImageNet.
61
4.5 Ablation Study CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models
percentage of the mini-batch that are compound instances. When ρ = 0, our method
is identical to CLIP as no composition is performed. On the other extreme, when
ρ = 1, all the examples in each mini-batch are instances of our composition method.
As shown in Figure 4.3, using a small non-zero sampling rate is more effective than
CLIP. However, the performance deteriorates when more than fifty percent of the
mini-batch are these compound image-text pairs. These results indicate that main-
taining a reasonable percentage of the original examples is necessary likely because
streamlined non-contradictory learning signal is significantly reduced when a major-
ity of the batch are compositions. Also, since downstream evaluations do not involve
such compositions, some exposure to examples with uniform semantic content during
pre-training is important for effective transfer.
62
4.5 Ablation Study CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models
Table 4.7: Our strategy outperforms CutMix [175] and MixUP [179].
other hand takes a random crop from one of the images and pastes it at the same
spatial location on the other image. The crop’s dimensions are scaled by the value
√
α = 1 − ω, ω ∼ β(1, 1). That is, Hcut = α · H, Wcut = α · W where H and W are
the height and width of the image respectively. Unlike MixUP, our method preserves
the integrity of each crop and does not paste parts of one image on the other as
in CutMix. Additionally, using the center-half crop of each image guarantees that
substantial portions of both images are represented in the output image. We believe
these characteristics of our method are important as demonstrated by its superior
zero-shot results over MixUP and CutMix in Table 4.7.
63
4.5 Ablation Study CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models
Table 4.10: Sugar-Crepe: Our model outperforms the baseline on compositional rea-
soning once again demonstrating is extensive capabilities.
most effective, probably due to the symmetry of transforming both modalities. The
second most effective is the captions only approach. Option (2) is the least effective
method likely because the images are naturally augmented (random cropping) in the
baseline method whereas the captions are fixed. These observations suggest that our
method is more helpful in learning representations of the texts relative to those of the
images. They also help to elucidate why we obtain much bigger improvements over
the baseline in zero-shot settings compared to linear evaluations.
(a) CLIP-C improves SLIP [121]. In Table 4.9, we show that CLIP-C’s frame-
work is compatible with other vision-language methods such as SLIP [121].
Using our composition method in SLIP improves zero-shot image recognition
performance from 19.4% to 20.8% on ImageNet showing the complementarity
of our method and SLIP.
(b) Results on SugarCrepe [61]. We show in Table 4.10 that CLIP-C has more
compositional knowledge than CLIP, especially on object replacement and ob-
64
4.6 Summary CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models
ject addition. This signals the potential of methods such as CLIP-C to close
the gap on relational and compositional understanding tasks in vision-language
models.
Section 4.6
Summary
In this chapter, we have shown that fast and straightforward semantic compositions
of distinct image-caption pairs substantially enhance the efficacy of vision-language
models. We labeled the approach we devised CLIP-C. Our proposed approach, termed
CLIP-C, demonstrates marked improvements in zero-shot downstream tasks over the
baseline CLIP model. Our ablation studies offered critical insights, highlighting that
the observed enhancements in performance arise not merely from increased data aug-
mentation but from the strategic deployment of semantically distinct examples in
compositions. Finally, we provided experiments demonstrating the applicability of
our semantic composition framework to other competitive models such as SLIP.
We anticipate that these findings will inspire further exploration into innovative
and efficient applications of small-scale datasets for vision-language pre-training, par-
ticularly in domains where curating extensive amounts of paired data is challenging,
such as in medical and satellite imagery. Furthermore, research exemplified by our
work in this chapter is crucial for training language models with guidance derived
from visual cues, as demonstrated in Chapter 3. In the next chapter (Chapter 5),
we will introduce an alternative approach for training effective multimodal models,
focusing on the composition of multimodal features rather than data compositions.
65
Channel Fusion for Vision and Language Learning
5
This chapter, like to Chapter 4, presents a method for training more effective mul-
timodal models. However, whereas the innovations in Chapter 4 were rooted in ad-
vancements at the input data level, the approach outlined in this chapter re-envisions
the fusion of tokens across different modalities within the multimodal model1 .
Section 5.1
Overview
66
5.1 Overview Channel Fusion for Vision and Language Learning
Figure 5.1: Multimodal fusion methods: Illustrations of fusions methods from the
perspective of one visual token Fi , and one text token Tj . Our proposed
Compound Tokens fusion method, illustrated in (c), uses only one
cross-attention layer for each modality compared to co-attention which
uses both cross-attention and self-attention in all blocks. Q, K, and V
denote the input query, keys, and values respectively. X represents the
cross-attention layer’s output. Finally, the subscripts V , L, and CT stand
for visual features, text features or Compound Tokens, respectively.
termed merged attention and co-attention, as depicted in Figures 5.1a and 5.1b,
respectively. Merged attention involves concatenating the two unimodal representa-
tions along the sequence dimension before applying self-attention globally over the
concatenated output [112, 57, 42]. In contrast, co-attention does not concatenate rep-
resentations from the two modalities; it uses cross-attention to facilitate the exchange
of information between modalities [16, 99, 42].
However, these two approaches possess inherent limitations: merged attention
lacks cross-attention which is important for the robust alignment of tokens. Con-
versely, co-attention does not have a global receptive field across all vision and text
67
5.1 Overview Channel Fusion for Vision and Language Learning
tokens both modalities never appear in a single sequence. Consequently, merged at-
tention may encounter difficulties in effectively aligning tokens from different modal-
ities, while co-attention sacrifices the advantages of a comprehensive global view over
all tokens. Duo et al . [42] reported performance enhancements with co-attention com-
pared to merged attention. Nonetheless, another drawback of co-attention is that it is
parameter inefficient relative to merged attention since it requires distinct parameter
sets for vision and language features.
Our objective is to efficiently and simply address the limitations of the exist-
ing fusion strategies. The method described in this chapter accomplishes this goal
by integrating merged attention and co-attention within a streamlined pipeline that
yields more robust multimodal representations than either approach across several
multimodal tasks. We introduce channel fusion to integrate visual and language rep-
resentations for various question answering tasks, such as visual question answering
and visual entailment. By focusing the fusion process on the feature dimension, our
model more effectively aligns tokens from the two modalities compared to the stan-
dard methods. We term the resulting fused representations “Compound Tokens”
(as illustrated in Figure 5.1c) because each feature vector contains elements from both
the image and language embeddings.
In the channel fusion process, tokens from one modality are queried using tokens
from the other modality, and the resulting output is concatenated with the input
query tokens along the feature dimension. We implement a bi-directional channel fu-
sion that generates vision-to-text and text-to-vision Compound Tokens, which are
subsequently concatenated along the token dimension for further modeling. Channel
fusion effectively aligns multimodal tokens through cross-attention while maintaining
the benefits of global self-attention across all tokens. Unlike merged attention, we
concatenate vision and text tokens along the channel dimension. In contrast to co-
68
5.1 Overview Channel Fusion for Vision and Language Learning
attention, which uses cross-attention functions in every block, our approach employs
only two cross-attention functions initially to facilitate channel concatenation.
Additionally, channel concatenation does not increase the token length thus avoid-
ing additional computational or memory burdens in the multimodal encoder (and
decoder). To further enhance efficiency, each modality is initially embedded into half
of the original feature dimension prior to compounding. This approach ensures that
the output following channel concatenation retains the same feature dimension as the
input vision and text tokens. Empirical evidence from our experiments indicates that
alternative methods of combining the input queries and cross-attention outputs such
as weighting or element-wise multiplication are less effective compared to channel
concatenation.
We evaluate Compound Tokens through extensive experiments in the challeng-
ing open-vocabulary setting using exact matching. In this context, the generated
responses must correspond precisely with the ground truth answers to be deemed
correct, presenting a significantly greater challenge than predicting from a limited
predefined set of responses, as is common with encoder-only models. This evaluation
approach is informed by prior research [29, 164, 133] that highlighted its flexibility
and applicability in practical scenarios.
The empirical evaluations show that our method outperforms both merged atten-
tion and co-attention on GQA [66], SNLI-VE [169], and VQA [51] with and without
vision-language pretraining. Compound Tokens obtained 82.87% on SNLI-VE
beating METER [42] by 2.26%. Additionally, they recorded 82.43% on GQA signifi-
cantly outperforming CFR [124] by 8.83%. Our model’s score of 70.62% on VQA is
competitive among existing models.
69
5.2 Technical Approach Channel Fusion for Vision and Language Learning
Section 5.2
Technical Approach
This section provides a background of the baseline fusion methods and the architec-
ture for our Compound Tokens method as depicted in Figure 5.2 and Figure 5.1c.
5.2.1. Background
We provide an overview of the key functions pertinent to understanding our method,
omitting layer normalization and multi-layer perceptrons within attention blocks for
the sake of simplicity. Similarly, we refrain from addressing residual connections
between layers in this high-level overview.
Attention: Given a set of query vectors Q ∈ RN ×d and a set of key vectors
70
5.2 Technical Approach Channel Fusion for Vision and Language Learning
qiT kj exp(ai,j ) X
ai,j = √ αi,j = P zi,ℓ = αi,j Vj,ℓ . (5.1)
d ℓ exp(ai,ℓ ) j
An attention mechanism is called self-attention when the query, context, and key
vectors are simple linear projections of the same underlining feature. It is known as
cross-attention when the keys and queries are projections of different features.
Multimodal Fusion: Token concatenation followed by self-attention is one of
the most adopted approaches for cross-modal learning in recent vision-language archi-
tectures [100, 133, 164, 25]. Formally, given a sequence of N image tokens, I ∈ RN ×d ,
and M text tokens, T ∈ RM ×d , most methods concatenate I and T into a single
representation O ∈ R(N +M )×d which is then fed into a multimodal transformer for
further modeling. The target outputs are produced using either a linear layer or a
decoder. Besides concatenation, other methods such as [153, 99, 98] use multimodal
transformers composed of both self-attention and cross-attention in every block.
71
5.3 Experimental Setup Channel Fusion for Vision and Language Learning
d
d
Ib = A I,e Te , Te ∈ RN × 2 Tb = A Te , I,
e Ie ∈ RM × 2 (5.2)
Icmpd = C-Concat I, e Ib ∈ RN ×d Tcmpd = C-Concat Te , Tb ∈ RM ×d , (5.3)
where A(q, k, v) is the cross-attention function with q, k, and v as queries, keys, and
values respectively. C-Concat(u, υ) concatenates tensors u and υ along the feature
dimension. We combine vision-to-text Compound Tokens Icmpd , and text-to-vision
Compound Tokens Tcmpd , into a set of output Compound Tokens as in merged
attention architectures
Section 5.3
Experimental Setup
Model. We use ResNet-50 [55] as our image encoder and T5-base [139] as our text
encoder. The output of the image and text encoders are fed to our novel fusion
method described in Section 5.2.2. A T5-base decoder consumes the output of the
72
5.3 Experimental Setup Channel Fusion for Vision and Language Learning
fusion module and generates free-form text for all question-answering tasks. The
image encoder is pretrained on ImageNet [146] while the text encoder and decoder
use pretrained T5 weights.
Pre-training Datasets and Tasks. We use CC3M2 [149] and COCO Captions [106]
for pre-training. The pre-training setup uses a mixture of these datasets across four
objectives:
(1) Image-Text Matching where the model predicts whether the given image-text
pair is a match or not a match.
(2) Captioning where the model is tasked with generating a description of the image.
(3) Caption Completion is similar to (2) but the model is given a masked caption
and the goal is to predict the missing words.
Hyper-parameters. Unless otherwise stated, we pre-train our models for 300, 000
steps using a batch size of 512 and perform an additional 100, 000 iterations of fine-
tuning at a batch size of 128 on the downstream tasks. During pre-training, the image
resolution is set to 224 × 224. This resolution is increased to 384 × 384 during fine-
tuning or when training from scratch without vision-language pretraining (VLP). The
input text length is set to 32 tokens. The output text length is 32 during pretraining
and reduced to 8 tokens during finetuning. Please refer to the Supplementary Mate-
rial for details on all hyper-parameter settings including learning rates, weight-decay,
etc.
2
The version of the dataset we used has about 2 million samples
73
5.4 Results and Analysis Channel Fusion for Vision and Language Learning
Section 5.4
Results and Analysis
(a) SNLI-VE [169] is a dataset of approximately 500,000 image-text pairs used for
visual entailment (VE). Given an image and a proposed statement, the task for
this dataset requires determining whether the statement is neutral, entails, or
contradicts the image.
(b) Visual Question Answering (VQA2.0) [51] is a widely used benchmark for
many question-answering models and contains 400,000 image-text pairs span-
ning 3,130 output categories. Each image-question pair is associated with 10
answers.
We emphasize that in every scenario, our models generate answers in the open-
vocabulary setting covering 32,000 words irrespective of the number of categories in
the task. A model prediction is counted as correct if and only if it matches exactly
with the ground-truth answer. We use the VQA metric3 for VQA2.0 and simple
accuracy for GQA and SNLI-VE as evaluation metrics. Generally, we use SNLI-VE
3
https://visualqa.org/evaluation.html
74
5.4 Results and Analysis Channel Fusion for Vision and Language Learning
and GQA for ablations as performance on those datasets in our setup is more stable
than results on VQA.
Table 5.1: Channel Fusion versus Other Mixing Methods: Channel concatena-
tion obtains the highest accuracy on SNLI-VE and GQA.
We compare merged attention and channel fusion (our method) in Figure 5.3 without
vision-language pre-training to establish a baseline result. We then incorporate vision-
4
This is important in order not to increase the computation cost exorbitantly
75
5.4 Results and Analysis Channel Fusion for Vision and Language Learning
language pre-training and reassess the performance of each method in Table 5.2. All
downstream tasks for each fusion method use the same pretrained model.
For the baseline comparisons, the fusion modules do not use a multimodal encoder:
Merged attention feeds a concatenation of the multimodal tokens to the decoder whilst
Compound Tokens sends the tokens to the decoder immediately after channel
chaining.
The results in Figure 5.3 and Table 5.2 show clearly that our fusion method is
superior to merged attention with and without vision-language pretraining at a rela-
tively small amount of additional computational cost. This performance boost sug-
gests that using cross-attention for multimodal token alignment is beneficial. When
vision-language pre-training is employed, Compound Tokens outperforms merged
attention by substantial margins: by +4.18% on VQA and 2.20% on GQA. The im-
provement on SNLI-VE is a modest 0.24% over the baseline. Our method enjoys
similar improvement margins when no vision-language pre-training is invoked. We
include a more efficient version of channel fusion (Compound Tokens (TAQ)) where
76
5.4 Results and Analysis Channel Fusion for Vision and Language Learning
Table 5.2: Compound Tokens versus other Fusion models with Vision-
Language Pre-training: We repeat the experiments in Figure 5.3, but
include vision-language pre-training on a mixture of CC3M and COCO
captions.
only the text tokens are used as queries. This version of our method also outperforms
merged attention across all tasks when training from scratch while using fewer flops.
Table 5.3: Compound Tokens versus other Fusion models without Vision-
Language Pre-training: We extend the models to include a multi-
modal encoder with 12 self-attention layers in merged attention to match
the typical setting in previous works [42]. Compound Tokens outperform
merged attention and Co-Attention with fewer parameters than both meth-
ods and fewer flops than merged attention. Co-Tokenization is from [134].
“Params” show the number of parameters in the entire model (not just the
fusion module); “RES” is the image resolution and L is the total number of
transformer blocks in the multimodal encoder: Compound Tokens uses
two cross-attention blocks before the multimodal encoder.
77
5.4 Results and Analysis Channel Fusion for Vision and Language Learning
Figure 5.1b), and Co-Tokenization [134] which was originally implemented for video
question-answering tasks. Co-Tokenization iteratively fuses visual features with text
features using a TokenLearner [147]. The Co-Attention fusion module uses 6 blocks
each for the vision and the text branches as in METER [42] where each block has
a self-attention, cross-attention, and feedforward layers. Co-Tokenization uses 64
image tokens and 4 transformer blocks for each tokenization round. We use three
tokenization rounds resulting in 12 self-attention blocks. The multimodal encoder for
Compound Tokens has 10 blocks to compensate for the two cross-attention blocks
that it uses.
The results of these experiments are shown in Table 5.3. The models are trained
for 300 thousand iterations at a batch size of 128 on each downstream task without
any vision-language pre-training. Compound Tokens outperform merged attention
and co-attention in this setting suggesting channel fusion remains competitive even
when a multimodal encoder is used. However, it slightly underperforms the more
expensive Co-Tokenization module.
Finally, we compare Compound Tokens with various models such as METER [42],
ALBEF [99], and CFR [124]. The models in Table 5.5 generally have approximately
78
5.4 Results and Analysis Channel Fusion for Vision and Language Learning
Table 5.5: Compound Tokens versus other Multimodal Models: Our fu-
sion method is competitive on SNLI-VE and GQA with all models except
SimVLM [164] which used a private dataset of 1.5B samples. The best
values among the models besides SimVLM are in bold. The second best
values are underlined. * Gflops are based on our calculations.
the same number of parameters, but may differ on the pre-training datasets, pre-
training objectives, and backbone encoders. For example, while we use Conceptual
Captions [149] and COCO [106] as our pre-training datasets, METER used Concep-
tual Captions, COCO, Visual Genome [82] and SBU Captions [126]. ALBEF used
all the datasets in METER in addition to Conceptual Captions 12M [21].
We pre-train the Compound Tokens model for 500 thousand steps with a batch
size of 512 using an image resolution of 224 × 224 and further finetune for 200 thou-
sand iterations on each of the downstream tasks at resolution 384 × 384 at a batch
size of 128. Except for SimVLM [164] which has about 1.5 billion parameters and
79
5.5 Ablation Study Channel Fusion for Vision and Language Learning
uses significantly large pre-training data (a 1.8 billion private dataset), our model
outperforms all other methods on SNLI-VE and GQA by large margins. We are
confident that further pretraining and increasing image resolution will improve our
already competitive result on the VQA dataset.
Section 5.5
Ablation Study
This section covers key ablations for our multimodal method including the input
image resolution and the architecture of the image encoder used.
80
5.6 Supplemental Information Channel Fusion for Vision and Language Learning
Table 5.6: Impact of Image Resolution: Increasing the resolution increases per-
formance for both merged attention and Compound Tokens.
Table 5.7: Image Encoder Ablation: Both the ViT and ResNet-50 are pre-trained
on ImageNet before transferring to the target task.
a CNN [91] or a Vision Transformer (ViT) [159] for image feature extraction. We
used ResNet-50 for our main experiments and ablate the impact of that choice in this
section. The results of using a ViT as the image encoder are shown in Table 5.7.
All models in that experiment use 224 × 224 as the image resolution. A patch size
of 16 × 16 was used for the ViT. The ViT models perform slightly less than the
comparable ResNet, but channel fusion remains superior to merged attention.
Section 5.6
Supplemental Information
81
5.7 Summary Channel Fusion for Vision and Language Learning
ments. We do not use any data augmentation beyond resizing and normalization in all
experiments. We apply random cropping and AutoAugment [35] during pre-training
of our main model. All our pretraining experiments use a batch size of 512 and image
resolution 224 × 224. The batch size is divided equally among the four pretraining
objectives: image captioning, caption completion, image-text matching, and masked
language modeling. We also sample the same number of examples from CC3M and
COCO in every iteration. The batch size and resolutions are set to 128 or 384 × 384
whenever training from scratch or fine-tuning respectively. The datasets we used in
our model are described in Section 5.3. The rest of the hyper-parameters are listed
in Table 5.8.
Section 5.7
Summary
82
5.7 Summary Channel Fusion for Vision and Language Learning
83
Slot Machines
6
Different from the preceding chapters which concentrated on vision-guided language
models and multimodal networks, this chapter focuses on architectures and training
methodologies of deep neural networks. In contrast to conventional weight optimiza-
tion in a continuous space, we demonstrate the existence of effective random networks
whose weights are never updated. We discuss the related works within the chapter
separately from Chapter 2 in Section 6.6.
Section 6.1
Overview
84
6.1 Overview Slot Machines
85
6.1 Overview Slot Machines
are then updated in the backward pass via stochastic gradient descent. However, the
weights are never changed. By evaluating different combinations of fixed randomly
generated values, this extremely simple procedure finds weight configurations that
yield high accuracy.
We demonstrate the efficacy of our algorithm through experiments on MNIST
and CIFAR-10. On MNIST, our randomly weighted Lenet-300-100 [91] obtains a
97.0% test set accuracy when using K = 2 options per connection and 98.2% with
K = 8. On CIFAR-10 [83], our six-layer convolutional network outperforms the
regular network when selecting from K = 8 fixed random values at each connection.
Fine-tuning the models obtained by our procedure generally boosts performance
over networks with optimized weights, albeit at an additional compute cost (see Fig-
ure 6.5). Also, compared to traditional networks, our networks are less memory effi-
cient due to the inclusion of scores. That said, our work casts light on some intriguing
phenomena about neural networks for further probing.
• Second, this paper highlights the enormous expressive capacity of neural net-
works. Maennel et al . [113] show that contemporary neural networks are so
powerful that they can memorize randomly generated labels. This work builds
on that revelation and demonstrates that current networks can model challeng-
ing non-linear mappings extremely well even by simple selection from random
weights.
• This work also connects to recent observations [114, 46] suggesting strong per-
formance can be obtained by utilizing gradient descent to uncover effective
86
6.2 Technical Approach Slot Machines
subnetworks.
Section 6.2
Technical Approach
Our goal is to construct non-sparse neural networks that achieve high accuracy by
selecting a value from a fixed set of completely random weights for each connection.
We start by providing an intuition for our method in Section 6.2.1, before formally
defining our algorithm in Section 6.2.2 .
6.2.1. Intuition
An untrained, randomly initialized network is unlikely to perform better than random
chance. Interestingly, the impressive advances of [140] and [184] demonstrate that
networks with random weights can in fact do well if pruned properly. In this work,
instead of pruning we explore weight selection from fixed random values as a way
87
6.2 Technical Approach Slot Machines
method finds randomly weighted networks that achieve very high accuracy even with
small values of K. For instance, a six-layer convolutional network with 2 random
values per connection obtains 85.1% test accuracy on CIFAR-10.
But how do we select a good network from these K n different networks? Brute-
force evaluation of all possible configurations is clearly not feasible due to the massive
number of different hypotheses. Instead, we present an algorithm, shown in Figure 6.1,
that iteratively searches the best combination of connection values for the entire
network by optimizing the given loss. To do this, the method learns a real-valued
quality score for each weight option. These scores are used to select the weight value
of each connection during the forward pass. The scores are then updated in the
backward pass based on the loss value in order to improve training performance over
iterations.
88
6.2 Technical Approach Slot Machines
nℓ−1
!
(ℓ)
X (ℓ−1) (ℓ)
h(x)i =g h(x)j Wij (6.1)
j=1
(ℓ)
where Wij is the weight of the connection between neuron i in layer ℓ and neuron
j in layer ℓ − 1, x represents the input to the network, and g is a non-linear activation
(ℓ)
function. Traditionally, Wij starts off as a random value drawn from an appropriate
distribution before being optimized with respect to a dataset and a loss function
using gradient descent. In contrast, our method does not ever update the weights.
Instead, it associates a set of K possible weight options for each connection1 , and
then it optimizes the selection of weights to use from these predefined sets for all
connections.
Forward Pass. Let {Wij1 , . . . , WijK }2 be the set of the K possible weight values
for connection (i, j) and let sijk be the “quality score” of value Wi,j,k , denoting the
preference for this value over the other possible K − 1 values. We define a selec-
tion function ρ which takes as input the scores {sij1 , . . . , sijK } and returns an index
between 1 and K. In the forward pass, we set the weight of (i, j) to Wijk∗ where
k ∗ = ρ(sij1 , . . . , sijK ).
In our work, we set ρ to be either the arg max function (returning the index
corresponding to the largest score) or the sampling from a Multinomial distribution
defined by {sij1 , . . . , sijK }. We refer to the former as Greedy Selection (GS) and
1
For simplicity, we use the same number of weight options K for all connections in a network.
2
For brevity, from now on we omit the superscript denoting the layer.
89
6.2 Technical Approach Slot Machines
where Mult is the multinomial distribution. The empirical comparison between these
two selection strategies is given in Section 6.5.1.
We note that, although K values per connection are considered during training
(as opposed to the infinite number of possible values in traditional training), only one
value per connection is used at test time. The final network is obtained by selecting
for each connection the value corresponding to the highest score (for both GS and PS)
upon completion of training. Thus, the effective capacity of the network at inference
time is the same as that of a traditionally-trained network.
Backward Pass. In the backward pass, all the scores are updated with straight-
through gradient estimation since ρ has a zero gradient almost everywhere. The
straight-through gradient estimator [9] treats ρ essentially as the identity function in
the backward pass by setting the gradient of the loss with respect to sijk as
∂L (ℓ−1) (ℓ)
∇sijk ← (ℓ)
h(x)j Wijk (6.3)
∂a(x)i
(ℓ)
for k ∈ {1, · · · , K} where L is the objective function. a(x)i is the pre-activation of
neuron i in layer ℓ. Given α as the learning rate, and ignoring momentum, we update
the scores via stochastic gradient descent as
where s̃ijk is the score after the update. Our experiments demonstrate that this
simple algorithm learns to select effective configurations of random weights resulting
90
6.3 Experimental Setup Slot Machines
Table 6.1: Architecture specifications of the networks used in our experiments. The
Lenet network is trained on MNIST. The CONV-X models are the same
VGG-like architectures used in [46, 184, 140]. All convolutions use 3 × 3
filters and pool denotes max pooling.
Section 6.3
Experimental Setup
Models. The weights of all our networks are sampled uniformly at random from a
Glorot Uniform distribution [50], U(−σx , σx ) where σx is the standard deviation of
the Glorot Normal distribution. We ignore K, the number of options per connection,
when computing the standard deviation since it does not affect the network capacity
in the forward pass. Like for the weights, we initialize the scores independently
from a uniform distribution U(0, λσx ) where λ is a small constant. We use λ = 0.1
for all fully-connected layers and set λ to 1 when initializing convolutional layers.
We use 15% and 10% of the training sets of MNIST and CIFAR-10, respectively,
91
6.4 Results and Discussion Slot Machines
Hyper-parameters. All models use a batch size of 128 and stochastic gradient
descent with warm restarts [108] at epoch 25 and 75, a momentum of 0.9 and an
ℓ2 penalty of 0.0001 Probabilistic Sampling models do not use weight-decay. When
training GS slot machines, we set the learning rate to 0.2 for K ≤ 8 and 0.1
otherwise. We set the learning rate to 0.01 when directly optimizing the weights
(training from scratch and finetuning) except when training VGG-19 where we set
the learning rate to 0.1. We find that a high learning rate is required when sampling
the network probabilistically, a behaviour also observed in [184]. Accordingly, we use
a learning rate of 25 for all PS models. We did not train VGG-19 using PS. We
use data augmentation and dropout (with a rate of p = 0.5) when experimenting on
CIFAR-10 [83]. We use batch normalization in VGG-19 but the affine parameters are
never updated throughout training.
92
6.4 Results and Discussion Slot Machines
Figure 6.2: Results between random networks and Slot Machines: Select-
ing from only K = 2 weight options per connection already dramatically
improves accuracy compared to an untrained network that performs at
random chance (10%) on both (a) MNIST and (b) CIFAR-10. The first
bar in each plot shows the performance of an untrained randomly initial-
ized network and the second bar shows the results of selecting random
weights with GS using K = 2 options per connection.
Section 6.4
Results and Discussion
93
6.4 Results and Discussion Slot Machines
85
97.75 87
77 84
97.50
76 83 86
97.25
75 82
85
0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60
K K K K
Learned Weights Slot Machines (GS)
0.8 0.8
Test Accuracy
0.7
0.6
0.6
Learned Weights Learned Weights
0.5 0.4
Slot Machine (K = 8) Slot Machine (K = 8)
0 200 400 600 0 250 500 750
Tera Flops Tera Flops
Figure 6.4: Test Accuracy versus Flops. Slot Machines achieve comparable
performance to models traditionally optimized for the same training com-
pute budget.
94
6.4 Results and Discussion Slot Machines
87.0
Test Accuracy
89.0
86.8
86.6 88.5
86.4
88.0
86.2
600 700 800 800 900 1000 1100
Tera Flops Tera Flops
Learned Weights Slot Machines (GS) Finetuned Selected Weights
Figure 6.5: Finetuning Selected Weights. Finetuning Slot Machines improves test
set performance on CIFAR-10. For CONV-4 and CONV-6 this results
in better accuracy compared to the same networks learned from scratch
at comparable training cost (shown on the x axis). The six-layer slot
machine uses K = 8 options per edge whereas the CONV-4 slot machine
uses K = 16.
95
6.5 Ablation Study Slot Machines
Finetuning on CIFAR-10
86.0
85.5
Test Accuracy
85.0
84.5
84.0
CONV-4-Finetuned
83.5 CONV-6-Finetuned
20 40 60 80 100
Slot Machine Checkpoint
does not match the performance of the model trained from scratch (92.6%).
To show that the weight selection in Slot Machines impacts the performance
of the finetuned models, we start finetuning from different checkpoints and compare
the results. If the selection is beneficial, then finetuning from later checkpoints will
show improved performance over fine-tuning from earlier checkpoints. As shown in
Figure 6.6, this is indeed the case as finetuning from later checkpoints results in higher
performance on the test set.
Section 6.5
Ablation Study
We conduct multiple ablation experiments on our models including the weights se-
lection strategy, weight sharing and sparsity in Slot Machines.
96
6.5 Ablation Study Slot Machines
97.75 74 82 84
97.50 72
80 82
97.25 70
78 80
97.00
0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60
K K K K
Slot Machines (GS) Slot Machines (PS)
Figure 6.7: Selection Method. Training slot machines via greedy selection yields
better accuracy than optimizing them with probabilistic sampling for all
values of K considered. The reason is that PS is a lot more exploratory
and tends to produce slower convergence.
97
6.5 Ablation Study Slot Machines
Lenet on MNIST
GS Network
8.0 PS Network
2.0
20 40 60 80 100
Epoch
Figure 6.8: Weight exploration in Slot Machines. The vertical axis shows
(on a log scale) the percentage of weights changed after every five epochs
as training progresses. Compared to PS, GS is much less exploratory
and converges rapidly to a preferred configuration of weights. On the
other hand, due to its probabilistic sampling, PS keeps changing the weight
selections even in late stages of training.
Inspired by quantized networks [64, 141, 65, 162], we examine Slot Machines under
two new settings. The first constrains the connections in a layer to share the same
set of K random weights. The second setting is even more restricting as it requires
all connections in the network to share the same set of K random weights. Under
the first setting, at each layer, the weights are drawn from the uniform distribution
(−σℓ , σℓ ) where σℓ is the standard deviation of the Glorot Normal distribution for
98
6.5 Ablation Study Slot Machines
88
95 80
87
Test Accuracy
90 70
86
85 60 85
84
80 50
25 50 25 50 25 50
K K K
Unshared Weights Layerwise-shared Weights Globally-shared Weights
Figure 6.9: Weights Sharing in Slot Machines: GS models using the same set
of K random weights for all connections in a layer or in the entire network
perform quite well. However, they do not match the performance of Slot
Machines that use different sets of weights for different connections.
layer ℓ. When using a single set of weights for the entire network, we sample the
weights independently from U(−σ̂, σ̂). σ̂ is the mean of the standard deviations of
the layer-wise Glorot Normal distributions.
Each of the weights is still associated with a score. The slot machine with shared
weights is then trained as before. Weight sharing substantially reduces the memory
requirements of these networks compared to traditional neural networks. For example,
a Lenet model with unshared weights needs ∼ 1MB of storage whereas the same model
using shared weights in each layer needs ∼ 0.02MB of storage.
As shown in Figure 6.9, these models are effective given a large enough K. Fur-
thermore, the accuracy exhibits a large variance from run to run, as evidenced by the
large error bars in the plot. This is understandable, as the slot machine with shared
weights is restricted to search in a much smaller space of parameter combinations
reducing the probability of finding a “winning” combination.
99
6.6 Related Works Slot Machines
Section 6.6
Related Works
Supermasks and the Strong Lottery Ticket Conjecture. The lottery ticket hy-
pothesis was articulated in [46] and states that a randomly initialized neural network
contains sparse subnetworks which when trained in isolation from scratch can achieve
accuracy similar to that of the trained dense network. Inspired by this result, Zhou et
al . [184] presents a method for identifying subnetworks of randomly initialized neural
networks that achieve better-than-chance performance without training. These sub-
networks (named “supermasks”) are found by assigning a probability value to each
connection. These probabilities are used to sample the connections to use and are
updated via stochastic gradient descent. On ImageNet [146], Ramanujan et al . [140]
finds supermasks within a randomly weighted ResNet-50 that match the performance
of a trained ResNet-34.
These empirical results as well theoretical ones [114, 131] suggest that pruning a
randomly initialized network is just as good as optimizing the weights, provided a
good pruning mechanism is used. Our work corroborates this intriguing phenomenon
100
6.6 Related Works Slot Machines
but differs from these prior methods in significant ways. We eliminate pruning com-
pletely and instead introduce multiple weight values per connection. Thus, rather
than selecting connections to define a subnetwork, our method selects weights for all
connections in a network of fixed architecture. Thus, every neural connection in our
network is active in every forward pass.
Pruning at Initialization. The lottery ticket hypothesis also inspired several re-
cent work aimed towards pruning (i.e., predicting “winning” tickets) at initializa-
tion [93, 94, 155, 161]. Our work is different in motivation from these methods and
those that train only a subset of the weights [59, 144]. Our aim is to find neural
networks with random weights that match the performance of traditionally-trained
networks with the same number of parameters.
Weight Agnostic Neural Networks. Gaier and Ha [48] build neural network
architectures with high performance in a setting where all the weights have the
same shared random value. The optimization is instead performed over the archi-
tecture [151]. They show empirically that the network performance is indifferent to
the shared value but defaults to random chance when all the weights assume different
random values. Although we do not perform weight training, the weights in this work
have different random values. Further, we build our models using fixed architectures.
101
6.7 Summary Slot Machines
used by low-bit networks. Furthermore, the weights in low-bit networks are usually
optimized directly whereas only associated scores are optimized in slot machines.
Random Decision Trees. Our approach is inspired by the popular use of random
subsets of features in the construction of decision trees [14]. Instead of considering
all possible choices of features and all possible splitting tests at each node, random
decision trees are built by restricting the selection to small random subsets of feature
values and splitting hypotheses. We adapt this strategy to the training of neural
networks by restricting the optimization of each connection over a random subset of
weight values.
Section 6.7
Summary
In this chapter, we discussed work showing that neural networks with random weights
perform competitively, provided that each connection is given multiple weight options
and that a good selection strategy is used. By selecting a weight among a fixed set
of random values for each individual connection, our method uncovers combinations
of random weights that match the performance of traditionally-trained networks of
the same capacity. We referred to our networks as “slot machines” where each
reel (connection) contains a fixed set of symbols (random values). Our backpropa-
gation algorithm “spins” the reels to seek “winning” combinations, i.e., selections of
random weight values that minimize the given loss. Quite surprisingly, we find that
allocating just a few random values to each connection (e.g., 8 values per connection)
yields highly competitive combinations despite being dramatically more constrained
compared to traditionally learned weights. Moreover, finetuning these combinations
often improves performance over the trained baselines.
102
Conclusion
7
In human beings, the visual world is essential in natural language acquisition. Il-
lustrations, plots, and gestures fundamentally augment our understanding of natural
language. Yet, so far in machine learning, efforts to ground natural language process-
ing in the visual domain beyond multimodal methods remain sparse. This is partly
because of the absence of vast amounts of information rich vision-language datasets of
high quality. In this thesis, we proposed a framework for leveraging joint-embedding
models pre-trained on weakly-labeled data to tackle this challenge and infuse visual
cues into natural language models. Additionally, we proposed methods for building
data efficient vision-and-language models and for training effectively integrating rep-
resentations in multimodal learning. Finally, we presented a novel neural network
architecture that uses weight selection instead of gradient updates for optimization.
In Chapter 3, we presented VGLMs and MT-VGLMs, which proved effective for
guided language modeling. The first set of models employed an image encoder from
a pre-trained joint-embedding model and required image-paired data for training.
However, such datasets are expensive to collect in large quantities. Additionally, the
distributions of text in image captions datasets do not match the distributions of free-
form text corpora. Thus, building language models on image captions corpora may
not be ideal for general language understanding. To overcome these challenges asso-
103
Conclusion Conclusion
ciated with using explicit visual information as a the guidance source, we advanced
multimodal text guided language models, MT-GLMs, as an alternative. MT-GLMs
allowed us to pre-train language models on free-form text corpora of various sizes
resulting in consistent improvements on multiple benchmarks. Our investigations
showed that using a multimodal text encoder is important as a regular language
model like BERT [39] did not yield any improvements over the baseline unguided
model.
After demonstrating the importance of vision-language models for integrating in-
formation about the visual world into language model pre-training, In Chapter 4, we
introduced a method for training data efficient vision-language contrastive models
called CLIP-C. CLIP-C adapted CutMix [175] augmentation strategy to the do-
main of vision and language models, showing that semantic compositions of multiple
image-caption pairs significantly enhance the effectiveness of language-supervised vi-
sual representation learning models, especially when the training set is small. The
composition strategy we proposed is fast and straightforward to implement requir-
ing no additional parameters or floating point operations relative to the baseline
CLIP [136] model. Comprehensive analysis showed that the augmented samples our
method created regularized the contrastive learning loss and transferred competi-
tively to downstream tasks in a zero-shot setting. We verified experimentally that
the observed performance improvements were not due to elevated levels of data aug-
mentations. The strategic use of semantically distinct examples in compositions and
dynamic sampling of examples emerged as essential components of our framework.
We are hopeful our research in Chapter 4 will encourage more exploration on novel
and efficient uses of small-scale datasets for vision-language pre-training.
In Chapter 5, we introduced Compound Tokens, a multimodal fusion method
for vision-and-language representation learning. Compound Tokens are generated
104
Conclusion Conclusion
by concatenating image and text features along the channel dimension in contrast
to the sequence dimension employed in existing systems [42]. We leveraged cross-
modal cross-attention to ensure that the tokens concatenated together in Compound
Tokens are complementary. This novel fusion method outperformed competitive
multimodal models such as ALBEF [99] and METER [42] across multiple multimodal
tasks. Numerous empirical evaluations demonstrated our fusion method is better
than the prior two approaches: merged attention and co-attention. We consistently
outperformed these standard methods with and without pre-training on image-text
pairs, across different image resolutions and image encoders. The work in this chapter
aligns with our method in Chapter 4 in that they both proposed mechanisms for pre-
training more effective multimodal models. These efforts in turn feed into our work
in Chapter 3 where multimodal models are exploited for language grounding in the
visual world.
Finally, in Chapter 6, we proposed a novel neural networks architecture called
Slot Machines that are training via selection of randomly initialized parameters
in contrast to continuous parameter updates as done in traditional models. In Slot
Machines, each neural connection is restricted to take a value from finite set of
size K where the elements of the set are drawn from a random distribution such as
Glorot Uniform [50]. In comparison, because of parameter updates through gradient
descent, neural connections traditional neural networks can assume any value on
the continuous domain. Thus, relative to the canonical networks, Slot Machines
are severely constrained. Yet, we showed our novel networks are competitive with
traditional models even with a minimal number of values, e.g., K = 2 given a good
selection strategy. We proposed two weight selection methods: (1) a greedy selection
criteria where the weight corresponding to the highest associated score, and (2) a
probabilistic selection strategy where we sample weights from a distribution of their
105
Conclusion Conclusion
scores. Greedy selection emerged as the more effective selection procedure resulting
in remarkably consistently producing effective and strong weight configurations. We
also demonstrated that selected configurations are good initialization checkpoints
for finetuning, leading to accuracy gains over training the network from scratch at
equivalent computational cost.
106
Bibliography
[1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya,
Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman,
Shyamal Anadkat, et al., Gpt-4 technical report, 2023.
[2] Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, and Weiwei Guo,
*SEM 2013 shared task: Semantic textual similarity, Second Joint Conference
on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the
Main Conference and the Shared Task: Semantic Textual Similarity (Atlanta,
Georgia, USA) (Mona Diab, Tim Baldwin, and Marco Baroni, eds.), Association
for Computational Linguistics, June 2013, pp. 32–43.
[3] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr,
Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds,
Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina
Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew
Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo
Barreira, Oriol Vinyals, Andrew Zisserman, and Karen Simonyan, Flamingo:
a visual language model for few-shot learning, 2022.
107
BIBLIOGRAPHY
[4] Morris Alper, Michael Fiman, and Hadar Averbuch-Elor, Is bert blind? ex-
ploring the effect of vision-and-language pretraining on visual language under-
standing, Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 2023.
[5] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson,
Stephen Gould, and Lei Zhang, Bottom-up and top-down attention for image
captioning and visual question answering, CVPR, 2018.
[6] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Ba-
tra, C. Lawrence Zitnick, and Devi Parikh, Vqa: Visual question answering,
Proceedings of the IEEE International Conference on Computer Vision (ICCV),
2015.
[7] Jason Baldridge, Jakob Bauer, Mukul Bhutani, Nicole Brichtova, Andrew Bun-
ner, Kelvin Chan, Yichang Chen, Sander Dieleman, Yuqing Du, Zach Eaton-
Rosen, et al., Imagen 3, arXiv preprint arXiv:2408.07009 (2024).
[8] Emily M. Bender and Alexander Koller, Climbing towards NLU: On meaning,
form, and understanding in the age of data, Proceedings of the 58th Annual
Meeting of the Association for Computational Linguistics (Online) (Dan Ju-
rafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, eds.), Association for
Computational Linguistics, July 2020, pp. 5185–5198.
[9] Yoshua Bengio, Nicholas Léonard, and Aaron Courville, Estimating or propa-
gating gradients through stochastic neurons for conditional computation, 2013.
[10] James Betker, Gabriel Goh, Li Jing, † TimBrooks, Jianfeng Wang, Linjie Li, †
LongOuyang, † JuntangZhuang, † JoyceLee, † YufeiGuo, † WesamManassra, †
108
BIBLIOGRAPHY
[11] Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio,
Joyce Chai, Mirella Lapata, Angeliki Lazaridou, Jonathan May, Aleksandr Nis-
nevich, Nicolas Pinto, and Joseph Turian, Experience grounds language, Pro-
ceedings of the 2020 Conference on Empirical Methods in Natural Language
Processing (EMNLP) (Online) (Bonnie Webber, Trevor Cohn, Yulan He, and
Yang Liu, eds.), Association for Computational Linguistics, November 2020,
pp. 8718–8735.
[12] Paul Bloom, How children learn the meanings of words, MIT press, 2002.
[13] Patrick Bordes, Eloi Zablocki, Laure Soulier, Benjamin Piwowarski, and Patrick
Gallinari, Incorporating visual semantics into sentence representations within a
grounded space, Proceedings of the 2019 Conference on Empirical Methods in
Natural Language Processing and the 9th International Joint Conference on
Natural Language Processing (EMNLP-IJCNLP) (Hong Kong, China) (Ken-
taro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, eds.), Association for
Computational Linguistics, November 2019, pp. 696–707.
[14] Leo Breiman, Jerome Friendman, Charles J. Stone, and R. A Olstein, Classi-
fication and regression trees., Wadsworth & Brooks/Cole Advanced Books &
Software., Monterey, CA, 1984.
[15] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Ka-
plan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry,
Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger,
Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin,
109
BIBLIOGRAPHY
Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish,
Alec Radford, Ilya Sutskever, and Dario Amodei, Language models are few-shot
learners, Advances in Neural Information Processing Systems, vol. 33, 2020.
[16] Emanuele Bugliarello, Ryan Cotterell, Naoaki Okazaki, and Desmond Elliott,
Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework
of Vision-and-Language BERTs, Transactions of the Association for Computa-
tional Linguistics 9 (2021), 978–994.
[17] Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han, Once-for-
all: Train one network and specialize it for efficient deployment, 2020.
[18] Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu, Yen-Chun Chen, and Jingjing Liu,
Behind the scene: Revealing the secrets of pre-trained vision-and-language mod-
els, Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK,
August 23–28, 2020, Proceedings, Part VI 16, Springer, 2020, pp. 565–580.
[19] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski,
and Armand Joulin, Unsupervised learning of visual features by contrast-
ing cluster assignments, Advances in Neural Information Processing Systems
(H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, eds.), vol. 33,
2020, pp. 9912–9924.
[20] Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia,
SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual
focused evaluation, Proceedings of the 11th International Workshop on Seman-
tic Evaluation (SemEval-2017) (Vancouver, Canada) (Steven Bethard, Marine
Carpuat, Marianna Apidianaki, Saif M. Mohammad, Daniel Cer, and David
Jurgens, eds.), Association for Computational Linguistics, August 2017, pp. 1–
14.
110
BIBLIOGRAPHY
[21] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut, Conceptual
12M: Pushing web-scale image-text pre-training to recognize long-tail visual con-
cepts, CVPR, 2021.
[23] Cheng Chen, Yudong Zhu, Zhenshan Tan, Qingrong Cheng, Xin Jiang, Qun
Liu, and Xiaodong Gu, Utc: A unified transformer with inter-task contrastive
learning for visual dialog, Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR), 2022.
[24] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton, A
simple framework for contrastive learning of visual representations, Proceedings
of the 37th International Conference on Machine Learning (Hal Daumé III and
Aarti Singh, eds.), Proceedings of Machine Learning Research, vol. 119, PMLR,
13–18 Jul 2020, pp. 1597–1607.
[26] Xinlei Chen and Kaiming He, Exploring simple siamese representation learn-
ing, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), 2021, pp. 15745–15753.
111
BIBLIOGRAPHY
[27] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe
Gan, Yu Cheng, and Jingjing Liu, Uniter: Universal image-text representation
learning, ECCV, 2020.
[28] Gong Cheng, Junwei Han, and Xiaoqiang Lu, Remote sensing image scene
classification: Benchmark and state of the art, Proceedings of the IEEE 105
(2017), no. 10, 1865–1883.
[29] Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal, Unifying vision-and-language
tasks via text generation, Proceedings of the 38th International Conference on
Machine Learning (Marina Meila and Tong Zhang, eds.), Proceedings of Ma-
chine Learning Research, vol. 139, PMLR, 18–24 Jul 2021, pp. 1931–1942.
[30] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gau-
rav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton,
Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua
Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam M. Shazeer, Vinodkumar
Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Brad-
bury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke,
Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier
García, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne
Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov,
Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M.
Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz,
Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou,
Xuezhi Wang, Brennan Saeta, Mark Díaz, Orhan Firat, Michele Catasta, Jason
Wei, Kathleen S. Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and
Noah Fiedel, Palm: Scaling language modeling with pathways, 2022.
112
BIBLIOGRAPHY
[32] Adam Coates, Andrew Ng, and Honglak Lee, An Analysis of Single Layer Net-
works in Unsupervised Feature Learning, AISTATS, 2011.
[34] Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine
Bordes, Supervised learning of universal sentence representations from natural
language inference data, Proceedings of the 2017 Conference on Empirical Meth-
ods in Natural Language Processing (Copenhagen, Denmark) (Martha Palmer,
Rebecca Hwa, and Sebastian Riedel, eds.), Association for Computational Lin-
guistics, September 2017, pp. 670–680.
[35] Ekin Dogus Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and
Quoc V. Le, Autoaugment: Learning augmentation policies from data, CVPR,
2019.
[36] Andrew M Dai and Quoc V Le, Semi-supervised sequence learning, Advances
in Neural Information Processing Systems (C. Cortes, N. Lawrence, D. Lee,
M. Sugiyama, and R. Garnett, eds.), vol. 28, Curran Associates, Inc., 2015.
113
BIBLIOGRAPHY
[37] Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav,
José M.F. Moura, Devi Parikh, and Dhruv Batra, Visual Dialog, Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2017.
[38] Karan Desai, Gaurav Kaul, Zubin Trivadi Aysola, and Justin Johnson, Redcaps:
Web-curated image-text data created by the people, for the people, Thirty-fifth
Conference on Neural Information Processing Systems Datasets and Bench-
marks Track (Round 1), 2021.
[39] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, BERT:
Pre-training of deep bidirectional transformers for language understanding, Pro-
ceedings of the 2019 Conference of the North American Chapter of the Asso-
ciation for Computational Linguistics: Human Language Technologies, Volume
1 (Long and Short Papers), Association for Computational Linguistics, June
2019, pp. 4171–4186.
[40] Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia,
Jingjing Xu, Zhiyong Wu, Tianyu Liu, et al., A survey on in-context learning,
2022.
[41] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi-
aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg
Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby, An image is worth
16x16 words: Transformers for image recognition at scale, International Con-
ference on Learning Representations, 2021.
[42] Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan
Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, Zicheng
114
BIBLIOGRAPHY
[43] Desmond Elliott, Stella Frank, Khalil Sima’an, and Lucia Specia, Multi30K:
Multilingual English-German image descriptions, Proceedings of the 5th Work-
shop on Vision and Language (Berlin, Germany) (Anya Belz, Erkut Erdem,
Krystian Mikolajczyk, and Katerina Pastra, eds.), Association for Computa-
tional Linguistics, August 2016, pp. 70–74.
[44] Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and Yonglong Tian, Im-
proving CLIP training with language rewrites, Thirty-seventh Conference on
Neural Information Processing Systems, 2023.
[45] Li Fei-Fei, Rob Fergus, and Pietro Perona, Learning generative visual models
from few training examples: An incremental bayesian approach tested on 101
object categories, CVPR Workshop (2004).
[46] Jonathan Frankle and Michael Carbin, The lottery ticket hypothesis: Finding
sparse, trainable neural networks, International Conference on Learning Repre-
sentations, 2019.
[47] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc' Au-
relio Ranzato, and Tomas Mikolov, Devise: A deep visual-semantic embed-
ding model, Advances in Neural Information Processing Systems (C.J. Burges,
L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, eds.), vol. 26,
2013.
[48] Adam Gaier and David Ha, Weight agnostic neural networks, Advances in Neu-
ral Information Processing Systems (H. Wallach, H. Larochelle, A. Beygelzimer,
115
BIBLIOGRAPHY
F. d'Alché-Buc, E. Fox, and R. Garnett, eds.), vol. 32, Curran Associates, Inc.,
2019, pp. 5364–5378.
[50] Xavier Glorot and Yoshua Bengio, Understanding the difficulty of training deep
feedforward neural networks, Proceedings of the Thirteenth International Con-
ference on Artificial Intelligence and Statistics (Chia Laguna Resort, Sardinia,
Italy) (Yee Whye Teh and Mike Titterington, eds.), Proceedings of Machine
Learning Research, vol. 9, PMLR, 13–15 May 2010, pp. 249–256.
[51] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh,
Making the V in VQA matter: Elevating the role of image understanding in Vi-
sual Question Answering, Conference on Computer Vision and Pattern Recog-
nition (CVPR), 2017.
[52] Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary
Mueller, Sourab Mangrulkar, Marc Sun, and Benjamin Bossan, Accelerate:
Training and inference at scale made simple, efficient and adaptable., https:
//github.com/huggingface/accelerate, 2022.
[53] K. He, X. Zhang, S. Ren, and J. Sun, Delving deep into rectifiers: Surpassing
human-level performance on imagenet classification, 2015 IEEE International
Conference on Computer Vision (ICCV), 2015, pp. 1026–1034.
[54] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick, Mask r-
cnn, 2017 IEEE International Conference on Computer Vision (ICCV), 2017,
pp. 2980–2988.
116
BIBLIOGRAPHY
[55] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Deep residual learn-
ing for image recognition, 2016 IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR), 2016, pp. 770–778.
[56] Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth, Eu-
rosat: A novel dataset and deep learning benchmark for land use and land cover
classification, 2017.
[57] Lisa Anne Hendricks, John Mellor, Rosalia Schneider, Jean-Baptiste Alayrac,
and Aida Nematzadeh, Decoupling the role of data, attention, and losses in
multimodal transformers, Transactions of the Association for Computational
Linguistics 9 (2021), 570–585.
[58] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean, Distilling the knowledge
in a neural network, ArXiv abs/1503.02531 (2015).
[59] Elad Hoffer, Itay Hubara, and Daniel Soudry, Fix your classifier: the marginal
value of training the last weight layer, International Conference on Learning
Representations, 2018.
[60] Tao Hong, Xiangyang Guo, and Jinwen Ma, Itmix: Image-text mix augmenta-
tion for transferring clip to image classification, 2022 16th IEEE International
Conference on Signal Processing (ICSP), vol. 1, 2022, pp. 129–133.
[61] Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kembhavi, and Ranjay
Krishna, Sugarcrepe: Fixing hackable benchmarks for vision-language composi-
tionality, NeurIPS, 2023.
[62] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li,
Shean Wang, Lu Wang, and Weizhu Chen, LoRA: Low-rank adaptation of large
language models, International Conference on Learning Representations, 2022.
117
BIBLIOGRAPHY
[63] Saining Xie Hu Xu, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu
Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feicht-
enhofer, Demystifying clip data, 2023.
[64] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua
Bengio, Binarized neural networks, Advances in Neural Information Processing
Systems (D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, eds.),
vol. 29, Curran Associates, Inc., 2016.
[65] , Quantized neural networks: Training neural networks with low precision
weights and activations, The Journal of Machine Learning Research 18 (2017),
no. 1, 6869–6898.
[66] Drew A Hudson and Christopher D Manning, Gqa: A new dataset for real-
world visual reasoning and compositional question answering, Conference on
Computer Vision and Pattern Recognition (CVPR) (2019).
[67] Taichi Iki and Akiko Aizawa, Effect of visual extensions on natural language
understanding in vision-and-language models, Proceedings of the 2021 Confer-
ence on Empirical Methods in Natural Language Processing (Online and Punta
Cana, Dominican Republic) (Marie-Francine Moens, Xuanjing Huang, Lucia
Specia, and Scott Wen-tau Yih, eds.), Association for Computational Linguis-
tics, November 2021, pp. 2189–2196.
[68] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas
Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong,
John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt, Openclip,
July 2021.
118
BIBLIOGRAPHY
[69] Julia Ive, Pranava Madhyastha, and Lucia Specia, Distilling translations with
visual awareness, Proceedings of the 57th Annual Meeting of the Association
for Computational Linguistics (Florence, Italy) (Anna Korhonen, David Traum,
and Lluís Màrquez, eds.), Association for Computational Linguistics, July 2019,
pp. 6525–6538.
[70] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham,
Quoc V. Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig, Scaling up visual and
vision-language representation learning with noisy text supervision, Proceedings
of the 38th International Conference on Machine Learning, ICML 2021 (Marina
Meila and Tong Zhang, eds.), Proceedings of Machine Learning Research, vol.
139, 2021, pp. 4904–4916.
[71] Huaizu Jiang, Ishan Misra, Marcus Rohrbach, Erik Learned-Miller, and Xin-
lei Chen, In defense of grid features for visual question answering, 2020
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2020, pp. 10264–10273.
[72] Kenan Jiang, Xuehai He, Ruize Xu, and Xin Eric Wang, Comclip: Training-free
compositional image and text matching, 2023.
[73] Armand Joulin, Laurens van der Maaten, Allan Jabri, and Nicolas Vasilache,
Learning visual features from large weakly supervised data, Computer Vision –
ECCV 2016 (Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, eds.),
Springer International Publishing, 2016, pp. 67–84.
[74] John M. Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael
Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Au-
gustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A A
Kohl, Andy Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav
119
BIBLIOGRAPHY
Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman,
Ellen Clancy, Michal Zielinski, Martin Steinegger, Michalina Pacholska, Tamas
Berghammer, Sebastian Bodenstein, David Silver, Oriol Vinyals, Andrew W.
Senior, Koray Kavukcuoglu, Pushmeet Kohli, and Demis Hassabis, Highly ac-
curate protein structure prediction with alphafold, Nature 596 (2021), 583 –
589.
[75] Aishwarya Kamath, Mannat Singh, Yann LeCun, Ishan Misra, Gabriel Syn-
naeve, and Nicolas Carion, Mdetr–modulated detection for end-to-end multi-
modal understanding, arXiv preprint arXiv:2104.12763 (2021).
[76] Douwe Kiela, Alexis Conneau, Allan Jabri, and Maximilian Nickel, Learning
visually grounded sentence representations, Proceedings of the 2018 Conference
of the North American Chapter of the Association for Computational Linguis-
tics: Human Language Technologies, Volume 1 (Long Papers) (New Orleans,
Louisiana) (Marilyn Walker, Heng Ji, and Amanda Stent, eds.), Association for
Computational Linguistics, June 2018, pp. 408–418.
[77] Wonjae Kim, Bokyung Son, and Ildoo Kim, Vilt: Vision-and-language trans-
former without convolution or region supervision, International Conference on
Machine Learning, 2021, pp. 5583–5594.
[78] Diederik P. Kingma and Jimmy Ba, Adam: A method for stochastic optimiza-
tion, ICLR, 2015.
[79] Ryan Kiros, Yukun Zhu, Russ R Salakhutdinov, Richard Zemel, Raquel Ur-
tasun, Antonio Torralba, and Sanja Fidler, Skip-thought vectors, Advances
in Neural Information Processing Systems (C. Cortes, N. Lawrence, D. Lee,
M. Sugiyama, and R. Garnett, eds.), vol. 28, Curran Associates, Inc., 2015.
120
BIBLIOGRAPHY
[80] Noriyuki Kojima, Hadar Averbuch-Elor, Alexander Rush, and Yoav Artzi, What
is learned in visually grounded neural syntax acquisition, Proceedings of the 58th
Annual Meeting of the Association for Computational Linguistics (Online) (Dan
Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, eds.), Association
for Computational Linguistics, July 2020, pp. 2615–2635.
[81] Satwik Kottur, José M. F. Moura, Devi Parikh, Dhruv Batra, and Marcus
Rohrbach, Visual coreference resolution in visual dialog using neural module
networks, The European Conference on Computer Vision (ECCV), 2018.
[82] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua
Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma,
Michael Bernstein, and Li Fei-Fei, Visual genome: Connecting language and
vision using crowdsourced dense image annotations, , 2016.
[83] Alex Krizhevsky, Learning multiple layers of features from tiny images, Tech.
report, University of Toronto, 2009.
[84] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, Imagenet classification
with deep convolutional neural networks, Advances in Neural Information Pro-
cessing Systems (F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger,
eds.), vol. 25, Curran Associates, Inc., 2012, pp. 1097–1105.
[85] Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, and Anelia Angelova,
Open-vocabulary object detection upon frozen vision and language models, The
Eleventh International Conference on Learning Representations, 2023.
[86] Weicheng Kuo, AJ Piergiovanni, Dahun Kim, Xiyang Luo, Ben Caine, Wei
Li, Abhijit Ogale, Luowei Zhou, Andrew Dai, Zhifeng Chen, Claire Cui, and
121
BIBLIOGRAPHY
[87] Zhengfeng Lai, Haotian Zhang, Wentao Wu, Haoping Bai, Aleksei Timofeev,
Xianzhi Du, Zhe Gan, Jiulong Shan, Chen-Nee Chuah, Yinfei Yang, and Meng
Cao, From scarcity to efficiency: Improving clip training via visual-enriched
captions, 2023.
[88] Angeliki Lazaridou, Adhi Kuncoro, Elena Gribovskaya, Devang Agrawal, Adam
Liska, Tayfun Terzi, Mai Gimenez, Cyprien de Masson d'Autume, Tomas
Kocisky, Sebastian Ruder, Dani Yogatama, Kris Cao, Susannah Young, and
Phil Blunsom, Mind the gap: Assessing temporal generalization in neural lan-
guage models, Advances in Neural Information Processing Systems (M. Ran-
zato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, eds.),
vol. 34, 2021, pp. 29348–29363.
[89] Angeliki Lazaridou, Nghia The Pham, and Marco Baroni, Combining language
and vision with a multimodal skip-gram model, Proceedings of the 2015 Con-
ference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies (Denver, Colorado) (Rada Mihal-
cea, Joyce Chai, and Anoop Sarkar, eds.), Association for Computational Lin-
guistics, May–June 2015, pp. 153–163.
[90] Quoc Le and Tomas Mikolov, Distributed representations of sentences and doc-
uments, Proceedings of the 31st International Conference on Machine Learning
(Bejing, China) (Eric P. Xing and Tony Jebara, eds.), vol. 32, Proceedings of
Machine Learning Research, no. 2, PMLR, 22–24 Jun 2014, pp. 1188–1196.
122
BIBLIOGRAPHY
[91] Yann Lecun, Le’on Bottou, Yoshua Bengio, and Patrick Haffner, Gradient-based
learning applied to document recognition, Proceedings of the IEEE 86 (1998),
no. 11, 2278–2324.
[92] Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He, Stacked
cross attention for image-text matching, ECCV, 2018.
[93] Namhoon Lee, Thalaiyasingam Ajanthan, Stephen Gould, and Philip H. S. Torr,
A signal propagation perspective for pruning neural networks at initialization,
International Conference on Learning Representations, 2020.
[94] Namhoon Lee, Thalaiyasingam Ajanthan, and Philip Torr, SNIP: Single-shot
Network Pruning based on connection sensitivity, International Conference on
Learning Representations, 2019.
[95] Brian Lester, Rami Al-Rfou, and Noah Constant, The power of scale for
parameter-efficient prompt tuning, Proceedings of the 2021 Conference on Em-
pirical Methods in Natural Language Processing (Online and Punta Cana, Do-
minican Republic) (Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and
Scott Wen-tau Yih, eds.), Association for Computational Linguistics, November
2021, pp. 3045–3059.
[96] A. Li, A. Jabri, A. Joulin, and L. van der Maaten, Learning visual n-grams from
web data, 2017 IEEE International Conference on Computer Vision (ICCV),
IEEE Computer Society, 2017, pp. 4193–4202.
[97] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi, BLIP-2: bootstrapping
language-image pre-training with frozen image encoders and large language mod-
els, ICML, 2023.
123
BIBLIOGRAPHY
[98] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi, Blip: Bootstrapping
language-image pre-training for unified vision-language understanding and gen-
eration, ICML, 2022.
[99] Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming
Xiong, and Steven Chu Hong Hoi, Align before fuse: Vision and language repre-
sentation learning with momentum distillation, Advances in Neural Information
Processing Systems (M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and
J. Wortman Vaughan, eds.), vol. 34, Curran Associates, Inc., 2021, pp. 9694–
9705.
[100] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang,
Visualbert: A simple and performant baseline for vision and language, Arxiv,
2019.
[101] Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua Wu,
and Haifeng Wang, UNIMO: Towards unified-modal understanding and gen-
eration via cross-modal contrastive learning, Proceedings of the 59th Annual
Meeting of the Association for Computational Linguistics and the 11th Inter-
national Joint Conference on Natural Language Processing (Volume 1: Long
Papers), Association for Computational Linguistics, 2021, pp. 2592–2607.
[102] Xiang Lisa Li and Percy Liang, Prefix-tuning: Optimizing continuous prompts
for generation, Proceedings of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International Joint Conference on Nat-
ural Language Processing (Volume 1: Long Papers) (Online) (Chengqing Zong,
Fei Xia, Wenjie Li, and Roberto Navigli, eds.), Association for Computational
Linguistics, August 2021, pp. 4582–4597.
124
BIBLIOGRAPHY
[103] Xiujun Li, Xi Yin, Chunyuan Li, Xiaowei Hu, Pengchuan Zhang, Lei Zhang,
Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao,
Oscar: Object-semantics aligned pre-training for vision-language tasks, ECCV
2020 (2020).
[104] Yangguang Li, Feng Liang, Lichen Zhao, Yufeng Cui, Wanli Ouyang, Jing Shao,
Fengwei Yu, and Junjie Yan, Supervision exists everywhere: A data efficient
contrastive language-image pre-training paradigm, International Conference on
Learning Representations, 2022.
[105] Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein,
and Joey Gonzalez, Train big, then compress: Rethinking model size for efficient
training and inference of transformers, International Conference on Machine
Learning, 2020, pp. 5958–5968.
[106] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva
Ramanan, Piotr Dollár, and C. Lawrence Zitnick, Microsoft coco: Common
objects in context, Computer Vision – ECCV 2014 (David Fleet, Tomas Pajdla,
Bernt Schiele, and Tinne Tuytelaars, eds.), 2014, pp. 740–755.
[107] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen,
Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov, Ro{bert}a:
A robustly optimized {bert} pretraining approach, 2020.
[108] Ilya Loshchilov and Frank Hutter, SGDR: Stochastic gradient descent with warm
restarts, International Conference on Learning Representations, 2017.
[110] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee, Vilbert: Pretraining task-
agnostic visiolinguistic representations for vision-and-language tasks, Advances
125
BIBLIOGRAPHY
[111] Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan
Lee, 12-in-1: Multi-task vision and language representation learning, The
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
June 2020.
[112] Zhou Luowei, Palangi Hamid, Zhang Lei, Hu Houdong, Corso Jason J., and
Jianfeng Gao, Unified vision-language pre-training for image captioning and
vqa, arXiv preprint arXiv:1909.11059 (2019).
[114] Eran Malach, Gilad Yehudai, Shai Shalev-Schwartz, and Ohad Shamir, Proving
the lottery ticket hypothesis: Pruning is all you need, Proceedings of the 37th
International Conference on Machine Learning (Hal Daumé III and Aarti Singh,
eds.), Proceedings of Machine Learning Research, vol. 119, PMLR, 13–18 Jul
2020, pp. 6682–6691.
[115] Ilaria Manco, Emmanouil Benetos, Elio Quinton, and György Fazekas, Con-
trastive audio-language learning for music, 2022.
[116] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan Yuille,
and Kevin Murphy, Generation and comprehension of unambiguous object de-
scriptions, CVPR, 2016.
[117] Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella
Bernardi, and Roberto Zamparelli, A SICK cure for the evaluation of compo-
126
BIBLIOGRAPHY
[118] Cynthia Matuszek, Grounded language learning: where robotics and nlp meet,
Proceedings of the 27th International Joint Conference on Artificial Intelligence,
IJCAI’18, AAAI Press, 2018, p. 5687–5691.
[119] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher, Pointer
sentinel mixture models, International Conference on Learning Representations,
2017.
[120] Tomas Mikolov, Kai Chen, Gregory S. Corrado, and Jeffrey Dean, Efficient
estimation of word representations in vector space, International Conference on
Learning Representations, 2013.
[121] Norman Mu, Alexander Kirillov, David Wagner, and Saining Xie, Slip: Self-
supervision meets language-image pre-training, arXiv:2112.12750 (2021).
127
BIBLIOGRAPHY
[123] Muhammad Ferjad Naeem, Yongqin Xian, Xiaohua Zhai, Lukas Hoyer, Luc Van
Gool, and Federico Tombari, Silc: Improving vision language pretraining with
self-distillation, 2023.
[124] Binh X. Nguyen, Tuong Do, Huy Tran, Erman Tjiputra, Quang D, and
Anh Nguyen Tran, Coarse-to-fine reasoning for visual question answering, Mul-
timodal Learning and Applications (MULA) Workshop, CVPR, 2022.
[126] Vicente Ordonez, Girish Kulkarni, and Tamara Berg, Im2text: Describing im-
ages using 1 million captioned photographs, Advances in Neural Information
Processing Systems (J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and
K.Q. Weinberger, eds.), vol. 24, Curran Associates, Inc., 2011.
[127] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright,
Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray,
John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens,
Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe,
Training language models to follow instructions with human feedback, Advances
in Neural Information Processing Systems (S. Koyejo, S. Mohamed, A. Agar-
wal, D. Belgrave, K. Cho, and A. Oh, eds.), vol. 35, Curran Associates, Inc.,
2022, pp. 27730–27744.
[129] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gre-
gory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Al-
128
BIBLIOGRAPHY
ban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison,
Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and
Soumith Chintala, Pytorch: An imperative style, high-performance deep learn-
ing library, Advances in Neural Information Processing Systems (H. Wallach,
H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, eds.),
vol. 32, Curran Associates, Inc., 2019, pp. 8026–8037.
[130] Jeffrey Pennington, Richard Socher, and Christopher Manning, GloVe: Global
vectors for word representation, Proceedings of the 2014 Conference on Em-
pirical Methods in Natural Language Processing (EMNLP) (Doha, Qatar)
(Alessandro Moschitti, Bo Pang, and Walter Daelemans, eds.), Association for
Computational Linguistics, October 2014, pp. 1532–1543.
[131] Ankit Pensia, Shashank Rajput, Alliot Nagle, Harit Vishwakarma, and Dim-
itris Papailiopoulos, Optimal lottery tickets via subsetsum: Logarithmic over-
parameterization is sufficient, Advances in Neural Information Processing Sys-
tems, vol. 33, 2020.
[132] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher
Clark, Kenton Lee, and Luke Zettlemoyer, Deep contextualized word represen-
tations, Proceedings of the 2018 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technolo-
gies, Volume 1 (Long Papers) (New Orleans, Louisiana) (Marilyn Walker, Heng
Ji, and Amanda Stent, eds.), Association for Computational Linguistics, June
2018, pp. 2227–2237.
[133] AJ Piergiovanni, Wei Li, Weicheng Kuo, Mohammad Saffar, Fred Bertsch, and
Anelia Angelova, Answer-me: Multi-task open-vocabulary visual question an-
swering, European Conference on Computer Vision (ECCV), 2022.
129
BIBLIOGRAPHY
[134] AJ Piergiovanni, Kairo Morton, Weicheng Kuo, Michael S. Ryoo, and Anelia
Angelova, Video question answering with iterative video-text co-tokenization,
ECCV, 2022.
[135] Ariadna Quattoni, Michael Collins, and Trevor Darrell, Learning visual rep-
resentations using images with captions, 2007 IEEE Conference on Computer
Vision and Pattern Recognition, 2007, pp. 1–8.
[136] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,
Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark,
Gretchen Krueger, and Ilya Sutskever, Learning transferable visual models from
natural language supervision, 2021.
[138] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
Sutskever, Language models are unsupervised multitask learners, 2019.
[139] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu, Exploring the limits
of transfer learning with a unified text-to-text transformer, Journal of Machine
Learning Research 21 (2020), no. 140, 1–67.
[140] Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and
Mohammad Rastegari, What’s hidden in a randomly weighted neural network?,
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), June 2020.
130
BIBLIOGRAPHY
[141] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi,
Xnor-net: Imagenet classification using binary convolutional neural networks,
Computer Vision – ECCV 2016, 2016, pp. 525–542.
[142] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, Faster r-cnn: Towards
real-time object detection with region proposal networks, Advances in Neural
Information Processing Systems (C. Cortes, N. Lawrence, D. Lee, M. Sugiyama,
and R. Garnett, eds.), vol. 28, Curran Associates, Inc., 2015, pp. 91–99.
[143] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and
Björn Ommer, High-resolution image synthesis with latent diffusion models,
2021.
[144] Amir Rosenfeld and John K. Tsotsos, Intriguing properties of randomly weighted
networks: Generalizing while learning next to nothing, 2019 16th Conference on
Computer and Robot Vision (CRV), 2019, pp. 9–16.
[145] Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao-
qing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al.,
Code llama: Open foundation models for code, 2023.
[146] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh,
Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein,
Alexander C. Berg, and Li Fei-Fei, ImageNet: A Large-Scale Hierarchical Image
Database, CVPR, 2009.
[147] Michael Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, and Anelia
Angelova, Tokenlearner: Adaptive space-time tokenization for videos, Ad-
vances in Neural Information Processing Systems (M. Ranzato, A. Beygelz-
131
BIBLIOGRAPHY
imer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, eds.), vol. 34, Curran
Associates, Inc., 2021, pp. 12786–12797.
[148] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L
Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan,
Tim Salimans, et al., Photorealistic text-to-image diffusion models with deep
language understanding, Advances in neural information processing systems 35
(2022), 36479–36494.
[149] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut, Conceptual
captions: A cleaned, hypernymed, image alt-text dataset for automatic image
captioning, Proceedings of ACL, 2018.
[150] Haoyue Shi, Jiayuan Mao, Kevin Gimpel, and Karen Livescu, Visually grounded
neural syntax acquisition, Proceedings of the 57th Annual Meeting of the Asso-
ciation for Computational Linguistics (Florence, Italy) (Anna Korhonen, David
Traum, and Lluís Màrquez, eds.), Association for Computational Linguistics,
July 2019, pp. 1842–1861.
[152] Alane Suhr, Mike Lewis, James Yeh, and Yoav Artzi, A corpus of natural lan-
guage for visual reasoning, Proceedings of the 55th Annual Meeting of the As-
sociation for Computational Linguistics (Volume 2: Short Papers), Association
for Computational Linguistics, 2017, pp. 217–223.
[153] Hao Tan and Mohit Bansal, Lxmert: Learning cross-modality encoder repre-
sentations from transformers, Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing, 2019.
132
BIBLIOGRAPHY
[155] Hidenori Tanaka, Daniel Kunin, Daniel L. K. Yamins, and Surya Ganguli, Prun-
ing neural networks without any data by iteratively conserving synaptic flow,
Advances in Neural Information Processing Systems, vol. 33, 2020.
[156] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre
Sablayrolles, and Herve Jegou, Training data-efficient image transformers &;
distillation through attention, Proceedings of the 38th International Conference
on Machine Learning (Marina Meila and Tong Zhang, eds.), Proceedings of
Machine Learning Research, vol. 139, PMLR, 18–24 Jul 2021, pp. 10347–10357.
[158] Aaron van den Oord, Yazhe Li, and Oriol Vinyals, Representation learning with
contrastive predictive coding, 2019.
[159] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin, Attention is all you
need, Advances in Neural Information Processing Systems (I. Guyon, U. V.
Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett,
eds.), vol. 30, Curran Associates, Inc., 2017, pp. 5998–6008.
[160] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and
Samuel R. Bowman, GLUE: A multi-task benchmark and analysis platform for
133
BIBLIOGRAPHY
[161] Chaoqi Wang, Guodong Zhang, and Roger Grosse, Picking winning tickets be-
fore training by preserving gradient flow, International Conference on Learning
Representations, 2020.
[162] P. Wang, Q. Hu, Y. Zhang, C. Zhang, Y. Liu, and J. Cheng, Two-step quan-
tization for low-bit neural networks, 2018 IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2018, pp. 4376–4384.
[163] Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu,
Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, and
Furu Wei, Image as a foreign language: Beit pretraining for all vision and
vision-language tasks, 2022.
[164] Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan
Cao, Simvlm: Simple visual language model pretraining with weak supervision,
International Conference on Learning Representations (ICLR), 2022.
[165] Jason Wei and Kai Zou, EDA: Easy data augmentation techniques for boosting
performance on text classification tasks, Proceedings of the 2019 Conference
on Empirical Methods in Natural Language Processing and the 9th Interna-
tional Joint Conference on Natural Language Processing (EMNLP-IJCNLP),
Association for Computational Linguistics, November 2019, pp. 6382–6388.
[167] Bichen Wu, Ruizhe Cheng, Peizhao Zhang, Tianren Gao, Joseph E. Gonzalez,
and Peter Vajda, Data efficient language-supervised zero-shot recognition with
134
BIBLIOGRAPHY
[169] Ning Xie, Farley Lai, Derek Doran, and Asim Kadav, Visual entail-
ment: A novel task for fine-grained image understanding, arXiv preprint
arXiv:1901.06706 (2019).
[170] Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan,
Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer, VideoCLIP:
Contrastive pre-training for zero-shot video-text understanding, Proceedings of
the 2021 Conference on Empirical Methods in Natural Language Processing,
Association for Computational Linguistics, November 2021, pp. 6787–6800.
[171] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson, How transferable
are features in deep neural networks?, Advances in Neural Information Process-
ing Systems (Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q.
Weinberger, eds.), vol. 27, Curran Associates, Inc., 2014, pp. 3320–3328.
[172] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier, From image
descriptions to visual denotations: New similarity metrics for semantic infer-
ence over event descriptions, Transactions of the Association for Computational
Linguistics 2 (2014), 67–78.
135
BIBLIOGRAPHY
[173] Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini,
and Yonghui Wu, Coca: Contrastive captioners are image-text foundation mod-
els, Transactions on Machine Learning Research, 2022.
[174] Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and
James Zou, When and why vision-language models behave like bags-of-words,
and what to do about it?, The Eleventh International Conference on Learning
Representations, 2023.
[175] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe,
and Youngjoon Yoo, Cutmix: Regularization strategy to train strong classifiers
with localizable features, ICCV, 2019.
[176] Éloi Zablocki, Benjamin Piwowarski, Laure Soulier, and Patrick Gallinari,
Learning Multi-Modal Word Representation Grounded in Visual Context, As-
sociation for the Advancement of Artificial Intelligence (AAAI) (New Orleans,
United States), February 2018.
[177] Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi, From recognition to
cognition: Visual commonsense reasoning, The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), June 2019.
[178] Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers,
Alexander Kolesnikov, and Lucas Beyer, Lit: Zero-shot transfer with locked-
image text tuning, Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), June 2022, pp. 18123–18133.
[179] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz,
mixup: Beyond empirical risk minimization, International Conference on Learn-
ing Representations, 2018.
136
Bibliography
[180] Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan
Wang, Yejin Choi, and Jianfeng Gao, Vinvl: Revisiting visual representations
in vision-language models, Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR), June 2021, pp. 5579–5588.
[181] Zhuosheng Zhang, Kehai Chen, Rui Wang, Masao Utiyama, Eiichiro Sumita,
Zuchao Li, and Hai Zhao, Neural machine translation with universal visual
representation, International Conference on Learning Representations, 2020.
[182] Yue Zhao, Ishan Misra, Philipp Krähenbühl, and Rohit Girdhar, Learning video
representations from large language models, arXiv:2212.04501, 2022.
[183] Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Li-
unian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al., Regionclip:
Region-based language-image pretraining, Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, 2022, pp. 16793–16803.
[184] Hattie Zhou, Janice Lan, Rosanne Liu, and Jason Yosinski, Deconstructing
lottery tickets: Zeros, signs, and the supermask, Advances in Neural Information
Processing Systems (H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc,
E. Fox, and R. Garnett, eds.), vol. 32, Curran Associates, Inc., 2019, pp. 3597–
3607.
[185] Barret Zoph and Quoc V. Le, Neural architecture search with reinforcement
learning, International Conference on Learning Representations, 2017.
137