0% found this document useful (0 votes)

18 views151 pages

Incorporating Visual Information Into Natural Language Processing

This Ph.D. dissertation by Maxwell Mbabilla Aladago explores the integration of visual information into natural language processing (NLP) to enhance machine learning models. It proposes two frameworks for incorporating visual cues into language pre-training, demonstrating their effectiveness across various downstream tasks. The research highlights the potential of multimodal models and introduces innovative methods for training these models, ultimately improving performance in tasks such as image categorization and visual question answering.

Uploaded by

hoangntdt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views151 pages

Incorporating Visual Information Into Natural Language Processing

Uploaded by

hoangntdt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 151

Dartmouth College

Dartmouth Digital Commons

Dartmouth College Ph.D Dissertations Theses and Dissertations

Winter 1-24-2025

Incorporating Visual Information into Natural Language

Processing
Maxwell Mbabilla Aladago
[email protected]

Follow this and additional works at: https://digitalcommons.dartmouth.edu/dissertations

Part of the Artificial Intelligence and Robotics Commons

Recommended Citation
Aladago, Maxwell Mbabilla, "Incorporating Visual Information into Natural Language Processing" (2025).
Dartmouth College Ph.D Dissertations. 335.
https://digitalcommons.dartmouth.edu/dissertations/335

This Thesis (Ph.D.) is brought to you for free and open access by the Theses and Dissertations at Dartmouth Digital
Commons. It has been accepted for inclusion in Dartmouth College Ph.D Dissertations by an authorized
administrator of Dartmouth Digital Commons. For more information, please contact
[email protected].
INCORPORATING VISUAL INFORMATION INTO NATURAL
LANGUAGE PROCESSING
A Thesis
Submitted to the Faculty
in partial fulfillment of the requirements for the
degree of
Doctor of Philosophy
in
Computer Science
by
Maxwell Mbabilla Aladago
DARTMOUTH COLLEGE
Hanover, New Hampshire
January 2025

Examining Committee:

Soroush Vosoughi, Chair

Lorenzo Torresani

Temiloluwa O. Prioleau

Anthony J Piergiovanni

F. Jon Kull, Ph.D.

Dean of the Guarini School of Graduate and Advanced Studies
Abstract

Natural language describes entities in the world, some real and some abstract. It is
also common practice to complement human learning of natural language with visual
cues. This is evident in the heavily graphical nature of children’s literature which
underscores the importance of visual cues in language acquisition. Similarly, the no-
tion of “visual learners” is well recognized, reflecting the understanding that visual
signals such as illustrations, gestures, and depictions effectively supplement language.
In machine learning, two primary paradigms have emerged for training systems in-
volving natural language. The first paradigm encompasses setups where pre-training
and downstream tasks are exclusively in natural language. The second paradigm
comprises models that require joint reasoning over both language and visual inputs
during pre-training and downstream tasks. Given the widely acknowledged role of
visual input in human language comprehension, it is pertinent to inquire whether
visual information can similarly augment the comprehension of language-only tasks
in machine learning. Despite the remarkable advancements in the capabilities of ma-
chine learning models across all domains in recent years, the concept of supplementing
Natural Language Processing with visual signals remains insufficiently explored. This
is in part due to the absence of clear and effective strategies for integrating visual

ii
information into language models, given the limited availability of extensive, high
quality image-language paired datasets. In this thesis, we address this challenge and
propose two frameworks for incorporating visual information into natural language
pre-training leveraging multimodal models as intermediaries between visual informa-
tion and language models. Empirical evaluations conducted on language pre-training
datasets of varying sizes demonstrate the efficacy of the proposed frameworks across
diverse downstream language tasks. In addition, we introduce methods for training
effective multimodal models through architectural innovations and novel multimodal
data augmentation techniques. The representations generated by our multimodal
models improve performance in zero-shot image categorization, visual question an-
swering, visual entailment, and cross-modal retrieval tasks in downstream evaluations.
Finally, this thesis presents a novel method for constructing effective neural networks
by selection from randomly initialized parameters in contrast to the conventional
practice of parameter updates via gradient descent.

iii
Acknowledgements

I would like to express my heartfelt appreciation and profound gratitude to my advi-

sors, Soroush Vosoughi and Lorenzo Torresani, for their unending support, guidance,
and patience during my Ph.D research. Since our first meeting in Ghana, Lorenzo
has selflessly shared his knowledge and time with me including staying as an Adjunct
Professor for over two years so that I could finish my studies. I am equally grateful to
Soroush who accepted me in the middle of my Ph.D. and worked endlessly to ensure
I completed this thesis.
I would also like to extend my heartfelt appreciation to Temiloluwa Prioleau and
Anthony J Piergiovanni for serving on my thesis committee, and for their wise counsel
across various facets of my life. Temi has been an invaluable anchor and family ever
since I arrived at Dartmouth while AJ has been a wonderful friend and mentor. I
will forever remain indebted to all of you —Lorenzo, Soroush, Temi, AJ— for the
numerous ways you have contributed to my story.
To my friend, Joseph DiPalma, thank you for the many ways you have helped me
especially during the pandemic year. Finally, to my fiancée, Jessica Tolbert, to my
parents who, despite never going to school, made sure I had an education, and to the
rest of my family, I dedicate this thesis to you. This is your thesis, too!

iv
Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv

1 Introduction 1
1.1 Modeling Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Vision-Guided Natural Language Pre-training . . . . . . . . . . . . . 4
1.3 Main Contributions and Outline of the Thesis . . . . . . . . . . . . . 7

2 Related Works 10
2.1 Language Models Pre-training and Grounding . . . . . . . . . . . . . 10
2.2 Multimodal Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Vision-Guided Language Learning 18

3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Technical Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.2 Objective Functions . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Main Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . 31

v
3.4.1 Zero-shot Semantic Relatedness . . . . . . . . . . . . . . . . . 32
3.4.2 General Language Understanding Evaluation (GLUE) Results 33
3.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.4 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.4.5 Pre-training with Guidance from Multimodal Text Encoder . . 39
3.4.6 Is the Visual Information Necessary? . . . . . . . . . . . . . . 42
3.5 Implications & Limitations . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4 CLIP-C: Semantic Compositions For Data-Efficient Vision-Language

Models 47
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2 Technical Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.2.2 CLIP-C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4.1 Zero-shot Image Classification . . . . . . . . . . . . . . . . . . 55
4.4.2 Zero-shot Cross-Modal Retrieval . . . . . . . . . . . . . . . . . 56
4.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4.5.1 Why Semantic Compositions? . . . . . . . . . . . . . . . . . . 60
4.5.2 Impact of Stochasticity During Sampling . . . . . . . . . . . . 61
4.5.3 Sampling Probability Rho (ρ) . . . . . . . . . . . . . . . . . . 61
4.5.4 Image Composition Function . . . . . . . . . . . . . . . . . . . 62
4.5.5 Impact of Modality Used in Composition . . . . . . . . . . . . 63
4.5.6 Additional Results . . . . . . . . . . . . . . . . . . . . . . . . 64

vi
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5 Channel Fusion for Vision and Language Learning 66

5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.2 Technical Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2.2 Compound Tokens . . . . . . . . . . . . . . . . . . . . . . . 71
5.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.4 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
5.4.1 Why Channel Concatenation? . . . . . . . . . . . . . . . . . . 75
5.4.2 Compound Tokens versus Merged Attention . . . . . . . . 75
5.4.3 Multimodal Transformer Encoder . . . . . . . . . . . . . . . . 77
5.4.4 An Encoder Model for VQA . . . . . . . . . . . . . . . . . . . 78
5.4.5 Comparison with Other Multimodal Models . . . . . . . . . . 78
5.5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.5.1 Image Resolution . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.5.2 Type of Image Encoder . . . . . . . . . . . . . . . . . . . . . . 80
5.6 Supplemental Information . . . . . . . . . . . . . . . . . . . . . . . . 81
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

6 Slot Machines 84
6.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.2 Technical Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2.1 Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.2.2 Learning in Slot Machines . . . . . . . . . . . . . . . . . . 89
6.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . 93

vii
6.4.1 Slot Machines versus Traditionally-Trained Networks . . . . . 93
6.4.2 Fine-tuning Slot Machines . . . . . . . . . . . . . . . . . . . . 95
6.5 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
6.5.1 Greedy Selection Versus Probabilistic Sampling . . . . . . . . 97
6.5.2 Weights Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.5.3 Sparse Slot Machines . . . . . . . . . . . . . . . . . . . . . 100
6.6 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

7 Conclusion 103

References 107

viii
List of Tables

3.1 Zero-shot Semantic Relatedness Results . . . . . . . . . . . . . . . . . 32

3.2 General Language Understanding Results . . . . . . . . . . . . . . . . 33
3.3 Ablation Results on the Guidance Loss . . . . . . . . . . . . . . . . . 36
3.4 Impact of Pre-trained Image Encoder Ablation . . . . . . . . . . . . . 38
3.5 Semantic Relatedness Results from Multimodal Text Encoder Guidance 41
3.6 Transfer Learning Results on GLUE . . . . . . . . . . . . . . . . . . . 42
3.7 Is Visual Information Actually Helpful? . . . . . . . . . . . . . . . . . 43

4.1 Zero-shot Image Classification . . . . . . . . . . . . . . . . . . . . . . 55

4.2 Zero-shot Cross-modal Retrieval . . . . . . . . . . . . . . . . . . . . . 56
4.3 CLIP-C Mini-batch Ablation . . . . . . . . . . . . . . . . . . . . . . . 59
4.4 Impact of Random Seeds Ablation . . . . . . . . . . . . . . . . . . . . 60
4.5 Importance of Semantic Compositions . . . . . . . . . . . . . . . . . 60
4.6 Impact of Stochasticity on CLIP-C . . . . . . . . . . . . . . . . . . . 61
4.7 CLIP-C versus MixUp and CutMix . . . . . . . . . . . . . . . . . . . 63

ix
4.8 Modality Involved in Composition: Applying semantic compo-
sitions on both modalities is the most consistently effective method
across different downstream datasets and tasks. . . . . . . . . . . . . 63
4.9 SLIP + CLIP-C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.10 Sugar-Crepe Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.1 Channel Fusion versus Other Mixing Methods . . . . . . . . . . . . . 75

5.2 Compound Tokens versus other Fusion Methods with VLP . . . . 77
5.3 Compound Tokens versus other Fusion Methods without VLP . . . 77
5.4 Encoder Model for VQA . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.5 Compound Tokens versus other Multimodal Models . . . . . . . . 79
5.6 Impact of Image Resolution . . . . . . . . . . . . . . . . . . . . . . . 81
5.7 Image Encoder Ablation . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.8 Hyper-parameters for Compound Tokens experiments . . . . . . . 82

6.1 Shallow Convolutional Networks Specifications . . . . . . . . . . . . . 91

x
List of Figures

1.1 Different Model Paradigms . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Vision-Guided Language Pre-training . . . . . . . . . . . . . . . . . . 5

3.1 Vision-Guided Improvements Patterns . . . . . . . . . . . . . . . . . 21

3.2 Architecture of Vision-Guided Language Learning . . . . . . . . . . . 22
3.3 Semantic Relatedness Correlation . . . . . . . . . . . . . . . . . . . . 34
3.4 Semantic Relatedness versus β . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Architecture of Text Encoder Guidance . . . . . . . . . . . . . . . . . 39

4.1 CLIP-C Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2 ImageNet Zero-shot results and Losses . . . . . . . . . . . . . . . . . 57
4.3 CLIP-C Sampling Probability . . . . . . . . . . . . . . . . . . . . . . 62

5.1 Multimodal Fusion Methods . . . . . . . . . . . . . . . . . . . . . . . 67

5.2 Compound Tokens Model Architecture . . . . . . . . . . . . . . . 70
5.3 Compound Tokens versus Merged Attention without VLP . . . . . 76

6.1 Slot Machines Architectural Setup . . . . . . . . . . . . . . . . . . 87

6.2 Results between random networks and Slot Machines . . . . . . . 93

xi
6.3 Slot Machines versus Traditional Training . . . . . . . . . . . . . . 94
6.4 Test Accuracy versus Flops for Slot Machines . . . . . . . . . . . . 94
6.5 Finetuning Slot Machines Selected Weights . . . . . . . . . . . . . 95
6.6 Different Slot Machine Finetuning Checkpoints . . . . . . . . . . . 96
6.7 Slot Machines Selection Method . . . . . . . . . . . . . . . . . . . 97
6.8 Weight exploration in Slot Machines. . . . . . . . . . . . . . . . . 98
6.9 Weights Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

xii
Introduction
1
Unless our words, concepts, ideas are hooked onto an image, they

will go in one ear, sail through the brain, and go out the other ear

Dr. Lynell Burmark

In recent years, there has been a remarkable acceleration in the capabilities of

various Artificial Intelligence (AI) models, observed both within the AI community
and by the general public. In the domain of natural language processing and conver-
sational systems, this advancement is exemplified by systems such as ChatGPT [22],
developed by OpenAI, and Gemini [49], developed by Google. Concurrently, image
and video generation have achieved photorealistic quality through the use of various
text-conditioned visual diffusion models [143, 148, 7]. Moreover, there has been signif-
icant progress in multimodal tasks —those requiring integrated reasoning across mul-
tiple modalities, particularly visual and textual. These tasks are increasingly being
addressed effectively through the application of various foundational models [17, 25].
The fundamental machine learning paradigms driving this remarkable era of ad-
vancements can be conceptualized through an examination of both the inputs pro-
vided to the models and the tasks anticipated of the trained models. For this thesis,
we will emphasize two primary input modalities: visual and textual. In the visual

1
Introduction Introduction

(a) BERT (b) Vision Transformer

(c) Pathways Language and Image (PaLI) model

Figure 1.1: Different Model Paradigms: (a) A natural language processing model
versus (b) a computer vision model versus (c) a multimodal model. In
BERT [39], the inputs and downstream tasks are in textual form. In the
Vision Transformer [41], the inputs are images with integer labels as out-
puts. To perform the desired task, PaLI [25] must integrate inputs from
both the visual and textual modalities. Credit: The images are from the
respective papers.

modality, input data comprises either images or videos, whereas in the language
modality, the data is textual in nature. Within this framework, certain modeling
patterns can be discerned, as depicted by representative examples in Figure 1.1 and
described in the subsequent sections.

2
1.1 Modeling Paradigms Introduction

Section 1.1
Modeling Paradigms from an Input Perspective

(a) Unimodal Models: Typically, unimodal models process inputs from a singular
modality and execute tasks that necessitate only this specific modality. This
classification encompasses two prominent fields within machine learning:

(i) Computer Vision (CV) Models: These models process inputs in the form
of images or videos and are designed to for tasks ranging from image recog-
nition, object detection to semantic segmentation etc. This category en-
compasses early models such as AlexNet [84] and ResNet [55], as well as
more contemporary models like Vision Transformers [41].

(ii) Natural Language Processing (NLP) Models: In the domain of NLP, models
are provided with natural language inputs, which are typically represented
as tokens. Once trained, these models are capable of performing a range
of tasks, including sentiment classification, semantic similarity matching,
summarization, and entailment. Examples of models within this domain
include BERT [39] and the GPT series [137, 138, 15] of models, among
others.

(b) Multimodal Models: This category contains models trained on inputs from
multiple modalities, thereby enabling them to perform tasks that necessitate
integrated reasoning across these modalities. Notable examples include tradi-
tional multimodal models such as VilBERT [110], PaLI [25], SimVLM [164]],
and Flamingo [3]. These models are designed to address a variety of downstream
multimodal tasks, such as visual question answering (VQA), visual entailment,
and image captioning.

3
1.2 Vision-Guided Natural Language Pre-training Introduction

While unimodal and multimodal models have eminently driven recent advance-
ments in machine learning, we posit that an alternative modeling framework warrants
increased scholarly attention. This framework involves models that incorporate in-
puts from both visual and linguistic domains during the pre-training phase, yet utilize
only a single modality when addressing downstream tasks.
In this respect, significant advancements have been achieved in vision-centric
downstream tasks, ranging from zero-shot image categorization to open-vocabulary
object detection [85], through the application of joint embedding models like CLIP [136]
and ALIGN [70]. Furthermore, text-conditioned generative visual models, including
Imagine-3 [7], DALL-E [10], and Stable Diffusion [143], have been developed, demon-
strating remarkable capabilities in generating visual outputs from natural language
inputs. Motivated by these advancements in vision-centric systems, this thesis ex-
plores the potential for achieving similar improvements in language-centric tasks by
incorporating visual cues during the pre-training phase of language models. This mod-
eling approach, depicted in Figure 1.2, is referred to in this thesis as vision-guided
language learning.

Section 1.2
Vision-Guided Natural Language Pre-training

Vision-guided natural language pre-training represents a learning framework wherein

language models are provided with visual information alongside textual inputs during
the pre-training phase. Notably, this visual information is typically not required or
present during downstream tasks. The fundamental distinction between traditional
natural language models and vision-conditioned pre-trained language models lies in
the inclusion of visual data during the pre-training process. Importantly, both lan-
guage models and vision-guided language models are deployed to downstream tasks

4
1.2 Vision-Guided Natural Language Pre-training Introduction

Figure 1.2: Vision-Guided Language Pre-training: In vision-Guided language pre-

training (dashed green arrow), the pre-training data is composed of both
visual and textual data. However, the target tasks have no corresponding
visual data.

in an identical manner following the pre-training stage. The motivation for building
language models with visual priming has foundations in three key areas:

(a) Cognitive Realism: In Journal of The Modern Language Review, Troscianko [157]
defines cognitive realism as “the capacity of a text to tap in directly to some
aspect of a reader’s cognitive faculties... A text that is cognitively realistic cor-
responds to how we really remember, or see, or feel, and may therefore induce
a particularly effortless imaginative response on the reader’s part.” This con-
cept of “cognitive text” effectively encapsulates the role of language in shaping
our understanding of the world, a dimension that remains underrepresented in
current NLP models. In human cognition, sentences are not merely isolated se-
quences of tokens, as often represented in language models; rather, they embody
feelings, visions, and emotions. The significance of visual aids such as graphics,

5
1.2 Vision-Guided Natural Language Pre-training Introduction

charts, and illustrations in enhancing comprehension highlights the advantages

of visual perception in communication. Vision-guided language model has the
potential to enable language models comprehend the world better beyond free-
form text. This thesis explores the potential of enabling language models to “see”
by incorporating relevant visual information during the pre-training phase.

(b) Language Grounding: Grounding is an essential component of language, fo-

cusing on the relationship between meaning and the physical world [118, 11].
Despite strong advancements in various natural language processing tasks, lan-
guage models have primarily relied on text as the sole contextual resource,
lacking grounding in the visual domain. Bender and Koller [8] contend that
language models trained exclusively on textual form are inherently incapable of
acquiring true semantic meaning. Consequently, efforts in grounded language
learning, such as vokenization [154] and the work by Bordes et al . [13], are
commendable. This study contributes to this body of research by integrating
“visual pointing” [12] into language models.

(c) Application to Cross-lingual Learning: Self-supervised natural language

processing models require massive amounts of textual data to achieve efficacy.
However, the majority of the world’s languages are considered low-resource, pos-
ing challenges in compiling large datasets for training models in these languages.
Nonetheless, all languages encode the same physical world, suggesting that the
visual domain could function as a “lingua franca” across languages, provided
visual cues are effectively integrated into the (pre-)training of language models.
Incorporating visual information into language modeling holds promising im-
plications for various fields, including visually-aided cross-lingual learning [122]
and machine translation [43].

6
1.3 Main Contributions and Outline of the Thesis Introduction

Section 1.3
Main Contributions and Outline of the Thesis

This thesis introduces a framework for incorporating visual information during the
pre-training of language models and evaluates its effectiveness across multiple lan-
guage tasks. Additionally, we propose two methods for training effective multimodal
models through creative data augmentation and better multimodal representation fu-
sion. Finally, we present an innovative approach for “training” neural networks based
on weight selection rather than traditional gradient updates.
Chapter 2 examines the related works in the areas of language model pre-training,
grounded language models, and multimodal methodologies. This discussion underpins
the approaches detailed in Chapters 3, 4 and 5. The related works pertinent to Slot
Machines are addressed in Chapter 6.
In Chapter 3, we present an architecture for vision-guided language learning that
leverages the knowledge embedded in multimodal models for language grounding.
First, we facilitate the transfer of visual knowledge from a multimodal image encoder
to a language model through guided pre-training on paired image-text data. Within
the context of this thesis, a multimodal image encoder is defined as the component
responsible for processing images in a vision-language model such as CLIP [136].
Due to the limited availability of high-quality paired image-text data, we propose a
variation of our approach that employs the multimodal text encoder from a vision-
language model. This second approach enables us to circumvent the necessity for raw
image data by using the text encoder’s association with visual information as a sur-
rogate. Text corpora can then be used for guided pre-training thereby mitigating the
challenges associated with the limited availability of high-quality image-text datasets.
Empirical evaluations on various language benchmarks demonstrate that our mod-

7
1.3 Main Contributions and Outline of the Thesis Introduction

els surpass those trained with language-only pre-training. The improvements are par-
ticularly pronounced in tasks assessing semantic relatedness, where the objective is
to evaluate the proximity between pairs of sentences. Our explorations are scoped to
language models trained with a de-noising objective specifically the masked language
model objective introduced in BERT [39].
Unlike Chapter 3, which concentrates on leveraging vision-language models to
enhance natural language understanding, Chapter 4 introduces a novel approach
for constructing more robust vision-language models, particularly in contexts where
paired image-text data is scarce. Our framework, termed CLIP-C (for CLIP Com-
positions), generates a third composite image-text pair from two randomly selected
image-text pairs based on a predetermined probability. We label the examples cre-
ated through this operation composite examples. The vision-language model is sub-
sequently trained on a mini-batch comprising both simple and composite examples,
with the ratio of composite examples determined by the composition probability. Ex-
perimental results across multiple vision-language datasets of varying scales demon-
strate that CLIP-C consistently yields performance improvements across tasks such
as zero-shot image categorization, image-text retrieval, and linear evaluations when
compared to baseline methods.
Similar to Chapter 4, Chapter 5 focuses on enhancing multimodal representation
learning via architectural innovations, introducing a novel formulation of multimodal
tokens. This approach embeds image and text representations into the same feature
vectors, thus facilitating stronger interactions between the two modalities. This is
accomplished by concatenating image tokens and text tokens along the channel di-
mension. Cross-attention is employed to retrieve the complementary tokens which are
then merged together on the feature dimension. Referred to as Compound Tokens,
this method demonstrates superior performance compared to standard approaches

8
1.3 Main Contributions and Outline of the Thesis Introduction

across several multimodal tasks, including vision question answering and visual en-
tailment.
Finally, in Chapter 6, we present Slot Machines, a novel method of training
neural networks based on weight selection as opposed to the conventional approach of
weight optimization in a continuous space. Slot Machines identify effective combi-
nations of random weights by selecting from a predetermined set of random values for
each connection, achieving performance comparable to that of traditionally trained
networks with identical capacity. This demonstrates the potential existence of efficient
random networks whose weights remain unchanged. Adjacent to our investigations
in Chapters 3, 4, and 5 which explore multimodal models and the enhancement of
language models through visual information, Slot Machines are primarily applied
to computer vision tasks.

9
Related Works
2
We organize our discussion of related works into three primary categories: (1) Lan-
guage models pre-training and grounding which is pertinent to our research in Chap-
ter 3 on vision-guided language model pre-training, leading to the development of in
VGLMs and MT-GLMs; (2) Multimodal methods encompassing CLIP-C (Chapter 4)
and Compound Tokens (Chapter 5); and (3) Neural networks initialization
and pruning which relates to Slot Machines (Chapter 6). In this chapter, we
focus on the first two categories, while the third category is reviewed in Chapter 6,
as Slot Machines primarily pertains to the domain of computer vision model ini-
tialization.

Section 2.1
Language Models Pre-training and Grounding

Language Models. Early developments in language model pre-training notably

Word2Vec [120] and Glove [130] focused on learning distributed vector representa-
tions of words. Word2Vec [120] in particular demonstrated semantic and syntactic
regularities contained in these representations through the use of vector arithmetic.
These initial approaches were subsequently superseded by sentence-level pre-training
approaches [34, 90, 36] which offered universal sentence representations.

10
2.1 Language Models Pre-training and Grounding Related Works

Following the invention of the Transfer [159] architecture, contextual language

learning via token-level embeddings using deep neural networks has gained significant
popularity and adoption [39, 137, 138, 132, 107, 33, 139], resulting in unparalleled
advancements across numerous tasks, including sentiment analysis, natural language
inference, cross-lingual learning, and machine translation. BERT [39] introduced an
innovative pre-training approach by combining next sentence prediction with masked
token prediction, thereby providing the model with a bidirectional perspective of
the input sequence. Alongside BERT[39], extensive auto-regressive language [137,
138] on large-scale language corpora has produced an unprecedented expansion of
language capabilities. Further scaling of both model and data, coupled with technical
innovations, has led to the development of even more advanced language models,
such as PaLM [30], GPT-4 [1], leading to models with commercial grade reasoning
and conversational capabilities.
VGLMs and MT-GLMs in Chapter 3 build upon the token-level language pre-
training methodologies described above specifically on the BERT [39] framework.
However, unlike these traditional approaches which focus exclusively on pre-training
with language data, our models integrates visual information to provide grounding
supervision at the sequence level.

Grounded Language Models. The idea that natural language must be grounded
in physical reality has motivated many works in machine learning over the years [89,
176, 76, 154]. Several studies utilize visual information to enhance particular language
tasks including cross-modal learning [122], machine translation [43, 69, 181] or seman-
tic parsing [80, 150]. Our contribution to grounded language models, as detailed in
Chapter 3, is akin to the vokenization method [154] which emphasizes pre-training
with visual information to enhance general language understanding. The vokeniza-
tion approach comprises two stages. Initially, a token-to-image retrieval model is built

11
2.1 Language Models Pre-training and Grounding Related Works

by mapping individual words to images using a hinge loss function. This first phase
generates visual tokens or “vokens” for use in the subsequent training phase. In their
experiments, Tan and Basal [154] utilize an image captioning dataset for vokenization
where each word in a caption is linked to the corresponding image. The second phase
involves language model pre-training which incorporates a masked “voken” prediction
alongside the masked token prediction used in BERT. Given a sequence of token IDs
and their associated voken IDs retrieved from a predefined set of images, an identi-
cal masking protocol is applied to both, with each contributing to their respective
masking prediction loss function.
Our vision-guided language models (VGLMs) do not employ vokenization nor
a voken classification objective. Instead, we directly utilize continuous embeddings
obtained from a pre-trained image encoder to implement a guidance loss via con-
trastive learning. Moreover, we introduce multimodal text encoders as effective alter-
native modules to image encoders, thereby mitigating the dependency on image-text
datasets.
A second closely related study builds upon the SkipThought model [79] by fus-
ing visual semantics into a grounded space through the incorporation of cluster and
perceptual loss functions [13]. The cluster loss evaluates whether two sentences are
visually equivalent without considering the content of the associated images. Two
sentences are designated as visually equivalent if they are paired with the same im-
age. The perceptual loss, as the second objective, factors in the structure of the
visual space by ensuring that the similarity between two sentences correlates with the
similarity between their corresponding images.
Our vision-guidance loss in VGLMs is distinct from both the cluster and per-
ceptual objectives introduced by Bordes et al . [13]. Our approach involves directly
contrasting the textual representations with their corresponding visual embeddings or

12
2.2 Multimodal Methods Related Works

inferred visual representations (for the case where we use a multimodal text encoder)
for improved guided learning. Furthermore, we use a modern language model based
on BERT in contrast to the SkipThought model utilized in Bordes et al . [13].
Finally, in recent years, a class of investigations has emerged that evaluate existing
multimodal models on language-only tasks, yielding mixed results [4, 67]. Alper et
al . [4] observed that multimodal extensions of pre-trained language models enhance
the performance of “visual-language tasks” such as color association prediction and
shape association prediction. Conversely, Ikil and Aizawa [67] found that these exten-
sions can degrade performance on natural language tasks. Similarly, Cao et al . [18]
noted that while multimodal text encoders demonstrate strong performance on se-
mantic linguistic probing tasks, they consistently underperform BERT.
Our VGLMs are language models augmented with visual information which we
train from scratch without targeting any multimodal tasks. Additionally, we do not
use the probing methodologies presented in the prior works. Instead, we evaluate our
models on downstream datasets in both zero-shot and fine-tuning scenarios.

Section 2.2
Multimodal Methods

The use of language as a potent supervisory signal for learning visual representations
has a rich history in machine learning [135, 47, 73, 96, 136]. Notably, early influential
vision-language endeavors like DeViSE [47] capitalized on unannotated textual data
to first learn semantic relationships and subsequently mapping images into that es-
tablished semantic space through the use of class labels. Recent works have relied
principally on weakly supervised datasets from the internet to build joint vision and
language embedding models.

13
2.2 Multimodal Methods Related Works

Joint vision and language embedding models. Recently, widely adopted models
such as CLIP [136], ALIGN [70], and others [121, 104, 63, 68] produce joint embed-
ding representations by training on extensive image-text paired datasets contrastively
using the InfoNCE loss [158]. Predominantly, these models utilize free-form text as a
supervisory signal for training generalized vision models facilitating a variety of capa-
bilities, including zero-shot image categorization, open-vocabulary object detection,
and semantic segmentation. However, both CLIP [136] and ALIGN [70] used massive
amounts of data, consuming 400 million and 1 billion image-text pairs, respectively.
The collection of such extensive datasets is resource-intensive and impractical for re-
searchers in resource-constrained environments. Furthermore, the reliance on large
image-text datasets impedes the adoption of these methods in application areas where
such datasets are not readily available, such as in medical imaging.
In response, subsequent approaches such as DeCLIP [104], SLIP [121] have aimed
to enhance the data efficiency of the CLIP model. DeCLIP [104] uses multiple train-
ing objectives, including self-supervision on each modality [26, 39], nearest-neighbor
supervision, and multi-view supervision [19]. SLIP [121] supplements the CLIP loss
with image self-supervision by employing the methodology outlined in SimCLR [24].
Another contribution toward reducing the training data requirements of CLIP is pre-
sented by Wu et al . [167] who exploit optimal transport distillation to implement soft
image-text matches. These existing methods aimed at enhancing the data efficiency
of CLIP exhibit several limitations. SLIP and DeCLIP need multiple passes through
the image encoder [121, 104] for every parameter update thus increasing computa-
tional demands. Other methods such as that proposed by Fan et al . [44] need access
to proprietary conversational systems such as ChatGPT to to re-write image captions.
Our work on CLIP-C discussed in Chapter 4 aligns with these studies in devel-
oping joint-embedding vision and language models targeted towards increasing data

14
2.2 Multimodal Methods Related Works

efficiency. However, our approach specifically emphasizes the applications of seman-

tic compositions on input sample compositions during pre-training. These semantic
compositions are straightforward to implement, demonstrate effectiveness, and offer
data efficiency comparable to the CLIP model. Furthermore, our method does not
include multiple forward passes like SLIP and DeCLIP nor does it rely on an external
language model API as implemented by Fan et al . [44].
Methods most comparable to CLIP-C (Chapter 4) include data augmentation
techniques such as CutMix [175] and MixUP [179] which have proven highly effective
in training categorization models within the field of computer vision. MixUP oper-
ates by extrapolating between two images via a weighted summation whereas CutMix
involves cutting and pasting a random crop from one image onto another. The one-
hot encoded label of the resulting composed sample is also a weighted element-wise
addition of the original one-hot labels. CLIP-C extends the advantages of these es-
tablished augmentation techniques, traditionally used for image understanding, into
the vision-language joint-embedding space. Apart from adding language, our ap-
proach diverges from CutMix by concatenating image crops rather than overlaying
one crop onto another. Furthermore, our models are trained using a contrastive loss,
as opposed to employing cross-entropy with soft labels, as utilized in these previous
methodologies.
The successes of image-text contrastive models have motivated the use of language
supervision in several multimodal tasks [86, 173], video understanding [170], open-
vocabulary object detection [85] as well as audio-language learning for music [115].
Since CLIP-C centers on learning joint image-text embeddings, we do cover these
works in this review. Similarly, several recent works [72, 60, 183, 178, 97, 174] that
leverage open-source CLIP checkpoints for a range of downstream evaluations are also
excluded from this discussion as our technique is primarily a pre-training mechanism.

15
2.2 Multimodal Methods Related Works

Multimodal representation learning. Our research on Compound Tokens,

presented in Chapter 5, bears a stronger relationship with canonical multimodal
models such as VilBERT [110], BEiT-3 [163], SimVLM [164], Flamingo [3], and
PaLI [25]. These multimodal methodologies typically pre-train on extensive and
diverse multimodal datasets culminating in models that demonstrate impressive ca-
pabilities across various vision-and-language tasks. These tasks include visual di-
alogue [37, 81, 23], visual reasoning [152, 177], entailment [169, 27], visual ques-
tion answering [6, 51, 71, 164], caption generation [5, 21], and cross-modality re-
trieval [116, 75].
Beyond data scaling, architectural advancements in vision-and-language models
have been pivotal in uncovering the remarkable capabilities of these systems. Earlier
models [153, 110, 100, 103, 180] frequently relied on costly object detectors such as
Faster-RCNN [142] for image processing. In contrast, more recent approaches [25, 173]
use ResNets [55] or Vision Transformers [41] for visual feature encoding. This shift
has diminished the reliance on training with meticulously annotated human datasets
like Visual Genome [82] enabling multimodal models to leverage extensive amounts
of weakly-supervised image-text datasets obtained from the internet [149, 21].
Pretraining vision-and-language models using diverse cross-modal objectives has
emerged as a significant avenue of investigation in recent studies. Approaches such as
contrastive learning [101, 173], image captioning [5], image-text matching [92, 110],
prefix language modeling [164], and word-patch alignment [77] represent some of the
various loss functions that have been proposed. Additionally, some research efforts
have focused on integrating multiple losses during pretraining [98, 42], while still
other methods have sought to consolidate several question-answering tasks within a
multi-task framework [125, 111, 133].
Compound Tokens improves multimodal fusion to facilitate effective vision-

16
2.2 Multimodal Methods Related Works

and-language representation learning. The majority of multimodal models implement

concatenation for multimodal fusion, as detailed in merged attention [42] for for
multimodal fusion [112, 133, 164], with slight variations concerning at which stage
in the model the concatenation process is executed. Co-attention represents another
widely used method for fusing multimodal features [153, 99, 42]. In co-attention,
visual and textual features are independently modeled within separate transformer
encoders, and a cross-attention mechanism is used to share information across the
two modalities.
Distinct from merged attention and co-attention, Compound Tokens, as pre-
sented in Chapter 5, introduces channel concatenation for multimodal fusion. This
approach aims to enhance the alignment between image and text tokens while em-
ploying global self-attention over the fused representations to improve multimodal
processing. As the name implies, each embedding in Compound Tokens contains
both image and text features.

17
Vision-Guided Language Learning
3
This chapter discusses our methodology for vision-guided language learning, wherein
visual information is incorporated into the pre-training phase of a language model.
The models obtained from our framework are called VGLMs for Vision-Guided Lan-
guage Models. We employ paired image-text datasets to train VGLMs. However,
considering the short supply of high-quality image-text datasets, we introduce a sec-
ond method for guided learning that relies chiefly on multimodal text encoders. We
title the second variation of models as MT-GLMs for Multimodal-Text Guided Lan-
guage models. Empirical evaluations demonstrate that both VGLMs and MT-GLMs
surpass the performance of baseline unguided language models across various language
tasks in both zero-shot and fine-tuning scenarios.

Section 3.1
Overview

Natural Language Processing has emerged as an invaluable platform for developing

general purpose artificial agents leading to capabilities that seemed impossible just
a few years ago. From code generation [145] to protein structure prediction [74] to
reasoning (and much more), these impressive abilities have come mainly from scaling
the models and datasets to extraordinary levels [49, 22] and from many important

18
3.1 Overview Vision-Guided Language Learning

technical innovations such as in-context learning [40], reinforcement learning with

human feedback [127], parameter efficient tuning systems [62, 102, 95], etc.
However, apart from the development of multimodal models, natural language
learning in artificial intelligence has predominantly been conducted without incor-
porating visual data that reflects the physical world. This approach contrasts with
human learning where the physical environment plays an integral role in language
acquisition. While there have been attempts to create grounded language mod-
els [154, 4, 13, 76], such efforts are still underexplored in our judgment. The mo-
tivation for our work is to expand on these efforts and explore mechanisms through
which visual content can be added to natural language processing systems.
The objective of this study is to investigate the impact of complementary visual
information on natural language representation learning. Complementary visual in-
formation is defined as visual content, such as an image, that theoretically augments
the informational content of a corresponding text. For instance, an image paired with
a correctly captioned piece of text is regarded as complementary. Additionally, this
thesis examines various natural language processing tasks to identify those that may
benefit from the inclusion of visual cues during the pre-training phase.
This study extends the remarkable accomplishments of de-noising language mod-
els, such as BERT [39], by incorporating an auxiliary guidance objective function
into the masked language modeling objective. BERT is traditionally trained solely
on text corpora. Specifically, during each training iteration, BERT receives a sequence
of tokens (sentence) and randomly masks a portion of these tokens. The model is
then trained to predict the masked tokens. This context-aware de-noising process
results in highly effective text embeddings that demonstrate strong transferability to
downstream language tasks.
Our extension is achieved through training on an image-text dataset while main-

19
3.1 Overview Vision-Guided Language Learning

taining a primary focus on language modeling. In addititon to the mask language

modeling loss, we add a contrastive loss to BERT where the “positive views” of the
texts are their corresponding images. The image embeddings are derived from a pre-
trained multimodal model, e.g., MetaCLIP [63]. For a given image-text pair, the
method involves the following steps: (1) First, we compute the image embeddings
using the pre-trained image encoder. As the image encoder is not tuned, these em-
beddings can be pre-computed prior to training, thus reducing training cost. (2)
Subsequently, we calculate the class token embeddings for the corresponding text.
These representations are trained to be close to their matching image embeddings
and far from non-matching image embeddings using InfoNCE [158]. (3) Finally, we
apply masking to the text and compute the mask language modeling loss as in BERT.
The language modeling and guidance losses are combined to train the model.
Our work shares a similar motivation with Vokenization [154]. However, unlike
that method, we do not create discrete image ids (“vokens”) as in that work. In-
stead, we retain our image representations in the continuous space, preserving the
rich structure inherent in images. While our method does not lead to an increase
in the number of trainable parameters within the model, it does necessitate a higher
number of floating-point operations (FLOPs) during the pre-training phase compared
to standard language models. That said, both the baseline model and our approach
utilize equivalent computational resources when applied to downstream tasks.
We use MS-COCO [106] for pre-training VGLMs and compare the results with the
baseline model on semantic relatedness and general language understanding. Com-
pared to the baseline model without guidance from images, our VGLM improves
performance on STS [2, 20] and SICK-R [117] by more than 10 points. We iden-
tify three general patterns through which VGLM outshines the baseline model in
Figure 3.1. These include (a) synonym replacements where one or more words are

20
3.1 Overview Vision-Guided Language Learning

(a) Synonym Replacement (b) Keyword Identification

(c) Semantic Understanding

Figure 3.1: Patterns of Improvements from Visual Guidance: On textual

relatedness, we identify three ways adding visual information during pre-
training improves the language model as labeled in the (a), (b), and (c).

substituted with their synonyms, (b) correct identification of keyword(s), and (c)
semantic understanding. We expound on these observations in Section 3.4.3.
On GLUE [160], the benefits of vision-guided learning are very modest when pre-
training on MS-COCO. The main guided model beats the baseline language model by
0.43%. The second class of models we introduced, MT-GLMs, that use multimodal
text encoders (see Figure 3.5) beat the baseline models by 1.24%, 0.93%, and 0.24%
when pre-training on MS-COCO (captions), Wiki-103 [119], and English Wikipedia
respectively. These results suggest that the benefits of guidance on general language
understanding diminish with increasing text corpus size. All the empirical results for
our experiments on this work are discussed in Section 3.4.

21
3.2 Technical Approach Vision-Guided Language Learning

Figure 3.2: Architecture of Vision-Guided Language Learning: The model

uses a combination of (1) mask language modeling and (2) vision-guided
losses during pre-training. As indicated in the figure, the image encoder
is not tuned during vision-guided pre-training. Also, only the language
model is transferred to downstream tasks after pre-training.

Section 3.2
Technical Approach

The technical details of our approach including the background material for the model
are discussed in this section. At a high-level, our pre-training method uses two objec-
tive functions: a Masked Language Modeling (MaskLM) loss and a noise contrastive
estimation (InfoNCE) loss, as illustrated in Figure 3.2. Prior to delving into these
two objectives, we first present an explanation of the architectural components of our
network.

3.2.1. Architecture
As depicted in Figure 3.2, the trainable module in our architecture is a language
model F that takes as input a piece of text and produces a sequence of token rep-

22
3.2 Technical Approach Vision-Guided Language Learning

resentations Ti = {t1i , t2i , · · · , tM

i }; ti ∈ R in addition to a class embedding ti ∈ R
j c ∗ c

for each input sequence wi . The “class token” is designated as the token responsible
for capturing the overall semantic essence of the input sequence. Consequently, this
token is the representation most commonly utilized in downstream tasks [39, 154],
although alternative sequence pooling methods, such as mean pooling, may also be
employed.
The focus of this work is on improving the language capabilities of F using visual
information. We invoke a pre-trained image encoder G that outputs an embedding
zi ∈ Rd for each input image xi . Owing to resource constraints, our experimentation
primarily focuses on the use of images to demonstrate the efficacy of our method.
However, it is feasible to incorporate videos or other forms of visual media within our
method. In our experiments, we utilize the image encoder from a multimodal model
for G to leverage the benefits of prior exposure to textual information. Additionally,
multimodal image encoders typically encounter a broader array of images compared
to supervised image models. In the ablations in Table 3.4, we present results for an
image encoder trained on ImageNet-22k for comparison with MetaCLIP. As explained
before, G is not updated during pre-training as the model of interest is the language
model F. Keeping G frozen also enables the pre-computation of image representations
prior to the commencement of pre-training.
We implement a linear transformation E : Rc → Rd on t∗i to produce a distillation
token zt∗i ∈ Rd to match the feature dimension of image output zi .

3.2.2. Objective Functions

Our approach uses two losses to model the language while transferring knowledge
of the visual world to the language model. We explain these loss functions in this
subsection.

23
3.2 Technical Approach Vision-Guided Language Learning

Mask Language Modeling. This is a technique that facilitates bidirectional rep-

resentation training of language models by randomly masking some percentage of the
input tokens. The model is then tasked to predict the masked tokens. This setup
allows the model to enjoy a bidirectional view of the input (from left-to-right and
right-to-left) without the possibility of information leakage. Each final hidden vector
tji corresponding to a mask token is passed through an output layer whose dimension
aligns with the vocabulary size. The resulting output is subsequently subjected to a
softmax function over the vocabulary, and a cross-entropy loss is applied using the
index of the corresponding unmasked token as the target label

LM askLM = H tji , wij (3.1)

where H is the cross-entropy loss, tji is the softmax output of token j in sentence
i, and wij is the ground-truth label of tji . We use the masking protocol employed in
BERT [39]. Overall, 15% of the tokens in each sentence are randomly selected for
masking: For each selected token t, the following masking procedure ensues: (1) t is
replaced with a placeholder token [Mask] with a probability of 80%. (2) 10% of the
time, t is replaced by a random token t′ from the sequence. (3) t is left unmasked
10% of the time to match the downstream fine-tuning setup where masking is absent.

Vision-Guidance Loss. This loss function is derived from Noise Contrastive Es-
timation (InfoNCE) [158] which is commonly employed in contexts that involve the
concept of matching representation pairs such as matching image-text pairs [136] or
distinct crops of the same image [24]. In this framework, the matching pairs are
designated as positive samples, while the non-matching pairs within the batch are
designated as negative examples. InfoNCE functions to bring the representations

24
3.2 Technical Approach Vision-Guided Language Learning

of matching pairs closer together while distancing the non-matching pairs. Given a
batch of N matching pairs of representations {(p(1) , q (1) ), (p(2) , q (2) ), · · · , (p(N ) , q (N ) )},
the contrastive loss is computed as

exp sim p(i) , q (i) /τ

LD p(i) , q (i)
(3.2)

= − log PN .
j=1 exp (sim (p(i) , q (j) ) /τ )

sim(., .) represents the cosine similarity function and τ is a learnable temperature

parameter that regulates the distributions. Typically, a bidirectional loss is calculated
averaging LD p(i) , q (i) and LD q (i) , p(i) as implemented in [136]. In our approach,

we employ the unidirectional loss LD p(i) , q (i) , since the image encoder is locked

during language model pre-training. Consequently, the overall loss for all N examples
in our configuration is expressed by

N
1 X
LD p(i) , q (i) . (3.3)

LD =
N i=1

In our model, the trainable embedding p(i) corresponds to the class token represen-
tation zt∗i of the language model F. The target vector q (i) represents the corresponding
image representation zi from the pre-trained image encoder. It is noteworthy to reit-
erate that the image encoder remains static during the language model’s pre-training
phase, as the primary objective is to improve the performance of F on language tasks.
Vision-guided language learning combines the language modeling loss, as delineated
in Equation 3.1, and the guidance loss described in Equation 3.3:

L = LM askLM + βLD (3.4)

25
3.3 Experimental Setup Vision-Guided Language Learning

where β is a hyper-parameter controlling the strength of the visual influence on the

language model. In our experiments, β is always set to 1 as the default. We show
in Figure 3.4 that while the parameter β is influential, any value of β > 1 yields
only marginal improvements over β = 1. The language modeling loss, LM askLM , is
left unweighted to align with the standard language model training scenario without
visual guidance. In addition to the contrastive loss discussed in this section, we also
explore alternative loss functions, including the mean squared error loss, as detailed
in Section 3.4.4.

Section 3.3
Experimental Setup

We build our framework using the mask language modeling setup popularized in
BERT [39] due to its simplicity and broad applicability across numerous language
tasks. While it is acknowledged that generative models such as GPT-2 [138] and GPT-
3 [15] have achieved more widespread adoption in recent years within the machine
learning community, BERT continues to serve as an effective network for language
modeling, particularly in the low-resource settings upon which our experiments are
based.
All comparative experiments are performed by training the language model from
randomly initialized parameters to ensure control over pre-training data, model size,
and other crucial hyperparameters that could significantly impact performance. Due
to resource constraints, our experiments are primarily confined to small data and
model regimes. While we anticipate that our observations will extend to larger models
and datasets, we are unable to empirically demonstrate this scalability given the
constraints of our current study.

26
3.3 Experimental Setup Vision-Guided Language Learning

Models. The main network used for the language model F is the “Base” variant
of the BERT [39] model which is constructed on the Transformer architecture [159].
This model comprises 12 attention blocks, each integrating self-attention operations
with a multi-layer perceptron. The model features a dimension of 768, distributed
across 12 attention heads, each with a dimension of 64. (See [159] for details of
the self-attention operation). The tokenization module employed is the WordPiece
tokenizer, developed for BERT, which supports a vocabulary size of 30,522 tokens.
Additionally, results for model configurations smaller than the “Base” architecture are
provided for comparative analysis.
For the image encoder G, we use the image tower of a pre-trained MetaCLIP [63].
This model was trained using the vision-language contrastive learning objective on
approximately 400 million image-text pairs, curated to closely resemble the private
dataset used in CLIP [136]. For our experiments, we adopt the “Base” 1 configuration
of MetaCLIP with a patch size of 32. The image features generated by G have a
dimension of 512; hence, a linear module E (refer to Section 3.2) is employed to
map the language model’s class token to this dimensionality. Results for other pre-
trained image encoders, including a model trained on ImageNet-22K, are presented in
Section 3.4.4. It is important to note that the image encoder is kept frozen throughout
the language model pre-training.

Pre-training Datasets. The Vision-Guided Language Model (VGLM) framework,

illustrated in Figure 3.2, requires a high-quality paired image-caption dataset. Due
to the challenges in collecting large-scale datasets of this nature, we employ the MS-
COCO dataset [106] to demonstrate the potential of our model. The MS-COCO
dataset is a small dataset comprising approximately 57, 000 images, each accompanied
by five human-provided captions. For the purpose of our experiments, we randomly
1
https://huggingface.co/facebook/metaclip-b16-400m

27
3.3 Experimental Setup Vision-Guided Language Learning

select one of these captions to serve as the text pair for each image.
The likelihood is substantial that the distribution of captions within the MS-
COCO dataset does not align with the general distribution of free-form text. This
assumption has been corroborated by previous studies, including the study of Tan
and Basal [154]. Consequently, the results derived from using MS-COCO, or any
paired image-text dataset, may not be optimally extendable to standard text corpora.
To address this challenge, we propose a modification of our VGLM architecture,
wherein the image encoder is replaced with a multimodal text encoder, as depicted in
Figure 3.5. This adjustment eliminates the requirement for paired image-text data.
As a result, this architecture enables us to pre-train on text-only datasets, such as
English Wikipedia2 and its subset Wiki-103 [119]. The English Wikipedia dataset
consists of approximately 120 million sentences, while Wiki-103 encompasses around
4.2million sentences.

Hyper-parameters. Our models are trained with a batch size of 1, 024 examples,
each having a sequence length of 32 tokens. The number of training iterations is set
to 20 thousand iterations for MS-COCO, 100 thousand steps for Wiki-103, and 200
thousand iterations for English Wikipedia. We employ a learning rate of 4×10−4 with
a cosine annealing schedule and a warm-up phase of 5, 000 steps. The weight decay
parameter is set to 0.01, and gradient clipping is applied with a maximum gradient
norm of 1.0. The AdamW [109] optimizer is utilized, with beta parameters (0.9, 0.999)
and epsilon 1 × 10−8 . Mixed-precision training is conducted across all models using
the PyTorch framework [129] along with the HuggingFace [166] and Accelerate [52]
packages.
2
https://github.com/attardi/wikiextractor

28
3.3 Experimental Setup Vision-Guided Language Learning

Evaluation Datasets. The effectiveness of vision-guided language modeling frame-

work is evaluated across two categories of language tasks: (1) semantic relatedness
and (2) general language understanding. The subsequent sections provide detailed
explanations of these tasks and an overview of the evaluation datasets employed for
each.

(a) Semantic Relatedness: Semantic relatedness measures the closeness or prox-

imity of two text segments in terms of their meaning or semantic content. For
two sentences, s1 and s2 along with a human-provided label r indicating the
semantic relationship between them, we compute the Pearson correlation be-
tween r and the model’s prediction of the relatedness score between the two
sentences. This task is critical for assessing a model’s capacity to comprehend
the “meaning” of language, beyond mere syntactic structure.

We use the Semantic Textual Similarity (STS) dataset curated from news ar-
ticles, online forums, etc., for evaluating our models. STS [2] has several cuts:
STS12, STS13, STS14, STS15 and STS16. We test on all these versions includ-
ing the STS-Benchmark (STS-B) [20] which is a selection of samples in the five
prior STS datasets. Each pair of sentences (s1 , s2 ) in STS is associated with
a rating r ∈ [0, 5] obtained from a human being through Amazon Mechanical
Turk3 . Rating r = 5 corresponds to two sentences that are completely equiva-
lent in meaning, e.g., “the child is bathing in the sink” and “the child is washing
himself in the water basin”. On the other extreme, r = 0 means the two sen-
tences are about different topics and uncorrelated. For example, “Jim played
chess and won several championships in his youth” and “It is really dark and
cold outside tonight”.

In addition to STS12-16 and STS-B, we use the Sentences Involving Composi-

3
https://www.mturk.com/

29
3.3 Experimental Setup Vision-Guided Language Learning

tional Knowledge Relatedness (SICK-R) dataset [117] for Semantic Relatedness.

Like STS, SICK-R has pairs of sentences that contain diverse lexical, syntactic,
and semantic phenomena, and a similarity rating r ∈ [1, 5] per pair.

We conduct zero-shot evaluations on STS and SICK-R. Thus, we evaluate the

models directly on these datasets after pre-training without any task-specific
fine-tuning. Pearson correlation between human ratings and the model’s pre-
dictions is used to assess performance.

(b) General Language Understanding: We utilize the General Language Un-

derstanding Evaluation (GLUE) benchmark [160] to evaluate the performance
of our network on various general language tasks beyond semantic textual sim-
ilarity. This benchmark encompasses tasks ranging from sentiment analysis to
textual entailment and natural language inference. Previous studies have noted
that GLUE datasets with a limited number of examples can lead to unstable
results, complicating model comparisons [154]. As a result, we concentrate on
four tasks identified in [154, 105] that provide a sufficient number of examples
to enable stable evaluations.

(i) The Stanford Sentiment Treebank (SST-2) is a sentiment prediction task.

Each entry in this dataset is a movie review and a human annotated sen-
timent {positive or negative} of the said review. Given a piece of review,
the model predicts whether the sentiment is postive or negative.

(ii) Question-Answering Natural Language Inferenc (QNLI), drawn from the

Stanford Question-Answering dataset, consists of question-sentence pairs
where the task of the model is to predict whether the sentence in each pair
contains the answer to the associated question.

(iii) The Multi-Genre Natural Language Inference (MNLI) is used to test natu-

30
3.4 Main Results and Analysis Vision-Guided Language Learning

ral language entailment. This dataset is a collection of premise-hypothesis

pairs of sentences. The task of the model is to predict whether the premise
is neutral to the hypothesis, contradicts the hypothesis, or entails the hy-
pothesis.

(iv) The Quora Question Pairs (QQP) is a question similarity task comprising
pairs of questions sourced from Quora4 . Each pair of questions is associated
with a human-label similarity score r ∈ {1, 2, 3, 4, 5} and the task of the
model is to predict the score given a question pair.

Unlike the approach adopted for semantic relatedness tasks, where zero-shot
evaluations are conducted, we fine-tuning on each GLUE task using the respec-
tive standard training dataset prior to evaluation on a separate test set. As
illustrated in Figure 3.2, the fine-tuning process is applied exclusively to the
language model F. Typically, fine-tuning is executed with a batch size of 64
and a sequence length of 128. The learning rate for this procedure is configured
to 5 × 10−5 . We use the AdamW [109] optimizer and fine-tune for 10 epochs.

Section 3.4
Main Results and Analysis

This section summarizes the results of our comparative evaluations between vision-
guided language training and the baseline model on semantic relatedness and general
language understanding. We name the baseline model M askLM in our experiments
since it is trained with only the mask language modeling objective. The datasets,
models, and hyper-parameters used in these experiments are detailed in Section 3.3.
4
https://www.quora.com/

31
3.4 Main Results and Analysis Vision-Guided Language Learning

Table 3.1: Zero-shot Semantic Relatedness Results: The results here show
clearly that our guided model is far superior to the unguided model on
semantic relatedness tasks across all datasets and model sizes. B, M and S
respectively stand for “Base”, “Medium” and “Small” configurations of the
language model. We employ the same image encoder in all experiments
here.

Model STS12 STS13 STS14 STS15 STS16 STS-B SICK-R AVG

MaskLM-B 34.94 32.10 35.17 42.36 36.23 36.50 47.39 37.82
VGLM-B (Ours) 47.99 54.54 54.33 65.74 54.92 64.78 60.35 57.52
MaskLM-M 26.73 32.29 31.53 36.75 32.80 34.18 49.72 34.86
VGLM-M (Ours) 49.46 56.78 54.94 66.58 56.47 65.11 62.53 58.84
MaskLM-S 45.47 39.88 41.00 51.57 39.25 44.47 56.60 45.46
VGLM-S (Ours) 48.57 57.49 57.74 69.06 58.75 66.76 62.79 60.17

3.4.1. Zero-shot Semantic Relatedness

We conduct an extensive set of experiments to evaluate the efficacy of our proposed
approach in zero-shot semantic similarity tasks on the STS and SICK-R datasets
described in Section 3.3. As illustrated in Table 3.1, the introduction of visual in-
formation during the pre-training phase of the language model significantly enhances
performance across all downstream datasets. Specifically, for the “Base” model con-
figuration, the Vision-Guided Language Model (VGLM) exhibits an improvement
of 12.98 points in alignment with the ground truth on the SICK-R compared to
the Masked Language Model (MaskLM). On STS-B, our model achieves an average
increase in correlation exceeding 19 points over the text-only model. Similar im-
provements are observed in the “medium” and “small” configurations of the language
transformer5 . This indicates that the observed benefits are not only generalizable
to various downstream datasets but are also applicable across different model sizes.
As expected, due to the limited size of the MS-COCO dataset used for pre-training,
smaller models tend to outperform larger models in this context.
5
See architecture specifications at https://huggingface.co/google/bert_uncased_L-12_H-
768_A-12

32
3.4 Main Results and Analysis Vision-Guided Language Learning

Table 3.2: General Language Understanding (GLUE): On average, incorporat-

ing image representations through vision-guided learning marginally in-
creases general language understanding.

Model SST-2 QNLI QQP MNLI AVG

MaskLM-B 82.34 63.21 77.29 66.71 72.39
VGLM-B (Ours) 82.11 63.58 79.48 66.10 72.82
MaskLM-M 81.88 58.38 77.97 66.20 71.11
VGLM-M (Ours) 80.96 60.36 78.06 64.04 70.86
MaskLM-S 84.17 60.14 77.47 65.85 71.91
VGLM-S (Ours) 80.96 60.10 78.03 64.04 70.78

3.4.2. General Language Understanding Evaluation (GLUE) Results

We demonstrate the general natural language processing capabilities of vision-guided
language models using the GLUE benchmark across four specific tasks as outlined
in [154, 105], with detailed descriptions provided in Section 3.3.
The performance gains from vision-guided learning in general language under-
standing appear relatively modest when compared to the benefits observed in se-
mantic relatedness tasks. As indicated in Table 3.2, the vision-guided model, on
average, exceeds the performance of the unguided model by 0.43% when pre-training
on MS-COCO. In tasks such as question-answer inference and question pair similar-
ity, the guided model demonstrates slightly superior performance to the unguided
model. However, it shows inferior performance relative to the plain-language model
in sentiment analysis and language entailment tasks. These results suggest that tasks
characterized by subjective interpretation such as sentiment analysis and, to a lesser
extent, language entailment are less amenable to visual information inclusion during
pre-training. In contrast, relatively more ‘factual’ tasks like text inference and se-
mantic relatedness seem to derive greater benefit from the visual information. The
patterns where vision-guidance is helpful on relatedness is illustrated in Figure 3.1
and discussed further in Section 3.4.3. The relatedness patterns where we find visual

33
3.4 Main Results and Analysis Vision-Guided Language Learning

MaskLM VGLM (OURS)

5.0 Corr-Coef = 0.44 5.0

4.5 4.5

4.0 4.0
Ground Truth Score

Ground Truth Score

3.5 3.5

3.0 3.0

2.5 2.5

2.0 2.0

1.5 1.5

1.0 1.0 Corr-Coef = 0.65

0.850 0.875 0.900 0.925 0.950 0.975 1.000 0.0 0.2 0.4 0.6 0.8 1.0
Predicted Score Predicted Score

Figure 3.3: Semantic Relatedness Correlation: These plots show that the base-
line model (left) assigns a high similarity score to a majority of question
pairs compared to our method (right) whose predictions correlate better
with the human-annotated scores.

information helpful are illustrated in Figure 3.1 and explained in greater detail in
Section 3.4.3.

3.4.3. Discussion
As presented in Table 3.1, vision-guided models are particularly more effective relative
to the text-only model on semantic relatedness. In this section, we conduct an in-
depth analysis of the raw predictions produced by both models to understand the
results at a more granular level. First, we plot the relatedness score from each model
against the human-annotated scores as depicted in Figure 3.3 using examples from
the Quora Question Pairs (QQP) dataset. The results show that the unguided model
(Figure 3.3, left) tends to perceive all question pairs as highly similar even in cases
where the human ratings are notably low (e.g., a score of 1). In contrast, the vision-
guided model (Figure 3.3, right) competently distinguishes between question pairs
with differing meanings and those that are semantically equivalent.

34
3.4 Main Results and Analysis Vision-Guided Language Learning

In addition to evaluating raw scores, we perform a qualitative analysis by examin-

ing individual question pairs alongside the corresponding predictions from the models.
This analysis identifies three specific scenarios in which visual guidance proves bene-
ficial.

(a) Synonym replacement is an inherent and potent feature of natural language that
enables individuals to employ variously spelled words without modifying their
semantic intent. This capability enriches, personalizes, and diversifies communi-
cation. However, our pre-trained Masked Language Model (MaskLM) exhibits
limitations in recognizing synonymous words6 . For instance, it incorrectly pre-
dicts that the questions “what is the function of a resistor in a circuit?” and
“what is the purpose of a resistor in a circuit?” convey different meanings. The
addition of visual cues during pre-training ameliorates this issue, allowing the
model to correctly interpret these questions as equivalent.

(b) Keyword identification is crucial for comprehension and appropriate response

generation in any communication. We observe that training without guidance
seems to lead to challenges in effectively identifying keywords. For example, the
questions “what are the chemical properties of ammonia?” and “what are the
chemical properties of carbon?” demand fundamentally different answers due
to their distinct focal points, despite both inquiring about chemical properties.
Our Masked Language Model erroneously classifies these questions as identical.
In contrast, the Vision-Guided Language Model correctly differentiates between
them as expected.

(c) Semantic understanding involves interpreting the meaning and intent underly-
ing textual data. Without a semantic framework, comprehension and effective
6
We do not claim language models in general lack this ability. Super-massive and highly ad-
vanced language models trained with human supervision such as ChatGPT and Gemini can identify
synonyms and antonyms.

35
3.4 Main Results and Analysis Vision-Guided Language Learning

Table 3.3: Ablation Results on the Guidance Loss: The results in this table show
that other objectives such as soft-distillation from [58] and mean squared
error are suboptimal compared to the unidirectional contrastive loss used
in our work.

Guidance Loss STS12 STS13 STS14 STS15 STS16 STS-B SICK-R AVG
MSE-Loss 50.8 35.17 43.41 58.98 44.56 55.66 59.27 49.70
Soft-Distillation 48.99 35.24 43.08 58.46 42.85 52.89 60.60 48.87
InfoNCE 47.99 54.54 54.33 65.74 54.92 64.78 60.35 57.52

communication would be unattainable. We have identified instances in which

the unguided model misclassifies question pairs, likely due to its inability to
discern their intended meanings. For instance, the model erroneously assesses
the questions “what are the reasons behind recurring dreams?” and “what are
some of your recurring dreams?” as identical, despite their distinct semantic
intents; the former seeks causal explanations, while the latter requests personal
experiences. In this case, the vision-guided model produces an accurate classi-
fication.

3.4.4. Ablation Study

We conduct experiments to understand the impact of different components in our
method including (1) the guidance loss weighting parameter β, (2) the formulation
of the vision-guidance loss, and (3) the pre-trained image encoder used for visual
guidance. These ablations are conducted using the MS-COCO dataset.

Vision-guidance loss. In addition to the contrastive loss explained in Section 3.2,

we experiment with the mean-squared error (MSE) and a soft-distillation [58] loss in
this ablation study. As shown in Table 3.3, both MSE and soft-distillation demon-
strate inferior performances compared to the InfoNCE loss. The use of MSE loss
drives the language embeddings to look like the target image embeddings. However,

36
3.4 Main Results and Analysis Vision-Guided Language Learning

20K Pre-training Steps on MS-COCO

57.5

Mean Zero-shot STS Spearman Score

55.0
52.5
50.0
47.5
45.0
42.5
40.0
0 1 2 3 4 5
Beta ( )
Figure 3.4: Semantic Relatedness versus β: β = 0 corresponds to the model with
a Mask LM loss only without any vision guidance. As clearly depicted
here, including the vision-guided loss (even for β = 0.5) is highly impactful
for semantic relatedness. However, β > 1 does not have a substantial
impact on relatedness score. We use β = 1 as the default setting.

this is suboptimal because creating “image embeddings” from text destroys the special
characteristics of textual data. Also, since the concept of “classes” or “categories” is
absent in our pre-training framework, soft-distillation which relies heavily on class
probabilities is not an ideal fit.

Vision-guided loss weighting factor β. The weighting factor dictates the extent
of the model’s focus on learning the correspondence with visual representations. High
values of β signify an increased emphasis on minimizing the visual correspondence loss
relative to the masked language modeling loss. A setting of β = 0 corresponds to the
absence of visual guidance. Figure 3.4 presents results indicating that β > 1 does not
significantly enhance performance compared to β = 1. Nevertheless, consistent with
the observations in Section 3.4.1, including an image correspondence loss contributes
to improvements in semantic textual similarity.

37
3.4 Main Results and Analysis Vision-Guided Language Learning

Table 3.4: Impact of Pre-trained Image Encoder Ablation: The results here
suggest that the specifics of the image encoder are not differentiating fac-
tors.

Image Encoder G STS12 STS13 STS14 STS15 STS16 STS-B SICK-R AVG
IN-22k-ViT-B/32 44.43 53.97 54.53 64.84 56.15 63.51 62.03 57.07
MC-2.5b-ViT-B/32 45.21 52.57 54.18 65.61 57.70 64.77 61.41 57.35
MC-400m-ViT-B/32 47.99 54.54 54.33 65.74 54.92 64.78 60.35 57.52

Impact of Pre-trained Image Encoder. The information capacity of the image

encoder G is crucial as the language model derives some knowledge from it, directly
influencing the learning process. In our experiments, G is implemented as the “Base”
version of MetaCLIP [63], which is based on the ViT-B/32 architecture [41]. This
particular version of MetaCLIP has been trained on a specially prepared dataset
consisting of approximately 400 million image-text pairs. In our ablation studies in
Table 3.4, this model is denoted as MC-400m-ViT-B/32. We compare the perfor-
mance of this model with two other image encoders: (1) IN-22k-ViT-B/32, an image
encoder trained on a subset of 22, 000 examples from ImageNet. Including this model
allows one to examine the effects of a supervised image model as compared to an
encoder derived from a vision-language model. (2) MC-2.5b-ViT-B/32, a MetaCLIP
model trained on 2.5 billion image-text pairs, which is included to assess the impact
of using an image encoder trained on a more expansive dataset.
Overall, our analysis does not reveal any significant differences arising from using
different pre-trained image encoders. As indicated in the semantic relatedness results
presented in Table 3.4, our default model slightly outperforms the other image models,
although performance varies across different downstream datasets. We hypothesize
that the relatively limited size of the MS-COCO dataset used during pre-training
might be insufficient to tease out notable distinctions attributable to a specific image
encoder.

38
3.4 Main Results and Analysis Vision-Guided Language Learning

Figure 3.5: Architecture of Text Encoder Guidance: We replace the image en-
coder in Figure 3.2 with a multimodal text encoder. This alteration re-
moves the need for image-text datasets which are difficult to gather in
large quantities.

3.4.5. Pre-training with Guidance from Multimodal Text Encoder

A limitation of the vision-guided learning framework described in Section 3.2 is its
dependence on image-caption pair dataset, which are often challenging to curate in
large quantities where the dataset quality is high. As a result, many vision-language
models reliant on such data utilize web-crawled alt-text examples. However, the
texts in these datasets are typically brief, grammatically imperfect, and generally of
suboptimal quality. Furthermore, the format of texts found in web-crawled datasets
deviates from that of conventional natural language, presenting additional challenges
for model training and application.
In our experiments, we used the MS-COCO dataset, which provides human-
annotated image captions of superior quality compared to the alt-texts found in
datasets like CC3M and CC12M. However, the MS-COCO dataset is relatively small,
particularly given the demands of contemporary language models. Furthermore, re-

39
3.4 Main Results and Analysis Vision-Guided Language Learning

cent trends in machine learning indicate that reliance on human-annotated instances

is not scalable, primarily due to the associated costs.
In this section, we investigate a method to bypass the requirement for concrete
images in the training of guided language models and instead exploit image asso-
ciations. We generate text representations from a joint vision-language embedding
model as proxies for image embeddings. This alteration allows us to train models on
large text-only corpora, such as the English Wikipedia.
As illustrated in Figure 3.5, the key change we make compared to the model in
Figure 3.2 is substituting the image encoder for a text encoder from a vision-language
model. This modification is based on two key insights: (1) Through extensive expo-
sure to images during joint-embedding training, the text encoder from the vision-
language model gains access to valuable implicit patterns related to images that can
be effectively leveraged. (2) Using text representations enriched with implicit visual
information may be more advantageous for language model training than employing
raw visual content, as it avoids the challenges associated with processing different
modalities.
We label this model as the Multimodal Text Guided Language Model (MT-
GLM). We evaluate MT-GLM on the MS-COCO, Wiki-103, and English Wikipedia
datasets. Informed by insights from preliminary experiments, the models trained on
MS-COCO undergo pre-training for 20, 000 steps, whereas those on Wiki-103 and
English Wikipedia are trained for 100, 000 and 200, 000 steps, respectively.
We use the text encoder from the “Base” version of MetaCLIP, which has been
trained on 400 million examples. However, due to the substantial size of the datasets
utilized in these experiments, it is not feasible to pre-compute embeddings for entire
datasets, as we did with MS-COCO. Instead, we dynamically retrieve the multimodal
text embeddings corresponding to each mini-batch in every iteration. Accordingly,

40
3.4 Main Results and Analysis Vision-Guided Language Learning

Table 3.5: Semantic Relatedness Results from Multimodal Text Encoder

Guidance: Similar to the results in Table 3.1, including the guidance
loss is beneficial for semantic relatedness.

Model STS12 STS13 STS14 STS15 STS16 STS-B SICK-R AVG

(MS-COCO, 20 thousand steps)
MaskLM 34.94 32.10 35.17 42.36 36.23 36.50 47.39 37.82
MT-GLM (Ours) 61.03 50.85 52.52 63.04 55.50 60.98 67.76 58.81
(Wiki-103, 100 thousand steps)
MaskLM 43.92 54.70 46.65 55.76 62.19 42.59 52.93 51.25
MT-GLM (Ours) 62.23 63.82 58.58 73.43 71.69 64.55 63.51 65.40
(English Wikipedia, 200 thousand steps)
MaskLM 52.06 67.50 57.56 69.53 68.02 60.73 65.74 63.02
MT-GLM (Ours) 62.96 65.50 63.31 77.19 71.62 70.44 64.44 68.22

this approach results in a relatively slower training process compared to the unguided
models.

Zero-shot Semantic Relatedness Results. As demonstrated in Table 3.5, our

proposed model consistently outperforms the baseline language model across all pre-
training datasets. This finding conforms with the results presented in Section 3.4;
only that the performances observed here for both models are higher due to train-
ing on larger datasets. These results are encouraging observations underscoring the
effectiveness of multimodal text encoders in enhancing language model training.

General Language Understanding (GLUE) Results. As shown in Figure 3.6,

our model achieves competitive results comparable to the baseline model across all
dataset sizes and tasks on GLUE. Increasing the pre-training dataset size from MS-
COCO captions to English Wikipedia leads to performance improvements across all
models and tasks. That notwithstanding, our method consistently outperforms the
MaskLM pre-trained model. Interestingly, using the multimodal text encoder instead

41
3.4 Main Results and Analysis Vision-Guided Language Learning

Table 3.6: Transfer Learning Results on GLUE: Compared to the baseline

model, our method improves average GLUE results across all pre-training
datasets. Increasing the pre-training dataset size tends to reduce the per-
formance gap between our model and the baseline.

Model SST-2 QNLI QQP MNLI AVG

(MS-COCO, 20 thousand steps)
MaskLM 82.34 63.21 77.29 66.71 72.39
MT-GLM (Ours) 82.80 65.81 79.05 66.85 73.63
(Wiki-103, 100 thousand steps)
MaskLM 91.51 85.36 84.60 77.44 84.73
MT-GLM (Ours) 91.51 85.94 86.95 78.24 85.66
(English Wikipedia, 200 thousand steps)
MaskLM 91.51 88.51 87.18 79.97 86.79
MT-GLM (Ours) 91.97 87.63 87.50 81.07 87.04

of the image encoder yields a higher average GLUE performance when MS-COCO is
the pre-training dataset (73.63% versus 72.86%).

3.4.6. Is the Visual Information Necessary?

The experiments conducted thus far have demonstrated that pre-training the language
model to correlate with pre-trained embeddings via contrastive learning enhances
performance. Explicit visual information derived from images and implicit visually
associated representations obtained from multimodal text encoders have been shown
to be highly effective within the proposed guidance framework. However, a crucial
question remains: Is it necessary to employ a pre-trained image encoder or a text
encoder from a joint-embedding model? It is conceivable that the newly introduced
loss function is the essential component or that the guidance representations only
help to reduce overfitting and regularize the model. Should this be the case, any
pre-trained text model, such as BERT, could be as effective as a multimodal text
encoder within our method.

42
3.5 Implications & Limitations Vision-Guided Language Learning

Table 3.7: Is Visual Information Actually Helpful?: We show that using an

ordinary pre-trained language model such as BERT [39] for the guidance
representations is actually detrimental to performance relative to the un-
guided model. These models were pre-trained on the Wiki-103 dataset for
100 thousand steps using a batch size of 1024 and sequence length 32.

Pre-trained Model STS12 STS13 STS14 STS15 STS16 STS-B SICK-R AVG
None 36.40 44.13 37.29 56.86 49.80 33.99 54.89 44.77
BERT-Base 27.64 41.63 27.69 45.48 48.04 22.90 49.99 37.62
MetaCLIP-Base (Ours) 61.25 58.04 57.14 74.46 70.53 69.08 62.73 64.75

This section details experiments designed to determine whether any pre-trained

language model can substitute for the multimodally trained text encoder. Specifi-
cally, we replace the MetaCLIP text encoder with a pre-trained BERT model and
present the resulting data in Table 3.7. Compared to the baseline model, the guid-
ance representations derived from BERT result in a performance decline of an average
of 7 points in semantic relatedness tasks. However, representations from MetaCLIP
enhance performance by 19.98 points. These results categorically underscore the
significance of explicit visual information (as demonstrated in Section 3.4.3) or im-
plicit visual information in the form of visually associated text representations, as
illustrated in Section 3.4.5.

Section 3.5
Implications & Limitations

Our study on vision-guided language models indicates that incorporating perceptual

cues into the pre-training of language models, specifically through the application of
an alignment loss, can enhance performance across a range of natural language pro-
cessing tasks, particularly in terms of semantic understanding. Moreover, the research
demonstrates that substituting the image encoder with the text tower from a vision-
language model serves as an effective alternative for achieving language grounding.

43
3.5 Implications & Limitations Vision-Guided Language Learning

These findings suggest two principal implications:

(1) Pre-training language models with visual information significantly enhances se-
mantic understanding. Our analysis identifies areas, e.g., synonym compre-
hension and keyword identification, through which visual information typically
contributes positively to performance (see Section 3.4.3). Furthermore, we hy-
pothesize that some of the observed improvements in semantic relatedness tasks
can be attributed, in part, to an enhanced ability to resolve ambiguous sen-
tences. In instances where a question or sentence possesses multiple interpre-
tations, the inclusion of visual information as supplementary context can serve
to disambiguate meaning.

(2) There is no necessity for explicit images in experiments involving language

grounding. Language models can therefore be trained at scale using grounding
information derived from text datasets, leveraging the weakly learned associ-
ations intrinsic to vision-language models. This approach addresses the chal-
lenges associated with the curation of extensive image-text datasets for language
grounding experiments. Additionally, it eliminates the modality gap between
image content and visual embeddings during language grounding and also re-
solves the distributional discrepancies between image captions and free-form
text.

These implications emphasize the critical need for ongoing investigation into lan-
guage grounding strategies. Grounded language models parallel the linguistic acquisi-
tion processes observed in human cognition and may serve as a conduit for enhanced
cross-lingual learning and “true” linguistic understanding in AI systems. Nevertheless,
this research is subject to certain limitations, which are highlight below.

(1) While our findings indicate that the incorporation of visual information during

44
3.6 Summary Vision-Guided Language Learning

the pre-training phase enhances performance on subsequent natural language

processing tasks, we were unable to quantitatively establish a direct relationship
between the quantity of image information integrated into the model and the
resultant improvement in a specific language task. Such results necessitate
experiments grounded in scaling laws, which we were not able to conduct.

(2) We demonstrated the efficacy of vision-guided language modeling exclusively on

the BERT model[39]. Further illustrating the impact of vision-guided loss on
generative-based language models, such as GPT-2 [138], would underscore the
generalizability of this approach. Moreover, it would facilitate broader adop-
tion, given that recent advancements in language modeling research have largely
converged on the text token prediction objective.

Section 3.6
Summary

In this chapter, we introduced Vision-Guided Language Models (VGLM), an ap-

proach to guided language modeling utilizing image features. Extensive experiments
demonstrated that incorporating a guidance loss derived from visual features enhances
the performance of language models across various datasets. However, a limitation
of the VGLM approach is its reliance on image-text datasets for pre-training, which
are typically costly to curate in large quantities. To address this challenge, we de-
veloped the Multimodal Text Guided Language Models (MT-GLM) by substituting
the image encoder with a text encoder trained within a multimodal framework. This
approach enabled the pre-training of the language model on text corpora of varying
sizes, resulting in consistent improvements in outcomes. Ablation experiments fur-
ther validated the critical role of employing a multimodal text encoder compared to
a conventional language model in the guided language model setup.

45
3.6 Summary Vision-Guided Language Learning

Having established the pivotal role of multimodal models in vision-guided train-

ing within this chapter, the next chapter, Chapter 4, will introduce a methodology
for making the training of vision-language models more data efficient through data
compositions. This innovative approach proves particularly effective in domains char-
acterized by a scarcity of image-text datasets.

46
CLIP-C: Semantic Compositions For
4
Data-Efficient Vision-Language Models

In chapter 3, we presented a framework for leveraging vision-language models to im-

prove guided language training. Building upon that work, this chapter introduces
a method for effective vision-language models in contexts where paired image-text
data is limited. Specifically, we propose the use of semantic compositions for vision-
language model pre-training following CLIP [136], resulting in improved data effi-
ciency across models of varying scales.

Section 4.1
Overview

Recent advancements in vision-language pre-training have significantly enhanced per-

formance across a range of tasks, encompassing zero-shot image classification [136, 70,
104], video understanding [170, 182], and diverse multimodal applications [86, 173,
123]. These achievements echo the transformative impact initially initiated by large-
scale pre-training endeavors in the fields of Computer Vision [84, 55] and subsequently
in Natural Language Processing [39, 137, 138, 15]. A prominent recent example in
the vision-language interplay is the Contrastive Language-Image Pre-training (CLIP)

47
4.1 Overview CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models

model [136], which set a benchmark for vision and language joint embedding mod-
els. CLIP adopts contrastive learning capitalizing on matched image-caption pairs
as positive examples and non-matching pairs within the batch as negative examples.
This method has yielded impressive results across numerous tasks. Nonetheless, a
downside of the CLIP model is its need for large-scale datasets of image-text pairs,
which poses accessibility challenges for researchers in low-resource settings. Follow-
up research efforts to enhance the data efficiency of the CLIP model have proposed
supplementary objectives [121, 104] or the generation of additional captions using
large language models [44, 87, 182]. However, these approaches often require sup-
plementary computational procedures, such as multiple forward passes or the use of
additional encoders, which are resource-intensive. Our work in this chapter proposes
a technique to develop effective vision-language models in scenarios where image-text
data is scarce.
The method we propose is based on creating semantically composite examples
to pre-train vision-language models. This approach, termed CLIP-C ( for CLIP
Compositions), involves merging captions and blending images from two distinct
image-caption instances within the dataset to create a new composite example. This
straightforward procedure does not introduce additional computational overhead or
increase model parameters similar to CutMix [175] data augmentation method in the
domain of vision categorization tasks.
We dynamically broaden the semantic challenges encountered by the model by
adding compositions of two distinct image-caption pairs in each iteration. CLIP-C
implements the composition by conjoining the captions and merging the central crops
of both images. The two image-caption pairs constituting the composite instance are
sampled randomly in each iteration based on a predefined probability, empowering
the model to uncover novel combinations of examples throughout training.

48
4.1 Overview CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models

Distinct from stylistic variation approaches that manipulate individual examples,

CLIP-C utilizes “semantic composition” to introduce contextually varied training in-
stances. Interestingly, our findings suggest that the advantages of CLIP-C stem from
factors beyond the diversity of these new semantic associations. Counterintuitively,
we observed that the model more readily learns to associate the compounded image-
caption pairs than the unmodified ones. Consequently, this creates a positive spillover
effect, whereby the model develops more robust representations of the plain, unmod-
ified examples in CLIP-C relative to the traditional CLIP approach.
In contrast to the approach by Fan et al . [44], our method enhances data efficiency
without any dependence on external systems, such as large language models. This
allows for the efficient and flexible generation of novel captions and images during the
training process. Furthermore, our approach does not increase the batch size or the
number of iterations relative to the baseline, thereby maintaining operational parity
with CLIP. Indeed, we demonstrate that extending the training duration of CLIP or
increasing its batch size is insufficient to bridge the performance disparity between
CLIP and CLIP-C.
We believe that CLIP-C will be particularly advantageous in contexts where image-
text datasets are either scarce or not readily accessible. However, conducting com-
prehensive evaluations in such domains is inherently challenging due to the absence
of established benchmarks of images with captions. Consequently, we employ eval-
uations on medium-sized, web-derived datasets to illustrate the potential efficacy of
our approach.
In downstream applications, CLIP-C exhibits a competitive advantage, exceed-
ing CLIP by over 5% in zero-shot cross-modal retrieval accuracy on the Flickr30k
dataset [172] and achieving substantial improvements on MS-COCO [106] in both
image-to-text and text-to-image retrieval tasks. Furthermore, our model exhibits sig-

49
4.1 Overview CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models

Figure 4.1: CLIP-C: We use the center half crops spanning the width (as in this
illustration) or the height of the image. The captions are concatenated
with the delimiter “and”. We vary the positions of the captions on ei-
ther side of the conjunction, i.e., the output caption can be either (a)
{caption1 and caption2} or (b) {caption2 and caption1}. We emphasize
that only a fraction of the batch in each iteration constitute composite
samples. The colored boxes and texts shown here are for illustrative pur-
poses.

nificant enhancements in zero-shot image classification performance, obtaining a 2%

accuracy gain on ImageNet [146] without using any additional model parameters,
memory, or external language processing systems. Even when evaluated on relatively
large datasets such as CC12M [21] and RedCaps [38], CLIP-C continues to outperform
the baseline CLIP in cross-modal retrieval and zero-shot classification tasks, however,
with decreased margins.
In Section 4.5.6, we present experimental results that demonstrate the applica-
bility of CLIP-C to other frameworks, specifically improving SLIP [121] by 1.4% in
ImageNet zero-shot classification accuracy. Similarly, our method exhibits superior
compositional understanding compared to CLIP, as evidenced by evaluations on the
SugarCrepe dataset [61], particularly in the contexts of object replacement and ad-
ditions. These results underscore the generalizability and utility of our approach.
Finally, we analyze the factors contributing to the effectiveness of CLIP-C in exten-
sive ablation studies in Section 4.5.

50
4.2 Technical Approach CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models

Section 4.2
Technical Approach

This section covers a background of the baseline method as well as the core compo-
nents of CLIP-C’s framework, depicted in Figure 4.1.

4.2.1. Background

Contrastive Language-Image Pre-training (CLIP) from Radford et al . [136] has emerged

as a highly successful approach for training vision-language models. CLIP is a dual
encoder model with separate encoders fI and fT for extracting visual and textual
features respectively. It also has two dedicated projection functions gI and gT that
map the outputs of the encoders to a shared embedding space. Given a batch of B
n oB
(i) (i)
images and text pairs xI , xT in each training step, CLIP computes the embed-
i=1
(i) (i) (i) (i) (i)
dings zI = gI fI xI and zT = gT fT xT where zI ∈ Rd represents the
(i) (i)
normalized features of image xI . zT ∈ Rd denotes the normalized features of the cor-
(i)
responding caption xT . The loss is evaluated using InfoNCE [158] whereby matching
(i) (i)
image-text pairs {xI , xT } constitute the positive samples and non-matching pairs
(i) (j)
{xI , xT } ∀j ̸= i form the negative examples. A bidirectional loss is computed as

(i) (i)
1
B
X exp 1
τ
sim zI , zT
LI2 T = − log P (4.1)
B i=1
B
exp 1
sim
(i)
zI , zT
(j)
j=1 τ

(i) (i)
1
B
X exp 1
τ
sim
zI , zT
LT2 I = − log P (4.2)
B (k) (i)
k=1 exp τ sim zI , zT
B 1
i=1

where temperature τ is typically a learnable parameter used to scale the logits. τ is

fixed in all of our ablation experiments as it has a noticeable impact on the model [88]

51
4.2 Technical Approach CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models

which makes comparisons across different experiments difficult. sim(·, ·) is a similarity

function measuring the distance between the features. In CLIP [136] and in our
experiments, sim(·, ·) is set as the dot product function. The total loss is an average
of the two losses in Eq. 4.1 and Eq. 4.2:

L = (LI2 T + LT2 I )/2 . (4.3)

4.2.2. CLIP-C
n oB
(i) (i)
In each training step, CLIP-C samples a batch of examples of size B, x̂I , x̂T .
i=1
(i) (i) (i) (i)
Any given paired instance x̂I , x̂T is either the original example xI , xT or a
′
(i ) (i′ )
composition of that example and another example xI , xT , i ̸= i′ , drawn from
the dataset. Note that index i′ is taken with respect to the dataset size and not
the batch size B, i.e., sample i′ may not be present in the current mini-batch. The
proportion of composed samples in any mini-batch is controlled by a sampling rate
hyper-parameter ρ. The impact of this parameter is discussed in Section 4.5.3.

(i) (i) (i)
In the case whereby x̂I , x̂T is a composite sample, the new caption x̂T is a
(i) (i) (i′ )
concatenation of the two original captions involved: x̂T = [xT , xT ] where [·, ·] is a
string concatenation function with the word “and” as a conjunction. The positions of
(i)
the captions on either side of this conjunction change, with xT appearing first fifty
percent of the time.
The new image is composed of the center half crops spanning either the height or
the width of each image. For example, if the images have resolution (S × S), either
( S2 ×S) or (S × S2 ) center crops are taken from both images and concatenated together
as illustrated in Figure 4.1. We experiment with other forms of image augmentation
methods such as MixUP [179] and CutMix [175] in Table 4.7.
After assembling the mini-batch as described above, CLIP-C proceeds to ex-

52
4.3 Experimental Setup CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models

(i) (i) (i)
tract the image and text features as in CLIP: ẑI = gI fI x̂I and ẑT =

(i) (i) (i)
gT fT x̂T . With ẑI and ẑT computed, Eq. 4.1, Eq. 4.2, and Eq. 4.3 are used
to compute the InfoNCE loss.
The sampling strategy CLIP-C employs exposes the model to a much higher diver-
sity of images and their corresponding captions compared to the vanilla pre-training
pipeline. As a result, we observe much more significant improvements in downstream
transfer when the pre-training dataset is small. It is reasonably expected that rela-
tively larger datasets such as RedCaps [38] are already sufficiently diverse and, there-
fore, may not benefit from our method. Nonetheless, CLIP-C still does better than
CLIP on these large datasets.

Section 4.3
Experimental Setup

All our experiments use the CLIP framework due to its demonstrated effectiveness,
simplicity, and widespread usage. We emphasize that we do not use pretrained CLIP
checkpoints from prior works as our method is a pre-training mechanism. Thus, we
retrain CLIP on all pre-training datasets and compare it to our approach. Finally,
due to resource constraints, we conduct our experiments in the low data and small
model regimes.

Pre-training Datasets. We use three widely adopted web-crawled datasets of vary-

ing sizes and distributions for our experiments: Conceptual Captions [149], Concep-
tual 12M [21], and RedCaps [38]. These three datasets altogether enable us to assess
the effectiveness of our method across pre-training datasets of different sizes and
qualities.

53
4.4 Results and Analysis CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models

Models. We use Vision Transformer [41] models of various sizes as in [121]. The
vision encoder is set to ViT-S/16 [156] in all our ablation experiments unless explicitly
specified otherwise. We use ViT-B/16 [41] as the image encoder to demonstrate
the efficacy of our method at scale. The text encoder in all our experiments is
set to the 38M parameter text Transformer model from [136]. Following previous
methods, Byte-Pair encoding is used for tokenization with a context length of 77 and
a vocabulary size of 49k. Finally, we fixed the temperature parameter at 0.01, the
maximum value used in CLIP [136].

Hyper-parameters. We train all models using PyTorch [129] with a global batch
size of 2, 048 split across 8 GPUS in a single machine. AdamW [109] is the optimizer
during pre-training. All models are pretrained for 40 epochs using a cosine decay
learning rate schedule with a base rate of 0.003, a warm-up period of 5 epochs, and a
final learning rate of 1e−5 . The weight decay parameter is always set to 0.1. Random
cropping is the only augmentation applied to the images during pre-training. The
image resolution is always set to 224 × 224.

Section 4.4
Results and Analysis

This section outlines our key comparisons between CLIP and CLIP-C (our method)
on zero-shot image classification, cross-modal retrieval, and linear probing. However,
we explain first why our method works.
We perform zero-shot evaluation on several classification benchmarks using class
names and prompts provided by [136, 121]. We test our model on eleven downstream
datasets including ImageNet [146], CIFAR [83], Caltech-101 [45], Oxford Pets [128],
Country211 [136], DTD [31], Sun397 [168], STL-10 [32], RESISC-45 [28], and Eu-

54
4.4 Results and Analysis CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models

Table 4.1: Zero-shot Image Classification: CLIP-C is our method. CLIP is the
model from [136] trained in our setting. CC3M CLIP-C models use ρ = 0.3
while CC12M and RedCaps models use ρ = 0.15. Bold numbers are the
best in each dataset and architecture comparison.

Caltech-101
PT Dataset

Country211
CIFAR-100

RESISC45
CIFAR-10

ImageNet
EuroSAT
Food-101
Method

STL-10
Sun397
DTD
Pets
Vision Encoder: ViT-S/16
CLIP 11.6 56.1 22.7 46.9 12.9 10.5 0.6 20.5 77.0 24.5 23.7 18.5
CC3M
CLIP-C 15.1 66.4 26.9 51.9 14.5 14.8 0.7 27.2 84.6 25.4 30.7 20.5
Vision Encoder: ViT-B/16
CLIP 13.8 54.8 20.4 49.8 14.9 12.2 0.7 21.9 76.0 22.7 19.6 19.6
CC3M
CLIP-C 15.7 58.0 28.5 50.1 11.4 14.2 0.7 27.8 86.8 26.1 21.3 21.2
CLIP 46.9 78.0 43.0 76.2 57.2 19.3 4.8 41.2 89.7 33.8 27.8 37.9
CC12M
CLIP-C 48.1 76.8 44.8 73.5 60.8 21.9 5.0 41.1 90.3 36.2 36.1 38.5
CLIP 78.8 72.8 38.7 72.1 76.0 16.2 6.1 27.5 92.9 36.5 30.9 40.7
RedCaps
CLIP-C 79.0 73.7 42.2 72.1 77.1 18.1 6.6 29.4 94.2 41.1 34.8 41.6

roSAT [56]. Following previous works [121, 44], we use “mean per class accuracy”
as the metric for Oxford Pets and Caltech-101. Accuracy is the metric for all other
datasets. Additionally, we analyze the zero-shot retrieval performance of our method
versus CLIP.

4.4.1. Zero-shot Image Classification

We conduct a thorough study of the transfer learning capabilities of our model in
zero-shot image classification on many downstream benchmarks. Across different
pre-training datasets and model sizes, our method substantially improves over CLIP
in zero-shot classification on multiple downstream benchmarks including ImageNet
as shown in Table 4.1. For ViT-S/16, CLIP-C achieves a 2% top-1 improvement over
the baseline CLIP model on ImageNet while outperforming CLIP on 12 out of 12
downstream datasets. These enhancements are maintained when the vision encoder

55
4.4 Results and Analysis CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models

Table 4.2: Zero-shot Cross-modal Retrieval. ρ is set to 0.3 for CC3M and 0.15
for CC12M abd RedCaps. Similarly to zero-shot classification, our se-
mantic composition model is nontrivially better than CLIP on zero-shot
retrieval.

Flickr30k MS-COCO
Image → Text Text → Image Image → Text Text → Image
PT Dataset Method
R@1 R@5 R@1 R@5 R@1 R@5 R@1 R@5
Vision Encoder: ViT-S/16
CLIP 35.2 62.3 25.4 49.12 17.3 39.0 13.1 31.2
CC3M
CLIP-C 40.7 70.9 30.6 57.9 21.4 45.6 16.2 36.5
Vision Encoder: ViT-B/16
CLIP 36.1 65.1 26.3 52.4 18.6 41.1 13.9 32.8
CC3M
CLIP-C 39.6 69.4 31.2 58.3 22.9 46.7 17.0 37.9
CLIP 61.5 87.2 46.1 74.9 36.2 64.2 25.3 49.7
CC12M
CLIP-C 66.0 87.8 49.5 75.6 38.4 65.6 26.4 51.5
CLIP 26.8 51.9 20.5 42.5 24.3 44.8 16.7 35.7
RedCaps
CLIP-C 32.3 57.2 23.6 44.9 27.1 49.2 18.2 38.4

is scaled from ViT-S/16 to ViT-B/16 indicating that CLIP-C’s continued effectiveness

over CLIP. We re-emphasize that our approach and CLIP use the same number of
parameters, memory, and computational resources during pre-training. Thus, without
incurring any training costs, CLIP-C improves upon the zero-shot capabilities of CLIP,
highlighting the utility of the composite examples.

4.4.2. Zero-shot Cross-Modal Retrieval

In addition to the zero-shot transfer results, as detailed in Subsection 4.4.1, we also
compare the performance of CLIP-C and CLIP on zero-shot cross-modal retrieval in
Table 4.2. We use MS-COCO [106] and Flickr30k [172] as the downstream bench-
marks for this evaluation. Similar to the zero-shot image classification task, CLIP-C
yields significant improvements over CLIP on both MS-COCO and Flickr30k across
different pre-training datasets and model sizes. When using CC3M as the pre-training

56
4.4 Results and Analysis CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models

ViT-S/16 Pretrained on CC3M ViT-S/16 Pretrained on CC3M

Validation Loss (50K Examples)

CLIP: (Plain Examples) 10 CLIP: (Plain Examples)
5 CLIP- : (Plain Examples) 9 CLIP: (Composite Examples)
4 CLIP- : (Composite Examples) 8 CLIP - : (Plain Examples)
Training Loss

CLIP - : (Composite Examples)

3 7
2 6
1 5
0 4
0 10 20 30 40 0 10 20 30 40
Epoch Epoch

(a) Pre-training and validation losses (b) ImageNet Zero-shot Results

Figure 4.2: (a) Training and validation losses during pre-training: Counter-
intuitively, the model learns to match the composite examples faster com-
pared to the plain instances. (b) CLIP-C v.s. CLIP: pre-training CLIP
longer than CLIP-C does not close the performance gap. CLIP-C becomes
even more superior as training duration increases.

dataset, our method outperforms CLIP by over 5% absolute top-1 retrieval accuracy
in both image-to-text and text-to-image retrieval. The enhancement on MS-COCO
is 4% on image-to-text and 3% on text-to-image retrievals.

4.4.3. Discussion
Why is CLIP-C an Effective Method?
As evidenced by the higher downstream zero-shot and retrieval results, it is clear
that CLIP-C is an effective method. To provide further intuitions as to why CLIP-C
works, we present two arguments for the effectiveness of our method based on data
diversity and improved model optimization as evidenced in the training and validation
losses. We expand on these points below starting with data diversity.

(a) CLIP-C exposes the model to more diverse data. It could be argued
that our method sees a lot more examples due to the compositions we employ,
and that may be the reason for the observed improved performances. The
empirical results in Table 4.3 and in Figure 4.2b show that this is not the
case. Even after training CLIP longer (Figure 4.2b) or with larger batch size

57
4.4 Results and Analysis CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models

(Table 4.3), it still underperforms CLIP-C. A CLIP-C model trained with a

batch size of 1024 outperforms a CLIP-C model trained with a batch size of
2048 examples by 1.6% on ImageNet zero-shot accuracy. Similarly, training
CLIP for (1+ρ) times the number of epochs for the CLIP-C model does not close
the performance gap between the two models. (Compare CLIP-52 and CLIP-
C-40 epochs in Figure 4.2b). Indeed, the results in Figure 4.2b highlight the
strong regularization effect of the novel examples in CLIP-C as its superiority
over CLIP becomes stronger as pre-training length increases. These results
suggest that by creating new examples from different captions and images in
each iteration, CLIP-C expands the data distribution the model sees, leading to
better downstream results.

(b) CLIP-C eases contrastive learning for all examples. Contrary to the
expectation that compound examples will be the more challenging to the model
(since they are multiple examples condensed into single instances), we observed
precisely the opposite: as shown in Figure 4.2a, the training and validation
losses on composite examples is lower than the losses on plain examples. Our
hypothesis for this empirical observation is that the model is able to easily
recognize compound image-caption pairs as they tend to be structurally different
due to stitching artifacts and other distortions. This also transfers to improved
matching of the plain examples in CLIP-C compared to CLIP as conveyed by
the validation losses in Figure 4.2a (right). We believe this elevated learning
of plain examples is a contributing factor to the superior capabilities of our
method.

58
4.5 Ablation Study CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models

Table 4.3: CLIP-C beats CLIP using just half the batch size of the CLIP model.

Method Batch CIFAR-10 CIFAR-100 ImageNet

CLIP 2, 048 56.1 22.7 18.5
CLIP-C (Ours) 1, 024 67.7 31.1 20.1

Section 4.5
Ablation Study

We ablate the various components of our framework including (1) the sampling prob-
ability ρ, (2) semantic versus stylistic compositions, (3) the impact of stochasticity
in drawing the second example, and (4) the function used for the image composition.
These ablation experiments underscore the importance of using semantically diverse
examples in compositions. They also reveal that while incorporating a proportion
of CLIP-C examples in the mini-batch contributes positively to performance, exclu-
sively using such compositions during training detracts from downstream transfer
capabilities. Finally, they unearth the necessity of generating compound examples
dynamically during training rather than relying on a static set of pre-generated in-
stances. Collectively, these insights affirm the effectiveness of the design principles
underpinning CLIP-C.
All ablation experiments are conducted using CC3M [149] with ViT-S/16 as the
image encoder to minimize cost. Additionally, we present only the zero-shot results
of CIFAR-10, CIFAR-100, and ImageNet for the ablation experiments.
To study the impact of random seeding, we train three models each for CLIP
and CLIP-C on CC3M with three different random seeds and show the results in
Table 4.4 indicating that the zero-shot performances are consistent across different
random initializations.

59
4.5 Ablation Study CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models

Table 4.4: Both CLIP and CLIP-C are consistent across three different initializations.

Method CIFAR-10 CIFAR-100 ImageNet

CLIP 57.8 ± 2.36 24.6 ± 1.62 18.6 ± 0.24
CLIP-C 64.7 ± 2.04 27.6 ± 0.53 20.4 ± 0.35

Table 4.5: Semantic Compositions: Using semantically distinct examples is better

than stylistic augmentations.

Method CIFAR-10 CIFAR-100 ImageNet

CLIP 56.1 22.7 18.5
Stylistic 53.7 25.0 19.0
Semantic (Ours) 66.4 26.9 20.5

4.5.1. Why Semantic Compositions?

We call CLIP-C compositions semantic because the new instances are not just stylis-
tically different from the constituent original examples, they are also semantically
different. Thus, it is fair to question whether or not this semantic differentiation is
important in producing the observed favorable results over CLIP. After all, purely
stylistic augmentations that use content from the same examples also increase data
diversity and could yield the same outcomes as our semantic compositions. We in-
vestigate this prospect in this section. We train a model using two augmentations of
the same example instead of two distinct examples. On the image side, two random
crops of the image are taken simulating two instances. For the text, we employ “Easy
Data Augmentation (EDA)” [165] to generate a caption for the second crop while the
first crop uses the original caption. These two stylistically generated examples are
then combined using CLIP-C.
In Table 4.5, it is evident that such stylistic augmentations are sub-optimal com-
pared to the semantic generations we employ in CLIP-C. On ImageNet, the CLIP-
C model achieves a 1.5% absolute top-1 accuracy than the stylistic augmentations
model. This suggests that the content of the new instances is important as the model

60
4.5 Ablation Study CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models

Table 4.6: Impact of Stochasticity: Dynamic stochastic assignments is more ef-

fective than fixed pairings.

Method CIFAR-10 CIFAR-100 ImageNet

CLIP 56.1 22.7 18.5
Fixed 60.4 28.0 19.7
Dynamic (Ours) 66.4 26.9 20.5

prefers the use of distinct examples in the composition. We also note that just in-
creasing the diversity of examples is helpful as the stylistic augmentations method
yields a 0.5% zero-shot accuracy gain over CLIP on ImageNet.

4.5.2. Impact of Stochasticity During Sampling

Whenever CLIP-C composition is activated, the second example is usually chosen
randomly from the dataset. This allows for every image-caption pair to be paired
with any other image-caption pair in the dataset. Moreover, the pairings differ from
one epoch to another, thus uncovering novel combinations of examples throughout
pre-training.
We examine the impact of this dynamic nature of CLIP-C versus using fixed pairs
of examples. To do this, for every example x, we allocate only one other example
x′ that is fixed throughout training. Then, whenever x is involved in a CLIP-C
composition, x′ is used. The results in Table 4.6 suggest that dynamic compositions
lead to better downstream results than fixed compositions. This makes intuitive
sense because in the dynamic case, if a particular composition is unhelpful, there is a
possibility of changing it in subsequent epochs. This possibility does not exist when
the combinations are fixed.

4.5.3. Sampling Probability Rho (ρ)

The probability at which we create a composite sample as opposed to the original
image-caption pair is an important parameter in our method which determines the

61
4.5 Ablation Study CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models

ViT-S/16 Pretrained on CC3M

0-shot Acc@1 on ImageNet

20
19
18
17
16 CLIP Baseline
CLIP- (OURS)
0.0 0.2 0.4 0.6 0.8 1.0
Sampling Probability
Figure 4.3: Sampling probability ρ. Our method is very effective when between
10% and 50% of the mini-batch are CLIP-C compositions but performs
poorly when the entire batch is composite instances.

percentage of the mini-batch that are compound instances. When ρ = 0, our method
is identical to CLIP as no composition is performed. On the other extreme, when
ρ = 1, all the examples in each mini-batch are instances of our composition method.
As shown in Figure 4.3, using a small non-zero sampling rate is more effective than
CLIP. However, the performance deteriorates when more than fifty percent of the
mini-batch are these compound image-text pairs. These results indicate that main-
taining a reasonable percentage of the original examples is necessary likely because
streamlined non-contradictory learning signal is significantly reduced when a major-
ity of the batch are compositions. Also, since downstream evaluations do not involve
such compositions, some exposure to examples with uniform semantic content during
pre-training is important for effective transfer.

4.5.4. Image Composition Function

In this section, we compare our image mixing method with established systems such
as CutMix [175] and MixUP [179]. When activated, MixUP executes a weighted
(i) (i) (j)
pixel-wise summation of the two images, x̂I = ω · xI + (1 − ω) · xI with the
weighting factor, ω sampled from the beta distribution ω ∼ β(1, 1). CutMix on the

62
4.5 Ablation Study CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models

Table 4.7: Our strategy outperforms CutMix [175] and MixUP [179].

Function CIFAR-10 CIFAR-100 ImageNet

MixUP 50.8 22.3 20.2
CutMix 54.2 26.9 20.4
CLIP-C (Ours) 66.4 26.9 20.5

Modality CIFAR-10 CIFAR-100 ImageNet

Text only 55.2 27.0 21.3
Images only 55.9 23.1 19.4
Text & Images 66.4 26.9 20.5
CLIP 56.1 22.7 18.5

Table 4.8: Modality Involved in Composition: Applying semantic compositions

on both modalities is the most consistently effective method across different
downstream datasets and tasks.

other hand takes a random crop from one of the images and pastes it at the same
spatial location on the other image. The crop’s dimensions are scaled by the value
√
α = 1 − ω, ω ∼ β(1, 1). That is, Hcut = α · H, Wcut = α · W where H and W are
the height and width of the image respectively. Unlike MixUP, our method preserves
the integrity of each crop and does not paste parts of one image on the other as
in CutMix. Additionally, using the center-half crop of each image guarantees that
substantial portions of both images are represented in the output image. We believe
these characteristics of our method are important as demonstrated by its superior
zero-shot results over MixUP and CutMix in Table 4.7.

4.5.5. Impact of Modality Used in Composition

Since our inputs are of different modalities, visual and textual, it is important to
examine whether compositions in each of these modalities produce similar effects. To
that end, in Table 4.8, we conduct analysis where our method is applied on (1) only
the captions, (2) only the images, and (3) both the captions and images. Of these
three variations, executing the compositions on both the captions and images is the

63
4.5 Ablation Study CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models

Table 4.9: CLIP-C is complementary with other methods such as SLIP.

Function CIFAR-10 CIFAR-100 ImageNet

SLIP 50.1 20.9 19.4
SLIP + CLIP-C 60.5 26.7 20.8

Table 4.10: Sugar-Crepe: Our model outperforms the baseline on compositional rea-
soning once again demonstrating is extensive capabilities.

REPLACE SWAP ADD

Method
Att Obj Rel Att Obj Att Obj
CLIP 65 76 58 63 58.0 61 63
CLIP-C 69 80 59 62 59 61 66

most effective, probably due to the symmetry of transforming both modalities. The
second most effective is the captions only approach. Option (2) is the least effective
method likely because the images are naturally augmented (random cropping) in the
baseline method whereas the captions are fixed. These observations suggest that our
method is more helpful in learning representations of the texts relative to those of the
images. They also help to elucidate why we obtain much bigger improvements over
the baseline in zero-shot settings compared to linear evaluations.

4.5.6. Additional Results

Here, we provide more results that demonstrate other qualities of our method.

(a) CLIP-C improves SLIP [121]. In Table 4.9, we show that CLIP-C’s frame-
work is compatible with other vision-language methods such as SLIP [121].
Using our composition method in SLIP improves zero-shot image recognition
performance from 19.4% to 20.8% on ImageNet showing the complementarity
of our method and SLIP.

(b) Results on SugarCrepe [61]. We show in Table 4.10 that CLIP-C has more
compositional knowledge than CLIP, especially on object replacement and ob-

64
4.6 Summary CLIP-C: Semantic Compositions For Data-Efficient Vision-Language Models

ject addition. This signals the potential of methods such as CLIP-C to close
the gap on relational and compositional understanding tasks in vision-language
models.

Section 4.6
Summary

In this chapter, we have shown that fast and straightforward semantic compositions
of distinct image-caption pairs substantially enhance the efficacy of vision-language
models. We labeled the approach we devised CLIP-C. Our proposed approach, termed
CLIP-C, demonstrates marked improvements in zero-shot downstream tasks over the
baseline CLIP model. Our ablation studies offered critical insights, highlighting that
the observed enhancements in performance arise not merely from increased data aug-
mentation but from the strategic deployment of semantically distinct examples in
compositions. Finally, we provided experiments demonstrating the applicability of
our semantic composition framework to other competitive models such as SLIP.
We anticipate that these findings will inspire further exploration into innovative
and efficient applications of small-scale datasets for vision-language pre-training, par-
ticularly in domains where curating extensive amounts of paired data is challenging,
such as in medical and satellite imagery. Furthermore, research exemplified by our
work in this chapter is crucial for training language models with guidance derived
from visual cues, as demonstrated in Chapter 3. In the next chapter (Chapter 5),
we will introduce an alternative approach for training effective multimodal models,
focusing on the composition of multimodal features rather than data compositions.

65
Channel Fusion for Vision and Language Learning
5
This chapter, like to Chapter 4, presents a method for training more effective mul-
timodal models. However, whereas the innovations in Chapter 4 were rooted in ad-
vancements at the input data level, the approach outlined in this chapter re-envisions
the fusion of tokens across different modalities within the multimodal model1 .

Section 5.1
Overview

Multimodal learning is a fundamental imperative as we drive towards more general

purpose artificial systems. Tasks such as visual question answering (VQA) [51, 66],
where models generate responses to textual queries about visual inputs require a
comprehensive understanding of both modalities. For instance, to accurately answer
a query like “What type of drink is to the right of the soda bottle?” the model must
possess the ability to distinguish a soda bottle from other bottles, differentiate left
from right, interpret the sentence, and identify the drink in question. Therefore, the
effective integration of representations from the different modalities is essential for
achieving high performance.
The two predominant strategies for the fusion of multimodal representations are
1
This work was done during an internship at Google

66
5.1 Overview Channel Fusion for Vision and Language Learning

(a) Merged Attention (b) Co-Attention (c) Compound Tokens

Figure 5.1: Multimodal fusion methods: Illustrations of fusions methods from the
perspective of one visual token Fi , and one text token Tj . Our proposed
Compound Tokens fusion method, illustrated in (c), uses only one
cross-attention layer for each modality compared to co-attention which
uses both cross-attention and self-attention in all blocks. Q, K, and V
denote the input query, keys, and values respectively. X represents the
cross-attention layer’s output. Finally, the subscripts V , L, and CT stand
for visual features, text features or Compound Tokens, respectively.

termed merged attention and co-attention, as depicted in Figures 5.1a and 5.1b,
respectively. Merged attention involves concatenating the two unimodal representa-
tions along the sequence dimension before applying self-attention globally over the
concatenated output [112, 57, 42]. In contrast, co-attention does not concatenate rep-
resentations from the two modalities; it uses cross-attention to facilitate the exchange
of information between modalities [16, 99, 42].
However, these two approaches possess inherent limitations: merged attention
lacks cross-attention which is important for the robust alignment of tokens. Con-
versely, co-attention does not have a global receptive field across all vision and text

67
5.1 Overview Channel Fusion for Vision and Language Learning

tokens both modalities never appear in a single sequence. Consequently, merged at-
tention may encounter difficulties in effectively aligning tokens from different modal-
ities, while co-attention sacrifices the advantages of a comprehensive global view over
all tokens. Duo et al . [42] reported performance enhancements with co-attention com-
pared to merged attention. Nonetheless, another drawback of co-attention is that it is
parameter inefficient relative to merged attention since it requires distinct parameter
sets for vision and language features.
Our objective is to efficiently and simply address the limitations of the exist-
ing fusion strategies. The method described in this chapter accomplishes this goal
by integrating merged attention and co-attention within a streamlined pipeline that
yields more robust multimodal representations than either approach across several
multimodal tasks. We introduce channel fusion to integrate visual and language rep-
resentations for various question answering tasks, such as visual question answering
and visual entailment. By focusing the fusion process on the feature dimension, our
model more effectively aligns tokens from the two modalities compared to the stan-
dard methods. We term the resulting fused representations “Compound Tokens”
(as illustrated in Figure 5.1c) because each feature vector contains elements from both
the image and language embeddings.
In the channel fusion process, tokens from one modality are queried using tokens
from the other modality, and the resulting output is concatenated with the input
query tokens along the feature dimension. We implement a bi-directional channel fu-
sion that generates vision-to-text and text-to-vision Compound Tokens, which are
subsequently concatenated along the token dimension for further modeling. Channel
fusion effectively aligns multimodal tokens through cross-attention while maintaining
the benefits of global self-attention across all tokens. Unlike merged attention, we
concatenate vision and text tokens along the channel dimension. In contrast to co-

68
5.1 Overview Channel Fusion for Vision and Language Learning

attention, which uses cross-attention functions in every block, our approach employs
only two cross-attention functions initially to facilitate channel concatenation.
Additionally, channel concatenation does not increase the token length thus avoid-
ing additional computational or memory burdens in the multimodal encoder (and
decoder). To further enhance efficiency, each modality is initially embedded into half
of the original feature dimension prior to compounding. This approach ensures that
the output following channel concatenation retains the same feature dimension as the
input vision and text tokens. Empirical evidence from our experiments indicates that
alternative methods of combining the input queries and cross-attention outputs such
as weighting or element-wise multiplication are less effective compared to channel
concatenation.
We evaluate Compound Tokens through extensive experiments in the challeng-
ing open-vocabulary setting using exact matching. In this context, the generated
responses must correspond precisely with the ground truth answers to be deemed
correct, presenting a significantly greater challenge than predicting from a limited
predefined set of responses, as is common with encoder-only models. This evaluation
approach is informed by prior research [29, 164, 133] that highlighted its flexibility
and applicability in practical scenarios.
The empirical evaluations show that our method outperforms both merged atten-
tion and co-attention on GQA [66], SNLI-VE [169], and VQA [51] with and without
vision-language pretraining. Compound Tokens obtained 82.87% on SNLI-VE
beating METER [42] by 2.26%. Additionally, they recorded 82.43% on GQA signifi-
cantly outperforming CFR [124] by 8.83%. Our model’s score of 70.62% on VQA is
competitive among existing models.

69
5.2 Technical Approach Channel Fusion for Vision and Language Learning

Figure 5.2: Model Architecture: Compound Tokens Channel Fusion is illus-

trated in Figure 5.1c. ResNet-50 [55] and T5-base [139] are used for the
image and text encoders respectively.

Section 5.2
Technical Approach

This section provides a background of the baseline fusion methods and the architec-
ture for our Compound Tokens method as depicted in Figure 5.2 and Figure 5.1c.

5.2.1. Background
We provide an overview of the key functions pertinent to understanding our method,
omitting layer normalization and multi-layer perceptrons within attention blocks for
the sake of simplicity. Similarly, we refrain from addressing residual connections
between layers in this high-level overview.
Attention: Given a set of query vectors Q ∈ RN ×d and a set of key vectors

70
5.2 Technical Approach Channel Fusion for Vision and Language Learning

K ∈ RM ×d , an attention layer gathers information from context vectors V ∈ RM ×c

proportional to the normalized scores between the elements of Q and K. Specifically,
for softmax dot-product attention [159], the scalar output zi,ℓ , of an attention layer
for query vector qi ∈ Q and key vector kj ∈ K, is the weighted sum of the elements
of V,

qiT kj exp(ai,j ) X
ai,j = √ αi,j = P zi,ℓ = αi,j Vj,ℓ . (5.1)
d ℓ exp(ai,ℓ ) j

An attention mechanism is called self-attention when the query, context, and key
vectors are simple linear projections of the same underlining feature. It is known as
cross-attention when the keys and queries are projections of different features.
Multimodal Fusion: Token concatenation followed by self-attention is one of
the most adopted approaches for cross-modal learning in recent vision-language archi-
tectures [100, 133, 164, 25]. Formally, given a sequence of N image tokens, I ∈ RN ×d ,
and M text tokens, T ∈ RM ×d , most methods concatenate I and T into a single
representation O ∈ R(N +M )×d which is then fed into a multimodal transformer for
further modeling. The target outputs are produced using either a linear layer or a
decoder. Besides concatenation, other methods such as [153, 99, 98] use multimodal
transformers composed of both self-attention and cross-attention in every block.

5.2.2. Compound Tokens: Our Proposed Channel Fusion Method

Our method, illustrated in Figures 5.1c & 5.2, draws from both co-attention and
merged-attention. Compound Tokens fusion first projects the visual and lan-
guage tokens into half of the embedding space so that the total number of features
d d
is maintained after channel concatenation: Ie ∈ RN × 2 ; Te ∈ RM × 2 for the image
and text tokens respectively. Next, we employ only two cross-attention layers (unlike
co-attention [42] that uses cross-attention and self-attention in every block) to create

71
5.3 Experimental Setup Channel Fusion for Vision and Language Learning

visual and language Compound Tokens:

d
d
Ib = A I,e Te , Te ∈ RN × 2 Tb = A Te , I,
e Ie ∈ RM × 2 (5.2)

Icmpd = C-Concat I, e Ib ∈ RN ×d Tcmpd = C-Concat Te , Tb ∈ RM ×d , (5.3)

where A(q, k, v) is the cross-attention function with q, k, and v as queries, keys, and
values respectively. C-Concat(u, υ) concatenates tensors u and υ along the feature
dimension. We combine vision-to-text Compound Tokens Icmpd , and text-to-vision
Compound Tokens Tcmpd , into a set of output Compound Tokens as in merged
attention architectures

Ocmpd = Concat (Icmpd , Tcmpd ) ∈ R(N +M )×d . (5.4)

Following previous methods, we feed Ocmpd into a self-attention multimodal en-

coder before generating the target outputs with an auto-regressive decoder. We also
show results in Figure 5.3 and Table 5.2 where we do not use a multimodal encoder:
Ocmpd is passed directly into the decoder to produce the outputs.

Section 5.3
Experimental Setup

The proposed fusion method is evaluated extensively on several multimodal tasks.

This section covers the experimental setup including the models and datasets em-
ployed in the empirical evaluations.

Model. We use ResNet-50 [55] as our image encoder and T5-base [139] as our text
encoder. The output of the image and text encoders are fed to our novel fusion
method described in Section 5.2.2. A T5-base decoder consumes the output of the

72
5.3 Experimental Setup Channel Fusion for Vision and Language Learning

fusion module and generates free-form text for all question-answering tasks. The
image encoder is pretrained on ImageNet [146] while the text encoder and decoder
use pretrained T5 weights.

Pre-training Datasets and Tasks. We use CC3M2 [149] and COCO Captions [106]
for pre-training. The pre-training setup uses a mixture of these datasets across four
objectives:

(1) Image-Text Matching where the model predicts whether the given image-text
pair is a match or not a match.

(2) Captioning where the model is tasked with generating a description of the image.

(3) Caption Completion is similar to (2) but the model is given a masked caption
and the goal is to predict the missing words.

(4) Masked Language Modeling as described in BERT [39].

Hyper-parameters. Unless otherwise stated, we pre-train our models for 300, 000
steps using a batch size of 512 and perform an additional 100, 000 iterations of fine-
tuning at a batch size of 128 on the downstream tasks. During pre-training, the image
resolution is set to 224 × 224. This resolution is increased to 384 × 384 during fine-
tuning or when training from scratch without vision-language pretraining (VLP). The
input text length is set to 32 tokens. The output text length is 32 during pretraining
and reduced to 8 tokens during finetuning. Please refer to the Supplementary Mate-
rial for details on all hyper-parameter settings including learning rates, weight-decay,
etc.
2
The version of the dataset we used has about 2 million samples

73
5.4 Results and Analysis Channel Fusion for Vision and Language Learning

Section 5.4
Results and Analysis

We test the transfer capabilities of our multimodal model on several downstream

benchmarks. Each downstream result is achieved by fine-tuning the pre-trained model
on the target dataset. The datasets used for downstream transfer experiments include
the following:

(a) SNLI-VE [169] is a dataset of approximately 500,000 image-text pairs used for
visual entailment (VE). Given an image and a proposed statement, the task for
this dataset requires determining whether the statement is neutral, entails, or
contradicts the image.

(b) Visual Question Answering (VQA2.0) [51] is a widely used benchmark for
many question-answering models and contains 400,000 image-text pairs span-
ning 3,130 output categories. Each image-question pair is associated with 10
answers.

(c) GQA [66] is a vision question-answering dataset of complex compositional ques-

tions comprising scene-object relations. This dataset was created from the Vi-
sual Genome [82] dataset. GQA has approximately 22 million question-answer
pairs and 113 thousand images.

We emphasize that in every scenario, our models generate answers in the open-
vocabulary setting covering 32,000 words irrespective of the number of categories in
the task. A model prediction is counted as correct if and only if it matches exactly
with the ground-truth answer. We use the VQA metric3 for VQA2.0 and simple
accuracy for GQA and SNLI-VE as evaluation metrics. Generally, we use SNLI-VE
3
https://visualqa.org/evaluation.html

74
5.4 Results and Analysis Channel Fusion for Vision and Language Learning

and GQA for ablations as performance on those datasets in our setup is more stable
than results on VQA.

5.4.1. Why Channel Concatenation?

First, to determine the best way of composing Compound Tokens, we examine
several options with a prime objective being the preservation of the sequence length4 .
We sampled four combination methods and compared them on SNLI-VE and GQA
as the performances on these datasets in our setup are more stable compared to
VQA. Given input queries q and cross-attention layer’s outputs X, we explored the
following: (1) Channel Concatenation where we concatenate q and X along the feature
dimension as described in Section 5.2.2. (2) Weighting uses the operation Y = αq +
βX where α and β are randomly initialized learnable scalars, and Y is the output.
(3) In Element-wise Product, Y = q ⊙ X. (4) Finally, we tested a simple summation
of the tensors, Y = q + X. All these methods use approximately the same number
of flops and parameters. The results in Table 5.1 confirm channel concatenation is
better than the other methods.

Table 5.1: Channel Fusion versus Other Mixing Methods: Channel concatena-
tion obtains the highest accuracy on SNLI-VE and GQA.

Method GFlops SNLI-VE GQA

Channel Concatenation 20.71 80.85 80.79
Weighting 20.71 80.63 80.61
Summation 20.71 80.75 80.35
Element-wise Product 20.71 80.81 78.31

5.4.2. Compound Tokens versus Merged Attention

We compare merged attention and channel fusion (our method) in Figure 5.3 without
vision-language pre-training to establish a baseline result. We then incorporate vision-
4
This is important in order not to increase the computation cost exorbitantly

75
5.4 Results and Analysis Channel Fusion for Vision and Language Learning

Figure 5.3: Compound Tokens versus Merged Attention without Vision

Language Pre-training: With a relatively minimal amount of addi-
tional flops, Compound Tokens demonstrate improvements over the
performance of merged attention across all tasks. Compound Tokens
(TAQ) is a more efficient version of Channel fusion where only one cross-
attention layer is used: only Text tokens As Query.

language pre-training and reassess the performance of each method in Table 5.2. All
downstream tasks for each fusion method use the same pretrained model.
For the baseline comparisons, the fusion modules do not use a multimodal encoder:
Merged attention feeds a concatenation of the multimodal tokens to the decoder whilst
Compound Tokens sends the tokens to the decoder immediately after channel
chaining.
The results in Figure 5.3 and Table 5.2 show clearly that our fusion method is
superior to merged attention with and without vision-language pretraining at a rela-
tively small amount of additional computational cost. This performance boost sug-
gests that using cross-attention for multimodal token alignment is beneficial. When
vision-language pre-training is employed, Compound Tokens outperforms merged
attention by substantial margins: by +4.18% on VQA and 2.20% on GQA. The im-
provement on SNLI-VE is a modest 0.24% over the baseline. Our method enjoys
similar improvement margins when no vision-language pre-training is invoked. We
include a more efficient version of channel fusion (Compound Tokens (TAQ)) where

76
5.4 Results and Analysis Channel Fusion for Vision and Language Learning

Table 5.2: Compound Tokens versus other Fusion models with Vision-
Language Pre-training: We repeat the experiments in Figure 5.3, but
include vision-language pre-training on a mixture of CC3M and COCO
captions.

Fusion Method GFlops VQA SNLI-VE GQA

Merged Attention 19.31 53.33 81.25 78.25
Compound Tokens (Ours) 19.87 57.51 81.49 80.45
Compound Tokens (TAQ, Ours) 17.34 53.23 81.21 77.74

only the text tokens are used as queries. This version of our method also outperforms
merged attention across all tasks when training from scratch while using fewer flops.

Table 5.3: Compound Tokens versus other Fusion models without Vision-
Language Pre-training: We extend the models to include a multi-
modal encoder with 12 self-attention layers in merged attention to match
the typical setting in previous works [42]. Compound Tokens outperform
merged attention and Co-Attention with fewer parameters than both meth-
ods and fewer flops than merged attention. Co-Tokenization is from [134].
“Params” show the number of parameters in the entire model (not just the
fusion module); “RES” is the image resolution and L is the total number of
transformer blocks in the multimodal encoder: Compound Tokens uses
two cross-attention blocks before the multimodal encoder.

Fusion Method L Params (×106 ) RES GFlops SNLI-VE GQA

Merged Attention 12 332.94 384 × 384 34.89 79.81 78.07
Co-Attention 12 361.26 384 × 384 29.61 80.20 77.75
Compound Tokens (Ours) 10 325.82 384 × 384 32.90 80.52 78.21
Co-Tokenization 12 392.14 384 × 384 57.78 80.79 81.07
Compound Tokens (Ours) 10 325.82 384 × 384 32.90 80.52 78.21

5.4.3. Multimodal Transformer Encoder

After establishing the superiority of Compound Tokens over merged attention
when no multimodal encoder is used, we extend the model to include a multimodal
encoder with 12 self-attention blocks to match the setting in previous works [99,
42]. We also compare with two other fusion methods Co-Attention (illustrated in

77
5.4 Results and Analysis Channel Fusion for Vision and Language Learning

Figure 5.1b), and Co-Tokenization [134] which was originally implemented for video
question-answering tasks. Co-Tokenization iteratively fuses visual features with text
features using a TokenLearner [147]. The Co-Attention fusion module uses 6 blocks
each for the vision and the text branches as in METER [42] where each block has
a self-attention, cross-attention, and feedforward layers. Co-Tokenization uses 64
image tokens and 4 transformer blocks for each tokenization round. We use three
tokenization rounds resulting in 12 self-attention blocks. The multimodal encoder for
Compound Tokens has 10 blocks to compensate for the two cross-attention blocks
that it uses.
The results of these experiments are shown in Table 5.3. The models are trained
for 300 thousand iterations at a batch size of 128 on each downstream task without
any vision-language pre-training. Compound Tokens outperform merged attention
and co-attention in this setting suggesting channel fusion remains competitive even
when a multimodal encoder is used. However, it slightly underperforms the more
expensive Co-Tokenization module.

5.4.4. An Encoder Model for VQA

The performance of our models on VQA in the encoder-decoder setup is unstable
due to the decoder not being able to sufficiently learn the VQA vocabulary. As a
result, we switch to an encoder-only model for VQA during fine-tuning. The decoder
in the pre-trained model is replaced with a linear layer of size 3, 130. The results in
Table 5.4 show that the encoder-only model significantly outperforms the encoder-
decoder model.

5.4.5. Comparison with Other Multimodal Models

Finally, we compare Compound Tokens with various models such as METER [42],
ALBEF [99], and CFR [124]. The models in Table 5.5 generally have approximately

78
5.4 Results and Analysis Channel Fusion for Vision and Language Learning

Table 5.4: Encoder versus Encoder-Decoder: The Encoder model outperforms

the encoder-decoder model by a large margin. We finetune both models for
100,000 steps starting from the same pre-trained encoder-decoder model.

Fusion Method Architecture GFlops VQA Accuracy

Compound Tokens Encoder-Decoder Model 35.50 58.14
Compound Tokens Encoder Model 31.77 70.39

Table 5.5: Compound Tokens versus other Multimodal Models: Our fu-
sion method is competitive on SNLI-VE and GQA with all models except
SimVLM [164] which used a private dataset of 1.5B samples. The best
values among the models besides SimVLM are in bold. The second best
values are underlined. * Gflops are based on our calculations.

Approach Params GFlops∗ VQA SNLI-VE GQA

SimVLMHuge [164] 1.5B 890 80.34 86.32 -
VisualBERT [100] 66.70 75.69 -
UNITER [27] 73.82 79.39 -
LXMERT [153] 69.90 - 60.00
ALBEF [99] 418M 122 75.84 80.91 -
METER [42] 336M 130 77.68 80.61 -
BLIP [98] 475M 122 77.54 - -
12-in-1 [111] 71.30 - 60.50
VinVL [180] 75.95 - 65.05
VL-T5 [29] 70.30 - 60.80
CFR [124] 69.80 - 73.60
Compound Tokens (Ours) 340M 36 70.62 82.87 82.43

the same number of parameters, but may differ on the pre-training datasets, pre-
training objectives, and backbone encoders. For example, while we use Conceptual
Captions [149] and COCO [106] as our pre-training datasets, METER used Concep-
tual Captions, COCO, Visual Genome [82] and SBU Captions [126]. ALBEF used
all the datasets in METER in addition to Conceptual Captions 12M [21].
We pre-train the Compound Tokens model for 500 thousand steps with a batch
size of 512 using an image resolution of 224 × 224 and further finetune for 200 thou-
sand iterations on each of the downstream tasks at resolution 384 × 384 at a batch
size of 128. Except for SimVLM [164] which has about 1.5 billion parameters and

79
5.5 Ablation Study Channel Fusion for Vision and Language Learning

uses significantly large pre-training data (a 1.8 billion private dataset), our model
outperforms all other methods on SNLI-VE and GQA by large margins. We are
confident that further pretraining and increasing image resolution will improve our
already competitive result on the VQA dataset.

Section 5.5
Ablation Study

This section covers key ablations for our multimodal method including the input
image resolution and the architecture of the image encoder used.

5.5.1. Image Resolution

Increasing image resolution generally leads to better performance for various question-
answering tasks. As a consequence, most prior works use a larger resolution during
finetuning compared to the pre-training resolution [164]. We follow the setting in
METER [42] by pre-training and finetuning at resolutions 224 × 224 and 384 × 384
respectively. In this section, we investigate whether Compound Tokens also out-
performs merged attention ablating for input resolution.
The results of this ablation are shown in Table 5.6 for models without a multi-
modal encoder. The models in Table 5.6 do not pre-train on paired image-text data.
As in prior works, increasing image resolution improves performance across all fusion
methods and datasets. In all cases, Compound Tokens outperform merge atten-
tion, underlining the fact that our proposed method is more effective than traditional
merge attention.

5.5.2. Type of Image Encoder

The image encoder is an important component in vision-language models. Although
earlier models used object detectors such as Faster-RCNN, more recent models use

80
5.6 Supplemental Information Channel Fusion for Vision and Language Learning

Table 5.6: Impact of Image Resolution: Increasing the resolution increases per-
formance for both merged attention and Compound Tokens.

Fusion Method RES GFlops SNLI-VE GQA

Merged Attention 224 × 224 9.94 78.70 75.62
Compound Tokens 224 × 224 10.22 79.59 76.62
Merged Attention 384 × 384 19.31 79.15 76.66
Compound Tokens 384 × 384 19.87 80.44 79.02

Table 5.7: Image Encoder Ablation: Both the ViT and ResNet-50 are pre-trained
on ImageNet before transferring to the target task.

Image Encoder Fusion Method SNLI-VE GQA

Merged Attention 77.44 74.02
ViTB
Compound Tokens 78.59 74.74
Merged Attention 78.70 75.62
ResNet-50
Compound Tokens 79.59 76.62

a CNN [91] or a Vision Transformer (ViT) [159] for image feature extraction. We
used ResNet-50 for our main experiments and ablate the impact of that choice in this
section. The results of using a ViT as the image encoder are shown in Table 5.7.
All models in that experiment use 224 × 224 as the image resolution. A patch size
of 16 × 16 was used for the ViT. The ViT models perform slightly less than the
comparable ResNet, but channel fusion remains superior to merged attention.

Section 5.6
Supplemental Information

We provide full details of our hyper-parameter settings in this section. We use

Adam [78] to optimize all our models. The learning rate starts from zero and warms
up linearly to the base rate after 8 thousand iterations. Cosine annealing [108] with
a cycle rate of 100 thousand steps is then used to decay the rate to zero by the end
of training. We use gradient clipping with a maximum norm of 1.0 in all our experi-

81
5.7 Summary Channel Fusion for Vision and Language Learning

Table 5.8: Hyper-parameters: We enumerate the hyper-parameters for our ablation

experiments and main model. L is the number of blocks in a multimodal
encoder. “Main Model” is the model we used in Table 5.5 for comparison
with existing works.

Experiment Phase L Iterations LR Dropout Weight Decay

Pretraining 0 / 12 300k 1.1e−4 1e−3 0.1
0 5e − 5 0
Finetuning 100k 1e−4
Ablations 12 3.1e−3 1e−3
0 7.5e−5
Scratch 300k 1e−2 1e−3
12 3e−5
Pretraining 500k 1.1e−4 0.1
Main Model 12 1e−3
Finetuning 200k 3e−5
1e−4

ments. We do not use any data augmentation beyond resizing and normalization in all
experiments. We apply random cropping and AutoAugment [35] during pre-training
of our main model. All our pretraining experiments use a batch size of 512 and image
resolution 224 × 224. The batch size is divided equally among the four pretraining
objectives: image captioning, caption completion, image-text matching, and masked
language modeling. We also sample the same number of examples from CC3M and
COCO in every iteration. The batch size and resolutions are set to 128 or 384 × 384
whenever training from scratch or fine-tuning respectively. The datasets we used in
our model are described in Section 5.3. The rest of the hyper-parameters are listed
in Table 5.8.

Section 5.7
Summary

In this chapter, we presented an innovative multimodal fusion method for vision-

and-language representation learning. Unlike existing systems that perform multi-
modal concatenation along the sequence dimension, our method employs concate-
nation along the feature dimension. This approach allows our multimodal tokens,

82
5.7 Summary Channel Fusion for Vision and Language Learning

termed Compound Tokens, to benefit from both cross-attention token alignment

and a global receptive field across both modalities, without significantly increasing
memory or computational costs. Through extensive comparative experiments, we
demonstrated that our method surpasses merged attention and co-attention across
three widely recognized question-answering tasks. Our method consistently outper-
formed these standard techniques, both with and without pre-training on image-text
pairs, and across various image resolutions and encoding backbones. Furthermore, our
approach achieved superior performance compared to well-established models such as
ALBEF [99] and METER [42], exceeding them by nearly 2% on the SNLI-VE task.

83
Slot Machines
6
Different from the preceding chapters which concentrated on vision-guided language
models and multimodal networks, this chapter focuses on architectures and training
methodologies of deep neural networks. In contrast to conventional weight optimiza-
tion in a continuous space, we demonstrate the existence of effective random networks
whose weights are never updated. We discuss the related works within the chapter
separately from Chapter 2 in Section 6.6.

Section 6.1
Overview

Typically, “learning” in modern deep neural works involves optimizing a network

from scratch [84], finetuning a pre-trained model [171] or jointly optimizing the ar-
chitecture and weights [185]. Innovations to this setup have fueled the remarkable
successes deep learning has produced in a variety of application areas, including image
recognition [55], object detection [142, 54], machine translation [159] and language
modeling [15]. Against this predominant background, we pose the following question:
can a network instantiated with only random weights achieve competitive results com-
pared to the same model using optimized weights? This is an important endeavor as a
solution could, in theory, provide a pathway to bypass the very expensive process of

84
6.1 Overview Slot Machines

gradient updates through backpropagation in traditional training methods. A favor-

able solution to the question may also provide new insights for deep neural network
initialization.
For any specified task, an untrained, randomly initialized network is unlikely to
produce performance above chance. However, given sufficient random weight options
for each connection, there may exist a selection of these random weights that have
comparable generalization performance to that of a traditionally-trained network with
the same architecture. We demonstrate that these selections exist. Importantly, we
introduce a method that can find these high-performing randomly weighted config-
urations consistently and efficiently. Furthermore, we show empirically that a small
number of random weight options (e.g., 2 − 8 values per connection) are sufficient to
obtain accuracy comparable to that of the conventionally-trained network. Instead of
updating the weights, the algorithm simply selects a weight value for each connection
from a fixed set of random weights.
We use the analogy of “slot machines” to describe how our method operates. Each
reel in a Slot Machine has a fixed set of symbols. The reels are spun jointly in
an attempt to find winning combinations. In our context, each connection has a
fixed set of random weight values. Our algorithm “spins the reels” in order to find a
winning combination of symbols, i.e., selects a weight value for each connection so
as to produce an instantiation of the network that yields strong performance. While
in physical Slot Machines the spinning of the reels is governed by a fully random
process, in our Slot Machines the selection of the weights is guided by a method
that optimizes the given loss at each spinning iteration.
More formally, we allocate K fixed random weight values to each connection. Our
algorithm assigns a quality score to each of these K possible values. In the forward
pass a weight value is selected for each connection based on the scores. The scores

85
6.1 Overview Slot Machines

are then updated in the backward pass via stochastic gradient descent. However, the
weights are never changed. By evaluating different combinations of fixed randomly
generated values, this extremely simple procedure finds weight configurations that
yield high accuracy.
We demonstrate the efficacy of our algorithm through experiments on MNIST
and CIFAR-10. On MNIST, our randomly weighted Lenet-300-100 [91] obtains a
97.0% test set accuracy when using K = 2 options per connection and 98.2% with
K = 8. On CIFAR-10 [83], our six-layer convolutional network outperforms the
regular network when selecting from K = 8 fixed random values at each connection.
Fine-tuning the models obtained by our procedure generally boosts performance
over networks with optimized weights, albeit at an additional compute cost (see Fig-
ure 6.5). Also, compared to traditional networks, our networks are less memory effi-
cient due to the inclusion of scores. That said, our work casts light on some intriguing
phenomena about neural networks for further probing.

• First, our results suggest a performance comparability between selection from

multiple random weights and traditional training by continuous weight opti-
mization. This underscores the effectiveness of strong initializations.

• Second, this paper highlights the enormous expressive capacity of neural net-
works. Maennel et al . [113] show that contemporary neural networks are so
powerful that they can memorize randomly generated labels. This work builds
on that revelation and demonstrates that current networks can model challeng-
ing non-linear mappings extremely well even by simple selection from random
weights.

• This work also connects to recent observations [114, 46] suggesting strong per-
formance can be obtained by utilizing gradient descent to uncover effective

86
6.2 Technical Approach Slot Machines

Figure 6.1: Slot Machines Architecture: Our method assigns a set of K (K = 3

in this illustration) random weight options to each connection. During
the forward pass, one of the K values is selected for each connection,
based on a quality score computed for each weight value. On the backward
pass, the quality scores of all weights are updated using a straight-through
gradient estimator [9], enabling the network to sample better weights in
future passes. Unlike the scores, the weights are never changed.

subnetworks.

Section 6.2
Technical Approach

Our goal is to construct non-sparse neural networks that achieve high accuracy by
selecting a value from a fixed set of completely random weights for each connection.
We start by providing an intuition for our method in Section 6.2.1, before formally
defining our algorithm in Section 6.2.2 .

6.2.1. Intuition
An untrained, randomly initialized network is unlikely to perform better than random
chance. Interestingly, the impressive advances of [140] and [184] demonstrate that
networks with random weights can in fact do well if pruned properly. In this work,
instead of pruning we explore weight selection from fixed random values as a way

87
6.2 Technical Approach Slot Machines

to obtain effective networks. To provide an intuition for our method, consider an

untrained network N with one weight value for each connection, as typically done.
If the weights of N are drawn randomly from an appropriate distribution D (e.g.,
Glorot Normal [50] or He Normal [53]), there is an extremely small but non-zero
probability that N obtains good accuracy (say, greater than a threshold τ ) on the
given task. Let q denote this probability. Also consider another untrained network
NK that has the same architectural configuration as N but with K > 1 weight
choices per connection. If n is the number of connections in N , then NK contains
within it K n different network instantiations that are architecturally identical to N
but that differ in weight configuration. If the weights of NK are sampled from D,
then the probability that none of the K n networks obtains good accuracy is essentially
(1 − q)K . Note that this probability decays quickly as either K or n increases. Our
n

method finds randomly weighted networks that achieve very high accuracy even with
small values of K. For instance, a six-layer convolutional network with 2 random
values per connection obtains 85.1% test accuracy on CIFAR-10.
But how do we select a good network from these K n different networks? Brute-
force evaluation of all possible configurations is clearly not feasible due to the massive
number of different hypotheses. Instead, we present an algorithm, shown in Figure 6.1,
that iteratively searches the best combination of connection values for the entire
network by optimizing the given loss. To do this, the method learns a real-valued
quality score for each weight option. These scores are used to select the weight value
of each connection during the forward pass. The scores are then updated in the
backward pass based on the loss value in order to improve training performance over
iterations.

88
6.2 Technical Approach Slot Machines

6.2.2. Learning in Slot Machines

Here we introduce our algorithm for the case of fully-connected networks but the
description extends seamlessly to convolutional networks. A fully-connected neural
network is an acyclic graph consisting of a stack of L layers [1, · · · , L] where the ℓth
(ℓ)
layer has nℓ neurons. The activation h(x)i of neuron i in layer ℓ is given by

nℓ−1
!
(ℓ)
X (ℓ−1) (ℓ)
h(x)i =g h(x)j Wij (6.1)
j=1

(ℓ)
where Wij is the weight of the connection between neuron i in layer ℓ and neuron
j in layer ℓ − 1, x represents the input to the network, and g is a non-linear activation
(ℓ)
function. Traditionally, Wij starts off as a random value drawn from an appropriate
distribution before being optimized with respect to a dataset and a loss function
using gradient descent. In contrast, our method does not ever update the weights.
Instead, it associates a set of K possible weight options for each connection1 , and
then it optimizes the selection of weights to use from these predefined sets for all
connections.
Forward Pass. Let {Wij1 , . . . , WijK }2 be the set of the K possible weight values
for connection (i, j) and let sijk be the “quality score” of value Wi,j,k , denoting the
preference for this value over the other possible K − 1 values. We define a selec-
tion function ρ which takes as input the scores {sij1 , . . . , sijK } and returns an index
between 1 and K. In the forward pass, we set the weight of (i, j) to Wijk∗ where
k ∗ = ρ(sij1 , . . . , sijK ).
In our work, we set ρ to be either the arg max function (returning the index
corresponding to the largest score) or the sampling from a Multinomial distribution
defined by {sij1 , . . . , sijK }. We refer to the former as Greedy Selection (GS) and
1
For simplicity, we use the same number of weight options K for all connections in a network.
2
For brevity, from now on we omit the superscript denoting the layer.

89
6.2 Technical Approach Slot Machines

name the latter Probabilistic Sampling (PS). Probabilistic Sampling is implemented

as !
esij1 esijK
ρ ∼ Mult PK , · · · , PK (6.2)
k=1 esijk k=1 esijk

where Mult is the multinomial distribution. The empirical comparison between these
two selection strategies is given in Section 6.5.1.
We note that, although K values per connection are considered during training
(as opposed to the infinite number of possible values in traditional training), only one
value per connection is used at test time. The final network is obtained by selecting
for each connection the value corresponding to the highest score (for both GS and PS)
upon completion of training. Thus, the effective capacity of the network at inference
time is the same as that of a traditionally-trained network.
Backward Pass. In the backward pass, all the scores are updated with straight-
through gradient estimation since ρ has a zero gradient almost everywhere. The
straight-through gradient estimator [9] treats ρ essentially as the identity function in
the backward pass by setting the gradient of the loss with respect to sijk as

∂L (ℓ−1) (ℓ)
∇sijk ← (ℓ)
h(x)j Wijk (6.3)
∂a(x)i

(ℓ)
for k ∈ {1, · · · , K} where L is the objective function. a(x)i is the pre-activation of
neuron i in layer ℓ. Given α as the learning rate, and ignoring momentum, we update
the scores via stochastic gradient descent as

s̃ijk = sijk − α∇sijk (6.4)

where s̃ijk is the score after the update. Our experiments demonstrate that this
simple algorithm learns to select effective configurations of random weights resulting

90
6.3 Experimental Setup Slot Machines

Table 6.1: Architecture specifications of the networks used in our experiments. The
Lenet network is trained on MNIST. The CONV-X models are the same
VGG-like architectures used in [46, 184, 140]. All convolutions use 3 × 3
filters and pool denotes max pooling.

Network Lenet CONV-2 CONV-4 CONV-6 VGG-19

2x64, pool
2x128, pool
64, 64, pool 2x256, pool
64, 64, pool 128, 128, pool 4x512, pool
Convolutional Layers 64, 64, pool 128, 128, pool 256, 256, pool 4x512, avg-pool
Fully-connected Layers 300, 100, 10 256, 256, 10 256, 256, 10 256, 256, 10 10
Epochs: Slot Machines 200 200 200 200 220
Epochs: Learned Weights 200 200 330 330 320
Dataset MNIST CIFAR-10 CIFAR-10 CIFAR-10 CIFAR-10

in impressive results across different datasets and models.

Section 6.3
Experimental Setup

Slot Machines are evaluated on various image categorization benchmarks. In this

section, we present the experimental setup including the models and hyper-parameters
employed in the empirical evaluations.

Models. The weights of all our networks are sampled uniformly at random from a
Glorot Uniform distribution [50], U(−σx , σx ) where σx is the standard deviation of
the Glorot Normal distribution. We ignore K, the number of options per connection,
when computing the standard deviation since it does not affect the network capacity
in the forward pass. Like for the weights, we initialize the scores independently
from a uniform distribution U(0, λσx ) where λ is a small constant. We use λ = 0.1
for all fully-connected layers and set λ to 1 when initializing convolutional layers.
We use 15% and 10% of the training sets of MNIST and CIFAR-10, respectively,

91
6.4 Results and Discussion Slot Machines

for validation. We report performance on the separate test set. On MNIST, we

experiment with the Lenet-300-100 [91] architecture following the protocol in [46]. We
also use the VGG-like architectures employed in Zhou et al . [184] and Ramanujan et
al . [140]. We denote these networks as CONV-2, CONV-4, and CONV-6. These
architectures are provided in Table 6.1 for completeness. All our plots show the
averages of four different independent trials. Error bars whenever shown are the
minimum and maximum over the trials. Accuracies are measured on the test set over
four different trials using early stopping on the validation accuracy.

Hyper-parameters. All models use a batch size of 128 and stochastic gradient
descent with warm restarts [108] at epoch 25 and 75, a momentum of 0.9 and an
ℓ2 penalty of 0.0001 Probabilistic Sampling models do not use weight-decay. When
training GS slot machines, we set the learning rate to 0.2 for K ≤ 8 and 0.1
otherwise. We set the learning rate to 0.01 when directly optimizing the weights
(training from scratch and finetuning) except when training VGG-19 where we set
the learning rate to 0.1. We find that a high learning rate is required when sampling
the network probabilistically, a behaviour also observed in [184]. Accordingly, we use
a learning rate of 25 for all PS models. We did not train VGG-19 using PS. We
use data augmentation and dropout (with a rate of p = 0.5) when experimenting on
CIFAR-10 [83]. We use batch normalization in VGG-19 but the affine parameters are
never updated throughout training.

92
6.4 Results and Discussion Slot Machines

Lenet CONV-2 CONV-4 CONV-6

100.0 100.0 100.0 100.0

Test Accuracy (%)

80.0 80.0 80.0 80.0

60.0 60.0 60.0 60.0

40.0 40.0 40.0 40.0

20.0 20.0 20.0 20.0

0.0 0.0 0.0 0.0

Randomly Selected Weights Randomly Selected Weights Randomly Selected Weights Randomly Selected Weights
Initialized (K = 2) Initialized (K = 2) Initialized (K = 2) Initialized (K = 2)
(a) Network
(b) Network Network Network

Figure 6.2: Results between random networks and Slot Machines: Select-
ing from only K = 2 weight options per connection already dramatically
improves accuracy compared to an untrained network that performs at
random chance (10%) on both (a) MNIST and (b) CIFAR-10. The first
bar in each plot shows the performance of an untrained randomly initial-
ized network and the second bar shows the results of selecting random
weights with GS using K = 2 options per connection.

Section 6.4
Results and Discussion

6.4.1. Slot Machines versus Traditionally-Trained Networks

We compare the networks using random weights selected from our approach with
two different baselines: (1) randomly initialized networks with one weight option
per connection, and (2) traditionally-trained networks whose continuous weights are
iteratively updated. For this first set of experiments we use GS to optimize Slot
Machines, since it tends to provide better performance than PS (the two methods
will be compared in subsection 6.5.1).
As shown in Figure 6.2, untrained networks with only one random weight per
edge perform at chance. However, methodologically selecting the parameters from
just two random values for each connection greatly enhances performance across dif-
ferent datasets and networks. Even better, as shown in Figure 6.3, as the number of
random weight options per connection increases, the performance of these networks
approaches that of traditionally-trained networks with the same number of param-
eters and cost (see Figure 6.4), despite containing only random values. Malach et
al . [114] proved that any “ReLU network of depth ℓ can be approximated by finding a

93
6.4 Results and Discussion Slot Machines

Lenet on MNIST CONV-2 on CIFAR-10 CONV-4 on CIFAR-10 CONV-6 on CIFAR-10

98.25 79
86 88
98.00 78
Test Accuracy

85
97.75 87
77 84
97.50
76 83 86
97.25
75 82
85
0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60
K K K K
Learned Weights Slot Machines (GS)

Figure 6.3: Comparison with traditional training on CIFAR-10 and

MNIST. Performance of slot machines improves as K increases (here
we consider K ∈ {2, 4, 8, 16, 32, 64}) although the performance degrades
after K ≥ 8. For CONV-6 (the deepest model considered here), our ap-
proach using GS achieves accuracy superior to that obtained with trained
weights, while for CONV-4 it produces performance only slightly inferior
to that of the optimized network. Furthermore, as illustrated by the er-
ror bars in these plots, the accuracy variances of slot machines are much
smaller than those of networks traditionally trained by optimizing weights.

CONV-4 on CIFAR-10 CONV-6 on CIFAR-10

0.8 0.8
Test Accuracy

0.7
0.6
0.6
Learned Weights Learned Weights
0.5 0.4
Slot Machine (K = 8) Slot Machine (K = 8)
0 200 400 600 0 250 500 750
Tera Flops Tera Flops
Figure 6.4: Test Accuracy versus Flops. Slot Machines achieve comparable
performance to models traditionally optimized for the same training com-
pute budget.

94
6.4 Results and Discussion Slot Machines

CONV-4 on CIFAR-10 CONV-6 on CIFAR-10

87.2 89.5

87.0
Test Accuracy

89.0
86.8

86.6 88.5

86.4
88.0
86.2
600 700 800 800 900 1000 1100
Tera Flops Tera Flops
Learned Weights Slot Machines (GS) Finetuned Selected Weights

Figure 6.5: Finetuning Selected Weights. Finetuning Slot Machines improves test
set performance on CIFAR-10. For CONV-4 and CONV-6 this results
in better accuracy compared to the same networks learned from scratch
at comparable training cost (shown on the x axis). The six-layer slot
machine uses K = 8 options per edge whereas the CONV-4 slot machine
uses K = 16.

weighted-subnetwork of a random network of depth 2ℓ and sufficient width.” Without

pruning, our selection method finds within the superset of fixed random weights a 6
layer configuration that outperforms a 6 layer traditionally-trained network.

6.4.2. Fine-tuning Slot Machines

Our approach can also be viewed as a strategy to provide a better initialization for
traditional training. To assess the value of such a scheme, we finetune the networks
obtained after training slot machines for 100 epochs to match the cost of learned
weights. Figure 6.5 summarizes the results in terms of training time (including both
selection and finetuning) vs test accuracy. It can be noted that for the CONV-4 and
CONV-6 architectures, finetuned slot machines achieve higher accuracy compared to
the same models learned from scratch, at no additional training cost. For VGG-19,
finetuning improves accuracy (92.1% instead of 91.7%) but the resulting model still

95
6.5 Ablation Study Slot Machines

Finetuning on CIFAR-10
86.0

85.5

Test Accuracy
85.0

84.5

84.0
CONV-4-Finetuned
83.5 CONV-6-Finetuned
20 40 60 80 100
Slot Machine Checkpoint

Figure 6.6: Finetuning from different Slot Machine checkpoints. Slot

Machine checkpoint shows the number of training epochs used for weight
selection before switching to finetuning (performed for 100 epochs).

does not match the performance of the model trained from scratch (92.6%).
To show that the weight selection in Slot Machines impacts the performance
of the finetuned models, we start finetuning from different checkpoints and compare
the results. If the selection is beneficial, then finetuning from later checkpoints will
show improved performance over fine-tuning from earlier checkpoints. As shown in
Figure 6.6, this is indeed the case as finetuning from later checkpoints results in higher
performance on the test set.

Section 6.5
Ablation Study

We conduct multiple ablation experiments on our models including the weights se-
lection strategy, weight sharing and sparsity in Slot Machines.

96
6.5 Ablation Study Slot Machines

Lenet on MNIST CONV-2 on CIFAR-10 CONV-4 on CIFAR-10 CONV-6 on CIFAR-10

98.25 78 86 88
98.00 76 84 86
Test Accuracy

97.75 74 82 84
97.50 72
80 82
97.25 70
78 80
97.00
0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60
K K K K
Slot Machines (GS) Slot Machines (PS)

Figure 6.7: Selection Method. Training slot machines via greedy selection yields
better accuracy than optimizing them with probabilistic sampling for all
values of K considered. The reason is that PS is a lot more exploratory
and tends to produce slower convergence.

6.5.1. Greedy Selection Versus Probabilistic Sampling

As detailed in Section 6.2.2, we consider two different methods for sampling our
networks in the forward pass: a greedy selection where the weight corresponding
to the highest score is used and a stochastic selection which draws from a proper
distribution over the weights. We compare the behavior of our networks under these
two different protocols.
As seen in Figure 6.7, GS performs better than PS. To fully comprehend the
performance differences between these two strategies, it is instructive to look at Fig-
ure 6.8, which reports the percentage of weights changed every 5 epochs by the two
strategies. PS keeps changing a large percentage of weights even in late stages of the
optimization, due to its probabilistic sampling. Despite the network changing con-
siderably, PS still manages to obtain good accuracy (see Figure 6.7) indicating that
there are potentially many good random networks within a slot machine. However,
as hypothesized in [140], the high variability due to stochastic sampling means that
the same network is likely never or rarely observed more than once in any training
run. This makes learning extremely challenging and consequently adversely impacts
performance. Conversely, GS is less exploratory and converges fairly quickly to a

97
6.5 Ablation Study Slot Machines

Lenet on MNIST

% Change of Selected Network

32.0

GS Network
8.0 PS Network

2.0

20 40 60 80 100
Epoch

Figure 6.8: Weight exploration in Slot Machines. The vertical axis shows
(on a log scale) the percentage of weights changed after every five epochs
as training progresses. Compared to PS, GS is much less exploratory
and converges rapidly to a preferred configuration of weights. On the
other hand, due to its probabilistic sampling, PS keeps changing the weight
selections even in late stages of training.

stable set of weights.

From Figure 6.7 we can also notice that the accuracy of GS improves or remains
stable as the value of K is increased. This is not always the case for PS when K ≥ 8.
We claim this behavior is expected since GS is more restricted in terms of the choices
it can take. Thus, GS benefits more from large values of K compared to PS.

6.5.2. Weights Sharing

Inspired by quantized networks [64, 141, 65, 162], we examine Slot Machines under
two new settings. The first constrains the connections in a layer to share the same
set of K random weights. The second setting is even more restricting as it requires
all connections in the network to share the same set of K random weights. Under
the first setting, at each layer, the weights are drawn from the uniform distribution
(−σℓ , σℓ ) where σℓ is the standard deviation of the Glorot Normal distribution for

98
6.5 Ablation Study Slot Machines

Lenet on MNIST CONV-4 on CIFAR-10 CONV-6 on CIFAR-10

88
95 80
87
Test Accuracy

90 70
86

85 60 85

84
80 50
25 50 25 50 25 50
K K K
Unshared Weights Layerwise-shared Weights Globally-shared Weights

Figure 6.9: Weights Sharing in Slot Machines: GS models using the same set
of K random weights for all connections in a layer or in the entire network
perform quite well. However, they do not match the performance of Slot
Machines that use different sets of weights for different connections.

layer ℓ. When using a single set of weights for the entire network, we sample the
weights independently from U(−σ̂, σ̂). σ̂ is the mean of the standard deviations of
the layer-wise Glorot Normal distributions.
Each of the weights is still associated with a score. The slot machine with shared
weights is then trained as before. Weight sharing substantially reduces the memory
requirements of these networks compared to traditional neural networks. For example,
a Lenet model with unshared weights needs ∼ 1MB of storage whereas the same model
using shared weights in each layer needs ∼ 0.02MB of storage.
As shown in Figure 6.9, these models are effective given a large enough K. Fur-
thermore, the accuracy exhibits a large variance from run to run, as evidenced by the
large error bars in the plot. This is understandable, as the slot machine with shared
weights is restricted to search in a much smaller space of parameter combinations
reducing the probability of finding a “winning” combination.

99
6.6 Related Works Slot Machines

6.5.3. Sparse Slot Machines

We conducted experiments where one of the K weight options is fixed at 0 to induce
sparse networks. In the empirical results, the resulting sparsity is low when K is
large: for CONV-6 on CIFAR-10, the sparsity is 49% when K = 2, and 1.1% when
K = 64. If K is small, the selected sparse network has a lower performance compared
to the corresponding full-capacity Slot Machine where all the K weight options
are initialized randomly (76% versus 83% test accuracy for CONV-6 on CIFAR-10
with K = 2). This is because when K is small and some weights are still set to 0,
the model becomes too restrictive and thus ineffective.

Section 6.6
Related Works

Supermasks and the Strong Lottery Ticket Conjecture. The lottery ticket hy-
pothesis was articulated in [46] and states that a randomly initialized neural network
contains sparse subnetworks which when trained in isolation from scratch can achieve
accuracy similar to that of the trained dense network. Inspired by this result, Zhou et
al . [184] presents a method for identifying subnetworks of randomly initialized neural
networks that achieve better-than-chance performance without training. These sub-
networks (named “supermasks”) are found by assigning a probability value to each
connection. These probabilities are used to sample the connections to use and are
updated via stochastic gradient descent. On ImageNet [146], Ramanujan et al . [140]
finds supermasks within a randomly weighted ResNet-50 that match the performance
of a trained ResNet-34.
These empirical results as well theoretical ones [114, 131] suggest that pruning a
randomly initialized network is just as good as optimizing the weights, provided a
good pruning mechanism is used. Our work corroborates this intriguing phenomenon

100
6.6 Related Works Slot Machines

but differs from these prior methods in significant ways. We eliminate pruning com-
pletely and instead introduce multiple weight values per connection. Thus, rather
than selecting connections to define a subnetwork, our method selects weights for all
connections in a network of fixed architecture. Thus, every neural connection in our
network is active in every forward pass.

Pruning at Initialization. The lottery ticket hypothesis also inspired several re-
cent work aimed towards pruning (i.e., predicting “winning” tickets) at initializa-
tion [93, 94, 155, 161]. Our work is different in motivation from these methods and
those that train only a subset of the weights [59, 144]. Our aim is to find neural
networks with random weights that match the performance of traditionally-trained
networks with the same number of parameters.

Weight Agnostic Neural Networks. Gaier and Ha [48] build neural network
architectures with high performance in a setting where all the weights have the
same shared random value. The optimization is instead performed over the archi-
tecture [151]. They show empirically that the network performance is indifferent to
the shared value but defaults to random chance when all the weights assume different
random values. Although we do not perform weight training, the weights in this work
have different random values. Further, we build our models using fixed architectures.

Low-bit Networks and Quantization Methods. As in binary networks [64, 141]

and network quantization [65, 162], the parameters in Slot Machines are drawn
from a finite set. However, whereas the primary objective in quantized networks is
compression and computational speedup, the motivation behind Slot Machines is
to recover good performance from randomly initialized networks. Accordingly, Slot
Machines use real-valued weights as opposed to the binary (or integer) weights

101
6.7 Summary Slot Machines

used by low-bit networks. Furthermore, the weights in low-bit networks are usually
optimized directly whereas only associated scores are optimized in slot machines.

Random Decision Trees. Our approach is inspired by the popular use of random
subsets of features in the construction of decision trees [14]. Instead of considering
all possible choices of features and all possible splitting tests at each node, random
decision trees are built by restricting the selection to small random subsets of feature
values and splitting hypotheses. We adapt this strategy to the training of neural
networks by restricting the optimization of each connection over a random subset of
weight values.

Section 6.7
Summary

In this chapter, we discussed work showing that neural networks with random weights
perform competitively, provided that each connection is given multiple weight options
and that a good selection strategy is used. By selecting a weight among a fixed set
of random values for each individual connection, our method uncovers combinations
of random weights that match the performance of traditionally-trained networks of
the same capacity. We referred to our networks as “slot machines” where each
reel (connection) contains a fixed set of symbols (random values). Our backpropa-
gation algorithm “spins” the reels to seek “winning” combinations, i.e., selections of
random weight values that minimize the given loss. Quite surprisingly, we find that
allocating just a few random values to each connection (e.g., 8 values per connection)
yields highly competitive combinations despite being dramatically more constrained
compared to traditionally learned weights. Moreover, finetuning these combinations
often improves performance over the trained baselines.

102
Conclusion
7
In human beings, the visual world is essential in natural language acquisition. Il-
lustrations, plots, and gestures fundamentally augment our understanding of natural
language. Yet, so far in machine learning, efforts to ground natural language process-
ing in the visual domain beyond multimodal methods remain sparse. This is partly
because of the absence of vast amounts of information rich vision-language datasets of
high quality. In this thesis, we proposed a framework for leveraging joint-embedding
models pre-trained on weakly-labeled data to tackle this challenge and infuse visual
cues into natural language models. Additionally, we proposed methods for building
data efficient vision-and-language models and for training effectively integrating rep-
resentations in multimodal learning. Finally, we presented a novel neural network
architecture that uses weight selection instead of gradient updates for optimization.
In Chapter 3, we presented VGLMs and MT-VGLMs, which proved effective for
guided language modeling. The first set of models employed an image encoder from
a pre-trained joint-embedding model and required image-paired data for training.
However, such datasets are expensive to collect in large quantities. Additionally, the
distributions of text in image captions datasets do not match the distributions of free-
form text corpora. Thus, building language models on image captions corpora may
not be ideal for general language understanding. To overcome these challenges asso-

103
Conclusion Conclusion

ciated with using explicit visual information as a the guidance source, we advanced
multimodal text guided language models, MT-GLMs, as an alternative. MT-GLMs
allowed us to pre-train language models on free-form text corpora of various sizes
resulting in consistent improvements on multiple benchmarks. Our investigations
showed that using a multimodal text encoder is important as a regular language
model like BERT [39] did not yield any improvements over the baseline unguided
model.
After demonstrating the importance of vision-language models for integrating in-
formation about the visual world into language model pre-training, In Chapter 4, we
introduced a method for training data efficient vision-language contrastive models
called CLIP-C. CLIP-C adapted CutMix [175] augmentation strategy to the do-
main of vision and language models, showing that semantic compositions of multiple
image-caption pairs significantly enhance the effectiveness of language-supervised vi-
sual representation learning models, especially when the training set is small. The
composition strategy we proposed is fast and straightforward to implement requir-
ing no additional parameters or floating point operations relative to the baseline
CLIP [136] model. Comprehensive analysis showed that the augmented samples our
method created regularized the contrastive learning loss and transferred competi-
tively to downstream tasks in a zero-shot setting. We verified experimentally that
the observed performance improvements were not due to elevated levels of data aug-
mentations. The strategic use of semantically distinct examples in compositions and
dynamic sampling of examples emerged as essential components of our framework.
We are hopeful our research in Chapter 4 will encourage more exploration on novel
and efficient uses of small-scale datasets for vision-language pre-training.
In Chapter 5, we introduced Compound Tokens, a multimodal fusion method
for vision-and-language representation learning. Compound Tokens are generated

104
Conclusion Conclusion

by concatenating image and text features along the channel dimension in contrast
to the sequence dimension employed in existing systems [42]. We leveraged cross-
modal cross-attention to ensure that the tokens concatenated together in Compound
Tokens are complementary. This novel fusion method outperformed competitive
multimodal models such as ALBEF [99] and METER [42] across multiple multimodal
tasks. Numerous empirical evaluations demonstrated our fusion method is better
than the prior two approaches: merged attention and co-attention. We consistently
outperformed these standard methods with and without pre-training on image-text
pairs, across different image resolutions and image encoders. The work in this chapter
aligns with our method in Chapter 4 in that they both proposed mechanisms for pre-
training more effective multimodal models. These efforts in turn feed into our work
in Chapter 3 where multimodal models are exploited for language grounding in the
visual world.
Finally, in Chapter 6, we proposed a novel neural networks architecture called
Slot Machines that are training via selection of randomly initialized parameters
in contrast to continuous parameter updates as done in traditional models. In Slot
Machines, each neural connection is restricted to take a value from finite set of
size K where the elements of the set are drawn from a random distribution such as
Glorot Uniform [50]. In comparison, because of parameter updates through gradient
descent, neural connections traditional neural networks can assume any value on
the continuous domain. Thus, relative to the canonical networks, Slot Machines
are severely constrained. Yet, we showed our novel networks are competitive with
traditional models even with a minimal number of values, e.g., K = 2 given a good
selection strategy. We proposed two weight selection methods: (1) a greedy selection
criteria where the weight corresponding to the highest associated score, and (2) a
probabilistic selection strategy where we sample weights from a distribution of their

105
Conclusion Conclusion

scores. Greedy selection emerged as the more effective selection procedure resulting
in remarkably consistently producing effective and strong weight configurations. We
also demonstrated that selected configurations are good initialization checkpoints
for finetuning, leading to accuracy gains over training the network from scratch at
equivalent computational cost.

106
Bibliography

[1] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya,
Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman,
Shyamal Anadkat, et al., Gpt-4 technical report, 2023.

[2] Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, and Weiwei Guo,
*SEM 2013 shared task: Semantic textual similarity, Second Joint Conference
on Lexical and Computational Semantics (*SEM), Volume 1: Proceedings of the
Main Conference and the Shared Task: Semantic Textual Similarity (Atlanta,
Georgia, USA) (Mona Diab, Tim Baldwin, and Marco Baroni, eds.), Association
for Computational Linguistics, June 2013, pp. 32–43.

[3] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr,
Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds,
Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina
Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew
Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo
Barreira, Oriol Vinyals, Andrew Zisserman, and Karen Simonyan, Flamingo:
a visual language model for few-shot learning, 2022.

107
BIBLIOGRAPHY

[4] Morris Alper, Michael Fiman, and Hadar Averbuch-Elor, Is bert blind? ex-
ploring the effect of vision-and-language pretraining on visual language under-
standing, Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 2023.

[5] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson,
Stephen Gould, and Lei Zhang, Bottom-up and top-down attention for image
captioning and visual question answering, CVPR, 2018.

[6] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Ba-
tra, C. Lawrence Zitnick, and Devi Parikh, Vqa: Visual question answering,
Proceedings of the IEEE International Conference on Computer Vision (ICCV),
2015.

[7] Jason Baldridge, Jakob Bauer, Mukul Bhutani, Nicole Brichtova, Andrew Bun-
ner, Kelvin Chan, Yichang Chen, Sander Dieleman, Yuqing Du, Zach Eaton-
Rosen, et al., Imagen 3, arXiv preprint arXiv:2408.07009 (2024).

[8] Emily M. Bender and Alexander Koller, Climbing towards NLU: On meaning,
form, and understanding in the age of data, Proceedings of the 58th Annual
Meeting of the Association for Computational Linguistics (Online) (Dan Ju-
rafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, eds.), Association for
Computational Linguistics, July 2020, pp. 5185–5198.

[9] Yoshua Bengio, Nicholas Léonard, and Aaron Courville, Estimating or propa-
gating gradients through stochastic neurons for conditional computation, 2013.

[10] James Betker, Gabriel Goh, Li Jing, † TimBrooks, Jianfeng Wang, Linjie Li, †
LongOuyang, † JuntangZhuang, † JoyceLee, † YufeiGuo, † WesamManassra, †

108
BIBLIOGRAPHY

PrafullaDhariwal, † CaseyChu, † YunxinJiao, and Aditya Ramesh, Improving

image generation with better captions.

[11] Yonatan Bisk, Ari Holtzman, Jesse Thomason, Jacob Andreas, Yoshua Bengio,
Joyce Chai, Mirella Lapata, Angeliki Lazaridou, Jonathan May, Aleksandr Nis-
nevich, Nicolas Pinto, and Joseph Turian, Experience grounds language, Pro-
ceedings of the 2020 Conference on Empirical Methods in Natural Language
Processing (EMNLP) (Online) (Bonnie Webber, Trevor Cohn, Yulan He, and
Yang Liu, eds.), Association for Computational Linguistics, November 2020,
pp. 8718–8735.

[12] Paul Bloom, How children learn the meanings of words, MIT press, 2002.

[13] Patrick Bordes, Eloi Zablocki, Laure Soulier, Benjamin Piwowarski, and Patrick
Gallinari, Incorporating visual semantics into sentence representations within a
grounded space, Proceedings of the 2019 Conference on Empirical Methods in
Natural Language Processing and the 9th International Joint Conference on
Natural Language Processing (EMNLP-IJCNLP) (Hong Kong, China) (Ken-
taro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, eds.), Association for
Computational Linguistics, November 2019, pp. 696–707.

[14] Leo Breiman, Jerome Friendman, Charles J. Stone, and R. A Olstein, Classi-
fication and regression trees., Wadsworth & Brooks/Cole Advanced Books &
Software., Monterey, CA, 1984.

[15] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Ka-
plan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry,
Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger,
Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu,
Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin,

109
BIBLIOGRAPHY

Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish,
Alec Radford, Ilya Sutskever, and Dario Amodei, Language models are few-shot
learners, Advances in Neural Information Processing Systems, vol. 33, 2020.

[16] Emanuele Bugliarello, Ryan Cotterell, Naoaki Okazaki, and Desmond Elliott,
Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework
of Vision-and-Language BERTs, Transactions of the Association for Computa-
tional Linguistics 9 (2021), 978–994.

[17] Han Cai, Chuang Gan, Tianzhe Wang, Zhekai Zhang, and Song Han, Once-for-
all: Train one network and specialize it for efficient deployment, 2020.

[18] Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu, Yen-Chun Chen, and Jingjing Liu,
Behind the scene: Revealing the secrets of pre-trained vision-and-language mod-
els, Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK,
August 23–28, 2020, Proceedings, Part VI 16, Springer, 2020, pp. 565–580.

[19] Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski,
and Armand Joulin, Unsupervised learning of visual features by contrast-
ing cluster assignments, Advances in Neural Information Processing Systems
(H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, eds.), vol. 33,
2020, pp. 9912–9924.

[20] Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia,
SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual
focused evaluation, Proceedings of the 11th International Workshop on Seman-
tic Evaluation (SemEval-2017) (Vancouver, Canada) (Steven Bethard, Marine
Carpuat, Marianna Apidianaki, Saif M. Mohammad, Daniel Cer, and David
Jurgens, eds.), Association for Computational Linguistics, August 2017, pp. 1–
14.

110
BIBLIOGRAPHY

[21] Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut, Conceptual
12M: Pushing web-scale image-text pre-training to recognize long-tail visual con-
cepts, CVPR, 2021.

[22] ChatGPT, https://openai.com/blog/chatgpt.

[23] Cheng Chen, Yudong Zhu, Zhenshan Tan, Qingrong Cheng, Xin Jiang, Qun
Liu, and Xiaodong Gu, Utc: A unified transformer with inter-task contrastive
learning for visual dialog, Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR), 2022.

[24] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton, A
simple framework for contrastive learning of visual representations, Proceedings
of the 37th International Conference on Machine Learning (Hal Daumé III and
Aarti Singh, eds.), Proceedings of Machine Learning Research, vol. 119, PMLR,
13–18 Jul 2020, pp. 1597–1607.

[25] Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski,

Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer,
Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Ak-
bari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, We-
icheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carlos
Riquelme, Andreas Steiner, Anelia Angelova, Xiaohua Zhai, Neil Houlsby, and
Radu Soricut, Pali: A jointly-scaled multilingual language-image model, 2022.

[26] Xinlei Chen and Kaiming He, Exploring simple siamese representation learn-
ing, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), 2021, pp. 15745–15753.

111
BIBLIOGRAPHY

[27] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe
Gan, Yu Cheng, and Jingjing Liu, Uniter: Universal image-text representation
learning, ECCV, 2020.

[28] Gong Cheng, Junwei Han, and Xiaoqiang Lu, Remote sensing image scene
classification: Benchmark and state of the art, Proceedings of the IEEE 105
(2017), no. 10, 1865–1883.

[29] Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal, Unifying vision-and-language
tasks via text generation, Proceedings of the 38th International Conference on
Machine Learning (Marina Meila and Tong Zhang, eds.), Proceedings of Ma-
chine Learning Research, vol. 139, PMLR, 18–24 Jul 2021, pp. 1931–1942.

[30] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gau-
rav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton,
Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua
Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam M. Shazeer, Vinodkumar
Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Brad-
bury, Jacob Austin, Michael Isard, Guy Gur-Ari, Pengcheng Yin, Toju Duke,
Anselm Levskaya, Sanjay Ghemawat, Sunipa Dev, Henryk Michalewski, Xavier
García, Vedant Misra, Kevin Robinson, Liam Fedus, Denny Zhou, Daphne
Ippolito, David Luan, Hyeontaek Lim, Barret Zoph, Alexander Spiridonov,
Ryan Sepassi, David Dohan, Shivani Agrawal, Mark Omernick, Andrew M.
Dai, Thanumalayan Sankaranarayana Pillai, Marie Pellat, Aitor Lewkowycz,
Erica Moreira, Rewon Child, Oleksandr Polozov, Katherine Lee, Zongwei Zhou,
Xuezhi Wang, Brennan Saeta, Mark Díaz, Orhan Firat, Michele Catasta, Jason
Wei, Kathleen S. Meier-Hellstern, Douglas Eck, Jeff Dean, Slav Petrov, and
Noah Fiedel, Palm: Scaling language modeling with pathways, 2022.

112
BIBLIOGRAPHY

[31] M. Cimpoi, S. Maji, I. Kokkinos, S. Mohamed, and A. Vedaldi, Describing

textures in the wild, Proceedings of the IEEE Conf. on Computer Vision and
Pattern Recognition (CVPR), 2014.

[32] Adam Coates, Andrew Ng, and Honglak Lee, An Analysis of Single Layer Net-
works in Unsupervised Feature Learning, AISTATS, 2011.

[33] Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary,

Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettle-
moyer, and Veselin Stoyanov, Unsupervised cross-lingual representation learning
at scale, Proceedings of the 58th Annual Meeting of the Association for Compu-
tational Linguistics (Online) (Dan Jurafsky, Joyce Chai, Natalie Schluter, and
Joel Tetreault, eds.), Association for Computational Linguistics, July 2020,
pp. 8440–8451.

[34] Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine
Bordes, Supervised learning of universal sentence representations from natural
language inference data, Proceedings of the 2017 Conference on Empirical Meth-
ods in Natural Language Processing (Copenhagen, Denmark) (Martha Palmer,
Rebecca Hwa, and Sebastian Riedel, eds.), Association for Computational Lin-
guistics, September 2017, pp. 670–680.

[35] Ekin Dogus Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and
Quoc V. Le, Autoaugment: Learning augmentation policies from data, CVPR,
2019.

[36] Andrew M Dai and Quoc V Le, Semi-supervised sequence learning, Advances
in Neural Information Processing Systems (C. Cortes, N. Lawrence, D. Lee,
M. Sugiyama, and R. Garnett, eds.), vol. 28, Curran Associates, Inc., 2015.

113
BIBLIOGRAPHY

[37] Abhishek Das, Satwik Kottur, Khushi Gupta, Avi Singh, Deshraj Yadav,
José M.F. Moura, Devi Parikh, and Dhruv Batra, Visual Dialog, Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
2017.

[38] Karan Desai, Gaurav Kaul, Zubin Trivadi Aysola, and Justin Johnson, Redcaps:
Web-curated image-text data created by the people, for the people, Thirty-fifth
Conference on Neural Information Processing Systems Datasets and Bench-
marks Track (Round 1), 2021.

[39] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, BERT:
Pre-training of deep bidirectional transformers for language understanding, Pro-
ceedings of the 2019 Conference of the North American Chapter of the Asso-
ciation for Computational Linguistics: Human Language Technologies, Volume
1 (Long and Short Papers), Association for Computational Linguistics, June
2019, pp. 4171–4186.

[40] Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia,
Jingjing Xu, Zhiyong Wu, Tianyu Liu, et al., A survey on in-context learning,
2022.

[41] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi-
aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg
Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby, An image is worth
16x16 words: Transformers for image recognition at scale, International Con-
ference on Learning Representations, 2021.

[42] Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan
Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, Zicheng

114
BIBLIOGRAPHY

Liu, and Michael Zeng, An empirical study of training end-to-end vision-and-

language transformers, Conference on Computer Vision and Pattern Recogni-
tion (CVPR), 2022.

[43] Desmond Elliott, Stella Frank, Khalil Sima’an, and Lucia Specia, Multi30K:
Multilingual English-German image descriptions, Proceedings of the 5th Work-
shop on Vision and Language (Berlin, Germany) (Anya Belz, Erkut Erdem,
Krystian Mikolajczyk, and Katerina Pastra, eds.), Association for Computa-
tional Linguistics, August 2016, pp. 70–74.

[44] Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and Yonglong Tian, Im-
proving CLIP training with language rewrites, Thirty-seventh Conference on
Neural Information Processing Systems, 2023.

[45] Li Fei-Fei, Rob Fergus, and Pietro Perona, Learning generative visual models
from few training examples: An incremental bayesian approach tested on 101
object categories, CVPR Workshop (2004).

[46] Jonathan Frankle and Michael Carbin, The lottery ticket hypothesis: Finding
sparse, trainable neural networks, International Conference on Learning Repre-
sentations, 2019.

[47] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Marc' Au-
relio Ranzato, and Tomas Mikolov, Devise: A deep visual-semantic embed-
ding model, Advances in Neural Information Processing Systems (C.J. Burges,
L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, eds.), vol. 26,
2013.

[48] Adam Gaier and David Ha, Weight agnostic neural networks, Advances in Neu-
ral Information Processing Systems (H. Wallach, H. Larochelle, A. Beygelzimer,

115
BIBLIOGRAPHY

F. d'Alché-Buc, E. Fox, and R. Garnett, eds.), vol. 32, Curran Associates, Inc.,
2019, pp. 5364–5378.

[49] Gemini, https://gemini.google.com/app.

[50] Xavier Glorot and Yoshua Bengio, Understanding the difficulty of training deep
feedforward neural networks, Proceedings of the Thirteenth International Con-
ference on Artificial Intelligence and Statistics (Chia Laguna Resort, Sardinia,
Italy) (Yee Whye Teh and Mike Titterington, eds.), Proceedings of Machine
Learning Research, vol. 9, PMLR, 13–15 May 2010, pp. 249–256.

[51] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh,
Making the V in VQA matter: Elevating the role of image understanding in Vi-
sual Question Answering, Conference on Computer Vision and Pattern Recog-
nition (CVPR), 2017.

[52] Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary
Mueller, Sourab Mangrulkar, Marc Sun, and Benjamin Bossan, Accelerate:
Training and inference at scale made simple, efficient and adaptable., https:
//github.com/huggingface/accelerate, 2022.

[53] K. He, X. Zhang, S. Ren, and J. Sun, Delving deep into rectifiers: Surpassing
human-level performance on imagenet classification, 2015 IEEE International
Conference on Computer Vision (ICCV), 2015, pp. 1026–1034.

[54] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick, Mask r-
cnn, 2017 IEEE International Conference on Computer Vision (ICCV), 2017,
pp. 2980–2988.

116
BIBLIOGRAPHY

[55] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, Deep residual learn-
ing for image recognition, 2016 IEEE Conference on Computer Vision and Pat-
tern Recognition (CVPR), 2016, pp. 770–778.

[56] Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth, Eu-
rosat: A novel dataset and deep learning benchmark for land use and land cover
classification, 2017.

[57] Lisa Anne Hendricks, John Mellor, Rosalia Schneider, Jean-Baptiste Alayrac,
and Aida Nematzadeh, Decoupling the role of data, attention, and losses in
multimodal transformers, Transactions of the Association for Computational
Linguistics 9 (2021), 570–585.

[58] Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean, Distilling the knowledge
in a neural network, ArXiv abs/1503.02531 (2015).

[59] Elad Hoffer, Itay Hubara, and Daniel Soudry, Fix your classifier: the marginal
value of training the last weight layer, International Conference on Learning
Representations, 2018.

[60] Tao Hong, Xiangyang Guo, and Jinwen Ma, Itmix: Image-text mix augmenta-
tion for transferring clip to image classification, 2022 16th IEEE International
Conference on Signal Processing (ICSP), vol. 1, 2022, pp. 129–133.

[61] Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kembhavi, and Ranjay
Krishna, Sugarcrepe: Fixing hackable benchmarks for vision-language composi-
tionality, NeurIPS, 2023.

[62] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li,
Shean Wang, Lu Wang, and Weizhu Chen, LoRA: Low-rank adaptation of large
language models, International Conference on Learning Representations, 2022.

117
BIBLIOGRAPHY

[63] Saining Xie Hu Xu, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu
Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feicht-
enhofer, Demystifying clip data, 2023.

[64] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua
Bengio, Binarized neural networks, Advances in Neural Information Processing
Systems (D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, eds.),
vol. 29, Curran Associates, Inc., 2016.

[65] , Quantized neural networks: Training neural networks with low precision
weights and activations, The Journal of Machine Learning Research 18 (2017),
no. 1, 6869–6898.

[66] Drew A Hudson and Christopher D Manning, Gqa: A new dataset for real-
world visual reasoning and compositional question answering, Conference on
Computer Vision and Pattern Recognition (CVPR) (2019).

[67] Taichi Iki and Akiko Aizawa, Effect of visual extensions on natural language
understanding in vision-and-language models, Proceedings of the 2021 Confer-
ence on Empirical Methods in Natural Language Processing (Online and Punta
Cana, Dominican Republic) (Marie-Francine Moens, Xuanjing Huang, Lucia
Specia, and Scott Wen-tau Yih, eds.), Association for Computational Linguis-
tics, November 2021, pp. 2189–2196.

[68] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas
Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong,
John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt, Openclip,
July 2021.

118
BIBLIOGRAPHY

[69] Julia Ive, Pranava Madhyastha, and Lucia Specia, Distilling translations with
visual awareness, Proceedings of the 57th Annual Meeting of the Association
for Computational Linguistics (Florence, Italy) (Anna Korhonen, David Traum,
and Lluís Màrquez, eds.), Association for Computational Linguistics, July 2019,
pp. 6525–6538.

[70] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham,
Quoc V. Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig, Scaling up visual and
vision-language representation learning with noisy text supervision, Proceedings
of the 38th International Conference on Machine Learning, ICML 2021 (Marina
Meila and Tong Zhang, eds.), Proceedings of Machine Learning Research, vol.
139, 2021, pp. 4904–4916.

[71] Huaizu Jiang, Ishan Misra, Marcus Rohrbach, Erik Learned-Miller, and Xin-
lei Chen, In defense of grid features for visual question answering, 2020
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
2020, pp. 10264–10273.

[72] Kenan Jiang, Xuehai He, Ruize Xu, and Xin Eric Wang, Comclip: Training-free
compositional image and text matching, 2023.

[73] Armand Joulin, Laurens van der Maaten, Allan Jabri, and Nicolas Vasilache,
Learning visual features from large weakly supervised data, Computer Vision –
ECCV 2016 (Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, eds.),
Springer International Publishing, 2016, pp. 67–84.

[74] John M. Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael
Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Au-
gustin Žídek, Anna Potapenko, Alex Bridgland, Clemens Meyer, Simon A A
Kohl, Andy Ballard, Andrew Cowie, Bernardino Romera-Paredes, Stanislav

119
BIBLIOGRAPHY

Nikolov, Rishub Jain, Jonas Adler, Trevor Back, Stig Petersen, David Reiman,
Ellen Clancy, Michal Zielinski, Martin Steinegger, Michalina Pacholska, Tamas
Berghammer, Sebastian Bodenstein, David Silver, Oriol Vinyals, Andrew W.
Senior, Koray Kavukcuoglu, Pushmeet Kohli, and Demis Hassabis, Highly ac-
curate protein structure prediction with alphafold, Nature 596 (2021), 583 –
589.

[75] Aishwarya Kamath, Mannat Singh, Yann LeCun, Ishan Misra, Gabriel Syn-
naeve, and Nicolas Carion, Mdetr–modulated detection for end-to-end multi-
modal understanding, arXiv preprint arXiv:2104.12763 (2021).

[76] Douwe Kiela, Alexis Conneau, Allan Jabri, and Maximilian Nickel, Learning
visually grounded sentence representations, Proceedings of the 2018 Conference
of the North American Chapter of the Association for Computational Linguis-
tics: Human Language Technologies, Volume 1 (Long Papers) (New Orleans,
Louisiana) (Marilyn Walker, Heng Ji, and Amanda Stent, eds.), Association for
Computational Linguistics, June 2018, pp. 408–418.

[77] Wonjae Kim, Bokyung Son, and Ildoo Kim, Vilt: Vision-and-language trans-
former without convolution or region supervision, International Conference on
Machine Learning, 2021, pp. 5583–5594.

[78] Diederik P. Kingma and Jimmy Ba, Adam: A method for stochastic optimiza-
tion, ICLR, 2015.

[79] Ryan Kiros, Yukun Zhu, Russ R Salakhutdinov, Richard Zemel, Raquel Ur-
tasun, Antonio Torralba, and Sanja Fidler, Skip-thought vectors, Advances
in Neural Information Processing Systems (C. Cortes, N. Lawrence, D. Lee,
M. Sugiyama, and R. Garnett, eds.), vol. 28, Curran Associates, Inc., 2015.

120
BIBLIOGRAPHY

[80] Noriyuki Kojima, Hadar Averbuch-Elor, Alexander Rush, and Yoav Artzi, What
is learned in visually grounded neural syntax acquisition, Proceedings of the 58th
Annual Meeting of the Association for Computational Linguistics (Online) (Dan
Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault, eds.), Association
for Computational Linguistics, July 2020, pp. 2615–2635.

[81] Satwik Kottur, José M. F. Moura, Devi Parikh, Dhruv Batra, and Marcus
Rohrbach, Visual coreference resolution in visual dialog using neural module
networks, The European Conference on Computer Vision (ECCV), 2018.

[82] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua
Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma,
Michael Bernstein, and Li Fei-Fei, Visual genome: Connecting language and
vision using crowdsourced dense image annotations, , 2016.

[83] Alex Krizhevsky, Learning multiple layers of features from tiny images, Tech.
report, University of Toronto, 2009.

[84] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, Imagenet classification
with deep convolutional neural networks, Advances in Neural Information Pro-
cessing Systems (F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger,
eds.), vol. 25, Curran Associates, Inc., 2012, pp. 1097–1105.

[85] Weicheng Kuo, Yin Cui, Xiuye Gu, AJ Piergiovanni, and Anelia Angelova,
Open-vocabulary object detection upon frozen vision and language models, The
Eleventh International Conference on Learning Representations, 2023.

[86] Weicheng Kuo, AJ Piergiovanni, Dahun Kim, Xiyang Luo, Ben Caine, Wei
Li, Abhijit Ogale, Luowei Zhou, Andrew Dai, Zhifeng Chen, Claire Cui, and

121
BIBLIOGRAPHY

Anelia Angelova, Mammut: A simple vision-encoder text-decoder architecture

for multimodal tasks, Transactions on Machine Learning Research (2023).

[87] Zhengfeng Lai, Haotian Zhang, Wentao Wu, Haoping Bai, Aleksei Timofeev,
Xianzhi Du, Zhe Gan, Jiulong Shan, Chen-Nee Chuah, Yinfei Yang, and Meng
Cao, From scarcity to efficiency: Improving clip training via visual-enriched
captions, 2023.

[88] Angeliki Lazaridou, Adhi Kuncoro, Elena Gribovskaya, Devang Agrawal, Adam
Liska, Tayfun Terzi, Mai Gimenez, Cyprien de Masson d'Autume, Tomas
Kocisky, Sebastian Ruder, Dani Yogatama, Kris Cao, Susannah Young, and
Phil Blunsom, Mind the gap: Assessing temporal generalization in neural lan-
guage models, Advances in Neural Information Processing Systems (M. Ran-
zato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, eds.),
vol. 34, 2021, pp. 29348–29363.

[89] Angeliki Lazaridou, Nghia The Pham, and Marco Baroni, Combining language
and vision with a multimodal skip-gram model, Proceedings of the 2015 Con-
ference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies (Denver, Colorado) (Rada Mihal-
cea, Joyce Chai, and Anoop Sarkar, eds.), Association for Computational Lin-
guistics, May–June 2015, pp. 153–163.

[90] Quoc Le and Tomas Mikolov, Distributed representations of sentences and doc-
uments, Proceedings of the 31st International Conference on Machine Learning
(Bejing, China) (Eric P. Xing and Tony Jebara, eds.), vol. 32, Proceedings of
Machine Learning Research, no. 2, PMLR, 22–24 Jun 2014, pp. 1188–1196.

122
BIBLIOGRAPHY

[91] Yann Lecun, Le’on Bottou, Yoshua Bengio, and Patrick Haffner, Gradient-based
learning applied to document recognition, Proceedings of the IEEE 86 (1998),
no. 11, 2278–2324.

[92] Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He, Stacked
cross attention for image-text matching, ECCV, 2018.

[93] Namhoon Lee, Thalaiyasingam Ajanthan, Stephen Gould, and Philip H. S. Torr,
A signal propagation perspective for pruning neural networks at initialization,
International Conference on Learning Representations, 2020.

[94] Namhoon Lee, Thalaiyasingam Ajanthan, and Philip Torr, SNIP: Single-shot
Network Pruning based on connection sensitivity, International Conference on
Learning Representations, 2019.

[95] Brian Lester, Rami Al-Rfou, and Noah Constant, The power of scale for
parameter-efficient prompt tuning, Proceedings of the 2021 Conference on Em-
pirical Methods in Natural Language Processing (Online and Punta Cana, Do-
minican Republic) (Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and
Scott Wen-tau Yih, eds.), Association for Computational Linguistics, November
2021, pp. 3045–3059.

[96] A. Li, A. Jabri, A. Joulin, and L. van der Maaten, Learning visual n-grams from
web data, 2017 IEEE International Conference on Computer Vision (ICCV),
IEEE Computer Society, 2017, pp. 4193–4202.

[97] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi, BLIP-2: bootstrapping
language-image pre-training with frozen image encoders and large language mod-
els, ICML, 2023.

123
BIBLIOGRAPHY

[98] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi, Blip: Bootstrapping
language-image pre-training for unified vision-language understanding and gen-
eration, ICML, 2022.

[99] Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming
Xiong, and Steven Chu Hong Hoi, Align before fuse: Vision and language repre-
sentation learning with momentum distillation, Advances in Neural Information
Processing Systems (M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and
J. Wortman Vaughan, eds.), vol. 34, Curran Associates, Inc., 2021, pp. 9694–
9705.

[100] Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang,
Visualbert: A simple and performant baseline for vision and language, Arxiv,
2019.

[101] Wei Li, Can Gao, Guocheng Niu, Xinyan Xiao, Hao Liu, Jiachen Liu, Hua Wu,
and Haifeng Wang, UNIMO: Towards unified-modal understanding and gen-
eration via cross-modal contrastive learning, Proceedings of the 59th Annual
Meeting of the Association for Computational Linguistics and the 11th Inter-
national Joint Conference on Natural Language Processing (Volume 1: Long
Papers), Association for Computational Linguistics, 2021, pp. 2592–2607.

[102] Xiang Lisa Li and Percy Liang, Prefix-tuning: Optimizing continuous prompts
for generation, Proceedings of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International Joint Conference on Nat-
ural Language Processing (Volume 1: Long Papers) (Online) (Chengqing Zong,
Fei Xia, Wenjie Li, and Roberto Navigli, eds.), Association for Computational
Linguistics, August 2021, pp. 4582–4597.

124
BIBLIOGRAPHY

[103] Xiujun Li, Xi Yin, Chunyuan Li, Xiaowei Hu, Pengchuan Zhang, Lei Zhang,
Lijuan Wang, Houdong Hu, Li Dong, Furu Wei, Yejin Choi, and Jianfeng Gao,
Oscar: Object-semantics aligned pre-training for vision-language tasks, ECCV
2020 (2020).

[104] Yangguang Li, Feng Liang, Lichen Zhao, Yufeng Cui, Wanli Ouyang, Jing Shao,
Fengwei Yu, and Junjie Yan, Supervision exists everywhere: A data efficient
contrastive language-image pre-training paradigm, International Conference on
Learning Representations, 2022.

[105] Zhuohan Li, Eric Wallace, Sheng Shen, Kevin Lin, Kurt Keutzer, Dan Klein,
and Joey Gonzalez, Train big, then compress: Rethinking model size for efficient
training and inference of transformers, International Conference on Machine
Learning, 2020, pp. 5958–5968.

[106] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva
Ramanan, Piotr Dollár, and C. Lawrence Zitnick, Microsoft coco: Common
objects in context, Computer Vision – ECCV 2014 (David Fleet, Tomas Pajdla,
Bernt Schiele, and Tinne Tuytelaars, eds.), 2014, pp. 740–755.

[107] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen,
Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov, Ro{bert}a:
A robustly optimized {bert} pretraining approach, 2020.

[108] Ilya Loshchilov and Frank Hutter, SGDR: Stochastic gradient descent with warm
restarts, International Conference on Learning Representations, 2017.

[109] , Decoupled weight decay regularization, ICML, 2019.

[110] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee, Vilbert: Pretraining task-
agnostic visiolinguistic representations for vision-and-language tasks, Advances

125
BIBLIOGRAPHY

in Neural Information Processing Systems, vol. 32, Curran Associates, Inc.,

2019.

[111] Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan
Lee, 12-in-1: Multi-task vision and language representation learning, The
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
June 2020.

[112] Zhou Luowei, Palangi Hamid, Zhang Lei, Hu Houdong, Corso Jason J., and
Jianfeng Gao, Unified vision-language pre-training for image captioning and
vqa, arXiv preprint arXiv:1909.11059 (2019).

[113] Hartmut Maennel, Ibrahim Alabdulmohsin, Ilya Tolstikhin, Robert J. N. Bal-

dock, Olivier Bousquet, Sylvain Gelly, and Daniel Keysers, What do neural
networks learn when trained with random labels?, 2020.

[114] Eran Malach, Gilad Yehudai, Shai Shalev-Schwartz, and Ohad Shamir, Proving
the lottery ticket hypothesis: Pruning is all you need, Proceedings of the 37th
International Conference on Machine Learning (Hal Daumé III and Aarti Singh,
eds.), Proceedings of Machine Learning Research, vol. 119, PMLR, 13–18 Jul
2020, pp. 6682–6691.

[115] Ilaria Manco, Emmanouil Benetos, Elio Quinton, and György Fazekas, Con-
trastive audio-language learning for music, 2022.

[116] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan Yuille,
and Kevin Murphy, Generation and comprehension of unambiguous object de-
scriptions, CVPR, 2016.

[117] Marco Marelli, Stefano Menini, Marco Baroni, Luisa Bentivogli, Raffaella
Bernardi, and Roberto Zamparelli, A SICK cure for the evaluation of compo-

126
BIBLIOGRAPHY

sitional distributional semantic models, Proceedings of the Ninth International

Conference on Language Resources and Evaluation (LREC’14) (Reykjavik, Ice-
land) (Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Hrafn Loftsson,
Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and Stelios
Piperidis, eds.), European Language Resources Association (ELRA), May 2014,
pp. 216–223.

[118] Cynthia Matuszek, Grounded language learning: where robotics and nlp meet,
Proceedings of the 27th International Joint Conference on Artificial Intelligence,
IJCAI’18, AAAI Press, 2018, p. 5687–5691.

[119] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher, Pointer
sentinel mixture models, International Conference on Learning Representations,
2017.

[120] Tomas Mikolov, Kai Chen, Gregory S. Corrado, and Jeffrey Dean, Efficient
estimation of word representations in vector space, International Conference on
Learning Representations, 2013.

[121] Norman Mu, Alexander Kirillov, David Wagner, and Saining Xie, Slip: Self-
supervision meets language-image pre-training, arXiv:2112.12750 (2021).

[122] Masayasu Muraoka, Bishwaranjan Bhattacharjee, Michele Merler, Graeme

Blackwood, Yulong Li, and Yang Zhao, Cross-lingual transfer of large language
model by visually-derived supervision toward low-resource languages, Proceed-
ings of the 31st ACM International Conference on Multimedia (New York, NY,
USA), MM ’23, Association for Computing Machinery, 2023, p. 3637–3646.

127
BIBLIOGRAPHY

[123] Muhammad Ferjad Naeem, Yongqin Xian, Xiaohua Zhai, Lukas Hoyer, Luc Van
Gool, and Federico Tombari, Silc: Improving vision language pretraining with
self-distillation, 2023.

[124] Binh X. Nguyen, Tuong Do, Huy Tran, Erman Tjiputra, Quang D, and
Anh Nguyen Tran, Coarse-to-fine reasoning for visual question answering, Mul-
timodal Learning and Applications (MULA) Workshop, CVPR, 2022.

[125] Duy-Kien Nguyen and Takayuki Okatani, Multi-task learning of hierarchical

vision-language representation, CVPR, 2019.

[126] Vicente Ordonez, Girish Kulkarni, and Tamara Berg, Im2text: Describing im-
ages using 1 million captioned photographs, Advances in Neural Information
Processing Systems (J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and
K.Q. Weinberger, eds.), vol. 24, Curran Associates, Inc., 2011.

[127] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright,
Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray,
John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens,
Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe,
Training language models to follow instructions with human feedback, Advances
in Neural Information Processing Systems (S. Koyejo, S. Mohamed, A. Agar-
wal, D. Belgrave, K. Cho, and A. Oh, eds.), vol. 35, Curran Associates, Inc.,
2022, pp. 27730–27744.

[128] O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. V. Jawahar, Cats and dogs,

IEEE Conference on Computer Vision and Pattern Recognition, 2012.

[129] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gre-
gory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Al-

128
BIBLIOGRAPHY

ban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison,
Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and
Soumith Chintala, Pytorch: An imperative style, high-performance deep learn-
ing library, Advances in Neural Information Processing Systems (H. Wallach,
H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, eds.),
vol. 32, Curran Associates, Inc., 2019, pp. 8026–8037.

[130] Jeffrey Pennington, Richard Socher, and Christopher Manning, GloVe: Global
vectors for word representation, Proceedings of the 2014 Conference on Em-
pirical Methods in Natural Language Processing (EMNLP) (Doha, Qatar)
(Alessandro Moschitti, Bo Pang, and Walter Daelemans, eds.), Association for
Computational Linguistics, October 2014, pp. 1532–1543.

[131] Ankit Pensia, Shashank Rajput, Alliot Nagle, Harit Vishwakarma, and Dim-
itris Papailiopoulos, Optimal lottery tickets via subsetsum: Logarithmic over-
parameterization is sufficient, Advances in Neural Information Processing Sys-
tems, vol. 33, 2020.

[132] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher
Clark, Kenton Lee, and Luke Zettlemoyer, Deep contextualized word represen-
tations, Proceedings of the 2018 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technolo-
gies, Volume 1 (Long Papers) (New Orleans, Louisiana) (Marilyn Walker, Heng
Ji, and Amanda Stent, eds.), Association for Computational Linguistics, June
2018, pp. 2227–2237.

[133] AJ Piergiovanni, Wei Li, Weicheng Kuo, Mohammad Saffar, Fred Bertsch, and
Anelia Angelova, Answer-me: Multi-task open-vocabulary visual question an-
swering, European Conference on Computer Vision (ECCV), 2022.

129
BIBLIOGRAPHY

[134] AJ Piergiovanni, Kairo Morton, Weicheng Kuo, Michael S. Ryoo, and Anelia
Angelova, Video question answering with iterative video-text co-tokenization,
ECCV, 2022.

[135] Ariadna Quattoni, Michael Collins, and Trevor Darrell, Learning visual rep-
resentations using images with captions, 2007 IEEE Conference on Computer
Vision and Pattern Recognition, 2007, pp. 1–8.

[136] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh,
Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark,
Gretchen Krueger, and Ilya Sutskever, Learning transferable visual models from
natural language supervision, 2021.

[137] Alec Radford and Karthik Narasimhan, Improving language understanding by

generative pre-training, 2018.

[138] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya
Sutskever, Language models are unsupervised multitask learners, 2019.

[139] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,
Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu, Exploring the limits
of transfer learning with a unified text-to-text transformer, Journal of Machine
Learning Research 21 (2020), no. 140, 1–67.

[140] Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and
Mohammad Rastegari, What’s hidden in a randomly weighted neural network?,
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), June 2020.

130
BIBLIOGRAPHY

[141] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi,
Xnor-net: Imagenet classification using binary convolutional neural networks,
Computer Vision – ECCV 2016, 2016, pp. 525–542.

[142] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, Faster r-cnn: Towards
real-time object detection with region proposal networks, Advances in Neural
Information Processing Systems (C. Cortes, N. Lawrence, D. Lee, M. Sugiyama,
and R. Garnett, eds.), vol. 28, Curran Associates, Inc., 2015, pp. 91–99.

[143] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and
Björn Ommer, High-resolution image synthesis with latent diffusion models,
2021.

[144] Amir Rosenfeld and John K. Tsotsos, Intriguing properties of randomly weighted
networks: Generalizing while learning next to nothing, 2019 16th Conference on
Computer and Robot Vision (CRV), 2019, pp. 9–16.

[145] Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiao-
qing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al.,
Code llama: Open foundation models for code, 2023.

[146] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh,
Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein,
Alexander C. Berg, and Li Fei-Fei, ImageNet: A Large-Scale Hierarchical Image
Database, CVPR, 2009.

[147] Michael Ryoo, AJ Piergiovanni, Anurag Arnab, Mostafa Dehghani, and Anelia
Angelova, Tokenlearner: Adaptive space-time tokenization for videos, Ad-
vances in Neural Information Processing Systems (M. Ranzato, A. Beygelz-

131
BIBLIOGRAPHY

imer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, eds.), vol. 34, Curran
Associates, Inc., 2021, pp. 12786–12797.

[148] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L
Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan,
Tim Salimans, et al., Photorealistic text-to-image diffusion models with deep
language understanding, Advances in neural information processing systems 35
(2022), 36479–36494.

[149] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut, Conceptual
captions: A cleaned, hypernymed, image alt-text dataset for automatic image
captioning, Proceedings of ACL, 2018.

[150] Haoyue Shi, Jiayuan Mao, Kevin Gimpel, and Karen Livescu, Visually grounded
neural syntax acquisition, Proceedings of the 57th Annual Meeting of the Asso-
ciation for Computational Linguistics (Florence, Italy) (Anna Korhonen, David
Traum, and Lluís Màrquez, eds.), Association for Computational Linguistics,
July 2019, pp. 1842–1861.

[151] K. O. Stanley and R. Miikkulainen, Evolving neural networks through augment-

ing topologies, Evolutionary Computation 10 (2002), no. 2, 99–127.

[152] Alane Suhr, Mike Lewis, James Yeh, and Yoav Artzi, A corpus of natural lan-
guage for visual reasoning, Proceedings of the 55th Annual Meeting of the As-
sociation for Computational Linguistics (Volume 2: Short Papers), Association
for Computational Linguistics, 2017, pp. 217–223.

[153] Hao Tan and Mohit Bansal, Lxmert: Learning cross-modality encoder repre-
sentations from transformers, Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing, 2019.

132
BIBLIOGRAPHY

[154] , Vokenization: Improving language understanding with contextualized,

visual-grounded supervision, Proceedings of the 2020 Conference on Empirical
Methods in Natural Language Processing (EMNLP) (Online) (Bonnie Webber,
Trevor Cohn, Yulan He, and Yang Liu, eds.), Association for Computational
Linguistics, November 2020, pp. 2066–2080.

[155] Hidenori Tanaka, Daniel Kunin, Daniel L. K. Yamins, and Surya Ganguli, Prun-
ing neural networks without any data by iteratively conserving synaptic flow,
Advances in Neural Information Processing Systems, vol. 33, 2020.

[156] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre
Sablayrolles, and Herve Jegou, Training data-efficient image transformers &;
distillation through attention, Proceedings of the 38th International Conference
on Machine Learning (Marina Meila and Tong Zhang, eds.), Proceedings of
Machine Learning Research, vol. 139, PMLR, 18–24 Jul 2021, pp. 10347–10357.

[157] Emily T. Troscianko, The cognitive realism of memory in flaubert’s madame

bovary, The Modern Language Review 107 (2012), no. 3, 772–795.

[158] Aaron van den Oord, Yazhe Li, and Oriol Vinyals, Representation learning with
contrastive predictive coding, 2019.

[159] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin, Attention is all you
need, Advances in Neural Information Processing Systems (I. Guyon, U. V.
Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett,
eds.), vol. 30, Curran Associates, Inc., 2017, pp. 5998–6008.

[160] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and
Samuel R. Bowman, GLUE: A multi-task benchmark and analysis platform for

133
BIBLIOGRAPHY

natural language understanding, International Conference on Learning Repre-

sentations, 2019.

[161] Chaoqi Wang, Guodong Zhang, and Roger Grosse, Picking winning tickets be-
fore training by preserving gradient flow, International Conference on Learning
Representations, 2020.

[162] P. Wang, Q. Hu, Y. Zhang, C. Zhang, Y. Liu, and J. Cheng, Two-step quan-
tization for low-bit neural networks, 2018 IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2018, pp. 4376–4384.

[163] Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu,
Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, and
Furu Wei, Image as a foreign language: Beit pretraining for all vision and
vision-language tasks, 2022.

[164] Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan
Cao, Simvlm: Simple visual language model pretraining with weak supervision,
International Conference on Learning Representations (ICLR), 2022.

[165] Jason Wei and Kai Zou, EDA: Easy data augmentation techniques for boosting
performance on text classification tasks, Proceedings of the 2019 Conference
on Empirical Methods in Natural Language Processing and the 9th Interna-
tional Joint Conference on Natural Language Processing (EMNLP-IJCNLP),
Association for Computational Linguistics, November 2019, pp. 6382–6388.

[166] T Wolf, Huggingface’s transformers: State-of-the-art natural language process-

ing, 2019.

[167] Bichen Wu, Ruizhe Cheng, Peizhao Zhang, Tianren Gao, Joseph E. Gonzalez,
and Peter Vajda, Data efficient language-supervised zero-shot recognition with

134
BIBLIOGRAPHY

optimal transport distillation, International Conference on Learning Represen-

tations, 2022.

[168] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, Sun database:

Large-scale scene recognition from abbey to zoo, 2010 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, June 2010, pp. 3485–
3492.

[169] Ning Xie, Farley Lai, Derek Doran, and Asim Kadav, Visual entail-
ment: A novel task for fine-grained image understanding, arXiv preprint
arXiv:1901.06706 (2019).

[170] Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan,
Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer, VideoCLIP:
Contrastive pre-training for zero-shot video-text understanding, Proceedings of
the 2021 Conference on Empirical Methods in Natural Language Processing,
Association for Computational Linguistics, November 2021, pp. 6787–6800.

[171] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson, How transferable
are features in deep neural networks?, Advances in Neural Information Process-
ing Systems (Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q.
Weinberger, eds.), vol. 27, Curran Associates, Inc., 2014, pp. 3320–3328.

[172] Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier, From image
descriptions to visual denotations: New similarity metrics for semantic infer-
ence over event descriptions, Transactions of the Association for Computational
Linguistics 2 (2014), 67–78.

135
BIBLIOGRAPHY

[173] Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini,
and Yonghui Wu, Coca: Contrastive captioners are image-text foundation mod-
els, Transactions on Machine Learning Research, 2022.

[174] Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and
James Zou, When and why vision-language models behave like bags-of-words,
and what to do about it?, The Eleventh International Conference on Learning
Representations, 2023.

[175] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe,
and Youngjoon Yoo, Cutmix: Regularization strategy to train strong classifiers
with localizable features, ICCV, 2019.

[176] Éloi Zablocki, Benjamin Piwowarski, Laure Soulier, and Patrick Gallinari,
Learning Multi-Modal Word Representation Grounded in Visual Context, As-
sociation for the Advancement of Artificial Intelligence (AAAI) (New Orleans,
United States), February 2018.

[177] Rowan Zellers, Yonatan Bisk, Ali Farhadi, and Yejin Choi, From recognition to
cognition: Visual commonsense reasoning, The IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), June 2019.

[178] Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers,
Alexander Kolesnikov, and Lucas Beyer, Lit: Zero-shot transfer with locked-
image text tuning, Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), June 2022, pp. 18123–18133.

[179] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz,
mixup: Beyond empirical risk minimization, International Conference on Learn-
ing Representations, 2018.

136
Bibliography

[180] Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan
Wang, Yejin Choi, and Jianfeng Gao, Vinvl: Revisiting visual representations
in vision-language models, Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR), June 2021, pp. 5579–5588.

[181] Zhuosheng Zhang, Kehai Chen, Rui Wang, Masao Utiyama, Eiichiro Sumita,
Zuchao Li, and Hai Zhao, Neural machine translation with universal visual
representation, International Conference on Learning Representations, 2020.

[182] Yue Zhao, Ishan Misra, Philipp Krähenbühl, and Rohit Girdhar, Learning video
representations from large language models, arXiv:2212.04501, 2022.

[183] Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Li-
unian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, et al., Regionclip:
Region-based language-image pretraining, Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition, 2022, pp. 16793–16803.

[184] Hattie Zhou, Janice Lan, Rosanne Liu, and Jason Yosinski, Deconstructing
lottery tickets: Zeros, signs, and the supermask, Advances in Neural Information
Processing Systems (H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc,
E. Fox, and R. Garnett, eds.), vol. 32, Curran Associates, Inc., 2019, pp. 3597–
3607.

[185] Barret Zoph and Quoc V. Le, Neural architecture search with reinforcement
learning, International Conference on Learning Representations, 2017.

137

English For Engineering Teacher's Book
82% (34)
English For Engineering Teacher's Book
128 pages
Shortcuts in Windows 11
100% (3)
Shortcuts in Windows 11
8 pages
Synthesis Lectures On Computer Vision: Series Editors
No ratings yet
Synthesis Lectures On Computer Vision: Series Editors
8 pages
Li Et Al. - 2023 - Multimodal Foundation Models From Specialists To
No ratings yet
Li Et Al. - 2023 - Multimodal Foundation Models From Specialists To
119 pages
Vision-Language Pre-Training
No ratings yet
Vision-Language Pre-Training
102 pages
677 A Survey On Bridging VLMs
No ratings yet
677 A Survey On Bridging VLMs
20 pages
Computational Methods For Integrating Vision and Language: Kobus Barnard
No ratings yet
Computational Methods For Integrating Vision and Language: Kobus Barnard
229 pages
An Introduction To Vision-Language Modeling: Aishwarya Agrawal Kate Saenko Asli Celikyilmaz Vikas Chandra
No ratings yet
An Introduction To Vision-Language Modeling: Aishwarya Agrawal Kate Saenko Asli Celikyilmaz Vikas Chandra
76 pages
PhoCLIP 232 Specialized Project OFFICIAL
No ratings yet
PhoCLIP 232 Specialized Project OFFICIAL
105 pages
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal Llms
No ratings yet
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal Llms
45 pages
Seed1.5-VL Technical Report
No ratings yet
Seed1.5-VL Technical Report
77 pages
The Dawn of LMMS: Preliminary Explorations With Gpt-4V (Ision)
No ratings yet
The Dawn of LMMS: Preliminary Explorations With Gpt-4V (Ision)
166 pages
Deep Learning Book PDF
No ratings yet
Deep Learning Book PDF
272 pages
Multimodal Foundation Models
No ratings yet
Multimodal Foundation Models
14 pages
A Guide To Image Captioning. How Deep Learning Helps in Captioning
No ratings yet
A Guide To Image Captioning. How Deep Learning Helps in Captioning
17 pages
Midterm Template
No ratings yet
Midterm Template
23 pages
Language Models Are General Purpose Interfaces
No ratings yet
Language Models Are General Purpose Interfaces
32 pages
ERNIE Technical Report
No ratings yet
ERNIE Technical Report
72 pages
Meta Morph
No ratings yet
Meta Morph
25 pages
Perception, Reason, Think, and Plan
No ratings yet
Perception, Reason, Think, and Plan
75 pages
Image Captioning Research Paper
No ratings yet
Image Captioning Research Paper
59 pages
Neuro Symbolic 2
No ratings yet
Neuro Symbolic 2
28 pages
Training The Application of LLM
No ratings yet
Training The Application of LLM
68 pages
Efficient and Effective Learning of Foundational Large Multi-Moda
No ratings yet
Efficient and Effective Learning of Foundational Large Multi-Moda
168 pages
Temporal Pattern Classification Using Spiking Neural Networks
No ratings yet
Temporal Pattern Classification Using Spiking Neural Networks
67 pages
Paper 1
No ratings yet
Paper 1
17 pages
Session 15-1 Multimodal
No ratings yet
Session 15-1 Multimodal
82 pages
Topic - ReviewPaper 7
No ratings yet
Topic - ReviewPaper 7
5 pages
Seminar Report
No ratings yet
Seminar Report
33 pages
Lu Prompt Distribution Learning CVPR 2022 Paper
No ratings yet
Lu Prompt Distribution Learning CVPR 2022 Paper
10 pages
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
No ratings yet
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
17 pages
Foundation Models in Robotics: Applications, Challenges, and The Future
No ratings yet
Foundation Models in Robotics: Applications, Challenges, and The Future
33 pages
Generating Caption From Images Using Flickr Image Dataset
No ratings yet
Generating Caption From Images Using Flickr Image Dataset
7 pages
Det GPT
No ratings yet
Det GPT
17 pages
2023PhD MIT Towards General-Purpose Vision Via Multiview Contrastive Learning
No ratings yet
2023PhD MIT Towards General-Purpose Vision Via Multiview Contrastive Learning
227 pages
By My Eyes: Grounding Multimodal Large Language Models With Sensor Data Via Visual Prompting
No ratings yet
By My Eyes: Grounding Multimodal Large Language Models With Sensor Data Via Visual Prompting
23 pages
PGCON Paper Final
No ratings yet
PGCON Paper Final
4 pages
Towards Interpreting Language Models
No ratings yet
Towards Interpreting Language Models
79 pages
MML Language
No ratings yet
MML Language
11 pages
Han 等 - 2025 - Vision as a Dialect Unifying Visual Understanding and Generation via Text-Aligned Representations
No ratings yet
Han 等 - 2025 - Vision as a Dialect Unifying Visual Understanding and Generation via Text-Aligned Representations
22 pages
Thesis
No ratings yet
Thesis
76 pages
Fang 2015
No ratings yet
Fang 2015
10 pages
Image Captioning: Department of Computer Science University of Engineering & Technology Taxila
No ratings yet
Image Captioning: Department of Computer Science University of Engineering & Technology Taxila
10 pages
10 1109@tetci 2019 2892755
No ratings yet
10 1109@tetci 2019 2892755
16 pages
Paper 2 Base
No ratings yet
Paper 2 Base
39 pages
Socratic Models ML AI LM 2204.00598
No ratings yet
Socratic Models ML AI LM 2204.00598
20 pages
Aust Cse Thesis Final Book
No ratings yet
Aust Cse Thesis Final Book
72 pages
Transformers in Computational Visual Media A Surve
No ratings yet
Transformers in Computational Visual Media A Surve
30 pages
Automatic Image Captioning Using Neural Networks
No ratings yet
Automatic Image Captioning Using Neural Networks
9 pages
Mit
No ratings yet
Mit
102 pages
Vila-U Foundation Model
No ratings yet
Vila-U Foundation Model
15 pages
An Overview of Vision Transformers For Image Processing A Survey
No ratings yet
An Overview of Vision Transformers For Image Processing A Survey
17 pages
DW & Caption Generator - Paper 1
No ratings yet
DW & Caption Generator - Paper 1
6 pages
Lecture22 Multimodal
No ratings yet
Lecture22 Multimodal
32 pages
Viecap4H - VLSP 2021: A Transformer-Based Method For Healthcare Image Captioning in Vietnamese
No ratings yet
Viecap4H - VLSP 2021: A Transformer-Based Method For Healthcare Image Captioning in Vietnamese
9 pages
Deepak Report Phase1
No ratings yet
Deepak Report Phase1
80 pages
Lecture Notes
No ratings yet
Lecture Notes
86 pages
Design of Machine Learning Algorithms For Object Captioning
No ratings yet
Design of Machine Learning Algorithms For Object Captioning
45 pages
Lifelong Education: Continuous Learning in the Digital Age
From Everand
Lifelong Education: Continuous Learning in the Digital Age
Maia Tobares
No ratings yet
Real-Time Critical Systems
From Everand
Real-Time Critical Systems
Jordan Lee Mauro-Buhagiar
3/5 (1)
Applied Deep Learning: Design and implement your own Neural Networks to solve real-world problems (English Edition)
From Everand
Applied Deep Learning: Design and implement your own Neural Networks to solve real-world problems (English Edition)
Dr. Rajkumar Tekchandani
No ratings yet
The Ultimate Guide to Teaching ESL With ChatGPT: Harness AI to Enhance Language Learning & Teaching
From Everand
The Ultimate Guide to Teaching ESL With ChatGPT: Harness AI to Enhance Language Learning & Teaching
Jackie Bolen
No ratings yet
Chap 8 - Public Relations
No ratings yet
Chap 8 - Public Relations
21 pages
ICMLSC2019 C018-A 29 11 2018 PPLan
No ratings yet
ICMLSC2019 C018-A 29 11 2018 PPLan
7 pages
Statistical Implicative Analysis - Theory and Applications PDF
100% (1)
Statistical Implicative Analysis - Theory and Applications PDF
511 pages
Machine Learning With Kernel Methods
No ratings yet
Machine Learning With Kernel Methods
760 pages
Unit 1 What Is Industrial Engineering? I. Read The Text and Answer The Questions Below!
No ratings yet
Unit 1 What Is Industrial Engineering? I. Read The Text and Answer The Questions Below!
3 pages
Macbeth
No ratings yet
Macbeth
3 pages
Timeline of Philippine Literature - Ershee Mallorca
No ratings yet
Timeline of Philippine Literature - Ershee Mallorca
47 pages
Brazilian Names - Top First Names & Surnames in Brazil
No ratings yet
Brazilian Names - Top First Names & Surnames in Brazil
8 pages
Half Yearly Syllabus-2023-2024 Class-6
No ratings yet
Half Yearly Syllabus-2023-2024 Class-6
4 pages
Historiography and Identity Ii Postroman Multiplicity and New Political Identities Helmut Reimitz Instant Download
No ratings yet
Historiography and Identity Ii Postroman Multiplicity and New Political Identities Helmut Reimitz Instant Download
85 pages
Microcomputers / Microcomputer Development Systems: Part Number List
No ratings yet
Microcomputers / Microcomputer Development Systems: Part Number List
61 pages
319 Scriptural Rosary
No ratings yet
319 Scriptural Rosary
40 pages
Jewish Beliefs Quote Bank
No ratings yet
Jewish Beliefs Quote Bank
3 pages
Project On Banking System in Mis PDF
No ratings yet
Project On Banking System in Mis PDF
43 pages
Worksheet On Grammar Class 10
No ratings yet
Worksheet On Grammar Class 10
4 pages
Design and Software Development For Vaccine Management System Using Java
No ratings yet
Design and Software Development For Vaccine Management System Using Java
3 pages
Expl NetFund CH 01 Intro - 56 Slides
No ratings yet
Expl NetFund CH 01 Intro - 56 Slides
68 pages
Software For Embedded Systems Assignment
No ratings yet
Software For Embedded Systems Assignment
4 pages
Ee Mungu Unilinde 122017
No ratings yet
Ee Mungu Unilinde 122017
1 page
The Largest Truly Open Library in Human History
No ratings yet
The Largest Truly Open Library in Human History
10 pages
4 Linkers and Connectors
No ratings yet
4 Linkers and Connectors
44 pages
What Is Coaching?: in This Chapter We Will Look at
No ratings yet
What Is Coaching?: in This Chapter We Will Look at
7 pages
SH125-150cc - 2015 Khoa Thong Minh
No ratings yet
SH125-150cc - 2015 Khoa Thong Minh
106 pages
F 584 C 6481 FDB
No ratings yet
F 584 C 6481 FDB
157 pages
18
No ratings yet
18
12 pages
Poetry: Rhythm and Rhyme: Bed in Summer
No ratings yet
Poetry: Rhythm and Rhyme: Bed in Summer
1 page
Verbos Regulares: Ejemplos
No ratings yet
Verbos Regulares: Ejemplos
4 pages
Hastrup, Kirsten - Getting It Right
No ratings yet
Hastrup, Kirsten - Getting It Right
19 pages
Anwer and Question
No ratings yet
Anwer and Question
9 pages
Computer Basics Lesson Plan One
No ratings yet
Computer Basics Lesson Plan One
6 pages
Trung Tâm Anh NG Nhung PH M 27N7A KĐT Trung Hòa Nhân Chính - 0944 225 191
No ratings yet
Trung Tâm Anh NG Nhung PH M 27N7A KĐT Trung Hòa Nhân Chính - 0944 225 191
2 pages
VendingMachine Notes
No ratings yet
VendingMachine Notes
54 pages