0% found this document useful (0 votes)
185 views17 pages

Eyes Wide Shut Exploring The Visual Shortcomings of Multimodal LLMs

1. The document explores the visual shortcomings of multimodal large language models like GPT-4V through a series of visual question answering examples. 2. The examples show that GPT-4V and other advanced systems struggle with seemingly simple visual questions, frequently providing incorrect answers or hallucinated explanations. 3. The research highlights scenarios where models lack accurate visual grounding abilities and identifies nine basic visual patterns that challenge state-of-the-art multimodal models.

Uploaded by

yym68686
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
185 views17 pages

Eyes Wide Shut Exploring The Visual Shortcomings of Multimodal LLMs

1. The document explores the visual shortcomings of multimodal large language models like GPT-4V through a series of visual question answering examples. 2. The examples show that GPT-4V and other advanced systems struggle with seemingly simple visual questions, frequently providing incorrect answers or hallucinated explanations. 3. The research highlights scenarios where models lack accurate visual grounding abilities and identifies nine basic visual patterns that challenge state-of-the-art multimodal models.

Uploaded by

yym68686
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Eyes Wide Shut?

Exploring the Visual Shortcomings of Multimodal LLMs

Shengbang Tong1 Zhuang Liu Yuexiang Zhai2


Yi Ma2 Yann LeCun1 Saining Xie1
1 2
New York University UC Berkeley
arXiv:2401.06209v1 [cs.CV] 11 Jan 2024

Q: Is the dog facing left or right Q: Do you see any Q: In this image, how many eyes Q: How many wheels can
Q: Is the door of the truck cab open?
from the camera's perspective? window in this image? can you see on the animal? you see in the image?
The dog is facing to the right from No, there are no windows visible in The image shows one eye of the animal. In the image provided, I can see two
No, the door of the truck is not open in
the camera's perspective. You can tell by this image. The image shows a traditional red It's a close-up of a bald eagle facing slightly to
the position of its tail and the orientation pillar box, which is a type of mailbox found in
the image provided.
its right, which presents only one side of its wheels on the visible side of the car.
of its body. the United Kingdom. face to the viewer.

Q: From the camera’s perspective,


Q: Does the keyboard Q: Is the edge of the heart in the Q: Is the school bus parked facing the
is the piano’s back panel on the Q: Are the butterfly’s feet visible?
have a backlight? picture white or dark-colored? camera or away from the camera?
left or right side?
Based on the image, it appears that the From the camera’s perspective, the piano’s The image you've provided is small and of The school bus is parked facing
away from The butterfly's feet are not visible in this
keyboard does have a backlight. low resolution, but it appears that the hearts image. The wings and part of the body of the
back panel is on the left side. The the camera. You can tell because the rear
This is indicated by the illumination seen keys visible in the foreground suggest that have a dark-colored edge or outline. lights and the back of the exterior rearview mirror
butterfly are in view, but the feet are either
around the keys, which suggests that there is we looking at the piano from front-right This outline varies in visibility, likely due to hidden beneath the body or out of the frame of
are visible. The text "SCHOOL BUS" is also
a light source beneath them to enhance side, thus placing the back panel on the left the quality of the image. the picture. Butterflies have delicate, slender feet
reversed, which is typically seen on the front of a
visibility in low-light conditions. from this angle. which are often not the most noticeable feature
bus, designed so that it reads correctly in a
when they are perched, especially from the angle
rearview mirror.
shown in this image.

Figure 1. Instances are systematically identified where the visual question answering (VQA) capabilities of GPT-4V [41] fall short
(Date accessed: Nov 04, 2023). Our research highlights scenarios in which advanced systems like GPT-4V struggle with seemingly
simple questions due to inaccurate visual grounding. Text in red signifies an incorrect response, while text in green represents hallucinated
explanations for the incorrect answer. All the images referenced are sourced from ImageNet-1K and LAION-Aesthetic datasets.

Abstract systems, including GPT-4V, struggle with straightforward


Is vision good enough for language? Recent advance- questions across nine basic visual patterns, often provid-
ments in multimodal models primarily stem from the pow- ing incorrect answers and hallucinated explanations. We
erful reasoning abilities of large language models (LLMs). further evaluate various CLIP-based vision-and-language
However, the visual component typically depends only on models and found a notable correlation between visual pat-
the instance-level contrastive language-image pre-training terns that challenge CLIP models and those problematic for
(CLIP). Our research reveals that the visual capabilities multimodal LLMs. As an initial effort to address these is-
in recent multimodal LLMs (MLLMs) still exhibit system- sues, we propose a Mixture of Features (MoF) approach,
atic shortcomings. To understand the roots of these errors, demonstrating that integrating vision self-supervised learn-
we explore the gap between the visual embedding space of ing features with MLLMs can significantly enhance their
CLIP and vision-only self-supervised learning. We iden- visual grounding capabilities. Together, our research sug-
tify “CLIP-blind pairs” – images that CLIP perceives as gests visual representation learning remains an open chal-
similar despite their clear visual differences. With these lenge, and accurate visual grounding is crucial for future
pairs, we construct the Multimodal Visual Patterns (MMVP) successful multimodal systems.
benchmark. MMVP exposes areas where state-of-the-art

1
1. Introduction the exception. Yet, even GPT-4V exhibits a considerable
disparity in performance – exceeding 50% – compared to
Multimodal Large Language Models (MLLMs) [8, 13, 31, human performance.
40] have been rapidly developing in recent times. MLLMs Having identified a large number of individual failure in-
integrate images into large language models (LLMs) and stances in MLLMs, we continue to study the systematic vi-
leverage the powerful abilities of LLMs [41, 59, 69], show- sual patterns in MMVP which CLIP models struggle (Sec-
casing remarkable proficiency in tasks such as image under- tion 3). We summarize nine prevalent patterns of the CLIP-
standing, visual question answering, and instruction follow- blind pairs in MMVP, such as “orientation”, “counting”,
ing. In particular, the recently released GPT-4V(ision) [40] and “viewpoint”, which pose significant challenges for the
has pushed performance to an unprecedented level [41, 63]. CLIP vision encoder. Notice that there has been significant
Beneath the advancements of these models, we find there and ongoing progress in scaling up both training data and
exists a notable weakness: they still exhibit visual short- model size for CLIP [10, 43, 54, 62, 66]. We categorize ex-
comings, some of which are surprisingly elementary and amples from MMVP into visual patterns to systematically
evident (see Figure 1). We ask: Where do these problems assess whether scaling alone can mitigate these challenges.
originate? Is it a deficiency in visual modality, language un- Our findings suggest that 7 out of the 9 identified visual
derstanding, or their alignment? In this work, we suggest patterns cannot be resolved by any large-scale CLIP-based
that these shortcomings observed in MLLMs might stem models, indicating that model/data scaling alone is not suf-
from a problem related to the visual representations. ficient. Moreover, we identify a strong correlation between
At their core, most MLLMs [8, 31, 71] are built on pre- the visual patterns that challenge CLIP models and the per-
trained vision [43, 54] and language [59, 68, 69] mod- formance of MLLMs. If CLIP struggles with a particular
els. These models are connected using various types of visual pattern, such as “orientation”, MLLMs will likely
adapters [2, 26, 31] to integrate the different modalities. A also fall short. This shows that the CLIP vision encoders
natural hypothesis is that any limitation in the pretrained vi- could become a bottleneck in such systems.
sion models can cascade into the downstream MLLMs that Finally, we take a step towards improving the visual
adopt them. Studies have explored a similar issue for lan- grounding of MLLMs. Since the visual shortcomings of
guage. For example, Tong et al. [57], Yuksekgonul et al. MLLMs stem from their reliance on the CLIP model, we in-
[65] demonstrate that failure patterns in the pretrained text vestigate the impact of integrating vision-centric represen-
encoder [43, 44] will lead to downstream failures in text- tations into MLLMs (Section 4). Specifically, we explore
guided generative models [22, 46]. ways to incorporate a vision-only self-supervised model,
On the vision side, most open-source MLLMs [2, 26, such as DINOv2 [42], to enhance the visual grounding
31] adopt the pretrained Contrastive Language-Image Pre- capabilities of MLLMs. We refer to these techniques as
Training (CLIP) model [43] as the visual encoder. We begin Mixture-of-Features (MoF). First, we linearly mix CLIP
by identifying failure examples that CLIP struggles to en- and DINOv2 features in different ratios, which we refer
code properly (Section 2). Inspired by Tong et al. [57], we to as Additive-MoF (A-MoF). This process reveals that
exploit the erroneous agreements in the embedding space. DINOv2 features are more effective in visual grounding,
If two visually different images are encoded similarly by though they come at the cost of diminished instruction-
CLIP, then at least one of the images is likely ambiguously following ability. To address this, we introduce Interleaved-
encoded. We call such a pair of images a CLIP-blind pair. MoF (I-MoF) that spatially mixes visual tokens from both
To measure the visual similarity between images, we use a CLIP and DINOv2 models. We find that this practice sig-
vision-only self-supervised encoder such as DINOv2 [42]. nificantly enhances visual grounding while maintaining the
In this context, CLIP-blind pairs are images with similar instruction-following capabilities.
CLIP embeddings but different DINOv2 embeddings.
We discover that these CLIP-blind pairs indeed lead to 2. The Multimodal Visual Patterns (MMVP)
errors in downstream MLLMs. With these pairs, We intro- Benchmark
duce the MultiModal Visual Patterns (MMVP) benchmark.
This benchmark is specifically designed to inquire about Currently, the majority of open-source MLLMs [8, 31, 71]
differences in CLIP-blind pairs and evaluate the visual abil- use the off-the-shelf CLIP vision encoders to process im-
ities of state-of-the-art MLLMs with straightforward ques- ages. In this section, we begin by identifying CLIP-blind
tions. We evaluate a variety of open-source [8, 30, 31, 71] pairs in the CLIP model (Section 2.1). Subsequently, we
and closed-source models [13, 41] including GPT-4V [40], construct the Multimodal Visual Patterns-MLLM (MMVP-
and conduct a user study to measure human performance. MLLM) benchmark using these CLIP-blind pairs (Sec-
The results show that MLLM models struggle with straight- tion 2.2). We evaluate SOTA MLLMs including GPT-4V
forward visual questions. Most of these models perform on the benchmark (Section 2.3) and find that all the tested
below the level of random guessing, with GPT-4V being models struggle with simple questions on visual details. A

2
Step 1 Step 2 Step 3
Finding CLIP-blind pairs. Spotting the difference between two images. Benchmarking multimodal LLMs.

Discover image pairs that are proximate in CLIP feature For a CLIP-blind pair, a human annotator attempts to spot Evaluate multimodal LLMs using a CLIP-blind
space but distant in DINOv2 feature space. the visual differences and formulates questions. image pair and its associated question.

Where is the yellow animal’s head lying in this image?


CLIP Space (a) Floor (b) Carpet
𝑺𝒊𝒎𝑪𝑳𝑰𝑷 = 0.95

“The dog’s head in the left image is resting


on the carpet, while the dog’s head in the (b) Carpet (b) Carpet
right image is lying on the floor.”

Formulating questions and


𝑺𝒊𝒎𝑫𝑰𝑵𝑶 = 0.58 options for both images. (no score for this pair)

Where is the yellow animal’s head lying in this image? The model receives a score only when both
DINOv2 Space (a) Floor (b) Carpet predictions for the CLIP-blind pair are correct.

Figure 2. Constructing MMVP benchmark via CLIP-blind pairs. Left: We start with finding CLIP-blind pairs that have similar CLIP
embedding but different DINOv2 embedding. Center: We manually inspect the differences between pair-wise images and formulate
questions based on the differences in the images. Right: We ask MLLMs the question alongside the CLIP-blind pair. The model receives
a score only when both questions for the CLIP-blind pair are answered correctly.

visualization of this process is provided in Figure 2. details that the CLIP vision encoder overlooks (see the mid-
dle of Figure 2) and craft questions that probe these visual
2.1. Finding CLIP-blind Pairs details, for example “Is the dog facing left or right?” (See
It is challenging to directly find instances (images) that the the right of Figure 2 and more examples in Figure 3). The
CLIP vision encoder struggles to encode “properly”. To primary goal is to determine whether MLLM models would
circumvent this issue, we extend the idea proposed in Tong fail when posed with these seemingly basic questions and
et al. [57] to automatically find blind pairs in vision models. overlook critical visual details. Hence, the questions are in-
The underlying principle is simple: if two images, despite tentionally straightforward and unambiguous.
having stark visual differences, are encoded similarly by the 2.3. Benchmark Results
CLIP vision encoder, then one of them is likely encoded
ambiguously (See Figure 2 left for example). To measure We assess the questions on SOTA open-source mod-
the visual difference between two images, we examine the els (LLaVA-1.5 [31], InstructBLIP [8], Mini-GPT4 [71])
images’ representations within a reference model: a vision- and closed-source models (GPT-4V [40], Gemini [14],
only self-supervised model trained without any language Bard [13]) We leave details of how we access the model in
guidance, e.g., DINOv2 [42]. These models are shown to Appendix B.1. In our evaluation, each question is queried
capture more visual details and information [42, 53]. independently, eliminating any biases from chat histories.
We take the corpus datasets, ImageNet [47] and LAION- We also evaluate human performance through a user study
Aesthetics [48], to collect these CLIP-blind pairs. where users are presented with 300 questions in a random-
For each pair, we compute its CLIP embeddings using ized sequence. For any given pair of images, we consider a
CLIP-ViT-L-14 [9, 43] model and their DINOv2 embed- pair of images to be correctly answered if both the questions
dings using DINOv2-ViT-L-14 [9, 42] model. We return associated with the pair are answered accurately.
pairs such that the cosine similarity exceeds 0.95 for CLIP
embeddings and less than 0.6 for DINOv2 embeddings. Human study confirms questions are straightforward.
As shown in Figure 4, human participants accurately answer
2.2. Designing Benchmark from CLIP-blind Pairs an average of 95.7% of the questions. This high accuracy
We introduce the Multimodal Visual Patterns (MMVP) rate underscores the ease of the questions. More details can
benchmark, and a Visual Question Answering (VQA) be found in Appendix B.4.
benchmark. Utilizing the collected CLIP-blind pairs, we
carefully design 150 pairs with 300 questions. For each Current MLLMs struggle with visual details. As
CLIP-blind pair of images, we manually pinpoint the visual shown in Figure 4, there is a significant performance gap

3
Is the dog facing left or right from the Is the cup placed on a surface or being held Is the snail in the picture facing the camera
camera’s perspective? Is the needle pointing up or down? by hand? Is the lock locked or unlocked? or away from the camera

(a) Left (b) Right (a) Placed on a (b) Held by hand (a) Away from the (b) Facing the
(a) Up (b) Down (a) Locked (b) Unlocked
surface camera Camera
(b) (b)  (b) (b)  (a) (a)  (a) (b) ✓ (b) (b) 
(a) (a)  (a) (b) ✓ (a) (b) ✓ (a) (b) ✓ (a) (a) 
(b) (b)  (a) (a)  (a) (a)  (a) (a)  (b) (b) 
(a) (a)  (a) (a)  (a) (b) ✓ (a) (a)  (a) (a) 
In this image, how many eyes can you see on Are there cookies stacked on top of other Is there a hand using the mouse in this
Are the ears of the dog erect or drooping? Is this a hammerhead shark?
the animal? cookies? image?

(a) Erect (b) Drooping (a) 1 (b) 2 (a) Yes (b) No (a) Yes (b) No (a) Yes (b) No

(b) (b)  (a) (a)  (b) (b)  (b) (b)  (b) (b) 
(a) (a)  (b) (b)  (a) (b) ✓ (a) (b) ✓ (a) (b) ✓
(b) (b)  (b) (b)  (b) (b)  (a) (a)  (b) (b) 
(a) (a)  (b) (b)  (a) (a)  (b) (a)  (a) (b) ✓
Is the following statement correct: There are Is the following statement correct: There is
Are there any clouds? Do you see any window in this image? Are the butterfly’s feet visible?
different colors of grapes in this image no letter D on this image?

(a) Yes (b) No (a) Yes (b) No (a) Yes (b) No (a) Correct (b) Incorrect (a) Correct (b) Incorrect

(b) (b)  (b) (b)  (b) (a)  (a) (a)  (b) (b) 
(a) (b) ✓ (a) (b) ✓ (a) (b) ✓ (a) (b) ✓ (b) (b) 
(a) (b) ✓ (b) (b)  (a) (b) ✓ (a) (a)  (b) (b) 
(a) (b) ✓ (b) (a)  (a) (a)  (a) (a)  (a) (a) 

GPT-4V Gemini Pro LLaVA-1.5 InstructBLIP

Figure 3. Examples of Questions in the MMVP benchmark. Incorrect answers are shaded in red . A model is considered correct only if
it answers both questions in a pair correctly. Both leading closed-source models (GPT-4V, Gemini) and open-source models (LLaVA-1.5,
InstructBLIP) fail these simple visual questions. (See Appendix B.2 for all the questions in MMVP benchmark.)

Human 95.7 between human and MLLM models, despite the latter of-
ten demonstrating impressive results [6, 27]. Models ex-
Gemini 40.7
cept GPT-4V and Gemini, scored below random guess level
GPT-4V 38.7 (25%). Most advanced GPT-4V and Gemini also face chal-
Random Guess 25.0 lenges in addressing basic visual grounding questions. Fig-
LLaVA-1.5 24.7 ures 1 and 3 provide examples of errors made by models.
The outcomes suggest that irrespective of model size or
Bard 19.0
training data, struggle with visual details.
Bing Chat 17.3 We have also conducted an ablation study, such as swap-
InstructBLIP 16.7 ping options and changing notations in the question formu-
mini-GPT4 12.7 lation (see Appendix B.3 for more details), to further con-
LLaVA 6.0 firm that this poor performance stems from visual incapa-
bility, not hallucination in the language models.
0 20 40 60 80 100
Accuracy (%)
3. Systematic Failures in CLIP
Figure 4. Benchmark results of current SOTA MLLM models
In the previous section, we identify CLIP-blind pairs and
and humans. We evaluate benchmark questions for current SOTA
MLLM models and human performances through user studies. use them to find failures in MLLMs. Here, we delve deeper
into these pairs to investigate (i) systematic visual patterns

4
Orientation and Direction Presence of Specific Features State and Condition Quantity and Count Positional and Relational Context

butterfly glasses on
a rabbit
tulips with wings 1 drink the right of
facing right
open the slipper

butterfly glasses on
a rabbit
no tulips with wings 2 drinks the left of
facing left
closed the slipper

Structural Characteristics Texts Viewpoint and Perspective Color and Appearance

flowers
some fruits light blue MMVP-VLM Benchmark
“11:54” seen from
cut in half sky
above Model chooses the correct
image based on the text
Model chooses the wrong
flowers
dark blue image based on the text
uncut fruits “11:59” seen from
sky
the side

Figure 5. Examples from MMVP-VLM. MMVP-VLM consists of image pairs across nine visual patterns. The examples in the figure are
from EVA01 ViT-g-14 model [54], one of the largest CLIP models that also fails to choose the right image given the text description.

emerged from CLIP-blind pairs (Section 3.1), (ii) whether overly focus on high-level semantic understanding, over-
these visual patterns pose challenges for CLIP-based mod- looking intricate details of the visual world. Full descrip-
els with massive scaling up (Section 3.2), and (iii) the cor- tions of the visual patterns can be found in Appendix D.
relation between failure patterns in CLIP models and those
in MLLMs (Section 3.3). 3.2. The MMVP-VLM Benchmark
CLIP-based models have developed rapidly since the intro-
3.1. Visual Patterns in CLIP-blind Pairs
duction in the first paper [43]. We want to test whether these
Having identified the CLIP-blind pairs, we summarize sys- visual patterns still impose challenges to the more recent
tematic visual patterns that the CLIP vision encoders might CLIP models [10, 54, 62, 66], which significantly scale up
consistently misinterpret. It is too abstract to directly cap- in terms of training data and model size. In doing so, we in-
ture systematic visual patterns in the CLIP-blind pairs. troduce a new benchmark: MMVP-VLM to systematically
Therefore, we turn to the questions and options from the study if CLIP models handle this visual pattern well.
MMVP benchmark. With these questions, we transform ab- We distill a subset of questions from the MMVP bench-
stract visual patterns in images into clearer, language-based mark into simpler language descriptions and categorize
descriptors that are easier to categorize. them into visual patterns. To maintain a balanced number
In this work, we use GPT-4 [41] to categorize general of questions for each visual pattern, we add a few questions,
patterns by prompting it with the following: if needed, to ensure that each visual pattern is represented
by 15 text-image pairs. Examples of pairs are shown in Fig-
User ure 5. A pair is deemed correctly answered if the model can
I am analyzing an image embedding model. Can you go
accurately match both image-text combinations.
through the questions and options, trying to figure out
some general patterns that the embedding model strug- We evaluate MMVP-VLM on a variety of CLIP mod-
gles with? Please focus on the visual features and gener- els [10, 43, 54, 62, 66]. These models vary in aspects like
alize patterns that are important to vision models size, training data, and methodology. As evidenced in Ta-
[MMVP Questions and Options] ble 1, increasing network size and training data only aids
in identifying two visual patterns – “color and appearance”
We identify 9 visual patterns: and “state and condition”. The rest of the visual patterns
☼ Orientation and Direction continue to challenge all CLIP-based models. We also find
Û Presence of Specific Features that the ImageNet-1k zero-shot accuracy is not a definitive
L State and Condition indicator of a model’s performance regarding visual pat-
 Quantity and Count terns. This underscores the necessity for additional evalua-
, Positional and Relational Context tion metrics, such as MMVP-VLM, to accurately assess the
h Color and Appearance model’s capabilities in areas beyond image classification.
Ô Structural and Physical Characteristics
3.3. How CLIP’s Errors Affect MLLMs
k Text
‚ Viewpoint and Perspective After analyzing the visual patterns that CLIP models strug-
These visual patterns suggest that CLIP vision encoders gle with, we pose the following question: Is there a correla-

5
Image Params IN-1k MMVP
(M) ZeroShot ☼ Û L  , h Ô k ‚ Average
Size
OpenAI ViT-L-14 [43] 2242 427.6 75.5 13.3 13.3 20.0 20.0 13.3 53.3 20.0 6.7 13.3 19.3
OpenAI ViT-L-14 [43] 3362 427.9 76.6 0.0 20.0 40.0 20.0 6.7 20.0 33.3 6.7 33.3 20.0
SigLIP ViT-SO-14 [66] 2242 877.4 82.0 26.7 20.0 53.3 40.0 20.0 66.7 40.0 20.0 53.3 37.8
SigLIP ViT-SO-14 [66] 3842 878.0 83.1 20.0 26.7 60.0 33.3 13.3 66.7 33.3 26.7 53.3 37.0
DFN ViT-H-14 [10] 2242 986.1 83.4 20.0 26.7 73.3 26.7 26.7 66.7 46.7 13.3 53.3 39.3
DFN ViT-H-14 [10] 3782 986.7 84.4 13.3 20.0 53.3 33.3 26.7 66.7 40.0 20.0 40.0 34.8
MetaCLIP ViT-L-14 [62] 2242 427.6 79.2 13.3 6.7 66.7 6.7 33.3 46.7 20.0 6.7 13.3 23.7
MetaCLIP ViT-H-14 [62] 2242 986.1 80.6 6.7 13.3 60.0 13.3 6.7 53.3 26.7 13.3 33.3 25.2
EVA01 ViT-g-14 [54] 2242 1136.4 78.5 6.7 26.7 40.0 6.7 13.3 66.7 13.3 13.3 20.0 23.0
EVA02 ViT-bigE-14+ [54] 2242 5044.9 82.0 13.3 20.0 66.7 26.7 26.7 66.7 26.7 20.0 33.3 33.3

Table 1. Performance of various CLIP based models on different visual patterns in MMVP-VLM benchmark. Models
scaled up in resolution show minimal improvement, whereas a slight advantage is observed when scaling up the network. For each
visual pattern, ImageNet-1k Zero-shot accuracy and MMVP average, we use light gray to highlight the best performance. For most of
the visual patterns, all CLIP-based methods show struggle, as evident from the scores. We use symbols for visual patterns due to space
limit: ☼: Orientation and Direction, Û: Presence of Specific Features, L: State and Condition, : Quantity and Count, ,: Positional and
Relational Context, h: Color and Appearance, Ô: Structural and Physical Characteristics, k: Texts, ‚: Viewpoint and Perspective.

70
CLIP InstructBLIP LLaVA 1.5 Gemini GPT­4 4. Mixture-of-Features (MoF) for MLLM
60
Based on our exploration in earlier sections, a natural ques-
50
tion arises: If open-sourced MLLM’s visual shortcomings
Accuracy

40
come from the CLIP vision encoder, how do we build a
30
more competent visual encoder? In this section, we take
20
initial steps to answer the question by studying Mixture-
10
of-Features (MoF). We start with additive MoF that mixes
0
CLIP features and vision-only SSL model features. Results
show that each encoder presents unique advantages and lim-
Figure 6. CLIP and MLLM’s performance on visual patterns. itations when employed as the pretrained model in MLLM
If CLIP performs poorly on a visual pattern such as “ ☼ orienta- (Section 4.2). We subsequently propose Interleaved MoF
tion”, MLLMs also underperform on the visual pattern. that integrates the features from both CLIP and SSL into
MLLM to enhance visual grounding without compromising
the model’s ability to follow instructions (Section 4.3).
tion between the underperformance of CLIP and MLLMs’
visual incapability? To explore this, we categorize ques- 4.1. Experiment Setting
tions from MMVP into these visual patterns summarized
and calculate each MLLM’s performance on these patterns. We adopt LLaVA [30, 31] as the framework to study vi-
sual encoders in MLLM. LLaVA uses a pretrained CLIP
In Figure 6, we plot CLIP’s performance and MLLMs’ encoder and trains an adapter to align visual tokens with
performance for each visual pattern. When the CLIP vi- language tokens in the LLM. (See left side of Figure 7). We
sion encoder underperforms on a certain visual pattern, the use DINOv2 [42] as the vision-only SSL model in our work
MLLM tends to exhibit similar shortcomings. Open-source because it is currently the most scalable vision-only model.
models such as LLaVA 1.5 [30] and InstructBLIP [8] that Our exploration includes the use of two visual encoders:
explicitly use the CLIP vision encoder display a strong cor- CLIP-ViT-L-14 [43] and DINOV2-ViT-L-14 [42]. To en-
relation in performance. sure consistent and fair comparisons, we train and finetune
Further, we calculate the Pearson Correlation Coeffi- our model with the same experiment setting in LLaVA. We
cient between the CLIP model and MLLM’s performance include the additional experimental details in Appendix A.
on each visual pattern. Results show that LLaVA 1.5 and
4.2. Additive MoF
InstructBLIP all possess a coefficient score greater than 0.7.
This high score indicates a strong correlation that weak- We add a pretrained DINOv2 encoder into MLLM and mix
nesses in visual pattern recognition in the CLIP model are the CLIP pretrained encoder with it. We use a coefficient α
transferred to MLLMs. More details on the Pearson Corre- to control the portion of CLIP features and 1 − α to con-
lation Coefficient can be found in Appendix C. trol the amount of DINOv2 features and linearly add them

6
Standard MLLM Additive-MoF MLLM Interleaved-MoF MLLM

Language Model Language Model Language Model

⋯ ⋯ ⋯

<text> <text> CLIP DINO <text>


Adapter “How many eyes can Adapter “How many eyes can “How many eyes can
you see in this image? Adapter Adapter you see in this image?
you see in this image?
(a) 1 (b) 2” (a) 1 (b) 2” (a) 1 (b) 2”

<image> <image> <image>


CLIP CLIP DINO CLIP DINO
Encoder Encoder Encoder Encoder Encoder

Figure 7. Different Mixture-of-Feature (MoF) Strategies in MLLM. Left: Standard MLLM that uses CLIP as off-the-shelf pretrained vi-
sion encoder; Middle: Additive-MoF (A-MoF) MLLM: Linearly mixing CLIP and DINOv2 features before the adapter; Right: Interleaved-
MoF (I-MoF MLLM) Spatially interleaving CLIP visual tokens and DINOv2 visual tokens after the adapter.

together (See middle part of Figure 7 for visualization). method SSL ratio MMVP LLaVA
We evaluate the model’s visual grounding ability by LLaVA 0.0 5.5 81.8
the MMVP proposed earlier in Section 2 and the model’s 0.25 7.9 (+2.4) 79.4 (-2.4)
instruction-following capability by LLaVA benchmark in- 0.5 12.0 (+6.5) 78.6 (-3.2)
troduced in Liu et al. [31]. Initially, we conduct LLaVA 0.625 15.0 (+9.5) 76.4 (-5.4)
five experiments where we linearly transition from us- + A-MoF 0.75 18.7 (+13.2) 75.8 (-6.0)
ing 100% CLIP features to 100% DINOv2 features. In 0.875 16.5 (+11.0) 69.3 (-12.5)
these tests, the DINOv2 feature proportions are set at 1.0 13.4 (+7.9) 68.5 (-13.3)
{0.00, 0.25, 0.50, 0.75, 1.00}. To further verify the ob-
served trends, we introduce two additional experiments Table 2. Empirical Results of Additive MoF. We use DINOv2 as
the image SSL model in our work. With more DINOv2 features
with DINOv2 proportions of {0.625, 0.875}. Our findings,
added, there is an improvement in visual grounding, while a de-
presented in Table 2, reveal two insights: cline in instruction following ability.
1. As the proportion of DINOv2 features increases, MLLM
exhibits a decline in its instruction-following capability. method res #tokens MMVP LLaVA POPE
Notably, there is a sharp decrease when the DINOv2 pro- LLaVA 2242 256 5.5 81.8 50.0
portion reaches 87.5%. LLaVA 3362 576 6.0 81.4 50.1
2. A higher proportion of DINOv2 features enhances the LLaVA + I-MoF 2242 512 16.7 (+10.7) 82.8 51.0
model’s visual grounding capability, but this advantage LLaVA1.5 3362 576 24.7 84.7 85.9
diminishes when the DINOv2 proportion surpasses 0.75, LLaVA1.5 + I-MoF 2242 512 28.0 (+3.3)
at which point instruction-following is notably impaired.
Hence, if we were to add DINOv2 features or completely re- Table 3. Empirical Results of Interleaved MoF. Interleaved MoF
improves visual grounding while maintaining same level of in-
place CLIP with DINOv2, it would result in a trade-off be-
struction following ability.
tween visual grounding and instruction-following. A higher
proportion of DINOv2 features improves the model’s visual
perception at the expense of its ability to follow linguistic We summarize the results in Table 3. Under the
instructions, while CLIP features enhance language com- LLaVA setting, interleave MoF significantly enhances vi-
prehension but reduce visual grounding. sual grounding, with a 10.7% increase observed in MMVP,
without compromising the model’s ability to follow instruc-
4.3. Interleaved MoF
tions. This experiment is replicated with the LLaVA-1.5
We propose interleaved MoF to leverage advantages from setting and under various image resolution settings, yield-
both CLIP and DINOv2 embeddings to enhance image rep- ing similar enhancements in performance. We also evalu-
resentation. An image concurrently passes into CLIP and ate on POPE [27] which is designed to test hallucination in
DINOv2 encoders, and the resulting embeddings are indi- visual grounding. Interleaved-MoF also shows consistent
vidually processed by adapters. We take the processed fea- improvement against the original LLaVA models. Merely
tures from CLIP and DINOv2 and interleave them while increasing the image resolution, and consequently, the num-
maintaining their original spatial order. We then feed the ber of tokens does not boost visual grounding capabilities.
interleaved features to LLM (See right part of Figure 7). Instead, it is the interleaving of MoF that leads to improved

7
performance in visual grounding tasks. We also evaluated tuning on ImageNet [45, 47]. Although current evaluation
Interleaved MoF on additional benchmarks such as MM- methods provide a basic level of assessment on represen-
Bench [32] and GQA [21], finding that Interleaved MoF tation quality, our findings indicate a growing detachment
achieves similar performance on these benchmarks. Please from the needs of recent use cases. As demonstrated in
refer to Appendix E for more results on these benchmarks. the MoF experiments in Section 4, the CLIP vision model
and the vision-only SSL models learn complementary fea-
5. Related Works tures. However, the linear probing accuracy on ImageNet
alone provides a limited understanding of feature utility in
Multimodal LLMs. We study the limitations of Multi- MLLMs. This observation suggests the need for more di-
modal LLMs [8, 13, 30, 31, 40] and explore possible ways verse evaluations [61] in visual representation learning, to
to improve these models. Multimodal LLMs build from better align with current and emerging applications.
pretrained Large Language Models [3, 41, 58, 59, 69] and Ambiguities in Embedding Models. Our work exploits
CLIP vision encoder [43, 54]. These systems then use an CLIP-blind pairs within the CLIP vision embedding space
adapter, such as MLPs [30, 31], Q-Former [8, 26], and gated to generate examples of failures in CLIP models and subse-
attention [2, 25], to integrate the pretrained CLIP vision en- quently MLLMs. This concept has ties to previous research
coder into LLMs. More recently, instructBLIP [8], LLaVA- focused on documenting failure modes in text embedding
1.5 [30] highlight the importance of high-quality training models [12, 36, 55]. More recently, Thrush et al. [56], Yuk-
data. Yet, there is a scarcity of research focusing on the im- sekgonul et al. [65] and Hsieh et al. [19] study the binding
pact of visual encoders, which is an important gap our work problems CLIP faces in processing text queries, noting that
aims to address through a systematic study. CLIP models treat text input as a bag of words. Tong et al.
Evaluating Multimodal LLMs. MMVP assesses [57] examines the implications for downstream text-guided
MLLMs using a set of simple yet critical Visual Ques- generative models. Tschannen et al. [60] suggests image
tion Answering (VQA) questions constructed from CLIP- captioners as promising alternatives to CLIP for improving
blind pairs. Previous benchmarks such as TextVQA [52], attribute binding. Our work focuses on the visual patterns.
VQAv2 [15], and GQA [21] have centered on traditional
VQA queries. Recently, there are works like MM-Vet [64],
6. Discussion
POPE [27], and MM-Bench [32] designed to specifically Circling back to the very first question we ask: is vision
evaluate multimodal LLMs including hallucination, reason- good enough for language? Perhaps not yet, as our study
ing, and robustness. The previous benchmarks and evalu- shows that vision models might become a bottleneck in
ations have shown that Multimodal LLMs can suffer from multimodal systems. MLLMs fail in simple questions be-
hallucination [28, 29], catastrophic forgetting [67] and lack cause their pre-trained CLIP vision encoders overlook cru-
of robustness [11]. In taking a step back to the fundamen- cial visual details in images, and systematically fail to sort
tals, our work uncovers that even the most advanced multi- important visual patterns. Yet, CLIP-type models remain
modal LLMs, such as GPT-4V [40], Gemini [14], Bard [30], the most scalable and widely used vision models today.
and LLaVA-1.5 [30], are not immune to stumbling over el- Contrary to the popular belief that data and model scaling
ementary visual questions. We also identified part of the is a panacea, our research demonstrates that scaling alone
problem as being the incapable visual encoder. does not rectify the inherent deficiencies in CLIP models.
Visual Encoders. MMVP-VLM provides a detailed anal- Our study reveals that popular visual representation
ysis of the visual capabilities of various CLIP variants [43, learning models – vision-and-language models and vision-
54, 62, 66]. These models mostly follow the method pro- only self-supervised learning models – excel in different
posed in Radford et al. [43] that uses contrastive loss to aspects. The distinction in their capabilities go beyond
train on large volumes of image-text pairs. They differ conventional benchmarks such as linear probing or zero-
in training data [62], training recipes [54], and objective shot accuracy on ImageNet. Although a carefully de-
functions [66]. Nonetheless, our studies show that all of signed Mixture-of-Features approach could alleviate visual
these CLIP variants struggle with simple visual patterns limitations and utilize the strengths of these two learning
such as “orientation”, “count”, “presence of specific fea- paradigms, it is necessary to develop new evaluation met-
tures”, etc. Another line of research focuses on vision-only rics to facilitate the development of new visual representa-
self-supervised learning (SSL). This category includes con- tion learning algorithms. We hope our work can motivate
trastive SSL [5, 7, 16, 17] and mask-based SSL [4, 18, 70]. further innovation in vision models.
SLIP [39] explores the synergy between CLIP and con- Acknowledgements. We thank Penghao Wu, Muzi Tao, Erik
trastive SSL, but focusing primarily on standard classifica- Jones, Michael Psenka, Daniel Yeh, Druv Pai for helpful discus-
tion tasks. In fact, a common practice to evaluate the qual- sions and feedback. We also thank Google Cloud and the TRC
ity of these vision models is through linear probing or fine- program for their support.

8
References Bernardo Avila Pires, Zhaohan Guo, Mohammad Ghesh-
laghi Azar, et al. Bootstrap your own latent-a new approach
[1] ShareGPT, 2023. to self-supervised learning. In NeurIPS, 2020.
[2] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine [17] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Girshick. Momentum contrast for unsupervised visual rep-
Katherine Millican, Malcolm Reynolds, et al. Flamingo: a resentation learning. In CVPR, 2020.
visual language model for few-shot learning. In NeruIPS,
[18] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr
2022.
Dollár, and Ross Girshick. Masked autoencoders are scalable
[3] Rohan Anil, Andrew M Dai, Orhan Firat, Melvin John- vision learners. In CVPR, 2022.
son, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri,
[19] Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kem-
Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2
bhavi, and Ranjay Krishna. Sugarcrepe: Fixing hack-
technical report. arXiv preprint arXiv:2305.10403, 2023.
able benchmarks for vision-language compositionality. In
[4] Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bo- NeurIPS, 2023.
janowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and
[20] Jennifer Hu and Roger Levy. Prompt-based methods may
Nicolas Ballas. Self-supervised learning from images with a
underestimate large language models’ linguistic generaliza-
joint-embedding predictive architecture. In CVPR, 2023.
tions. In EMNLP, 2023.
[5] Adrien Bardes, Jean Ponce, and Yann LeCun. Vi- [21] Drew A Hudson and Christopher D Manning. GQA: A new
creg: Variance-invariance-covariance regularization for self- dataset for real-world visual reasoning and compositional
supervised learning. 2022. question answering. In CVPR, 2019.
[6] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Jo- [22] Heewoo Jun and Alex Nichol. Shap-E: Generat-
hannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat ing conditional 3d implicit functions. arXiv preprint
Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial arXiv:2305.02463, 2023.
general intelligence: Early experiments with gpt-4. arXiv
[23] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and
preprint arXiv:2303.12712, 2023.
Tamara Berg. Referitgame: Referring to objects in pho-
[7] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- tographs of natural scenes. In EMNLP, 2014.
offrey Hinton. A simple framework for contrastive learning
[24] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson,
of visual representations. In ICML, 2020.
Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalan-
[8] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat tidis, Li-Jia Li, David A Shamma, et al. Visual genome:
Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Connecting language and vision using crowdsourced dense
Fung, and Steven Hoi. Instructblip: Towards general- image annotations. IJCV, 2017.
purpose vision-language models with instruction tuning,
[25] Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bek-
2023.
man, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Sid-
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, dharth Karamcheti, Alexander M Rush, Douwe Kiela, et al.
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Obelisc: An open web-scale filtered dataset of interleaved
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- image-text documents. arXiv preprint arXiv:2306.16527,
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is 2023.
worth 16x16 words: Transformers for image recognition at
[26] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi.
scale. In ICML, 2021.
BLIP-2: bootstrapping language-image pre-training with
[10] Alex Fang, Albin Madappally Jose, Amit Jain, Ludwig frozen image encoders and large language models. In ICML,
Schmidt, Alexander Toshev, and Vaishaal Shankar. Data fil- 2023.
tering networks. arXiv preprint arXiv:2309.17425, 2023. [27] Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin
[11] Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Zhao, and Ji-Rong Wen. Evaluating object hallucina-
Mengdan Zhang, Xu Lin, Zhenyu Qiu, Wei Lin, Jinrui Yang, tion in large vision-language models. arXiv preprint
Xiawu Zheng, et al. Mme: A comprehensive evaluation arXiv:2305.10355, 2023.
benchmark for multimodal large language models. arXiv [28] Fuxiao Liu, Tianrui Guan, Zongxia Li, Lichang Chen, Yaser
preprint arXiv:2306.13394, 2023. Yacoob, Dinesh Manocha, and Tianyi Zhou. Hallusion-
[12] Hila Gonen and Yoav Goldberg. Lipstick on a pig: Debiasing bench: You see what you think? or you think what you
methods cover up systematic gender biases in word embed- see? an image-context reasoning benchmark challenging for
dings but do not remove them. In NAACL, 2019. GPT-4V (ision), LLaVA-1.5, and other multi-modality mod-
[13] Google. Bard, 2023. els. arXiv preprint arXiv:2310.14566, 2023.
[14] Google. Gemini, 2023. [29] Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser
[15] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- Yacoob, and Lijuan Wang. Aligning large multi-modal
tra, and Devi Parikh. Making the V in VQA matter: El- model with robust instruction tuning. arXiv preprint
evating the role of image understanding in visual question arXiv:2306.14565, 2023.
answering. In CVPR, 2017. [30] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee.
[16] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Improved baselines with visual instruction tuning. arXiv
Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, preprint arXiv:2310.03744, 2023.

9
[31] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. [49] Dustin Schwenk, Apoorv Khandelwal, Christopher Clark,
Visual instruction tuning. 2023. Kenneth Marino, and Roozbeh Mottaghi. A-OKVQA: A
[32] Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang benchmark for visual question answering using world knowl-
Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, edge. In ECCV, 2022.
Ziwei Liu, et al. Mmbench: Is your multi-modal model an [50] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu
all-around player? arXiv preprint arXiv:2307.06281, 2023. Soricut. Conceptual captions: A cleaned, hypernymed, im-
[33] Ilya Loshchilov and Frank Hutter. Decoupled weight decay age alt-text dataset for automatic image captioning. In ACL,
regularization. In ICLR, 2017. 2018.
[34] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana [51] Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and
Camburu, Alan L Yuille, and Kevin Murphy. Generation Amanpreet Singh. Textcaps: a dataset for image captioning
and comprehension of unambiguous object descriptions. In with reading comprehension. In ECCV, 2020.
CVPR, 2016. [52] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang,
[35] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus
Roozbeh Mottaghi. OK-VQA: A visual question answering Rohrbach. Towards VQA models that can read. In CVPR,
benchmark requiring external knowledge. In CVPR, 2019. 2019.
[36] Chandler May, Alex Wang, Shikha Bordia, Samuel R Bow- [53] Mannat Singh, Quentin Duval, Kalyan Vasudev Alwala,
man, and Rachel Rudinger. On measuring social biases in Haoqi Fan, Vaibhav Aggarwal, Aaron Adcock, Armand
sentence encoders. In NAACL, 2019. Joulin, Piotr Dollár, Christoph Feichtenhofer, Ross Girshick,
et al. The effectiveness of MAE pre-pretraining for billion-
[37] Microsoft. newbing, 2023.
scale pretraining. In ICCV, 2023.
[38] Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and
[54] Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue
Anirban Chakraborty. OCR-VQA: Visual question answer-
Cao. EVA-CLIP: Improved training techniques for clip at
ing by reading text in images. In ICDAR, 2019.
scale. arXiv preprint arXiv:2303.15389, 2023.
[39] Norman Mu, Alexander Kirillov, David Wagner, and Sain-
[55] Tony Sun, Andrew Gaut, Shirlyn Tang, Yuxin Huang, Mai
ing Xie. Slip: Self-supervision meets language-image pre-
ElSherief, Jieyu Zhao, Diba Mirza, Elizabeth Belding, Kai-
training. In ECCV, 2022.
Wei Chang, and William Yang Wang. Mitigating gender bias
[40] OpenAI. GPT-4V(ision) System Card, 2023. in natural language processing: Literature review. In ACL,
[41] OpenAI. Gpt-4 technical report, 2023. 2019.
[42] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy [56] Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet
Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Singh, Adina Williams, Douwe Kiela, and Candace Ross.
Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Winoground: Probing vision and language models for visio-
DINOv2: Learning robust visual features without supervi- linguistic compositionality. In CVPR, 2022.
sion. arXiv preprint arXiv:2304.07193, 2023. [57] Shengbang Tong, Erik Jones, and Jacob Steinhardt. Mass-
[43] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya producing failures of multimodal systems with language
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, models. In NeurIPS, 2023.
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- [58] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
ing transferable visual models from natural language super- Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste
vision. In ICML, 2021. Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al.
[44] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, LLaMA: Open and efficient foundation language models.
Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and arXiv preprint arXiv:2302.13971, 2023.
Peter J Liu. Exploring the limits of transfer learning with a [59] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert,
unified text-to-text transformer. JMLR, 2020. Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
[45] Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al.
Zelnik-Manor. Imagenet-21k pretraining for the masses. In LLaMA 2: Open foundation and fine-tuned chat models.
NeurIPS, 2021. 2023.
[46] Robin Rombach, Andreas Blattmann, Dominik Lorenz, [60] Michael Tschannen, Manoj Kumar, Andreas Steiner, Xiao-
Patrick Esser, and Björn Ommer. High-resolution image syn- hua Zhai, Neil Houlsby, and Lucas Beyer. Image captioners
thesis with latent diffusion models. In CVPR, 2022. are scalable vision learners too. NeurIPS, 2023.
[47] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San- [61] Kirill Vishniakov, Zhiqiang Shen, and Zhuang Liu. Convnet
jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, vs transformer, supervised vs clip: Beyond imagenet accu-
Aditya Khosla, Michael Bernstein, et al. Imagenet large racy, 2024.
scale visual recognition challenge. IJCV, 2015. [62] Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang,
[48] Christoph Schuhmann, Romain Beaumont, Richard Vencu, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh,
Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Luke Zettlemoyer, and Christoph Feichtenhofer. Demystify-
Coombes, Aarush Katta, Clayton Mullis, Mitchell Worts- ing CLIP data. arXiv preprint arXiv:2309.16671, 2023.
man, et al. LAION-5B: An open large-scale dataset for train- [63] Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang,
ing next generation image-text models. In NeurIPS, 2022. Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. The Dawn

10
of LMMs: Preliminary Explorations with GPT-4V (ision).
arXiv preprint arXiv:2309.17421, 2023.
[64] Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang,
Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang.
MM-Vet: Evaluating large multimodal models for integrated
capabilities. arXiv preprint arXiv:2308.02490, 2023.
[65] Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri,
Dan Jurafsky, and James Zou. When and why vision-
language models behave like bags-of-words, and what to do
about it? In ICLR, 2022.
[66] Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and
Lucas Beyer. Sigmoid loss for language image pre-training.
In ICCV, 2023.
[67] Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu,
Yong Jae Lee, and Yi Ma. Investigating the catastrophic for-
getting in multimodal large language models. arXiv preprint
arXiv:2309.10313, 2023.
[68] Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu,
Shilin Yan, Pan Lu, Hongsheng Li, Peng Gao, and Yu Qiao.
Llama-adapter: Efficient fine-tuning of language models
with zero-init attention. arXiv preprint arXiv:2303.16199,
2023.
[69] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan
Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gon-
zalez, and Ion Stoica. Judging LLM-as-a-judge with MT-
Bench and Chatbot Arena, 2023.
[70] Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang
Xie, Alan Yuille, and Tao Kong. iBOT: Image BERT pre-
training with online tokenizer. In ICLR, 2021.
[71] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mo-
hamed Elhoseiny. MiniGPT-4: Enhancing vision-language
understanding with advanced large language models. arXiv
preprint arXiv:2304.10592, 2023.

11
A. Experiment Details LLaVA-1.5 InstructBLIP Bard Gemini GPT-4
Hyperparameters. In this work, we adopt the same set of Correlation 0.87 0.71 0.79 0.72 0.31
hyperparameters as LLaVA [31] and LLaVA-1.5 [30]. We use
Vicuna-13b-v1.3 [69] in LLaVA experiments and Vicuna-13b- Table 5. Pearson Correlation between the CLIP model and
v1.5 [69] in LLaVA-1.5 experiments. We show the training hy- MLLMs. Open-source models that explicitly use CLIP-based
perparameters for LLaVA and LLaVA-1.5 experiments in Table 4. models are highlighted in gray.
All experiments are conducted using a maximum of 8 Nvidia A100
GPUs.
guage model [20]. We conduct additional ablation experiments on
LLaVA LLaVA-1.5 the format and notations of VQA questions and options in MMVP.
Hyperparameter
Stage 1 Stage 2 Stage 1 Stage 2 We choose GPT-4V to do these experiments, as it is currently the
batch size 128 128 256 128 best model.
lr 1e-3 2e-5 2e-3 2e-5
lr schedule decay cosine cosine cosine cosine Swapping options The first experiment swaps the two options
lr warmup ratio 0.03 0.03 0.03 0.03 in the MMVP benchmark. For example, we change the question
weight decay 0 0 0 0 from “Are the butterfly’s wings closer to being open or closed?
epoch 1 3 1 1 (a) Open (b) Closed” to “Are the butterfly’s wings closer to being
optimizer AdamW [33] open or closed? (a) Closed (b) Open”.
DeepSpeed stage 2 3 2 3 Empirically, we find that GPT-4V obtains a 40.3% accuracy on
the option swapping in our study, as opposed to the original 38.7%.
Table 4. Hyperparameters for MoF training on LLaVA and We observe that a few questions are answered differently, while
LLaVA-1.5. the majority remain the same. This further suggests that the visual
incapabilities are in the vision encoder rather than in alignment or
the LLMs.

Pretrain Datasets. We use the same dataset for both LLaVA Changing notations in the options We conducted an abla-
and LLaVA-1.5 experiments. For LLaVA experiments, stage 1
tion study to assess the impact of altering notations. For example,
uses CC595k [50] and stage 2 uses LLaVA 158k [31] instruction
we changed “(a) Closed (b) Open” to “(1) Closed (2) Open”. The
data; For LLaVA-1.5 experiments, stage 1 uses CC595k [50] and
results are comparable to the original findings, achieving a perfor-
stage 2 uses DataMix 665k [1, 15, 21, 23, 24, 31, 34, 35, 38, 49, 51]
mance of 37.3%, closely matching the original 38.7%. The study
proposed in Liu et al. [30].
further suggests that the core challenge in MLLMs is their inher-
ent visual incapability, rather than hallucinations in the language
B. MMVP Benchmark model.
We provide more details on the MMVP benchmark. B.4. Human Study Details
B.1. Details of evaluating SOTA models In this study, we ask four participants to volunteer in our study. An
We access GPT-4V through ChatGPT in October and November example user interface for labeling is shown in Figure 8. We col-
2023. We also evaluate Gemini-Pro through Vertex AI API in De- lect their responses and calculate the average score as the human-
cember 2023. We use the official checkpoints for InstructBLIP [8]. level performance.
We access mini-GPT4 [71],1 LLaVA and LLaVA-1.5 [31] through
their playgrounds. We test Bard [13] using the official website C. CLIP-MLLM Failure Correlation
in September and October 2023. Moreover, we test new-Bing [37]
Correlation between CLIP and MLLM models. We
through new-Bing chat creative mode and GPT-4V [40] in Septem-
compute the Pearson Correlation between the CLIP model and
ber 2023.
MLLMs and show results in Table 5. Notably, both open-source
B.2. Questions in MMVP Benchmark models – LLaVA and InstructBLIP – exhibit remarkably high
Pearson Correlation, exceeding 0.7. This finding indicates a strong
We present more examples in MMVP at the end in Fig- correlation between the errors made by the CLIP model and those
ures 10, 11, 12. We also share the entire benchmark in the sup- made by MLLMs. Bard also displays a very high correlation. This
plementary material. suggests that some of the most advanced closed-source models are
B.3. Ablation Studies also affected by the visual limitations in the CLIP models.

To further verify that MLLMs make mistakes in MMVP due to Correlation between ImageNet-1k and MMVP perfor-
their incapable visual grounding instead of hallucination in the lan- mance. We plot the ImageNet-1k Zero-shot accuracy against
1 To circumvent response hallucination in mini-GPT4 we prefix our MMVP-VLM average performance in Figure 9. For models with
questions with “Please only choose an option to answer the question below ImageNet-1k Zero-shot accuracy below 80, a higher Zero-shot ac-
without explanation: ” curacy tends to indicate improved MMVP performance. However,

12
Questionnaire

Progress:

6/300

Can you see the key "Z" in the image?

Yes
No

This question is not good


Answers are too ambiguous

Back Next

Figure 8. Example of user study interface. The questions in the user study are randomly shuffled to avoid any potential bias. Users
choose answers for the VQA questions as well as potential concerns for the VQA question.

40
DFN ViT-H-14 D. Visual Patterns for CLIP
SigLIP ViT-SO-14
Here, we provide the full description of visual patterns that pose
Average Performance on MMVP

SigLIP ViT-SO-14 challenges to all CLIP-based models.


35
DFN ViT-H-14
• ☼ Orientation and Direction: Questions about the direction
30
something is facing or moving, such as the direction the dog or
EVA02 ViT-bigE-14+ duck is facing, or the orientation of the school bus.
• Û Presence of Specific Features: Questions that focus on the
25 MetaCLIP ViT-L-14 existence or non-existence of certain elements or features in the
MetaCLIP ViT-H-14
image.
EVA01 ViT-g-14
20 OpenAI ViT-L-14 • L State and Condition: Questions that pertain to the state or
OpenAI ViT-L-14 condition of an object, such as whether a flag is blowing in the
74 76 78 80 82 84 86
IN-1k ZeroShot Performance wind or if the ground is wet.
•  Quantity and Count: Questions about the number of objects
Figure 9. Correlation between ImageNet-1k Zero-shot and or features present in the image.
MMVP-VLM average. The area of each bubble corresponds to • , Positional and Relational Context: This aspect refers to the
the model’s number of parameters. A higher ImageNet-1k zero- model’s ability to understand the position and relationship of
shot performance does not necessarily imply superior performance objects or elements within an image in relation to each other
in MMVP-VLM. and their surroundings.
• h Color and Appearance: Questions regarding the color of
certain objects or elements.
in models with superior ImageNet-1k Zero-shot performance, this • Ô Structural and Physical Characteristics: This category
trend does not necessarily hold for MMVP-VLM accuracy. This involves the model’s ability to identify and analyze the physical
distinction accentuates the value of MMVP-VLM as an evaluation attributes and structural features of objects in an image.
metric, which probes into visual patterns such as orientation – as- • k Text: Questions related to text or symbols present in the
pects that are pivotal for downstream tasks and go beyond what is image.
captured by ImageNet accuracy alone. • ‚ Viewpoint and Perspective: Questions concerning the per-

13
method res #tokens MMVP LLVB LLVW MMB VQAT POPE VQAV2 MM-V
LLaVA1.5 3362 576 24.7 84.7 70.7 67.7 61.3 85.9 80.0 35.4
LLaVA1.5 + I-MoF 2242 512 28.0 82.7 73.3 61.6 55.3 86.3 77.3 33.5
LLaVA1.5 + I-MoF 3362 1152 31.3 81.8 73.3 65.4 58.7 86.7 79.3 34.6

Table 6. Comparison with LLaVA-1.5 on 6 more benchmarks. Interleaved-MoF LLaVA-1.5 obtains performance on par with the
original method while showing improvements on benchmarks evaluating visual grounding. Benchmark names are abbreviated due to
space limits. LLVB : LLaVA Benchmark [31]; LLVW : LLaVA-In-the-Wild [30]; MMB: MMBench [32]; VQAT : TextVQA[52]; POPE:
POPE [27]; VQAV2 : VQA-v2 [15]; MM-V: MM-Vet [64].

spective from which the photo was taken.

E. More Benchmark Results


We conduct additional experiments on Interleaved-MoF that fur-
ther scale up the resolution to 336 and evaluate on more bench-
marks. The summarized results in Table 6 reveal that Interleaved-
MoF achieves comparable performance on most benchmarks
while demonstrating improvements in benchmarks focused on vi-
sual grounding. We also observe that MMVP are more sensitive
to the model’s visual capabilities, underscoring the significance of
our benchmark in assessing visual proficiency.

14
Can you see the key “Z” in the image? Is there shadow on the flower? Is the front of the school bus protruding?

(a) Yes (b) No (a) Yes (b) No (a) Yes (b) No

(a) (b) ✓ (a) (a)  (a) (a) 


(a) (b) ✓ (a) (a)  (a) (b) ✓
(a) (b) ✓ (a) (a)  (a) (a) 
(b) (a)  (a) (a)  (a) (b) ✓
Is the butterfly’s abdomen visible in the
Do the vegetables have spikes? image? Can you see stems of bananas in the image?

(a) Yes (b) No (a) Yes (b) No (a) Yes (b) No

(b) (b)  (b) (b)  (a) (b) ✓


(b) (b)  (a) (b) ✓ (a) (a) 
(b) (b)  (a) (a)  (a) (b) ✓
(a) (a)  (a) (a)  (a) (a) 
Are there any words displayed on the Do you see this flower from the top or the
vehicle’s lightbar? side? Is the door of the truck open?

(a) Yes (b) No (a) Top (b) Side (a) Yes (b) No

(b) (b)  (b) (b)  (b) (b) 


(a) (a)  (a) (a)  (a) (b) ✓
(a) (a)  (b) (b)  (a) (a) 
(a) (a)  (a) (a)  (a) (a) 

GPT-4V Gemini LLaVA-1.5 InstructBLIP

Figure 10. More examples of questions in the MMVP benchmark (Part I).

15
How many eyes of the cat can you see in the
Does the keyboard have a backlight? picture? Does this corn have white kernels?

(a) Yes (b) No (a) 1 (b) 2 (a) Yes (b) No

(a) (a)  (a) (a)  (a) (b) ✓


(a) (b) ✓ (b) (b)  (a) (b) ✓
(a) (a)  (b) (b)  (a) (b) ✓
(a) (a)  (b) (b)  (b) (b) 
Where is the yellow animal’s head lying in
Are some fruits cut open or are all the fruits
What does the center button say? this image?
uncut?

(a) OK/SELECT (b) OK (a) Floor (b) Carpet (a) Yes (b) No

(a) (a)  (b) (b)  (a) (a) 


(a) (b) ✓ (a) (b) ✓ (a) (a) 
(a) (a)  (a) (a)  (a) (a) 
(a) (a)  (b) (b)  (a) (a) 

Is the ladybug positioned upright or upside In this picture, is the snake’s head visible or
down? not visible? How many wheels can you see in the image?

(a) Yes (b) No (a) Visible (b) Not VIsible (a) 1 (b) 2

(a) (b) ✓ (a) (b) ✓ (b) (b) 


(a) (b) ✓ (a) (b) ✓ (b) (b) 
(b) (b)  (b) (b)  (b) (b) 
(a) (a)  (b) (b)  (b) (b) 

GPT-4V Gemini LLaVA-1.5 InstructBLIP

Figure 11. More examples of questions in the MMVP benchmark (Part II).

16
Is there an orange with leaves next to the
What are the words in the image: cup? Are there black stripes on the roof of the car?

(a) “Happy Easter” (b) “Happy Easter!” (a) Yes (b) No (a) Yes (b) No

(a) (b) ✓ (b) (b)  (b) (b) 


(a) (b) ✓ (a) (b) ✓ (b) (b) 
(b) (b)  (a) (a)  (b) (b) 
(b) (b)  (a) (a)  (a) (a) 
Are all easter eggs placed in a container (e.g.
Is the sky in the background dark blue or light
Is the rabbit in the image facing left or right? nest, basket)?
blue

(a) Left (b) Right (a) Yes (b) No (a) Dark blue (b) Light blue

(b) (b)  (b) (b)  (a) (b) ✓


(a) (a)  (b) (b)  (a) (b) ✓
(b) (b)  (b) (b)  (b) (b) 
(a) (a)  (a) (a)  (b) (b) 

Are there any fruits and vegetables in the In the image, is it a salmon fillet or a salmon
heart-shaped part of the picture? steak? How many trees are the treehouse built on?

(a) Yes (b) No (a) Salmon fillet (b) Salmon steak (a) One (b) More than one

(a) (b) ✓ (a) (a)  (a) (a) 


(a) (b) ✓ (a) (a)  (a) (b) ✓
(a) (a)  (b) (b)  (a) (a) 
(a) (a)  (a) (a)  (b) (b) 

GPT-4V Gemini LLaVA-1.5 InstructBLIP

Figure 12. More examples of questions in the MMVP benchmark (Part III).

17

You might also like