What Makes for Good Image Captions?

Delong Chen
HKUST
[email protected]
&Samuel Cahyawijaya
HKUST
[email protected]
&Etsuko Ishii
HKUST
[email protected]
&Ho Shu Chan
HKUST
[email protected]
&Yejin Bang
HKUST
[email protected]
&Pascale Fung
HKUST
[email protected]
Abstract

This paper establishes a formal information-theoretic framework for image captioning, conceptualizing captions as compressed linguistic representations that selectively encode semantic units in images. Our framework posits that good image captions should balance three key aspects: informationally sufficient, minimally redundant, and readily comprehensible by humans. By formulating these aspects as quantitative measures with adjustable weights, our framework provides a flexible foundation for analyzing and optimizing image captioning systems across diverse task requirements. To demonstrate its applicability, we introduce the Pyramid of Captions (PoCa) method, which generates enriched captions by integrating local and global visual information. We present both theoretical proof that PoCa improves caption quality under certain assumptions, and empirical validation of its effectiveness across various image captioning models and datasets.

1 Introduction

Image captioning, the process of translating visual content into natural language descriptions, serving pivotal roles in real-world applications ranging from assisting visually impaired individuals [1, 2, 3, 4] to facilitating content-based image retrieval [5, 6, 7, 8, 9, 10]. Over the last decade, the field of image captioning has witnessed substantial progress, primarily driven by advancements in deep neural nets and the availability of large-scale high-quality image-text datasets.

Despite empirical advancements, several fundamental questions remain unanswered: What makes for good image captions? Which properties should they possess, and how can we measure them? Some existing models can generate captions closely resembling single-sentence human annotations [11], but these may not be adequate for use cases where more comprehensive coverage of fine-grained visual information is required. Conversely, recent Large Vision Language Models (LVLMs) [12, 13] have demonstrated the ability to generate multi-paragraph detailed image descriptions [14, 15, 12, 16]. Yet, longer captions can sometimes be less accurate, hallucinate content, or put excessive emphasis on irrelevant details while omitting important ones.

Recognizing the absence of a universal standard for ideal captions, this work aims to establish well-defined principles for image captioning that address varying task requirements. We introduce an information-theoretic framework based on semantic communication [17, 18] and the information bottleneck principle [19, 20, 21]. By leveraging this perspective, we formulate an objective function for image captioning that strikes a balance among three key criteria:

  • Information Sufficiency: Ensuring comprehensive coverage of meaningful content, measured by the mutual information between the caption and task-relevant visual semantics.

  • Minimal Redundancy: Optimizing the conciseness of the caption, quantified by the entropy of the generated caption.

  • Human Comprehensibility: Facilitating ease of understanding for human readers, assessed through the distributional distance between generated captions and natural language.

Our framework conceptualizes images and captions as observations of latent variables in a semantic space. This allows us to formulate image captioning as a communication process where semantic information is transmitted from the image to the caption, and measure the error at the semantic level. We then present formal quantitative measurements of the above three criteria and define the ultimate objective of image captioning as a weighted combination of them. Varying the weighting coefficients of these terms suits different preferences over image captions (e.g., comprehensiveness, succinctness, readability), providing a rigorous foundation for analysis and evaluations.

Our framework provides a rigorous foundation for analyzing and advancing image captioning techniques. To demonstrate its practical applicability, we present the Pyramid of Captions (PoCa) method as an example application. PoCa employs a hierarchical approach to generate semantically rich captions by leveraging both local and global visual information. Utilizing our theoretical framework, we provide formal proof that each local-global aggregation operation in PoCa improves caption quality under certain assumptions. Empirical evaluations across various image captioning models and datasets corroborate our theoretical findings, showing that PoCa consistently yields more informative and semantically aligned captions while maintaining brevity and interpretability.

2 Proposed Framework

In this section, we provide a theoretical framework for image captioning as depicted in Figure 1. First, we formulate the task of image captioning by applying the concept of semantic units [17, 18]. We suppose that an image is an observation of a latent variable in a semantic space characterized by semantic units. An image captioning model will generate a caption for the image, and the caption can be mapped back to the latent semantic space and compared with the source semantic latent variable.

Based on this framework, we then introduce our proposed objectives inspired by the information bottleneck principle [19, 20, 21] for feature representation learning [21]. In our framework, we consider that the overall image captioning objective is composed of a “information sufficiency” term, a “minimal redundancy” term, and a “human comprehensibility” term.

Refer to caption
Figure 1: Overview of our formulation. Some latent variable X𝑋Xitalic_X in a latent semantic space 𝒮𝒮\mathcal{S}caligraphic_S generates image X~~𝑋\tilde{X}over~ start_ARG italic_X end_ARG in data space 𝒟𝒟\mathcal{D}caligraphic_D. The image X~~𝑋\tilde{X}over~ start_ARG italic_X end_ARG is then captioned by fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT producing a caption Y~~𝑌\tilde{Y}over~ start_ARG italic_Y end_ARG which can be mapped back to the original latent space as Y𝑌Yitalic_Y. The semantic-level error Z=YX𝑍𝑌𝑋Z=Y-Xitalic_Z = italic_Y - italic_X measures the difference between the source semantics X𝑋Xitalic_X and received semantics Y𝑌Yitalic_Y.

2.1 Formulation

We assume that meaning arises from a set of independent and discrete units called semantic units in a semantic space, and images and captions are observations of some latent variables in this semantic space [17, 18]. Following [17, 18], we define a semantic unit and the semantic space as:

Definition 1 (Semantic Units and Semantic Space)

A semantic unit represents an atomic piece of information. The set of all possible semantic units is denoted by Ω={wi}i=1nΩsuperscriptsubscriptsubscript𝑤𝑖𝑖1𝑛\Omega=\{w_{i}\}_{i=1}^{n}roman_Ω = { italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. A semantic space is an n𝑛nitalic_n-dimensional space 𝒮[0,1]n𝒮superscript01𝑛\mathcal{S}\in[0,1]^{n}caligraphic_S ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT where the value of the i𝑖iitalic_i-th dimension of represents the probability of the presence of the corresponding semantic unit p(wi)𝑝subscript𝑤𝑖p(w_{i})italic_p ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

The ΩΩ\Omegaroman_Ω encompasses a wide range of semantic units, similar to how a vocabulary contains diverse words. Each point in 𝒮𝒮\mathcal{S}caligraphic_S corresponds to a specific combination of semantic units, corresponding to different meanings. Adopting the semantic space and its probabilistic interpretation in [17, 22, 23], we can apply the classical information theory [24] to operate at the semantic level, and make information in images and captions to be comparable.

Let a random variable X𝒮𝑋𝒮X\in\mathcal{S}italic_X ∈ caligraphic_S and Y𝒮𝑌𝒮Y\in\mathcal{S}italic_Y ∈ caligraphic_S represent semantic information in an image and a caption, where Xi=p(wi)subscript𝑋𝑖𝑝subscript𝑤𝑖X_{i}=p(w_{i})italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_p ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and Yi=p(wi)subscript𝑌𝑖𝑝subscript𝑤𝑖Y_{i}=p(w_{i})italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_p ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) represent the likelihood of a semantic unit wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT observed in an image X~𝒟image~𝑋subscript𝒟image\tilde{X}\in\mathcal{D_{\mathrm{image}}}over~ start_ARG italic_X end_ARG ∈ caligraphic_D start_POSTSUBSCRIPT roman_image end_POSTSUBSCRIPT or a caption Y~𝒟caption~𝑌subscript𝒟caption\tilde{Y}\in\mathcal{D_{\mathrm{caption}}}over~ start_ARG italic_Y end_ARG ∈ caligraphic_D start_POSTSUBSCRIPT roman_caption end_POSTSUBSCRIPT. These latent variables X𝑋Xitalic_X and Y𝑌Yitalic_Y encode all the information within real images X~𝒟image~𝑋subscript𝒟image\tilde{X}\in\mathcal{D_{\mathrm{image}}}over~ start_ARG italic_X end_ARG ∈ caligraphic_D start_POSTSUBSCRIPT roman_image end_POSTSUBSCRIPT and textual captions Y~𝒟caption~𝑌subscript𝒟caption\tilde{Y}\in\mathcal{D_{\mathrm{caption}}}over~ start_ARG italic_Y end_ARG ∈ caligraphic_D start_POSTSUBSCRIPT roman_caption end_POSTSUBSCRIPT, both low-level and high-level semantics. Then, image captioning can be framed as follows:

Definition 2 (Image Captioning)

An image captioning model f𝑓fitalic_f parameterized by θ𝜃\thetaitalic_θ operates in the observed data spaces fθ:𝒟image𝒟caption:subscript𝑓𝜃subscript𝒟imagesubscript𝒟captionf_{\theta}:\mathcal{D_{\mathrm{image}}}\rightarrow\mathcal{D_{\mathrm{caption}}}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT : caligraphic_D start_POSTSUBSCRIPT roman_image end_POSTSUBSCRIPT → caligraphic_D start_POSTSUBSCRIPT roman_caption end_POSTSUBSCRIPT, it translates an image into a caption, i.e., Y~=fθ(X~)~𝑌subscript𝑓𝜃~𝑋\tilde{Y}=f_{\theta}(\tilde{X})over~ start_ARG italic_Y end_ARG = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG italic_X end_ARG ). X~~𝑋\tilde{X}over~ start_ARG italic_X end_ARG is generated from some source latent variable X~=g(X)~𝑋𝑔𝑋\tilde{X}=g(X)over~ start_ARG italic_X end_ARG = italic_g ( italic_X ) while Y~~𝑌\tilde{Y}over~ start_ARG italic_Y end_ARG can be converted back to the latent semantic space Y=h(Y~)𝑌~𝑌Y=h(\tilde{Y})italic_Y = italic_h ( over~ start_ARG italic_Y end_ARG ). Let Z=YX[1,1]n𝑍𝑌𝑋superscript11𝑛Z=Y-X\in[-1,1]^{n}italic_Z = italic_Y - italic_X ∈ [ - 1 , 1 ] start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT denote the error between the source and recovered semantics caused by parameters θ𝜃\thetaitalic_θ.

Here, Z𝑍Zitalic_Z can be associated with which kind of error Y~~𝑌\tilde{Y}over~ start_ARG italic_Y end_ARG has; negative Z𝑍Zitalic_Z indicates that the caption misses some contents of the image (undercoverage), and positive Z𝑍Zitalic_Z indicates that the caption includes something that is not in the image (hallucination).

2.2 Objectives

The image captioning process can be compared to a communication system [24] where information source X𝑋Xitalic_X is converted to signal X~~𝑋\tilde{X}over~ start_ARG italic_X end_ARG by a lossless source encoder g𝑔gitalic_g, transmitted through a noisy channel fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, and the received signal Y~~𝑌\tilde{Y}over~ start_ARG italic_Y end_ARG is losslessly decoded by hhitalic_h, giving the final received information Y𝑌Yitalic_Y. From this communication system perspective, one might say that the optimal θsuperscript𝜃\theta^{*}italic_θ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT could just be the one that minimizes the error Znorm𝑍||Z||| | italic_Z | |. However, this requirement is unrealistic as it would result in extremely long captions that losslessly encode both high-level semantic information and all the low-level irrelevant information.

Therefore, we apply the information bottleneck principle [19, 20, 21] to evaluate this system. Information bottleneck is a generalization of rate-distortion theory for lossy data compression, it requires a representation (which in our case is Y~~𝑌\tilde{Y}over~ start_ARG italic_Y end_ARG) to have maximal mutual information with some information T𝑇Titalic_T that is required fulfill the task requirements (i.e., high information sufficiency), while having minimal mutual information with the input X𝑋Xitalic_X (minimal redundancy). The desired minimal sufficient representation can be given as Y~=argmaxI(Y~;T)βI(X;Y~)superscript~𝑌𝐼~𝑌𝑇𝛽𝐼𝑋~𝑌\tilde{Y}^{*}=\arg\max I(\tilde{Y};T)-\beta I(X;\tilde{Y})over~ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg roman_max italic_I ( over~ start_ARG italic_Y end_ARG ; italic_T ) - italic_β italic_I ( italic_X ; over~ start_ARG italic_Y end_ARG ) where β𝛽\betaitalic_β is a Lagrange multiplier. If the captioning model is deterministic (given X𝑋Xitalic_X, it always produces the same Y~~𝑌\tilde{Y}over~ start_ARG italic_Y end_ARG) then we have H(Y~|X)=0𝐻conditional~𝑌𝑋0H(\tilde{Y}|X)=0italic_H ( over~ start_ARG italic_Y end_ARG | italic_X ) = 0. Since I(X;Y~)=H(Y~)H(Y~|X)𝐼𝑋~𝑌𝐻~𝑌𝐻conditional~𝑌𝑋I(X;\tilde{Y})=H(\tilde{Y})-H(\tilde{Y}|X)italic_I ( italic_X ; over~ start_ARG italic_Y end_ARG ) = italic_H ( over~ start_ARG italic_Y end_ARG ) - italic_H ( over~ start_ARG italic_Y end_ARG | italic_X ), the minimal sufficient representation can be written as:

Y~=argmaxY~I(Y~;T)βH(Y~).superscript~𝑌~𝑌𝐼~𝑌𝑇𝛽𝐻~𝑌\tilde{Y}^{*}=\arg\underset{\tilde{Y}}{\max}\ I(\tilde{Y};T)-\beta H(\tilde{Y}).over~ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_arg start_UNDERACCENT over~ start_ARG italic_Y end_ARG end_UNDERACCENT start_ARG roman_max end_ARG italic_I ( over~ start_ARG italic_Y end_ARG ; italic_T ) - italic_β italic_H ( over~ start_ARG italic_Y end_ARG ) . (1)

The second term will penalize the captioning model when it generates an over-length caption and the value of β𝛽\betaitalic_β controls the penalty strength. Combined with the first term, the objective requires the model to preserve as much useful information as possible for the task, while keeping the captions as succinct as possible.

Next, we give a formal definition of the information sufficiency objective of image captioning with importance of semantic units [17].

Definition 3 (Information Sufficiency)

For given X𝑋Xitalic_X, let a latent variable T𝒮𝑇𝒮T\in\mathcal{S}italic_T ∈ caligraphic_S represent the task-relevant information in X𝑋Xitalic_X, and let an importance variable A[0,1]n𝐴superscript01𝑛A\in[0,1]^{n}italic_A ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT denote the importance scores of different semantic units. The A𝐴Aitalic_A is derived from X𝑋Xitalic_X by an underlying mapping, thus being dependent on X𝑋Xitalic_X. The T𝑇Titalic_T is produced by a point-wise product between A𝐴Aitalic_A and X𝑋Xitalic_X, thus T=AX𝑇direct-product𝐴𝑋T=A\odot Xitalic_T = italic_A ⊙ italic_X. For generated image captions Y~=fθ(X~)~𝑌subscript𝑓𝜃~𝑋\tilde{Y}=f_{\theta}(\tilde{X})over~ start_ARG italic_Y end_ARG = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG italic_X end_ARG ), the information sufficiency objective is:

Jsuf(θ)=I(Y~;AX)subscript𝐽suf𝜃𝐼~𝑌direct-product𝐴𝑋J_{\mathrm{suf}}(\theta)=I(\tilde{Y};A\odot X)italic_J start_POSTSUBSCRIPT roman_suf end_POSTSUBSCRIPT ( italic_θ ) = italic_I ( over~ start_ARG italic_Y end_ARG ; italic_A ⊙ italic_X ) (2)

In the importance variable, Ai=1subscript𝐴𝑖1A_{i}=1italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1 means the semantic units wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are very important in the image, while Ai=0subscript𝐴𝑖0A_{i}=0italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0 means wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is irrelevant. It behaves similarly to the attention mechanism [25], which also produces a heatmap between zero to one according to the given input. Note that here the A𝐴Aitalic_A is not binary, the continual nature of A𝐴Aitalic_A gives a good property to Jsuf(θ)subscript𝐽suf𝜃J_{\mathrm{suf}}(\theta)italic_J start_POSTSUBSCRIPT roman_suf end_POSTSUBSCRIPT ( italic_θ ): when the “budget” is limited (as there are also other objectives to optimize), more semantic units with higher importance score will be retained while less important ones will be discarded.

Definition 4 (Minimal Redundancy)

The minimal redundancy objective encourages the image captioning model fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to eliminate irrelevant information, it is given by measuring the entropy of generated captions:

Jmin(θ)=H(Y~).subscript𝐽min𝜃𝐻~𝑌J_{\mathrm{min}}(\theta)=-H(\tilde{Y}).italic_J start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_θ ) = - italic_H ( over~ start_ARG italic_Y end_ARG ) . (3)

Combining Jsuf(θ)subscript𝐽suf𝜃J_{\mathrm{suf}}(\theta)italic_J start_POSTSUBSCRIPT roman_suf end_POSTSUBSCRIPT ( italic_θ ) and Jmin(θ)subscript𝐽min𝜃J_{\mathrm{min}}(\theta)italic_J start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_θ ) ensures the captions are minimal sufficient representations of images. However, there is no guarantee that the generated captions can be understood by humans. Therefore, we need to measure the human comprehensibility of the captions using a third objective, which is the distributional similarity between Ydata𝑌dataY\mathrm{data}italic_Y roman_data and natural language.

Definition 5 (Human Comprehensibility)

Let PY~subscript𝑃~𝑌P_{\tilde{Y}}italic_P start_POSTSUBSCRIPT over~ start_ARG italic_Y end_ARG end_POSTSUBSCRIPT denote the probabilistic distribution of model-generated captions over 𝒟captionsubscript𝒟caption\mathcal{D_{\mathrm{caption}}}caligraphic_D start_POSTSUBSCRIPT roman_caption end_POSTSUBSCRIPT, and let Plangsubscript𝑃langP_{\mathrm{lang}}italic_P start_POSTSUBSCRIPT roman_lang end_POSTSUBSCRIPT denote the distribution of human interpretable natural language. Given a certain statistical divergence measurement D𝐷Ditalic_D, the human comprehensibility objective is:

Jint(θ)=D(PY~||Plang).J_{\mathrm{int}}(\theta)=-D(P_{\tilde{Y}}||P_{\mathrm{lang}}).italic_J start_POSTSUBSCRIPT roman_int end_POSTSUBSCRIPT ( italic_θ ) = - italic_D ( italic_P start_POSTSUBSCRIPT over~ start_ARG italic_Y end_ARG end_POSTSUBSCRIPT | | italic_P start_POSTSUBSCRIPT roman_lang end_POSTSUBSCRIPT ) . (4)

The overall objective of image captioning is a weighted combination of information sufficiency, minimal redundancy, and human comprehensibility:

J(θ)=Jsuf(θ)βJmin(θ)γJint(θ),𝐽𝜃subscript𝐽suf𝜃𝛽subscript𝐽min𝜃𝛾subscript𝐽int𝜃J(\theta)=J_{\mathrm{suf}}(\theta)-\beta J_{\mathrm{min}}(\theta)-\gamma J_{% \mathrm{int}}(\theta),italic_J ( italic_θ ) = italic_J start_POSTSUBSCRIPT roman_suf end_POSTSUBSCRIPT ( italic_θ ) - italic_β italic_J start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( italic_θ ) - italic_γ italic_J start_POSTSUBSCRIPT roman_int end_POSTSUBSCRIPT ( italic_θ ) , (5)

where β>0𝛽0\beta>0italic_β > 0 and γ>0𝛾0\gamma>0italic_γ > 0 are weighting coefficients. Here, one of the factors that the coefficient β𝛽\betaitalic_β in Eq. 5 controls is the length of generated captions. If we prefer more detailed, comprehensive captions, we have smaller β𝛽\betaitalic_β and larger γ𝛾\gammaitalic_γ.

3 Example Application of the Framework

3.1 Method: The Pyramid of Captions

In this section, we introduce the Pyramid of Captions (PoCa) method, which showcases the applicability of our framework to image captionoing research. The key intuition behind the PoCa method is that we can have a more accurate and detailed caption by ensembling multiple captions.We propose to split an image into multiple local patches generate local captions for each patch, and fuse the captions to obtain a higher-quality caption for the global image.

Formally, let σsplitsubscript𝜎split\sigma_{\mathrm{split}}italic_σ start_POSTSUBSCRIPT roman_split end_POSTSUBSCRIPT be a function operating in 𝒟imagesubscript𝒟image\mathcal{D}_{\mathrm{image}}caligraphic_D start_POSTSUBSCRIPT roman_image end_POSTSUBSCRIPT that represents the splitting function, which splits an image into a set of local patches:

σsplit(X~)={X~[j]}j=1m.subscript𝜎split~𝑋superscriptsubscriptsuperscript~𝑋delimited-[]𝑗𝑗1𝑚\sigma_{\mathrm{split}}(\tilde{X})=\left\{\tilde{X}^{[j]}\right\}_{j=1}^{m}.italic_σ start_POSTSUBSCRIPT roman_split end_POSTSUBSCRIPT ( over~ start_ARG italic_X end_ARG ) = { over~ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT [ italic_j ] end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT . (6)

We apply an image captioning model fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT to the local patches and obtain a set of local captions: {Y~[j]=fθ(X~[j])}j=1msuperscriptsubscriptsuperscript~𝑌delimited-[]𝑗subscript𝑓𝜃superscript~𝑋delimited-[]𝑗𝑗1𝑚\{\tilde{Y}^{[j]}=f_{\theta}(\tilde{X}^{[j]})\}_{j=1}^{m}{ over~ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT [ italic_j ] end_POSTSUPERSCRIPT = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG italic_X end_ARG start_POSTSUPERSCRIPT [ italic_j ] end_POSTSUPERSCRIPT ) } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, and also generate a caption for the global image: Y~=fθ(X~)~𝑌subscript𝑓𝜃~𝑋\tilde{Y}=f_{\theta}(\tilde{X})over~ start_ARG italic_Y end_ARG = italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( over~ start_ARG italic_X end_ARG ). The local and global information will be fused by a merging function σmergesubscript𝜎merge\sigma_{\mathrm{merge}}italic_σ start_POSTSUBSCRIPT roman_merge end_POSTSUBSCRIPT operating in 𝒟captionsubscript𝒟caption\mathcal{D}_{\mathrm{caption}}caligraphic_D start_POSTSUBSCRIPT roman_caption end_POSTSUBSCRIPT:

Y~merged=σmerge(Y~;{Y~[j]}j=1m).subscript~𝑌mergedsubscript𝜎merge~𝑌superscriptsubscriptsuperscript~𝑌delimited-[]𝑗𝑗1𝑚\tilde{Y}_{\mathrm{merged}}=\sigma_{\mathrm{merge}}\left(\tilde{Y};\{\tilde{Y}% ^{[j]}\}_{j=1}^{m}\right).over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT roman_merged end_POSTSUBSCRIPT = italic_σ start_POSTSUBSCRIPT roman_merge end_POSTSUBSCRIPT ( over~ start_ARG italic_Y end_ARG ; { over~ start_ARG italic_Y end_ARG start_POSTSUPERSCRIPT [ italic_j ] end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ) . (7)

We adopt text-only LLMs as the merging function σmergesubscript𝜎merge\sigma_{\mathrm{merge}}italic_σ start_POSTSUBSCRIPT roman_merge end_POSTSUBSCRIPT. Table 2 provides an example of merging using an LLM, where we instruct it to generate a merged caption that incorporates both local and global information.

As the PoCa method is hierarchical, we can extend it to be more layers. We can recursively split a patch into sub-patches and merge captions for each sub-patches to represent the patch.

3.2 PoCa Gives Better Captions (Provably)

In this section, we provide an analysis of under what condition a single local-global merging operation in PoCa can be guaranteed to improve the caption quality.

First, we assume that there is some function φ𝜑\varphiitalic_φ to quantify the error Z𝑍Zitalic_Z by X𝑋Xitalic_X in a deterministic manner, and that function is concave. The deterministic assumption of φ𝜑\varphiitalic_φ simplifies the analysis by ignoring errors caused by factors other than the input semantics X𝑋Xitalic_X, such as the randomness in sampling-based autoregressive generation. In other words, we assume that fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT always generates the same caption and makes the same error for the same input. The concavity of φ𝜑\varphiitalic_φ implies that it generates a larger volume of error when p(wi)𝑝subscript𝑤𝑖p(w_{i})italic_p ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is far from zero and one. This means that the captioning model is more likely to make mistakes when there is high uncertainty about the presence or absence of a semantic unit in the image.

Assumption 1 (Uncertainty-aware content-dependent error)

The error Z𝑍Zitalic_Z produced by the image captioning model fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is dependent on the input semantics X𝑋Xitalic_X. Therefore, it can be expressed as a deterministic function of X𝑋Xitalic_X:

Zi=φ(Xi),normsubscript𝑍𝑖𝜑subscript𝑋𝑖||Z_{i}||=\varphi(X_{i}),| | italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | = italic_φ ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , (8)

where φ𝜑\varphiitalic_φ is a concave function and 1in1𝑖𝑛1\leq i\leq n1 ≤ italic_i ≤ italic_n.

Next, we introduce our assumptions on the image splitting function σsplitsubscript𝜎split\sigma_{\mathrm{split}}italic_σ start_POSTSUBSCRIPT roman_split end_POSTSUBSCRIPT and caption merging function σmergesubscript𝜎merge\sigma_{\mathrm{merge}}italic_σ start_POSTSUBSCRIPT roman_merge end_POSTSUBSCRIPT. These assumptions simplify the relationship between local and global semantics by assuming linear combinations. In practice, the relationship may be more complex and depend on factors such as the spatial arrangement of the local patches and the presence of objects spanning multiple patches.

Assumption 2 (Local-global relationship of image semantics)

The σsplitsubscript𝜎split\sigma_{\mathrm{split}}italic_σ start_POSTSUBSCRIPT roman_split end_POSTSUBSCRIPT splits an image into local patches. The latent semantic variables corresponding to local patches satisfy the following relationship with global semantics:

X=jmαjX[j],𝑋superscriptsubscript𝑗𝑚subscript𝛼𝑗superscript𝑋delimited-[]𝑗X=\sum_{j}^{m}\alpha_{j}X^{[j]},italic_X = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_X start_POSTSUPERSCRIPT [ italic_j ] end_POSTSUPERSCRIPT , (9)

where the weights αjsubscript𝛼𝑗\alpha_{j}italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT satisfying jmαj=1superscriptsubscript𝑗𝑚subscript𝛼𝑗1\sum_{j}^{m}\alpha_{j}=1∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = 1.

Assumption 3 (Local-global aggregation of caption semantics)

The function σmergesubscript𝜎merge\sigma_{\mathrm{merge}}italic_σ start_POSTSUBSCRIPT roman_merge end_POSTSUBSCRIPT merges the global and global captions. The latent semantic variable corresponding to the merged caption is a weighted combination of the global and local semantics:

Ymerged=ηY+(1η)jmαjY[j],subscript𝑌merged𝜂𝑌1𝜂superscriptsubscript𝑗𝑚subscript𝛼𝑗superscript𝑌delimited-[]𝑗Y_{\mathrm{merged}}=\eta Y+(1-\eta)\sum_{j}^{m}\alpha_{j}Y^{[j]},italic_Y start_POSTSUBSCRIPT roman_merged end_POSTSUBSCRIPT = italic_η italic_Y + ( 1 - italic_η ) ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_Y start_POSTSUPERSCRIPT [ italic_j ] end_POSTSUPERSCRIPT , (10)

where η(0,1)𝜂01\eta\in(0,1)italic_η ∈ ( 0 , 1 ) is a weighting coefficient.

We now present a theorem regarding Zmerged=YmergedXsubscript𝑍mergedsubscript𝑌merged𝑋Z_{\mathrm{merged}}=Y_{\mathrm{merged}}-Xitalic_Z start_POSTSUBSCRIPT roman_merged end_POSTSUBSCRIPT = italic_Y start_POSTSUBSCRIPT roman_merged end_POSTSUBSCRIPT - italic_X (the proof can be found in the Appendix):

Theorem 1 (PoCa method reduces semantic error)

Under Assumptions 1-3, the PoCa method is guaranteed to have the smaller error Zmergedsubscript𝑍mergedZ_{\mathrm{merged}}italic_Z start_POSTSUBSCRIPT roman_merged end_POSTSUBSCRIPT than Z𝑍Zitalic_Z, i.e.,

ZmergedZ.normsubscript𝑍mergednorm𝑍||Z_{\mathrm{merged}}||\leq||Z||.| | italic_Z start_POSTSUBSCRIPT roman_merged end_POSTSUBSCRIPT | | ≤ | | italic_Z | | . (11)

Since A𝐴Aitalic_A is non-negative, Theorem 1 implies non-decreasing information sufficiency; if merging does not increase redundancy or decrease interpretability, then the overall quality of caption becomes better. This also aligns with the findings in [26] that smaller-scale models combined can be as effective as a larger-scale model. Additionally, it is worth noting that our assumptions require linear combinations of semantics, while it may not hold in practice for images with more complex structures of semantics.

3.3 PoCa Gives Better Captions (Empirically)

We conduct quantitative evaluations to study whether the PoCa method can improve the caption quality. We adopt the VQA-v2 [27] dataset, which is built upon the MS-COCO [11] dataset and contains multiple questions per image. The questions serve as a proxy for the importance score A𝐴Aitalic_A in Definition 3, and the accuracy of the text-only LLM (LLaMA2-Chat-13B, prompt shown in Appendix Table 5) generated answers becomes an estimation of the information sufficiency term. See Appendix for more implementation details.

Refer to caption
Figure 2: Evaluation of Sufficiency. VQA accuracy using captions generated by different image captioning models and the proposed PoCa method. The PoCa method consistently improves the VQA performance across all models and caption types.

Figure 2 provides the evaluation results on 5,000 questions in the VQA-v2 validation split. As can be seen, our proposed PoCa method (green) consistently yields performance gains across all three examined LVLMs. The scale of improvement ranges from 0.48% (MobileVLM-v2 detailed captions) to 2.10% (LLaVA-1.5 short captions). Interestingly, we find that detailed captions do not necessarily correlate with better information coverage, as the detailed captions generated by MobileVLM-v2 underperform the single-sentence captions generated by InternVL. The comparison also shows that human annotations may not be optimal for certain scenarios, since several groups of LVLM-generated captions can yield higher VQA accuracy compared to that of human annotators.

It is crucial to evaluate whether the performance gain brought by PoCa is achieved by significantly sacrificing other objectives. In Table 1, we present the length statistics. We calculate the average number of words in default captions and PoCa captions and note their differences in the “±Δplus-or-minusΔ\pm\Delta± roman_Δ” column. The results show that PoCa does not exhibit a significant trend of either increasing or decreasing the length of captions. Among the six comparisons, PoCa compresses the length in four cases and extends the length in two cases. This empirically demonstrates that using LLMs as σmergesubscript𝜎merge\sigma_{\mathrm{merge}}italic_σ start_POSTSUBSCRIPT roman_merge end_POSTSUBSCRIPT in the PoCa model does not significantly violate the minimal redundancy objective H(Y~merged)𝐻subscript~𝑌mergedH(\tilde{Y}_{\mathrm{merged}})italic_H ( over~ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT roman_merged end_POSTSUBSCRIPT ).

Table 1: Evaluation of Redundancy. Caption lengths statics between default and PoCa captions.
LVLM VQAv2 [27] Img2P [28]
Default PoCA ±Δplus-or-minusΔ\pm\Delta± roman_Δ Default PoCA ±Δplus-or-minusΔ\pm\Delta± roman_Δ
MobileVLM-v2-1.7B 54.1 78.2 +24.1 61.6 47.0 -14.6
LLaVA-1.5-7B 82.7 74.7 -8.0 93.2 133.4 +40.2
InternVL 158.3 93.4 -65.0 177.4 176.2 -1.2

4 Conclusion

Our work presents a novel information-theoretic framework that provides well-defined principles for image captioning covering information sufficiency, minimal redundancy, and human interpretability. By leveraging the theoretical framework, we propose Pyramid of Captions (PoCa), a novel image captioning approach that employs a hierarchical method to generate content-rich captions by exploiting the complementary nature of local and global visual cues. Through theoretical proofs and empirical evaluations, we demonstrate that PoCa consistently enhances the quality of image captions, making them more informative, semantically accurate, and contextually coherent while maintaining brevity and interoperability.

References

  • Gurari et al. [2020] Danna Gurari, Yinan Zhao, Meng Zhang, and Nilavra Bhattacharya. Captioning images taken by people who are blind. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16, pages 417–434. Springer, 2020.
  • Rane et al. [2021] Chinmayi Rane, Amol Lashkare, Aarti Karande, and YS Rao. Image captioning based smart navigation system for visually impaired. In 2021 International Conference on Communication information and Computing Technology (ICCICT), pages 1–5. IEEE, 2021.
  • Guo et al. [2022] Yu Guo, Yue Chen, Yuanyan Xie, Xiaojuan Ban, and Mohammad S Obaidat. An offline assistance tool for visually impaired people based on image captioning. In 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 969–976. IEEE, 2022.
  • Ganesan et al. [2022] Jothi Ganesan, Ahmad Taher Azar, Shrooq Alsenan, Nashwa Ahmad Kamal, Basit Qureshi, and Aboul Ella Hassanien. Deep learning reader for visually impaired. Electronics, 11(20):3335, 2022.
  • Gudivada and Raghavan [1995] Venkat N Gudivada and Vijay V Raghavan. Content based image retrieval systems. Computer, 28(9):18–22, 1995.
  • Srihari [1995] Rohini K Srihari. Automatic indexing and content-based retrieval of captioned images. Computer, 28(9):49–56, 1995.
  • Datta et al. [2005] Ritendra Datta, Jia Li, and James Z Wang. Content-based image retrieval: approaches and trends of the new age. In Proceedings of the 7th ACM SIGMM international workshop on Multimedia information retrieval, pages 253–262, 2005.
  • da Silva Torres and Falcao [2006] Ricardo da Silva Torres and Alexandre X Falcao. Content-based image retrieval: theory and applications. RITA, 13(2):161–185, 2006.
  • Jain et al. [2015] Sahil Jain, Kiranmai Pulaparthi, and Chetan Fulara. Content based image retrieval. Int. J. Adv. Eng. Glob. Technol, 3:1251–1258, 2015.
  • Li et al. [2021] Xiaoqing Li, Jiansheng Yang, and Jinwen Ma. Recent developments of content-based image retrieval (cbir). Neurocomputing, 452:675–689, 2021.
  • Chen et al. [2015] Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO captions: Data collection and evaluation server. CoRR, abs/1504.00325, 2015. URL http://arxiv.org/abs/1504.00325.
  • Liu et al. [2023a] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a. URL https://openreview.net/forum?id=w0H2xGHlkw.
  • Dai et al. [2024] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advances in Neural Information Processing Systems, 36, 2024.
  • Urbanek et al. [2023] Jack Urbanek, Florian Bordes, Pietro Astolfi, Mary Williamson, Vasu Sharma, and Adriana Romero-Soriano. A picture is worth more than 77 text tokens: Evaluating clip-style models on dense captions. arXiv preprint arXiv:2312.08578, 2023.
  • Cho et al. [2023] Jaemin Cho, Yushi Hu, Jason Michael Baldridge, Roopal Garg, Peter Anderson, Ranjay Krishna, Mohit Bansal, Jordi Pont-Tuset, and Su Wang. Davidsonian scene graph: Improving reliability in fine-grained evaluation for text-image generation. In The Twelfth International Conference on Learning Representations, 2023.
  • Wang et al. [2024] Weiyun Wang, Min Shi, Qingyun Li, Wenhai Wang, Zhenhang Huang, Linjie Xing, Zhe Chen, Hao Li, Xizhou Zhu, Zhiguo Cao, Yushi Chen, Tong Lu, Jifeng Dai, and Yu Qiao. The all-seeing project: Towards panoptic visual recognition and understanding of the open world. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=c2R7ajodcI.
  • Peyrard [2019] Maxime Peyrard. A simple theoretical model of importance for summarization. In Anna Korhonen, David R. Traum, and Lluís Màrquez, editors, Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 1059–1073. Association for Computational Linguistics, 2019. doi: 10.18653/V1/P19-1101. URL https://doi.org/10.18653/v1/p19-1101.
  • Zhong [2017] Yixin Zhong. A theory of semantic information. Proceedings, 1(3), 2017. ISSN 2504-3900. doi: 10.3390/IS4SI-2017-04000. URL https://www.mdpi.com/2504-3900/1/3/129.
  • Tishby et al. [2000] Naftali Tishby, Fernando C. N. Pereira, and William Bialek. The information bottleneck method. CoRR, physics/0004057, 2000. URL http://arxiv.org/abs/physics/0004057.
  • Shwartz-Ziv and Tishby [2017] Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. CoRR, abs/1703.00810, 2017. URL http://arxiv.org/abs/1703.00810.
  • Tsai et al. [2021] Yao-Hung Hubert Tsai, Yue Wu, Ruslan Salakhutdinov, and Louis-Philippe Morency. Self-supervised learning from a multi-view perspective. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=-bdp_8Itjwp.
  • Shao et al. [2022] Yulin Shao, Qi Cao, and Deniz Gündüz. A theory of semantic communication. CoRR, abs/2212.01485, 2022. doi: 10.48550/ARXIV.2212.01485. URL https://doi.org/10.48550/arXiv.2212.01485.
  • Niu and Zhang [2024] Kai Niu and Ping Zhang. A mathematical theory of semantic communication. CoRR, abs/2401.13387, 2024. doi: 10.48550/ARXIV.2401.13387. URL https://doi.org/10.48550/arXiv.2401.13387.
  • Shannon [1948] Claude E. Shannon. A mathematical theory of communication. Bell Syst. Tech. J., 27(3):379–423, 1948. doi: 10.1002/J.1538-7305.1948.TB01338.X. URL https://doi.org/10.1002/j.1538-7305.1948.tb01338.x.
  • Bahdanau et al. [2015] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1409.0473.
  • Shi et al. [2024] Baifeng Shi, Ziyang Wu, Maolin Mao, Xin Wang, and Trevor Darrell. When do we not need larger vision models? arXiv preprint arXiv:2403.13043, 2024.
  • Goyal et al. [2017] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 6325–6334. IEEE Computer Society, 2017. doi: 10.1109/CVPR.2017.670. URL https://doi.org/10.1109/CVPR.2017.670.
  • Krause et al. [2017] Jonathan Krause, Justin Johnson, Ranjay Krishna, and Li Fei-Fei. A hierarchical approach for generating descriptive image paragraphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 317–325, 2017.
  • Yu et al. [2022a] Youngjae Yu, Jiwan Chung, Heeseung Yun, Jack Hessel, Jae Sung Park, Ximing Lu, Prithviraj Ammanabrolu, Rowan Zellers, Ronan Le Bras, Gunhee Kim, and Yejin Choi. Multimodal knowledge alignment with reinforcement learning. CoRR, abs/2205.12630, 2022a. doi: 10.48550/ARXIV.2205.12630. URL https://doi.org/10.48550/arXiv.2205.12630.
  • Feng et al. [2019] Yang Feng, Lin Ma, Wei Liu, and Jiebo Luo. Unsupervised image captioning. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 4125–4134. Computer Vision Foundation / IEEE, 2019. doi: 10.1109/CVPR.2019.00425. URL http://openaccess.thecvf.com/content_CVPR_2019/html/Feng_Unsupervised_Image_Captioning_CVPR_2019_paper.html.
  • Liu et al. [2023b] Hao Liu, Wilson Yan, and Pieter Abbeel. Language quantized autoencoders: Towards unsupervised text-image alignment. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023b. URL http://papers.nips.cc/paper_files/paper/2023/hash/0df1738319f8c6e15b58cb16ea3cfa57-Abstract-Conference.html.
  • Yu et al. [2023] Lijun Yu, Yong Cheng, Zhiruo Wang, Vivek Kumar, Wolfgang Macherey, Yanping Huang, David A. Ross, Irfan Essa, Yonatan Bisk, Ming-Hsuan Yang, Kevin P. Murphy, Alexander G. Hauptmann, and Lu Jiang. SPAE: semantic pyramid autoencoder for multimodal generation with frozen llms. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/a526cc8f6ffb74bedb6ff313e3fdb450-Abstract-Conference.html.
  • Chen et al. [2023a] Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023a.
  • Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
  • Chen et al. [2020] Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. In European conference on computer vision, pages 104–120. Springer, 2020.
  • Zhang et al. [2021] Pengchuan Zhang, Xiujun Li, Xiaowei Hu, Jianwei Yang, Lei Zhang, Lijuan Wang, Yejin Choi, and Jianfeng Gao. Vinvl: Revisiting visual representations in vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5579–5588, 2021.
  • Wang et al. [2022a] Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. SimVLM: Simple visual language model pretraining with weak supervision. In International Conference on Learning Representations, 2022a. URL https://openreview.net/forum?id=GUrhfTuf_3.
  • Wang et al. [2022b] Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022b.
  • Yu et al. [2022b] Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research, 2022b. ISSN 2835-8856. URL https://openreview.net/forum?id=Ee277P3AYC.
  • Wang et al. [2022c] Jianfeng Wang, Zhengyuan Yang, Xiaowei Hu, Linjie Li, Kevin Lin, Zhe Gan, Zicheng Liu, Ce Liu, and Lijuan Wang. Git: A generative image-to-text transformer for vision and language. Transactions on Machine Learning Research, 2022c.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • Li et al. [2022] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pages 12888–12900. PMLR, 2022.
  • Li et al. [2023] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023.
  • Tsimpoukelli et al. [2021] Maria Tsimpoukelli, Jacob Menick, Serkan Cabi, S. M. Ali Eslami, Oriol Vinyals, and Felix Hill. Multimodal few-shot learning with frozen language models. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=WtmMyno9Tq2.
  • Alayrac et al. [2022] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
  • Sun et al. [2023] Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, and Xinlong Wang. Emu: Generative pretraining in multimodality. In The Twelfth International Conference on Learning Representations, 2023.
  • Bai et al. [2023] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. 2023.
  • Zhang et al. [2024] Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, and Tong Sun. Enhanced visual instruction tuning for text-rich image understanding, 2024. URL https://openreview.net/forum?id=tj4a1JY03u.
  • Jiang et al. [2023] Yunfan Jiang, Agrim Gupta, Zichen Zhang, Guanzhi Wang, Yongqiang Dou, Yanjun Chen, Li Fei-Fei, Anima Anandkumar, Yuke Zhu, and Linxi Fan. Vima: robot manipulation with multimodal prompts. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
  • Wake et al. [2023] Naoki Wake, Atsushi Kanehira, Kazuhiro Sasabuchi, Jun Takamatsu, and Katsushi Ikeuchi. Gpt-4v (ision) for robotics: Multimodal task planning from human demonstration. arXiv preprint arXiv:2311.12015, 2023.
  • Zhao et al. [2024] Haozhe Zhao, Zefan Cai, Shuzheng Si, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, and Baobao Chang. MMICL: Empowering vision-language model with multi-modal in-context learning. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=5KojubHBr8.
  • Zhang et al. [2023] Yuanhan Zhang, Kaiyang Zhou, and Ziwei Liu. What makes good examples for visual in-context learning? In A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 17773–17794. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/398ae57ed4fda79d0781c65c926d667b-Paper-Conference.pdf.
  • Yang et al. [2022] Zhengyuan Yang, Zhe Gan, Jianfeng Wang, Xiaowei Hu, Yumao Lu, Zicheng Liu, and Lijuan Wang. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3081–3089, 2022.
  • Hu et al. [2022] Yushi Hu, Hang Hua, Zhengyuan Yang, Weijia Shi, Noah A Smith, and Jiebo Luo. Promptcap: Prompt-guided task-aware image captioning. arXiv preprint arXiv:2211.09699, 2022.
  • Lin et al. [2022] Yuanze Lin, Yujia Xie, Dongdong Chen, Yichong Xu, Chenguang Zhu, and Lu Yuan. Revive: Regional visual representation matters in knowledge-based visual question answering. Advances in Neural Information Processing Systems, 35:10560–10571, 2022.
  • Gui et al. [2022] Liangke Gui, Borui Wang, Qiuyuan Huang, Alexander G Hauptmann, Yonatan Bisk, and Jianfeng Gao. Kat: A knowledge augmented transformer for vision-and-language. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 956–968, 2022.
  • Shao et al. [2023] Zhenwei Shao, Zhou Yu, Meng Wang, and Jun Yu. Prompting large language models with answer heuristics for knowledge-based visual question answering. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14974–14983, 2023. doi: 10.1109/CVPR52729.2023.01438.
  • Gupta and Kembhavi [2023] Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962, 2023.
  • Liu et al. [2023c] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. CoRR, abs/2310.03744, 2023c. doi: 10.48550/ARXIV.2310.03744. URL https://doi.org/10.48550/arXiv.2310.03744.
  • Chu et al. [2024] Xiangxiang Chu, Limeng Qiao, Xinyu Zhang, Shuang Xu, Fei Wei, Yang Yang, Xiaofei Sun, Yiming Hu, Xinyang Lin, Bo Zhang, and Chunhua Shen. Mobilevlm v2: Faster and stronger baseline for vision language model, 2024.
  • Chen et al. [2023b] Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023b.
  • Hessel et al. [2021] Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors, Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 7514–7528. Association for Computational Linguistics, 2021. doi: 10.18653/V1/2021.EMNLP-MAIN.595. URL https://doi.org/10.18653/v1/2021.emnlp-main.595.
  • Banerjee and Lavie [2005] Satanjeev Banerjee and Alon Lavie. METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In Jade Goldstein, Alon Lavie, Chin-Yew Lin, and Clare R. Voss, editors, Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization@ACL 2005, Ann Arbor, Michigan, USA, June 29, 2005, pages 65–72. Association for Computational Linguistics, 2005. URL https://aclanthology.org/W05-0909/.
  • Liang et al. [2017] Xiaodan Liang, Zhiting Hu, Hao Zhang, Chuang Gan, and Eric P. Xing. Recurrent topic-transition GAN for visual paragraph generation. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pages 3382–3391. IEEE Computer Society, 2017. doi: 10.1109/ICCV.2017.364. URL https://doi.org/10.1109/ICCV.2017.364.
  • Yang et al. [2020] Xu Yang, Chongyang Gao, Hanwang Zhang, and Jianfei Cai. Hierarchical scene graph encoder-decoder for image paragraph captioning. In Chang Wen Chen, Rita Cucchiara, Xian-Sheng Hua, Guo-Jun Qi, Elisa Ricci, Zhengyou Zhang, and Roger Zimmermann, editors, MM ’20: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle, WA, USA, October 12-16, 2020, pages 4181–4189. ACM, 2020. doi: 10.1145/3394171.3413859. URL https://doi.org/10.1145/3394171.3413859.
  • Melas-Kyriazi et al. [2018] Luke Melas-Kyriazi, Alexander M. Rush, and George Han. Training for diversity in image paragraph captioning. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 757–761. Association for Computational Linguistics, 2018. doi: 10.18653/V1/D18-1084. URL https://doi.org/10.18653/v1/d18-1084.
  • Wang et al. [2019] Jing Wang, Yingwei Pan, Ting Yao, Jinhui Tang, and Tao Mei. Convolutional auto-encoding of sentence topics for image paragraph generation. In Sarit Kraus, editor, Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, pages 940–946. ijcai.org, 2019. doi: 10.24963/IJCAI.2019/132. URL https://doi.org/10.24963/ijcai.2019/132.
  • Chung and Yu [2023] Jiwan Chung and Youngjae Yu. VLIS: unimodal language models guide multimodal language generation. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 700–721. Association for Computational Linguistics, 2023. doi: 10.18653/V1/2023.EMNLP-MAIN.46. URL https://doi.org/10.18653/v1/2023.emnlp-main.46.
  • Fan et al. [2023] Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and Yonglong Tian. Improving CLIP training with language rewrites. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors, Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/6fa4d985e7c434002fb6289ab9b2d654-Abstract-Conference.html.
  • Chen et al. [2024] Delong Chen, Jianfeng Liu, Wenliang Dai, and Baoyuan Wang. Visual instruction tuning with polite flamingo. In Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan, editors, Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20-27, 2024, Vancouver, Canada, pages 17745–17753. AAAI Press, 2024. doi: 10.1609/AAAI.V38I16.29727. URL https://doi.org/10.1609/aaai.v38i16.29727.

Appendix

Appendix A Related Work

A.1 Image Captioning

Image captioning lies at the intersection of computer vision and natural language processing, requiring both accurate visual recognition and coherent language generation abilities. In Section 2 we formally defined the objectives for this task, but it is worth noting that current efforts do not explicitly optimize that objective. The primary challenge is the difficulty of back-propagation through the discrete textual space 𝒟captionsubscript𝒟caption\mathcal{D}_{\mathrm{caption}}caligraphic_D start_POSTSUBSCRIPT roman_caption end_POSTSUBSCRIPT, and efforts addressing this challenge involve adopting reinforcement learning [29, 30] or aligning continuous latent spaces with language spaces [31, 32]. However, these approaches suffer from training instability and less satisfactory language coherence.

Most current methods rely on a surrogate methodology, where they use human annotators fhumansubscript𝑓humanf_{\mathrm{human}}italic_f start_POSTSUBSCRIPT roman_human end_POSTSUBSCRIPT to write captions and train models to imitate those captions. The underlying assumption is that human-written captions optimize the objective J(human)𝐽humanJ(\mathrm{human})italic_J ( roman_human ), which is achieved by providing instructions to crowd-sourced caption annotators. For example, the instructions for MS COCO Caption annotation [11] include “Describe all the important parts of the scene” and “Do not describe unimportant details”, which are respectively connected to the information sufficiency term and minimal redundancy term in our objective.

An important trend in the image captioning field is the increasing focus on the comprehensiveness of image captions. As mentioned earlier, this represents a decreased length penalty (smaller weight β𝛽\betaitalic_β for the minimal redundancy term) and more emphasis on the information sufficiency term. In recent years, there has been an increasing number of high-quality detailed captioning datasets for this target, such as human-annotated image paragraph captioning [28], Densely Captioned Images (DCI) [14], and pseudo-labeled datasets, including LLaVA-Detailed-Captions [12], ShareGPT4V [33], AS-1B [16], etc. However, detailed caption annotation is much more expensive than previous single-sentence annotation, while automated caption labeling exhibits a high risk of hallucination.

A.2 Vision-Language Learning in the Era of Large Language Models

Various methods have been explored for enabling vision-language learning in LLMs. One line of work focusing on vision-language alignment during pretraining [34, 35, 36, 37, 38, 39, 40, 41, 42, 43], allowing the model to jointly learn a shared vision-language latent space. The other line of work, improve the vision-language training efficiency by aligning the vision representation into the language space of LLMs by only training the visual encoder module or a vision-language projection matrix [44, 45, 12, 46, 47, 48]. These two lines of works enable vision-language alignment, enabling various joint vision and language modalities prompting methods such as robot manipulation prompting [49, 50] and multimodal in-context learning [51, 52].

Unlike the other two directions, another line of work exploits the reasoning and planning ability of LLMs allowing zero-shot multimodal vision-language inference by extracting the information from the vision modality into a textual description and performing inference through frozen LLMs [53, 54, 55, 56]. Recent works in this direction showcase remarkable VQA ability through answer heuristics generation [57] and enabling object tagging and image editing through visual programming [58]. Inspired by this line of work, our work introduces a zero-shot hierarchical image captioning approach that relies on the reasoning ability of LLMs to aggregate information from local and global captions.

Appendix B Proof for Theorem 1

Theorem 1 (PoCa method reduces semantic error)

Under Assumptions 1-3, the PoCa method is guaranteed to have a smaller error Zmerged=YmergedXsubscript𝑍mergedsubscript𝑌merged𝑋Z_{\mathrm{merged}}=Y_{\mathrm{merged}}-Xitalic_Z start_POSTSUBSCRIPT roman_merged end_POSTSUBSCRIPT = italic_Y start_POSTSUBSCRIPT roman_merged end_POSTSUBSCRIPT - italic_X than Z𝑍Zitalic_Z, i.e.,

ZmergedZ.normsubscript𝑍mergednorm𝑍||Z_{\mathrm{merged}}||\leq||Z||.| | italic_Z start_POSTSUBSCRIPT roman_merged end_POSTSUBSCRIPT | | ≤ | | italic_Z | | . (12)
Proof 1

First, we express the error of the i𝑖iitalic_i-th semantic unit in merged caption Zmerged,isubscript𝑍merged𝑖Z_{\mathrm{merged},i}italic_Z start_POSTSUBSCRIPT roman_merged , italic_i end_POSTSUBSCRIPT as the difference between the merged caption semantics Ymerged,isubscript𝑌merged𝑖Y_{\mathrm{merged},i}italic_Y start_POSTSUBSCRIPT roman_merged , italic_i end_POSTSUBSCRIPT and the source semantics Xisubscript𝑋𝑖X_{i}italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Using Assumption 3 and 2, Zmerged,isubscript𝑍merged𝑖Z_{\mathrm{merged},i}italic_Z start_POSTSUBSCRIPT roman_merged , italic_i end_POSTSUBSCRIPT can be expressed as:

Zmerged,isubscript𝑍merged𝑖\displaystyle Z_{\mathrm{merged},i}italic_Z start_POSTSUBSCRIPT roman_merged , italic_i end_POSTSUBSCRIPT =Ymerged,iXiabsentsubscript𝑌merged𝑖subscript𝑋𝑖\displaystyle=Y_{\mathrm{merged},i}-X_{i}= italic_Y start_POSTSUBSCRIPT roman_merged , italic_i end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (13)
=ηYi+(1η)jmαjYi[j]Xiabsent𝜂subscript𝑌𝑖1𝜂superscriptsubscript𝑗𝑚subscript𝛼𝑗subscriptsuperscript𝑌delimited-[]𝑗𝑖subscript𝑋𝑖\displaystyle=\eta Y_{i}+(1-\eta)\sum_{j}^{m}\alpha_{j}Y^{[j]}_{i}-X_{i}= italic_η italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ( 1 - italic_η ) ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_Y start_POSTSUPERSCRIPT [ italic_j ] end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (14)
=η(Xi+Zi)+(1η)jmαj(Xi[j]+Zi[j])Xiabsent𝜂subscript𝑋𝑖subscript𝑍𝑖1𝜂superscriptsubscript𝑗𝑚subscript𝛼𝑗subscriptsuperscript𝑋delimited-[]𝑗𝑖subscriptsuperscript𝑍delimited-[]𝑗𝑖subscript𝑋𝑖\displaystyle=\eta(X_{i}+Z_{i})+(1-\eta)\sum_{j}^{m}\alpha_{j}(X^{[j]}_{i}+Z^{% [j]}_{i})-X_{i}= italic_η ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ( 1 - italic_η ) ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_X start_POSTSUPERSCRIPT [ italic_j ] end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_Z start_POSTSUPERSCRIPT [ italic_j ] end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (15)
=ηZi+(1η)jmαjZi[j].absent𝜂subscript𝑍𝑖1𝜂superscriptsubscript𝑗𝑚subscript𝛼𝑗subscriptsuperscript𝑍delimited-[]𝑗𝑖\displaystyle=\eta Z_{i}+(1-\eta)\sum_{j}^{m}\alpha_{j}Z^{[j]}_{i}.= italic_η italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ( 1 - italic_η ) ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT [ italic_j ] end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . (16)

Here, (15) is derived from (14) is based on the decomposition of the global and local caption semantics into there corresponding source semantics and errors, i.e., Yi=Xi+Zisubscript𝑌𝑖subscript𝑋𝑖subscript𝑍𝑖Y_{i}=X_{i}+Z_{i}italic_Y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Yi[j]=Xi[j]+Zi[j]subscriptsuperscript𝑌delimited-[]𝑗𝑖subscriptsuperscript𝑋delimited-[]𝑗𝑖subscriptsuperscript𝑍delimited-[]𝑗𝑖Y^{[j]}_{i}=X^{[j]}_{i}+Z^{[j]}_{i}italic_Y start_POSTSUPERSCRIPT [ italic_j ] end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_X start_POSTSUPERSCRIPT [ italic_j ] end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_Z start_POSTSUPERSCRIPT [ italic_j ] end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

Next, we define ΔPoC,isubscriptΔPoC𝑖\Delta_{\mathrm{PoC},i}roman_Δ start_POSTSUBSCRIPT roman_PoC , italic_i end_POSTSUBSCRIPT as the gap between the norm of the global caption error and the norm of the merged caption error for the i𝑖iitalic_i-th semantic unit. Using the triangle inequality and Assumption 1, we derive a lower bound for ΔPoC,isubscriptΔPoC𝑖\Delta_{\mathrm{PoC},i}roman_Δ start_POSTSUBSCRIPT roman_PoC , italic_i end_POSTSUBSCRIPT:

ΔPoC,isubscriptΔPoC𝑖\displaystyle\Delta_{\mathrm{PoC},i}roman_Δ start_POSTSUBSCRIPT roman_PoC , italic_i end_POSTSUBSCRIPT =ZiZmerged,iabsentnormsubscript𝑍𝑖normsubscript𝑍merged𝑖\displaystyle=||Z_{i}||-||Z_{\mathrm{merged},i}||= | | italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | - | | italic_Z start_POSTSUBSCRIPT roman_merged , italic_i end_POSTSUBSCRIPT | | (17)
=ZiηZi+(1η)jmαjZi[j]absentnormsubscript𝑍𝑖norm𝜂subscript𝑍𝑖1𝜂superscriptsubscript𝑗𝑚subscript𝛼𝑗subscriptsuperscript𝑍delimited-[]𝑗𝑖\displaystyle=||Z_{i}||-||\eta Z_{i}+(1-\eta)\sum_{j}^{m}\alpha_{j}Z^{[j]}_{i}||= | | italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | - | | italic_η italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ( 1 - italic_η ) ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_Z start_POSTSUPERSCRIPT [ italic_j ] end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | (18)
Zi(ηZi+(1η)jmαjZi[j])absentnormsubscript𝑍𝑖𝜂normsubscript𝑍𝑖1𝜂superscriptsubscript𝑗𝑚subscript𝛼𝑗normsubscriptsuperscript𝑍delimited-[]𝑗𝑖\displaystyle\geq||Z_{i}||-(\eta||Z_{i}||+(1-\eta)\sum_{j}^{m}\alpha_{j}||Z^{[% j]}_{i}||)≥ | | italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | - ( italic_η | | italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | + ( 1 - italic_η ) ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | | italic_Z start_POSTSUPERSCRIPT [ italic_j ] end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | ) (19)
=(1η)(ZijmαjZi[j])absent1𝜂normsubscript𝑍𝑖superscriptsubscript𝑗𝑚subscript𝛼𝑗normsubscriptsuperscript𝑍delimited-[]𝑗𝑖\displaystyle=(1-\eta)(||Z_{i}||-\sum_{j}^{m}\alpha_{j}||Z^{[j]}_{i}||)= ( 1 - italic_η ) ( | | italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | - ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | | italic_Z start_POSTSUPERSCRIPT [ italic_j ] end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | ) (20)
=(1η)(φ(Xi)jmαjφ(Xi[j])).absent1𝜂𝜑subscript𝑋𝑖superscriptsubscript𝑗𝑚subscript𝛼𝑗𝜑subscriptsuperscript𝑋delimited-[]𝑗𝑖\displaystyle=(1-\eta)(\varphi(X_{i})-\sum_{j}^{m}\alpha_{j}\varphi(X^{[j]}_{i% })).= ( 1 - italic_η ) ( italic_φ ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_φ ( italic_X start_POSTSUPERSCRIPT [ italic_j ] end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) . (21)

To yield (19) from (18), we apply the triangle inequality, which states that for any two vectors a𝑎aitalic_a and b𝑏bitalic_b, a+ba+bnorm𝑎𝑏norm𝑎norm𝑏||a+b||\leq||a||+||b||| | italic_a + italic_b | | ≤ | | italic_a | | + | | italic_b | |. Finally, we apply Assumption 1 to obtain (21), which states Zi=φ(Xi)normsubscript𝑍𝑖𝜑subscript𝑋𝑖||Z_{i}||=\varphi(X_{i})| | italic_Z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | = italic_φ ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) and Zi[j]=φ(Xi[j])normsubscriptsuperscript𝑍delimited-[]𝑗𝑖𝜑subscriptsuperscript𝑋delimited-[]𝑗𝑖||Z^{[j]}_{i}||=\varphi(X^{[j]}_{i})| | italic_Z start_POSTSUPERSCRIPT [ italic_j ] end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | = italic_φ ( italic_X start_POSTSUPERSCRIPT [ italic_j ] end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

By Assumption 1, φ𝜑\varphiitalic_φ is a concave function. Applying Jensen’s inequality and Assumption 2, we have:

φ(Xi)=φ(jmαjXi[j])jmαjφ(Xi[j]).𝜑subscript𝑋𝑖𝜑superscriptsubscript𝑗𝑚subscript𝛼𝑗subscriptsuperscript𝑋delimited-[]𝑗𝑖superscriptsubscript𝑗𝑚subscript𝛼𝑗𝜑subscriptsuperscript𝑋delimited-[]𝑗𝑖\varphi(X_{i})=\varphi(\sum_{j}^{m}\alpha_{j}X^{[j]}_{i})\geq\sum_{j}^{m}% \alpha_{j}\varphi(X^{[j]}_{i}).italic_φ ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_φ ( ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_X start_POSTSUPERSCRIPT [ italic_j ] end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_φ ( italic_X start_POSTSUPERSCRIPT [ italic_j ] end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (22)

This inequality implies that the error of the global caption is always greater than or equal to the weighted average of the errors of the local captions. Intuitively, this means that the PoCa method, which combines information from both global and local captions, is expected to have a lower error than using only the global caption.

Combining this result with the lower bound for ΔPoC,isubscriptΔPoC𝑖\Delta_{\mathrm{PoC},i}roman_Δ start_POSTSUBSCRIPT roman_PoC , italic_i end_POSTSUBSCRIPT derived earlier, we can conclude that ΔPoC,isubscriptΔPoC𝑖\Delta_{\mathrm{PoC},i}roman_Δ start_POSTSUBSCRIPT roman_PoC , italic_i end_POSTSUBSCRIPT is non-negative for all i𝑖iitalic_i:

ΔPoC,i(1η)(φ(Xi)jmαjφ(Xi[j]))0.subscriptΔPoC𝑖1𝜂𝜑subscript𝑋𝑖superscriptsubscript𝑗𝑚subscript𝛼𝑗𝜑subscriptsuperscript𝑋delimited-[]𝑗𝑖0\Delta_{\mathrm{PoC},i}\geq(1-\eta)(\varphi(X_{i})-\sum_{j}^{m}\alpha_{j}% \varphi(X^{[j]}_{i}))\geq 0.roman_Δ start_POSTSUBSCRIPT roman_PoC , italic_i end_POSTSUBSCRIPT ≥ ( 1 - italic_η ) ( italic_φ ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_φ ( italic_X start_POSTSUPERSCRIPT [ italic_j ] end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) ≥ 0 . (23)

The first inequality follows directly from the lower bound for ΔPoC,isubscriptΔPoC𝑖\Delta_{\mathrm{PoC},i}roman_Δ start_POSTSUBSCRIPT roman_PoC , italic_i end_POSTSUBSCRIPT derived earlier. The second inequality follows from the Jensen’s inequality result, which states that φ(Xi)jmαjφ(Xi[j])𝜑subscript𝑋𝑖superscriptsubscript𝑗𝑚subscript𝛼𝑗𝜑subscriptsuperscript𝑋delimited-[]𝑗𝑖\varphi(X_{i})\geq\sum_{j}^{m}\alpha_{j}\varphi(X^{[j]}_{i})italic_φ ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≥ ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_φ ( italic_X start_POSTSUPERSCRIPT [ italic_j ] end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Since 1η>01𝜂01-\eta>01 - italic_η > 0 (as η(0,1)𝜂01\eta\in(0,1)italic_η ∈ ( 0 , 1 ) by Assumption 3), the product of (1η)1𝜂(1-\eta)( 1 - italic_η ) and a non-negative term (φ(Xi)jmαjφ(Xi[j]))𝜑subscript𝑋𝑖superscriptsubscript𝑗𝑚subscript𝛼𝑗𝜑subscriptsuperscript𝑋delimited-[]𝑗𝑖(\varphi(X_{i})-\sum_{j}^{m}\alpha_{j}\varphi(X^{[j]}_{i}))( italic_φ ( italic_X start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_φ ( italic_X start_POSTSUPERSCRIPT [ italic_j ] end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) should also be non-negative, thus proving that ΔPoC,i0subscriptΔPoC𝑖0\Delta_{\mathrm{PoC},i}\geq 0roman_Δ start_POSTSUBSCRIPT roman_PoC , italic_i end_POSTSUBSCRIPT ≥ 0 for all i𝑖iitalic_i. Therefore, we have:

ZmergedZ.normsubscript𝑍mergednorm𝑍||Z_{\mathrm{merged}}||\leq||Z||.| | italic_Z start_POSTSUBSCRIPT roman_merged end_POSTSUBSCRIPT | | ≤ | | italic_Z | | . (24)

This completes the proof, demonstrating that under the given assumptions, the PoCa method is guaranteed to reduce the semantic error compared to using only the global caption.

Appendix C Implementation Details

C.1 Image Captioning Models

We employ three groups of Large Vision Language Models (LVLMs) as the image captioning models: LLaVA-1.5 [59], MobileVLM v2 [60], and InternVL [61], among them:

  • LLaVA-1.5 series is a popular LVLM with two variants: LLaVA-1.5-7B and LLaVA-1.5-13B, which adopt Vicuna-7B and Vicuna-13B as their Language Models (LLMs), respectively.

  • MobileVLM v2 is a family of efficient LVLMs with smaller scales, and we utilize its MobileVLM-v2-1.7B and MobileVLM-v2-3B models.

  • InternVL is one of the top-performing publicly available LVLMs. We use its InternVL-Chat-Chinese-V1-2-Plus model, which is based on the Yi-34B LLM and has a total of 40.1B parameters.

All inference is performed in FP16 precision on a single NVIDIA A800 GPU. We employ two types of prompts for short-form single-sentence image captioning and long-form detailed image captioning: "Provide a one-sentence caption for the provided image" and "Describe this image in detail". All generation parameters are set to the default values provided by the source repository.

C.2 Caption Pyramids

For the caption merging function σmergesubscript𝜎merge\sigma_{\mathrm{merge}}italic_σ start_POSTSUBSCRIPT roman_merge end_POSTSUBSCRIPT, we adopt and compare a variety of Large Language Models (LLMs) as its implementation, including the Gemma family (2B and 7B versions), LLaMA2 family (7B chat and 13B chat), Qwen-1.5-7B Chat, Mistral 7B, and a mixture-of-expert model Mixtral 8x7B. The Mixtral 8x7B model has capabilities similar to ChatGPT-3.5 and is one of the top-performing open-source LLMs. All inference is performed in FP16 precision, except for the large Mixtral 8x7B model, for which we use 8-bit quantization to fit it into a single NVIDIA A800 GPU. The Mixtral 8x7B is used as the default LLM for caption merging.

Prompt for Merging Caption Pyramid System Message: Input: \quad\bullet\quad You will receive a global caption describing an image. \quad\bullet\quad Additionally, you will have access to local captions generated for specific patches within the image. \quad\bullet\quad Both global and local captions may contain noise or errors. Task Objective: \quad\bullet\quad Your goal is to create a merged global caption that combines relevant information from both sources. \quad\bullet\quad The merged caption should be no longer than the original ones. \quad\bullet\quad You only give the merged caption as output, without any additional information. \quad\bullet\quad Do NOT give any explaination or notes on how you generate this caption. Guidelines: \quad\bullet\quad Combine Information: Extract key details from both global and local captions. \quad\bullet\quad Filter Noise: Remove non-sense content, inaccuracies, and irrelevant information. \quad\bullet\quad Prioritize Visual Details: Highlight essential visual elements instead of feeling or atmosphere \quad\bullet\quad Be Concise: Use as few words as possible while maintaining coherence and clarity. \quad\bullet\quad Ensure Coherence: Arrange the merged information logically. Remember, your output should be a high-quality caption that is concise, informative, and coherent! User: ### Global Caption: {global caption} ### Top-left: {top-left caption} ### Bottom-left: {bottom-left caption} ### Top-right: {top-right caption} ### Bottom-left: {bottom-left caption} Assistant Generation Prefix: Here’s the merged caption:
Table 2: An Example implementation of the merging function σmergesubscript𝜎merge\sigma_{\mathrm{merge}}italic_σ start_POSTSUBSCRIPT roman_merge end_POSTSUBSCRIPT based on prompting text-only LLMs.

We employ the prompt shown in Table 2 for caption merging, where the "Assistant Generation Prefix" is injected after the instructions to control the model output format. All generation parameters are set to the default values provided by the source repository. For splitting function σmergesubscript𝜎merge\sigma_{\mathrm{merge}}italic_σ start_POSTSUBSCRIPT roman_merge end_POSTSUBSCRIPT, we adopt the most straightforward implementation by splitting the input image into four equal-sized patches.

C.3 Human annotation baseline

In Fig. 2, the human annotation for short captions represents the accuracy of a single-sentence caption drawn from the MS-COCO annotation, while the human annotation for detailed captions refers to five MS-COCO caption annotations concatenated with the prefix "The following are several captions of this image written by different people: " added to the front.

Appendix D Additional Experiments and Further Analysis

D.1 Image Paragraph Captioning

The image paragraph captioning dataset contains human-annotated single-paragraph descriptions for Visual Genome images. We use its testing split, which consists of 2,492 samples. We employ both the reference-free metric CLIPScore [62] and the reference-based metric METEOR [63] to evaluate the quality of the captions.

CLIPScore measures the similarity between the image and text features extracted by the CLIP model (we use the standard OpenAI pretrained ViT-Base-32). The underlying assumption is that CLIP encoders are capable of extracting semantic information and can represent the importance score A𝐴Aitalic_A; thus, a higher CLIPScore correlates with higher information sufficiency.

The reference-based metric METEOR is widely adopted for evaluations in image captioning and natural language generation (e.g., machine translation). It measures the word-level similarity between model-generated captions and human-generated captions. The underlying assumption is that human annotations optimize the information sufficiency objective, so if a model behaves similarly to human annotations, it achieves high information sufficiency.

The results are shown in Table 3, where we also list the performance of previous fully-supervised models and few-shot models. Once again, our PoCa method provides information sufficiency improvement according to both the reference-free metric CLIPScore and the reference-based metric METEOR across all three families of LVLMs. These results further demonstrate the effectiveness of the PoCa method in enhancing the quality and informative content of the generated captions.

Table 3: Evaluation results on the image paragraph captioning dataset using CLIPScore and METEOR.
Image Captioning Model CLIPScore METEOR
Fully Supervised Models Regions-Hierarchical [28] - 15.95
RTT-GAN [64] - 17.12
HSGED [65] - 18.33
SCST [66] - 17.86
CAE-LSTM [67] - 18.82
Few-shot & Zero-shot Models BLIP-2 3-shot - 10.8
OPT-IML 3-shot - 9.5
Naïve Ensemble 3-shot - 9.8
BLIP-2 VLIS [68] - 14.6
MobileVLM-v2-1.7B Default 80.05 13.95
PoCa 81.80 16.39
MobileVLM-v2-3B Default 79.02 8.99
PoCa 81.34 13.28
LLaVA-1.5-7B Default 81.68 28.11
PoCa 81.80 28.79
LLaVA-1.5-13B Default 82.16 28.44
PoCa 82.47 28.97
InternVL Default 84.65 29.32
PoCa 85.52 29.84

D.2 Caption Merging Strategies

In this section, we compare the effectiveness of different implementations of the merging function σmergesubscript𝜎merge\sigma_{\mathrm{merge}}italic_σ start_POSTSUBSCRIPT roman_merge end_POSTSUBSCRIPT. First, we compare various LLMs introduced in Section C.2. As shown in Table 4, compared to the global caption baseline, every LLM yields performance improvement, except for the smallest Gemma-2B-IT model. We also provide an ablation on prompting, where we replace the default prompt shown in Table 2 with a naive prompt of "merge these captions". This ablation results in a slight decrease in accuracy and a significant increase in caption length, which further violates the minimal redundancy objective.

Additionally, we compare two parameter-free merging strategies based on simply concatenating local-only captions (representing η=0𝜂0\eta=0italic_η = 0 in Assumption 3) or local-global captions, with positional encoding as in the "User" field in Table 4. The results show that local captions alone cannot provide sufficient information, while adding the global caption brings significant improvement. However, these two concatenation-based methods generate excessively long captions, demonstrating the necessity of LLM-based information fusion and length compression.

Table 4: Comparison of different caption merging strategies on the VQA-v2 validation set.
Merging Function Params Accuracy Length
Global Caption Baseline 0 57.68 50.75
Gemma-2B-IT 2B 57.44 107.14
Gemma-7B-IT 7B 58.74 178.79
Mistral 7B Instruct-v0.2 7B 58.92 136.12
LLaMA2-7B Chat 7B 58.64 199.34
LLaMA2-7B Chat (Naive Prompt) 7B 58.60 239.02
Qwen-1.5-7B-Chat 7B 58.64 130.93
LLaMA2 13B Chat 13B 59.06 154.67
Mixtral 8x7B Instruct-v0.1 46.7B 59.78 219.22
Local Captions Concatenation 0 55.66 265.63
Global Local Concatenation 0 59.12 337.38

D.3 Analysis of VQA-based Caption Evaluation

Prompt for LLM-based VQA Evaluation System Message: You will be given a caption of an image, and your task is to try to answer the question based on the caption. If the relevant information is not present in the caption, try your best to guess the answer. You shouldn’t provide any rationale or explaination in your response, just give the answer only. The answer can be a number, a single word or a short phrase, plese make your response as short, simple and clear as possible. User: Image Caption: {image caption} Question: {question} Assistant Generation Prefix: The most possible answer is:
Table 5: Prompt for LLM-based VQA Evaluation.

In the VQA-based evaluation, we adopt text-only LLMs for VQA inference with image captioning input to assess caption quality. This section provides a detailed analysis of this approach. Using the instruction given in Table 5, we prompt different LLMs to generate answers and evaluate the accuracy based on exact matching and Natural Language Inference (NLI) based evaluation. The NLI evaluation classifies a pair of statements, "The answer to this question is {ground truth}" and "The answer to this question is {generated answer}", into entailment, neutral, and contradiction, where entailment outputs are regarded as successful. Compared to exact matching, NLI evaluation measures the correctness of answers at a semantic level and can tolerate low-level differences.

As shown in Table 6, we find that different LLMs behave very differently in terms of answer length, and many LLMs fail to keep the answer succinct as instructed. Since the ground truth answers are mostly one word or a short phrase, this results in significantly reduced exact matching accuracy, while the actual semantic similarity is much higher, as measured by the NLI accuracy. We also observe an increasing trend in NLI accuracy when comparing different scales of LLMs, despite the largest Mixtral 8x7B Instruct-v0.1 producing lower NLI accuracy. We found that this outlier is caused by the over-conservative nature of the Mixtral 8x7B Instruct-v0.1 model, which frequently refuses to answer questions with responses such as “cannot determine” and “not sure”. Finally, we add an ablation by instructing the LLM to guess the answer without caption input using the prompt: "You will be given a question regarding an image, and your task is to try to infer the most possible answer". The resulting performance, noted as “LLaMA2 7B Chat (No Caption)”, is much lower when measured by both exact matching and NLI accuracy.

Table 6: Comparison of different LLMs for VQA-based caption evaluation.
LLM
Answer
Length
Match
Accuracy
NLI
Accuracy
Gemma-2B-IT 33.50 5.20 55.44
Gemma-7B-IT 38.00 0.00 54.44
Mistral 7B Instruct-v0.2 28.90 2.30 63.30
LLaMA2 7B Chat 6.10 57.44 67.14
LLaMA2 7B Chat (No Caption) 4.30 41.34 44.76
Qwen-1.5-7B-Chat 7.90 56.72 69.06
LLaMA2 13B Chat 5.30 60.24 69.14
Mixtral 8x7B Instruct-v0.1 24.40 8.38 64.86
Ground Truth Answer 4.70 - -

Limitations

While the PoCa method has demonstrated effectiveness in improving image caption quality, there are several limitations that are worth discussing.

Assumptions on Image Semantics. The Assumption 2 made in this work could be sometimes strong and unrealistic, especially for the naive patch splitting function. The linear combination assumption may not hold well for images with more complex structures. This issue could be particularly problematic when objects or important semantic elements span across multiple local patches. In future work, employing more advanced splitting functions, object detection or semantic segmentation, could help alleviate this limitation and better capture the semantic structure of the image.

Assumptions on Caption Semantics. Similarly, the assumption about the local-global aggregation of caption semantics (Assumption 3) may not always be well satisfied by the LLM used for caption merging, particularly when the LLM is not sufficiently powerful. Weaker LLMs may struggle to effectively combine the local and global caption semantics in the desired manner. Further investigation into the impact of LLM choice on the fulfillment of this assumption would be valuable.

Depth of the Caption Pyramid. In the experiments, this work has demonstrated the benefits of a single level of local-global splitting and merging. However, the potential of deeper caption pyramids has not been fully explored. As the pyramid grows deeper, there could be a distribution shift for the input image patches, leading to more errors in the generated captions. Investigating the performance of merging functions for noisier captions is an important direction for future research.

VQA Evaluation. While the VQA-based evaluation provides a useful measure of caption quality in terms of information sufficiency, it has limitations. The questions used for evaluation may not comprehensively cover all of the important semantic units, resulting a sub-optimal estimation of the importance score A𝐴Aitalic_A. In addition, due to resource constraints, we use a 5,000 question subset from the full VQAv2 dataset. To test its reliability, we run default caption generation with 5 models, together with human annotated caption, resulting in a total of (5+1)×\times×2=12 data points combining short and long captioierns. The Pearson correlation coefficient between 5k subset accuracy and full dataset accuracy is 0.8519 – although already quite high, it still introduce some degree of noise for model performance evaluation.

Computational Efficiency. Our implementation of PoCa involves more inferences to generate captions and prompting LLM for fusing the local and global captions. These multiple inference steps and the use of large models can lead to increased computational costs. This computational overhead may be a concern, especially in resource-constrained environments or when processing a large number of images. One potential solution is to finetune an image captioning model on the captions generated by PoCa. By doing so, the knowledge captured by PoCa can be distilled into the finetuned model, allowing for a single inference pass during deployment, while still benefiting from the enhanced caption quality achieved by PoCa. Similar approach of knowledge distillation has been adopted in other literature, such as [33, 69, 70].

Broader Impact

The proposed PoCa method has the potential to positively impact various applications, such as improving accessibility for visually impaired individuals and enhancing information retrieval. However, it is crucial to consider and address potential ethical implications, including biases inherited from LLMs, privacy concerns related to revealing sensitive information, and the environmental impact of training and deploying large-scale models. Responsible deployment, clear communication of capabilities and limitations, and ongoing efforts to mitigate biases and ensure fairness are essential for developing image captioning systems that are not only informative and accurate but also socially responsible and inclusive.

Appendix E PoCa Examples

We show some example of PoCa caption merging, where the images are sourced from the test split of the image paragraph captioning dataset. Captions are generated by LLaVA-1.5-7B model and merged by Mixtral-8x7B LLM. It shows that local captions successfully captured some errors that are hallucinated or omitted by the global caption, while PoCa successfully aggregated such information (marked by green) and recompile it into a human-interpretable caption.

[Uncaptioned image]
Global Caption The image features a street corner with a street sign hanging above the
street. The sign is written in both English and Chinese, indicating that the location is in an Asian
neighborhood. The street sign is positioned above a restaurant, which is likely the "Asian Pearl
Restaurant." There are several people in the scene, with some standing near the restaurant and
others walking around the area. A car is parked on the street, and a traffic light is visible in the
background. The overall atmosphere of the scene suggests a bustling urban environment with a mix of
cultures.
[Uncaptioned image]
Top-left: The image features a large, ornate building with a green roof. The building has
a prominent green roof, which stands out against the sky. The sky is a pale yellow, creating a
contrasting backdrop for the building. The building’s architecture and the green roof make it an
eye-catching structure in the scene.
[Uncaptioned image]
Bottom-left: The image features a brightly colored building with a yellow and red facade,
likely a Chinese restaurant. The building is adorned with a large sign that reads "Bright Pearl."
The sign is positioned above the entrance, making it easily noticeable. In addition to the main
building, there are two smaller buildings visible in the scene, one on the left side and the other
on the right side. The overall atmosphere of the scene is vibrant and inviting.
[Uncaptioned image]
Top-right: The image features a street sign with a blue background and white lettering,
indicating the name of the street. The sign is attached to a green pole, which is located near a
traffic light. The traffic light is positioned on the right side of the pole, and it is currently
displaying a red light. The scene appears to be set in a city, with the street sign and traffic
light being essential elements of urban infrastructure.
[Uncaptioned image]
Bottom-right: The image features a street sign in a foreign language, likely Chinese,
warning people about the consequences of dumping. The sign is placed on a pole, and it is positioned
near a building. The sign is written in both English and Chinese, making it accessible to a wider
audience. In the background, there is a car parked on the street, and a person can be seen standing
nearby. The scene appears to be set in an urban environment, with the street sign serving as a
reminder for people to properly dispose of their waste.
[Uncaptioned image]
PoCa Merged Caption: At a bustling Asian neighborhood street corner, an ornate building
with a green roof and a yellow-red Chinese restaurant named "Bright Pearl" stand out. A street sign
in English and Chinese indicates the location, while a nearby sign warns against dumping. A car,
traffic light, and people add to the urban atmosphere. The likely "Asian Pearl Restaurant" is
positioned near the street sign.
Human Annotation: A picture of a Asian town. There are many signs in an Asian language. Te
sky above the town is gray. There is a large green pole to the right with signs mounted to it. There
are multiple story buildings along the streets. There is a restaurant with the word pearl in it.
Therre are blue and white signs on the pole.
[Uncaptioned image]
Global Caption The image features a group of four people dressed in old-fashioned
clothing, standing together in a train station. They are posing for a picture in front of a train,
which is located on the left side of the scene. The people are dressed in various styles, with some
wearing ties and hats. There are two handbags visible in the scene, one near the center and the
other towards the right side. The overall atmosphere of the image suggests a vintage or historical
setting.
[Uncaptioned image]
Top-left: The image features a man standing in front of a train, which is prominently
displayed in the scene. The train is positioned towards the right side of the image, and the man is
standing closer to the left side. The man appears to be posing for a picture in front of the train.
In the background, there is a building with a clock on its side, adding to the overall atmosphere of
the scene. Additionally, there are two other people in the image, one located near the left edge and
the other towards the right side.
[Uncaptioned image]
Bottom-left: The image features a train station with a train on the tracks. There are
several people in the scene, some of them standing near the train, while others are scattered
throughout the station. A man is standing close to the train, possibly waiting for it to depart. In
addition to the people, there are a few handbags visible in the scene, with one near the center of
the image and another closer to the right side. The presence of these handbags suggests that the
people in the station might be travelers or commuters.
[Uncaptioned image]
Top-right: The image features a group of three people dressed in top hats and
Victorian-style clothing, standing together in a large room. They appear to be posing for a picture,
possibly in a train station or a similar setting. The people are positioned in the center of the
room, with one person on the left, another in the middle, and the third on the right side. The room
has a few chairs placed around the area, with one chair located near the left side of the room,
another in the middle, and a third chair on the right side. The overall atmosphere of the scene
suggests a historical or themed event.
[Uncaptioned image]
Bottom-right: The image features a man and a woman standing next to each other, both
dressed in old-fashioned clothing. The woman is wearing a green dress, while the man is dressed in a
suit with a tie. They appear to be posing for a picture, possibly in a historical setting. In the
scene, there are also two handbags visible, one near the woman and the other closer to the man. The
overall atmosphere of the image suggests a vintage or historical theme.
[Uncaptioned image]
PoCa Merged Caption: A group of four people in old-fashioned clothing, including two in
top hats, pose for a picture in a train station with a vintage atmosphere. A train is visible on the
left, and two handbags are spotted near the center and right. Two other people are scattered
throughout the scene, with one man standing close to the train.
Human Annotation: Four people wearing old-fashioned clothing stand posing in front of the
camera. They are in a large building with a bright ceiling, and there is a black train behind them.
There are two men on either side of two women. The man on the far left has a black suit, black
shoes, and a black top hat. The woman to his right is wearing a white, pink, and green dress, with a
white hat, and a broach pinned to her chest. She is holding a white satchel in her white-gloved
hands. The woman next to her is also wearing white gloves and holding a white satchel. This woman
has a green dress on, with lace on the trim. The man next to her is wearing a checkered vest
underneath a dark long coat and dark pants. He has a black top hat.
[Uncaptioned image]
Global Caption The image features a person holding a pair of scissors with a leafy green
plant in their hand. The scissors are positioned in the middle of the plant, ready to cut it. The
person is likely preparing to trim the plant or remove a leaf. The scene takes place outdoors, with
the person standing in a garden or an outdoor area.
[Uncaptioned image]
Top-left: The image features a close-up view of a patch of grass on a dirt ground. The
grass appears to be green and healthy, with a few small rocks scattered around the area. The dirt
ground is visible in the foreground, while the grassy patch extends into the background. The scene
captures the essence of a natural environment, with the grass and rocks creating a sense of depth
and texture.
[Uncaptioned image]
Bottom-left: The image features a close-up of a green plant with a few leaves. The plant
is situated in a dirt field, and its leaves are visible in the foreground. The plant appears to be
growing in a dirt-covered area, possibly a garden or a natural setting. The focus of the image is on
the plant and its leaves, creating a sense of depth and detail.
[Uncaptioned image]
Top-right: The image features a close-up of a pair of scissors with a leafy green plant in
the foreground. The scissors are positioned in the middle of the scene, with the leafy plant
surrounding them. The plant appears to be a freshly cut herb, possibly mint, and is placed on a dirt
surface. The scissors are likely being used to trim the plant or prepare it for use in a recipe.
[Uncaptioned image]
Bottom-right: The image features a pair of green scissors with a green handle, placed on a
pile of fresh herbs. The scissors are open, ready to be used for cutting the herbs. The herbs are
scattered around the scissors, with some located closer to the scissors and others further away. The
scene suggests that the person using the scissors is preparing to cut the herbs for cooking or other
purposes.
[Uncaptioned image]
PoCa Merged Caption: A person holds scissors with a leafy green plant, likely preparing to
trim it in an outdoor setting. The scissors, situated in the middle of the plant, are positioned on
a pile of fresh herbs. The plant, possibly a type of mint, appears healthy and green, surrounded by
small rocks and dirt.
Human Annotation: There are a pair of scissors sitting on top of a plant. The handle on
the scissors is colored green. The other part of the scissor is metal. The leaves of the plant or a
nice healthy green color.
[Uncaptioned image]
Global Caption The image depicts a group of people gathered around a dining table in a
restaurant. They are enjoying a meal together, with several pizzas placed on the table. There are
multiple cups and bottles scattered around the table, indicating that they are drinking beverages as
well. The people are seated on various chairs placed around the table. Some of them are engaged in
conversation, while others are focused on their meal. The atmosphere appears to be relaxed and
social, as the group shares a meal and spends time together.
[Uncaptioned image]
Top-left: The image depicts a group of people gathered in a room, possibly a restaurant or
a bar. There are at least nine people visible in the scene, with some sitting at a dining table and
others standing around the room. The room features a spiral staircase, which adds a unique
architectural element to the space. A few chairs are placed around the dining table, and a cup can
be seen on the table. There is also a bowl and a spoon placed nearby. The atmosphere appears to be
lively and social, with people enjoying their time together.
[Uncaptioned image]
Bottom-left: The image features a group of people sitting around a dining table, enjoying
a meal together. There are at least five people visible in the scene, with some sitting closer to
the table and others further away. The table is set with various food items, including a pizza, a
bowl, and a sandwich. There are also multiple cups and a cell phone placed on the table. A bottle
can be seen on the table as well. The dining table occupies a significant portion of the image, with
the people sitting around it, creating a warm and inviting atmosphere.
[Uncaptioned image]
Top-right: The image features a man with a long beard sitting at a dining table in a
restaurant. He is looking at the camera, possibly posing for a picture. The man is wearing a blue
shirt and appears to be enjoying his time at the restaurant. The dining area is furnished with
multiple chairs and dining tables. There are two chairs visible in the scene, one near the man and
another further away. Two dining tables can be seen, one in the foreground and another in the
background. In the background, there is a TV mounted on the wall, likely providing entertainment for
the restaurant’s guests.
[Uncaptioned image]
Bottom-right: The image features a person sitting at a dining table with a plate of pizza
in front of them. The table is set with a glass of beer, a cup, and a pitcher of water. There are
also two additional cups on the table. The person is holding a fork, ready to enjoy their meal. The
dining table is surrounded by chairs, with one chair on the left side and another on the right side.
The scene appears to be a casual dining experience, with the person enjoying their pizza and beer.
[Uncaptioned image]
PoCa Merged Caption: A group of people, including a man with a beard, enjoy a meal in a
restaurant with various pizzas, sandwiches, and beverages on a table. Nine people are gathered in a
room with a spiral staircase and chairs around the dining table. The atmosphere is lively and
social. (169 characters)
Human Annotation: three men are talking. they all have bears on their face. there is a
blue shirt on the man. the shirt has flowers on it.