CrossCheckGPT: Universal Hallucination
Ranking for Multimodal Foundation Models

Guangzhi Sun1∗  Potsawee Manakul1,2,3∗  Adian Liusie1  Kunat Pipatanakul2,3
Chao Zhang4Phil Woodland1Mark Gales1
1University of Cambridge  2SCB 10X  3SCBX  4Tsinghua University
[email protected], [email protected], [email protected]
Abstract

Multimodal foundation models are prone to hallucination, generating outputs that either contradict the input or are not grounded by factual information. Given the diversity in architectures, training data and instruction tuning techniques, there can be large variations in systems’ susceptibility to hallucinations. To assess system hallucination robustness, hallucination ranking approaches have been developed for specific tasks such as image captioning, question answering, summarization, or biography generation. However, these approaches typically compare model outputs to gold-standard references or labels, limiting hallucination benchmarking for new domains. This work proposes "CrossCheckGPT", a reference-free universal hallucination ranking for multimodal foundation models. The core idea of CrossCheckGPT is that the same hallucinated content is unlikely to be generated by different independent systems, hence cross-system consistency can provide meaningful and accurate hallucination assessment scores. CrossCheckGPT can be applied to any model or task, provided that the information consistency between outputs can be measured through an appropriate distance metric. Focusing on multimodal large language models that generate text, we explore two information consistency measures: CrossCheck-explicit and CrossCheck-implicit. We showcase the applicability of our method for hallucination ranking across various modalities, namely the text, image, and audio-visual domains. Further, we propose the first audio-visual hallucination benchmark, "AVHalluBench", and illustrate the effectiveness of CrossCheckGPT, achieving correlations of 98% and 89% with human judgements on MHaluBench and AVHalluBench, respectively.

**footnotetext: Equal contribution

1 Introduction

In the domain of generative foundation models, ‘hallucination’ describes the scenario when generated outputs, while seemingly credible, are either inconsistent with the provided context or contradict established factual knowledge [24, 48, 44]. This issue impacts many generative applications and can lead to the spread of misinformation in a range of settings [52, 33]. Given the differences in architectures, data, and alignment techniques for foundation models, there is a need to be able to quantify a system’s susceptibility to hallucination, such that practitioners can be aware of systems’ hallucination risk and select systems with high factual consistency.

Refer to caption
Figure 1: SelfCheckGPT (Left) and CrossCheckGPT (Right) for hallucination rankings. The approach can rank a set of MLLMs on any task without reference, enabling hallucination benchmarks for various generative tasks.

Current hallucination benchmarks have been developed to rank systems for individual tasks including question answering [22, 14, 18, 10, 43], summarization [29, 27], biography generation [28], instruction following [30], image captioning [36], and visual question answering [19, 49]. Many of these benchmarks measure the hallucination level through a proxy measure, such as the ability of the model to correctly answer questions designed to trigger hallucinations. However, these benchmarks have been designed for particular tasks and assume access to gold-standard labels, limiting their applicability to generalized domains. On the other hand, hallucination detection approaches such as SelfCheckGPT [28] and UniHD [4] directly examine generated responses against self-evidence, and therefore do not require gold-standard answers. These methods, though, simply aim to identify when a model hallucinates, and scores are not directly comparable across different models.

In this paper, we propose CrossCheckGPT, a universal hallucination ranking approach to benchmark multimodal foundation models. The core idea of CrossCheckGPT is that the same hallucinated content is unlikely to be generated by different independent systems, while factual content likely to be consistent across models. An illustration of the approach and its contrast to SelfCheckGPT is depicted in Fig. 1. Instead of checking for self-consistency, as done in SelfCheckGPT, CrossCheckGPT checks the cross-consistency by comparing against evidence generated from a set of independent models. This produces more accurate and directly comparable hallucination scores, as well as yielding more robust rankings. CrossCheckGPT can be applied to any foundation model and task as long as a suitable information consistency measure is used. This paper demonstrates the effectiveness of CrossCheckGPT as a universal evaluation framework for any Multimodal Large Language Model (MLLM) that generates text outputs, applicable irrespective of the input modality. We investigate two information consistency measures: CrossCheck-explicit, which generates multiple text samples from each evidence system, and CrossCheck-implicit, which prompts the evidence model to determine whether it agrees with the assessed outputs.

CrossCheckGPT is validated on WikiBio [28] and MHaluBench [4] as text-to-text and image-to-text description tasks, and our experiments show that CrossCheckGPT achieves a notable 98% Spearman’s Rank Correlation (SRC) on MHaluBench against human ranking compared to -10% SRC using SelfCheckGPT and 33% using UniHD. In addition, a comprehensive audio-visual hallucination benchmark dataset (AVHalluBench) is proposed, covering a diverse range of styles, domains and elements such as visual text, speech and music. The AVHalluBench is used to rank recent audio and video LLMs such as Gemini 1.5 Pro, conducting the first study on audio-visual hallucination benchmarking. The key contributions of this paper are summarized as follows:

  • We propose CrossCheckGPT, a reference-free hallucination ranking approach that can be applied universally across text-generation tasks for systems of different modalities.

  • We conduct comprehensive experiments over a range of tasks and modalities, demonstrating the effectiveness of CrossCheckGPT as a hallucination benchmarking approach for ranking text, image or audio-visual systems. Experimental results illustrate that CrossCheckGPT consistently outperforms alternate approaches, such as SelfCheckGPT [28] and UniHD [4].

  • We analyze hallucination within video understanding and curate AVHalluBench, which to the best of our knowledge, is the first publicly released audio-visual hallucination benchmark.

2 Related Work

LLM Hallucination Benchmarking: Hallucination benchmarks typically rely on proxy tasks to probe the likelihood of LLM making factual errors. For example, question-answering (QA) based benchmarks, such as TriviaQA [14], TruthfulQA [22], HaluEval-QA [18], MemoTrap [30] and FEWL [50] design questions specifically to probe truthfulness and factual accuracy and rank systems by their accuracy. Other methods, such as FaithDial [10], XSum [34] and CNN-DM [38] measure hallucination in dialogue responses or summarization. However, these benchmarks require references (e.g., ground-truth answers or gold-standard references) to compare to model-generated outputs. On the other hand, SelfCheckGPT [28] can be used to rank systems on hallucination levels by measuring systems’ self-consistency scores on equivalent tasks. However, SelfCheckGPT was designed as a hallucination detection method and may not be calibrated across systems.

Multimodal LLM Hallucination Benchmarking: Multimodal hallucination has been mainly explored in the image-to-text domain for visual LLMs. One stream of methods, including CHAIR [36], LURE [56] and MHaluBench [4], directly evaluate the generated text descriptions of images using gold-standard annotations or external toolkits. Another stream of methods, such as POPE [19] and HallusionBench [13], curate a set of questions with short answers trying to capture various aspects of hallucination. Meanwhile, AMBER [49] combines both generation and question answering in one single benchmark. Unlike these methods, CrossCheckGPT does not rely on gold-standard reference or dedicated question sets, and can be universally applied to any input modalities.

3 CrossCheckGPT

CrossCheckGPT assigns a score to an MLLM (denoted as the target model) by assessing how much the responses of the MLLM are supported by evidence generated from a set of MLLMs (denoted as evidence models). The CrossCheckGPT scores can then be used to rank the MLLMs. As illustrated in Fig. 2, we explore two information consistency measures, CrossCheck-explicit and CrossCheck-implicit, which measure the hallucination of generated responses either through the explicit generation of evidence passages or implicit prompting, respectively. CrossCheckGPT is reference-free and can be generally applied to MLLMs of any input modality and output response type.

Refer to caption
Figure 2: Illustration of the CrossCheckGPT approach with two evidence models as an example. Two information consistency measures are shown. \raisebox{-1.2pt}{1}⃝ CrossCheck-explicit where N𝑁Nitalic_N passages are stochastically generated by sampling from each evidence model and \raisebox{-1.0pt}{2}⃝ CrossCheck-implicit where evidence models are directly used to determine whether there are any factual errors in each sentence (without sampling). The LLM judge uses the sentence and the analysis from the evidence model to produce the Yes/No binary decision.

3.1 Information Consistency Measures

CrossCheck-explicit stochastically generates a set of evidence passages from each evidence model and computes the average distance between each evidence passage and the target response. Let R=[r1,,ri,,rI]𝑅subscript𝑟1subscript𝑟𝑖subscript𝑟𝐼R=[r_{1},\ldots,r_{i},\ldots,r_{I}]italic_R = [ italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , … , italic_r start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT ] denote the response of the target model M^^𝑀\hat{M}over^ start_ARG italic_M end_ARG, where risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the i𝑖iitalic_i-th sentence of the response, to a given query Q𝑄Qitalic_Q, which can be of any modality. We first re-formulate the SelfCheckGPT score for sentence risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of the target model in Eqn. (1) below,

𝒮selfcheck(M^)subscript𝒮selfcheck^𝑀\displaystyle\mathcal{S}_{\text{selfcheck}}(\hat{M})caligraphic_S start_POSTSUBSCRIPT selfcheck end_POSTSUBSCRIPT ( over^ start_ARG italic_M end_ARG ) =1|𝒬|1IQ|𝒬|i=1I𝒮ri,Qselfcheck(M^) where 𝒮ri,Qselfcheck(M^)=1N^n=1N^xri,Q(n)(M^)formulae-sequenceabsent1𝒬1𝐼subscript𝑄𝒬superscriptsubscript𝑖1𝐼subscriptsuperscript𝒮selfchecksubscript𝑟𝑖𝑄^𝑀 where subscriptsuperscript𝒮selfchecksubscript𝑟𝑖𝑄^𝑀1^𝑁superscriptsubscript𝑛1^𝑁subscriptsuperscript𝑥𝑛subscript𝑟𝑖𝑄^𝑀\displaystyle=\frac{1}{|\mathcal{Q}|}\frac{1}{I}\sum_{Q\in|\mathcal{Q}|}\sum_{% i=1}^{I}\mathcal{S}^{\text{selfcheck}}_{r_{i},Q}(\hat{M})\qquad\text{~{}where~% {}}\;\mathcal{S}^{\text{selfcheck}}_{r_{i},Q}(\hat{M})=\frac{1}{\hat{N}}\sum_{% n=1}^{\hat{N}}x^{(n)}_{r_{i},Q}(\hat{M})= divide start_ARG 1 end_ARG start_ARG | caligraphic_Q | end_ARG divide start_ARG 1 end_ARG start_ARG italic_I end_ARG ∑ start_POSTSUBSCRIPT italic_Q ∈ | caligraphic_Q | end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT caligraphic_S start_POSTSUPERSCRIPT selfcheck end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q end_POSTSUBSCRIPT ( over^ start_ARG italic_M end_ARG ) where caligraphic_S start_POSTSUPERSCRIPT selfcheck end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q end_POSTSUBSCRIPT ( over^ start_ARG italic_M end_ARG ) = divide start_ARG 1 end_ARG start_ARG over^ start_ARG italic_N end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_N end_ARG end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q end_POSTSUBSCRIPT ( over^ start_ARG italic_M end_ARG ) (1)

where 𝒬𝒬\mathcal{Q}caligraphic_Q is the set of queries in a test set, N^^𝑁\hat{N}over^ start_ARG italic_N end_ARG is the number of stochastically generated passages by the model M^^𝑀\hat{{M}}over^ start_ARG italic_M end_ARG, and xri,Q(n)(M^)subscriptsuperscript𝑥𝑛subscript𝑟𝑖𝑄^𝑀x^{(n)}_{r_{i},Q}(\hat{{M}})italic_x start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q end_POSTSUBSCRIPT ( over^ start_ARG italic_M end_ARG ) denotes the hallucination score of whether sentence risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is supported by evidence n𝑛nitalic_n from M^^𝑀\hat{{M}}over^ start_ARG italic_M end_ARG. The hallucination score, estimated by prompting an LLM judge with the sentence and each evidence, takes a value in {0,1}01\{0,1\}{ 0 , 1 }, where 00 denotes supported and 1111 denotes hallucinatory.

CrossCheck-explicit, in contrast to SelfCheckGPT, uses the evidence from |||\mathcal{M}|| caligraphic_M | evidence models and measures the distance of the response against those from all other systems. The overall CrossCheck-explicit score 𝒞explicit(M^)subscript𝒞explicit^𝑀\mathcal{C}_{\text{explicit}}(\hat{M})caligraphic_C start_POSTSUBSCRIPT explicit end_POSTSUBSCRIPT ( over^ start_ARG italic_M end_ARG ) for a specific target model M^^𝑀\hat{M}over^ start_ARG italic_M end_ARG can be computed using Eqn. (2),

𝒞explicit(M^)=1|𝒬|1IQ|𝒬|i=1I𝒞ri,Qexplicit(M^) where 𝒞ri,Qexplicit(M^)=j=1||ηjn=1Njxri,Q(n)(Mj)j=1||ηjNjsubscript𝒞explicit^𝑀1𝒬1𝐼subscript𝑄𝒬superscriptsubscript𝑖1𝐼subscriptsuperscript𝒞explicitsubscript𝑟𝑖𝑄^𝑀 where subscriptsuperscript𝒞explicitsubscript𝑟𝑖𝑄^𝑀superscriptsubscript𝑗1subscript𝜂𝑗superscriptsubscript𝑛1subscript𝑁𝑗subscriptsuperscript𝑥𝑛subscript𝑟𝑖𝑄subscript𝑀𝑗superscriptsubscript𝑗1subscript𝜂𝑗subscript𝑁𝑗\mathcal{C}_{\text{explicit}}(\hat{M})\!=\!\frac{1}{|\mathcal{Q}|}\frac{1}{I}% \!\!\sum_{Q\in|\mathcal{Q}|}\sum_{i=1}^{I}\mathcal{C}^{\text{explicit}}_{r_{i}% ,Q}(\hat{M})\;\;\;\text{~{}where~{}}\;\mathcal{C}^{\text{explicit}}_{r_{i},Q}(% \hat{M})\!=\!\frac{\sum_{j=1}^{|\mathcal{M}|}\eta_{j}\sum_{n=1}^{N_{j}}x^{(n)}% _{r_{i},Q}({M}_{j})}{\sum_{j=1}^{|\mathcal{M}|}\eta_{j}N_{j}}caligraphic_C start_POSTSUBSCRIPT explicit end_POSTSUBSCRIPT ( over^ start_ARG italic_M end_ARG ) = divide start_ARG 1 end_ARG start_ARG | caligraphic_Q | end_ARG divide start_ARG 1 end_ARG start_ARG italic_I end_ARG ∑ start_POSTSUBSCRIPT italic_Q ∈ | caligraphic_Q | end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT caligraphic_C start_POSTSUPERSCRIPT explicit end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q end_POSTSUBSCRIPT ( over^ start_ARG italic_M end_ARG ) where caligraphic_C start_POSTSUPERSCRIPT explicit end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q end_POSTSUBSCRIPT ( over^ start_ARG italic_M end_ARG ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_M | end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT ( italic_n ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_M | end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG (2)

where \mathcal{M}caligraphic_M denotes the set of evidence models used for CrossCheck-explicit. Note that self-consistency can be taken into account by including the target model M^^𝑀\hat{M}over^ start_ARG italic_M end_ARG into the evidence models, M^^𝑀\hat{M}\!\in\!\mathcal{M}over^ start_ARG italic_M end_ARG ∈ caligraphic_M. Each evidence model Mjsubscript𝑀𝑗{M}_{j}italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT stochastically generates Njsubscript𝑁𝑗N_{j}italic_N start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT passages to check the response against, and since systems may have different levels of reliability, a factor ηjsubscript𝜂𝑗\eta_{j}italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT can be assigned to the passages generated from model Mjsubscript𝑀𝑗{M}_{j}italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT.

CrossCheck-implicit is an alternative consistency measure, where instead of explicitly generating passages for the same query, the evidence models are prompted to spot any factual errors in each sentence. The overall implicit CrossCheck-implicit score is computed using Eqn. (3),

𝒞implicit(M^)=1|𝒬|1IQ|𝒬|i=1I𝒞ri,Qimplicit(M^) where 𝒞ri,Qimplicit(M^)=j=1||ηjyri,Q(Mj)formulae-sequencesubscript𝒞implicit^𝑀1𝒬1𝐼subscript𝑄𝒬superscriptsubscript𝑖1𝐼subscriptsuperscript𝒞implicitsubscript𝑟𝑖𝑄^𝑀 where subscriptsuperscript𝒞implicitsubscript𝑟𝑖𝑄^𝑀superscriptsubscript𝑗1subscript𝜂𝑗subscript𝑦subscript𝑟𝑖𝑄subscript𝑀𝑗\mathcal{C}_{\text{implicit}}(\hat{M})=\frac{1}{|\mathcal{Q}|}\frac{1}{I}\sum_% {Q\in|\mathcal{Q}|}\sum_{i=1}^{I}\mathcal{C}^{\text{implicit}}_{r_{i},Q}(\hat{% M})\qquad\text{~{}where~{}}\;\;\mathcal{C}^{\text{implicit}}_{r_{i},Q}(\hat{M}% )=\sum_{j=1}^{|\mathcal{M}|}\eta_{j}\,y_{r_{i},Q}({M}_{j})caligraphic_C start_POSTSUBSCRIPT implicit end_POSTSUBSCRIPT ( over^ start_ARG italic_M end_ARG ) = divide start_ARG 1 end_ARG start_ARG | caligraphic_Q | end_ARG divide start_ARG 1 end_ARG start_ARG italic_I end_ARG ∑ start_POSTSUBSCRIPT italic_Q ∈ | caligraphic_Q | end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_I end_POSTSUPERSCRIPT caligraphic_C start_POSTSUPERSCRIPT implicit end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q end_POSTSUBSCRIPT ( over^ start_ARG italic_M end_ARG ) where caligraphic_C start_POSTSUPERSCRIPT implicit end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q end_POSTSUBSCRIPT ( over^ start_ARG italic_M end_ARG ) = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_M | end_POSTSUPERSCRIPT italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (3)

where yri,Q(Mj)subscript𝑦subscript𝑟𝑖𝑄subscript𝑀𝑗y_{r_{i},Q}({M}_{j})italic_y start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) denotes the hallucination score of sentence risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT computed using CrossCheck-implicit. In contrast to CrossCheck-explicit (which computes xri,Q(Mj)subscript𝑥subscript𝑟𝑖𝑄subscript𝑀𝑗x_{r_{i},Q}({M}_{j})italic_x start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )), yri,Q(Mj)subscript𝑦subscript𝑟𝑖𝑄subscript𝑀𝑗y_{r_{i},Q}({M}_{j})italic_y start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is computed by first prompting the evidence model Mjsubscript𝑀𝑗M_{j}italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to analyze whether risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT contains any factual errors given the input Q𝑄Qitalic_Q. The LLM judge then takes the input risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and analysis from model Mjsubscript𝑀𝑗M_{j}italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and predicts yri,Q(Mj)subscript𝑦subscript𝑟𝑖𝑄subscript𝑀𝑗y_{r_{i},Q}({M}_{j})italic_y start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), whether the response is hallucinatory. If factual errors are found in risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, yri,Q(Mj)=1subscript𝑦subscript𝑟𝑖𝑄subscript𝑀𝑗1y_{r_{i},Q}({M}_{j})=1italic_y start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = 1, and otherwise yri,Q(Mj)=0subscript𝑦subscript𝑟𝑖𝑄subscript𝑀𝑗0y_{r_{i},Q}({M}_{j})=0italic_y start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_Q end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = 0. We note that concurrent work, PoLL [47], applies a group of models as judges to evaluate texts and can be viewed as similar to CrossCheck-implicit. This work focuses on multimodal inputs and hallucination benchmarking.

3.2 Confidence-based Weighting for Evidence Models

While all evidence models are advanced MLLMs, the quality of their evidence may vary depending on their propensity to hallucinate. Therefore, a weighting mechanism is proposed where the scores are weighted by model uncertainty reflected by SelfCheckGPT scores, as shown below:

ηj=e𝒮selfcheck(Mj)/Tk=1||e𝒮selfcheck(Mk)/T,subscript𝜂𝑗superscript𝑒subscript𝒮selfchecksubscript𝑀𝑗𝑇superscriptsubscript𝑘1superscript𝑒subscript𝒮selfchecksubscript𝑀𝑘𝑇\eta_{j}=\frac{e^{-\mathcal{S}_{\text{selfcheck}}({M}_{j})/T}}{\sum_{k=1}^{|% \mathcal{M}|}e^{-\mathcal{S}_{\text{selfcheck}}({M}_{k})/T}},italic_η start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG italic_e start_POSTSUPERSCRIPT - caligraphic_S start_POSTSUBSCRIPT selfcheck end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) / italic_T end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT | caligraphic_M | end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT - caligraphic_S start_POSTSUBSCRIPT selfcheck end_POSTSUBSCRIPT ( italic_M start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) / italic_T end_POSTSUPERSCRIPT end_ARG , (4)

where T𝑇Titalic_T is the calibration temperature that determines the sharpness of the weight distribution, which is set to a constant for each benchmark. A higher SelfCheckGPT score indicates that the model tends to generate inconsistent information and is more uncertain. In addition, this weighting mechanism ensures that outlier systems will not be undermined by the evidence from weaker models.111Note that a weight distribution can also be associated with each specific query by using the average SelfCheckGPT score of each evidence model.

4 CrossCheckGPT for Hallucination with Multimodal Inputs

CrossCheckGPT is designed to be general and applicable to models of any input modality, provided that the outputs are of a consistent form (i.e. text) and a suitable information consistency measure is used. This general design of CrossCheckGPT enables it to also be applied to rank multi-modal systems (i.e. systems which use two or more input modalities).

Refer to caption
Figure 3: CrossCheckGPT score computation for AVHalluBench with audio, visual and audio-visual inputs.

As shown in Fig. 3, we use CrossCheckGPT to evaluate models of three different categories: the audio domain and visual domain where the inputs are either audio or visual (image or silent video), and we further conduct the first study on evaluating hallucination levels within the audio-visual domain where the inputs are videos with their paired audio. Due to the lack of diversity in current publicly available capable systems taking audio-visual inputs, to evaluate CrossCheckGPT in the audio-visual domain, we prompt multi-modal models to instead split the outputs into visual descriptions and auditory descriptions, evaluating CrossCheckGPT within either of the domains. We use visual descriptions to check the visual-only inputs and audio descriptions to check the audio-only inputs. For hallucination benchmarking in multimodal audio-visual settings, information may require both modalities, e.g. someone demonstrating and explaining a skateboard trick. In this scenario, we use 𝒞=min(𝒞audio,𝒞visual)𝒞superscript𝒞audiosuperscript𝒞visual\mathcal{C}=\min\left(\mathcal{C}^{\text{audio}},\mathcal{C}^{\text{visual}}\right)caligraphic_C = roman_min ( caligraphic_C start_POSTSUPERSCRIPT audio end_POSTSUPERSCRIPT , caligraphic_C start_POSTSUPERSCRIPT visual end_POSTSUPERSCRIPT ) as the CrossCheckGPT score, where 𝒞audiosuperscript𝒞audio\mathcal{C}^{\text{audio}}caligraphic_C start_POSTSUPERSCRIPT audio end_POSTSUPERSCRIPT uses the audio descriptions and 𝒞visualsuperscript𝒞visual\mathcal{C}^{\text{visual}}caligraphic_C start_POSTSUPERSCRIPT visual end_POSTSUPERSCRIPT uses the visual descriptions.222For simplicity, M^^𝑀\hat{M}over^ start_ARG italic_M end_ARG, risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and Q𝑄Qitalic_Q are dropped here, and the scores can be either implicit or explicit.333Initial findings showed CrossCheck-implicit gives different ranges of scores for audio and visual modalities, at about 0.2 and 0.5 on average, respectively. Thus, only CrossCheck-explicit is adopted for audio-visual inputs.

AVHalluBench: To benchmark hallucinations in audio-visual LLMs, we curate AVHalluBench, a dataset containing 175 videos selected from six video understanding datasets covering various styles and elements, with statistics shown in Table 15 in the Appendix. To verify the effectiveness of CrossCheckGPT (and future benchmarking methods), AVHalluBench includes a carefully written set of hallucination-free descriptions for audio and visual contents. After watching each video with audio, the annotators were instructed to write one description focusing on the audio content and one description focusing on the visual content of the video, separately.444To maximize coverage, initial descriptions were generated using Gemini 1.5 Pro and GPT-4v, prompted to describe all the elements present in the sequence of frames. Note that although these descriptions are not hallucination-free, they have a high level of coverage and subjective details. The annotators were provided with these descriptions in addition to the videos while being instructed to write only objective details of the videos. To analyze the inter-annotator agreement, we split each description into atomic facts [31] and verify each fact against the descriptions written by the other annotators, categorized as either: Supporting, such that the fact is supported by the other annotator, Contradicting, such that the fact contradicts the information provided by the other annotator, or Neutral such that the facts neither support nor contradict one another. Both decomposition and verification processes are performed automatically using GPT-4. Of the 39 videos annotated by multiple annotators, there were 471 audio-related facts and 913 visual-related facts, and the agreement between annotators (as counted by Supporting/Neutral/Contradicting) was 64.6%/24.6%/10.8% and 62.0%/29.0%/9.0%, respectively.

5 Experiments

We conduct experiments to validate CrossCheckGPT on MLLMs with three input modalities, including text (§5.1), image (§5.2), and audio-visual (§5.3). During inference, we use a temperature of 1.0, a beam size of 1 and a top-p of 0.9 are used for all models. SelfCheckGPT [28] is applied as a hallucination ranking baseline for all modalities since it is reference-free and not task-specific.

5.1 Text-to-text Experiments

Experimental Setup: The main text-to-text experiments are performed using the subset of WikiBio data used in [28], which contains 238 biographical passages from Wikipedia. We select 10 open-source LLMs (listed in Appendix Table 7) as target models, 8 of which are used as evidence models. Four models are Llama-2-7B based [45] (e.g. Vicuna-v1.5-7B [6]) and four models are Mistral-7B based [15]. Each evidence model generates 20 stochastic passages. For the LLM judge in CrossCheck-explicit (used to determine whether sentences support one another), Mistral-7B [15] is used as it achieves the best results among all considered open-source LLMs (shown in Appendix Table 10).

To evaluate the general benchmarking ability of ranking methods, 10 benchmark metrics from the hallucinations leaderboard555https://huggingface.co/spaces/hallucinations-leaderboard/leaderboard (shown in Table 8) are selected to provide the overall hallucination ranking of the systems. These metrics are either based on human annotation or gold-standard references, where the overall rankings are obtained by averaging the rankings from each metric.

We report the system-level correlation between the hallucination ranking methods and the overall ranking measured by Spearman’s Rank Correlation coefficient (SRC), denoted as System(ρ𝜌\rhoitalic_ρ). In addition, as WikiBio contains reference texts, the references can be used as evidence texts, which can be considered an idealized fact-checking method. This method is referred to as RefCheck, and CrossCheckGPT and SelfCheckGPT scores also are compared against RefCheck at document-level using Pearson’s Correlation Coefficient (PCC), denoted as Document(r)𝑟(r)( italic_r ). Furthermore, to investigate the effectiveness of CrossCheckGPT when the target LLM is much more powerful than those evidence models, we include GPT-4 in addition to the 10 target LLMs.

Hallucination Ranking Results: Existing hallucination metrics such as HaluEval-QA accuracy do not correlate well with the overall ranking at the system level. Some metrics have negative correlations while the highest (TruthfulQA MC2) is 57.14% (shown in Table 1, with further pairwise correlations provided in Appendix Table 13). This is likely because each existing metric is typically designed to measure only one aspect related to hallucinations, e.g., probing through question-answering.

Metrics System(ρ𝜌\rhoitalic_ρ) (%) Document (r𝑟ritalic_r) (%)
w/o GPT4 with GPT4
TruthfulQA MC2 [22] 57.14 - -
SelfCheckGPT [28] 66.46 74.06 76.08
CrossCheck-implicit 56.71 18.33 17.29
CrossCheck-explicit 77.44 82.28 77.23
CrossCheck-implicit weighted 56.81 20.21 19.16
CrossCheck-explicit weighted 82.32 81.78 82.18
Table 1: General hallucination evaluation where the task for SelfCheckGPT/CrossCheckGPT is open-ended biography generation on WikiBio. System-level correlation, System(ρ𝜌\rhoitalic_ρ), is measured against the overall ranking of the leaderboard, and document-level correlation, Document(r𝑟ritalic_r), is measured against RefCheck. “With GPT-4” refers to including GPT-4 as a target model. Additional metrics are presented in Table 11 in the Appendix.
[Uncaptioned image]
Figure 4: Scatter plot of document-level scores for SelfCheckGPT and CrossCheck-explicit against RefCheck for text-to-text experiments.
Subset Values
Succ. Rate 90%
P-value 4×106absentsuperscript106\times 10^{-6}× 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT
Table 2: Success rate of CrossCheck outperforming SelfCheck for independent subsets of WikiBio documents. The P-value is measured by the one-tailed sign test with H0=subscript𝐻0absentH_{0}=italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = CrossCheck not better than SelfCheck.

CrossCheck-explicit correlates with the overall ranking better than all other methods, with CrossCheck-explicit weighted by model uncertainty achieving the highest correlation, highlighting its effective general hallucination ranking ability. In addition, the document-level correlation plots are shown in Fig. 4, and the sign test on independent subsets in Table 2 shows the statistical significance (p=4×106𝑝4superscript106p=4\times 10^{-}6italic_p = 4 × 10 start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT 6) of CrossCheckGPT being better than SelfCheckGPT for ranking at the system-level.

5.2 Image-to-text Experiments

We validate CrossCheckGPT for the hallucination ranking of visual LLMs on image-to-text tasks. The experiments are performed on MHaluBench [4], an image-captioning hallucination dataset. Nine visual LLMs are selected as target models, all of which are used to generate evidence passages (see Appendix Table 7 for the list of models). Each evidence model generates ten image descriptions per image. The overall ranking is obtained by averaging the rankings from CHAIR [36] and POPE (MSCOCO subset) [19].666CHAIR and POPE are the two popular representative metrics for free-form text generation and binary classification hallucination benchmarks respectively [49]. In addition to SelfCheckGPT, UniHD[4] is used as a stronger baseline.

For evaluation, we take a subset of 30 image descriptions generated by each target model (a total of 270 passages with 3237 facts) and annotate each description with a binary label of either hallucinatory or factual. The Cohen’s κ𝜅\kappaitalic_κ between the two annotators is 0.632, indicating substantial agreement. The models are ranked by the average percentage of factual errors produced by each target model, and hallucination ranking performance is measured at the system-level using SRC, denoted System(ρ𝜌\rhoitalic_ρ) and at the image-level using PCC, denoted as Image(r𝑟ritalic_r).

Metrics System(ρ𝜌\rhoitalic_ρ) (%) Image(r𝑟ritalic_r) (%)
Overall CHAIR Human Human
UniHD [4] 42.02 36.98 33.33 36.70
SelfCheckGPT [28] 43.70 23.10 -10.00 20.93
CrossCheck-implicit 50.42 64.71 98.33 48.72
CrossCheck-explicit 42.86 43.70 75.00 35.16
CrossCheck-implicit weighted 50.42 64.71 98.33 52.83
CrossCheck-explicit weighted 47.06 46.22 73.33 36.98
Table 3: System-level correlation measured by System(ρ𝜌\rhoitalic_ρ) and Image-level correlation measured by Image(r𝑟ritalic_r) for various hallucination evaluation methods on the MHaluBench dataset. System-level correlation is measured against the overall ranking, rankings from CHAIR scores and human annotation.

Hallucination Ranking Results: Similar to before, Table 3 presents the system-level and image-level correlations against overall rankings and rankings derived from human annotations. Both variants of CrossCheckGPT outperform SelfCheckGPT and UniHD, with CrossCheck-implicit weighted performing best out of all methods, achieving a 98.33% correlation with the rankings from human annotations. Equivalent statistical significance analysis and scatter plots are shown in Table 14 and Fig. 7 in the Appendix F, respectively.

5.3 Video-to-text Experiments

Next, we apply CrossCheckGPT to AVHalluBench to investigate hallucination ranking in audio-visual LLMs. We consider 7 models that can handle video inputs and 6 models that can handle audio inputs. Three models, FAVOR [40], Video-LLaMA [54], and Gemini 1.5 Pro [42], are in the intersection of the two sets, and can handle audio-visual inputs. When ranking hallucinations for visual description, we consider audio-visual LLMs with visual-only inputs and audio-visual inputs as separate systems, and hence, there are 7+3=1073107\!+\!3\!=\!107 + 3 = 10 target models for ranking. We conduct a similar ranking scheme for audio descriptions, where there are 6+3=96396\!+\!3\!=\!96 + 3 = 9 target models. All the target models are also used as evidence models in CrossCheck-explicit,777Gemini 1.5 Pro is not used for CrossCheck-implicit due to the number of request limitations. and each model generates ten evidence passages. When using audio-visual LLMs as evidence models, audio-visual inputs are given to obtain the visual or audio descriptions as evidence. As only 5 target models can handle speech inputs, we further make a dedicated ranking only for these models with prompts explicitly asking for speech description.

Metrics Visual Description (%) Audio Description (%)
System(ρ𝜌\rhoitalic_ρ) Video(r𝑟ritalic_r) System(ρ𝜌\rhoitalic_ρ) Video(r𝑟ritalic_r) (w. speech)
SelfCheckGPT 86.67 65.77 60.00 51.13 (44.55)
CrossCheck-implicit weighted 54.29 30.73 40.00 2.15 (16.20)
CrossCheck-explicit weighted 89.09 78.58 71.67 68.10 (47.60)
Table 4: System-level and video-level correlations of SelfCheckGPT and CrossCheckGPT against RefCheck using manual descriptions in AVHalluBench. Weighted version of CrossCheckGPT is used with C=0.1𝐶0.1C=0.1italic_C = 0.1. Ranking correlations for systems that handle speech are in brackets.

Hallucination Ranking Results: First, system-level and video-level correlations are shown in Table 4, measured by System(ρ𝜌\rhoitalic_ρ) and Video(r𝑟ritalic_r). CrossCheck-explicit correlates with RefCheck best, with an 89.09% System(ρ𝜌\rhoitalic_ρ) for the visual description. Similar to the text-to-text results, we observe that CrossCheck-explicit performs better than CrossCheck-implicit. For both text-to-text and video-to-text experiments, this is likely due to the high diversity in the evidence passages as indicated by high raw SelfCheckGPT scores, which we discuss further in Section 5.4.

Impact of Audio-Visual Inputs: As supporting information from another modality is expected to reduce hallucination, this section investigates whether audio-visual inputs reduce the raw hallucination scores compared to the scores when a single modality is used. Table 5 presents the average raw hallucination scores (rather than correlations), for three MLLMs that can take audio-visual inputs.

Model Input modality Visual Description (%) Audio Description (%)
𝒮selfchecksubscript𝒮selfcheck\mathcal{S}_{\text{selfcheck}}caligraphic_S start_POSTSUBSCRIPT selfcheck end_POSTSUBSCRIPT \downarrow 𝒞explicitsubscript𝒞explicit\mathcal{C}_{\text{explicit}}caligraphic_C start_POSTSUBSCRIPT explicit end_POSTSUBSCRIPT \downarrow 𝒮selfchecksubscript𝒮selfcheck\mathcal{S}_{\text{selfcheck}}caligraphic_S start_POSTSUBSCRIPT selfcheck end_POSTSUBSCRIPT \downarrow 𝒞explicitsubscript𝒞explicit\mathcal{C}_{\text{explicit}}caligraphic_C start_POSTSUBSCRIPT explicit end_POSTSUBSCRIPT \downarrow
FAVOR [40] Visual 60.67 53.85
Audio 49.62 66.69
Audio-Visual 56.42 49.60 33.25 35.20
Video-LLaMA [54] Visual 41.14 52.02
Audio 56.42 68.05
Audio-Visual 47.73 49.13 70.23 41.25
Gemini 1.5 Pro [42] Visual 19.87 31.74
Audio 25.82 34.66
Audio-Visual 12.77 23.27 48.51 28.79
Table 5: SelfCheckGPT scores (𝒮selfchecksubscript𝒮selfcheck\mathcal{S}_{\text{selfcheck}}caligraphic_S start_POSTSUBSCRIPT selfcheck end_POSTSUBSCRIPT) and weighted CrossCheck-explicit scores (𝒞explicitsubscript𝒞explicit\mathcal{C}_{\text{explicit}}caligraphic_C start_POSTSUBSCRIPT explicit end_POSTSUBSCRIPT) on AVHalluBench for audio-visual LLMs. Calibration temperature T=0.1𝑇0.1T=0.1italic_T = 0.1 is used here.

When considering the CrossCheckGPT scores, we observe that having audio-visual inputs reduces hallucination rates, as measured by the raw CrossCheckGPT scores, as expected. While Gemini 1.5 Pro achieved the best scores, it can be more susceptible to hallucination when silent videos are used as inputs as it often fabricates its audio descriptions. Moreover, except for Gemini 1.5 Pro, when audio-visual inputs are used the reduction in hallucination scores is larger for audio description tasks than for visual description tasks. This likely occurs as for audio description tasks, visual information often provides useful information on the source of the sound, which can significantly reduce the uncertainty of the sound. For visual description tasks, while particular audio cues (especially from speech) can provide useful information, misleading or unrelated sounds may cause additional hallucinations. For example, in Fig 10 where there is a self-playing piano, audio inputs can mislead a model to believe that the piano is played by an individual. Further examples are presented in Appendix H with the raw hallucination scores for audio and visual-only inputs shown in Tables 16 and 17 in Appendix.

5.4 CrossCheck-explicit vs. CrossCheck-implicit

While CrossCheck-implicit is more sample-efficient than CrossCheck-explicit and only requires generating the error analysis once, the performance of CrossCheck-implicit can be highly dependent on the task. For the text-to-text and video-to-text experiments, CrossCheck-implicit performs worse than CrossCheck-explicit, as opposed to the findings in the image-to-text experiments. We hypothesize that for challenging and open-ended tasks, CrossCheck-explicit is preferred as it can better cover the output space by disentangling the evidence generation and verification tasks, yielding more calibrated uncertainty measures. However, in other circumstances, CrossCheck-implicit may help the model focus on specific aspects of the input and yield more accurate rankings. For challenging and open-ended tasks with diverse outputs, the raw SelfCheckGPT scores are expected to be high and therefore can be used as a proxy to determine which consistency measure to select. For example, the average SelfCheckGPT score across models is 40.63% for text-to-text, which is much higher than 17.16% for image-to-text. We recommend using CrossCheck-explicit when the SelfCheckGPT scores are high, and CrossCheck-implicit when they are sufficiently low, which is demonstrated to be a reasonable rule, illustrated by the results in Appendix Table 18.

5.5 Ablation Studies

Self-Bias: LLMs are known to have self-preferential bias [2, 55] and may prefer outputs from similar models. Therefore LLMs using the same base model may provide inflated CrossCheckGPT scores. The results in Table 6 show that self-bias is an issue, and for example, when only using Llama-2-based evidence models, the outputs from Vicuna get a lower hallucination score whereas when only using Mistral-based evidence models, Mistral has the lowest hallucination score, resulting in contradictory conclusions. This bias can be mitigated by adopting a wide range of evidence models, which is adopted in CrossCheckGPT scores, hence achieving more reliable evaluation with strong correlations.

Evidence Models System(ρ𝜌\rhoitalic_ρ) Document(r𝑟ritalic_r) Vicuna 𝒞explicitsubscript𝒞explicit\mathcal{C}_{\text{explicit}}caligraphic_C start_POSTSUBSCRIPT explicit end_POSTSUBSCRIPT Mistral 𝒞explicitsubscript𝒞explicit\mathcal{C}_{\text{explicit}}caligraphic_C start_POSTSUBSCRIPT explicit end_POSTSUBSCRIPT
Llama-2-based models only 55.49% 81.10% 42.94% 45.68%
Mistral-based models only 81.71% 81.06% 44.98% 41.81%
All models 82.32% 82.28% 44.82% 44.93%
Table 6: The mitigation of self-bias in CrossCheckGPT scores and its influence measured by document-level correlations and CrossCheck-explicit scores of Vicuna and Mistral on WikiBio. There are 4 Llama-2-based models and 4 Mistral-based models in the set of evidence models.
Refer to caption
Figure 5: Variation of SelfCheckGPT scores (Left) and the weighted CrossCheck-explicit scores (Right) against the varying temperature during description generation.

Robustness to Manipulation: To investigate whether a ranking method can be easily manipulated, we examine the influence of the generation temperature (which can be selected for any model). The results in Fig. 5 show that by increasing the temperature of the target model from 0.5 to 1.5, SelfCheckGPT scores increase by as much as 35%, drastically influencing the rankings. In contrast, CrossCheckGPT provides more stable rankings for all generation temperatures. Results are demonstrated for MHaluBench, but similar trends are observed for WikiBio as well.

6 Conclusions

This paper proposes CrossCheckGPT, a universal hallucination ranking method for multimodal large language models. We evaluated two variants of CrossCheckGPT on text-to-text, image-to-text and video-to-text tasks, demonstrating that it consistently outperforms all baseline methods, achieving 98% and 89% system-level correlation against humans on MHaluBench and AVHalluBench respectively. We also introduce AVHalluBench, the first resource to study audio-visual hallucination issues in video understanding.

Acknowledgments

This work is supported by Cambridge University Press & Assessment (CUP&A), a department of The Chancellor, Masters, and Scholars of the University of Cambridge.

References

  • Almazrouei et al. [2023] E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojocaru, M. Debbah, Étienne Goffinet, D. Hesslow, J. Launay, Q. Malartic, D. Mazzotta, B. Noune, B. Pannier, and G. Penedo. The falcon series of open language models. arXiv:2311.16867, 2023.
  • Brown [1986] J. D. Brown. Evaluations of self and others: Self-enhancement biases in social judgments. Social cognition, 4(4):353–376, 1986.
  • Chen et al. [2023] S. Chen, X. He, L. Guo, X. Zhu, W. Wang, J. Tang, and J. Liu. Valor: Vision-audio-language omni-perception pretraining model and dataset. arXiv:2304.08345, 2023.
  • Chen et al. [2024a] X. Chen, C. Wang, Y. Xue, N. Zhang, X. Yang, Q. Li, Y. Shen, L. Liang, J. Gu, and H. Chen. Unified hallucination detection for multimodal large language models. arXiv:2402.03190, 2024a.
  • Chen et al. [2024b] Z. Chen, H. Liu, W. Yu, G. Sun, H. Liu, J. Wu, C. Zhang, Y. Wang, and Y. Wang. M3av: A multimodal, multigenre, and multipurpose audio-visual academic lecture dataset. arXiv:2403.14168, 2024b.
  • Chiang et al. [2023] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing. Vicuna: An opensource chatbot impressing gpt-4 with 90% chatgpt quality., 2023.
  • Chu et al. [2023] Y. Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919, 2023.
  • Dai et al. [2023] W. Dai, J. Li, D. Li, A. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=vvoWPYqZJA.
  • Dinan et al. [2019] E. Dinan, S. Roller, K. Shuster, A. Fan, M. Auli, and J. Weston. Wizard of wikipedia: Knowledge-powered conversational agents. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=r1l73iRqKm.
  • Dziri et al. [2022] N. Dziri, E. Kamalloo, S. Milton, O. Zaiane, M. Yu, E. Ponti, and S. Reddy. Faithdial: A faithful benchmark for information-seeking dialogue. Transactions of the Association for Computational Linguistics, 10:1473–1490, 2022.
  • Feng et al. [2023] S. Feng, V. Balachandran, Y. Bai, and Y. Tsvetkov. FactKB: Generalizable factuality evaluation using language models enhanced with factual knowledge. In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 933–952, Singapore, Dec. 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.59. URL https://aclanthology.org/2023.emnlp-main.59.
  • Gong et al. [2024] Y. Gong, H. Luo, A. H. Liu, L. Karlinsky, and J. R. Glass. Listen, think, and understand. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=nBZBPXdJlC.
  • Guan et al. [2024] T. Guan, F. Liu, X. Wu, R. Xian, Z. Li, X. Liu, X. Wang, L. Chen, F. Huang, Y. Yacoob, D. Manocha, and T. Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. In CVPR, 2024.
  • Han et al. [2019] M. Han, M. Kang, H. Jung, and S. J. Hwang. Episodic memory reader: Learning what to remember for question answering from streaming data. In A. Korhonen, D. Traum, and L. Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4407–4417, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1434. URL https://aclanthology.org/P19-1434.
  • Jiang et al. [2023] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. Mistral 7b. arXiv:2310.06825, 2023.
  • Jin et al. [2024] P. Jin, R. Takanobu, C. Zhang, X. Cao, and L. Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. In CVPR, 2024.
  • li et al. [2022] G. li, Y. Wei, Y. Tian, C. Xu, J.-R. Wen, and D. Hu. Learning to answer questions in dynamic audio-visual scenarios. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  • Li et al. [2023a] J. Li, X. Cheng, X. Zhao, J.-Y. Nie, and J.-R. Wen. HaluEval: A large-scale hallucination evaluation benchmark for large language models. In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464, Singapore, Dec. 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.397. URL https://aclanthology.org/2023.emnlp-main.397.
  • Li et al. [2023b] Y. Li, Y. Du, K. Zhou, J. Wang, X. Zhao, and J.-R. Wen. Evaluating object hallucination in large vision-language models. In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 292–305, Singapore, Dec. 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.20. URL https://aclanthology.org/2023.emnlp-main.20.
  • Li et al. [2023c] Y. Li, C. Wang, and J. Jia. Llama-vid: An image is worth 2 tokens in large language models. arXiv:2311.17043, 2023c.
  • Lin et al. [2023] B. Lin, B. Zhu, Y. Ye, M. Ning, P. Jin, and L. Yuan. Video-llava: Learning united visual representation by alignment before projection. arXiv:2311.10122, 2023.
  • Lin et al. [2022] S. Lin, J. Hilton, and O. Evans. TruthfulQA: Measuring how models mimic human falsehoods. In S. Muresan, P. Nakov, and A. Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl-long.229.
  • Liu et al. [2023] H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=w0H2xGHlkw.
  • Liu et al. [2024] H. Liu, W. Xue, Y. Chen, D. Chen, X. Zhao, K. Wang, L. Hou, R. Li, and W. Peng. A survey on hallucination in large vision-language models. arXiv:2402.00253, 2024.
  • Luo et al. [2023] R. Luo, Z. Zhao, M. Yang, J. Dong, M. Qiu, P. Lu, T. Wang, and Z. Wei. Valley: Video assistant with large language model enhanced ability. arXiv:2306.07207, 2023.
  • [26] D. Mahan, R. Carlow, L. Castricato, N. Cooper, and C. Laforte. Stable beluga models. URL [https://huggingface.co/stabilityai/StableBeluga2](https://huggingface.co/stabilityai/StableBeluga2).
  • Manakul et al. [2023a] P. Manakul, A. Liusie, and M. Gales. MQAG: Multiple-choice question answering and generation for assessing information consistency in summarization. In J. C. Park, Y. Arase, B. Hu, W. Lu, D. Wijaya, A. Purwarianti, and A. A. Krisnadhi, editors, Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 39–53, Nusa Dua, Bali, Nov. 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.ijcnlp-main.4. URL https://aclanthology.org/2023.ijcnlp-main.4.
  • Manakul et al. [2023b] P. Manakul, A. Liusie, and M. Gales. SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004–9017, Singapore, Dec. 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.557. URL https://aclanthology.org/2023.emnlp-main.557.
  • Maynez et al. [2020] J. Maynez, S. Narayan, B. Bohnet, and R. McDonald. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.173. URL https://aclanthology.org/2020.acl-main.173.
  • McKenzie et al. [2023] I. R. McKenzie, A. Lyzhov, M. Pieler, A. Parrish, A. Mueller, A. Prabhu, E. McLean, A. Kirtland, A. Ross, A. Liu, A. Gritsevskiy, D. Wurgaft, D. Kauffman, G. Recchia, J. Liu, J. Cavanagh, M. Weiss, S. Huang, T. F. Droid, T. Tseng, T. Korbak, X. Shen, Y. Zhang, Z. Zhou, N. Kim, S. R. Bowman, and E. Perez. Inverse scaling: When bigger isn’t better. TMLR, 2023.
  • Min et al. [2023] S. Min, K. Krishna, X. Lyu, M. Lewis, W.-t. Yih, P. Koh, M. Iyyer, L. Zettlemoyer, and H. Hajishirzi. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, Singapore, Dec. 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.741. URL https://aclanthology.org/2023.emnlp-main.741.
  • Mukherjee et al. [2023] S. Mukherjee, A. Mitra, G. Jawahar, S. Agarwal, H. Palangi, and A. Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv:2306.02707, 2023.
  • Nahar et al. [2024] M. Nahar, H. Seo, E.-J. Lee, A. Xiong, and D. Lee. Fakes of varying shades: How warning affects human perception and engagement regarding llm hallucinations. arXiv:2404.03745, 2024.
  • Narayan et al. [2018] S. Narayan, S. B. Cohen, and M. Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807, Brussels, Belgium, Oct.-Nov. 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1206. URL https://aclanthology.org/D18-1206.
  • OpenAI [2023] OpenAI. GPT-4 technical report. arXiv:2303.08774, 2023.
  • Rohrbach et al. [2018] A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell, and K. Saenko. Object hallucination in image captioning. In E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045, Brussels, Belgium, Oct.-Nov. 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1437. URL https://aclanthology.org/D18-1437.
  • Sanabria et al. [2018] R. Sanabria, O. Caglayan, S. Palaskar, D. Elliott, L. Barrault, L. Specia, and F. Metze. How2: A large-scale dataset for multimodal language understanding. In Proc. ViGIL, 2018.
  • See et al. [2017] A. See, P. J. Liu, and C. D. Manning. Get to the point: Summarization with pointer-generator networks. In R. Barzilay and M.-Y. Kan, editors, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1099. URL https://aclanthology.org/P17-1099.
  • Shen et al. [2023] X. Shen, D. Li, J. Zhou, Z. Qin, B. He, X. Han, A. Li, Y. Dai, L. Kong, M. Wang, Y. Qiao, and Y. Zhong. Favdbench: Fine-grained audible video description. In Proc. CVPR, 2023.
  • Sun et al. [2023] G. Sun, W. Yu, C. Tang, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang. Fine-grained audio-visual joint representations for multimodal large language models. arXiv:2310.05863, 2023.
  • Tang et al. [2024] C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. MA, and C. Zhang. SALMONN: Towards generic hearing abilities for large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=14rn7HpKVk.
  • Team [2024] G. Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv:2403.05530, 2024.
  • Thorne et al. [2018] J. Thorne, A. Vlachos, C. Christodoulopoulos, and A. Mittal. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1074. URL https://aclanthology.org/N18-1074.
  • Tonmoy et al. [2024] S. M. T. I. Tonmoy, S. M. M. Zaman, V. Jain, A. Rani, V. Rawte, A. Chadha, and A. Das. A comprehensive survey of hallucination mitigation techniques in large language models. arXiv:2401.01313, 2024.
  • Touvron et al. [2023] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023.
  • Tunstall et al. [2023] L. Tunstall, E. Beeching, N. Lambert, N. Rajani, K. Rasul, Y. Belkada, S. Huang, L. von Werra, C. Fourrier, N. Habib, N. Sarrazin, O. Sanseviero, A. M. Rush, and T. Wolf. Zephyr: Direct distillation of lm alignment. arXiv:2310.16944, 2023.
  • Verga et al. [2024] P. Verga, S. Hofstatter, S. Althammer, Y. Su, A. Piktus, A. Arkhangorodsky, M. Xu, N. White, and P. Lewis. Replacing judges with juries: Evaluating llm generations with a panel of diverse models. arXiv preprint arXiv:2404.18796, 2024.
  • Wang et al. [2023a] C. Wang, X. Liu, Y. Yue, X. Tang, T. Zhang, C. Jiayang, Y. Yao, W. Gao, X. Hu, Z. Qi, Y. Wang, L. Yang, J. Wang, X. Xie, Z. Zhang, and Y. Zhang. Survey on factuality in large language models: Knowledge, retrieval and domain-specificity. arXiv:2310.07521, 2023a.
  • Wang et al. [2023b] J. Wang, Y. Wang, G. Xu, J. Zhang, Y. Gu, H. Jia, M. Yan, J. Zhang, and J. Sang. An llm-free multi-dimensional benchmark for mllms hallucination evaluation. arXiv preprint arXiv:2311.07397, 2023b.
  • Wei et al. [2024] J. Wei, Y. Yao, J.-F. Ton, H. Guo, A. Estornell, and Y. Liu. Measuring and reducing llm hallucination without gold-standard answers via expertise-weighting. arXiv:2402.10412, 2024.
  • Xiao et al. [2021] J. Xiao, X. Shang, A. Yao, and T.-S. Chua. NExT-QA: Next phase of question-answering to explaining temporal actions. In Proc. CVPR, 2021.
  • Yang et al. [2023] X. Yang, L. Pan, X. Zhao, H. Chen, L. Petzold, W. Y. Wang, and W. Cheng. A survey on detection of llms-generated content. arXiv:2310.15654, 2023.
  • Ye et al. [2023] Q. Ye, H. Xu, J. Ye, M. Yan, A. Hu, H. Liu, Q. Qian, J. Zhang, F. Huang, and J. Zhou. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv:2311.04257, 2023.
  • Zhang et al. [2023] H. Zhang, X. Li, and L. Bing. Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. In Y. Feng and E. Lefever, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 543–553, Singapore, Dec. 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-demo.49. URL https://aclanthology.org/2023.emnlp-demo.49.
  • Zheng et al. [2023] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 46595–46623. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/91f18a1287b398d378ef22505bf41832-Paper-Datasets_and_Benchmarks.pdf.
  • Zhou et al. [2024] Y. Zhou, C. Cui, J. Yoon, L. Zhang, Z. Deng, C. Finn, M. Bansal, and H. Yao. Analyzing and mitigating object hallucination in large vision-language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=oZDJKTlOUe.
  • Zhu et al. [2023] B. Zhu, E. Frick, T. Wu, H. Zhu, and J. Jiao. Starling-7b: Improving llm helpfulness & harmlessness with rlaif, November 2023.

Appendix A Experimental Setup Details

We list the models involved in this paper in Table 7, and text-to-text metrics in Table 8.

Target LLMs Modality Evidence Models Evidence Models License
(explicit) (explicit)
Llama-2-7B [28] Text llama2
Llama-2-7B-Chat [28] Text llama2
Mistral-7B-Instruct-v0.1 [15] Text Apache-2.0
Mistral-7B-Instruct-v0.2 [15] Text Apache-2.0
Vicuna-v1.5-7B[6] Text llama2
Falcon-7B[1] Text Apache-2.0
Starling-7B-alpha[57] Text Apache-2.0
StableBeluga-7B[26] Text llama2
Zephyr-7b-beta[46] Text MIT
Mistral-7B-OpenOrca[32] Text Apache-2.0
GPT-4 [35] Text N/A
LLaVA-v1.5 [23] Vision llama2
InstructBLIP (vicuna-7B) [8] Vision BSD 3-Clause
mPLUG-Owl2 [53] Vision MIT
Valley [25] Vision Apache-2.0
Video-LLaVA [21] Vision Apache-2.0
Chat-Univi [16] Vision Apache-2.0
LLaMA-VID [20] Vision Apache-2.0
LTU [12] Audio Apache-2.0
Qwen-Audio-Chat [7] Audio Tongyi Qianwen
SALMONN [41] Audio Apache-2.0
Video-LLaMA [54] Audio-visual BSD 3-Clause
FAVOR [40] Audio-visual Apache-2.0
Gemini 1.5 Pro [42] Audio-visual N/A
Table 7: Models and reference benchmarks for validating CrossCheckGPT.
Reference Benchmarks (Metrics) Description
TriviaQA [14] (Acc) A realistic text-based question-answering dataset containing documents collected from Wikipedia and the web.
TruthfulQA MC1 [22] (Acc) A benchmark to measure whether a language model is truthful in generating answers to questions, spanning 38 categories.
TruthfulQA MC2 [22] (Acc)
XSum [34] (FactKB [11]) The factual accuracy of summarization models by verifying the presence of knowledge base facts in generated summaries.
CNN-DM [38] (BERTP) The CNN-DailyMail dataset is a collection of news articles and accompanying summaries measured by BERTScore-Precision.
MemoTrap [30] (Acc) Assessing whether LLMs fall into memorization traps which occur when LLMs memorize specific examples in training.
FaithDial [10] (Acc) A benchmark for hallucination-free dialogues by editing hallucinated responses in Wizard of Wikipedia (WoW) [9]
HaluEval-QA [18] (Acc) A large collection of generated and human-annotated hallucinated samples for evaluating the performance of LLMs in recognizing hallucination. It contains the QA, summarization and dialogue tasks.
HaluEval-summarization [18] (Acc)
HaluEval-Dialogue [18] (Acc)
Table 8: Dataset, models and reference benchmarks for validating CrossCheckGPT. Acc stands for accuracy.

Appendix B Exact Prompts

We provide the exact prompts we used in our experiments in Table 9 for various tasks.

Task Prompt
Text-to-text generation Generate a passage about <<<name>>>.
Image-to-text description Describe the image in one paragraph.
Visual description for video Describe the video in one paragraph.
Audio description for video Describe the audio in one paragraph.
Prompt for speech content What does the man/woman say in the video?
LLM Judgment for CrossCheck-explicit Context: <<<evidence_passage>>>\\\backslash\n\\\backslash\nSentence: <<<sentence>>> \\\backslash\n\\\backslash\nIs the sentence supported by the context above? Answer Yes or No.\\\backslash\n\\\backslash\nAnswer:
CrossCheck-implicit factual errors You are given the following sentence about <<<name/image/video>>> that might be inaccurate:\\\backslash\n<<<sentence>>>\\\backslash\n List possible inaccurate information in this sentence.
LLM Judgment for CrossCheck-implicit You are given the following sentence about <<<name/image/video>>>:\\\backslash\n<<<sentence>>>\\\backslash\nThe following is an analysis of possible inaccuracies in this sentence:\\\backslash\n<<<list_of_possible_errors>>>\\\backslash\nBased on the analysis, determine if the sentence contains any inaccurate information. Answer Yes or No.\\\backslash\n\\\backslash\nAnswer:
Table 9: Exact prompt used for different tasks.

Appendix C CrossCheckGPT as a Hallucination Detection Method

CrossCheckGPT can be used as a Hallucination detection method, which performs better than the best output-probability-based method reported in SelfCheckGPT[28].

Evidence Model Non-Factual Non-Factual* Factual Document (r𝑟ritalic_r)
Llama 30B Max(\mathcal{H}caligraphic_H) [28] 80.92 37.32 37.90 35.57
Llama-2-7B-Chat 85.84 57.22 54.41 56.25
Vicuna-v1.5-7B 83.13 53.38 51.13 54.64
Mistral-7B-Instruct-v0.2 87.21 59.60 56.72 63.04
Table 10: AUC-PR and document-level correlation against human annotation for detecting hallucinations in GPT-3 using individual evidence models on non-factual and factual statements in WikiBio [28].

Appendix D Text-to-text Additional Results

We provide the version of Table 1 with all ten benchmark metrics in Table 11. Moreover, we investigate the specific-task hallucination ranking ability where the inputs to SelfCheckGPT and CrossCheckGPT are from a specific task (rather than text generation). We conduct task-specific experiments using the inputs from TruthfulQA MC1 and HaluEval QA containing multiple-choice and yes-no questions respectively. The results in Table 12 show high system-level correlations and moderate document-level correlations, indicating that CrossCheckGPT can operate as a task-specific metric without requiring any ground truth.

Metrics System(ρ𝜌\rhoitalic_ρ) Document (r𝑟ritalic_r)
w/o GPT4 with GPT4
TriviaQA [14] 23.33 - -
TruthfulQA MC1 [22] 52.94 - -
TruthfulQA MC2 [22] 57.14 - -
XSum [34] -70.00 - -
CNNDM [38] 38.33 - -
MemoTrap [30] 10.88 - -
FaithDial [10] -8.33 - -
HaluEval-QA [18] -18.33 - -
HaluEval-Summarization [18] 48.33 - -
HaluEval-Dialogue [18] 46.03 - -
SelfCheckGPT [28] 66.46 74.06 76.08
CrossCheck-explicit 77.44 82.28 77.23
CrossCheck-implicit 56.71 18.33 17.29
CrossCheck-explicit weighted 82.32 81.78 82.18
CrossCheck-explicit weighted 56.81 20.21 19.16
Table 11: Full version of Table 1 including all other metrics. General hallucination evaluation where the task for SelfCheckGPT/CrossCheckGPT is open-ended text generation on WikiBio. System-level correlation, System(ρ𝜌\rhoitalic_ρ), is measured against the overall ranking in the leaderboard, and document-level correlation, Document(r𝑟ritalic_r), is measured against RefCheck. With GPT-4 refers to including GPT-4 as the target LLM.
Metrics System(ρ𝜌\rhoitalic_ρ) Document (r𝑟ritalic_r)
TruthfulQA MC1 HaluEval QA TruthfulQA MC1 HaluEval QA
SelfCheckGPT 76.19 30.95 30.87 6.76
CrossCheckGPT 76.19 88.10 33.68 22.00
Table 12: Task-specific hallucination evaluation where the task of SelfCheckGPT/CrossCheckGPT is, in this example, either TruthfulQA MC1 or HaluEval QA. Note that rankings are performed on 8 target models that are instruction-tuned as these tasks are QA-based and require some instruction-following ability.
Refer to caption
Figure 6: The variation of System(ρ𝜌\rhoitalic_ρ) and Document(r𝑟ritalic_r) against calibration temperature T𝑇Titalic_T in Eqn. (4) for weighted CrossCheck-explicit. Constant weighting refers to applying the same weight for all documents, while per-passage weighting refers to the use of passage-specific weighting derived from SelfCheckGPT scores of each passage.

We first show the variation of system and document-level correlation against varying calibration temperatures for CrossCheck-explicit weighted in Fig. 6 using WikiBio data. A comparison between using per-query weights and using the same weights for the entire task is also provided. As a result, C=0.1𝐶0.1C=0.1italic_C = 0.1 is chosen as it achieves the best system-level correlation. Besides, the same weighting across the whole task is used at C=0.1𝐶0.1C=0.1italic_C = 0.1 as the large variance among weights of different queries introduces more noise in scoring and hence hinders the correlation.

Appendix E System-level Correlations between Individual Text-based Hallucination Benchmarks

We provide the system-level correlations between individual text-based hallucination benchmarks to show that they capture different aspects and do not correlate well with each other in Table 13.

TriviaQA TruthfulQA Xsum CNN-DM MemoTrap FaithDial HaluQA HaluSumm HaluDial
TriviaQA [14] 1.00 0.20 -0.72 0.15 0.07 0.13 0.27 0.40 0.50
TruthfulQA [22] 0.20 1.00 -0.10 0.38 0.27 0.05 -0.50 0.37 0.63
Xsum [34] -0.72 -0.10 1.00 -0.03 -0.40 0.12 -0.57 -0.63 -0.68
CNN-DM [38] 0.15 0.38 -0.03 1.00 0.28 -0.05 -0.05 0.33 0.37
MemoTrap [30] 0.07 0.27 -0.40 0.28 1.00 -0.05 -0.08 0.48 0.17
FaithDial [10] 0.13 0.05 0.12 -0.05 -0.05 1.00 -0.03 -0.22 -0.13
HaluQA [18] 0.27 -0.50 -0.57 -0.05 -0.08 -0.03 1.00 0.30 0.20
HaluSumm [18] 0.40 0.37 -0.63 0.33 0.48 -0.22 0.30 1.00 0.67
HaluDial [18] 0.50 0.63 -0.68 0.37 0.17 -0.13 0.20 0.67 1.00
Table 13: System-level correlation (ρ𝜌\rhoitalic_ρ) between each pair of the 9 selected benchmarks metrics.

Appendix F Scatter Plots and Statistical Significance for Image-to-text

Refer to caption
Figure 7: Scatter plot of SelfCheckGPT, CrossCheck-explicit and CrossCheck-implicit scores against human annotation for image-to-text tasks.

The scatter plot, similar to text-to-text ones in Fig. 4, is shown in Fig. 7.

Methods Success rate (p-value)
CrossCheck-explicit 65.5% (<<<0.00001)
CrossCheck-implicit 84.5% (<<<0.00001)
CrossCheck-explicit weighted 67.0% (<<<0.00001)
CrossCheck-implicit weighted 88.0% (<<<0.00001)
Table 14: Success rate and statistical significance of CrossCheckGPT approaches measured via sign-test on independent subsets of images.

Additionally, we report the statistical significance of CrossCheckGPT being better than SelfCheckGPT on MHaluBench by performing the sign test at the image level.

Appendix G Statistics of AVHalluBench

We provide detailed statistics about AVHallubench in Table 15, including the number of videos, average lengths of each subset, as well as various audio and visual elements involved.

Source Dataset Num. of Videos Avg. Length (sec.) w/ Speech w/ Music w/ Visual Text
NeXT-QA [51] 32 (18%) 22.0 19 7 1
M3AV [5] 27 (16%) 11.3 27 0 27
How2 [37] 27 (16%) 9.5 27 4 2
MUSIC-AVQA [17] 23 (13%) 29.0 0 23 0
VALOR32k [3] 26 (15%) 8.7 11 7 8
FAVDBench [39] 38 (22%) 8.0 8 15 13
Overall 175 14.2 92 (52%) 56 (32%) 51 (29%)
Table 15: Statistics of the AVHalluBench dataset with the percentage shown in brackets.

Appendix H Additional SelfCheckGPT and CrossCheckGPT Scores on AVHalluBench

We provide the detailed SelfCheckGPT and CrossCheckGPT scores on AVHalluBench for all MLLMs that handle video or audio inputs in this paper in Table 16 for video descriptions and Table 17 for audio descriptions.

Models SelfCheckGPT CrossCheck-explicit CrossCheck-implicit
Valley [25] 52.43 55.98 48.22
Video-LLaVA [21] 30.59 33.52 40.57
Chat-Univi [16] 29.40 32.68 41.75
LLaMA-VID [20] 38.61 39.14 40.48
Video-LLaMA [54] 41.14 52.02 48.80
FAVOR [40] 60.67 53.85 50.49
Gemini 1.5 Pro 19.87 31.74 -
Table 16: SelfCheckGPT and CrossCheckGPT scores for 6 visual-LLMs that take video as inputs on AVHalluBench. Note that FAVOR, Video-LLaMA and Gemini 1.5 Pro are only given visual inputs. Gemini 1.5 Pro was not used for CrossCheck-implicit.
Models SelfCheck CrossCheck-explicit CrossCheck-implicit
audio w. speech audio w.speech audio w. speech
LTU [12] 21.95 - 37.44 - 18.06 -
Qwen-Audio-Chat [7] 36.57 37.08 43.66 43.41 20.21 52.20
SALMONN [41] 34.99 34.80 42.21 40.15 18.32 48.17
FAVOR [40] 49.62 41.51 66.69 55.41 23.26 61.01
Video-LLaMA [54] 56.42 - 68.05 - 17.10 -
Gemini 1.5 Pro 25.82 27.38 34.66 36.52 - -
Table 17: SelfCheckGPT and CrossCheckGPT scores for 6 audio-LLMs on AVHalluBench. Note that FAVOR and Video-LLaMA are only given audio inputs. Gemini 1.5 Pro was not used for CrossCheck-implicit.

Appendix I CrossCheck-explicit vs. CrossCheck-implicit

We present the average SelfCheckGPT scores on each task together with the system-level correlations in Table 18 to support our recommendations on CrossCheck-explicit and CrossCheck-implicit.

System(ρ𝜌\rhoitalic_ρ)
Tasks Ave. 𝒮selfchecksubscript𝒮selfcheck\mathcal{S}_{\text{selfcheck}}caligraphic_S start_POSTSUBSCRIPT selfcheck end_POSTSUBSCRIPT CrossCheck-explicit CrossCheck-implicit
Text-to-text 40.63 77.44 56.71
Image-to-text 17.16 42.86 50.42
Audio description 39.91 71.67 40.00
Visual description 42.14 89.09 54.29
Table 18: SelfCheckGPT scores and system-level correlations using CrossCheck-explicit and CrossCheck-implicit on four tasks. The system-level correlation for audio and visual descriptions is measured against RefCheck, and that for text-to-text and image-to-text tasks are measured against overall ranking.

Appendix J Case Studies for Hallucination with Audio-Visual Inputs

In addition to the piano example shown in Fig. 10 that has been mentioned in the main text, we show here two additional examples in Fig. 9 and Fig. 8 where audio-visual inputs influence the hallucination compared to using audio or visual inputs alone.

Refer to caption
Figure 8: Example of audio-visual hallucination problem from Gemini 1.5 Pro. In this example, even when no audio is provided, the model still describes what the man is talking about, and having audio inputs greatly benefits the description by reducing the hallucination in describing the man’s speech.
Refer to caption
Figure 9: Example of audio-visual hallucination problem from FAVOR. In this example, the audio is the man explaining what he is doing in the game. The speech description reduces the hallucination of “pressing the button” and "opening a door" in the visual description with new but random hallucinations coming out.
Refer to caption
Figure 10: Example of audio-visual hallucination problem. In this example, the audio is the piano itself playing, which introduces additional hallucination to the visual description which describes it as “played by a woman”.

Appendix K Limitations

Our investigation is limited in the following aspects: First, hallucination is an expansive area and, as done in other studies, this paper only covers a reasonable subset of all possible domains. However, we plan to release a live hallucination leaderboard where we plan to benchmark the performance of further MLLMs over more benchmark metrics. Secondly, while the confidence-based weighting mechanism improves the performance of CrossCheckGPT, it does not take into account the similarities of different evidence models. Correlation between models, due to having similar training data or from starting at the same checkpoints, may result in evidence models making similar mistakes. This poses a future research direction, in raking model correlation into account for the weighting mechanism. Lastly, there is limited by the number of currently available audio-visual LLMs for evidence generation.

Appendix L Broader Impact

Hallucinations in multimodal foundation models have become increasingly critical and challenging. Therefore, providing a general reference-free hallucination benchmarking approach is necessary and timely, enabling practitioners to have metrics for model trustworthiness. Therefore, CrossCheckGPT has the following positive broad impact:

  • CrossCheckGPT establishes a universal ranking system which helps identify more factual and faithful models to be selected in particular applications, reducing the dissemination of misinformation and increasing societal confidence in AI applications.

  • CrossCheckGPT provides a reliable ranking that would aid regulatory bodies in enforcing compliance standards for multimodal foundation models, particularly in critical areas such as healthcare, finance, and public safety.

  • As a reference-free and versatile benchmarking method, CrossCheckGPT can drive developers to innovate and improve their multimodal foundation models.

However, our method by no means provides perfect hallucination scores and may inherit potential bias from the chosen evidence models. Therefore, practitioners should be independently educated and avoid overreliance on the rankings, as doing so may lead to complacency in critical thinking and reduced vigilance. From the model aspect, the approach in this paper does not give rise to any additional potential biases beyond the ones directly inherited from the pre-trained LLM checkpoints.

Appendix M Computing Resource

Our experiments are performed on a single Nvidia A100 GPU for inference. The average inference time for each target model to get the CrossCheckGPT score is 20 hours. The total amount of time to run for all models in the text-to-text leaderboard is 200 hours, in the image-to-text leaderboard is 190 hours and in the AVHalluBench is 240 hours. The total GPU hours for running the full research is 2000. There is no training process involved in the research.

Appendix N Assets and License Explanation

The following licenses are applied to the datasets used in our paper:

The following licenses are applied to the code and Python packages we use for our experiments: