CrossCheckGPT: Universal Hallucination
Ranking for Multimodal Foundation Models

Guangzhi Sun^1∗ Potsawee Manakul^1,2,3∗ Adian Liusie¹ Kunat Pipatanakul^2,3
Chao Zhang⁴ Phil Woodland¹ Mark Gales¹
¹University of Cambridge ²SCB 10X ³SCBX ⁴Tsinghua University
[email protected], [email protected], [email protected]

Abstract

Multimodal foundation models are prone to hallucination, generating outputs that either contradict the input or are not grounded by factual information. Given the diversity in architectures, training data and instruction tuning techniques, there can be large variations in systems’ susceptibility to hallucinations. To assess system hallucination robustness, hallucination ranking approaches have been developed for specific tasks such as image captioning, question answering, summarization, or biography generation. However, these approaches typically compare model outputs to gold-standard references or labels, limiting hallucination benchmarking for new domains. This work proposes "CrossCheckGPT", a reference-free universal hallucination ranking for multimodal foundation models. The core idea of CrossCheckGPT is that the same hallucinated content is unlikely to be generated by different independent systems, hence cross-system consistency can provide meaningful and accurate hallucination assessment scores. CrossCheckGPT can be applied to any model or task, provided that the information consistency between outputs can be measured through an appropriate distance metric. Focusing on multimodal large language models that generate text, we explore two information consistency measures: CrossCheck-explicit and CrossCheck-implicit. We showcase the applicability of our method for hallucination ranking across various modalities, namely the text, image, and audio-visual domains. Further, we propose the first audio-visual hallucination benchmark, "AVHalluBench", and illustrate the effectiveness of CrossCheckGPT, achieving correlations of 98% and 89% with human judgements on MHaluBench and AVHalluBench, respectively.

^*^*footnotetext: Equal contribution

1 Introduction

In the domain of generative foundation models, ‘hallucination’ describes the scenario when generated outputs, while seemingly credible, are either inconsistent with the provided context or contradict established factual knowledge [24, 48, 44]. This issue impacts many generative applications and can lead to the spread of misinformation in a range of settings [52, 33]. Given the differences in architectures, data, and alignment techniques for foundation models, there is a need to be able to quantify a system’s susceptibility to hallucination, such that practitioners can be aware of systems’ hallucination risk and select systems with high factual consistency.

Refer to caption — Figure 1: SelfCheckGPT (Left) and CrossCheckGPT (Right) for hallucination rankings. The approach can rank a set of MLLMs on any task without reference, enabling hallucination benchmarks for various generative tasks.

Current hallucination benchmarks have been developed to rank systems for individual tasks including question answering [22, 14, 18, 10, 43], summarization [29, 27], biography generation [28], instruction following [30], image captioning [36], and visual question answering [19, 49]. Many of these benchmarks measure the hallucination level through a proxy measure, such as the ability of the model to correctly answer questions designed to trigger hallucinations. However, these benchmarks have been designed for particular tasks and assume access to gold-standard labels, limiting their applicability to generalized domains. On the other hand, hallucination detection approaches such as SelfCheckGPT [28] and UniHD [4] directly examine generated responses against self-evidence, and therefore do not require gold-standard answers. These methods, though, simply aim to identify when a model hallucinates, and scores are not directly comparable across different models.

In this paper, we propose CrossCheckGPT, a universal hallucination ranking approach to benchmark multimodal foundation models. The core idea of CrossCheckGPT is that the same hallucinated content is unlikely to be generated by different independent systems, while factual content likely to be consistent across models. An illustration of the approach and its contrast to SelfCheckGPT is depicted in Fig. 1. Instead of checking for self-consistency, as done in SelfCheckGPT, CrossCheckGPT checks the cross-consistency by comparing against evidence generated from a set of independent models. This produces more accurate and directly comparable hallucination scores, as well as yielding more robust rankings. CrossCheckGPT can be applied to any foundation model and task as long as a suitable information consistency measure is used. This paper demonstrates the effectiveness of CrossCheckGPT as a universal evaluation framework for any Multimodal Large Language Model (MLLM) that generates text outputs, applicable irrespective of the input modality. We investigate two information consistency measures: CrossCheck-explicit, which generates multiple text samples from each evidence system, and CrossCheck-implicit, which prompts the evidence model to determine whether it agrees with the assessed outputs.

CrossCheckGPT is validated on WikiBio [28] and MHaluBench [4] as text-to-text and image-to-text description tasks, and our experiments show that CrossCheckGPT achieves a notable 98% Spearman’s Rank Correlation (SRC) on MHaluBench against human ranking compared to -10% SRC using SelfCheckGPT and 33% using UniHD. In addition, a comprehensive audio-visual hallucination benchmark dataset (AVHalluBench) is proposed, covering a diverse range of styles, domains and elements such as visual text, speech and music. The AVHalluBench is used to rank recent audio and video LLMs such as Gemini 1.5 Pro, conducting the first study on audio-visual hallucination benchmarking. The key contributions of this paper are summarized as follows:

•

We propose CrossCheckGPT, a reference-free hallucination ranking approach that can be applied universally across text-generation tasks for systems of different modalities.
•

We conduct comprehensive experiments over a range of tasks and modalities, demonstrating the effectiveness of CrossCheckGPT as a hallucination benchmarking approach for ranking text, image or audio-visual systems. Experimental results illustrate that CrossCheckGPT consistently outperforms alternate approaches, such as SelfCheckGPT [28] and UniHD [4].
•

We analyze hallucination within video understanding and curate AVHalluBench, which to the best of our knowledge, is the first publicly released audio-visual hallucination benchmark.

2 Related Work

LLM Hallucination Benchmarking: Hallucination benchmarks typically rely on proxy tasks to probe the likelihood of LLM making factual errors. For example, question-answering (QA) based benchmarks, such as TriviaQA [14], TruthfulQA [22], HaluEval-QA [18], MemoTrap [30] and FEWL [50] design questions specifically to probe truthfulness and factual accuracy and rank systems by their accuracy. Other methods, such as FaithDial [10], XSum [34] and CNN-DM [38] measure hallucination in dialogue responses or summarization. However, these benchmarks require references (e.g., ground-truth answers or gold-standard references) to compare to model-generated outputs. On the other hand, SelfCheckGPT [28] can be used to rank systems on hallucination levels by measuring systems’ self-consistency scores on equivalent tasks. However, SelfCheckGPT was designed as a hallucination detection method and may not be calibrated across systems.

Multimodal LLM Hallucination Benchmarking: Multimodal hallucination has been mainly explored in the image-to-text domain for visual LLMs. One stream of methods, including CHAIR [36], LURE [56] and MHaluBench [4], directly evaluate the generated text descriptions of images using gold-standard annotations or external toolkits. Another stream of methods, such as POPE [19] and HallusionBench [13], curate a set of questions with short answers trying to capture various aspects of hallucination. Meanwhile, AMBER [49] combines both generation and question answering in one single benchmark. Unlike these methods, CrossCheckGPT does not rely on gold-standard reference or dedicated question sets, and can be universally applied to any input modalities.

3 CrossCheckGPT

CrossCheckGPT assigns a score to an MLLM (denoted as the target model) by assessing how much the responses of the MLLM are supported by evidence generated from a set of MLLMs (denoted as evidence models). The CrossCheckGPT scores can then be used to rank the MLLMs. As illustrated in Fig. 2, we explore two information consistency measures, CrossCheck-explicit and CrossCheck-implicit, which measure the hallucination of generated responses either through the explicit generation of evidence passages or implicit prompting, respectively. CrossCheckGPT is reference-free and can be generally applied to MLLMs of any input modality and output response type.

3.1 Information Consistency Measures

CrossCheck-explicit stochastically generates a set of evidence passages from each evidence model and computes the average distance between each evidence passage and the target response. Let $R=[r_{1},\ldots,r_{i},\ldots,r_{I}]$ denote the response of the target model $\hat{M}$ , where $r_{i}$ is the $i$ -th sentence of the response, to a given query $Q$ , which can be of any modality. We first re-formulate the SelfCheckGPT score for sentence $r_{i}$ of the target model in Eqn. (1) below,

\displaystyle\mathcal{S}_{\text{selfcheck}}(\hat{M})

\displaystyle=\frac{1}{|\mathcal{Q}|}\frac{1}{I}\sum_{Q\in|\mathcal{Q}|}\sum_{% i=1}^{I}\mathcal{S}^{\text{selfcheck}}_{r_{i},Q}(\hat{M})\qquad\text{~{}where~% {}}\;\mathcal{S}^{\text{selfcheck}}_{r_{i},Q}(\hat{M})=\frac{1}{\hat{N}}\sum_{% n=1}^{\hat{N}}x^{(n)}_{r_{i},Q}(\hat{M})

(1)

where $\mathcal{Q}$ is the set of queries in a test set, $\hat{N}$ is the number of stochastically generated passages by the model $\hat{{M}}$ , and $x^{(n)}_{r_{i},Q}(\hat{{M}})$ denotes the hallucination score of whether sentence $r_{i}$ is supported by evidence $n$ from $\hat{{M}}$ . The hallucination score, estimated by prompting an LLM judge with the sentence and each evidence, takes a value in $\{0,1\}$ , where $0$ denotes supported and $1$ denotes hallucinatory.

CrossCheck-explicit, in contrast to SelfCheckGPT, uses the evidence from $|\mathcal{M}|$ evidence models and measures the distance of the response against those from all other systems. The overall CrossCheck-explicit score $\mathcal{C}_{\text{explicit}}(\hat{M})$ for a specific target model $\hat{M}$ can be computed using Eqn. (2),

\mathcal{C}_{\text{explicit}}(\hat{M})\!=\!\frac{1}{|\mathcal{Q}|}\frac{1}{I}% \!\!\sum_{Q\in|\mathcal{Q}|}\sum_{i=1}^{I}\mathcal{C}^{\text{explicit}}_{r_{i}% ,Q}(\hat{M})\;\;\;\text{~{}where~{}}\;\mathcal{C}^{\text{explicit}}_{r_{i},Q}(% \hat{M})\!=\!\frac{\sum_{j=1}^{|\mathcal{M}|}\eta_{j}\sum_{n=1}^{N_{j}}x^{(n)}% _{r_{i},Q}({M}_{j})}{\sum_{j=1}^{|\mathcal{M}|}\eta_{j}N_{j}}

(2)

where $\mathcal{M}$ denotes the set of evidence models used for CrossCheck-explicit. Note that self-consistency can be taken into account by including the target model $\hat{M}$ into the evidence models, $\hat{M}\!\in\!\mathcal{M}$ . Each evidence model ${M}_{j}$ stochastically generates $N_{j}$ passages to check the response against, and since systems may have different levels of reliability, a factor $\eta_{j}$ can be assigned to the passages generated from model ${M}_{j}$ .

CrossCheck-implicit is an alternative consistency measure, where instead of explicitly generating passages for the same query, the evidence models are prompted to spot any factual errors in each sentence. The overall implicit CrossCheck-implicit score is computed using Eqn. (3),

\mathcal{C}_{\text{implicit}}(\hat{M})=\frac{1}{|\mathcal{Q}|}\frac{1}{I}\sum_% {Q\in|\mathcal{Q}|}\sum_{i=1}^{I}\mathcal{C}^{\text{implicit}}_{r_{i},Q}(\hat{% M})\qquad\text{~{}where~{}}\;\;\mathcal{C}^{\text{implicit}}_{r_{i},Q}(\hat{M}% )=\sum_{j=1}^{|\mathcal{M}|}\eta_{j}\,y_{r_{i},Q}({M}_{j})

(3)

where $y_{r_{i},Q}({M}_{j})$ denotes the hallucination score of sentence $r_{i}$ computed using CrossCheck-implicit. In contrast to CrossCheck-explicit (which computes $x_{r_{i},Q}({M}_{j})$ ), $y_{r_{i},Q}({M}_{j})$ is computed by first prompting the evidence model $M_{j}$ to analyze whether $r_{i}$ contains any factual errors given the input $Q$ . The LLM judge then takes the input $r_{i}$ and analysis from model $M_{j}$ and predicts $y_{r_{i},Q}({M}_{j})$ , whether the response is hallucinatory. If factual errors are found in $r_{i}$ , $y_{r_{i},Q}({M}_{j})=1$ , and otherwise $y_{r_{i},Q}({M}_{j})=0$ . We note that concurrent work, PoLL [47], applies a group of models as judges to evaluate texts and can be viewed as similar to CrossCheck-implicit. This work focuses on multimodal inputs and hallucination benchmarking.

3.2 Confidence-based Weighting for Evidence Models

While all evidence models are advanced MLLMs, the quality of their evidence may vary depending on their propensity to hallucinate. Therefore, a weighting mechanism is proposed where the scores are weighted by model uncertainty reflected by SelfCheckGPT scores, as shown below:

\eta_{j}=\frac{e^{-\mathcal{S}_{\text{selfcheck}}({M}_{j})/T}}{\sum_{k=1}^{|% \mathcal{M}|}e^{-\mathcal{S}_{\text{selfcheck}}({M}_{k})/T}},

(4)

where $T$ is the calibration temperature that determines the sharpness of the weight distribution, which is set to a constant for each benchmark. A higher SelfCheckGPT score indicates that the model tends to generate inconsistent information and is more uncertain. In addition, this weighting mechanism ensures that outlier systems will not be undermined by the evidence from weaker models.¹¹1Note that a weight distribution can also be associated with each specific query by using the average SelfCheckGPT score of each evidence model.

4 CrossCheckGPT for Hallucination with Multimodal Inputs

CrossCheckGPT is designed to be general and applicable to models of any input modality, provided that the outputs are of a consistent form (i.e. text) and a suitable information consistency measure is used. This general design of CrossCheckGPT enables it to also be applied to rank multi-modal systems (i.e. systems which use two or more input modalities).

As shown in Fig. 3, we use CrossCheckGPT to evaluate models of three different categories: the audio domain and visual domain where the inputs are either audio or visual (image or silent video), and we further conduct the first study on evaluating hallucination levels within the audio-visual domain where the inputs are videos with their paired audio. Due to the lack of diversity in current publicly available capable systems taking audio-visual inputs, to evaluate CrossCheckGPT in the audio-visual domain, we prompt multi-modal models to instead split the outputs into visual descriptions and auditory descriptions, evaluating CrossCheckGPT within either of the domains. We use visual descriptions to check the visual-only inputs and audio descriptions to check the audio-only inputs. For hallucination benchmarking in multimodal audio-visual settings, information may require both modalities, e.g. someone demonstrating and explaining a skateboard trick. In this scenario, we use $\mathcal{C}=\min\left(\mathcal{C}^{\text{audio}},\mathcal{C}^{\text{visual}}\right)$ as the CrossCheckGPT score, where $\mathcal{C}^{\text{audio}}$ uses the audio descriptions and $\mathcal{C}^{\text{visual}}$ uses the visual descriptions.²²2For simplicity, $\hat{M}$ , $r_{i}$ , and $Q$ are dropped here, and the scores can be either implicit or explicit.³³3Initial findings showed CrossCheck-implicit gives different ranges of scores for audio and visual modalities, at about 0.2 and 0.5 on average, respectively. Thus, only CrossCheck-explicit is adopted for audio-visual inputs.

AVHalluBench: To benchmark hallucinations in audio-visual LLMs, we curate AVHalluBench, a dataset containing 175 videos selected from six video understanding datasets covering various styles and elements, with statistics shown in Table 15 in the Appendix. To verify the effectiveness of CrossCheckGPT (and future benchmarking methods), AVHalluBench includes a carefully written set of hallucination-free descriptions for audio and visual contents. After watching each video with audio, the annotators were instructed to write one description focusing on the audio content and one description focusing on the visual content of the video, separately.⁴⁴4To maximize coverage, initial descriptions were generated using Gemini 1.5 Pro and GPT-4v, prompted to describe all the elements present in the sequence of frames. Note that although these descriptions are not hallucination-free, they have a high level of coverage and subjective details. The annotators were provided with these descriptions in addition to the videos while being instructed to write only objective details of the videos. To analyze the inter-annotator agreement, we split each description into atomic facts [31] and verify each fact against the descriptions written by the other annotators, categorized as either: Supporting, such that the fact is supported by the other annotator, Contradicting, such that the fact contradicts the information provided by the other annotator, or Neutral such that the facts neither support nor contradict one another. Both decomposition and verification processes are performed automatically using GPT-4. Of the 39 videos annotated by multiple annotators, there were 471 audio-related facts and 913 visual-related facts, and the agreement between annotators (as counted by Supporting/Neutral/Contradicting) was 64.6%/24.6%/10.8% and 62.0%/29.0%/9.0%, respectively.

5 Experiments

We conduct experiments to validate CrossCheckGPT on MLLMs with three input modalities, including text (§5.1), image (§5.2), and audio-visual (§5.3). During inference, we use a temperature of 1.0, a beam size of 1 and a top-p of 0.9 are used for all models. SelfCheckGPT [28] is applied as a hallucination ranking baseline for all modalities since it is reference-free and not task-specific.

5.1 Text-to-text Experiments

Experimental Setup: The main text-to-text experiments are performed using the subset of WikiBio data used in [28], which contains 238 biographical passages from Wikipedia. We select 10 open-source LLMs (listed in Appendix Table 7) as target models, 8 of which are used as evidence models. Four models are Llama-2-7B based [45] (e.g. Vicuna-v1.5-7B [6]) and four models are Mistral-7B based [15]. Each evidence model generates 20 stochastic passages. For the LLM judge in CrossCheck-explicit (used to determine whether sentences support one another), Mistral-7B [15] is used as it achieves the best results among all considered open-source LLMs (shown in Appendix Table 10).

To evaluate the general benchmarking ability of ranking methods, 10 benchmark metrics from the hallucinations leaderboard⁵⁵5https://huggingface.co/spaces/hallucinations-leaderboard/leaderboard (shown in Table 8) are selected to provide the overall hallucination ranking of the systems. These metrics are either based on human annotation or gold-standard references, where the overall rankings are obtained by averaging the rankings from each metric.

We report the system-level correlation between the hallucination ranking methods and the overall ranking measured by Spearman’s Rank Correlation coefficient (SRC), denoted as System( $\rho$ ). In addition, as WikiBio contains reference texts, the references can be used as evidence texts, which can be considered an idealized fact-checking method. This method is referred to as RefCheck, and CrossCheckGPT and SelfCheckGPT scores also are compared against RefCheck at document-level using Pearson’s Correlation Coefficient (PCC), denoted as Document $(r)$ . Furthermore, to investigate the effectiveness of CrossCheckGPT when the target LLM is much more powerful than those evidence models, we include GPT-4 in addition to the 10 target LLMs.

Hallucination Ranking Results: Existing hallucination metrics such as HaluEval-QA accuracy do not correlate well with the overall ranking at the system level. Some metrics have negative correlations while the highest (TruthfulQA MC2) is 57.14% (shown in Table 1, with further pairwise correlations provided in Appendix Table 13). This is likely because each existing metric is typically designed to measure only one aspect related to hallucinations, e.g., probing through question-answering.

Metrics	System( $\rho$ ) (%)	Document ( $r$ ) (%)
Metrics	System( $\rho$ ) (%)	w/o GPT4	with GPT4
TruthfulQA MC2 [22]	57.14	-	-
SelfCheckGPT [28]	66.46	74.06	76.08
CrossCheck-implicit	56.71	18.33	17.29
CrossCheck-explicit	77.44	82.28	77.23
CrossCheck-implicit weighted	56.81	20.21	19.16
CrossCheck-explicit weighted	82.32	81.78	82.18

Table 1: General hallucination evaluation where the task for SelfCheckGPT/CrossCheckGPT is open-ended biography generation on WikiBio. System-level correlation, System(

\rho

), is measured against the overall ranking of the leaderboard, and document-level correlation, Document(

r

), is measured against RefCheck. “With GPT-4” refers to including GPT-4 as a target model. Additional metrics are presented in Table 11 in the Appendix.

Figure 4: Scatter plot of document-level scores for SelfCheckGPT and CrossCheck-explicit against RefCheck for text-to-text experiments.

Subset	Values
Succ. Rate	90%
P-value	4 $\times 10^{-6}$

Table 2: Success rate of CrossCheck outperforming SelfCheck for independent subsets of WikiBio documents. The P-value is measured by the one-tailed sign test with

H_{0}=

CrossCheck not better than SelfCheck.

CrossCheck-explicit correlates with the overall ranking better than all other methods, with CrossCheck-explicit weighted by model uncertainty achieving the highest correlation, highlighting its effective general hallucination ranking ability. In addition, the document-level correlation plots are shown in Fig. 4, and the sign test on independent subsets in Table 2 shows the statistical significance ( $p=4\times 10^{-}6$ ) of CrossCheckGPT being better than SelfCheckGPT for ranking at the system-level.

5.2 Image-to-text Experiments

We validate CrossCheckGPT for the hallucination ranking of visual LLMs on image-to-text tasks. The experiments are performed on MHaluBench [4], an image-captioning hallucination dataset. Nine visual LLMs are selected as target models, all of which are used to generate evidence passages (see Appendix Table 7 for the list of models). Each evidence model generates ten image descriptions per image. The overall ranking is obtained by averaging the rankings from CHAIR [36] and POPE (MSCOCO subset) [19].⁶⁶6CHAIR and POPE are the two popular representative metrics for free-form text generation and binary classification hallucination benchmarks respectively [49]. In addition to SelfCheckGPT, UniHD[4] is used as a stronger baseline.

For evaluation, we take a subset of 30 image descriptions generated by each target model (a total of 270 passages with 3237 facts) and annotate each description with a binary label of either hallucinatory or factual. The Cohen’s $\kappa$ between the two annotators is 0.632, indicating substantial agreement. The models are ranked by the average percentage of factual errors produced by each target model, and hallucination ranking performance is measured at the system-level using SRC, denoted System( $\rho$ ) and at the image-level using PCC, denoted as Image( $r$ ).

Metrics	System( $\rho$ ) (%)			Image( $r$ ) (%)
Metrics	Overall	CHAIR	Human	Human
UniHD [4]	42.02	36.98	33.33	36.70
SelfCheckGPT [28]	43.70	23.10	-10.00	20.93
CrossCheck-implicit	50.42	64.71	98.33	48.72
CrossCheck-explicit	42.86	43.70	75.00	35.16
CrossCheck-implicit weighted	50.42	64.71	98.33	52.83
CrossCheck-explicit weighted	47.06	46.22	73.33	36.98

Table 3: System-level correlation measured by System(

\rho

) and Image-level correlation measured by Image(

r

) for various hallucination evaluation methods on the MHaluBench dataset. System-level correlation is measured against the overall ranking, rankings from CHAIR scores and human annotation.

Hallucination Ranking Results: Similar to before, Table 3 presents the system-level and image-level correlations against overall rankings and rankings derived from human annotations. Both variants of CrossCheckGPT outperform SelfCheckGPT and UniHD, with CrossCheck-implicit weighted performing best out of all methods, achieving a 98.33% correlation with the rankings from human annotations. Equivalent statistical significance analysis and scatter plots are shown in Table 14 and Fig. 7 in the Appendix F, respectively.

5.3 Video-to-text Experiments

Next, we apply CrossCheckGPT to AVHalluBench to investigate hallucination ranking in audio-visual LLMs. We consider 7 models that can handle video inputs and 6 models that can handle audio inputs. Three models, FAVOR [40], Video-LLaMA [54], and Gemini 1.5 Pro [42], are in the intersection of the two sets, and can handle audio-visual inputs. When ranking hallucinations for visual description, we consider audio-visual LLMs with visual-only inputs and audio-visual inputs as separate systems, and hence, there are $7\!+\!3\!=\!10$ target models for ranking. We conduct a similar ranking scheme for audio descriptions, where there are $6\!+\!3\!=\!9$ target models. All the target models are also used as evidence models in CrossCheck-explicit,⁷⁷7Gemini 1.5 Pro is not used for CrossCheck-implicit due to the number of request limitations. and each model generates ten evidence passages. When using audio-visual LLMs as evidence models, audio-visual inputs are given to obtain the visual or audio descriptions as evidence. As only 5 target models can handle speech inputs, we further make a dedicated ranking only for these models with prompts explicitly asking for speech description.

Metrics	Visual Description (%)		Audio Description (%)
Metrics	System( $\rho$ )	Video( $r$ )	System( $\rho$ )	Video( $r$ ) (w. speech)
SelfCheckGPT	86.67	65.77	60.00	51.13 (44.55)
CrossCheck-implicit weighted	54.29	30.73	40.00	2.15 (16.20)
CrossCheck-explicit weighted	89.09	78.58	71.67	68.10 (47.60)

Table 4: System-level and video-level correlations of SelfCheckGPT and CrossCheckGPT against RefCheck using manual descriptions in AVHalluBench. Weighted version of CrossCheckGPT is used with

C=0.1

. Ranking correlations for systems that handle speech are in brackets.

Hallucination Ranking Results: First, system-level and video-level correlations are shown in Table 4, measured by System( $\rho$ ) and Video( $r$ ). CrossCheck-explicit correlates with RefCheck best, with an 89.09% System( $\rho$ ) for the visual description. Similar to the text-to-text results, we observe that CrossCheck-explicit performs better than CrossCheck-implicit. For both text-to-text and video-to-text experiments, this is likely due to the high diversity in the evidence passages as indicated by high raw SelfCheckGPT scores, which we discuss further in Section 5.4.

Impact of Audio-Visual Inputs: As supporting information from another modality is expected to reduce hallucination, this section investigates whether audio-visual inputs reduce the raw hallucination scores compared to the scores when a single modality is used. Table 5 presents the average raw hallucination scores (rather than correlations), for three MLLMs that can take audio-visual inputs.

Model	Input modality	Visual Description (%)		Audio Description (%)
Model	Input modality	$\mathcal{S}_{\text{selfcheck}}$ $\downarrow$	$\mathcal{C}_{\text{explicit}}$ $\downarrow$	$\mathcal{S}_{\text{selfcheck}}$ $\downarrow$	$\mathcal{C}_{\text{explicit}}$ $\downarrow$
FAVOR [40]	Visual	60.67	53.85	—	—
	Audio	—	—	49.62	66.69
	Audio-Visual	56.42	49.60	33.25	35.20
Video-LLaMA [54]	Visual	41.14	52.02	—	—
	Audio	—	—	56.42	68.05
	Audio-Visual	47.73	49.13	70.23	41.25
Gemini 1.5 Pro [42]	Visual	19.87	31.74	—	—
	Audio	—	—	25.82	34.66
	Audio-Visual	12.77	23.27	48.51	28.79

Table 5: SelfCheckGPT scores (

\mathcal{S}_{\text{selfcheck}}

) and weighted CrossCheck-explicit scores (

\mathcal{C}_{\text{explicit}}

) on AVHalluBench for audio-visual LLMs. Calibration temperature

T=0.1

is used here.

When considering the CrossCheckGPT scores, we observe that having audio-visual inputs reduces hallucination rates, as measured by the raw CrossCheckGPT scores, as expected. While Gemini 1.5 Pro achieved the best scores, it can be more susceptible to hallucination when silent videos are used as inputs as it often fabricates its audio descriptions. Moreover, except for Gemini 1.5 Pro, when audio-visual inputs are used the reduction in hallucination scores is larger for audio description tasks than for visual description tasks. This likely occurs as for audio description tasks, visual information often provides useful information on the source of the sound, which can significantly reduce the uncertainty of the sound. For visual description tasks, while particular audio cues (especially from speech) can provide useful information, misleading or unrelated sounds may cause additional hallucinations. For example, in Fig 10 where there is a self-playing piano, audio inputs can mislead a model to believe that the piano is played by an individual. Further examples are presented in Appendix H with the raw hallucination scores for audio and visual-only inputs shown in Tables 16 and 17 in Appendix.

5.4 CrossCheck-explicit vs. CrossCheck-implicit

While CrossCheck-implicit is more sample-efficient than CrossCheck-explicit and only requires generating the error analysis once, the performance of CrossCheck-implicit can be highly dependent on the task. For the text-to-text and video-to-text experiments, CrossCheck-implicit performs worse than CrossCheck-explicit, as opposed to the findings in the image-to-text experiments. We hypothesize that for challenging and open-ended tasks, CrossCheck-explicit is preferred as it can better cover the output space by disentangling the evidence generation and verification tasks, yielding more calibrated uncertainty measures. However, in other circumstances, CrossCheck-implicit may help the model focus on specific aspects of the input and yield more accurate rankings. For challenging and open-ended tasks with diverse outputs, the raw SelfCheckGPT scores are expected to be high and therefore can be used as a proxy to determine which consistency measure to select. For example, the average SelfCheckGPT score across models is 40.63% for text-to-text, which is much higher than 17.16% for image-to-text. We recommend using CrossCheck-explicit when the SelfCheckGPT scores are high, and CrossCheck-implicit when they are sufficiently low, which is demonstrated to be a reasonable rule, illustrated by the results in Appendix Table 18.

5.5 Ablation Studies

Self-Bias: LLMs are known to have self-preferential bias [2, 55] and may prefer outputs from similar models. Therefore LLMs using the same base model may provide inflated CrossCheckGPT scores. The results in Table 6 show that self-bias is an issue, and for example, when only using Llama-2-based evidence models, the outputs from Vicuna get a lower hallucination score whereas when only using Mistral-based evidence models, Mistral has the lowest hallucination score, resulting in contradictory conclusions. This bias can be mitigated by adopting a wide range of evidence models, which is adopted in CrossCheckGPT scores, hence achieving more reliable evaluation with strong correlations.

Evidence Models	System( $\rho$ )	Document( $r$ )	Vicuna $\mathcal{C}_{\text{explicit}}$	Mistral $\mathcal{C}_{\text{explicit}}$
Llama-2-based models only	55.49%	81.10%	42.94%	45.68%
Mistral-based models only	81.71%	81.06%	44.98%	41.81%
All models	82.32%	82.28%	44.82%	44.93%

Table 6: The mitigation of self-bias in CrossCheckGPT scores and its influence measured by document-level correlations and CrossCheck-explicit scores of Vicuna and Mistral on WikiBio. There are 4 Llama-2-based models and 4 Mistral-based models in the set of evidence models.

Robustness to Manipulation: To investigate whether a ranking method can be easily manipulated, we examine the influence of the generation temperature (which can be selected for any model). The results in Fig. 5 show that by increasing the temperature of the target model from 0.5 to 1.5, SelfCheckGPT scores increase by as much as 35%, drastically influencing the rankings. In contrast, CrossCheckGPT provides more stable rankings for all generation temperatures. Results are demonstrated for MHaluBench, but similar trends are observed for WikiBio as well.

6 Conclusions

This paper proposes CrossCheckGPT, a universal hallucination ranking method for multimodal large language models. We evaluated two variants of CrossCheckGPT on text-to-text, image-to-text and video-to-text tasks, demonstrating that it consistently outperforms all baseline methods, achieving 98% and 89% system-level correlation against humans on MHaluBench and AVHalluBench respectively. We also introduce AVHalluBench, the first resource to study audio-visual hallucination issues in video understanding.

Acknowledgments

This work is supported by Cambridge University Press & Assessment (CUP&A), a department of The Chancellor, Masters, and Scholars of the University of Cambridge.

References

Almazrouei et al. [2023] E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojocaru, M. Debbah, Étienne Goffinet, D. Hesslow, J. Launay, Q. Malartic, D. Mazzotta, B. Noune, B. Pannier, and G. Penedo. The falcon series of open language models. arXiv:2311.16867, 2023.
Brown [1986] J. D. Brown. Evaluations of self and others: Self-enhancement biases in social judgments. Social cognition, 4(4):353–376, 1986.
Chen et al. [2023] S. Chen, X. He, L. Guo, X. Zhu, W. Wang, J. Tang, and J. Liu. Valor: Vision-audio-language omni-perception pretraining model and dataset. arXiv:2304.08345, 2023.
Chen et al. [2024a] X. Chen, C. Wang, Y. Xue, N. Zhang, X. Yang, Q. Li, Y. Shen, L. Liang, J. Gu, and H. Chen. Unified hallucination detection for multimodal large language models. arXiv:2402.03190, 2024a.
Chen et al. [2024b] Z. Chen, H. Liu, W. Yu, G. Sun, H. Liu, J. Wu, C. Zhang, Y. Wang, and Y. Wang. M³av: A multimodal, multigenre, and multipurpose audio-visual academic lecture dataset. arXiv:2403.14168, 2024b.
Chiang et al. [2023] W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing. Vicuna: An opensource chatbot impressing gpt-4 with 90% chatgpt quality., 2023.
Chu et al. [2023] Y. Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919, 2023.
Dai et al. [2023] W. Dai, J. Li, D. Li, A. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi. InstructBLIP: Towards general-purpose vision-language models with instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=vvoWPYqZJA.
Dinan et al. [2019] E. Dinan, S. Roller, K. Shuster, A. Fan, M. Auli, and J. Weston. Wizard of wikipedia: Knowledge-powered conversational agents. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=r1l73iRqKm.
Dziri et al. [2022] N. Dziri, E. Kamalloo, S. Milton, O. Zaiane, M. Yu, E. Ponti, and S. Reddy. Faithdial: A faithful benchmark for information-seeking dialogue. Transactions of the Association for Computational Linguistics, 10:1473–1490, 2022.
Feng et al. [2023] S. Feng, V. Balachandran, Y. Bai, and Y. Tsvetkov. FactKB: Generalizable factuality evaluation using language models enhanced with factual knowledge. In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 933–952, Singapore, Dec. 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.59. URL https://aclanthology.org/2023.emnlp-main.59.
Gong et al. [2024] Y. Gong, H. Luo, A. H. Liu, L. Karlinsky, and J. R. Glass. Listen, think, and understand. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=nBZBPXdJlC.
Guan et al. [2024] T. Guan, F. Liu, X. Wu, R. Xian, Z. Li, X. Liu, X. Wang, L. Chen, F. Huang, Y. Yacoob, D. Manocha, and T. Zhou. Hallusionbench: An advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. In CVPR, 2024.
Han et al. [2019] M. Han, M. Kang, H. Jung, and S. J. Hwang. Episodic memory reader: Learning what to remember for question answering from streaming data. In A. Korhonen, D. Traum, and L. Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4407–4417, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1434. URL https://aclanthology.org/P19-1434.
Jiang et al. [2023] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed. Mistral 7b. arXiv:2310.06825, 2023.
Jin et al. [2024] P. Jin, R. Takanobu, C. Zhang, X. Cao, and L. Yuan. Chat-univi: Unified visual representation empowers large language models with image and video understanding. In CVPR, 2024.
li et al. [2022] G. li, Y. Wei, Y. Tian, C. Xu, J.-R. Wen, and D. Hu. Learning to answer questions in dynamic audio-visual scenarios. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
Li et al. [2023a] J. Li, X. Cheng, X. Zhao, J.-Y. Nie, and J.-R. Wen. HaluEval: A large-scale hallucination evaluation benchmark for large language models. In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464, Singapore, Dec. 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.397. URL https://aclanthology.org/2023.emnlp-main.397.
Li et al. [2023b] Y. Li, Y. Du, K. Zhou, J. Wang, X. Zhao, and J.-R. Wen. Evaluating object hallucination in large vision-language models. In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 292–305, Singapore, Dec. 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.20. URL https://aclanthology.org/2023.emnlp-main.20.
Li et al. [2023c] Y. Li, C. Wang, and J. Jia. Llama-vid: An image is worth 2 tokens in large language models. arXiv:2311.17043, 2023c.
Lin et al. [2023] B. Lin, B. Zhu, Y. Ye, M. Ning, P. Jin, and L. Yuan. Video-llava: Learning united visual representation by alignment before projection. arXiv:2311.10122, 2023.
Lin et al. [2022] S. Lin, J. Hilton, and O. Evans. TruthfulQA: Measuring how models mimic human falsehoods. In S. Muresan, P. Nakov, and A. Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl-long.229.
Liu et al. [2023] H. Liu, C. Li, Q. Wu, and Y. J. Lee. Visual instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=w0H2xGHlkw.
Liu et al. [2024] H. Liu, W. Xue, Y. Chen, D. Chen, X. Zhao, K. Wang, L. Hou, R. Li, and W. Peng. A survey on hallucination in large vision-language models. arXiv:2402.00253, 2024.
Luo et al. [2023] R. Luo, Z. Zhao, M. Yang, J. Dong, M. Qiu, P. Lu, T. Wang, and Z. Wei. Valley: Video assistant with large language model enhanced ability. arXiv:2306.07207, 2023.
[26] D. Mahan, R. Carlow, L. Castricato, N. Cooper, and C. Laforte. Stable beluga models. URL [https://huggingface.co/stabilityai/StableBeluga2](https://huggingface.co/stabilityai/StableBeluga2).
Manakul et al. [2023a] P. Manakul, A. Liusie, and M. Gales. MQAG: Multiple-choice question answering and generation for assessing information consistency in summarization. In J. C. Park, Y. Arase, B. Hu, W. Lu, D. Wijaya, A. Purwarianti, and A. A. Krisnadhi, editors, Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 39–53, Nusa Dua, Bali, Nov. 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.ijcnlp-main.4. URL https://aclanthology.org/2023.ijcnlp-main.4.
Manakul et al. [2023b] P. Manakul, A. Liusie, and M. Gales. SelfCheckGPT: Zero-resource black-box hallucination detection for generative large language models. In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9004–9017, Singapore, Dec. 2023b. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.557. URL https://aclanthology.org/2023.emnlp-main.557.
Maynez et al. [2020] J. Maynez, S. Narayan, B. Bohnet, and R. McDonald. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.173. URL https://aclanthology.org/2020.acl-main.173.
McKenzie et al. [2023] I. R. McKenzie, A. Lyzhov, M. Pieler, A. Parrish, A. Mueller, A. Prabhu, E. McLean, A. Kirtland, A. Ross, A. Liu, A. Gritsevskiy, D. Wurgaft, D. Kauffman, G. Recchia, J. Liu, J. Cavanagh, M. Weiss, S. Huang, T. F. Droid, T. Tseng, T. Korbak, X. Shen, Y. Zhang, Z. Zhou, N. Kim, S. R. Bowman, and E. Perez. Inverse scaling: When bigger isn’t better. TMLR, 2023.
Min et al. [2023] S. Min, K. Krishna, X. Lyu, M. Lewis, W.-t. Yih, P. Koh, M. Iyyer, L. Zettlemoyer, and H. Hajishirzi. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In H. Bouamor, J. Pino, and K. Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, Singapore, Dec. 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.741. URL https://aclanthology.org/2023.emnlp-main.741.
Mukherjee et al. [2023] S. Mukherjee, A. Mitra, G. Jawahar, S. Agarwal, H. Palangi, and A. Awadallah. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv:2306.02707, 2023.
Nahar et al. [2024] M. Nahar, H. Seo, E.-J. Lee, A. Xiong, and D. Lee. Fakes of varying shades: How warning affects human perception and engagement regarding llm hallucinations. arXiv:2404.03745, 2024.
Narayan et al. [2018] S. Narayan, S. B. Cohen, and M. Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807, Brussels, Belgium, Oct.-Nov. 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1206. URL https://aclanthology.org/D18-1206.
OpenAI [2023] OpenAI. GPT-4 technical report. arXiv:2303.08774, 2023.
Rohrbach et al. [2018] A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell, and K. Saenko. Object hallucination in image captioning. In E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045, Brussels, Belgium, Oct.-Nov. 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1437. URL https://aclanthology.org/D18-1437.
Sanabria et al. [2018] R. Sanabria, O. Caglayan, S. Palaskar, D. Elliott, L. Barrault, L. Specia, and F. Metze. How2: A large-scale dataset for multimodal language understanding. In Proc. ViGIL, 2018.
See et al. [2017] A. See, P. J. Liu, and C. D. Manning. Get to the point: Summarization with pointer-generator networks. In R. Barzilay and M.-Y. Kan, editors, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1099. URL https://aclanthology.org/P17-1099.
Shen et al. [2023] X. Shen, D. Li, J. Zhou, Z. Qin, B. He, X. Han, A. Li, Y. Dai, L. Kong, M. Wang, Y. Qiao, and Y. Zhong. Favdbench: Fine-grained audible video description. In Proc. CVPR, 2023.
Sun et al. [2023] G. Sun, W. Yu, C. Tang, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang. Fine-grained audio-visual joint representations for multimodal large language models. arXiv:2310.05863, 2023.
Tang et al. [2024] C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. MA, and C. Zhang. SALMONN: Towards generic hearing abilities for large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=14rn7HpKVk.
Team [2024] G. Team. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv:2403.05530, 2024.
Thorne et al. [2018] J. Thorne, A. Vlachos, C. Christodoulopoulos, and A. Mittal. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1074. URL https://aclanthology.org/N18-1074.
Tonmoy et al. [2024] S. M. T. I. Tonmoy, S. M. M. Zaman, V. Jain, A. Rani, V. Rawte, A. Chadha, and A. Das. A comprehensive survey of hallucination mitigation techniques in large language models. arXiv:2401.01313, 2024.
Touvron et al. [2023] H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom. Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288, 2023.
Tunstall et al. [2023] L. Tunstall, E. Beeching, N. Lambert, N. Rajani, K. Rasul, Y. Belkada, S. Huang, L. von Werra, C. Fourrier, N. Habib, N. Sarrazin, O. Sanseviero, A. M. Rush, and T. Wolf. Zephyr: Direct distillation of lm alignment. arXiv:2310.16944, 2023.
Verga et al. [2024] P. Verga, S. Hofstatter, S. Althammer, Y. Su, A. Piktus, A. Arkhangorodsky, M. Xu, N. White, and P. Lewis. Replacing judges with juries: Evaluating llm generations with a panel of diverse models. arXiv preprint arXiv:2404.18796, 2024.
Wang et al. [2023a] C. Wang, X. Liu, Y. Yue, X. Tang, T. Zhang, C. Jiayang, Y. Yao, W. Gao, X. Hu, Z. Qi, Y. Wang, L. Yang, J. Wang, X. Xie, Z. Zhang, and Y. Zhang. Survey on factuality in large language models: Knowledge, retrieval and domain-specificity. arXiv:2310.07521, 2023a.
Wang et al. [2023b] J. Wang, Y. Wang, G. Xu, J. Zhang, Y. Gu, H. Jia, M. Yan, J. Zhang, and J. Sang. An llm-free multi-dimensional benchmark for mllms hallucination evaluation. arXiv preprint arXiv:2311.07397, 2023b.
Wei et al. [2024] J. Wei, Y. Yao, J.-F. Ton, H. Guo, A. Estornell, and Y. Liu. Measuring and reducing llm hallucination without gold-standard answers via expertise-weighting. arXiv:2402.10412, 2024.
Xiao et al. [2021] J. Xiao, X. Shang, A. Yao, and T.-S. Chua. NExT-QA: Next phase of question-answering to explaining temporal actions. In Proc. CVPR, 2021.
Yang et al. [2023] X. Yang, L. Pan, X. Zhao, H. Chen, L. Petzold, W. Y. Wang, and W. Cheng. A survey on detection of llms-generated content. arXiv:2310.15654, 2023.
Ye et al. [2023] Q. Ye, H. Xu, J. Ye, M. Yan, A. Hu, H. Liu, Q. Qian, J. Zhang, F. Huang, and J. Zhou. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv:2311.04257, 2023.
Zhang et al. [2023] H. Zhang, X. Li, and L. Bing. Video-LLaMA: An instruction-tuned audio-visual language model for video understanding. In Y. Feng and E. Lefever, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 543–553, Singapore, Dec. 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-demo.49. URL https://aclanthology.org/2023.emnlp-demo.49.
Zheng et al. [2023] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 46595–46623. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/91f18a1287b398d378ef22505bf41832-Paper-Datasets_and_Benchmarks.pdf.
Zhou et al. [2024] Y. Zhou, C. Cui, J. Yoon, L. Zhang, Z. Deng, C. Finn, M. Bansal, and H. Yao. Analyzing and mitigating object hallucination in large vision-language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=oZDJKTlOUe.
Zhu et al. [2023] B. Zhu, E. Frick, T. Wu, H. Zhu, and J. Jiao. Starling-7b: Improving llm helpfulness & harmlessness with rlaif, November 2023.

Appendix A Experimental Setup Details

We list the models involved in this paper in Table 7, and text-to-text metrics in Table 8.

Target LLMs	Modality	Evidence Models	Evidence Models	License
		(explicit)	(explicit)
Llama-2-7B [28]	Text	✓	✓	llama2
Llama-2-7B-Chat [28]	Text	✓	✓	llama2
Mistral-7B-Instruct-v0.1 [15]	Text	✗	✗	Apache-2.0
Mistral-7B-Instruct-v0.2 [15]	Text	✓	✓	Apache-2.0
Vicuna-v1.5-7B[6]	Text	✓	✓	llama2
Falcon-7B[1]	Text	✗	✗	Apache-2.0
Starling-7B-alpha[57]	Text	✓	✓	Apache-2.0
StableBeluga-7B[26]	Text	✓	✓	llama2
Zephyr-7b-beta[46]	Text	✓	✓	MIT
Mistral-7B-OpenOrca[32]	Text	✓	✓	Apache-2.0
GPT-4 [35]	Text	✗	✗	N/A
LLaVA-v1.5 [23]	Vision	✓	✓	llama2
InstructBLIP (vicuna-7B) [8]	Vision	✓	✗	BSD 3-Clause
mPLUG-Owl2 [53]	Vision	✓	✓	MIT
Valley [25]	Vision	✓	✓	Apache-2.0
Video-LLaVA [21]	Vision	✓	✓	Apache-2.0
Chat-Univi [16]	Vision	✓	✓	Apache-2.0
LLaMA-VID [20]	Vision	✓	✗	Apache-2.0
LTU [12]	Audio	✓	✓	Apache-2.0
Qwen-Audio-Chat [7]	Audio	✓	✓	Tongyi Qianwen
SALMONN [41]	Audio	✓	✓	Apache-2.0
Video-LLaMA [54]	Audio-visual	✓	✓	BSD 3-Clause
FAVOR [40]	Audio-visual	✓	✓	Apache-2.0
Gemini 1.5 Pro [42]	Audio-visual	✓	✗	N/A

Table 7: Models and reference benchmarks for validating CrossCheckGPT.

Reference Benchmarks (Metrics)	Description
TriviaQA [14] (Acc)	A realistic text-based question-answering dataset containing documents collected from Wikipedia and the web.
TruthfulQA MC1 [22] (Acc)	A benchmark to measure whether a language model is truthful in generating answers to questions, spanning 38 categories.
TruthfulQA MC2 [22] (Acc)
XSum [34] (FactKB [11])	The factual accuracy of summarization models by verifying the presence of knowledge base facts in generated summaries.
CNN-DM [38] (BERTP)	The CNN-DailyMail dataset is a collection of news articles and accompanying summaries measured by BERTScore-Precision.
MemoTrap [30] (Acc)	Assessing whether LLMs fall into memorization traps which occur when LLMs memorize specific examples in training.
FaithDial [10] (Acc)	A benchmark for hallucination-free dialogues by editing hallucinated responses in Wizard of Wikipedia (WoW) [9]
HaluEval-QA [18] (Acc)	A large collection of generated and human-annotated hallucinated samples for evaluating the performance of LLMs in recognizing hallucination. It contains the QA, summarization and dialogue tasks.
HaluEval-summarization [18] (Acc)
HaluEval-Dialogue [18] (Acc)

Table 8: Dataset, models and reference benchmarks for validating CrossCheckGPT. Acc stands for accuracy.

Appendix B Exact Prompts

We provide the exact prompts we used in our experiments in Table 9 for various tasks.

Task	Prompt
Text-to-text generation	Generate a passage about $<$ name $>$ .
Image-to-text description	Describe the image in one paragraph.
Visual description for video	Describe the video in one paragraph.
Audio description for video	Describe the audio in one paragraph.
Prompt for speech content	What does the man/woman say in the video?
LLM Judgment for CrossCheck-explicit	Context: $<$ evidence_passage $>$ $\backslash$ n $\backslash$ nSentence: $<$ sentence $>$ $\backslash$ n $\backslash$ nIs the sentence supported by the context above? Answer Yes or No. $\backslash$ n $\backslash$ nAnswer:
CrossCheck-implicit factual errors	You are given the following sentence about $<$ name/image/video $>$ that might be inaccurate: $\backslash$ n $<$ sentence $>$ $\backslash$ n List possible inaccurate information in this sentence.
LLM Judgment for CrossCheck-implicit	You are given the following sentence about $<$ name/image/video $>$ : $\backslash$ n $<$ sentence $>$ $\backslash$ nThe following is an analysis of possible inaccuracies in this sentence: $\backslash$ n $<$ list_of_possible_errors $>$ $\backslash$ nBased on the analysis, determine if the sentence contains any inaccurate information. Answer Yes or No. $\backslash$ n $\backslash$ nAnswer:

Table 9: Exact prompt used for different tasks.

Appendix C CrossCheckGPT as a Hallucination Detection Method

CrossCheckGPT can be used as a Hallucination detection method, which performs better than the best output-probability-based method reported in SelfCheckGPT[28].

Evidence Model	Non-Factual	Non-Factual*	Factual	Document ( $r$ )
Llama 30B Max( $\mathcal{H}$ ) [28]	80.92	37.32	37.90	35.57
Llama-2-7B-Chat	85.84	57.22	54.41	56.25
Vicuna-v1.5-7B	83.13	53.38	51.13	54.64
Mistral-7B-Instruct-v0.2	87.21	59.60	56.72	63.04

Table 10: AUC-PR and document-level correlation against human annotation for detecting hallucinations in GPT-3 using individual evidence models on non-factual and factual statements in WikiBio [28].

Appendix D Text-to-text Additional Results

We provide the version of Table 1 with all ten benchmark metrics in Table 11. Moreover, we investigate the specific-task hallucination ranking ability where the inputs to SelfCheckGPT and CrossCheckGPT are from a specific task (rather than text generation). We conduct task-specific experiments using the inputs from TruthfulQA MC1 and HaluEval QA containing multiple-choice and yes-no questions respectively. The results in Table 12 show high system-level correlations and moderate document-level correlations, indicating that CrossCheckGPT can operate as a task-specific metric without requiring any ground truth.

Metrics	System( $\rho$ )	Document ( $r$ )
Metrics	System( $\rho$ )	w/o GPT4	with GPT4
TriviaQA [14]	23.33	-	-
TruthfulQA MC1 [22]	52.94	-	-
TruthfulQA MC2 [22]	57.14	-	-
XSum [34]	-70.00	-	-
CNNDM [38]	38.33	-	-
MemoTrap [30]	10.88	-	-
FaithDial [10]	-8.33	-	-
HaluEval-QA [18]	-18.33	-	-
HaluEval-Summarization [18]	48.33	-	-
HaluEval-Dialogue [18]	46.03	-	-
SelfCheckGPT [28]	66.46	74.06	76.08
CrossCheck-explicit	77.44	82.28	77.23
CrossCheck-implicit	56.71	18.33	17.29
CrossCheck-explicit weighted	82.32	81.78	82.18
CrossCheck-explicit weighted	56.81	20.21	19.16

Table 11: Full version of Table 1 including all other metrics. General hallucination evaluation where the task for SelfCheckGPT/CrossCheckGPT is open-ended text generation on WikiBio. System-level correlation, System(

\rho

), is measured against the overall ranking in the leaderboard, and document-level correlation, Document(

r

), is measured against RefCheck. With GPT-4 refers to including GPT-4 as the target LLM.

Metrics	System( $\rho$ )		Document ( $r$ )
Metrics	TruthfulQA MC1	HaluEval QA	TruthfulQA MC1	HaluEval QA
SelfCheckGPT	76.19	30.95	30.87	6.76
CrossCheckGPT	76.19	88.10	33.68	22.00

Table 12: Task-specific hallucination evaluation where the task of SelfCheckGPT/CrossCheckGPT is, in this example, either TruthfulQA MC1 or HaluEval QA. Note that rankings are performed on 8 target models that are instruction-tuned as these tasks are QA-based and require some instruction-following ability.

We first show the variation of system and document-level correlation against varying calibration temperatures for CrossCheck-explicit weighted in Fig. 6 using WikiBio data. A comparison between using per-query weights and using the same weights for the entire task is also provided. As a result, $C=0.1$ is chosen as it achieves the best system-level correlation. Besides, the same weighting across the whole task is used at $C=0.1$ as the large variance among weights of different queries introduces more noise in scoring and hence hinders the correlation.

Appendix E System-level Correlations between Individual Text-based Hallucination Benchmarks

We provide the system-level correlations between individual text-based hallucination benchmarks to show that they capture different aspects and do not correlate well with each other in Table 13.

	TriviaQA	TruthfulQA	Xsum	CNN-DM	MemoTrap	FaithDial	HaluQA	HaluSumm	HaluDial
TriviaQA [14]	1.00	0.20	-0.72	0.15	0.07	0.13	0.27	0.40	0.50
TruthfulQA [22]	0.20	1.00	-0.10	0.38	0.27	0.05	-0.50	0.37	0.63
Xsum [34]	-0.72	-0.10	1.00	-0.03	-0.40	0.12	-0.57	-0.63	-0.68
CNN-DM [38]	0.15	0.38	-0.03	1.00	0.28	-0.05	-0.05	0.33	0.37
MemoTrap [30]	0.07	0.27	-0.40	0.28	1.00	-0.05	-0.08	0.48	0.17
FaithDial [10]	0.13	0.05	0.12	-0.05	-0.05	1.00	-0.03	-0.22	-0.13
HaluQA [18]	0.27	-0.50	-0.57	-0.05	-0.08	-0.03	1.00	0.30	0.20
HaluSumm [18]	0.40	0.37	-0.63	0.33	0.48	-0.22	0.30	1.00	0.67
HaluDial [18]	0.50	0.63	-0.68	0.37	0.17	-0.13	0.20	0.67	1.00

Table 13: System-level correlation (

\rho

) between each pair of the 9 selected benchmarks metrics.

Appendix F Scatter Plots and Statistical Significance for Image-to-text

The scatter plot, similar to text-to-text ones in Fig. 4, is shown in Fig. 7.

Methods	Success rate (p-value)
CrossCheck-explicit	65.5% ( $<$ 0.00001)
CrossCheck-implicit	84.5% ( $<$ 0.00001)
CrossCheck-explicit weighted	67.0% ( $<$ 0.00001)
CrossCheck-implicit weighted	88.0% ( $<$ 0.00001)

Table 14: Success rate and statistical significance of CrossCheckGPT approaches measured via sign-test on independent subsets of images.

Additionally, we report the statistical significance of CrossCheckGPT being better than SelfCheckGPT on MHaluBench by performing the sign test at the image level.

Appendix G Statistics of AVHalluBench

We provide detailed statistics about AVHallubench in Table 15, including the number of videos, average lengths of each subset, as well as various audio and visual elements involved.

Source Dataset	Num. of Videos	Avg. Length (sec.)	w/ Speech	w/ Music	w/ Visual Text
NeXT-QA [51]	32 (18%)	22.0	19	7	1
M3AV [5]	27 (16%)	11.3	27	0	27
How2 [37]	27 (16%)	9.5	27	4	2
MUSIC-AVQA [17]	23 (13%)	29.0	0	23	0
VALOR32k [3]	26 (15%)	8.7	11	7	8
FAVDBench [39]	38 (22%)	8.0	8	15	13
Overall	175	14.2	92 (52%)	56 (32%)	51 (29%)

Table 15: Statistics of the AVHalluBench dataset with the percentage shown in brackets.

Appendix H Additional SelfCheckGPT and CrossCheckGPT Scores on AVHalluBench

We provide the detailed SelfCheckGPT and CrossCheckGPT scores on AVHalluBench for all MLLMs that handle video or audio inputs in this paper in Table 16 for video descriptions and Table 17 for audio descriptions.

Models	SelfCheckGPT	CrossCheck-explicit	CrossCheck-implicit
Valley [25]	52.43	55.98	48.22
Video-LLaVA [21]	30.59	33.52	40.57
Chat-Univi [16]	29.40	32.68	41.75
LLaMA-VID [20]	38.61	39.14	40.48
Video-LLaMA [54]	41.14	52.02	48.80
FAVOR [40]	60.67	53.85	50.49
Gemini 1.5 Pro	19.87	31.74	-

Table 16: SelfCheckGPT and CrossCheckGPT scores for 6 visual-LLMs that take video as inputs on AVHalluBench. Note that FAVOR, Video-LLaMA and Gemini 1.5 Pro are only given visual inputs. Gemini 1.5 Pro was not used for CrossCheck-implicit.

Models	SelfCheck		CrossCheck-explicit		CrossCheck-implicit
	audio	w. speech	audio	w.speech	audio	w. speech
LTU [12]	21.95	-	37.44	-	18.06	-
Qwen-Audio-Chat [7]	36.57	37.08	43.66	43.41	20.21	52.20
SALMONN [41]	34.99	34.80	42.21	40.15	18.32	48.17
FAVOR [40]	49.62	41.51	66.69	55.41	23.26	61.01
Video-LLaMA [54]	56.42	-	68.05	-	17.10	-
Gemini 1.5 Pro	25.82	27.38	34.66	36.52	-	-

Table 17: SelfCheckGPT and CrossCheckGPT scores for 6 audio-LLMs on AVHalluBench. Note that FAVOR and Video-LLaMA are only given audio inputs. Gemini 1.5 Pro was not used for CrossCheck-implicit.

Appendix I CrossCheck-explicit vs. CrossCheck-implicit

We present the average SelfCheckGPT scores on each task together with the system-level correlations in Table 18 to support our recommendations on CrossCheck-explicit and CrossCheck-implicit.

		System( $\rho$ )
Tasks	Ave. $\mathcal{S}_{\text{selfcheck}}$	CrossCheck-explicit	CrossCheck-implicit
Text-to-text	40.63	77.44	56.71
Image-to-text	17.16	42.86	50.42
Audio description	39.91	71.67	40.00
Visual description	42.14	89.09	54.29

Table 18: SelfCheckGPT scores and system-level correlations using CrossCheck-explicit and CrossCheck-implicit on four tasks. The system-level correlation for audio and visual descriptions is measured against RefCheck, and that for text-to-text and image-to-text tasks are measured against overall ranking.

Appendix J Case Studies for Hallucination with Audio-Visual Inputs

In addition to the piano example shown in Fig. 10 that has been mentioned in the main text, we show here two additional examples in Fig. 9 and Fig. 8 where audio-visual inputs influence the hallucination compared to using audio or visual inputs alone.

Appendix K Limitations

Our investigation is limited in the following aspects: First, hallucination is an expansive area and, as done in other studies, this paper only covers a reasonable subset of all possible domains. However, we plan to release a live hallucination leaderboard where we plan to benchmark the performance of further MLLMs over more benchmark metrics. Secondly, while the confidence-based weighting mechanism improves the performance of CrossCheckGPT, it does not take into account the similarities of different evidence models. Correlation between models, due to having similar training data or from starting at the same checkpoints, may result in evidence models making similar mistakes. This poses a future research direction, in raking model correlation into account for the weighting mechanism. Lastly, there is limited by the number of currently available audio-visual LLMs for evidence generation.

Appendix L Broader Impact

Hallucinations in multimodal foundation models have become increasingly critical and challenging. Therefore, providing a general reference-free hallucination benchmarking approach is necessary and timely, enabling practitioners to have metrics for model trustworthiness. Therefore, CrossCheckGPT has the following positive broad impact:

•

CrossCheckGPT establishes a universal ranking system which helps identify more factual and faithful models to be selected in particular applications, reducing the dissemination of misinformation and increasing societal confidence in AI applications.
•

CrossCheckGPT provides a reliable ranking that would aid regulatory bodies in enforcing compliance standards for multimodal foundation models, particularly in critical areas such as healthcare, finance, and public safety.
•

As a reference-free and versatile benchmarking method, CrossCheckGPT can drive developers to innovate and improve their multimodal foundation models.

However, our method by no means provides perfect hallucination scores and may inherit potential bias from the chosen evidence models. Therefore, practitioners should be independently educated and avoid overreliance on the rankings, as doing so may lead to complacency in critical thinking and reduced vigilance. From the model aspect, the approach in this paper does not give rise to any additional potential biases beyond the ones directly inherited from the pre-trained LLM checkpoints.

Appendix M Computing Resource

Our experiments are performed on a single Nvidia A100 GPU for inference. The average inference time for each target model to get the CrossCheckGPT score is 20 hours. The total amount of time to run for all models in the text-to-text leaderboard is 200 hours, in the image-to-text leaderboard is 190 hours and in the AVHalluBench is 240 hours. The total GPU hours for running the full research is 2000. There is no training process involved in the research.

Appendix N Assets and License Explanation

Links to the following licenses that apply to the models used in the paper are provided (see Table 7).

•

Llama2: https://huggingface.co/meta-llama/Llama-2-7b-chat-hf/blob/main/LICENSE.txt
•

Apache-2.0: https://www.apache.org/licenses/LICENSE-2.0
•

MIT License: https://choosealicense.com/licenses/mit/
•

BSD 3-Clause License: https://github.com/salesforce/LAVIS/blob/main/LICENSE.txt
•

Tongyi Qianwen: https://github.com/QwenLM/Qwen-Audio/blob/main/LICENSE

The following licenses are applied to the datasets used in our paper:

•

CC-BY-SA-3.0: Used by WikiBio hallucination data [28]. License link: https://spdx.org/licenses/CC-BY-SA-3.0.
•

MIT License: Used by MHaluBench (https://huggingface.co/datasets/openkg/MHaluBench). License link see above.

The following licenses are applied to the code and Python packages we use for our experiments:

•

Apache-2.0: Applies to Huggingface Transformers (https://github.com/huggingface/transformers/blob/main/LICENSE) and UniHD (https://github.com/OpenKG-ORG/EasyDetect/blob/main/LICENSE).
•

MIT License: Applies to SelfCheckGPT (https://github.com/potsawee/selfcheckgpt/blob/main/LICENSE) and spaCy (https://github.com/explosion/spaCy/blob/master/LICENSE).

CrossCheckGPT: Universal Hallucination Ranking for Multimodal Foundation Models