Just rephrase it! Uncertainty estimation in closed-source language models via multiple rephrased queries

Adam Yang
University of Bristol
Bristol, United Kingdom
[email protected]
&Chen Chen
Nanyang University
Singapore
[email protected]
\ANDKonstantinos Pitas
INRIA Grenoble Rhône-Alpes
Grenoble, France
[email protected]
Abstract

State-of-the-art large language models are sometimes distributed as open-source software but are also increasingly provided as a closed-source service. These closed-source large-language models typically see the widest usage by the public, however, they often do not provide an estimate of their uncertainty when responding to queries. As even the best models are prone to “hallucinating” false information with high confidence, a lack of a reliable estimate of uncertainty limits the applicability of these models in critical settings. We explore estimating the uncertainty of closed-source LLMs via multiple rephrasings of an original base query. Specifically, we ask the model, multiple rephrased questions, and use the similarity of the answers as an estimate of uncertainty. We diverge from previous work in i) providing rules for rephrasing that are simple to memorize and use in practice ii) proposing a theoretical framework for why multiple rephrased queries obtain calibrated uncertainty estimates. Our method demonstrates significant improvements in the calibration of uncertainty estimates compared to the baseline and provides intuition as to how query strategies should be designed for optimal test calibration.

1 Introduction

Since the introduction of ChatGPT (Brown et al., 2020), closed-source Large Language Models have seen incredibly rapid adoption by the general public, resulting in great productivity gains (Eloundou et al., 2023). At the same time, closed-source LLMs are prone to generating highly convincing but false information, a problem known as ”hallucinating” (Huang et al., 2023; Ji et al., 2023). They are furthermore known to state this false information with high confidence (Kadavath et al., 2022). This combination presents a conundrum to users. Specifically, while on average LLM-generated text is useful, the unreliability of any individual LLM-generated text and the lack of an effective mechanism to separate reliable and unreliable generations necessitates that LLM answers be inspected and vetted by the user, especially for critical applications. This significantly slows the LLM usage pipeline. Furthermore, the typical user has limited access to the model (specifically he can only query the LLM with textual prompts), and thus standard approaches for uncertainty estimation (Guo et al., 2017; Arbel et al., 2023) in deep neural networks cannot be applied, as they typically require access to the deep neural networks logits.

It is folk wisdom that one approach for estimating LLM uncertainty, even with such limited access to the model, is to query it multiple times (Wang et al., 2022; Xiong et al., 2023). This approach is based on the premise that LLM-generated text is frequently stochastic by design, as the next generated token is chosen through nucleus sampling (Holtzman et al., 2019) or top-k decoding (Fan et al., 2018; Radford et al., 2019). Wang et al. (2022) and Xiong et al. (2023) proposed to use the consistency of multiple answers as an estimate of uncertainty. Xiong et al. (2023) furthermore proposed to add ”noise” to the base query at each repetition, through misleading hints. While adding noise at each query repetition has been shown to improve over using the internal stochasticity of the LLM, we believe that there is considerable room for improvement. Specifically:

  • The current SOTA hint-based approach to submitting multiple noisy queries (Xiong et al., 2023) is cumbersome for end users, as it requires memorization of the hint patterns. This in turn might significantly limit adoption.

  • A theoretical understanding of why multiple queries work in the top-1 decoding settings is currently lacking. Specifically, a clear understanding of which ”noising” methods work and why would help the community design better noising rules.

  • Furthermore, a more detailed understanding of when and why adding noise to queries helps in the top-k decoding setting, (which by itself results in multiple answers) would help avoid ”noising” queries when this is unnecessary.

Refer to caption
Figure 1: Multiple rephrased queries for uncertainty estimation. Top row: Querying a closed-source LLM only once with a base query may yield an incorrect top-1 prediction. In the absence of additional information, the naive baseline is to assign 100%percent100100\%100 % confidence to this singular prediction. Bottom row: Querying the model multiple times with rephrased versions of the base query produces the {Athens}Athens\{\mathrm{Athens}\}{ roman_Athens } class twice and the {Paris}Paris\{\mathrm{Paris}\}{ roman_Paris } class once. This is roughly equivalent to 66.6%percent66.666.6\%66.6 % confidence. This observation should serve as an alert to a potential error, even when the true label is unknown.

In this work, we delve deeply in, refine, and theoretically analyze multiple queries for uncertainty estimation. Given a base query, we restrict ourselves to submitting rephrased versions of the base query to an LLM, checking the consistency of the answers, and using the result as an estimate of uncertainty. Concretely our contributions are the following:

  • We test four simple strategies for creating multiple rephrased queries, and find that in the top-1 decoding setting, two of them, substituting words with their synonyms and making the base query more verbose, result in significant calibration gains over the naive baseline of trusting every LLM answer. These two strategies have the advantage of requiring only basic language and arithmetic skills by the end user, and practically no memorization apart from the rephrasing rule.

  • We propose a theoretical model for multiple rephrased queries on a simplified top-1 decoding setting. Given multiple rephrased queries, our analysis shows that it is possible to recover the probability of the answer under the inaccessible categorical distribution of the LLM.

  • We propose a theoretical model for multiple rephrased queries on a simplified top-k decoding setting (Holtzman et al., 2019). Our analysis implies that generating multiple answers using the same base query and top-k decoding can also recover a tempered version of the probability of the answer under the inaccessible categorical distribution of the LLM. While generating multiple answers in this way (without rephrasing) might be sufficient for good calibration, we find that rephrasing results in additional tempering of the resulting uncertainty estimate, which is known to improve calibration.

  • In practice however, when comparing top-k and top-1 decoding with and without rephasing, in terms of Brier score, we find that top-1 decoding with rephrasing results in the best trade-off between accuracy and calibration.

2 Rephrasing drastically improves calibration for top-1 decoding

Method Question
original What part of the digestive system first causes chemical changes to food? A. Teeth in the mouth. B. Saliva in the mouth. C. Enzymes in the stomach. D. Enzymes in the small intestine.
reword Which region of the gastrointestinal tract initiates the initial chemical modifications to food intake? A. Teeth in the mouth. B. Saliva in the mouth. C. Enzymes in the stomach. D. Enzymes in the small intestine.
rephrase In what region of the digestive system does the food undergo its initial chemical transformations? A. Teeth in the mouth. B. Saliva in the mouth. C. Enzymes in the stomach. D. Enzymes in the small intestine.
paraphrase At what point in the digestive process do initial chemical transformations of food occur and which section of the system carries out this function? A. Teeth in the mouth. B. Saliva in the mouth. C. Enzymes in the stomach. D. Enzymes in the small intestine.
expansion Considering the intricate process by which our bodies break down and absorb nutrients from food, which specific organ or region within the digestive system initiates the essential biochemical transformations through enzyme secretion and the beginning of the digestion process? A. Teeth in the mouth. B. Saliva in the mouth. C. Enzymes in the stomach. D. Enzymes in the small intestine.
Table 1: Rephrasing examples generated by Mistral-7B, with rephrasing methods listed on the left and corresponding rephrases on the right.

Let f:𝒳𝒴:𝑓𝒳𝒴f:\mathcal{X}\rightarrow\mathcal{Y}italic_f : caligraphic_X → caligraphic_Y be an LLM which takes 𝒙𝒙\boldsymbol{x}bold_italic_x an input query in the form of a multiple choice question, and outputs y𝑦yitalic_y, an answer. We first consider top-1 decoding such that the answers of the LLM are deterministic. We consider randomized transformations of the base query 𝒯(𝒙)τsimilar-to𝒯𝒙𝜏\mathcal{T}(\boldsymbol{x})\sim\taucaligraphic_T ( bold_italic_x ) ∼ italic_τ in the form of rephrasings of the query, and the most probable answer under the transformations A=argmaxi(f(𝒯(𝒙))=i)𝐴subscriptargmax𝑖𝑓𝒯𝒙𝑖A=\mathrm{argmax}_{i}\mathbb{P}\left(f(\mathcal{T}(\boldsymbol{x}))=i\right)italic_A = roman_argmax start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_P ( italic_f ( caligraphic_T ( bold_italic_x ) ) = italic_i ). In a multiple choice question setting (which can be seen as a multi-class classification problem), we will use A𝐴Aitalic_A as the predicted class and

pA(𝒙)=(f(𝒯(𝒙))=A),subscript𝑝𝐴𝒙𝑓𝒯𝒙𝐴p_{A}(\boldsymbol{x})=\mathbb{P}\left(f(\mathcal{T}(\boldsymbol{x}))=A\right),italic_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( bold_italic_x ) = blackboard_P ( italic_f ( caligraphic_T ( bold_italic_x ) ) = italic_A ) ,

as our confidence about this prediction (here the predicted class coincides with a predicted token denoting this class). We consider four types of rephrasings, with an increasing level of modification to the original query:

  • Reword: Focuses on replacing words with their synonyms without significantly altering the sentence structure or adding new content.

  • Rephrase: Modifies the original question with changes in structure and possibly synonyms to achieve a similar but distinct question.

  • Paraphrase: Reconstructs the original query, often significantly, to retain its meaning while altering its presentation.

  • Expansion: Elaborates on the original query, making it more detailed or specific, often by adding context or additional considerations.

We provide our one-shot prompt template for each rephrasing method in Table 9 in Appendix A, and example generations from Mistral-7B in Table 1 and generations from Llama-7B/13B in Appendix B. In general, we perform the rephrasings with a separate instance of the same model that responds to the queries. We estimate pA(𝒙)subscript𝑝𝐴𝒙p_{A}(\boldsymbol{x})italic_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( bold_italic_x ) using Monte Carlo sampling with 10 draws from 𝒯(𝒙)τsimilar-to𝒯𝒙𝜏\mathcal{T}(\boldsymbol{x})\sim\taucaligraphic_T ( bold_italic_x ) ∼ italic_τ to estimate uncertainty with our method unless stated otherwise.

Table 2: Evaluation results on ARC-Challenge with various rephrasing methods applied to three LLMs. In the majority of cases, the rephrasing approach outperforms the naive baseline by 1040%10percent4010-40\%10 - 40 % in AUROC, 1030%10percent3010-30\%10 - 30 % in ECE and 00.400.40-0.40 - 0.4 in Brier.
Model Rephrasing Acc \uparrow ECE \downarrow TACE \downarrow Brier \downarrow AUROC \uparrow temp
Mistral-7B top-1 0.742 0.258 0.065 0.517 0.5 -
hint 0.593, 0.201, 0.108, 0.614, 0.695, -
\cdashline2-8 reword 0.619 0.12 0.103 0.512 0.846 1.0
rephrase 0.555 0.125 0.103 0.571 0.817 1.5
paraphrase 0.525 0.102 0.115 0.592 0.827 1.5
expansion 0.602 0.133 0.099 0.509 0.847 1.0
Llama-2-7B top-1 0.483 0.517 - 1.034 0.5 -
hint 0.258, 0.071, 0.144, 0.839, 0.562, -
\cdashline2-8 reword 0.352 0.193 0.176 0.853 0.626 1.5
rephrase 0.381 0.263 0.173 0.871 0.656 1.5
paraphrase 0.39 0.287 0.162 0.883 0.67 1.0
expansion 0.373 0.112 0.153 0.778 0.687 1.5
Llama-2-13B top-1 0.508 0.492 - 0.983 0.5 -
hint 0.331, 0.147, 0.134, 0.813, 0.57, -
\cdashline2-8 reword 0.445 0.084 0.119 0.714 0.721 1.5
rephrase 0.441 0.128 0.134 0.727 0.713 1.5
paraphrase 0.453 0.092 0.129 0.717 0.697 1.5
expansion 0.441 0.154 0.142 0.715 0.784 1.2
Table 3: Evaluation results on ARC-Easy with various rephrasing methods applied to three LLMs. In the majority of cases, the rephrasing approach outperforms the naive baseline by 1040%10percent4010-40\%10 - 40 % in AUROC, 1030%10percent3010-30\%10 - 30 % in ECE, and 00.400.40-0.40 - 0.4 in Brier.
Model Rephrasing Acc \uparrow ECE \downarrow TACE \downarrow Brier \downarrow AUROC \uparrow temp
Mistral-7B top-1 0.866 0.134 0.034 0.269 0.5 -
hint 0.773, 0.17, 0.076, 0.386, 0.795, -
\cdashline2-8 reword 0.753 0.045 0.062 0.297 0.931 1.0
rephrase 0.678 0.035 0.076 0.357 0.953 1.5
paraphrase 0.663 0.036 0.08 0.381 0.943 1.5
expansion 0.742 0.034 0.067 0.31 0.936 1.0
Llama-2-7B top-1 0.672 0.328 0.082 0.656 0.5 -
hint 0.231, 0.041, 0.149, 0.827, 0.663, -
\cdashline2-8 reword 0.43 0.084 0.119 0.672 0.818 1.5
rephrase 0.535 0.131 0.117 0.603 0.830 1.5
paraphrase 0.526 0.184 0.125 0.626 0.831 1.0
expansion 0.405 0.045 0.119 0.692 0.818 1.5
Llama-2-13B top-1 0.617 0.383 0.096 0.767 0.5 -
hint 0.346, 0.089, 0.128, 0.77, 0.673, -
\cdashline2-8 reword 0.546 0.07 0.11 0.58 0.814 1.5
rephrase 0.526 0.07 0.112 0.579 0.842 1.5
paraphrase 0.518 0.104 0.119 0.604 0.815 1.5
expansion 0.524 0.078 0.12 0.552 0.893 1.2
Table 4: Evaluation results on OpenBookQA with various rephrasing methods applied to three LLMs. In the majority of cases, the rephrasing approach outperforms the naive baseline by 1040%10percent4010-40\%10 - 40 % in AUROC, 1030%10percent3010-30\%10 - 30 % in ECE, and 00.400.40-0.40 - 0.4 in Brier.
Model Rephrasing Acc \uparrow ECE \downarrow TACE \downarrow Brier \downarrow AUROC \uparrow temp
Mistral-7B top-1 0.655 0.345 0.086 0.69 0.5 -
hint 0.56, 0.265, 0.119, 0.71, 0.606, -
\cdashline2-8 reword 0.552 0.105 0.102 0.592 0.796 1.0
rephrase 0.482 0.107 0.122 0.641 0.809 1.5
paraphrase 0.49 0.076 0.116 0.622 0.826 1.5
expansion 0.518 0.087 0.117 0.596 0.837 1.0
Llama-2-7B top-1 0.478 0.522 0.131 1.045 0.5 -
hint 0.275, 0.08, 0.142, 0.832, 0.556, -
\cdashline2-8 reword 0.388 0.137 0.143 0.786 0.689 1.5
rephrase 0.39 0.196 0.156 0.806 0.721 1.5
paraphrase 0.398 0.227 0.159 0.834 0.712 1.0
expansion 0.362 0.083 0.138 0.775 0.678 1.5
Llama-2-13B top-1 0.418 0.582 - 1.165 0.5 -
hint 0.295, 0.069, 0.138, 0.809, 0.613, -
\cdashline2-8 reword 0.428 0.117 0.142 0.75 0.676 1.5
rephrase 0.428 0.095 0.14 0.729 0.73 1.5
paraphrase 0.41 0.116 0.141 0.759 0.682 1.5
expansion 0.41 0.143 0.147 0.772 0.702 1.2

We used three different models, the Llama-2 7B model, the Llama-2 13B model (Touvron et al., 2023) and the Mistral 7B model (Jiang et al., 2023). We tested our framework on three multiple choice tasks of different difficulty namely ARC-Challenge, ARC-Easy (Clark et al., 2018), and Openbookqa (Mihaylov et al., 2018). Following Kojima et al. (2022), we extract the answer from LLM-generated texts by looking at the first appearance of A/B/C/D. To test for calibration we used standard calibration metrics, including the ECE and TACE (Naeini et al., 2015), Brier score (Murphy, 1973) and AUROC (Murphy, 2012). We note that for a fair comparison when the accuracy drops significantly, we must consult the Brier score which is a proper scoring rule. This is because, the ECE, TACE and AUROC are not proper scoring rules and can in general trade-off accuracy for calibration. For a baseline, we assumed 100%percent100100\%100 % confidence for each deterministic prediction. We also tested the ”hint” based approach of Xiong et al. (2023), which we describe in detail in Appendix A.

We present the results in Tables 2, 3 and 4. In the majority of cases rephrasing outperforms the naive baseline by 1040%10percent4010-40\%10 - 40 % in AUROC, 1030%10percent3010-30\%10 - 30 % in ECE, and 00.400.40-0.40 - 0.4 in Brier. Our approach also typically outperforms the “hint” base approach of Xiong et al. (2023) by 1020%10percent2010-20\%10 - 20 % in AUROC, 510%5percent105-10\%5 - 10 % in ECE, and 0.10.10.10.1 in Brier. In particular, the ”hint” based approach which more inflexible than our approach and typically hurts accuracy significantly 1020%10percent2010-20\%10 - 20 % compared to 510%5percent105-10\%5 - 10 % for our approach. For our method, these accuracy drops are more prevalent in the smaller 7B models, while the larger 13B model often shows a much smaller drop.

Crucially, the different rephrasing methods exhibit different calibration gains. On average, in terms of all calibration metrics the best methods are the ”expansion” and ”reword” methods, which make the queries more verbose, and substitute words with synonyms respectively. In terms of AUROC ”expansion” outperforms the alternatives by 15%1percent51-5\%1 - 5 %. In terms of the Brier score it outperforms by 0.05absent0.05\approx 0.05≈ 0.05. To instantiate our rephrasings we used a prompt with a one-shot example and a temperature parameter resulting in greater or smaller varieties of rephrasings. We include this temperature parameter in the Tables. Generally, we choose this temperature that balances accuracy and calibration. In Figure 2 we plot the behaviour as the number of MC draws increases.

In Appendix D, we also compare with Chain-of-Thought Wei et al. (2022) for uncertainty estimation. We find that we get competitive results with CoT. At the same time our method is significantly easier and more natural to implement for humans interacting via text with an LLM.

Refer to caption
(a) Accuracy
Refer to caption
(b) ECE
Refer to caption
(c) TACE
Refer to caption
(d) Brier
Refer to caption
(e) AUROC
Figure 2: The behavior of the Accuracy, ECE, TACE, Brier, and AUROC for all datasets, architectures, and expansion methods, as we increase the number of samples. We plot the average value as well as confidence intervals ±2σplus-or-minus2𝜎\pm 2\sigma± 2 italic_σ. We see that the ECE and the AUROC improve with more samples while the accuracy drops slightly. This might be because the meaning of some queries is completely destroyed by our rephrasings. The Brier score captures this tradeoff by having a minimum at approximately 5 samples. The TACE remains relatively stable with respect to the number of samples.
Table 5: Comparisons between our rephrasing methods and white-box logit uncertainty estimation. We see that our rephrasing methods achieve similar calibration to what would be achieved if we had access to last layer logits. This is evident both in the AUROC and TACE as well as the Brier score, which also accounts for accuracy.
Dataset Model Method Acc \uparrow ECE \downarrow TACE \downarrow Brier \downarrow AUROC \uparrow
ARC-C Mistral-7B logits 0.742 0.252 0.075 0.503 0.741
expansion 0.602 0.133 0.099 0.509 0.847
\cdashline2-8 Llama-2-7B logits 0.483 0.362 0.168 0.853 0.621
expansion 0.373 0.112 0.153 0.778 0.687
\cdashline2-8 Llama-2-13B logits 0.508 0.132 0.141 0.704 0.669
reword 0.445 0.084 0.119 0.714 0.721
ARC-E Mistral-7B logits 0.866 0.128 0.037 0.264 0.818
reword 0.753 0.045 0.062 0.297 0.931
\cdashline2-8 Llama-2-7B logits 0.672 0.190 0.098 0.493 0.779
rephrase 0.535 0.131 0.117 0.603 0.830
\cdashline2-8 Llama-2-13B logits 0.617 0.060 0.094 0.498 0.763
expansion 0.524 0.078 0.12 0.552 0.893
OBQA Mistral-7B logits 0.655 0.298 0.085 0.602 0.705
reword 0.552 0.105 0.102 0.592 0.796
\cdashline2-8 Llama-2-7B logits 0.478 0.277 0.147 0.758 0.642
expansion 0.362 0.083 0.138 0.775 0.678
\cdashline2-8 Llama-2-13B logits 0.418 0.168 0.135 0.723 0.650
rephrase 0.428 0.095 0.14 0.729 0.73

3 Rephrasing works as well as having access to the last layer logits

We now derive a proposition that elucidates why pA(𝒙)subscript𝑝𝐴𝒙p_{A}(\boldsymbol{x})italic_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( bold_italic_x ) results in calibrated estimates of uncertainty.

Proposition 3.1.

Let f:𝒳𝒴:𝑓𝒳𝒴f:\mathcal{X}\rightarrow\mathcal{Y}italic_f : caligraphic_X → caligraphic_Y be an LLM, 𝐱𝐱\boldsymbol{x}bold_italic_x is a base query and 𝒯(𝐱)τsimilar-to𝒯𝐱𝜏\mathcal{T}(\boldsymbol{x})\sim\taucaligraphic_T ( bold_italic_x ) ∼ italic_τ is some randomized transformation of the base query. Let

pA(𝒙)=(f(𝒯(𝒙))=A),subscript𝑝𝐴𝒙𝑓𝒯𝒙𝐴p_{A}(\boldsymbol{x})=\mathbb{P}\left(f(\mathcal{T}(\boldsymbol{x}))=A\right),italic_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( bold_italic_x ) = blackboard_P ( italic_f ( caligraphic_T ( bold_italic_x ) ) = italic_A ) , (1)

be the probability of sampling the most probable answer A𝒴𝐴𝒴A\in\mathcal{Y}italic_A ∈ caligraphic_Y under transformations 𝒯(𝐱)τsimilar-to𝒯𝐱𝜏\mathcal{T}(\boldsymbol{x})\sim\taucaligraphic_T ( bold_italic_x ) ∼ italic_τ. Let 𝐳mean+ϵrephrasesubscript𝐳𝑚𝑒𝑎𝑛subscriptitalic-ϵ𝑟𝑒𝑝𝑟𝑎𝑠𝑒\boldsymbol{z}_{mean}+\epsilon_{rephrase}bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_r italic_e italic_p italic_h italic_r italic_a italic_s italic_e end_POSTSUBSCRIPT be the latent representation of 𝐱𝐱\boldsymbol{x}bold_italic_x under 𝒯(𝐱)𝒯𝐱\mathcal{T}(\boldsymbol{x})caligraphic_T ( bold_italic_x ) at the final LLM layer, where 𝐳meansubscript𝐳𝑚𝑒𝑎𝑛\boldsymbol{z}_{mean}bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT is the mean representation and ϵrephrasesubscriptitalic-ϵ𝑟𝑒𝑝𝑟𝑎𝑠𝑒\epsilon_{rephrase}italic_ϵ start_POSTSUBSCRIPT italic_r italic_e italic_p italic_h italic_r italic_a italic_s italic_e end_POSTSUBSCRIPT is some additive noise. Let 𝐰𝐰\boldsymbol{\mathrm{w}}bold_w be the separating hyperplane between the most probable answer A𝐴Aitalic_A and the second most probable answer B𝐵Bitalic_B. Assuming that 𝐰ϵrephraseρsimilar-tosuperscript𝐰topsubscriptitalic-ϵ𝑟𝑒𝑝𝑟𝑎𝑠𝑒𝜌\boldsymbol{\mathrm{w}}^{\top}\epsilon_{rephrase}\sim\rhobold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_r italic_e italic_p italic_h italic_r italic_a italic_s italic_e end_POSTSUBSCRIPT ∼ italic_ρ follows a logistic distribution with μ=0𝜇0\mu=0italic_μ = 0 and s=1𝑠1s=1italic_s = 1 then

pA(𝒙)=p(A|𝒛mean,f)subscript𝑝𝐴𝒙𝑝conditional𝐴subscript𝒛𝑚𝑒𝑎𝑛𝑓\begin{split}p_{A}(\boldsymbol{x})=p(A|\boldsymbol{z}_{mean},f)\end{split}start_ROW start_CELL italic_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( bold_italic_x ) = italic_p ( italic_A | bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT , italic_f ) end_CELL end_ROW (2)

where p(A|𝐳mean,f)𝑝conditional𝐴subscript𝐳𝑚𝑒𝑎𝑛𝑓p(A|\boldsymbol{z}_{mean},f)italic_p ( italic_A | bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT , italic_f ) is the probability of A𝐴Aitalic_A given 𝐳meansubscript𝐳𝑚𝑒𝑎𝑛\boldsymbol{z}_{mean}bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT under the categorical distribution of the final layer.

We prove the above for the binary case of two classes A𝐴Aitalic_A and B𝐵Bitalic_B in Appendix C, but expect that it should be sufficiently informative in multi-class settings when A,B𝐴𝐵A,Bitalic_A , italic_B are much more probable than other classes. A crucial assumption for recovering well-calibrated predictions is that 𝐰ϵrephraseρsimilar-tosuperscript𝐰topsubscriptitalic-ϵ𝑟𝑒𝑝𝑟𝑎𝑠𝑒𝜌\boldsymbol{\mathrm{w}}^{\top}\epsilon_{rephrase}\sim\rhobold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_r italic_e italic_p italic_h italic_r italic_a italic_s italic_e end_POSTSUBSCRIPT ∼ italic_ρ follows a logistic distribution with μ=0𝜇0\mu=0italic_μ = 0 and s=1𝑠1s=1italic_s = 1. We test this assumption by computing the cumulative of ρ𝜌\rhoitalic_ρ for our different experimental setups. In Figure 3(c) we find and plot the empirical cumulative using a Kolmogorov-Smirnov test (Smirnov, 1948) and S=100𝑆100S=100italic_S = 100 MC samples of ρ𝜌\rhoitalic_ρ for Mistral-7B, ARC-Challenge, and the “expansion” rephrasing method. We see that the indeed the cumulative is approximately logistical validating our prediction (the confidence bands cover different queries 𝒙𝒙\boldsymbol{x}bold_italic_x). In Table 5 we use the logits of the answers as an oracle white-box uncertainty estimate. Specifically, we apply the softmax function and use the probability of the most probable class as our estimate of uncertainty. We compare the results of this method with the best rephrasing method (in terms of Brier) from Tables 2, 3 and 4. We see observe that our uncertainty estimates that are similar to what we would get if we had access to the last layer logits.

4 For top-k decoding, rephrasing tempers predictive uncertainty

Refer to caption
(a) pA(𝒙)subscript𝑝𝐴𝒙p_{A}(\boldsymbol{x})italic_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( bold_italic_x ) for top-k without rephrasing
Refer to caption
(b) pA(𝒙)subscript𝑝𝐴𝒙p_{A}(\boldsymbol{x})italic_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( bold_italic_x ) for top-k with rephrasing
Refer to caption
(c) Logistic (blue), and empirical cdf (red)
Figure 3: We plot the distribution of pA(𝒙)subscript𝑝𝐴𝒙p_{A}(\boldsymbol{x})italic_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( bold_italic_x ) for the case of top-k decoding with and without rephrasing, for all datasets, models, and rephrasing methods. We see that rephrasing primarily acts to temper the probability of the most probable class A𝐴Aitalic_A, thus making the model less confident and possibly better calibrated. We also plot the logistic (blue), and empirical cdf (red) for 𝐰ϵrephraseρsimilar-tosuperscript𝐰topsubscriptitalic-ϵ𝑟𝑒𝑝𝑟𝑎𝑠𝑒𝜌\boldsymbol{\mathrm{w}}^{\top}\epsilon_{rephrase}\sim\rhobold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_r italic_e italic_p italic_h italic_r italic_a italic_s italic_e end_POSTSUBSCRIPT ∼ italic_ρ for Mistral-7B, ARC-Challenge, and the “expansion” rephrasing method for top-1 decoding. ρ𝜌\rhoitalic_ρ is often close to a logistic distribution.
Table 6: Evaluation results on ARC-Challenge with various rephrasing methods applied to three LLMs using top-k decoding. In the majority of cases rephrasing + top-k outperforms simple top-k in terms of calibration.
Model Rephrasing Acc \uparrow ECE \downarrow TACE \downarrow Brier \downarrow AUROC \uparrow temp
Mistral-7B top-k 0.746, 0.272, 0.091, 0.511, 0.6, -
temp-sampling 0.742 0.272 0.089 0.513 0.605 -
\cdashline2-8 reword 0.547, 0.05, 0.093, 0.543, 0.864, 1.5
rephrase 0.64, 0.106, 0.086, 0.485, 0.82, 1.0
paraphrase 0.631, 0.11, 0.098, 0.495, 0.83, 1.0
expansion 0.517, 0.061, 0.114, 0.573, 0.859, 1.5
Llama-2-7B top-k 0.436, 0.201, 0.139, 0.761, 0.602, -
temp-sampling 0.441 0.211 0.132 0.757 0.621 -
\cdashline2-8 reword 0.335, 0.187, 0.166, 0.858, 0.62, 1.5
rephrase 0.356, 0.314, 0.17, 0.944, 0.627, 1.0
paraphrase 0.309, 0.185, 0.162, 0.851, 0.69, 1.5
expansion 0.322, 0.144, 0.155, 0.828, 0.622, 1.5
Llama-2-13B top-k 0.462, 0.125, 0.115, 0.679, 0.753, -
temp-sampling 0.47, 0.122 0.115 0.662 0.766 -
\cdashline2-8 reword 0.352, 0.087, 0.136, 0.771, 0.687, 1.5
rephrase 0.398, 0.068, 0.136, 0.725, 0.743, 1.0
paraphrase 0.364, 0.109, 0.137, 0.738, 0.719, 1.2
expansion 0.373, 0.124, 0.143, 0.76, 0.669, 1.5

In practice, the assumptions of the above proposition are too restrictive. In particular, decoding in LLMs is performed with top-k decoding or nucleus sampling instead of top-1 decoding. Furthermore while for an oracle choice of the rephrasing intensity the modeling assumption that 𝐰ϵηρsimilar-tosuperscript𝐰topsubscriptitalic-ϵ𝜂𝜌\boldsymbol{\mathrm{w}}^{\top}\epsilon_{\eta}\sim\rhobold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT ∼ italic_ρ follows a logistic distribution with μ=0𝜇0\mu=0italic_μ = 0 and s=1𝑠1s=1italic_s = 1 might be correct, in general, the variance of the noise in latent space is unknown. It is thus illustrative to consider an extension of our toy model. The following proposition explores these extensions.

Proposition 4.1.

Let g:dη𝒴:𝑔superscriptsubscript𝑑𝜂𝒴g:\mathbb{R}^{d_{\eta}}\rightarrow\mathcal{Y}italic_g : blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → caligraphic_Y be the final encoder layer of an LLM, 𝐱𝐱\boldsymbol{x}bold_italic_x is a base query and 𝒯(𝐱)τsimilar-to𝒯𝐱𝜏\mathcal{T}(\boldsymbol{x})\sim\taucaligraphic_T ( bold_italic_x ) ∼ italic_τ is some randomized transformation of the base query. Let

pA(𝒙)=(f(𝒯(𝒙))=A),subscript𝑝𝐴𝒙𝑓𝒯𝒙𝐴p_{A}(\boldsymbol{x})=\mathbb{P}\left(f(\mathcal{T}(\boldsymbol{x}))=A\right),italic_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( bold_italic_x ) = blackboard_P ( italic_f ( caligraphic_T ( bold_italic_x ) ) = italic_A ) , (3)

be the probability of sampling the most probable answer f(𝐱)=A𝒴𝑓𝐱𝐴𝒴f(\boldsymbol{x})=A\in\mathcal{Y}italic_f ( bold_italic_x ) = italic_A ∈ caligraphic_Y under transformations 𝒯(𝐱)τsimilar-to𝒯𝐱𝜏\mathcal{T}(\boldsymbol{x})\sim\taucaligraphic_T ( bold_italic_x ) ∼ italic_τ. Let 𝐳mean+ϵtopk+ϵrephrasesubscript𝐳𝑚𝑒𝑎𝑛subscriptitalic-ϵ𝑡𝑜𝑝𝑘subscriptitalic-ϵ𝑟𝑒𝑝𝑟𝑎𝑠𝑒\boldsymbol{z}_{mean}+\epsilon_{topk}+\epsilon_{rephrase}bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_t italic_o italic_p italic_k end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_r italic_e italic_p italic_h italic_r italic_a italic_s italic_e end_POSTSUBSCRIPT be the latent representation of 𝐱𝐱\boldsymbol{x}bold_italic_x under 𝒯(𝐱)𝒯𝐱\mathcal{T}(\boldsymbol{x})caligraphic_T ( bold_italic_x ) at the final LLM layer, where 𝐳meansubscript𝐳𝑚𝑒𝑎𝑛\boldsymbol{z}_{mean}bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT is the mean representation and ϵtopksubscriptitalic-ϵ𝑡𝑜𝑝𝑘\epsilon_{topk}italic_ϵ start_POSTSUBSCRIPT italic_t italic_o italic_p italic_k end_POSTSUBSCRIPT is additive noise resulting from the top-k decoding and ϵrephrasesubscriptitalic-ϵ𝑟𝑒𝑝𝑟𝑎𝑠𝑒\epsilon_{rephrase}italic_ϵ start_POSTSUBSCRIPT italic_r italic_e italic_p italic_h italic_r italic_a italic_s italic_e end_POSTSUBSCRIPT is additive noise resulting from the rephrasings 𝒯(𝐱)𝒯𝐱\mathcal{T}(\boldsymbol{x})caligraphic_T ( bold_italic_x ). Assuming that 𝐰(ϵtopk+ϵrephrase)ρsimilar-tosuperscript𝐰topsubscriptitalic-ϵ𝑡𝑜𝑝𝑘subscriptitalic-ϵ𝑟𝑒𝑝𝑟𝑎𝑠𝑒𝜌\boldsymbol{\mathrm{w}}^{\top}(\epsilon_{topk}+\epsilon_{rephrase})\sim\rhobold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_ϵ start_POSTSUBSCRIPT italic_t italic_o italic_p italic_k end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_r italic_e italic_p italic_h italic_r italic_a italic_s italic_e end_POSTSUBSCRIPT ) ∼ italic_ρ approximately follows a logistic distribution with μ=0𝜇0\mu=0italic_μ = 0 and s=stopk2+srephrase2𝑠subscriptsuperscript𝑠2𝑡𝑜𝑝𝑘subscriptsuperscript𝑠2𝑟𝑒𝑝𝑟𝑎𝑠𝑒s=\sqrt{s^{2}_{topk}+s^{2}_{rephrase}}italic_s = square-root start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_o italic_p italic_k end_POSTSUBSCRIPT + italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r italic_e italic_p italic_h italic_r italic_a italic_s italic_e end_POSTSUBSCRIPT end_ARG then

pA(𝒙)0.5+1stopk2+srephrase2(p(A|𝒛mean,f)0.5)subscript𝑝𝐴𝒙0.51subscriptsuperscript𝑠2𝑡𝑜𝑝𝑘subscriptsuperscript𝑠2𝑟𝑒𝑝𝑟𝑎𝑠𝑒𝑝conditional𝐴subscript𝒛𝑚𝑒𝑎𝑛𝑓0.5\begin{split}p_{A}(\boldsymbol{x})\approx 0.5+\frac{1}{\sqrt{s^{2}_{topk}+s^{2% }_{rephrase}}}(p(A|\boldsymbol{z}_{mean},f)-0.5)\end{split}start_ROW start_CELL italic_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( bold_italic_x ) ≈ 0.5 + divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_o italic_p italic_k end_POSTSUBSCRIPT + italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r italic_e italic_p italic_h italic_r italic_a italic_s italic_e end_POSTSUBSCRIPT end_ARG end_ARG ( italic_p ( italic_A | bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT , italic_f ) - 0.5 ) end_CELL end_ROW (4)

where p(A|𝐳mean,f)𝑝conditional𝐴subscript𝐳𝑚𝑒𝑎𝑛𝑓p(A|\boldsymbol{z}_{mean},f)italic_p ( italic_A | bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT , italic_f ) is the probability of A𝐴Aitalic_A given 𝐳meansubscript𝐳𝑚𝑒𝑎𝑛\boldsymbol{z}_{mean}bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT under the categorical distribution of g𝑔gitalic_g.

Table 7: Evaluation results on ARC-Easy with various rephrasing methods applied to three LLMs using top-k decoding. In the majority of cases rephrasing + top-k outperforms simple top-k in terms of calibration.
Model Rephrasing Acc \uparrow ECE \downarrow TACE \downarrow Brier \downarrow AUROC \uparrow temp
Mistral-7B top-k 0.868, 0.133, 0.042, 0.255, 0.695, -
temp-sampling 0.859 0.131 0.046 0.266 0.677 -
\cdashline2-8 reword 0.694, 0.054, 0.076, 0.344, 0.941, 1.5
rephrase 0.789, 0.047, 0.049, 0.274, 0.911, 1.0
paraphrase 0.753, 0.036, 0.056, 0.3, 0.922, 1.0
expansion 0.63, 0.042, 0.086, 0.403, 0.942, 1.5
Llama-2-7B top-k 0.612, 0.25, 0.115, 0.612, 0.73, -
temp-sampling 0.619 0.261 0.114 0.617 0.717 -
\cdashline2-8 reword 0.401, 0.074, 0.121, 0.681, 0.825, 1.5
rephrase 0.564, 0.145, 0.108, 0.584, 0.819, 1.0
paraphrase 0.425, 0.08, 0.117, 0.665, 0.835, 1.5
expansion 0.335, 0.054, 0.138, 0.742, 0.791, 1.5
Llama-2-13B top-k 0.557, 0.06, 0.098, 0.528, 0.865, -
temp-sampling 0.544 0.087 0.107 0.532 0.866 -
\cdashline2-8 reword 0.412, 0.106, 0.129, 0.72, 0.741, 1.5
rephrase 0.458, 0.05, 0.12, 0.643, 0.817, 1.0
paraphrase 0.427, 0.066, 0.126, 0.652, 0.845, 1.2
expansion 0.366, 0.087, 0.13, 0.74, 0.75, 1.5
Table 8: Evaluation results on OpenBookQA with various rephrasing methods applied to three LLMs using top-k decoding. In the majority of cases rephrasing + top-k outperforms simple top-k in terms of calibration.
Model Rephrasing Acc \uparrow ECE \downarrow TACE \downarrow Brier \downarrow AUROC \uparrow temp
Mistral-7B top-k 0.638, 0.289, 0.101, 0.636, 0.636, -
temp-sampling 0.668 0.289 0.098 0.607 0.624 -
\cdashline2-8 reword 0.528, 0.103, 0.105, 0.606, 0.794, 1.5
rephrase 0.582, 0.109, 0.093, 0.542, 0.821, 1.0
paraphrase 0.552, 0.078, 0.101, 0.57, 0.817, 1.0
expansion 0.445, 0.061, 0.128, 0.653, 0.818, 1.5
Llama-2-7B top-k 0.412, 0.208, 0.129, 0.776, 0.617, -
temp-sampling 0.442 0.235 0.13 0.772 0.599 -
\cdashline2-8 reword 0.34, 0.14, 0.153, 0.807, 0.696, 1.5
rephrase 0.408, 0.239, 0.154, 0.815, 0.704, 1.0
paraphrase 0.355, 0.127, 0.145, 0.783, 0.721, 1.5
expansion 0.308, 0.098, 0.151, 0.807, 0.711, 1.5
Llama-2-13B top-k 0.43, 0.114, 0.13, 0.708, 0.72, -
temp-sampling 0.43, 0.099 0.121 0.702 0.733 -
\cdashline2-8 reword 0.345, 0.111, 0.144, 0.794, 0.618, 1.5
rephrase 0.345, 0.062, 0.141, 0.767, 0.706, 1.0
paraphrase 0.37, 0.092, 0.141, 0.763, 0.67, 1.2
expansion 0.36, 0.138, 0.138, 0.799, 0.574, 1.5

The approximation relies on linearizing the involved functions, however, it is illustrative of the effect of both stopk2subscriptsuperscript𝑠2𝑡𝑜𝑝𝑘s^{2}_{topk}italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_o italic_p italic_k end_POSTSUBSCRIPT and srephrase2subscriptsuperscript𝑠2𝑟𝑒𝑝𝑟𝑎𝑠𝑒s^{2}_{rephrase}italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r italic_e italic_p italic_h italic_r italic_a italic_s italic_e end_POSTSUBSCRIPT. In particular, we see that both stopk2subscriptsuperscript𝑠2𝑡𝑜𝑝𝑘s^{2}_{topk}italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_o italic_p italic_k end_POSTSUBSCRIPT and srephrase2subscriptsuperscript𝑠2𝑟𝑒𝑝𝑟𝑎𝑠𝑒s^{2}_{rephrase}italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r italic_e italic_p italic_h italic_r italic_a italic_s italic_e end_POSTSUBSCRIPT act to temper the probability p(A|𝒛mean,f)𝑝conditional𝐴subscript𝒛𝑚𝑒𝑎𝑛𝑓p(A|\boldsymbol{z}_{mean},f)italic_p ( italic_A | bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT , italic_f ) under the categorical distribution of g. This highlights why using rephrasings with an appropriate temperature might improve the calibration in downstream tasks. In previous works, tempering of the categorical distribution has been found to significantly improve the calibration of deep neural networks (Guo et al., 2017).

In Tables 6, 7 and 8 and Figure 3, we present the results for the top-k experiments with and without rephrasing, with k=40𝑘40k=40italic_k = 40. We also present the relaxed temperature sampling variant Wei et al. (2022). We see that the stochasticity of top-40 compared to top-1 decoding from Tables 2, 3 and 4 results in an improvement in calibration but a drop in accuracy. The Brier score often improves at the cost of accuracy. Further stochasticity in answers caused by rephrasings has a similar effect. These observations are consistent with the fact that top-k and nucleus sampling (Holtzman et al., 2019) make text more human-like but not necessarily more “accurate”. However, if the main goal is calibration, the tables, and Figure 3 show that in accordance with proposition 4.1 rephrasing acts primarily to temper the probability of the top class. This often improves calibration significantly in terms of ECE, and AUROC especially for smaller models. We plot the results of all methods averaged over all models for each dataset in Figure 4. Tables 2, 3 4 and 6, 7 8 and Figure 4 indicate that the user should assess whether rephrasing is appropriate after an analysis of his individual model, task and evaluation metric. However, in general, a hyperparameter-optimized choice of rephrasing + top-1 decoding outperforms or matches all other method combinations in all metrics.

5 Related works

The field of estimating the uncertainty of closed-source LLM models is nascent but fast-growing. Kadavath et al. (2022) propose that in addition to a query the user can prompt the LLM to output a numerical confidence value, known as “verbalized confidence”. Crucially, there is no easy statistical justification as to why verbalized confidence should result in calibrated predictions. Uncertainty estimates using verbalized confidence tend to be overly optimistic and concentrate in the 80%-100% confidence range (Xiong et al., 2023). Recently, Pacchiardi et al. (2023) proposed that after submitting a base query the user should ask additional and unrelated binary questions and check the accuracy of the answers. They empirically correlate this to well-calibrated uncertainty but only for the setting where the LLM purposefully lies. Our work is also related to Carlini et al. (2024) which manages to “steal” the last layer of closed-source LLMs using only random queries.

Wang et al. (2022) proposed to leverage multiple chains of thoughts to derive varied responses. Their findings suggest that a majority vote across these answers not only enhances accuracy but also yields well-calibrated uncertainty estimates. Kuhn et al. (2023) introduced a novel uncertainty quantification metric by sampling a multitude of responses and employing a BERT model to categorize these answers. Subsequently, they calculated the entropy of the empirical distribution, presenting an alternative approach to uncertainty estimation. This approach has the significant disadvantage of being computationally expensive and requiring access to a secondary LLM.

Another line of work focuses on rephrasing queries to improve accuracy instead of estimating uncertainty. Specifically, Deng et al. (2023) demonstrated that expanding questions with supplementary details through a zero-shot prompt significantly improves model performance. Zheng et al. (2023) adopted a similar approach by asking the LLM to derive high-level concepts and first principles before reasoning and answering the question, which boosted the performance.

Conversely, another segment of research has delved into the uncertainty estimates derived from the logits associated with multiple-choice questions. This approach entails extracting the logits corresponding to the first token of each option (A, B, C, D) following the question prompt and applying a softmax normalization to ascertain the predicted probabilities for the options. Achiam et al. (2023) discovered that while pre-trained models exhibit commendable calibration, the application of Reinforcement Learning from Human Feedback (RLHF) adversely affects calibration. Other studies have endeavored to bolster the calibration of fine-tuned LLMs by employing strategies such as ensembles (Wang et al., 2023) or adopting Bayesian methods (Yang et al., 2023). However, such an approach does not apply to closed-source LLMs where logits are not available, as well as free-form QA tasks.

A comprehensive body of literature exists on the topic of estimating uncertainty in deep neural network models when access to the softmax categorical distribution is available (Guo et al., 2017; Lakshminarayanan et al., 2017; Blundell et al., 2015; Maddox et al., 2019; Wenzel et al., 2020; Arbel et al., 2023). The most straightforward method involves utilizing the categorical distribution itself as an uncertainty estimate (Guo et al., 2017). Noteworthy enhancements can be achieved by applying tempering to the logits just before the application of the softmax function (Guo et al., 2017).

Refer to caption
(a) ARC-Challenge
Refer to caption
(b) ARC-Easy
Refer to caption
(c) OpenBookQA
Figure 4: We plot the AUROC averaged over all models for each dataset and for each uncertainty estimation method. We observe that top-k improves over the naive top-1 decoding. Furthermore, the best rephrasing method (denoted as rephrase*) improves the AUROC significantly in all cases.

6 Discussion

We conducted a thorough analysis of rephrased queries as a method for obtaining calibrated predictions from closed-source LLM models. Notably, we found that two simple methods; making the query more verbose, and substituting words with their synonyms, provide a straightforward means of identifying false positives. The appeal of our approach lies in its practicality, as it requires only basic language and arithmetic skills by the end user to obtain meaningful uncertainty estimates. Exciting future directions include learning optimal rephrasing rules in a data-driven manner, to be used in conjunction with a rephrasing LLM. While we tested on the multiple choice question setting for ease of evaluation, we expect our results to also hold for open-ended text generation.

References

  • Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • Arbel et al. (2023) Julyan Arbel, Konstantinos Pitas, Mariia Vladimirova, and Vincent Fortuin. A primer on Bayesian neural networks: review and debates. arXiv preprint arXiv:2309.16314, 2023.
  • Blundell et al. (2015) Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In International Conference on Machine Learning, 2015.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Carlini et al. (2024) Nicholas Carlini, Daniel Paleka, Krishnamurthy Dj Dvijotham, Thomas Steinke, Jonathan Hayase, A Feder Cooper, Katherine Lee, Matthew Jagielski, Milad Nasr, Arthur Conmy, et al. Stealing part of a production language model. arXiv preprint arXiv:2403.06634, 2024.
  • Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.
  • Deng et al. (2023) Yihe Deng, Weitong Zhang, Zixiang Chen, and Quanquan Gu. Rephrase and respond: Let large language models ask better questions for themselves. arXiv preprint arXiv:2311.04205, 2023.
  • Eloundou et al. (2023) Tyna Eloundou, Sam Manning, Pamela Mishkin, and Daniel Rock. Gpts are gpts: An early look at the labor market impact potential of large language models. arXiv preprint arXiv:2303.10130, 2023.
  • Fan et al. (2018) Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  889–898, 2018.
  • Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. In International Conference on Machine Learning, pp.  1321–1330. PMLR, 2017.
  • Holtzman et al. (2019) Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. In International Conference on Learning Representations, 2019.
  • Huang et al. (2023) Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232, 2023.
  • Ji et al. (2023) Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
  • Jiang et al. (2023) Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  • Kadavath et al. (2022) Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, Zac Hatfield-Dodds, Nova DasSarma, Eli Tran-Johnson, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022.
  • Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  • Kuhn et al. (2023) Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. arXiv preprint arXiv:2302.09664, 2023.
  • Lakshminarayanan et al. (2017) Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in Neural Information Processing Systems, 30, 2017.
  • Maddox et al. (2019) Wesley J Maddox, Pavel Izmailov, Timur Garipov, Dmitry P Vetrov, and Andrew Gordon Wilson. A simple baseline for Bayesian uncertainty in deep learning. Advances in Neural Information Processing Systems, 32:13153–13164, 2019.
  • Mihaylov et al. (2018) Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
  • Murphy (1973) Allan H Murphy. A new vector partition of the probability score. Journal of Applied Meteorology and Climatology, 12(4):595–600, 1973.
  • Murphy (2012) Kevin P Murphy. Machine learning: a probabilistic perspective. MIT Press, 2012.
  • Naeini et al. (2015) Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using Bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence, 2015.
  • Pacchiardi et al. (2023) Lorenzo Pacchiardi, Alex James Chan, Sören Mindermann, Ilan Moscovitz, Alexa Yue Pan, Yarin Gal, Owain Evans, and Jan M Brauner. How to catch an ai liar: Lie detection in black-box llms by asking unrelated questions. In The Twelfth International Conference on Learning Representations, 2023.
  • Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  • Smirnov (1948) Nickolay Smirnov. Table for estimating the goodness of fit of empirical distributions. The annals of mathematical statistics, 19(2):279–281, 1948.
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  • Wang et al. (2023) Xi Wang, Laurence Aitchison, and Maja Rudolph. Lora ensembles for large language model fine-tuning. arXiv preprint arXiv:2310.00035, 2023.
  • Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  • Wenzel et al. (2020) Florian Wenzel, Kevin Roth, Bastiaan S Veeling, Jakub Swiatkowski, Linh Tran, Stephan Mandt, Jasper Snoek, Tim Salimans, Rodolphe Jenatton, and Sebastian Nowozin. How good is the Bayes posterior in deep neural networks really? International Conference on Machine Learning, 2020.
  • Xiong et al. (2023) Miao Xiong, Zhiyuan Hu, Xinyang Lu, Yifei Li, Jie Fu, Junxian He, and Bryan Hooi. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. arXiv preprint arXiv:2306.13063, 2023.
  • Yang et al. (2023) Adam X Yang, Maxime Robeyns, Xi Wang, and Laurence Aitchison. Bayesian low-rank adaptation for large language models. arXiv preprint arXiv:2308.13111, 2023.
  • Zheng et al. (2023) Huaixiu Steven Zheng, Swaroop Mishra, Xinyun Chen, Heng-Tze Cheng, Ed H Chi, Quoc V Le, and Denny Zhou. Take a step back: Evoking reasoning via abstraction in large language models. arXiv preprint arXiv:2310.06117, 2023.

Appendix A Prompt template

We present our prompt template for initiating rephrases with a one-shot example in Table 9. Note that we only present and rephrase questions without revealing choices, to reduce unnecessary bias to rephrases when presented with answer choices.

Method Prompt
reword [INST]Reword the following question:
George wants to warm his hands quickly by rubbing them. Which skin surface will produce the most heat?
Respond with the reworded question only: [\INST]
George seeks to heat his hands swiftly by rubbing them. Which skin area will generate the maximum heat?
[INST]Reword the following question: {question}
Respond with the reworded question only: [\INST]
rephrase [INST]Rephrase the following question:
George wants to warm his hands quickly by rubbing them. Which skin surface will produce the most heat?
Respond with the rephrased question only: [\INST]
What type of skin texture on George’s hands would generate the most heat through rapid rubbing to warm them effectively?
[INST]Rephrase the following question: {question}
Respond with the rephrased question only: [\INST]
paraphrase [INST]Semantically paraphrase the following question:
George wants to warm his hands quickly by rubbing them. Which skin surface will produce the most heat?
Respond with the semantically paraphrased question only: [\INST]
How can George induce the highest thermal output by briskly rubbing his hands, and which part of the skin would be most effective?
[INST]Semantically paraphrase the following question: {question}
Respond with the semantically paraphrased question only: [\INST]
expansion [INST] Expand the following question with additional context:
George wants to warm his hands quickly by rubbing them. Which skin surface will produce the most heat?
Respond with the expanded question only: [\INST]
In the context of seeking immediate relief from the biting cold and understanding the mechanisms behind heat generation through friction, what type of skin texture on George’s hands would most effectively generate heat by rapid rubbing?
[INST]Expand the following question with additional context: {question}
Respond with the expanded question only: [\INST]
Table 9: Prompt templates for one-shot rephrasing, with rephrasing methods listed on the left and corresponding prompt on the right. The user instructions are colored in blue and surrounded by the instruction token, whereas model response demonstrations are colored in orange.

We followed the instructions in Xiong et al. (2023) to generate ”hint” based rephrasings. Specifically, to generate a rephrased query given a base query, we appended one of the following three weak claims (as they found weak claims outperform other types of hints) together with a random class from the available ones.

Table 10: Different Prompts to Induce Consistency Confidence
Consistency Confidence Inducing Prompts
Hint1 I think the answer should be
Hint2 I read online the answer is
Hint3 I vaguely remember the answer is

Appendix B Rephrase generations

In the main text, we present rephrases generated by Mistral-7B in Table 1. Here, we present additional generated rephrasings by Llama2-7B and Llama2-13B in Table 11 and Table 12.

Method Question
original What part of the digestive system first causes chemical changes to food? A. Teeth in the mouth. B. Saliva in the mouth. C. Enzymes in the stomach. D. Enzymes in the small intestine.
reword What section of the digestive system initiates chemical alterations to sustenance? A. Teeth in the mouth. B. Saliva in the mouth. C. Enzymes in the stomach. D. Enzymes in the small intestine.
rephrase Which portion of the digestive system initially catalyzes the biochemical transformation of ingested sustenance? A. Teeth in the mouth. B. Saliva in the mouth. C. Enzymes in the stomach. D. Enzymes in the small intestine.
paraphrase Which digestive organ releases enzymes that initiate chemical breakdown within ingested sustenences, leading to nutrient extraction and energy release? A. Teeth in the mouth. B. Saliva in the mouth. C. Enzymes in the stomach. D. Enzymes in the small intestine.
expansion In the context of the digestive process and the breakdown of nutrients, which portion of the digestive system initiates the chemical transformations that result in the nutrient absorption and energy production? A. Teeth in the mouth. B. Saliva in the mouth. C. Enzymes in the stomach. D. Enzymes in the small intestine.
Table 11: Rephrasing examples generated by Llama2-7B, with rephrasing methods listed on the left and corresponding rephrases on the right.
Method Question
original What part of the digestive system first causes chemical changes to food? A. Teeth in the mouth. B. Saliva in the mouth. C. Enzymes in the stomach. D. Enzymes in the small intestine.
reword Which section of the gastrointestinal tract initiates the chemical transformation of sustenance? A. Teeth in the mouth. B. Saliva in the mouth. C. Enzymes in the stomach. D. Enzymes in the small intestine.
rephrase In which section of the digestive system does the initial chemical breakdown of food occur? A. Teeth in the mouth. B. Saliva in the mouth. C. Enzymes in the stomach. D. Enzymes in the small intestine.
paraphrase In the digestive process, where do crucial transformations initially occur to break down nutrients? A. Teeth in the mouth. B. Saliva in the mouth. C. Enzymes in the stomach. D. Enzymes in the small intestine.
expansion Taking into account that human digestive system’s several organs coordinate to breakdown, absorb, and expel waste, which part of the gastrointestinal system would have the most significant logic-based influence on the breakdown of food into usable components, prior to the nutrient absorption? A. Teeth in the mouth. B. Saliva in the mouth. C. Enzymes in the stomach. D. Enzymes in the small intestine.
Table 12: Rephrasing examples generated by Llama2-13B, with rephrasing methods listed on the left and corresponding rephrases on the right.

Appendix C Additional Proofs

Proposition C.1.

Let f:𝒳𝒴:𝑓𝒳𝒴f:\mathcal{X}\rightarrow\mathcal{Y}italic_f : caligraphic_X → caligraphic_Y be an LLM, 𝐱𝐱\boldsymbol{x}bold_italic_x is a base query and 𝒯(𝐱)τsimilar-to𝒯𝐱𝜏\mathcal{T}(\boldsymbol{x})\sim\taucaligraphic_T ( bold_italic_x ) ∼ italic_τ is some randomized transformation of the base query. Let

pA(𝒙)=(f(𝒯(𝒙))=A),subscript𝑝𝐴𝒙𝑓𝒯𝒙𝐴p_{A}(\boldsymbol{x})=\mathbb{P}\left(f(\mathcal{T}(\boldsymbol{x}))=A\right),italic_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( bold_italic_x ) = blackboard_P ( italic_f ( caligraphic_T ( bold_italic_x ) ) = italic_A ) , (5)

be the probability of sampling the most probable answer A𝒴𝐴𝒴A\in\mathcal{Y}italic_A ∈ caligraphic_Y under transformations 𝒯(𝐱)τsimilar-to𝒯𝐱𝜏\mathcal{T}(\boldsymbol{x})\sim\taucaligraphic_T ( bold_italic_x ) ∼ italic_τ. Let 𝐳mean+ϵrephrasesubscript𝐳𝑚𝑒𝑎𝑛subscriptitalic-ϵ𝑟𝑒𝑝𝑟𝑎𝑠𝑒\boldsymbol{z}_{mean}+\epsilon_{rephrase}bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_r italic_e italic_p italic_h italic_r italic_a italic_s italic_e end_POSTSUBSCRIPT be the latent representation of 𝐱𝐱\boldsymbol{x}bold_italic_x under 𝒯(𝐱)𝒯𝐱\mathcal{T}(\boldsymbol{x})caligraphic_T ( bold_italic_x ) at the final LLM layer, where 𝐳meansubscript𝐳𝑚𝑒𝑎𝑛\boldsymbol{z}_{mean}bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT is the mean representation and ϵrephrasesubscriptitalic-ϵ𝑟𝑒𝑝𝑟𝑎𝑠𝑒\epsilon_{rephrase}italic_ϵ start_POSTSUBSCRIPT italic_r italic_e italic_p italic_h italic_r italic_a italic_s italic_e end_POSTSUBSCRIPT is some additive noise. Let 𝐰𝐰\boldsymbol{\mathrm{w}}bold_w be the separating hyperplane between the most probable answer A𝐴Aitalic_A and the second most probable answer B𝐵Bitalic_B. Assuming that 𝐰ϵrephraseρsimilar-tosuperscript𝐰topsubscriptitalic-ϵ𝑟𝑒𝑝𝑟𝑎𝑠𝑒𝜌\boldsymbol{\mathrm{w}}^{\top}\epsilon_{rephrase}\sim\rhobold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_r italic_e italic_p italic_h italic_r italic_a italic_s italic_e end_POSTSUBSCRIPT ∼ italic_ρ follows a logistic distribution with μ=0𝜇0\mu=0italic_μ = 0 and s=1𝑠1s=1italic_s = 1 then

pA(𝒙)=p(A|𝒛mean,f)subscript𝑝𝐴𝒙𝑝conditional𝐴subscript𝒛𝑚𝑒𝑎𝑛𝑓\begin{split}p_{A}(\boldsymbol{x})=p(A|\boldsymbol{z}_{mean},f)\end{split}start_ROW start_CELL italic_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( bold_italic_x ) = italic_p ( italic_A | bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT , italic_f ) end_CELL end_ROW (6)

where p(A|𝐳mean,f)𝑝conditional𝐴subscript𝐳𝑚𝑒𝑎𝑛𝑓p(A|\boldsymbol{z}_{mean},f)italic_p ( italic_A | bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT , italic_f ) is the probability of A𝐴Aitalic_A given 𝐳meansubscript𝐳𝑚𝑒𝑎𝑛\boldsymbol{z}_{mean}bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT under the categorical distribution of the final layer.

Proof.

We first analyze the categorical distribution, resulting from applying the softmax on the final layer logits. In the binary classification case given a top-1 class prediction A𝐴Aitalic_A, the softmax probability of this class is

p(A|𝒙,f)=e𝐰A𝒛+bAe𝐰A𝒛+bA+e𝐰B𝒛+bB𝑝conditional𝐴𝒙𝑓superscript𝑒superscriptsubscript𝐰𝐴top𝒛subscript𝑏𝐴superscript𝑒superscriptsubscript𝐰𝐴top𝒛subscript𝑏𝐴superscript𝑒superscriptsubscript𝐰𝐵top𝒛subscript𝑏𝐵\displaystyle p(A|\boldsymbol{x},f)=\frac{e^{\boldsymbol{\mathrm{w}}_{A}^{\top% }\boldsymbol{z}+b_{A}}}{e^{\boldsymbol{\mathrm{w}}_{A}^{\top}\boldsymbol{z}+b_% {A}}+e^{\boldsymbol{\mathrm{w}}_{B}^{\top}\boldsymbol{z}+b_{B}}}italic_p ( italic_A | bold_italic_x , italic_f ) = divide start_ARG italic_e start_POSTSUPERSCRIPT bold_w start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_z + italic_b start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG italic_e start_POSTSUPERSCRIPT bold_w start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_z + italic_b start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + italic_e start_POSTSUPERSCRIPT bold_w start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_z + italic_b start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG
=11+e(𝐰A+bA𝐰BbB)𝒛=11+e(𝐰𝒛+b).absent11superscript𝑒superscriptsubscript𝐰𝐴subscript𝑏𝐴subscript𝐰𝐵subscript𝑏𝐵top𝒛11superscript𝑒superscript𝐰top𝒛𝑏\displaystyle\,\,=\frac{1}{1+e^{-(\boldsymbol{\mathrm{w}}_{A}+b_{A}-% \boldsymbol{\mathrm{w}}_{B}-b_{B})^{\top}\boldsymbol{z}}}=\frac{1}{1+e^{-(% \boldsymbol{\mathrm{w}}^{\top}\boldsymbol{z}+b)}}.= divide start_ARG 1 end_ARG start_ARG 1 + italic_e start_POSTSUPERSCRIPT - ( bold_w start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT + italic_b start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT - bold_w start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT - italic_b start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_z end_POSTSUPERSCRIPT end_ARG = divide start_ARG 1 end_ARG start_ARG 1 + italic_e start_POSTSUPERSCRIPT - ( bold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_z + italic_b ) end_POSTSUPERSCRIPT end_ARG . (7)

The above simply corresponds to the folk knowledge that a softmax layer with two classes is equivalent to a single separating hyperplane that assigns classes based on the rule sign(𝐰𝒛+b)signsuperscript𝐰top𝒛𝑏\mathrm{sign}\left(\boldsymbol{\mathrm{w}}^{\top}\boldsymbol{z}+b\right)roman_sign ( bold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_z + italic_b ), specifically

g(𝒛)={Aif (𝐰𝒛+b)>0,Botherwise.𝑔𝒛cases𝐴if superscript𝐰top𝒛𝑏0𝐵otherwise.g(\boldsymbol{z})=\begin{cases}A&\text{if }\left(\boldsymbol{\mathrm{w}}^{\top% }\boldsymbol{z}+b\right)>0,\\ B&\text{otherwise.}\end{cases}italic_g ( bold_italic_z ) = { start_ROW start_CELL italic_A end_CELL start_CELL if ( bold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_z + italic_b ) > 0 , end_CELL end_ROW start_ROW start_CELL italic_B end_CELL start_CELL otherwise. end_CELL end_ROW

After establishing that the softmax layer is equivalent to this single separating hyperplane, let us relate pA(𝒙)subscript𝑝𝐴𝒙p_{A}(\boldsymbol{x})italic_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( bold_italic_x ) to 𝐰𝒛+bsuperscript𝐰top𝒛𝑏\boldsymbol{\mathrm{w}}^{\top}\boldsymbol{z}+bbold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_z + italic_b. We have

pA(𝒙)=(f(𝒯(𝒙))=A)=(𝐰(𝒛mean+ϵrephrase)+b>0)=(𝐰𝒛mean+𝐰ϵrephrase+b>0)=(Z>𝐰𝒛meanb)=1(Z<𝐰𝒛meanb)=1F(𝐰𝒛meanb)subscript𝑝𝐴𝒙𝑓𝒯𝒙𝐴superscript𝐰topsubscript𝒛𝑚𝑒𝑎𝑛subscriptitalic-ϵ𝑟𝑒𝑝𝑟𝑎𝑠𝑒𝑏0superscript𝐰topsubscript𝒛𝑚𝑒𝑎𝑛superscript𝐰topsubscriptitalic-ϵ𝑟𝑒𝑝𝑟𝑎𝑠𝑒𝑏0𝑍superscript𝐰topsubscript𝒛𝑚𝑒𝑎𝑛𝑏1𝑍superscript𝐰topsubscript𝒛𝑚𝑒𝑎𝑛𝑏1𝐹superscript𝐰topsubscript𝒛𝑚𝑒𝑎𝑛𝑏\begin{split}p_{A}(\boldsymbol{x})&=\mathbb{P}\left(f(\mathcal{T}(\boldsymbol{% x}))=A\right)\\ &=\mathbb{P}\left(\boldsymbol{\mathrm{w}}^{\top}(\boldsymbol{z}_{mean}+% \epsilon_{rephrase})+b>0\right)\\ &=\mathbb{P}\left(\boldsymbol{\mathrm{w}}^{\top}\boldsymbol{z}_{mean}+% \boldsymbol{\mathrm{w}}^{\top}\epsilon_{rephrase}+b>0\right)\\ &=\mathbb{P}\left(Z>-\boldsymbol{\mathrm{w}}^{\top}\boldsymbol{z}_{mean}-b% \right)\\ &=1-\mathbb{P}\left(Z<-\boldsymbol{\mathrm{w}}^{\top}\boldsymbol{z}_{mean}-b% \right)\\ &=1-F\left(-\boldsymbol{\mathrm{w}}^{\top}\boldsymbol{z}_{mean}-b\right)\\ \end{split}start_ROW start_CELL italic_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( bold_italic_x ) end_CELL start_CELL = blackboard_P ( italic_f ( caligraphic_T ( bold_italic_x ) ) = italic_A ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = blackboard_P ( bold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_r italic_e italic_p italic_h italic_r italic_a italic_s italic_e end_POSTSUBSCRIPT ) + italic_b > 0 ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = blackboard_P ( bold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT + bold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_r italic_e italic_p italic_h italic_r italic_a italic_s italic_e end_POSTSUBSCRIPT + italic_b > 0 ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = blackboard_P ( italic_Z > - bold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT - italic_b ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = 1 - blackboard_P ( italic_Z < - bold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT - italic_b ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = 1 - italic_F ( - bold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT - italic_b ) end_CELL end_ROW (8)

Then F(𝐰𝒛meanb)=1pA𝐰𝒛mean+b=F1(1pA)iff𝐹superscript𝐰topsubscript𝒛𝑚𝑒𝑎𝑛𝑏1subscript𝑝𝐴superscript𝐰topsubscript𝒛𝑚𝑒𝑎𝑛𝑏superscript𝐹11subscript𝑝𝐴F(-\boldsymbol{\mathrm{w}}^{\top}\boldsymbol{z}_{mean}-b)=1-p_{A}\iff% \boldsymbol{\mathrm{w}}^{\top}\boldsymbol{z}_{mean}+b=-F^{-1}(1-p_{A})italic_F ( - bold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT - italic_b ) = 1 - italic_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ⇔ bold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT + italic_b = - italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 1 - italic_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ). We substitute this result to 7, assume that F𝐹Fitalic_F is the cumulative of the logistic distribution with μ=0𝜇0\mu=0italic_μ = 0 and s=1𝑠1s=1italic_s = 1 and get

p(A|𝒛mean,f)𝑝conditional𝐴subscript𝒛𝑚𝑒𝑎𝑛𝑓\displaystyle p(A|\boldsymbol{z}_{mean},f)italic_p ( italic_A | bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT , italic_f ) =11+eF1(1pA)absent11superscript𝑒superscript𝐹11subscript𝑝𝐴\displaystyle=\frac{1}{1+e^{F^{-1}(1-p_{A})}}= divide start_ARG 1 end_ARG start_ARG 1 + italic_e start_POSTSUPERSCRIPT italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 1 - italic_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT end_ARG (9)
=11+eF1(pA(𝒙))absent11superscript𝑒superscript𝐹1subscript𝑝𝐴𝒙\displaystyle=\frac{1}{1+e^{-F^{-1}(p_{A}(\boldsymbol{x}))}}= divide start_ARG 1 end_ARG start_ARG 1 + italic_e start_POSTSUPERSCRIPT - italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( bold_italic_x ) ) end_POSTSUPERSCRIPT end_ARG (10)
=pA(𝒙)absentsubscript𝑝𝐴𝒙\displaystyle=p_{A}(\boldsymbol{x})= italic_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( bold_italic_x ) (11)

In the second line we used the fact that the inverse cumulative F1superscript𝐹1F^{-1}italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT of the logistic distribution is symmetric around 0.50.50.50.5. In the third line we use the fact that 11+ex11superscript𝑒𝑥\frac{1}{1+e^{-x}}divide start_ARG 1 end_ARG start_ARG 1 + italic_e start_POSTSUPERSCRIPT - italic_x end_POSTSUPERSCRIPT end_ARG is the cumulative of the logistic with μ=0𝜇0\mu=0italic_μ = 0 and s=1𝑠1s=1italic_s = 1. Thus p(A|𝒛mean,f)=F(F1(pA(𝒙)))p(A|𝒛mean,f)=pA(𝒙)iff𝑝conditional𝐴subscript𝒛𝑚𝑒𝑎𝑛𝑓𝐹superscript𝐹1subscript𝑝𝐴𝒙𝑝conditional𝐴subscript𝒛𝑚𝑒𝑎𝑛𝑓subscript𝑝𝐴𝒙p(A|\boldsymbol{z}_{mean},f)=F(F^{-1}(p_{A}(\boldsymbol{x})))\iff p(A|% \boldsymbol{z}_{mean},f)=p_{A}(\boldsymbol{x})italic_p ( italic_A | bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT , italic_f ) = italic_F ( italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( bold_italic_x ) ) ) ⇔ italic_p ( italic_A | bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT , italic_f ) = italic_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( bold_italic_x )

A technical point remains. Even though in the previous we can assume that g(𝒛mean)=A𝑔subscript𝒛𝑚𝑒𝑎𝑛𝐴g(\boldsymbol{z}_{mean})=Aitalic_g ( bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT ) = italic_A (that 𝒛meansubscript𝒛𝑚𝑒𝑎𝑛\boldsymbol{z}_{mean}bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT results in the most probable class) by definition, we still need to show that A=argmaxi(f(𝒯(𝒙))=i)g(𝒛mean)=Aiff𝐴subscriptargmax𝑖𝑓𝒯𝒙𝑖𝑔subscript𝒛𝑚𝑒𝑎𝑛𝐴A=\mathrm{argmax}_{i}\mathbb{P}\left(f(\mathcal{T}(\boldsymbol{x}))=i\right)% \iff g(\boldsymbol{z}_{mean})=Aitalic_A = roman_argmax start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_P ( italic_f ( caligraphic_T ( bold_italic_x ) ) = italic_i ) ⇔ italic_g ( bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT ) = italic_A. This means that for a closed-source LLM we can identify the (unknown) top-1 class A through Monte Carlo sampling (A=argmaxi(f(𝒯(𝒙))=i)𝐴subscriptargmax𝑖𝑓𝒯𝒙𝑖A=\mathrm{argmax}_{i}\mathbb{P}\left(f(\mathcal{T}(\boldsymbol{x}))=i\right)italic_A = roman_argmax start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_P ( italic_f ( caligraphic_T ( bold_italic_x ) ) = italic_i )).

A=argmaxi(f(𝒯(𝒙))=i)(f(𝒯(𝒙))=A)>12(𝐰(𝒛mean+ϵrephrase)+b0)>12(𝐰𝒛mean+𝐰ϵrephrase+b0)>12(Z𝐰𝒛meanb)>12(Z𝐰𝒛mean+b)>12𝐰𝒛mean+b>0g(𝒛mean)=Aiff𝐴subscriptargmax𝑖𝑓𝒯𝒙𝑖𝑓𝒯𝒙𝐴12iffsuperscript𝐰topsubscript𝒛𝑚𝑒𝑎𝑛subscriptitalic-ϵ𝑟𝑒𝑝𝑟𝑎𝑠𝑒𝑏012iffsuperscript𝐰topsubscript𝒛𝑚𝑒𝑎𝑛superscript𝐰topsubscriptitalic-ϵ𝑟𝑒𝑝𝑟𝑎𝑠𝑒𝑏012iff𝑍superscript𝐰topsubscript𝒛𝑚𝑒𝑎𝑛𝑏12iff𝑍superscript𝐰topsubscript𝒛𝑚𝑒𝑎𝑛𝑏12iffsuperscript𝐰topsubscript𝒛𝑚𝑒𝑎𝑛𝑏0iff𝑔subscript𝒛𝑚𝑒𝑎𝑛𝐴\begin{split}A=\mathrm{argmax}_{i}\mathbb{P}\left(f(\mathcal{T}(\boldsymbol{x}% ))=i\right)&\iff\mathbb{P}\left(f(\mathcal{T}(\boldsymbol{x}))=A\right)>\frac{% 1}{2}\\ &\iff\mathbb{P}\left(\boldsymbol{\mathrm{w}}^{\top}(\boldsymbol{z}_{mean}+% \epsilon_{rephrase})+b\geq 0\right)>\frac{1}{2}\\ &\iff\mathbb{P}\left(\boldsymbol{\mathrm{w}}^{\top}\boldsymbol{z}_{mean}+% \boldsymbol{\mathrm{w}}^{\top}\epsilon_{rephrase}+b\geq 0\right)>\frac{1}{2}\\ &\iff\mathbb{P}\left(Z\geq-\boldsymbol{\mathrm{w}}^{\top}\boldsymbol{z}_{mean}% -b\right)>\frac{1}{2}\\ &\iff\mathbb{P}\left(Z\leq\boldsymbol{\mathrm{w}}^{\top}\boldsymbol{z}_{mean}+% b\right)>\frac{1}{2}\\ &\iff\boldsymbol{\mathrm{w}}^{\top}\boldsymbol{z}_{mean}+b>0\\ &\iff g(\boldsymbol{z}_{mean})=A\end{split}start_ROW start_CELL italic_A = roman_argmax start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT blackboard_P ( italic_f ( caligraphic_T ( bold_italic_x ) ) = italic_i ) end_CELL start_CELL ⇔ blackboard_P ( italic_f ( caligraphic_T ( bold_italic_x ) ) = italic_A ) > divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⇔ blackboard_P ( bold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_r italic_e italic_p italic_h italic_r italic_a italic_s italic_e end_POSTSUBSCRIPT ) + italic_b ≥ 0 ) > divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⇔ blackboard_P ( bold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT + bold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ϵ start_POSTSUBSCRIPT italic_r italic_e italic_p italic_h italic_r italic_a italic_s italic_e end_POSTSUBSCRIPT + italic_b ≥ 0 ) > divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⇔ blackboard_P ( italic_Z ≥ - bold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT - italic_b ) > divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⇔ blackboard_P ( italic_Z ≤ bold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT + italic_b ) > divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⇔ bold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT + italic_b > 0 end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ⇔ italic_g ( bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT ) = italic_A end_CELL end_ROW (12)

where we use the assumption that Z𝑍Zitalic_Z follows a logistic distribution with μ=0𝜇0\mu=0italic_μ = 0 and s=1𝑠1s=1italic_s = 1. ∎

Proposition C.2.

Let g:dη𝒴:𝑔superscriptsubscript𝑑𝜂𝒴g:\mathbb{R}^{d_{\eta}}\rightarrow\mathcal{Y}italic_g : blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_η end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → caligraphic_Y be the final encoder layer of an LLM, 𝐱𝐱\boldsymbol{x}bold_italic_x is a base query and 𝒯(𝐱)τsimilar-to𝒯𝐱𝜏\mathcal{T}(\boldsymbol{x})\sim\taucaligraphic_T ( bold_italic_x ) ∼ italic_τ is some randomized transformation of the base query. Let

pA(𝒙)=(f(𝒯(𝒙))=A),subscript𝑝𝐴𝒙𝑓𝒯𝒙𝐴p_{A}(\boldsymbol{x})=\mathbb{P}\left(f(\mathcal{T}(\boldsymbol{x}))=A\right),italic_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( bold_italic_x ) = blackboard_P ( italic_f ( caligraphic_T ( bold_italic_x ) ) = italic_A ) , (13)

be the probability of sampling the most probable answer f(𝐱)=A𝒴𝑓𝐱𝐴𝒴f(\boldsymbol{x})=A\in\mathcal{Y}italic_f ( bold_italic_x ) = italic_A ∈ caligraphic_Y under transformations 𝒯(𝐱)τsimilar-to𝒯𝐱𝜏\mathcal{T}(\boldsymbol{x})\sim\taucaligraphic_T ( bold_italic_x ) ∼ italic_τ. Let 𝐳mean+ϵtopk+ϵrephrasesubscript𝐳𝑚𝑒𝑎𝑛subscriptitalic-ϵ𝑡𝑜𝑝𝑘subscriptitalic-ϵ𝑟𝑒𝑝𝑟𝑎𝑠𝑒\boldsymbol{z}_{mean}+\epsilon_{topk}+\epsilon_{rephrase}bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_t italic_o italic_p italic_k end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_r italic_e italic_p italic_h italic_r italic_a italic_s italic_e end_POSTSUBSCRIPT be the latent representation of 𝐱𝐱\boldsymbol{x}bold_italic_x under 𝒯(𝐱)𝒯𝐱\mathcal{T}(\boldsymbol{x})caligraphic_T ( bold_italic_x ) at the final LLM layer, where 𝐳meansubscript𝐳𝑚𝑒𝑎𝑛\boldsymbol{z}_{mean}bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT is the mean representation and ϵtopksubscriptitalic-ϵ𝑡𝑜𝑝𝑘\epsilon_{topk}italic_ϵ start_POSTSUBSCRIPT italic_t italic_o italic_p italic_k end_POSTSUBSCRIPT is additive noise resulting from the top-k decoding and ϵrephrasesubscriptitalic-ϵ𝑟𝑒𝑝𝑟𝑎𝑠𝑒\epsilon_{rephrase}italic_ϵ start_POSTSUBSCRIPT italic_r italic_e italic_p italic_h italic_r italic_a italic_s italic_e end_POSTSUBSCRIPT is additive noise resulting from the rephrasings 𝒯(𝐱)𝒯𝐱\mathcal{T}(\boldsymbol{x})caligraphic_T ( bold_italic_x ). Assuming that 𝐰(ϵtopk+ϵrephrase)ρsimilar-tosuperscript𝐰topsubscriptitalic-ϵ𝑡𝑜𝑝𝑘subscriptitalic-ϵ𝑟𝑒𝑝𝑟𝑎𝑠𝑒𝜌\boldsymbol{\mathrm{w}}^{\top}(\epsilon_{topk}+\epsilon_{rephrase})\sim\rhobold_w start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_ϵ start_POSTSUBSCRIPT italic_t italic_o italic_p italic_k end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_r italic_e italic_p italic_h italic_r italic_a italic_s italic_e end_POSTSUBSCRIPT ) ∼ italic_ρ approximately follows a logistic distribution with μ=0𝜇0\mu=0italic_μ = 0 and s=stopk2+srephrase2𝑠subscriptsuperscript𝑠2𝑡𝑜𝑝𝑘subscriptsuperscript𝑠2𝑟𝑒𝑝𝑟𝑎𝑠𝑒s=\sqrt{s^{2}_{topk}+s^{2}_{rephrase}}italic_s = square-root start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_o italic_p italic_k end_POSTSUBSCRIPT + italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r italic_e italic_p italic_h italic_r italic_a italic_s italic_e end_POSTSUBSCRIPT end_ARG then

pA(𝒙)0.5+1stopk2+srephrase2(p(A|𝒛mean,f)0.5)subscript𝑝𝐴𝒙0.51subscriptsuperscript𝑠2𝑡𝑜𝑝𝑘subscriptsuperscript𝑠2𝑟𝑒𝑝𝑟𝑎𝑠𝑒𝑝conditional𝐴subscript𝒛𝑚𝑒𝑎𝑛𝑓0.5\begin{split}p_{A}(\boldsymbol{x})\approx 0.5+\frac{1}{\sqrt{s^{2}_{topk}+s^{2% }_{rephrase}}}(p(A|\boldsymbol{z}_{mean},f)-0.5)\end{split}start_ROW start_CELL italic_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( bold_italic_x ) ≈ 0.5 + divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_o italic_p italic_k end_POSTSUBSCRIPT + italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r italic_e italic_p italic_h italic_r italic_a italic_s italic_e end_POSTSUBSCRIPT end_ARG end_ARG ( italic_p ( italic_A | bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT , italic_f ) - 0.5 ) end_CELL end_ROW (14)

where p(A|𝐳mean,f)𝑝conditional𝐴subscript𝐳𝑚𝑒𝑎𝑛𝑓p(A|\boldsymbol{z}_{mean},f)italic_p ( italic_A | bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT , italic_f ) is the probability of A𝐴Aitalic_A given 𝐳meansubscript𝐳𝑚𝑒𝑎𝑛\boldsymbol{z}_{mean}bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT under the categorical distribution of g𝑔gitalic_g.

Proof.

We first claim that the sum of two logistic distributions (μ1,s1)subscript𝜇1subscript𝑠1(\mu_{1},s_{1})( italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and (μ1,s1)subscript𝜇1subscript𝑠1(\mu_{1},s_{1})( italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) is approximately logistic with (μ1+μ2,s12+s22)subscript𝜇1subscript𝜇2subscriptsuperscript𝑠21subscriptsuperscript𝑠22(\mu_{1}+\mu_{2},\sqrt{s^{2}_{1}+s^{2}_{2}})( italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , square-root start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) by claiming that logistic distributions are approximately Gaussian. Then considering that p(A|𝒛mean,f)=11+eF1(1pA(𝒙))𝑝conditional𝐴subscript𝒛𝑚𝑒𝑎𝑛𝑓11superscript𝑒superscript𝐹11subscript𝑝𝐴𝒙p(A|\boldsymbol{z}_{mean},f)=\frac{1}{1+e^{F^{-1}(1-p_{A}(\boldsymbol{x}))}}italic_p ( italic_A | bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT , italic_f ) = divide start_ARG 1 end_ARG start_ARG 1 + italic_e start_POSTSUPERSCRIPT italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 1 - italic_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( bold_italic_x ) ) end_POSTSUPERSCRIPT end_ARG we can write

p(A|𝒛mean,f)=11+eF1(1pA(𝒙))=11+eF1(pA(𝒙))=0.5+14F1(pA(𝒙))=0.5+144stopk2+srephrase2(pA(𝒙)0.5)𝑝conditional𝐴subscript𝒛𝑚𝑒𝑎𝑛𝑓11superscript𝑒superscript𝐹11subscript𝑝𝐴𝒙11superscript𝑒superscript𝐹1subscript𝑝𝐴𝒙0.514superscript𝐹1subscript𝑝𝐴𝒙0.5144subscriptsuperscript𝑠2𝑡𝑜𝑝𝑘subscriptsuperscript𝑠2𝑟𝑒𝑝𝑟𝑎𝑠𝑒subscript𝑝𝐴𝒙0.5\begin{split}p(A|\boldsymbol{z}_{mean},f)&=\frac{1}{1+e^{F^{-1}(1-p_{A}(% \boldsymbol{x}))}}=\frac{1}{1+e^{-F^{-1}(p_{A}(\boldsymbol{x}))}}\\ &=0.5+\frac{1}{4}F^{-1}(p_{A}(\boldsymbol{x}))=0.5+\frac{1}{4}4\sqrt{s^{2}_{% topk}+s^{2}_{rephrase}}(p_{A}(\boldsymbol{x})-0.5)\\ \end{split}start_ROW start_CELL italic_p ( italic_A | bold_italic_z start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT , italic_f ) end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG 1 + italic_e start_POSTSUPERSCRIPT italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 1 - italic_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( bold_italic_x ) ) end_POSTSUPERSCRIPT end_ARG = divide start_ARG 1 end_ARG start_ARG 1 + italic_e start_POSTSUPERSCRIPT - italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( bold_italic_x ) ) end_POSTSUPERSCRIPT end_ARG end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = 0.5 + divide start_ARG 1 end_ARG start_ARG 4 end_ARG italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( bold_italic_x ) ) = 0.5 + divide start_ARG 1 end_ARG start_ARG 4 end_ARG 4 square-root start_ARG italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t italic_o italic_p italic_k end_POSTSUBSCRIPT + italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r italic_e italic_p italic_h italic_r italic_a italic_s italic_e end_POSTSUBSCRIPT end_ARG ( italic_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( bold_italic_x ) - 0.5 ) end_CELL end_ROW (15)

In the first line we first considered that F1superscript𝐹1F^{-1}italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT for the logistic is symmetric thus F1(1pA(𝒙))=F1(pA(𝒙))superscript𝐹11subscript𝑝𝐴𝒙superscript𝐹1subscript𝑝𝐴𝒙F^{-1}(1-p_{A}(\boldsymbol{x}))=-F^{-1}(p_{A}(\boldsymbol{x}))italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( 1 - italic_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( bold_italic_x ) ) = - italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_p start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ( bold_italic_x ) ). In the second line we first do a first order Taylor expansion around 00 on 11+ex11superscript𝑒𝑥\frac{1}{1+e^{-x}}divide start_ARG 1 end_ARG start_ARG 1 + italic_e start_POSTSUPERSCRIPT - italic_x end_POSTSUPERSCRIPT end_ARG and then a first order Taylor expansion around 0.50.50.50.5 on F1superscript𝐹1F^{-1}italic_F start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT. ∎

Appendix D Additional comparisons with CoT

We compare with Chain-of-Thought Wei et al. (2022) for uncertainty estimation and plot the results in Table 13. We find that we get competitive results with CoT. At the same time our method is significantly easier and more natural to implement for humans interacting via text with an LLM. In CoT one needs to first obtain a sequence of reasoning steps. These should then be used as additional context when asking an LLM to answer again the base question. By contrast we propose a simple one step process of rephrasing the base question.

Table 13: Comparisons between our best rephrasing method and CoT. Our rephrasing method obtains comparable results to CoT in terms of Brier score and other calibration metrics.
Dataset Model Method Acc \uparrow ECE \downarrow TACE \downarrow Brier \downarrow AUROC \uparrow
ARC-C Mistral-7B CoT 0.725 0.173 0.071 0.439 0.719
expansion 0.602 0.133 0.099 0.509 0.847
\cdashline2-8 Llama-2-7B CoT 0.407 0.205 0.151 0.783 0.696
expansion 0.373 0.112 0.153 0.778 0.687
\cdashline2-8 Llama-2-13B CoT 0.369 0.137 0.148 0.782 0.729
reword 0.445 0.084 0.119 0.714 0.721
ARC-E Mistral-7B CoT 0.857 0.07 0.037 0.211 0.829
reword 0.753 0.045 0.062 0.297 0.931
\cdashline2-8 Llama-2-7B CoT 0.482 0.104 0.116 0.624 0.842
rephrase 0.535 0.131 0.117 0.603 0.830
\cdashline2-8 Llama-2-13B CoT 0.463 0.097 0.124 0.61 0.884
expansion 0.524 0.078 0.12 0.552 0.893
OBQA Mistral-7B CoT 0.662 0.153 0.083 0.501 0.762
reword 0.552 0.105 0.102 0.592 0.796
\cdashline2-8 Llama-2-7B CoT 0.39 0.185 0.145 0.805 0.713
expansion 0.362 0.083 0.138 0.775 0.678
\cdashline2-8 Llama-2-13B CoT 0.37 0.166 0.153 0.801 0.683
rephrase 0.428 0.095 0.14 0.729 0.73