Tacl A 00685
Tacl A 00685
Lindia Tjuatja∗, Valerie Chen∗ , Tongshuang Wu, Ameet Talwalkwar, Graham Neubig
Carnegie Mellon University, USA
{vchen2, lindiat}@andrew.cmu.edu
Transactions of the Association for Computational Linguistics, vol. 12, pp. 1011–1026, 2024. https://doi.org/10.1162/tacl a 00685
Action Editor: Kristina Toutenova. Submission batch: 1/2024; Revision batch: 4/2024; Published 9/2024.
c 2024 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license.
Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00685/2468689/tacl_a_00685.pdf by guest on 04 October 2024
Figure 1: Our evaluation framework consists of three steps: (1) generating a dataset of original and modified
questions given a response bias of interest, (2) collecting LLM responses, and (3) evaluating whether the change
in the distribution of LLM responses aligns with known trends about human behavior. We directly apply the same
workflow to evaluate LLM behavior on non-bias perturbations (i.e., question modifications that have been shown
to not elicit a change in response in humans).
obtaining the subjective opinions of large-scale the Llama2 base and chat models, we find that
populations (Weisberg et al., 1996) and are used RLHF-ed chat models demonstrated less signif-
across a diverse set of organizations and appli- icant changes to question modifications as a re-
cations (Hauser and Shugan, 1980; Morwitz and sult of response biases but are more affected by
Pluzinski, 1996; Al-Abri and Al-Balushi, 2014), non-bias perturbations than their non-RLHF-ed
we believe that our results would be of broad counterparts, highlighting the potential undesir-
interest to multiple research communities. able effects of additional training schemes.
We evaluate LLM behavior across 5 differ-
ent response biases, as well as 3 non-bias pertur- (3) There is little correspondence between ex-
bations (e.g., typos) that are known to not affect hibiting response biases and other desirable
human responses. To understand whether aspects metrics for survey design: We find that a
of model architecture (e.g., size) and training model’s ability to replicate human opinion dis-
schemes (e.g., instruction fine-tuning and RLHF) tributions is not indicative of how well an LLM
affect LLM responses to these question modifi- reflects human behavior.
cations, we selected 9 models—including both These results suggest the need for care and cau-
open models from the Llama2 series and com- tion when considering the use of LLMs as human
mercial models from OpenAI—to span these con- proxies, as well as the importance of building
siderations. In summary, we find: more extensive evaluations that disentangle the
nuances of how LLMs may or may not behave
(1) LLMs do not generally reflect human-like similarly to humans.
behaviors as a result of question modifications:
All models showed behavior notably unlike hu-
2 Methodology
mans such as a significant change in the op-
posite direction of known human biases and a In this section, we overview our evaluation frame-
significant change to non-bias perturbations. Fur- work, which consists of three parts (Figure 1):
thermore, unlike humans, models are unlikely to (1) dataset generation, (2) collection of LLM re-
show significant changes due to bias modifica- sponses, and (3) analysis of LLM responses.
tions if they are more uncertain with their origi-
nal responses.
2.1 Dataset Generation
(2) Behavioral trends of RLHF-ed models tend When evaluating whether humans exhibit hy-
to differ from those of vanilla LLMs: Among pothesized response biases, prior social science
1012
studies typically design a set of control questions tions is relatively straightforward, and the impact
and a set of treatment questions, which are in- of such biases on human decision outcomes has
tended to elicit the hypothesized bias (McFarland, been explicitly demonstrated in prior studies with
1981; Gordon, 1987; Hippler and Schwarz, 1987; humans. We generate a dataset with a total of
Schwarz et al., 1992, inter alia). In line with this 2578 question pairs, covering 5 biases and 3 non-
methodology, we similarly create sets of ques- bias perturbations. The modified forms of the
tions (q, q ) ∈ Q that contain both original (q ) and questions for each bias were generated by ei-
modified (q ) forms of multiple-choice questions ther modifying them manually ourselves (as was
to study whether an LLM exhibits a response bias the case for acquiescence and allow/forbid) or
behavior given a change in the prompt. systematic modifications such as automatically
The first set of question pairs Qbias is one where appending an option, removing an option, or re-
q corresponds to questions that are modified in versing the order of options (for odd/even, opinion
a way that is known to induce that particular bias float, and response order). The specific break-
in humans. However, we may also want to know down of the number of questions by bias type is
1013
Example q Example q
Acquiescence : For questions where respondents are asked to agree or disagree with a given statement,
respondents tend to agree with the statement (Choi and Pak, 2005).
Thinking about the US as a whole, do you think this country is Wouldn’t you agree that the United States is more united
now now than it was before the coronavirus outbreak?
A. More united than before the coronavirus outbreak A. Yes
B. More divided than before the coronavirus outbreak B. No
Allow/forbid asymmetry : Certain word pairings may elicit different responses, despite entailing the same
result. A well-studied example is asking whether an action should be ‘‘not allowed’’ or ‘‘forbidden’’ (Hippler
and Schwarz, 1987).
In your opinion, is voting a privilege that comes with respon- In your opinion, is voting a fundamental right for every
sibilities and can be limited if adult U.S. citizens don’t meet adult U.S. citizen and should not be forbidden in any way?
some requirements? A. Yes
A. Yes B. No
B. No
Response order : In written surveys, respondents have been shown to display primacy bias, i.e., preferring
options at the top of a list (Ayidiya and McClendon, 1990).
Opinion floating : When both a middle option and ‘‘don’t know’’ option are provided in a scale with an odd
number of responses, respondents who do not have a stance are more likely to distribute their responses across
both options than when only the middle option is provided (Schuman and Presser, 1996).
As far as you know, how many of your neighbors have the same po- As far as you know, how many of your neighbors have the
litical views as you same political views as you
A. All of them A. All of them
B. Most of them B. Most of them
C. About half C. About half
D. Only some of them D. Only some of them
E. None of them E. None of them
F. Don’t know
Odd/even scale effects : When omitting a middle alternative, transforming the scale from an odd to an even
one, responses tend to stay near the scale midpoint more often than extreme points (e.g., Reduced somewhat vs
Reduced a great deal) (O’Muircheartaigh et al., 2001).
Thinking about the size of America’s military, do you think it Thinking about the size of America’s military, do you think
should be it should be
A. Reduced a great deal A. Reduced a great deal
B. Reduced somewhat B. Reduced somewhat
C. Increased somewhat C. Kept about as is
D. Increased a great deal D. Increased somewhat
E. Increased a great deal
Key typo : With a low probability, we randomly change one letter in each word (Rawlinson, 2007).
How likely do you think it is that the following will happen in How likely do you think it is that the following will hap-
the next 30 years? A woman will be elected U.S. president pen in the next 30 yeans? A woman wilp we elected U.S.
president
Letter swap : We perform one swap per word but do not alter the first or last letters. For this reason, this noise
is only applied to words of length ≥ 4 (Rawlinson, 2007).
Overall, do you think science has made life easier or more Ovearll, do you tihnk sicence has made life eaiser or more
difficult for most people? diffiuclt for most poeple?
Middle random : We randomize the order of all the letters in a word, except for the first and last (Rawlinson,
2007). Again, this noise is only applied to words of length ≥ 4.
Do you think that private citizens should be allowed to pilot Do you thnik that pvarite citziens sluhod be aewolld to
drones in the following areas? Near people’s homes piolt derons in the flnowolig areas? Near people’s heoms
Table 1: To evaluate LLM behavior as a result of response bias modifications and non-bias
perturbations , we create sets of questions (q, q ) ∈ Q that contain both original (q ) and modified
(q ) forms of multiple-choice questions. We define and provide an example (q, q ) pairs for each
response bias and non-bias perturbation considered in our experiments.
We selected LLMs to evaluate based on mul- has undergone reinforcement learning with hu-
tiple axes of consideration: open-weight versus man feedback (RLHF), and the number of model
closed-weight models, whether the model has parameters. We evaluate a total of nine mod-
been instruction fine-tuned, whether the model els, which include variants of Llama2 (Touvron
1014
Bias Type Δb To determine whether there is a consistent
Acquiescence count(q’[a]) - count(q[a]) deviation across all questions, we compute the
average change Δ̄b across all questions and con-
Allow/forbid count(q[b]) - count(q’[a])
duct a Student’s t-test where the null hypothesis
Response order count(q’[d]) - count(q[a]) is that Δ̄b for a given model and bias type is
Opinion floating count(q[c]) - count(q’[c]) 0. Together, the p-value and direction of Δ̄b in-
count(q’[b]) + count(q’[d]) form us whether we observe a significant change
Odd/even scale
- count(q[b]) - count(q[d]) across questions that aligns with known human
behavior.6 We then evaluate LLMs on Qperturb
Table 2: We measure the change resulting from following the same process (i.e., selecting the
bias modifications for a given question pair subset of relevant response options for the bias)
(q, q ) by looking at the change in the response to compute Δp , with the expectation that across
distributions between Dq and Dq with respect questions Δ̄p should be not statistically different
to the relevant response options for each bias from 0.
1015
Downloaded from http://direct.mit.edu/tacl/article-pdf/doi/10.1162/tacl_a_00685/2468689/tacl_a_00685.pdf by guest on 04 October 2024
Figure 2: We compare LLMs’ behavior on bias types (Δ̄b ) with their respective behavior on the set of pertur-
bations (Δ̄p ). We color cells that have statistically significant changes by the directionality of Δ̄b ( blue indi-
cates a positive effect and orange indicates a negative effect), using p = 0.05 cut-off, and use hatched cells to
indicate non-significant changes. A full table with Δ̄b and Δ̄p values and p-values is in Table 4. While we would
ideally observe that models are only responsive to the bias modifications and are not responsive to the other
perturbations, as shown in the top-right the ‘‘most human-like’’ depiction, the results do not generally reflect
the ideal setting.
an increase or decrease in Δ̄b (e.g., allow/forbid cations, especially for those with changes in the
and opinion float for the base Llama2 7b to 70b). wording of the question like acquiescence and al-
low/forbid. An interesting exception is odd/even,
3.2 Comparing Base Models with Their where all but one of the RLHF-ed models (3.5
Modified Counterparts turbo instruct) have a larger positive effect size
Instruction fine-tuning and RLHF-ed models can than the Llama2 base models. Insensitivity to bias
improve a model’s abilities to better generalize to modifications may be more desirable if we want
unseen tasks (Wei et al., 2022a; Sanh et al., 2022) an LLM to simulate a ‘‘bias-resistant’’ user, but
and be steered towards a user’s intent (Ouyang not necessarily if we want it to be affected by the
et al., 2022); how do these training schemes affect same changes as humans more broadly.
other abilities, such as exhibiting human-like re-
sponse biases? To disentangle the effect of these RLHF-ed models tend to show more signif-
additional training schemes, we focus our compar- icant changes resulting from perturbations.
isons on base Llama2 models with their instruc- We also see that RLHF-ed models tend to show
tion fine-tuned (Solar, chat) and RLHF-ed (chat) a larger magnitude of effect sizes among the non-
counterparts. As we do not observe a clear effect bias perturbations. For every perturbation setting
from instruction fine-tuning,7 we center our anal- that has a significant effect in both model pairs,
ysis on the use of RLHF by comparing the base the RLHF-ed chat models have a greater magni-
models with their chat counterparts: tude of effect size in 21 out of 27 of these settings
and have on average 68% larger effect size than
RLHF-ed models are more insensitive to bias- the base model, a noticeably less human-like—
inducing changes than their vanilla counter- and arguably generally less desirable—behavior.
parts. We find that the base models are more
likely to exhibit a change for the bias modifi- 4 Examining the Effect of Uncertainty
7
We note that SOLAR and the Llama2 chat models use
different fine-tuning datasets, which may mask potential In addition to studying the presence of response
common effects of instruction fine-tuning more broadly. biases, prior social psychology studies have also
1016
found that when people are more confident about
their opinions, they are less likely to be af-
fected by these question modifications (Hippler
and Schwarz, 1987). We measure whether LLMs
exhibit similar behavior and capture LLM un-
certainty using the normalized entropy of the an-
swer distributions of each question,
n
i=1 pi log2 pi
− (1)
log2 n
where n is the number of multiple-choice op- Figure 3: Representativeness is a metric based on the
tions, to allow for a fair comparison across the Wasserstein distance which measures the extent to
entire dataset where questions vary in the num- which each model reflects the opinions of a population,
1017
6 Related Work such as generating an answer to a free-response
question, versus comparisons of closed-form out-
LLM Sensitivity to Prompts. A growing set comes, where LLMs generate a label based on
of work aims to understand how LLMs may a fixed set of response options. Since the open-
be sensitive to prompt constructions. These ended tasks typically rely on human judgments to
works have studied a variety of permutations of determine whether LLM behaviors are perceived
prompts which include—but are not limited to— to be sufficiently human-like (Park et al., 2022,
adversarial prompts (Wallace et al., 2019; Perez 2023a), we focus on closed-form tasks, which al-
and Ribeiro, 2022; Maus et al., 2023; Zou et al., lows us to more easily find broader quantitative
2023), changes in the order of in-context exam- trends and enables scalable evaluations.
ples (Lu et al., 2022), changes in multiple-choice Prior works have conducted evaluations of
questions (Zheng et al., 2023; Pezeshkpour and LLM and human outcomes on a number of real-
Hruschka, 2023), and changes in formatting of world tasks including social science studies (Park
few-shot examples (Sclar et al., 2023). While this et al., 2023b; Aher et al., 2023; Horton, 2023;
1018
Furthermore, we show that the ability of a lan- the Association for Computational Linguistics
guage model to replicate human opinion distribu- (Volume 1: Long Papers), pages 819–862.
tions generally does not correspond to its ability
Stephen A. Ayidiya and McKee J. McClendon.
to show human-like response biases. Taken to-
1990. Response effects in mail surveys. Public
gether, we believe our results highlight the limi-
Opinion Quarterly, 54(2):229–247. https://
tations of using LLMs as human proxies in survey
doi.org/10.1086/269200
design and the need for more critical evaluations
to further understand the set of similarities or dis- Yonatan Belinkov and Yonatan Bisk. 2017.
similarities with humans. Synthetic and natural noise both break
neural machine translation. arXiv preprint
8 Limitations arXiv:1711.02173.
In this work, the focus of our experiments was Ian Brace. 2018. Questionnaire Design: How to
on English-based, and U.S.-centric survey ques- Plan, Structure and Write Survey Material
1019
Kumaran, James L. McClelland, and Felix John R. Hauser and Steven M. Shugan. 1980.
Hill. 2022. Language models show human-like Intensity measures of consumer preference. Op-
content effects on reasoning. arXiv preprint erations Research, 28(2):278–320. https://
arXiv:2207.07051. doi.org/10.1287/opre.28.2.278
Danica Dillion, Niket Tandon, Yuling Gu, and Hans-J. Hippler and Norbert Schwarz. 1987.
Kurt Gray. 2023. Can AI language models re- Response effects in surveys. In Social infor-
place human participants? Trends in Cogni- mation processing and survey methodology,
tive Sciences. https://doi.org/10.1016/j pages 102–122. Springer. https://doi.org
.tics.2023.04.008, PubMed: 37173156 /10.1007/978-1-4612-4798-2 6
Esin Durmus, Karina Nyugen, Thomas I. Liao, John J. Horton. 2023. Large language mod-
Nicholas Schiefer, Amanda Askell, Anton els as simulated economic agents: What can
Bakhtin, Carol Chen, Zac Hatfield-Dodds, we learn from homo silicus? Working Paper
Danny Hernandez, Nicholas Joseph, Liane 31122, National Bureau of Economic Research.
1020
(Volume 1: Long Papers), pages 8086–8098, https://doi.org/10.29115/SP-2014
Dublin, Ireland. Association for Computational -0013
Linguistics. https://doi.org/10.18653 Colm A. O’Muircheartaigh, Jon A. Krosnick, and
/v1/2022.acl-long.556 Armin Helic. 2001. Middle Alternatives, Ac-
Natalie Maus, Patrick Chao, Eric Wong, and quiescence, and the Quality of Questionnaire
Jacob R. Gardner. 2023. Black box adversarial Data. Irving B. Harris Graduate School of
prompting for foundation models. In The Sec- Public Policy Studies, University of Chicago.
ond Workshop on New Frontiers in Adversarial Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo
Machine Learning. Almeida, Carroll Wainwright, Pamela Mishkin,
McKee J. McClendon. 1991. Acquiescence and Chong Zhang, Sandhini Agarwal, Katarina
recency response-order effects in interview Slama, Alex Ray, John Schulman, Jacob Hilton,
surveys. Sociological Methods & Research, Fraser Kelton, Luke Miller, Maddie Simens,
20(1):60–103. https://doi.org/10.1177 Amanda Askell, Peter Welinder, Paul F.
1021
22(1):26–27. https://doi.org/10.1109 els’ sensitivity to spurious features in prompt
/MAES.2007.327521 design or: How I learned to start worrying about
prompt formatting. arXiv preprint arXiv:2310
Keisuke Sakaguchi, Kevin Duh, Matt Post, and
.11324.
Benjamin Van Durme. 2017. Robsut wrod re-
ocginiton via semi-character recurrent neural Arabella Sinclair, Jaap Jumelet, Willem Zuidema,
network. In Proceedings of the AAAI Con- and Raquel Fernández. 2022. Structural per-
ference on Artificial Intelligence, volume 31. sistence in language models: Priming as a
https://doi.org/10.1609/aaai.v31i1 window into abstract language representations.
.10970 Transactions of the Association for Computa-
tional Linguistics, 10:1031–1050. https://
Victor Sanh, Albert Webson, Colin Raffel,
doi.org/10.1162/tacl a 00504
Stephen Bach, Lintang Sutawika, Zaid
Alyafeai, Antoine Chaffin, Arnaud Stiegler, Lindia Tjuatja, Emmy Liu, Lori Levin, and
Arun Raja, Manan Dey, M. Saiful Bari, Canwen Graham Neubig. 2023. Syntax and semantics
1022
NLP. In Proceedings of the 2019 Confer- A Stimuli Implementation
ence on Empirical Methods in Natural Lan-
guage Processing and the 9th International A.1 American Trends Panel Details
Joint Conference on Natural Language Pro- The link to the full ATP dataset. We use a subset
cessing (EMNLP-IJCNLP), pages 2153–2162, of the dataset that has been formatted into CSVs
Hong Kong, China. Association for Computa- from (Santurkar et al., 2023). Since our study is
tional Linguistics. https://doi.org/10 focused on subjective questions, we further fil-
.18653/v1/D19-1221 tered for opinion-based questions, so questions
Albert Webson, Alyssa Loo, Qinan Yu, and asking about people’s daily habits (e.g., how often
Ellie Pavlick. 2023. Are language models they smoke) or other ‘‘factual’’ information (e.g.,
worse than humans at following prompts? It’s if they are married) are out-of-scope. Note that
complicated. In Findings of the Association the Pew Research Center bears no responsibility
for Computational Linguistics: EMNLP 2023, for the analyses or interpretations of the data pre-
1023
To construct the modified questions, we manu- Model Average # of queries
ally added a middle option to questions with even-
Llama2-7b 69.63
numbered scales (when there was a logical middle
Llama2-13b 56.93
addition) and removed the middle option for ques-
Llama2-70b 22.36
tions with odd-numbered scales.
Llama2-7b-chat 32.77
Opinion Floating (Schuman and Presser, 1996). Llama2-13b-chat 12.99
We used the same set of questions as with the Llama2-70b-chat 2.05
odd/even bias but instead of removing the middle SOLAR 1.00
option, we added a ‘‘don’t know’’ option. GPT-3.5-turbo 1.00
GPT-3.5-turbo-instruct 1.20
Middle Random (Rawlinson, 2007). We sam-
Table 3: Average number of queries (100 single-
ple an index (excluding the first and last letters)
token responses per query) required to generate
from each question and swap the character at that
Letter Swap (Rawlinson, 2007). For a given This prompt is used for all models. We have the
question, we randomize the order of all the letters models generate only one token with a tempera-
in a word, except for the first and last characters. ture of 1.
Again, this perturbation is only applied to words Please answer the following question with
of length ≥ 4. one of the alphabetical options provided.
We did not apply non-bias perturbations to
any words that contain numeric values or punctu- Question: [question]
ation to prevent completely non-sensical outputs. A. [option]
1024
a relatively high number of queries, they were per question, with nearly a quarter of the ques-
free to query and thus did not present a prohibi- tions having 0 valid responses. For these ques-
tive cost for experimentation. tions, GPT-4 tended to generate ‘‘As’’ or ‘‘This’’
(and when set to generate more tokens, GPT-4
B.4 Initial Explorations with GPT-4 generated ‘‘As a language model’’ or ‘‘This is
In addition to the models above, we also at- subjective’’ as the start of its response).
tempted to use GPT-4-0613 in our experimental This is in stark contrast to GPT-3.5, which had
setup, but found it difficult to generate valid re- an average of ∼48 valid responses per question
sponses for many questions, most likely due to with none of the questions having 0 valid re-
OpenAI’s generation guardrails. As an initial ex- sponses. Histograms for the ratio of valid re-
periment, we tried generating 50 responses per sponses are shown in Figure 4. Based on these
question for all (q, q ) in Qbias (747 questions × observations, the number of repeated queries that
2 conditions) and counting the number of valid would be required for evaluating GPT-4 would
responses that GPT-4 generated out of the 50. On be prohibitively expensive and potentially infea-
Figure 4: Histograms of the response ratio of valid responses (out of 50) out of all 1494 question forms (q
and q ). GPT-4 has 750/1494 question forms with less than 5 valid responses, whereas GPT-3.5-turbo only has 15.
1025
model bias type Δ̄b p value Δ̄p key typo p value Δ̄p middle random p value Δ̄p letter swap p value pearson r p value
Acquiescence 1.921 0.021 −3.920 0.007 −4.480 0.000 −4.840 0.004 0.182 0.015
Response Order 24.915 0.000 1.680 0.382 −0.320 0.871 2.320 0.151 −0.503 0.000
Llama2
Odd/even 1.095 0.206 0.720 0.625 1.360 0.355 1.680 0.221 −0.102 0.255
7b
Opinion Float 4.270 0.000 0.720 0.625 1.360 0.355 1.680 0.221 −0.252 0.004
Allow/forbid −60.350 0.000 −5.400 0.007 −10.250 0.000 −7.700 0.000 −0.739 0.000
Acquiescence −11.852 0.000 −6.800 0.001 −5.760 0.000 −9.320 0.000 −0.412 0.000
Response Order 45.757 0.000 11.600 0.000 11.640 0.000 11.720 0.000 −0.664 0.000
Llama2
Odd/even −3.492 0.000 5.840 0.000 3.600 0.031 4.000 0.007 0.192 0.031
13b
Opinion Float 4.127 0.000 5.840 0.000 3.600 0.031 4.000 0.007 −0.023 0.799
Allow/forbid −55.100 0.000 −9.100 0.000 −5.700 0.000 −7.600 0.000 −0.739 0.000
Acquiescence 7.296 0.000 −2.440 0.218 −3.080 0.173 −3.320 0.146 −0.018 0.809
Response Order 5.122 0.000 −1.080 0.597 3.240 0.113 2.000 0.306 −0.140 0.021
Llama2
Odd/even 12.191 0.000 0.920 0.540 0.600 0.687 −0.800 0.618 0.12 0.179
70b
Opinion Float 2.444 0.000 0.920 0.540 0.600 0.687 −0.800 0.618 −0.033 0.714
Allow/forbid −42.200 0.000 −6.200 0.004 2.250 0.332 0.350 0.877 −0.628 0.000
Acquiescence 1.136 0.647 −7.807 0.000 −12.034 0.000 −5.546 0.000 −0.099 0.189
Response Order −9.801 0.000 7.173 0.000 12.679 0.000 1.594 0.253 −0.315 0.000
Llama2
Odd/even 20.079 0.000 8.460 0.000 15.810 0.000 9.175 0.000 −0.315 0.000
7b-chat
Opinion Float −1.254 0.283 8.460 0.000 15.801 0.000 9.175 0.000 −0.086 0.339
Allow/forbid −7.050 0.367 −18.700 0.000 −24.600 0.000 −16.200 0.002 −0.161 0.321
Acquiescence 1.909 0.434 −9.239 0.000 −11.534 0.000 −5.284 0.000 −0.095 0.209
Acquiescence 11.114 0.000 2.320 0.523 −5.280 0.312 4.040 0.166 0.452 0.000
Response Order −0.495 0.745 0.200 0.904 15.040 0.002 1.200 0.459 0.465 0.000
Llama2
Odd/even 26.476 0.000 3.280 0.210 −2.040 0.656 −7.240 0.018 −0.231 0.009
70b-chat
Opinion Float 1.556 0.039 3.280 0.210 −2.040 0.656 −7.240 0.018 0.440 0.000
Allow/forbid 4.000 0.546 −4.750 0.258 −16.000 0.021 −0.950 0.811 0.280 0.080
Acquiescence 18.511 0.000 0.120 0.970 2.560 0.596 0.600 0.833 0.187 0.013
Response Order 9.683 0.000 2.280 0.336 8.680 0.012 4.360 0.017 0.248 0.000
Solar Odd/even 17.508 0.000 0.480 0.815 2.960 0.223 −1.000 0.661 −0.385 0.000
Opinion Float 1.921 0.017 0.480 0.815 −2.960 0.223 −1.000 0.661 0.291 0.001
Allow/forbid 6.800 0.207 −2.950 0.343 −8.500 0.131 −8.050 0.001 0.145 0.373
Acquiescence 5.523 0.040 −11.720 0.008 −28.680 0.000 −19.120 0.000 0.334 0.000
GPT Response Order −2.709 0.147 4.960 0.121 15.960 0.002 8.000 0.011 0.198 0.001
3.5 Odd/even 25.048 0.000 −5.480 0.082 −14.800 0.001 −5.800 0.062 −0.273 0.002
Turbo Opinion Float −11.905 0.000 −5.480 0.082 −14.800 0.001 −5.800 0.062 0.467 0.000
Allow/forbid 25.300 0.000 −12.000 0.008 −23.200 0.001 −6.950 0.058 0.206 0.202
Acquiescence 6.455 0.024 2.600 0.445 −11.800 0.008 −2.800 0.326 0.334 0.000
GPT
Response Order −11.114 0.000 3.880 0.169 11.920 0.001 3.800 0.147 0.275 0.000
3.5
Odd/even 2.032 0.390 1.560 0.433 −7.120 0.061 −0.840 0.711 −0.073 0.416
Turbo
Opinion Float 0.143 0.891 1.560 0.433 −7.120 0.061 −0.840 0.711 0.360 0.000
Instruct
Allow/forbid 8.550 0.111 −4.500 0.216 −10.050 0.139 4.100 0.261 0.437 0.005
Table 4: Δ̄b for each bias type and associated p-value from t-test as well as Δ̄p for the three perturba-
tions and associated p-value from t-test. We also report the Pearson r statistic between model uncer-
tainty and the magnitude of Δb .
1026