SSRN Id4380365
SSRN Id4380365
SSRN Id4380365
Abstract
Large language models (LLMs) such as ChatGPT have garnered global attention recently, with a promise to
disrupt and revolutionize business operations. As managers rely more on artificial intelligence (AI) technology,
there is an urgent need to understand whether there are systematic biases in AI decision-making since they
are trained with human data and feedback, and both may be highly biased. This paper tests a broad range
of behavioral biases commonly found in humans that are especially relevant to operations management. We
found that although ChatGPT can be much less biased and more accurate than humans in problems with
explicit mathematical/probabilistic natures, it also exhibits many biases humans possess, especially when
the problems are complicated, ambiguous, and implicit. It may suffer from conjunction bias and probability
weighting. Its preference can be influenced by framing, the salience of anticipated regret, and the choice of
reference. ChatGPT also struggles to process ambiguous information and evaluates risks differently from
humans. It may also produce responses similar to heuristics employed by humans, and is prone to confirmation
bias. To make these issues worse, ChatGPT is highly overconfident. Our research characterizes ChatGPT’s
behaviors in decision-making and showcases the need for researchers and businesses to consider potential AI
behavioral biases when developing and employing AI for business operations.
1. Introduction
Large language models (LLMs) are massive machine learning algorithms that process and generate text, such
as PaLM, LaMDA, RoBERTa and Generative Pre-trained Transformers (GPTs)1 . ChatGPT has recently
impressed the world with articulate conversations, broad general knowledge, and remarkable problem-solving
abilities in operations management tasks (Terwiesch, 2023). Many predict AI will soon replace humans in
some jobs, while even more jobs will require a human to work with AI. It thus seems inevitable that AI will
be more involved in operational decision-making. For example, currently, a manager may need to determine
an ordering quantity based on a combination of knowledge, experience, and some algorithm/model. In the
near future, she may also consider the suggestion provided by an AI like chatGPT.
Using AI to assist in business decision-making is attractive for various reasons. For example, on the surface,
∗ Corresponding author: [email protected]; he did most of the work, other authors contrbitued equally.
1 The recent versions of GPT include GPT-3, GPT-3.5, and ChatGPT by OpenAI. Released in May 2020, GPT-3 had 175
billion parameters. GPT-3.5 (InstructGPT) was released in Jan 2022. It was based on GPT-3 but with additional fine-tuning
with human feedback. ChatGPT was released in Nov 2022. It was based on GPT-3.5 but with even more human guidance.
The field of behavioral operations management has established that various cognitive and behavioral biases
in human workers/managers often result in suboptimal business decisions. At the same time, consumers
also exhibit biases in their decision makings that businesses need to consider to maximize profit. Given that
managers and consumers may use AI like ChatGPT in various tasks, it is important to understand answers
to the following research questions. Do LLMs such as ChatGPT also exhibit behavioral biases? If they do,
are they more or less biased than human decision-makers, or just biased differently?
To answer these questions, we build on the Davis (2019) chapter in The Handbook of Behavioral Operations,
and examine 18 common human biases most relevant to operational decision-making, see §1.1 and Figure 1.
We have four central findings. First, ChatGPT exhibits considerable levels of bias when making decisions
that involve 1) conjunction bias, 2) probability weighting, 3) overconfidence, 4) framing, 5) anticipated regret,
6) reference dependence, and 7) confirmation bias. Second, ChatGPT’s risk preferences can be different
from humans. It is situationally risk-averse but tends to prioritize expected payoffs. Third, ChatGPT also
performs poorly when the information is ambiguous, a situation often faced by human managers. Decisions
that require intuitions or “gut feelings” are not possible with ChatGPT. Lastly, ChatGPT performs better
in some tasks that explicitly require calculations and does not suffer from human biases such as mental
accounting, endowment effect, or sunk cost fallacy.
Due to ChatGPT’s tendency to be influenced by the information salience and how a question is framed and
referenced, it is essential for managers working with ChatGPT to form questions neutrally. A suggestive
question will likely result in a biased answer. The question must also be clear and precise, for ChatGPT
struggles with ambiguity. Additionally, its risk preferences may not align with every manager’s goals. A
financial adviser may want to manage both risks and returns, while an operations manager may want
to maximize expected profit, while ChatGPT’s risk-related behavior may not satisfy either. As a result,
businesses need to evaluate AI behavioral biases and understand their effects on business decision makings
before adopting AI technologies. Training and protocols may also need to be established to help managers
navigate working with AI.
2 https://openai.com/blog/how-should-ai-systems-behave/, accessed Feb. 20, 2023
3 https://help.openai.com/en/articles/6783457-chatgpt-general-faq, accessed Feb. 16, 2023
There is a vast literature on behavioral biases. To focus on those that are common, robust, and relevant to
operations management, we followed the chapter by Davis (2019). The author introduced a comprehensive list
of well-established and prevalent behavioral biases and focused on one-shot individual decision experiments.
This proves to be an added advantage for these experiments are simple to perform with ChatGPT, compared
to ones involving group decisions, competition, and multiple rounds. We supplement this collection with
experiments from Camerer (1995)and our own search of highly cited economics and psychology experiments.
In total, we examine 18 well-known behavioral biases with ChatGPT, see Figure 1. Similar to Davis (2019),
we classify them into biases in risk judgments, biases regarding the evaluation of outcomes, and heuristics
in decision-making. To help with the readability, we introduce each bias in more detail together with the
test results in §3-§5. Note that we do not suspect ChatGPT to have all the listed behavioral biases, such as
endowment effect or availability heuristics, but we still test them for completeness.
LLMs have made dramatic progress in recent years and prompting4 LLMs to solve tasks has become a
dominant paradigm (Brown et al., 2020; Mialon et al., 2023). There has been a growing body of literature
on eliciting reasoning and creating capability benchmarks for LLM (Wei et al., 2022; Kojima et al., 2022;
Bommasani et al., 2021; Srivastava et al., 2022) through prompts. However, most of these works focus
on LLMs’ capabilities instead of their behavior. There is a nascent body of literature on understating the
behavior of LLMs by prompting LLMs to solve economics (Horton, 2023), cognitive (Binz & Schulz, 2023;
Hagendorff et al., 2022), and psychological (Park et al., 2023) problems. Our work adds to this body of
literature, demonstrating the importance of studying LLMs as behavioral decision-makers.
Horton (2023) performed four behavioral economics experiments with GPT-3, including social preferences,
fairness, status quo bias, and a minimum wage experiment. The author concluded that GPT-3 could
qualitatively repeat findings in humans, and LLMs are a promising substitute for human subjects in economics
experiments. This view is shared by Argyle et al. (2022), who used GPT-3 to simulate human samples for
social science research. Our findings also support that sometimes LLMs behave like human subjects, but we
also find many cases where ChatGPT behaves differently. For example, our research show ChatGPT does not
4A prompt is a text input to the language model, such as a sentence, a paragraph, or an example of a conversation, etc.
Binz and Schulz (2023) studied GPT-3 using cognitive psychology methods and found that it performed well
in decision-making and deliberation and decently in information search, but showed little causal reasoning
abilities. This study is especially relevant to ours because the authors also tested some biases we present in this
paper. They found GPT-3 had conjunction bias, gave intuitive (but incorrect) responses in cognitive reflection
test (CRT), had the framing effect, certainty effect (risk aversion), and overweighting bias (probability
weighting). Hagendorff et al. (2022) performed CRT and semantic illusions on GPT-3.5 and also found that
GPT-3.5 was more likely to respond with intuitive answers than humans. Our results on ChatGPT support
many of these findings. We report conjunction bias, framing effect, and probability weighting. However, there
also seemed to be differences between GPT-3/3.5 and ChatGPT. We found ChatGPT performed better than
humans in CRT, and gave more correct responses on average, instead of the intuitive ones, potentially due to
the improved mathematical capabilities. We also found ChatGPT’s risk aversion seemed to be limited to
choices with equal payoff expectations, and overall it seemed more risk neutral compared to humans.
Park et al. (2023) replicated ten studies from the Many Labs 2 project (Klein et al., 2018) with GPT-3.5.
They replicated about 30% of the original findings, including the false consensus effect and less-is-better
effect. The authors concluded that using LLMs for psychological studies is feasible, but such results are only
sometimes generalizable to humans.
Lastly, our work also contributes to the fast-growing literature on ChatGPT’s applications in various
professions, such as business (Terwiesch, 2023), law (Choi et al., 2023), and medicine (Kung et al., 2023).
Researchers found that ChatGPT could pass challenging exams in business, law, and medicine.
Next, we describe the study design and protocol. In §3-5, we present the biases, experiments, and results for
judgments regarding risk, the evaluation of outcomes, and heuristics in decision-making, respectively. We
summarize the study results and discuss its limitations and future research opportunities in §6.
ChatGPT tends not to give direct answers to questions about personal preferences, feelings, or tasks requiring
any form of physical ownership or interactions with the world. In our experience, it also tries to avoid definitive
answers when asked to take a “best guess” without access to all necessary information. These conditions are,
however, common in the existing economics and behavioral operations management experiments. We use the
exact wording as the referenced experiments in humans whenever possible. However, in circumstances where
a preference is required, we make minor modifications (e.g., instead of asking “what is your preference,” ask
The hot-hand and gambler’s fallacies are both about false beliefs of future event probabilities being correlated
with the past even when the actual probabilities are entirely independent. In hot-hand fallacy cases, people
believe a player’s winning streak (by chance) indicates a higher probability of winning in the future. In
We adopt a similar experiment condition to Ross & Levy (1958) and Bakan (1960) and ask ChatGPT to
generate random fair coin toss series in the length of 50. Per §2, we have ten independent conversations with
ChatGPT that results in ten random series. ChatGPT generates sequences that are about 50 in length. We
show the lag-1 autocorrelations and their 95% confidence intervals in Figure 2.
0.4
Lag−1 autocorrelation
0.0
−0.4
1 2 3 4 5 6 7 8 9 10
Test number
Out of the ten sequences, 3 have significant negative auto-correlation, and 1 has significant positive auto-
correlation. The average correlation coefficient is 0.13 with 95% confidence interval (-0.31, 0.22). That is,
while ChatGPT predominantly generates sequences without significant auto-correlations, the opposite is also
true in many conversations.
The probability of a conjunction event P(A&B) cannot be greater than P(A) or P(B). The conjunction fallacy
is the observation that people’s estimation of probability over conjunction events can violate this principle.
Tversky and Kahneman (1983) demonstrated this fallacy in the well-known “Linda problem”:
We gave ChatGPT the same description of Linda and choices A through H, with added instruction, “Rank
the following statements that could describe Linda from the most likely to the least likely.” In 10 out of 10
conversations, ChatGPT ranked P(C) > P(H) > P(F), showing that not only ChatGPT provides a similar
biased ranking as human subjects, but its bias may also be more severe.
The availability heuristic is a tendency to use data familiar to oneself, such as personal experiences and
conversations, to judge risks. Tversky and Kahneman (1973) demonstrated availability affecting judgment by
giving subjects a visual task of estimating the possible ways a bus can make r stops traveling along a route
with ten stops in total. The correct answer is the binomial coefficient of 10 choose r, which yields the highest
estimate at five stops and is symmetric with respect to r = 5. However, the subjects’ median judgments are
decreasing in r. Without realizing the symmetry between patterns with r stops and patterns with 10 − r
stops, the patterns requiring more stops are more challenging to consider.
We suspect that ChatGPT, an AI with vast knowledge, is unlikely to exhibit availability heuristic, but we
gave it a similar set of tasks, with r ∈ {2, 5, 8}. Since it does not take graphs as input, we describe this task
with words but keep the description of the task as faithful as possible to the original:
There are ten stations along a route. Consider a bus that travels, stopping at exactly r stations
along this route. What is the number of different patterns of r stops that the bus can make?
Not surprisingly, ChatGPT performs very well in these tasks compared to human subjects. 9 out of 10 times,
ChatGPT gave the correct answers to all three questions. The representative responses are as follows:
When asked to estimate a risk that is a Bayesian posterior in nature, people’s tendency to ignore the base
rate in the process is called the base rate fallacy. Take the question by Casscells et al. (1978) as an example:
If a test to detect a disease whose prevalence is 1/1000 has a false positive rate of 5%, what
is the chance that a person found to have a positive result actually has the disease, assuming
you know nothing about the person’s symptoms or signs?
The fallacy stems from the decision makers ignoring the fact that the probability of the disease is low and
most positive test results are false positives. 11 out of 60 participants from Harvard Medical School gave the
correct answer of around 2%, while their average estimate was 56%, and nearly half of the subjects responded
95%. This experiment was repeated in 2013 in a group of physicians and medical students in a Boston-area
hospital, and the results were not significantly different (Manrai et al., 2014).
We asked ChatGPT the exact same question, and 7/10 times ChatGPT gave the correct answer, with
the average estimate being 4.9%. Compared to Casscells et al. (1978), ChatGPT does significantly better.
Fisher’s exact test yields an odds ratio of 9.92 with a p-value of 0.002 between ChatGPT and human’s odds
of providing the correct answer. However, upon closer inspection, 2/3 of incorrect answers by ChatGPT were
purely due to calculation errors. As this mistake is easy to correct when reviewed by a human decision maker,
and OpenAI is likely going to improve the calculation capability over time, we recalculate the odds ratio
(ChatGPT with the help of a calculator vs. human) to be 37.38 with a p-value <0.001.
Probability weighting is applying additional “decision weights” when evaluating potential outcomes with
stated probability information. Kahneman and Tversky (1979) ask their subjects the question:
52 out of 72 subjects (72%) preferred the riskier choice. One explanation of this finding is that the participants
put additional weight on the low 0.1% probability, as if they did not believe 0.1% was a very low probability.
Another test discussed in (Davis, 2019), also from Kahneman and Tversky (1979), is about a Russian roulette:
Suppose you are compelled to play Russian roulette, but are given the opportunity to purchase
the removal of one bullet from the loaded gun. Would you pay as much to reduce the number
of bullets from four to three as you would to reduce the number of bullets from one to zero?
Although removing one bullet will always decrease the probability of death by 1/6, people value the removal
of the last bullet much higher than the fourth bullet.
We ask ChatGPT these two preference questions to gauge whether it has similar biases. Our results are
surprising. For the first question, in all responses, ChatGPT states that itself does not have a preference, but
6 out of 10 times, it gave answers strongly indicating $5 with certainty is better/preferred by most people
(and once, amusingly, “mathematically better”), while the rest of the four times it indicates the decisions
should be based on one’s risk tolerance without indicating preference. In this experiment, ChatGPT exhibits
probability weighting, but it weighs small probabilities in the opposite direction compared to human, resulting
10
As a language model AI, I don’t have personal preferences. However, most people would prefer $5 with
certainty because it is a guaranteed outcome, while a 0.1% chance at $5000 is uncertain and has a low
probability of being realized.
Depends on risk tolerance (4/10)
As an AI language model, I don’t have personal preferences or emotions, but a 0.1% chance at $5000 and
$5 with certainty are two different things. If one values the potential reward higher than the certainty of
the smaller reward, they might prefer the 0.1% chance at $5000. On the other hand, if one values certainty
of a small reward over the potential reward, they might prefer the $5 with certainty. It ultimately depends
on individual preferences and risk tolerance.
As for the Russian roulette question, 10 out of 10 times, ChatGPT responds that it will not pay the same
amount. 7/10 times it is willing to pay more to reduce the number of bullets from one to zero, while 3/10 times,
it chose the opposite. Interestingly, the reasons ChatGPT provides for its choices are reference-dependent. In
cases it is willing to pay more to reduce from one to zero, it emphasizes that the relative reduction in the four
to three bullets scenario is much lower than the one to zero scenario. In comparison, responses supporting a
higher price for reducing four bullets to three emphasize that the absolute risk of having four bullets is much
greater than having one bullet. We will also discuss a test for reference dependency in §4.5. Together, our
results suggest that ChatGPT exhibits probability weighting bias in its answers, although the direction of the
biases may be the same or opposite to human decision-makers. Two representative answers are as follows:
3.6 Overconfidence
Overconfidence is a bias in that decision-makers overestimate their probability of being correct. Fischhoff
et al. (1977) asked the subjects a series of knowledge questions, such as Is absinthe a liqueur or a precious
stone?, then calibrate their confidence against their actual performance. We also ask ChatGPT the same
11
Percentage confidence
75
50
25
0
0 25 50 75 100
Percentage correct
absinthe question, but ChatGPT, with its vast general knowledge, can always provide correct answers 100%
of the time with 100% confidence, thus making calibration difficult. So, we also ask ChatGPT to provide
its confidence level in other behavioral tests presented in this paper that have correct answers. These tests
include the Linda problem and the modified Linda problem (conjugation bias), the disease PPV calculation
(base rate neglect), the bus stop pattern calculation (availability heuristic), CRT (System 1 and System 2
thinking), and the four-card selection task (confirmation bias). We calculate ChatGPT’s average estimated
confidence level and its performance in each test, and summarize the calibration results in Table 2 below. We
also graph the calibration curve in Figure 3. Note, some of these tests are discussed after this section, but we
merely use them as data points here.
An unbiased decision maker would have a calibration curve close to the diagonal line; ChatGPT is overconfident
in these behavioral tasks. However, caution is needed when comparing this result to the general knowledge
12
Ambiguity aversion is a decision maker’s tendency to avoid choices with uncertain probability information.
Ellsberg (1961) designed the following experiment questions:
Test 1: There is an urn with 30 red balls and 60 other balls that are either black or yellow.
Choose among the following two options:
A: $100 if you draw a red ball. B: $100 if you draw a black ball.
Test 2: You must also choose between these two options A’:$100 if you draw a red or yellow
ball. B’:$100 if you draw a black or yellow ball.
If a subject strictly prefers A over B, then she should also prefer A’ over B’. However, a decision maker that is
ambiguity averse may prefer A and B’. We ask the same questions to ChatGPT, but with an added sentence,
“we do not know the exact numbers of black balls or yellow balls, but the total number of black and yellow
balls is 60” to underscore the uncertainty.
Even with additional clarification, ChatGPT struggled to understand that even though the individual number
of black and yellow balls is unknown, their total is fixed without ambiguity. As a result, its responses are
highly variable, and it often cannot determine which choice is better. We summarize ChatGPT’s responses
below. Fisher’s exact test yields a p-value <0.01, indicating the responses in corresponding choices (A and
A’, B and B’, No preferences) are distributed differently between the two tests. However, this significance is
likely driven by the fact that ChatGPT, understanding the ambiguity in Test 1, is more likely to give answers
without preferences. At the same time, in Test 2, ChatGPT misunderstands the question more often and
gives illogical responses.
Test 1 1 3 6
Test 2 8 0 2
The expected value for option A is $100 * (30 / 90) = $33.33. The expected value for option B is unknown
as we do not know the exact number of black balls in the urn. Without this information, it is not possible
to determine the expected value for option B.
Test 2: A’ is better (8/10)
The expected value of A’ is (30 + 60)/90 * $100 = $100, and the expected value of B’ is 60/90 * $100 =
$66.67. So, option A’ is the better option as it has a higher expected payout.
In summary, we cannot determine the ambiguity aversion level of ChatGPT since it avoids providing answers
without all the necessary information and frequently misunderstands the questions. Compared to human
decision-makers, which can be compelled to decide under ambiguous information about probability, ChatGPT
struggles to understand the ambiguity and declines to make decisions. This tendency, however, may be seen
13
Risk aversion is the tendency to prefer more certain outcomes when making decisions under uncertainty. We
perform two lottery choice risk aversion elicitation tasks found in Davis (2019) Table 5.1. Below are the
questions we asked ChatGPT:
Test 1: Choose between two lotteries A and B, which one is better? A: 50% chance to win
$5.5 and 50% chance to win $4.5 B: 50% chance to win $9 and 50% chance to win $1
Test 2: Choose between two lotteries A and B, which one is better? A: 50% chance to win
$5.5 and 50% chance to win $4.5 B: 50% chance to win $9.5 and 50% chance to win $1
These elicitation tasks have a pair of lotteries, A with lower risk and fixed expected returns, and B with
higher risk and varying expected returns. The expectation of lottery B increases as the test progresses to
make the riskier option increasingly more appealing. More risk-averse subjects will switch to lottery B later
than those who are more risk-seeking. We stop our experiment at Test 2 simply because ChatGPT already
switches over to prefer lottery B as soon as the expectation of lottery B is slightly greater than lottery A.
In the first round of tests, two lotteries have the exact same expected payoffs, and 9/10 times ChatGPT
indicated that the two lotteries are similar, and it depends on the decision-maker’s risk tolerance. 1/10 times
it preferred lottery A (the less risky choice). This is a drastically different result from the first test in §3.5,
where we asked ChatGPT to compare two choices with the same $5 payoff but much more extreme risk
differences. In that test, ChatGPT favored choice with certainty. In the second round of tests, 9/10 times
ChatGPT preferred lottery B, while 1/10 times it suggests it depends on the risk tolerance. Below is the
most representative response from each round of tests.
The expected value for Lottery A is $5 and for Lottery B is $5. The expected value represents the average
outcome of a Lottery if it is repeated many times. So, both Lotterys A and B have the same expected value
and hence have the same level of "goodness". It is up to personal preference as to whether one values a
lower variance (Lottery A) or a higher potential payout (Lottery B).
The expected value of Lottery A is $5 and the expected value of Lottery B is $5.25. So, Lottery B is
mathematically better, as it has a higher expected value. This means that if you were to play Lottery B
repeatedly, you would expect to win more money on average than if you were to play Lottery A repeatedly.
14
Risk aversion and Lottery choice risk Risk-averse people Switches to risky choice
scaling aversion elicitation tasks switch to riskier choice as soon as its expected
only when its expected pay off is higher than
pay off far exceeds the the safer choice, treats
safer choice risk as secondary
consideration
Prospect theory Risk preference test in Risk-averse in gains, Mostly focused on
gains vs losses risk-seeking in losses maximizing expected
pay off, can be
risk-averse in gains
Framing Same difficult choice Risk-averse in gain Risk-averse in gain
presented in gain frame frame, risk-seeking in frame, risk-seeking in
vs. loss frame loss frame loss frame
Anticipated regret Varying salience on Significant effect of the Significantly less
counterfactual regrets of regret salience preference to check the
checking a car or going manipulation car when quiz regret is
to a quiz made more salient
Mental accounting What’s better, large Integration feels better No preference
gain small loss or
integrated small gain
Reference dependence Preference between Preference can reverse Different preferences in
mixed gains presented when a mixed or absolute and relative
in percentage or relative frame is used frames
absolute numbers
Intertemporal choice Discount factor of a Discount factor is not Constant discount
(hyperbolic discounting) future pay off constant, decreasing in factors, however may
time and size of payoffs apply them incorrectly
The endowment effect Raffle ticket WTP vs. Pronounced WTA-WTP No WTA-WTP gap
WTA gap
Sunk cost fallacy Invest into finishing a More would finish the Does not consider the
project that may fail or project with sunk cost sunk cost
start another project than start another
that may also fail
15
One aspect of the well-known prospect theory is that decision-makers tend to be risk-averse in gains and
risk-seeking in losses. Kahneman and Tversky (1979) asked the following questions in an experiment, 80% of
respondents chose a $3000 gain with certainty, but only 8% chose a $3000 loss with certainty.
Test 1: Would you rather play a gamble where there is an 80% chance of earning $4000, or
receive $3000 with certainty?
Test 2: Now consider the following choice: Would you rather play a gamble where there is an
80% chance of losing $4000, or pay $3000 with certainty?
We repeat this experiment on ChatGPT, and the results are summarized in the following table. The Fisher’s
exact test returns a p-value of 1, suggesting ChatGPT does not have different risk preference based in gains
vs. losses. Upon closer examination of ChatGPT’s responses, in the second test, 2 out of the 3 times when it
preferred to gamble, it made a mistake calculating the expected pay-offs favoring the gamble option. However
the potential switch of preference in these two responses still would not change the result qualitatively.
Test 1 (Gains) 4 4 2
Test 2 (Losses) 5 3 2
Similar to previous results, ChatGPT is more focused on maximizing expectations. Compared to human
respondents, ChatGPT rarely exhibits the classic prospect theory response (certainty for gain, gamble for
loss). However, compared to results in §4.1, where ChatGPT almost strictly maximized payoff expectations,
it is surprising that in Test 1, ChatGPT preferred certain gains even when the expected payoff was $200 less.
This difference suggests that ChatGPT may not be entirely risk neutral. The scale of the problem may also
play a role here- ChatGPT may have different suggestions for small-scale tasks vs. big-scale choices, and
future research is needed to further investigate the effect of scale. Below are the dominant responses of the
two tests.
Most people would rather receive $3000 with certainty as opposed to playing a gamble where there is an
80% chance of earning $4000.
16
4.3 Framing
Also part of the prospect theory’s prediction is that framing a set of choices in gain or loss may change
decisions. We repeat the set of experiments by Tversky and Kahneman (1981) on ChatGPT with minimal
wording changes:
Imagine that the U.S. is preparing for the outbreak of an unusual disease, which is expected to
kill 600 people. Two alternative programs to combat the disease have been proposed. Assume
that that exact scientific estimate of the consequences of the programs is as follows:
Program A: 200 people will be saved.
Program B: There is a 1/3 probability that 600 people will be saved, and 2/3 probability that
no people will be saved.
Which program should I chose?
Imagine that the U.S. is preparing for the outbreak of an unusual disease, which is expected to
kill 600 people. Two alternative programs to combat the disease have been proposed. Assume
that that exact scientific estimate of the consequences of the programs is as follows:
Program C: 400 people will die.
Program D: There is a 1/3 probability that nobody will die and 2/3 probability that 600 people
will die.
Which program should I chose?
Of human respondents, 72% chose program A, but only 22% chose program C (Tversky & Kahneman, 1981).
Although options A and C are identical, in gain frames, people prefer certainty, but in loss frames, people
prefer to gamble. We classify ChatGPT’s responses into three categories: ones that clearly favored certainty,
those that clearly favored gamble, and those that explained the situation and left the decision to us. The
summarized results are shown below.
Gain frame 7 0 3
Loss frame 3 4 3
Fisher’s exact test yields a borderline significant association between framing and risk preference (p-value
0.073), suggesting there is some evidence that ChatGPT is also sensitive to framing, considering our relatively
small sample size. ChatGPT consistently preferred certainty in the gain frame, while in the loss frame, its
most dominant answer is actually to gamble; both are similar to human decision-makers. The dominant
answers to the two tests are shown below.
17
Based on the information given, Program D is the better option as it offers a chance (1/3) of no one dying,
whereas Program C guarantees 400 deaths. Although Program D also has a 2/3 chance of 600 deaths, the
possibility of no deaths makes it a more favorable option.
Anticipated regret or regret aversion is the theory that when people make decisions, they consider not
only the utilities of their options but also the counterfactual regrets of their choices (Bell, 1982; Loomes &
Sugden, 1982). Hetts et al. (2000) conducted a questionnaire study on 164 UCLA undergraduate students
in a scenario where they needed to choose between checking their car or arriving at a quiz on time. The
authors manipulated the salience of anticipated counterfactual regrets and found a significant effect of the
manipulation. 69.1% of the subjects in the car regret manipulation group preferred to check their car,
compared to 46.3% in the control group and 34.5% in the quiz regret manipulation group. We give ChatGPT
the exact same scenarios and ask it to decide what to do:
Control:
Imagine that you commute to school and that you park your car in one of the lots on campus.
Further imagine that, on this day, you are walking to class in a bit of a rush because you
have a quiz that you do not want to be late for. On the way to class, however, you get a
strange feeling that you may have left your car door unlocked. Try as you might, you cannot
be absolutely certain whether or not you locked your door.
Now you must choose: would you go back to check your car or go straight to class for the
quiz?
Car regret:
Imagine that you commute to school and that you park your car in one of the lots on campus.
Further imagine that, on this day, you are walking to class in a bit of a rush because you
have a quiz that you do not want to be late for. On the way to class, however, you get a
strange feeling that you may have left your car door unlocked. Try as you might, you cannot
be absolutely certain whether or not you locked your door.
Think for a minute about how upset you would be if you didn’t go back to check your car, and
later that day your car was burglarized. Now you must choose: would you go back to check
your car or go straight to class for the quiz?
18
Control 9 0 1
Car Regret 10 0 0
Quiz Regret 1 1 8
Below are the representative dominant answers from ChatGPT under the three experimental conditions:
19
One aspect of mental accounting concerns the granularity in people’s evaluation of gains and losses, as gains
and losses may feel different depending on how they are mentally segregated or combined. Thaler (1985)
concluded that multiple gains should be segregated, while multiple losses should be integrated, mixed gains
(a bigger gain + a smaller loss) should be integrated, and mixed losses (a bigger loss + smaller gain) should
be segregated.
Reference dependence suggests that our feelings of gains or losses depend on the reference we set. Heath
et al. (1995) adopted the experiment in Thaler (1985) on mental accounting to test reference dependence
by framing the gains and losses in percentages as opposed to absolute dollar amounts and tested people’s
preferences in hypothetical scenarios. They found the preference for integration and segregation in mental
accounting is reference-dependent. Specifically, for mixed gains, respondents preferred integration in an
absolute frame but segregation in mixed and relative frames.
We replicate one set of questions in Heath et al. (1995) to test both phenomena since the first question of the
set is a direct adoption of Thaler (1985)’s work on mental accounting. The second and third questions are of
mixed and relative frames. We use them to examine reference dependence on ChatGPT.
Absolute frame 0 0 2 8
Mixed frame 0 2 0 8
Relative frame 0 7 0 3
20
Intertemporal choices consider decisions with present and future consequences. One common assumption
when modeling intertemporal choices is to apply a constant discount factor per unit of time lapsed. However,
experiments such as Thaler (1981) showed that the discount factors among average decision-makers are not
constant. In contrast, they are decreasing in time and size of payoffs: people would heavily discount waiting
for a short time for a small payoff and would much prefer to get paid immediately but are relatively more
patient on a large payoff in the distant future. This phenomenon is also called hyperbolic discounting.
We adopt one experimental condition of Thaler (1981), abbreviated by Davis, A.M. (2019), and ask ChatGPT
a set of three questions:
You have won $15 in the lottery which is being held by your bank. You can take the $15 now,
or wait until later. How much money would you require to make waiting [3 months, 1 year, 3
years] just as attractive as getting the $15 now?
21
The amount of money required to make waiting 3 months just as attractive as getting $15 now is dependent
on the discount rate, or the rate at which the value of money decreases over time. If the discount rate is
5%, then the present value of $15 in 3 months would be $14.41 ($15/ (1 + 0.05)3 ). Therefore, you would
need to be offered at least $14.41 in 3 months to make waiting as attractive as taking $15 now.
1 year: less than $15 (5/10)
The amount of money required to make waiting 1 year just as attractive as getting $15 now is dependent on
the discount rate, or the rate at which the value of money decreases over time. If the discount rate is 5%,
then the present value of $15 in 1 year would be $14.04 ($15/ (1 + 0.05)1 ). Therefore, you would need to be
offered at least $14.04 in 1 year to make waiting as attractive as taking $15 now.
3 years: less than $15 (5/10)
The amount of money required to make waiting 3 years just as attractive as getting $15 now is dependent
on the discount rate, or the rate at which the value of money decreases over time. If the discount rate is
5%, then the present value of $15 in 3 years would be $12.55 ($15/ (1 + 0.05)3 ). Therefore, you would need
to be offered at least $12.55 in 3 years to make waiting as attractive as taking $15 now.
The endowment effect is when people value an item or asset more if they own it, as is often demonstrated in
experiments by the gap between the willingness-to-accept (WTA) and willingness-to-pay (WTP). When an
item is in someone’s endowment, their WTA tends to be higher than their WTP had they not owned the
item. Knetsch and Sinden (1984) experimented on 76 University of New England students. Thirty-eight of
them were given the option of entering a raffle for $2. The prize was worth $50 - $70. The other 38 students
were given the free raffle tickets but were allowed to redeem them for $2. The authors found 19 students
who did not own the tickets and were willing to pay $2 (WTP > $2), while 29 students who were given the
tickets declined to give up the tickets for $2 (WTA >$2). The endowment induced a statistically significantly
difference in the perceived value of the raffle ticket.
We provide ChatGPT with two similar scenarios. In Knetsch and Sinden (1984), the students did not know
their winning probability when they made decisions to purchase or sell the tickets. However for ChatGPT to
give preferences, we have to provide a fixed probability of winning the raffle (recall the ambiguity aversion
discussion), and we choose a winning probability of 2.08%, which is the actual winning probability in Knetsch
and Sinden (1984). Below are the two scenarios we prompt ChatGPT with.
22
Yes No No preference
Since the ticket was given to you for free, the cost of entry is $0. The expected value of entering the prize
draw is calculated as follows: Expected value = Winning probability × Prize amount = 2.08% × $70 =
$1.45 Since the expected value is greater than $0, you have a positive expected outcome from entering the
prize draw. So, if you sell the ticket for $2, you would earn a guaranteed profit of $2, which is greater than
the expected outcome of entering the prize draw ($1.45). In this case, selling the ticket for $2 would be the
more profitable option.
The sunk cost fallacy is the phenomenon that decision-makers consider unrecoverable costs in the past when
evaluating future options. We follow a questionnaire study by Arkes and Blumer (1985) on 108 college
students. The pair of questions are as follows:
23
Yes No No preference
Question A 0 7 3
Question B 0 7 3
24
System 1 and system 2 Cognitive reflection test On average, 1.24 correct On average, 2 correct
decisions (CRT) answers out of 3 answers out of 3
Confirmation bias Four-card selection task Seek evidence that Seek evidence that
supports the hypothesis supports the hypothesis
A System 1 decision refers to an instant, autopilot-like decision-making process, while a System 2 decision
requires careful and conscious considerations. Frederick (2005) developed a set of three tricky questions,
CRT, in which respondents must suppress their System 1 thinking to arrive at the correct answers. A study
of over 3000 participants yielded an average number of correct responses of 1.24 out of the three questions
(Davis, 2019). Although LLMs like ChatGPT do not “think” as humans do, we have seen throughout our
experiments that it is capable of producing both incorrect results, such as in the “Linda problem” when the
quantitative information is made implicit, and accurate results such as in bus stop pattern calculation when
the quantitative information is made explicit. We give the CRT questions to ChatGPT to see whether it can
produce System 1 or System 2 answers:
1. A bat and a ball cost $1.1 in total. The bat costs $1 more than the ball. How much does
the ball cost?
2. If it takes 5 machines 5 min to make 5 widgets, how long would it take 100 machines to
make 100 widgets?
3. In a lake, there is a patch of lily pads. Every day, the patch doubles in size. If it takes 48
days for the patch to cover the entire lake, how long would it take for the patch to cover half
of the lake?
The System 1 answers to the CRT are $0.1, 100 min, and 24 days, respectively. The correct System 2 answers
are $0.05, 5 min, and 47 days. ChatGPT gave the correct answer of $0.05 in the first question 10/10 times,
gave the correct answer of 5 minutes in the second question 8/10 times, but was only correct 2/10 times in
the third question. The distributions of answers in questions 2 and 3 are shown in Figure 4. Interestingly,
ChatGPT produced the System 1 answer of 100 minutes in 1 of the ten responses to question 2, and its most
dominant answer (4 out of 10 times) to question 3 is the System 1 answer of 24 days. So, ChatGPT can
produce both System 1 and System 2 answers. Future research may be required to determine why such two
distinct types of responses coexist.
25
8 8
6 6
count
count
4 4
2 2
0 0
0 25 50 75 100 0 10 20 30 40 50
Q2 Q3
Figure 4: ChatGPT response distributions on CRT questions 2 and 3
Let x be the cost of the ball in dollars. Then the cost of the bat is x + $1. So x + (x + $1) = $1.1. Solving
for x, we get 2x + $1 = $1.1, so 2x = $0.1 and x = $0.05. So the ball costs $0.05 and the bat costs $1.05.
Question 2: 5 min/System 2 (8/10)
If 5 machines can make 5 widgets in 5 minutes, then each machine can make 1 widget in 5 minutes. Therefore,
100 machines can make 100 widgets in 5 minutes, so the time to make 100 widgets would still be 5 minutes.
Question 3: 24 days/System 1 (4/10)
The size of the patch of lily pads is growing exponentially, so to find when it covers half the lake, we need
to find when the size of the patch is equal to half the size of the lake. If it takes 48 days for the patch to
cover the entire lake, then in half that time (48/2 = 24 days), the patch will cover half the lake.
Another heuristic in the judgment process is confirmation bias. Confirmation bias is a tendency to seek
evidence that could confirm someone’s hypothesis or prior belief while ignoring evidence that could disprove
it. To study this bias, Wason (1966) came up with a four-card selection task:
You are shown four cards, marked E, K, 4 and 7. Each card has a letter on one side and a
number on the other. You are given the following rule: Every card with a vowel on one side
has an even number on the other side. Which cards must you turn over to test whether the
rule is true or false?
The correct answer is to turn cards E and 7, although most people would choose E, or E and 4. This is
because if the statement “if P then Q” is true, combinations P and Q, not-P and Q, and not-P and not-Q are
all true, while P and not-Q is false. To falsify the rule “if vowel on one side (P) then even number on the
other (Q),” we need to show that the combination P and not-Q does not exist, which requires us to check all
cards with vowels (E) or odd numbers (7). However, confirmation bias makes people more inclined to confirm
P(E) and Q(4) instead. In a lab experiment, Wason (1968) found that 8 out of 18 subjects chose to verify P
and Q, 4 chose to verify P, and none chose to verify P and not-Q.
26
E and 4 (4/10)
To test the rule, you must turn over the cards marked "E" and "4". If the rule is true, the card marked "E"
should have an even number on the other side and the card marked "4" should have a vowel on the other
side. If the rule is false, either the card marked "E" will have an odd number on the other side or the card
marked "4" will have a consonant on the other side.
E (4/10)
To test the rule, you must turn over the card with the letter E, as it is the only vowel. If the number on the
other side of the E card is even, the rule is true. If it is odd, the rule is false.
6. Conclusions
ChatGPT – a representative of a broader class of advanced AI technologies known as Large Language
Models (LLMs), – has made headlines recently due to its broad general knowledge, remarkable ability to hold
conversations, and impressive problem-solving skills. However, ChatGPT is also pre-trained with human
data, and its fine-tuning involves human reviewers applying guidelines set by OpenAI, the company behind
ChatGPT. As a result, ChatGPT may learn behavioral biases from humans. As managers may start to rely
more and more on AI, it is essential to understand whether ChatGPT shares the same behavioral biases
found in humans and to what extent.
To understand that, we sourced 18 biases commonly found in operations management problems and tested
them with ChatGPT. The responses showed evidence of 1) conjunction bias, 2) probability weighting, 3)
overconfidence, 4) framing, 5) anticipated regret, 6) reference dependence, and 7) confirmation bias. ChatGPT
also struggled with ambiguity, and had different risk preferences than humans.
We performed four tests on risk preference under different contexts. In general, ChatGPT was likelier to
prefer certainty in tests with equal expectations and to prefer maximized expectations in tests with unequal
expectations. It appeared that ChatGPT had a tiered decision policy. It prioritized expected payoff first, and
showed risk aversion only when expected payoffs were equal. The tiered policy may not align with managers
who need to consider risks and returns jointly. Future research may explore ChatGPT’s risk preferences more
systematically.
We found that ChatGPT was sensitive to framing, reference points, and salience of information. This may not
be surprising, considering ChatGPT is a chatbot; openness to suggestions may help produce more pleasant
conversations. However, it may also render ChatGPT susceptible to implicative questions and statements.
ChatGPT also performed poorly with ambiguous information and may have confirmation bias. All of the
above requires managers to form clear and well-balanced questions to avoid biased answers. It may interest
future research to investigate whether ChatGPT reduces or exacerbates biases when paired with humans.
27
Our study adds to a nascent body of literature on machine behavior. To the best of our knowledge, it is
one of the first papers to explore behavior biases of ChatGPT with a focus on operations management. We
found that in addition to the potential of factual mistakes and biases in opinions, ChatGPT may also carry
behavioral biases.
In summary, LLMs, like ChatGPT, shine bright in identifying factual questions and solving a well-defined
problem with clear goals. However, they also produce biased responses regarding risks, outcome evaluations,
and heuristics. Although our results do not necessarily generalize to all LLMs, our approach to evaluating
LLMs from a behavioral angle yields many unexpected properties of ChatGPT that are meaningful for the
designers of AI systems as well as businesses that would like to employ them. Our work suggests that a
framework for AI behavioral evaluation is urgently required to safeguard successful AI adoptions.
References
Argyle, L. P., Busby, E. C., Fulda, N., Gubler, J., Rytting, C., & Wingate, D. (2022). Out of One, Many:
Using Language Models to Simulate Human Samples. arXiv preprint arXiv:2209.06899.
Arkes, H. R., & Blumer, C. (1985). The psychology of sunk cost. Organizational Behavior and Human
Decision Processes, 35(1), 124–140. https://doi.org/10.1016/0749-5978(85)90049-4
Bakan, P. (1960) Response-tendencies in attempts to generate random binary series. American Journal of
Psychology, 73, 127-131.
Bell, D. E. (1982). Regret in Decision Making under Uncertainty. Operations Research, 30(5), 961–981.
http://www.jstor.org/stable/170353
Binz, M., & Schulz, E. (2023). Using cognitive psychology to understand GPT-3. Proceedings of the National
Academy of Sciences, 120(6), e2218523120.
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., . . . & Liang, P. (2021). On
the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., . . . & Amodei, D. (2020).
Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901.
Camerer, C. (1995). Individual decision making. In: Handbook of Experimental Economics (ed. J.H. Kagel
28
Casscells, W., Schoenberger, A., & Graboys, T. B. (1978). Interpretation by physicians of clinical laboratory
results. New England Journal of Medicine, 299(18), 999-1001.
Choi, J. H., Hickman, K. E., Monahan, A., & Schwarcz, D. (2023). ChatGPT Goes to Law School. Available
at SSRN.
Davis, A. M. (2019). Biases in individual decision-making. In K. Donohue, E. Katok, & S. Leider (Eds.),
The handbook of behavioral operations (pp. 151–198). Wiley Blackwell.
Ellsberg, D. (1961). Risk, Ambiguity, and the Savage Axioms. The Quarterly Journal of Economics, 75(4),
643–669. https://doi.org/10.2307/1884324
Fiedler, K. The dependence of the conjunction fallacy on subtle linguistic factors. Psychol. Res 50, 123–129
(1988). https://doi.org/10.1007/BF00309212
Fischhoff, B., Slovic, P., & Lichtenstein, S. (1977). Knowing with certainty: The appropriateness of extreme
confidence. Journal of Experimental Psychology: Human Perception and Performance, 3(4), 552–564.
https://doi.org/10.1037/0096-1523.3.4.552
Frederick, S. (2005). Cognitive reflection and decision making. Journal of Economic Perspectives, 19(4),
25–42. https://doi.org/10.1257/089533005775196732
Hagendorff, T., Fabi, S., & Kosinski, M. (2022). Machine intuition: Uncovering human-like intuitive
decision-making in GPT-3.5. arXiv preprint arXiv:2212.05206.
Heath, T. B., Chatterjee, S., & France, K. R. (1995). Mental Accounting and Changes in Price: The Frame
Dependence of Reference Dependence. Journal of Consumer Research, 22(1), 90–97. http://www.jstor.org/st
able/2489702
Hetts, J.J., Boninger, D.S., Armor, D.A., Gleicher, F. and Nathanson, A. (2000), The influence of anticipated
counterfactual regret on behavior. Psychology & Marketing, 17: 345-368. https://doi-org.proxy.queensu.ca/1
0.1002/(SICI)1520-6793(200004)17:4<345::AID-MAR5>3.0.CO;2-M
Horton, J. J. (2023). Large Language Models as Simulated Economic Agents: What Can We Learn from
Homo Silicus?. arXiv preprint arXiv:2301.07543.
Kahneman, D., & Tversky, A. (1979). Prospect Theory: An Analysis of Decision under Risk. Econometrica,
47(2), 263–291. https://doi.org/10.2307/1914185
Klein, R. A., Vianello, M., Hasselman, F., Adams, B. G., Adams Jr, R. B., Alper, S., . . . & Sowden, W.
(2018). Many Labs 2: Investigating variation in replicability across samples and settings. Advances in
Methods and Practices in Psychological Science, 1(4), 443-490.
Knetsch, J. L., & Sinden, J. A. (1984). Willingness to Pay and Compensation Demanded: Experimental
Evidence of an Unexpected Disparity in Measures of Value. The Quarterly Journal of Economics, 99(3),
507–521. https://doi.org/10.2307/1885962
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large language models are zero-shot
reasoners. arXiv preprint arXiv:2205.11916.
29
Kung, T. H., Cheatham, M., Medenilla, A., Sillos, C., De Leon, L., Elepaño, C., . . . & Tseng, V. (2023).
Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language
models. PLOS Digital Health, 2(2), e0000198.
Loomes, G., & Sugden, R. (1982). Regret Theory: An Alternative Theory of Rational Choice Under
Uncertainty. The Economic Journal, 92(368), 805–824. https://doi.org/10.2307/2232669
Manrai, A. K., Bhatia, G., Strymish, J., Kohane, I. S., & Jain, S. H. (2014). Medicine’s uncomfortable
relationship with math: calculating positive predictive value. JAMA internal medicine, 174(6), 991–993.
https://doi.org/10.1001/jamainternmed.2014.1059
Park, P. S., Schoenegger, P., & Zhu, C. (2023). Artificial intelligence in psychology research. arXiv preprint
arXiv:2302.07267.
Ross, B. M., & Levy, N. (1958) Patterned predictions of chance events by children and adults. Psychological
Reports, 4, 87-124.
Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., . . . & Kim, H. (2022). Beyond
the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint
arXiv:2206.04615.
Terwiesch, C. (2023) Would Chat GPT3 Get a Wharton MBA? A Prediction Based on Its Performance in
the Operations Management Course, Mack Institute for Innovation Management at the Wharton School,
University of Pennsylvania.
Thaler, R. (1985). Mental Accounting and Consumer Choice. Marketing Science, 4(3), 199–214. http:
//www.jstor.org/stable/183904
Thaler, R.H. (1981). Some empirical evidence on dynamic inconsistency. Economics Letters, 8, 201-207.
Tversky, A., & Kahneman, D. (1973). Availability: A heuristic for judging frequency and probability.
Cognitive Psychology, 5(2), 207–232. https://doi.org/10.1016/0010-0285(73)90033-9
Tversky, A., & Kahneman, D. (1981). The framing of decisions and the psychology of choice. Science,
211(4481), 453–458. https://doi.org/10.1126/science.7455683
Tversky, A., & Kahneman, D. (1983). Extensional versus intuitive reasoning: The conjunction fallacy in
probability judgment. Psychological Review, 90(4), 293–315. https://doi.org/10.1037/0033-295X.90.4.293
Wagenaar, W. A. (1972). Generation of random sequences by human subjects: A critical survey of literature.
Psychological Bulletin, 77(1), 65–72. https://doi.org/10.1037/h0032060
Wason, P. C. (1966). Reasoning. In B. M. Foss (Ed.), New horizons in psychology I (pp. 106–137).
Harmandsworth: Penguin.
Wason, P. C. (1968). Reasoning about a rule. The Quarterly Journal of Experimental Psychology, 20(3),
273–281. https://doi.org/10.1080/14640746808400161
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., & Zhou, D. (2022). Chain of thought
prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
30