0% found this document useful (0 votes)
61 views25 pages

Can Large Language Models Be An Alternative To Human Evaluation

This paper investigates the potential of large language models (LLMs) as an alternative to human evaluation in assessing the quality of texts generated by NLP systems. The authors demonstrate that LLM evaluations yield results consistent with expert human evaluations across two tasks: open-ended story generation and adversarial attacks. They discuss the advantages and limitations of using LLMs for evaluation, highlighting their stability and effectiveness compared to traditional human evaluation methods.

Uploaded by

kaanguneyli0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views25 pages

Can Large Language Models Be An Alternative To Human Evaluation

This paper investigates the potential of large language models (LLMs) as an alternative to human evaluation in assessing the quality of texts generated by NLP systems. The authors demonstrate that LLM evaluations yield results consistent with expert human evaluations across two tasks: open-ended story generation and adversarial attacks. They discuss the advantages and limitations of using LLMs for evaluation, highlighting their stability and effectiveness compared to traditional human evaluation methods.

Uploaded by

kaanguneyli0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Can Large Language Models Be an Alternative to Human Evaluation?

Cheng-Han Chiang Hung-yi Lee


National Taiwan University, National Taiwan University,
Taiwan Taiwan
[email protected] [email protected]

Abstract unstable (Gillick and Liu, 2010; Clark et al., 2021).


Karpinska et al. (2021) has shown that the quality
Human evaluation is indispensable and in- of workforces in human evaluation can have a detri-
evitable for assessing the quality of texts gen-
mental effect on the evaluation result, making it
erated by machine learning models or written
by humans. However, human evaluation is very impossible to compare the performance among dif-
difficult to reproduce and its quality is noto- ferent systems. Reproducibility is another issue in
riously unstable, hindering fair comparisons human evaluation since it is hard to recruit the same
among different natural language processing human evaluators and rerun the same evaluation.
(NLP) models and algorithms. Recently, large Even if the same workers are recruited, the workers
language models (LLMs) have demonstrated that have seen the task before are likely to produce
exceptional performance on unseen tasks when
a different evaluation result the next time because
only the task instructions are provided. In this
paper, we explore if such an ability of the LLMs
they have already done the task. While human eval-
can be used as an alternative to human evalu- uation is used to better assess NLP systems and has
ation. We present the LLMs with the exact some advantages over automatic evaluation met-
same instructions, samples to be evaluated, and rics, the drawbacks of human evaluation somewhat
questions used to conduct human evaluation, make it difficult to reliably evaluate NLP systems.
and then ask the LLMs to generate responses to To resolve some of the drawbacks, we take ad-
those questions; we dub this LLM evaluation. vantage of large language models (LLMs). LLMs
We use human evaluation and LLM evaluation
are large models that are trained to model human
to evaluate the texts in two NLP tasks: open-
ended story generation and adversarial attacks. languages using self-supervised learning (Brown
We show that the result of LLM evaluation is et al., 2020) and further using special training pro-
consistent with the results obtained by expert cedures to improve the performance on unseen
human evaluation: the texts rated higher by hu- tasks and better follow natural language instruc-
man experts are also rated higher by the LLMs. tions (Sanh et al., 2022; Wei et al., 2022). The
We also find that the results of LLM evalua- ability to perform a task just given the task instruc-
tion are stable over different formatting of the
tions motivates us to ask if these LLMs can perform
task instructions and the sampling algorithm
used to generate the answer. We are the first to
what humans do in human evaluation. To answer
show the potential of using LLMs to assess the this question, we feed in the LLM with the same
quality of texts and discuss the limitations and instruction, sample, and question used in human
ethical considerations of LLM evaluation. evaluation, and take the sequences generated by the
LLM as the LLM’s answer to the question. This
1 Introduction process is shown in Figure 1, and we call this pro-
cess LLM evaluation.
Human evaluation is an important method to under-
To test if LLM evaluation yields meaningful re-
stand the performance of an NLP model or algo-
sults, we conduct LLM evaluation on two different
rithm (Guzmán et al., 2015; Gillick and Liu, 2010).
NLP tasks: evaluating the quality of stories in open-
We rely on human evaluation because there are cer-
ended story generation and the quality of sentences
tain aspects of texts that are hard to evaluate using
generated by adversarial attacks. We summarize
automatic evaluation metrics; thus, researchers re-
our findings and contribution as follows:
sort to humans to rate the quality of the output of
NLP models. While human evaluation is preva- • We show that LLM evaluation produces re-
lent and indispensable in NLP, it is notoriously sults similar to expert human evaluation, ver-
15607
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics
Volume 1: Long Papers, pages 15607–15631
July 9-14, 2023 ©2023 Association for Computational Linguistics
Task instruction, sample, and question Human evaluation

Please rate the story fragment Human evaluators Human responses

1 (lowest)
The goal of this task is to rate story
2
fragments.
3
NOTE: Please take the time 4
to fully read and understand the story 5 (highest)
fragment. We will reject submissions from
workers that are clearly spamming the
task.

Story fragment
The human ambassador reached down and LLM evaluation
grasped it's paw. "Humans, cats, is it
true that all intelligent beings are
LLM (e.g., GPT3) LLM output
omnivorous?" "Certainly, your rank
demeanour can be demonstrated from the
I would rate the
words we spoke to the Kelpie. They're of
grammar in the text
no concern to us humans, as they are
not considered to live among us, thus of the story
far. (…) fragment as a 4.
4
There are no major
How grammatically correct is the text of grammatical errors
the story fragment? (on a scale of 1-5, or issues with
with 1 being the lowest?) punctuation,(…)

Figure 1: Illustration of the core idea of the paper using open-ended story generation as the example task. The left
part shows the instruction, story fragments, and questions used in human evaluation. The human experts are asked
to rate the quality of the story fragments using a 5-point Likert scale, shown on the upper right. The lower right part
shows the process of LLM evaluation, where we feed the LLMs the same instruction, story fragments, and questions
and parse the LLM-generated output to get the rating.

ifying the effectiveness of LLM evaluation been applied to those LLMs after pre-training. For
(§3.3 and §4.3). This paper is the first to pro- example, T0 (Sanh et al., 2022) and FLAN (Wei
pose using LLMs as an alternative to human et al., 2022) are fine-tuned on a mixture of tasks
evaluation and show their effectiveness. and can thus achieve better zero-shot performance
compared to GPT-3. InstructGPT (Ouyang et al.,
• We show that LLM evaluation results only 2022) is fine-tuned from GPT-3 using reinforce-
slightly vary due to different task instructions ment learning from human feedback (RLHF), and
and the hyperparameters of the sampling al- it is shown to better follow the instructions. Chat-
gorithm used to generate the answer. (§3.3.2 GPT (OpenAI, 2022) is fine-tuned from Instruct-
and §3.3.3) GPT with a conversation dataset using RLHF, so
ChatGPT can interact with users in a conversational
• We carefully discuss the pros and cons of us-
way. ChatGPT is able to answer questions asked
ing LLM evaluation and discuss the ethical
by the user and provide comprehensive explana-
considerations of LLM evaluation. (§5)
tions about its answer. Given the LLMs’ ability to
2 LLM Evaluation follow task instructions and provide feedback, we
ask whether LLMs can be used as an alternative
2.1 Large Language Models (LLMs) to human evaluation and aid NLP researchers in
Large language models are language models hav- evaluating the quality of texts.
ing bulk parameter sizes, typically on the scale
of a few billion, and pre-trained on enormous 2.2 LLM Evaluation
amounts of natural language corpora, including To evaluate the quality of texts generated by NLP
GPT3 (Brown et al., 2020), T5 (Raffel et al., 2020), systems or written by humans using LLM, we
and BLOOM (Scao et al., 2022). These LLMs present the LLMs with the task instructions, the
show exceptional performance on unseen tasks sample to be evaluated, and a question. The ques-
when only the task instructions are given; this kind tion asks the LLM to rate the sample’s quality using
of ability is called zero-shot in-context learning. a 5-point Likert scale. Given the inputs, the LLM
To further improve the zero-shot in-context learn- will answer the question by generating some out-
ing performance, special training techniques have put sentences. We parse the output sentences to get
15608
the score rated by the LLM. We call this process while English teachers show a clear preference for
LLM evaluation, and this procedure is shown in the human-written stories over GPT-2-generated sto-
lower part of Figure 1. Different tasks use different ries. We want to see if LLM can rate human-written
sets of task instructions, and each task uses differ- stories higher than GPT-2-generated ones.
ent questions to evaluate the quality of the samples. Following prior works (Mao et al., 2019; Guan
The instructions and questions used in LLM evalua- et al., 2020; Karpinska et al., 2021), the story gen-
tion in our paper are not tailored for the LLMs; we eration model is GPT-2 medium model fine-tuned
follow those instructions used to conduct human on the WritingPrompts training dataset. After the
evaluation in prior works. model is trained, we randomly select 200 prompts
To compare the result of LLM evaluation and from the testing set of WritingPrompts and make
show its effectiveness, we compare the result of the fine-tuned GPT-2 generate stories based on
LLM evaluation with human evaluation conducted those prompts using nucleus sampling (Holtzman
by English teachers. To make a fair and meaningful et al., 2020) with p = 0.9. For the human-written
comparison, the instructions, samples, and ques- stories to be compared, we use the 200 stories
tions in human evaluation are formatted similarly written based on the same 200 prompts. We post-
to those in LLM evaluation. The main difference process the human-written and GPT-2-generated
between LLM evaluation and human evaluation stories and then use them for LLM evaluation and
is that in human evaluation, the human evaluators human evaluation. Please find the details on fine-
answer the question by choosing the answer from tuning and data processing in Appendix B.
a pre-defined set of options (the 1-5 Likert scale
3.2 LLM Evaluation and Human Evaluation
scores), as shown in the upper right in Figure 1.
In LLM evaluation, we instead let the LLM freely We present the LLMs and the human evaluators
generate sentences and extract the score from the with a short description, and the story to be eval-
generated sentences using some simple rules, de- uated, formatted as shown in Figure 1. Follow-
tailed in Appendix D.2.1. ing Karpinska et al. (2021), we evaluate the stories
on four different attributes. The four attributes and
3 Example Task 1: Open-Ended Story their corresponding questions are as follows:
Generation
1. Grammaticality: How grammatically cor-
We first use open-ended story generation to demon- rect is the text of the story fragment?
strate the usefulness of LLM evaluation.
2. Cohesiveness: How well do the sentences in
3.1 Task Introduction the story fragment fit together?

Open-ended story generation is a task to gener- 3. Likability: How enjoyable do you find the
ate a short story based on a given prompt. We story fragment?
use the WritingPrompts dataset (Fan et al., 2018),
4. Relevance: Now read the PROMPT based on
which is composed of pairs of short prompts and
which the story fragment was written.
human-written stories collected from the subred-
Prompt: [PROMPT].
dit WritingPrompts. In the WritingPrompts, the
How relevant is the story fragment to the
users are given a short prompt, and they need to
prompt?
write a story based on the short prompt.1
In this experiment, we use LLM evaluation and Where the [PROMPT] will be filled in with the
human evaluation to rate the stories generated by prompt which the story is based on. Each attribute
humans and the stories generated by a story gen- is evaluated using a 5-point Likert scale; the fol-
eration model. We select open-ended story gen- lowing description is appended at the end of each
eration as an example because Karpinska et al. question: "(on a scale of 1-5, with 1 being the low-
(2021) show that workers from Amazon Mechani- est)". We show the interface used in human evalua-
cal Turk (AMT) cannot distinguish GPT-2 (Radford tion and the input format for the LLM evaluation
et al., 2019) generated and human-written stories, in Appendix C.2 and D.2.2.
1
The LLMs used for LLM evaluation include T0,
The WritingPrompts subreddit explicitly forbids the
users to use AI for generating stories, so we consider the text-curie-001, text-davinci-003, and Chat-
stories in the dataset to be human-written. GPT. text-curie-001 and text-davinci-003
15609
Grammaticality Cohesiveness Likability Relevance
Evaluator
MeanSTD IAA% MeanSTD IAA% MeanSTD IAA% MeanSTD IAA%
Human-written stories
Human 3.760.95 0.3320.5 4.290.82 0.3227 3.781.10 0.089.5 3.351.48 0.058
T0 2.551.47 0.1610 2.981.45 0.114 3.181.53 0.127 2.931.64 0.026
curie 3.190.47 0.0746.5 2.820.46 0.0147.5 2.850.37 0.110.65 3.060.40 0.110.64
davinci 4.220.38 0.2635 4.540.47 0.3739.5 3.990.38 0.4968.5 4.400.79 0.7148.5
ChatGPT 3.830.60 3.550.88 2.440.89 3.291.50
GPT-2-generated stories
Human 3.560.91 0.1019.5 3.191.07 0.1417 2.591.29 −0.213.5 2.381.40 −0.038.5
T0 2.441.49 0.059 3.021.51 0.076 3.001.59 0.166 2.821.61 0.046
curie 3.230.51 0.0138 2.820.45 0.0250 2.860.37 0.0965.5 3.010.43 0.1161
davinci 4.070.35 0.3545.5 4.260.45 0.4242 3.840.42 0.5262 4.020.74 0.6942.5
ChatGPT 2.980.76 2.480.71 1.590.67 2.021.21

Table 1: LLM evaluation and human evaluation results of human-written stories and GPT-2-generated stories. For
each evaluated attribute, we report its mean Likert scale and the standard deviation. We also report the inter-annotator
agreement (IAA) among three annotators using Krippendorff’s α. The subscript in the IAA column (%) is used to
denote the percentage of the stories where all three annotators exactly agree on a rating.

are two InstructGPT models, and the latter is the Table 1. We report the mean and standard deviation
stronger model; we will use InstructGPT to refer to of the Likert scores obtained from LLM evaluation
these two models. We query the InstructGPT using and human evaluation and show the inter-annotator
the official API provided by OpenAI. We use nu- agreement (IAA) using two different metrics: (1)
cleus sampling with p = 0.9 to generate the answer the Krippendorff’s α, and (2) the percentage of the
from T0 and InstructGPTs. We sample three an- stories where three evaluators give the exact same
swers from LLMs to stimulate the result of asking rating.2 The main observations from Table 1 are
the model to rate the same story three times. We discussed as follows.
query ChatGPT using the user interface recently Expert human evaluators prefer human-
released by OpenAI. Unlike InstructGPT, we can- written stories: Human evaluation result serves as
not control the parameters used for generating the some kind of ground truth of the LLM evaluation.
response from ChatGPT. Because ChatGPT limits For all four attributes, teachers rate the human-
the maximum number of queries per user, we only written stories higher than GPT-2-generated stories.
sample one response for each question. This indicates that experts are able to distinguish
For human evaluation, we do not use the the quality difference between model-generated
commonly used AMT for human evaluation be- stories and human-written stories. Based on the
cause Karpinska et al. (2021) has already shown IAA, we also find that the agreements among ex-
that the results obtained using AMT are highly perts are lower on GPT-2-generated texts and on
questionable. Following the recommendation of the likability. This shows that experts tend to have
the prior works, we hire three certified English less agreement on model-generated texts and on a
teachers using an online freelancer platform, Up- subjective attribute (likability), agreeing with the
Work. Teachers are familiar with evaluating the results in Karpinska et al. (2021).
essays of students, making them the expert evalua- T0 and text-curie-001 do not show clear
tors in our task. The details about recruiting human preference toward human-written stories: For
evaluators are in Appendix C.1. Each LLM and T0, we can see that T0 rates human-written sto-
each English teacher rates the 200 human-written ries higher than GPT-2-generated stories on gram-
stories and 200 GPT-2-generated stories. matically, likability, and relevance. However, the
rating differences between the human-written and
3.3 Experiment Results 2
The three evaluators in human evaluation are the three
English teachers. In LLM evaluation, we sample the answer
The LLM evaluation and human evaluation results generated by LLM three times as an analogy to three different
of open-ended story generation are presented in evaluators.

15610
model-generated stories do not achieve statistical interesting observations: (1): ChatGPT is able to
significance for grammaticality and relevance; the provide a detailed explanation of why it gives a
p-value obtained by Welch’s t-test is much larger certain rating. It will reference the sentences in
than 0.05. The result of text-curie-001 is simi- the stories and prompts to support its rating. (2):
lar to T0: text-curie-001 does not rate human- ChatGPT sometimes refuses to rate the likability of
written stories higher than model-generated stories. the story because "I am an AI and I do not have the
It can also be observed that for T0, the IAA in terms ability to experience enjoyment". In such cases, we
of the percentage of exact agreement among three regenerate the response until it gives a rating. (3):
different sampled answers is overall very low. This we find that ChatGPT tends to rate low likability on
indicates that given the same sample, T0 is likely violent or impolite stories, which is likely because
to give a different rating for the three sampled an- it is trained to provide safe and unharmful replies,
swers. The result implies that T0 does not assign making ChatGPT dislike brutal and profane stories.
a high probability to a specific rating, so different Experts mostly agree with the ratings and ex-
scores are all likely to be sampled. This shows that planations of ChatGPT: We randomly select the
even if LLMs are specifically fine-tuned to better answers on four stories by ChatGPT and ask the
perform zero-shot in-context learning and trained English teachers if they agree with the reasoning
to better follow human instructions, these do not and rating of ChatGPT3 . The teachers mostly agree
make them capable of assessing open-ended story with the rating and consider the explanation from
generation as human experts can. ChatGPT reasonable. Interestingly, one teacher
told us she cannot agree with ChatGPT’s rating on
text-davinci-003 shows clear preference
grammaticality because ChatGPT considers punc-
toward human-written stories just like En-
tuation errors as grammar errors, but she does not
glish teachers: text-davinci-003 rates human-
think punctuation errors are grammar errors. This
written stories much higher than model-generated
shows that individuals have their own standards for
stories on all four attributes, which is in accor-
ratings and this is also the case for LLMs.
dance with the result produced by human ex-
text-davinci-003 tends to give higher rat-
perts. By Welch’s t-test, we find that the higher
ings and ChatGPT is the opposite: The rating on
ratings on human-written stories are all statisti-
the same attribute of the same type of text tends to
cally significant. In prior work, researchers have
be higher for text-davinci-003 compared with
found that workers recruited on AMT do not rate
human rating; contrarily, ChatGPT is more fastid-
human-written stories higher than GPT-2-generated
ious and prone to give lower scores. This shows
ones (Karpinska et al., 2021); combining their re-
that different LLMs have distinct tendencies re-
sult with our result, we can see that LLM evaluation
garding the rating. While the absolute values of
using text-davinci-003 yields more convincing
the scores rated by text-davinci-003, ChatGPT,
results than using human evaluation on AMT for
and human differ, they all rate human-written texts
open-ended story generation. The results show
higher than GPT-2-generated stories. The absolute
that text-davinci-003 can perform basic evalu-
number reflects the bias or belief of the evaluator;
ations such as checking for grammatical errors in
as long as one uses the same evaluators to assess
stories. Additionally, the model excels in assessing
different systems, the comparison is meaningful.
the relevance of a story to a prompt, which involves
more complex reasoning over the connection be- 3.3.1 Does LLM and Human Evaluators
tween the two. We also find the Krippendorff’s Agree on the Rating of Individual
α of text-davinci-003 is much higher than T0 Stories?
and text-curie-001, indicating that the rating by We have found in Table 1 that the ratings of
text-davinci-003 is more consistent among dif- text-davinci-003 and ChatGPT show a strong
ferent samplings of the generated answers. preference toward human-written stories just like
ChatGPT rates like human experts and can English teachers. However, it is unclear whether
explain its own decision well: ChatGPT also those LLMs agree with the teachers’ rating on each
shows a clear preference for human-written stories, individual story. Precisely, when English teach-
and the preference toward human written-stories is ers rate a story higher, do LLMs also rate the
statistically significant. When we query ChatGPT 3
We do not tell the teachers these are responses from an AI
using the OpenAI user interface, we find several model. See the stories and teachers’ replies in Appendix C.3.2.

15611
Story Writer Human GPT-2 sidered when rating grammaticality is not clearly
Grammaticality 0.14 0.12 stated in our instructions, so the LLM may have
Cohesiveness 0.18 0.14 a different rubric compared with English teachers.
Likability 0.19 0.22 We also calculate the average Kendall’s τ between
Relevance 0.38 0.43 a pair of English teachers, and we find a weak cor-
relation on grammaticality between the rating of
Table 2: The Kendall’s τ correlation coefficient between two teachers, while the correlation of the rating on
English teachers and text-davinci-003. relevance is much stronger. The result is presented
in Table 6 in Appendix.
story higher? To answer this question, we calcu- 3.3.2 Variance due to Different Instructions
late Kendall’s τ correlation coefficient between the
LLMs have been shown to be sensitive to the in-
ratings of text-davinci-003 and English teach-
structions used to query the LLM sometimes (Zhao
ers. We choose to use the correlation coefficient
et al., 2021; Sanh et al., 2022). To investigate
instead of the inter-annotator agreement score be-
how varying the task instructions and questions can
cause IAA mainly cares if two annotators agree
affect the LLM evaluation result for open-ended
on the exact ratings, while the correlation coef-
story generation, we change the instructions and
ficient focus on the question: "when annotator
questions and see how the LLM evaluation result
A rates one story higher, does annotator B also
changes. We experiment with two different instruc-
rate the story higher?" (Amidei et al., 2019). We
tions by changing the instruction or question in
calculate Kendall’s τ for four rating attributes as
Figure 1: (1) We prepend the sentence, "(You are a
follows: For each story and each rating attribute,
human worker hired to rate the story fragment.)",
we calculate the average rating of the three En-
in front of the task instruction in Figure 1. We
glish teachers and calculate the average rating of
try to provide the LLM a persona for it to better
the three scores given by the text-davinci-003
understand its role. This is inspired by previous
(which is obtained from three independent sam-
work that reported GPT-3 can yield different results
ples). For each attribute, we collect the average
when giving them a persona (Zeng et al., 2022). (2)
rating of teachers into a vector A ∈ R200 , where
We ask the LLMs to explain their decision by ap-
each entry is the average rating of a story; likewise,
pending the following sentence after the question:
we construct a vector B ∈ R200 for the average
Please also explain your decision. Here, we would
ratings of davinci. Next, we calculate Kendall’s
like to know if LLM will rate the stories differently
τ correlation coefficient between A and B.
when they are asked to justify their decision. This
The Kendall’s τ between teacher ratings and
is inspired by zero-shot chain-of-thought (Kojima
LLM ratings is shown in Table 2.4 We find that
et al.). We use text-davinci-003 instead of Chat-
for all four attributes and for both human-written
GPT as the LLM in this experiment since it is more
and GPT-2-generated stories, we observe weak
accessible than ChatGPT.
to strong positive correlations between teachers’
The results are shown in the upper block in Ta-
ratings and text-davinci-003’s ratings. All the
ble 3. We observe that for grammaticality and
correlations have p-values less than 0.05. Hence,
cohesiveness, the scores obtained from different in-
we can say that when teachers rate a story higher,
structions are quite close: the rating changes due to
text-davinci-003 also rates it higher to a certain
different instructions are less than 0.1. For the other
extent. We also observe that Kendall’s τ for differ-
two attributes, the score changes are slightly larger
ent attributes are quite different: relevance has the
but still in the range of 0.25. Despite that there
strongest correlation while grammaticality has the
are small variations due to different instructions,
weakest correlation. This is possibly because rating
these variances still do not change the conclusion
relevance is rather straightforward, which requires
that "LLM rates human-written stories higher than
checking if the content in the prompt is mentioned
GPT-2-generated stories". Thus, different instruc-
in the story. On the contrary, what should be con-
tions do not change the relative ranking of GPT-2-
4
When interpreting Kendall’s τ , |τ | ∈ [0, 0.1) is consid- generated and human-written stories. In summary,
ered as very weak correlation, |τ | ∈ [0.1, 0.2) is considered as long as the stories are evaluated using the same
as weak correlation, |τ | ∈ [0.2, 0.3) is considered as moder-
ate correlation, and |τ | ∈ [0.3, 1.0] is considered as strong instructions using LLM evaluation, such evaluation
correlation (Botsch, 2011). and comparison are meaningful.
15612
Grammaticality Cohesiveness Likability Relevance
Setup
Human GPT-2 Human GPT-2 Human GPT-2 Human GPT-2
Different instructions (Section 3.3.2)
Original 4.220.38 4.070.35 4.540.45 4.260.45 3.990.38 3.840.42 4.400.79 4.020.74
(1) + persona 4.290.45 4.010.45 4.600.49 4.270.50 4.050.39 3.870.39 4.550.70 4.250.77
(2) + explain 4.240.42 4.050.25 4.610.49 4.320.51 4.150.44 3.980.34 4.350.75 4.030.56
Different sampling temperature T (Section 3.3.3)
T = 1.0 4.220.38 4.070.35 4.540.45 4.260.45 3.990.38 3.840.42 4.400.79 4.020.74
T = 0.7 4.180.35 4.060.33 4.520.48 4.230.43 3.960.34 3.820.42 4.360.77 3.950.72
T = 0.3 4.130.33 3.990.25 4.480.49 4.140.39 3.950.26 3.820.41 4.340.75 3.930.67
T =0 4.070.27 3.990.18 4.490.50 4.090.34 3.950.25 3.820.40 4.320.75 3.920.66

Table 3: Understanding the variance of LLM evaluation. For each of the four attributes evaluated, the left column is
the mean and standard deviation of human-written stories and the right column is those of GPT-2-generated stories.
The upper block shows the rating change due to different instructions (Section 3.3.2), and the lower block is the
result of changing the temperature T used for generating the LLM’s output (Section 3.3.3).

3.3.3 Variance due to Different Sampling in evaluating open-ended story generation.


Parameters
When generating the answers from the LLM, we 4 Example Task 2: Adversarial Attack
must choose a set of hyperparameters for genera-
As another application, we use LLM evaluation to
tion, including the temperature T and the proba-
rate the texts generated by adversarial attacks.
bility p used in nucleus sampling. To understand
whether different sampling parameters change the
LLM evaluation result, we modify the tempera- 4.1 Task Introduction
ture used for sampling and keep the p in nucleus Given a trained text classifier and a benign (non-
sampling fixed to 0.9 when generating the answers adversarial) testing sample that the text classifier
from text-davinci-003. We do not simultane- can correctly classify, an adversarial attack aims to
ously vary T and p since the two parameters are craft an adversarial sample that makes the classifier
both used to control the diversity of the output, it is make a wrong prediction. A special type of adver-
enough to change only one of the two parameters, sarial attack is called synonym substitution attacks
as recommended in the API documentation. (SSAs) (Alzantot et al., 2018), where the adversar-
The results of varying T from 1 to 0 are shown ial sample is created by replacing some words with
in the lower block in Table 3. We observe an inter- their synonyms in the benign sample. By replacing
esting trend as T varies from 1 to 0: the average words with their synonym, the semantics of the
rating slightly drops in most cases. Considering benign sample should be preserved in the adversar-
that T = 0 is simply argmax sampling, the result in- ial sample and make the adversarial perturbation
dicates that the response of the LLM with the high- imperceptible to humans. While conceptually rea-
est probability tends to give lower scores. Despite sonable, it has recently been shown that many SSAs
this interesting trend, the LLM consistently rates often yield ungrammatical and unnatural adversar-
human-written stories higher than GPT-2-generated ial samples that significantly change the meaning
stories. While not shown in Table 3, we find that of the benign sample (Hauser et al., 2021; Chiang
the IAA increases as the temperature decreases. and Lee, 2022). To evaluate the quality of adver-
This is expected since lower temperature means sarial samples, human evaluation is invaluable and
less diversity during the LLM sampling, causing widely used in prior works. In our experiment here,
the sampled ratings to agree more closely. In sum- we would like to see whether the LLMs can rate
mary, changing the instructions and temperatures the quality of adversarial samples like human ex-
can slightly change the absolute value of the rating perts. Adversarial samples are not normal texts, so
given by LLM but does not change the LLM’s pref- the LLMs may not have seen such abnormal inputs
erence on human-written stories. The overall result during training. It would be interesting to know
in this section shows that LLM evaluation is useful how LLMs rate these adversarial samples.
15613
Human evaluate LLM evaluate generated by SSAs very low in terms of fluency and
Fluent Mean. Fluent Mean. meaning preserving, this result is in line with re-
Benign 4.55 - 4.32 5.00† cent observations on the quality of adversarial sam-
Textfooler 2.17 1.88 2.12 2.06 ples (Hauser et al., 2021; Chiang and Lee, 2022).
PWWS 2.16 1.85 2.42 2.49 Before interpreting the result of LLM evaluation,
BAE 3.01 3.02 3.71 3.71 we first conduct a sanity check on whether the LLM
understands the task. We ask the LLM to rate the
Table 4: Mean Likert score of LLM evaluation and meaning preserving of two benign samples that are
human evaluation result on fluency (Fluent) of the be- exactly the same. Ideally, the LLM should always
nign and adversarial samples and meaning preserving
give a score of 5, meaning that it strongly agrees
(Mean.) between the news title before and after adver-
sarial attacks. that the meanings are not changed. The result of
this sanity check is the entry with † in Table 4,
which is a perfect 5.00. ChatGPT often says that
4.2 Experiment Setup "the two titles are identical so I rate a 5 (strongly
agree)", showing that ChatGPT understands what
We select three different classic SSAs:
the task is about.
Textfooler (Jin et al., 2020), PWWS (Ren
Next, we turn our attention to the LLM evalua-
et al., 2019), and BAE (Garg and Ramakrishnan,
tion results of the adversarial samples. We observe
2020); these attacks are predominantly used
that ChatGPT tends to rate adversarial samples
as strong baselines in the literature of SSAs
higher than English teachers, meaning that Chat-
nowadays. We use these three SSAs to attack a
GPT is less harsh on the unnatural and artificial
BERT-base-uncased model (Devlin et al., 2019)
parts in the adversarial samples. We conduct the
fine-tuned on AG-News (Zhang et al., 2015), a
same experiment using text-davinci-003 and
news classification dataset. For each SSA, we
find similar results. Although ChatGPT rates adver-
randomly select 100 pairs of benign and adversarial
sarial samples higher than the teachers, ChatGPT
samples and use LLMs to evaluate their quality.
still rates adversarial samples significantly lower
We show the result of using ChatGPT as LLM here
than benign samples. ChatGPT also agrees with the
since it can better explain its decision. Following
English teachers that the adversarial samples gener-
the suggestions of prior works (Morris et al.,
ated by BAE are better than the samples generated
2020), we evaluate the quality of the adversarial
by Textfooler and PWWS.
samples from two aspects: the fluency and meaning
preservation. For fluency, we present the LLM Interestingly, we find that ChatGPT rates PWWS
with a piece of news (either benign or adversarial to be more natural than Textfooler, while such a
sample) and the following question: How natural rating difference is not seen in the expert human
and fluent is the text of the news title? (on a scale evaluation. At first sight, this means that ChatGPT
of 1-5, with 1 being the lowest). For meaning is inconsistent with human evaluation results. How-
preserving, we present the LLM with both the ever, by scrutinizing the human evaluation results,
benign and the adversarial sample, and prompt we find that two teachers rate PWWS higher than
the LLM to answer this question: Do you agree Textfooler while one teacher rates PWWS much
that the meaning (or semantics) of news title 1 is lower than Textfooler. This indicates that ChatGPT
preserved in news title 2? (on a scale of 1-5, with 1 actually agrees with the majority of human experts.
being the strongly disagree and 5 being strongly Overall, LLM can rank the quality of adversarial
agree.) The exact instruction and formatting are texts and benign texts like most human experts.
presented in Appendix D.2.3. We also ask three
English teachers to rate the fluency and meaning 5 Discussions
preserving of the samples. The task instructions In this paper, we propose to use LLM for evaluating
and questions are formatted the same as in LLM the quality of texts to serve as an alternative to
evaluation. human evaluation. To demonstrate the potential of
LLM evaluation, we use LLMs to rate the quality
4.3 Experiment Result
of texts in two distinct tasks: open-ended story
The results are presented in Table 4. We can see generation and adversarial attacks. We show that
that English teachers rate the adversarial samples even if LLMs have exceptional zero-shot in-context
15614
learning ability, they are not always suitable to be lecting the evaluation results, but only a few hours
used for LLM evaluation. Still, we find that the to query InstructGPT and perform LLM evaluation.
best InstructGPT and ChatGPT can rate the quality Finally, utilizing LLM evaluation, rather than
of texts like human experts on the two tasks we human evaluation, can minimize the need for hu-
used as examples. Overall, the results in this paper man exposure to objectionable content, such as
demonstrate that LLM evaluation has the potential violent, sexual, hateful, or biased material. Such
to be used to evaluate NLP systems and algorithms. content may cause discomfort for human evaluators
while reading and rating these texts. 5
Pros of LLM evaluation There are several bene-
fits of LLM evaluation, compared to human evalua- Limitations and Ethical Considerations of LLM
tion. First, LLM evaluation is more reproducible. evaluation Despite the promising results of LLM
Human evaluation results are hard to reproduce as evaluation shown in this paper, there are some lim-
it is difficult to hire the same group of evaluators, itations of this method. First, LLM may possess
and it is hard to compare the results of similar ex- incorrect factual knowledge (Cao et al., 2021), so it
periments even if they use the same instructions, re- is not suitable to use them in tasks that involve fac-
cruitment platform, and qualifications for the eval- tual knowledge. Next, LLMs trained to behave in a
uators. On the contrary, LLM evaluation does not certain way can be biased toward certain responses.
have such a drawback. By specifying the model Precisely, an LLM that is trained to be safe and
used for LLM evaluation, the random seed, and non-harmful can result in LLMs preferring to gen-
the hyperparameters used to generate the answers erate more positive and upbeat responses, which is
from the LLM, the LLM evaluation result is more observed throughout our interaction with ChatGPT.
likely to be reproduced. Note that in certain cases, Additionally, even with researchers’ efforts to make
the LLM provider may regularly update the LLM, LLMs safer (Bai et al., 2022a,b), LLMs can still
making the LLM evaluation unreproducible if the generate harmful and biased responses (Ganguli
LLM is outdated and not accessible. et al., 2022; Perez et al., 2022), which are violative
Second, the evaluation of each sample is inde- of basic ethics, and LLM evaluation results will be
pendent of each other in LLM evaluation. Con- highly doubtful (Hendrycks et al., 2021). However,
trarily, in human evaluation, the rating of the cur- it is important to note that these limitations and po-
rent example may more or less be affected by prior tential harms also apply to human evaluation: the
samples. Humans tend to compare the current sam- bias of human evaluators can affect the human eval-
ple to the ones they have previously seen and this uation result (Lentz and De Jong, 1997; Amidei
affects their ratings. As a piece of evidence, in the et al., 2018).
interview after rating the 400 stories, the English Our pioneering idea, LLM evaluation, has the
teachers say it took them some time to calibrate potential to transform the NLP community.6 We
their ratings (Appendix C.3.1). Thus, using LLM encourage future researchers to consider using it
evaluation can simplify some experiment designs while being aware of its limitations. Our paper’s
since one does not need to worry whether the order goal is not to replace human evaluation but to
of the sample being evaluated will change the result. present an alternative option. Both human and
Still, one may also argue that being able to calibrate LLM evaluation have their own advantages and dis-
the rating of different samples is desired and this advantages, and they can be used in conjunction.
is why human evaluation might be preferred. Over- We recommend using LLM evaluation as a cheap
all, whether the rating of the evaluator (human or and fast quality judgment when developing a new
LLM) should be affected by a previously rated item NLP system, while human evaluation is best used
is inherently a design choice of the experiment. to collect feedback from humans prior to deploying
Third, LLM evaluation is cheaper and faster the NLP system in real-world applications.
than human evaluation, making it easier and
5
quicker for researchers to evaluate the quality of It should be noted that the LLM may decline to assess cer-
tain inputs that violate the content policy of the LLM provider.
NLP systems. Hiring an English teacher to rate 6
We say that we are the first to propose this idea since when
200 stories costs us US$140, while LLM evalua- we submitted this paper to ACL 2023 on January 13, 2023,
tion using the best InstructGPT model costs less we do not find any other paper that explores this idea. During
the reviewing process, we found some works on arXiv (Wang
than US$5. It took us over a week to collect human et al., 2023; Huang et al., 2023; Gilardi et al., 2023) that
evaluation results starting from recruitment to col- explore a similar idea using different tasks.

15615
Limitations want to make it clear that this is not our intent. As
our paper title suggests, we aim to offer an alter-
There are additional limitations and potential risks
native option to human evaluation with the goal
of LLM evaluations that should be noted, and these
of enhancing the reproducibility of NLP research.
limitations are actually well-known problems of
Human evaluation is still essential as the ultimate
pre-trained language models. As listed on the Open
goal of NLP systems is to be used by human users,
AI blog for ChatGPT, ChatGPT sometimes gen-
so it’s important to gather feedback from them. We
erates answers that sound right and plausible but
highly enjoy the process of discussing the experi-
are totally nonsense. OpenAI also admits that the
ment settings and results with the English teachers
model’s response may be sensitive to the prompt
we hired. We do not recommend that future re-
used to query the model. While in Section 3.3.2,
searchers completely eliminate human evaluation;
we find that the overall results among different in-
rather, we believe that human evaluation should be
structions are not significantly different, we cannot
used in conjunction with LLM evaluation. Both
guarantee that this is the case for all kinds of modi-
methods have their own advantages and disadvan-
fication on the task instructions.
tages, making them both necessary for evaluating
Other than the limitations listed on the OpenAI
NLP systems. We hope the positive results in this
blog, there are still other limitations. For example,
paper provide NLP researchers with an alternative
LLMs may not have emotions. Whether AI mod-
method to evaluate systems and encourage further
els have emotion is a more philosophical question
discussions on this topic.
and is controversial, so the results of using such
models for evaluating emotion-related tasks may Ethical statements on the experiments in the pa-
be strongly challenged and may even violate re- per All the experiments strictly follow the ACL
search ethics. As we find during our experiments, Code of Ethics. We include comprehensive de-
ChatGPT often replies "I am an AI system and I tails about human evaluation in Appendix C.1. To
do not have emotions like a human" when asked to summarize, we include the exact instructions and
rate the likability of a story. screenshots of the interface in the human evalua-
Another important limitation of LLM evaluation tion. We inform the human evaluators what the
is that LLMs lack the ability to process visual cues task is about and tell them that their responses will
in task instructions, unlike human evaluation. Hu- be used to assess the performance of AI models.
man evaluators can use formattings such as special We try our best to follow the ethical guidelines of
fonts or text styles to focus on important parts of ACL.
the instructions. Additionally, the way instructions We use the models and datasets when following
and questions are formatted can influence how hu- their intended usage. Specifically, we follow the
man evaluators approach the task. While using OpenAI usage policy when using the InstructGPT
special HTML syntax can serve as an alternative models and the ChatGPT model.
for visual cues, such tags are not used in human
evaluation, so we do not use those HTML tags in Acknowledgements
LLM evaluation to incorporate visual cues in the
We want to thank the reviews for providing detailed
inputs to the LLMs. However, LLMs can only pro-
feedback and actionable suggestions, which help
cess raw text input and are unable to take in visual
us strengthen our paper. We list the modification
cues.
based on the reviewers’ suggestions in Appendix A.
Ethics Statement We thank Yung-Sung Chuang for providing valu-
able feedback on the draft of this paper. We want to
Further ethical considerations of LLM evalua- thank Tung-En Hsiao, the administrative assistant
tion Aside from the limitations of LLM evalua- of our lab, for helping us deal with the payment
tion mentioned previously, there is a crucial ethical on Upwork. Cheng-Han Chiang is supported by a
concern at the heart of LLM evaluation. Is it ethical Ph.D. scholarship program by Delta Electronics.
to replace human evaluation with LLM evaluation?
Some may question if this paper is suggesting that
LLMs are now ready to replace humans and find References
this idea unsettling. As responsible and ethical Moustafa Alzantot, Yash Sharma, Ahmed Elgohary,
NLP researchers, we understand these concerns but Bo-Jhang Ho, Mani Srivastava, and Kai-Wei Chang.

15616
2018. Generating natural language adversarial ex- the North American Chapter of the Association for
amples. In Proceedings of the 2018 Conference on Computational Linguistics: Human Language Tech-
Empirical Methods in Natural Language Processing, nologies, Volume 1 (Long and Short Papers), pages
pages 2890–2896, Brussels, Belgium. Association 4171–4186, Minneapolis, Minnesota. Association for
for Computational Linguistics. Computational Linguistics.
Jacopo Amidei, Paul Piwek, and Alistair Willis. 2018. Angela Fan, Mike Lewis, and Yann Dauphin. 2018.
Rethinking the agreement in human evaluation tasks. Hierarchical neural story generation. In Proceedings
In Proceedings of the 27th International Conference of the 56th Annual Meeting of the Association for
on Computational Linguistics, pages 3318–3329, Computational Linguistics (Volume 1: Long Papers),
Santa Fe, New Mexico, USA. Association for Com- pages 889–898, Melbourne, Australia. Association
putational Linguistics. for Computational Linguistics.
Jacopo Amidei, Paul Piwek, and Alistair Willis. 2019.
Agreement is overrated: A plea for correlation to Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda
assess human evaluation reliability. In Proceedings Askell, Yuntao Bai, Saurav Kadavath, Ben Mann,
of the 12th International Conference on Natural Lan- Ethan Perez, Nicholas Schiefer, Kamal Ndousse,
guage Generation, pages 344–354, Tokyo, Japan. et al. 2022. Red teaming language models to re-
Association for Computational Linguistics. duce harms: Methods, scaling behaviors, and lessons
learned. arXiv preprint arXiv:2209.07858.
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda
Askell, Anna Chen, Nova DasSarma, Dawn Drain, Leo Gao. 2021. On the sizes of openai api models.
Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Accessed on January 17, 2023.
2022a. Training a helpful and harmless assistant with
reinforcement learning from human feedback. arXiv Siddhant Garg and Goutham Ramakrishnan. 2020. Bae:
preprint arXiv:2204.05862. Bert-based adversarial examples for text classifica-
tion. In Proceedings of the 2020 Conference on
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Empirical Methods in Natural Language Processing
Amanda Askell, Jackson Kernion, Andy Jones, (EMNLP), pages 6174–6181.
Anna Chen, Anna Goldie, Azalia Mirhoseini,
Cameron McKinnon, et al. 2022b. Constitutional Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli.
ai: Harmlessness from ai feedback. arXiv preprint 2023. Chatgpt outperforms crowd-workers for text-
arXiv:2212.08073. annotation tasks. arXiv preprint arXiv:2303.15056.
R Botsch. 2011. Chapter 12: Significance and mea-
sures of association. Scopes and Methods of Political Dan Gillick and Yang Liu. 2010. Non-expert evaluation
Science. of summarization systems is risky. In Proceedings of
the NAACL HLT 2010 Workshop on Creating Speech
Tom Brown, Benjamin Mann, Nick Ryder, Melanie and Language Data with Amazon’s Mechanical Turk,
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind pages 148–151, Los Angeles. Association for Com-
Neelakantan, Pranav Shyam, Girish Sastry, Amanda putational Linguistics.
Askell, et al. 2020. Language models are few-shot
learners. Advances in neural information processing Jian Guan, Fei Huang, Zhihao Zhao, Xiaoyan Zhu, and
systems, 33:1877–1901. Minlie Huang. 2020. A knowledge-enhanced pre-
training model for commonsense story generation.
Nicola De Cao, Wilker Aziz, and Ivan Titov. 2021. Edit- Transactions of the Association for Computational
ing factual knowledge in language models. Linguistics, 8:93–108.
Cheng-Han Chiang and Hung-yi Lee. 2022. How far are Francisco Guzmán, Ahmed Abdelali, Irina Temnikova,
we from real synonym substitution attacks? arXiv Hassan Sajjad, and Stephan Vogel. 2015. How do hu-
preprint arXiv:2210.02844. mans evaluate machine translation. In Proceedings
Elizabeth Clark, Tal August, Sofia Serrano, Nikita of the Tenth Workshop on Statistical Machine Trans-
Haduong, Suchin Gururangan, and Noah A. Smith. lation, pages 457–466, Lisbon, Portugal. Association
2021. All that’s ‘human’ is not gold: Evaluating for Computational Linguistics.
human evaluation of generated text. In Proceedings
of the 59th Annual Meeting of the Association for Jens Hauser, Zhao Meng, Damián Pascual, and Roger
Computational Linguistics and the 11th International Wattenhofer. 2021. Bert is robust! a case against
Joint Conference on Natural Language Processing synonym-based adversarial examples in text classifi-
(Volume 1: Long Papers), pages 7282–7296, Online. cation. arXiv preprint arXiv:2109.07403.
Association for Computational Linguistics.
Dan Hendrycks, Collin Burns, Steven Basart, Andrew
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Critch, Jerry Li, Dawn Song, and Jacob Steinhardt.
Kristina Toutanova. 2019. BERT: Pre-training of 2021. Aligning {ai} with shared human values. In
deep bidirectional transformers for language under- International Conference on Learning Representa-
standing. In Proceedings of the 2019 Conference of tions.

15617
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and language models with language models. In Proceed-
Yejin Choi. 2020. The curious case of neural text de- ings of the 2022 Conference on Empirical Methods
generation. In International Conference on Learning in Natural Language Processing, pages 3419–3448,
Representations. Abu Dhabi, United Arab Emirates. Association for
Computational Linguistics.
Fan Huang, Haewoon Kwak, and Jisun An. 2023. Is
chatgpt better than human annotators? potential and Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
limitations of chatgpt in explaining implicit hate Dario Amodei, Ilya Sutskever, et al. 2019. Language
speech. arXiv preprint arXiv:2302.07736. models are unsupervised multitask learners. OpenAI
blog, 1(8):9.
Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter
Szolovits. 2020. Is bert really robust? a strong base- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
line for natural language attack on text classification Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
and entailment. In Proceedings of the AAAI con- Wei Li, Peter J Liu, et al. 2020. Exploring the limits
ference on artificial intelligence, volume 34, pages of transfer learning with a unified text-to-text trans-
8018–8025. former. J. Mach. Learn. Res., 21(140):1–67.

Marzena Karpinska, Nader Akoury, and Mohit Iyyer. Shuhuai Ren, Yihe Deng, Kun He, and Wanxiang Che.
2021. The perils of using Mechanical Turk to evalu- 2019. Generating natural language adversarial exam-
ate open-ended text generation. In Proceedings of the ples through probability weighted word saliency. In
2021 Conference on Empirical Methods in Natural Proceedings of the 57th Annual Meeting of the Asso-
Language Processing, pages 1265–1285, Online and ciation for Computational Linguistics, pages 1085–
Punta Cana, Dominican Republic. Association for 1097, Florence, Italy. Association for Computational
Computational Linguistics. Linguistics.

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu- Victor Sanh, Albert Webson, Colin Raffel, Stephen
taka Matsuo, and Yusuke Iwasawa. Large language Bach, Lintang Sutawika, Zaid Alyafeai, Antoine
models are zero-shot reasoners. In Advances in Neu- Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey,
ral Information Processing Systems. M Saiful Bari, Canwen Xu, Urmish Thakker,
Shanya Sharma Sharma, Eliza Szczechla, Taewoon
Leo Lentz and Menno De Jong. 1997. The evaluation Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti
of text quality: Expert-focused and reader-focused Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han
methods compared. IEEE transactions on profes- Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong,
sional communication, 40(3):224–234. Harshit Pandey, Rachel Bawden, Thomas Wang, Tr-
ishala Neeraj, Jos Rozen, Abheesht Sharma, An-
Huanru Henry Mao, Bodhisattwa Prasad Majumder, drea Santilli, Thibault Fevry, Jason Alan Fries, Ryan
Julian McAuley, and Garrison Cottrell. 2019. Im- Teehan, Teven Le Scao, Stella Biderman, Leo Gao,
proving neural story generation by targeted common Thomas Wolf, and Alexander M Rush. 2022. Multi-
sense grounding. In Proceedings of the 2019 Confer- task prompted training enables zero-shot task gener-
ence on Empirical Methods in Natural Language Pro- alization. In International Conference on Learning
cessing and the 9th International Joint Conference Representations.
on Natural Language Processing (EMNLP-IJCNLP),
pages 5988–5993, Hong Kong, China. Association Teven Le Scao, Angela Fan, Christopher Akiki, El-
for Computational Linguistics. lie Pavlick, Suzana Ilić, Daniel Hesslow, Roman
Castagné, Alexandra Sasha Luccioni, François Yvon,
John Morris, Eli Lifland, Jack Lanchantin, Yangfeng Ji, Matthias Gallé, et al. 2022. Bloom: A 176b-
and Yanjun Qi. 2020. Reevaluating adversarial exam- parameter open-access multilingual language model.
ples in natural language. In Findings of the Associ- arXiv preprint arXiv:2211.05100.
ation for Computational Linguistics: EMNLP 2020,
pages 3829–3839, Online. Association for Computa- Jiaan Wang, Yunlong Liang, Fandong Meng, Haoxiang
tional Linguistics. Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, and Jie Zhou.
2023. Is chatgpt a good nlg evaluator? a preliminary
OpenAI. 2022. Chatgpt: Optimizing language models study. arXiv preprint arXiv:2303.04048.
for dialogue. Accessed on January 10, 2023.
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu,
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- Adams Wei Yu, Brian Lester, Nan Du, Andrew M.
roll L Wainwright, Pamela Mishkin, Chong Zhang, Dai, and Quoc V Le. 2022. Finetuned language mod-
Sandhini Agarwal, Katarina Slama, Alex Ray, et al. els are zero-shot learners. In International Confer-
2022. Training language models to follow in- ence on Learning Representations.
structions with human feedback. arXiv preprint
arXiv:2203.02155. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pier-
Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-
Roman Ring, John Aslanides, Amelia Glaese, Nat icz, Joe Davison, Sam Shleifer, Patrick von Platen,
McAleese, and Geoffrey Irving. 2022. Red teaming Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,

15618
Teven Le Scao, Sylvain Gugger, Mariama Drame, B Experiment Details for Open-Ended
Quentin Lhoest, and Alexander Rush. 2020. Trans- Story Generation
formers: State-of-the-art natural language processing.
In Proceedings of the 2020 Conference on Empirical B.1 The WritingPrompt Dataset
Methods in Natural Language Processing: System
Demonstrations, pages 38–45, Online. Association The training dataset contains 303K pairs of
for Computational Linguistics. stories and prompts, which our model is trained
KiYoon Yoo, Jangho Kim, Jiho Jang, and Nojun Kwak. on. We only use 200 prompt-story pairs from
2022. Detection of adversarial examples in text clas- the test set. The dataset is downloaded from
sification: Benchmark and baseline via robust density https://www.kaggle.com/datasets/ratthachat/writing-
estimation. In Findings of the Association for Com- prompts.
putational Linguistics: ACL 2022, pages 3656–3672,
Dublin, Ireland. Association for Computational Lin-
B.2 Fine-tuning the GPT-2 Model
guistics.
We train the model for 3 epochs with a learning
Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof
Choromanski, Federico Tombari, Aveek Purohit, rate of 5e−5 and linear learning rate schedule. The
Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vin- trained model eventually reaches a perplexity of 20
cent Vanhoucke, et al. 2022. Socratic models: Com- on the validation set of WritingPrompts.
posing zero-shot multimodal reasoning with lan-
guage. arXiv preprint arXiv:2204.00598. B.3 Data Post-processing
Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Once the model is trained, we randomly select 200
Character-level convolutional networks for text classi- prompts from the testing set of WritingPrompts,
fication. Advances in neural information processing
systems, 28. and feed the prompts to the trained model and ask
the model to generate stories based on the given
Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and prompts. When generating the stories, we adopt
Sameer Singh. 2021. Calibrate before use: Improv-
ing few-shot performance of language models. In In- nucleus sampling with p = 0.9. Next, we manu-
ternational Conference on Machine Learning, pages ally truncate the generated stories to less than 150
12697–12706. PMLR. words and ensure that after the truncation, the story
ends with a full sentence.7 After this process, we
A Modification Based on the Reviews
have 200 pairs of prompts and model-generated
We list the main differences between this version stories.
and the pre-review version of our paper; the mod- As a comparison to model-generated stories,
ifications are mainly based on the reviewers’ sug- we select the same 200 prompts used for gen-
gestions. We thank the reviewers again for those erating model-generated stories and their corre-
valuable suggestions. sponding human-written stories to form 200 pairs
of prompts and human-written stories. For these
• We add Section 3.3.1 to discuss whether the
human-written stories, we also truncate the sto-
LLM and human evaluators agree on the rat-
ries to less than 150 words and end with a full
ings of individual stories.
sentence to match the model-generated sentences.
• We refine the wordings in Section 5 and add We also manually remove some artifacts in the
relevant references. human-written story due to the tokenization of the
WritingPrompts dataset.
• We add Table 6 to discuss the correlation be-
tween human evaluators. C Human Evaluation
• We conduct supplementary experiments on C.1 Recruiting English Teachers
human evaluation that mixes human-written
The English teachers hold ESL certificates8 ; given
stories and GPT-2-generated stories when con-
that they are experienced with correcting essays
ducting human evaluation and report the re-
written by students, they are perfect fits for this task.
sults in Table 5.
Each teacher is asked to rate 200 GPT-2-generated
• We correct the typos and include almost all stories and 200 human-written stories, and they are
presentation suggestions mentioned by the re- 7
We truncate the story to 150 words since this is the mean
viewers. We cannot follow all presentation length of the model-generated story.
suggestions due to limited space. 8
English as a Second Language Teaching Certification

15619
paid US$140 for rating 200 stories. Considering Writer Human GPT-2
that the teachers reported that they take at most Grammaticality 3.890.97 3.880.84
5 hours to rate 200 stories, this makes the hourly Cohesiveness 4.350.87 3.490.97
wage at least US$28. We first ask the teachers to Likability 3.461.40 2.891.12
rate the GPT-2-generated stories and then the 200 Relevance 3.711.20 2.371.33
human-written stories. Different from Karpinska
et al. (2021) that take a break between the rating Table 5: The average Likert score for human-written
of GPT-2-generated stories and the human-written and GPT-2-generated stories when we randomly mix
the 200 model-generated and 200 human-written stories
stories, we do not take a break to avoid the teacher’s
during human evaluation.
rating standard to change after taking a long break.
The teachers are not told who wrote the stories
before they evaluate the stories. We reveal to them are asked to rate 100 news titles and are paid US$35
what this project aims to study after they finish for doing so. They reported that it took them less
rating all the stories. than 1 hour to complete the rating.
The reason we do not mix human-written and C.2 Human Evaluation Interface
GPT-2-generated stories for rating is that in Karpin-
Open-Ended Story Generation We use Google
ska et al. (2021), their observation is that (1) when
Forms to collect the responses from the teachers.
AMT workers rate model-generated and human-
Each form contains 100 stories, and each story is
written stories separately, their ratings do not
on one page of the Google Form. The interface on
show preference toward human-written stories, but
one page is shown in Figure 2 and Figure 3; the two
(2) even when rating the model-generated and
figures are from the same page of the Google Form,
human-written stories separately, English teacher
and we are splitting them because screenshotting
shows clear preference toward human-written sto-
the whole interface will cause low resolution.
ries. We follow their settings and do not mix GPT-
2-generated/human-written stories.
During the reviewing process, we received ques-
tions from the reviewers about why not mixing
the stories for human evaluation. Thus, we con-
duct the same experiment by randomly mixing 200
human-written and 200 GPT-2-generated stories
and asking three teachers (not the teachers that
already rated the stories) to rate them. All other
experiment conditions are the same as previously
stated. The full result is shown in Table 5. We
find that the teacher still shows a clear preference
toward human-written stories for all four attributes,
similar to the observation in Table 1. The only
exception is grammaticality, where English teach-
ers do not show a very clear preference for the
grammar of human-written stories. However, when
calculating the average rating for individual teach-
ers, we find that two out of three teachers do rate
grammaticality higher for human-written stories.
It is interesting to note that for LLM evaluation,
there is no such problem about whether or not to Figure 2: The upper part of the interface in open-ended
mix the human-written and GPT-2-generated sto- story generation.
ries during LLM evaluation as the rating of each
story is independent of each other, as discussed in Adversarial Attacks Quality Evaluation In this
Section 5. task, we also use Google Forms to collect the re-
For adversarial attack quality evaluation, we also sponses from the teachers. We create two different
recruit certified teachers on Upwork. The teachers Google Forms, one is used to evaluate the fluency,
15620
Figure 4: The Google Form used to evaluate the fluency
of the benign or adversarial samples.

Figure 3: The lower part of the interface in open-ended


story generation. Figure 5: The Google Form used to evaluate the mean-
ing preserving between a benign sample and an adver-
sarial sample.
whose interface is shown in Figure 4. In this form,
we mix an equal number of benign news titles,
TextFooler-attacked, PWWS-attacked, and BAE- ries?
attacked news titles. Each page of the Google Form
contains one news title. Q2 What is your standard on each of the four at-
Another Google Form is used to compare the tributes (grammatical, coherence, likability,
meaning preserving of the news title before and relevance) evaluated? For example, in what
after the adversarial attacks. We highlight the dif- case do you give a high/low rating for gram-
ference between the benign and adversarial sam- matically? What kind of story did you give
ples using boldface, as shown in Figure 5. On each a low rating on likability? Did your personal
page of the Google Form, there is one pair of news preference affect the rating?
titles. Q3 How long did it take for you to calibrate your
C.3 Post-Task Interview with English rating on the task?
Teachers Q4 Did you change your rating on the first three
C.3.1 How English Teachers Rate the Stories attributes after reading the prompt the story is
After the teachers rate 400 stories, we ask them the based on?
following questions:
We briefly summarize the answers from the three
Q1 How long did it take you to rate the 400 sto- teachers. The teachers report that they spent 6 to
15621
Writer Human GPT-2 rated by ChatGPT; two of them are highly-rated
Grammaticality 0.25 0.15 and lowly-rated human-written stories and the other
Cohesiveness 0.26 0.18 two stories are highly-rated and lowly-rated GPT-
Likability 0.09 0.12 2-generated stories.
Relevance 0.38 0.41
Story 1 It had been months since they found me
Table 6: The Kendall’s τ correlation coefficient two dead. I don’t know why I stopped walking. No-
English teachers. Three English teachers participate in one else around to take my place, or talk with me.
the rating, so the result in the Table is averaged over 32 I spent days frozen in shock, never able to talk,
Kendall’s τ . unable to move, unable to express myself. It was
a strange thing, nobody else would suspect me of
wanting this. The more I focused on it, the more
10 hours rating 400 stories. For grammar, most
it consumed me. Years passed, and I got more
teachers check the punctuation9 , word choice, and
and more lost in it. I began to think that perhaps
subject-verb agreement. English teachers decrease
they didn’t really care about me. I still had a deep
their rating based on the types and number of gram-
longing for them. All I had was so many steps to
mar errors in the stories.
walk now. A quiet part of me liked to think that
For coherence, the teachers rate it based on
they had passed me by. The loneliness. It never
whether the sentences in the stories follow a logical
felt like such a loss. They were gone, the ones who
sequence to build the narrative. The teachers ask
saved me. So why did I still keep walking?
themselves questions such as "does the story make
sense". This is a more holistic evaluation of the Prompt "Every person sees the world in a unique
whole story. way. When someone dies it is, in a way, the death
For likability, some teachers say they try not to of an entire world."
be affected by personal preference. One teacher
Ratings Table 7.
asks herself: Did I personally enjoy it based on
the amount of sense it made and whether or not it
had stylistic flair, humor, or engaging plotting or Story 2 "My son is gone. He has been" People
characterization? Overall, the teachers all try to stare at you as you give your explanation and say
use a fair and objective view to rate the likability. "yes. But. You cant just do it" "He is my son! This
For relevance, the teachers simply check if the story is my son! I love him and he is going to be okay"
is based on the prompt or not. "What do you mean he is okay? He said he was
The teachers said that it took them about five upset with someone, couldnt find his anger, how the
to ten stories to calibrate their ratings. Except for hell can you tell, it doesnt fit the bill?! The police
one teacher changing the rating on the other three called, the doctor said it was an addiction" "You
attributes after seeing the prompt on only one story, can’t trust me! He told me he loves me." "No. No
the teachers do not change their rating on the three he doesnt. But I can trust you. The evidence against
other attributes after reading the prompts. him was overwhelming, now he cant do anything.
The evidence against him was all circumstantial"
C.3.2 Teachers’ Comments on ChatGPT’s "A man I once was won’t be able to marry me"
Rating
After the teachers finished the rating and answered Prompt Everyone is allowed to kill one person
the four questions in Appendix C.3.1, we ask them in their life. However, you have to fill out a form
to check the ratings and explanations of ChatGPT explaining why this person deserves to be killed
and ask if they agree with ChatGPT. We told the and the committee can deny/approve your request
teachers that the rating is done by another teacher as per the rules. You are presenting your form
we hire; we do not tell the teacher that the rating is today.
done by an AI model to avoid possible stereotypes Ratings Table 8.
about AI models. We randomly select four stories
9
One of the teachers explicitly told us punctuation is not Story 3 I held the little black box in the palm of
grammar. By searching the Internet, we find that there is a
debate on whether punctuation should be counted as a part of my hand. Pitch black, perfectly cubed and nothing
grammar. special about it, but it continued to hold my gaze
15622
regardless as if there were some deep importance explanation of ChatGPT. In the end, all teachers
about it. My friend Valhalla appeared out of no summarize that rating stories is highly subjective,
where and sat next to me, also staring at the cube. and it is normal to have disagreements.
"What do you have there, Heaven?" he asked. I
continued to gaze at the box, refusing to look at D LLM Evaluation
Valhalla for even a moment as though I would miss D.1 Details on LLMs used
something if I did. "This," I said. "Is the secret to
the universe." I could tell Valhalla was perturbed The T0 model we use is called T0pp, which
by this sort of knowledge, as if there was some evil is a variant of the T0 model and has 13B pa-
about the cube. Or perhaps he didn’t think such an rameters. We will still use T0 to refer to
object could exist. But I made it exist, and so it sits this model. We load the T0 model using the
in the palm of my hand. transformers toolkit (Wolf et al., 2020). The
two InstructGPT models, text-curie-001 and
Prompt The Little Black Box text-davinci-003, are queried using the OpenAI
API. We query ChatGPT using the OpenAI GUI.
Ratings Table 9. While we are aware that some online resources pro-
vide an API-like tool to query ChatGPT, we think
Story 4 I stared down the telescopic sight of my it violates the intended use of ChatGPT so we do
l96 sniper rifle. I slowly moved my gaze into each not adopt those online resources. The ChatGPT we
window in the hotel, Many displays of various queried is the Dec. 15 and Jan. 9 version.
vice. One couple was violently pleasuring each OpenAI does not reveal the model sizes of any
other. Another was an old man, watching a younger of the GPT models. However, it is estimated
woman strip in front of him. A prostitute no doubt. that text-curie-001 has 13B parameters and
I inhaled slowly, and exhaled. The air was brisk, text-davinci-003 has 175B parameters (Gao,
atleast 30 degrees Fahrenheit. I so small flakes of 2021).
snow, float peacefully in front of me. I found the
D.2 Details on Querying the LLMs
room, i was looking for. Ive been tracking this man
for 2 weeks. Man was he elusive. The lights flick- D.2.1 Parsing the LLM outputs
ered on. The red haired man, was mildly attractive, After the T0 and InstructGPT generate the answer,
i can see the appeal women had for him. I followed which is composed of several sentences in almost
him into the next room, with my sights. The lights all cases, we parse the generated sentence to get
flickered on, i was taken aback by the scene. A the model’s score. We use some rules to parse the
man, overweight and balding. output, and the rule is established after manually
looking into the output of the generated sequences.
Prompt You are the antagonist of the story. How- First, we remove the string 1-5 from the output
ever, you aren’t sure if you can call yourself that since we observe that LLM will sometimes say "...
after what the protagonist did. on a scale of 1-5 ...", and we remove the string
Ratings Table 10. out of 5 since LLM sometimes say it "give a
score of x out of 5". We also remove the string
Overall Comments from Teachers on Chat- /5 since LLM sometimes uses "x/5" to express
GPT’s Rating After the teachers elaborated on that it gives a score of x out of 5. For rating the
their thoughts on the rating of ChatGPT, we ask adversarial samples, we remove the title 1 and
them to provide an overall comment on how Chat- title 2 in the output sentences. Last, we use
GPT is doing. Again, the teachers are not informed the regular expression to extract the first number
that the ratings are done by an AI model. In sum- in the sequence. We find that the above rules are
mary, teachers all consider the rating and explana- sufficient to parse the output and get the rating of
tions reasonable. They find that the attributes they the LLMs. In a few rare cases, the LLM gives the
do not agree with are mainly Likability and Cohe- rating with a 0.5 incremental, meaning that it gives
siveness. However, they think the two attributes scores like 4.5. In such cases, we do not drop the
are a more holistic evaluation of the story and tend 0.5 since in our instruction, we do not explicitly tell
to be more subjective. Even if they do not give the it the rating should be 1-incremental. We find the
same rating, they still are able to understand the LLM results merely difference between dropping
15623
Attribute Rater Response
ChatGPT I would rate the grammatical correctness of the text of the story fragment
as a 4. There are a few minor errors and awkward phrasings, but overall
Grammar
the text is well-constructed and easy to understand.
T1 Agree
T2 Agree
T3 Agree
ChatGPT I would rate the fit of the sentences in the story fragment as a 5. The
sentences flow well together, building a clear and cohesive narrative, and
Cohesive
providing a sense of progression and development in the main character’s
thoughts and emotions.
T1 Disagree: I think it’s a 4 for me, not a 5, because there was one sentence
that didn’t seem to fit with the others.
T2 Agree
T3 Disagree: I find disjointedness in the passage, which serves to confuse
me as a reader. The opening sentence is about the first-person narrator
being found dead, but then the next sentences are about the narrator’s
inner thoughts. Also, this passage: "The loneliness. It never felt like such
a loss." So what is the ’it’ that felt like a loss? The loneliness? If so, how
can that feel like a loss? If not, what is this ’it’? So, its juxtaposition to
the previous sentence makes it confusing.
ChatGPT I would rate the story fragment as a 2 in terms of enjoyment. The fragment
is written in a somber and introspective tone, and it deals with themes of
Likable
isolation and loneliness, which may not be enjoyable for some readers.
The protagonist’s state of mind is also portrayed as being in a state of
shock and confusion, which adds to the overall sense of unease and lack
of enjoyment.
T1 Disagree:I would rate it a 3, not a 2, since the enjoyment piece is
subjective.
T2 Disagree: While the rating of the freelancer find that the overall theme
while somber has a glimmer of hope to it. It is this that makes it more
enjoyable to me. The underdog story. Death is not an issue that should
be shied away from in writing. We all face it and I enjoyed the honesty in
the piece. :"Why did I keep walking?". Something I feel many of us face
and may have asked ourselves. I believe that this would be relevant to
many readers.
T3 Agree
ChatGPT I would rate the story fragment as a 5. It is highly relevant to the prompt
as it explores the theme of death and how it can be seen as the death of an
Relevant
entire world. The story fragment also touches on the ideas of loneliness
and longing, which are related to the loss of someone or something. The
concept of the protagonist being frozen in shock and unable to express
themselves, also supports the idea of how death can be seen as the death
of an entire world.
T1 Agree
T2 Agree
T3 Disagree: I do not see the link between the prompt and the story.

Table 7: Story 1: a GPT-2-generated story that is highly-rated by ChatGPT.

15624
Attribute Rater Response
ChatGPT I would rate the grammatical correctness of the story fragment as a 3.
The text is generally understandable, but there are a few punctuation
Grammar
errors and awkward phrasing that make it somewhat difficult to follow.
T1 Disagree:I would rate the grammar as a 2, not a 3–there were signifi-
cantly more grammatical errors than the average within this set of story
fragments.
T2 Disagree: The issues with the grammar and coherence creates a stop
start feeling when reading. I am unsure as to who is speaking in the text
and this leaves me with many questions.
T3 Disagree: The grammar is poorer than a 3, as is evident in for example
what is supposed to be the second sentence: "He has been ... (incomplete,
and another segment is there "People stare ...); cant; couldnt; doesnt;
and sentences that do not make sense like this one: " A man I once was
won’t be able to marry me "
ChatGPT I would rate the cohesiveness of the sentences in the story fragment as
a 2. The sentences in the fragment are disconnected and jump between
Cohesive
different thoughts and ideas, making it difficult to understand the overall
story or message. There is no clear progression of events or logical
flow between the sentences. It appears that the fragment may be missing
some important information or context that would help tie the sentences
together.
T1 Agree
T2 Agree
T3 Disagree: very incoherent passage in which it is not very clear who says
what, what the topic is, thus rendering a disjointed narrative.
ChatGPT I would rate the enjoyability of the story fragment as a 1. The fragment
is confusing and difficult to understand, the sentences are disconnected
Likable
and jumps between different thoughts and ideas. The story seems to
be incomplete and lacks context, making it hard to follow the plot or
characters. Additionally, the fragment has a sad and negative tone and it
does not seem to be enjoyable to read.
T1 Agree
T2 Agree
T3 Agree
ChatGPT I would rate the relevance of the story fragment to the prompt as a 1.
While the story fragment does mention someone being gone, it doesn’t
Relevant
seem to be directly related to the idea of filling out a form to justify killing
someone. It’s unclear if the son is the person being proposed to be killed,
or if the story is even taking place in the context of the prompt. There is
no clear indication that the fragment is related to the prompt.
T1 Agree
T2 Agree
T3 Agree

Table 8: Story 2: a GPT-2-generated story that is lowly-rated by ChatGPT.

15625
Attribute Rater Response
ChatGPT The text of the story fragment appears to be grammatically correct, with
no major errors or awkward phrasing. On a scale of 1-5, with 1 being
Grammar
the lowest, I would rate the grammatical correctness of the text as a 5.
T1 Agree
T2 Agree
T3 Disagree: I would rate this a 4 because of for example ’no where’ =
nowhere, although it might seem like a minor error, it does take away
some correctness of the text.
ChatGPT The sentences in the story fragment fit together well, creating a clear and
coherent image of the scene and the characters’ actions and thoughts.
Cohesive
The language is descriptive and evocative, and the dialogue effectively
conveys the characters’ emotions and motivations. On a scale of 1-5,
with 1 being the lowest, I would rate how well the sentences fit together
as a 4.
T1 Agree
T2 Agree
T3 Partially agree: To me, all the sentences fit well together to create a
captivating narrative. I will give a 4 or 5.
ChatGPT Based on the structure, writing style and plot of the story fragment, I
would rate it as 3 in terms of enjoyability. The fragment sets up an
Likable
intriguing situation with the mysterious little black box and the protago-
nist’s belief that it holds the "secret to the universe," but it ends abruptly
and leaves many questions unanswered, which may not be satisfying for
readers.
T1 Disagree: I liked it and it was very clear, so I’d rate it a 4 or 5.
T2 Disagree: While the story ends suddenly, I am remined that it is just a
fragment and that their may be more to this piece. I felt the piece written
about a little black box was intriguing and enjoyed the introduction of
Norse mythology. This would attract the interest of many readers.
T3 Disagree: I enjoyed the fragment, as it piqued my interest to want to
know more about how the story evolves and what it actually was about.
ChatGPT I would rate the story fragment as a 5 in relevance to the prompt "The
Little Black Box." The story fragment centers around a little black box
Relevant
and its perceived importance, which aligns with the prompt. Additionally,
the main character, Heaven, is holding the box and describing it as "the
secret to the universe," further solidifying the connection to the prompt.
T1 Agree
T2 Agree
T3 Agree: the story is totally relevant to the prompt.

Table 9: Story 3: a human-written story that is highly-rated by ChatGPT.

15626
Attribute Rater Response
ChatGPT I would rate the grammatical correctness of the text as a 3. There are
a few errors such as "Many displays of various vice" which should
Grammar
be "Many displays of various vices" and "I so small flakes of snow"
which should be "I saw small flakes of snow." Additionally, some of
the sentences are slightly awkward and could be rephrased for better
clarity. Overall, the text is understandable, but there are a few errors
and awkward phrasings that could be improved.
T1 Partially agree:I rated the grammar lower than this–maybe a 2 or a 3.
There were above-average numbers of errors.
T2 Disagree: While the text is understandable, I found myself being dis-
tracted by the errors in the text and coherency.
T3 Agree
ChatGPT I would rate the cohesiveness of the sentences in the story fragment as
a 2. The story fragment jumps abruptly between different scenes and
Cohesive
characters without clear transition, making it difficult to follow and
understand the narrative. Additionally, the sentences are often disjointed
and lack a clear flow. The story fragment also lacks a clear plot or
purpose, making it difficult to understand the overall story or the main
character’s intentions. The story fragment can be improved by providing
more context, and a clearer narrative arc.
T1 Agree: it was jumpy
T2 Agree
T3 Disagree: The story is about a sniper looking through his rifle’s telescope,
thus seeing changing scenes based on each room he is seeing through
the windows, which are then subsequently described. These descriptions
account for the seeming lack of cohesion.
ChatGPT I would rate the enjoyability of the story fragment as a 1. The fragment
lacks a clear plot, purpose or character development, making it difficult to
Likable
become invested in the story or the characters. Additionally, the fragment
includes disturbing and graphic content such as violence and prostitution,
which may make it uncomfortable for some readers. Additionally, the
fragment contains several grammatical errors and awkward phrasings
that can make it difficult to follow the story. Overall, the story fragment
is not enjoyable to read.
T1 Agree: : it was not just graphic, but also showed really harmful and
negative representations of characters based on their weight, gender, etc.
T2 Agree
T3 Disagree: The story did hold my attention to a degree because I wanted
to know whom he wanted to target and why.
ChatGPT I would rate the relevance of the story fragment to the prompt as a 1. The
story fragment doesn’t contain any elements that relate to the prompt such
Relevant
as the protagonist and antagonist, their actions, or the moral ambiguities
the antagonist faces. The story fragment mainly describes a person with
a sniper rifle looking into windows of a hotel and doesn’t present any
clear plot or purpose. Additionally, the story fragment does not offer any
indication of moral ambiguity on the part of the person with the sniper
rifle. Overall, the story fragment is not relevant to the prompt.
T1 Agree
T2 Agree
T3 Agree

Table 10: Story 4: a human-written story that is lowly-rated by ChatGPT.


15627
the 0.5 and keeping the 0.5, and the results in the fragment.
paper do not drop the 0.5. Again, these cases are Note: Please take the time to fully read
really rare. and understand the story fragment. We
We do not parse the output of ChatGPT using will reject submissions from workers that
any rules since we the authors read the response are clearly spamming the task.
and extract the score by ourselves. Story fragment:
During the experiments, ChatGPT refuses to an- [STORY]
swer the questions about 2 stories and 3 news titles (End of story fragment)
since ChatGPT find those contents to violate the How enjoyable do you find the story
OpenAI content policy. We find that those samples fragment? (on a scale of 1-5, with 1
contain discrimination to some protected groups, being the lowest)
or contain sexual or violent descriptions. Hence,
the results of ChatGPT are calculated without those Relevance
samples. Please rate the story fragment
The goal of this task is to rate story
D.2.2 Open-Ended Story Generation fragment.
For T0 and the two InstructGPT models, we query Note: Please take the time to fully read
the four attributes separately using the queries and understand the story fragment. We
shown as follows: will reject submissions from workers that
are clearly spamming the task.
Grammaticality Story fragment:
Please rate the story fragment [STORY]
The goal of this task is to rate story (End of story fragment)
fragment. Now read the PROMPT based on which the
Note: Please take the time to fully read story fragment was written.
and understand the story fragment. We PROMPT: [PROMPT]
will reject submissions from workers that (End of PROMPT)
are clearly spamming the task. How relevant is the story fragment to the
Story fragment: prompt? (on a scale of 1-5, with 1 being
[STORY] the lowest)
(End of story fragment) The [STORY] and [PROMPT] are to be filled in
How grammatically correct is the text of with the story and the prompt. We show the new-
the story fragment? (on a scale of 1-5, lines for better readability. When we query the
with 1 being the lowest) models, we use the token \n to represent the new
line.
Cohesiveness
Please rate the story fragment When querying ChatGPT, we query the four at-
The goal of this task is to rate story tributes of the same story in one conversation; this
fragment. is similar to asking the teachers to rate the same
Note: Please take the time to fully read story on the same page of the Google Form. We use
and understand the story fragment. We the same queries shown above to query ChatGPT
will reject submissions from workers that and the order of queries is the same as the order
are clearly spamming the task. shown above.
Story fragment: D.2.3 Adversarial Attack Quality Evaluation
[STORY]
(End of story fragment) When querying all the LLMs in this task, we query
How well do the sentences in the story the fluency and the meaning preserving of the same
fragment fit together? (on a scale of news title independently. This means that each
1-5, with 1 being the lowest) conversation with ChatGPT will only have one
question, asking about the fluency or the meaning
Likability preserving of news title(s). All the parameters for
Please rate the story fragment generation are the same as the default parameters
The goal of this task is to rate story in Section 3.2.
15628
The exact query we use are: Human evaluate LLM evaluate
Fluent Mean. Fluent Mean.
Fluency
Benign 4.55 - 4.33 4.56†
You are given a news title. Please read
Textfooler 2.17 1.88 3.71 2.37
the news title and answer the question.
PWWS 2.16 1.85 3.62 3.21
News title:
BAE 3.01 3.02 4.16 3.69
[NEWS_TITLE]
(End of news title) Table 11: LLM evaluation (text-davinci-003) and
Question: How natural and fluent is the human evaluation result on fluency (Fluent) of the be-
text of the news title? (on a scale of nign and adversarial samples and meaning preserving
1-5, with 1 being the lowest (Mean.) between the news title before and after adver-
The [NEWS_TITLE] will be filled in with either sarial attacks.
a benign or adversarial-attacked news title.
Rater Textfooler PWWS BAE
Meaning Preserving You are given two news T1 3.36 3.68 4.2
titles. Please read the news titles and T2 1.80 1.40 2.96
answer the question. T3 1.36 1.40 1.88
News title 1:
Table 12: The rating on three adversarial attacks of the
[BENIGN_TITLE]
three teachers T1, T2, and T3.
(End of news title 1)
News title 2:
[ADVERSARIAL_TITLE] cantly lower than the benign samples. As already
(End of news title 2) seen in Section 3.3, text-davinci-003 tends to
Question: Do you agree that the meaning give a higher rating.
(or semantics) of news title 1 is As mentioned in Section 4.3, one teacher rates
preserved in news title 2? (on a scale of the fluency of Textfooler significantly higher than
1-5, with 1 being the strongly disagree PWWS while the other two teachers do not. We
and 5 being strongly agree.) show the rating on fluency on the three adversarial
The [BENIGN_TITLE] will be filled in with attacks by each teacher in Table 12.
the news title before the attack and the
[ADVERSARIAL_TITLE] will be filled in with the
news title after an adversarial attack.

E Experiment Details on Adversarial


Attacks
The adversarial samples used in Section 4 are
from Yoo et al. (2022). Yoo et al. (2022) gener-
ates different sets of adversarial samples using dif-
ferent adversarial attacks against different victim
models. We use the adversarial samples gener-
ated against a bert-base-uncased text classifier
trained on AG-News, using three different adver-
sarial attacks: Textfooler, PWWS, and BAE. The
intent of the dataset is to facilitate the research in
SSA, which we do not violate.
Here, we show the supplementary results of us-
ing text-davinci-003 as the LLM evaluation for
evaluating the quality of adversarial samples in
Table 11. We can see that the result of using
text-davinci-003 is similar to ChatGPT in the
sense that text-davinci-003 also rates adversar-
ial samples higher than humans while still signifi-
15629
ACL 2023 Responsible NLP Checklist
A For every submission:
3 A1. Did you describe the limitations of your work?

Sec 5 and Limitation on page 10
3 A2. Did you discuss any potential risks of your work?

Sec 5 and Limitation on page 10
3 A3. Do the abstract and introduction summarize the paper’s main claims?

Abstract


7 A4. Have you used AI writing assistants when working on this paper?
Left blank.
3 Did you use or create scientific artifacts?
B 
Section 4.2, Appendix B.1 and E
3 B1. Did you cite the creators of artifacts you used?

Section 4.2, Appendix B.1 and E


7 B2. Did you discuss the license or terms for use and / or distribution of any artifacts?
The datasets we use do not include a license
3 B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided

that it was specified? For the artifacts you create, do you specify intended use and whether that is
compatible with the original access conditions (in particular, derivatives of data accessed for research
purposes should not be used outside of research contexts)?
Appendix E and Ethical statement


7 B4. Did you discuss the steps taken to check whether the data that was collected / used contains any
information that names or uniquely identifies individual people or offensive content, and the steps
taken to protect / anonymize it?
Removing the names in AG-News will make news titles to be nonsensical.
3 B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and

linguistic phenomena, demographic groups represented, etc.?
Section 4.2, Appendix B.1 and E
3 B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits,

etc. for the data that you used / created? Even for commonly-used benchmark datasets, include the
number of examples in train / validation / test splits, as these provide necessary context for a reader
to understand experimental results. For example, small differences in accuracy on large test sets may
be significant, while on small test sets they may not be.
Section 4.2, Appendix B.1 and E

C 
7 Did you run computational experiments?
Left blank.

 C1. Did you report the number of parameters in the models used, the total computational budget
(e.g., GPU hours), and computing infrastructure used?
Not applicable. Left blank.
The Responsible NLP Checklist used at ACL 2023 is adopted from NAACL 2022, with the addition of a question on AI writing
assistance.

15630
 C2. Did you discuss the experimental setup, including hyperparameter search and best-found
hyperparameter values?
Not applicable. Left blank.

 C3. Did you report descriptive statistics about your results (e.g., error bars around results, summary
statistics from sets of experiments), and is it transparent whether you are reporting the max, mean,
etc. or just a single run?
Not applicable. Left blank.

 C4. If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did
you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE,
etc.)?
Not applicable. Left blank.

D 3 Did you use human annotators (e.g., crowdworkers) or research with human participants?

Section 3, 4, Appendix C
3 D1. Did you report the full text of instructions given to participants, including e.g., screenshots,

disclaimers of any risks to participants or annotators, etc.?
Appendix C.1
3 D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students)

and paid participants, and discuss if such payment is adequate given the participants’ demographic
(e.g., country of residence)?
Appendix C.1, C.2
3 D3. Did you discuss whether and how consent was obtained from people whose data you’re

using/curating? For example, if you collected data via crowdsourcing, did your instructions to
crowdworkers explain how the data would be used?
Ethical statement

 D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?
Not applicable. We do not ahve ethic review board in our institute.


7 D5. Did you report the basic demographic and geographic characteristics of the annotator population
that is the source of the data?
This is not related to our task. We report the certification of the workers in Appendix C.1.

15631

You might also like