Can Large Language Models Be An Alternative To Human Evaluation
Can Large Language Models Be An Alternative To Human Evaluation
1 (lowest)
The goal of this task is to rate story
2
fragments.
3
NOTE: Please take the time 4
to fully read and understand the story 5 (highest)
fragment. We will reject submissions from
workers that are clearly spamming the
task.
Story fragment
The human ambassador reached down and LLM evaluation
grasped it's paw. "Humans, cats, is it
true that all intelligent beings are
LLM (e.g., GPT3) LLM output
omnivorous?" "Certainly, your rank
demeanour can be demonstrated from the
I would rate the
words we spoke to the Kelpie. They're of
grammar in the text
no concern to us humans, as they are
not considered to live among us, thus of the story
far. (…) fragment as a 4.
4
There are no major
How grammatically correct is the text of grammatical errors
the story fragment? (on a scale of 1-5, or issues with
with 1 being the lowest?) punctuation,(…)
Figure 1: Illustration of the core idea of the paper using open-ended story generation as the example task. The left
part shows the instruction, story fragments, and questions used in human evaluation. The human experts are asked
to rate the quality of the story fragments using a 5-point Likert scale, shown on the upper right. The lower right part
shows the process of LLM evaluation, where we feed the LLMs the same instruction, story fragments, and questions
and parse the LLM-generated output to get the rating.
ifying the effectiveness of LLM evaluation been applied to those LLMs after pre-training. For
(§3.3 and §4.3). This paper is the first to pro- example, T0 (Sanh et al., 2022) and FLAN (Wei
pose using LLMs as an alternative to human et al., 2022) are fine-tuned on a mixture of tasks
evaluation and show their effectiveness. and can thus achieve better zero-shot performance
compared to GPT-3. InstructGPT (Ouyang et al.,
• We show that LLM evaluation results only 2022) is fine-tuned from GPT-3 using reinforce-
slightly vary due to different task instructions ment learning from human feedback (RLHF), and
and the hyperparameters of the sampling al- it is shown to better follow the instructions. Chat-
gorithm used to generate the answer. (§3.3.2 GPT (OpenAI, 2022) is fine-tuned from Instruct-
and §3.3.3) GPT with a conversation dataset using RLHF, so
ChatGPT can interact with users in a conversational
• We carefully discuss the pros and cons of us-
way. ChatGPT is able to answer questions asked
ing LLM evaluation and discuss the ethical
by the user and provide comprehensive explana-
considerations of LLM evaluation. (§5)
tions about its answer. Given the LLMs’ ability to
2 LLM Evaluation follow task instructions and provide feedback, we
ask whether LLMs can be used as an alternative
2.1 Large Language Models (LLMs) to human evaluation and aid NLP researchers in
Large language models are language models hav- evaluating the quality of texts.
ing bulk parameter sizes, typically on the scale
of a few billion, and pre-trained on enormous 2.2 LLM Evaluation
amounts of natural language corpora, including To evaluate the quality of texts generated by NLP
GPT3 (Brown et al., 2020), T5 (Raffel et al., 2020), systems or written by humans using LLM, we
and BLOOM (Scao et al., 2022). These LLMs present the LLMs with the task instructions, the
show exceptional performance on unseen tasks sample to be evaluated, and a question. The ques-
when only the task instructions are given; this kind tion asks the LLM to rate the sample’s quality using
of ability is called zero-shot in-context learning. a 5-point Likert scale. Given the inputs, the LLM
To further improve the zero-shot in-context learn- will answer the question by generating some out-
ing performance, special training techniques have put sentences. We parse the output sentences to get
15608
the score rated by the LLM. We call this process while English teachers show a clear preference for
LLM evaluation, and this procedure is shown in the human-written stories over GPT-2-generated sto-
lower part of Figure 1. Different tasks use different ries. We want to see if LLM can rate human-written
sets of task instructions, and each task uses differ- stories higher than GPT-2-generated ones.
ent questions to evaluate the quality of the samples. Following prior works (Mao et al., 2019; Guan
The instructions and questions used in LLM evalua- et al., 2020; Karpinska et al., 2021), the story gen-
tion in our paper are not tailored for the LLMs; we eration model is GPT-2 medium model fine-tuned
follow those instructions used to conduct human on the WritingPrompts training dataset. After the
evaluation in prior works. model is trained, we randomly select 200 prompts
To compare the result of LLM evaluation and from the testing set of WritingPrompts and make
show its effectiveness, we compare the result of the fine-tuned GPT-2 generate stories based on
LLM evaluation with human evaluation conducted those prompts using nucleus sampling (Holtzman
by English teachers. To make a fair and meaningful et al., 2020) with p = 0.9. For the human-written
comparison, the instructions, samples, and ques- stories to be compared, we use the 200 stories
tions in human evaluation are formatted similarly written based on the same 200 prompts. We post-
to those in LLM evaluation. The main difference process the human-written and GPT-2-generated
between LLM evaluation and human evaluation stories and then use them for LLM evaluation and
is that in human evaluation, the human evaluators human evaluation. Please find the details on fine-
answer the question by choosing the answer from tuning and data processing in Appendix B.
a pre-defined set of options (the 1-5 Likert scale
3.2 LLM Evaluation and Human Evaluation
scores), as shown in the upper right in Figure 1.
In LLM evaluation, we instead let the LLM freely We present the LLMs and the human evaluators
generate sentences and extract the score from the with a short description, and the story to be eval-
generated sentences using some simple rules, de- uated, formatted as shown in Figure 1. Follow-
tailed in Appendix D.2.1. ing Karpinska et al. (2021), we evaluate the stories
on four different attributes. The four attributes and
3 Example Task 1: Open-Ended Story their corresponding questions are as follows:
Generation
1. Grammaticality: How grammatically cor-
We first use open-ended story generation to demon- rect is the text of the story fragment?
strate the usefulness of LLM evaluation.
2. Cohesiveness: How well do the sentences in
3.1 Task Introduction the story fragment fit together?
Open-ended story generation is a task to gener- 3. Likability: How enjoyable do you find the
ate a short story based on a given prompt. We story fragment?
use the WritingPrompts dataset (Fan et al., 2018),
4. Relevance: Now read the PROMPT based on
which is composed of pairs of short prompts and
which the story fragment was written.
human-written stories collected from the subred-
Prompt: [PROMPT].
dit WritingPrompts. In the WritingPrompts, the
How relevant is the story fragment to the
users are given a short prompt, and they need to
prompt?
write a story based on the short prompt.1
In this experiment, we use LLM evaluation and Where the [PROMPT] will be filled in with the
human evaluation to rate the stories generated by prompt which the story is based on. Each attribute
humans and the stories generated by a story gen- is evaluated using a 5-point Likert scale; the fol-
eration model. We select open-ended story gen- lowing description is appended at the end of each
eration as an example because Karpinska et al. question: "(on a scale of 1-5, with 1 being the low-
(2021) show that workers from Amazon Mechani- est)". We show the interface used in human evalua-
cal Turk (AMT) cannot distinguish GPT-2 (Radford tion and the input format for the LLM evaluation
et al., 2019) generated and human-written stories, in Appendix C.2 and D.2.2.
1
The LLMs used for LLM evaluation include T0,
The WritingPrompts subreddit explicitly forbids the
users to use AI for generating stories, so we consider the text-curie-001, text-davinci-003, and Chat-
stories in the dataset to be human-written. GPT. text-curie-001 and text-davinci-003
15609
Grammaticality Cohesiveness Likability Relevance
Evaluator
MeanSTD IAA% MeanSTD IAA% MeanSTD IAA% MeanSTD IAA%
Human-written stories
Human 3.760.95 0.3320.5 4.290.82 0.3227 3.781.10 0.089.5 3.351.48 0.058
T0 2.551.47 0.1610 2.981.45 0.114 3.181.53 0.127 2.931.64 0.026
curie 3.190.47 0.0746.5 2.820.46 0.0147.5 2.850.37 0.110.65 3.060.40 0.110.64
davinci 4.220.38 0.2635 4.540.47 0.3739.5 3.990.38 0.4968.5 4.400.79 0.7148.5
ChatGPT 3.830.60 3.550.88 2.440.89 3.291.50
GPT-2-generated stories
Human 3.560.91 0.1019.5 3.191.07 0.1417 2.591.29 −0.213.5 2.381.40 −0.038.5
T0 2.441.49 0.059 3.021.51 0.076 3.001.59 0.166 2.821.61 0.046
curie 3.230.51 0.0138 2.820.45 0.0250 2.860.37 0.0965.5 3.010.43 0.1161
davinci 4.070.35 0.3545.5 4.260.45 0.4242 3.840.42 0.5262 4.020.74 0.6942.5
ChatGPT 2.980.76 2.480.71 1.590.67 2.021.21
Table 1: LLM evaluation and human evaluation results of human-written stories and GPT-2-generated stories. For
each evaluated attribute, we report its mean Likert scale and the standard deviation. We also report the inter-annotator
agreement (IAA) among three annotators using Krippendorff’s α. The subscript in the IAA column (%) is used to
denote the percentage of the stories where all three annotators exactly agree on a rating.
are two InstructGPT models, and the latter is the Table 1. We report the mean and standard deviation
stronger model; we will use InstructGPT to refer to of the Likert scores obtained from LLM evaluation
these two models. We query the InstructGPT using and human evaluation and show the inter-annotator
the official API provided by OpenAI. We use nu- agreement (IAA) using two different metrics: (1)
cleus sampling with p = 0.9 to generate the answer the Krippendorff’s α, and (2) the percentage of the
from T0 and InstructGPTs. We sample three an- stories where three evaluators give the exact same
swers from LLMs to stimulate the result of asking rating.2 The main observations from Table 1 are
the model to rate the same story three times. We discussed as follows.
query ChatGPT using the user interface recently Expert human evaluators prefer human-
released by OpenAI. Unlike InstructGPT, we can- written stories: Human evaluation result serves as
not control the parameters used for generating the some kind of ground truth of the LLM evaluation.
response from ChatGPT. Because ChatGPT limits For all four attributes, teachers rate the human-
the maximum number of queries per user, we only written stories higher than GPT-2-generated stories.
sample one response for each question. This indicates that experts are able to distinguish
For human evaluation, we do not use the the quality difference between model-generated
commonly used AMT for human evaluation be- stories and human-written stories. Based on the
cause Karpinska et al. (2021) has already shown IAA, we also find that the agreements among ex-
that the results obtained using AMT are highly perts are lower on GPT-2-generated texts and on
questionable. Following the recommendation of the likability. This shows that experts tend to have
the prior works, we hire three certified English less agreement on model-generated texts and on a
teachers using an online freelancer platform, Up- subjective attribute (likability), agreeing with the
Work. Teachers are familiar with evaluating the results in Karpinska et al. (2021).
essays of students, making them the expert evalua- T0 and text-curie-001 do not show clear
tors in our task. The details about recruiting human preference toward human-written stories: For
evaluators are in Appendix C.1. Each LLM and T0, we can see that T0 rates human-written sto-
each English teacher rates the 200 human-written ries higher than GPT-2-generated stories on gram-
stories and 200 GPT-2-generated stories. matically, likability, and relevance. However, the
rating differences between the human-written and
3.3 Experiment Results 2
The three evaluators in human evaluation are the three
English teachers. In LLM evaluation, we sample the answer
The LLM evaluation and human evaluation results generated by LLM three times as an analogy to three different
of open-ended story generation are presented in evaluators.
15610
model-generated stories do not achieve statistical interesting observations: (1): ChatGPT is able to
significance for grammaticality and relevance; the provide a detailed explanation of why it gives a
p-value obtained by Welch’s t-test is much larger certain rating. It will reference the sentences in
than 0.05. The result of text-curie-001 is simi- the stories and prompts to support its rating. (2):
lar to T0: text-curie-001 does not rate human- ChatGPT sometimes refuses to rate the likability of
written stories higher than model-generated stories. the story because "I am an AI and I do not have the
It can also be observed that for T0, the IAA in terms ability to experience enjoyment". In such cases, we
of the percentage of exact agreement among three regenerate the response until it gives a rating. (3):
different sampled answers is overall very low. This we find that ChatGPT tends to rate low likability on
indicates that given the same sample, T0 is likely violent or impolite stories, which is likely because
to give a different rating for the three sampled an- it is trained to provide safe and unharmful replies,
swers. The result implies that T0 does not assign making ChatGPT dislike brutal and profane stories.
a high probability to a specific rating, so different Experts mostly agree with the ratings and ex-
scores are all likely to be sampled. This shows that planations of ChatGPT: We randomly select the
even if LLMs are specifically fine-tuned to better answers on four stories by ChatGPT and ask the
perform zero-shot in-context learning and trained English teachers if they agree with the reasoning
to better follow human instructions, these do not and rating of ChatGPT3 . The teachers mostly agree
make them capable of assessing open-ended story with the rating and consider the explanation from
generation as human experts can. ChatGPT reasonable. Interestingly, one teacher
told us she cannot agree with ChatGPT’s rating on
text-davinci-003 shows clear preference
grammaticality because ChatGPT considers punc-
toward human-written stories just like En-
tuation errors as grammar errors, but she does not
glish teachers: text-davinci-003 rates human-
think punctuation errors are grammar errors. This
written stories much higher than model-generated
shows that individuals have their own standards for
stories on all four attributes, which is in accor-
ratings and this is also the case for LLMs.
dance with the result produced by human ex-
text-davinci-003 tends to give higher rat-
perts. By Welch’s t-test, we find that the higher
ings and ChatGPT is the opposite: The rating on
ratings on human-written stories are all statisti-
the same attribute of the same type of text tends to
cally significant. In prior work, researchers have
be higher for text-davinci-003 compared with
found that workers recruited on AMT do not rate
human rating; contrarily, ChatGPT is more fastid-
human-written stories higher than GPT-2-generated
ious and prone to give lower scores. This shows
ones (Karpinska et al., 2021); combining their re-
that different LLMs have distinct tendencies re-
sult with our result, we can see that LLM evaluation
garding the rating. While the absolute values of
using text-davinci-003 yields more convincing
the scores rated by text-davinci-003, ChatGPT,
results than using human evaluation on AMT for
and human differ, they all rate human-written texts
open-ended story generation. The results show
higher than GPT-2-generated stories. The absolute
that text-davinci-003 can perform basic evalu-
number reflects the bias or belief of the evaluator;
ations such as checking for grammatical errors in
as long as one uses the same evaluators to assess
stories. Additionally, the model excels in assessing
different systems, the comparison is meaningful.
the relevance of a story to a prompt, which involves
more complex reasoning over the connection be- 3.3.1 Does LLM and Human Evaluators
tween the two. We also find the Krippendorff’s Agree on the Rating of Individual
α of text-davinci-003 is much higher than T0 Stories?
and text-curie-001, indicating that the rating by We have found in Table 1 that the ratings of
text-davinci-003 is more consistent among dif- text-davinci-003 and ChatGPT show a strong
ferent samplings of the generated answers. preference toward human-written stories just like
ChatGPT rates like human experts and can English teachers. However, it is unclear whether
explain its own decision well: ChatGPT also those LLMs agree with the teachers’ rating on each
shows a clear preference for human-written stories, individual story. Precisely, when English teach-
and the preference toward human written-stories is ers rate a story higher, do LLMs also rate the
statistically significant. When we query ChatGPT 3
We do not tell the teachers these are responses from an AI
using the OpenAI user interface, we find several model. See the stories and teachers’ replies in Appendix C.3.2.
15611
Story Writer Human GPT-2 sidered when rating grammaticality is not clearly
Grammaticality 0.14 0.12 stated in our instructions, so the LLM may have
Cohesiveness 0.18 0.14 a different rubric compared with English teachers.
Likability 0.19 0.22 We also calculate the average Kendall’s τ between
Relevance 0.38 0.43 a pair of English teachers, and we find a weak cor-
relation on grammaticality between the rating of
Table 2: The Kendall’s τ correlation coefficient between two teachers, while the correlation of the rating on
English teachers and text-davinci-003. relevance is much stronger. The result is presented
in Table 6 in Appendix.
story higher? To answer this question, we calcu- 3.3.2 Variance due to Different Instructions
late Kendall’s τ correlation coefficient between the
LLMs have been shown to be sensitive to the in-
ratings of text-davinci-003 and English teach-
structions used to query the LLM sometimes (Zhao
ers. We choose to use the correlation coefficient
et al., 2021; Sanh et al., 2022). To investigate
instead of the inter-annotator agreement score be-
how varying the task instructions and questions can
cause IAA mainly cares if two annotators agree
affect the LLM evaluation result for open-ended
on the exact ratings, while the correlation coef-
story generation, we change the instructions and
ficient focus on the question: "when annotator
questions and see how the LLM evaluation result
A rates one story higher, does annotator B also
changes. We experiment with two different instruc-
rate the story higher?" (Amidei et al., 2019). We
tions by changing the instruction or question in
calculate Kendall’s τ for four rating attributes as
Figure 1: (1) We prepend the sentence, "(You are a
follows: For each story and each rating attribute,
human worker hired to rate the story fragment.)",
we calculate the average rating of the three En-
in front of the task instruction in Figure 1. We
glish teachers and calculate the average rating of
try to provide the LLM a persona for it to better
the three scores given by the text-davinci-003
understand its role. This is inspired by previous
(which is obtained from three independent sam-
work that reported GPT-3 can yield different results
ples). For each attribute, we collect the average
when giving them a persona (Zeng et al., 2022). (2)
rating of teachers into a vector A ∈ R200 , where
We ask the LLMs to explain their decision by ap-
each entry is the average rating of a story; likewise,
pending the following sentence after the question:
we construct a vector B ∈ R200 for the average
Please also explain your decision. Here, we would
ratings of davinci. Next, we calculate Kendall’s
like to know if LLM will rate the stories differently
τ correlation coefficient between A and B.
when they are asked to justify their decision. This
The Kendall’s τ between teacher ratings and
is inspired by zero-shot chain-of-thought (Kojima
LLM ratings is shown in Table 2.4 We find that
et al.). We use text-davinci-003 instead of Chat-
for all four attributes and for both human-written
GPT as the LLM in this experiment since it is more
and GPT-2-generated stories, we observe weak
accessible than ChatGPT.
to strong positive correlations between teachers’
The results are shown in the upper block in Ta-
ratings and text-davinci-003’s ratings. All the
ble 3. We observe that for grammaticality and
correlations have p-values less than 0.05. Hence,
cohesiveness, the scores obtained from different in-
we can say that when teachers rate a story higher,
structions are quite close: the rating changes due to
text-davinci-003 also rates it higher to a certain
different instructions are less than 0.1. For the other
extent. We also observe that Kendall’s τ for differ-
two attributes, the score changes are slightly larger
ent attributes are quite different: relevance has the
but still in the range of 0.25. Despite that there
strongest correlation while grammaticality has the
are small variations due to different instructions,
weakest correlation. This is possibly because rating
these variances still do not change the conclusion
relevance is rather straightforward, which requires
that "LLM rates human-written stories higher than
checking if the content in the prompt is mentioned
GPT-2-generated stories". Thus, different instruc-
in the story. On the contrary, what should be con-
tions do not change the relative ranking of GPT-2-
4
When interpreting Kendall’s τ , |τ | ∈ [0, 0.1) is consid- generated and human-written stories. In summary,
ered as very weak correlation, |τ | ∈ [0.1, 0.2) is considered as long as the stories are evaluated using the same
as weak correlation, |τ | ∈ [0.2, 0.3) is considered as moder-
ate correlation, and |τ | ∈ [0.3, 1.0] is considered as strong instructions using LLM evaluation, such evaluation
correlation (Botsch, 2011). and comparison are meaningful.
15612
Grammaticality Cohesiveness Likability Relevance
Setup
Human GPT-2 Human GPT-2 Human GPT-2 Human GPT-2
Different instructions (Section 3.3.2)
Original 4.220.38 4.070.35 4.540.45 4.260.45 3.990.38 3.840.42 4.400.79 4.020.74
(1) + persona 4.290.45 4.010.45 4.600.49 4.270.50 4.050.39 3.870.39 4.550.70 4.250.77
(2) + explain 4.240.42 4.050.25 4.610.49 4.320.51 4.150.44 3.980.34 4.350.75 4.030.56
Different sampling temperature T (Section 3.3.3)
T = 1.0 4.220.38 4.070.35 4.540.45 4.260.45 3.990.38 3.840.42 4.400.79 4.020.74
T = 0.7 4.180.35 4.060.33 4.520.48 4.230.43 3.960.34 3.820.42 4.360.77 3.950.72
T = 0.3 4.130.33 3.990.25 4.480.49 4.140.39 3.950.26 3.820.41 4.340.75 3.930.67
T =0 4.070.27 3.990.18 4.490.50 4.090.34 3.950.25 3.820.40 4.320.75 3.920.66
Table 3: Understanding the variance of LLM evaluation. For each of the four attributes evaluated, the left column is
the mean and standard deviation of human-written stories and the right column is those of GPT-2-generated stories.
The upper block shows the rating change due to different instructions (Section 3.3.2), and the lower block is the
result of changing the temperature T used for generating the LLM’s output (Section 3.3.3).
15615
Limitations want to make it clear that this is not our intent. As
our paper title suggests, we aim to offer an alter-
There are additional limitations and potential risks
native option to human evaluation with the goal
of LLM evaluations that should be noted, and these
of enhancing the reproducibility of NLP research.
limitations are actually well-known problems of
Human evaluation is still essential as the ultimate
pre-trained language models. As listed on the Open
goal of NLP systems is to be used by human users,
AI blog for ChatGPT, ChatGPT sometimes gen-
so it’s important to gather feedback from them. We
erates answers that sound right and plausible but
highly enjoy the process of discussing the experi-
are totally nonsense. OpenAI also admits that the
ment settings and results with the English teachers
model’s response may be sensitive to the prompt
we hired. We do not recommend that future re-
used to query the model. While in Section 3.3.2,
searchers completely eliminate human evaluation;
we find that the overall results among different in-
rather, we believe that human evaluation should be
structions are not significantly different, we cannot
used in conjunction with LLM evaluation. Both
guarantee that this is the case for all kinds of modi-
methods have their own advantages and disadvan-
fication on the task instructions.
tages, making them both necessary for evaluating
Other than the limitations listed on the OpenAI
NLP systems. We hope the positive results in this
blog, there are still other limitations. For example,
paper provide NLP researchers with an alternative
LLMs may not have emotions. Whether AI mod-
method to evaluate systems and encourage further
els have emotion is a more philosophical question
discussions on this topic.
and is controversial, so the results of using such
models for evaluating emotion-related tasks may Ethical statements on the experiments in the pa-
be strongly challenged and may even violate re- per All the experiments strictly follow the ACL
search ethics. As we find during our experiments, Code of Ethics. We include comprehensive de-
ChatGPT often replies "I am an AI system and I tails about human evaluation in Appendix C.1. To
do not have emotions like a human" when asked to summarize, we include the exact instructions and
rate the likability of a story. screenshots of the interface in the human evalua-
Another important limitation of LLM evaluation tion. We inform the human evaluators what the
is that LLMs lack the ability to process visual cues task is about and tell them that their responses will
in task instructions, unlike human evaluation. Hu- be used to assess the performance of AI models.
man evaluators can use formattings such as special We try our best to follow the ethical guidelines of
fonts or text styles to focus on important parts of ACL.
the instructions. Additionally, the way instructions We use the models and datasets when following
and questions are formatted can influence how hu- their intended usage. Specifically, we follow the
man evaluators approach the task. While using OpenAI usage policy when using the InstructGPT
special HTML syntax can serve as an alternative models and the ChatGPT model.
for visual cues, such tags are not used in human
evaluation, so we do not use those HTML tags in Acknowledgements
LLM evaluation to incorporate visual cues in the
We want to thank the reviews for providing detailed
inputs to the LLMs. However, LLMs can only pro-
feedback and actionable suggestions, which help
cess raw text input and are unable to take in visual
us strengthen our paper. We list the modification
cues.
based on the reviewers’ suggestions in Appendix A.
Ethics Statement We thank Yung-Sung Chuang for providing valu-
able feedback on the draft of this paper. We want to
Further ethical considerations of LLM evalua- thank Tung-En Hsiao, the administrative assistant
tion Aside from the limitations of LLM evalua- of our lab, for helping us deal with the payment
tion mentioned previously, there is a crucial ethical on Upwork. Cheng-Han Chiang is supported by a
concern at the heart of LLM evaluation. Is it ethical Ph.D. scholarship program by Delta Electronics.
to replace human evaluation with LLM evaluation?
Some may question if this paper is suggesting that
LLMs are now ready to replace humans and find References
this idea unsettling. As responsible and ethical Moustafa Alzantot, Yash Sharma, Ahmed Elgohary,
NLP researchers, we understand these concerns but Bo-Jhang Ho, Mani Srivastava, and Kai-Wei Chang.
15616
2018. Generating natural language adversarial ex- the North American Chapter of the Association for
amples. In Proceedings of the 2018 Conference on Computational Linguistics: Human Language Tech-
Empirical Methods in Natural Language Processing, nologies, Volume 1 (Long and Short Papers), pages
pages 2890–2896, Brussels, Belgium. Association 4171–4186, Minneapolis, Minnesota. Association for
for Computational Linguistics. Computational Linguistics.
Jacopo Amidei, Paul Piwek, and Alistair Willis. 2018. Angela Fan, Mike Lewis, and Yann Dauphin. 2018.
Rethinking the agreement in human evaluation tasks. Hierarchical neural story generation. In Proceedings
In Proceedings of the 27th International Conference of the 56th Annual Meeting of the Association for
on Computational Linguistics, pages 3318–3329, Computational Linguistics (Volume 1: Long Papers),
Santa Fe, New Mexico, USA. Association for Com- pages 889–898, Melbourne, Australia. Association
putational Linguistics. for Computational Linguistics.
Jacopo Amidei, Paul Piwek, and Alistair Willis. 2019.
Agreement is overrated: A plea for correlation to Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda
assess human evaluation reliability. In Proceedings Askell, Yuntao Bai, Saurav Kadavath, Ben Mann,
of the 12th International Conference on Natural Lan- Ethan Perez, Nicholas Schiefer, Kamal Ndousse,
guage Generation, pages 344–354, Tokyo, Japan. et al. 2022. Red teaming language models to re-
Association for Computational Linguistics. duce harms: Methods, scaling behaviors, and lessons
learned. arXiv preprint arXiv:2209.07858.
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda
Askell, Anna Chen, Nova DasSarma, Dawn Drain, Leo Gao. 2021. On the sizes of openai api models.
Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Accessed on January 17, 2023.
2022a. Training a helpful and harmless assistant with
reinforcement learning from human feedback. arXiv Siddhant Garg and Goutham Ramakrishnan. 2020. Bae:
preprint arXiv:2204.05862. Bert-based adversarial examples for text classifica-
tion. In Proceedings of the 2020 Conference on
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Empirical Methods in Natural Language Processing
Amanda Askell, Jackson Kernion, Andy Jones, (EMNLP), pages 6174–6181.
Anna Chen, Anna Goldie, Azalia Mirhoseini,
Cameron McKinnon, et al. 2022b. Constitutional Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli.
ai: Harmlessness from ai feedback. arXiv preprint 2023. Chatgpt outperforms crowd-workers for text-
arXiv:2212.08073. annotation tasks. arXiv preprint arXiv:2303.15056.
R Botsch. 2011. Chapter 12: Significance and mea-
sures of association. Scopes and Methods of Political Dan Gillick and Yang Liu. 2010. Non-expert evaluation
Science. of summarization systems is risky. In Proceedings of
the NAACL HLT 2010 Workshop on Creating Speech
Tom Brown, Benjamin Mann, Nick Ryder, Melanie and Language Data with Amazon’s Mechanical Turk,
Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind pages 148–151, Los Angeles. Association for Com-
Neelakantan, Pranav Shyam, Girish Sastry, Amanda putational Linguistics.
Askell, et al. 2020. Language models are few-shot
learners. Advances in neural information processing Jian Guan, Fei Huang, Zhihao Zhao, Xiaoyan Zhu, and
systems, 33:1877–1901. Minlie Huang. 2020. A knowledge-enhanced pre-
training model for commonsense story generation.
Nicola De Cao, Wilker Aziz, and Ivan Titov. 2021. Edit- Transactions of the Association for Computational
ing factual knowledge in language models. Linguistics, 8:93–108.
Cheng-Han Chiang and Hung-yi Lee. 2022. How far are Francisco Guzmán, Ahmed Abdelali, Irina Temnikova,
we from real synonym substitution attacks? arXiv Hassan Sajjad, and Stephan Vogel. 2015. How do hu-
preprint arXiv:2210.02844. mans evaluate machine translation. In Proceedings
Elizabeth Clark, Tal August, Sofia Serrano, Nikita of the Tenth Workshop on Statistical Machine Trans-
Haduong, Suchin Gururangan, and Noah A. Smith. lation, pages 457–466, Lisbon, Portugal. Association
2021. All that’s ‘human’ is not gold: Evaluating for Computational Linguistics.
human evaluation of generated text. In Proceedings
of the 59th Annual Meeting of the Association for Jens Hauser, Zhao Meng, Damián Pascual, and Roger
Computational Linguistics and the 11th International Wattenhofer. 2021. Bert is robust! a case against
Joint Conference on Natural Language Processing synonym-based adversarial examples in text classifi-
(Volume 1: Long Papers), pages 7282–7296, Online. cation. arXiv preprint arXiv:2109.07403.
Association for Computational Linguistics.
Dan Hendrycks, Collin Burns, Steven Basart, Andrew
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Critch, Jerry Li, Dawn Song, and Jacob Steinhardt.
Kristina Toutanova. 2019. BERT: Pre-training of 2021. Aligning {ai} with shared human values. In
deep bidirectional transformers for language under- International Conference on Learning Representa-
standing. In Proceedings of the 2019 Conference of tions.
15617
Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and language models with language models. In Proceed-
Yejin Choi. 2020. The curious case of neural text de- ings of the 2022 Conference on Empirical Methods
generation. In International Conference on Learning in Natural Language Processing, pages 3419–3448,
Representations. Abu Dhabi, United Arab Emirates. Association for
Computational Linguistics.
Fan Huang, Haewoon Kwak, and Jisun An. 2023. Is
chatgpt better than human annotators? potential and Alec Radford, Jeffrey Wu, Rewon Child, David Luan,
limitations of chatgpt in explaining implicit hate Dario Amodei, Ilya Sutskever, et al. 2019. Language
speech. arXiv preprint arXiv:2302.07736. models are unsupervised multitask learners. OpenAI
blog, 1(8):9.
Di Jin, Zhijing Jin, Joey Tianyi Zhou, and Peter
Szolovits. 2020. Is bert really robust? a strong base- Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
line for natural language attack on text classification Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
and entailment. In Proceedings of the AAAI con- Wei Li, Peter J Liu, et al. 2020. Exploring the limits
ference on artificial intelligence, volume 34, pages of transfer learning with a unified text-to-text trans-
8018–8025. former. J. Mach. Learn. Res., 21(140):1–67.
Marzena Karpinska, Nader Akoury, and Mohit Iyyer. Shuhuai Ren, Yihe Deng, Kun He, and Wanxiang Che.
2021. The perils of using Mechanical Turk to evalu- 2019. Generating natural language adversarial exam-
ate open-ended text generation. In Proceedings of the ples through probability weighted word saliency. In
2021 Conference on Empirical Methods in Natural Proceedings of the 57th Annual Meeting of the Asso-
Language Processing, pages 1265–1285, Online and ciation for Computational Linguistics, pages 1085–
Punta Cana, Dominican Republic. Association for 1097, Florence, Italy. Association for Computational
Computational Linguistics. Linguistics.
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yu- Victor Sanh, Albert Webson, Colin Raffel, Stephen
taka Matsuo, and Yusuke Iwasawa. Large language Bach, Lintang Sutawika, Zaid Alyafeai, Antoine
models are zero-shot reasoners. In Advances in Neu- Chaffin, Arnaud Stiegler, Arun Raja, Manan Dey,
ral Information Processing Systems. M Saiful Bari, Canwen Xu, Urmish Thakker,
Shanya Sharma Sharma, Eliza Szczechla, Taewoon
Leo Lentz and Menno De Jong. 1997. The evaluation Kim, Gunjan Chhablani, Nihal Nayak, Debajyoti
of text quality: Expert-focused and reader-focused Datta, Jonathan Chang, Mike Tian-Jian Jiang, Han
methods compared. IEEE transactions on profes- Wang, Matteo Manica, Sheng Shen, Zheng Xin Yong,
sional communication, 40(3):224–234. Harshit Pandey, Rachel Bawden, Thomas Wang, Tr-
ishala Neeraj, Jos Rozen, Abheesht Sharma, An-
Huanru Henry Mao, Bodhisattwa Prasad Majumder, drea Santilli, Thibault Fevry, Jason Alan Fries, Ryan
Julian McAuley, and Garrison Cottrell. 2019. Im- Teehan, Teven Le Scao, Stella Biderman, Leo Gao,
proving neural story generation by targeted common Thomas Wolf, and Alexander M Rush. 2022. Multi-
sense grounding. In Proceedings of the 2019 Confer- task prompted training enables zero-shot task gener-
ence on Empirical Methods in Natural Language Pro- alization. In International Conference on Learning
cessing and the 9th International Joint Conference Representations.
on Natural Language Processing (EMNLP-IJCNLP),
pages 5988–5993, Hong Kong, China. Association Teven Le Scao, Angela Fan, Christopher Akiki, El-
for Computational Linguistics. lie Pavlick, Suzana Ilić, Daniel Hesslow, Roman
Castagné, Alexandra Sasha Luccioni, François Yvon,
John Morris, Eli Lifland, Jack Lanchantin, Yangfeng Ji, Matthias Gallé, et al. 2022. Bloom: A 176b-
and Yanjun Qi. 2020. Reevaluating adversarial exam- parameter open-access multilingual language model.
ples in natural language. In Findings of the Associ- arXiv preprint arXiv:2211.05100.
ation for Computational Linguistics: EMNLP 2020,
pages 3829–3839, Online. Association for Computa- Jiaan Wang, Yunlong Liang, Fandong Meng, Haoxiang
tional Linguistics. Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, and Jie Zhou.
2023. Is chatgpt a good nlg evaluator? a preliminary
OpenAI. 2022. Chatgpt: Optimizing language models study. arXiv preprint arXiv:2303.04048.
for dialogue. Accessed on January 10, 2023.
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu,
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Car- Adams Wei Yu, Brian Lester, Nan Du, Andrew M.
roll L Wainwright, Pamela Mishkin, Chong Zhang, Dai, and Quoc V Le. 2022. Finetuned language mod-
Sandhini Agarwal, Katarina Slama, Alex Ray, et al. els are zero-shot learners. In International Confer-
2022. Training language models to follow in- ence on Learning Representations.
structions with human feedback. arXiv preprint
arXiv:2203.02155. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien
Chaumond, Clement Delangue, Anthony Moi, Pier-
Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, ric Cistac, Tim Rault, Remi Louf, Morgan Funtow-
Roman Ring, John Aslanides, Amelia Glaese, Nat icz, Joe Davison, Sam Shleifer, Patrick von Platen,
McAleese, and Geoffrey Irving. 2022. Red teaming Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu,
15618
Teven Le Scao, Sylvain Gugger, Mariama Drame, B Experiment Details for Open-Ended
Quentin Lhoest, and Alexander Rush. 2020. Trans- Story Generation
formers: State-of-the-art natural language processing.
In Proceedings of the 2020 Conference on Empirical B.1 The WritingPrompt Dataset
Methods in Natural Language Processing: System
Demonstrations, pages 38–45, Online. Association The training dataset contains 303K pairs of
for Computational Linguistics. stories and prompts, which our model is trained
KiYoon Yoo, Jangho Kim, Jiho Jang, and Nojun Kwak. on. We only use 200 prompt-story pairs from
2022. Detection of adversarial examples in text clas- the test set. The dataset is downloaded from
sification: Benchmark and baseline via robust density https://www.kaggle.com/datasets/ratthachat/writing-
estimation. In Findings of the Association for Com- prompts.
putational Linguistics: ACL 2022, pages 3656–3672,
Dublin, Ireland. Association for Computational Lin-
B.2 Fine-tuning the GPT-2 Model
guistics.
We train the model for 3 epochs with a learning
Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof
Choromanski, Federico Tombari, Aveek Purohit, rate of 5e−5 and linear learning rate schedule. The
Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vin- trained model eventually reaches a perplexity of 20
cent Vanhoucke, et al. 2022. Socratic models: Com- on the validation set of WritingPrompts.
posing zero-shot multimodal reasoning with lan-
guage. arXiv preprint arXiv:2204.00598. B.3 Data Post-processing
Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Once the model is trained, we randomly select 200
Character-level convolutional networks for text classi- prompts from the testing set of WritingPrompts,
fication. Advances in neural information processing
systems, 28. and feed the prompts to the trained model and ask
the model to generate stories based on the given
Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and prompts. When generating the stories, we adopt
Sameer Singh. 2021. Calibrate before use: Improv-
ing few-shot performance of language models. In In- nucleus sampling with p = 0.9. Next, we manu-
ternational Conference on Machine Learning, pages ally truncate the generated stories to less than 150
12697–12706. PMLR. words and ensure that after the truncation, the story
ends with a full sentence.7 After this process, we
A Modification Based on the Reviews
have 200 pairs of prompts and model-generated
We list the main differences between this version stories.
and the pre-review version of our paper; the mod- As a comparison to model-generated stories,
ifications are mainly based on the reviewers’ sug- we select the same 200 prompts used for gen-
gestions. We thank the reviewers again for those erating model-generated stories and their corre-
valuable suggestions. sponding human-written stories to form 200 pairs
of prompts and human-written stories. For these
• We add Section 3.3.1 to discuss whether the
human-written stories, we also truncate the sto-
LLM and human evaluators agree on the rat-
ries to less than 150 words and end with a full
ings of individual stories.
sentence to match the model-generated sentences.
• We refine the wordings in Section 5 and add We also manually remove some artifacts in the
relevant references. human-written story due to the tokenization of the
WritingPrompts dataset.
• We add Table 6 to discuss the correlation be-
tween human evaluators. C Human Evaluation
• We conduct supplementary experiments on C.1 Recruiting English Teachers
human evaluation that mixes human-written
The English teachers hold ESL certificates8 ; given
stories and GPT-2-generated stories when con-
that they are experienced with correcting essays
ducting human evaluation and report the re-
written by students, they are perfect fits for this task.
sults in Table 5.
Each teacher is asked to rate 200 GPT-2-generated
• We correct the typos and include almost all stories and 200 human-written stories, and they are
presentation suggestions mentioned by the re- 7
We truncate the story to 150 words since this is the mean
viewers. We cannot follow all presentation length of the model-generated story.
suggestions due to limited space. 8
English as a Second Language Teaching Certification
15619
paid US$140 for rating 200 stories. Considering Writer Human GPT-2
that the teachers reported that they take at most Grammaticality 3.890.97 3.880.84
5 hours to rate 200 stories, this makes the hourly Cohesiveness 4.350.87 3.490.97
wage at least US$28. We first ask the teachers to Likability 3.461.40 2.891.12
rate the GPT-2-generated stories and then the 200 Relevance 3.711.20 2.371.33
human-written stories. Different from Karpinska
et al. (2021) that take a break between the rating Table 5: The average Likert score for human-written
of GPT-2-generated stories and the human-written and GPT-2-generated stories when we randomly mix
the 200 model-generated and 200 human-written stories
stories, we do not take a break to avoid the teacher’s
during human evaluation.
rating standard to change after taking a long break.
The teachers are not told who wrote the stories
before they evaluate the stories. We reveal to them are asked to rate 100 news titles and are paid US$35
what this project aims to study after they finish for doing so. They reported that it took them less
rating all the stories. than 1 hour to complete the rating.
The reason we do not mix human-written and C.2 Human Evaluation Interface
GPT-2-generated stories for rating is that in Karpin-
Open-Ended Story Generation We use Google
ska et al. (2021), their observation is that (1) when
Forms to collect the responses from the teachers.
AMT workers rate model-generated and human-
Each form contains 100 stories, and each story is
written stories separately, their ratings do not
on one page of the Google Form. The interface on
show preference toward human-written stories, but
one page is shown in Figure 2 and Figure 3; the two
(2) even when rating the model-generated and
figures are from the same page of the Google Form,
human-written stories separately, English teacher
and we are splitting them because screenshotting
shows clear preference toward human-written sto-
the whole interface will cause low resolution.
ries. We follow their settings and do not mix GPT-
2-generated/human-written stories.
During the reviewing process, we received ques-
tions from the reviewers about why not mixing
the stories for human evaluation. Thus, we con-
duct the same experiment by randomly mixing 200
human-written and 200 GPT-2-generated stories
and asking three teachers (not the teachers that
already rated the stories) to rate them. All other
experiment conditions are the same as previously
stated. The full result is shown in Table 5. We
find that the teacher still shows a clear preference
toward human-written stories for all four attributes,
similar to the observation in Table 1. The only
exception is grammaticality, where English teach-
ers do not show a very clear preference for the
grammar of human-written stories. However, when
calculating the average rating for individual teach-
ers, we find that two out of three teachers do rate
grammaticality higher for human-written stories.
It is interesting to note that for LLM evaluation,
there is no such problem about whether or not to Figure 2: The upper part of the interface in open-ended
mix the human-written and GPT-2-generated sto- story generation.
ries during LLM evaluation as the rating of each
story is independent of each other, as discussed in Adversarial Attacks Quality Evaluation In this
Section 5. task, we also use Google Forms to collect the re-
For adversarial attack quality evaluation, we also sponses from the teachers. We create two different
recruit certified teachers on Upwork. The teachers Google Forms, one is used to evaluate the fluency,
15620
Figure 4: The Google Form used to evaluate the fluency
of the benign or adversarial samples.
15624
Attribute Rater Response
ChatGPT I would rate the grammatical correctness of the story fragment as a 3.
The text is generally understandable, but there are a few punctuation
Grammar
errors and awkward phrasing that make it somewhat difficult to follow.
T1 Disagree:I would rate the grammar as a 2, not a 3–there were signifi-
cantly more grammatical errors than the average within this set of story
fragments.
T2 Disagree: The issues with the grammar and coherence creates a stop
start feeling when reading. I am unsure as to who is speaking in the text
and this leaves me with many questions.
T3 Disagree: The grammar is poorer than a 3, as is evident in for example
what is supposed to be the second sentence: "He has been ... (incomplete,
and another segment is there "People stare ...); cant; couldnt; doesnt;
and sentences that do not make sense like this one: " A man I once was
won’t be able to marry me "
ChatGPT I would rate the cohesiveness of the sentences in the story fragment as
a 2. The sentences in the fragment are disconnected and jump between
Cohesive
different thoughts and ideas, making it difficult to understand the overall
story or message. There is no clear progression of events or logical
flow between the sentences. It appears that the fragment may be missing
some important information or context that would help tie the sentences
together.
T1 Agree
T2 Agree
T3 Disagree: very incoherent passage in which it is not very clear who says
what, what the topic is, thus rendering a disjointed narrative.
ChatGPT I would rate the enjoyability of the story fragment as a 1. The fragment
is confusing and difficult to understand, the sentences are disconnected
Likable
and jumps between different thoughts and ideas. The story seems to
be incomplete and lacks context, making it hard to follow the plot or
characters. Additionally, the fragment has a sad and negative tone and it
does not seem to be enjoyable to read.
T1 Agree
T2 Agree
T3 Agree
ChatGPT I would rate the relevance of the story fragment to the prompt as a 1.
While the story fragment does mention someone being gone, it doesn’t
Relevant
seem to be directly related to the idea of filling out a form to justify killing
someone. It’s unclear if the son is the person being proposed to be killed,
or if the story is even taking place in the context of the prompt. There is
no clear indication that the fragment is related to the prompt.
T1 Agree
T2 Agree
T3 Agree
15625
Attribute Rater Response
ChatGPT The text of the story fragment appears to be grammatically correct, with
no major errors or awkward phrasing. On a scale of 1-5, with 1 being
Grammar
the lowest, I would rate the grammatical correctness of the text as a 5.
T1 Agree
T2 Agree
T3 Disagree: I would rate this a 4 because of for example ’no where’ =
nowhere, although it might seem like a minor error, it does take away
some correctness of the text.
ChatGPT The sentences in the story fragment fit together well, creating a clear and
coherent image of the scene and the characters’ actions and thoughts.
Cohesive
The language is descriptive and evocative, and the dialogue effectively
conveys the characters’ emotions and motivations. On a scale of 1-5,
with 1 being the lowest, I would rate how well the sentences fit together
as a 4.
T1 Agree
T2 Agree
T3 Partially agree: To me, all the sentences fit well together to create a
captivating narrative. I will give a 4 or 5.
ChatGPT Based on the structure, writing style and plot of the story fragment, I
would rate it as 3 in terms of enjoyability. The fragment sets up an
Likable
intriguing situation with the mysterious little black box and the protago-
nist’s belief that it holds the "secret to the universe," but it ends abruptly
and leaves many questions unanswered, which may not be satisfying for
readers.
T1 Disagree: I liked it and it was very clear, so I’d rate it a 4 or 5.
T2 Disagree: While the story ends suddenly, I am remined that it is just a
fragment and that their may be more to this piece. I felt the piece written
about a little black box was intriguing and enjoyed the introduction of
Norse mythology. This would attract the interest of many readers.
T3 Disagree: I enjoyed the fragment, as it piqued my interest to want to
know more about how the story evolves and what it actually was about.
ChatGPT I would rate the story fragment as a 5 in relevance to the prompt "The
Little Black Box." The story fragment centers around a little black box
Relevant
and its perceived importance, which aligns with the prompt. Additionally,
the main character, Heaven, is holding the box and describing it as "the
secret to the universe," further solidifying the connection to the prompt.
T1 Agree
T2 Agree
T3 Agree: the story is totally relevant to the prompt.
15626
Attribute Rater Response
ChatGPT I would rate the grammatical correctness of the text as a 3. There are
a few errors such as "Many displays of various vice" which should
Grammar
be "Many displays of various vices" and "I so small flakes of snow"
which should be "I saw small flakes of snow." Additionally, some of
the sentences are slightly awkward and could be rephrased for better
clarity. Overall, the text is understandable, but there are a few errors
and awkward phrasings that could be improved.
T1 Partially agree:I rated the grammar lower than this–maybe a 2 or a 3.
There were above-average numbers of errors.
T2 Disagree: While the text is understandable, I found myself being dis-
tracted by the errors in the text and coherency.
T3 Agree
ChatGPT I would rate the cohesiveness of the sentences in the story fragment as
a 2. The story fragment jumps abruptly between different scenes and
Cohesive
characters without clear transition, making it difficult to follow and
understand the narrative. Additionally, the sentences are often disjointed
and lack a clear flow. The story fragment also lacks a clear plot or
purpose, making it difficult to understand the overall story or the main
character’s intentions. The story fragment can be improved by providing
more context, and a clearer narrative arc.
T1 Agree: it was jumpy
T2 Agree
T3 Disagree: The story is about a sniper looking through his rifle’s telescope,
thus seeing changing scenes based on each room he is seeing through
the windows, which are then subsequently described. These descriptions
account for the seeming lack of cohesion.
ChatGPT I would rate the enjoyability of the story fragment as a 1. The fragment
lacks a clear plot, purpose or character development, making it difficult to
Likable
become invested in the story or the characters. Additionally, the fragment
includes disturbing and graphic content such as violence and prostitution,
which may make it uncomfortable for some readers. Additionally, the
fragment contains several grammatical errors and awkward phrasings
that can make it difficult to follow the story. Overall, the story fragment
is not enjoyable to read.
T1 Agree: : it was not just graphic, but also showed really harmful and
negative representations of characters based on their weight, gender, etc.
T2 Agree
T3 Disagree: The story did hold my attention to a degree because I wanted
to know whom he wanted to target and why.
ChatGPT I would rate the relevance of the story fragment to the prompt as a 1. The
story fragment doesn’t contain any elements that relate to the prompt such
Relevant
as the protagonist and antagonist, their actions, or the moral ambiguities
the antagonist faces. The story fragment mainly describes a person with
a sniper rifle looking into windows of a hotel and doesn’t present any
clear plot or purpose. Additionally, the story fragment does not offer any
indication of moral ambiguity on the part of the person with the sniper
rifle. Overall, the story fragment is not relevant to the prompt.
T1 Agree
T2 Agree
T3 Agree
7 A4. Have you used AI writing assistants when working on this paper?
Left blank.
3 Did you use or create scientific artifacts?
B
Section 4.2, Appendix B.1 and E
3 B1. Did you cite the creators of artifacts you used?
Section 4.2, Appendix B.1 and E
7 B2. Did you discuss the license or terms for use and / or distribution of any artifacts?
The datasets we use do not include a license
3 B3. Did you discuss if your use of existing artifact(s) was consistent with their intended use, provided
that it was specified? For the artifacts you create, do you specify intended use and whether that is
compatible with the original access conditions (in particular, derivatives of data accessed for research
purposes should not be used outside of research contexts)?
Appendix E and Ethical statement
7 B4. Did you discuss the steps taken to check whether the data that was collected / used contains any
information that names or uniquely identifies individual people or offensive content, and the steps
taken to protect / anonymize it?
Removing the names in AG-News will make news titles to be nonsensical.
3 B5. Did you provide documentation of the artifacts, e.g., coverage of domains, languages, and
linguistic phenomena, demographic groups represented, etc.?
Section 4.2, Appendix B.1 and E
3 B6. Did you report relevant statistics like the number of examples, details of train / test / dev splits,
etc. for the data that you used / created? Even for commonly-used benchmark datasets, include the
number of examples in train / validation / test splits, as these provide necessary context for a reader
to understand experimental results. For example, small differences in accuracy on large test sets may
be significant, while on small test sets they may not be.
Section 4.2, Appendix B.1 and E
C
7 Did you run computational experiments?
Left blank.
C1. Did you report the number of parameters in the models used, the total computational budget
(e.g., GPU hours), and computing infrastructure used?
Not applicable. Left blank.
The Responsible NLP Checklist used at ACL 2023 is adopted from NAACL 2022, with the addition of a question on AI writing
assistance.
15630
C2. Did you discuss the experimental setup, including hyperparameter search and best-found
hyperparameter values?
Not applicable. Left blank.
C3. Did you report descriptive statistics about your results (e.g., error bars around results, summary
statistics from sets of experiments), and is it transparent whether you are reporting the max, mean,
etc. or just a single run?
Not applicable. Left blank.
C4. If you used existing packages (e.g., for preprocessing, for normalization, or for evaluation), did
you report the implementation, model, and parameter settings used (e.g., NLTK, Spacy, ROUGE,
etc.)?
Not applicable. Left blank.
D 3 Did you use human annotators (e.g., crowdworkers) or research with human participants?
Section 3, 4, Appendix C
3 D1. Did you report the full text of instructions given to participants, including e.g., screenshots,
disclaimers of any risks to participants or annotators, etc.?
Appendix C.1
3 D2. Did you report information about how you recruited (e.g., crowdsourcing platform, students)
and paid participants, and discuss if such payment is adequate given the participants’ demographic
(e.g., country of residence)?
Appendix C.1, C.2
3 D3. Did you discuss whether and how consent was obtained from people whose data you’re
using/curating? For example, if you collected data via crowdsourcing, did your instructions to
crowdworkers explain how the data would be used?
Ethical statement
D4. Was the data collection protocol approved (or determined exempt) by an ethics review board?
Not applicable. We do not ahve ethic review board in our institute.
7 D5. Did you report the basic demographic and geographic characteristics of the annotator population
that is the source of the data?
This is not related to our task. We report the certification of the workers in Appendix C.1.
15631