Exploring AI-Generated English Relative Clauses in Comparison To Human Production

Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

Exploring AI-Generated English Relative Clauses in

Comparison to Human Production

Hongoak Yun1, Eunkyung Yi2* and Sanghoun Song3*


1
Jeju National University,
2
Ewha Womans University,
3
Korea University

Abstract
Human behavioral studies have consistently indicated a
preference for subject-extracted relative clauses (SRCs) over
object-extracted relative clauses (ORCs) in sentence production
and comprehension. Some studies have further shown that this
preference can be influenced by the semantic properties of head
nouns, particularly animacy. In this study, we use AI language
models, specifically GPT-2 and ChatGPT 3.5, to simulate human
sentence generation. Our primary goal is to evaluate the extent
to which these language models replicate human behaviors in
sentence production and identify any divergences. We tasked the
models with completing sentence fragments structured as ‘the,’
followed by a head noun and 'that’ (The reporter that …). We
varied the semantic property of head nouns such that they are all
animate (the secretary that … ) in Study 1 and are either animate
or inanimate (the musician/book that … ) in Study 2. Our findings
reveal that in Study 1, both GPT models exhibited a robust SRC
bias, replicating human-like behavior in relative clause production.
However, in Study 2, we observed divergent behavior between the
models when head nouns were inanimate, while consistency was
maintained when head nouns were animate. Specifically, ChatGTP
3.5 generated more ORCs than SRCs in the presence of inanimate
head nouns. These results, particularly those from ChatGPT 3.5,
Received 06 November 2023; Revised 14 December 2023; Accepted 15 December 2023
* Correspondence: [email protected] (Eunkyung Yi) & [email protected] (Sanghoun Song)
Journal of Cognitive Science 24(4): 465-496 December 2023
©2023 Institute for Cognitive Science, Seoul National University
466 Hongoak Yun, Eunkyung Yi and Sanghoun Song

closely mirror human relative clause production patterns. Our


study highlights the potential of language generative language
models as efficient and versatile corpus simulators. Furthermore,
our findings contribute to the evolving field of AI linguistics,
shedding light on the capacity of AI generative systems to emulate
human-like linguistic patterns in sentence production.

Key words: large language models, asymmetry of relative clauses,


semantic sensitivity, AI-generated corpus, ChatGPT

1. Introduction

Achieving machine intelligence akin to human performers and


thinkers has been a challenging topic for many decades. Following the
groundbreaking release of ChatGPT, including GPT 4 (OpenAI, 2023),
language model simulations have witnessed remarkable advancements in
almost all domains associated with humans. While some remain skeptical,
arguing that human behaviors, including cognition, are not that simple for
machines to learn (Chomsky, Roberts, & Watumull, 2023), others embrace
the remarkable advancements of neural language models, viewing ChatGPT
as a powerful tool with practical applications (King & ChatGPT, 2023;
Liebrenz, Schleifer, Buadze, Bhugra, & Smith, 2023; Thorp, 2023).
The simulations of human linguistic knowledge are no exception. Even
prior to the emergence of ChatGPT, various versions of neural language
models have been implemented across diverse linguistic tasks, such as
speech recognition (Arisoy, Sainath, Kingsbury, & Ramabhadran, 2012;
Mikolov, Karafiát, Burget, Černocký, & Khudanpur, 2010), machine
translation (Schwenk, Harmel, Brechet, Zolles, Berkefeld, Müller,
Bildl, Baehrens, Hüber, Kulik, Klöcker, Schulte, & Fakler, 2012), text
summarization (Filippova, Alfonseca, Colmenares, Kaiser, & Vinyal,
2015; Rush, Chopra, & Weston, 2015), subject-verb agreement (Bernardy
& Lappin, 2017; Enguehard, Goldberg, & Linzen, 2017; Gulordava,
Exploring AI-Generated English Relative Clauses in Comparison to Human Production 467

Grave, Linzen, & Baroni, 2018; Linzen, Dupoux, & Goldberg, 2016),
grammaticality judgment with garden path constructions (Frank & Hoeks,
2019; Futrell & Levy, 2019; van Schijndel & Linzen, 2018), negative
polarity items licensing (Futrell, Wilcox, Morita. & Levy, 2018; Shin, Yi,
& Song, 2023), a filler-gap grammaticality judgment in center embedding
and syntactic islands involving long-distance dependencies (Wilcox, Levy,
Morita, & Futrell, 2018; 2019), and discourse expectations (Yi, Cho, &
Song, 2022). Overall, pre-ChatGPT neural language models have revealed
considerable syntactic sensitivity to grammaticality (Warstadt & Bowman,
2020), but these language models have struggled with making commonsense
inferences and role-based event prediction (Ettinger, 2020). Since the
launch of ChatGPT, language models have seen dramatic improvements
in semantic, pragmatic, and syntactic knowledge. However, few studies
have attempted to simulate the cognitive aspects that underlie how humans
use their linguistic knowledge during the process of comprehension and
production (Cai, Haslett, Duan, Wang, & Pickering, 2023). One potential
approach to address this gap is by simulating psycholinguistic findings.
The processing of relative clauses is a topic of significant psycholinguistic
interest. When English-speaking humans read sentences like (1a-b), they
typically read clauses like (1a), where the subject is extracted to the head
position of the relative clause (i.e., SRC), faster or with less difficulty
than clauses like (1b), where the object is extracted (i.e., ORC). The
phenomenon, known as SRC advantages in comprehension, is a well-
established psycholinguistic observation widely replicated across different
languages. In addition to comprehension studies, corpus studies have
consistently found that subject-extracted relatives are produced more
frequently than their corresponding object-extracted counterparts in large-
scale corpora (Levy, 2008; Reali & Christiansen, 2007; Roland, Dick,
Elman, 2007). Computational linguistic theories have proposed probabilistic
models, such as the surprisal model, which mathematically demonstrate
468 Hongoak Yun, Eunkyung Yi and Sanghoun Song

that processing frequently produced (more probable) structures like SRCs,


compared to ORCs, requires relatively less cognitive cost from language
users (Hale, 2001; Levy, 2008). Furthermore, examining the interactive link
between production and comprehension contributes to our understanding of
the systematic linguistic and cognitive demands placed on ordinary humans
during language processing (MacDonald, 2013).

(1) a. The reporteri that [t]i attacked the senator ………...


b. The reporteri that the senator attacked [t]i …………

Suppose another scenario in which large language models, like


ChatGPT, simulate the behaviors of human language users processing
relative clauses (e.g., 1a-b). Given that state-of-the-art pretrained language
models are trained on vast amounts of human-generated data, their output
would naturally align with the frequency distribution observed in human
language. However, few simulations have examined the extent to which
large language models are sensitive to the asymmetry of relative clauses in
language processing. In this study, our goal is to simulate human producers
in the generation of English relative clauses, using GPT-2 (Radford et al.
2018, 2019) and ChatGPT 3.5 (OpenAI, 2022). Through this, we aim to
expand our discussion on how AI generative systems approximate human
linguistic patterns in sentence production.

2. Previous studies

2.1 Machine simulations of psycholinguistic studies


Cai et al. (2023) conducted a series of model simulations on 12
preregistered psycholinguistic experiments. Their goal was to evaluate
the extent to which ChatGPT resembles human language comprehension
and production. However, due to the restrictions of ChatGPT, which
Exploring AI-Generated English Relative Clauses in Comparison to Human Production 469

only produces written responses to written prompts with response time


depending on website traffic, Cai et al.’s (2023) simulations were limited to
measuring the variations of ChatGPT outputs in response to manipulated
text prompts. Nonetheless, their simulations have covered a wide range
of psycholinguistic dimensions from sounds of words to structures of
sentences and discourses.
Out of the 12 simulations, we focus on two sub-studies related to human
sentence processing. First, humans do have a strong tendency to repeat
linguistic elements like words or phrases. A substantial body of structural
priming studies has revealed that humans tend to repeat a syntactic structure
that they have recently encountered (Bock, 1986; Pickering & Branigan,
1986, among many others). In Cai et al.’s (2023) simulation, ChatGPT
successfully replicated structural priming in sentence generation, mirroring
human behavior. For example, ChatGPT was instructed to complete a pair
of sentence fragments sequentially: first, prime fragments (e.g., The racing
driver gave the helpful mechanic …), followed by target fragments (e.g.,
The patient showed ….). Like human producers, ChatGPT used the same
structures in the prime-target sequence. This means that if the language
model completes the prime with a double object (DO) construction (e.g.,
The racing driver gave the helpful mechanic a wrench), it was likely to
continue the subsequent target fragment with a double object construction
(e.g., The patient showed the nurse his hand) rather than opting for a
prepositional object (PO) construction (e.g., The patient showed his hand
to the nurse). Crucially, as found in human studies (Pickering & Branigan,
1986), verb repetition also played a significant role in determining the
strength of the priming effect. ChatGPT’s structural priming effect was
enhanced when the prime and the target shared the same verb (e.g., showed)
compared to when they used different verbs (e.g., showed versus gave).
These findings suggest that ChatGPT exhibited human-like behaviors by
showing structural priming and the lexical boost.
470 Hongoak Yun, Eunkyung Yi and Sanghoun Song

Second, ChatGPT seems to acknowledge the phenomenon that typical


language use often contains elements of noise. For example, Gibson,
Bergen, and Piantadosi (2013) demonstrated that humans were likely to
accept implausible DO sentences like The mother gave the candle the
daughter with the belief that to in front of the daughter might have been
dropped by accident. However, such a generous tolerance was not observed
in implausible PO sentences like The mother gave the daughter to the
candle because inserting to in front of the candle does not look like an
accidental noise. Cai et al. (2023) observed that ChatGPT yielded a similar
tendency to humans. The language models made literal interpretations for
plausible and some implausible DO sentences as being correct but made
nonliteral interpretations of other implausible PO sentences. These outputs
suggest that at least superficially, ChatGPT accommodates the presence
of noise when processing implausible sentences. To sum up, the two
simulations by Cai et al. (2023) revealed that ChatGPT displays human-
like behaviors, including a strong tendency to repeat recently encountered
structures and an ability to discern likely accidental noise in sentence
comprehension.
Similar to Cai et al.’s approach, we investigated whether the language
models could reveal other key aspects consistently observed in human
sentence processing: frequency-based expectation generation. Recent
studies, mostly in the last two decades, have demonstrated that human
processors develop expectations about what is likely to occur in upcoming
positions and begin processing it even before encountering its actual
occurrence (Altman & Kamide, 1999; Yun & Yi, 2019, among others). While
many sources of information are engaged in the computation of likelihoods
for upcoming candidates, “frequency” stands out as a powerful factor
influencing the adjustment of likelihood distributions among candidates
(Hale, 2001; Levy, 2008). As for the generation of relative clauses, once
humans are certain that relative clauses will unfold in the downstream of
Exploring AI-Generated English Relative Clauses in Comparison to Human Production 471

remaining sentences (typically upon recognizing relative pronouns like


that), they draw upon their long-term memory (e.g., frequency) to anticipate
the most likely type of relative clause to occur next. Given that humans
often encounter subject-extracted relatives more frequently than object-
extracted relatives in many cases (Roland et al., 2007), they would expect
to encounter verbs rather than nouns in the position right after relative
pronouns. In some other instances (e.g., pronominal relative clauses, animate
head nouns), they might expect to encounter object relatives rather than
subject relatives (Gennari & McDonald, 2008; Reali & Christiansen, 2007;
Roland et al., 2007), resulting in stronger expectations for nouns than verbs
for the upcoming position after relative pronouns. What choices would the
large language models make? What would be the results of pretraining that
the models have gone through?

2.2 Human behaviors with English relative clauses


The SRCs advantage over ORCs is a non-trivial finding, well-supported
by numerous replications using diverse methods such as sentence
generation, self-paced reading, eye-movement tracking, and ERPs (Ford,
1983; Grodner & Gibson, 2005; Gordon, Hendrick, & Johnson, 2001; Kim
& O’Grady, 2016; King & Just, 1991; Reali & Christiansen, 2007; Roland,
Mauner, O'Meara, & Yun, 2012; Traxler, Morris, & Seely, 2002; among
others). Various theories propose alternative accounts to the locus of this
processing difference. For example, Kim and O’Grady (2016) proposed
the role of syntactic hierarchy saying that SRCs are likely to be produced
more easily than ORCs because subjects are placed in a structurally higher
position than direct objects. However, our focus is on the approach based on
the frequency distribution of structural patterns.
According to distribution-based approaches, such as word-order
frequency theories (Bever, 1970; MacDonald & Christiansen, 2002) and
the Tuning Hypothesis (Mitchell, Cuetos, Corley, & Brysbaert, 1995), a key
472 Hongoak Yun, Eunkyung Yi and Sanghoun Song

constraining factor for the subject-object asymmetry depends on language


users’ experience; that is, structures with which language users have more
direct experience are easier to produce or comprehend than those with less
direct experiences. For instance, in Subject-Verb-Object (SVO) languages
like English, subject relative clauses like (1a), which adhere to the canonical
word order (i.e., SVO), should be easier and faster for English speakers
to produce and comprehend than object relative clauses like (1b) in the
noncanonical order (i.e., OSV order). Roland et al. (2007) reported a set
of fine-grained frequency counts associated with relative clauses across
five big corpora. Table 1 presents selected corpus counts, illustrating the
frequencies of different types of relative clauses per million NPs in each
corpus. Overall, subject relatives occurred much more frequently than
object relatives across these corpora. Among object relatives, reduced forms
of object relatives (e.g., The words underlined in red have errors) are more
common, and object relatives occurred slightly more in spoken corpora (i.e.,
BNC spoken, Switchboard) than in written-text corpora (i.e., BNC, Brown,
and Wall Street Journal).

Table 1. Frequencies of each type of relative clause per 1 million noun


phrases in each corpus (taken from Table 7 in [Roland et al., 2007])

British
British Wall Street
National
National Brown Switchboard Journal
Corpus
Corpus Treebank 2
Spoken
Subject
14,182 9,851 15,024 9,548 18,229
relative
Object
2,943 3,863 1,976 5,616 1,802
relative
Object
relative 5,455 14,423 4,746 5,314 3,385
(reduced)
Exploring AI-Generated English Relative Clauses in Comparison to Human Production 473

Intriguingly, the skewed distribution favoring subject relatives over object


relatives is not consistently observed. It is often disrupted by a couple of
non-structural factors. One possible factor is the use of pronouns within
relative clauses. Reali and Christiansen (2007) have replicated the subject-
biased distribution (i.e., 68% versus 32%) when embedded NPs within
relatives were full NPs, as illustrated in (2a-b). However, when pronouns
except it are used as in (2c-d), the subject-object frequency asymmetry
became reversed to the opposite pattern (i.e., 35% versus 65%). The
pronominal effect was noted in comprehension studies, suggesting that the
difficulty associated with pronominal ORCs, as opposed to pronominal
SRCs, was either neutralized or reduced (Gordon, Hendrick, & Johnson,
2001; Heider, Dery, & Roland, 2014; Roland, Mauner, & Hirose, 2021;
Roland et al., 2012).

(2) a. The lady that visited the doctor


b. The lady that the doctor visited
c. The lady that visited you …..
d. The lady that you visited …..

Another potential nonstructural factor is related to the animacy of the


head nouns modified by relative clauses. Studies have shown that the
frequency discrepancy between subject relatives and object relatives
diminishes when head nouns are inanimate, compared to when they are
animate (Fox & Thompson, 1990; Roland et al., 2007). Roland et al. (2007)
found that in the Brown corpus, subject relatives constituted 75% out of
all relative clauses when head nouns were animate, but this occurrence
dropped to 45% when head nouns were inanimate. A more dramatic
reversal was found in the counts from the Switchboard corpus, which
documented spontaneous spoken utterances. In this corpus, the frequency
of subject relatives increased to 91% out of all relative clauses when head
474 Hongoak Yun, Eunkyung Yi and Sanghoun Song

nouns were animate but decreased to 31% when they were inanimate. The
effect of semantic sensitivity to the SRC-ORC frequency distribution,
based on corpus counts, has also been observed in the laboratory-based
studies of human production. Gennari and MacDonald (2008) manipulated
the animacy of head nouns, using nouns that were animate as in (3a) and
inanimate as in (3b). In a completion task, English native speakers were
asked to continue sentence fragments like (3a-b). The results revealed that
when prompted with animate head nouns, there was a strong bias for SRCs,
with an 85% preference for SRCs over 15% for ORCs. Conversely, when
prompted with inanimate head nouns, the strong bias for SRCs weakened,
resulting in a 65% preference for SRCs versus 35% for ORCs.

(3) a. The musician that …..


b. The accident that …..

In summary, human producers have consistently generated subject-


extracted relatives more frequently than object-extracted relatives, with
exception occurring when pronominal relatives are used or when head
nouns are inanimate. Considering the link between production and
comprehension, this asymmetric frequency distribution of relative clauses is
important to account for the easier comprehension of subject relatives than
object relatives.

2.3 Research questions and hypotheses


Our goal is to examine the extent to which language models simulate
human behavior in the production of relative clauses. More specifically, we
have three research questions as follows:

Research question 1: Do large language models exhibit a bias in favor of


generating subject relatives over object relatives, resulting in an asymmetric
Exploring AI-Generated English Relative Clauses in Comparison to Human Production 475

distribution of both types?


Research question 2: Do large language models demonstrate sensitivity to
the semantic features of head nouns when generating relative clauses?
Research question 3: Do GPT models exhibit variations in their outputs
across different versions when simulating human language production?

To investigate these research questions, we conducted two studies using


GPT-2 and ChatGPT 3.5. In Study 1, we extracted experimental trials from
existing psycholinguistic studies in which head nouns were all animate.
In Study 2, we included another set of experimental trials involving both
animate and inanimate head nouns. We hypothesized that in Study 1, large
language models, due to their pretraining on diverse human-written texts,
would display a strong bias towards producing subject relatives. This was
based on the assumption that, given the algorithmic nature of GPT models,
they were more likely to generate verbs rather than nouns for the upcoming
position upon the recognition of relative pronouns. Conversely, in Study 2,
we hypothesized that if the language models were sensitive to the animacy
factor associated with head nouns, the strong bias towards subject relatives
would weaken. However, we did not have a specific hypothesis about
whether language models would generate different outputs depending on
the versions used.

3. Study 1

The purpose of Study 1 was to build an AI-generated corpus of English


relative clauses. We hypothesized that if GPT models successfully
replicate human-generated corpus, subject-extracted relatives would occur
significantly more frequently than object-extracted relatives.
476 Hongoak Yun, Eunkyung Yi and Sanghoun Song

3.1 Method

3.1.1 Language models


We used two versions of neural language models: GPT-2 and ChatGPT 3.5.
These models are known for their reliability and suitability in generating
coherent sentences.

GPT-2
GPT-2, featuring 1.5 billion parameters, underwent pretraining on
an extensive dataset comprising web pages, books, and various written
materials. During pretraining, the model mastered the language by
predicting the next word in a given context sequence (Shrivastava, Pupale,
& Singh, 2021). Following this phase, GPT-2 underwent fine-tuning for
a wide range of downstream tasks, such as text classification, sentiment
analysis, and question-answering (Schneider, de Souza, Gumiel, Moro, &
Paraiso, 2021).

ChatGPT 3.5
ChatGPT, powered by GPT 3.5, employs a stack of 13 Transformer
blocks, each featuring 12 attention heads and 768 hidden units. This model
was pretrained on a vast corpus of text data, including books, articles,
and websites, through a language modeling task (Abdullah, Madain, &
Jararweh, 2022). This pretraining enables ChatGPT to learn connections
between words and phrases in natural language, resulting in coherent
responses during conversations. Notably, compared to GPT-2, ChatGPT
demonstrates significantly improved generation capabilities. The texts it
generates are contextually accurate, grammatically precise, and logically
coherent. Many have attested to the fluency of ChatGPT, finding it valuable
across diverse applications like content writing, summarization, machine
translation, and text rewriting (Sallam, 2023).
Exploring AI-Generated English Relative Clauses in Comparison to Human Production 477

3.1.2 Materials
A total of 76 stimuli were sampled from Roland et al. (2012), Gordon
et al. (2004), and Reali and Christiansen (2007), all of which are
psycholinguistic studies that documented faster reading times for subject
relatives compared to object relatives. To generate relative clauses by neural
language models (i.e., GPT-2 and ChatGPT 3.5), we prepared incomplete
sentence prompts, as indicated in Example (4), which have been extracted
from the 76 original stimuli. All of the head NPs used in the present study
were animate. Appendix A displays the stimuli that we used.

(4) The secretary that …..

3.2.3 Procedure
The incomplete sentence fragments were entered into neural language
models, GPT-2 and ChatGPT 3. First, we used the large version of GPT-2
and asked the model to complete the incomplete sentence fragments across
six different temperature settings: 0.1, 0.3, 0.5, 0.7, 0.9 and 1.0. At each
temperature level, we asked the model to generate ten sentence samples
for each stimulus. In total, we ended up with 4,500 sentence samples
from six different temperature levels. Second, we used the free version of
ChatGPT-3.5, accessible via the OpenAI website (https://chat.openai.com/).
To obtain sentence samples, we formulated our request within the prompt
box, as follows: “Would you please generate 10 English sentences starting
with The secretary that … ?”. To mitigate potential implicit priming effect
arising from previously produced samples that the model might have in
its working memory space during generation (Cai et al., 2023), we had the
ChaptGPT-3.5 model generate ten sentence samples for each stimulus and
repeated the process 15 times in distinct trials. This resulted in a total of
11,250 sentence samples.
478 Hongoak Yun, Eunkyung Yi and Sanghoun Song

3.2.4 Coding and Analyses


To evaluate AI-generated outputs, we initially identified ungrammatical
sentences for which we could not determine the type of relative clauses.
This resulted in the removal of 3.2 % of the GPT-2 output, with only two
samples from the ChatGPT-3.5 output being excluded. Subsequent to this
preliminary filtering, we hand-coded the remaining sentences, classifying
each generated relative clause as either subject-extracted relatives or object-
extracted relatives. Two linguistic experts independently coded the data and
their coding results revealed an agreement rate of 98%. In cases where they
disagreed, they discussed until they reached an agreement.
Based on the completed coding data, we computed two variables for
each stimulus. First, we measured the extent to which the given incomplete
sentence fragment continued with subject relatives and object relatives,
respectively. This was achieved by computing the relative frequencies of
subject relatives and object relatives using the function (5) below. Second,
we estimated the extent to which the given incomplete sentence fragment
exhibited a bias in favor of subject-extracted relatives over object-extracted
relatives. This estimate, termed SRC bias, was computed for each stimulus
by subtracting the relative frequency of object relatives from that of subject
relatives. A relatively higher occurrence of SRCs indicates a stronger SRC
bias. We used these two measurements as our dependent variables. For
statistical tests of the relative clause frequency distribution, we conducted a
series of paired t-tests, using SPSS version 20.

number of SRC occurences 1


(5) Relative frequency (SRC) = total occurrences of RCs

1
To compute the relative frequency of ORCs, we replaced SRC occurrences with
ORC occurrences.
Exploring AI-Generated English Relative Clauses in Comparison to Human Production 479

3.3 Results
Figure 1a-d illustrate the means and standard errors of the relative
frequencies of both subject-extracted relative clauses and object-extracted
relative clauses generated by both GPT-2 and ChatGPT 3.5 using the stimuli
taken from the three previous studies. A simple visual examination clearly
reveals
stimuli taken that subject-extracted
from the three previous studies.relative clauses
A simple visual greatly
examination outnumber
clearly object-
reveals that subject-
extracted relative clauses greatly outnumber object-extracted relative clauses across all studies, regardless
extracted relative clauses across all studies, regardless of the GPT versions.
of the GPT versions.

(a) Roland et al. (2012) (b) Gordon et al. (2004)

(c) Reali & Christiansen (2007) (d) SRC bias

Figure 1. Frequency
Figure 1. Frequency
distributions distributions
of RC types generated
of byRC bothtypes
GPT-2 and ChatGPT 3.5
generated byusing
bothheadGPT-
nouns extracted from the three previous studies. The extent of SRC biases for each study is depicted in
2 Error
(d). andbars
ChatGPT 3.5Confident
represent 95% using Intervals.
head nouns extracted from the three previous
studies. The extent of SRC biases for each study is depicted in (d). Error
Table 2 presents the results of paired t-tests, confirming that the neural language models generated
bars represent
subject-extracted 95%
relative Confident
clauses Intervals.
significantly more frequently than object-extracted relative clauses. The
significant differences reported in Table 2 indicated that neural large language models succeed in simulating
human producers, similar to Roland et al.’s (2007) corpus counts. As shown in Figure 1d, the SRC bias was
consistently present in all studies, irrespective of the GPT versions. These results supported our hypothesis,
Table 2 presents the results of paired t-tests, confirming that the neural
raised in Research question 1, suggesting that AI-generated corpus approximates human-generated corpora.
languageourmodels
Additionally, generated
results indicated that the subject-extracted
SRC bias was consistentlyrelative clauses
much stronger significantly
when using ChatGPT
3.5 models compared to GPT-2 models.
more frequently than object-extracted relative clauses. The significant
Table 2. Paired t-tests results comparing relative clause frequencies across studies

GPT 2 ChatGPT 3.5

means (SD) Statistics means (SD) Statistics

Roland et al. ORCs .31 (.24) t(23) = -3.63, .10 (.21) t(23) = -9.54,
(2012) SRCs .66 (.25) p = .001 .90 (.21) p = .000
480 Hongoak Yun, Eunkyung Yi and Sanghoun Song

differences reported in Table 2 indicated that neural large language models


succeed in simulating human producers, similar to Roland et al.’s (2007)
corpus counts. As shown in Figure 1d, the SRC bias was consistently
present in all studies, irrespective of the GPT versions. These results
supported our hypothesis, raised in Research question 1, suggesting that
AI-generated corpus approximates human-generated corpora. Additionally,
our results indicated that the SRC bias was consistently much stronger
when using ChatGPT 3.5 models compared to GPT-2 models.

Table 2. Paired t-tests results comparing relative clause frequencies across


studies

GPT 2 ChatGPT 3.5


means means
Statistics Statistics
(SD) (SD)
Roland ORCs .31 (.24) .10 (.21)
t(23) = -3.63, t(23) = -9.54,
et al.
SRCs .66 (.25) p = .001 .90 (.21) p = .000
(2012)
Gordon ORCs .32 (.23) .02 (.02)
t(23) = -3.13, t(23) = -98.46,
et al.
SRCs .63 (.27) p = .005 .98 (.02) p = .000
(2004)
Reali & ORCs .28 (.21) .08 (.06)
t(26) = -5.20, t(26) = -36.17,
Christiansen
SRCs .71 (.22) p = .000 .92 (.06) p = .000
(2007)

4. Study 2

The purpose of Study 2 was to examine whether neural language models


exhibit sensitivity to semantic features when generating English relative
clauses. We hypothesized that if GPT models effectively simulate human
language generation, the SRC bias would be modulated by the semantic
property of the modified head nouns. Furthermore, we sought to examine
potential differences in semantic sensitivity between different versions of
Exploring AI-Generated English Relative Clauses in Comparison to Human Production 481

GPT models.

4.1 Method

4.1.1 Language Models


We used the same large language models used in Study 1.

4.1.2 Materials
A total of 56 stimuli were taken from Gennari and MacDonald (2008)
who have conducted a sentence completion task to English native speakers.
Among these stimuli, 28 stimuli had animate head nouns, as in (6a),
whereas the remaining 28 stimuli had inanimate head nouns, as in (6b).
Recall that Gennari and MacDonald (2008) have observed the strong bias
for subject relatives when head nouns were animate (i.e., 85% vs. 15%) but
the strong bias got weak when head NPs were inanimate (i.e., 65% vs. 35%).
As we did in Study 1, we prepared incomplete sentence prompts as in (6a-
b), by using an NP plus that, for neural large language models to generate
relative clauses. The stimuli used in this study are presented in Appendix A.

(6) a. The musician that …..


b. The book that …..

4.1.3 Procedure, Coding, and Analyses


We followed the same procedures as we did in Study 1. The methods for
coding and analyses were also identical to those used in Study 1. We did not
receive sentence generations from one item, The grenade that ~, because
ChatGPT 3.5 denied to process any unethical content.

4.2 Results
Figure 2a-b illustrate the means and standard errors of the relative
482 Hongoak Yun, Eunkyung Yi and Sanghoun Song

frequencies of both subject-extracted relative clauses and object-extracted


relative clauses that both GPT-2 and ChatGPT 3.5 generated using the
stimuli from Gennari and MacDonald (2008). When the head nouns were
animate (Figure 2a), subject-extracted relative clauses were much more
frequent than object-extracted relative clauses, regardless of the GPT model
versions. Crucially, however, when the head nouns were inanimate (Figure
2b), the relative frequency differences interacted with the language model
versions (see paired t-test results in Table 3). Significant differences between
subject relatives and object relatives completely disappeared in the outputs
generated by ChatGPT 3.5, with object relatives numerically more frequent
than subject relatives.

(a) animate NPs (b) inanimate NPs

Figure
Figure 2.2.Frequency
Frequency distributions
distributions ofclauses
of the types of relative the for
types
animateof
headrelative
NPs (a) and clauses
inanimate for
head NPs (b), generated by both GPT 2 and ChatGPT 3.5 based on the stimuli extracted from Gennari and
animate
MacDonald head
(2008).NPs (a)represent
Error bars and inanimate head NPs (b), generated by both
95% Confident Intervals.
GPT 2 2.and
Figure ChatGPT
Frequency 3.5ofbased
distributions the typeson the stimuli
of relative clauses forextracted from
animate head NPs Gennari
(a) and inanimateand
head NPs (b), generated by both GPT 2 and ChatGPT 3.5 based on the stimuli extracted from Gennari and
MacDonald (2008). Error bars represent 95%
MacDonald (2008). Error bars represent 95% Confident Intervals.
Confident Intervals.

Figure 3. The extent of SRC biases when head NPs are animate (a) and inanimate (b). Error bars
represent 95% Confident Intervals.

Recall that a higher SRC bias indicates a stronger preference for SRCs over ORCs. Figure 3 illustrates
Figure 3. The
the extent extentthe
to which ofSRC
SRCbiasbiases when headbyNPs are animate (a) and inanimate (b). Error bars
Figure
represent
3.95%
that the SRC
The extent
Confident
bias dropped
ofis SRC
Intervals.
modulated
biasesthe animacy
whenof headhead nouns
NPs and the GPT
are version.
animate
below zero only for the ChatGPT 3.5 when inanimate head nouns were used.
It shows
(a) and
inanimate (b).revealed
Statistical tests Errorno bars represent
significant differences 95%
in SRC Confident
bias between theIntervals.
GPT versions when head nouns
Recall
were that a(t(27)
animate higher SRCp =bias
= .32, indicates
.754) but therea were
stronger preference
statistical for SRCs
significances over
when ORCs.
head nounsFigure 3 illustrates
were inanimate
the extent
(t(26) = to which
5.01, p =the SRCThese
.000). bias isresults
modulated by the
support ouranimacy of head
hypothesis, nouns andintheResearch
as presented GPT version. It shows
question 2,
that the SRCthat
suggesting biasAI-generated
dropped below zero only
corpora for the ChatGPT
approximate 3.5 whencorpora
human-generated inanimatein head
termsnouns were used.
of semantics.
Statistical tests revealed no significant differences in SRC bias between the GPT versions
However, the successful simulations were observed only with ChatGPT 3.5. Additionally, our results show when head nouns
were
that animate (t(27)toward
the bias shift = .32, object
p = .754) but there
relatives whenwere statistical
inanimate significances
head nouns werewhen headmuch
used was nounsmore
weredramatic
inanimate
(t(26) = 5.01,3.5
by ChatGPT = .000).compared
p models These results
to humansupport our hypothesis, as presented in Research question 2,
speakers.
suggesting that AI-generated corpora approximate human-generated corpora in terms of semantics.
However, the successful
Table 3. Results simulations
of paired were RC
t-tests between observed
relativeonly with ChatGPT
frequencies 3.5.NP
across the Additionally,
types our results show
Exploring AI-Generated English Relative Clauses in Comparison to Human Production 483

Recall that a higher SRC bias indicates a stronger preference for SRCs
over ORCs. Figure 3 illustrates the extent to which the SRC bias is
modulated by the animacy of head nouns and the GPT version. It shows
that the SRC bias dropped below zero only for the ChatGPT 3.5 when
inanimate head nouns were used. Statistical tests revealed no significant
differences in SRC bias between the GPT versions when head nouns were
animate (t(27) = .32, p = .754) but there were statistical significances when
head nouns were inanimate (t(26) = 5.01, p = .000). These results support
our hypothesis, as presented in Research question 2, suggesting that
AI-generated corpora approximate human-generated corpora in terms of
semantics. However, the successful simulations were observed only with
ChatGPT 3.5. Additionally, our results show that the bias shift toward object
relatives when inanimate head nouns were used was much more dramatic
by ChatGPT 3.5 models compared to human speakers.

Table 3. Results of paired t-tests between RC relative frequencies across the


NP types

GPT 2 ChatGPT 3.5


means means
Statistics Statistics
(SD) (SD)
Animate ORCs .19 (.20) t(27) = -8.10, .20 (.30) t(27) = -5.31,
head NPs SRCs .79 (.20) p = .000 .80 (.30) p = .000

Inanimate ORCs .21 (.21) t(27) = -7.15, .56 (.35) t(26) = .88,
head NPs SRCs .78 (.22) p = .000 .44 (.35) p = .386†
Note.† Degree of freedom (df) for this comparison is one less than the others. For this mean
comparison tests, we ended up with 26 pairs of comparisons because ChatGPT 3.5 denied to process
unethical content for one stimulus (The grenade that~).
484 Hongoak Yun, Eunkyung Yi and Sanghoun Song

5. General discussion

We aimed to simulate human producers in the production of English


relative clauses by using neural large language models, GPT-2 and ChatGPT
3.5. Human-corpus studies have demonstrated that humans have strong
tendency to favor subject relatives over object relatives, but the strong
bias for subject relatives is modulated by the semantic features of head
nouns. Across two studies, we tested whether AI-generated relative clauses
would successfully simulate human behaviors. We discuss our results by
answering the research questions that we raised above, respectively.

Research question 1: Do large language models exhibit a bias in favor


of generating subject relatives over object relatives, resulting in the
asymmetric distribution of both types?

In Study 1, we found that when provided with prompts consisting of


an NP plus the relative pronoun that, large language models generated
subject relatives more frequently than object relatives, regardless of the
GPT versions. This means that given the sentence fragments, the language
models preferred verbs over nouns in the subsequent position (i.e., the first
word within relative clauses). These model outputs closely replicate the
frequency distributions of relative clauses observed in human-generated
corpora.
We have considered several potential accounts for this asymmetric bias
in language models. On one hand, the models might rely heavily on the
mechanism of exemplar-based learning. That is, language models might
have a kind of exemplar knowledge that specific combinations of head
nouns with that are more likely to continue with verbs instead of nouns.
For example, a particular trial starting with The lady that ~ might be
associated with verbs more than with any other part-of-speeches in the
Exploring AI-Generated English Relative Clauses in Comparison to Human Production 485

models’ deep-learning representations. This particular association could


lead to a preference for subject relatives. On the other hand, the models
might develop abstract syntactic knowledge that is universally applicable to
specific tokens during pretraining, and they extend that abstract knowledge
into new (not-yet trained) trials during sentence production. That is, when
provided with a head noun together with a relative pronoun (e.g., the N
that), the models estimate that verbs, rather than nouns, are more likely to
follow in the subsequent position, leading to subject relatives. For example,
when the models encounter an input starting with The lady that ~ , the
models tend to produce verbs more than nouns at the next position because
the input satisfies the syntactic requirement for the N that.
However, we raise a question whether the models’ preference for subject
relatives relies solely on abstract syntactic knowledge. Note that in terms
of semantics, all of the head nouns used in Study 1 were animate. If the
language models have indeed learned the significant role of semantic
featural knowledge, particularly the feature [+animate]), during pretraining,
this semantic knowledge might also influence their production of relative
clauses. For example, when the models encounter an input like The lady that
~ , they may prioritize verbs over nouns at the next position because this
input satisfies the semantic requirement, the animate noun that. However,
due to the manipulation limitations of Study 1, we cannot conclusively
determine whether the language models’ performance exclusively reflects
their syntactic knowledge or if their semantic knowledge also plays a role.

Research question 2: Do large language models demonstrate sensitivity


to the semantic features of head nouns when generating relative clauses?

In Study 2, using the same set of materials employed by Gennari and


MacDonald (2008), we replicated the results observed in Study 1 when head
nouns were animate. Regardless of the GPT versions, subject relatives were
486 Hongoak Yun, Eunkyung Yi and Sanghoun Song

generated more frequently than object relatives. However, when head nouns
were inanimate, ChatGPT 3.5 generated numerically more object relatives
than subject relatives, while GPT-2 did not exhibit this shift in preference.
These results suggest that when provided with sentence fragments
beginning with ‘the + inanimate noun + that,’ ChatGPT’ computations
estimate that nouns, rather than verbs, are more probable for the following
position (i.e., the first word within relative clauses). Conversely, in cases
where the sentence fragments began with ‘the + animate noun + that,’ both
language models showed a preference for verbs over nouns in the next
position. Notably, these ChatGPT outputs closely replicate the frequency
distributions of relative clauses observed in human-generated corpora.
We have speculated several accounts for the observed bias observed in
the language models. As discussed previously, the models may depend on
exemplar-based learning, where specific trials, such as The book that ~, are
associated with a higher likelihood of continuing with nouns, whiles others,
like The lady that ~, favor verbs. It is also possible that ChatGPT 3.5 has
develop abstract semantic knowledge (e.g., [+/- animate]) during pretraining
and applies this knowledge during sentence production. For example,
inanimate NPs like the book or the wine can be associated with a patient
role occurring at an object position (Kako, 2006), leading to a preference
for object-extracted relatives. On the contrary, animate NPs like the lady
may be linked an agent role occurring at a subject position (Kako, 2006),
resulting in a preference for subject-extracted relatives. Importantly, the
performance of ChatGPT in Study 2 cannot be accounted for only by the
syntactic approach.
As a final note, close examination of ChatGPT’s outputs revealed that
not all inanimate head nouns generated a bias for object relatives. For
example, inanimate head nouns like the accident that or the incident that
continued more frequently with subjective relatives than object relatives.
The presence of the [-animate] feature alone may not be sufficient to
Exploring AI-Generated English Relative Clauses in Comparison to Human Production 487

uniquely contribute to the increased occurrence of object relatives. Some


may argue that additional semantic features, such as [+concrete], might be
necessary to assign a patient role to inanimate head nouns. For instance,
the book (that) and the wine (that), which are [-animate] and [+concrete],
exhibit a bias toward objective relatives, whereas the accident (that) and
the incident (that), which are [-animate] and [-concrete], do not exhibit the
same bias. We think our discussion leaves a door open for future studies.
Further research with finer-grained semantic manipulations is needed for a
comprehensive understanding of this phenomenon.

Research question 3: Do GPT models exhibit variations in their outputs


across different versions when simulating human language production?

We observed both quantitative and qualitative differences between the


two GPT models. In Study 1, both GPT-2 and ChatGPT 3.5 showed a strong
preference for subject relatives, but the SRC bias was notably larger in
ChatGPT 3.5 than in GPT-2. In Study 2, when head nouns were animate,
there were no differences across the models. However, when dealing with
inanimate head nouns, only ChatGPT 3.5 generated sentence outputs that
seemed to reflect semantic sensitivity. Consequently, the shift in the SRC
bias was exclusively observed in ChatGPT 3.5, not in GPT-2. Overall,
ChatGPT 3.5, being a more advanced large language model, demonstrated a
better simulation of human performance in sentence production than GPT-
2. Moreover, the distributional patterns of AI-generated relative clauses
could prove effective in predicting how human readers process relative
clauses, aligning well with the claim of the production-comprehension link
(MacDonald, 2013).

Our present study has a couple of limitations.2 First, given that ChatGPT
2
We appreciate the anonymous reviewers’ comments for this matter.
488 Hongoak Yun, Eunkyung Yi and Sanghoun Song

is fundamentally trained based on human language, it could not be a matter


of doubt or surprise that AIs and humans exhibit similar behaviors to some
extent. Presumably, it is more interesting to observe instances where the two
language processes diverge from one another, even in the presence of subtle
differences. As a next step, we may focus on exploring the effect of fine-
grained semantic factors, such as agentivity, on the linguistic faculty of both
AIs and humans in future studies. Second, our study presented a simple
contrast between subject and object relative clauses, similar to many other
related studies. However, other types of relative clauses exist such as those
involving the extraction of indirect objects or obliques. Kim and O’Grady
(2016) demonstrated the effect of syntactic hierarchy in generating relative
clauses, suggesting a preference for relative clauses when head nouns are
extracted from structurally higher positions. For future studies, we will
broaden our perspectives by extending our investigations to include indirect
object and oblique relative clauses, comparing them to subject and direct
object relative clauses.

6. Conclusion

We aimed to examine the extent to which large language models can


simulate human language production, particularly in the context of
generating English relative clauses for both animate and inanimate head
nouns. We observed ChatGPT 3.5 generates outputs that closely resemble
human-generated sentences. This seems to be due to the model’s ability
to incorporate semantic featural information associated with head nouns.
However, we are aware of the warnings that machine algorithms, often
considered ‘black boxes,’ may not necessarily follow the same logic of
human minds (Howard, Chouikhi, Adeel, Dial, Howard, & Hussain, 2020).
Further studies are necessary to understand the nature of the semantic
knowledge encoded in these models’ black boxes.
Exploring AI-Generated English Relative Clauses in Comparison to Human Production 489

Frequency-based psycholinguistic theories have accounted for the


processing asymmetry of English relative clauses. This asymmetry is
often attributed to the statistical bias favoring subject-extracted relative
clauses over object-extracted relative ones, as evidenced by frequency
counts derived from human-generated corpora. However, existing corpus
studies face inherent limitations, such as the potential scarcity of identical
stimuli occurrences in actual corpora. In contrast, AI-generated corpora
offer a notable advantage by providing frequency data with reliable
flexibility. Our study suggests that, when using identical testing stimuli, AI
language models can construct more accurate approximations of frequency
distributions.

Acknowledgement We would like to extend our heartfelt appreciation to the anonymous


peer reviewers who generously dedicated their time and expertise to review and provide
feedback on this research paper. We would also like to acknowledge Minjoo Jung of Jeju
National University and the NLP lab members of Korea University for their invaluable
contributions to the data collection process.

Funding This research was supported by the 2023 scientific promotion program funded
by Jeju National University.

Declarations

Ethics Approval This study is exempt from ethics approval as it did not involve human
participants in obtaining results.

Consent to Participate and Consent for Publication Not applicable

Conflict of Interest The authors declare that they have no competing interests.
490 Hongoak Yun, Eunkyung Yi and Sanghoun Song

References

Abdullah, M., Madain, A., & Jararweh, Y. (2022). ChatGPT: fundamentals,


applications and social impacts, Proceedings o f the 2022 Ninth
International Conference on Social Networks Analysis, Management and
Security (SNAMS), 1–8.
Altmann, G. T. M., & Kamide, Y. (1999). Incremental interpretation at verbs:
Restricting the domain of subsequent reference. Cognition, 73(3), 247–264.
https://doi.org/10.1016/S0010-0277(99)00059-1
Arisoy, E., Sainath, T., Kingsbury, B, & Ramabhadran, B. (2012). Deep Neural
Network Language Models. Proceedings of the NAACL-HLT 2012
Workshop: Will We Ever Really Replace the N-gram Model? On the Future
of Language Modeling for HLT, 20–28.
Bernardy, J-P., & Lappin, S. (2017). Using Deep Neural Networks to Learn
Syntactic Agreement. Linguistic Issues in Language Technology, 15 (2),
1-15.
Bever, T. G. (1970). The Cognitive Basis for Linguistic Structures. In J. R. Hayes
(Ed.), Cognition and the Development of Language, 279-362. New York:
John Wiley.
Bock, J. K. (1986). Syntactic persistence in language production. Cognitive
Psychology, 18(3), 355–387. https://doi.org/10.1016/0010-0285(86)90004-6
Cai, Z. G., Haslett, D.A., Duan, X., Wang, S., & Pickering, M., J. (2023). Does
ChatGPT resemble humans in language use? ArXiv. /abs/2303.08014.
Chomsky, N., Roberts, I., & Watumull, J. (2023). The false promise of ChatGPT.
The New York Times (8th of March, 2023).
Devlin, J., Chang, M-W, Lee, K., & Toutanova, K.. (2019). BERT: Pre-training of
deep bidirectional transformers for language understanding. Proceedings of
the 2019 conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Vol. 1, 4171–
4186.
Enguehard, É., Goldberg., Y., & Linzen, T. (2017). Exploring the Syntactic
Abilities of RNNs with Multi-task Learning. Proceedings of the 21st
Conference on Computational Natural Language Learning (CoNLL 2017),
3–14.
Exploring AI-Generated English Relative Clauses in Comparison to Human Production 491

Ettinger, A., Elgohary, A., & Resnik, P. (2016). Probing for semantic evidence of
composition by means of simple classification tasks. Proceedings of the 1st
Workshop on Evaluating Vector-space Representations for NLP, 134–139.
Filippova, K., Alfonseca, E., Colmenares, C. A., Kaiser, L., & Vinyals, O. (2015).
Sentence compression by deletion with LSTMs. Proceedings of the 2015
Conference on Empirical Methods in Natural Language Processing, 360–
368.
Ford, M. (1983). A method for obtaining measures of local parsing complexity
throughout sentences. Journal of Verbal Learning and Verbal Behavior, 22,
203–218.
Fox, B. A., & Thompson, S. A. (1990). A Discourse Explanation of the Grammar
of Relative Clauses in English Conversation. Language, 66, 297-316.
Frank, S. L., & Hoeks, J. (2019). The Interaction Between Structure and Meaning
in Sentence Comprehension: Recurrent Neural Networks and Reading
Times. Proceedings of the 2019 Cognitive Science Society, 337-343.
Futrell, R., Wilcox, E., Morita, T., & Levy, R. (2018). RNNs as psycholinguistic
subjects: Syntactic state and grammatical dependency. arXiv:1809.01329
Futrell, R., & Levy, R. (2019). Do RNNs learn human-like abstract word
order preferences? Proceedings of the 2019 Society for Computation in
Linguistics (SCiL), 50–59.
Gennari, S. P., & MacDonald, M. C. (2008). Semantic indeterminacy in object
relative clauses. Journal of Memory and Language, 58 (2), 161-187.
Gibson, E., Bergen, L., & Piantadosi, S. T. (2013). Rational integration of noisy
evidence and prior semantic expectations in sentence interpretation.
Proceedings of the National Academy of Sciences, 110(20), 8051-8056.
Gordon, P., Hendrick, R., & Johnson, M. (2001). Memory interference during
language processing. Journal of Experimental Psychology: Learning,
Memory, and Cognition, 27(6), 1411–1423.
Grodner, D., & Gibson, E. (2005). Consequences of the Serial Nature of
Linguistic Input for Sentenial Complexity. Cognitive Science, 29 (2), 261-
290.
Gulordava, K., Bojanowski, P., Grave, E., Linzen, T., & Baroni, M. (2018).
Colorless green recurrent networks dream hierarchically. Proceedings of
the 2018 conference of the north American chapter of the Association for
Computational Linguistics: Human Language Technologies, 1, 1195–1205.
492 Hongoak Yun, Eunkyung Yi and Sanghoun Song

Hale, J. (2001). A probabilistic Earley parser as a psycholinguistic model.


Proceedings of the second meeting of the North American Chapter of
the Association for Computational Linguistics on Language Technologies
(NAACL-HLT), 1–8.
Heider, P. M., Jeruen, E. D., & Roland, D. (2014). The processing of it object
relative clauses: Evidence against a fine-grained frequency account, Journal
of Memory and Language, 75, 58-76.
Howard, N., Chouikhi, N., Adeel, A., Dial, K., Howar, A., & Hussain, A. (2020).
BrainOS: A Novel Artificial Brain-Alike Automatic Machine Learning
Framework. Frontiers in Computational Neuroscience, 14, Article 16.
Kako, E. (2006). The semantics of syntactic frames. Language and Cognitive
Processes, 21 (5), 562-575,
Kim, C. E., & O'Grady, W. (2016). Asymmetries in children's production of
relative clauses: data from English and Korean. Journal of Child Language,
43(5), 1038–1071.
King, M.R., & chatGPT. (2023). A Conversation on Artificial Intelligence,
Chatbots, and Plagiarism in Higher Education. Cellular and Molecular
Bioengineering, 16, 1–2.
King, J., & Just, M. A.. (1991). Individual differences in syntactic processing:
The role of working memory. Journal of Memory and Language, 30(5),
580–602.
Levy, R. (2008). Expectation-based syntactic comprehension. Cognition, 106,
1126–1177.
Liebrenz, M., Schleifer, R., Buadze, A., Bhugra, D., & Smith, A. (2023).
Generating scholarly content with ChatGPT: ethical challenges for medical
publishing. The Lancet Digital Health, 5(3), e105–e106.
Linzen, T., Dupoux, E., & Goldberg, Y. (2016). Assessing the ability of LSTMs
to learn syntax-sensitive dependencies. Transactions of the Association
Computational Linguists, 4, 521–535.
MacDonald, M. C. (2013). How language production shapes language form and
comprehension. Frontiers in Psychology, 4, Article 226.
MacDonald, M. C., & Christiansen, M., H. (2002). Reassessing working memory:
Comment on Just and Carpenter (1992) and Waters and Caplan. 1996.
Psychological Review, 109(1), 35–54.
Exploring AI-Generated English Relative Clauses in Comparison to Human Production 493

Mikolov, T., Karafiát, M., Burget, L, Černocký, J. H., & Khudanpur, S. (2010).
Recurrent neural network based language model. Interspeech, 1045-1048.
Mitchell, D.C., Cuetos, C., & Corley, M. M. B. (1995). Exposure-based models
of human parsing: Evidence for the use of coarse-grained (nonlexical)
statistical records. Journal of Psycholinguist Research, 24, 469–488.
OpenAI. (2023). ChatGPT (Mar 14 version) [Large language model]. https://chat.
openai.com/chat
Pickering, M. J., & Branigan, H. P. (1998). The representation of verbs: Evidence
from syntactic priming in language production. Journal of Memory and
Language, 39(4), 633–651.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019).
Language models are unsupervised multitask learners. OpenAI tech report.
Reali, F., & Christiansen, M. H. (2007). Processing of relative clauses is made
easier by frequency of occurrence. Journal of Memory and Language, 57
(1), 1-23.
Roland, D., Mauner, G., & Hirose, Y. (2021). The processing of pronominal
relative clauses: Evidence from eye movements. Journal of Memory and
Language, 119, 104244,.
Roland, D., Dick, F., & Elman, J. L. (2007). Frequency of basic English
grammatical structures: A corpus analysis. Journal of Memory and Language,
57(3), 348-379.
Roland, D., Mauner, G., O’Meara, C., & Yun, H. (2012). Discourse expectations
and relative clause processing. Journal of Memory and Language, 66 (3),
479-508.
Rush, A., M., Chopra, S., & Weston, J. (2015). A Neural Attention Model for
Abstractive Sentence Summarization. Proceedings of the 2015 Conference
on Empirical Methods in Natural Language Processing, 379–389.
Sallam, M. (2023). ChatGPT utility in health care education, research, and
practice: systematic review on the promising perspectives and valid
concerns. Healthcare, 11 (6), 887, 1-20.
Schneider, E.T.R, de Souza, J. V. A., Gumiel, Y. B., Moro, C., & Paraiso, E.
C. (2021). A GPT-2 language model for biomedical texts in Portuguese,
Proceedings of the 2021 IEEE 34th International Symposium on Computer-
Based Medical Systems (CBMS), 474–479.
494 Hongoak Yun, Eunkyung Yi and Sanghoun Song

Schwenk, J., Harmel, N., Brechet, A., Zolles, G., Berkefeld, H., Müller, C. S.,
Bildl, W., Baehrens, D., Hüber, B., Kulik, A., Klöcker, N., Schulte, U., &
Fakler, B. (2012). High-resolution proteomics unravel architecture and
molecular diversity of native AMPA receptor complexes. Neuron, 74(4),
621–633.
Shin, U., Yi, E., & Song, S. (2023). Investigating a neural language model’s
replicability of psycholinguistic experiments: A case study of NPI licensing.
Frontiers in Psychology, 14, 937656.
Shrivastava, A., Pupale, R., & Singh, P. (2021). Enhancing aggression detection
using GPT-2vbased data balancing technique, Proceedings of the 2021 5th
International Conference on Intelligent Computing and Control Systems
(ICICCS), 1345–1350.
Traxler, M. J., Morris, R., K., & Seely, R. E. (2002). Processing subject and object
relative clauses: Evidence from eye movements. Journal of Memory and
Language, 47(1), 69–90.
Thorp, H. H. (2023). ChatGPT is fun, but not an author. Science, 379 (6630), 313-
313.
Van Schijndel, M., & Linzen, T. (2018). Modeling garden path effects without
explicit hierarchical syntax. Proceedings of the 40th Annual Conference of
the Cognitive Science Society, 2600–2605.
Wang, F. Y., Miao, Q., Li, X., Wang, X., & Lin, Y. (2023). What does chatGPT
say: the DAO from algorithmic intelligence to linguistic intelligence, IEEE/
CAA Journal of Automatica Sinica, 10 (3), 575–579.
Warstadt, A., & Bowman, S. R. (2020). Can neural networks acquire a structural
bias from raw linguistic data? arXiv:2007.06761.
Wilcox, E., Levy, R., Morita, T., & Futrell, R. (2018). What do RNN language
models learn about filler–gap dependencies? Proceedings of the 2018
EMNLP workshop BlackboxNLP: Analyzing and interpreting neural
networks for NLP, 211–221.
Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z, & Duan, N. (2023). Visual
ChatGPT: Talking, Drawing and Editing with Visual Foundation Models.
arXiv:2303.04671.
Yi, E., Cho, H., & Song, S. (2022). An experimental investigation of discourse
expectations in neural language models. Korean Journal of English
Language and Linguistics, 22, 1101-1115.
Exploring AI-Generated English Relative Clauses in Comparison to Human Production 495

Yun, H., & Yi, E. (2019). The role of frequency in the processing of giving and
receiving events. Language Research, 55(2), 253-279.
496 Hongoak Yun, Eunkyung Yi and Sanghoun Song

Appendix A. List of experimental stimuli used in Study 1 & Study 2


Item 1-24 from Roland et al. (2012); Item 25-48 from Gordon et al. (2004);
Item 49-76 from Reali & Christiansen (2007); Item 77-132 from Gennari &
MacDonald (2008)
1. The secretary that 34. The poet that 67. The detective that 100. The spy that
2. The tenant that 35. The chef that 68. The consultant that 101. The journalist that
3. The realtor that 36. The aunt that 69. The students that 102. The minister that
4. The customer that 37. The violinist that 70. The guy that 103. The woman that
5. The professor that 38. The teacher that 71. The manager that 104. The dieter that
6. The producer that 39. The editor that 72. The landlord that 105. The accident that
7. The director that 40. The tailor that 73. The woman that 106. The prize that
8. The photographer that 41. The admiral that 74. The girl that 107. The grenade that
9. The agent that 42. The coach that 75. The executive that 108. The book that
10. The dancer that 43. The lawyer that 76. The agency that 109. The movie that
11. The lawyer that 44. The plumber that 77. The musician that 110. The school that
12. The doctor that 45. The salesman that 78. The contestant that 111. The play that
13. The architect that 46. The clown that 79. The soldier that 112. The incident that
14. The event planner that 47. The clerk that 80. The scientist that 113. The wrench that
15. The flight attendant that 48. The gardener that 81. The director that 114. The loan that
16. The musician that 49. The lady that 82. The student that 115. The trial that
17. The detective that 50. The teacher that 83. The teacher that 116. The notes that
18. The consultant that 51. The neighbor that 84. The employee that 117. The story that
19. The football player that 52. The clerk that 85. The plumber that 118. The game that
20. The florist that 53. The landlord that 86. The banker that 119. The product that
21. The policeman that 54. The woman that 87. The lawyer that 120. The fire that
22. The artist that 55. The businessman that 88. The psychologist that 121. The lure that
23. The waitress that 56. The girl that 89. The child that 122. The tractor that
24. The teacher that 57. The professor that 90. The golfer that 123. The plants that
25. The banker that 58. The boy that 91. The salesman that 124. The plane that
26. The dancer that 59. The dancer that 92. The fireman that 125. The wine that
27. The architect that 60. The director that 93. The fish that 126. The play that
28. The waiter that 61. The guy that 94. The farmer that 127. The instrument that
29. The detective that 62. The salesman that 95. The gardener that 128. The message that
30. The judge that 63. The lady that 96. The pilot that 129. The article that
31. The robber that 64. The professor that 97. The executive that 130. The meal that
32. The governor that 65. The person that 98. The actor that 131. The jewelry that
33. The actor that 66. The teacher that 99. The student that 132. The dessert that

You might also like