Why Large Language Models Are Poor Theories of Human Linguistic Cognition. A Reply To Piantadosi (2023)
Why Large Language Models Are Poor Theories of Human Linguistic Cognition. A Reply To Piantadosi (2023)
Why large language models are poor theories of human linguistic cognition. A
reply to Piantadosi (2023).
Roni Katzir, Tel Aviv University
Linguistic biases and representations. When the training corpus is big enough —
and the training corpora of modern LLMs are huge (often running into the hundreds
of billions of tokens in current models) — the learning method of LLMs allows them
to accumulate a large amount of knowledge. Still, even after exposure to training
data that are orders of magnitude bigger than the linguistic data that humans hear or
read in a whole lifetime, LLMs fail to acquire aspects of syntax that any healthy
12-year-old child has mastered. In the following exchange, for example, the (1)
sentences are acceptable English sentences while the (2) sentences are not, but
ChatGPT thinks that the opposite is true:
1
Obviously there is nothing wrong with considering unusual theories. While scientific work typically
focuses on just a handful of theories at a time — those that seem the most promising at the moment — it
can be a productive exercise to occasionally examine alternative hypotheses that don’t seem as
promising. But as engineering products developed without cognitive plausibility as a goal, LLMs are very
unpromising a priori as models of human cognition, as mentioned above, and with on the order of 1011
parameters in current models, they are not very likely to be particularly insightful, either. (At this level of
complexity one might simply use a living child as a model.) For the purposes of the present discussion,
though, I will set aside such considerations and focus on standard empirical tests of adequacy, showing
that LLMs fail on all of them.
March 17, 2023
In Lan, Chemla, & Katzir (2022) we checked how four different LLMs handle
examples such as the one in the exchange above. All four failed: none learned the
constraint (from Ross, 1967) that says that if you leave a gap in one conjunct (“...
Mary met __ yesterday”) you need to leave a gap in the other conjunct (“will talk to
__ tomorrow” and not “talk to you tomorrow”) as well. Presumably there are too few
examples of this kind in the data to support the learning of this constraint by a
learner that is not suitably biased (see Pearl & Sprouse 2013 for discussion of the
frequency of relevant data in corpora). Children do arrive at the relevant constraint,
which shows that they are suitably biased and that LLMs are poor theories of human
linguistic cognition.2 A similar point could be made with any number of textbook
linguistic phenomena.
2
This is an instance of the argument from the poverty of the stimulus, a central form of
argument in linguistics ever since Chomsky (1971, 1975). Piantadosi claims that LLMs eliminate
the argument from the poverty of the stimulus (p. 19), but as shown by the failure of LLMs to
acquire constraints such as the one just mentioned, this argument is still here.
March 17, 2023
Typology. Innate biases express themselves not just in the knowledge attained by
children but also in typological tendencies. For example, many languages have the
same constraint as English does with respect to gaps in conjunction, and there are
no known languages that exhibit the opposite constraint (allowing for a gap in a
single conjunct but not in both). Many other universals have been studied by
linguists. For example, no known language allows for an adjunct question (“how” or
“why”) to ask about something that has taken place inside a relative clause. So, for
example, “How did you know the man who broke the window?” cannot be answered
with “with a hammer,” and “Why did you know the man who broke the window?”
cannot be answered with “because he was angry”. The same holds for all known
languages in which this has been tested. A universal of a different kind is that
phonological processes are always regular: they can be captured with a finite-state
device that has no access to working memory. So, for example, while there are
many unbounded phonological patterns, none are palindrome-like (see Heinz &
Idsardi 2013). In syntax, meanwhile, roughly the opposite holds: dependencies are
overwhelmingly hierarchical, while linear processes are rare or nonexistent. Another
universal, in yet another domain, is that nominal quantifiers (“every”, “some”, “most”,
etc.) are always conservative: when they take two arguments, they only care about
individuals that satisfy their first argument (Barwise & Cooper 1981, Keenan & Stavi
1986). For example, no language has a quantifying “gleeb” such that “Gleeb boys
smoke” is true exactly when there are more boys than smokers (a meaning that
would care about smokers that are not boys). Note that there is nothing inherently
strange about such a meaning. The verb “outnumbers” expresses exactly this
notion. It’s just that quantifiers seem to never have such meanings.
Universals of this kind are central reasons for linguists to argue for the innate
components that they propose and that look nothing like LLMs. Piantadosi dismisses
universals, referring to Evans & Levinson (2009). But Evans & Levinson do not
address the universals just mentioned or those like them, and neither do the many
other papers that Piantadosi lists (p. 25). LLMs are not biased in a way that would
lead to these universals, and in the absence of other explanations for why these
universals should arise from an unbiased learner, LLMs remain a deeply implausible
model for human linguistic cognition.
bit died”). LLMs make no such distinction, which is generally unproblematic given
their use as an engineering tool. Piantadosi, however, seems to find the lack of a
distinction between competence and performance unproblematic even in their use
as scientific models of human cognition. This is surprising. When humans encounter
a sentence with center embedding they often make mistakes, but this is not because
they don’t know how center embedding works. Rather, it is because of their struggle
to harness real-time processing resources to handle the sentence under
consideration. This struggle, which varies with available resources and conditions
(noise, sleep, etc.) is itself evidence for the distinction between competence and
performance. And given more time, human speakers often succeed where at first
they failed. This is part of what a theory of linguistic cognition needs to account for,
but in the absence of a notion of performance LLMs are singularly unsuitable to
model this aspect of linguistic cognition:3
3
Gulordava et al. (2018) show that LLMs can learn non-local agreement and that they make agreement
errors that are in some ways similar to those of human speakers. The competence-performance
distinction helps illustrate that these errors are not, in fact, human-like: when humans make an agreement
error and are given a chance to reread their sentence, they will often correct it; for the LLM, the error is
part of the competence and will not be corrected. This is not a problem for Gulordava et al., who avoid
making claims about LLMs as linguistic theories, but it is a problem for Piantadosi’s view.
March 17, 2023
Likely vs. grammatical. LLMs also lack a distinction between likelihood and
grammaticality. The two notions often overlap, but they are conceptually distinct:
some things are unlikely but correct, and others are likely but incorrect. Human
speakers can tease these notions apart. LLMs cannot: any attempt they might make
to judge goodness is based on likelihood.4 This means that they will generally prefer
an ungrammatical but somewhat likely continuation over a grammatical but unlikely
one:
4
Piantadosi suggests that probabilities are a good thing in a linguistic model because of compression (pp.
16-7). But he confuses two possible roles for probabilities: as part of linguistic knowledge and as part of
the learning model. The generative approach, starting with the remarks that Piantadosi brings from
Chomsky 1957, has rejected a role for probabilities within the grammar. (This rejection is based on
empirical considerations rather than on any hostility to probabilities: if probabilities can be part of the
grammar, one would expect to occasionally find grammatical processes that are sensitive to probabilities;
such processes have not yet been found, at least in syntax and semantics, so it makes methodological
sense to prevent them from being stated within the grammar in the first place.) But non-probabilistic
grammars can still be learned in a probabilistic, or compression-based framework. In fact some recent
generative work argues that this is exactly the correct approach to learning (Katzir 2014, Rasin & Katzir
2020).
March 17, 2023
As might be expected, ChatGPT chose the ungrammatical but frequent word “are”
as the continuation instead of the grammatical (but perhaps unlikely) continuation
“destroys”. The distinction between likely and grammatical, which all humans have,
is entirely foreign to ChatGPT and its fellow LLMs.
Generalization. Taking a step back from the linguistic behavior of LLMs, let us look
at how such models generalize more broadly. Piantadosi likens LLMs to automated
scientists, or linguists (p. 19). But consider the following exchange with ChatGPT:
The input sequence quite clearly corresponds to the pattern anbncn. That, at least, is
what humans would say. ChatGPT suggests one string corresponding to this pattern
(#2 in the first response) and another string (#3) that matches a slightly broader
generalization, one allowing for further letters. But string #1 is entirely unrelated, and
the following response is even stranger. Clearly ChatGPT has no idea what is going
March 17, 2023
on in the input sequence. And it is not alone in this. In Lan, Geyer, Chemla, & Katzir
(2022) we looked at several current neural networks and found that they performed
poorly on a range of patterns similar to the one in the exchange above. One central
factor behind this failure is the learning method: LLMs are trained using
backpropagation, which pushes the network in very un-scientist-like directions and
prevents it from generalizing and reasoning about inputs in anything even remotely
similar to how humans generalize. When instead of backpropagation we trained
networks using Minimum Description Length (MDL) — a learning criterion that does
correspond to rational scientific reasoning — the networks were able to find perfect
solutions to patterns that remained outside the reach of networks trained with
backpropagation. Could there eventually be future LLMs (MDL-based or otherwise)
that would strike us as scientist-like? Perhaps. But if such models do arrive they will
be very different from current LLMs. In the meantime, any suggestion that LLMs are
automated scientists should be treated with suspicion.
References
Baroni, M. (2022). On the proper role of linguistically oriented deep net analysis in
linguistic theorising. In Bernardy, J.-P. and Lappin, S., editors, Algebraic Structures in
Natural Language, pages 1–16. CRC Press.
🦜
Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the
Dangers of Stochastic Parrots: Can Language Models Be Too Big? . In
Proceedings of the 2021 ACM conference on fairness, accountability, and
transparency (pp. 610-623).
5
Note, however, that differently from LLMs, actual parrots are, in fact, intelligent.
March 17, 2023
Chomsky, N. (1965). Aspects of the Theory of Syntax. MIT Press, Cambridge, MA.
Gulordava, K., Bojanowski, P., Grave, E., Linzen, T., and Baroni, M. (2018).
Colorless green recurrent networks dream hierarchically. In Proceedings of NAACL
2018, pages 1195–1205.
Heinz, J. and Idsardi, W. (2013). What complexity differences reveal about domains
in language. Topics in cognitive science, 5(1):111–131.
Lakretz, Y., Hupkes, D., Vergallito, A., Marelli, M., Baroni, M., and Dehaene, S.
(2021). Mechanisms for handling nested dependencies in neural-network language
models and humans. Cognition, 213:104699. Special Issue in Honour of Jacques
Mehler, Cognition’s founding editor.
Lan, N., Chemla, E., and Katzir, R. (2022). Large language models and the
argument from the poverty of the stimulus. Under review, available at
https://ling.auf.net/lingbuzz/006829.
Lan, N., Geyer, M., Chemla, E., and Katzir, R. (2022). Minimum description length
recurrent neural networks. Transactions of the Association for Computational
Linguistics, 10:785–799.
March 17, 2023
Miller, G. and Chomsky, N. (1963). Finitary models of language users. In Luce, R. D.,
Bush, R. R., and Galanter, E., editors, Handbook of Mathematical Psychology,
volume 2, pages 419–491, New York, NY. Wiley.
Pearl, L. and Sprouse, J. (2013). Syntactic islands and learning biases: Combining
experimental syntax and computational modeling to investigate the language
acquisition problem. Language Acquisition, 20(1):23–68.
Warstadt, A., Singh, A., and Bowman, S. R. (2019). Neural Network Acceptability
Judgments. Transactions of the Association for Computational Linguistics,
7:625–641.
Wilcox, E., Futrell, R., and Levy, R. (2022). Using computational models to test
syntactic learnability. Linguistic Inquiry 2022; doi:
https://doi.org/10.1162/ling_a_00491