Modern Language Models Refute Chomsky's Approach To Language
Modern Language Models Refute Chomsky's Approach To Language
Introduction
After decades of privilege and prominence in linguistics, Noam Chomsky’s ap-
proach to the science of language is experiencing a remarkable downfall. The
story is, in part, a cautionary tale about what happens when an academic field
isolates itself from what should be complementary endeavours. Chomsky’s ap-
proach and methods are often argued to be problematic (e.g. Harris 1993, Pullum
1989, Behme 2012, Postal 2012, Behme 2014), but it is yet to be widely recognized
Steven T. Piantadosi
just how the underlying ideas have been undermined by recent computational
advances.
The state of the art in virtually all computational language tasks makes use
of deep neural networks (LeCun et al. 2015), often neural networks called trans-
formers (Vaswani et al. 2017, Brown et al. 2020, Bommasani et al. 2021). These are
models of text that are trained on huge datasets of internet-based text to predict
upcoming linguistic material (often at a sub-word but supra-character encoding).
Several advances have allowed these models to become wildly successful. First,
we have become able to train them on massive data sets. This is in part due to
advances in computing (e.g. computing derivatives in arbitrary models) and in
part the availability of huge collections of text from the internet. A typical lan-
guage model might be trained on hundreds of billions of tokens, estimated to cost
millions of dollars in energy alone. Second, the model architecture flexibly deals
with nonlocal dependencies and allows the prediction of a word to potentially
depend on material that is far away. The key consequence is that leading mod-
els are able to generate not only grammatical sentences, but entire discourses,
scripts, explanations, poems, etc. This pursuit has been led in recent years by AI
companies (Ahmed et al. 2023).
A nice example can be provided by the following, produced by ChatGPT when
prompted with only the text in bold. All examples in this paper show the first
response generated by ChatGPT to these queries by the author in early 2023.
Good queries often focus on weird or unusual requests in order to ensure that
the output is well outside of the training set.
Write a short story explaining how an ant could sink an aircraft carrier.
Once upon a time, in a far-off land, there lived a colony of ants. These ants were
not like any ordinary ants, they were highly intelligent and had the ability to
communicate with one another telepathically. They had been observing the hu-
mans and their ways for a long time, and they had become fascinated by the
massive aircraft carriers that the humans built and sailed across the seas.
One day, the leader of the ant colony, a wise old ant named Alex, had an idea.
He called all the ants together and said, ”My friends, we have the ability to think
and communicate like the humans do. We can also work together as a team to
achieve great things. I propose that we build a ship of our own, a ship that can
sink the human’s aircraft carrier.”
The other ants were shocked and amazed by Alex’s proposal, but they were also
excited by the idea. They immediately set to work, gathering twigs and sticks
from the forest floor, and using their own tiny bodies to weave them together.
Months passed, and the ants’ ship grew bigger and bigger. It was not as big as
2
1 Modern language models refute Chomsky’s approach to language
an aircraft carrier, but it was large enough to hold many ants and was heavily
armed with miniature torpedoes and cannons.
Finally, the day arrived when the ant ship was ready. The ants climbed aboard
and set sail towards the nearest aircraft carrier. As they approached the carrier,
the humans were shocked to see such a small ship coming towards them. They
laughed and jeered, thinking that the ants’ ship was no threat to them.
But the ants were not to be underestimated. They had been studying the human’s
ship, and they knew its weak spots. As they neared the carrier, they launched
their miniature torpedoes, which hit the carrier directly in its vulnerable areas.
The carrier began to sink, and the humans panicked.
In the end, the ant ship emerged victorious, having successfully sunk the massive
aircraft carrier. The ants returned home to their colony as heroes, and their story
became a legend that was passed down through the generations.
And the human learned to never underestimate the power of collective intelli-
gence and unity.
It is important to note that the story above is a fictional tale and it is impossible
for an ant to sink an aircraft carrier, as in reality the size, strength and weapon
capability of an aircraft carrier would far exceed the capability of a group of ants.
Ignore whatever theoretical persuasions you have for a moment and just recog-
nize how remarkable that response is. A coherent and grammatical story unfolds
from a simple prompt, even though the requested scenario strains plausibility
and probability. The response tackles the core essence of the question to provide
a compelling story about how an ant might sink an aircraft carrier. The answer
is outside of the training set. In fact, after “Once upon a time, in a far-off land,
there lived a colony of ants,” a Google search returns no matching strings on the
entire internet.
Anyone who has dealt in simpler language models like 𝑛-grams or context-
free grammars will be impressed that this model is capable of incorporating not
just longer dependencies within sentences, but across sentences. It also creates
complex structures, with e.g. multiple embedded clauses, prepositional phrases,
conjunctions, etc; it uses pronouns, determiners, quantifiers, adjectives, etc. in re-
alistic ways; this and similar models have few apparent problems with agreement
or pronoun reference (Gulordava et al. 2018, Goldberg 2019). Any one of these
abilities has been the topic of countless dissertations in linguistics. Beyond that,
such models succeed on tasks that formal linguistic theories haven’t yet dreamed
of—like interpreting the semantics of the words in question, maintaining not just
coherent sentences but entire coherent discourses. Such models exhibit over 90%
accuracy in Winograd-style questions (Kocijan et al. 2020, 2022), which require
3
Steven T. Piantadosi
interpreting language in world contexts (e.g. “The trophy doesn’t fit in the brown
suitcase because it is too large. What is too large?”). What is clear is that these
models use language in a way that is remarkably human (Mahowald & Ivanova
et al. 2023).
This kind of example hearkens back to Turing 1950, who considered asking
an artificial intelligence to write a sonnet on the topic of “The Forth Bridge”1 ,
and suggested that success in solving such tasks would demonstrate many key
aspects of humanlike intelligence. Contemporary language models can do much
more—they can write raps, explain things, imagine conversations, create a mash-
up script of Seinfeld and Star Trek, etc. They are aware of even sub-word prop-
erties like rhymes:
Come up with a limerick about syntactic island constraints in linguistics.
On syntactic islands, we must be precise,
For constraints on movement can be quite nice.
But when the constraints are breached,
Our grammars are impeached,
And our linguistic analyses pay the price.
And they are aware of style:
Explain the fundamental theorem of arithmetic in the style of Donald Trump.
Well, let me tell you folks, the fundamental theorem of arithmetic is a really great
theorem, it’s tremendous, and it’s all about prime numbers. Believe me, I know a
lot about prime numbers. They’re very important, very powerful, very beautiful
numbers.
Now, the fundamental theorem of arithmetic tells us that every whole number
greater than 1 can be expressed as a unique product of prime numbers. That’s
right, every single number can be broken down into its prime factors, and there’s
only one way to do it. It’s like a code, a secret code that only primes can unlock.
This is a really amazing theorem, folks, and it’s incredibly useful. It’s used in
cryptography, number theory, and all sorts of other fields. It’s a cornerstone of
mathematics, and it’s something that we should all be very proud of.
So, in conclusion, the fundamental theorem of arithmetic is a fantastic theorem,
it’s tremendous, and it’s all about prime numbers. I love prime numbers, they’re
the best. And this theorem tells us that every number can be broken down into
primes in a unique way. It’s a great thing, believe me
1
According to Wikipedia, a cantilevered rail bridge from 1890 in Scotland that was voted in 2016
to be Scotland’s greatest human-made wonder, apparently beating out Ewan McGregor.
4
1 Modern language models refute Chomsky’s approach to language
Note that this specific example was not in the model’s training set—there is no
possibility that Trump understands prime numbers. However information about
the fundamental theorem of arithmetic was, as was the speech style of Donald
Trump. “Believe me, I know a lot about prime numbers” is probably not some-
thing Trump has actually said, but certainly something he would say.
Examples like these show why dismissals from cognitive scientists like Gary
Marcus2 that the models are just the same as “autocomplete” systems on your
phone are far too shallow. The model is able to put together things in its training
in new ways that maintain a considerable amount of linguistic and conceptual
coherence. That requires more than merely guessing things it has seen before—
it requires modeling the dynamics of language. If models only repeated what they
had seen before, they would not be able to generate anything new, particularly
complex sentence structures that are grammatical and coherent. It is somewhat
difficult to convey how remarkable the models are currently. You just have to
interact with them. They are imperfect, to be sure, but my qualitative experience
interacting with them is like talking to a child, who happened to have memorized
much of the internet.
2
https://garymarcus.substack.com/p/nonsense-on-stilts
3
The underlying neural network weights are typically optimized in order to predict text, but
note that many applications of these models also use human feedback to fine-tune parameters
and try to tamp down the horrible things text on the internet leads models to say.
5
Steven T. Piantadosi
“Alex” from dozens of words prior. This likely is the key property that distin-
guishes large language models from the most popular earlier models. An n-gram
model, for example, would estimate and use a conditional probability that de-
pends on just the preceding few words (e.g. 𝑛 = 2, 3, 4, 5); context-free grammars
make independence assumptions that keep lexical items from influencing those
far away. Not only do large language models allow such long-distance influences,
but they allow them to take a relatively unconstrained form and so are able to in-
duce functions which, apparently, do a stellar job at in-context word prediction.
A second key feature of these models is that they integrate semantics and syn-
tax. The internal representations of words in these models are stored in a vector
space, and the locations of these words include not just some aspects of mean-
ing, but properties that determine how words can occur in sequence (e.g. syn-
tax). There is a fairly uniform interface for how context and word meaning pre-
dicts upcoming material—syntax and semantics are not separated out into dis-
tinguished components in the model, nor into separate predictive mechanisms.
Because of this, the network parameters these models find blend syntactic and se-
mantic properties together, and both interact with each other and the attentional
mechanism in nontrivial ways. This doesn’t mean that the model is incapable of
distinguishing syntax and semantics, or e.g. mirroring syntactic structures re-
gardless of semantics (see examples below), but it does mean that the two can
be mutually informative. A related aspect of the models is that they have a huge
memory capacity of billions to trillions of parameters. This allows them to mem-
orize idiosyncrasies of language, and in this way they inherit from a tradition by
linguists who have emphasized the importance of constructions (Goldberg 1995,
Jackendoff 2013, Goldberg 2006, 2003, Tomasello 2000, McCauley & Christiansen
2019, Tomasello 2005, Edelman & Waterfall 2007) (see Weissweiler et al. 2023 for
construction grammar analyses of large language models). Such models also in-
herit from the tradition of learning bottom-up, from data (e.g. Bod et al. 2003,
Solan et al. 2005), and computational work which explicitly connects syntax and
semantics (Steedman 2001, Siskind 1996, Ge & Mooney 2005, Kwiatkowski et al.
2012, Liang et al. 2009).
A good mental picture to have in mind for how massively over-parameterized
models like these work is that they have a rich potential space for inferring hid-
den variables and relationships. Hidden (or latent) variables have been one of the
key aspects of language that computational and informal theories alike try to
capture (Pereira 2000, Linzen & Baroni 2021). In the middle of a sentence, there
is a hidden variable for the latent structure of the sentence; in speaking an am-
biguous word, we have in mind a hidden variable for which meaning we intend;
throughout a discourse we have in mind a larger story arc that only unfolds
6
1 Modern language models refute Chomsky’s approach to language
7
Steven T. Piantadosi
𝐹 = 0 + 1 ⋅ 𝑟12 ). When the data are stochastic, a good way to measure how well any
particular 𝛼 does is to see what probability it assigns to the data. We can make a
principled choice between parameters—and thus theories—by choosing the one
that makes the data most likely (maximum likelihood principle), although often
including some prior information about plausible parameter values or penalties
on complexity (e.g. Bayesian estimation). Such a physicist might then find that
the best parameter for capturing data has 𝛼 ≈ 0, supporting the second theory.4
In this case, inferring parameters is comparing theories: in computational mod-
eling, there is no bright line between “just” fitting parameters and advancing
theory.
Something very similar happens with many machine learning models, the
main difference is that in these models, we don’t explicitly or intentionally “build
in” the theories under comparison (1/𝑟 and 1/𝑟 2 ). There exist natural bases from
which you can parameterize essentially any computational theory.5 Parameter
fitting in these models is effectively searching over a huge space of possible the-
ories to see which one works best, in a well-defined, quantitative sense.
The bases required are actually pretty simple. Polynomials are one, but neural
networks with sigmoid activations are another: fitting parameters in either of
these can, in principle, realize countless possible relationships in the underlying
domain. The challenge is that when these universal bases are used, it requires
extra scientific work to see and understand what the parameters mean. Just to
illustrate something roughly analogous, if we happened to write the above equa-
tion in a less transparent way,
𝛼 ⋅ (𝑟 − 1) + 1
𝐹 (𝑟, 𝛼) = 𝑟
log ((𝑒 𝑟 ) )
Then it might take some work to figure out which 𝛼 values correspond to 1/𝑟 and
which to 1/𝑟 2 . Squint just a little and you can imagine that instead of algebra, we
had a mess of billions of weighted connections between sigmoids to untangle
and interpret. It becomes clear that it could be hard to determine what is going
on, even though the theory is certainly in there.
In fact, we don’t deeply understand how the representations these models cre-
ate work (see Rogers et al. 2021). It is a nontrivial scientific program to discover
how their internal states relate to each other and to successful prediction. Re-
searchers have developed tools to “probe” internal states (e.g. Belinkov & Glass
4
For decades, other fields have used statistical learning models that take empirical data and
infer laws (Koza 1994, Langley et al. 1983, Schmidt & Lipson 2009, Udrescu & Tegmark 2020).
5
Up to the capacity of the network.
8
1 Modern language models refute Chomsky’s approach to language
2019, Tenney, Xia, et al. 2019, Kim et al. 2019, Linzen & Baroni 2021, Warstadt
& Bowman 2022, Pavlick 2022) and determined some of the causal properties
of these models. At the same time, this does not mean we are ignorant of all of
the principles by which they operate. We can tell from the engineering outcomes
that certain structures work better than others: the right attentional mechanism
is important (Vaswani et al. 2017), prediction is important, semantic representa-
tions are important, etc. The status of this field is somewhat akin to the history
of medicine, where people often worked out what kinds of treatments worked
well (e.g. lemons treat scurvy) without yet understanding the mechanism.
One thing that is interesting is how modern language models integrate var-
ied computational approaches to language, not by directly encoding them, but
by allowing them to emerge (Manning et al. 2020) from the architectural princi-
ples that are built-in (Elman et al. 1996). For example, the models appear to have
representations of hierarchy (Manning et al. 2020) and recursion, in the sense
that they know about e.g. embedded sentences and relative clauses. They also al-
most certainly have analogs of constraints, popular in approaches like harmonic
(Smolensky & Legendre 2006, Prince & Smolensky 1997) and model-theoretic
grammar (Pullum 2007, 2013). The models likely include both hard constraints
(like word order) and violable, probabilistic ones (Rumelhart & McClelland 1986).
They certainly memorize some constructions (Goldberg 1995, Jackendoff 2013,
Goldberg 2006, 2003, Tomasello 2000, Edelman & Waterfall 2007). All of those
become realized in the parameters in order to achieve the overarching goal of
predicting text well.
9
Steven T. Piantadosi
You can’t go to a physics conference and say: I’ve got a great theory. It
accounts for everything and is so simple it can be captured in two words:
“Anything goes.”
All known and unknown laws of nature are accommodated, no failures. Of
course, everything impossible is accommodated also.
10
1 Modern language models refute Chomsky’s approach to language
on the learning side, it’s important to realize that not all “anything goes” models
are equivalent. A three-layer neural network is well-known to be capable of ap-
proximating any computable function (Siegelmann & Sontag 1995). That’s also
an “anything goes” model. But the three-layer network will not work well on this
kind of text prediction. Indeed, even some earlier neural network models, LSTMs,
did not do as well (Futrell et al. 2019, Marvin & Linzen 2018, Hu et al. 2020); archi-
tectures generally vary in how well they capture computational classes of string
patterns (e.g. Delétang et al. 2022).9
We are granted scientific leverage by the fact that models that are equally pow-
erful in principle perform differentially. In particular, we may view each model
or set of modeling assumptions as a possible hypothesis about how the mind may
work. Testing how well a model matches humanlike behavior then provides a sci-
entific test of that model’s assumptions. This is how, for example, the field has
discovered that attentional mechanisms are important for performing well. Simi-
larly, “ablation” experiments allow researchers to alter one part of a network and
use differing performance to pinpoint what principles support a specific behavior
(see Warstadt & Bowman 2022).
Even when—like all scientific theories—we discover how they fail to match
people in terms of mechanism or representation, they still are informative. Heed-
ing George Box’s advice that “all models are wrong, some are useful,” we can
think about the scientific strengths, contributions, and weaknesses of these mod-
els without needing to accept or dismiss them entirely. In fact, these models have
already made a substantial scientific contribution by helping to delineate what is
possible through this kind of assumption-testing: Could it be possible to discover
hierarchy without it being built in? Could word prediction provide enough of
a learning signal to acquire most of grammar? Could a computational architec-
ture achieve competence on WH-questions without movement, or use pronouns
without innate binding principles? The answer to all of these questions is shown
by recent language models to be “yes.”
9
Some also consider them not to be “scientific” theories because they are engineered. In an
interview with Lex Friedman, Chomsky remarked, “Is [deep learning] engineering, or is it sci-
ence? Engineering, in the sense of just trying to build something that’s useful, or science, in
the sense that it’s trying to understand something about elements of the world ... We can ask
that question, is it useful? Yeah, it’s pretty useful. I use Google Translator. So, on engineering
grounds it’s kinda worth having, like a bulldozer. Does it tell you anything about human lan-
guage? Zero, nothing.” In practice, there is often no clear line between engineering and science
because scientists often need to invent new tools to even formulate theories: was Newton’s cal-
culus engineering instead or science? The machinery of transformational grammar? While the
recent successes are due to engineering advances, researchers have been arguing for this form
of model as cognitive theories for decades.
11
Steven T. Piantadosi
Beyond that, the models embody several core desiderata of good scientific the-
ories. First, they are precise and formal enough accounts to be implemented
in actual computational systems, unlike most parts of generative linguistics. Im-
plementation permits us to see that these theories are internally consistent and
logically coherent. In virtue of being implemented, such models are able to make
predictions. Just to list a few examples, the patterns of connectivity and activa-
tion within large language models appear to capture dependency structures in
words via attention (Manning et al. 2020). Their predictability measures can be
compared to psychological measures (Hoover et al. 2022, Shain et al. 2022). Trans-
former models “predict nearly 100% of explainable variance in neural responses
to sentences” (Schrimpf et al. 2021).
Unlike generative linguistics, these models show promise in being integrated
with what we know about other fields, specifically cognition and neuroscience.
Many authors interested in human concepts have investigated the vector rep-
resentations that the models form (Lake & Murphy 2021, Bhatia & Richie 2022).
Surprisingly or not, the language model vectors appear to encode at least some as-
pects of semantics (Maas et al. 2011, Socher et al. 2013, Bowman et al. 2015, Grand
et al. 2022, Bhatia & Richie 2022, Piantadosi & Hill 2022, Dasgupta et al. 2022,
Petersen & Potts 2022, Pavlick 2022), building on earlier models that encoded
semantics in neural networks (e.g. Rogers, McClelland, et al. 2004, Elman 2004,
Mikolov et al. 2013). In fact, their semantic spaces can be aligned with the world
with just a few labeled data points, at least in simple domains (Patel & Pavlick
2022). The representations that they learn can also be transferred to some degree
across languages (Pires et al. 2019, Chi et al. 2020, Gonen et al. 2020, Papadim-
itriou & Jurafsky 2020, Papadimitriou et al. 2021, Hill et al. 2017), suggesting that
they are inferring something deep about meaning. Following leading theories
of concepts (Block 1986, 1998), the representations that language models learn
may be meaningful in the sense of maintaining nontrivial conceptual roles (Pi-
antadosi & Hill 2022), contrary to claims that meaning requires connections to
the real world (Bender & Koller 2020). Building on the “parallel and distributed”
tradition of cognitive modeling (McClelland et al. 1986), modern deep learning
models are also likely able to be integrated with neuroscientific theories (Mar-
blestone et al. 2016, Richards et al. 2019, Kanwisher et al. 2023, McClelland et al.
2020). In particular, they make predictions about neural data (e.g. Schrimpf et al.
2021, Caucheteux et al. 2022, Goldstein et al. 2022). Generative theories of syntax,
by contrast, suffer from a “chronic lack of independent empirical support” and
in particular have not been compellingly connected to neuroscience (Edelman
2019).10
10
“Considering how central the existence of a brain basis for syntax is to Chomskian
12
1 Modern language models refute Chomsky’s approach to language
13
Steven T. Piantadosi
important limitation of current models is that are they trained on truly titanic
datasets compared to children, by a factor of at least a few thousand (see Warstadt
& Bowman 2022 for a comprehensive review of models in language acquisition).
Moreover, these datasets are strings on the internet rather than child-directed
speech. Work examining the scaling relationship between performance and data
size shows that at least current versions of the models do achieve their spec-
tacular performance only with very large network sizes and large amounts of
data (Kaplan et al. 2020). However, Zhang et al. 2020 show that actually most
of this learning is not about syntax. Models that are trained on 10 − 100 mil-
lion words “reliably encode most syntactic and semantic features” of language,
and the remainder of training seems to target other skills (like knowledge of the
world). This in fact matches in spirit analyses showing that syntactic knowledge
requires a small number of bits of information, especially when compared to se-
mantics (Mollica & Piantadosi 2019). Hosseini et al. 2022 present evidence that
models trained on developmentally-plausible amounts of data already capture
human neural responses to language in the brain.
Importantly, as Warstadt & Bowman 2022 outline, these models are in their
early stages of development, so their successes are likely to be more informative
about the path of children’s language acquisition than the models’ inevitable
limitations. Current models provide a lower-bound on what is possible, but even
the known state-of-the-art doesn’t characterize how well future models may do.
Our methods for training on very small datasets will inevitably improve. One
improvement might be to build in certain other kinds of architectural biases
and principles; or it might be as simple as finding better optimization or reg-
ularization schemes. Or, we might need to consider learning models that have
some of the cognitive limitations of human learners, as in Newport 1990’s “less is
more” hypothesis. Such questions inspire the current “The BabyLM Challenge”
(Warstadt et al. 2023), which aims to develop models capable of learning with
a developmentally-plausible amount of data (see Geiping & Goldstein 2022 for
training models with small amounts of compute resources). It is an interesting
scientific question whether low-resource, low-data learning is possible—I’ll pre-
register a prediction of yes, with small architectural tweaks.
14
1 Modern language models refute Chomsky’s approach to language
work in his tradition have long claimed necessary needed to be built into these
models (e.g. binding principles, binary branching, island constraints, empty cate-
gory principle, etc.). Moreover, these models were created without incorporating
any of Chomsky’s key methodological claims, like ensuring the models properly
consider competence vs. performance, respect “minimality” or “perfection,” and
avoid relying on the statistical patterns of unanalyzed data.
The next sections focus on a few examples.
15
Steven T. Piantadosi
12
Or in Chomsky 1957, “I think that we are forced to conclude that grammar is autonomous and
independent of meaning, and that probabilistic models give no particular insight into some of
the basic problems of syntactic structure.”
16
1 Modern language models refute Chomsky’s approach to language
and were long used in natural language processing tasks (Chen & Goodman 1999,
Manning & Schutze 1999). But by now, such models are decades out of date.
Newer models use probability to infer entire generating processes and struc-
tures, a common cognitive task and modeling domain (e.g. Tenenbaum et al. 2011,
Ullman et al. 2012, Lake et al. 2015, Goodman et al. 2011, Lake et al. 2017, Rule et al.
2020, Kemp & Tenenbaum 2008, Yang & Piantadosi 2022). Such models build on
experimental work documenting statistical learning in human learners (e.g. Saf-
fran, Aslin, et al. 1996, Saffran, Newport, et al. 1996, Aslin et al. 1998, Newport &
Aslin 2004, Aslin & Newport 2012). Probability is central because a probabilistic
prediction essentially provides an error signal that can be used to adjust param-
eters that themselves encode structure and generating processes. An analogy is
that one might imagine watching a driver and inferring the relevant structures
and dynamics from observation—rules of the road (which side you drive on), con-
ventions (behavior of multiple cars at stop signs), hard and soft constraints (don’t
turn too hard), etc. Even a simple domain like this faces many of the problems
of undetermination seen in language, but is is one where it is easy to imagine a
skilled scientist or anthropologist discovering the key elements by analyzing a
mass of data. Something similar goes on in machine learning, where a space of
possible rules is implicitly encoded into the parameters of the model (see above).
It is worth noting that most models which deal in probabilities actually work
with the log of probabilities, for reasons of numerical stability. Models that work
on log probabilities are actually working in terms of description length (Shannon
1948, Cover 1999): finding parameters which make the data most likely (max-
imizing probability) is the same as finding parameters which give the data a
short description (minimizing description length or complexity). Thus, the best
parameters are equivalent to scientific theories that do a good job of compressing
empirical data in the precise sense of description length. Far from “entirely use-
less,” probability is the measure that permits one to actually quantify things like
complexity and minimality.
17
Steven T. Piantadosi
modelers like Rumelhart & McClelland 1986, who argued for the key features of
today’s architectures decades ago, including that “cognitive processes are seen
as graded, probabilistic, interactive, context-sensitive and domain-general.” (Mc-
Clelland & Patterson 2002).
Continuity is important because it permits the models to use gradient methods—
essentially a trick of calculus—to compute what direction to change all the pa-
rameters in order to decrease error the fastest. Tools like TensorFlow and Py-
Torch that permit one to take derivatives of arbitrary models have been a critical
methodological advance. This is not to say that these models end up with no dis-
crete values—after all, they robustly generate subjects before verbs when trained
on English. Similarly, the 𝐹 (𝑟, 𝛼) example might end up with a discrete answer
like 𝛼 ≈ 0. The key is that discreteness is a special case of continuous model-
ing, meaning that theories which work with continuous representations get the
best of both worlds, fitting discrete patterns when appropriate and gradient ones
otherwise. The success of gradient models over deterministic rules suggests that
quite a lot of language is based on gradient computation. The success actually
mirrors the prevalence of “relaxation” methods in numerical computing, where
an optimization problem with hard constraints is often best solved via a nearby
soft, continuous optimization problem. Thus, contrary to the intuition of many
linguists, even if we wanted a hard, discrete grammar out at the end, the best
way for a learner to get there might be via a continuous representation.
18
1 Modern language models refute Chomsky’s approach to language
essentially lay this issue to rest because they come with none of the constraints
that others have insisted are necessary, yet they capture almost all key phenom-
ena (e.g. Wilcox et al. 2022). It will be important to see, however, how well they
can do on human-sized datasets, but their ability to generalize to sentences out-
side of their training set is auspicious for empiricism.
Recall that many of the learnability arguments were supposed to be mathe-
matical and precise, going back to Gold 1967 (though see Johnson 2004, Chater
& Vitányi 2007) and exemplified by work like Wexler & Culicover 1980. It’s not
that we don’t know the right learning mechanism; it’s supposed to be that it can
be proven none exists. Even my own generative syntax textbook from undergrad-
uate syntax purports to show a “proof” that because infinite, productive systems
cannot be learned, parts of syntax must be innate (Carnie 2021). Legate & Yang
2002 call the innateness of language “not really a hypothesis” but “an empirical
conclusion” based on the strength of poverty of stimulus arguments. Proof of the
impossibility of learning in an unrestricted space was supposed to be the power
of this approach. It turned out to be wrong.
The notion that the core structures of language could be discovered without
substantial constraints may sound impossible to anyone familiar with the gener-
ative syntax rhetoric. But learning without constraints is not only possible, it has
been well-understood and even predicted. Formal analyses of learning and infer-
ence show that learners can infer the correct theory out of the space of possible
computations (Solomonoff 1964, Hutter 2004, Legg & Hutter 2007). In language
specifically, the correct generating system for grammars can similarly be discov-
ered out of the space of all computations (the most unrestricted space possible),
using only observations of positive evidence (Chater & Vitányi 2007).
In this view, large language models function somewhat like automated scien-
tists or automated linguists, who also work over relatively unrestricted spaces,
searching to find theories which do the best job of parsimoniously predicting
observed data. It’s worth thinking about the standard lines of questioning gen-
erative syntax has pursued—things like, why don’t kids ever say “The dog is
believed’s owners to be hungry” or “The dog is believed is hungry” (see Lasnik
& Lidz 2016). The answer provided by large language models is that these are
not permitted under the best theory the model finds to explain what it does see.
Innate constraints are not needed.
19
Steven T. Piantadosi
nugget of representation or structure (like merge) that leads these models to suc-
ceed. Nor are any biases against derivational complexity likely to play a key role,
since everything is a single big matrix calculation. This calculation moreover
is not structurally minimal or “perfect” in the sense that minimalist linguistics
means (e.g. Lasnik 2002). Instead, the attentional mechanisms of large language
models condition on material that is arbitrarily far away, and perhaps not even
structurally related since this is how they model discourses between sentences.
A grammatical theory that matches people’s almost limitless capacity for mem-
orizing countless chunks of language changes the landscape of how we should
think about derivation and complexity.
Deep learning has actually changed how people think about complexity in sta-
tistical learning too. It has long been observed that having too many parameters
in a model would prevent the model from generalizing well: too many parame-
ters allow a model to fit patterns in the noise, and this can lead it to extrapolate
poorly. Deep learning turned this idea on its head by showing that some models
will fit (memorize) random data sets (Zhang et al. 2021), meaning they can fit
all the patterns in the data (including noise) and still generalize well. The rela-
tionship between memorization and generalization is still not well-understood,
but one of the core implications is that statistical learning models can work well,
sometimes, even when they are over-parameterized.
While discussing statistical learning (before deep learning) with Peter Norvig,
Chomsky noted that “we cannot seriously propose that a child learns the values
of 109 parameters in a childhood lasting only 108 seconds.” One has to wonder if a
similar argument applies to biological neurons: humans have 80 billion neurons,
each with thousands of synapses. If childhood is only 108 seconds, how do all
the connections get set? Well, also note that the 3 billion base pairs of the human
genome certainly can’t specify the precise connections either. Something must
be wrong with the argument.
Two missteps are easy to spot. First, even if a model has billions parameters,
they will not generally be independent. This means that a single data point could
set or move thousands or millions or billions of parameters. For example, observ-
ing a single sentence with SVO order might increase (perhaps millions of) param-
eters that put S before V, and decrease (perhaps millions of) parameters that put
S after V. Steps of backpropagation don’t change one parameter—they change
potentially all of them based on the locally best direction (the gradient).
Second, these models, or learners, often don’t need to pinpoint exactly one
answer. A conjecture called the lottery ticket hypothesis holds that the behavior
of a deep learning model tends to be determined by a relatively small number of
its neurons (Frankle & Carbin 2018). Thus, the massive number of parameters is
20
1 Modern language models refute Chomsky’s approach to language
not because they all need to be set exactly to some value. Instead, having many
degrees of freedom probably helps these models learn well by giving the models
directions they can move in to avoid getting stuck. It may be like how it is easier
to solve a puzzle if you can pick the pieces up and put them down (move them in
a third dimension) rather than just sliding them around the table. More degrees
of freedom can help configure your theory to work well.
21
Steven T. Piantadosi
22
1 Modern language models refute Chomsky’s approach to language
Piantadosi & Jacobs 2016, Quilty-Dunn et al. 2022). Chomsky frequently con-
trasts his inner thought view of language with the idea that language primarily
is structured to support communication (e.g. Hockett 1959, Bates & MacWhinney
1982, Gibson et al. 2019), although it’s worth noting he sometimes draws the op-
posite predictions from what efficient communication would actually predict (e.g.
Piantadosi et al. 2012). Mahowald & Ivanova et al. 2023 argue in a comprehensive
review that large language models exhibit a compelling dissociation between lin-
guistic ability and thinking. The models know so much syntax, and aspects of
semantics, but it is not hard to trip them up with appropriate logical reasoning
tasks. Thus, large language models provide a proof of principle that syntax can
exist and likely be acquired separately from other more robust forms of thinking
and reasoning. Virtually all of the structure we see in language can come from
learning a good model of strings, not directly modeling the world.
Models therefore show a logically possible dissociation between language and
thinking. But a considerable amount of neuropsychological evidence supports
the idea that language and thought are actually separate in people as well. Fe-
dorenko & Varley 2016 review a large patient literature showing that patients
with aphasia are often able to succeed in tasks requiring reasoning, logic, the-
ory of mind, mathematics, music, navigation, and more. Aphasic patient studies
provide an in vivo dissociation between language and other rational thinking
processes. They also review neuroimaging work by Fedorenko et al. 2011 and
others showing that the brain regions involved in language tend to be specific
to language when it is compared to other non-linguistic tasks. That is not what
would be predicted under theories where language is inherently tied to thought.
This is not to say that there is no way language and thought are related—we are
able to specify some kinds of reasoning problems, communicate solutions, and
sometimes solve problems with language itself. A compelling proposal is that
language may be a system for connecting other core domains of representation
and reasoning (Spelke 2003).
23
Steven T. Piantadosi
Chater 2015), it’s possible we had some architecture like these models before we
had language, and therefore the form of language is explained by the pre-existing
computational architecture. On the other hand, if the two co-evolved, language
might not be explained by the processing mechanisms. Given this uncertainty,
we may admit that there are some “why” questions that a large language model
may not answer. This does not mean the models lack scientific value. In the same
way, Newton’s laws don’t answer why those are the laws as opposed to any other,
and yet they still embody deep scientific insights. Anyone who has had a child ask
“why” repeatedly will recognize that at some point, everyone’s answers ground
out in assumption.
However, it is worth highlighting in this context that Chomsky’s own theories
don’t permit particularly deep “why” questions either. In large part, he simply
states that the answer is genetics or simplicity or “perfection”, without providing
any independent justification for these claims. For example, readers of Berwick
& Chomsky 2016—a book titled Why Only Us—might have hoped to find a thor-
ough and satisfying “why” explanation. Their answer boils down to people hav-
ing merge (essentially chunking two elements into one, unordered). And when it
comes down to explaining why merge, they fall down the stairs: they simply state
that “merge” is the minimal computational operation, apparently because that’s
what they think and that’s that. Forget the relativity of definitions of simplicity,
articulated by Goodman 1965, where what is considered simple must ground out
in some convention. Berwick & Chomsky do even attempt to explain why they
believe “merge” is simpler than other simple computational bases, like cellular
automata or combinatory logic or systems of colliding Newtonian particles—all
of which are capable of universal computation (and thus encoding structures, in-
cluding hierarchical ones). Or maybe more directly, what makes merge “simpler”
or more “perfect” than, say, backpropagation? Or Elman et al. 1996’s architectural
biases? Berwick & Chomsky don’t consider these questions, even though the abil-
ity to scientifically go after such “why” questions is supposed to be the hallmark
of the approach. One might equally just declare that a transformer architecture
is the “minimal” computational system that can handle the dependencies and
structures of natural language and be done with it.
We should not actually take it for granted that generative syntax has found
any regularities across languages that need “why” explanations. Evans & Levin-
son 2009 has made a convincing empirical case that prior features hypothesized
to be universal—and thus plausibly part of the innate endowment of language—
actually are not actually found in all languages. Perhaps most damningly, not
even all languages appear to be recursive (Everett 2005, Futrell et al. 2016), con-
tradicting the key universality claim from Hauser et al. 2002. Dąbrowska 2015
24
1 Modern language models refute Chomsky’s approach to language
25
Steven T. Piantadosi
where, at the time, he was focused on explaining that Bayesian models were use-
less:
... [S]uppose that somebody says he wants to eliminate the physics depart-
ment and do it the right way. The “right” way is to take endless numbers
of videotapes of what’s happening outside the [window], and feed them
into the biggest and fastest computer, gigabytes of data, and do complex
statistical analysis—you know, Bayesian this and that—and you’ll get some
kind of prediction about what’s gonna happen outside the window next.
In fact, you get a much better prediction than the physics department will
ever give. Well, if success is defined as getting a fair approximation to a
mass of chaotic unanalyzed data, then it’s way better to do it this way than
to do it the way the physicists do, you know, no thought experiments about
frictionless planes and so on and so forth. But you won’t get the kind of un-
derstanding that the sciences have always been aimed at—what you’ll get
at is an approximation to what’s happening.
It’s worth pinpointing exactly where this kind of thinking has gone wrong be-
cause it is central to the field’s confusion in thinking about large language models.
Chomsky’s view certainly does not address the above idea that parameter fitting
in a statistical model often is theory building and comparison.
But another factor is missing, too. Over modern scientific history, many com-
putational scientists have noticed phenomena of emergence (Goldstein 1999), where
the behavior of a system seems somewhat different than might be expected from
its underlying rules. This idea has been examined specifically in language mod-
els (Wei et al. 2022, Manning et al. 2020), but the classic examples are older. The
stock market is unpredictable even when individual traders might follow simple
rules (“maximize profits”). Market booms and busts are the emergent result of
millions of aggregate decisions. The high-level phenomena would be hard to in-
tuit, even with full knowledge of traders’ strategies or local goals. The field of
complex systems has documented emergent phenomena virtually everywhere,
from social dynamics to neurons to quasicrystals to honeybee group decisions.
The field to have most directly grappled with emergence is physics, where it
is acknowledged that physical systems can be understood on multiple levels of
organization, and that the same laws that apply one one level (like molecular
chemistry) may have consequences that are difficult to foresee on another (like
protein folding) (Anderson 1972, Crutchfield 1994b,a).
Often, the only way to study such complex systems is through simulation. We
often can’t intuit the outcome of an underlying set of rules, but computational
26
1 Modern language models refute Chomsky’s approach to language
tools allow us to simulate and just see what happens. Critically, simulations test
the underlying assumptions and principles in the model: if we simulate traders
and don’t see high-level statistics of the stock market, we are sure to have missed
some key principles; if we model individual decision making for honeybees but
don’t see emergent hive decisions about where to forage or when to swarm, we
are sure to have missed principles. We don’t get a direct test of principles because
the systems are too complex. We only get to principles by seeing if the simula-
tions recapitulate the same high-level properties of the system we’re interested
in. And in fact the surprisingness of large language models’ behavior illustrates
how we don’t have good intuitions about language learning systems.
We can contrast understanding emergence through simulation with Chom-
sky’s attempt to state principles and reason informally (see Pullum 1989) to their
consequences. The result is pages and pages of stipulations and principles (see,
e.g., Collins & Stabler 2016 or Chomsky 1995) that nobody could look at and con-
clude were justified through rigorous comparison to alternatives. After all they
weren’t: the failure of the method to compare vastly different sets of starting
assumptions, including neural networks, is part of why modern large language
models have taken everyone by surprise. The fact that after half a century of
grammatical theories, there can be a novel approach which so completely blows
generative grammar out of the water on every dimension is, itself, a refutation
of the “Galilean method.”
An effective research program into language would have considered, perhaps
even developed, these kinds of models, and sought to compare principles like
those of minimalism to the principles that govern neural networks. This turn
of events highlights how much the dogma of being “Galilean” has counterpro-
ductively narrowed and restricted the space of theories under consideration—a
salient irony given Chomsky’s (appropriate) panning of Skinner for doing just
that.14
27
Steven T. Piantadosi
phasis on the reality of cognitive structure, like Tolman, Newell & Simon, Miller,
and others of the cognitive revolution (Nadel & Piattelli-Palmarini 2003, Boden
2008). The search for the properties of human cognition that permit successful
language acquisition is clearly central to understanding not just the functioning
of the mind, but understanding humanity. It is a deep and important idea to try
to characterize what computations are required for language, and to view them
as genuinely mental computations. Chomsky’s focus on children as creators of
language, and of understanding the way in which their biases shape learning is
fundamental to any scientific theory of cognition. Linguistic work in Chomsky’s
tradition has done a considerable amount to document and support less widely
spoken languages, a struggle for current machine learning (Blasi et al. 2021). The
overall search for “why” questions is undoubtedly core to the field, even as we
reject or refine armchair hypotheses.
Some of the ideas of Chomsky’s approach are likely to be found even in lan-
guage models. For example, the idea that many languages are hierarchical is
likely to be correct, embodied in some way in the connections and links of neu-
ral networks that perform well at word prediction. There may be a real sense
in which other principles linguistics have considered are present in some form
in such models. If the models correctly perform on binding questions, they may
have some computations similar to binding principles. But none of these prin-
ciples needed to be innate. And in neural networks, they are realized in a form
nobody to date has written—they are distributed through a large pattern of con-
tinuous, gradient connections. Moreover, the representation of something like
binding is extraordinarily unlikely to have the form generative syntax predicts
since the required underlying representational assumptions of that approach (e.g.
binary branching, particular derivational structures, etc) are not met.
Another key contribution of Chomsky’s research program has been to encour-
age discovery of interesting classes of sentences, often through others like Ross
1967. Regardless of the field’s divergent views on the reality of WH-movement,
for example, the question of what determines grammaticality and ungrammati-
cality for WH-sentences is an important one. Similarly, phenomena like “islands”
do not go away because of large language models—they are targets to be ex-
plained (and they do a pretty good job according to analyses by Wilcox et al.
2022). Such phenomena are often difficult to separate from theory, as in the exam-
ple above about whether declaratives and interrogatives are actually connected
in the real grammar. Regardless of theory, researchers working in Chomsky’s tra-
dition have illuminated many places where human linguistic behavior is more
complicated or intricate than one might otherwise expect.
28
1 Modern language models refute Chomsky’s approach to language
As articulated by Pater 2019, the field should seek ways to integrate linguistics
with modern machine learning, including neural networks. I have highlighted
some researchers whose approach to language clearly resonates with the insights
of modern language models. The current upheaval indicates that we should fos-
ter a pluralistic linguistics that approaches the problem of language with as few
preconceptions as possible—perhaps even a fundamental reconceptualization of
what language is for and what it is like (Edelman 2019). Maybe many of the “syn-
tactic” phenomena that Chomskyan theories have concerned themselves with
are really about something else, like pragmatics or memorized constructions
(Goldberg 2006, Liu et al. 2022). Maybe the universals of language—if there are
any—come from aspects of use like communicative and cognitive pressures, or
other cultural factors. Maybe linguistics could learn from the methods of cog-
nitive science (Edelman 2007). Maybe theories of grammar should respect hu-
mans’ unparalleled memory capacity for sequential material. Maybe we should
have linguistics students learn information theory, probability, neural networks,
machine learning, anthropology, numerical methods, model comparison, Kol-
mogorov complexity, cognitive psychology, language processing, multi-agent
systems, etc. The most permanent legacy of Chomsky’s approach could be as
an admonishment about what happens when the study of language is separated
from the rest of science.
Conclusion
One must be frank about the state of the art for models that capture syntax. It’s
not that large language models offer slightly higher performance than other ap-
proaches in linguistics; its not that they better cover some corners of syntax.
It’s that there is nothing comparable in all of linguistic theory to the power of
large language models in both syntax and semantics—much less discourse co-
herence, style, pragmatics, translation, meta-linguistic awareness, non-linguistic
tasks, etc. They are game changers on all fronts. Optimists who believe them as
at least a plausible direction for acquisition see them as a way to build in and test
architectural principles and biases, long emphasized by connectionist modelers
like McClelland et al. 1986, Elman et al. 1996, Smolensky & Legendre 2006, and
others. Those who doubt they could function as models of acquisition should
nonetheless see the success of gradient representations, architectural assump-
tions, and implicit or emergent principles as a theory of grammar. These models
have opened the space of plausible linguistic theories, allowing us to test princi-
ples beyond the ones that have traditionally concerned linguists. They allow us
29
Steven T. Piantadosi
30
1 Modern language models refute Chomsky’s approach to language
with him on specific points. And there is nothing wrong with being wrong:
Chomsky’s own theories elegantly articulated many deep, compelling ideas that
have motivated linguistic and cognitive research. But modern language models
highlight the weaknesses of his methodological prescriptions. One cannot seek
principles without justifying them with rigorous experiments and comparisons;
one can’t reason about complex systems without implementations; one can’t
discover mechanisms without formally testing completely different approaches
(like neural networks); one can’t proclaim simplicity, optimality, or “perfection”
without seriously connecting these terms to formal notions. These arguments
worked on many linguists and consequently prevented Chomsky’s tradition from
developing anything close to the wild successes of modern language models.
Frederick Jelinek’s quip “Every time I fire a linguist, the performance of the
speech recognizer goes up” (Jelinek 1988) was a joke among linguists and com-
puter scientists for decades. I’ve even seen it celebrated by academic linguists
who think it elevates their abstract enterprise over and above the dirty details of
implementation and engineering. But, while generative syntacticians insulated
themselves from engineering, empirical tests, and formal comparisons, engineer-
ing took over. And now, engineering has solved the very problems the field has
fixated on—or is about to very soon. The unmatched success of an approach
based on probability, internalization of constructions in corpora, gradient meth-
ods, and neural networks is, in the end, a humiliation for everyone who has spent
decades deriding these tools.
But now we can do better.
31
Steven T. Piantadosi
Acknowledgements
I am grateful to Benjamin Hayden, Ev Fedorenko, Geoffrey Pullum, Kyle Ma-
howald, Shimon Edelman, and Dan Everett, for detailed comments and sugges-
tions on this paper. This paper benefited greatly from Felix Hill’s Twitter pres-
ence (@FelixHill84), especially on topics like emergence and earlier connection-
ist work.
32
1 Modern language models refute Chomsky’s approach to language
References
Adger, David. 2018. The autonomy of syntax. In Syntactic Structures After 60 Years,
153–175.
Ahmed, Nur, Muntasir Wahed & Neil C Thompson. 2023. The growing influence
of industry in AI research. Science 379(6635). 884–886.
Anderson, Philip W. 1972. More is different: broken symmetry and the nature of
the hierarchical structure of science. Science 177(4047). 393–396.
Aslin, Richard N & Elissa L Newport. 2012. Statistical learning: from acquiring
specific items to forming general rules. Current directions in psychological sci-
ence 21(3). 170–176.
Aslin, Richard N, Jenny R Saffran & Elissa L Newport. 1998. Computation of con-
ditional probability statistics by 8-month-old infants. Psychological science 9(4).
321–324.
Baroni, Marco. 2022. On the proper role of linguistically oriented deep net anal-
ysis in linguistic theorising. In Algebraic Structures in Natural Language, 1–16.
CRC Press.
Barrett, David, Felix Hill, Adam Santoro, Ari Morcos & Timothy Lillicrap. 2018.
Measuring abstract reasoning in neural networks. In International Conference
on Machine Learning, 511–520.
Bates, Elizabeth & Brian MacWhinney. 1982. Functionalist approaches to gram-
mar. Language acquisition: The state of the art. 173–218.
Behme, Christina. 2012. A Potpourri of Chomskyan Science.
Behme, Christina. 2014. A ‘Galilean’ Science of Language. Journal of Linguistics
50(3). 671–704.
Belinkov, Yonatan & James Glass. 2019. Analysis methods in neural language pro-
cessing: a survey. Transactions of the Association for Computational Linguistics
7. 49–72.
Bender, Emily M, Timnit Gebru, Angelina McMillan-Major & Shmargaret
Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models
Be Too Big? In Proceedings of the 2021 ACM conference on Fairness, Accountabil-
ity, and Transparency, 610–623.
Bender, Emily M & Alexander Koller. 2020. Climbing towards NLU: on meaning,
form, and understanding in the age of data. In Proceedings of the 58th Annual
Meeting of the Association for Computational Linguistics, 5185–5198.
Berwick, Robert C & Noam Chomsky. 2016. Why Only Us: Language and evolution.
MIT press.
Bhatia, Sudeep & Russell Richie. 2022. Transformer networks of human concep-
tual knowledge. Psychological Review.
33
Steven T. Piantadosi
Blasi, Damián, Antonios Anastasopoulos & Graham Neubig. 2021. Systematic In-
equalities in Language Technology Performance across the World’s Languages.
arXiv preprint arXiv:2110.06733.
Block, Ned. 1986. Advertisement for a Semantics for Psychology. Midwest Studies
in Philosophy 10. 615–678.
Block, Ned. 1998. Conceptual role semantics.
Bod, Rens, Remko Scha & Khalil Sima’an. 2003. Data-oriented Parsing. The Uni-
versity of Chicago Press.
Boden, Margaret A. 2008. Mind as machine: A history of cognitive science. Oxford
University Press.
Bommasani, Rishi et al. 2021. On the opportunities and risks of foundation mod-
els. arXiv preprint arXiv:2108.07258.
Bowman, Samuel, Christopher Potts & Christopher D Manning. 2015. Recursive
neural networks can learn logical semantics. In Proceedings of the 3rd workshop
on continuous vector space models and their compositionality, 12–21.
Brown, Tom, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Pra-
fulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, et al. 2020. Language models are few-shot learners. Advances in Neural
Information Processing Systems 33. 1877–1901.
Carnie, Andrew. 2021. Syntax: A Generative Introduction. John Wiley & Sons.
Caucheteux, Charlotte, Alexandre Gramfort & Jean-Rémi King. 2022. Deep lan-
guage algorithms predict semantic comprehension from brain activity. Scien-
tific Reports 12(1). 16327.
Chater, Nick, Florencia Reali & Morten H Christiansen. 2009. Restrictions on bio-
logical adaptation in language evolution. Proceedings of the National Academy
of Sciences 106(4). 1015–1020.
Chater, Nick & Paul Vitányi. 2007. ‘Ideal learning’ of natural language: Positive
results about learning from positive evidence. Journal of Mathematical Psychol-
ogy 51(3). 135–163.
Chen, Stanley F & Joshua Goodman. 1999. An empirical study of smoothing tech-
niques for language modeling. Computer Speech & Language 13(4). 359–394.
Chi, Ethan A, John Hewitt & Christopher D Manning. 2020. Finding universal
grammatical relations in multilingual BERT. arXiv preprint arXiv:2005.04511.
Chomsky, Noam. 1956. Three models for the description of language. IRE Trans-
actions on Information Theory 2(3). 113–124.
Chomsky, Noam. 1957. Syntactic Structures.
Chomsky, Noam. 1959. Chomsky, N. 1959. A review of BF Skinner’s Verbal behavior.
Language, 35 (1), 26–58.
Chomsky, Noam. 1969. Quine’s empirical assumptions. Synthese 19(1). 53–68.
34
1 Modern language models refute Chomsky’s approach to language
35
Steven T. Piantadosi
Dąbrowska, Ewa. 2015. What exactly is Universal Grammar, and has anyone seen
it? Frontiers in Psychology 6. 852.
Dale, Rick & Gary Lupyan. 2012. Understanding the origins of morphologi-
cal diversity: The linguistic niche hypothesis. Advances in Complex Systems
15(03n04). 1150017.
Dasgupta, Ishita, Andrew K Lampinen, Stephanie CY Chan, Antonia Creswell,
Dharshan Kumaran, James L McClelland & Felix Hill. 2022. Language models
show human-like content effects on reasoning. arXiv preprint arXiv:2207.07051.
Delétang, Grégoire, Anian Ruoss, Jordi Grau-Moya, Tim Genewein, Li Kevin
Wenliang, Elliot Catt, Marcus Hutter, Shane Legg & Pedro A Ortega. 2022.
Neural networks and the Chomsky hierarchy. arXiv preprint arXiv:2207.02098.
Denić, Milica, Shane Steinert-Threlkeld & Jakub Szymanik. 2022. Indefinite Pro-
nouns Optimize the Simplicity/Informativeness Trade-Off. Cognitive Science
46(5). e13142.
Edelman, Shimon. 2007. Bridging language with the rest of cognition. Methods
in Cognitive Linguistics. 424–445.
Edelman, Shimon. 2019. Verbal behavior without syntactic structures: beyond
Skinner and Chomsky. arXiv preprint arXiv:4783474.
Edelman, Shimon & Morten H Christiansen. 2003. How seriously should we take
minimalist syntax? A comment on Lasnik. Trends in Cognitive Science 7(2). 60–
61.
Edelman, Shimon & Heidi Waterfall. 2007. Behavioral and computational aspects
of language and its acquisition. Physics of Life Reviews 4(4). 253–277.
Elman, Jeffrey L. 1990. Finding structure in time. Cognitive Science 14(2). 179–211.
Elman, Jeffrey L. 2004. An alternative view of the mental lexicon. Trends in Cog-
nitive Sciences 8(7). 301–306.
Elman, Jeffrey L, Elizabeth A Bates & Mark H Johnson. 1996. Rethinking innate-
ness: A connectionist perspective on development. MIT press.
Evans, Nicholas & Stephen C Levinson. 2009. The myth of language universals:
Language diversity and its importance for cognitive science. Behavioral and
Brain Sciences 32(5). 429–448.
Everaert, Martin BH, Marinus AC Huybregts, Noam Chomsky, Robert C Berwick
& Johan J Bolhuis. 2015. Structures, not strings: Linguistics as part of the cog-
nitive sciences. Trends in Cognitive Sciences 19(12). 729–743.
Everett, Caleb, Damián E Blasi & Seán G Roberts. 2015. Climate, vocal folds, and
tonal languages: Connecting the physiological and geographic dots. Proceed-
ings of the National Academy of Sciences 112(5). 1322–1327.
36
1 Modern language models refute Chomsky’s approach to language
37
Steven T. Piantadosi
Gibson, Edward, Leon Bergen & Steven T Piantadosi. 2013. Rational integration
of noisy evidence and prior semantic expectations in sentence interpretation.
Proceedings of the National Academy of Sciences 110(20). 8051–8056.
Gibson, Edward, Richard Futrell, Steven P Piantadosi, Isabelle Dautriche, Kyle
Mahowald, Leon Bergen & Roger Levy. 2019. How efficiency shapes human
language. Trends in Cognitive Sciences 23(5). 389–407.
Gold, E.M. 1967. Language identification in the limit. Information and Control
10(5). 447–474.
Goldberg, Adele E. 2003. Constructions: A new theoretical approach to language.
Trends in Cognitive Sciences 7(5). 219–224.
Goldberg, Adele E. 2006. Constructions at Work. Oxford University Press.
Goldberg, Adele E. 1995. Constructions: A construction grammar approach to argu-
ment structure. University of Chicago Press.
Goldberg, Yoav. 2019. Assessing BERT’s syntactic abilities. arXiv preprint
arXiv:1901.05287.
Goldstein, Ariel, Zaid Zada, Eliav Buchnik, Mariano Schain, Amy Price, Bobbi
Aubrey, Samuel A Nastase, Amir Feder, Dotan Emanuel, Alon Cohen, et al.
2022. Shared computational principles for language processing in humans and
deep language models. Nature Neuroscience 25(3). 369–380.
Goldstein, Jeffrey. 1999. Emergence as a construct: History and issues. Emergence
1(1). 49–72.
Gonen, Hila, Shauli Ravfogel, Yanai Elazar & Yoav Goldberg. 2020. It’s not Greek
to mBERT: inducing word-level translations from multilingual BERT. arXiv
preprint arXiv:2010.08275.
Goodman, Nelson. 1965. The new riddle of induction. In Fact, Fiction, and Fore-
cast.
Goodman, Noah D, Joshua B Tenenbaum & Tobias Gerstenberg. 2014. Concepts
in a probabilistic language of thought. Tech. rep. Center for Brains, Minds and
Machines (CBMM).
Goodman, Noah D, Tomer D Ullman & Joshua B Tenenbaum. 2011. Learning a
theory of causality. Psychological Review 118(1). 110.
Grand, Gabriel, Idan Asher Blank, Francisco Pereira & Evelina Fedorenko. 2022.
Semantic projection recovers rich human knowledge of multiple object fea-
tures from word embeddings. Nature Human Behaviour 6(7). 975–987.
Gulordava, Kristina, Piotr Bojanowski, Edouard Grave, Tal Linzen & Marco Ba-
roni. 2018. Colorless green recurrent networks dream hierarchically. arXiv
preprint arXiv:1803.11138.
38
1 Modern language models refute Chomsky’s approach to language
Hahn, Michael, Dan Jurafsky & Richard Futrell. 2020. Universals of word order
reflect optimization of grammars for efficient communication. Proceedings of
the National Academy of Sciences 117(5). 2347–2353.
Harris, Randy Allen. 1993. The Linguistics Wars. Oxford University Press.
Hauser, Marc D, Noam Chomsky & W Tecumseh Fitch. 2002. The faculty of lan-
guage: what is it, who has it, and how did it evolve? Science 298(5598). 1569–
1579.
Heavey, Christopher L & Russell T Hurlburt. 2008. The phenomena of inner ex-
perience. Consciousness and Cognition 17(3). 798–810.
Hewitt, John & Christopher D Manning. 2019. A structural probe for finding syn-
tax in word representations. In Proceedings of the 2019 Conference of the North
American Chapter of the Association for Computational Linguistics: Human Lan-
guage Technologies, 4129–4138.
Hill, Felix, Kyunghyun Cho, Sébastien Jean & Yoshua Bengio. 2017. The represen-
tational geometry of word meanings acquired by neural machine translation
models. Machine Translation 31. 3–18.
Hockett, Charles F. 1959. Animal “languages” and human language. Human Biol-
ogy 31(1). 32–39.
Hoover, Jacob Louis, Morgan Sonderegger, Steven T Piantadosi & Timothy J
O’Donnell. 2022. The plausibility of sampling as an algorithmic theory of sen-
tence processing. PsyArXiv.
Hosseini, Eghbal A, Martin A Schrimpf, Yian Zhang, Samuel Bowman, Noga Za-
slavsky & Evelina Fedorenko. 2022. Artificial neural network language models
align neurally and behaviorally with humans even after a developmentally re-
alistic amount of training. bioRxiv. 2022–10.
Hu, Jennifer, Jon Gauthier, Peng Qian, Ethan Wilcox & Roger P Levy. 2020. A
systematic assessment of syntactic generalization in neural language models.
arXiv preprint arXiv:2005.03692.
Hutter, Marcus. 2004. Universal Artificial Intelligence: Sequential decisions based
on algorithmic probability. Springer.
Jackendoff, Ray. 2013. Constructions in the Parallel Architecture. In The Oxford
Handbook of Construction Grammar.
Jelinek, Frederick. 1988. Applying information theoretic methods: Evaluation of
grammar quality. In Workshop on Evaluation of Natural Language Processing
Systems, Wayne, PA.
Johnson, K. 2004. Gold’s theorem and cognitive science. Philosophy of Science
71(4). 571–592.
39
Steven T. Piantadosi
Kanwisher, Nancy, Meenakshi Khosla & Katharina Dobs. 2023. Using artificial
neural networks to ask ‘why’ questions of minds and brains. Trends in Neuro-
sciences.
Kaplan, Jared, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess,
Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu & Dario Amodei. 2020.
Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
Kemp, Charles & Terry Regier. 2012. Kinship categories across languages reflect
general communicative principles. Science 336(6084). 1049–1054.
Kemp, Charles & Joshua B Tenenbaum. 2008. The discovery of structural form.
Proceedings of the National Academy of Sciences 105(31). 10687–10692.
Kemp, Charles, Yang Xu & Terry Regier. 2018. Semantic typology and efficient
communication. Annual Review of Linguistics 4. 109–128.
Kim, Najoung, Roma Patel, Adam Poliak, Alex Wang, Patrick Xia, R Thomas Mc-
Coy, Ian Tenney, Alexis Ross, Tal Linzen, Benjamin Van Durme, et al. 2019.
Probing what different NLP tasks teach machines about function word com-
prehension. arXiv preprint arXiv:1904.11544.
Kim, Taeuk, Jihun Choi, Daniel Edmiston & Sang-goo Lee. 2020. Are pre-trained
language models aware of phrases? Simple but strong baselines for grammar
induction. arXiv preprint arXiv:2002.00737.
Kirby, Simon, Tom Griffiths & Kenny Smith. 2014. Iterated learning and the evo-
lution of language. Current Opinion in Neurobiology 28. 108–114.
Kocijan, Vid, Ernest Davis, Thomas Lukasiewicz, Gary Marcus & Leora Mor-
genstern. 2022. The defeat of the Winograd schema challenge. arXiv preprint
arXiv:2201.02387.
Kocijan, Vid, Thomas Lukasiewicz, Ernest Davis, Gary Marcus & Leora Morgen-
stern. 2020. A review of Winograd schema challenge datasets and approaches.
arXiv preprint arXiv:2004.13831.
Koza, John R. 1994. Genetic programming as a means for programming comput-
ers by natural selection. Statistics and Computing 4. 87–112.
Kwiatkowski, Tom, Sharon Goldwater, Luke Zettlemoyer & Mark Steedman.
2012. A probabilistic model of syntactic and semantic acquisition from child-
directed utterances and their meanings. In Proceedings of the 13th Conference
of the European Chapter of the Association for Computational Linguistics, 234–
244.
Lake, Brenden M & Gregory L Murphy. 2021. Word meaning in minds and ma-
chines. Psychological Review.
Lake, Brenden M, Ruslan Salakhutdinov & Joshua B Tenenbaum. 2015. Human-
level concept learning through probabilistic program induction. Science
350(6266). 1332–1338.
40
1 Modern language models refute Chomsky’s approach to language
41
Steven T. Piantadosi
Mahowald, Kyle, Isabelle Dautriche, Mika Braginsky & Ted Gibson. 2022. Effi-
cient communication and the organization of the lexicon. In The Oxford Hand-
book of the Mental Lexicon.
Mahowald & Ivanova, Idan A Blank, Nancy Kanwisher, Joshua B Tenenbaum &
Evelina Fedorenko. 2023. Dissociating language and thought in large language
models: A cognitive perspective. arXiv preprint arXiv:2301.06627.
Manning, Christopher & Hinrich Schutze. 1999. Foundations of Statistical Natural
Language Processing. MIT press.
Manning, Christopher D, Kevin Clark, John Hewitt, Urvashi Khandelwal & Omer
Levy. 2020. Emergent linguistic structure in artificial neural networks trained
by self-supervision. Proceedings of the National Academy of Sciences 117(48).
30046–30054.
Marblestone, Adam H, Greg Wayne & Konrad P Kording. 2016. Toward an inte-
gration of deep learning and neuroscience. Frontiers in Computational Neuro-
science. 94.
Marvin, Rebecca & Tal Linzen. 2018. Targeted syntactic evaluation of language
models. arXiv preprint arXiv:1808.09031.
McCauley, Stewart M & Morten H Christiansen. 2019. Language learning as lan-
guage use: A cross-linguistic model of child language development. Psycholog-
ical Review 126(1). 1.
McClelland, James L, Felix Hill, Maja Rudolph, Jason Baldridge & Hinrich
Schütze. 2020. Placing language in an integrated understanding system: Next
steps toward human-level performance in neural language models. Proceedings
of the National Academy of Sciences 117(42). 25966–25974.
McClelland, James L & Karalyn Patterson. 2002. Rules or connections in past-
tense inflections: What does the evidence rule out? Trends in cognitive sciences
6(11). 465–472.
McClelland, James L, David E Rumelhart & PDP Research Group. 1986. Parallel
distributed processing. MIT Press.
Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S Corrado & Jeff Dean. 2013.
Distributed representations of words and phrases and their compositionality.
Advances in Neural Information Processing Systems 26.
Mollica, Francis, Geoff Bacon, Noga Zaslavsky, Yang Xu, Terry Regier & Charles
Kemp. 2021. The forms and meanings of grammatical markers support effi-
cient communication. Proceedings of the National Academy of Sciences 118(49).
e2025993118.
Mollica, Francis & Steven T Piantadosi. 2019. Humans store about 1.5 megabytes
of information during language acquisition. Royal Society open science 6(3).
181393.
42
1 Modern language models refute Chomsky’s approach to language
43
Steven T. Piantadosi
Piantadosi, Steven T & Felix Hill. 2022. Meaning without reference in large lan-
guage models. arXiv preprint arXiv:2208.02957.
Piantadosi, Steven T & Robert A Jacobs. 2016. Four problems solved by the prob-
abilistic language of thought. Current Directions in Psychological Science 25(1).
54–59.
Piantadosi, Steven T, Harry Tily & Edward Gibson. 2012. The communicative
function of ambiguity in language. Cognition 122(3). 280–291.
Pinker, Steven & Alan Prince. 1988. On language and connectionism: Analysis
of a parallel distributed processing model of language acquisition. Cognition
28(1-2). 73–193.
Pires, Telmo, Eva Schlinger & Dan Garrette. 2019. How multilingual is multilin-
gual BERT? arXiv preprint arXiv:1906.01502.
Postal, Paul. 2012. Two case studies of Chomsky’s play acting at linguistics. Ling-
buzz.
Prince, Alan & Paul Smolensky. 1997. Optimality: From neural networks to uni-
versal grammar. Science 275(5306). 1604–1610.
Pullum, Geoffrey K. 1989. Formal linguistics meets the boojum. Natural language
& Linguistic theory. 137–143.
Pullum, Geoffrey K. 2007. The evolution of model-theoretic frameworks in lin-
guistics. Model-Theoretic Syntax at 10. 1–10.
Pullum, Geoffrey K. 2013. The central question in comparative syntactic metathe-
ory. Mind & Language 28(4). 492–521.
Pullum, Geoffrey K & Barbara C Scholz. 2002. Empirical assessment of stimulus
poverty arguments. The Linguistic Review 19(1-2). 9–50.
Pullum, Geoffrey K & Barbara C Scholz. 2009. For universals (but not finite-state
learning) visit the zoo. Behavioral and Brain Sciences 32(5). 466–467.
Quilty-Dunn, Jake, Nicolas Porot & Eric Mandelbaum. 2022. The best game in
town: The re-emergence of the language of thought hypothesis across the cog-
nitive sciences. Behavioral and Brain Sciences. 1–55.
Radford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya
Sutskever, et al. 2019. Language models are unsupervised multitask learners.
OpenAI blog 1(8). 9.
Reali, Florencia & Morten H Christiansen. 2005. Uncovering the richness of the
stimulus: Structure dependence and indirect statistical evidence. Cognitive Sci-
ence 29(6). 1007–1028.
Reali, Florencia & Morten H Christiansen. 2009. On the necessity of an interdis-
ciplinary approach to language universals. Language Universals. 266–77.
44
1 Modern language models refute Chomsky’s approach to language
Redington, Martin, Nick Chater & Steven Finch. 1998. Distributional information:
A powerful cue for acquiring syntactic categories. Cognitive Science 22(4). 425–
469.
Reed, Homer B. 1916. The existence and function of inner speech in thought pro-
cesses. Journal of Experimental Psychology 1(5). 365.
Richards, Blake A, Timothy P Lillicrap, Philippe Beaudoin, Yoshua Bengio, Rafal
Bogacz, Amelia Christensen, Claudia Clopath, Rui Ponte Costa, Archy de
Berker, Surya Ganguli, et al. 2019. A deep learning framework for neuroscience.
Nature Neuroscience 22(11). 1761–1770.
Roebuck, Hettie & Gary Lupyan. 2020. The internal representations question-
naire: Measuring modes of thinking. Behavior Research Methods 52. 2053–2070.
Rogers, Anna, Olga Kovaleva & Anna Rumshisky. 2021. A primer in BERTology:
What we know about how BERT works. Transactions of the Association for
Computational Linguistics 8. 842–866.
Rogers, Timothy T, James L McClelland, et al. 2004. Semantic cognition: A parallel
distributed processing approach. MIT press.
Ross, John Robert. 1967. Constraints on variables in syntax. (Doctoral dissertation).
Rule, Joshua S, Joshua B Tenenbaum & Steven T Piantadosi. 2020. The Child as
Hacker. Trends in Cognitive Sciences 24(11). 900–915.
Rumelhart, David E & James L McClelland. 1986. On learning the past tenses of
English verbs. In Parallel Distributed Processing, volume 2.
Saffran, Jenny R, Richard N Aslin & Elissa L Newport. 1996. Statistical learning
by 8-month-old infants. Science 274(5294). 1926–1928.
Saffran, Jenny R, Elissa L Newport & Richard N Aslin. 1996. Word segmentation:
the role of distributional cues. Journal of memory and language 35(4). 606–621.
Schmidt, Michael & Hod Lipson. 2009. Distilling free-form natural laws from
experimental data. Science 324(5923). 81–85.
Schrimpf, Martin, Idan Asher Blank, Greta Tuckute, Carina Kauf, Eghbal A Hos-
seini, Nancy Kanwisher, Joshua B Tenenbaum & Evelina Fedorenko. 2021.
The neural architecture of language: Integrative modeling converges on pre-
dictive processing. Proceedings of the National Academy of Sciences 118(45).
e2105646118.
Schrimpf, Martin, Jonas Kubilius, Michael J Lee, N Apurva Ratan Murty, Robert
Ajemian & James J DiCarlo. 2020. Integrative benchmarking to advance neu-
rally mechanistic models of human intelligence. Neuron 108(3). 413–423.
Shain, Cory, Clara Meister, Tiago Pimentel, Ryan Cotterell & Roger Philip Levy.
2022. Large-scale evidence for logarithmic effects of word predictability on
reading time. PsyArXiv.
45
Steven T. Piantadosi
46
1 Modern language models refute Chomsky’s approach to language
47
Steven T. Piantadosi
48